Skip to main content

Databricks Lakehouse

Direct Load

Starting with version 4.0.0, the Databricks Lakehouse destination uses Direct Load architecture. This means data is written directly to final tables without using intermediate raw tables, providing improved performance and reduced storage costs.

For migration details and backward compatibility options, see the Databricks Migration Guide.

Prerequisites

  • A Databricks workspace with Unity Catalog enabled.
  • A SQL warehouse or compute cluster to run queries against.
  • Authentication credentials: an OAuth2 client ID and secret (recommended), or a personal access token.
  • Acceptance of the Databricks JDBC ODBC driver license. By using this connector, you agree that it may only be used to connect third-party applications to Apache Spark SQL within a Databricks offering using the ODBC and/or JDBC protocols.

Network access

If you're using Airbyte Cloud and this destination uses IP-based access controls, add Airbyte's IP addresses to your allowlist.

Step 1: Set up Databricks

You will need the following information from your Databricks workspace:

Server Hostname / HTTP Path / Port

  1. Open the workspace console.

  2. Open your SQL warehouse:

  3. Open the Connection Details tab:

  4. Note the Server Hostname, HTTP Path, and Port values.

  5. You will also need the Databricks Unity Catalog Name — the name of the Unity Catalog that contains the database you want to write to. This is not found on the Connection Details tab; look for it in the Databricks workspace sidebar under Catalog.

Authentication

Create a service principal in your Databricks workspace and generate a client ID and secret.

Personal Access Token

  1. Open your workspace console.

  2. Click on your icon in the top-right corner, and head to settings, then developer, then manage under access tokens

  3. Enter a description for the token and how long it will be valid for (or leave blank for a permanent token):

Step 2: Set up the Databricks destination in Airbyte

  1. Log in to your Airbyte account.
  2. In the left navigation bar, click Destinations. In the top-right corner, click + New destination.
  3. Find and select Databricks Lakehouse from the list of available destinations.
  4. Enter the Server Hostname, HTTP Path, Port, and Databricks Unity Catalog Name from Step 1.
  5. Select your Authentication method and enter the required credentials.
  6. Configure the remaining options:
    • Default Schema - The schema that will contain your data. You can later override this on a per-connection basis.
    • CDC deletion mode - Whether CDC deletions are propagated as hard deletes (the row is removed) or soft deletes (the row is kept with a tombstone). Defaults to hard delete.
    • Purge Staging Files and Tables - Whether to delete staging files after loading them into tables. Disable for debugging.
  7. Click Set up destination.

Supported sync modes

Sync modeSupported?
Full Refresh - OverwriteYes
Full Refresh - AppendYes
Full Refresh - Overwrite + DedupedYes
Incremental Sync - AppendYes
Incremental Sync - Append + DedupedYes

Output schema

Each stream is written directly to a final table in your configured schema. The table includes your data columns plus the following Airbyte metadata columns:

ColumnTypeNotes
_airbyte_raw_idSTRINGA UUID assigned by Airbyte to each processed event.
_airbyte_extracted_atTIMESTAMPTimestamp when the event was pulled from the data source.
_airbyte_metaSTRINGJSON metadata about the record, including sync information.
_airbyte_generation_idLONGSee the refreshes documentation.

Data type map

Airbyte TypeDatabricks TypeNotes
stringSTRING
numberDECIMAL(38, 10)Max 28 integer digits, 10 fractional
integerLONG64-bit integer
booleanBOOLEAN
objectSTRINGSerialized as JSON
arraySTRINGSerialized as JSON
timestamp_with_timezoneTIMESTAMPMicrosecond precision
timestamp_without_timezoneTIMESTAMP_NTZMicrosecond precision, no timezone
time_with_timezoneSTRINGNo native Databricks equivalent
time_without_timezoneSTRINGNo native Databricks equivalent
dateDATE

Naming conventions

  • Schema and table names are lowercased automatically. Databricks treats them as case-insensitive identifiers.
  • Column names preserve the casing from your source data.
  • Special characters in identifiers are escaped automatically by the connector.

Namespace support

This destination supports namespaces. The namespace maps to a Databricks schema.

Reference

Config fields reference

Field
Type
Property name
boolean
accept_terms
object
authentication
string
database
string
hostname
string
http_path
string
cdc_deletion_mode
string
port
boolean
purge_staging_data
string
schema

Changelog

Expand to review
VersionDatePull RequestSubject
4.0.02026-06-2980951Major rewrite: upgraded to Direct-Load architecture using the Bulk CDK
3.3.82026-03-1174732Add JDBC ConnectTimeout and SocketTimeout to prevent indefinite hangs when Databricks SQL warehouse is paused or unresponsive
3.3.72025-07-1563311Support arbitrary number of streams in findExisitngTable query
3.3.62025-03-2456355Upgrade to airbyte/java-connector-base:2.0.1 to be M4 compatible.
3.3.52025-03-0755232fix table name collision multiple connections same schema
3.3.32025-01-1051506Use a non root base image
3.3.22024-12-1849898Use a base image: airbyte/java-connector-base:1.0.0
3.3.12024-12-02#48779bump resource reqs for check
3.3.02024-09-18#45438upgrade all dependencies.
3.2.52024-09-12#45439Move to integrations section.
3.2.42024-09-09#45208Fix CHECK to create missing namespace if not exists.
3.2.32024-09-03#45115Clarify Unity Catalog Name option.
3.2.22024-08-22#44941Clarify Unity Catalog Path option.
3.2.12024-08-22#44506Handle uppercase/mixed-case stream name/namespaces
3.2.02024-08-12#40712Rely solely on PAT, instead of also needing a user/pass
3.1.02024-07-22#40692Support for refreshes and resumable full refresh. WARNING: You must upgrade to platform 0.63.7 before upgrading to this connector version.
3.0.02024-07-12#40689(Private release, not to be used for production) Add _airbyte_generation_id column, and sync_id entry in _airbyte_meta
2.0.02024-05-17#37613(Private release, not to be used for production) Alpha release of the connector to use Unity Catalog
1.1.22024-04-04#36846(incompatible with CDK, do not use) Remove duplicate S3 Region
1.1.12024-01-03#33924(incompatible with CDK, do not use) Add new ap-southeast-3 AWS region
1.1.02023-06-02#26942Support schema evolution
1.0.22023-04-20#25366Fix default catalog to be hive_metastore
1.0.12023-03-30#24657Fix support for external tables on S3
1.0.02023-03-21#23965Added: Managed table storage type, Databricks Catalog field
0.3.12022-10-15#18032Add SSL=1 to the JDBC URL to ensure SSL connection.
0.3.02022-10-14#15329Add support for Azure storage.
2022-09-01#16243Fix Json to Avro conversion when there is field name clash from combined restrictions (anyOf, oneOf, allOf fields)
0.2.62022-08-05#14801Fix multiply log bindings
0.2.52022-07-15#14494Make S3 output filename configurable.
0.2.42022-07-14#14618Removed additionalProperties: false from JDBC destination connectors
0.2.32022-06-16#13852Updated stacktrace format for any trace message errors
0.2.22022-06-13#13722Rename to "Databricks Lakehouse".
0.2.12022-06-08#13630Rename to "Databricks Delta Lake" and add field orders in the spec.
0.2.02022-05-15#12861Use new public Databricks JDBC driver, and open source the connector.
0.1.52022-05-04#12578In JSON to Avro conversion, log JSON field values that do not follow Avro schema for debugging.
0.1.42022-02-14#10256Add -XX:+ExitOnOutOfMemoryError JVM option
0.1.32022-01-06#7622 #9153Upgrade Spark JDBC driver to 2.6.21 to patch Log4j vulnerability; update connector fields title/description.
0.1.22021-11-03#7288Support Json additionalProperties.
0.1.12021-10-05#6792Require users to accept Databricks JDBC Driver Terms & Conditions.
0.1.02021-09-14#5998Initial private release.