Domain 3 β€” Module 4 of 10 40%
14 of 28 overall
Domain 3: Prepare and Process Data Free ⏱ ~13 min read

Ingesting Data: Lakeflow Connect & Notebooks

Ingest data using Lakeflow Connect's pre-built connectors and custom notebook code β€” batch and streaming patterns for getting data into your lakehouse.

Two paths to ingestion

Simple explanation

Lakeflow Connect is like a pre-built plumbing kit. Notebooks are like custom plumbing you build yourself.

Lakeflow Connect: pick your source (Salesforce, SAP, databases), configure connection details, and data flows automatically. No coding β€” just configuration.

Notebooks: write Python or SQL code to read data, transform it, and write it to your lakehouse. Full control, but you build and maintain everything.

Lakeflow Connect

Batch ingestion with Lakeflow Connect

Ravi uses Lakeflow Connect to ingest CRM data from Salesforce:

  1. Create a connection β€” specify the source system and credentials
  2. Configure ingestion β€” select tables, choose full or incremental sync
  3. Schedule β€” set the refresh cadence (hourly, daily)
  4. Monitor β€” track ingestion status in the Lakeflow dashboard

Lakeflow Connect automatically handles schema mapping, type conversion, and incremental extraction using watermark columns or change tracking.

Streaming ingestion with Lakeflow Connect

For sources that support change streams (databases with CDC enabled), Lakeflow Connect can stream changes continuously:

  • Source sends changes β†’ Lakeflow Connect reads the change stream
  • Writes to Delta table in near-real-time
  • Handles schema evolution β€” new columns in the source are automatically added

Notebook-based ingestion

Batch ingestion with notebooks

# Read CSV files from ADLS landing zone
raw_df = (spark.read
    .format("csv")
    .option("header", "true")
    .option("inferSchema", "true")
    .load("abfss://landing@storage.dfs.core.windows.net/sales/*.csv"))

# Write to Delta table
(raw_df.write
    .format("delta")
    .mode("append")
    .saveAsTable("bronze.raw_sales"))

Streaming ingestion with notebooks

# Read streaming data from a source
stream_df = (spark.readStream
    .format("delta")
    .table("bronze.raw_transactions"))

# Transform and write as a streaming query
(stream_df
    .filter("amount > 0")
    .writeStream
    .format("delta")
    .outputMode("append")
    .option("checkpointLocation", "/checkpoints/silver_txn")
    .toTable("silver.valid_transactions"))

Key concept: Streaming uses readStream and writeStream instead of read and write. The checkpoint location tracks what data has been processed.

FeatureLakeflow ConnectNotebooks
Setup effortLow (configuration)High (code + testing)
Custom logicLimitedUnlimited
Error handlingBuilt-in retriesYou implement
Schema evolutionAutomaticManual or with mergeSchema
MonitoringLakeflow dashboardSpark UI + custom logging
Best forStandard sources, quick setupComplex transforms, custom sources
Exam tip: Schema evolution in notebook ingestion

When source schemas change (new columns added), notebook ingestion can fail. Enable schema evolution:

# Allow new columns to be added automatically
df.write.option("mergeSchema", "true").mode("append").saveAsTable("my_table")

Or enable schema merging session-wide:

spark.conf.set("spark.databricks.delta.schema.autoMerge.enabled", "true")

Exam tip: If the question mentions β€œsource schema changes” or β€œnew columns added” β€” mergeSchema is the answer.

Question

What is the difference between Lakeflow Connect and notebook-based ingestion?

Click or press Enter to reveal answer

Answer

Lakeflow Connect: low-code, pre-built connectors, automatic schema handling. Notebooks: full code control, unlimited custom logic, but you build and maintain everything.

Click to flip back

Question

What makes streaming different from batch in notebook code?

Click or press Enter to reveal answer

Answer

Streaming uses readStream/writeStream (not read/write), requires a checkpoint location for tracking processed data, and runs continuously or with trigger intervals.

Click to flip back

Question

How do you handle schema evolution during notebook ingestion?

Click or press Enter to reveal answer

Answer

Use .option('mergeSchema', 'true') on the write operation. This allows new columns from the source to be automatically added to the target Delta table.

Click to flip back

Knowledge check

Knowledge Check

Mei Lin is ingesting data from 15 different Freshmart suppliers. Each supplier sends daily CSV files to an ADLS landing zone. Some suppliers occasionally add new columns. What ingestion approach handles this best?


Next up: Ingesting Data: SQL Methods & CDC β€” CTAS, CREATE OR REPLACE, COPY INTO, and change data capture feeds.