Ingesting Data: Lakeflow Connect & Notebooks

Two paths to ingestion

Simple explanation

Lakeflow Connect is like a pre-built plumbing kit. Notebooks are like custom plumbing you build yourself.

Lakeflow Connect: pick your source (Salesforce, SAP, databases), configure connection details, and data flows automatically. No coding — just configuration.

Notebooks: write Python or SQL code to read data, transform it, and write it to your lakehouse. Full control, but you build and maintain everything.

Lakeflow Connect

Batch ingestion with Lakeflow Connect

Ravi uses Lakeflow Connect to ingest CRM data from Salesforce:

Create a connection — specify the source system and credentials
Configure ingestion — select tables, choose full or incremental sync
Schedule — set the refresh cadence (hourly, daily)
Monitor — track ingestion status in the Lakeflow dashboard

Lakeflow Connect automatically handles schema mapping, type conversion, and incremental extraction using watermark columns or change tracking.

Streaming ingestion with Lakeflow Connect

For sources that support change streams (databases with CDC enabled), Lakeflow Connect can stream changes continuously:

Source sends changes → Lakeflow Connect reads the change stream
Writes to Delta table in near-real-time
Handles schema evolution — new columns in the source are automatically added

Notebook-based ingestion

Batch ingestion with notebooks

# Read CSV files from ADLS landing zone
raw_df = (spark.read
    .format("csv")
    .option("header", "true")
    .option("inferSchema", "true")
    .load("abfss://landing@storage.dfs.core.windows.net/sales/*.csv"))

# Write to Delta table
(raw_df.write
    .format("delta")
    .mode("append")
    .saveAsTable("bronze.raw_sales"))

Streaming ingestion with notebooks

# Read streaming data from a source
stream_df = (spark.readStream
    .format("delta")
    .table("bronze.raw_transactions"))

# Transform and write as a streaming query
(stream_df
    .filter("amount > 0")
    .writeStream
    .format("delta")
    .outputMode("append")
    .option("checkpointLocation", "/checkpoints/silver_txn")
    .toTable("silver.valid_transactions"))

Key concept: Streaming uses readStream and writeStream instead of read and write. The checkpoint location tracks what data has been processed.

Feature	Lakeflow Connect	Notebooks
Setup effort	Low (configuration)	High (code + testing)
Custom logic	Limited	Unlimited
Error handling	Built-in retries	You implement
Schema evolution	Automatic	Manual or with mergeSchema
Monitoring	Lakeflow dashboard	Spark UI + custom logging
Best for	Standard sources, quick setup	Complex transforms, custom sources

Exam tip: Schema evolution in notebook ingestion

When source schemas change (new columns added), notebook ingestion can fail. Enable schema evolution:

# Allow new columns to be added automatically
df.write.option("mergeSchema", "true").mode("append").saveAsTable("my_table")

Or enable schema merging session-wide:

spark.conf.set("spark.databricks.delta.schema.autoMerge.enabled", "true")

Exam tip: If the question mentions “source schema changes” or “new columns added” — mergeSchema is the answer.

Question

What is the difference between Lakeflow Connect and notebook-based ingestion?

Click or press Enter to reveal answer

Answer

Lakeflow Connect: low-code, pre-built connectors, automatic schema handling. Notebooks: full code control, unlimited custom logic, but you build and maintain everything.

Click to flip back

Question

What makes streaming different from batch in notebook code?

Click or press Enter to reveal answer

Answer

Streaming uses readStream/writeStream (not read/write), requires a checkpoint location for tracking processed data, and runs continuously or with trigger intervals.

Click to flip back

Question

How do you handle schema evolution during notebook ingestion?

Click or press Enter to reveal answer

Answer

Use .option('mergeSchema', 'true') on the write operation. This allows new columns from the source to be automatically added to the target Delta table.

Click to flip back

Knowledge check

Knowledge Check

Mei Lin is ingesting data from 15 different Freshmart suppliers. Each supplier sends daily CSV files to an ADLS landing zone. Some suppliers occasionally add new columns. What ingestion approach handles this best?

Next up: Ingesting Data: SQL Methods & CDC — CTAS, CREATE OR REPLACE, COPY INTO, and change data capture feeds.