Domain 3 β€” Module 7 of 10 70%
17 of 28 overall
Domain 3: Prepare and Process Data Free ⏱ ~14 min read

Auto Loader & Declarative Pipelines

Scalable file ingestion with Auto Loader and production-grade data flows with Lakeflow Spark Declarative Pipelines β€” the two tools that simplify lakehouse ETL.

Auto Loader

Simple explanation

Auto Loader is a smart mailroom for your data files.

New files land in a storage folder. Auto Loader detects them, processes them, and loads them into a Delta table β€” automatically, without duplicates, even if millions of files arrive. It uses cloud notifications (Azure Event Grid) to know when new files appear, rather than scanning the entire folder every time.

Auto Loader code pattern

# Auto Loader ingestion
df = (spark.readStream
    .format("cloudFiles")
    .option("cloudFiles.format", "csv")
    .option("cloudFiles.schemaLocation", "/schemas/raw_sales")
    .option("header", "true")
    .load("abfss://landing@storage.dfs.core.windows.net/sales/"))

# Write to Delta
(df.writeStream
    .format("delta")
    .option("checkpointLocation", "/checkpoints/raw_sales")
    .trigger(availableNow=True)
    .toTable("bronze.raw_sales"))

File discovery modes

ModeHow It WorksBest For
Directory listing (default)Scans directory for new filesSmall to medium file volumes
File notificationAzure Event Grid notifies on file creationMillions of files, low latency

Auto Loader vs COPY INTO

FeatureAuto LoaderCOPY INTO
File trackingCheckpoint-basedInternal file tracking
StreamingYes (continuous or triggered)No (batch only)
File discoveryDirectory listing + Event GridDirectory listing only
ScaleMillions of filesThousands of files
Schema inferenceAutomatic with evolutionManual or inferSchema

Exam default: Auto Loader is preferred over COPY INTO for most file ingestion scenarios due to better scalability and streaming support.

Lakeflow Spark Declarative Pipelines

Simple explanation

Declarative Pipelines are like a recipe card system for your kitchen.

Instead of writing step-by-step cooking instructions (imperative code), you describe what each dish should look like (declarative). The system figures out the cooking order, handles retries, and validates quality automatically.

You define your bronze, silver, and gold tables as declarations. The pipeline engine handles execution order, dependencies, and data quality checks.

SQL Declarative Pipeline

-- Bronze: ingest raw files with Auto Loader
CREATE OR REFRESH STREAMING TABLE bronze_sales
AS SELECT * FROM cloud_files(
  'abfss://landing@storage.dfs.core.windows.net/sales/',
  'csv',
  map('header', 'true')
);

-- Silver: clean and validate
CREATE OR REFRESH STREAMING TABLE silver_sales (
  CONSTRAINT valid_amount EXPECT (amount > 0) ON VIOLATION DROP ROW,
  CONSTRAINT valid_date EXPECT (sale_date IS NOT NULL) ON VIOLATION FAIL UPDATE
)
AS SELECT
  sale_id, customer_id, CAST(amount AS DECIMAL(10,2)) AS amount, sale_date
FROM STREAM(LIVE.bronze_sales);

-- Gold: aggregate for reporting
CREATE OR REFRESH MATERIALIZED VIEW gold_daily_revenue
AS SELECT
  sale_date, SUM(amount) AS total_revenue, COUNT(*) AS txn_count
FROM LIVE.silver_sales
GROUP BY sale_date;

Key concepts

ConceptWhat It Does
STREAMING TABLEAppend-only table that processes incrementally
MATERIALIZED VIEWPrecomputed query result, refreshed automatically
LIVE.table_nameReference to another table in the same pipeline
STREAM(LIVE.table)Read streaming changes from an upstream table
ExpectationsData quality constraints (EXPECT, ON VIOLATION)

Pipeline expectations (data quality)

Violation ActionBehaviour
DROP ROWBad rows are silently removed
FAIL UPDATEPipeline fails if any row violates the constraint
(no action)Bad rows are kept, violation is logged in metrics
FeatureDeclarative PipelineNotebook Pipeline
ApproachDeclare WHAT tables look likeWrite HOW to build tables
Dependency managementAutomaticManual (task order)
Data qualityBuilt-in expectationsCustom validation code
Error recoveryAutomatic retryManual retry logic
MonitoringPipeline event log + metricsSpark UI + custom logging
Best forStandard medallion ETLComplex custom logic
Question

What is Auto Loader and when should you use it over COPY INTO?

Click or press Enter to reveal answer

Answer

Auto Loader (cloudFiles) incrementally ingests new files using checkpoints and Event Grid notifications. Use over COPY INTO for millions of files, streaming support, and automatic schema evolution.

Click to flip back

Question

What are the three violation actions for pipeline expectations?

Click or press Enter to reveal answer

Answer

DROP ROW (remove bad rows silently), FAIL UPDATE (pipeline fails on any violation), no action (keep bad rows, log violation in metrics).

Click to flip back

Question

What is the difference between a STREAMING TABLE and a MATERIALIZED VIEW in Declarative Pipelines?

Click or press Enter to reveal answer

Answer

STREAMING TABLE: append-only, processes data incrementally. MATERIALIZED VIEW: precomputed query result, fully recomputed or incrementally refreshed. Use streaming tables for raw/cleaned data, materialized views for aggregates.

Click to flip back

Knowledge check

Knowledge Check

Mei Lin receives 50,000 new CSV files daily from Freshmart's 5,000 stores. She needs to ingest them incrementally with zero duplicates and automatic schema evolution. Which tool is best?


Next up: Cleansing & Profiling Data β€” data profiling, choosing column types, and handling duplicates and nulls.