Auto Loader & Declarative Pipelines

Auto Loader

Simple explanation

Auto Loader is a smart mailroom for your data files.

New files land in a storage folder. Auto Loader detects them, processes them, and loads them into a Delta table — automatically, without duplicates, even if millions of files arrive. It uses cloud notifications (Azure Event Grid) to know when new files appear, rather than scanning the entire folder every time.

Auto Loader code pattern

# Auto Loader ingestion
df = (spark.readStream
    .format("cloudFiles")
    .option("cloudFiles.format", "csv")
    .option("cloudFiles.schemaLocation", "/schemas/raw_sales")
    .option("header", "true")
    .load("abfss://landing@storage.dfs.core.windows.net/sales/"))

# Write to Delta
(df.writeStream
    .format("delta")
    .option("checkpointLocation", "/checkpoints/raw_sales")
    .trigger(availableNow=True)
    .toTable("bronze.raw_sales"))

File discovery modes

Mode	How It Works	Best For
Directory listing (default)	Scans directory for new files	Small to medium file volumes
File notification	Azure Event Grid notifies on file creation	Millions of files, low latency

Auto Loader vs COPY INTO

Feature	Auto Loader	COPY INTO
File tracking	Checkpoint-based	Internal file tracking
Streaming	Yes (continuous or triggered)	No (batch only)
File discovery	Directory listing + Event Grid	Directory listing only
Scale	Millions of files	Thousands of files
Schema inference	Automatic with evolution	Manual or inferSchema

Exam default: Auto Loader is preferred over COPY INTO for most file ingestion scenarios due to better scalability and streaming support.

Lakeflow Spark Declarative Pipelines

Simple explanation

Declarative Pipelines are like a recipe card system for your kitchen.

Instead of writing step-by-step cooking instructions (imperative code), you describe what each dish should look like (declarative). The system figures out the cooking order, handles retries, and validates quality automatically.

You define your bronze, silver, and gold tables as declarations. The pipeline engine handles execution order, dependencies, and data quality checks.

SQL Declarative Pipeline

-- Bronze: ingest raw files with Auto Loader
CREATE OR REFRESH STREAMING TABLE bronze_sales
AS SELECT * FROM cloud_files(
  'abfss://landing@storage.dfs.core.windows.net/sales/',
  'csv',
  map('header', 'true')
);

-- Silver: clean and validate
CREATE OR REFRESH STREAMING TABLE silver_sales (
  CONSTRAINT valid_amount EXPECT (amount > 0) ON VIOLATION DROP ROW,
  CONSTRAINT valid_date EXPECT (sale_date IS NOT NULL) ON VIOLATION FAIL UPDATE
)
AS SELECT
  sale_id, customer_id, CAST(amount AS DECIMAL(10,2)) AS amount, sale_date
FROM STREAM(LIVE.bronze_sales);

-- Gold: aggregate for reporting
CREATE OR REFRESH MATERIALIZED VIEW gold_daily_revenue
AS SELECT
  sale_date, SUM(amount) AS total_revenue, COUNT(*) AS txn_count
FROM LIVE.silver_sales
GROUP BY sale_date;

Key concepts

Concept	What It Does
STREAMING TABLE	Append-only table that processes incrementally
MATERIALIZED VIEW	Precomputed query result, refreshed automatically
LIVE.table_name	Reference to another table in the same pipeline
STREAM(LIVE.table)	Read streaming changes from an upstream table
Expectations	Data quality constraints (EXPECT, ON VIOLATION)

Pipeline expectations (data quality)

Violation Action	Behaviour
DROP ROW	Bad rows are silently removed
FAIL UPDATE	Pipeline fails if any row violates the constraint
(no action)	Bad rows are kept, violation is logged in metrics

Feature	Declarative Pipeline	Notebook Pipeline
Approach	Declare WHAT tables look like	Write HOW to build tables
Dependency management	Automatic	Manual (task order)
Data quality	Built-in expectations	Custom validation code
Error recovery	Automatic retry	Manual retry logic
Monitoring	Pipeline event log + metrics	Spark UI + custom logging
Best for	Standard medallion ETL	Complex custom logic

Question

What is Auto Loader and when should you use it over COPY INTO?

Click or press Enter to reveal answer

Answer

Auto Loader (cloudFiles) incrementally ingests new files using checkpoints and Event Grid notifications. Use over COPY INTO for millions of files, streaming support, and automatic schema evolution.

Click to flip back

Question

What are the three violation actions for pipeline expectations?

Click or press Enter to reveal answer

Answer

DROP ROW (remove bad rows silently), FAIL UPDATE (pipeline fails on any violation), no action (keep bad rows, log violation in metrics).

Click to flip back

Question

What is the difference between a STREAMING TABLE and a MATERIALIZED VIEW in Declarative Pipelines?

Click or press Enter to reveal answer

Answer

STREAMING TABLE: append-only, processes data incrementally. MATERIALIZED VIEW: precomputed query result, fully recomputed or incrementally refreshed. Use streaming tables for raw/cleaned data, materialized views for aggregates.

Click to flip back

Knowledge check

Knowledge Check

Mei Lin receives 50,000 new CSV files daily from Freshmart's 5,000 stores. She needs to ingest them incrementally with zero duplicates and automatic schema evolution. Which tool is best?

Next up: Cleansing & Profiling Data — data profiling, choosing column types, and handling duplicates and nulls.