Domain 4 โ€” Module 1 of 8 13%
21 of 28 overall
Domain 4: Deploy and Maintain Data Pipelines and Workloads Free โฑ ~14 min read

Building Data Pipelines

Design order of operations, choose between notebook and Declarative Pipelines, implement task logic, error handling, and build production pipelines.

Designing pipelines

Simple explanation

A data pipeline is an assembly line for data.

Raw materials arrive (ingestion), get cleaned (bronze โ†’ silver), assembled into products (silver โ†’ gold), and shipped (to dashboards). You design the order of stations, decide which machines to use (notebooks or Declarative Pipelines), and plan what happens when something breaks (error handling).

Notebook vs Declarative Pipelines

AspectNotebook PipelineDeclarative Pipeline
ApproachImperative โ€” you code each stepDeclarative โ€” you define desired state
Dependency managementManual (Lakeflow Jobs task graph)Automatic (inferred from LIVE references)
Error handlingtry/except in codeBuilt-in retry + expectations
Data qualityCustom validation codeBuilt-in expectations (EXPECT)
ComputeAny cluster typeDedicated pipeline compute (serverless default)
MonitoringSpark UI + custom loggingPipeline event log + metrics dashboard
Best forComplex logic, ML integration, custom APIsStandard medallion ETL
Exam preferenceWhen 'custom logic' is mentionedWhen 'managed' or 'automated quality' is mentioned

Notebook pipeline with Lakeflow Jobs

A notebook pipeline chains multiple notebooks as tasks in a Lakeflow Job:

Task 1: ingest_raw       โ†’  Task 2: clean_validate  โ†’  Task 3: build_gold
(bronze layer)                (silver layer)              (gold layer)
                                                            โ†˜
                                                   Task 4: update_dashboard

Precedence constraints

Tasks can have dependencies: Task 3 only runs if Tasks 1 and 2 succeed.

# In a notebook: return an output value to the job
dbutils.notebook.exit("row_count=15000")  # returns a value; task succeeds on normal completion

# Or raise an error to fail the task
if error_count > threshold:
    raise Exception(f"Data quality failed: {error_count} errors exceeded threshold")

Error handling in notebooks

try:
    # Main processing logic
    df = spark.read.table("bronze.raw_orders")
    clean_df = transform_orders(df)
    clean_df.write.mode("append").saveAsTable("silver.orders")

except Exception as e:
    # Log the error
    print(f"Pipeline failed: {str(e)}")
    # Optionally write to an error table
    spark.sql(f"INSERT INTO pipeline_errors VALUES (CURRENT_TIMESTAMP(), '{str(e)}')")
    # Re-raise to fail the job task
    raise

Declarative Pipeline implementation

-- Complete medallion pipeline in Declarative Pipelines

-- Bronze: Auto Loader ingestion
CREATE OR REFRESH STREAMING TABLE bronze_orders
AS SELECT * FROM cloud_files(
  'abfss://landing@storage.dfs.core.windows.net/orders/',
  'json'
);

-- Silver: cleaned with quality expectations
CREATE OR REFRESH STREAMING TABLE silver_orders (
  CONSTRAINT valid_id EXPECT (order_id IS NOT NULL) ON VIOLATION DROP ROW,
  CONSTRAINT positive_amount EXPECT (amount > 0) ON VIOLATION DROP ROW
)
AS SELECT
  CAST(order_id AS BIGINT) AS order_id,
  customer_id,
  CAST(amount AS DECIMAL(10,2)) AS amount,
  TO_DATE(order_date) AS order_date
FROM STREAM(LIVE.bronze_orders);

-- Gold: materialized view for dashboards
CREATE OR REFRESH MATERIALIZED VIEW gold_daily_summary
AS SELECT
  order_date,
  COUNT(*) AS order_count,
  SUM(amount) AS total_revenue
FROM LIVE.silver_orders
GROUP BY order_date;

The pipeline engine automatically:

  • Determines execution order from LIVE. references
  • Handles incremental processing (streaming tables)
  • Enforces expectations and logs quality metrics
  • Retries on transient failures
Exam decision tree: notebook or declarative?
  1. Standard bronze โ†’ silver โ†’ gold ETL? โ†’ Declarative Pipeline
  2. Need custom Python ML models in the pipeline? โ†’ Notebook
  3. Need built-in data quality metrics? โ†’ Declarative Pipeline
  4. Need to call external APIs during processing? โ†’ Notebook
  5. Want automatic dependency management? โ†’ Declarative Pipeline
  6. Need complex control flow (if/else branching)? โ†’ Notebook
Question

When should you use a notebook pipeline vs a Declarative Pipeline?

Click or press Enter to reveal answer

Answer

Notebook: complex logic, ML integration, external APIs, branching control flow. Declarative: standard medallion ETL, automatic dependencies, built-in quality expectations, managed compute.

Click to flip back

Question

How do Declarative Pipelines handle dependency ordering?

Click or press Enter to reveal answer

Answer

Automatically โ€” the engine infers dependencies from LIVE. references. If silver_orders reads from LIVE.bronze_orders, the engine runs bronze first. No manual dependency graph needed.

Click to flip back

Question

How should you handle errors in notebook pipelines?

Click or press Enter to reveal answer

Answer

Use try/except blocks, log errors to an error table, and re-raise the exception to fail the job task. Use dbutils.notebook.exit('SUCCESS') to signal success for downstream task dependencies.

Click to flip back

Knowledge check

Knowledge Check

Dr. Sarah Okafor needs to build a standard bronze โ†’ silver โ†’ gold ETL pipeline for Athena Group. The pipeline should have built-in data quality checks and automatic dependency management. Which approach should she use?


Next up: Lakeflow Jobs: Create & Configure โ€” creating, configuring, and triggering Lakeflow Jobs.