Domain 4 β€” Module 6 of 8 75%
26 of 28 overall
Domain 4: Deploy and Maintain Data Pipelines and Workloads Free ⏱ ~13 min read

Monitoring Clusters & Troubleshooting

Monitor cluster consumption, troubleshoot Lakeflow Jobs, and diagnose Spark job failures β€” the operational skills that keep production running.

Monitoring cluster consumption

Simple explanation

Monitoring is reading the dashboard gauges while driving.

Speed (throughput), fuel level (cost), engine temperature (resource usage). If you don’t check the gauges, you run out of fuel (budget) or overheat (OOM errors) without warning.

Key metrics to monitor

MetricWhere to FindWhat It Tells You
DBU consumptionAccount console β†’ UsageCost by workspace, cluster, job
CPU utilisationCluster UI β†’ MetricsWhether you’re under/over-provisioned
Memory usageCluster UI β†’ MetricsRisk of OOM errors
Spill to diskSpark UI β†’ StagesMemory pressure (data doesn’t fit in RAM)
Job duration trendsJob run historyPerformance degradation over time
Cluster idle timeCompute UIWasted spend on idle clusters

Cost optimisation actions

IssueFix
High idle timeReduce auto-termination timeout
Over-provisioned (low CPU)Reduce worker count or node size
Under-provisioned (high spill)Increase memory or worker count
Expensive always-on clustersSwitch to job compute or serverless
Dev clusters running overnightSet auto-termination to 30 min

Troubleshooting Lakeflow Jobs

Repair runs

When a job fails, you don’t have to re-run everything:

Job: nightly_etl (5 tasks)
  βœ… ingest_crm       (completed)
  βœ… ingest_pos        (completed)
  βœ… clean_data        (completed)
  ❌ build_reports     (FAILED β€” OOM error)
  ⏭️ notify_team      (skipped)

Repair run re-runs only build_reports and notify_team β€” the three successful tasks are not repeated.

Common job failures

SymptomLikely CauseFix
Task timeoutQuery too slow, data too largeIncrease timeout, optimize query, add nodes
OOM (Out of Memory)Data doesn’t fit in memoryIncrease node memory, reduce partition size, use disk-based operations
Cluster start failureQuota exceeded, region capacityTry a different node type or region
Source unavailableNetwork/auth issueCheck connectivity, rotate expired credentials
Concurrent run conflictPrevious run still activeSet max concurrent runs to 1

Job operations

ActionWhen to Use
RunStart a new execution
RepairRe-run only failed tasks from a failed run
RestartCancel current run and start fresh
Stop/CancelStop a running execution

Troubleshooting Spark jobs

Common Spark issues

IssueSymptomInvestigation
Slow stageOne stage takes much longerCheck Spark UI β†’ Stages for skew
OOM errorDriver or executor out of memoryReduce collect() calls, increase memory
Job hangsProgress stops, no errorsCheck for deadlocks, broadcast timeout
Data skewOne task processes much more dataCheck Spark UI β†’ Task metrics for uneven distribution

Cluster restart for recovery

Sometimes the simplest fix is a cluster restart:

  • When: persistent driver issues, memory leaks, corrupt state
  • How: Stop and restart the cluster (or let auto-termination handle it)
  • Caution: streaming jobs lose in-flight micro-batch state (checkpoints protect against data loss)
Question

What is a repair run and when should you use it?

Click or press Enter to reveal answer

Answer

A repair run re-executes only failed tasks and their downstream dependents from a failed job run. Use it to avoid re-running successful tasks, saving time and compute cost.

Click to flip back

Question

What are the top cost optimization actions for Databricks clusters?

Click or press Enter to reveal answer

Answer

Reduce auto-termination timeout (idle clusters), right-size nodes (match CPU/memory to workload), switch to job compute for scheduled work, and shut down dev clusters outside business hours.

Click to flip back

Question

What causes an Out of Memory (OOM) error in Spark?

Click or press Enter to reveal answer

Answer

Data doesn't fit in the executor or driver memory. Common causes: collect() pulling too much data to driver, large broadcast joins, insufficient partition count, or skewed data. Fix: increase memory, reduce collect(), repartition.

Click to flip back

Knowledge check

Knowledge Check

Ravi's nightly ETL job at DataPulse failed on task 4 of 5. Tasks 1-3 completed successfully and produced correct output. What is the most efficient way to recover?


Next up: Spark Performance: DAG & Query Profile β€” investigating caching, skew, spilling, and shuffle issues.