Monitoring Clusters & Troubleshooting

Monitoring cluster consumption

Simple explanation

Monitoring is reading the dashboard gauges while driving.

Speed (throughput), fuel level (cost), engine temperature (resource usage). If you don’t check the gauges, you run out of fuel (budget) or overheat (OOM errors) without warning.

Key metrics to monitor

Metric	Where to Find	What It Tells You
DBU consumption	Account console → Usage	Cost by workspace, cluster, job
CPU utilisation	Cluster UI → Metrics	Whether you’re under/over-provisioned
Memory usage	Cluster UI → Metrics	Risk of OOM errors
Spill to disk	Spark UI → Stages	Memory pressure (data doesn’t fit in RAM)
Job duration trends	Job run history	Performance degradation over time
Cluster idle time	Compute UI	Wasted spend on idle clusters

Cost optimisation actions

Issue	Fix
High idle time	Reduce auto-termination timeout
Over-provisioned (low CPU)	Reduce worker count or node size
Under-provisioned (high spill)	Increase memory or worker count
Expensive always-on clusters	Switch to job compute or serverless
Dev clusters running overnight	Set auto-termination to 30 min

Troubleshooting Lakeflow Jobs

Repair runs

When a job fails, you don’t have to re-run everything:

Job: nightly_etl (5 tasks)
  ✅ ingest_crm       (completed)
  ✅ ingest_pos        (completed)
  ✅ clean_data        (completed)
  ❌ build_reports     (FAILED — OOM error)
  ⏭️ notify_team      (skipped)

Repair run re-runs only build_reports and notify_team — the three successful tasks are not repeated.

Common job failures

Symptom	Likely Cause	Fix
Task timeout	Query too slow, data too large	Increase timeout, optimize query, add nodes
OOM (Out of Memory)	Data doesn’t fit in memory	Increase node memory, reduce partition size, use disk-based operations
Cluster start failure	Quota exceeded, region capacity	Try a different node type or region
Source unavailable	Network/auth issue	Check connectivity, rotate expired credentials
Concurrent run conflict	Previous run still active	Set max concurrent runs to 1

Job operations

Action	When to Use
Run	Start a new execution
Repair	Re-run only failed tasks from a failed run
Restart	Cancel current run and start fresh
Stop/Cancel	Stop a running execution

Troubleshooting Spark jobs

Common Spark issues

Issue	Symptom	Investigation
Slow stage	One stage takes much longer	Check Spark UI → Stages for skew
OOM error	Driver or executor out of memory	Reduce collect() calls, increase memory
Job hangs	Progress stops, no errors	Check for deadlocks, broadcast timeout
Data skew	One task processes much more data	Check Spark UI → Task metrics for uneven distribution

Cluster restart for recovery

Sometimes the simplest fix is a cluster restart:

When: persistent driver issues, memory leaks, corrupt state
How: Stop and restart the cluster (or let auto-termination handle it)
Caution: streaming jobs lose in-flight micro-batch state (checkpoints protect against data loss)

Question

What is a repair run and when should you use it?

Click or press Enter to reveal answer

Answer

A repair run re-executes only failed tasks and their downstream dependents from a failed job run. Use it to avoid re-running successful tasks, saving time and compute cost.

Click to flip back

Question

What are the top cost optimization actions for Databricks clusters?

Click or press Enter to reveal answer

Answer

Reduce auto-termination timeout (idle clusters), right-size nodes (match CPU/memory to workload), switch to job compute for scheduled work, and shut down dev clusters outside business hours.

Click to flip back

Question

What causes an Out of Memory (OOM) error in Spark?

Click or press Enter to reveal answer

Answer

Data doesn't fit in the executor or driver memory. Common causes: collect() pulling too much data to driver, large broadcast joins, insufficient partition count, or skewed data. Fix: increase memory, reduce collect(), repartition.

Click to flip back

Knowledge check

Knowledge Check

Ravi's nightly ETL job at DataPulse failed on task 4 of 5. Tasks 1-3 completed successfully and produced correct output. What is the most efficient way to recover?

Next up: Spark Performance: DAG & Query Profile — investigating caching, skew, spilling, and shuffle issues.