Domain 3 β€” Module 3 of 8 38%
21 of 26 overall
Domain 3: Monitor and Optimize an Analytics Solution Free ⏱ ~12 min read

Troubleshoot Notebooks & SQL

Identify and resolve Spark notebook errors and T-SQL failures β€” OOM errors, data skew, schema mismatches, query timeouts, and debugging techniques.

Notebook errors

Simple explanation

Think of a Spark notebook as a team of workers processing data.

Errors happen when: a worker runs out of desk space (OOM β€” out of memory), one worker gets all the heavy files while others sit idle (data skew), the input data doesn’t match expectations (schema mismatch), or the instructions themselves are wrong (code error).

Common notebook errors

Read the error message carefully β€” Spark errors are verbose but informative
ErrorCauseFix
java.lang.OutOfMemoryErrorDataset too large for driver/executor memoryIncrease pool size, reduce data with filters before collect(), avoid collect() on large DataFrames
AnalysisException: cannot resolve columnColumn name doesn't exist (typo or schema change)Check column names with df.printSchema(), verify source data
Data skew (one task takes 10x longer)One partition key has far more data than othersRepartition data, use salting technique, or broadcast smaller table
Py4JJavaError with NullPointerExceptionNull values in a column used for operationsFilter nulls before processing, use coalesce() or fillna()
SchemaConflictException on writeDataFrame schema doesn't match existing Delta tableUse mergeSchema option or fix DataFrame to match
Cluster startup timeoutNo available capacity for Spark nodesWait and retry, use starter pool, or request capacity increase
Scenario: Carlos debugs an OOM error

Carlos’s transformation notebook crashes with OutOfMemoryError on the driver. He investigates:

  1. The line that failed: result = df_500m_rows.collect() β€” collects all 500M rows to the driver!
  2. Root cause: collect() pulls the entire distributed DataFrame into the single driver node’s memory
  3. Fix: Replace collect() with .write.format("delta").save() to write directly to the lakehouse without pulling data to the driver

Rule: Never collect() large DataFrames. Write to Delta tables or use show(20) to preview.

Common T-SQL errors

ErrorCauseFix
Query timeoutComplex query exceeds time limitOptimize query (add WHERE filters, simplify joins), check for missing statistics
Insufficient permissionsUser lacks READ/WRITE on tableGrant appropriate permissions (ReadAll for queries, Contributor role for writes)
Invalid object nameTable or view doesn’t exist (typo, wrong schema)Verify object name and schema β€” use SELECT * FROM INFORMATION_SCHEMA.TABLES
Data type conversion failedINSERT/UPDATE with incompatible typesCast data explicitly: CAST(column AS DECIMAL(10,2))
DeadlockTwo queries blocking each otherReview query execution plans, reduce transaction scope, retry with backoff

Debugging techniques

TechniqueToolWhen to Use
Spark UIBuilt into notebookInvestigate slow stages, data skew, shuffle metrics
df.printSchema()PySparkVerify column names and types before operations
df.show(5)PySparkPreview data at each transformation step
EXPLAINT-SQL / Spark SQLView query execution plan
Cell-by-cell executionNotebookIsolate which transformation step fails

Question

What is the most common cause of OOM errors in Spark notebooks?

Click or press Enter to reveal answer

Answer

Using collect() on large DataFrames (pulling millions of rows to the single driver node), or processing very wide datasets without filtering first. Fix: write to Delta tables instead of collecting, filter early, increase pool memory.

Click to flip back

Question

How do you diagnose data skew in a Spark notebook?

Click or press Enter to reveal answer

Answer

Check the Spark UI β€” look for tasks within a stage where one task takes much longer or processes much more data than others. High shuffle read/write on one executor is a key indicator.

Click to flip back


Knowledge Check

A PySpark notebook fails with 'AnalysisException: cannot resolve column order_total.' The DataFrame was loaded from a Delta table. What should the engineer check first?

Knowledge Check

A T-SQL query in a Fabric warehouse times out after 10 minutes. The query joins two large tables without WHERE filters. What is the best first step?

Next up: Troubleshoot Streaming & Shortcuts β€” resolve Eventhouse, Eventstream, and OneLake shortcut errors.