Troubleshoot Notebooks & SQL

Notebook errors

Simple explanation

Think of a Spark notebook as a team of workers processing data.

Errors happen when: a worker runs out of desk space (OOM — out of memory), one worker gets all the heavy files while others sit idle (data skew), the input data doesn’t match expectations (schema mismatch), or the instructions themselves are wrong (code error).

Common notebook errors

Read the error message carefully — Spark errors are verbose but informative
Error	Cause	Fix
java.lang.OutOfMemoryError	Dataset too large for driver/executor memory	Increase pool size, reduce data with filters before collect(), avoid collect() on large DataFrames
AnalysisException: cannot resolve column	Column name doesn't exist (typo or schema change)	Check column names with df.printSchema(), verify source data
Data skew (one task takes 10x longer)	One partition key has far more data than others	Repartition data, use salting technique, or broadcast smaller table
Py4JJavaError with NullPointerException	Null values in a column used for operations	Filter nulls before processing, use coalesce() or fillna()
SchemaConflictException on write	DataFrame schema doesn't match existing Delta table	Use mergeSchema option or fix DataFrame to match
Cluster startup timeout	No available capacity for Spark nodes	Wait and retry, use starter pool, or request capacity increase

Scenario: Carlos debugs an OOM error

Carlos’s transformation notebook crashes with OutOfMemoryError on the driver. He investigates:

The line that failed: result = df_500m_rows.collect() — collects all 500M rows to the driver!
Root cause: collect() pulls the entire distributed DataFrame into the single driver node’s memory
Fix: Replace collect() with .write.format("delta").save() to write directly to the lakehouse without pulling data to the driver

Rule: Never collect() large DataFrames. Write to Delta tables or use show(20) to preview.

Common T-SQL errors

Error	Cause	Fix
Query timeout	Complex query exceeds time limit	Optimize query (add WHERE filters, simplify joins), check for missing statistics
Insufficient permissions	User lacks READ/WRITE on table	Grant appropriate permissions (ReadAll for queries, Contributor role for writes)
Invalid object name	Table or view doesn’t exist (typo, wrong schema)	Verify object name and schema — use `SELECT * FROM INFORMATION_SCHEMA.TABLES`
Data type conversion failed	INSERT/UPDATE with incompatible types	Cast data explicitly: `CAST(column AS DECIMAL(10,2))`
Deadlock	Two queries blocking each other	Review query execution plans, reduce transaction scope, retry with backoff

Debugging techniques

Technique	Tool	When to Use
Spark UI	Built into notebook	Investigate slow stages, data skew, shuffle metrics
df.printSchema()	PySpark	Verify column names and types before operations
df.show(5)	PySpark	Preview data at each transformation step
EXPLAIN	T-SQL / Spark SQL	View query execution plan
Cell-by-cell execution	Notebook	Isolate which transformation step fails

Question

What is the most common cause of OOM errors in Spark notebooks?

Click or press Enter to reveal answer

Answer

Using collect() on large DataFrames (pulling millions of rows to the single driver node), or processing very wide datasets without filtering first. Fix: write to Delta tables instead of collecting, filter early, increase pool memory.

Click to flip back

Question

How do you diagnose data skew in a Spark notebook?

Click or press Enter to reveal answer

Answer

Check the Spark UI — look for tasks within a stage where one task takes much longer or processes much more data than others. High shuffle read/write on one executor is a key indicator.

Click to flip back

Knowledge Check

A PySpark notebook fails with 'AnalysisException: cannot resolve column order_total.' The DataFrame was loaded from a Delta table. What should the engineer check first?

Knowledge Check

A T-SQL query in a Fabric warehouse times out after 10 minutes. The query joins two large tables without WHERE filters. What is the best first step?

Next up: Troubleshoot Streaming & Shortcuts — resolve Eventhouse, Eventstream, and OneLake shortcut errors.