Domain 3 β€” Module 5 of 8 63%
23 of 26 overall
Domain 3: Monitor and Optimize an Analytics Solution Free ⏱ ~14 min read

Optimize Lakehouse Tables: Delta Tuning

Use OPTIMIZE, VACUUM, Z-ordering, and V-ordering to make your Delta Lake tables fast, compact, and storage-efficient.

Why optimize Delta tables?

Simple explanation

Think of a filing cabinet that nobody organises.

After months of use, you have thousands of tiny folders (small files), old documents that should have been shredded (orphaned files), and no logical order to anything (unsorted data). Finding something takes forever.

Delta table optimization is like hiring a professional organiser: OPTIMIZE combines tiny files into bigger ones (compaction). VACUUM removes old files that are no longer needed. Z-ordering sorts data so queries on common columns skip irrelevant files. V-ordering is Fabric’s special compression for even faster reads.

OPTIMIZE (compaction)

Combines many small files into fewer larger files. This is the single most impactful optimization.

-- Compact all files in the table
OPTIMIZE lakehouse.FactOrders

-- Compact with Z-ordering on a specific column
OPTIMIZE lakehouse.FactOrders ZORDER BY (order_date)

-- Compact a specific partition
OPTIMIZE lakehouse.FactOrders WHERE year = 2026

Before and after

MetricBefore OPTIMIZEAfter OPTIMIZE
Number of files12,847 small files156 optimally-sized files
Average file size2 MB128 MB
Query scan time45 seconds6 seconds
Why small files are a problem

Each Parquet file has overhead: file open, metadata read, footer parsing. With 12,847 files, Spark spends more time on overhead than actual data reading.

OPTIMIZE reads all the small files and rewrites them as fewer, optimally-sized files (target: ~128 MB each). The data is identical β€” only the physical file layout changes.

Best practice: Run OPTIMIZE after batch loads, especially after many small appends or streaming micro-batches.

VACUUM (cleanup)

Removes files that are no longer referenced by the transaction log. These are leftover from UPDATE, DELETE, MERGE, or OPTIMIZE operations.

-- Remove files older than 7 days (default retention)
VACUUM lakehouse.FactOrders

-- Remove files older than 24 hours (careful β€” reduces time travel window)
VACUUM lakehouse.FactOrders RETAIN 24 HOURS

What VACUUM does NOT delete

  • Current data files (referenced by the latest transaction log)
  • Files within the retention period (default: 7 days)
  • The transaction log itself (_delta_log/)
Exam tip: VACUUM and time travel

Critical trade-off: VACUUM removes old file versions. If you VACUUM with a 24-hour retention, you can only time travel back 24 hours. Files needed for older versions are gone.

Exam pattern: β€œAfter running VACUUM, time travel to version 5 fails.” Why? VACUUM deleted the files that version 5 referenced.

Safe default: Keep the 7-day retention period unless storage cost is critical.

Z-ordering

Z-ordering physically sorts data within files by specified columns. This enables data skipping β€” Spark reads file metadata (min/max values per column) and skips files that can’t contain matching rows.

-- Z-order by the most commonly filtered column
OPTIMIZE lakehouse.FactOrders ZORDER BY (customer_id)

-- Z-order by multiple columns (first column gets priority)
OPTIMIZE lakehouse.FactOrders ZORDER BY (order_date, region)

When to Z-order

ScenarioZ-Order ColumnWhy
Queries filter by dateorder_dateDate range queries skip non-matching files
Queries filter by customercustomer_idCustomer lookups read fewer files
Queries join on product_idproduct_idJoin scans only relevant files
Scenario: Carlos Z-orders production data

Carlos’s FactProduction table has 500M rows. Most queries filter by factory_id and production_date. Without Z-ordering, every query scans all files.

After running OPTIMIZE FactProduction ZORDER BY (production_date, factory_id):

  • A query for β€œFactory F01, April 2026” scans 12 files instead of 1,500
  • Query time drops from 35 seconds to 3 seconds

V-ordering

V-ordering is Fabric’s proprietary write-time optimization that applies special sorting and compression to Parquet files. It’s enabled by default for Fabric lakehouse tables.

Z-ordering is manual + column-specific; V-ordering is automatic + engine-wide
FeatureZ-orderingV-ordering
When appliedManually with OPTIMIZE commandAutomatically on write (default in Fabric)
What it doesSorts data by specific columns for data skippingApplies optimized compression and read-path optimization
Configured byYou choose the columnsEnabled explicitly via table properties, session config, or OPTIMIZE VORDER option
Works withSpark engine (data skipping)All Fabric engines (Spark, SQL endpoint, Power BI Direct Lake)
RelationshipCan be combined with V-orderingComplements Z-ordering

Table maintenance strategy

TaskFrequencyCommand
OPTIMIZEAfter batch loads, daily for heavy tablesOPTIMIZE table_name
OPTIMIZE + Z-ORDERWeekly or after major loadsOPTIMIZE table ZORDER BY (column)
VACUUMWeeklyVACUUM table_name
Analyze tableAfter major schema or data changesANALYZE TABLE table_name COMPUTE STATISTICS

Question

What does OPTIMIZE do to a Delta table?

Click or press Enter to reveal answer

Answer

Combines many small files into fewer, optimally-sized files (~128 MB each). This reduces file open overhead and dramatically improves query scan time. Data content is unchanged β€” only the physical layout changes.

Click to flip back

Question

What is the trade-off of running VACUUM with a short retention period?

Click or press Enter to reveal answer

Answer

VACUUM removes old file versions. Short retention (e.g., 24 hours) frees storage but limits time travel β€” you can only query versions within the retention window. Default 7-day retention is the safe balance.

Click to flip back

Question

What is Z-ordering and when should you use it?

Click or press Enter to reveal answer

Answer

Z-ordering sorts data within files by specified columns, enabling data skipping (Spark reads min/max metadata and skips irrelevant files). Use it on columns that appear frequently in WHERE filters or JOIN conditions.

Click to flip back

Question

What is V-ordering in Fabric?

Click or press Enter to reveal answer

Answer

An automatic, proprietary write-time optimization that applies special compression and read-path optimization to Parquet files. Enabled by default in Fabric. Works with all engines (Spark, SQL endpoint, Direct Lake). No configuration needed.

Click to flip back


Knowledge Check

Carlos runs VACUUM on a Delta table with RETAIN 2 HOURS. Later, he tries to time travel to a version from 3 hours ago. The query fails. Why?

Knowledge Check

A lakehouse table has 15,000 small files (average 1.5 MB each) after months of streaming micro-batches. Queries take 40+ seconds. What is the single most impactful optimization?

Next up: Optimize Spark: Speed Up Your Code β€” partition strategies, caching, broadcast joins, and Spark configuration tuning.