Domain 2 β€” Module 8 of 14 57%
15 of 29 overall
Domain 2: Prepare Data Free ⏱ ~15 min read

Star Schema Design

The star schema is the most important data modelling pattern in DP-600. Fact tables, dimension tables, grain, and denormalization β€” with a worked retail example.

What is a star schema?

Simple explanation

Think of a receipt and a set of reference cards.

When you buy groceries, your receipt lists every item: product code, quantity, price, store, date, time. That receipt is a fact table β€” it records what happened (the transaction).

But the receipt alone is not very useful for analysis. To answer β€œwhich category sold the most last month?”, you need a product card (name, category, brand), a store card (location, region, size), and a date card (month, quarter, year). These are dimension tables.

A star schema puts the fact table in the centre and surrounds it with dimension tables β€” like a star. It is the foundation of fast analytics and Power BI performance.

Fact tables vs dimension tables

Facts measure what happened; dimensions describe context
FeatureFact TableDimension Table
What it recordsBusiness events β€” transactions, measurements, metricsDescriptive attributes β€” who, what, where, when
RowsMany (millions to billions)Few (hundreds to millions)
ColumnsKeys (foreign keys to dimensions) + measures (numbers)Attributes (text, categories, hierarchies)
ChangesAppend-mostly β€” new events added over timeSlowly changing β€” attributes updated occasionally
ExamplesSales transactions, website clicks, sensor readingsProducts, customers, stores, dates, employees
GrainThe level of detail (one row per transaction, per day, per hour)One row per entity (one row per product, per customer)

A worked example: FreshCart’s star schema

Anita at FreshCart designs a star schema for sales analytics. Here is the complete model:

Fact table: fact_sales

ColumnTypeDescription
sale_idINTUnique transaction line ID
date_keyINTFK β†’ dim_date
store_keyINTFK β†’ dim_store
product_keyINTFK β†’ dim_product
customer_keyINTFK β†’ dim_customer
quantityINTNumber of items sold (measure)
unit_priceDECIMALPrice per item (measure)
total_amountDECIMALquantity x unit_price (measure)
discount_amountDECIMALDiscount applied (measure)

Grain: One row per product per transaction (the most detailed level).

Dimension tables

dim_date β€” when the sale happened

ColumnExample Values
date_key20260421
full_date2026-04-21
day_nameTuesday
month_nameApril
quarterQ2
year2026
is_weekendNo
is_holidayNo

dim_store β€” where the sale happened

ColumnExample Values
store_key1042
store_nameFreshCart Auckland CBD
cityAuckland
regionNorth Island
store_typeMetro
opening_date2019-03-15

dim_product β€” what was sold

ColumnExample Values
product_key5678
product_nameOrganic Avocado
categoryFresh Produce
subcategoryFruits
brandFreshGreen
unit_of_measureEach

dim_customer β€” who bought it

ColumnExample Values
customer_key90123
customer_nameJane Smith
loyalty_tierGold
signup_date2022-01-10
cityWellington

The star shape

  dim_date ──────┐
                  β”‚
  dim_store ──── fact_sales ──── dim_product
                  β”‚
  dim_customer β”€β”€β”˜

Every query follows the same pattern: start from the fact table, join to dimensions for context.

-- Total sales by region and month
SELECT
    s.region,
    d.month_name,
    SUM(f.total_amount) AS total_revenue
FROM fact_sales f
JOIN dim_store s ON f.store_key = s.store_key
JOIN dim_date d ON f.date_key = d.date_key
WHERE d.year = 2026
GROUP BY s.region, d.month_name
ORDER BY total_revenue DESC

Why denormalize?

In transactional databases (OLTP), you normalise data to reduce redundancy β€” separate tables linked by foreign keys. In analytical databases (OLAP), you denormalize β€” merge related attributes into fewer, wider tables.

Normalization vs denormalization

ApproachNormalised (OLTP)Denormalised (OLAP / Star Schema)
GoalEliminate redundancyOptimise query performance
TablesMany narrow tables with joinsFewer wide tables with redundant attributes
Joins neededMany (complex queries)Few (simple queries)
Write performanceFast (each fact stored once)Slower (redundant data updated in multiple places)
Read performanceSlower (many joins)Faster (fewer joins, columnar storage)
Best forTransaction processingReporting and analytics

In the FreshCart example, dim_store already includes city and region β€” even though a normalised design would have separate city and region tables. This redundancy makes queries faster because you join fact_sales to dim_store once instead of joining to store, then city, then region.

Exam tip: When to denormalize

The exam tests your judgment on denormalization. General rules:

  • Denormalize dimension tables β€” flatten hierarchies into the dimension (city, region, country all in dim_store)
  • Keep fact tables normalised β€” fact tables should contain only keys and measures
  • Create aggregate tables for common high-level queries (daily totals instead of per-transaction)
  • Do NOT denormalize when dimension attributes change frequently β€” updates ripple across all redundant copies

Aggregate tables

For very large fact tables, pre-computing common aggregations dramatically improves performance:

-- Create an aggregate table: daily sales by store
CREATE TABLE agg_daily_store_sales AS
SELECT
    date_key,
    store_key,
    COUNT(*) AS transaction_count,
    SUM(quantity) AS total_units,
    SUM(total_amount) AS total_revenue,
    SUM(discount_amount) AS total_discount
FROM fact_sales
GROUP BY date_key, store_key
Query LevelSourceRows (FreshCart)
Per transactionfact_sales5.4 billion/year
Per day per storeagg_daily_store_sales730,000/year
Per month per regionCustom aggregate48/year

Power BI and Direct Lake semantic models can automatically use aggregate tables when available (via user-defined aggregations).

Question

What is a fact table?

Click or press Enter to reveal answer

Answer

A fact table records measurable business events β€” sales transactions, website clicks, sensor readings. Rows are numerous (millions to billions). Columns contain foreign keys to dimension tables plus numeric measures (quantity, amount, duration).

Click to flip back

Question

What is a dimension table?

Click or press Enter to reveal answer

Answer

A dimension table contains descriptive attributes used for filtering, grouping, and labelling β€” products, customers, stores, dates. Rows are relatively few. Attributes are denormalised (city, region, country in one table) for query performance.

Click to flip back

Question

What is the grain of a fact table?

Click or press Enter to reveal answer

Answer

The grain defines what one row represents β€” the most detailed level of measurement. For FreshCart: one row per product per transaction. Defining grain correctly is the most important design decision because it determines what questions the model can answer.

Click to flip back

Question

Why do we denormalize dimension tables in a star schema?

Click or press Enter to reveal answer

Answer

Denormalization reduces the number of joins needed for analytical queries. Instead of joining store β†’ city β†’ region β†’ country, all attributes live in one dim_store table. This is slower for writes but much faster for reads β€” exactly the trade-off analytics requires.

Click to flip back

Knowledge Check

Anita at FreshCart is designing a star schema for sales analytics. She has transaction-level data (one row per item per sale) plus reference data for products, stores, customers, and dates. How should she structure the schema?

Knowledge Check

Raj at Atlas Capital notices that Power BI dashboards showing 'total portfolio value by region' take 30+ seconds to load. The fact table has 2 billion rows at transaction-level grain. The query aggregates by region and month. What is the best optimisation?


Next up: SQL Objects: Views, Functions & Stored Procedures β€” programmable SQL in Fabric warehouses and lakehouse SQL endpoints.