Domain 2 β€” Module 8 of 10 80%
16 of 26 overall
Domain 2: Ingest and Transform Data Free ⏱ ~14 min read

Transform Data with SQL & KQL

Use T-SQL in Fabric warehouses and KQL in Eventhouses to transform, aggregate, and shape data β€” stored procedures, views, materialized queries, and time-series patterns.

Two query languages, two data stores

Simple explanation

Think of two different kitchens with different tools.

The warehouse kitchen uses T-SQL β€” the same language database admins have used for decades. It excels at joins, stored procedures, and creating views for BI reports. It’s precise, powerful, and familiar.

The Eventhouse kitchen uses KQL (Kusto Query Language) β€” a language built for speed on time-series data. It excels at slicing millions of events by time windows, calculating percentiles, and rendering real-time dashboards. It reads left-to-right like a pipeline.

Both transform data. The difference is what data they work on and how they think about it.

T-SQL transformations in warehouses

Views for reusable transformations

-- Create a view that denormalizes orders with customer info
CREATE VIEW vw_OrdersWithCustomers AS
SELECT
    o.order_id,
    o.order_date,
    c.customer_name,
    c.region,
    o.quantity,
    o.revenue
FROM FactOrders o
INNER JOIN DimCustomer c ON o.customer_key = c.customer_key
WHERE c.is_current = 1;   -- SCD Type 2: only current dimension rows

Stored procedures for repeatable ETL

-- Stored procedure: load and deduplicate daily orders
CREATE PROCEDURE usp_LoadDailyOrders
    @load_date DATE
AS
BEGIN
    -- Step 1: Stage new data
    INSERT INTO staging.DailyOrders
    SELECT * FROM external_source.Orders
    WHERE order_date = @load_date;

    -- Step 2: Deduplicate (keep latest per order_id)
    WITH RankedOrders AS (
        SELECT *,
            ROW_NUMBER() OVER (
                PARTITION BY order_id
                ORDER BY modified_date DESC
            ) AS rn
        FROM staging.DailyOrders
    )
    DELETE FROM RankedOrders WHERE rn > 1;

    -- Step 3: MERGE into fact table
    MERGE INTO FactOrders AS target
    USING staging.DailyOrders AS source
    ON target.order_id = source.order_id
    WHEN MATCHED THEN
        UPDATE SET target.quantity = source.quantity,
                   target.revenue = source.revenue,
                   target.status = source.status
    WHEN NOT MATCHED THEN
        INSERT (order_id, order_date, customer_key, quantity, revenue, status)
        VALUES (source.order_id, source.order_date, source.customer_key,
                source.quantity, source.revenue, source.status);

    -- Step 4: Clean staging
    TRUNCATE TABLE staging.DailyOrders;
END;
What's happening: The stored procedure explained

This stored procedure runs four steps in order:

  1. Stage β€” load new data from the external source into a staging table
  2. Deduplicate β€” use ROW_NUMBER to rank duplicates and delete all but the latest
  3. MERGE β€” upsert into the fact table (update matches, insert new)
  4. Clean β€” truncate the staging table for the next run

The pipeline calls this procedure with a date parameter: EXEC usp_LoadDailyOrders @load_date = '2026-04-21'

Handling duplicates with T-SQL

-- Find duplicates
SELECT order_id, COUNT(*) as dupe_count
FROM FactOrders
GROUP BY order_id
HAVING COUNT(*) > 1;

-- Remove duplicates (keep the row with latest modified_date)
WITH Dupes AS (
    SELECT *,
        ROW_NUMBER() OVER (PARTITION BY order_id ORDER BY modified_date DESC) AS rn
    FROM FactOrders
)
DELETE FROM Dupes WHERE rn > 1;

KQL transformations in Eventhouses

KQL uses a pipe syntax β€” each line transforms the output of the previous line.

Basic KQL patterns

// Count events by hour for the last 24 hours
OrderEvents
| where Timestamp > ago(24h)                    // Filter: last 24 hours
| summarize EventCount = count() by bin(Timestamp, 1h)  // Group by hour
| order by Timestamp asc                         // Sort chronologically
| render timechart                               // Visualize as line chart

Aggregation patterns

// Revenue by region, with percentiles
OrderEvents
| where EventType == "purchase"
| summarize
    TotalRevenue = sum(Amount),
    AvgRevenue = avg(Amount),
    P95Revenue = percentile(Amount, 95),
    OrderCount = count()
    by Region
| order by TotalRevenue desc

Time-series functions

// Create a time series of hourly order counts (fills gaps with 0)
OrderEvents
| make-series OrderCount = count() on Timestamp step 1h
| render timechart
Scenario: Zoe's clickstream analysis

Zoe at WaveMedia needs to find the top 10 most-watched videos in the last hour, with average watch time and drop-off rate:

PlaybackEvents
| where Timestamp > ago(1h)
| where EventType == "video_play"
| summarize
    AvgWatchSeconds = avg(WatchDuration),
    TotalViews = count(),
    DropOffRate = countif(WatchDuration < 30) * 100.0 / count()
    by VideoId, VideoTitle
| top 10 by TotalViews desc

This runs in under 2 seconds on 50M events. The same query in T-SQL on a warehouse would take minutes.

T-SQL vs KQL

T-SQL for relational work; KQL for time-series speed
FactorT-SQL (Warehouse)KQL (Eventhouse)
Syntax styleSELECT ... FROM ... WHERE ... (set-based)Table | where | summarize | render (pipe-based)
Write supportFull DML (INSERT, UPDATE, DELETE, MERGE)Limited (ingestion only β€” no UPDATE/DELETE on rows)
Best forRelational transforms, dimensional modeling, stored procsTime-series analysis, log exploration, streaming aggregations
Time functionsDATEADD, DATEDIFF, FORMATago(), bin(), datetime_diff(), make-series()
VisualizationNone built-in (use Power BI)render operator (timechart, barchart, piechart)
Stored procedures?YesStored functions only (no side effects)
Views?Yes (standard + materialized views planned)Materialized views (pre-computed aggregations)

Handling data quality issues

Missing data

StrategyT-SQLPySpark
Replace with defaultISNULL(column, default_value)fillna({"column": default})
Remove rowsWHERE column IS NOT NULLdropna(subset=["column"])
Flag for reviewCASE WHEN column IS NULL THEN 'missing' ENDwhen(col("column").isNull(), "missing")

Duplicate data

StrategyT-SQLPySpark
DetectGROUP BY key HAVING COUNT(*) > 1groupBy("key").count().filter("count > 1")
RemoveROW_NUMBER() OVER (PARTITION BY key ORDER BY date DESC) then DELETE WHERE rn > 1dropDuplicates(["key"]) or window + filter

Question

What is the key difference between T-SQL and KQL syntax?

Click or press Enter to reveal answer

Answer

T-SQL is set-based (SELECT ... FROM ... WHERE ...). KQL is pipe-based (Table | where ... | summarize ... | render ...). Each KQL operator transforms the output of the previous one, reading left-to-right.

Click to flip back

Question

How do you remove duplicate rows in T-SQL?

Click or press Enter to reveal answer

Answer

Use ROW_NUMBER() OVER (PARTITION BY key ORDER BY date DESC) to rank duplicates, then DELETE WHERE rn > 1. This keeps the most recent version of each duplicated key.

Click to flip back

Question

What is KQL's bin() function?

Click or press Enter to reveal answer

Answer

bin() rounds timestamp values to a time bucket β€” bin(Timestamp, 1h) groups events into hourly buckets. Essential for time-series aggregations like 'count events per hour.'

Click to flip back


Knowledge Check

A warehouse table has duplicate order records due to a pipeline bug. Each order_id appears 2-3 times with different modified_date values. Carlos needs to keep only the most recent version of each order. Which T-SQL pattern is correct?

Knowledge Check

Zoe needs to calculate the number of video plays per 15-minute interval for the last 6 hours. Which KQL query is correct?

Next up: Eventstreams & Spark Streaming: Real-Time Ingestion β€” process streaming data as it arrives using Eventstreams and Spark Structured Streaming.