Domain 2 β€” Module 9 of 10 90%
17 of 26 overall
Domain 2: Ingest and Transform Data Free ⏱ ~14 min read

Eventstreams & Spark Streaming: Real-Time Ingestion

Process streaming data as it arrives using Fabric Eventstreams and Spark Structured Streaming. Choose the right streaming engine for your workload.

Streaming in Fabric

Simple explanation

Think of batch loading as a postal service β€” you get a delivery once a day. Streaming is a phone call β€” you hear every word the moment it’s spoken.

Fabric offers two streaming tools: Eventstreams (a managed service that captures events from sources like Event Hubs and routes them to destinations) and Spark Structured Streaming (code in a notebook that processes events with PySpark transformations).

Use Eventstreams when you want a visual, no-code pipeline for routing events. Use Spark Structured Streaming when you need complex transformations on the stream.

Choosing a streaming engine

Eventstreams for routing; Spark Structured Streaming for complex transforms
FactorEventstreamsSpark Structured Streaming
InterfaceVisual canvas (drag-and-drop)Code (PySpark notebook)
Coding required?No (built-in operators)Yes (Python/Spark)
TransformationsFilter, project, aggregate, join (built-in operators)Any PySpark transformation (joins, ML, complex logic)
LatencySub-second to secondsSeconds to minutes (micro-batch)
DestinationsKQL database, lakehouse, derived streams, custom endpointsLakehouse (Delta tables)
Best forRouting events, simple transforms, multi-destination fanoutComplex transformations, ML scoring, stateful processing
MonitoringBuilt-in metrics and health dashboardSpark UI + custom logging
Exactly-once?At-least-once (dedup at destination)Exactly-once with checkpointing

Eventstreams

How Eventstreams work

Source (Event Hub, IoT Hub, CDC)
  β†’ Eventstream (filter, transform, route)
    β†’ Destination 1: KQL Database (real-time queries)
    β†’ Destination 2: Lakehouse (Delta table for batch analytics)
    β†’ Destination 3: Derived stream (feed another Eventstream)

Built-in operators

OperatorWhat It DoesExample
FilterKeep only events matching a conditionKeep only event_type == "purchase"
Manage fieldsSelect, rename, or remove columnsDrop PII columns before routing to analytics
Group byAggregate events in time windowsCount events per minute
UnionCombine multiple streamsMerge orders from 3 regional Event Hubs
ExpandFlatten nested JSON arraysExpand items[] array into individual rows
Scenario: Zoe's clickstream pipeline

WaveMedia generates 2 million playback events per minute. Zoe builds an Eventstream:

  1. Source: Azure Event Hub receiving playback events
  2. Filter: Keep only event_type IN ("play", "pause", "complete") β€” drop heartbeat noise
  3. Manage fields: Remove user_ip (PII) before analytics
  4. Destinations:
    • KQL Database β†’ real-time dashboard (what’s being watched right now?)
    • Lakehouse β†’ Delta table for daily batch analysis (weekly trends)

Two destinations from one stream. The KQL database answers β€œwhat’s hot right now?” while the lakehouse answers β€œwhat was popular this week?”

Spark Structured Streaming

Reading from a stream

# Read from Event Hub
eh_config = {
    "eventhubs.connectionString": sc._jvm.org.apache.spark
        .eventhubs.EventHubsUtils.encrypt(connection_string),
    "eventhubs.consumerGroup": "$Default"
}

stream_df = spark.readStream \
    .format("eventhubs") \
    .options(**eh_config) \
    .load()

Transforming streaming data

from pyspark.sql.functions import from_json, col, window

# Parse JSON payload
schema = StructType([
    StructField("order_id", StringType()),
    StructField("amount", DoubleType()),
    StructField("timestamp", TimestampType()),
    StructField("region", StringType())
])

parsed = stream_df.select(
    from_json(                                     # Parse the JSON body
        col("body").cast("string"), schema
    ).alias("data")
).select("data.*")                                 # Flatten into columns

# Windowed aggregation: revenue per region per 5-minute window
windowed = parsed \
    .withWatermark("timestamp", "10 minutes") \    # Late data tolerance
    .groupBy(
        window("timestamp", "5 minutes"),           # 5-min tumbling window
        "region"
    ) \
    .agg(sum("amount").alias("window_revenue"))

Writing to Delta Lake

# Write streaming results to a Delta table
query = windowed.writeStream \
    .format("delta") \
    .outputMode("append") \
    .option("checkpointLocation", "/checkpoints/order_revenue") \
    .start("Tables/OrderRevenue5Min")

query.awaitTermination()   # Keep the stream running
What's happening: Streaming watermarks

The .withWatermark("timestamp", "10 minutes") line tells Spark: β€œAccept events up to 10 minutes late. After that, drop them.”

Why it matters: In streaming, events don’t always arrive in order. A 10-minute watermark means Spark keeps state for the window until 10 minutes past the window’s end time. This balances completeness (catching late events) with memory usage (not keeping state forever).

Exam pattern: β€œStreaming aggregation is missing late events” β†’ increase the watermark duration. β€œStreaming notebook is running out of memory” β†’ the watermark might be too large.

Output modes

ModeBehaviourUse With
AppendOnly new rows are written to the sinkNon-aggregated streams, delta tables
CompleteEntire result table is written on every triggerSmall aggregation results (not for large state)
UpdateOnly changed rows are writtenAggregations where you want just the updated values

Question

What is the key difference between Eventstreams and Spark Structured Streaming?

Click or press Enter to reveal answer

Answer

Eventstreams: visual, no-code, sub-second latency, routes events to multiple destinations (KQL, lakehouse). Spark Structured Streaming: code-based (PySpark), seconds-to-minutes latency, complex transformations and ML scoring, writes to Delta tables.

Click to flip back

Question

What is a streaming watermark in Spark?

Click or press Enter to reveal answer

Answer

A watermark defines how late an event can arrive and still be included in a windowed aggregation. withWatermark('timestamp', '10 minutes') means events arriving more than 10 minutes late are dropped. Balances completeness vs memory usage.

Click to flip back

Question

What is the difference between append and update output modes in streaming?

Click or press Enter to reveal answer

Answer

Append: only new rows are emitted (good for non-aggregated streams). Update: only changed rows are emitted (good for aggregations where you want incremental updates). Complete: the entire result table is rewritten on every trigger.

Click to flip back


Knowledge Check

Zoe needs to route clickstream events to both a KQL database (real-time dashboard) and a lakehouse (weekly analytics) with simple filtering but no complex transformations. Which streaming tool should she use?

Knowledge Check

A Spark Structured Streaming job calculates 5-minute revenue windows. Late-arriving events (up to 15 minutes late) must be included. The current watermark is set to 5 minutes, and some late events are being dropped. What should the engineer change?

Next up: Real-Time Intelligence: KQL & Windowing β€” query streaming data with KQL, choose between native tables and shortcuts, and build windowing functions.