Eventstreams & Spark Streaming: Real-Time Ingestion

Streaming in Fabric

Simple explanation

Think of batch loading as a postal service — you get a delivery once a day. Streaming is a phone call — you hear every word the moment it’s spoken.

Fabric offers two streaming tools: Eventstreams (a managed service that captures events from sources like Event Hubs and routes them to destinations) and Spark Structured Streaming (code in a notebook that processes events with PySpark transformations).

Use Eventstreams when you want a visual, no-code pipeline for routing events. Use Spark Structured Streaming when you need complex transformations on the stream.

Choosing a streaming engine

Eventstreams for routing; Spark Structured Streaming for complex transforms
Factor	Eventstreams	Spark Structured Streaming
Interface	Visual canvas (drag-and-drop)	Code (PySpark notebook)
Coding required?	No (built-in operators)	Yes (Python/Spark)
Transformations	Filter, project, aggregate, join (built-in operators)	Any PySpark transformation (joins, ML, complex logic)
Latency	Sub-second to seconds	Seconds to minutes (micro-batch)
Destinations	KQL database, lakehouse, derived streams, custom endpoints	Lakehouse (Delta tables)
Best for	Routing events, simple transforms, multi-destination fanout	Complex transformations, ML scoring, stateful processing
Monitoring	Built-in metrics and health dashboard	Spark UI + custom logging
Exactly-once?	At-least-once (dedup at destination)	Exactly-once with checkpointing

Eventstreams

How Eventstreams work

Source (Event Hub, IoT Hub, CDC)
  → Eventstream (filter, transform, route)
    → Destination 1: KQL Database (real-time queries)
    → Destination 2: Lakehouse (Delta table for batch analytics)
    → Destination 3: Derived stream (feed another Eventstream)

Built-in operators

Operator	What It Does	Example
Filter	Keep only events matching a condition	Keep only `event_type == "purchase"`
Manage fields	Select, rename, or remove columns	Drop PII columns before routing to analytics
Group by	Aggregate events in time windows	Count events per minute
Union	Combine multiple streams	Merge orders from 3 regional Event Hubs
Expand	Flatten nested JSON arrays	Expand `items[]` array into individual rows

Scenario: Zoe's clickstream pipeline

WaveMedia generates 2 million playback events per minute. Zoe builds an Eventstream:

Source: Azure Event Hub receiving playback events
Filter: Keep only event_type IN ("play", "pause", "complete") — drop heartbeat noise
Manage fields: Remove user_ip (PII) before analytics
Destinations:
- KQL Database → real-time dashboard (what’s being watched right now?)
- Lakehouse → Delta table for daily batch analysis (weekly trends)

Two destinations from one stream. The KQL database answers “what’s hot right now?” while the lakehouse answers “what was popular this week?”

Spark Structured Streaming

Reading from a stream

# Read from Event Hub
eh_config = {
    "eventhubs.connectionString": sc._jvm.org.apache.spark
        .eventhubs.EventHubsUtils.encrypt(connection_string),
    "eventhubs.consumerGroup": "$Default"
}

stream_df = spark.readStream \
    .format("eventhubs") \
    .options(**eh_config) \
    .load()

Transforming streaming data

from pyspark.sql.functions import from_json, col, window

# Parse JSON payload
schema = StructType([
    StructField("order_id", StringType()),
    StructField("amount", DoubleType()),
    StructField("timestamp", TimestampType()),
    StructField("region", StringType())
])

parsed = stream_df.select(
    from_json(                                     # Parse the JSON body
        col("body").cast("string"), schema
    ).alias("data")
).select("data.*")                                 # Flatten into columns

# Windowed aggregation: revenue per region per 5-minute window
windowed = parsed \
    .withWatermark("timestamp", "10 minutes") \    # Late data tolerance
    .groupBy(
        window("timestamp", "5 minutes"),           # 5-min tumbling window
        "region"
    ) \
    .agg(sum("amount").alias("window_revenue"))

Writing to Delta Lake

# Write streaming results to a Delta table
query = windowed.writeStream \
    .format("delta") \
    .outputMode("append") \
    .option("checkpointLocation", "/checkpoints/order_revenue") \
    .start("Tables/OrderRevenue5Min")

query.awaitTermination()   # Keep the stream running

What's happening: Streaming watermarks

The .withWatermark("timestamp", "10 minutes") line tells Spark: “Accept events up to 10 minutes late. After that, drop them.”

Why it matters: In streaming, events don’t always arrive in order. A 10-minute watermark means Spark keeps state for the window until 10 minutes past the window’s end time. This balances completeness (catching late events) with memory usage (not keeping state forever).

Exam pattern: “Streaming aggregation is missing late events” → increase the watermark duration. “Streaming notebook is running out of memory” → the watermark might be too large.

Output modes

Mode	Behaviour	Use With
Append	Only new rows are written to the sink	Non-aggregated streams, delta tables
Complete	Entire result table is written on every trigger	Small aggregation results (not for large state)
Update	Only changed rows are written	Aggregations where you want just the updated values

Question

What is the key difference between Eventstreams and Spark Structured Streaming?

Click or press Enter to reveal answer

Answer

Eventstreams: visual, no-code, sub-second latency, routes events to multiple destinations (KQL, lakehouse). Spark Structured Streaming: code-based (PySpark), seconds-to-minutes latency, complex transformations and ML scoring, writes to Delta tables.

Click to flip back

Question

What is a streaming watermark in Spark?

Click or press Enter to reveal answer

Answer

A watermark defines how late an event can arrive and still be included in a windowed aggregation. withWatermark('timestamp', '10 minutes') means events arriving more than 10 minutes late are dropped. Balances completeness vs memory usage.

Click to flip back

Question

What is the difference between append and update output modes in streaming?

Click or press Enter to reveal answer

Answer

Append: only new rows are emitted (good for non-aggregated streams). Update: only changed rows are emitted (good for aggregations where you want incremental updates). Complete: the entire result table is rewritten on every trigger.

Click to flip back

Knowledge Check

Zoe needs to route clickstream events to both a KQL database (real-time dashboard) and a lakehouse (weekly analytics) with simple filtering but no complex transformations. Which streaming tool should she use?

Knowledge Check

A Spark Structured Streaming job calculates 5-minute revenue windows. Late-arriving events (up to 15 minutes late) must be included. The current watermark is set to 5 minutes, and some late events are being dropped. What should the engineer change?

Next up: Real-Time Intelligence: KQL & Windowing — query streaming data with KQL, choose between native tables and shortcuts, and build windowing functions.