Choosing the Right Messaging Service

Why this module exists

Simple explanation

Azure has four overlapping messaging services and they look interchangeable until you’ve used them. The exam loves to give you a scenario where two of them would technically work but only one is the textbook fit.

Three quick mental hooks:

Service Bus = work queue. “I need a worker to do this thing reliably.”
Event Grid = router. “Something happened — fan out the news.”
Event Hubs = telemetry firehose. “Millions of small events per second from many producers.”
Storage Queue = the cheap one. “I just need a queue, no fancy features.”

The matrix

Use this matrix as the first pass; pick the row that matches the scenario.
Feature	Service Bus	Event Grid	Event Hubs	Storage Queues
Pattern	Queue or pub/sub topic	Push event routing	Partitioned event log	Simple FIFO queue
Direction	Pull (receivers poll)	Push (Event Grid pushes to subscribers)	Pull (consumers read offsets)	Pull (receivers poll)
Throughput	Up to ~thousands of msgs/sec per Premium MU	Millions of events/sec across topics	Millions of events/sec on Event Hubs Premium / Dedicated	Modest (hundreds of msgs/sec)
Order	FIFO with sessions	No order guarantee	Per-partition order	FIFO
Retention	TTL per message; configurable up to effectively unlimited on Standard/Premium (14-day max applies on Basic only)	24-hour retry window then DLQ	Days of replay (configurable, up to 90 days on Premium)	7-day default TTL; on API ≥ 2017-07-29 you can set any positive value or -1 (no expiry)
Best for AI	Reliable async work, RAG ingest, agent backplane	Reactive flows: blob created → embed; cosmos updated → audit	Telemetry from inference, click streams, training pipelines	Cheap small queues, dev/test

The decision flow

Are you reacting to a state change in another Azure service or your own app?
  └── Yes → Event Grid (or Event Grid → Service Bus for durability)

Are you ingesting telemetry / events at very high volume from many producers?
  └── Yes → Event Hubs

Do you need reliable, ordered, transactional async work?
  └── Yes → Service Bus

Just need a basic queue, cost-sensitive, small scale?
  └── Yes → Storage Queues

Common architectural patterns

Pattern 1 — Event Grid → Service Bus → Container Apps

The classic “reactive AI worker” pipeline:

Blob created (Event Grid system topic)
  → Service Bus queue (subscriber)
  → Container App + KEDA (scales on Service Bus depth)

Event Grid handles the routing/filtering, Service Bus provides durability + DLQ for the worker, Container Apps + KEDA scales workers.

Pattern 2 — Event Hubs → Stream processor → Cosmos

High-throughput ingestion + processing:

Many producers (devices, ad servers, sensors) → Event Hubs
  → Azure Functions (or Stream Analytics, or Spark on AKS)
  → Cosmos DB for state, vector store for embeddings, telemetry to a data lake

Event Hubs absorbs the firehose; downstream services do the work.

Pattern 3 — Service Bus topic with multiple subscribers

"order-events" topic
  → "fulfilment" subscription (filter on type=NewOrder)  → fulfilment Container App
  → "loyalty" subscription (no filter)                    → loyalty Container App
  → "audit" subscription (no filter)                       → audit Function App

One publisher, three downstream concerns, broker-side filtering, durable delivery.

Pattern 4 — Cosmos DB change feed → Function App

When the trigger source IS Cosmos:

Cosmos write → Cosmos change feed → Function App

Skip messaging entirely — the change feed itself acts as the queue. Cheaper and simpler than tee-ing every write to Service Bus or Event Grid first.

Real-world example: Mira's choice between three patterns

Mira’s robots upload shipping label photos to Blob storage. Two downstream concerns: an OCR worker extracting label text, and an audit log of every upload.

Naive choice: put all logic in a Function App with a Blob trigger. Works, but the Blob trigger has reliability gotchas at scale (long-poll model, missed events under heavy load).
Better choice: Blob created event → Event Grid → Service Bus topic with two subscriptions (ocr and audit). Container Apps run the OCR worker, scaled by KEDA on Service Bus depth.

The second pattern handles the burst at shift change, dead-letters bad images, audits everything, and never duplicates work — exactly what Service Bus + Event Grid was designed for.

Anti-patterns

Anti-pattern	Why it’s wrong	Better choice
Event Grid for durable work assignment	At-least-once + 24-hour retry window is fine for events, but you can’t store work in Event Grid	Service Bus queue
Service Bus for high-throughput telemetry	Designed for thousands of msgs/sec, not millions; per-message overhead	Event Hubs
Storage Queue for production AI workloads	No DLQ, no sessions, no Entra auth, no built-in metrics	Service Bus
Two messaging services chained “just in case”	Doubles failure modes and operational surface	Pick one and lean on its features

When in doubt — Microsoft’s own decision tree

Microsoft documents this exact decision flow at “Compare Azure messaging services”. Two big rules of thumb from there:

Discrete events that report state change → Event Grid. “Blob created”, “Cosmos changed”, “Resource Health changed”.
Series of related events that need to stay in order → Event Hubs (telemetry/log) or Service Bus (work).

Don’t memorise feature tables — internalise the intent of each service, and the right answer reveals itself from the scenario.