Choosing the Right Messaging Service
Service Bus, Event Grid, Event Hubs, Storage Queues β four very similar-sounding services with different jobs. The decision framework that turns 'I need messaging' into the right choice.
Why this module exists
Azure has four overlapping messaging services and they look interchangeable until youβve used them. The exam loves to give you a scenario where two of them would technically work but only one is the textbook fit.
Three quick mental hooks:
- Service Bus = work queue. βI need a worker to do this thing reliably.β
- Event Grid = router. βSomething happened β fan out the news.β
- Event Hubs = telemetry firehose. βMillions of small events per second from many producers.β
- Storage Queue = the cheap one. βI just need a queue, no fancy features.β
The matrix
| Feature | Service Bus | Event Grid | Event Hubs | Storage Queues |
|---|---|---|---|---|
| Pattern | Queue or pub/sub topic | Push event routing | Partitioned event log | Simple FIFO queue |
| Direction | Pull (receivers poll) | Push (Event Grid pushes to subscribers) | Pull (consumers read offsets) | Pull (receivers poll) |
| Throughput | Up to ~thousands of msgs/sec per Premium MU | Millions of events/sec across topics | Millions of events/sec on Event Hubs Premium / Dedicated | Modest (hundreds of msgs/sec) |
| Order | FIFO with sessions | No order guarantee | Per-partition order | FIFO |
| Retention | TTL per message; configurable up to effectively unlimited on Standard/Premium (14-day max applies on Basic only) | 24-hour retry window then DLQ | Days of replay (configurable, up to 90 days on Premium) | 7-day default TTL; on API β₯ 2017-07-29 you can set any positive value or -1 (no expiry) |
| Best for AI | Reliable async work, RAG ingest, agent backplane | Reactive flows: blob created β embed; cosmos updated β audit | Telemetry from inference, click streams, training pipelines | Cheap small queues, dev/test |
The decision flow
Are you reacting to a state change in another Azure service or your own app?
βββ Yes β Event Grid (or Event Grid β Service Bus for durability)
Are you ingesting telemetry / events at very high volume from many producers?
βββ Yes β Event Hubs
Do you need reliable, ordered, transactional async work?
βββ Yes β Service Bus
Just need a basic queue, cost-sensitive, small scale?
βββ Yes β Storage Queues
Common architectural patterns
Pattern 1 β Event Grid β Service Bus β Container Apps
The classic βreactive AI workerβ pipeline:
Blob created (Event Grid system topic)
β Service Bus queue (subscriber)
β Container App + KEDA (scales on Service Bus depth)
Event Grid handles the routing/filtering, Service Bus provides durability + DLQ for the worker, Container Apps + KEDA scales workers.
Pattern 2 β Event Hubs β Stream processor β Cosmos
High-throughput ingestion + processing:
Many producers (devices, ad servers, sensors) β Event Hubs
β Azure Functions (or Stream Analytics, or Spark on AKS)
β Cosmos DB for state, vector store for embeddings, telemetry to a data lake
Event Hubs absorbs the firehose; downstream services do the work.
Pattern 3 β Service Bus topic with multiple subscribers
"order-events" topic
β "fulfilment" subscription (filter on type=NewOrder) β fulfilment Container App
β "loyalty" subscription (no filter) β loyalty Container App
β "audit" subscription (no filter) β audit Function App
One publisher, three downstream concerns, broker-side filtering, durable delivery.
Pattern 4 β Cosmos DB change feed β Function App
When the trigger source IS Cosmos:
Cosmos write β Cosmos change feed β Function App
Skip messaging entirely β the change feed itself acts as the queue. Cheaper and simpler than tee-ing every write to Service Bus or Event Grid first.
Real-world example: Mira's choice between three patterns
Miraβs robots upload shipping label photos to Blob storage. Two downstream concerns: an OCR worker extracting label text, and an audit log of every upload.
- Naive choice: put all logic in a Function App with a Blob trigger. Works, but the Blob trigger has reliability gotchas at scale (long-poll model, missed events under heavy load).
- Better choice: Blob created event β Event Grid β Service Bus topic with two subscriptions (
ocrandaudit). Container Apps run the OCR worker, scaled by KEDA on Service Bus depth.
The second pattern handles the burst at shift change, dead-letters bad images, audits everything, and never duplicates work β exactly what Service Bus + Event Grid was designed for.
Anti-patterns
| Anti-pattern | Why itβs wrong | Better choice |
|---|---|---|
| Event Grid for durable work assignment | At-least-once + 24-hour retry window is fine for events, but you canβt store work in Event Grid | Service Bus queue |
| Service Bus for high-throughput telemetry | Designed for thousands of msgs/sec, not millions; per-message overhead | Event Hubs |
| Storage Queue for production AI workloads | No DLQ, no sessions, no Entra auth, no built-in metrics | Service Bus |
| Two messaging services chained βjust in caseβ | Doubles failure modes and operational surface | Pick one and lean on its features |
When in doubt β Microsoftβs own decision tree
Microsoft documents this exact decision flow at βCompare Azure messaging servicesβ. Two big rules of thumb from there:
- Discrete events that report state change β Event Grid. βBlob createdβ, βCosmos changedβ, βResource Health changedβ.
- Series of related events that need to stay in order β Event Hubs (telemetry/log) or Service Bus (work).
Donβt memorise feature tables β internalise the intent of each service, and the right answer reveals itself from the scenario.
Key terms
Knowledge check
Theo's clinical assistant ingests 850k clinical events per minute from monitoring devices across hospitals. Each event is small (200 bytes); all events for a given patient must stay in order. Which Azure messaging service fits best?
Mira wants to react when a Blob is created in a Storage container β embed the image, write the result to Cosmos, alert if the embedding worker is down. Which architecture fits?
Lin's POC needs the cheapest possible queue for a hobby AI demo handling 10 messages/minute. No DLQ needed; if a message is lost, it doesn't matter. Which service?