Azure Service Bus: Queues + Dead-Letter Handling
Reliable async messaging for AI back-ends. Queue mechanics, sessions, dead-letter queues, message lock and complete, and how to scale a Container App on Service Bus depth.
Why Service Bus is the workhorse for async AI
Azure Service Bus is a durable, ordered message broker for back-end work. When an HTTP request arrives but the work is too slow to do synchronously β embedding an article, generating a summary, calling an external API β drop a message on a Service Bus queue and let a worker pick it up. The user gets an instant response; the worker chews through messages on its own schedule.
For AI workloads, the killer features are:
- Reliable delivery β once a message is in the queue, it stays there until processed
- Dead-letter queue (DLQ) β messages that fail too many times go to a side queue you can inspect
- Sessions β guarantee FIFO order within a session (e.g., per-user message order)
- KEDA scaling β Container Apps scales replicas based on queue depth
The lifecycle of a message
1. Sender places a message on the queue
2. Receiver calls Receive() β message is locked (default 30 s, max 5 min)
3. Receiver processes the message
4. Receiver calls Complete() β message is removed from the queue
OR Abandon() β lock released, message returns to the queue (delivery count++)
OR DeadLetter() β message moved to the DLQ with a reason
5. If the lock expires before any of those, message returns to the queue (delivery count++)
6. After MaxDeliveryCount attempts, the message auto-dead-letters
The βlockβ is critical: while a receiver holds the lock on a message, no other receiver gets it. If the receiver crashes, the lock expires and another receiver picks up the message β at-least-once delivery.
The five outcomes of a Receive()
| Feature | Complete | Abandon | DeadLetter | Defer | Lock expired |
|---|---|---|---|---|---|
| When | Successful processing | Transient failure β retry later | Poison message β give up | Process out of order β defer for later | Crash, slow processing, network drop |
| Effect | Removed from queue | Returns to queue, delivery count++ | Moves to DLQ with reason + description | Removed from main queue, retrievable by SequenceNumber | Returns to queue, delivery count++ |
| Counts toward MaxDeliveryCount | No | Yes | Yes (last attempt) | No | Yes |
from azure.servicebus.aio import ServiceBusClient
from azure.identity.aio import DefaultAzureCredential
async with ServiceBusClient(
fully_qualified_namespace="roo-sb.servicebus.windows.net",
credential=DefaultAzureCredential(),
) as client:
receiver = client.get_queue_receiver(queue_name="image-jobs", max_wait_time=30)
async with receiver:
async for msg in receiver:
try:
await process(msg)
await receiver.complete_message(msg)
except TransientError:
await receiver.abandon_message(msg) # retry later
except PoisonError as e:
await receiver.dead_letter_message(
msg, reason="Poison", error_description=str(e)
)
Dead-letter queues β the safety net
Every queue has an automatic sub-queue for dead messages. Two paths in:
| Path | When |
|---|---|
| Auto dead-letter on max delivery | A message exceeds MaxDeliveryCount (default 10) β usually because it keeps abandoning |
| Explicit dead-letter | Receiver calls DeadLetter() with a reason β clean way to handle βthis will never succeedβ |
Read DLQ messages by appending /$DeadLetterQueue to the queue path:
dlq_receiver = client.get_queue_receiver(
queue_name="image-jobs/$DeadLetterQueue",
max_wait_time=10,
)
The dead-letter queue is a queue too β you can inspect, reprocess, fix, and resubmit. Most production systems have a small admin tool that reads the DLQ, lets a human triage, and either re-sends or discards.
Real-world example: Mira's poison-image handling
Miraβs worker embeds product images. About 0.1% of incoming images are corrupted (broken JPEGs, network truncation). Without DLQ handling, those messages would loop forever, blocking other work.
Miraβs pattern:
- Worker tries to process the image
- On
TransientError(network blip, OpenAI quota): Abandon β Service Bus retries - On
CorruptImageError(file is broken): DeadLetter with reason βBadPayloadβ - After 3 abandons, Service Bus auto-dead-letters with reason βMaxDeliveryCountExceededβ
- A nightly job reads the DLQ, re-encodes JPEGs that look fixable, re-submits them, deletes the rest
Result: the main queue never blocks on broken data. The DLQ is a small worklist, not a black hole.
Sessions β FIFO within a key
Service Bus sessions group related messages and guarantee FIFO order within the session. Useful when message order matters per user / conversation / order-id.
sender = client.get_queue_sender(queue_name="orders")
await sender.send_messages(
ServiceBusMessage(body=json.dumps(payload), session_id=user_id)
)
# Receive β get messages from one session at a time
session_receiver = client.get_queue_receiver(
queue_name="orders", session_id=NEXT_AVAILABLE_SESSION,
)
The session lock lasts as long as one receiver holds it; other receivers can get other sessions in parallel. So sessions give you per-key FIFO without serialising the whole queue.
Authentication β managed identity recommended
# Grant the Container App's managed identity the right RBAC
az role assignment create \
--assignee $PRINCIPAL_ID \
--role "Azure Service Bus Data Receiver" \
--scope $(az servicebus namespace show -n roo-sb -g roo-prod --query id -o tsv)
Three built-in roles (data-plane):
| Role | Permissions | Use for |
|---|---|---|
| Azure Service Bus Data Owner | Full data-plane | Admin tools |
| Azure Service Bus Data Sender | Send only | Producers |
| Azure Service Bus Data Receiver | Receive (peek, complete, abandon, dead-letter) | Consumers |
Receive modes β peek-lock vs receive-and-delete
| Mode | What happens | Use for |
|---|---|---|
| PeekLock (default) | Message is locked but stays in queue; you Complete or Abandon | Reliable processing β recover from failures |
| ReceiveAndDelete | Message is removed from queue immediately on receive | Telemetry where itβs OK to lose a message on crash |
ReceiveAndDelete is faster but risky for AI workloads β a worker crash drops the message permanently. Default to PeekLock unless you have a clear reason.
Scheduled and TTL messages
# Scheduled β process this in 5 minutes
scheduled_time = datetime.now(timezone.utc) + timedelta(minutes=5)
sequence_number = await sender.schedule_messages(message, scheduled_time)
# TTL β message expires if not consumed in 1 hour
message = ServiceBusMessage(body=payload, time_to_live=timedelta(hours=1))
TTL is a per-message override of the queue default. Expired messages can either dead-letter (queue config) or be silently discarded.
Key terms
Knowledge check
Mira's worker is processing image-embedding messages. A message contains a corrupted JPEG and will never succeed. What's the cleanest disposition to use?
Theo's clinical-message processing must guarantee that all messages for the same patient are processed in the order they were sent, even with multiple worker replicas. What feature does Theo need?
Lin's Container App connects to Service Bus with a connection string in app settings. The security team wants the connection string removed. What's the recommended replacement?