End-to-End Observability: Putting It All Together
Stitching Key Vault, App Configuration, OpenTelemetry, and KQL into a production observability story. The trace, the dashboard, the alert, the runbook, the post-incident review β and Mira's pager.
What βproduction observabilityβ actually means
Observability is the property of a system that lets you ask any question of it without changing the code. For an AI back-end, that means answering: βis it up?β, βis it fast?β, βis it accurate?β, βwhere did the cost come from?β, βdid it leak a secret?β, βwhat did this user see?β
The four Azure pieces fit together like this:
- Key Vault β secrets out of code, audited every time theyβre read
- App Configuration β feature flags + environment-specific config + audit of changes
- OpenTelemetry β traces, logs, metrics flowing to Application Insights
- KQL β the query language that ties it all back together
Golden signals for AI back-ends
The classic four golden signals β latency, traffic, errors, saturation β translate cleanly. Add a fifth for AI:
| Signal | What | Where |
|---|---|---|
| Latency | p50/p95/p99 of request duration per route | requests |
| Traffic | requests/sec per route | requests | summarize count() by bin(...) |
| Errors | success=false rate per route | requests | countif(success == false) |
| Saturation | replica count, CPU%, memory% | Container Insights, Container Apps system logs |
| AI quality | token cost, retrieval quality, refusal rate, hallucination markers | Custom metrics + spans |
A reasonable starter dashboard renders these five plus a few drill-downs.
SLOs β what counts as βbrokenβ
A service-level objective is a numeric target that defines the contract:
- 99.5% of POST /chat requests complete in under 2 seconds (rolling 30 days)
- Error rate < 1% rolling 1 hour
- Token-cost per request < $0.05 p95 rolling 24 hours
Translate each SLO to a KQL query. Translate each query to an alert. The alert + the runbook = an actionable response.
// Error-rate SLO (rolling 1 hour, alert if > 1%)
requests
| where timestamp > ago(1h)
| where name startswith "POST /chat"
| summarize errors = countif(success == false), total = count()
| extend error_rate = todouble(errors) / total
| where error_rate > 0.01
Schedule that as an Azure Monitor alert; route to PagerDuty/Teams/email; on fire, the on-call follows the runbook (next section).
Alerts β the operational primitive
Azure Monitor alerts run a query on a schedule. Three flavours:
| Type | What it does |
|---|---|
| Metric alert | Threshold on an Azure-native metric (CPU, requests/sec) |
| Log alert | Run KQL on a schedule; alert if rows match (or count exceeds threshold) |
| Activity log alert | Trigger on Azure resource events (resource deleted, role assignment changed) |
For AI back-ends, log alerts on KQL are the workhorse β they cover everything you instrumented, including custom AI metrics.
az monitor scheduled-query create \
--name "chat-error-rate" \
--resource-group roo-prod \
--action-group $AG_RESOURCE_ID \
--condition "count 'rows' > 0" \
--condition-query 'requests | where timestamp > ago(15m) | where name startswith "POST /chat" | summarize errors = countif(success == false), total = count() | where todouble(errors) / total > 0.01' \
--description "Chat error rate > 1% rolling 15 min" \
--evaluation-frequency 5m \
--window-size 15m \
--severity 2 \
--scopes $APP_INSIGHTS_RESOURCE_ID
Runbooks β what to do when an alert fires
A runbook is the human-readable answer to βthe alert just paged me at 2am β now what?β Three sections:
- Triage β how to tell whatβs actually wrong (specific KQL queries, App Insights links, dashboard URLs)
- Mitigate β actions that buy time (scale up, kill the bad feature flag, swap deployment slot back, redirect traffic)
- Investigate β once mitigated, the deeper diagnosis (full trace inspection, logs, customer impact assessment)
Real-world example: Mira's chat-error-rate runbook
Alert: chat error rate > 1% rolling 15 min
Triage queries:
requests | where timestamp > ago(15m) and name startswith "POST /chat"
| summarize errors = countif(success==false), total = count() by bin(timestamp, 1m)
exceptions | where timestamp > ago(15m) | summarize count() by problemId | top 5Mitigate (in this order):
- Set feature flag
EnableNewRagPipelineto 0% (App Configuration β save β take effect within 30 s) - If still failing: swap deployment slots (production β staging) β instant rollback
- If Service Bus queue is backed up: scale max replicas up
Investigate:
- Check App Insights end-to-end transactions for the slowest 10 requests
- Search exceptions for new error types in the past hour
- Check Cosmos / Postgres metrics β RU exhaustion or connection storms?
A worked example β Miraβs morning
08:00 β alert fires: error rate on POST /chat is 4% (normal: 0.2%).
exceptions | where timestamp > ago(15m) | summarize count() by problemId | top 5
Top exception: httpx.TimeoutException at openai_client.embed. The embedding API is slow. Mira looks at:
dependencies
| where timestamp > ago(15m) and name == "POST /openai/embeddings"
| summarize p95 = percentile(duration, 95), errors = countif(success==false) by bin(timestamp, 1m)
| render timechart
p95 latency on embeddings has gone from 300 ms to 8 s. Azure OpenAI is throttling.
Mitigate: App Configuration feature flag EmbeddingFallback from false to true β the worker now uses the cached embedding for cache hits, and skips re-embedding edits if the recent embedding is less than 24 hours old.
10 minutes later, error rate is back to 0.2%. Mira files a ticket with Azure OpenAI to raise the quota, schedules a retro for tomorrow, and goes back to her coffee. The whole flight is in App Insights with full trace evidence; the rollback was a single config change.
Tying the four pieces together
| Service | What it covered in Miraβs morning |
|---|---|
| Key Vault | Held the OpenAI key; nothing exposed in the alert workflow; rotation possible without downtime |
| App Configuration | The EmbeddingFallback feature flag was the mitigation β runtime change, no redeploy |
| OpenTelemetry | Traces and dependency records made the breakdown visible |
| KQL | The investigation queries that pinned the problem in 90 seconds |
This is the integrated production story AI-200 expects you to internalise.
Common AI-specific alerts to ship from day one
| Alert | Query shape |
|---|---|
| Error rate spike | requests | where timestamp > ago(15m) | countif(success==false) / count() > 0.01 |
| Cosmos throttling | dependencies | where target endswith "documents.azure.com" | countif(resultCode == 429) |
| OpenAI throttling | dependencies | where target endswith "openai.azure.com" | countif(resultCode == 429) |
| Probe failures (Container Apps) | ContainerAppSystemLogs_CL | where Reason_s == "Unhealthy" |
| Replica restarts | ContainerAppSystemLogs_CL | where Reason_s in ("Killing", "BackOff") |
| Token cost runaway | dependencies | where name == "rag.generate" | summarize sum(toint(customDimensions['gen_ai.usage.input_tokens'])) |
Final exam framing β the next steps
Youβve now covered every domain on AI-200:
- Domain 1 β Containers (ACR, App Service containers, Container Apps + KEDA, AKS, troubleshooting)
- Domain 2 β Data services (Cosmos NoSQL incl vectors + change feed, PostgreSQL + pgvector, Managed Redis)
- Domain 3 β Connect (Service Bus, Event Grid, Functions, choosing between them)
- Domain 4 β Secure / monitor / troubleshoot (Key Vault, App Configuration, OpenTelemetry, KQL, end-to-end observability)
The exam pattern: every question maps a real-world scenario (often featuring developers like Mira, Theo, Priya, or Lin) to the right Azure primitive at the right scope with the right configuration. Thereβs a short list of βcommon pivot pointsβ β managed identity vs passwords, scale-to-zero vs always-warm, single-partition vs cross-partition, push vs pull, secret in Key Vault vs config in App Configuration β and most questions test which side of the pivot fits.
Trust your intuition. The AI-200 exam rewards the simplest service that fits β over-engineered answers are usually wrong, even when they technically work.
Good luck.
Key terms
Knowledge check
Mira's pager goes off: chat error rate spiked to 4% from 0.2%. What's the first thing she should run?
Theo's runbook calls for a feature-flag mitigation. The flag lives in App Configuration. He needs the change to take effect across 12 Container App replicas within 30 seconds. Which mechanism makes that timing realistic?
Lin's team wants automated monitoring for Cosmos throttling. Which approach combines OTel data with Azure Monitor alerts most cleanly?