End-to-End Observability: Putting It All Together

What “production observability” actually means

Simple explanation

Observability is the property of a system that lets you ask any question of it without changing the code. For an AI back-end, that means answering: “is it up?”, “is it fast?”, “is it accurate?”, “where did the cost come from?”, “did it leak a secret?”, “what did this user see?”

The four Azure pieces fit together like this:

Key Vault — secrets out of code, audited every time they’re read
App Configuration — feature flags + environment-specific config + audit of changes
OpenTelemetry — traces, logs, metrics flowing to Application Insights
KQL — the query language that ties it all back together

Golden signals for AI back-ends

The classic four golden signals — latency, traffic, errors, saturation — translate cleanly. Add a fifth for AI:

Signal	What	Where
Latency	p50/p95/p99 of request duration per route	`requests`
Traffic	requests/sec per route	`requests \| summarize count() by bin(...)`
Errors	success=false rate per route	`requests \| countif(success == false)`
Saturation	replica count, CPU%, memory%	Container Insights, Container Apps system logs
AI quality	token cost, retrieval quality, refusal rate, hallucination markers	Custom metrics + spans

A reasonable starter dashboard renders these five plus a few drill-downs.

SLOs — what counts as “broken”

A service-level objective is a numeric target that defines the contract:

- 99.5% of POST /chat requests complete in under 2 seconds (rolling 30 days)
- Error rate < 1% rolling 1 hour
- Token-cost per request < $0.05 p95 rolling 24 hours

Translate each SLO to a KQL query. Translate each query to an alert. The alert + the runbook = an actionable response.

// Error-rate SLO (rolling 1 hour, alert if > 1%)
requests
| where timestamp > ago(1h)
| where name startswith "POST /chat"
| summarize errors = countif(success == false), total = count()
| extend error_rate = todouble(errors) / total
| where error_rate > 0.01

Schedule that as an Azure Monitor alert; route to PagerDuty/Teams/email; on fire, the on-call follows the runbook (next section).

Alerts — the operational primitive

Azure Monitor alerts run a query on a schedule. Three flavours:

Type	What it does
Metric alert	Threshold on an Azure-native metric (CPU, requests/sec)
Log alert	Run KQL on a schedule; alert if rows match (or count exceeds threshold)
Activity log alert	Trigger on Azure resource events (resource deleted, role assignment changed)

For AI back-ends, log alerts on KQL are the workhorse — they cover everything you instrumented, including custom AI metrics.

az monitor scheduled-query create \
  --name "chat-error-rate" \
  --resource-group roo-prod \
  --action-group $AG_RESOURCE_ID \
  --condition "count 'rows' > 0" \
  --condition-query 'requests | where timestamp > ago(15m) | where name startswith "POST /chat" | summarize errors = countif(success == false), total = count() | where todouble(errors) / total > 0.01' \
  --description "Chat error rate > 1% rolling 15 min" \
  --evaluation-frequency 5m \
  --window-size 15m \
  --severity 2 \
  --scopes $APP_INSIGHTS_RESOURCE_ID

Runbooks — what to do when an alert fires

A runbook is the human-readable answer to “the alert just paged me at 2am — now what?” Three sections:

Triage — how to tell what’s actually wrong (specific KQL queries, App Insights links, dashboard URLs)
Mitigate — actions that buy time (scale up, kill the bad feature flag, swap deployment slot back, redirect traffic)
Investigate — once mitigated, the deeper diagnosis (full trace inspection, logs, customer impact assessment)

Real-world example: Mira's chat-error-rate runbook

Alert: chat error rate > 1% rolling 15 min

Triage queries:

requests | where timestamp > ago(15m) and name startswith "POST /chat"
         | summarize errors = countif(success==false), total = count() by bin(timestamp, 1m)

exceptions | where timestamp > ago(15m) | summarize count() by problemId | top 5

Mitigate (in this order):

Set feature flag EnableNewRagPipeline to 0% (App Configuration → save → take effect within 30 s)
If still failing: swap deployment slots (production ↔ staging) — instant rollback
If Service Bus queue is backed up: scale max replicas up

Investigate:

Check App Insights end-to-end transactions for the slowest 10 requests
Search exceptions for new error types in the past hour
Check Cosmos / Postgres metrics — RU exhaustion or connection storms?

A worked example — Mira’s morning

08:00 — alert fires: error rate on POST /chat is 4% (normal: 0.2%).

exceptions | where timestamp > ago(15m) | summarize count() by problemId | top 5

Top exception: httpx.TimeoutException at openai_client.embed. The embedding API is slow. Mira looks at:

dependencies
| where timestamp > ago(15m) and name == "POST /openai/embeddings"
| summarize p95 = percentile(duration, 95), errors = countif(success==false) by bin(timestamp, 1m)
| render timechart

p95 latency on embeddings has gone from 300 ms to 8 s. Azure OpenAI is throttling.

Mitigate: App Configuration feature flag EmbeddingFallback from false to true — the worker now uses the cached embedding for cache hits, and skips re-embedding edits if the recent embedding is less than 24 hours old.

10 minutes later, error rate is back to 0.2%. Mira files a ticket with Azure OpenAI to raise the quota, schedules a retro for tomorrow, and goes back to her coffee. The whole flight is in App Insights with full trace evidence; the rollback was a single config change.

Tying the four pieces together

Service	What it covered in Mira’s morning
Key Vault	Held the OpenAI key; nothing exposed in the alert workflow; rotation possible without downtime
App Configuration	The `EmbeddingFallback` feature flag was the mitigation — runtime change, no redeploy
OpenTelemetry	Traces and dependency records made the breakdown visible
KQL	The investigation queries that pinned the problem in 90 seconds

This is the integrated production story AI-200 expects you to internalise.

Common AI-specific alerts to ship from day one

Alert	Query shape
Error rate spike	`requests \| where timestamp > ago(15m) \| countif(success==false) / count() > 0.01`
Cosmos throttling	`dependencies \| where target endswith "documents.azure.com" \| countif(resultCode == 429)`
OpenAI throttling	`dependencies \| where target endswith "openai.azure.com" \| countif(resultCode == 429)`
Probe failures (Container Apps)	`ContainerAppSystemLogs_CL \| where Reason_s == "Unhealthy"`
Replica restarts	`ContainerAppSystemLogs_CL \| where Reason_s in ("Killing", "BackOff")`
Token cost runaway	`dependencies \| where name == "rag.generate" \| summarize sum(toint(customDimensions['gen_ai.usage.input_tokens']))`

Final exam framing — the next steps

You’ve now covered every domain on AI-200:

Domain 1 — Containers (ACR, App Service containers, Container Apps + KEDA, AKS, troubleshooting)
Domain 2 — Data services (Cosmos NoSQL incl vectors + change feed, PostgreSQL + pgvector, Managed Redis)
Domain 3 — Connect (Service Bus, Event Grid, Functions, choosing between them)
Domain 4 — Secure / monitor / troubleshoot (Key Vault, App Configuration, OpenTelemetry, KQL, end-to-end observability)

The exam pattern: every question maps a real-world scenario (often featuring developers like Mira, Theo, Priya, or Lin) to the right Azure primitive at the right scope with the right configuration. There’s a short list of “common pivot points” — managed identity vs passwords, scale-to-zero vs always-warm, single-partition vs cross-partition, push vs pull, secret in Key Vault vs config in App Configuration — and most questions test which side of the pivot fits.

Trust your intuition. The AI-200 exam rewards the simplest service that fits — over-engineered answers are usually wrong, even when they technically work.

Good luck.