KQL for AI Apps: Querying Logs + Metrics

Why KQL is non-negotiable

Simple explanation

Kusto Query Language (KQL) is how you read everything Azure observes. Application Insights traces, Container Apps logs, KubeEvents, Service Bus diagnostics, AKS Container Insights, Log Analytics — they all answer to KQL.

Three rules of thumb to memorise:

Filter early with where on time and identifier columns — KQL is paid by the GB scanned
Project to what you need with project after filters
Summarise with summarize for aggregations — sum, count, percentile, average

The exam tests reading KQL — given a query, what does it tell you? It also tests writing KQL for common scenarios — p95 latency, error rates, dependency analysis.

Tables you must know

Table	Where	Holds
`requests`	Application Insights	Inbound HTTP requests handled by your app
`dependencies`	Application Insights	Outbound calls (HTTP, SQL, Service Bus, Cosmos, etc.) — including span data
`traces`	Application Insights	Application logs (info / warn / error) emitted via the OTel SDK
`exceptions`	Application Insights	Captured exceptions with stack traces
`customMetrics`	Application Insights	Metrics emitted via the OTel meter API
`ContainerAppConsoleLogs_CL`	Log Analytics	Container app stdout/stderr
`ContainerAppSystemLogs_CL`	Log Analytics	Container app platform events (image pulls, scale, restarts)
`ContainerLog` (AKS)	Log Analytics	AKS pod stdout/stderr (newer schema: `ContainerLogV2`)
`KubeEvents` (AKS)	Log Analytics	AKS pod events (scheduling, restarts, OOMKilled)
`AzureDiagnostics`	Log Analytics	Diagnostic logs for many Azure services

Five queries that solve real problems

1. P95 latency over time

requests
| where timestamp > ago(1h)
| where name == "POST /embed"
| summarize p50 = percentile(duration, 50),
            p95 = percentile(duration, 95),
            p99 = percentile(duration, 99)
            by bin(timestamp, 1m)
| render timechart

A line chart of latency percentiles per minute. The first thing to look at when “the API feels slow.”

2. Error rate per route

requests
| where timestamp > ago(1d)
| summarize total = count(), errors = countif(success == false) by name
| extend error_rate_pct = round(100.0 * errors / total, 2)
| where total > 100
| order by error_rate_pct desc

Routes ranked by error rate, with a minimum-volume gate so noisy low-traffic routes don’t dominate.

3. Dependency breakdown for slow requests

requests
| where timestamp > ago(1h) and duration > 5000   // slow ones
| project rid = id, parent = operation_Id, total_ms = duration
| join kind=inner (
    dependencies | project parent = operation_Id, dep_target = target,
                            dep_type = type, dep_ms = duration
) on parent
| summarize total_dep_ms = sum(dep_ms), n = count() by dep_target, dep_type
| order by total_dep_ms desc

For requests over 5 seconds, where did the time go? Which downstream service ate the budget?

4. AI-specific custom dimension query

dependencies
| where timestamp > ago(1h)
| where name == "rag.generate"
| extend model = tostring(customDimensions['gen_ai.request.model'])
| extend tokens_out = toint(customDimensions['gen_ai.usage.output_tokens'])
| summarize calls = count(), total_tokens = sum(tokens_out),
            p95_ms = percentile(duration, 95) by model
| order by total_tokens desc

Token usage and latency per model — pulled from custom dimensions you set on your span attributes.

5. Container Apps system events

ContainerAppSystemLogs_CL
| where TimeGenerated > ago(30m)
| where ContainerAppName_s == "roo-vision"
| where Reason_s in ("Failed", "BackOff", "Killing", "Unhealthy")
| project TimeGenerated, Reason_s, Log_s, RevisionName_s
| order by TimeGenerated desc

Anything alarming the platform reported about a specific container app — image pull failures, probe failures, throttle events.

Exam tip: 'where' before 'summarize'

KQL queries are billed by the data they scan. A query that does summarize ... by name and THEN filters with where scans the entire table. Filtering first — by time, by app name, by route — keeps cost (and latency) low.

Order: where TimeGenerated > ago(...) → other where filters → extend (computed columns) → summarize → order by → render.

The most useful operators in one place

// Filter
| where col == "value" and othercol > 100

// Pick columns
| project a, b, c

// Add computed columns without dropping
| extend duration_s = duration / 1000.0

// Aggregate
| summarize count(), sum(x), avg(x), percentile(x, 95) by groupCol, bin(timestamp, 5m)

// Join
| join kind=inner (otherTable | project key, val) on key

// Take top N by some metric
| top 10 by duration desc

// Render hint
| render timechart        // or barchart, piechart, columnchart

Joins — when correlation matters

// Find the dependency call chains for the slowest 20 requests
requests
| where timestamp > ago(1h)
| top 20 by duration desc
| project oid = operation_Id, request_dur = duration, request_name = name
| join kind=inner (
    dependencies | project oid = operation_Id, dep_dur = duration,
                            dep_target = target, dep_name = name
) on oid

operation_Id is the trace ID — all spans in a single trace share it. That’s how you reconstruct an end-to-end story.

Functions and parsing

// String functions
| extend route_short = substring(name, 0, 30)
| where url contains "openai"
| extend host = url_host(url)

// Numeric / time
| extend dt = todatetime(customDimensions["created_at"])
| extend tier = case(duration < 100, "fast",
                     duration < 1000, "ok",
                     "slow")

// JSON parsing
| extend parsed = parse_json(customDimensions)
| extend model = tostring(parsed.model)

case, iff, parse_json are the everyday kitchen tools.

Workbooks and dashboards

KQL queries become reusable through:

Surface	What it does
Workbooks	Interactive parameterised reports — pick a time range, an app, an environment; the query reruns
Dashboards	Pinned charts on the Azure portal home
Alerts	Run a KQL query on a schedule; trigger if rows match (e.g., error rate > 5%)

Most teams build a small Workbook per service that answers the standard “is everything OK” questions in one place.