Domain 4 β€” Module 3 of 5 60%
25 of 27 overall
Domain 4: Secure, monitor, and troubleshoot Azure solutions Free ⏱ ~13 min read

OpenTelemetry: Distributed Tracing for AI Apps

Trace a request from user click β†’ API β†’ vector search β†’ LLM call β†’ response. The OpenTelemetry SDK on Azure, exporting to Application Insights, the three signals (traces, metrics, logs), and what shows up in Azure Monitor.

Why OpenTelemetry, why now

Simple explanation

OpenTelemetry (OTel) is the open standard for telemetry β€” traces, metrics, and logs. Azure has fully adopted it. Application Insights has switched its recommended SDK to the OpenTelemetry-based one; new docs lead with OTel.

An AI request is naturally distributed: HTTP API β†’ vector search β†’ LLM call β†’ response. A trace stitches all those steps into one timeline so you can answer β€œwhere did the 7 seconds go?” β€” which is the only question that matters when something’s slow.

The exam tests three things: instrumenting your code with OpenTelemetry, adding custom spans for AI-specific operations, and exporting traces to Azure Monitor / Application Insights.

The three signals

Each signal answers different questions; production AI apps emit all three.
FeatureTracesMetricsLogs
WhatA timeline of operations across servicesNumeric measurements over timeTimestamped event records
ExamplesHTTP request β†’ SQL query β†’ LLM call β†’ responseRequests/sec, latency p99, queue depthWorker started, error caught, audit event
Where it shines for AIEnd-to-end latency breakdown of a RAG callToken throughput, embedding cache hit rate, RU consumptionDiscrete events β€” agent decisions, tool calls

Setting it up β€” one configuration

# Python β€” Azure Monitor OpenTelemetry distribution
from azure.monitor.opentelemetry import configure_azure_monitor
import logging
import os

configure_azure_monitor(
    connection_string=os.environ["APPLICATIONINSIGHTS_CONNECTION_STRING"],
    instrumentation_options={
        "azure_sdk": {"enabled": True},
        "django": {"enabled": False},
        "fastapi": {"enabled": True},
        "psycopg2": {"enabled": True},
    },
)

logger = logging.getLogger(__name__)
logger.info("worker started")

After configure_azure_monitor:

  • HTTP requests are auto-traced (FastAPI / Flask / Django middleware)
  • Outbound HTTP, SQL queries, Redis calls are auto-traced
  • logger.info(...) flows to Application Insights
  • Metrics: request duration, exception count, etc.

Custom spans for AI-specific operations

The auto-instrumentation covers infrastructure. For AI-specific operations β€” LLM calls, embedding generation, retrieval steps β€” you typically add your own spans.

from opentelemetry import trace

tracer = trace.get_tracer(__name__)

async def answer_question(question: str, user_id: str):
    with tracer.start_as_current_span("rag.answer") as root:
        root.set_attribute("user_id", user_id)
        root.set_attribute("question_length", len(question))

        with tracer.start_as_current_span("rag.embed_query"):
            query_vec = await embed(question)

        with tracer.start_as_current_span("rag.retrieve") as retrieve:
            retrieve.set_attribute("retriever", "pgvector")
            retrieve.set_attribute("top_k", 5)
            docs = await retrieve_similar(query_vec, k=5)
            retrieve.set_attribute("docs_returned", len(docs))

        with tracer.start_as_current_span("rag.generate") as generate:
            generate.set_attribute("model", "gpt-4o")
            answer = await call_llm(question, docs)
            generate.set_attribute("tokens_in", answer.tokens_in)
            generate.set_attribute("tokens_out", answer.tokens_out)

        return answer

Three spans nested inside the parent. In Application Insights’ end-to-end transaction view this renders as a Gantt chart β€” you immediately see which step is the bottleneck.

Exam tip: span attributes are searchable

Attributes you set on spans (like model, tokens_in, top_k) flow to Application Insights as customDimensions on the dependency record. KQL queries can filter on them β€” for example, finding the slowest GPT-4o calls.

Use attribute names consistent with the OpenTelemetry GenAI semantic conventions (gen_ai.provider.name, gen_ai.request.model, gen_ai.usage.input_tokens, gen_ai.usage.output_tokens) so future tools and dashboards understand them automatically. Note: as of mid-2026 these conventions are still in Development status, not Stable β€” names may shift in future spec revisions.

Sampling β€” the other side of cost control

In a high-volume system you can’t keep every trace. OpenTelemetry samples:

SamplerWhat it does
AlwaysOn / AlwaysOffKeep every trace / drop every trace
TraceIdRatioBasedKeep N% of traces (e.g., 0.1 = 10%)
ParentBasedDefer to parent’s sampling decision (e.g., keep only if upstream did)
configure_azure_monitor(
    connection_string=...,
    sampling_ratio=0.1,   # 10% of traces
)

Adaptive sampling (Application Insights’ default for some SDKs) targets a specific events-per-second rate by adjusting the ratio dynamically. Useful when traffic is bursty.

Logs β€” bridging Python logging to Azure Monitor

configure_azure_monitor automatically wires Python’s logging module to send records to Application Insights. Set log levels per logger:

logging.getLogger("azure.monitor").setLevel(logging.WARNING)
logging.getLogger("openai").setLevel(logging.INFO)
logging.getLogger(__name__).setLevel(logging.DEBUG)

Logs join traces in Application Insights β€” a log line emitted inside a span is correlated to that span, so the end-to-end transaction view shows them inline.

Metrics β€” counters and histograms

from opentelemetry import metrics

meter = metrics.get_meter(__name__)
embed_calls = meter.create_counter("ai.embed.calls", description="Number of embedding API calls")
embed_latency = meter.create_histogram("ai.embed.latency_ms", unit="ms")

embed_calls.add(1, {"model": "text-embedding-3-small"})
embed_latency.record(elapsed_ms, {"model": "text-embedding-3-small"})

These appear in Azure Monitor under custom metrics. You can graph and alert on them just like Azure-native metrics.

Where to look in Azure Monitor

ViewWhat it shows
Application MapService dependency graph computed from traces
End-to-end transactionA single trace with timeline, dependencies, exceptions, and logs
FailuresAggregated failed requests / dependencies with grouping
PerformanceLatency distributions per request type and dependency
Logs (KQL)Full query language β€” covered in next module

Key terms

Question

What is OpenTelemetry?

Click or press Enter to reveal answer

Answer

The CNCF's vendor-neutral standard for observability β€” traces, metrics, and logs. Defines SDKs in every major language with consistent APIs and exporters to back-ends like Azure Monitor, Jaeger, Prometheus. Microsoft has standardised on it for Application Insights.

Click to flip back

Question

What's the recommended setup for OpenTelemetry on Azure?

Click or press Enter to reveal answer

Answer

The Azure Monitor OpenTelemetry distribution β€” a meta-package that wraps the OTel SDK with Azure Monitor as the configured exporter. One pip install (or NuGet, etc.), one `configure_azure_monitor(connection_string=...)` call, and auto-instrumentation kicks in.

Click to flip back

Question

What's a span?

Click or press Enter to reveal answer

Answer

A unit of work in a trace β€” a function call, an HTTP request, a database query. Spans have a name, a start time, an end time, and attributes (key/value tags). Spans nest into a tree to form a trace.

Click to flip back

Question

Why add custom spans to an AI app?

Click or press Enter to reveal answer

Answer

Auto-instrumentation covers infrastructure (HTTP, SQL, Redis). Custom spans cover AI-specific operations: embedding generation, retrieval, LLM calls. Attributes on those spans (model, top_k, tokens) make AI behaviour searchable in Application Insights.

Click to flip back

Question

What does TraceIdRatioBased sampling do?

Click or press Enter to reveal answer

Answer

Keeps a fixed fraction of traces (e.g., 10%). The decision is based on the trace ID hash, so all spans within a trace are kept or dropped together. Useful in high-volume systems where keeping every trace is impractical or expensive.

Click to flip back

Knowledge check

Knowledge Check

Mira's Python container is using FastAPI, psycopg, and openai. She wants Application Insights to show end-to-end traces with no per-route instrumentation. What's the simplest way?

Knowledge Check

Theo wants to find the slowest GPT-4o calls in production. He's already instrumented LLM calls as spans with `model` and `tokens_in` as attributes. Where can he query?

Knowledge Check

Lin's app is high-volume and Application Insights costs are climbing. He wants to keep ~10% of traces while keeping all error traces. Which sampling approach fits?