OpenTelemetry: Distributed Tracing for AI Apps

Why OpenTelemetry, why now

Simple explanation

OpenTelemetry (OTel) is the open standard for telemetry — traces, metrics, and logs. Azure has fully adopted it. Application Insights has switched its recommended SDK to the OpenTelemetry-based one; new docs lead with OTel.

An AI request is naturally distributed: HTTP API → vector search → LLM call → response. A trace stitches all those steps into one timeline so you can answer “where did the 7 seconds go?” — which is the only question that matters when something’s slow.

The exam tests three things: instrumenting your code with OpenTelemetry, adding custom spans for AI-specific operations, and exporting traces to Azure Monitor / Application Insights.

The three signals

Each signal answers different questions; production AI apps emit all three.
Feature	Traces	Metrics	Logs
What	A timeline of operations across services	Numeric measurements over time	Timestamped event records
Examples	HTTP request → SQL query → LLM call → response	Requests/sec, latency p99, queue depth	Worker started, error caught, audit event
Where it shines for AI	End-to-end latency breakdown of a RAG call	Token throughput, embedding cache hit rate, RU consumption	Discrete events — agent decisions, tool calls

Setting it up — one configuration

# Python — Azure Monitor OpenTelemetry distribution
from azure.monitor.opentelemetry import configure_azure_monitor
import logging
import os

configure_azure_monitor(
    connection_string=os.environ["APPLICATIONINSIGHTS_CONNECTION_STRING"],
    instrumentation_options={
        "azure_sdk": {"enabled": True},
        "django": {"enabled": False},
        "fastapi": {"enabled": True},
        "psycopg2": {"enabled": True},
    },
)

logger = logging.getLogger(__name__)
logger.info("worker started")

After configure_azure_monitor:

HTTP requests are auto-traced (FastAPI / Flask / Django middleware)
Outbound HTTP, SQL queries, Redis calls are auto-traced
logger.info(...) flows to Application Insights
Metrics: request duration, exception count, etc.

Custom spans for AI-specific operations

The auto-instrumentation covers infrastructure. For AI-specific operations — LLM calls, embedding generation, retrieval steps — you typically add your own spans.

from opentelemetry import trace

tracer = trace.get_tracer(__name__)

async def answer_question(question: str, user_id: str):
    with tracer.start_as_current_span("rag.answer") as root:
        root.set_attribute("user_id", user_id)
        root.set_attribute("question_length", len(question))

        with tracer.start_as_current_span("rag.embed_query"):
            query_vec = await embed(question)

        with tracer.start_as_current_span("rag.retrieve") as retrieve:
            retrieve.set_attribute("retriever", "pgvector")
            retrieve.set_attribute("top_k", 5)
            docs = await retrieve_similar(query_vec, k=5)
            retrieve.set_attribute("docs_returned", len(docs))

        with tracer.start_as_current_span("rag.generate") as generate:
            generate.set_attribute("model", "gpt-4o")
            answer = await call_llm(question, docs)
            generate.set_attribute("tokens_in", answer.tokens_in)
            generate.set_attribute("tokens_out", answer.tokens_out)

        return answer

Three spans nested inside the parent. In Application Insights’ end-to-end transaction view this renders as a Gantt chart — you immediately see which step is the bottleneck.

Exam tip: span attributes are searchable

Attributes you set on spans (like model, tokens_in, top_k) flow to Application Insights as customDimensions on the dependency record. KQL queries can filter on them — for example, finding the slowest GPT-4o calls.

Use attribute names consistent with the OpenTelemetry GenAI semantic conventions (gen_ai.provider.name, gen_ai.request.model, gen_ai.usage.input_tokens, gen_ai.usage.output_tokens) so future tools and dashboards understand them automatically. Note: as of mid-2026 these conventions are still in Development status, not Stable — names may shift in future spec revisions.

Sampling — the other side of cost control

In a high-volume system you can’t keep every trace. OpenTelemetry samples:

Sampler	What it does
AlwaysOn / AlwaysOff	Keep every trace / drop every trace
TraceIdRatioBased	Keep N% of traces (e.g., 0.1 = 10%)
ParentBased	Defer to parent’s sampling decision (e.g., keep only if upstream did)

configure_azure_monitor(
    connection_string=...,
    sampling_ratio=0.1,   # 10% of traces
)

Adaptive sampling (Application Insights’ default for some SDKs) targets a specific events-per-second rate by adjusting the ratio dynamically. Useful when traffic is bursty.

Logs — bridging Python `logging` to Azure Monitor

configure_azure_monitor automatically wires Python’s logging module to send records to Application Insights. Set log levels per logger:

logging.getLogger("azure.monitor").setLevel(logging.WARNING)
logging.getLogger("openai").setLevel(logging.INFO)
logging.getLogger(__name__).setLevel(logging.DEBUG)

Logs join traces in Application Insights — a log line emitted inside a span is correlated to that span, so the end-to-end transaction view shows them inline.

Metrics — counters and histograms

from opentelemetry import metrics

meter = metrics.get_meter(__name__)
embed_calls = meter.create_counter("ai.embed.calls", description="Number of embedding API calls")
embed_latency = meter.create_histogram("ai.embed.latency_ms", unit="ms")

embed_calls.add(1, {"model": "text-embedding-3-small"})
embed_latency.record(elapsed_ms, {"model": "text-embedding-3-small"})

These appear in Azure Monitor under custom metrics. You can graph and alert on them just like Azure-native metrics.

Where to look in Azure Monitor

View	What it shows
Application Map	Service dependency graph computed from traces
End-to-end transaction	A single trace with timeline, dependencies, exceptions, and logs
Failures	Aggregated failed requests / dependencies with grouping
Performance	Latency distributions per request type and dependency
Logs (KQL)	Full query language — covered in next module