OpenTelemetry: Distributed Tracing for AI Apps
Trace a request from user click β API β vector search β LLM call β response. The OpenTelemetry SDK on Azure, exporting to Application Insights, the three signals (traces, metrics, logs), and what shows up in Azure Monitor.
Why OpenTelemetry, why now
OpenTelemetry (OTel) is the open standard for telemetry β traces, metrics, and logs. Azure has fully adopted it. Application Insights has switched its recommended SDK to the OpenTelemetry-based one; new docs lead with OTel.
An AI request is naturally distributed: HTTP API β vector search β LLM call β response. A trace stitches all those steps into one timeline so you can answer βwhere did the 7 seconds go?β β which is the only question that matters when somethingβs slow.
The exam tests three things: instrumenting your code with OpenTelemetry, adding custom spans for AI-specific operations, and exporting traces to Azure Monitor / Application Insights.
The three signals
| Feature | Traces | Metrics | Logs |
|---|---|---|---|
| What | A timeline of operations across services | Numeric measurements over time | Timestamped event records |
| Examples | HTTP request β SQL query β LLM call β response | Requests/sec, latency p99, queue depth | Worker started, error caught, audit event |
| Where it shines for AI | End-to-end latency breakdown of a RAG call | Token throughput, embedding cache hit rate, RU consumption | Discrete events β agent decisions, tool calls |
Setting it up β one configuration
# Python β Azure Monitor OpenTelemetry distribution
from azure.monitor.opentelemetry import configure_azure_monitor
import logging
import os
configure_azure_monitor(
connection_string=os.environ["APPLICATIONINSIGHTS_CONNECTION_STRING"],
instrumentation_options={
"azure_sdk": {"enabled": True},
"django": {"enabled": False},
"fastapi": {"enabled": True},
"psycopg2": {"enabled": True},
},
)
logger = logging.getLogger(__name__)
logger.info("worker started")
After configure_azure_monitor:
- HTTP requests are auto-traced (FastAPI / Flask / Django middleware)
- Outbound HTTP, SQL queries, Redis calls are auto-traced
logger.info(...)flows to Application Insights- Metrics: request duration, exception count, etc.
Custom spans for AI-specific operations
The auto-instrumentation covers infrastructure. For AI-specific operations β LLM calls, embedding generation, retrieval steps β you typically add your own spans.
from opentelemetry import trace
tracer = trace.get_tracer(__name__)
async def answer_question(question: str, user_id: str):
with tracer.start_as_current_span("rag.answer") as root:
root.set_attribute("user_id", user_id)
root.set_attribute("question_length", len(question))
with tracer.start_as_current_span("rag.embed_query"):
query_vec = await embed(question)
with tracer.start_as_current_span("rag.retrieve") as retrieve:
retrieve.set_attribute("retriever", "pgvector")
retrieve.set_attribute("top_k", 5)
docs = await retrieve_similar(query_vec, k=5)
retrieve.set_attribute("docs_returned", len(docs))
with tracer.start_as_current_span("rag.generate") as generate:
generate.set_attribute("model", "gpt-4o")
answer = await call_llm(question, docs)
generate.set_attribute("tokens_in", answer.tokens_in)
generate.set_attribute("tokens_out", answer.tokens_out)
return answer
Three spans nested inside the parent. In Application Insightsβ end-to-end transaction view this renders as a Gantt chart β you immediately see which step is the bottleneck.
Exam tip: span attributes are searchable
Attributes you set on spans (like model, tokens_in, top_k) flow to Application Insights as customDimensions on the dependency record. KQL queries can filter on them β for example, finding the slowest GPT-4o calls.
Use attribute names consistent with the OpenTelemetry GenAI semantic conventions (gen_ai.provider.name, gen_ai.request.model, gen_ai.usage.input_tokens, gen_ai.usage.output_tokens) so future tools and dashboards understand them automatically. Note: as of mid-2026 these conventions are still in Development status, not Stable β names may shift in future spec revisions.
Sampling β the other side of cost control
In a high-volume system you canβt keep every trace. OpenTelemetry samples:
| Sampler | What it does |
|---|---|
| AlwaysOn / AlwaysOff | Keep every trace / drop every trace |
| TraceIdRatioBased | Keep N% of traces (e.g., 0.1 = 10%) |
| ParentBased | Defer to parentβs sampling decision (e.g., keep only if upstream did) |
configure_azure_monitor(
connection_string=...,
sampling_ratio=0.1, # 10% of traces
)
Adaptive sampling (Application Insightsβ default for some SDKs) targets a specific events-per-second rate by adjusting the ratio dynamically. Useful when traffic is bursty.
Logs β bridging Python logging to Azure Monitor
configure_azure_monitor automatically wires Pythonβs logging module to send records to Application Insights. Set log levels per logger:
logging.getLogger("azure.monitor").setLevel(logging.WARNING)
logging.getLogger("openai").setLevel(logging.INFO)
logging.getLogger(__name__).setLevel(logging.DEBUG)
Logs join traces in Application Insights β a log line emitted inside a span is correlated to that span, so the end-to-end transaction view shows them inline.
Metrics β counters and histograms
from opentelemetry import metrics
meter = metrics.get_meter(__name__)
embed_calls = meter.create_counter("ai.embed.calls", description="Number of embedding API calls")
embed_latency = meter.create_histogram("ai.embed.latency_ms", unit="ms")
embed_calls.add(1, {"model": "text-embedding-3-small"})
embed_latency.record(elapsed_ms, {"model": "text-embedding-3-small"})
These appear in Azure Monitor under custom metrics. You can graph and alert on them just like Azure-native metrics.
Where to look in Azure Monitor
| View | What it shows |
|---|---|
| Application Map | Service dependency graph computed from traces |
| End-to-end transaction | A single trace with timeline, dependencies, exceptions, and logs |
| Failures | Aggregated failed requests / dependencies with grouping |
| Performance | Latency distributions per request type and dependency |
| Logs (KQL) | Full query language β covered in next module |
Key terms
Knowledge check
Mira's Python container is using FastAPI, psycopg, and openai. She wants Application Insights to show end-to-end traces with no per-route instrumentation. What's the simplest way?
Theo wants to find the slowest GPT-4o calls in production. He's already instrumented LLM calls as spans with `model` and `tokens_in` as attributes. Where can he query?
Lin's app is high-volume and Application Insights costs are climbing. He wants to keep ~10% of traces while keeping all error traces. Which sampling approach fits?