Domain 1 β€” Module 8 of 8 100%
8 of 27 overall
Domain 1: Develop containerized solutions on Azure Free ⏱ ~13 min read

Troubleshooting Containers: Logs, Events, Connectivity

When the container won't start, won't scale, or won't talk to its dependencies β€” the toolkit. Container Apps logs, AKS pod debugging, network probes, and the patterns that pass exam troubleshooting questions.

The triage hierarchy

Simple explanation

When a container goes wrong, three things can be broken: the image, the runtime, or the network. Triage in that order.

  • Image: Did it pull? Did it start? (Check pull events and exit codes.)
  • Runtime: Is it crashing? Hitting OOM? Failing health probes? (Check logs and resource metrics.)
  • Network: Can it reach its dependencies β€” Cosmos, Service Bus, Key Vault? (Check DNS and outbound rules.)

The right Azure-native tool for each layer:

  • Container Apps β†’ Log Analytics + system events + az containerapp exec
  • AKS β†’ kubectl describe pod + kubectl logs + Log Analytics

Common failure patterns and what they mean

The six patterns that cover most container failures on Azure.
FeatureSymptomLikely causeFirst thing to check
ImagePullBackOff (AKS) / pull failed (Container Apps)No permission, wrong tag, registry unreachableDid the kubelet/managed identity get AcrPull? Is the tag spelt right?
CrashLoopBackOffContainer starts then exits with non-zero codekubectl logs --previous, or `az containerapp logs show --previous`
OOMKilled (exit code 137)Container exceeded memory limitBump `resources.limits.memory`, or fix the leak
Liveness probe failingProbe path returns non-2xxTry the probe URL from inside the pod with curl
ContainerCannotRun / exec format errorWrong CPU architecture (ARM image on x86 node, or vice versa)Rebuild for the right platform; use `--platform=linux/amd64`
Outbound calls timing outNSG, private endpoint missing, DNS, or wrong identity for managed-identity APIskubectl exec / containerapp exec β†’ curl/dig from inside; check effective NSG rules

Container Apps β€” the troubleshooting toolkit

# Stream logs (live)
az containerapp logs show \
  --name roo-vision \
  --resource-group roo-prod \
  --follow

# Logs from a previous (crashed) revision
az containerapp logs show \
  --name roo-vision -g roo-prod \
  --revision roo-vision--v3-2 \
  --tail 200

# Open a shell inside a running replica
az containerapp exec \
  --name roo-vision -g roo-prod \
  --command "/bin/sh"

# System events (KEDA scale events, image pull events, restarts)
az containerapp logs show \
  --name roo-vision -g roo-prod \
  --type system

In Log Analytics:

// Console logs β€” your app's stdout/stderr
ContainerAppConsoleLogs_CL
| where ContainerAppName_s == "roo-vision"
| where TimeGenerated > ago(1h)
| project TimeGenerated, RevisionName_s, ContainerName_s, Log_s
| order by TimeGenerated desc

// System logs β€” image pulls, scale events, container restarts
ContainerAppSystemLogs_CL
| where ContainerAppName_s == "roo-vision"
| where TimeGenerated > ago(1h)
| where Reason_s in ("Pulled", "Failed", "Created", "Started", "Killing", "BackOff")
| project TimeGenerated, Reason_s, Log_s

These two tables (ContainerAppConsoleLogs_CL and ContainerAppSystemLogs_CL) are the spine of Container Apps observability.

Exam tip: 'system' logs vs 'console' logs

Container Apps separates system logs (the platform’s view β€” image pulls, scale events, container restarts) from console logs (your app’s stdout/stderr). When troubleshooting a startup failure, system logs explain what the platform saw; console logs show what your code printed. Both matter; the exam asks about both.

Match the table name to the question: β€œthe container is restarting in a loop” β†’ system logs to confirm BackOff, console logs to see why each restart fails.

AKS β€” the kubectl toolkit

# What's actually running?
kubectl get pods -A
kubectl get pod roo-vision-7d4b -o yaml

# Why did this pod misbehave?
kubectl describe pod roo-vision-7d4b
# ↑ Events section at the bottom is the gold β€” it shows pull, schedule, probe events with timestamps

# Logs from the current container
kubectl logs roo-vision-7d4b -c vision

# Logs from the PREVIOUS crashed container (key for CrashLoopBackOff)
kubectl logs roo-vision-7d4b -c vision --previous

# Shell into the pod
kubectl exec -it roo-vision-7d4b -c vision -- /bin/sh

# Check resources at the node level
kubectl top pods -A
kubectl top nodes

The single most important command on AKS for triage is kubectl describe pod. Its Events: section narrates every state transition β€” pull, schedule, start, probe, OOMKilled β€” with timestamps.

Connectivity troubleshooting from inside a container

When the container starts but can’t reach a dependency:

# Open a shell
kubectl exec -it roo-vision-7d4b -- /bin/sh
# OR
az containerapp exec -n roo-vision -g roo-prod --command "/bin/sh"

# Inside the container:
nslookup roo-cosmos.documents.azure.com    # DNS resolution
curl -v https://roo-cosmos.documents.azure.com/   # TLS reachability
nc -vz roo-cosmos.documents.azure.com 443         # Raw TCP reachability

Patterns:

What you seeWhat it means
DNS lookup failsNo DNS forwarder, private DNS zone not linked, or the name doesn’t exist
DNS resolves to private IP, curl times outNSG / firewall rule blocking outbound, or private endpoint not deployed
TCP succeeds but TLS handshake failsCertificate issue, SNI mismatch, or wrong endpoint
403 / 401App identity correct but no role assignment on the target resource

Probes β€” readiness, liveness, startup

A probe failure can keep a perfectly good container in the doghouse. Three kinds:

ProbeWhat it controlsFailure consequence
Liveness”Is the process healthy?”Container is killed and restarted
Readiness”Should traffic be routed here?”Pod removed from Service endpoints (no traffic)
Startup”Has the app finished initialising?”Disables liveness probe until startup passes (good for slow-loading models)
spec:
  containers:
    - name: vision
      startupProbe:
        httpGet: { path: /health, port: 8000 }
        failureThreshold: 30
        periodSeconds: 10        # Allow 5 minutes for startup
      readinessProbe:
        httpGet: { path: /ready, port: 8000 }
        periodSeconds: 5
      livenessProbe:
        httpGet: { path: /health, port: 8000 }
        periodSeconds: 10

For AI workloads where the first inference call loads a multi-GB model, startup probes are essential β€” without them, the pod’s liveness probe fails during model load and the pod restarts in a loop.

Real-world example: Mira's model-load CrashLoopBackOff

Mira deployed a new vision model. Pods went straight to CrashLoopBackOff. Logs showed the model was still downloading from blob storage when liveness probes started, and after 3 failed probes the pod was killed.

Fix: add a startup probe with failureThreshold: 30 * periodSeconds: 10 β€” that’s a 5-minute grace window. Once the startup probe passes, the liveness probe takes over.

Lesson: liveness probes assume your app is up. Startup probes acknowledge that AI containers often need a long warm-up.

End-to-end connectivity check β€” the AKS / Container Apps Network Watcher

Azure Network Watcher’s Connection Troubleshoot can verify reachability from a Container App or AKS node to any Azure resource β€” Cosmos, Key Vault, Service Bus β€” and tell you which hop drops the packet.

az network watcher test-connectivity \
  --source-resource $POD_VM_RESOURCE_ID \
  --dest-address roo-kv.vault.azure.net \
  --dest-port 443

Output reports the connection status, latency, and the next hops β€” a programmatic equivalent of β€œwhere exactly does the packet die?”

Key terms

Question

What does CrashLoopBackOff mean?

Click or press Enter to reveal answer

Answer

A container starts, exits with a non-zero status, Kubernetes restarts it, it crashes again, and the platform applies an exponential back-off between restart attempts. Diagnose with `kubectl logs --previous` (AKS) or `az containerapp logs show --previous` (Container Apps) to see the last failure.

Click to flip back

Question

What is exit code 137 / OOMKilled?

Click or press Enter to reveal answer

Answer

The container exceeded its memory limit and the Linux kernel's OOM killer terminated it (signal 9 β†’ exit 137). Fix: increase `resources.limits.memory`, fix a memory leak, or scale out so each replica handles fewer concurrent requests.

Click to flip back

Question

What's the difference between liveness, readiness, and startup probes?

Click or press Enter to reveal answer

Answer

Liveness β€” is the process healthy? Failure restarts the container. Readiness β€” should traffic flow here? Failure removes the pod from Service endpoints but doesn't restart. Startup β€” has init finished? While failing, suppresses liveness probes (key for AI workloads with multi-GB model loads).

Click to flip back

Question

Which Container Apps log table holds your application's stdout/stderr?

Click or press Enter to reveal answer

Answer

`ContainerAppConsoleLogs_CL`. The companion table `ContainerAppSystemLogs_CL` holds platform events β€” image pulls, scale events, container restarts. Both go to the environment's Log Analytics workspace.

Click to flip back

Question

Where do AKS pod events live, and how do you see them?

Click or press Enter to reveal answer

Answer

`kubectl describe pod <name>` shows the Events section at the bottom β€” chronological pull, schedule, start, probe events. They're also available in Log Analytics as `KubeEvents` when Container Insights is enabled.

Click to flip back

Knowledge check

Knowledge Check

Theo's clinical assistant pod restarts every 90 seconds in CrashLoopBackOff. The model is large (4 GB) and takes about 3 minutes to load on startup. The pod has a liveness probe but no startup probe. What's the most likely cause and fix?

Knowledge Check

Mira's Container App `roo-vision` started failing yesterday with `ImagePullBackOff`. The image, registry, and tag are all correct. Nothing changed in the manifests. What's the most likely cause?

Knowledge Check

Lin shells into a Container Apps replica and runs `curl https://roo-kv.vault.azure.net/secrets/foo`. It returns `Authentication failed`. The Container App has system-assigned managed identity enabled. What's the most likely missing step?