Troubleshooting Containers: Logs, Events, Connectivity
When the container won't start, won't scale, or won't talk to its dependencies β the toolkit. Container Apps logs, AKS pod debugging, network probes, and the patterns that pass exam troubleshooting questions.
The triage hierarchy
When a container goes wrong, three things can be broken: the image, the runtime, or the network. Triage in that order.
- Image: Did it pull? Did it start? (Check pull events and exit codes.)
- Runtime: Is it crashing? Hitting OOM? Failing health probes? (Check logs and resource metrics.)
- Network: Can it reach its dependencies β Cosmos, Service Bus, Key Vault? (Check DNS and outbound rules.)
The right Azure-native tool for each layer:
- Container Apps β Log Analytics + system events + az containerapp exec
- AKS β kubectl describe pod + kubectl logs + Log Analytics
Common failure patterns and what they mean
| Feature | Symptom | Likely cause | First thing to check |
|---|---|---|---|
| ImagePullBackOff (AKS) / pull failed (Container Apps) | No permission, wrong tag, registry unreachable | Did the kubelet/managed identity get AcrPull? Is the tag spelt right? | |
| CrashLoopBackOff | Container starts then exits with non-zero code | kubectl logs --previous, or `az containerapp logs show --previous` | |
| OOMKilled (exit code 137) | Container exceeded memory limit | Bump `resources.limits.memory`, or fix the leak | |
| Liveness probe failing | Probe path returns non-2xx | Try the probe URL from inside the pod with curl | |
| ContainerCannotRun / exec format error | Wrong CPU architecture (ARM image on x86 node, or vice versa) | Rebuild for the right platform; use `--platform=linux/amd64` | |
| Outbound calls timing out | NSG, private endpoint missing, DNS, or wrong identity for managed-identity APIs | kubectl exec / containerapp exec β curl/dig from inside; check effective NSG rules |
Container Apps β the troubleshooting toolkit
# Stream logs (live)
az containerapp logs show \
--name roo-vision \
--resource-group roo-prod \
--follow
# Logs from a previous (crashed) revision
az containerapp logs show \
--name roo-vision -g roo-prod \
--revision roo-vision--v3-2 \
--tail 200
# Open a shell inside a running replica
az containerapp exec \
--name roo-vision -g roo-prod \
--command "/bin/sh"
# System events (KEDA scale events, image pull events, restarts)
az containerapp logs show \
--name roo-vision -g roo-prod \
--type system
In Log Analytics:
// Console logs β your app's stdout/stderr
ContainerAppConsoleLogs_CL
| where ContainerAppName_s == "roo-vision"
| where TimeGenerated > ago(1h)
| project TimeGenerated, RevisionName_s, ContainerName_s, Log_s
| order by TimeGenerated desc
// System logs β image pulls, scale events, container restarts
ContainerAppSystemLogs_CL
| where ContainerAppName_s == "roo-vision"
| where TimeGenerated > ago(1h)
| where Reason_s in ("Pulled", "Failed", "Created", "Started", "Killing", "BackOff")
| project TimeGenerated, Reason_s, Log_s
These two tables (ContainerAppConsoleLogs_CL and ContainerAppSystemLogs_CL) are the spine of Container Apps observability.
Exam tip: 'system' logs vs 'console' logs
Container Apps separates system logs (the platformβs view β image pulls, scale events, container restarts) from console logs (your appβs stdout/stderr). When troubleshooting a startup failure, system logs explain what the platform saw; console logs show what your code printed. Both matter; the exam asks about both.
Match the table name to the question: βthe container is restarting in a loopβ β system logs to confirm BackOff, console logs to see why each restart fails.
AKS β the kubectl toolkit
# What's actually running?
kubectl get pods -A
kubectl get pod roo-vision-7d4b -o yaml
# Why did this pod misbehave?
kubectl describe pod roo-vision-7d4b
# β Events section at the bottom is the gold β it shows pull, schedule, probe events with timestamps
# Logs from the current container
kubectl logs roo-vision-7d4b -c vision
# Logs from the PREVIOUS crashed container (key for CrashLoopBackOff)
kubectl logs roo-vision-7d4b -c vision --previous
# Shell into the pod
kubectl exec -it roo-vision-7d4b -c vision -- /bin/sh
# Check resources at the node level
kubectl top pods -A
kubectl top nodes
The single most important command on AKS for triage is kubectl describe pod. Its Events: section narrates every state transition β pull, schedule, start, probe, OOMKilled β with timestamps.
Connectivity troubleshooting from inside a container
When the container starts but canβt reach a dependency:
# Open a shell
kubectl exec -it roo-vision-7d4b -- /bin/sh
# OR
az containerapp exec -n roo-vision -g roo-prod --command "/bin/sh"
# Inside the container:
nslookup roo-cosmos.documents.azure.com # DNS resolution
curl -v https://roo-cosmos.documents.azure.com/ # TLS reachability
nc -vz roo-cosmos.documents.azure.com 443 # Raw TCP reachability
Patterns:
| What you see | What it means |
|---|---|
| DNS lookup fails | No DNS forwarder, private DNS zone not linked, or the name doesnβt exist |
| DNS resolves to private IP, curl times out | NSG / firewall rule blocking outbound, or private endpoint not deployed |
| TCP succeeds but TLS handshake fails | Certificate issue, SNI mismatch, or wrong endpoint |
| 403 / 401 | App identity correct but no role assignment on the target resource |
Probes β readiness, liveness, startup
A probe failure can keep a perfectly good container in the doghouse. Three kinds:
| Probe | What it controls | Failure consequence |
|---|---|---|
| Liveness | βIs the process healthy?β | Container is killed and restarted |
| Readiness | βShould traffic be routed here?β | Pod removed from Service endpoints (no traffic) |
| Startup | βHas the app finished initialising?β | Disables liveness probe until startup passes (good for slow-loading models) |
spec:
containers:
- name: vision
startupProbe:
httpGet: { path: /health, port: 8000 }
failureThreshold: 30
periodSeconds: 10 # Allow 5 minutes for startup
readinessProbe:
httpGet: { path: /ready, port: 8000 }
periodSeconds: 5
livenessProbe:
httpGet: { path: /health, port: 8000 }
periodSeconds: 10
For AI workloads where the first inference call loads a multi-GB model, startup probes are essential β without them, the podβs liveness probe fails during model load and the pod restarts in a loop.
Real-world example: Mira's model-load CrashLoopBackOff
Mira deployed a new vision model. Pods went straight to CrashLoopBackOff. Logs showed the model was still downloading from blob storage when liveness probes started, and after 3 failed probes the pod was killed.
Fix: add a startup probe with failureThreshold: 30 * periodSeconds: 10 β thatβs a 5-minute grace window. Once the startup probe passes, the liveness probe takes over.
Lesson: liveness probes assume your app is up. Startup probes acknowledge that AI containers often need a long warm-up.
End-to-end connectivity check β the AKS / Container Apps Network Watcher
Azure Network Watcherβs Connection Troubleshoot can verify reachability from a Container App or AKS node to any Azure resource β Cosmos, Key Vault, Service Bus β and tell you which hop drops the packet.
az network watcher test-connectivity \
--source-resource $POD_VM_RESOURCE_ID \
--dest-address roo-kv.vault.azure.net \
--dest-port 443
Output reports the connection status, latency, and the next hops β a programmatic equivalent of βwhere exactly does the packet die?β
Key terms
Knowledge check
Theo's clinical assistant pod restarts every 90 seconds in CrashLoopBackOff. The model is large (4 GB) and takes about 3 minutes to load on startup. The pod has a liveness probe but no startup probe. What's the most likely cause and fix?
Mira's Container App `roo-vision` started failing yesterday with `ImagePullBackOff`. The image, registry, and tag are all correct. Nothing changed in the manifests. What's the most likely cause?
Lin shells into a Container Apps replica and runs `curl https://roo-kv.vault.azure.net/secrets/foo`. It returns `Authentication failed`. The Container App has system-assigned managed identity enabled. What's the most likely missing step?