Troubleshooting Containers: Logs, Events, Connectivity ·…

The triage hierarchy

Simple explanation

When a container goes wrong, three things can be broken: the image, the runtime, or the network. Triage in that order.

Image: Did it pull? Did it start? (Check pull events and exit codes.)
Runtime: Is it crashing? Hitting OOM? Failing health probes? (Check logs and resource metrics.)
Network: Can it reach its dependencies — Cosmos, Service Bus, Key Vault? (Check DNS and outbound rules.)

The right Azure-native tool for each layer:

Container Apps → Log Analytics + system events + az containerapp exec
AKS → kubectl describe pod + kubectl logs + Log Analytics

Common failure patterns and what they mean

The six patterns that cover most container failures on Azure.
Feature	Symptom	Likely cause
ImagePullBackOff (AKS) / pull failed (Container Apps)	No permission, wrong tag, registry unreachable	Did the kubelet/managed identity get AcrPull? Is the tag spelt right?
CrashLoopBackOff	Container starts then exits with non-zero code	kubectl logs --previous, or `az containerapp logs show --previous`
OOMKilled (exit code 137)	Container exceeded memory limit	Bump `resources.limits.memory`, or fix the leak
Liveness probe failing	Probe path returns non-2xx	Try the probe URL from inside the pod with curl
ContainerCannotRun / exec format error	Wrong CPU architecture (ARM image on x86 node, or vice versa)	Rebuild for the right platform; use `--platform=linux/amd64`
Outbound calls timing out	NSG, private endpoint missing, DNS, or wrong identity for managed-identity APIs	kubectl exec / containerapp exec → curl/dig from inside; check effective NSG rules

Container Apps — the troubleshooting toolkit

# Stream logs (live)
az containerapp logs show \
  --name roo-vision \
  --resource-group roo-prod \
  --follow

# Logs from a previous (crashed) revision
az containerapp logs show \
  --name roo-vision -g roo-prod \
  --revision roo-vision--v3-2 \
  --tail 200

# Open a shell inside a running replica
az containerapp exec \
  --name roo-vision -g roo-prod \
  --command "/bin/sh"

# System events (KEDA scale events, image pull events, restarts)
az containerapp logs show \
  --name roo-vision -g roo-prod \
  --type system

In Log Analytics:

// Console logs — your app's stdout/stderr
ContainerAppConsoleLogs_CL
| where ContainerAppName_s == "roo-vision"
| where TimeGenerated > ago(1h)
| project TimeGenerated, RevisionName_s, ContainerName_s, Log_s
| order by TimeGenerated desc

// System logs — image pulls, scale events, container restarts
ContainerAppSystemLogs_CL
| where ContainerAppName_s == "roo-vision"
| where TimeGenerated > ago(1h)
| where Reason_s in ("Pulled", "Failed", "Created", "Started", "Killing", "BackOff")
| project TimeGenerated, Reason_s, Log_s

These two tables (ContainerAppConsoleLogs_CL and ContainerAppSystemLogs_CL) are the spine of Container Apps observability.

Exam tip: 'system' logs vs 'console' logs

Container Apps separates system logs (the platform’s view — image pulls, scale events, container restarts) from console logs (your app’s stdout/stderr). When troubleshooting a startup failure, system logs explain what the platform saw; console logs show what your code printed. Both matter; the exam asks about both.

Match the table name to the question: “the container is restarting in a loop” → system logs to confirm BackOff, console logs to see why each restart fails.

AKS — the kubectl toolkit

# What's actually running?
kubectl get pods -A
kubectl get pod roo-vision-7d4b -o yaml

# Why did this pod misbehave?
kubectl describe pod roo-vision-7d4b
# ↑ Events section at the bottom is the gold — it shows pull, schedule, probe events with timestamps

# Logs from the current container
kubectl logs roo-vision-7d4b -c vision

# Logs from the PREVIOUS crashed container (key for CrashLoopBackOff)
kubectl logs roo-vision-7d4b -c vision --previous

# Shell into the pod
kubectl exec -it roo-vision-7d4b -c vision -- /bin/sh

# Check resources at the node level
kubectl top pods -A
kubectl top nodes

The single most important command on AKS for triage is kubectl describe pod. Its Events: section narrates every state transition — pull, schedule, start, probe, OOMKilled — with timestamps.

Connectivity troubleshooting from inside a container

When the container starts but can’t reach a dependency:

# Open a shell
kubectl exec -it roo-vision-7d4b -- /bin/sh
# OR
az containerapp exec -n roo-vision -g roo-prod --command "/bin/sh"

# Inside the container:
nslookup roo-cosmos.documents.azure.com    # DNS resolution
curl -v https://roo-cosmos.documents.azure.com/   # TLS reachability
nc -vz roo-cosmos.documents.azure.com 443         # Raw TCP reachability

Patterns:

What you see	What it means
DNS lookup fails	No DNS forwarder, private DNS zone not linked, or the name doesn’t exist
DNS resolves to private IP, curl times out	NSG / firewall rule blocking outbound, or private endpoint not deployed
TCP succeeds but TLS handshake fails	Certificate issue, SNI mismatch, or wrong endpoint
403 / 401	App identity correct but no role assignment on the target resource

Probes — readiness, liveness, startup

A probe failure can keep a perfectly good container in the doghouse. Three kinds:

Probe	What it controls	Failure consequence
Liveness	”Is the process healthy?”	Container is killed and restarted
Readiness	”Should traffic be routed here?”	Pod removed from Service endpoints (no traffic)
Startup	”Has the app finished initialising?”	Disables liveness probe until startup passes (good for slow-loading models)

spec:
  containers:
    - name: vision
      startupProbe:
        httpGet: { path: /health, port: 8000 }
        failureThreshold: 30
        periodSeconds: 10        # Allow 5 minutes for startup
      readinessProbe:
        httpGet: { path: /ready, port: 8000 }
        periodSeconds: 5
      livenessProbe:
        httpGet: { path: /health, port: 8000 }
        periodSeconds: 10

For AI workloads where the first inference call loads a multi-GB model, startup probes are essential — without them, the pod’s liveness probe fails during model load and the pod restarts in a loop.

Real-world example: Mira's model-load CrashLoopBackOff

Mira deployed a new vision model. Pods went straight to CrashLoopBackOff. Logs showed the model was still downloading from blob storage when liveness probes started, and after 3 failed probes the pod was killed.

Fix: add a startup probe with failureThreshold: 30 * periodSeconds: 10 — that’s a 5-minute grace window. Once the startup probe passes, the liveness probe takes over.

Lesson: liveness probes assume your app is up. Startup probes acknowledge that AI containers often need a long warm-up.

End-to-end connectivity check — the AKS / Container Apps Network Watcher

Azure Network Watcher’s Connection Troubleshoot can verify reachability from a Container App or AKS node to any Azure resource — Cosmos, Key Vault, Service Bus — and tell you which hop drops the packet.

az network watcher test-connectivity \
  --source-resource $POD_VM_RESOURCE_ID \
  --dest-address roo-kv.vault.azure.net \
  --dest-port 443

Output reports the connection status, latency, and the next hops — a programmatic equivalent of “where exactly does the packet die?”