Kubernetes Troubleshooting Guide
Overview
This skill provides systematic approaches to diagnosing and resolving common Kubernetes issues including pod failures, networking problems, and resource constraints.
Diagnostic Workflow
Step 1: Identify the Problem
# Check pod status across all namespaces kubectl get pods -A | grep -v Running # View recent events sorted by time kubectl get events --sort-by='.lastTimestamp' | tail -20 # Check node health kubectl get nodes kubectl top nodes
Step 2: Gather Details
# Describe problematic pod kubectl describe pod <pod-name> -n <namespace> # Check pod logs kubectl logs <pod-name> -n <namespace> kubectl logs <pod-name> -n <namespace> --previous # Previous container # Check resource usage kubectl top pod <pod-name> -n <namespace>
Pod Issue Resolution
CrashLoopBackOff
Symptoms: Pod repeatedly crashes and restarts
Diagnostic Steps:
- •Check logs:
kubectl logs <pod> --previous - •Check resources:
kubectl describe pod <pod> - •Check events:
kubectl get events --field-selector involvedObject.name=<pod>
Common Causes & Fixes:
| Cause | Indicator | Fix |
|---|---|---|
| OOMKilled | Exit code 137 | Increase memory limits |
| Missing config | ConfigMap/Secret errors | Verify CM/Secret exists |
| Failed probes | Liveness probe failed | Adjust probe thresholds |
| App crash | Application error in logs | Fix application code |
| Bad command | Error starting container | Verify command/args |
ImagePullBackOff
Symptoms: Container image cannot be pulled
Diagnostic Steps:
- •Verify image exists:
docker pull <image> - •Check image name spelling
- •Verify registry credentials
Fixes:
# Check secret exists
kubectl get secret <registry-secret> -n <namespace>
# Create registry secret
kubectl create secret docker-registry regcred \
--docker-server=<registry> \
--docker-username=<user> \
--docker-password=<password>
# Verify pod has imagePullSecrets
kubectl get pod <pod> -o jsonpath='{.spec.imagePullSecrets}'
Pending State
Symptoms: Pod stuck in Pending, not scheduled
Diagnostic Steps:
- •Check node resources:
kubectl describe nodes | grep -A 5 "Allocated resources" - •Check PVC status:
kubectl get pvc -n <namespace> - •Check taints/tolerations:
kubectl describe node <node> | grep Taint
Common Causes:
| Cause | Check | Fix |
|---|---|---|
| Insufficient CPU/Memory | kubectl describe node | Add nodes or reduce requests |
| PVC not bound | kubectl get pvc | Check storage class |
| Node selector miss | Pod spec nodeSelector | Update selector or label nodes |
| Taint not tolerated | Node taints | Add toleration to pod |
Service Issues
No Endpoints
Symptoms: Service returns no endpoints, traffic not reaching pods
Diagnostic Steps:
# Check endpoints
kubectl get endpoints <service> -n <namespace>
# Verify pod labels match selector
kubectl get pods -n <namespace> --show-labels
kubectl get svc <service> -n <namespace> -o jsonpath='{.spec.selector}'
Fix: Ensure pod labels match service selector exactly.
DNS Resolution Failures
Symptoms: Pods cannot resolve service names
Diagnostic Steps:
# Test DNS from pod kubectl exec -it <pod> -- nslookup kubernetes.default kubectl exec -it <pod> -- cat /etc/resolv.conf # Check CoreDNS kubectl get pods -n kube-system -l k8s-app=kube-dns kubectl logs -n kube-system -l k8s-app=kube-dns
Resource Issues
OOMKilled
Symptoms: Container killed due to memory limits
Diagnostic Steps:
# Check container status kubectl describe pod <pod> | grep -A 10 "Last State" # Check memory usage kubectl top pod <pod>
Fix:
resources:
requests:
memory: "256Mi"
limits:
memory: "512Mi" # Increase this
CPU Throttling
Symptoms: Slow application response, high latency
Diagnostic Steps:
# Check CPU usage vs limits kubectl top pod <pod> kubectl describe pod <pod> | grep -A 5 "Limits"
Quick Reference
| Symptom | First Check | Common Fix |
|---|---|---|
| Pod not starting | describe pod | Fix image/config |
| Service no endpoints | Pod labels | Match selector |
| OOMKilled | Memory limits | Increase limits |
| CrashLoopBackOff | Pod logs --previous | Fix app error |
| Pending pod | Node resources | Scale cluster |
| ImagePullBackOff | Image name/secret | Fix registry auth |
Useful Aliases
alias k='kubectl' alias kgp='kubectl get pods' alias kgpa='kubectl get pods -A' alias kdp='kubectl describe pod' alias kl='kubectl logs' alias klf='kubectl logs -f' alias kge='kubectl get events --sort-by=.lastTimestamp'
Escalation Checklist
Before escalating:
- • Checked pod logs (current and previous)
- • Described pod and reviewed events
- • Verified resource availability
- • Checked network connectivity
- • Reviewed recent deployments/changes
- • Documented timeline of issue