Skill: Kubernetes Pod Crash Handler
Purpose: Diagnose and resolve Kubernetes pod crashes and deployment issues for Phase IV/V.
Trigger Conditions
Invoke this skill when:
- •Pod status is
CrashLoopBackOff,Error, orImagePullBackOff - •User mentions "pod crashing", "deployment failing", or "container restarting"
- •Kubernetes deployment not reaching Ready state
Capabilities
1. Diagnostic Commands
bash
# Check pod status kubectl get pods -n <namespace> # Get detailed pod info kubectl describe pod <pod-name> -n <namespace> # Check pod logs kubectl logs <pod-name> -n <namespace> # Check previous container logs (if restarting) kubectl logs <pod-name> -n <namespace> --previous # Check events kubectl get events -n <namespace> --sort-by='.lastTimestamp'
2. Common Crash Causes & Solutions
CrashLoopBackOff
| Cause | Diagnosis | Solution |
|---|---|---|
| App error on startup | Check logs for Python/Node errors | Fix application code |
| Missing env vars | Logs show "KeyError" or "undefined" | Add ConfigMap/Secret |
| DB connection fail | Logs show connection errors | Check DB service/credentials |
| Wrong command | Container exits immediately | Fix Dockerfile CMD/ENTRYPOINT |
Fix ConfigMap for env vars:
yaml
apiVersion: v1 kind: ConfigMap metadata: name: todo-backend-config data: CORS_ORIGINS: "https://your-frontend.vercel.app" DEBUG: "false" --- apiVersion: v1 kind: Secret metadata: name: todo-backend-secrets type: Opaque stringData: DATABASE_URL: "postgresql+asyncpg://user:pass@host/db?sslmode=require" BETTER_AUTH_SECRET: "your-secret-key-min-32-chars"
Reference in Deployment:
yaml
spec:
containers:
- name: backend
envFrom:
- configMapRef:
name: todo-backend-config
- secretRef:
name: todo-backend-secrets
ImagePullBackOff
| Cause | Diagnosis | Solution |
|---|---|---|
| Wrong image name | Check describe pod output | Fix image name in deployment |
| Private registry | "unauthorized" in events | Add imagePullSecrets |
| Image doesn't exist | "not found" in events | Build and push image |
Fix for private registry:
bash
# Create secret for Docker Hub kubectl create secret docker-registry dockerhub-secret \ --docker-server=docker.io \ --docker-username=<username> \ --docker-password=<password> \ --docker-email=<email> # Add to deployment spec: imagePullSecrets: - name: dockerhub-secret
OOMKilled (Out of Memory)
Solution - Increase memory limits:
yaml
spec:
containers:
- name: backend
resources:
requests:
memory: "256Mi"
cpu: "100m"
limits:
memory: "512Mi"
cpu: "500m"
3. Health Check Configuration
yaml
spec:
containers:
- name: backend
livenessProbe:
httpGet:
path: /health
port: 8000
initialDelaySeconds: 30
periodSeconds: 10
failureThreshold: 3
readinessProbe:
httpGet:
path: /health
port: 8000
initialDelaySeconds: 5
periodSeconds: 5
failureThreshold: 3
4. Quick Recovery Scripts
restart-deployment.sh:
bash
#!/bin/bash
NAMESPACE=${1:-default}
DEPLOYMENT=${2:-todo-backend}
echo "Restarting $DEPLOYMENT in $NAMESPACE..."
kubectl rollout restart deployment/$DEPLOYMENT -n $NAMESPACE
kubectl rollout status deployment/$DEPLOYMENT -n $NAMESPACE
echo "Restart complete!"
debug-pod.sh:
bash
#!/bin/bash
POD=$1
NAMESPACE=${2:-default}
echo "=== Pod Status ==="
kubectl get pod $POD -n $NAMESPACE
echo -e "\n=== Pod Description ==="
kubectl describe pod $POD -n $NAMESPACE | tail -50
echo -e "\n=== Pod Logs ==="
kubectl logs $POD -n $NAMESPACE --tail=100
echo -e "\n=== Previous Logs (if any) ==="
kubectl logs $POD -n $NAMESPACE --previous --tail=50 2>/dev/null || echo "No previous logs"
5. Pod Crash Decision Tree
code
Pod Not Running
├── CrashLoopBackOff
│ ├── Check logs → App error → Fix code
│ ├── "KeyError"/"undefined" → Missing env → Add ConfigMap/Secret
│ └── Connection error → Check service/DB
│
├── ImagePullBackOff
│ ├── "unauthorized" → Add imagePullSecrets
│ ├── "not found" → Build and push image
│ └── Wrong name → Fix image reference
│
├── OOMKilled
│ └── Increase memory limits
│
├── Pending
│ ├── "Insufficient cpu" → Reduce requests or scale cluster
│ └── "Unschedulable" → Check node resources
│
└── Error
└── Check events with kubectl get events
Checklist
- • Check pod status:
kubectl get pods - • Read pod logs:
kubectl logs <pod> - • Check events:
kubectl get events - • Verify environment variables in ConfigMap/Secret
- • Check resource limits (memory/CPU)
- • Verify image exists and is pullable
- • Check health probe configuration
- • Test locally with same env vars before deploying