Kubernetes Debugger
Systematic debugging workflows for Kubernetes issues using MCP kubernetes tools.
Prerequisites
Install Kubernetes MCP Server
bash
claude mcp add kubernetes --scope user -- npx mcp-server-kubernetes
Requirements:
- •Access to a Kubernetes cluster configured for kubectl (minikube, Rancher Desktop, GKE, EKS, AKS, etc.)
- •kubeconfig at
~/.kube/config(default) orKUBECONFIGenv var set - •Helm v3 in PATH (optional, for Helm operations)
Alternative installation methods:
bash
# Global install npm install -g mcp-server-kubernetes # Or run directly with npx (no install) npx mcp-server-kubernetes
Verify installation:
bash
claude mcp list # Should show 'kubernetes' server
Quick Reference: MCP Tools
| Tool | Use For |
|---|---|
kubectl_get | List resources, check status, find resource names |
kubectl_describe | Detailed info, events, conditions |
kubectl_logs | Container stdout/stderr, application errors |
exec_in_pod | Run commands inside containers |
kubectl_rollout | Deployment rollout status/history |
node_management | Cordon/drain/uncordon nodes |
Debugging Decision Tree
code
Issue reported
│
├─ Pod not running? ──────────► See: Pod Debugging Workflow
│
├─ Service unreachable? ──────► See: Service/Network Debugging
│
├─ Deployment stuck? ─────────► See: Deployment Debugging
│
├─ Node issues? ──────────────► See: Node Debugging
│
└─ Performance/Resources? ────► See: Resource Debugging
Pod Debugging Workflow
Step 1: Get Pod Status
code
kubectl_get(resourceType="pods", namespace="<ns>")
Common statuses and their meaning:
- •Pending: Scheduling issues (resources, node selector, affinity)
- •CrashLoopBackOff: Container crashing repeatedly
- •ImagePullBackOff/ErrImagePull: Cannot pull container image
- •Running but not ready: Readiness probe failing
- •Terminating: Stuck deletion (finalizers, PDB)
Step 2: Check Events and Conditions
code
kubectl_describe(resourceType="pod", name="<pod>", namespace="<ns>")
Look for in output:
- •Events section: Scheduling failures, image pull errors, probe failures
- •Conditions: PodScheduled, Initialized, ContainersReady, Ready
- •Container State: Waiting (reason), Running, Terminated (exit code)
Step 3: Get Container Logs
code
kubectl_logs(resourceType="pod", name="<pod>", namespace="<ns>", container="<container>")
Options:
- •
previous=true: Logs from crashed container - •
tail=100: Last N lines - •
since="1h": Logs from last hour
Step 4: Exec Into Container (if running)
code
exec_in_pod(name="<pod>", namespace="<ns>", command=["sh", "-c", "<cmd>"])
Useful commands:
- •
["cat", "/etc/resolv.conf"]- Check DNS config - •
["env"]- Verify environment variables - •
["ls", "-la", "/app"]- Check mounted files - •
["nc", "-zv", "<host>", "<port>"]- Test connectivity
Common Pod Issues
CrashLoopBackOff
- •Get logs:
kubectl_logs(previous=true)for crashed container - •Check exit code in
kubectl_describeoutput - •Common causes:
- •Exit code 1: Application error
- •Exit code 137: OOMKilled (check memory limits)
- •Exit code 143: SIGTERM (graceful shutdown issue)
ImagePullBackOff
- •Check image name/tag in describe output
- •Verify image exists in registry
- •Check imagePullSecrets if private registry
- •Look for "Failed to pull image" in events
Pending Pod
- •Check events for scheduling failure reason
- •Common causes:
- •
Insufficient cpu/memory: Node capacity exhausted - •
node(s) didn't match node selector: Wrong labels - •
PersistentVolumeClaim not bound: Storage issue - •
0/N nodes available: Taints/tolerations mismatch
- •
Service/Network Debugging
Step 1: Verify Service Exists
code
kubectl_get(resourceType="services", namespace="<ns>") kubectl_describe(resourceType="service", name="<svc>", namespace="<ns>")
Step 2: Check Endpoints
code
kubectl_get(resourceType="endpoints", name="<svc>", namespace="<ns>")
No endpoints? Check:
- •Pod labels match service selector
- •Pods are Running and Ready
- •Target port matches container port
Step 3: Test DNS Resolution
code
exec_in_pod(name="<debug-pod>", command=["nslookup", "<service>.<namespace>.svc.cluster.local"])
Step 4: Test Connectivity
code
exec_in_pod(name="<debug-pod>", command=["nc", "-zv", "<service>", "<port>"])
Deployment Debugging
Check Rollout Status
code
kubectl_rollout(subCommand="status", resourceType="deployment", name="<deploy>", namespace="<ns>")
View Rollout History
code
kubectl_rollout(subCommand="history", resourceType="deployment", name="<deploy>", namespace="<ns>")
Rollback if Needed
code
kubectl_rollout(subCommand="undo", resourceType="deployment", name="<deploy>", namespace="<ns>")
Common Issues
- •Progressing stuck: New pods failing (check ReplicaSet pods)
- •Available < desired: Pods not passing readiness probes
- •Surge/unavailable conflicts: Check deployment strategy
Node Debugging
Check Node Status
code
kubectl_get(resourceType="nodes") kubectl_describe(resourceType="node", name="<node>")
Node Conditions to Check
| Condition | Problem If |
|---|---|
| Ready | False or Unknown |
| MemoryPressure | True |
| DiskPressure | True |
| PIDPressure | True |
| NetworkUnavailable | True |
Drain Node for Maintenance
code
node_management(operation="cordon", nodeName="<node>") # Prevent new pods node_management(operation="drain", nodeName="<node>", confirmDrain=true) # Evict pods # After maintenance: node_management(operation="uncordon", nodeName="<node>")
Resource Debugging
Check Resource Usage
code
kubectl_generic(command="top", resourceType="pods", namespace="<ns>") kubectl_generic(command="top", resourceType="nodes")
OOMKilled Detection
- •
kubectl_describepod - look for "OOMKilled" in container state - •Check memory limits vs actual usage
- •Solutions:
- •Increase memory limits
- •Fix memory leak in application
- •Add memory requests for better scheduling
CPU Throttling
- •Check if CPU limits are too restrictive
- •Consider removing CPU limits (keep requests)
- •Use
kubectl top podsto see actual usage
Reference Files
- •references/pod-states.md: Complete pod state reference
- •references/common-errors.md: Error messages and solutions
- •references/network-debug.md: Network troubleshooting details