Kubernetes Troubleshooting Guide

Overview

This skill provides systematic approaches to diagnosing and resolving common Kubernetes issues including pod failures, networking problems, and resource constraints.

Diagnostic Workflow

Step 1: Identify the Problem

bash

# Check pod status across all namespaces
kubectl get pods -A | grep -v Running

# View recent events sorted by time
kubectl get events --sort-by='.lastTimestamp' | tail -20

# Check node health
kubectl get nodes
kubectl top nodes

Step 2: Gather Details

bash

# Describe problematic pod
kubectl describe pod <pod-name> -n <namespace>

# Check pod logs
kubectl logs <pod-name> -n <namespace>
kubectl logs <pod-name> -n <namespace> --previous  # Previous container

# Check resource usage
kubectl top pod <pod-name> -n <namespace>

Pod Issue Resolution

CrashLoopBackOff

Symptoms: Pod repeatedly crashes and restarts

Diagnostic Steps:

•Check logs: kubectl logs <pod> --previous
•Check resources: kubectl describe pod <pod>
•Check events: kubectl get events --field-selector involvedObject.name=<pod>

Common Causes & Fixes:

Cause	Indicator	Fix
OOMKilled	Exit code 137	Increase memory limits
Missing config	ConfigMap/Secret errors	Verify CM/Secret exists
Failed probes	Liveness probe failed	Adjust probe thresholds
App crash	Application error in logs	Fix application code
Bad command	Error starting container	Verify command/args

ImagePullBackOff

Symptoms: Container image cannot be pulled

Diagnostic Steps:

•Verify image exists: docker pull <image>
•Check image name spelling
•Verify registry credentials

Fixes:

bash

# Check secret exists
kubectl get secret <registry-secret> -n <namespace>

# Create registry secret
kubectl create secret docker-registry regcred \
  --docker-server=<registry> \
  --docker-username=<user> \
  --docker-password=<password>

# Verify pod has imagePullSecrets
kubectl get pod <pod> -o jsonpath='{.spec.imagePullSecrets}'

Pending State

Symptoms: Pod stuck in Pending, not scheduled

Diagnostic Steps:

•Check node resources: kubectl describe nodes | grep -A 5 "Allocated resources"
•Check PVC status: kubectl get pvc -n <namespace>
•Check taints/tolerations: kubectl describe node <node> | grep Taint

Common Causes:

Cause	Check	Fix
Insufficient CPU/Memory	`kubectl describe node`	Add nodes or reduce requests
PVC not bound	`kubectl get pvc`	Check storage class
Node selector miss	Pod spec nodeSelector	Update selector or label nodes
Taint not tolerated	Node taints	Add toleration to pod

Service Issues

No Endpoints

Symptoms: Service returns no endpoints, traffic not reaching pods

Diagnostic Steps:

bash

# Check endpoints
kubectl get endpoints <service> -n <namespace>

# Verify pod labels match selector
kubectl get pods -n <namespace> --show-labels
kubectl get svc <service> -n <namespace> -o jsonpath='{.spec.selector}'

Fix: Ensure pod labels match service selector exactly.

DNS Resolution Failures

Symptoms: Pods cannot resolve service names

Diagnostic Steps:

bash

# Test DNS from pod
kubectl exec -it <pod> -- nslookup kubernetes.default
kubectl exec -it <pod> -- cat /etc/resolv.conf

# Check CoreDNS
kubectl get pods -n kube-system -l k8s-app=kube-dns
kubectl logs -n kube-system -l k8s-app=kube-dns

Resource Issues

OOMKilled

Symptoms: Container killed due to memory limits

Diagnostic Steps:

bash

# Check container status
kubectl describe pod <pod> | grep -A 10 "Last State"

# Check memory usage
kubectl top pod <pod>

Fix:

yaml

resources:
  requests:
    memory: "256Mi"
  limits:
    memory: "512Mi"  # Increase this

CPU Throttling

Symptoms: Slow application response, high latency

Diagnostic Steps:

bash

# Check CPU usage vs limits
kubectl top pod <pod>
kubectl describe pod <pod> | grep -A 5 "Limits"

Quick Reference

Symptom	First Check	Common Fix
Pod not starting	`describe pod`	Fix image/config
Service no endpoints	Pod labels	Match selector
OOMKilled	Memory limits	Increase limits
CrashLoopBackOff	Pod logs `--previous`	Fix app error
Pending pod	Node resources	Scale cluster
ImagePullBackOff	Image name/secret	Fix registry auth

Useful Aliases

bash

alias k='kubectl'
alias kgp='kubectl get pods'
alias kgpa='kubectl get pods -A'
alias kdp='kubectl describe pod'
alias kl='kubectl logs'
alias klf='kubectl logs -f'
alias kge='kubectl get events --sort-by=.lastTimestamp'

Escalation Checklist

Before escalating:

• Checked pod logs (current and previous)
• Described pod and reviewed events
• Verified resource availability
• Checked network connectivity
• Reviewed recent deployments/changes
• Documented timeline of issue