Kubernetes Incident Response

Runbooks and diagnostic workflows for common Kubernetes incidents.

Incident Triage

Quick Health Check

code

1. get_nodes()                    # Node status
2. get_pods(namespace="kube-system")  # Control plane
3. get_events(namespace)          # Recent events

Severity Assessment

Indicator	Severity	Action
Multiple nodes NotReady	Critical	Escalate immediately
kube-system pods failing	Critical	Control plane issue
Single pod CrashLoop	Medium	Debug pod
High latency	Medium	Check resources

Runbook: Pod Failures

CrashLoopBackOff

code

1. get_pod_logs(name, namespace, previous=True)
2. describe_pod(name, namespace)
3. get_events(namespace, field_selector="involvedObject.name=<pod>")
4. get_pod_metrics(name, namespace)

Common Causes:

•OOMKilled → Increase memory limits
•Exit code 1 → Application error in logs
•Exit code 137 → Killed by OOM or SIGKILL
•Exit code 143 → Graceful SIGTERM

ImagePullBackOff

code

1. describe_pod(name, namespace)  # Check image name
2. get_secrets(namespace)         # Check imagePullSecrets

Common Causes:

•Wrong image name/tag
•Private registry, no imagePullSecret
•Registry rate limiting

Pending Pod

code

1. describe_pod(name, namespace)
2. get_nodes()
3. get_events(namespace)

Common Causes:

•Insufficient resources
•Node selector mismatch
•Taints without tolerations
•PVC not bound

Runbook: Node Issues

Node NotReady

code

1. describe_node(name)
2. get_events(namespace="", field_selector="involvedObject.name=<node>")
3. node_logs_tool(name, "kubelet")

Common Causes:

•kubelet not running
•Network partition
•Disk pressure
•Memory pressure

Node DiskPressure

code

1. describe_node(name)
2. get_pods(field_selector="spec.nodeName=<node>")
3. # Check large containers/logs

Actions:

•Clean up container logs
•Evict low-priority pods
•Expand node disk

Runbook: Network Issues

Service Not Accessible

code

1. get_services(namespace)
2. get_endpoints(namespace)        # Check backends
3. get_pods(namespace, label_selector="<service-selector>")
4. get_network_policies(namespace)

Common Causes:

•No matching pods (empty endpoints)
•Pods not ready
•NetworkPolicy blocking traffic

DNS Resolution Failures

code

1. get_pods(namespace="kube-system", label_selector="k8s-app=kube-dns")
2. get_pod_logs("coredns-xxx", "kube-system")

With Cilium

code

cilium_status_tool()
cilium_endpoints_list_tool(namespace)
hubble_flows_query_tool(namespace)

With Istio

code

istio_analyze_tool(namespace)
istio_proxy_status_tool()

Runbook: Storage Issues

PVC Pending

code

1. describe_pvc(name, namespace)
2. get_storage_classes()
3. get_events(namespace)

Common Causes:

•No matching PV
•StorageClass not provisioning
•Quota exceeded

Pod Stuck in ContainerCreating

code

1. describe_pod(name, namespace)
2. get_pvc(namespace)
3. get_events(namespace)

Common Causes:

•PVC not bound
•Volume mount error
•Image pull taking time

Runbook: Control Plane Issues

API Server Unavailable

code

1. get_pods(namespace="kube-system", label_selector="component=kube-apiserver")
2. get_events(namespace="kube-system")

etcd Issues

code

1. get_pods(namespace="kube-system", label_selector="component=etcd")
2. get_pod_logs("etcd-xxx", "kube-system")

Emergency Actions

Cordon Node (Prevent Scheduling)

code

# Via kubectl (not in MCP tools yet)
# kubectl cordon <node>

Drain Node (Evict Pods)

code

# Via kubectl
# kubectl drain <node> --ignore-daemonsets --delete-emptydir-data

Force Delete Pod

code

delete_pod(name, namespace, grace_period=0, force=True)

Rollback Deployment

code

rollback_deployment(name, namespace, revision=0)  # Previous version

Helm Rollback

code

rollback_helm_release(name, namespace, revision=1)

Diagnostic Collection Script

For comprehensive incident diagnostics, see scripts/collect-diagnostics.py.

Collects:

•Pod logs and events
•Node conditions
•Resource usage
•Network policies
•Recent changes

Multi-Cluster Incident Response

Check all clusters:

code

for context in ["prod-1", "prod-2", "staging"]:
    get_nodes(context=context)
    get_pods(namespace="kube-system", context=context)
    get_events(namespace="kube-system", context=context)

Post-Incident

Document Timeline

•When did the incident start?
•What was the impact?
•What was the root cause?
•What fixed it?

Prevent Recurrence

•Add monitoring/alerting
•Improve resource limits
•Add readiness probes
•Document runbook

Related Skills

•k8s-troubleshoot - Detailed debugging
•k8s-security - Security incidents