AgentSkillsCN

debugging-k8s-pods

调试 Kubernetes Pod 失败问题,包括 CrashLoopBackOff、OOMKilled、ImagePullBackOff、Init 容器失败,以及 CreateContainerConfigError。适用于 Pod 不断崩溃、反复重启、无法启动,或容器出现错误时使用。

SKILL.md
--- frontmatter
name: debugging-k8s-pods
description: Debugs Kubernetes pod failures including CrashLoopBackOff, OOMKilled, ImagePullBackOff, init container failures, and CreateContainerConfigError. Use when pods crash, restart repeatedly, fail to start, or show container errors.
allowed-tools: Bash

Debugging Kubernetes Pods

Investigates pod lifecycle issues and container failures.

Pod Failure Patterns

StatusLikely CauseFirst Check
CrashLoopBackOffApp crash or misconfigurationLogs + exit code
ImagePullBackOffWrong image, missing tag, auth failureImage name + pull secret
OOMKilledMemory limit exceededResource limits vs actual usage
CreateContainerConfigErrorMissing ConfigMap/SecretReferenced configs exist
Init:ErrorInit container failedInit container logs
PendingScheduling issueLoad debugging-k8s-scheduling

Investigation Workflow

Step 1: Get Pod Status

bash
kubectl get pod <pod> -n <ns> -o wide
kubectl describe pod <pod> -n <ns>

Look for:

  • Status and Reason fields
  • Last State for exit codes
  • Events section at bottom

Step 2: Check Container Logs

bash
# Current container logs
kubectl logs <pod> -n <ns>

# Previous crashed container logs
kubectl logs <pod> -n <ns> --previous

# Specific container in multi-container pod
kubectl logs <pod> -n <ns> -c <container>

# Init container logs
kubectl logs <pod> -n <ns> -c <init-container-name>

Step 3: Exit Code Analysis

Exit CodeMeaning
0Success (check why it exited)
1Application error
137SIGKILL (OOMKilled or external kill)
139SIGSEGV (segmentation fault)
143SIGTERM (graceful shutdown requested)

Get exit code:

bash
kubectl get pod <pod> -n <ns> -o jsonpath='{.status.containerStatuses[0].lastState.terminated.exitCode}'

Specific Issues

CrashLoopBackOff

bash
# Check logs from crashed container
kubectl logs <pod> -n <ns> --previous

# Check restart count and last state
kubectl get pod <pod> -n <ns> -o jsonpath='{.status.containerStatuses[0].restartCount}'

Common causes:

  • Application startup failure
  • Missing environment variables
  • Missing dependencies (files, services)
  • Liveness probe failing too quickly

ImagePullBackOff

bash
# Check image name in events
kubectl describe pod <pod> -n <ns> | grep -A5 "Events:"

# Check if pull secret exists
kubectl get secrets -n <ns>

Common causes:

  • Typo in image name or tag
  • Private registry without imagePullSecret
  • Tag doesn't exist (e.g., latest removed)

OOMKilled

bash
# Check memory limits
kubectl get pod <pod> -n <ns> -o jsonpath='{.spec.containers[*].resources}'

# Check if OOMKilled
kubectl get pod <pod> -n <ns> -o jsonpath='{.status.containerStatuses[0].lastState.terminated.reason}'

If OOMKilled: either increase memory limits or investigate memory leak.

CreateContainerConfigError

bash
# Check what ConfigMap/Secret is referenced
kubectl get pod <pod> -n <ns> -o yaml | grep -A10 "env:\|envFrom:\|volumes:"

# Verify ConfigMap exists
kubectl get configmap -n <ns>

# Verify Secret exists
kubectl get secrets -n <ns>

Init Container Failures

bash
# List init containers
kubectl get pod <pod> -n <ns> -o jsonpath='{.spec.initContainers[*].name}'

# Check init container logs
kubectl logs <pod> -n <ns> -c <init-container-name>

Quick Debug Commands

bash
# Full pod YAML for deep inspection
kubectl get pod <pod> -n <ns> -o yaml

# Events for this pod only
kubectl get events -n <ns> --field-selector involvedObject.name=<pod>

# Check all containers status
kubectl get pod <pod> -n <ns> -o jsonpath='{range .status.containerStatuses[*]}{.name}: {.state}{"\n"}{end}'

Notes

  • Load retrieving-k8s-logs for advanced log patterns
  • Load debugging-k8s-resources if OOMKilled due to limits
  • Load debugging-k8s-scheduling if stuck in Pending