AgentSkillsCN

k8s-troubleshoot

使用 kubectl、日志、事件和资源检查来排查 Kubernetes 问题。适用于“Pod 未启动”、“崩溃循环”、“OOMKilled”、“调试 K8s”以及“Pod 失败原因”等场景。

SKILL.md
--- frontmatter
name: k8s-troubleshoot
description: "Debug Kubernetes issues using kubectl, logs, events, and resource inspection. Use on 'pod not starting', 'crash loop', 'OOMKilled', 'debug k8s', 'why is pod failing'."

Kubernetes Troubleshooting Skill

Systematic debugging for Kubernetes issues.

When to Use

  • Pods stuck in Pending/CrashLoopBackOff
  • OOMKilled containers
  • Service connectivity issues
  • Deployment rollout failures
  • PVC/storage problems

Diagnostic Flow

1. Get Status

bash
kubectl get pods -o wide
kubectl get events --sort-by='.lastTimestamp'
kubectl describe pod <pod>

2. Check Logs

bash
kubectl logs <pod> --previous  # crashed container
kubectl logs <pod> -c <container>  # specific container
stern <pod-prefix>  # multiple pods

3. Resource Issues

bash
kubectl top pods
kubectl describe node <node> | grep -A5 "Allocated resources"

Common Issues

Pending Pod

CauseCheckFix
No resourceskubectl describe pod -> EventsIncrease limits or add nodes
No matching nodeCheck nodeSelector/affinityFix selectors
PVC not boundkubectl get pvcCheck storage class

CrashLoopBackOff

CauseCheckFix
App errorkubectl logs --previousFix app code
Missing configCheck ConfigMap/Secret mountsCreate missing resources
Bad commandCheck command/args in specFix entrypoint
OOMKilledkubectl describe pod -> StateIncrease memory limit

ImagePullBackOff

CauseCheckFix
Wrong imageCheck image name/tagFix image reference
Private registryCheck imagePullSecretsAdd registry credentials
Rate limitCheck eventsUse registry mirror

Service Not Reachable

bash
# Check endpoints exist
kubectl get endpoints <service>

# Check selector matches pods
kubectl get pods -l <selector>

# Test from inside cluster
kubectl run debug --rm -it --image=alpine -- wget -qO- <service>:<port>

Quick Commands

bash
# All failing pods
kubectl get pods --field-selector=status.phase!=Running

# Events for namespace
kubectl get events --sort-by='.lastTimestamp' -n <ns>

# Resource usage
kubectl top pods --sort-by=memory

# Shell into pod
kubectl exec -it <pod> -- /bin/sh

# Port forward for debugging
kubectl port-forward <pod> 8080:80

# Restart deployment
kubectl rollout restart deployment/<name>

# Check rollout status
kubectl rollout status deployment/<name>

Log Patterns to Search

bash
# Errors
kubectl logs <pod> | grep -i error

# Python tracebacks
kubectl logs <pod> | grep -A 20 "Traceback"

# OOM
kubectl logs <pod> | grep -i "out of memory\|oom\|killed"