AgentSkillsCN

k8s-troubleshooter

诊断 Kubernetes 问题并提供修复建议。适用于 Pod 失败、服务无法访问,或集群健康状况恶化时使用。

SKILL.md
--- frontmatter
name: k8s-troubleshooter
description: Diagnoses Kubernetes issues and provides remediation steps. Use when pods are failing, services are unreachable, or cluster health is degraded.

Kubernetes Troubleshooting Guide

Overview

This skill provides systematic approaches to diagnosing and resolving common Kubernetes issues including pod failures, networking problems, and resource constraints.

Diagnostic Workflow

Step 1: Identify the Problem

bash
# Check pod status across all namespaces
kubectl get pods -A | grep -v Running

# View recent events sorted by time
kubectl get events --sort-by='.lastTimestamp' | tail -20

# Check node health
kubectl get nodes
kubectl top nodes

Step 2: Gather Details

bash
# Describe problematic pod
kubectl describe pod <pod-name> -n <namespace>

# Check pod logs
kubectl logs <pod-name> -n <namespace>
kubectl logs <pod-name> -n <namespace> --previous  # Previous container

# Check resource usage
kubectl top pod <pod-name> -n <namespace>

Pod Issue Resolution

CrashLoopBackOff

Symptoms: Pod repeatedly crashes and restarts

Diagnostic Steps:

  1. Check logs: kubectl logs <pod> --previous
  2. Check resources: kubectl describe pod <pod>
  3. Check events: kubectl get events --field-selector involvedObject.name=<pod>

Common Causes & Fixes:

CauseIndicatorFix
OOMKilledExit code 137Increase memory limits
Missing configConfigMap/Secret errorsVerify CM/Secret exists
Failed probesLiveness probe failedAdjust probe thresholds
App crashApplication error in logsFix application code
Bad commandError starting containerVerify command/args

ImagePullBackOff

Symptoms: Container image cannot be pulled

Diagnostic Steps:

  1. Verify image exists: docker pull <image>
  2. Check image name spelling
  3. Verify registry credentials

Fixes:

bash
# Check secret exists
kubectl get secret <registry-secret> -n <namespace>

# Create registry secret
kubectl create secret docker-registry regcred \
  --docker-server=<registry> \
  --docker-username=<user> \
  --docker-password=<password>

# Verify pod has imagePullSecrets
kubectl get pod <pod> -o jsonpath='{.spec.imagePullSecrets}'

Pending State

Symptoms: Pod stuck in Pending, not scheduled

Diagnostic Steps:

  1. Check node resources: kubectl describe nodes | grep -A 5 "Allocated resources"
  2. Check PVC status: kubectl get pvc -n <namespace>
  3. Check taints/tolerations: kubectl describe node <node> | grep Taint

Common Causes:

CauseCheckFix
Insufficient CPU/Memorykubectl describe nodeAdd nodes or reduce requests
PVC not boundkubectl get pvcCheck storage class
Node selector missPod spec nodeSelectorUpdate selector or label nodes
Taint not toleratedNode taintsAdd toleration to pod

Service Issues

No Endpoints

Symptoms: Service returns no endpoints, traffic not reaching pods

Diagnostic Steps:

bash
# Check endpoints
kubectl get endpoints <service> -n <namespace>

# Verify pod labels match selector
kubectl get pods -n <namespace> --show-labels
kubectl get svc <service> -n <namespace> -o jsonpath='{.spec.selector}'

Fix: Ensure pod labels match service selector exactly.

DNS Resolution Failures

Symptoms: Pods cannot resolve service names

Diagnostic Steps:

bash
# Test DNS from pod
kubectl exec -it <pod> -- nslookup kubernetes.default
kubectl exec -it <pod> -- cat /etc/resolv.conf

# Check CoreDNS
kubectl get pods -n kube-system -l k8s-app=kube-dns
kubectl logs -n kube-system -l k8s-app=kube-dns

Resource Issues

OOMKilled

Symptoms: Container killed due to memory limits

Diagnostic Steps:

bash
# Check container status
kubectl describe pod <pod> | grep -A 10 "Last State"

# Check memory usage
kubectl top pod <pod>

Fix:

yaml
resources:
  requests:
    memory: "256Mi"
  limits:
    memory: "512Mi"  # Increase this

CPU Throttling

Symptoms: Slow application response, high latency

Diagnostic Steps:

bash
# Check CPU usage vs limits
kubectl top pod <pod>
kubectl describe pod <pod> | grep -A 5 "Limits"

Quick Reference

SymptomFirst CheckCommon Fix
Pod not startingdescribe podFix image/config
Service no endpointsPod labelsMatch selector
OOMKilledMemory limitsIncrease limits
CrashLoopBackOffPod logs --previousFix app error
Pending podNode resourcesScale cluster
ImagePullBackOffImage name/secretFix registry auth

Useful Aliases

bash
alias k='kubectl'
alias kgp='kubectl get pods'
alias kgpa='kubectl get pods -A'
alias kdp='kubectl describe pod'
alias kl='kubectl logs'
alias klf='kubectl logs -f'
alias kge='kubectl get events --sort-by=.lastTimestamp'

Escalation Checklist

Before escalating:

  • Checked pod logs (current and previous)
  • Described pod and reviewed events
  • Verified resource availability
  • Checked network connectivity
  • Reviewed recent deployments/changes
  • Documented timeline of issue