AgentSkillsCN

kubernetes-troubleshooting

全面分析并排查 Kubernetes 和 OpenShift 集群的健康状况与故障问题。当您需要完成以下任务时,可运用此技能: (1) 主动开展集群健康评估与安全分析; (2) 分析 Pod/容器日志中的错误或异常信息; (3) 解读集群事件(kubectl get events); (4) 排查 Pod 失败原因:CrashLoopBackOff、ImagePullBackOff、OOMKilled 等; (5) 诊断网络问题:DNS、服务连通性、Ingress/Route 故障; (6) 探究存储问题:PVC 处于 Pending 状态、挂载失败等; (7) 分析节点问题:NotReady、资源压力过大、节点被污点标记等; (8) 排查 OCP 特定问题:SCC、Route、Operator、Build 等; (9) 进行性能分析与资源优化; (10) 开展安全漏洞评估与 RBAC 验证。

SKILL.md
--- frontmatter
name: kubernetes-troubleshooting
description: |
  Comprehensive Kubernetes and OpenShift cluster health analysis and troubleshooting. Use this skill when:
  (1) Proactive cluster health assessment and security analysis
  (2) Analyzing pod/container logs for errors or issues
  (3) Interpreting cluster events (kubectl get events)
  (4) Debugging pod failures: CrashLoopBackOff, ImagePullBackOff, OOMKilled
  (5) Diagnosing networking issues: DNS, Service connectivity, Ingress/Route problems
  (6) Investigating storage issues: PVC pending, mount failures
  (7) Analyzing node problems: NotReady, resource pressure, taints
  (8) Troubleshooting OCP-specific issues: SCCs, Routes, Operators, Builds
  (9) Performance analysis and resource optimization
  (10) Security vulnerability assessment and RBAC validation
metadata:
  author: cluster-skills
  version: "1.0.0"

Kubernetes / OpenShift Troubleshooting Guide

Systematic approach to diagnosing and resolving cluster issues through event analysis, log interpretation, and Popeye-style health scoring.

Current Versions & Tools (January 2026)

PlatformVersionKey Changes
Kubernetes1.31.xSidecar containers GA, Pod lifecycle improvements
OpenShift4.17.xOVN-Kubernetes default, enhanced web terminal
EKS1.31Pod Identity, Auto Mode, Karpenter 1.x
AKS1.31Cilium CNI, Workload Identity GA
GKE1.31Autopilot improvements, Gateway API GA

Troubleshooting Tools

ToolInstallPurpose
k9sbrew install k9sTerminal UI
sternbrew install sternMulti-pod log tailing
kubectx/kubensbrew install kubectxContext switching
kubectl-node-shellkubectl krew install node-shellNode access

Command Usage Convention

IMPORTANT: This skill uses kubectl as the primary command. When working with:

  • OpenShift/ARO clusters: Replace kubectl with oc
  • Standard Kubernetes (AKS, EKS, GKE): Use kubectl as shown

Cluster Health Scoring (Popeye-Style)

Health scores range from 0-100. Issues reduce the score based on severity:

  • BOOM (Critical): -50 points - Security vulnerabilities, resource exhaustion, failed services
  • WARN (Warning): -20 points - Configuration inefficiencies, best practice violations
  • INFO (Informational): -5 points - Non-critical issues, optimization opportunities

Quick Cluster Health Assessment

bash
#!/bin/bash
# cluster-health-check.sh
echo "=== CLUSTER HEALTH ASSESSMENT ==="

# 1. Node Health (Critical)
echo "### NODE HEALTH ###"
kubectl get nodes -o wide | grep -E "NotReady|Unknown" && \
  echo "BOOM: Unhealthy nodes detected!" || echo "✓ All nodes healthy"

# 2. Pod Issues (Critical)
echo -e "\n### POD HEALTH ###"
POD_ISSUES=$(kubectl get pods -A --field-selector=status.phase!=Running,status.phase!=Succeeded --no-headers | wc -l)
if [ $POD_ISSUES -gt 0 ]; then
    echo "WARN: $POD_ISSUES pods not running"
    kubectl get pods -A --field-selector=status.phase!=Running,status.phase!=Succeeded
else
    echo "✓ All pods running"
fi

# 3. Security (Critical)
echo -e "\n### SECURITY ASSESSMENT ###"
PRIVILEGED=$(kubectl get pods -A -o json | jq -r '.items[] | select(.spec.containers[].securityContext.privileged == true) | "\(.metadata.namespace)/\(.metadata.name)"' | wc -l)
[ $PRIVILEGED -gt 0 ] && echo "BOOM: $PRIVILEGED privileged containers!" || echo "✓ No privileged containers"

# 4. Resource Configuration (Warning)
echo -e "\n### RESOURCE CONFIGURATION ###"
NO_LIMITS=$(kubectl get pods -A -o json | jq -r '.items[] | select(.spec.containers[].resources.limits == null) | "\(.metadata.namespace)/\(.metadata.name)"' | wc -l)
[ $NO_LIMITS -gt 0 ] && echo "WARN: $NO_LIMITS containers without limits" || echo "✓ All have limits"

# 5. Storage (Warning)
echo -e "\n### STORAGE HEALTH ###"
PENDING_PVC=$(kubectl get pvc -A --field-selector=status.phase!=Bound --no-headers | wc -l)
[ $PENDING_PVC -gt 0 ] && echo "WARN: $PENDING_PVC PVCs not bound" || echo "✓ All PVCs bound"

# OpenShift: Cluster Operators
if command -v oc &> /dev/null; then
    echo -e "\n### OPENSHIFT OPERATORS ###"
    DEGRADED=$(oc get clusteroperators --no-headers | grep -c -E "False.*True|False.*False")
    [ $DEGRADED -gt 0 ] && echo "BOOM: $DEGRADED operators degraded!" || echo "✓ All operators healthy"
fi

Quick Diagnostic Commands

bash
# Pod status overview
kubectl get pods -n ${NAMESPACE} -o wide

# Recent events (sorted by time)
kubectl get events -n ${NAMESPACE} --sort-by='.lastTimestamp'

# Pod details and events
kubectl describe pod ${POD_NAME} -n ${NAMESPACE}

# Container logs (current)
kubectl logs ${POD_NAME} -n ${NAMESPACE} -c ${CONTAINER}

# Container logs (previous crashed instance)
kubectl logs ${POD_NAME} -n ${NAMESPACE} -c ${CONTAINER} --previous

# Multi-pod log streaming
stern -n ${NAMESPACE} ${POD_PREFIX}
stern -A -l app=${APP_NAME} --since 1h

# Node status
kubectl get nodes -o wide
kubectl describe node ${NODE_NAME}

# Resource usage
kubectl top pods -n ${NAMESPACE}
kubectl top nodes

Pod Status Interpretation

Pod Phase States

PhaseMeaningAction
PendingNot scheduled or pulling imagesCheck events, node resources, PVC status
RunningAt least one container runningCheck container statuses if issues
SucceededAll containers completed successfullyNormal for Jobs
FailedAll containers terminated, at least one failedCheck logs, exit codes
UnknownCannot determine stateNode communication issue

Container Waiting States

ReasonCauseResolution
ContainerCreatingSetting up containerCheck events, volume mounts
ImagePullBackOffCannot pull imageVerify image name, registry access, credentials
ErrImagePullImage pull failedCheck image exists, network, ImagePullSecrets
CreateContainerConfigErrorConfig errorCheck ConfigMaps, Secrets exist
CrashLoopBackOffContainer repeatedly crashingCheck logs --previous, fix application

Container Exit Codes

Exit CodeSignalCauseResolution
0-Normal exitExpected for Jobs
1-Application errorCheck logs for stack trace
126-Command not executableFix permissions
127-Command not foundFix command path
137SIGKILLOOM or forced terminationIncrease memory limit
143SIGTERMGraceful shutdownNormal during updates

Event Analysis

Critical Events to Monitor

Scheduling Events

EventMeaningResolution
FailedSchedulingCannot place podCheck node resources, taints, affinity
UnschedulableNo suitable nodeAdd nodes, adjust requirements

FailedScheduling Messages:

code
"Insufficient cpu"           → Reduce requests or add capacity
"Insufficient memory"        → Reduce requests or add capacity
"node(s) had taint"          → Add toleration or remove taint
"node(s) didn't match selector" → Fix nodeSelector/affinity
"persistentvolumeclaim not found" → Create PVC or fix name

Image Events

EventMeaningResolution
BackOffRepeated pull failuresCheck image name, registry, auth
ErrImageNeverPullImage not localChange imagePullPolicy or pre-pull

ImagePullBackOff Diagnosis:

bash
# Check image name
kubectl get pod ${POD} -o jsonpath='{.spec.containers[*].image}'

# Verify ImagePullSecrets
kubectl get pod ${POD} -o jsonpath='{.spec.imagePullSecrets}'
kubectl get secret ${SECRET} -n ${NAMESPACE}

Volume Events

EventMeaningResolution
FailedMountCannot mount volumeCheck PVC, storage class
FailedAttachVolumeCannot attachCheck cloud provider, volume exists

PVC Pending Diagnosis:

bash
kubectl describe pvc ${PVC_NAME} -n ${NAMESPACE}
kubectl get storageclass
kubectl get pv

Log Analysis Patterns

Common Error Patterns

bash
# Search for errors
kubectl logs ${POD} -n ${NS} | grep -iE "(error|exception|fatal|panic)"

# Java OOM
java.lang.OutOfMemoryError → Increase memory, tune JVM heap

# Connection refused
ECONNREFUSED, Connection refused → Dependency not available

# DNS failure
ENOTFOUND, getaddrinfo → DNS resolution failed, check service name

# Permission denied
Permission denied → Check securityContext, runAsUser, fsGroup

Memory Issues (OOMKilled)

code
Last State: Terminated
Reason: OOMKilled
Exit Code: 137

→ Solutions:
1. Increase memory limit
2. Profile application memory usage
3. For JVM: Set -Xmx < container limit (leave ~25% headroom)

Node Troubleshooting

Node Conditions

ConditionStatusMeaning
ReadyTrueNode healthy
ReadyFalseKubelet not healthy
ReadyUnknownNo heartbeat
MemoryPressureTrueLow memory
DiskPressureTrueLow disk space
PIDPressureTrueToo many processes

Node NotReady Diagnosis

bash
kubectl describe node ${NODE_NAME}

# On the node (SSH or debug)
systemctl status kubelet
journalctl -u kubelet -f

# Check resources
df -h
free -m
top

Networking Troubleshooting

DNS Issues

bash
# Test DNS resolution
kubectl run dns-test --image=busybox:1.28 --rm -it --restart=Never -- \
  nslookup ${SERVICE_NAME}.${NAMESPACE}.svc.cluster.local

# Check CoreDNS
kubectl get pods -n kube-system -l k8s-app=kube-dns
kubectl logs -n kube-system -l k8s-app=kube-dns

Service Connectivity

bash
# Verify service and endpoints
kubectl get svc ${SERVICE} -n ${NS}
kubectl get endpoints ${SERVICE} -n ${NS}

# Test from debug pod
kubectl run curl-test --image=curlimages/curl --rm -it --restart=Never -- \
  curl -v http://${SERVICE}.${NS}.svc.cluster.local:${PORT}

Ingress/Route Issues

bash
# Check Ingress
kubectl describe ingress ${INGRESS} -n ${NS}

# Ingress controller logs
kubectl logs -n ingress-nginx -l app.kubernetes.io/name=ingress-nginx

# OpenShift Route
oc describe route ${ROUTE} -n ${NS}
oc get pods -n openshift-ingress

OpenShift-Specific Troubleshooting

Cluster Operators

bash
# Check overall health
oc get clusteroperators

# Investigate degraded operator
oc describe clusteroperator ${OPERATOR}
oc logs -n openshift-${OPERATOR} -l name=${OPERATOR}-operator

Security Context Constraints (SCC)

bash
# List SCCs
oc get scc

# Check which SCC a pod is using
oc get pod ${POD} -n ${NS} -o yaml | grep scc

# Common error fix
# "unable to validate against any security context constraint"
oc adm policy add-scc-to-user ${SCC} -z ${SERVICE_ACCOUNT} -n ${NS}

Build Failures

bash
# Check build status
oc get builds -n ${NS}
oc describe build ${BUILD} -n ${NS}
oc logs build/${BUILD} -n ${NS}

Cloud Provider Troubleshooting

EKS (AWS)

bash
aws eks describe-cluster --name ${CLUSTER} --query 'cluster.status'
aws eks describe-addon --cluster-name ${CLUSTER} --addon-name vpc-cni
eksctl get nodegroup --cluster ${CLUSTER}

AKS (Azure)

bash
az aks show --resource-group ${RG} --name ${CLUSTER} --query provisioningState
az aks check-network outbound --resource-group ${RG} --name ${CLUSTER}

GKE (Google Cloud)

bash
gcloud container clusters describe ${CLUSTER} --region ${REGION} --format='value(status)'
gcloud container operations list --filter="targetLink:${CLUSTER}" --limit=10

Diagnostic Decision Tree

Pod Not Starting

code
Pod Phase = Pending?
├── Yes → Check Scheduling
│   ├── "Insufficient cpu/memory" → Add nodes or reduce requests
│   ├── "node(s) had taint" → Add toleration
│   ├── "PVC not found" → Create PVC
│   └── No events → Check API server
│
└── No → Check Container Status
    ├── ImagePullBackOff → Fix image name/auth
    ├── CrashLoopBackOff → Check logs --previous
    ├── CreateContainerConfigError → Fix ConfigMap/Secret
    └── Running but not ready → Check readiness probe

Application Not Responding

code
Can reach Service?
├── No → Check Service
│   ├── No endpoints → Fix selector labels
│   ├── Wrong port → Fix targetPort
│   └── NetworkPolicy blocking → Adjust policy
│
└── Yes → Check Pod
    ├── Probe failing → Fix probe or application
    ├── High latency → Check resources, dependencies
    └── Errors in logs → Fix application

Performance Analysis

Resource Optimization

bash
# Compare usage vs requests
kubectl top pods -n ${NS}

kubectl get pods -n ${NS} -o custom-columns=\
NAME:.metadata.name,\
CPU_REQ:.spec.containers[*].resources.requests.cpu,\
MEM_REQ:.spec.containers[*].resources.requests.memory

# Find pods without limits
kubectl get pods -A -o json | jq -r \
  '.items[] | select(.spec.containers[].resources.limits == null) |
   "\(.metadata.namespace)/\(.metadata.name)"'

Right-Sizing Recommendations

SymptomIndicationAction
CPU throttlingCPU limit too lowIncrease CPU limit
OOMKilledMemory limit too lowIncrease memory limit
Low utilizationOver-provisionedReduce requests