AgentSkillsCN

k8s-incident

借助运行手册与诊断工具,快速响应Kubernetes集群中的各类突发状况。无论是宕机故障、Pod故障、节点问题、网络异常,还是紧急应急响应,这一功能都能为你提供可靠保障。

SKILL.md
--- frontmatter
name: k8s-incident
description: Respond to Kubernetes incidents with runbooks and diagnostics. Use for outages, pod failures, node issues, network problems, and emergency response.

Kubernetes Incident Response

Runbooks and diagnostic workflows for common Kubernetes incidents.

Incident Triage

Quick Health Check

code
1. get_nodes()                    # Node status
2. get_pods(namespace="kube-system")  # Control plane
3. get_events(namespace)          # Recent events

Severity Assessment

IndicatorSeverityAction
Multiple nodes NotReadyCriticalEscalate immediately
kube-system pods failingCriticalControl plane issue
Single pod CrashLoopMediumDebug pod
High latencyMediumCheck resources

Runbook: Pod Failures

CrashLoopBackOff

code
1. get_pod_logs(name, namespace, previous=True)
2. describe_pod(name, namespace)
3. get_events(namespace, field_selector="involvedObject.name=<pod>")
4. get_pod_metrics(name, namespace)

Common Causes:

  • OOMKilled → Increase memory limits
  • Exit code 1 → Application error in logs
  • Exit code 137 → Killed by OOM or SIGKILL
  • Exit code 143 → Graceful SIGTERM

ImagePullBackOff

code
1. describe_pod(name, namespace)  # Check image name
2. get_secrets(namespace)         # Check imagePullSecrets

Common Causes:

  • Wrong image name/tag
  • Private registry, no imagePullSecret
  • Registry rate limiting

Pending Pod

code
1. describe_pod(name, namespace)
2. get_nodes()
3. get_events(namespace)

Common Causes:

  • Insufficient resources
  • Node selector mismatch
  • Taints without tolerations
  • PVC not bound

Runbook: Node Issues

Node NotReady

code
1. describe_node(name)
2. get_events(namespace="", field_selector="involvedObject.name=<node>")
3. node_logs_tool(name, "kubelet")

Common Causes:

  • kubelet not running
  • Network partition
  • Disk pressure
  • Memory pressure

Node DiskPressure

code
1. describe_node(name)
2. get_pods(field_selector="spec.nodeName=<node>")
3. # Check large containers/logs

Actions:

  • Clean up container logs
  • Evict low-priority pods
  • Expand node disk

Runbook: Network Issues

Service Not Accessible

code
1. get_services(namespace)
2. get_endpoints(namespace)        # Check backends
3. get_pods(namespace, label_selector="<service-selector>")
4. get_network_policies(namespace)

Common Causes:

  • No matching pods (empty endpoints)
  • Pods not ready
  • NetworkPolicy blocking traffic

DNS Resolution Failures

code
1. get_pods(namespace="kube-system", label_selector="k8s-app=kube-dns")
2. get_pod_logs("coredns-xxx", "kube-system")

With Cilium

code
cilium_status_tool()
cilium_endpoints_list_tool(namespace)
hubble_flows_query_tool(namespace)

With Istio

code
istio_analyze_tool(namespace)
istio_proxy_status_tool()

Runbook: Storage Issues

PVC Pending

code
1. describe_pvc(name, namespace)
2. get_storage_classes()
3. get_events(namespace)

Common Causes:

  • No matching PV
  • StorageClass not provisioning
  • Quota exceeded

Pod Stuck in ContainerCreating

code
1. describe_pod(name, namespace)
2. get_pvc(namespace)
3. get_events(namespace)

Common Causes:

  • PVC not bound
  • Volume mount error
  • Image pull taking time

Runbook: Control Plane Issues

API Server Unavailable

code
1. get_pods(namespace="kube-system", label_selector="component=kube-apiserver")
2. get_events(namespace="kube-system")

etcd Issues

code
1. get_pods(namespace="kube-system", label_selector="component=etcd")
2. get_pod_logs("etcd-xxx", "kube-system")

Emergency Actions

Cordon Node (Prevent Scheduling)

code
# Via kubectl (not in MCP tools yet)
# kubectl cordon <node>

Drain Node (Evict Pods)

code
# Via kubectl
# kubectl drain <node> --ignore-daemonsets --delete-emptydir-data

Force Delete Pod

code
delete_pod(name, namespace, grace_period=0, force=True)

Rollback Deployment

code
rollback_deployment(name, namespace, revision=0)  # Previous version

Helm Rollback

code
rollback_helm_release(name, namespace, revision=1)

Diagnostic Collection Script

For comprehensive incident diagnostics, see scripts/collect-diagnostics.py.

Collects:

  • Pod logs and events
  • Node conditions
  • Resource usage
  • Network policies
  • Recent changes

Multi-Cluster Incident Response

Check all clusters:

code
for context in ["prod-1", "prod-2", "staging"]:
    get_nodes(context=context)
    get_pods(namespace="kube-system", context=context)
    get_events(namespace="kube-system", context=context)

Post-Incident

Document Timeline

  1. When did the incident start?
  2. What was the impact?
  3. What was the root cause?
  4. What fixed it?

Prevent Recurrence

  • Add monitoring/alerting
  • Improve resource limits
  • Add readiness probes
  • Document runbook

Related Skills