k8s

掌握Kubernetes运维技能，能够在Kubernetes平台上完成服务的部署、运行与故障排查。适用于编写清单文件与Helm配置、配置部署/服务/Ingress、实现自动扩缩容、构建可观测性体系、管理RBAC权限、处理Secrets与ConfigMaps、执行发布与回滚操作、调试生产环境中的各类故障，以及开展生产就绪性检查等任务。

SKILL.md

--- frontmatter

name: k8s
description: Kubernetes ops skill for deploying, operating, and troubleshooting services on Kubernetes. Use for tasks like writing manifests/Helm, configuring deployments/services/ingress, autoscaling, observability, RBAC, secrets/configmaps, rollout/rollback, incident debugging, and production readiness checks.

k8s

Use this skill for Kubernetes 运维与发布相关工作。

Defaults / assumptions to confirm

•Cluster type: managed (EKS/GKE/ACK) vs self-hosted
•Packaging: raw YAML vs Helm vs Kustomize
•Ingress: NGINX/ALB/APISIX/Istio
•Observability stack: Prometheus/Grafana, Loki/ELK, tracing

Workflow

•Understand service requirements

•Ports, protocols, health checks, resources (CPU/mem), storage needs.
•SLOs: latency, availability, RPO/RTO.
•Dependencies: DB, cache, MQ, external APIs.

•Deployment design

•Use Deployment for stateless; StatefulSet for stable identities/storage.
•Define readinessProbe and livenessProbe (and startupProbe if needed).
•Set resources.requests/limits and choose appropriate QoS.
•Use PodDisruptionBudget for availability during maintenance.

•Config & secrets

•Config: ConfigMap (non-sensitive), mounted or env.
•Secrets: Secret (sensitive) + external secret manager if available.
•Never commit plaintext secrets; prefer sealed/external secrets.

•Networking

•Service types and DNS.
•Ingress/Gateway routing, TLS termination, timeouts.
•NetworkPolicy if cluster enforces it.

•Scaling & resilience

•HPA based on CPU/memory/custom metrics.
•Graceful shutdown (preStop, terminationGracePeriodSeconds).
•Retry/backoff at client; avoid retry storms.

•Observability

•Standard logs with correlation IDs.
•Metrics: RPS, p95 latency, error rate, saturation.
•Alerts and dashboards; runbook links.

•Release operations

•Rolling updates, canary/blue-green if needed.
•kubectl rollout status + rollback plan.
•Post-deploy verification checks and smoke tests.

•Troubleshooting checklist

•kubectl get/describe pods, events, and logs.
•Check probes, image pull, env/config, DNS, network, and resource throttling.
•For performance: node pressure, HPA behavior, GC/heap, connection pool limits.

Output expectations when making changes

•Provide manifests (or Helm values/templates) + brief deployment notes.
•Include resource sizing rationale and probe settings.
•Include rollback instructions and verification steps.