AgentSkillsCN

k8s

掌握Kubernetes运维技能,能够在Kubernetes平台上完成服务的部署、运行与故障排查。适用于编写清单文件与Helm配置、配置部署/服务/Ingress、实现自动扩缩容、构建可观测性体系、管理RBAC权限、处理Secrets与ConfigMaps、执行发布与回滚操作、调试生产环境中的各类故障,以及开展生产就绪性检查等任务。

SKILL.md
--- frontmatter
name: k8s
description: Kubernetes ops skill for deploying, operating, and troubleshooting services on Kubernetes. Use for tasks like writing manifests/Helm, configuring deployments/services/ingress, autoscaling, observability, RBAC, secrets/configmaps, rollout/rollback, incident debugging, and production readiness checks.

k8s

Use this skill for Kubernetes 运维与发布相关工作。

Defaults / assumptions to confirm

  • Cluster type: managed (EKS/GKE/ACK) vs self-hosted
  • Packaging: raw YAML vs Helm vs Kustomize
  • Ingress: NGINX/ALB/APISIX/Istio
  • Observability stack: Prometheus/Grafana, Loki/ELK, tracing

Workflow

  1. Understand service requirements
  • Ports, protocols, health checks, resources (CPU/mem), storage needs.
  • SLOs: latency, availability, RPO/RTO.
  • Dependencies: DB, cache, MQ, external APIs.
  1. Deployment design
  • Use Deployment for stateless; StatefulSet for stable identities/storage.
  • Define readinessProbe and livenessProbe (and startupProbe if needed).
  • Set resources.requests/limits and choose appropriate QoS.
  • Use PodDisruptionBudget for availability during maintenance.
  1. Config & secrets
  • Config: ConfigMap (non-sensitive), mounted or env.
  • Secrets: Secret (sensitive) + external secret manager if available.
  • Never commit plaintext secrets; prefer sealed/external secrets.
  1. Networking
  • Service types and DNS.
  • Ingress/Gateway routing, TLS termination, timeouts.
  • NetworkPolicy if cluster enforces it.
  1. Scaling & resilience
  • HPA based on CPU/memory/custom metrics.
  • Graceful shutdown (preStop, terminationGracePeriodSeconds).
  • Retry/backoff at client; avoid retry storms.
  1. Observability
  • Standard logs with correlation IDs.
  • Metrics: RPS, p95 latency, error rate, saturation.
  • Alerts and dashboards; runbook links.
  1. Release operations
  • Rolling updates, canary/blue-green if needed.
  • kubectl rollout status + rollback plan.
  • Post-deploy verification checks and smoke tests.
  1. Troubleshooting checklist
  • kubectl get/describe pods, events, and logs.
  • Check probes, image pull, env/config, DNS, network, and resource throttling.
  • For performance: node pressure, HPA behavior, GC/heap, connection pool limits.

Output expectations when making changes

  • Provide manifests (or Helm values/templates) + brief deployment notes.
  • Include resource sizing rationale and probe settings.
  • Include rollback instructions and verification steps.