AgentSkillsCN

k8s-continual-improvement

适用于定义与管理 SLO、优化集群成本(FinOps)、衡量并减少运维负担、进行容量规划、评估平台成熟度、实施反馈循环,或制定改进路线图时使用。

SKILL.md
--- frontmatter
name: k8s-continual-improvement
description: Use when defining and managing SLOs, optimizing cluster costs (FinOps), measuring and reducing toil, conducting capacity planning, assessing platform maturity, implementing feedback loops, or creating improvement roadmaps

Kubernetes Continual Service Improvement

Continuously improve Kubernetes platforms including SLO management, cost optimization, performance tuning, FinOps, and platform maturity.

Keywords

kubernetes, slo, sli, sla, error budget, cost optimization, finops, capacity, performance, improvement, maturity, metrics, toil, feedback, defining, managing, optimizing, measuring, reducing, planning, assessing, implementing, creating

When to Use This Skill

  • Defining and managing SLOs
  • Optimizing cluster costs (FinOps)
  • Measuring and reducing toil
  • Conducting capacity planning
  • Assessing platform maturity
  • Implementing feedback loops
  • Creating improvement roadmaps

Related Skills

Quick Reference

MetricTargetCalculation
Availability99.9%uptime / total_time
Error Budget43.2 min/mo(1 - SLO) * time_period
CPU Efficiency>60%actual / requested
MTTR<4h P1mean(resolve_time - alert_time)

SLO Framework

Service Level Indicators

Availability:

yaml
- record: platform:availability:ratio_5m
  expr: |
    sum(up{job=~"kubernetes-.*"})
    / count(up{job=~"kubernetes-.*"})

Latency (p99):

yaml
- record: platform:latency:p99_5m
  expr: |
    histogram_quantile(0.99,
      sum(rate(apiserver_request_duration_seconds_bucket{verb!="WATCH"}[5m]))
      by (le))

Error Rate:

yaml
- record: platform:error_rate:ratio_5m
  expr: |
    sum(rate(apiserver_request_total{code=~"5.."}[5m]))
    / sum(rate(apiserver_request_total[5m]))

SLO Targets

ServiceSLISLOError Budget/mo
API ServerAvailability99.9%43.2 min
API Serverp99 Latency<500ms-
IngressAvailability99.95%21.6 min
WorkloadsPod Start<60s p95-

Error Budget Alerts

yaml
- alert: ErrorBudgetBurnRate
  expr: |
    (1 - platform:availability:ratio_5m) > (1 - 0.999) * 14.4
  for: 5m
  labels:
    severity: warning
  annotations:
    summary: "Error budget burning fast"

Cost Optimization (FinOps)

Efficiency Metrics

MetricFormulaTarget
CPU Efficiencyactual_cpu / requested_cpu>60%
Memory Efficiencyactual_mem / requested_mem>70%
Cost per Tenantcluster_cost * (tenant_usage / total)Track
Idle Resourcesunused_capacity / total<20%

Resource Analysis

bash
# CPU efficiency
kubectl top pods -A --no-headers | awk '{
  split($3, cpu, "m"); actual+=cpu[1]
} END {print "Total CPU:", actual, "m"}'

# Find idle deployments
kubectl get deployments -A -o json | \
  jq -r '.items[] | select(.spec.replicas==0) | "\(.metadata.namespace)/\(.metadata.name)"'

# Unmounted PVCs
kubectl get pvc -A -o json | jq -r '.items[] | select(.status.phase=="Bound") | .metadata.name' > bound.txt
kubectl get pods -A -o json | jq -r '.items[].spec.volumes[]?.persistentVolumeClaim.claimName' | sort -u > mounted.txt
comm -23 <(sort bound.txt) <(sort mounted.txt)

Cost Reduction Strategies

StrategySavingsEffort
Right-size requests20-40%Medium
Spot/preemptible nodes60-80%High
Cluster autoscaling10-30%Low
Namespace quotasPrevents wasteLow
Resource cleanup5-15%Low

Cost Allocation Labels

yaml
metadata:
  labels:
    cost-center: engineering
    team: platform
    environment: production
    application: api-gateway

Toil Measurement

Toil Indicators

  • Manual, repetitive tasks
  • No lasting value
  • Scales with service size
  • Automatable

Toil Tracking

yaml
toil_tasks:
  - name: "Manual tenant onboarding"
    frequency: "5/week"
    duration: "30min"
    annual_hours: 130
    automation_effort: "M"

  - name: "Certificate rotation"
    frequency: "4/year"
    duration: "2h"
    annual_hours: 8
    automation_effort: "S"

Toil Reduction Target

  • Current: X hours/week
  • Target: 50% reduction in 6 months
  • Method: Automation, self-service

Platform Maturity Model

LevelNameCharacteristics
1InitialAd-hoc, manual
2ManagedDocumented, repeatable
3DefinedStandardized, measured
4QuantifiedData-driven, optimized
5OptimizingContinuous improvement

Capability Assessment

yaml
capabilities:
  provisioning:
    current: 2
    target: 4
    gap: "No self-service"
  monitoring:
    current: 3
    target: 4
    gap: "Missing SLOs"
  security:
    current: 3
    target: 4
    gap: "Manual audits"

Feedback Loops

Tenant Satisfaction (NPS)

yaml
survey:
  - "How satisfied with platform stability? (1-5)"
  - "How easy to deploy applications? (1-5)"
  - "How responsive is support? (1-5)"
  - "What should we improve?"

Platform Metrics Dashboard

yaml
dashboards:
  executive:
    - Availability %
    - Cost per tenant
    - Incident count
  tenant:
    - Resource usage
    - Deploy success rate
    - Error rates
  platform_team:
    - All SLIs
    - Error budget remaining
    - Capacity utilization

Improvement Cadence

CadenceActivities
WeeklyIncident review, quick wins
MonthlySLO review, cost analysis, backlog
QuarterlyMaturity assessment, OKRs
AnnuallyStrategy, tech radar, budget

Reporting Template

markdown
# Platform Report - ${MONTH} ${YEAR}

## Availability
- SLO: 99.9% | Actual: ${ACTUAL}%
- Error Budget: ${REMAINING}% remaining

## Incidents
- P1: ${COUNT} | P2: ${COUNT}
- MTTR: ${MTTR}

## Cost
- Total: ${TOTAL}
- Per Tenant: ${AVG}
- MoM: ${CHANGE}%

## Capacity
- CPU: ${CPU}% | Memory: ${MEM}%

## Improvements
1. ${DELIVERED_1}
2. ${DELIVERED_2}

## Next Month
1. ${PLANNED_1}
2. ${PLANNED_2}

Improvement Backlog Template

markdown
## ${TITLE}

**Category**: Performance | Reliability | Security | Cost | UX
**Priority**: P1 | P2 | P3
**Effort**: S | M | L | XL

**Current**: ${PROBLEM}
**Target**: ${GOAL}
**Metrics**: Before: X → Target: Y
**Dependencies**: ${DEPS}