Kubernetes Platform Operations
Manage day-2 operations of Kubernetes platforms including monitoring, incident response, capacity planning, and operational excellence.
Keywords
kubernetes, operations, monitoring, incident, capacity, maintenance, backup, recovery, health check, runbook, on-call, escalation, node, pod, troubleshooting, performing, responding, planning, managing, conducting, creating
When to Use This Skill
- •Performing cluster health checks
- •Responding to incidents and alerts
- •Planning and managing capacity
- •Conducting maintenance operations
- •Managing backups and recovery
- •Creating and following runbooks
Related Skills
- •k8s-platform-tenancy - Tenant management
- •k8s-security-hardening - Security operations
- •k8s-continual-improvement - SLOs and metrics
- •flux-troubleshooting - GitOps troubleshooting
Quick Reference
| Task | Command |
|---|---|
| Cluster health | kubectl get nodes && kubectl get cs |
| All pods status | kubectl get pods -A | grep -v Running |
| Resource usage | kubectl top nodes && kubectl top pods -A |
| Recent events | kubectl get events -A --sort-by='.lastTimestamp' |
Health Monitoring
Cluster Health Checks
bash
kubectl get nodes -o wide kubectl top nodes kubectl get pods -n kube-system kubectl get pods -n platform-system kubectl get --raw='/healthz?verbose' kubectl get --raw='/livez?verbose' kubectl get --raw='/readyz?verbose' kubectl get --raw='/healthz/etcd'
Key Metrics to Monitor
| Metric | Warning | Critical | Action |
|---|---|---|---|
| Node CPU | >70% | >85% | Scale or optimize |
| Node Memory | >75% | >90% | Scale or evict |
| Pod restarts | >3/hr | >10/hr | Investigate |
| API latency p99 | >500ms | >1s | Check etcd/load |
| etcd disk | >70% | >85% | Expand/compact |
| PVC usage | >75% | >90% | Expand or clean |
Incident Response
Severity Levels
| Level | Impact | Response | Examples |
|---|---|---|---|
| P1 | Platform down | 15 min | API unreachable, etcd failure |
| P2 | Major degradation | 30 min | Node failures, ingress down |
| P3 | Partial impact | 2 hours | Single tenant affected |
| P4 | Minor issue | 24 hours | Non-critical alerts |
Incident Workflow
code
1. DETECT → Alert fires or user report 2. TRIAGE → Assess severity and impact 3. COMMUNICATE → Update status page/channel 4. INVESTIGATE → Gather evidence 5. MITIGATE → Restore service (temporary fix OK) 6. RESOLVE → Fix root cause 7. REVIEW → Post-incident analysis
Escalation Path
code
L1 (On-call) → 15 min no progress
↓
L2 (Senior SRE) → 30 min no progress
↓
L3 (Platform Lead) → Critical impact
↓
Management (if customer-facing outage)
On-Call Handoff Template
markdown
## On-Call Handoff - ${DATE}
### Active Issues
- [P2] Issue description - Status
### Recent Changes
- ${CHANGE_DESCRIPTION} - ${TIME}
### Upcoming Maintenance
- ${MAINTENANCE} - ${SCHEDULED_TIME}
### Watchlist
- ${ITEM_TO_MONITOR}
### Notes
- ${ADDITIONAL_CONTEXT}
Common Runbooks
Node NotReady
bash
# 1. Check node status
kubectl describe node ${NODE_NAME}
# 2. Check kubelet (SSH to node)
journalctl -u kubelet -n 100 --no-pager
# 3. Check resources
kubectl top node ${NODE_NAME}
# 4. If unrecoverable
kubectl cordon ${NODE_NAME}
kubectl drain ${NODE_NAME} --ignore-daemonsets --delete-emptydir-data
Pod CrashLoopBackOff
bash
# 1. Get details
kubectl describe pod ${POD} -n ${NS}
# 2. Check logs
kubectl logs ${POD} -n ${NS}
kubectl logs ${POD} -n ${NS} --previous
# 3. Check events
kubectl get events -n ${NS} --sort-by='.lastTimestamp' | grep ${POD}
# 4. Check resources
kubectl get pod ${POD} -n ${NS} -o jsonpath='{.spec.containers[*].resources}'
High Memory Pressure
bash
# 1. Find top consumers
kubectl top pods -A --sort-by=memory | head -20
# 2. Check evictions
kubectl get pods -A --field-selector=status.phase=Failed | grep Evicted
# 3. Check node conditions
kubectl get nodes -o jsonpath='{range .items[*]}{.metadata.name}: {.status.conditions[?(@.type=="MemoryPressure")].status}{"\n"}{end}'
# 4. Identify leaks (high restarts)
kubectl get pods -A -o jsonpath='{range .items[*]}{.metadata.namespace}/{.metadata.name} restarts={.status.containerStatuses[0].restartCount}{"\n"}{end}' | sort -t= -k2 -rn | head -20
Capacity Planning
Resource Analysis
bash
# Cluster summary
kubectl top nodes --no-headers | awk '{cpu+=$3; mem+=$5} END {print "Avg CPU:", cpu/NR"%", "Avg Mem:", mem/NR"%"}'
# Namespace consumption
kubectl top pods -A --no-headers | awk '{ns[$1]+=$3} END {for(n in ns) print n, ns[n]"m"}' | sort -k2 -rn
# Over-provisioned (requests >> actual)
kubectl get pods -A -o jsonpath='{range .items[*]}{.metadata.namespace}/{.metadata.name} req={.spec.containers[0].resources.requests.cpu}{"\n"}{end}'
Capacity Thresholds
- •Green: <60% - Healthy headroom
- •Yellow: 60-80% - Plan scaling
- •Red: >80% - Scale immediately
Maintenance Operations
Node Maintenance
bash
# 1. Cordon (prevent new pods)
kubectl cordon ${NODE}
# 2. Drain (evict pods)
kubectl drain ${NODE} \
--ignore-daemonsets \
--delete-emptydir-data \
--grace-period=60 \
--timeout=5m
# 3. Perform maintenance...
# 4. Uncordon
kubectl uncordon ${NODE}
Rolling Restart
bash
kubectl rollout restart deployment/${NAME} -n ${NS}
kubectl rollout status deployment/${NAME} -n ${NS}
Certificate Check
bash
kubeadm certs check-expiration
Backup & Recovery
etcd Backup
bash
ETCDCTL_API=3 etcdctl snapshot save /backup/etcd-$(date +%Y%m%d).db \ --endpoints=https://127.0.0.1:2379 \ --cacert=/etc/kubernetes/pki/etcd/ca.crt \ --cert=/etc/kubernetes/pki/etcd/server.crt \ --key=/etc/kubernetes/pki/etcd/server.key
Velero Backup
bash
velero backup create platform-$(date +%Y%m%d) \ --include-namespaces platform-system \ --ttl 720h velero schedule create daily-platform \ --schedule="0 2 * * *" \ --include-namespaces platform-system
Operational Checklists
Daily
- • Review alerting dashboard
- • Check node health
- • Verify backup completion
- • Review capacity metrics
Weekly
- • Audit resource quotas
- • Check certificate expiry
- • Review pending updates
- • Update runbooks
Monthly
- • Capacity planning review
- • Security patch assessment
- • Cost optimization review
- • Tenant usage reports
Post-Incident Review Template
markdown
## Incident Review: ${INCIDENT_ID}
### Summary
- **Duration**: ${START} - ${END}
- **Severity**: P${LEVEL}
- **Impact**: ${IMPACT_DESCRIPTION}
### Timeline
| Time | Event |
|------|-------|
| HH:MM | Alert fired |
| HH:MM | Investigation started |
| HH:MM | Root cause identified |
| HH:MM | Service restored |
### Root Cause
${ROOT_CAUSE}
### What Went Well
- ${POSITIVE_1}
### What Could Be Improved
- ${IMPROVEMENT_1}
### Action Items
| Action | Owner | Due |
|--------|-------|-----|
| ${ACTION} | ${OWNER} | ${DATE} |
MCP Tools
- •
mcp__flux-operator-mcp__get_kubernetes_resources - •
mcp__flux-operator-mcp__get_kubernetes_logs - •
mcp__flux-operator-mcp__get_kubernetes_metrics