Kubernetes Operations
Purpose
Operating Kubernetes clusters in production requires mastery of resource management, scheduling patterns, networking architecture, storage strategies, security hardening, and autoscaling. This skill provides operations-first frameworks for right-sizing workloads, implementing high-availability patterns, securing clusters with RBAC and Pod Security Standards, and systematically troubleshooting common failures.
Use this skill when deploying applications to Kubernetes, configuring cluster resources, implementing NetworkPolicies for zero-trust security, setting up autoscaling (HPA, VPA, KEDA), managing persistent storage, or diagnosing operational issues like CrashLoopBackOff or resource exhaustion.
When to Use This Skill
Common Triggers:
- •"Deploy my application to Kubernetes"
- •"Configure resource requests and limits"
- •"Set up autoscaling for my pods"
- •"Implement NetworkPolicies for security"
- •"My pod is stuck in Pending/CrashLoopBackOff"
- •"Configure RBAC with least privilege"
- •"Set up persistent storage for my database"
- •"Spread pods across availability zones"
Operations Covered:
- •Resource management (CPU/memory, QoS classes, quotas)
- •Advanced scheduling (affinity, taints, topology spread)
- •Networking (NetworkPolicies, Ingress, Gateway API)
- •Storage operations (StorageClasses, PVCs, CSI)
- •Security hardening (RBAC, Pod Security Standards, policies)
- •Autoscaling (HPA, VPA, KEDA, cluster autoscaler)
- •Troubleshooting (systematic debugging playbooks)
Resource Management
Quality of Service (QoS) Classes
Kubernetes assigns QoS classes based on resource requests and limits:
Guaranteed (Highest Priority):
- •Requests equal limits for CPU and memory
- •Never evicted unless exceeding limits
- •Use for critical production services
resources:
requests:
memory: "512Mi"
cpu: "500m"
limits:
memory: "512Mi" # Same as request
cpu: "500m"
Burstable (Medium Priority):
- •Requests less than limits (or only requests set)
- •Can burst above requests
- •Evicted under node pressure
- •Use for web servers, most applications
resources:
requests:
memory: "256Mi"
cpu: "250m"
limits:
memory: "512Mi" # 2x request
cpu: "500m"
BestEffort (Lowest Priority):
- •No requests or limits set
- •First to be evicted under pressure
- •Use only for development/testing
Decision Framework: Which QoS Class?
| Workload Type | QoS Class | Configuration |
|---|---|---|
| Critical API/Database | Guaranteed | requests == limits |
| Web servers, services | Burstable | limits 1.5-2x requests |
| Batch jobs | Burstable | Low requests, high limits |
| Dev/test environments | BestEffort | No limits |
Resource Quotas and LimitRanges
Enforce multi-tenancy with ResourceQuotas (namespace limits) and LimitRanges (per-container defaults):
# ResourceQuota: Namespace-level limits
apiVersion: v1
kind: ResourceQuota
metadata:
name: team-quota
namespace: team-alpha
spec:
hard:
requests.cpu: "10"
requests.memory: "20Gi"
limits.cpu: "20"
limits.memory: "40Gi"
pods: "50"
For detailed resource management patterns including Vertical Pod Autoscaler (VPA), see references/resource-management.md.
Advanced Scheduling
Node Affinity
Control which nodes pods schedule on with required (hard) or preferred (soft) constraints:
affinity:
nodeAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
nodeSelectorTerms:
- matchExpressions:
- key: node.kubernetes.io/instance-type
operator: In
values:
- g4dn.xlarge # GPU instance
Taints and Tolerations
Reserve nodes for specific workloads (inverse of affinity):
# Taint GPU nodes to prevent non-GPU workloads kubectl taint nodes gpu-node-1 workload=gpu:NoSchedule
# Pod tolerates GPU taint tolerations: - key: "workload" operator: "Equal" value: "gpu" effect: "NoSchedule"
Topology Spread Constraints
Distribute pods evenly across failure domains (zones, nodes):
topologySpreadConstraints:
- maxSkew: 1 # Max difference in pod count
topologyKey: topology.kubernetes.io/zone
whenUnsatisfiable: DoNotSchedule
labelSelector:
matchLabels:
app: critical-app
For advanced scheduling patterns including pod priority and preemption, see references/scheduling-patterns.md.
Networking
NetworkPolicies (Zero-Trust Security)
Implement default-deny security with NetworkPolicies:
# Default deny all traffic
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
name: default-deny-all
namespace: production
spec:
podSelector: {}
policyTypes:
- Ingress
- Egress
# Allow specific ingress (frontend → backend)
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
name: backend-allow-frontend
spec:
podSelector:
matchLabels:
app: backend
ingress:
- from:
- podSelector:
matchLabels:
app: frontend
ports:
- protocol: TCP
port: 8080
Ingress vs. Gateway API
Ingress (Legacy):
- •Widely supported, mature ecosystem
- •Limited expressiveness
- •Use for existing applications
Gateway API (Modern):
- •Role-oriented design (cluster ops vs. app devs)
- •More expressive (HTTPRoute, TCPRoute, TLSRoute)
- •Recommended for new applications (GA in Kubernetes 1.29+)
# Gateway API example
apiVersion: gateway.networking.k8s.io/v1
kind: HTTPRoute
metadata:
name: app-routes
spec:
parentRefs:
- name: production-gateway
rules:
- matches:
- path:
type: PathPrefix
value: /api
backendRefs:
- name: backend
port: 8080
For detailed networking patterns including service mesh integration, see references/networking.md.
Storage
StorageClasses (Define Performance Tiers)
StorageClasses define storage tiers for different workload needs:
# AWS EBS SSD (high performance) apiVersion: storage.k8s.io/v1 kind: StorageClass metadata: name: fast-ssd provisioner: ebs.csi.aws.com parameters: type: gp3 iopsPerGB: "50" encrypted: "true" volumeBindingMode: WaitForFirstConsumer allowVolumeExpansion: true reclaimPolicy: Delete
Storage Decision Matrix
| Workload | Performance | Access Mode | Storage Class |
|---|---|---|---|
| Database | High | ReadWriteOnce | SSD (gp3/io2) |
| Shared files | Medium | ReadWriteMany | NFS/EFS |
| Logs (temp) | Low | ReadWriteOnce | Standard HDD |
| ML models | High | ReadOnlyMany | Object storage (S3) |
Access Modes:
- •ReadWriteOnce (RWO): Single node read-write (most common)
- •ReadOnlyMany (ROX): Multiple nodes read-only
- •ReadWriteMany (RWX): Multiple nodes read-write (requires network storage)
For detailed storage operations including volume snapshots and CSI drivers, see references/storage.md.
Security
RBAC (Role-Based Access Control)
Implement least-privilege access with RBAC:
# Role (namespace-scoped) apiVersion: rbac.authorization.k8s.io/v1 kind: Role metadata: name: pod-reader namespace: production rules: - apiGroups: [""] resources: ["pods", "pods/log"] verbs: ["get", "list", "watch"] --- # RoleBinding (assign role to user) apiVersion: rbac.authorization.k8s.io/v1 kind: RoleBinding metadata: name: read-pods namespace: production subjects: - kind: User name: jane@example.com apiGroup: rbac.authorization.k8s.io roleRef: kind: Role name: pod-reader apiGroup: rbac.authorization.k8s.io
Pod Security Standards
Enforce secure pod configurations at the namespace level:
# Namespace with Restricted PSS (most secure)
apiVersion: v1
kind: Namespace
metadata:
name: production
labels:
pod-security.kubernetes.io/enforce: restricted
pod-security.kubernetes.io/audit: restricted
pod-security.kubernetes.io/warn: restricted
Pod Security Levels:
- •Restricted: Most secure, removes all privilege escalations (use for applications)
- •Baseline: Minimally restrictive, prevents known escalations
- •Privileged: Unrestricted (only for system workloads)
For detailed security patterns including policy enforcement (Kyverno/OPA) and secrets management, see references/security.md.
Autoscaling
Horizontal Pod Autoscaler (HPA)
Scale pod replicas based on CPU, memory, or custom metrics:
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
name: web-app-hpa
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: web-app
minReplicas: 2
maxReplicas: 10
metrics:
- type: Resource
resource:
name: cpu
target:
type: Utilization
averageUtilization: 70
behavior:
scaleDown:
stabilizationWindowSeconds: 300 # Wait 5min before scaling down
KEDA (Event-Driven Autoscaling)
Scale based on events beyond CPU/memory (queues, cron schedules, Prometheus metrics):
apiVersion: keda.sh/v1alpha1
kind: ScaledObject
metadata:
name: rabbitmq-scaler
spec:
scaleTargetRef:
name: message-processor
minReplicaCount: 0 # Scale to zero when queue empty
maxReplicaCount: 30
triggers:
- type: rabbitmq
metadata:
queueName: tasks
queueLength: "10" # Scale up when >10 messages
Autoscaling Decision Matrix
| Scenario | Use HPA | Use VPA | Use KEDA | Use Cluster Autoscaler |
|---|---|---|---|---|
| Stateless web app with traffic spikes | ✅ | ❌ | ❌ | Maybe |
| Single-instance database | ❌ | ✅ | ❌ | Maybe |
| Queue processor (event-driven) | ❌ | ❌ | ✅ | Maybe |
| Pods pending (insufficient nodes) | ❌ | ❌ | ❌ | ✅ |
For detailed autoscaling patterns including VPA and cluster autoscaler configuration, see references/autoscaling.md.
Troubleshooting
Common Pod Issues
Pod Stuck in Pending:
kubectl describe pod <pod-name> # Common causes: # - Insufficient CPU/memory: Reduce requests or add nodes # - Node selector mismatch: Fix nodeSelector or add labels # - PVC not bound: Create PVC or fix name # - Taint intolerance: Add toleration or remove taint
CrashLoopBackOff:
kubectl logs <pod-name> kubectl logs <pod-name> --previous # Check previous crash # Common causes: # - Application crash: Fix code or configuration # - Missing environment variables: Add to deployment # - Liveness probe failing: Increase initialDelaySeconds # - OOMKilled: Increase memory limit or fix leak
ImagePullBackOff:
kubectl describe pod <pod-name> # Common causes: # - Image doesn't exist: Fix image name/tag # - Authentication required: Create imagePullSecrets # - Network issues: Check NetworkPolicies, firewall rules
Service Not Accessible:
kubectl get endpoints <service-name> # Should list pod IPs # If endpoints empty: # - Service selector doesn't match pod labels # - Pods aren't ready (readiness probe failing) # - Check NetworkPolicies blocking traffic
For systematic troubleshooting playbooks including networking and storage issues, see references/troubleshooting.md.
Reference Documentation
Deep Dives
- •references/resource-management.md - Resource requests/limits, QoS classes, ResourceQuotas, VPA
- •references/scheduling-patterns.md - Node affinity, taints/tolerations, topology spread, priority
- •references/networking.md - NetworkPolicies, Ingress, Gateway API, service mesh integration
- •references/storage.md - StorageClasses, PVCs, CSI drivers, volume snapshots
- •references/security.md - RBAC, Pod Security Standards, policy enforcement, secrets
- •references/autoscaling.md - HPA, VPA, KEDA, cluster autoscaler configuration
- •references/troubleshooting.md - Systematic debugging playbooks for common failures
Examples
- •examples/manifests/ - Copy-paste ready YAML manifests
- •examples/python/ - Automation scripts (audit, cost analysis, validation)
- •examples/go/ - Operator development examples
Tools
- •scripts/validate-resources.sh - Audit pods without resource limits
- •scripts/audit-networkpolicies.sh - Find namespaces without NetworkPolicies
- •scripts/cost-analysis.sh - Resource cost breakdown by namespace
Related Skills
- •building-ci-pipelines - Deploy to Kubernetes from CI/CD (kubectl apply, Helm, GitOps)
- •observability - Monitor clusters and workloads (Prometheus, Grafana, tracing)
- •secret-management - Secure secrets in Kubernetes (External Secrets, Sealed Secrets)
- •testing-strategies - Test manifests and deployments (Kubeval, Conftest, Kind)
- •infrastructure-as-code - Provision Kubernetes clusters (Terraform, Cluster API)
- •gitops-workflows - Declarative cluster management (Flux, ArgoCD)
Best Practices Summary
Resource Management:
- •Always set CPU/memory requests and limits
- •Use VPA for automated rightsizing
- •Implement resource quotas per namespace
- •Monitor actual usage vs. requests
Scheduling:
- •Use topology spread constraints for high availability
- •Apply taints for workload isolation (GPU, spot instances)
- •Set pod priority for critical workloads
Networking:
- •Implement NetworkPolicies with default-deny
- •Use Gateway API for new applications
- •Apply rate limiting at ingress layer
Storage:
- •Use CSI drivers (not legacy provisioners)
- •Define StorageClasses per performance tier
- •Enable volume snapshots for stateful apps
Security:
- •Enforce Pod Security Standards (Restricted for apps)
- •Implement RBAC with least privilege
- •Use policy engines for guardrails (Kyverno/OPA)
- •Scan images for vulnerabilities
Autoscaling:
- •Use HPA for stateless workloads
- •Use KEDA for event-driven workloads
- •Enable cluster autoscaler with limits
- •Set PodDisruptionBudgets to prevent over-disruption