K8s Diagnostic Routine
Use this skill to diagnose common failures in a Kubernetes home lab environment.
Diagnostic Steps
- •Check Pod Status:
powershell
kubectl get pods -n <namespace>
- •Describe for Events: Use this for
CrashLoopBackOfforPendingpods.powershellkubectl describe pod <pod-name> -n <namespace>
- •Inspect Logs: Check both current and previous logs if a pod is crashing.
powershell
kubectl logs <pod-name> -n <namespace> --tail=50 kubectl logs <pod-name> -n <namespace> --previous
- •Node Affinity & Resources: For
Pendingpods, check node resources and affinities (especially for GPU/NVIDIA nodes).powershellkubectl get nodes -o custom-columns=NAME:.metadata.name,GPU:.status.allocatable."nvidia\.com/gpu"
Advanced Observability
If pod-level diagnostics are insufficient, use the cluster's observability stack:
Prometheus (Metrics & Querying)
- •Access: Available on the local network via its
HTTPRoutehostname. - •Proxy: If the hostname is unreachable, use a port-forward:
powershell
kubectl port-forward -n monitoring svc/prometheus-k8s 9090
- •Usage: Query for resource pressure, networking issues, or specific application metrics to identify trends or correlates for failures.
Loki (Log Aggregation)
- •Usage: Use Loki when
kubectl logs(even with--previous) is insufficient or when you need to correlate logs across multiple replicas or services. - •Access: Typically accessed via the Grafana dashboard.
Common Failure Patterns
- •NFS Provisioning Failure (
mount.nfs: Cannot allocate memory):- •Symptoms: Pods stuck in
ContainerCreatingor PVCs stuck inPendingwith events showing "failed to provision volume ... mount failed: exit status 32 ... Cannot allocate memory". - •Rectification: Restart the storage-nfs components:
powershell
kubectl rollout restart ds -n storage-nfs kubectl rollout restart deploy -n storage-nfs
- •Symptoms: Pods stuck in
- •Generic MountVolume.SetUp failed: Check NFS server availability or PVC name mismatches.
- •ImagePullBackOff: Verify the tag exists and Renovate bot hasn't pushed a non-existent version.
- •OOMKilled: Compare
kubectl top podresults with theresources.limitsdefined in the manifest.