Platform Health Check Skill
This skill helps you perform comprehensive platform health checks and identify issues quickly.
When to Use
- •After deployments or cluster restarts
- •Before making changes (baseline health)
- •During incident investigation
- •Regular health monitoring
- •After running tests
- •User requests "check platform" or "is everything working"
What This Skill Does
- •Quick Health Overview: One-command platform status
- •ArgoCD Apps: Health and sync status of all applications
- •Pod Health: Check pods across all namespaces
- •Service Accessibility: Test Gateway routes and certificates
- •Resource Usage: CPU/memory consumption
- •Component-Specific Checks: Detailed validation per component
Quick Health Check
Comprehensive Platform Status
bash
# Single command for full platform health (includes pytest tests) ./scripts/platform-status.sh # What it checks: # ✓ ArgoCD applications (health & sync status) # ✓ Platform pods (all namespaces) # ✓ Gateway & certificates # ✓ Istio mTLS configuration # ✓ Service accessibility (via Gateway) # ✓ OAuth authentication # ✓ Integration tests (pytest)
Expected Output:
code
=== ArgoCD Applications Status === ✓ gateway-api: Healthy, Synced ✓ cert-manager: Healthy, Synced ✓ istio-base: Healthy, Synced ... === Platform Pods === observability grafana-xxx 2/2 Running observability prometheus-xxx 2/2 Running ... === Gateway & Certificates === ✓ external-gateway: Programmed ✓ grafana-cert: Ready ... === Integration Tests === PASSED tests/validation/test_app_state.py::test_critical_apps ...
Quick Status Commands
bash
# ArgoCD apps summary argocd app list --port-forward --port-forward-namespace argocd --grpc-web # All pods summary kubectl get pods -A # Failing pods only kubectl get pods -A | grep -vE "Running|Completed" # Service endpoints kubectl get svc -A # Gateway status kubectl get gateway -A # Certificate status kubectl get certificate -A
Detailed Health Checks
1. ArgoCD Application Health
bash
# List all apps with health status argocd app list --port-forward --port-forward-namespace argocd --grpc-web \ -o json | jq -r '.[] | "\(.metadata.name): \(.status.health.status), \(.status.sync.status)"' # Check for unhealthy apps argocd app list --port-forward --port-forward-namespace argocd --grpc-web \ | grep -E "Degraded|OutOfSync|Unknown|Missing" # Get details for specific app argocd app get <app-name> --port-forward --port-forward-namespace argocd --grpc-web # Check app sync history argocd app history <app-name> --port-forward --port-forward-namespace argocd --grpc-web
Expected States:
- •Health:
Healthy(✓),Progressing(⚠️),Degraded(❌),Missing(❌) - •Sync:
Synced(✓),OutOfSync(⚠️)
Critical Apps (must be Healthy):
- •gateway-api
- •cert-manager
- •istio-base, istiod
- •tekton-pipelines
- •keycloak
- •kagenti-operator, kagenti-platform-operator
- •kagenti-platform
- •kagenti-ui
Optional Apps (can be Progressing):
- •observability (large images, slow startup)
- •kiali
- •ollama
2. Pod Health by Namespace
bash
# All pods with status kubectl get pods -A -o wide # Pods sorted by restarts kubectl get pods -A --sort-by='.status.containerStatuses[0].restartCount' | tail -20 # Pods with issues kubectl get pods -A --field-selector=status.phase!=Running,status.phase!=Succeeded # Pod resource usage kubectl top pods -A --sort-by=memory kubectl top pods -A --sort-by=cpu # Specific namespace health kubectl get pods -n observability kubectl get pods -n keycloak kubectl get pods -n kagenti-system
Check for these statuses:
- •❌ CrashLoopBackOff: Application crashes on startup
- •❌ ImagePullBackOff: Image not available
- •❌ Error: Container exited with error
- •⚠️ Pending: Waiting for resources or scheduling
- •⚠️ Init: Init containers still running
- •✓ Running: Pod healthy
- •✓ Completed: Job finished successfully
3. Service Accessibility
bash
# Test all platform services via Gateway for service in grafana prometheus tempo phoenix kiali keycloak kagenti; do echo "=== Testing https://$service.localtest.me:9443/ ===" curl -k -I -m 5 "https://$service.localtest.me:9443/" 2>&1 | head -3 echo done # Check Gateway status kubectl get gateway -A kubectl describe gateway external-gateway -n default # Check HTTPRoutes kubectl get httproute -A kubectl describe httproute <route-name> -n <namespace> # Check service endpoints (should have IP addresses) kubectl get endpoints -A | grep -v "<none>"
Expected Results:
- •Grafana: HTTP/2 302 (redirect to /login)
- •Prometheus: HTTP/2 302 (OAuth redirect)
- •Keycloak: HTTP/2 200
- •Kagenti UI: HTTP/2 200
4. Certificate Health
bash
# All certificates status kubectl get certificate -A # Check certificate details kubectl describe certificate <cert-name> -n <namespace> # Check cert-manager logs for issues kubectl logs -n cert-manager deployment/cert-manager --tail=50 # Verify certificate expiration kubectl get certificate -A -o json | jq -r '.items[] | "\(.metadata.namespace)/\(.metadata.name): expires \(.status.notAfter)"'
Expected State: All certificates show Ready=True
5. Istio Service Mesh Health
bash
# Check Istio components kubectl get pods -n istio-system # Verify sidecar injection (should show 2/2 containers) kubectl get pods -A -o wide | grep "2/2" # Check mTLS policies kubectl get peerauthentication -A kubectl get destinationrule -A # Istio proxy status istioctl proxy-status # Check specific pod mesh config istioctl x describe pod <pod-name> -n <namespace>
6. Resource Usage
bash
# Node resources kubectl top nodes # Cluster-wide pod resources kubectl top pods -A --sort-by=memory | head -20 kubectl top pods -A --sort-by=cpu | head -20 # Namespace resource usage kubectl top pods -n observability kubectl top pods -n keycloak kubectl top pods -n kagenti-system # Check for resource pressure kubectl get nodes -o json | jq -r '.items[] | "\(.metadata.name): \(.status.conditions[] | select(.type=="MemoryPressure" or .type=="DiskPressure") | .type)=\(.status)"'
7. Storage Health
bash
# PersistentVolumes kubectl get pv # PersistentVolumeClaims kubectl get pvc -A # Check PVC usage via metrics kubectl exec -n observability deployment/grafana -- \ curl -s -G 'http://prometheus.observability.svc:9090/api/v1/query' \ --data-urlencode 'query=(kubelet_volume_stats_used_bytes / kubelet_volume_stats_capacity_bytes) * 100' \ | python3 -m json.tool
Component-Specific Health Checks
Observability Stack
bash
# Prometheus kubectl get pods -n observability -l app=prometheus kubectl exec -n observability deployment/grafana -- \ curl -s http://prometheus.observability.svc:9090/-/ready # Grafana kubectl get pods -n observability -l app=grafana curl -k -I https://grafana.localtest.me:9443/api/health # Loki kubectl get pods -n observability -l app=loki kubectl exec -n observability deployment/grafana -- \ curl -s http://loki.observability.svc:3100/ready # Tempo kubectl get pods -n observability -l app=tempo kubectl exec -n observability deployment/grafana -- \ curl -s http://tempo-query-frontend.observability.svc:3100/ready # Phoenix kubectl get pods -n observability -l app=phoenix curl -k -I https://phoenix.localtest.me:9443/ # AlertManager kubectl get pods -n observability -l app=alertmanager kubectl exec -n observability deployment/alertmanager -c alertmanager -- \ wget -qO- http://localhost:9093/-/ready
Authentication & Authorization
bash
# Keycloak kubectl get pods -n keycloak -l app=keycloak kubectl exec -n keycloak statefulset/keycloak -- \ curl -s http://localhost:8080/health/ready | python3 -m json.tool # OAuth2-Proxy instances kubectl get pods -n oauth2-proxy kubectl get deployment -n oauth2-proxy # Test Keycloak SSO curl -k "https://keycloak.localtest.me:9443/realms/master/.well-known/openid-configuration"
Platform Components
bash
# Kagenti Operator kubectl get pods -n kagenti-operator kubectl logs -n kagenti-operator deployment/kagenti-operator --tail=20 # Kagenti Platform Operator kubectl get pods -n kagenti-platform-operator kubectl logs -n kagenti-platform-operator deployment/kagenti-platform-operator --tail=20 # Kagenti UI kubectl get pods -n kagenti-platform -l app=kagenti-ui curl -k -I https://kagenti.localtest.me:9443/ # Tekton Pipelines kubectl get pods -n tekton-pipelines kubectl get pipelineruns -A
Health Check Checklists
Post-Deployment Health Check
- • All ArgoCD apps Healthy and Synced
- • No pods in CrashLoopBackOff/ImagePullBackOff
- • All services have endpoints
- • All certificates Ready
- • All Gateway routes Programmed
- • Services accessible via browser
- • Integration tests passing
- • No firing critical alerts
Pre-Change Health Check
- • Capture platform snapshot:
./scripts/capture-platform-snapshot.sh before-change - • All critical apps Healthy
- • No existing incidents in TODO_INCIDENTS.md
- • Resource usage within limits
- • Recent Git commits validated
Incident Investigation Health Check
- • Identify degraded components
- • Check recent events
- • Collect logs from affected pods
- • Query metrics for anomalies
- • Check for correlated failures
- • Review recent changes (Git history)
Common Health Issues
Issue: Pods stuck in Pending
bash
# Check pod description for reason kubectl describe pod <pod-name> -n <namespace> # Common causes: # - Insufficient CPU/memory # - No nodes matching nodeSelector # - Unbound PersistentVolumeClaim
Issue: Pods CrashLoopBackOff
bash
# Check previous logs kubectl logs <pod-name> -n <namespace> --previous # Check events kubectl get events -n <namespace> --sort-by='.lastTimestamp' | tail -20 # Common causes: # - Application error on startup # - Missing configuration # - Dependency not available
Issue: Service not accessible
bash
# Check pod status kubectl get pods -n <namespace> -l app=<service> # Check service endpoints kubectl get endpoints -n <namespace> <service-name> # Check HTTPRoute kubectl get httproute -n <namespace> # Test from inside cluster kubectl run debug-curl -n <namespace> --image=curlimages/curl --rm -it \ -- curl http://<service-name>.<namespace>.svc:PORT
Issue: Certificate not Ready
bash
# Check certificate status kubectl describe certificate <cert-name> -n <namespace> # Check cert-manager logs kubectl logs -n cert-manager deployment/cert-manager # Common causes: # - DNS validation failing # - Rate limit reached # - Invalid configuration
Issue: High resource usage
bash
# Find top consumers kubectl top pods -A --sort-by=memory | head -10 kubectl top pods -A --sort-by=cpu | head -10 # Check for memory leaks kubectl logs <pod-name> -n <namespace> | grep -i "out of memory" # Check resource limits kubectl describe pod <pod-name> -n <namespace> | grep -A5 "Limits:"
Automation & Monitoring
Continuous Health Monitoring
bash
# Watch pod status watch -n 5 'kubectl get pods -A | grep -vE "Running|Completed"' # Watch ArgoCD apps watch -n 10 'argocd app list --port-forward --port-forward-namespace argocd --grpc-web | grep -vE "Healthy.*Synced"' # Monitor specific namespace watch -n 5 'kubectl get pods -n observability'
Scheduled Health Checks
bash
# Cron job for periodic health checks (local dev) # Add to crontab: crontab -e */15 * * * * /path/to/kagenti-demo-deployment/scripts/platform-status.sh > /tmp/health-$(date +\%Y\%m\%d-\%H\%M).log 2>&1 # Compare snapshots over time ./scripts/capture-platform-snapshot.sh hourly-check
Related Documentation
- •CLAUDE.md Platform Status - Monitoring commands
- •scripts/platform-status.sh - Automated health check
- •TODO_INCIDENTS.md - Active incidents
- •docs/INTEGRATION_TESTS.md - Test strategy
Integration with Other Skills
After health check, if issues found:
- •Use investigate-incident skill for RCA
- •Use check-logs skill to examine error logs
- •Use check-metrics skill for performance analysis
- •Use check-alerts skill to see if alerts fired
Pro Tips
- •Always baseline first: Run health check BEFORE making changes
- •Use platform-status.sh: Single command for comprehensive check
- •Capture snapshots: Use
capture-platform-snapshot.shfor historical comparison - •Check critical apps first: Focus on gateway-api, istio, keycloak, operators
- •Look for patterns: Multiple pods failing often indicates cluster-wide issue
- •Check Git history: Recent commits may explain new issues
- •Verify after fixes: Always re-run health check after remediation
🤖 Generated with Claude Code