RCA-HyperShift: Root Cause Analysis with Live Cluster
Systematic root cause analysis with full cluster access for deep investigation.
rca:hypershift vs rca:ci
| Aspect | rca:hypershift | rca:ci |
|---|---|---|
| Access | Full cluster (pods, logs, secrets, configs) | CI logs only |
| Debugging | Real-time with k8s:* skills | Static log analysis |
| State | Can inspect current state | Historical artifacts only |
| When | Have cluster, need deep investigation | No cluster available |
When to Use
- •
rca:ciwas inconclusive - •Need to inspect live pod state, secrets, or configs
- •Want to reproduce failure with debugging enabled
- •Complex multi-component issues
Prerequisites
Auto-approved: All read operations on hosted clusters are auto-approved. Run each command separately for auto-approve to work.
Create working directory for analysis:
mkdir -p /tmp/kagenti/rca
Set cluster context:
export CLUSTER=<suffix> MANAGED_BY_TAG=${MANAGED_BY_TAG:-kagenti-hypershift-custom}
export KUBECONFIG=~/clusters/hcp/$MANAGED_BY_TAG-$CLUSTER/auth/kubeconfig
Verify connection:
kubectl get nodes
RCA Workflow
┌─────────────────────────────────────────────────────────────────┐ │ 1. OBSERVE │ │ ├─ Check pod status │ │ ├─ Get recent events │ │ └─ Review current logs │ ├─────────────────────────────────────────────────────────────────┤ │ 2. INSPECT │ │ ├─ Examine failing component │ │ ├─ Check configs and secrets │ │ └─ Verify connectivity │ ├─────────────────────────────────────────────────────────────────┤ │ 3. REPRODUCE │ │ ├─ Run failing test with verbose output │ │ ├─ Watch logs in real-time │ │ └─ Capture exact failure │ ├─────────────────────────────────────────────────────────────────┤ │ 4. TRACE │ │ ├─ Follow request through components │ │ ├─ Identify where failure occurs │ │ └─ Determine root cause │ ├─────────────────────────────────────────────────────────────────┤ │ 5. DOCUMENT │ │ ├─ Root cause with evidence │ │ ├─ Reproduction steps │ │ └─ Fix and verification plan │ └─────────────────────────────────────────────────────────────────┘
Phase 1: Observe Current State
Pod Status
Check all pods:
kubectl get pods -A
Find problem pods:
kubectl get pods -A --field-selector=status.phase!=Running,status.phase!=Succeeded
Check kagenti-system namespace:
kubectl get pods -n kagenti-system
Check team1 namespace:
kubectl get pods -n team1
Recent Events
Cluster-wide events:
kubectl get events -A --sort-by='.lastTimestamp' | tail -30
Namespace events:
kubectl get events -n kagenti-system --sort-by='.lastTimestamp'
Current Logs
OTEL Collector logs:
kubectl logs -n kagenti-system deployment/otel-collector --tail=100
MLflow logs:
kubectl logs -n kagenti-system deployment/mlflow --tail=100
Agent logs:
kubectl logs -n team1 deployment/weather-service --tail=100
Filter for errors:
kubectl logs -n kagenti-system deployment/otel-collector --tail=500 | grep -iE "error|fail|warn"
Phase 2: Inspect Components
Check Pod Details
Describe failing pod:
kubectl describe pod <pod-name> -n <namespace>
Check container status:
kubectl get pod <pod-name> -n <namespace> -o jsonpath='{.status.containerStatuses[*]}'
Examine Configuration
List ConfigMaps:
kubectl get configmap -n kagenti-system
View specific ConfigMap:
kubectl get configmap otel-collector-config -n kagenti-system -o yaml
List Secrets (check existence, not values):
kubectl get secrets -n kagenti-system
Check secret keys:
kubectl get secret mlflow-oauth-secret -n kagenti-system -o jsonpath='{.data}' | jq 'keys'
Decode specific secret value:
kubectl get secret <secret-name> -n <namespace> -o jsonpath='{.data.<key>}' | base64 -d
Verify Connectivity
Service endpoints:
kubectl get endpoints -n kagenti-system
Test internal connectivity:
kubectl run -it --rm debug --image=curlimages/curl -- \ curl -v http://mlflow.kagenti-system.svc.cluster.local:5000/health
Check routes (OpenShift):
kubectl get routes -A
Phase 3: Reproduce with Debugging
Run Failing Test
Set environment variables:
export CLUSTER=<suffix> WORKTREE=<worktree> MANAGED_BY_TAG=${MANAGED_BY_TAG:-kagenti-hypershift-custom}
Run specific test with verbose output:
KUBECONFIG=~/clusters/hcp/$MANAGED_BY_TAG-$CLUSTER/auth/kubeconfig \ .worktrees/$WORKTREE/.github/scripts/local-setup/hypershift-full-test.sh $CLUSTER \ --include-test --pytest-filter "<test_name>" --pytest-args "-v -s"
Watch Logs in Real-Time
Watch component logs (in separate terminal):
kubectl logs -f -n kagenti-system deployment/otel-collector
Or use stern for multiple pods:
stern -n kagenti-system .
Phase 4: Trace the Failure
Request Flow Analysis
For a typical agent request:
Client → Gateway → Agent → Tool → Agent → Gateway → Client
↓
OTEL Collector → MLflow
Check each hop:
- •Did the request reach the gateway?
- •Did the agent receive it?
- •Did the tool respond?
- •Were traces exported?
- •Did MLflow receive them?
Component-Specific Checks
OTEL Collector:
kubectl logs -n kagenti-system deployment/otel-collector | grep -i "span\|trace\|export"
MLflow:
kubectl logs -n kagenti-system deployment/mlflow | grep -i "trace\|experiment\|error"
Agent:
kubectl logs -n team1 deployment/weather-service | grep -i "request\|response\|error"
Auth Flow Analysis
Get Keycloak host:
KEYCLOAK_HOST=$(kubectl get route keycloak -n keycloak -o jsonpath='{.spec.host}')
Get OAuth client credentials:
CLIENT_ID=$(kubectl get secret mlflow-oauth-secret -n kagenti-system -o jsonpath='{.data.OIDC_CLIENT_ID}' | base64 -d)
CLIENT_SECRET=$(kubectl get secret mlflow-oauth-secret -n kagenti-system -o jsonpath='{.data.OIDC_CLIENT_SECRET}' | base64 -d)
Test OAuth token exchange:
curl -sk -X POST "https://$KEYCLOAK_HOST/realms/master/protocol/openid-connect/token" \ -d "grant_type=client_credentials" \ -d "client_id=$CLIENT_ID" \ -d "client_secret=$CLIENT_SECRET"
Phase 5: Document Findings
## Root Cause Analysis **Failure**: [Test name / description] **Cluster**: [cluster suffix] **Date**: [timestamp] ### Symptoms - [What failed] - [Error messages] ### Investigation Steps 1. [Step taken] → [Finding] 2. [Step taken] → [Finding] ### Root Cause [Clear statement with evidence] ### Evidence
[Relevant log lines or command output]
### Fix [Proposed solution] ### Verification [How to confirm fix works]
Quick Reference
| Task | Command |
|---|---|
| Pod status | kubectl get pods -A |
| Pod logs | kubectl logs -n <ns> deployment/<name> |
| Describe pod | kubectl describe pod <name> -n <ns> |
| Get secret | kubectl get secret <name> -n <ns> -o yaml |
| Events | kubectl get events -A --sort-by='.lastTimestamp' |
| ConfigMap | kubectl get configmap <name> -n <ns> -o yaml |
Related Skills
- •
rca:ci- RCA from CI logs only - •
tdd:hypershift- Fix iteration after RCA - •
k8s:pods- Pod debugging - •
k8s:logs- Log analysis - •
k8s:health- Platform health check - •
k8s:live-debugging- Iterative debugging - •
auth:mlflow-oidc-auth- MLflow OIDC auth debugging - •
openshift:debug- Debug OpenShift-specific operators, SCCs, builds - •
openshift:routes- Debug route/ingress issues