AgentSkillsCN

mape-k-troubleshoot

诊断并修复x0tta6bl4的MAPE-K自愈闭环中的各类问题。当用户提及“自愈功能失效”“MAPE-K无法正常运行”“节点未能恢复”“自愈循环卡壳”“MTTR过高”“自动恢复失败”或“诊断自愈机制”时,便可启用此技能。

SKILL.md
--- frontmatter
name: mape-k-troubleshoot
description: >
  Diagnoses and fixes issues in the MAPE-K self-healing loop of x0tta6bl4.
  Use when user says "self-healing broken", "MAPE-K not working",
  "node not recovering", "healing loop stuck", "MTTR too high",
  "auto-recovery failed", or "diagnose self-healing".
metadata:
  author: x0tta6bl4
  version: 1.0.0
  category: operations
  tags: [mape-k, self-healing, troubleshooting, autonomic]

MAPE-K Self-Healing Troubleshooting

Overview

x0tta6bl4 uses the MAPE-K (Monitor-Analyze-Plan-Execute over Knowledge) autonomic computing loop for self-healing. Target MTTR is under 3 minutes.

Key files:

  • src/core/mape_k_loop.py - Core MAPE-K implementation
  • src/self_healing/mape_k.py - Self-healing integration
  • src/self_healing/mape_k_integrated.py - Full integrated loop
  • src/core/health.py - Health check providers

Instructions

Step 1: Identify the Stuck Phase

The MAPE-K loop has 4 phases. Determine which phase is failing:

Monitor phase issues:

  • Symptoms: No metrics being collected, stale data
  • Check: src/monitoring/metrics.py, src/monitoring/prometheus_client.py
  • Verify Prometheus scraping is active on port 9090

Analyze phase issues:

  • Symptoms: Anomalies not detected, false positives
  • Check: src/ml/graphsage_anomaly_detector.py
  • Verify anomaly threshold (default 0.6, adjustable)
  • Check if model is in observe-only mode: src/ml/graphsage_observe_mode.py

Plan phase issues:

  • Symptoms: Correct detection but no recovery plan generated
  • Check: Planning logic in src/self_healing/mape_k_integrated.py
  • Verify action policies are not too restrictive

Execute phase issues:

  • Symptoms: Plan generated but not executed
  • Check: Circuit breaker state (may be open after too many failures)
  • Check: SPIFFE identity valid (execution requires authenticated context)

Step 2: Check Health Endpoints

bash
# Overall health
curl -s http://localhost:8080/health

# Detailed status with MAPE-K state
curl -s http://localhost:8080/api/v1/mesh/status

# Prometheus metrics for MAPE-K
curl -s http://localhost:9090/api/v1/query?query=mape_k_cycle_duration_seconds

Step 3: Review Logs

bash
# Docker
docker-compose logs --tail=100 app | grep -i "mape\|heal\|anomal"

# Kubernetes
kubectl logs -n x0tta6bl4 deployment/proxy-api --tail=100 | grep -i "mape\|heal"

# Local
grep -i "mape\|heal\|anomal" /var/log/x0tta6bl4/app.log

Look for these patterns:

  • MAPE-K cycle completed - Loop is running
  • Anomaly detected - Analysis phase working
  • Recovery plan generated - Planning phase working
  • Executing recovery action - Execute phase working
  • Circuit breaker OPEN - Executions halted (too many failures)

Step 4: Iterative Fix

Based on the stuck phase, apply fixes:

Fix Monitor Phase

  1. Verify metrics collection:
    python
    from src.monitoring.metrics import MetricsRegistry
    MetricsRegistry.get_all_metrics()  # Should return dict
    
  2. Check Prometheus target is up: http://localhost:9090/targets
  3. Verify health providers are registered in src/core/health.py

Fix Analyze Phase

  1. Check GraphSAGE model state:
    python
    from src.ml.graphsage_anomaly_detector import GraphSAGEAnomalyDetector
    detector = GraphSAGEAnomalyDetector()
    # If model is None, torch not available - falls back to rule-based
    print(f"Model: {detector.model}, Threshold: {detector.anomaly_threshold}")
    
  2. Adjust threshold if too high (missing anomalies) or too low (false positives)
  3. Check if observe mode is stuck: disable with detector.observe_mode = False

Fix Plan Phase

  1. Verify recovery strategies are registered
  2. Check if action quotas are exhausted
  3. Verify SPIFFE identity for cross-node recovery

Fix Execute Phase

  1. Reset circuit breaker if stuck open:
    python
    # Circuit breaker auto-resets after timeout
    # Force reset by restarting the MAPE-K loop
    
  2. Check SPIFFE certificate expiry
  3. Verify target node is reachable

Step 5: Validate Fix

After applying fix, verify the loop completes:

bash
# Watch for a full MAPE-K cycle
curl -s http://localhost:8080/api/v1/mesh/status | python3 -c "
import sys, json
data = json.load(sys.stdin)
print(f'MAPE-K status: {data.get(\"mape_k_status\", \"unknown\")}')
print(f'Last cycle: {data.get(\"last_mape_k_cycle\", \"never\")}')
"

Re-run if the issue persists. Check each phase sequentially until the full loop completes within the 3-minute MTTR target.

Common Issues

Circuit breaker stuck open

Cause: Too many consecutive recovery failures Solution: Fix the underlying failure first, then wait for circuit breaker half-open timeout or restart the service

GraphSAGE model not loading

Cause: torch-geometric not installed or GPU not available Solution: Falls back to rule-based detection automatically. Install torch-geometric for ML-based detection: pip install torch-geometric

MAPE-K cycle too slow (MTTR > 3 min)

Cause: Analysis phase taking too long, or network latency Solution: Reduce anomaly detection batch size, increase monitoring frequency