AgentSkillsCN

Debugging Patterns

调试模式

SKILL.md

Debugging Patterns -- LSTM Autoencoder Anomaly Detection

Diagnostic Decision Tree

code
Symptom reported
├── Container won't start / restart loop
│   ├── Check: docker logs <container> --since 5m
│   ├── Common: missing model files (FileNotFoundError on preprocessor.joblib or .weights.h5)
│   │   └── Fix: run training first -- docker-compose run --rm anomaly-detection python scripts/train.py
│   └── Common: import error after code change
│       └── Fix: check import paths, rebuild image if dependency changed
│
├── All data flagged as anomalous (100% false positives)
│   ├── Check: per-feature reconstruction error breakdown
│   ├── Common: scaler mismatch (model trained with StandardScaler, inference uses fixed_minmax or vice versa)
│   │   └── Fix: ensure preprocessor.joblib matches the scaler_type in config/data.yaml
│   ├── Common: rate()[5m] warm-up -- first ~9 minutes after stack start produce near-zero rates
│   │   └── Fix: wait for warm-up, self-resolves
│   └── Common: synthetic training distribution doesn't match real Prometheus data
│       └── Fix: compare formulas in train.py generate_synthetic_data vs mock_service/app.py
│
├── Empty Prometheus query results
│   ├── Check: curl -s --get "http://localhost:9090/api/v1/query" --data-urlencode "query=up"
│   ├── Common: Prometheus not running or mock-service not scraped yet
│   ├── Common: wrong query syntax (brackets not URL-encoded)
│   └── Common: time range exceeds 11,000-point limit
│       └── Fix: auto-adjustment in prometheus_client.py handles this, verify step parameter
│
└── Training crashes
    ├── Check: full traceback
    ├── Common: too few data points (len(df) < window_size * 5)
    │   └── Fix: min_rows validation already in train.py, verify Prometheus has enough history
    └── Common: shape mismatch in model (n_features changed)
        └── Fix: retrain from scratch, delete old model files

Per-Feature MSE Analysis

When reconstruction error is high, break it down per feature to find the culprit:

python
# Run inside the anomaly-detection container
docker exec tv-anomaly-detector python -c "
import numpy as np
from src.data.preprocessor import DataPreprocessor

preprocessor = DataPreprocessor()
preprocessor.load_scaler('models/preprocessor.joblib')

print('Scaler type:', preprocessor.scaler_type)
print('Features:', preprocessor.feature_columns)
print('Fixed bounds:', preprocessor.fixed_bounds)
"

Known Historical Issues

#IssueRoot causeFile(s) fixedDate
1Prometheus data lost on restartNo volume mountdocker-compose.ymlFeb 2026
2Synthetic data ranges wrongFormulas didn't account for PromQL aggregationtrain.py, inference.pyFeb 2026
3Query fails > 7 days11K-point limitprometheus_client.pyFeb 2026
4get_tv_metrics TypeErrorMissing queries/step paramsprometheus_client.pyFeb 2026
5Training crash on few rowsNo min_rows checktrain.pyFeb 2026
6100% false positivesStandardScaler memorized noisepreprocessor.py, data.yamlFeb 2026