Debugging Patterns -- LSTM Autoencoder Anomaly Detection

Name: Debugging Patterns
Rating: 92
Author: christophcharaf

Diagnostic Decision Tree

code

Symptom reported
├── Container won't start / restart loop
│   ├── Check: docker logs <container> --since 5m
│   ├── Common: missing model files (FileNotFoundError on preprocessor.joblib or .weights.h5)
│   │   └── Fix: run training first -- docker-compose run --rm anomaly-detection python scripts/train.py
│   └── Common: import error after code change
│       └── Fix: check import paths, rebuild image if dependency changed
│
├── All data flagged as anomalous (100% false positives)
│   ├── Check: per-feature reconstruction error breakdown
│   ├── Common: scaler mismatch (model trained with StandardScaler, inference uses fixed_minmax or vice versa)
│   │   └── Fix: ensure preprocessor.joblib matches the scaler_type in config/data.yaml
│   ├── Common: rate()[5m] warm-up -- first ~9 minutes after stack start produce near-zero rates
│   │   └── Fix: wait for warm-up, self-resolves
│   └── Common: synthetic training distribution doesn't match real Prometheus data
│       └── Fix: compare formulas in train.py generate_synthetic_data vs mock_service/app.py
│
├── Empty Prometheus query results
│   ├── Check: curl -s --get "http://localhost:9090/api/v1/query" --data-urlencode "query=up"
│   ├── Common: Prometheus not running or mock-service not scraped yet
│   ├── Common: wrong query syntax (brackets not URL-encoded)
│   └── Common: time range exceeds 11,000-point limit
│       └── Fix: auto-adjustment in prometheus_client.py handles this, verify step parameter
│
└── Training crashes
    ├── Check: full traceback
    ├── Common: too few data points (len(df) < window_size * 5)
    │   └── Fix: min_rows validation already in train.py, verify Prometheus has enough history
    └── Common: shape mismatch in model (n_features changed)
        └── Fix: retrain from scratch, delete old model files

Per-Feature MSE Analysis

When reconstruction error is high, break it down per feature to find the culprit:

python

# Run inside the anomaly-detection container
docker exec tv-anomaly-detector python -c "
import numpy as np
from src.data.preprocessor import DataPreprocessor

preprocessor = DataPreprocessor()
preprocessor.load_scaler('models/preprocessor.joblib')

print('Scaler type:', preprocessor.scaler_type)
print('Features:', preprocessor.feature_columns)
print('Fixed bounds:', preprocessor.fixed_bounds)
"

Known Historical Issues

#	Issue	Root cause	File(s) fixed	Date
1	Prometheus data lost on restart	No volume mount	docker-compose.yml	Feb 2026
2	Synthetic data ranges wrong	Formulas didn't account for PromQL aggregation	train.py, inference.py	Feb 2026
3	Query fails > 7 days	11K-point limit	prometheus_client.py	Feb 2026
4	get_tv_metrics TypeError	Missing queries/step params	prometheus_client.py	Feb 2026
5	Training crash on few rows	No min_rows check	train.py	Feb 2026
6	100% false positives	StandardScaler memorized noise	preprocessor.py, data.yaml	Feb 2026