Troubleshooting History
Known issues discovered and fixed during development. Check here first when debugging.
Issue 1: Prometheus data loss on stack teardown
Symptom: All Prometheus data disappeared after docker-compose down.
Cause: No volume mount for Prometheus data directory.
Fix: Added prometheus_data:/prometheus volume in docker-compose.yml.
Issue 2: Synthetic data distribution mismatch
Symptom: Model trained on synthetic data flagged all real Prometheus data as anomalous.
Cause: Synthetic data formulas did not match the mock service's actual output after PromQL aggregation (rate(), histogram_quantile()).
Fix: Derived correct formulas mathematically from mock service source code and verified against live Prometheus. Updated generators in both train.py and inference.py.
Issue 3: Prometheus 11,000-point query limit
Symptom: query_range returned empty/error for 7-day queries at 30s step (20,160 points > 11K limit).
Fix: Added _adjust_step_if_needed() to PrometheusClient that auto-increases step when needed.
Issue 4: StandardScaler generalization failure (critical)
Symptom: Model reconstruction error 1.97 vs threshold 0.41 on Prometheus data. Even fresh synthetic data got 100% false positive rate.
Root cause: StandardScaler memorizes training data's exact mean/std. Any new data (even same distribution, different random seed) with different base_memory or noise produces shifted z-scores. Model only learned one specific distribution.
Diagnosis: Tested same model with fresh scaler (fit on same test data) -- 1.1% FP rate. Confirmed scaler is the issue, not the model.
Fix: Implemented fixed_minmax scaler mode with predefined bounds in DataPreprocessor. Also changed stride from 20 to 1 (20x more training samples). Result: training loss 0.0021, Prometheus MSE 0.005, 80.5% headroom below threshold.
Issue 5: Startup transient false anomaly
Symptom: ~9 minutes of anomaly detection after cold start.
Cause: rate()[5m] produces artificially low values until Prometheus has 5 minutes of scrapes.
Status: Known behavior, self-resolves. The anomaly detector correctly shows "RESOLVED" after the warm-up period.
Issue 6: get_tv_metrics missing parameters
Symptom: inference.py passed queries parameter that get_tv_metrics didn't accept (latent TypeError).
Fix: Updated get_tv_metrics signature to accept queries and step parameters. Updated callers in both train.py and inference.py.
Issue 7: Keras optimizer loading warning
Symptom: When loading the model, Keras warns: "Skipping variable loading for optimizer 'adam', because it has 2 variables whereas the saved optimizer has 42 variables."
Cause: We save only weights (save_weights); the optimizer state is not saved. Keras expects a full checkpoint when loading.
Impact: None. Optimizer state is irrelevant for inference. Weights load correctly.
Status: Expected and safe. No action required.
Diagnostic commands
For the full troubleshooting journal with detailed analysis, see TROUBLESHOOTING_JOURNAL.md in the project root.