ATFT Training Skill
Mission
- •Launch production-grade training for the Graph Attention Network forecaster with correct dataset/version parity.
- •Tune hyper-parameters (LR, batch size, horizons, latent dims) exploiting 80GB GPU headroom.
- •Safely resume, stop, or monitor long-running jobs and record experiment metadata.
Engagement Triggers
- •Requests to “train”, “fine-tune”, “HP optimize”, “resume training”, or “monitor training logs”.
- •Need to validate new dataset compatibility with model code.
- •Investigations into training stalls, divergence, or GPU under-utilization.
Preflight Safety Checks
- •Dataset freshness:
ls -lh output/ml_dataset_latest_full.parquetthenpython scripts/utils/dataset_guard.py --assert-recency 72. - •Environment health:
tools/project-health-check.sh --section training. - •GPU allocation:
nvidia-smi --query-gpu=utilization.gpu,memory.used --format=csv(target >60% util, <76GB used baseline). - •Git hygiene:
git status --shortensure working tree state is understood (avoid accidental overrides during long runs).
Training Playbooks
1. Production Optimized Training (default 120 epochs)
- •
make train-optimized DATASET=output/ml_dataset_latest_full.parquet— compiles TorchInductor + FlashAttention2. - •
make train-monitor— tails_logs/training/train-optimized.log. - •
make train-status— polls background process; ensure ETA < 7h. - •Post-run validation:
- •
python scripts/eval/aggregate_metrics.py runs/latest— compute Sharpe, RankIC, hit ratios. - •Update
results/latest_training_summary.md.
- •
2. Quick Validation / Smoke
- •
make train-quick EPOCHS=3— run in foreground. - •
python scripts/smoke_test.py --max-epochs 1 --subset 512for additional regression guard. - •
pytest tests/integration/test_training_loop.py::test_forward_backwardif suspicious gradients.
3. Safe Mode / Debug
- •
make train-safe— disables compile, single-worker dataloading. - •
make train-stopif hung jobs detected (consult_logs/training/pids/). - •
python scripts/integrated_ml_training_pipeline.py --profile --epochs 2 --no-compile— capture flamegraph tobenchmark_output/.
4. Hyper-Parameter Exploration
- •Ensure
mlflowbackend running if required (make mlflow-up). - •
make hpo-run HPO_TRIALS=24 HPO_STUDY=atft_prod_lr_sched— uses Optuna integration. - •
make hpo-status— track trial completions. - •Promote winning config →
configs/training/atft_prod.yamland document inEXPERIMENT_STATUS.md.
Monitoring & Telemetry
- •Training logs:
_logs/training/*.log(includes gradient norms, learning rate schedule, GPU temp). - •Metrics JSONL:
runs/<timestamp>/metrics.jsonl. - •Checkpoint artifacts:
models/checkpoints/<timestamp>/epoch_###.pt. - •GPU telemetry:
watch -n 30 nvidia-smiorpython tools/gpu_monitor.py --pid $(cat _logs/training/pids/train.pid).
Failure Handling
- •NaN loss → run
make train-safewithFP32=1, inspectruns/<ts>/nan_batches.json. - •Slow dataloading → regenerate dataset with
make dataset-gpu GRAPH_WINDOW=90or enable PyTorch compile caching. - •OOM → set
GRADIENT_ACCUMULATION_STEPS=2or reduceBATCH_SIZE; confirm memory fragments viapython tools/gpu_memory_report.py. - •Divergent metrics → verify
configs/training/schedule.yaml; runpytest tests/unit/test_loss_functions.py.
Codex Collaboration
- •Invoke
./tools/codex.sh --max "Design a new learning rate policy for ATFT-GAT-FAN"when novel optimizer or architecture strategy is required. - •Use
codex exec --model gpt-5-codex "Analyze runs/<timestamp>/metrics.jsonl and suggest fixes"for automated postmortems. - •Share Codex-discovered tuning insights in
results/training_runs/and update config files/documents accordingly.
Post-Training Handoff
- •Persist summary in
results/training_runs/<timestamp>.mdnoting dataset hash and commit SHA. - •Push model weights to
models/artifacts/with naminggatfan_<date>_Sharpe<score>.pt. - •Notify research team via
docs/research/changelog.md.