Training Runner
Use this skill for end-to-end model training execution, optimized for from-scratch pretraining.
Framework Support
Generate/adapt training workflows for:
- •NanoGPT
- •LitGPT
- •Hugging Face Trainer
Script Generation Defaults
Training scripts should include at minimum:
- •cosine learning rate schedule
- •gradient clipping
- •mixed precision (bf16/fp16 based on hardware)
- •deterministic seeds where feasible
- •structured logging for losses and throughput
Execution Patterns
1. SSH / Remote Execution (RunPod/Vast/Lambda)
- •Use
sshto run training commands on provisioned instances. - •Use
screenortmuxto keep processes alive. - •Stream logs back via
tail -f.
2. Kaggle Kernels
- •Prepare a
kernel-metadata.jsonwithenable_gpu: trueandenable_internet: true. - •Use
kaggle kernels push -p <path>to start training. - •Monitor with
kaggle kernels status <user>/<kernel>. - •Retrieve logs/outputs via
kaggle kernels output <user>/<kernel>.
3. Background Terminal Execution
For long runs, prefer background terminal execution:
- •start via
exec - •set
background: true - •set
pty: truefor interactive-safe output
This keeps the agent responsive while training proceeds asynchronously.
Live Monitoring
Use process:log to stream and inspect ongoing training output.
Parse and report key metrics:
- •train loss
- •eval/validation loss
- •learning rate
- •tokens/sec or samples/sec
- •checkpoint save events
If divergence appears (NaN, exploding loss, repeated OOM), propose immediate mitigation and recovery.
Multi-Phase Training Strategy
Support staged curricula:
- •broad pretraining phase
- •curated high-quality phase
- •domain-specific phase
- •cooldown/anneal phase
For each phase, define target steps/tokens, LR policy, and success criteria.
Checkpoint Management
- •Save checkpoints on predictable intervals and key milestones.
- •Keep latest + best + periodic snapshots.
- •Validate checkpoint integrity after save.
- •Support clean resume from last good checkpoint after interruption.
- •Maintain a concise checkpoint index with timestamp/step/metrics.
Run Outputs
- •
train.shor equivalent launcher script - •
training_config.yaml(or framework-native config) - •
training_status.md(current metrics + risk notes) - •
checkpoints_index.json