Training Configure
Set up and configure model training runs using training_hub.
Training Methods
Standard SFT (Supervised Fine-Tuning)
python
from training_hub import sft
sft(
model_path="meta-llama/Llama-3.1-8B",
data_path="./data/train.jsonl",
ckpt_output_dir="./checkpoints/sft",
effective_batch_size=128,
num_epochs=3,
learning_rate=2e-5,
max_seq_length=4096,
)
OSFT (Optimizer-State Fine-Tuning)
python
from training_hub import osft
osft(
model_path="meta-llama/Llama-3.1-8B",
data_path="./data/train.jsonl",
ckpt_output_dir="./checkpoints/osft",
unfreeze_rank_ratio=0.3, # Key OSFT parameter
effective_batch_size=128,
num_epochs=3,
learning_rate=2e-5,
)
LoRA SFT
python
from training_hub import lora_sft
lora_sft(
model_path="meta-llama/Llama-3.1-8B",
data_path="./data/train.jsonl",
ckpt_output_dir="./checkpoints/lora",
lora_rank=64,
lora_alpha=128,
lora_dropout=0.05,
target_modules=["q_proj", "k_proj", "v_proj", "o_proj"],
effective_batch_size=128,
)
Key Parameters
Batch Size Configuration
python
# Effective batch size = per_device_batch_size * gradient_accumulation * num_gpus
sft(
effective_batch_size=128, # Total effective batch size
per_device_batch_size=4, # Per GPU (auto-calculated if not set)
gradient_accumulation_steps=8, # Auto-calculated based on GPU count
)
Learning Rate & Schedule
python
sft(
learning_rate=2e-5,
lr_scheduler_type="cosine", # cosine, linear, constant
warmup_ratio=0.1, # 10% warmup
weight_decay=0.01,
)
Sequence Length & Packing
python
sft(
max_seq_length=4096,
packing=True, # Pack multiple short sequences into one
pad_to_multiple_of=64, # Efficient padding
)
Checkpointing
python
sft(
ckpt_output_dir="./checkpoints",
save_strategy="steps",
save_steps=500,
save_total_limit=3, # Keep only last 3 checkpoints
resume_from_checkpoint="./checkpoints/checkpoint-1000",
)
Data Format
Expected JSONL format (messages style)
json
{"messages": [{"role": "user", "content": "..."}, {"role": "assistant", "content": "..."}]}
{"messages": [{"role": "system", "content": "..."}, {"role": "user", "content": "..."}, {"role": "assistant", "content": "..."}]}
Alternative formats
python
sft(
data_path="./data/train.jsonl",
data_format="alpaca", # instruction, input, output columns
# OR
data_format="sharegpt", # conversations column
)
Distributed Training
Multi-GPU (single node)
python
# Automatically uses all available GPUs
sft(
model_path="meta-llama/Llama-3.1-8B",
data_path="./data/train.jsonl",
# Will auto-detect GPUs and configure DDP
)
Multi-node with FSDP
python
sft(
model_path="meta-llama/Llama-3.1-70B",
data_path="./data/train.jsonl",
fsdp=True,
fsdp_config={
"sharding_strategy": "FULL_SHARD",
"cpu_offload": False,
"mixed_precision": "bf16",
},
)
DeepSpeed integration
python
sft(
model_path="meta-llama/Llama-3.1-70B",
data_path="./data/train.jsonl",
deepspeed="./ds_config_zero3.json",
)
OSFT-Specific Configuration
python
osft(
# Standard params
model_path="meta-llama/Llama-3.1-8B",
data_path="./data/train.jsonl",
# OSFT-specific
unfreeze_rank_ratio=0.3, # Fraction of ranks to unfreeze (0.0-1.0)
unfreeze_strategy="magnitude", # magnitude, random, gradient
warmup_frozen_epochs=1, # Epochs with all frozen before unfreezing
)
Memory Optimization
python
sft(
# Gradient checkpointing (saves memory, slower training)
gradient_checkpointing=True,
# Mixed precision
bf16=True, # Use bf16 (requires Ampere+ GPU)
# OR
fp16=True, # Use fp16 (older GPUs)
# Optimizer memory
optim="adamw_8bit", # 8-bit Adam
# OR
optim="paged_adamw_32bit", # Paged optimizer for large models
)
Evaluation During Training
python
sft(
eval_data_path="./data/eval.jsonl",
eval_strategy="steps",
eval_steps=500,
metric_for_best_model="eval_loss",
load_best_model_at_end=True,
)
Common Configurations
Small model, quick iteration
python
sft(
model_path="meta-llama/Llama-3.1-8B",
data_path="./data/train.jsonl",
effective_batch_size=32,
num_epochs=1,
max_seq_length=2048,
bf16=True,
)
Large model, production training
python
osft(
model_path="meta-llama/Llama-3.1-70B",
data_path="./data/train.jsonl",
effective_batch_size=256,
num_epochs=3,
max_seq_length=4096,
fsdp=True,
gradient_checkpointing=True,
bf16=True,
unfreeze_rank_ratio=0.2,
)
Related Skills
- •
/training-debug- Debug training issues - •
/pipeline-design- Design end-to-end pipelines