Training Configure

Set up and configure model training runs using training_hub.

Training Methods

Standard SFT (Supervised Fine-Tuning)

python

from training_hub import sft

sft(
    model_path="meta-llama/Llama-3.1-8B",
    data_path="./data/train.jsonl",
    ckpt_output_dir="./checkpoints/sft",
    effective_batch_size=128,
    num_epochs=3,
    learning_rate=2e-5,
    max_seq_length=4096,
)

OSFT (Optimizer-State Fine-Tuning)

python

from training_hub import osft

osft(
    model_path="meta-llama/Llama-3.1-8B",
    data_path="./data/train.jsonl",
    ckpt_output_dir="./checkpoints/osft",
    unfreeze_rank_ratio=0.3,  # Key OSFT parameter
    effective_batch_size=128,
    num_epochs=3,
    learning_rate=2e-5,
)

LoRA SFT

python

from training_hub import lora_sft

lora_sft(
    model_path="meta-llama/Llama-3.1-8B",
    data_path="./data/train.jsonl",
    ckpt_output_dir="./checkpoints/lora",
    lora_rank=64,
    lora_alpha=128,
    lora_dropout=0.05,
    target_modules=["q_proj", "k_proj", "v_proj", "o_proj"],
    effective_batch_size=128,
)

Key Parameters

Batch Size Configuration

python

# Effective batch size = per_device_batch_size * gradient_accumulation * num_gpus
sft(
    effective_batch_size=128,      # Total effective batch size
    per_device_batch_size=4,       # Per GPU (auto-calculated if not set)
    gradient_accumulation_steps=8,  # Auto-calculated based on GPU count
)

Learning Rate & Schedule

python

sft(
    learning_rate=2e-5,
    lr_scheduler_type="cosine",  # cosine, linear, constant
    warmup_ratio=0.1,            # 10% warmup
    weight_decay=0.01,
)

Sequence Length & Packing

python

sft(
    max_seq_length=4096,
    packing=True,  # Pack multiple short sequences into one
    pad_to_multiple_of=64,  # Efficient padding
)

Checkpointing

python

sft(
    ckpt_output_dir="./checkpoints",
    save_strategy="steps",
    save_steps=500,
    save_total_limit=3,  # Keep only last 3 checkpoints
    resume_from_checkpoint="./checkpoints/checkpoint-1000",
)

Data Format

Expected JSONL format (messages style)

json

{"messages": [{"role": "user", "content": "..."}, {"role": "assistant", "content": "..."}]}
{"messages": [{"role": "system", "content": "..."}, {"role": "user", "content": "..."}, {"role": "assistant", "content": "..."}]}

Alternative formats

python

sft(
    data_path="./data/train.jsonl",
    data_format="alpaca",  # instruction, input, output columns
    # OR
    data_format="sharegpt",  # conversations column
)

Distributed Training

Multi-GPU (single node)

python

# Automatically uses all available GPUs
sft(
    model_path="meta-llama/Llama-3.1-8B",
    data_path="./data/train.jsonl",
    # Will auto-detect GPUs and configure DDP
)

Multi-node with FSDP

python

sft(
    model_path="meta-llama/Llama-3.1-70B",
    data_path="./data/train.jsonl",
    fsdp=True,
    fsdp_config={
        "sharding_strategy": "FULL_SHARD",
        "cpu_offload": False,
        "mixed_precision": "bf16",
    },
)

DeepSpeed integration

python

sft(
    model_path="meta-llama/Llama-3.1-70B",
    data_path="./data/train.jsonl",
    deepspeed="./ds_config_zero3.json",
)

OSFT-Specific Configuration

python

osft(
    # Standard params
    model_path="meta-llama/Llama-3.1-8B",
    data_path="./data/train.jsonl",

    # OSFT-specific
    unfreeze_rank_ratio=0.3,      # Fraction of ranks to unfreeze (0.0-1.0)
    unfreeze_strategy="magnitude", # magnitude, random, gradient
    warmup_frozen_epochs=1,        # Epochs with all frozen before unfreezing
)

Memory Optimization

python

sft(
    # Gradient checkpointing (saves memory, slower training)
    gradient_checkpointing=True,

    # Mixed precision
    bf16=True,  # Use bf16 (requires Ampere+ GPU)
    # OR
    fp16=True,  # Use fp16 (older GPUs)

    # Optimizer memory
    optim="adamw_8bit",  # 8-bit Adam
    # OR
    optim="paged_adamw_32bit",  # Paged optimizer for large models
)

Evaluation During Training

python

sft(
    eval_data_path="./data/eval.jsonl",
    eval_strategy="steps",
    eval_steps=500,
    metric_for_best_model="eval_loss",
    load_best_model_at_end=True,
)

Common Configurations

Small model, quick iteration

python

sft(
    model_path="meta-llama/Llama-3.1-8B",
    data_path="./data/train.jsonl",
    effective_batch_size=32,
    num_epochs=1,
    max_seq_length=2048,
    bf16=True,
)

Large model, production training

python

osft(
    model_path="meta-llama/Llama-3.1-70B",
    data_path="./data/train.jsonl",
    effective_batch_size=256,
    num_epochs=3,
    max_seq_length=4096,
    fsdp=True,
    gradient_checkpointing=True,
    bf16=True,
    unfreeze_rank_ratio=0.2,
)

Related Skills

•/training-debug - Debug training issues
•/pipeline-design - Design end-to-end pipelines