LLM Fine-Tuning
Modern LLM fine-tuning using Hugging Face TRL, PEFT, and optimizations like Flash Attention and Liger Kernels.
Decision Tree
Need to fine-tune an LLM? ├── Consumer GPU (24GB VRAM)? → QLoRA (4-bit quantization + LoRA) ├── Multi-GPU cluster? → Full fine-tuning or Spectrum with DeepSpeed ZeRO3 ├── Single high-end GPU (48GB+)? → Spectrum (selective layer training) └── Limited compute budget? → Start with QLoRA, benchmark against base model
Quick Start
# Install dependencies pip install torch transformers datasets accelerate bitsandbytes trl peft # Train with YAML config python run_sft.py --config config.yaml # Or use TRL CLI directly trl sft --model_name_or_path meta-llama/Llama-3.1-8B \ --dataset_name your-dataset --output_dir ./output \ --per_device_train_batch_size 4 --learning_rate 2e-4 \ --packing --max_length 1024 --gradient_checkpointing
Training Methods
QLoRA (Recommended for Consumer GPUs)
4-bit quantization with LoRA adapters. Fits 8B models on 24GB VRAM.
from transformers import AutoModelForCausalLM, BitsAndBytesConfig
from peft import LoraConfig, get_peft_model
import torch
# 4-bit quantization config
bnb_config = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_use_double_quant=True,
bnb_4bit_quant_type="nf4",
bnb_4bit_compute_dtype=torch.bfloat16,
)
model = AutoModelForCausalLM.from_pretrained(
"meta-llama/Llama-3.1-8B",
quantization_config=bnb_config,
device_map="auto",
)
# LoRA config
lora_config = LoraConfig(
r=16, # Rank (8-64 typical)
lora_alpha=16, # Scaling (usually equal to r)
target_modules="all-linear", # Or specific: ["q_proj", "v_proj"]
modules_to_save=["lm_head", "embed_tokens"], # Train output layers for new tokens
lora_dropout=0.05,
bias="none",
task_type="CAUSAL_LM",
)
model = get_peft_model(model, lora_config)
Spectrum (Selective Layer Training)
Train only high-SNR layers. Performance close to full fine-tuning with reduced compute.
# Generate Spectrum config for your model python spectrum.py --model-name meta-llama/Llama-3.1-8B --top-percent 30 # Use in training python run_sft.py --config config.yaml --spectrum_config_path spectrum_config.yaml
Full Fine-Tuning (Multi-GPU)
# With DeepSpeed ZeRO3 accelerate launch --config_file deepspeed_zero3.yaml \ --num_processes 8 run_sft.py --config config.yaml
FSDP + QLoRA (Large Models on Consumer GPUs)
Combine FSDP sharding with QLoRA for 70B+ models:
| Configuration | GPU Requirement (70B) |
|---|---|
| Full fine-tune + FSDP | 16x 80GB |
| FSDP + LoRA | 8x 80GB |
| FSDP + QLoRA | 2x 40GB |
| FSDP + QLoRA + CPU offload | 4x 24GB |
# accelerate config for FSDP + QLoRA fsdp: "full_shard auto_wrap offload" fsdp_config: backward_prefetch: "backward_pre" use_orig_params: "false"
Dataset Preparation
Prompt-Completion Format (Recommended for completion_only_loss)
Use this format when you want to train only on assistant responses:
from datasets import load_dataset
def to_prompt_completion(sample):
return {
"prompt": [
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": sample["question"]},
],
"completion": [
{"role": "assistant", "content": sample["answer"]}
]
}
dataset = load_dataset("your-dataset", split="train")
dataset = dataset.map(to_prompt_completion, remove_columns=dataset.features)
Messages Format (For language modeling without masking)
Use this format when training on entire conversations:
def to_messages(sample):
return {
"messages": [
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": sample["question"]},
{"role": "assistant", "content": sample["answer"]}
]
}
Format Selection Guide
| Format | Use When | completion_only_loss |
|---|---|---|
| Prompt-Completion | Training only on responses (most tasks) | ✅ Works |
| Messages | Language modeling on full conversations | ❌ No effect |
| Text-only | Pre-formatted text, custom templates | ❌ No effect |
Training Configuration
See references/config-examples.md for complete YAML configs.
Key parameters:
| Parameter | QLoRA | Spectrum | Full |
|---|---|---|---|
learning_rate | 2e-4 | 2e-5 | 1e-5 |
lr_scheduler_type | constant | cosine | cosine |
per_device_train_batch_size | 4-8 | 4-8 | 2-4 |
gradient_accumulation_steps | 2-4 | 2-4 | 4-8 |
max_length | 1024-2048 | 2048-4096 | 2048-4096 |
num_train_epochs | 1-3 | 1-3 | 1-2 |
max_grad_norm | 0.3-1.0 | 1.0 | 1.0 |
Completion-Only Loss (Response Masking)
Train only on assistant responses, not system/user prompts. This focuses learning on the actual task rather than chat template tokens.
Dataset Format Matters
Critical: completion_only_loss requires prompt-completion format, not messages format:
# CORRECT - Prompt-completion format (completion_only_loss works)
def to_prompt_completion(example):
return {
"prompt": [
{"role": "system", "content": "You are helpful."},
{"role": "user", "content": example["input"]},
],
"completion": [
{"role": "assistant", "content": example["output"]},
]
}
# WRONG - Messages format (completion_only_loss will NOT work)
def to_messages(example):
return {
"messages": [
{"role": "system", "content": "..."},
{"role": "user", "content": "..."},
{"role": "assistant", "content": "..."}, # No clear boundary!
]
}
Configuration
# In SFTConfig (TRL 0.25.0+) completion_only_loss: true chat_template_path: "HuggingFaceTB/SmolLM2-360M-Instruct" # For base models packing: false # Required when using completion_only_loss
Note: setup_chat_format() is deprecated. Use chat_template_path in SFTConfig instead.
Verifying It Works
Check that labels are properly masked after creating the trainer:
sample_batch = trainer.data_collator([trainer.train_dataset[0]])
labels = sample_batch["labels"][0]
masked = (labels == -100).sum().item()
total = len(labels)
print(f"Masked (prompt): {masked/total*100:.1f}%") # Should be 30-60%
print(f"Trained (response): {(total-masked)/total*100:.1f}%")
Expected output: ~50% masked (prompt tokens with label=-100), ~50% trained (response tokens).
If 0% masked → completion_only_loss is not working. Check dataset format.
How It Works
During training, the model predicts every next token (including prompt tokens). The labels=-100 masking:
- •Model still sees and processes all tokens in forward pass
- •Loss is only computed where label ≠ -100 (response tokens)
- •Gradients only flow from response positions
This means the loss value only measures performance on the actual task, not on predicting chat template tokens like <|im_start|> which the base model doesn't know.
Optimizations
Enable all for best performance:
# In config.yaml attn_implementation: flash_attention_2 # Flash Attention use_liger: true # Liger Kernels (fused ops) gradient_checkpointing: true # Reduce VRAM gradient_checkpointing_kwargs: use_reentrant: false # Required for TRL compatibility packing: true # Efficient batching (disable if using completion_only_loss) torch_dtype: bfloat16 # Mixed precision
Performance Benchmarks (Llama-3.1-8B, 10K samples, 1x L4 24GB)
| Configuration | Training Time |
|---|---|
| QLoRA baseline | ~360 min |
| + Flash Attention | ~290 min |
| + Liger Kernels | ~220 min |
| + Packing (all opts) | ~135 min |
Spectrum (30% layers) achieves ~58% GSM8K accuracy vs ~54% for QLoRA.
Post-Training
Merge Adapter Weights
from peft import AutoPeftModelForCausalLM
import torch
model = AutoPeftModelForCausalLM.from_pretrained(
"path/to/adapter",
low_cpu_mem_usage=True,
torch_dtype=torch.float16,
)
model = model.merge_and_unload()
model.save_pretrained("merged-model", safe_serialization=True)
Evaluation
# Using lm-eval harness lm_eval --model local-chat-completions \ --tasks gsm8k_cot \ --model_args model=your-model,base_url=http://localhost:8080/v1/chat/completions \ --apply_chat_template
Resources
- •references/config-examples.md: Complete YAML configurations for QLoRA, Spectrum, and distributed training
- •references/troubleshooting.md: Common issues and solutions
- •scripts/run_sft.py: Complete training script with Spectrum support
- •scripts/merge_adapter_weights.py: Merge LoRA adapters with base model