LLM Fine-Tuning

Modern LLM fine-tuning using Hugging Face TRL, PEFT, and optimizations like Flash Attention and Liger Kernels.

Decision Tree

code

Need to fine-tune an LLM?
├── Consumer GPU (24GB VRAM)? → QLoRA (4-bit quantization + LoRA)
├── Multi-GPU cluster? → Full fine-tuning or Spectrum with DeepSpeed ZeRO3
├── Single high-end GPU (48GB+)? → Spectrum (selective layer training)
└── Limited compute budget? → Start with QLoRA, benchmark against base model

Quick Start

bash

# Install dependencies
pip install torch transformers datasets accelerate bitsandbytes trl peft

# Train with YAML config
python run_sft.py --config config.yaml

# Or use TRL CLI directly
trl sft --model_name_or_path meta-llama/Llama-3.1-8B \
  --dataset_name your-dataset --output_dir ./output \
  --per_device_train_batch_size 4 --learning_rate 2e-4 \
  --packing --max_length 1024 --gradient_checkpointing

Training Methods

QLoRA (Recommended for Consumer GPUs)

4-bit quantization with LoRA adapters. Fits 8B models on 24GB VRAM.

python

from transformers import AutoModelForCausalLM, BitsAndBytesConfig
from peft import LoraConfig, get_peft_model
import torch

# 4-bit quantization config
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_use_double_quant=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.bfloat16,
)

model = AutoModelForCausalLM.from_pretrained(
    "meta-llama/Llama-3.1-8B",
    quantization_config=bnb_config,
    device_map="auto",
)

# LoRA config
lora_config = LoraConfig(
    r=16,                    # Rank (8-64 typical)
    lora_alpha=16,           # Scaling (usually equal to r)
    target_modules="all-linear",  # Or specific: ["q_proj", "v_proj"]
    modules_to_save=["lm_head", "embed_tokens"],  # Train output layers for new tokens
    lora_dropout=0.05,
    bias="none",
    task_type="CAUSAL_LM",
)

model = get_peft_model(model, lora_config)

Spectrum (Selective Layer Training)

Train only high-SNR layers. Performance close to full fine-tuning with reduced compute.

bash

# Generate Spectrum config for your model
python spectrum.py --model-name meta-llama/Llama-3.1-8B --top-percent 30

# Use in training
python run_sft.py --config config.yaml --spectrum_config_path spectrum_config.yaml

Full Fine-Tuning (Multi-GPU)

bash

# With DeepSpeed ZeRO3
accelerate launch --config_file deepspeed_zero3.yaml \
  --num_processes 8 run_sft.py --config config.yaml

FSDP + QLoRA (Large Models on Consumer GPUs)

Combine FSDP sharding with QLoRA for 70B+ models:

Configuration	GPU Requirement (70B)
Full fine-tune + FSDP	16x 80GB
FSDP + LoRA	8x 80GB
FSDP + QLoRA	2x 40GB
FSDP + QLoRA + CPU offload	4x 24GB

yaml

# accelerate config for FSDP + QLoRA
fsdp: "full_shard auto_wrap offload"
fsdp_config:
  backward_prefetch: "backward_pre"
  use_orig_params: "false"

Dataset Preparation

Prompt-Completion Format (Recommended for completion_only_loss)

Use this format when you want to train only on assistant responses:

python

from datasets import load_dataset

def to_prompt_completion(sample):
    return {
        "prompt": [
            {"role": "system", "content": "You are a helpful assistant."},
            {"role": "user", "content": sample["question"]},
        ],
        "completion": [
            {"role": "assistant", "content": sample["answer"]}
        ]
    }

dataset = load_dataset("your-dataset", split="train")
dataset = dataset.map(to_prompt_completion, remove_columns=dataset.features)

Messages Format (For language modeling without masking)

Use this format when training on entire conversations:

python

def to_messages(sample):
    return {
        "messages": [
            {"role": "system", "content": "You are a helpful assistant."},
            {"role": "user", "content": sample["question"]},
            {"role": "assistant", "content": sample["answer"]}
        ]
    }

Format Selection Guide

Format	Use When	completion_only_loss
Prompt-Completion	Training only on responses (most tasks)	✅ Works
Messages	Language modeling on full conversations	❌ No effect
Text-only	Pre-formatted text, custom templates	❌ No effect

Training Configuration

See references/config-examples.md for complete YAML configs.

Key parameters:

Parameter	QLoRA	Spectrum	Full
`learning_rate`	2e-4	2e-5	1e-5
`lr_scheduler_type`	constant	cosine	cosine
`per_device_train_batch_size`	4-8	4-8	2-4
`gradient_accumulation_steps`	2-4	2-4	4-8
`max_length`	1024-2048	2048-4096	2048-4096
`num_train_epochs`	1-3	1-3	1-2
`max_grad_norm`	0.3-1.0	1.0	1.0

Completion-Only Loss (Response Masking)

Train only on assistant responses, not system/user prompts. This focuses learning on the actual task rather than chat template tokens.

Dataset Format Matters

Critical: completion_only_loss requires prompt-completion format, not messages format:

python

# CORRECT - Prompt-completion format (completion_only_loss works)
def to_prompt_completion(example):
    return {
        "prompt": [
            {"role": "system", "content": "You are helpful."},
            {"role": "user", "content": example["input"]},
        ],
        "completion": [
            {"role": "assistant", "content": example["output"]},
        ]
    }

# WRONG - Messages format (completion_only_loss will NOT work)
def to_messages(example):
    return {
        "messages": [
            {"role": "system", "content": "..."},
            {"role": "user", "content": "..."},
            {"role": "assistant", "content": "..."},  # No clear boundary!
        ]
    }

Configuration

yaml

# In SFTConfig (TRL 0.25.0+)
completion_only_loss: true
chat_template_path: "HuggingFaceTB/SmolLM2-360M-Instruct"  # For base models
packing: false  # Required when using completion_only_loss

Note: setup_chat_format() is deprecated. Use chat_template_path in SFTConfig instead.

Verifying It Works

Check that labels are properly masked after creating the trainer:

python

sample_batch = trainer.data_collator([trainer.train_dataset[0]])
labels = sample_batch["labels"][0]

masked = (labels == -100).sum().item()
total = len(labels)
print(f"Masked (prompt): {masked/total*100:.1f}%")  # Should be 30-60%
print(f"Trained (response): {(total-masked)/total*100:.1f}%")

Expected output: ~50% masked (prompt tokens with label=-100), ~50% trained (response tokens).

If 0% masked → completion_only_loss is not working. Check dataset format.

How It Works

During training, the model predicts every next token (including prompt tokens). The labels=-100 masking:

•Model still sees and processes all tokens in forward pass
•Loss is only computed where label ≠ -100 (response tokens)
•Gradients only flow from response positions

This means the loss value only measures performance on the actual task, not on predicting chat template tokens like <|im_start|> which the base model doesn't know.

Optimizations

Enable all for best performance:

yaml

# In config.yaml
attn_implementation: flash_attention_2  # Flash Attention
use_liger: true                          # Liger Kernels (fused ops)
gradient_checkpointing: true             # Reduce VRAM
gradient_checkpointing_kwargs:
  use_reentrant: false                   # Required for TRL compatibility
packing: true                            # Efficient batching (disable if using completion_only_loss)
torch_dtype: bfloat16                    # Mixed precision

Performance Benchmarks (Llama-3.1-8B, 10K samples, 1x L4 24GB)

Configuration	Training Time
QLoRA baseline	~360 min
+ Flash Attention	~290 min
+ Liger Kernels	~220 min
+ Packing (all opts)	~135 min

Spectrum (30% layers) achieves ~58% GSM8K accuracy vs ~54% for QLoRA.

Post-Training

Merge Adapter Weights

python

from peft import AutoPeftModelForCausalLM
import torch

model = AutoPeftModelForCausalLM.from_pretrained(
    "path/to/adapter",
    low_cpu_mem_usage=True,
    torch_dtype=torch.float16,
)
model = model.merge_and_unload()
model.save_pretrained("merged-model", safe_serialization=True)

Evaluation

bash

# Using lm-eval harness
lm_eval --model local-chat-completions \
  --tasks gsm8k_cot \
  --model_args model=your-model,base_url=http://localhost:8080/v1/chat/completions \
  --apply_chat_template

Resources

•references/config-examples.md: Complete YAML configurations for QLoRA, Spectrum, and distributed training
•references/troubleshooting.md: Common issues and solutions
•scripts/run_sft.py: Complete training script with Spectrum support
•scripts/merge_adapter_weights.py: Merge LoRA adapters with base model