AgentSkillsCN

llm-fine-tuning

使用现代技术(QLoRA、Spectrum、全量微调)结合 Hugging Face TRL 与 PEFT,对大型语言模型进行微调。当用户需要为特定任务、领域或数据集对 LLM 进行微调、训练或适配时,可使用此功能。本指南涵盖数据集准备、训练配置、分布式训练、模型合并,以及模型评估等环节。

SKILL.md
--- frontmatter
name: llm-fine-tuning
description: Fine-tune large language models using modern techniques (QLoRA, Spectrum, full fine-tuning) with Hugging Face TRL and PEFT. Use when asked to fine-tune, train, or adapt an LLM for specific tasks, domains, or datasets. Covers dataset preparation, training configuration, distributed training, model merging, and evaluation.

LLM Fine-Tuning

Modern LLM fine-tuning using Hugging Face TRL, PEFT, and optimizations like Flash Attention and Liger Kernels.

Decision Tree

code
Need to fine-tune an LLM?
├── Consumer GPU (24GB VRAM)? → QLoRA (4-bit quantization + LoRA)
├── Multi-GPU cluster? → Full fine-tuning or Spectrum with DeepSpeed ZeRO3
├── Single high-end GPU (48GB+)? → Spectrum (selective layer training)
└── Limited compute budget? → Start with QLoRA, benchmark against base model

Quick Start

bash
# Install dependencies
pip install torch transformers datasets accelerate bitsandbytes trl peft

# Train with YAML config
python run_sft.py --config config.yaml

# Or use TRL CLI directly
trl sft --model_name_or_path meta-llama/Llama-3.1-8B \
  --dataset_name your-dataset --output_dir ./output \
  --per_device_train_batch_size 4 --learning_rate 2e-4 \
  --packing --max_length 1024 --gradient_checkpointing

Training Methods

QLoRA (Recommended for Consumer GPUs)

4-bit quantization with LoRA adapters. Fits 8B models on 24GB VRAM.

python
from transformers import AutoModelForCausalLM, BitsAndBytesConfig
from peft import LoraConfig, get_peft_model
import torch

# 4-bit quantization config
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_use_double_quant=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.bfloat16,
)

model = AutoModelForCausalLM.from_pretrained(
    "meta-llama/Llama-3.1-8B",
    quantization_config=bnb_config,
    device_map="auto",
)

# LoRA config
lora_config = LoraConfig(
    r=16,                    # Rank (8-64 typical)
    lora_alpha=16,           # Scaling (usually equal to r)
    target_modules="all-linear",  # Or specific: ["q_proj", "v_proj"]
    modules_to_save=["lm_head", "embed_tokens"],  # Train output layers for new tokens
    lora_dropout=0.05,
    bias="none",
    task_type="CAUSAL_LM",
)

model = get_peft_model(model, lora_config)

Spectrum (Selective Layer Training)

Train only high-SNR layers. Performance close to full fine-tuning with reduced compute.

bash
# Generate Spectrum config for your model
python spectrum.py --model-name meta-llama/Llama-3.1-8B --top-percent 30

# Use in training
python run_sft.py --config config.yaml --spectrum_config_path spectrum_config.yaml

Full Fine-Tuning (Multi-GPU)

bash
# With DeepSpeed ZeRO3
accelerate launch --config_file deepspeed_zero3.yaml \
  --num_processes 8 run_sft.py --config config.yaml

FSDP + QLoRA (Large Models on Consumer GPUs)

Combine FSDP sharding with QLoRA for 70B+ models:

ConfigurationGPU Requirement (70B)
Full fine-tune + FSDP16x 80GB
FSDP + LoRA8x 80GB
FSDP + QLoRA2x 40GB
FSDP + QLoRA + CPU offload4x 24GB
yaml
# accelerate config for FSDP + QLoRA
fsdp: "full_shard auto_wrap offload"
fsdp_config:
  backward_prefetch: "backward_pre"
  use_orig_params: "false"

Dataset Preparation

Prompt-Completion Format (Recommended for completion_only_loss)

Use this format when you want to train only on assistant responses:

python
from datasets import load_dataset

def to_prompt_completion(sample):
    return {
        "prompt": [
            {"role": "system", "content": "You are a helpful assistant."},
            {"role": "user", "content": sample["question"]},
        ],
        "completion": [
            {"role": "assistant", "content": sample["answer"]}
        ]
    }

dataset = load_dataset("your-dataset", split="train")
dataset = dataset.map(to_prompt_completion, remove_columns=dataset.features)

Messages Format (For language modeling without masking)

Use this format when training on entire conversations:

python
def to_messages(sample):
    return {
        "messages": [
            {"role": "system", "content": "You are a helpful assistant."},
            {"role": "user", "content": sample["question"]},
            {"role": "assistant", "content": sample["answer"]}
        ]
    }

Format Selection Guide

FormatUse Whencompletion_only_loss
Prompt-CompletionTraining only on responses (most tasks)✅ Works
MessagesLanguage modeling on full conversations❌ No effect
Text-onlyPre-formatted text, custom templates❌ No effect

Training Configuration

See references/config-examples.md for complete YAML configs.

Key parameters:

ParameterQLoRASpectrumFull
learning_rate2e-42e-51e-5
lr_scheduler_typeconstantcosinecosine
per_device_train_batch_size4-84-82-4
gradient_accumulation_steps2-42-44-8
max_length1024-20482048-40962048-4096
num_train_epochs1-31-31-2
max_grad_norm0.3-1.01.01.0

Completion-Only Loss (Response Masking)

Train only on assistant responses, not system/user prompts. This focuses learning on the actual task rather than chat template tokens.

Dataset Format Matters

Critical: completion_only_loss requires prompt-completion format, not messages format:

python
# CORRECT - Prompt-completion format (completion_only_loss works)
def to_prompt_completion(example):
    return {
        "prompt": [
            {"role": "system", "content": "You are helpful."},
            {"role": "user", "content": example["input"]},
        ],
        "completion": [
            {"role": "assistant", "content": example["output"]},
        ]
    }

# WRONG - Messages format (completion_only_loss will NOT work)
def to_messages(example):
    return {
        "messages": [
            {"role": "system", "content": "..."},
            {"role": "user", "content": "..."},
            {"role": "assistant", "content": "..."},  # No clear boundary!
        ]
    }

Configuration

yaml
# In SFTConfig (TRL 0.25.0+)
completion_only_loss: true
chat_template_path: "HuggingFaceTB/SmolLM2-360M-Instruct"  # For base models
packing: false  # Required when using completion_only_loss

Note: setup_chat_format() is deprecated. Use chat_template_path in SFTConfig instead.

Verifying It Works

Check that labels are properly masked after creating the trainer:

python
sample_batch = trainer.data_collator([trainer.train_dataset[0]])
labels = sample_batch["labels"][0]

masked = (labels == -100).sum().item()
total = len(labels)
print(f"Masked (prompt): {masked/total*100:.1f}%")  # Should be 30-60%
print(f"Trained (response): {(total-masked)/total*100:.1f}%")

Expected output: ~50% masked (prompt tokens with label=-100), ~50% trained (response tokens).

If 0% masked → completion_only_loss is not working. Check dataset format.

How It Works

During training, the model predicts every next token (including prompt tokens). The labels=-100 masking:

  • Model still sees and processes all tokens in forward pass
  • Loss is only computed where label ≠ -100 (response tokens)
  • Gradients only flow from response positions

This means the loss value only measures performance on the actual task, not on predicting chat template tokens like <|im_start|> which the base model doesn't know.

Optimizations

Enable all for best performance:

yaml
# In config.yaml
attn_implementation: flash_attention_2  # Flash Attention
use_liger: true                          # Liger Kernels (fused ops)
gradient_checkpointing: true             # Reduce VRAM
gradient_checkpointing_kwargs:
  use_reentrant: false                   # Required for TRL compatibility
packing: true                            # Efficient batching (disable if using completion_only_loss)
torch_dtype: bfloat16                    # Mixed precision

Performance Benchmarks (Llama-3.1-8B, 10K samples, 1x L4 24GB)

ConfigurationTraining Time
QLoRA baseline~360 min
+ Flash Attention~290 min
+ Liger Kernels~220 min
+ Packing (all opts)~135 min

Spectrum (30% layers) achieves ~58% GSM8K accuracy vs ~54% for QLoRA.

Post-Training

Merge Adapter Weights

python
from peft import AutoPeftModelForCausalLM
import torch

model = AutoPeftModelForCausalLM.from_pretrained(
    "path/to/adapter",
    low_cpu_mem_usage=True,
    torch_dtype=torch.float16,
)
model = model.merge_and_unload()
model.save_pretrained("merged-model", safe_serialization=True)

Evaluation

bash
# Using lm-eval harness
lm_eval --model local-chat-completions \
  --tasks gsm8k_cot \
  --model_args model=your-model,base_url=http://localhost:8080/v1/chat/completions \
  --apply_chat_template

Resources