AgentSkillsCN

Llm Fine Tuning

LLM 微调专家。涵盖 LoRA、QLoRA、PEFT、数据集准备、Hugging Face Trainer/TRL、RLHF、DPO、量化(GPTQ/AWQ/GGUF)、模型合并、DeepSpeed/FSDP 分布式训练、硬件选型,以及微调模型的生产级部署。适用于:微调、fine-tuning、LoRA、QLoRA、PEFT、RLHF、DPO、GPTQ、AWQ、GGUF、LLM 训练、模型训练、量化、DeepSpeed。

SKILL.md
--- frontmatter
description: "LLM Fine-Tuning expert. Covers LoRA, QLoRA, PEFT, dataset preparation, Hugging Face Trainer/TRL, RLHF, DPO, quantization (GPTQ/AWQ/GGUF), model merging, distributed training with DeepSpeed/FSDP, hardware selection, and production deployment of fine-tuned models. Activates for: fine-tune, fine-tuning, LoRA, QLoRA, PEFT, RLHF, DPO, GPTQ, AWQ, GGUF, LLM training, model training, quantization, DeepSpeed."
model: opus
context: fork

LLM Fine-Tuning

Expert guidance for fine-tuning large language models efficiently. Covers parameter-efficient methods, dataset preparation, training pipelines, alignment techniques, quantization, and production deployment.

When to Fine-Tune: Decision Framework

Fine-Tune vs RAG vs Prompt Engineering

ApproachBest WhenCostTimeMaintenance
Prompt EngineeringBehavior is achievable with instructionsLowestMinutesLow
RAGModel needs access to specific knowledgeMediumHoursMedium (index updates)
Fine-TuningModel needs new behavior/style/formatHighHours-DaysHigh (retraining)
Fine-Tune + RAGNeed both new behavior AND specific knowledgeHighestDaysHighest

Fine-Tuning Decision Checklist

Choose fine-tuning when:

  • Prompt engineering cannot achieve the desired output format or style
  • RAG retrieval adds too much latency or cost for production
  • You need consistent behavior that prompt instructions cannot guarantee
  • You have 500+ high-quality training examples
  • The task is repetitive and well-defined (classification, extraction, formatting)
  • You need to reduce token usage (shorter prompts after fine-tuning)
  • Domain-specific terminology or reasoning patterns are required

Do NOT fine-tune when:

  • Prompt engineering or few-shot examples work well enough
  • Your data is small (< 100 examples) or low quality
  • The knowledge changes frequently (use RAG instead)
  • You need the model to access external information (use tools/RAG)

LoRA (Low-Rank Adaptation)

How LoRA Works

LoRA freezes the pretrained model weights and injects trainable low-rank decomposition matrices into transformer layers. Instead of updating a weight matrix W (d x d), it learns A (d x r) and B (r x d) where r << d.

Basic LoRA Configuration

python
from peft import LoraConfig, get_peft_model, TaskType

lora_config = LoraConfig(
    task_type=TaskType.CAUSAL_LM,
    r=16,                          # Rank: 8-64 typical, higher = more capacity
    lora_alpha=32,                 # Scaling factor, usually 2x rank
    lora_dropout=0.05,             # Dropout for regularization
    target_modules=[               # Which layers to adapt
        "q_proj", "k_proj", "v_proj", "o_proj",  # Attention
        "gate_proj", "up_proj", "down_proj",       # MLP (optional)
    ],
    bias="none",                   # "none", "all", or "lora_only"
)

model = get_peft_model(base_model, lora_config)
model.print_trainable_parameters()
# trainable params: 6,553,600 || all params: 6,744,682,496 || trainable%: 0.0971

LoRA Rank Selection Guide

Rank (r)Trainable ParamsUse Case
4-8Very fewSimple style transfer, format adaptation
16-32ModerateGeneral fine-tuning, most tasks
64-128ManyComplex domain adaptation, multilingual
256+LargeApproaching full fine-tune quality

Target Module Selection

python
# Find all available modules
from peft import PeftModel
import re

def find_target_modules(model):
    """List all linear layers that can be targeted by LoRA."""
    linear_modules = set()
    for name, module in model.named_modules():
        if isinstance(module, torch.nn.Linear):
            # Get the last part of the module name
            names = name.split(".")
            linear_modules.add(names[-1])
    return list(linear_modules)

# Common target modules by model family:
# Llama/Mistral: q_proj, k_proj, v_proj, o_proj, gate_proj, up_proj, down_proj
# GPT-NeoX: query_key_value, dense, dense_h_to_4h, dense_4h_to_h
# Falcon: query_key_value, dense, dense_h_to_4h, dense_4h_to_h

QLoRA: Fine-Tuning on Consumer Hardware

QLoRA combines 4-bit quantization with LoRA, enabling fine-tuning of 70B models on a single 48GB GPU.

python
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig
from peft import LoraConfig, get_peft_model, prepare_model_for_kbit_training

# 4-bit quantization config
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",           # NormalFloat4 (best for LLMs)
    bnb_4bit_compute_dtype=torch.bfloat16, # Compute in bf16
    bnb_4bit_use_double_quant=True,       # Nested quantization saves ~0.4 bits/param
)

# Load model in 4-bit
model = AutoModelForCausalLM.from_pretrained(
    "meta-llama/Llama-3.1-8B",
    quantization_config=bnb_config,
    device_map="auto",
    torch_dtype=torch.bfloat16,
    attn_implementation="flash_attention_2",
)

tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-3.1-8B")
tokenizer.pad_token = tokenizer.eos_token

# Prepare for k-bit training (freezes, casts, enables gradient checkpointing)
model = prepare_model_for_kbit_training(model)

# Apply LoRA
lora_config = LoraConfig(
    r=16,
    lora_alpha=32,
    target_modules="all-linear",  # Shorthand: targets all linear layers
    lora_dropout=0.05,
    bias="none",
    task_type="CAUSAL_LM",
)

model = get_peft_model(model, lora_config)

Hardware Requirements for QLoRA

Model SizeVRAM (QLoRA)VRAM (Full Fine-Tune)GPU Example
7-8B6-10 GB60+ GBRTX 3090/4090
13B10-16 GB100+ GBRTX 4090/A6000
34B20-28 GB250+ GBA100 40GB
70B36-48 GB500+ GBA100 80GB

Dataset Preparation

Instruction Format

python
from datasets import Dataset

# Alpaca-style format
training_data = [
    {
        "instruction": "Summarize the following text in one sentence.",
        "input": "The quick brown fox jumped over the lazy dog. The dog was sleeping...",
        "output": "A fox jumped over a sleeping dog."
    },
    {
        "instruction": "Translate the following English text to French.",
        "input": "Hello, how are you?",
        "output": "Bonjour, comment allez-vous?"
    },
]

dataset = Dataset.from_list(training_data)

Chat Template Format

python
# Chat/conversation format (preferred for modern models)
chat_data = [
    {
        "messages": [
            {"role": "system", "content": "You are a medical coding assistant."},
            {"role": "user", "content": "What ICD-10 code for type 2 diabetes?"},
            {"role": "assistant", "content": "The ICD-10 code for type 2 diabetes mellitus is E11."},
        ]
    },
    {
        "messages": [
            {"role": "system", "content": "You are a medical coding assistant."},
            {"role": "user", "content": "Code for acute myocardial infarction?"},
            {"role": "assistant", "content": "The ICD-10 code for acute myocardial infarction is I21."},
        ]
    },
]

dataset = Dataset.from_list(chat_data)

# Apply chat template during tokenization
def format_chat(example):
    text = tokenizer.apply_chat_template(
        example["messages"],
        tokenize=False,
        add_generation_prompt=False,
    )
    return {"text": text}

dataset = dataset.map(format_chat)

Data Quality Checklist

  • Minimum 500 examples (1000+ preferred for robust results)
  • Consistent format across all examples
  • Diverse inputs covering expected use cases
  • Correct and high-quality outputs (garbage in = garbage out)
  • No personally identifiable information (PII)
  • Balanced distribution across categories/topics
  • Deduplicated (no exact or near-duplicates)
  • Train/validation split (90/10 or 80/20)

Training with Hugging Face TRL

SFTTrainer (Supervised Fine-Tuning)

python
from trl import SFTTrainer, SFTConfig

training_args = SFTConfig(
    output_dir="./results",
    num_train_epochs=3,
    per_device_train_batch_size=4,
    gradient_accumulation_steps=4,    # Effective batch size = 4 * 4 = 16
    gradient_checkpointing=True,       # Saves VRAM at cost of ~20% speed
    learning_rate=2e-4,
    lr_scheduler_type="cosine",
    warmup_ratio=0.03,
    weight_decay=0.01,
    max_seq_length=2048,
    packing=True,                      # Pack multiple short examples per sequence
    bf16=True,                         # Use bf16 mixed precision
    logging_steps=10,
    save_strategy="steps",
    save_steps=100,
    eval_strategy="steps",
    eval_steps=100,
    save_total_limit=3,
    load_best_model_at_end=True,
    report_to="wandb",                 # Weights & Biases logging
    dataset_text_field="text",
)

trainer = SFTTrainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset,
    eval_dataset=eval_dataset,
    tokenizer=tokenizer,
    peft_config=lora_config,           # Pass LoRA config directly
)

trainer.train()

# Save the LoRA adapter (small, typically 10-100 MB)
trainer.save_model("./final_adapter")

Training Hyperparameter Guide

ParameterRecommended RangeNotes
Learning rate1e-5 to 3e-42e-4 is a safe default for LoRA
Epochs1-5More epochs risk overfitting; monitor eval loss
Batch size (effective)8-64Larger = more stable, use grad accumulation
Warmup ratio0.03-0.1Prevents early training instability
Weight decay0.01-0.1Regularization, 0.01 is safe default
Max seq length512-4096Match your use case; longer = more VRAM
LoRA rank8-64Start with 16, increase if underfitting

RLHF and DPO (Alignment)

Direct Preference Optimization (DPO)

DPO is simpler than full RLHF since it does not require training a separate reward model.

python
from trl import DPOTrainer, DPOConfig

# Dataset format: each example has a prompt, chosen response, and rejected response
dpo_data = [
    {
        "prompt": "Explain quantum computing simply.",
        "chosen": "Quantum computing uses quantum bits (qubits) that can be 0, 1, or both simultaneously...",
        "rejected": "Quantum computing is a complex field involving superposition and entanglement of quantum states in Hilbert space...",
    },
]
dpo_dataset = Dataset.from_list(dpo_data)

dpo_config = DPOConfig(
    output_dir="./dpo_results",
    num_train_epochs=1,
    per_device_train_batch_size=2,
    gradient_accumulation_steps=8,
    learning_rate=5e-6,                # Lower LR for DPO
    beta=0.1,                          # KL penalty strength (0.1-0.5)
    bf16=True,
    gradient_checkpointing=True,
    logging_steps=10,
    save_strategy="epoch",
    report_to="wandb",
)

dpo_trainer = DPOTrainer(
    model=model,
    args=dpo_config,
    train_dataset=dpo_dataset,
    tokenizer=tokenizer,
    peft_config=lora_config,
)

dpo_trainer.train()

RLHF Pipeline (Full)

python
from trl import PPOConfig, PPOTrainer, AutoModelForCausalLMWithValueHead

# Step 1: Train a reward model on preference data
# Step 2: Use PPO to optimize the policy model against the reward model

ppo_config = PPOConfig(
    model_name="meta-llama/Llama-3.1-8B-Instruct",
    learning_rate=1e-5,
    batch_size=16,
    mini_batch_size=4,
    ppo_epochs=4,
    kl_penalty="kl",
    target_kl=6.0,
)

model = AutoModelForCausalLMWithValueHead.from_pretrained(ppo_config.model_name)
ppo_trainer = PPOTrainer(config=ppo_config, model=model, tokenizer=tokenizer)

# Training loop
for batch in dataloader:
    query_tensors = batch["input_ids"]
    response_tensors = ppo_trainer.generate(query_tensors, max_new_tokens=256)
    rewards = reward_model(query_tensors, response_tensors)
    stats = ppo_trainer.step(query_tensors, response_tensors, rewards)

Quantization Formats

Comparison

FormatBitsSpeedQualityUse Case
BF1616BaselineBestTraining, high-end inference
GPTQ4/8Fast (GPU)Very goodGPU deployment
AWQ4Fastest (GPU)Very goodGPU deployment, best perf
GGUF2-8Good (CPU/GPU)Variesllama.cpp, local inference
bitsandbytes4/8ModerateGoodTraining (QLoRA), prototyping

GPTQ Quantization

python
from transformers import AutoModelForCausalLM, AutoTokenizer, GPTQConfig

# Quantize a model to 4-bit GPTQ
quantization_config = GPTQConfig(
    bits=4,
    dataset="c4",                     # Calibration dataset
    group_size=128,                    # Quantization group size
    desc_act=True,                     # Better quality, slightly slower
    sym=True,                          # Symmetric quantization
)

model = AutoModelForCausalLM.from_pretrained(
    "meta-llama/Llama-3.1-8B",
    quantization_config=quantization_config,
    device_map="auto",
)

model.save_pretrained("./llama-3.1-8b-gptq-4bit")

AWQ Quantization

python
from awq import AutoAWQForCausalLM

model = AutoAWQForCausalLM.from_pretrained("meta-llama/Llama-3.1-8B")
tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-3.1-8B")

quant_config = {
    "zero_point": True,
    "q_group_size": 128,
    "w_bit": 4,
    "version": "GEMM",                # GEMM or GEMV
}

model.quantize(tokenizer, quant_config=quant_config)
model.save_quantized("./llama-3.1-8b-awq-4bit")

GGUF Conversion (for llama.cpp)

bash
# Convert HF model to GGUF
python llama.cpp/convert_hf_to_gguf.py \
    ./model-directory \
    --outfile model.gguf \
    --outtype q4_k_m

# Quantization levels:
# q2_k   - 2-bit (smallest, lowest quality)
# q4_0   - 4-bit (fast, decent quality)
# q4_k_m - 4-bit k-quant medium (best balance)
# q5_k_m - 5-bit k-quant medium (good quality)
# q6_k   - 6-bit (high quality)
# q8_0   - 8-bit (near lossless)

Model Merging

Combine multiple fine-tuned models without additional training.

python
# Using mergekit
# Install: pip install mergekit

# mergekit-config.yaml for TIES merge
"""
models:
  - model: base-model/path
    parameters:
      density: 1.0
      weight: 1.0
  - model: code-adapter/path
    parameters:
      density: 0.5
      weight: 0.7
  - model: math-adapter/path
    parameters:
      density: 0.5
      weight: 0.3

merge_method: ties
base_model: base-model/path
parameters:
  normalize: true
  int8_mask: true
dtype: bfloat16
"""
bash
# Run merge
mergekit-yaml mergekit-config.yaml ./merged-model --cuda

Merge Method Selection

MethodBest ForQuality
LinearTwo similar modelsGood baseline
SLERPTwo models, smooth interpolationBetter for dissimilar
TIESMultiple models, resolve conflictsBest for 3+ models
DAREMultiple models with pruningGood for diverse skills

Distributed Training

DeepSpeed ZeRO

python
# deepspeed_config.json
"""
{
    "bf16": {"enabled": true},
    "zero_optimization": {
        "stage": 2,
        "offload_optimizer": {"device": "cpu", "pin_memory": true},
        "allgather_partitions": true,
        "reduce_scatter": true
    },
    "gradient_accumulation_steps": 4,
    "gradient_clipping": 1.0,
    "train_batch_size": "auto",
    "train_micro_batch_size_per_gpu": "auto"
}
"""

# In SFTConfig
training_args = SFTConfig(
    ...,
    deepspeed="deepspeed_config.json",
)

# Launch
# deepspeed --num_gpus=4 train.py

FSDP (Fully Sharded Data Parallel)

python
from transformers import TrainingArguments

training_args = TrainingArguments(
    ...,
    fsdp="full_shard auto_wrap",
    fsdp_config={
        "fsdp_transformer_layer_cls_to_wrap": "LlamaDecoderLayer",
        "fsdp_backward_prefetch": "backward_pre",
        "fsdp_forward_prefetch": True,
        "fsdp_use_orig_params": True,
        "fsdp_cpu_ram_efficient_loading": True,
        "fsdp_sync_module_states": True,
    },
)

# Launch
# torchrun --nproc_per_node=4 train.py

ZeRO Stage Selection

StageShardsVRAM SavingsCommunicationUse When
ZeRO-1Optimizer states~4xLow2-4 GPUs, large batch
ZeRO-2+ Gradients~8xMedium4-8 GPUs, standard
ZeRO-3+ Parameters~NxHighModel too large for 1 GPU
ZeRO-3 + Offload+ CPU offloadMaximumHighestMaximum VRAM savings

Evaluation

Automated Metrics

python
import evaluate

# Perplexity (lower is better, measures model confidence)
perplexity = evaluate.load("perplexity")
results = perplexity.compute(
    predictions=generated_texts,
    model_id="meta-llama/Llama-3.1-8B",
)

# Task-specific evaluation
from lm_eval import evaluator

results = evaluator.simple_evaluate(
    model="hf",
    model_args="pretrained=./fine-tuned-model",
    tasks=["mmlu", "hellaswag", "arc_challenge"],
    batch_size=8,
)

Catastrophic Forgetting Detection

python
def evaluate_forgetting(base_model, fine_tuned_model, general_benchmarks):
    """Compare base vs fine-tuned on general tasks to detect forgetting."""
    base_scores = run_benchmarks(base_model, general_benchmarks)
    ft_scores = run_benchmarks(fine_tuned_model, general_benchmarks)

    for task in general_benchmarks:
        delta = ft_scores[task] - base_scores[task]
        if delta < -0.05:  # > 5% degradation
            print(f"WARNING: Forgetting detected on {task}: {delta:.2%}")

    return {"base": base_scores, "fine_tuned": ft_scores}

Prevention Strategies

  • Use low LoRA rank (8-16) to limit weight changes
  • Short training (1-3 epochs) to avoid overfitting
  • Include general-purpose examples in training data (5-10%)
  • Use regularization (weight decay, dropout)
  • Evaluate on general benchmarks throughout training

Deployment of Fine-Tuned Models

Merge and Export LoRA

python
from peft import PeftModel
from transformers import AutoModelForCausalLM

# Load base model
base_model = AutoModelForCausalLM.from_pretrained(
    "meta-llama/Llama-3.1-8B",
    torch_dtype=torch.bfloat16,
    device_map="auto",
)

# Load and merge adapter
model = PeftModel.from_pretrained(base_model, "./final_adapter")
merged_model = model.merge_and_unload()

# Save merged model
merged_model.save_pretrained("./merged_model")
tokenizer.save_pretrained("./merged_model")

# Push to Hub
merged_model.push_to_hub("my-org/my-fine-tuned-model")
tokenizer.push_to_hub("my-org/my-fine-tuned-model")

Serve with vLLM

bash
# Install: pip install vllm
# Serve the merged model
vllm serve ./merged_model \
    --dtype bfloat16 \
    --max-model-len 4096 \
    --gpu-memory-utilization 0.9 \
    --port 8000

# Or serve with LoRA adapter on the fly
vllm serve meta-llama/Llama-3.1-8B \
    --enable-lora \
    --lora-modules my-adapter=./final_adapter \
    --max-lora-rank 64

Serve with TGI (Text Generation Inference)

bash
docker run --gpus all --shm-size 1g \
    -v ./merged_model:/model \
    -p 8080:80 \
    ghcr.io/huggingface/text-generation-inference:latest \
    --model-id /model \
    --quantize bitsandbytes-nf4 \
    --max-input-length 2048 \
    --max-total-tokens 4096