LLM Fine-Tuning
Expert guidance for fine-tuning large language models efficiently. Covers parameter-efficient methods, dataset preparation, training pipelines, alignment techniques, quantization, and production deployment.
When to Fine-Tune: Decision Framework
Fine-Tune vs RAG vs Prompt Engineering
| Approach | Best When | Cost | Time | Maintenance |
|---|---|---|---|---|
| Prompt Engineering | Behavior is achievable with instructions | Lowest | Minutes | Low |
| RAG | Model needs access to specific knowledge | Medium | Hours | Medium (index updates) |
| Fine-Tuning | Model needs new behavior/style/format | High | Hours-Days | High (retraining) |
| Fine-Tune + RAG | Need both new behavior AND specific knowledge | Highest | Days | Highest |
Fine-Tuning Decision Checklist
Choose fine-tuning when:
- • Prompt engineering cannot achieve the desired output format or style
- • RAG retrieval adds too much latency or cost for production
- • You need consistent behavior that prompt instructions cannot guarantee
- • You have 500+ high-quality training examples
- • The task is repetitive and well-defined (classification, extraction, formatting)
- • You need to reduce token usage (shorter prompts after fine-tuning)
- • Domain-specific terminology or reasoning patterns are required
Do NOT fine-tune when:
- •Prompt engineering or few-shot examples work well enough
- •Your data is small (< 100 examples) or low quality
- •The knowledge changes frequently (use RAG instead)
- •You need the model to access external information (use tools/RAG)
LoRA (Low-Rank Adaptation)
How LoRA Works
LoRA freezes the pretrained model weights and injects trainable low-rank decomposition matrices into transformer layers. Instead of updating a weight matrix W (d x d), it learns A (d x r) and B (r x d) where r << d.
Basic LoRA Configuration
from peft import LoraConfig, get_peft_model, TaskType
lora_config = LoraConfig(
task_type=TaskType.CAUSAL_LM,
r=16, # Rank: 8-64 typical, higher = more capacity
lora_alpha=32, # Scaling factor, usually 2x rank
lora_dropout=0.05, # Dropout for regularization
target_modules=[ # Which layers to adapt
"q_proj", "k_proj", "v_proj", "o_proj", # Attention
"gate_proj", "up_proj", "down_proj", # MLP (optional)
],
bias="none", # "none", "all", or "lora_only"
)
model = get_peft_model(base_model, lora_config)
model.print_trainable_parameters()
# trainable params: 6,553,600 || all params: 6,744,682,496 || trainable%: 0.0971
LoRA Rank Selection Guide
| Rank (r) | Trainable Params | Use Case |
|---|---|---|
| 4-8 | Very few | Simple style transfer, format adaptation |
| 16-32 | Moderate | General fine-tuning, most tasks |
| 64-128 | Many | Complex domain adaptation, multilingual |
| 256+ | Large | Approaching full fine-tune quality |
Target Module Selection
# Find all available modules
from peft import PeftModel
import re
def find_target_modules(model):
"""List all linear layers that can be targeted by LoRA."""
linear_modules = set()
for name, module in model.named_modules():
if isinstance(module, torch.nn.Linear):
# Get the last part of the module name
names = name.split(".")
linear_modules.add(names[-1])
return list(linear_modules)
# Common target modules by model family:
# Llama/Mistral: q_proj, k_proj, v_proj, o_proj, gate_proj, up_proj, down_proj
# GPT-NeoX: query_key_value, dense, dense_h_to_4h, dense_4h_to_h
# Falcon: query_key_value, dense, dense_h_to_4h, dense_4h_to_h
QLoRA: Fine-Tuning on Consumer Hardware
QLoRA combines 4-bit quantization with LoRA, enabling fine-tuning of 70B models on a single 48GB GPU.
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig
from peft import LoraConfig, get_peft_model, prepare_model_for_kbit_training
# 4-bit quantization config
bnb_config = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_quant_type="nf4", # NormalFloat4 (best for LLMs)
bnb_4bit_compute_dtype=torch.bfloat16, # Compute in bf16
bnb_4bit_use_double_quant=True, # Nested quantization saves ~0.4 bits/param
)
# Load model in 4-bit
model = AutoModelForCausalLM.from_pretrained(
"meta-llama/Llama-3.1-8B",
quantization_config=bnb_config,
device_map="auto",
torch_dtype=torch.bfloat16,
attn_implementation="flash_attention_2",
)
tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-3.1-8B")
tokenizer.pad_token = tokenizer.eos_token
# Prepare for k-bit training (freezes, casts, enables gradient checkpointing)
model = prepare_model_for_kbit_training(model)
# Apply LoRA
lora_config = LoraConfig(
r=16,
lora_alpha=32,
target_modules="all-linear", # Shorthand: targets all linear layers
lora_dropout=0.05,
bias="none",
task_type="CAUSAL_LM",
)
model = get_peft_model(model, lora_config)
Hardware Requirements for QLoRA
| Model Size | VRAM (QLoRA) | VRAM (Full Fine-Tune) | GPU Example |
|---|---|---|---|
| 7-8B | 6-10 GB | 60+ GB | RTX 3090/4090 |
| 13B | 10-16 GB | 100+ GB | RTX 4090/A6000 |
| 34B | 20-28 GB | 250+ GB | A100 40GB |
| 70B | 36-48 GB | 500+ GB | A100 80GB |
Dataset Preparation
Instruction Format
from datasets import Dataset
# Alpaca-style format
training_data = [
{
"instruction": "Summarize the following text in one sentence.",
"input": "The quick brown fox jumped over the lazy dog. The dog was sleeping...",
"output": "A fox jumped over a sleeping dog."
},
{
"instruction": "Translate the following English text to French.",
"input": "Hello, how are you?",
"output": "Bonjour, comment allez-vous?"
},
]
dataset = Dataset.from_list(training_data)
Chat Template Format
# Chat/conversation format (preferred for modern models)
chat_data = [
{
"messages": [
{"role": "system", "content": "You are a medical coding assistant."},
{"role": "user", "content": "What ICD-10 code for type 2 diabetes?"},
{"role": "assistant", "content": "The ICD-10 code for type 2 diabetes mellitus is E11."},
]
},
{
"messages": [
{"role": "system", "content": "You are a medical coding assistant."},
{"role": "user", "content": "Code for acute myocardial infarction?"},
{"role": "assistant", "content": "The ICD-10 code for acute myocardial infarction is I21."},
]
},
]
dataset = Dataset.from_list(chat_data)
# Apply chat template during tokenization
def format_chat(example):
text = tokenizer.apply_chat_template(
example["messages"],
tokenize=False,
add_generation_prompt=False,
)
return {"text": text}
dataset = dataset.map(format_chat)
Data Quality Checklist
- • Minimum 500 examples (1000+ preferred for robust results)
- • Consistent format across all examples
- • Diverse inputs covering expected use cases
- • Correct and high-quality outputs (garbage in = garbage out)
- • No personally identifiable information (PII)
- • Balanced distribution across categories/topics
- • Deduplicated (no exact or near-duplicates)
- • Train/validation split (90/10 or 80/20)
Training with Hugging Face TRL
SFTTrainer (Supervised Fine-Tuning)
from trl import SFTTrainer, SFTConfig
training_args = SFTConfig(
output_dir="./results",
num_train_epochs=3,
per_device_train_batch_size=4,
gradient_accumulation_steps=4, # Effective batch size = 4 * 4 = 16
gradient_checkpointing=True, # Saves VRAM at cost of ~20% speed
learning_rate=2e-4,
lr_scheduler_type="cosine",
warmup_ratio=0.03,
weight_decay=0.01,
max_seq_length=2048,
packing=True, # Pack multiple short examples per sequence
bf16=True, # Use bf16 mixed precision
logging_steps=10,
save_strategy="steps",
save_steps=100,
eval_strategy="steps",
eval_steps=100,
save_total_limit=3,
load_best_model_at_end=True,
report_to="wandb", # Weights & Biases logging
dataset_text_field="text",
)
trainer = SFTTrainer(
model=model,
args=training_args,
train_dataset=train_dataset,
eval_dataset=eval_dataset,
tokenizer=tokenizer,
peft_config=lora_config, # Pass LoRA config directly
)
trainer.train()
# Save the LoRA adapter (small, typically 10-100 MB)
trainer.save_model("./final_adapter")
Training Hyperparameter Guide
| Parameter | Recommended Range | Notes |
|---|---|---|
| Learning rate | 1e-5 to 3e-4 | 2e-4 is a safe default for LoRA |
| Epochs | 1-5 | More epochs risk overfitting; monitor eval loss |
| Batch size (effective) | 8-64 | Larger = more stable, use grad accumulation |
| Warmup ratio | 0.03-0.1 | Prevents early training instability |
| Weight decay | 0.01-0.1 | Regularization, 0.01 is safe default |
| Max seq length | 512-4096 | Match your use case; longer = more VRAM |
| LoRA rank | 8-64 | Start with 16, increase if underfitting |
RLHF and DPO (Alignment)
Direct Preference Optimization (DPO)
DPO is simpler than full RLHF since it does not require training a separate reward model.
from trl import DPOTrainer, DPOConfig
# Dataset format: each example has a prompt, chosen response, and rejected response
dpo_data = [
{
"prompt": "Explain quantum computing simply.",
"chosen": "Quantum computing uses quantum bits (qubits) that can be 0, 1, or both simultaneously...",
"rejected": "Quantum computing is a complex field involving superposition and entanglement of quantum states in Hilbert space...",
},
]
dpo_dataset = Dataset.from_list(dpo_data)
dpo_config = DPOConfig(
output_dir="./dpo_results",
num_train_epochs=1,
per_device_train_batch_size=2,
gradient_accumulation_steps=8,
learning_rate=5e-6, # Lower LR for DPO
beta=0.1, # KL penalty strength (0.1-0.5)
bf16=True,
gradient_checkpointing=True,
logging_steps=10,
save_strategy="epoch",
report_to="wandb",
)
dpo_trainer = DPOTrainer(
model=model,
args=dpo_config,
train_dataset=dpo_dataset,
tokenizer=tokenizer,
peft_config=lora_config,
)
dpo_trainer.train()
RLHF Pipeline (Full)
from trl import PPOConfig, PPOTrainer, AutoModelForCausalLMWithValueHead
# Step 1: Train a reward model on preference data
# Step 2: Use PPO to optimize the policy model against the reward model
ppo_config = PPOConfig(
model_name="meta-llama/Llama-3.1-8B-Instruct",
learning_rate=1e-5,
batch_size=16,
mini_batch_size=4,
ppo_epochs=4,
kl_penalty="kl",
target_kl=6.0,
)
model = AutoModelForCausalLMWithValueHead.from_pretrained(ppo_config.model_name)
ppo_trainer = PPOTrainer(config=ppo_config, model=model, tokenizer=tokenizer)
# Training loop
for batch in dataloader:
query_tensors = batch["input_ids"]
response_tensors = ppo_trainer.generate(query_tensors, max_new_tokens=256)
rewards = reward_model(query_tensors, response_tensors)
stats = ppo_trainer.step(query_tensors, response_tensors, rewards)
Quantization Formats
Comparison
| Format | Bits | Speed | Quality | Use Case |
|---|---|---|---|---|
| BF16 | 16 | Baseline | Best | Training, high-end inference |
| GPTQ | 4/8 | Fast (GPU) | Very good | GPU deployment |
| AWQ | 4 | Fastest (GPU) | Very good | GPU deployment, best perf |
| GGUF | 2-8 | Good (CPU/GPU) | Varies | llama.cpp, local inference |
| bitsandbytes | 4/8 | Moderate | Good | Training (QLoRA), prototyping |
GPTQ Quantization
from transformers import AutoModelForCausalLM, AutoTokenizer, GPTQConfig
# Quantize a model to 4-bit GPTQ
quantization_config = GPTQConfig(
bits=4,
dataset="c4", # Calibration dataset
group_size=128, # Quantization group size
desc_act=True, # Better quality, slightly slower
sym=True, # Symmetric quantization
)
model = AutoModelForCausalLM.from_pretrained(
"meta-llama/Llama-3.1-8B",
quantization_config=quantization_config,
device_map="auto",
)
model.save_pretrained("./llama-3.1-8b-gptq-4bit")
AWQ Quantization
from awq import AutoAWQForCausalLM
model = AutoAWQForCausalLM.from_pretrained("meta-llama/Llama-3.1-8B")
tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-3.1-8B")
quant_config = {
"zero_point": True,
"q_group_size": 128,
"w_bit": 4,
"version": "GEMM", # GEMM or GEMV
}
model.quantize(tokenizer, quant_config=quant_config)
model.save_quantized("./llama-3.1-8b-awq-4bit")
GGUF Conversion (for llama.cpp)
# Convert HF model to GGUF
python llama.cpp/convert_hf_to_gguf.py \
./model-directory \
--outfile model.gguf \
--outtype q4_k_m
# Quantization levels:
# q2_k - 2-bit (smallest, lowest quality)
# q4_0 - 4-bit (fast, decent quality)
# q4_k_m - 4-bit k-quant medium (best balance)
# q5_k_m - 5-bit k-quant medium (good quality)
# q6_k - 6-bit (high quality)
# q8_0 - 8-bit (near lossless)
Model Merging
Combine multiple fine-tuned models without additional training.
# Using mergekit
# Install: pip install mergekit
# mergekit-config.yaml for TIES merge
"""
models:
- model: base-model/path
parameters:
density: 1.0
weight: 1.0
- model: code-adapter/path
parameters:
density: 0.5
weight: 0.7
- model: math-adapter/path
parameters:
density: 0.5
weight: 0.3
merge_method: ties
base_model: base-model/path
parameters:
normalize: true
int8_mask: true
dtype: bfloat16
"""
# Run merge mergekit-yaml mergekit-config.yaml ./merged-model --cuda
Merge Method Selection
| Method | Best For | Quality |
|---|---|---|
| Linear | Two similar models | Good baseline |
| SLERP | Two models, smooth interpolation | Better for dissimilar |
| TIES | Multiple models, resolve conflicts | Best for 3+ models |
| DARE | Multiple models with pruning | Good for diverse skills |
Distributed Training
DeepSpeed ZeRO
# deepspeed_config.json
"""
{
"bf16": {"enabled": true},
"zero_optimization": {
"stage": 2,
"offload_optimizer": {"device": "cpu", "pin_memory": true},
"allgather_partitions": true,
"reduce_scatter": true
},
"gradient_accumulation_steps": 4,
"gradient_clipping": 1.0,
"train_batch_size": "auto",
"train_micro_batch_size_per_gpu": "auto"
}
"""
# In SFTConfig
training_args = SFTConfig(
...,
deepspeed="deepspeed_config.json",
)
# Launch
# deepspeed --num_gpus=4 train.py
FSDP (Fully Sharded Data Parallel)
from transformers import TrainingArguments
training_args = TrainingArguments(
...,
fsdp="full_shard auto_wrap",
fsdp_config={
"fsdp_transformer_layer_cls_to_wrap": "LlamaDecoderLayer",
"fsdp_backward_prefetch": "backward_pre",
"fsdp_forward_prefetch": True,
"fsdp_use_orig_params": True,
"fsdp_cpu_ram_efficient_loading": True,
"fsdp_sync_module_states": True,
},
)
# Launch
# torchrun --nproc_per_node=4 train.py
ZeRO Stage Selection
| Stage | Shards | VRAM Savings | Communication | Use When |
|---|---|---|---|---|
| ZeRO-1 | Optimizer states | ~4x | Low | 2-4 GPUs, large batch |
| ZeRO-2 | + Gradients | ~8x | Medium | 4-8 GPUs, standard |
| ZeRO-3 | + Parameters | ~Nx | High | Model too large for 1 GPU |
| ZeRO-3 + Offload | + CPU offload | Maximum | Highest | Maximum VRAM savings |
Evaluation
Automated Metrics
import evaluate
# Perplexity (lower is better, measures model confidence)
perplexity = evaluate.load("perplexity")
results = perplexity.compute(
predictions=generated_texts,
model_id="meta-llama/Llama-3.1-8B",
)
# Task-specific evaluation
from lm_eval import evaluator
results = evaluator.simple_evaluate(
model="hf",
model_args="pretrained=./fine-tuned-model",
tasks=["mmlu", "hellaswag", "arc_challenge"],
batch_size=8,
)
Catastrophic Forgetting Detection
def evaluate_forgetting(base_model, fine_tuned_model, general_benchmarks):
"""Compare base vs fine-tuned on general tasks to detect forgetting."""
base_scores = run_benchmarks(base_model, general_benchmarks)
ft_scores = run_benchmarks(fine_tuned_model, general_benchmarks)
for task in general_benchmarks:
delta = ft_scores[task] - base_scores[task]
if delta < -0.05: # > 5% degradation
print(f"WARNING: Forgetting detected on {task}: {delta:.2%}")
return {"base": base_scores, "fine_tuned": ft_scores}
Prevention Strategies
- •Use low LoRA rank (8-16) to limit weight changes
- •Short training (1-3 epochs) to avoid overfitting
- •Include general-purpose examples in training data (5-10%)
- •Use regularization (weight decay, dropout)
- •Evaluate on general benchmarks throughout training
Deployment of Fine-Tuned Models
Merge and Export LoRA
from peft import PeftModel
from transformers import AutoModelForCausalLM
# Load base model
base_model = AutoModelForCausalLM.from_pretrained(
"meta-llama/Llama-3.1-8B",
torch_dtype=torch.bfloat16,
device_map="auto",
)
# Load and merge adapter
model = PeftModel.from_pretrained(base_model, "./final_adapter")
merged_model = model.merge_and_unload()
# Save merged model
merged_model.save_pretrained("./merged_model")
tokenizer.save_pretrained("./merged_model")
# Push to Hub
merged_model.push_to_hub("my-org/my-fine-tuned-model")
tokenizer.push_to_hub("my-org/my-fine-tuned-model")
Serve with vLLM
# Install: pip install vllm
# Serve the merged model
vllm serve ./merged_model \
--dtype bfloat16 \
--max-model-len 4096 \
--gpu-memory-utilization 0.9 \
--port 8000
# Or serve with LoRA adapter on the fly
vllm serve meta-llama/Llama-3.1-8B \
--enable-lora \
--lora-modules my-adapter=./final_adapter \
--max-lora-rank 64
Serve with TGI (Text Generation Inference)
docker run --gpus all --shm-size 1g \
-v ./merged_model:/model \
-p 8080:80 \
ghcr.io/huggingface/text-generation-inference:latest \
--model-id /model \
--quantize bitsandbytes-nf4 \
--max-input-length 2048 \
--max-total-tokens 4096