LLM Fine-Tuning
When to Use What
| Method | VRAM (7B) | Quality vs Full FT | When to Use |
|---|---|---|---|
| Full fine-tuning | 60-80 GB | Baseline | Unlimited compute, max quality, own the weights |
| LoRA (r=64) | 18-24 GB | 95-99% | Production adapters, multi-tenant serving |
| QLoRA (4-bit + LoRA) | 6-10 GB | 90-97% | Single GPU, prototyping, budget-constrained |
| LoRA (r=8-16) | 14-18 GB | 90-95% | Quick experiments, narrow domain tasks |
Decision rule: Start with QLoRA to validate the task is learnable, then move to LoRA r=64 or full FT for production.
Dataset Formatting
Instruction Format
def format_instruction(example):
"""Alpaca-style. Works with most base models."""
return {
"text": (
f"### Instruction:\n{example['instruction']}\n\n"
f"### Input:\n{example.get('input', '')}\n\n"
f"### Response:\n{example['output']}"
)
}
Chat Template Format
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-3.1-8B-Instruct")
def format_chat(example):
"""Use the model's native chat template. Always prefer this for instruct models."""
messages = [
{"role": "system", "content": example.get("system", "You are a helpful assistant.")},
{"role": "user", "content": example["question"]},
{"role": "assistant", "content": example["answer"]},
]
return {"text": tokenizer.apply_chat_template(messages, tokenize=False)}
Dataset Prep Pipeline
from datasets import load_dataset, DatasetDict
def prepare_dataset(path, tokenizer, max_length=2048, test_size=0.05):
ds = load_dataset("json", data_files=path, split="train")
ds = ds.map(format_chat, remove_columns=ds.column_names)
# Filter overlength samples rather than truncating -- avoids training on garbage
ds = ds.filter(lambda x: len(tokenizer.encode(x["text"])) <= max_length)
split = ds.train_test_split(test_size=test_size, seed=42)
return DatasetDict({"train": split["train"], "test": split["test"]})
Gotcha: Truncating mid-response teaches the model to produce incomplete outputs. Filter or increase max_length instead.
PEFT / LoRA Configuration
Standard LoRA Config
from peft import LoraConfig, TaskType, get_peft_model
lora_config = LoraConfig(
r=64,
lora_alpha=128, # alpha = 2*r is a solid default
lora_dropout=0.05,
target_modules=["q_proj", "k_proj", "v_proj", "o_proj",
"gate_proj", "up_proj", "down_proj"],
task_type=TaskType.CAUSAL_LM,
bias="none",
)
model = get_peft_model(model, lora_config)
model.print_trainable_parameters() # Expect 1-3% of total
Target Module Selection
| Model Family | Recommended Targets | Notes |
|---|---|---|
| Llama/Mistral | q,k,v,o_proj + gate,up,down_proj | All linear layers for best quality |
| GPT-NeoX/Pythia | query_key_value, dense | Fused QKV attention |
| Phi | q_proj,k_proj,v_proj,dense | Check model config for names |
Use model.named_modules() to discover the actual layer names if unsure.
QLoRA Setup
from transformers import AutoModelForCausalLM, BitsAndBytesConfig
import torch
bnb_config = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_quant_type="nf4",
bnb_4bit_compute_dtype=torch.bfloat16,
bnb_4bit_use_double_quant=True, # Nested quantization saves ~0.4 GB/B params
)
model = AutoModelForCausalLM.from_pretrained(
"meta-llama/Llama-3.1-8B",
quantization_config=bnb_config,
device_map="auto",
attn_implementation="flash_attention_2",
)
model.gradient_checkpointing_enable(gradient_checkpointing_kwargs={"use_reentrant": False})
Gotcha: use_reentrant=False is mandatory with LoRA + gradient checkpointing. The default (True) silently skips gradients for LoRA params.
Training with SFTTrainer
from trl import SFTTrainer, SFTConfig
training_args = SFTConfig(
output_dir="./output",
num_train_epochs=3,
per_device_train_batch_size=4,
gradient_accumulation_steps=4, # Effective batch = 16
learning_rate=2e-4, # LoRA tolerates higher LR than full FT
lr_scheduler_type="cosine",
warmup_ratio=0.05,
bf16=True,
logging_steps=10,
eval_strategy="steps",
eval_steps=100,
save_strategy="steps",
save_steps=100,
max_seq_length=2048,
dataset_text_field="text",
packing=True, # Pack short sequences together for efficiency
gradient_checkpointing=True,
gradient_checkpointing_kwargs={"use_reentrant": False},
)
trainer = SFTTrainer(
model=model,
args=training_args,
train_dataset=dataset["train"],
eval_dataset=dataset["test"],
peft_config=lora_config,
tokenizer=tokenizer,
)
trainer.train()
trainer.save_model("./final_adapter")
Adapter Merging
from peft import PeftModel
from transformers import AutoModelForCausalLM, AutoTokenizer
# Load base + adapter, merge, save full model
base_model = AutoModelForCausalLM.from_pretrained(
"meta-llama/Llama-3.1-8B",
torch_dtype=torch.bfloat16,
device_map="auto",
)
model = PeftModel.from_pretrained(base_model, "./final_adapter")
merged = model.merge_and_unload()
merged.save_pretrained("./merged_model")
tokenizer.save_pretrained("./merged_model")
Gotcha: Don't merge QLoRA adapters directly onto the quantized base. Load the base in full precision (bf16/fp16) first, then load the adapter, then merge.
Eval Loop
import torch
from torch.utils.data import DataLoader
def evaluate(model, tokenizer, eval_dataset, max_new_tokens=256):
model.eval()
predictions, references = [], []
for example in eval_dataset:
prompt = extract_prompt(example["text"]) # Everything before assistant response
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
with torch.no_grad():
outputs = model.generate(
**inputs,
max_new_tokens=max_new_tokens,
do_sample=False, # Greedy for reproducible eval
temperature=1.0,
pad_token_id=tokenizer.eos_token_id,
)
pred = tokenizer.decode(outputs[0][inputs["input_ids"].shape[1]:], skip_special_tokens=True)
predictions.append(pred.strip())
references.append(extract_response(example["text"]))
return compute_metrics(predictions, references)
Gotchas and Anti-Patterns
Chat Template Mismatch
Training with one template, inferring with another, destroys quality. Always save the tokenizer alongside the adapter and use apply_chat_template consistently.
Padding Direction
- •Training:
tokenizer.padding_side = "right"(standard for causal LM) - •Batch inference:
tokenizer.padding_side = "left"(so generation starts at the right position) - •Forgetting to switch between train/inference is a silent quality killer
Common Mistakes
- •Setting
lora_alpha = rinstead of2*r-- underscales the adapter contribution - •Using
packing=Truewithout checking that your dataset has many short examples -- wastes compute on padding if examples are already nearmax_seq_length - •Not setting
pad_token-- many models (Llama) don't have one by default:tokenizer.pad_token = tokenizer.eos_token - •Training on the prompt tokens -- use
DataCollatorForCompletionOnlyor mask labels manually - •Evaluating with
do_sample=True-- introduces variance that makes comparison meaningless