TRL Training Skill
You are an expert at using the TRL (Transformers Reinforcement Learning) library to train and fine-tune large language models.
Overview
TRL provides CLI commands for post-training foundation models using state-of-the-art techniques:
- •SFT (Supervised Fine-Tuning): Fine-tune models on instruction-following or conversational datasets
- •DPO (Direct Preference Optimization): Align models using preference data
- •GRPO (Group Relative Policy Optimization): Train models by ranking multiple sampled outputs relative to each other and optimizing based on their comparative rewards.
- •RLOO (Reinforce Leave One Out): Online RL training with generation-based rewards
- •Reward Model Training: Train reward models for RLHF
TRL is built on top of Hugging Face Transformers and Accelerate, providing seamless integration with the Hugging Face ecosystem.
Core Commands
trl sft - Supervised Fine-Tuning
Fine-tune language models on instruction-following or conversational datasets.
Full training:
trl sft \ --model_name_or_path Qwen/Qwen2-0.5B \ --dataset_name trl-lib/Capybara \ --learning_rate 2.0e-5 \ --num_train_epochs 1 \ --packing \ --per_device_train_batch_size 2 \ --gradient_accumulation_steps 8 \ --eos_token '<|im_end|>' \ --eval_strategy steps \ --eval_steps 100 \ --output_dir Qwen2-0.5B-SFT \ --push_to_hub
Train with LoRA adapters:
trl sft \ --model_name_or_path Qwen/Qwen2-0.5B \ --dataset_name trl-lib/Capybara \ --learning_rate 2.0e-4 \ --num_train_epochs 1 \ --packing \ --per_device_train_batch_size 2 \ --gradient_accumulation_steps 8 \ --eos_token '<|im_end|>' \ --eval_strategy steps \ --eval_steps 100 \ --use_peft \ --lora_r 32 \ --lora_alpha 16 \ --output_dir Qwen2-0.5B-SFT \ --push_to_hub
trl dpo - Direct Preference Optimization
Align models using preference data (chosen/rejected pairs).
Full training:
trl dpo \ --dataset_name trl-lib/ultrafeedback_binarized \ --model_name_or_path Qwen/Qwen2-0.5B-Instruct \ --learning_rate 5.0e-7 \ --num_train_epochs 1 \ --per_device_train_batch_size 2 \ --max_steps 1000 \ --gradient_accumulation_steps 8 \ --eval_strategy steps \ --eval_steps 50 \ --output_dir Qwen2-0.5B-DPO \ --no_remove_unused_columns
Train with LoRA adapters:
trl dpo \ --dataset_name trl-lib/ultrafeedback_binarized \ --model_name_or_path Qwen/Qwen2-0.5B-Instruct \ --learning_rate 5.0e-6 \ --num_train_epochs 1 \ --per_device_train_batch_size 2 \ --max_steps 1000 \ --gradient_accumulation_steps 8 \ --eval_strategy steps \ --eval_steps 50 \ --output_dir Qwen2-0.5B-DPO \ --no_remove_unused_columns \ --use_peft \ --lora_r 32 \ --lora_alpha 16
trl grpo - Group Relative Policy Optimization
Train models using reward functions or LLM-as-a-judge for evaluating generations and providing rewards.
Basic usage:
trl grpo \ --model_name_or_path Qwen/Qwen2.5-0.5B \ --dataset_name trl-lib/gsm8k \ --reward_funcs accuracy_reward \ --output_dir Qwen2-0.5B-GRPO \ --push_to_hub
trl rloo - Reinforce Leave One Out
Online RL training where the model generates text and receives rewards based on custom criteria.
Basic usage:
trl rloo \ --model_name_or_path Qwen/Qwen2.5-0.5B \ --dataset_name trl-lib/tldr \ --reward_model_name_or_path sentiment-analysis:nlptown/bert-base-multilingual-uncased-sentiment \ --output_dir Qwen2-0.5B-RLOO \ --push_to_hub
trl reward - Reward Model Training
Train a reward model to score text quality for RLHF.
Full training:
trl reward \ --model_name_or_path Qwen/Qwen2-0.5B-Instruct \ --dataset_name trl-lib/ultrafeedback_binarized \ --output_dir Qwen2-0.5B-Reward \ --per_device_train_batch_size 8 \ --num_train_epochs 1 \ --learning_rate 1.0e-5 \ --eval_strategy steps \ --eval_steps 50 \ --max_length 2048
Train with LoRA adapters:
trl reward \ --model_name_or_path Qwen/Qwen2-0.5B-Instruct \ --dataset_name trl-lib/ultrafeedback_binarized \ --output_dir Qwen2-0.5B-Reward-LoRA \ --per_device_train_batch_size 8 \ --num_train_epochs 1 \ --learning_rate 1.0e-4 \ --eval_strategy steps \ --eval_steps 50 \ --max_length 2048 \ --use_peft \ --lora_task_type SEQ_CLS \ --lora_r 32 \ --lora_alpha 16
Configuration Files
TRL supports YAML configuration files for reproducible training. All CLI arguments can be specified in a config file.
Example config (sft_config.yaml):
model_name_or_path: Qwen/Qwen2.5-0.5B dataset_name: trl-lib/Capybara learning_rate: 2.0e-5 num_train_epochs: 1 per_device_train_batch_size: 8 gradient_accumulation_steps: 2 output_dir: ./sft_output use_peft: true lora_r: 16 lora_alpha: 16 report_to: trackio
Launch with config:
trl sft --config sft_config.yaml
Override config values:
trl sft --config sft_config.yaml --learning_rate 1.0e-5
Distributed Training
TRL integrates with Accelerate for multi-GPU and multi-node training.
Multi-GPU training:
trl sft \ --config sft_config.yaml \ --num_processes 4
Use predefined Accelerate configs:
TRL provides predefined configs: single_gpu, multi_gpu, fsdp1, fsdp2, zero1, zero2, zero3
trl sft \ --config sft_config.yaml \ --accelerate_config zero2
Custom Accelerate config:
# Generate custom config accelerate config # Use custom config trl sft --config sft_config.yaml --config_file ~/.cache/huggingface/accelerate/default_config.yaml
Fully Sharded Data Parallel (FSDP):
trl sft --config sft_config.yaml --accelerate_config fsdp2
DeepSpeed ZeRO:
trl sft --config sft_config.yaml --accelerate_config zero3
Troubleshooting
CUDA Out of Memory
- •Reduce
--per_device_train_batch_sizeand increase--gradient_accumulation_steps - •Enable
--use_peftfor LoRA training - •Use
--gradient_checkpointingto save memory - •Try smaller model or longer sequence truncation
Dataset Loading Issues
- •Verify dataset exists: check Hugging Face Hub or local path
- •Check dataset format matches expected columns
- •Use
--dataset_configfor multi-config datasets - •Inspect dataset:
from datasets import load_dataset; ds = load_dataset(name)
Model Loading Issues
- •Verify model exists on Hugging Face Hub
- •Check if gated model requires authentication:
hf auth login - •For local models, provide absolute path
- •Ensure sufficient disk space and memory
Slow Training
- •Enable dataset
--packingfor short sequences - •Use larger
--per_device_train_batch_sizeif memory allows - •Enable
--tf32for faster computation on Ampere GPUs - •Use
--bf16on supported hardware - •Consider multi-GPU training with
--num_processes
Generation Issues (GRPO/RLOO)
- •Check prompt format in dataset
- •Adjust
--temperatureand--top_pfor generation - •Verify the reward function (for GRPO/RLOO)
Additional Resources
- •Documentation: https://huggingface.co/docs/trl
- •GitHub: https://github.com/huggingface/trl
- •Examples: https://github.com/huggingface/trl/tree/main/examples
Best Practices
- •Start with SFT: Always fine-tune base models with SFT before preference alignment
- •Use LoRA for efficiency: Enable
--use_peftfor faster training and lower memory - •Monitor training: Use
--report_to trackio(or--report_to wandbor--report_to tensorboard) for tracking - •Save checkpoints: TRL automatically saves checkpoints in
--output_dir - •Test on small datasets first: Verify pipeline works before full training
- •Use configuration files: Create YAML configs for reproducibility
- •Leverage Accelerate: Use multi-GPU training for faster iteration
When helping users with TRL:
- •Always check which training method is appropriate for their use case
- •Verify dataset format matches the expected schema
- •Recommend starting with smaller models for testing
- •Suggest LoRA for resource-constrained environments
- •Point to specific documentation sections for advanced features