AgentSkillsCN

training-hub

使用 Red Hat training-hub 库,通过 SFT、LoRA 和 OSFT 算法微调 LLM。当准备 JSONL 数据集、运行训练作业、配置硬件、扩展到集群、评估模型或使用 vLLM 部署时,请使用此技能。

SKILL.md
--- frontmatter
name: training-hub
description: Fine-tune LLMs using Red Hat training-hub library with SFT, LoRA, and OSFT algorithms. Use when preparing JSONL datasets, running training jobs, configuring hardware, scaling to clusters, evaluating models, or deploying with vLLM.

Training Hub

Red Hat's unified library for LLM post-training: SFT, LoRA, and OSFT (continual learning).

Quick Reference

TaskCommand
Recommend configpython scripts/recommend_config.py --model <model> --hardware <hw>
Estimate memorypython scripts/estimate_memory.py --model <model> --method sft --hardware h100
Validate datasetpython scripts/validate_dataset.py data.jsonl
Full fine-tuningfrom training_hub import sft
LoRA trainingfrom training_hub import lora_sft
OSFT (continual)from training_hub import osft

Installation

bash
pip install training-hub              # Basic
pip install training-hub[lora]        # LoRA with Unsloth (2x faster)
pip install training-hub[cuda] --no-build-isolation  # CUDA support

Get Started Fast

bash
# Get optimal config for your hardware
python scripts/recommend_config.py \
  --model meta-llama/Llama-3.1-8B-Instruct \
  --hardware rtx-5090

Data Format

Training data must be JSONL with message structure:

json
{"messages": [{"role": "user", "content": "Q"}, {"role": "assistant", "content": "A"}]}

Validate before training:

bash
python scripts/validate_dataset.py ./training_data.jsonl

For data preparation details, see DATA-FORMATS.md.

Training Methods

Supervised Fine-Tuning (SFT)

Full-parameter fine-tuning. Requires significant VRAM.

python
from training_hub import sft

result = sft(
    model_path="Qwen/Qwen2.5-7B-Instruct",
    data_path="./training_data.jsonl",
    ckpt_output_dir="./checkpoints",
    num_epochs=3,
    effective_batch_size=8,
    learning_rate=2e-5,
    max_seq_len=2048,
    max_tokens_per_gpu=45000,
)

LoRA Fine-Tuning

Memory-efficient adaptation (up to 2x faster, 70% less VRAM):

python
from training_hub import lora_sft

result = lora_sft(
    model_path="Qwen/Qwen2.5-7B-Instruct",
    data_path="./training_data.jsonl",
    ckpt_output_dir="./outputs",
    lora_r=16,
    lora_alpha=32,
    num_epochs=3,
    learning_rate=2e-4,
)

QLoRA (4-bit): Add load_in_4bit=True for large models on limited VRAM.

OSFT (Continual Learning)

Adapt without catastrophic forgetting:

python
from training_hub import osft

result = osft(
    model_path="meta-llama/Llama-3.1-8B-Instruct",
    data_path="./domain_data.jsonl",
    ckpt_output_dir="./checkpoints",
    unfreeze_rank_ratio=0.25,
    effective_batch_size=16,
    learning_rate=2e-5,
)

For all parameters, see ALGORITHMS.md.

Hardware Support

HardwareVRAMBest For
RTX 509032GB8B LoRA, 70B QLoRA
DGX Spark128GB70B SFT
H10080GB14B SFT, 70B LoRA
8×H100640GB70B SFT
bash
# Check if your config fits
python scripts/estimate_memory.py \
  --model meta-llama/Llama-3.1-70B-Instruct \
  --method lora \
  --hardware h100 \
  --num-gpus 8

For hardware-specific configs, see HARDWARE.md.

Scaling

Multi-GPU:

python
result = sft(..., nproc_per_node=8)

Multi-node:

python
result = sft(..., nnodes=2, node_rank=0, nproc_per_node=8, rdzv_endpoint="0.0.0.0:29500")

For Slurm, Kubernetes, and datacenter deployments, see SCALE.md.

Algorithm Selection

ScenarioMethod
First-time fine-tuning, large datasetSFT
Memory constrainedLoRA
Very large model (70B+), limited VRAMLoRA + QLoRA
Preserve existing capabilitiesOSFT
Domain adaptation, small datasetOSFT

Documentation

TopicFile
Hardware profiles & configsHARDWARE.md
All algorithm parametersALGORITHMS.md
Data formats & conversionDATA-FORMATS.md
Datacenter & cluster setupSCALE.md
Model evaluationEVALUATION.md
vLLM inference & servingINFERENCE.md
Advanced techniquesADVANCED.md
Model-specific configsMODELS.md
TroubleshootingTROUBLESHOOTING.md
Distributed trainingDISTRIBUTED.md

Utility Scripts

ScriptPurpose
recommend_config.pyGenerate optimal config for model + hardware
estimate_memory.pyEstimate GPU memory requirements
validate_dataset.pyValidate JSONL dataset format
convert_to_jsonl.pyConvert CSV, Alpaca, ShareGPT to JSONL

Troubleshooting

CUDA OOM: Reduce max_tokens_per_gpu, use LoRA + QLoRA, or add GPUs

Dataset errors: Run python scripts/validate_dataset.py first

LoRA multi-GPU: Requires torchrun --nproc-per-node=N script.py

Training diverges: Lower learning_rate (try 1e-5 for SFT, 1e-4 for LoRA)

For more, see TROUBLESHOOTING.md.