Training Debug

Diagnose and fix common training issues in training_hub.

Out of Memory (OOM) Issues

Symptoms

code

CUDA out of memory. Tried to allocate X GiB
RuntimeError: CUDA error: out of memory

Diagnosis Steps

•Check GPU memory usage

bash

nvidia-smi -l 1  # Monitor GPU memory in real-time

•Estimate memory requirements

python

from training_hub.utils import estimate_memory

estimate_memory(
    model_path="meta-llama/Llama-3.1-8B",
    batch_size=4,
    seq_length=4096,
    precision="bf16",
)
# Output: Estimated memory: 24.5 GB per GPU

Solutions

Reduce batch size

python

sft(
    per_device_batch_size=2,  # Reduce from 4
    gradient_accumulation_steps=16,  # Increase to maintain effective batch
)

Enable gradient checkpointing

python

sft(
    gradient_checkpointing=True,  # Trades compute for memory
)

Use memory-efficient optimizers

python

sft(
    optim="adamw_8bit",  # 8-bit Adam saves ~50% optimizer memory
    # OR
    optim="paged_adamw_32bit",  # Pages optimizer states to CPU
)

Reduce sequence length

python

sft(
    max_seq_length=2048,  # Reduce from 4096
)

Use FSDP for large models

python

sft(
    fsdp=True,
    fsdp_config={"sharding_strategy": "FULL_SHARD"},
)

NaN/Inf Loss Issues

Symptoms

code

Loss: nan
Loss: inf
Training loss spiked to very large values

Diagnosis

python

# Enable anomaly detection
import torch
torch.autograd.set_detect_anomaly(True)

# Check for NaN in inputs
from training_hub.utils import check_data_quality

issues = check_data_quality("./data/train.jsonl")
print(issues)  # Reports NaN, inf, empty examples

Solutions

Fix learning rate

python

sft(
    learning_rate=1e-5,  # Lower from 2e-5
    warmup_ratio=0.1,     # Ensure warmup
)

Use loss scaling for fp16

python

sft(
    fp16=True,
    fp16_opt_level="O1",
    loss_scale=128.0,  # Or "dynamic"
)

Prefer bf16 over fp16

python

sft(
    bf16=True,  # More stable than fp16, no loss scaling needed
)

Gradient clipping

python

sft(
    max_grad_norm=1.0,  # Clip gradients
)

Check and clean data

python

# Remove problematic examples
from datasets import load_dataset

dataset = load_dataset("json", data_files="train.jsonl", split="train")
dataset = dataset.filter(lambda x: len(x["messages"]) > 0)
dataset = dataset.filter(lambda x: all(
    isinstance(m.get("content", ""), str) for m in x["messages"]
))

Distributed Training Issues

NCCL Timeout

code

RuntimeError: NCCL timeout
Watchdog caught collective operation timeout

Solutions:

bash

# Increase timeout
export NCCL_TIMEOUT=1800  # 30 minutes

# Debug NCCL
export NCCL_DEBUG=INFO
export NCCL_DEBUG_SUBSYS=ALL

GPU Mismatch

code

CUDA error: invalid device ordinal
RuntimeError: Expected all tensors to be on the same device

Solutions:

bash

# Verify GPU visibility
export CUDA_VISIBLE_DEVICES=0,1,2,3  # Explicitly set GPUs

# Check GPU topology
nvidia-smi topo -m

Data Loading Bottleneck

python

# Symptoms: Low GPU utilization, CPUs maxed out

sft(
    dataloader_num_workers=8,  # Increase workers
    dataloader_pin_memory=True,
    dataloader_prefetch_factor=4,
)

FSDP-Specific Issues

Checkpoint Loading Failures

python

# Use FSDP-compatible checkpoint saving
sft(
    fsdp=True,
    fsdp_config={
        "state_dict_type": "SHARDED_STATE_DICT",  # For large models
    },
)

Memory Fragmentation

python

sft(
    fsdp=True,
    fsdp_config={
        "limit_all_gathers": True,
        "forward_prefetch": True,
    },
)

Data Issues

Tokenization Errors

python

from training_hub.utils import validate_data

# Check if data tokenizes correctly
errors = validate_data(
    data_path="./data/train.jsonl",
    model_path="meta-llama/Llama-3.1-8B",
)

for error in errors:
    print(f"Row {error['index']}: {error['message']}")

Sequence Length Distribution

python

from training_hub.utils import analyze_data

stats = analyze_data(
    data_path="./data/train.jsonl",
    model_path="meta-llama/Llama-3.1-8B",
)

print(f"Mean length: {stats['mean_tokens']}")
print(f"Max length: {stats['max_tokens']}")
print(f"Examples > 4096: {stats['exceeds_4096']}")

Debugging Workflow

•Start with minimal config

python

sft(
    model_path="...",
    data_path="...",
    per_device_batch_size=1,
    max_steps=10,
    bf16=True,
)

•Gradually increase complexity

python

# Step 1: Increase batch size until OOM
# Step 2: Add gradient checkpointing if needed
# Step 3: Add evaluation
# Step 4: Full training run

•Enable logging

python

import logging
logging.basicConfig(level=logging.DEBUG)

sft(
    logging_steps=1,
    logging_first_step=True,
)

Diagnostic Commands

bash

# GPU status
nvidia-smi

# Check CUDA version
nvcc --version
python -c "import torch; print(torch.cuda.is_available())"

# Check distributed setup
python -c "import torch.distributed as dist; print(dist.is_available())"

# Memory profiling
python -c "import torch; torch.cuda.memory_summary()"

Related Skills

•/training-configure - Configure training runs
•/pipeline-design - Design end-to-end pipelines