Training on Google Colab Skill
This skill enables efficient model training using Google Colab's GPU resources while maintaining code and data synchronization with the local project via Google Drive.
When to Use
- •Need GPU acceleration for model training
- •Training experiments that exceed local hardware capabilities
- •Long-running training jobs that benefit from cloud execution
- •Testing hyperparameter variations at scale
Prerequisites
- •Google account with Colab Pro (recommended) or Colab Free
- •Google Drive with project folder structure
- •Training data uploaded to Google Drive
- •Local project configured for Drive sync (see
docs/reference/colab-pro-setup.md)
Workflow
Step 1: Prepare Training Configuration
Create or update a training config in training/configs/:
yaml
# training/configs/experiment_name.yaml model: name: resnet50 num_classes: 2 pretrained: true training: epochs: 100 batch_size: 32 learning_rate: 0.001 optimizer: adam data: train_path: /content/drive/MyDrive/traina/data/train val_path: /content/drive/MyDrive/traina/data/val augmentation: enabled: true normalize: false # Set based on production requirements
Step 2: Sync Code to Google Drive
Before syncing, perform a Pre-flight Check:
- •Run a local smoke test (e.g.,
python smoke_test.pyor setDRY_RUN=Truein your notebook). - •Verify that 1 epoch runs on CPU with a few batches.
- •Only sync after local verification passes.
bash
# From project root ./training/scripts/sync_to_drive.sh
Step 3: Launch Colab Notebook
- •Open
training/notebooks/colab_training.ipynbin Google Colab - •Connect to GPU runtime (Runtime → Change runtime type → GPU)
- •Mount Google Drive:
python
from google.colab import drive drive.mount('/content/drive') - •Execute training cells
Step 4: Download Results
After training completes:
bash
# Sync trained models back from Drive ./training/scripts/sync_from_drive.sh \ --source "drive/MyDrive/traina/experiments/" \ --dest "training/experiments/"
Configuration Options
Environment Modes
| Mode | Description | Use Case |
|---|---|---|
| Development | Quick iterations, small dataset | Testing configs, debugging |
| Training | Full dataset, GPU acceleration | Production model training |
| Evaluation | Validation/test metrics only | Assessing trained models |
GPU Optimization
python
# Enable mixed precision for faster training
from torch.cuda.amp import autocast, GradScaler
scaler = GradScaler()
for batch in dataloader:
optimizer.zero_grad()
with autocast():
outputs = model(batch)
loss = criterion(outputs, targets)
scaler.scale(loss).backward()
scaler.step(optimizer)
scaler.update()
Best Practices
- •
Version Control
- •Git tracks code and configs locally
- •Drive stores datasets and model weights
- •Never commit large
.ptfiles to git
- •
Experiment Tracking
- •Use descriptive experiment names
- •Log all hyperparameters
- •Save training curves and metrics
- •
Data Management
- •Keep training data in Drive under versioned folders
- •Use symbolic links for large datasets
- •Validate data integrity before training
- •
Cost Optimization
- •Colab Free: Limited GPU hours per day
- •Colab Pro: Faster GPUs, longer sessions
- •Monitor runtime to avoid losing progress
- •
"Run All" Configuration Pattern
- •Use config cells at the top of notebooks with skip flags
- •Allows selective experiment execution when using "Run All"
- •Example pattern:
python# === EXPERIMENT CONFIGURATION === DRY_RUN = False # Smoke test mode SKIP_EXPERIMENT_1 = True # Skip if inconclusive/slow THRESHOLD_MODEL = 'production' # Model selection
- •Each experiment cell checks its skip flag before executing
- •Ensures reproducible "Run All" behavior
Troubleshooting
| Issue | Solution |
|---|---|
| Drive not mounting | Re-authenticate, check permissions |
| CUDA out of memory | Reduce batch_size, use gradient accumulation |
| Training interrupted | Enable checkpointing, resume from last epoch |
| Slow data loading | Pre-cache dataset to local Colab storage |