Modal GPU Training
Overview
Modal is a serverless platform for running Python code on cloud GPUs. It provides:
- •Serverless GPUs: On-demand access to T4, A10G, A100 GPUs
- •Container Images: Define dependencies declaratively with pip
- •Remote Execution: Run functions on cloud infrastructure
- •Result Handling: Return Python objects from remote functions
Two patterns:
- •Single Function: Simple script with
@app.functiondecorator - •Multi-Function: Complex workflows with multiple remote calls
Quick Reference
| Topic | Reference |
|---|---|
| Basic Structure | Getting Started |
| GPU Options | GPU Selection |
| Data Handling | Data Download |
| Results & Outputs | Results |
| Troubleshooting | Common Issues |
Installation
bash
pip install modal modal token set --token-id <id> --token-secret <secret>
Minimal Example
python
import modal
app = modal.App("my-training-app")
image = modal.Image.debian_slim(python_version="3.11").pip_install(
"torch",
"einops",
"numpy",
)
@app.function(gpu="A100", image=image, timeout=3600)
def train():
import torch
device = torch.device("cuda")
print(f"Using GPU: {torch.cuda.get_device_name(0)}")
# Training code here
return {"loss": 0.5}
@app.local_entrypoint()
def main():
results = train.remote()
print(results)
Common Imports
python
import modal from modal import Image, App # Inside remote function import torch import torch.nn as nn from huggingface_hub import hf_hub_download
When to Use What
| Scenario | Approach |
|---|---|
| Quick GPU experiments | gpu="T4" (16GB, cheapest) |
| Medium training jobs | gpu="A10G" (24GB) |
| Large-scale training | gpu="A100" (40/80GB, fastest) |
| Long-running jobs | Set timeout=3600 or higher |
| Data from HuggingFace | Download inside function with hf_hub_download |
| Return metrics | Return dict from function |
Running
bash
# Run script modal run train_modal.py # Run in background modal run --detach train_modal.py
External Resources
- •Modal Documentation: https://modal.com/docs
- •Modal Examples: https://github.com/modal-labs/modal-examples