Lambda Labs GPU Cloud
Comprehensive guide to running ML workloads on Lambda Labs GPU cloud with on-demand instances and 1-Click Clusters.
When to use Lambda Labs
Use Lambda Labs when:
- •Need dedicated GPU instances with full SSH access
- •Running long training jobs (hours to days)
- •Want simple pricing with no egress fees
- •Need persistent storage across sessions
- •Require high-performance multi-node clusters (16-512 GPUs)
- •Want pre-installed ML stack (Lambda Stack with PyTorch, CUDA, NCCL)
Key features:
- •GPU variety: B200, H100, GH200, A100, A10, A6000, V100
- •Lambda Stack: Pre-installed PyTorch, TensorFlow, CUDA, cuDNN, NCCL
- •Persistent filesystems: Keep data across instance restarts
- •1-Click Clusters: 16-512 GPU Slurm clusters with InfiniBand
- •Simple pricing: Pay-per-minute, no egress fees
- •Global regions: 12+ regions worldwide
Use alternatives instead:
- •Modal: For serverless, auto-scaling workloads
- •SkyPilot: For multi-cloud orchestration and cost optimization
- •RunPod: For cheaper spot instances and serverless endpoints
- •Vast.ai: For GPU marketplace with lowest prices
Quick start
Account setup
- •Create account at https://lambda.ai
- •Add payment method
- •Generate API key from dashboard
- •Add SSH key (required before launching instances)
Launch via console
- •Go to https://cloud.lambda.ai/instances
- •Click "Launch instance"
- •Select GPU type and region
- •Choose SSH key
- •Optionally attach filesystem
- •Launch and wait 3-15 minutes
Connect via SSH
bash
# Get instance IP from console ssh ubuntu@<INSTANCE-IP> # Or with specific key ssh -i ~/.ssh/lambda_key ubuntu@<INSTANCE-IP>
GPU instances
Available GPUs
| GPU | VRAM | Price/GPU/hr | Best For |
|---|---|---|---|
| B200 SXM6 | 180 GB | $4.99 | Largest models, fastest training |
| H100 SXM | 80 GB | $2.99-3.29 | Large model training |
| H100 PCIe | 80 GB | $2.49 | Cost-effective H100 |
| GH200 | 96 GB | $1.49 | Single-GPU large models |
| A100 80GB | 80 GB | $1.79 | Production training |
| A100 40GB | 40 GB | $1.29 | Standard training |
| A10 | 24 GB | $0.75 | Inference, fine-tuning |
| A6000 | 48 GB | $0.80 | Good VRAM/price ratio |
| V100 | 16 GB | $0.55 | Budget training |
Instance configurations
code
8x GPU: Best for distributed training (DDP, FSDP) 4x GPU: Large models, multi-GPU training 2x GPU: Medium workloads 1x GPU: Fine-tuning, inference, development
Launch times
- •Single-GPU: 3-5 minutes
- •Multi-GPU: 10-15 minutes
Lambda Stack
All instances come with Lambda Stack pre-installed:
bash
# Included software - Ubuntu 22.04 LTS - NVIDIA drivers (latest) - CUDA 12.x - cuDNN 8.x - NCCL (for multi-GPU) - PyTorch (latest) - TensorFlow (latest) - JAX - JupyterLab
Verify installation
bash
# Check GPU nvidia-smi # Check PyTorch python -c "import torch; print(torch.cuda.is_available())" # Check CUDA version nvcc --version
Python API
Installation
bash
pip install lambda-cloud-client
Authentication
python
import os
import lambda_cloud_client
# Configure with API key
configuration = lambda_cloud_client.Configuration(
host="https://cloud.lambdalabs.com/api/v1",
access_token=os.environ["LAMBDA_API_KEY"]
)
List available instances
python
with lambda_cloud_client.ApiClient(configuration) as api_client:
api = lambda_cloud_client.DefaultApi(api_client)
# Get available instance types
types = api.instance_types()
for name, info in types.data.items():
print(f"{name}: {info.instance_type.description}")
Launch instance
python
from lambda_cloud_client.models import LaunchInstanceRequest
request = LaunchInstanceRequest(
region_name="us-west-1",
instance_type_name="gpu_1x_h100_sxm5",
ssh_key_names=["my-ssh-key"],
file_system_names=["my-filesystem"], # Optional
name="training-job"
)
response = api.launch_instance(request)
instance_id = response.data.instance_ids[0]
print(f"Launched: {instance_id}")
List running instances
python
instances = api.list_instances()
for instance in instances.data:
print(f"{instance.name}: {instance.ip} ({instance.status})")
Terminate instance
python
from lambda_cloud_client.models import TerminateInstanceRequest
request = TerminateInstanceRequest(
instance_ids=[instance_id]
)
api.terminate_instance(request)
SSH key management
python
from lambda_cloud_client.models import AddSshKeyRequest
# Add SSH key
request = AddSshKeyRequest(
name="my-key",
public_key="ssh-rsa AAAA..."
)
api.add_ssh_key(request)
# List keys
keys = api.list_ssh_keys()
# Delete key
api.delete_ssh_key(key_id)
CLI with curl
List instance types
bash
curl -u $LAMBDA_API_KEY: \ https://cloud.lambdalabs.com/api/v1/instance-types | jq
Launch instance
bash
curl -u $LAMBDA_API_KEY: \
-X POST https://cloud.lambdalabs.com/api/v1/instance-operations/launch \
-H "Content-Type: application/json" \
-d '{
"region_name": "us-west-1",
"instance_type_name": "gpu_1x_h100_sxm5",
"ssh_key_names": ["my-key"]
}' | jq
Terminate instance
bash
curl -u $LAMBDA_API_KEY: \
-X POST https://cloud.lambdalabs.com/api/v1/instance-operations/terminate \
-H "Content-Type: application/json" \
-d '{"instance_ids": ["<INSTANCE-ID>"]}' | jq
Persistent storage
Filesystems
Filesystems persist data across instance restarts:
bash
# Mount location /lambda/nfs/<FILESYSTEM_NAME> # Example: save checkpoints python train.py --checkpoint-dir /lambda/nfs/my-storage/checkpoints
Create filesystem
- •Go to Storage in Lambda console
- •Click "Create filesystem"
- •Select region (must match instance region)
- •Name and create
Attach to instance
Filesystems must be attached at instance launch time:
- •Via console: Select filesystem when launching
- •Via API: Include
file_system_namesin launch request
Best practices
bash
# Store on filesystem (persists) /lambda/nfs/storage/ ├── datasets/ ├── checkpoints/ ├── models/ └── outputs/ # Local SSD (faster, ephemeral) /home/ubuntu/ └── working/ # Temporary files
SSH configuration
Add SSH key
bash
# Generate key locally ssh-keygen -t ed25519 -f ~/.ssh/lambda_key # Add public key to Lambda console # Or via API
Multiple keys
bash
# On instance, add more keys echo 'ssh-rsa AAAA...' >> ~/.ssh/authorized_keys
Import from GitHub
bash
# On instance ssh-import-id gh:username
SSH tunneling
bash
# Forward Jupyter ssh -L 8888:localhost:8888 ubuntu@<IP> # Forward TensorBoard ssh -L 6006:localhost:6006 ubuntu@<IP> # Multiple ports ssh -L 8888:localhost:8888 -L 6006:localhost:6006 ubuntu@<IP>
JupyterLab
Launch from console
- •Go to Instances page
- •Click "Launch" in Cloud IDE column
- •JupyterLab opens in browser
Manual access
bash
# On instance jupyter lab --ip=0.0.0.0 --port=8888 # From local machine with tunnel ssh -L 8888:localhost:8888 ubuntu@<IP> # Open http://localhost:8888
Training workflows
Single-GPU training
bash
# SSH to instance ssh ubuntu@<IP> # Clone repo git clone https://github.com/user/project cd project # Install dependencies pip install -r requirements.txt # Train python train.py --epochs 100 --checkpoint-dir /lambda/nfs/storage/checkpoints
Multi-GPU training (single node)
python
# train_ddp.py
import torch
import torch.distributed as dist
from torch.nn.parallel import DistributedDataParallel as DDP
def main():
dist.init_process_group("nccl")
rank = dist.get_rank()
device = rank % torch.cuda.device_count()
model = MyModel().to(device)
model = DDP(model, device_ids=[device])
# Training loop...
if __name__ == "__main__":
main()
bash
# Launch with torchrun (8 GPUs) torchrun --nproc_per_node=8 train_ddp.py
Checkpoint to filesystem
python
import os
checkpoint_dir = "/lambda/nfs/my-storage/checkpoints"
os.makedirs(checkpoint_dir, exist_ok=True)
# Save checkpoint
torch.save({
'epoch': epoch,
'model_state_dict': model.state_dict(),
'optimizer_state_dict': optimizer.state_dict(),
'loss': loss,
}, f"{checkpoint_dir}/checkpoint_{epoch}.pt")
1-Click Clusters
Overview
High-performance Slurm clusters with:
- •16-512 NVIDIA H100 or B200 GPUs
- •NVIDIA Quantum-2 400 Gb/s InfiniBand
- •GPUDirect RDMA at 3200 Gb/s
- •Pre-installed distributed ML stack
Included software
- •Ubuntu 22.04 LTS + Lambda Stack
- •NCCL, Open MPI
- •PyTorch with DDP and FSDP
- •TensorFlow
- •OFED drivers
Storage
- •24 TB NVMe per compute node (ephemeral)
- •Lambda filesystems for persistent data
Multi-node training
bash
# On Slurm cluster srun --nodes=4 --ntasks-per-node=8 --gpus-per-node=8 \ torchrun --nnodes=4 --nproc_per_node=8 \ --rdzv_backend=c10d --rdzv_endpoint=$MASTER_ADDR:29500 \ train.py
Networking
Bandwidth
- •Inter-instance (same region): up to 200 Gbps
- •Internet outbound: 20 Gbps max
Firewall
- •Default: Only port 22 (SSH) open
- •Configure additional ports in Lambda console
- •ICMP traffic allowed by default
Private IPs
bash
# Find private IP ip addr show | grep 'inet '
Common workflows
Workflow 1: Fine-tuning LLM
bash
# 1. Launch 8x H100 instance with filesystem
# 2. SSH and setup
ssh ubuntu@<IP>
pip install transformers accelerate peft
# 3. Download model to filesystem
python -c "
from transformers import AutoModelForCausalLM
model = AutoModelForCausalLM.from_pretrained('meta-llama/Llama-2-7b-hf')
model.save_pretrained('/lambda/nfs/storage/models/llama-2-7b')
"
# 4. Fine-tune with checkpoints on filesystem
accelerate launch --num_processes 8 train.py \
--model_path /lambda/nfs/storage/models/llama-2-7b \
--output_dir /lambda/nfs/storage/outputs \
--checkpoint_dir /lambda/nfs/storage/checkpoints
Workflow 2: Batch inference
bash
# 1. Launch A10 instance (cost-effective for inference) # 2. Run inference python inference.py \ --model /lambda/nfs/storage/models/fine-tuned \ --input /lambda/nfs/storage/data/inputs.jsonl \ --output /lambda/nfs/storage/data/outputs.jsonl
Cost optimization
Choose right GPU
| Task | Recommended GPU |
|---|---|
| LLM fine-tuning (7B) | A100 40GB |
| LLM fine-tuning (70B) | 8x H100 |
| Inference | A10, A6000 |
| Development | V100, A10 |
| Maximum performance | B200 |
Reduce costs
- •Use filesystems: Avoid re-downloading data
- •Checkpoint frequently: Resume interrupted training
- •Right-size: Don't over-provision GPUs
- •Terminate idle: No auto-stop, manually terminate
Monitor usage
- •Dashboard shows real-time GPU utilization
- •API for programmatic monitoring
Common issues
| Issue | Solution |
|---|---|
| Instance won't launch | Check region availability, try different GPU |
| SSH connection refused | Wait for instance to initialize (3-15 min) |
| Data lost after terminate | Use persistent filesystems |
| Slow data transfer | Use filesystem in same region |
| GPU not detected | Reboot instance, check drivers |
References
- •Advanced Usage - Multi-node training, API automation
- •Troubleshooting - Common issues and solutions
Resources
- •Documentation: https://docs.lambda.ai
- •Console: https://cloud.lambda.ai
- •Pricing: https://lambda.ai/instances
- •Support: https://support.lambdalabs.com
- •Blog: https://lambda.ai/blog