Amazon SageMaker HyperPod Expert
You are an expert in Amazon SageMaker HyperPod for provisioning resilient ML training clusters with AWS Trainium and NVIDIA GPUs.
When This Skill Activates
- •Creating HyperPod clusters (EKS or Slurm)
- •Running distributed ML training jobs
- •Troubleshooting cluster issues
- •Checking quotas or instance availability
- •User mentions: "hyperpod", "hyp", "trainium", "trn1", "distributed training"
Detailed Guides
| Guide | Use When |
|---|---|
| reference/eks-guide.md | EKS orchestration, hyp CLI, add-ons, Pod Identity |
| reference/slurm-guide.md | Slurm orchestration, lifecycle scripts, SBATCH |
| reference/troubleshooting.md | Error diagnosis and solutions |
Orchestrator Selection
| Aspect | EKS | Slurm |
|---|---|---|
| AZ Requirement | 2+ AZs required | Single AZ OK |
| Primary Tool | hyp CLI | AWS CLI |
| Job Submission | PyTorchJob via hyp create | SBATCH scripts |
| Access Method | kubectl | SSM Session Manager |
| Best For | Kubernetes teams, container workloads | HPC teams, batch jobs |
Instance Types
| Instance Type | Accelerator | Count | Use Case |
|---|---|---|---|
| ml.p4d.24xlarge | A100 | 8 | General training |
| ml.p4de.24xlarge | A100 (80GB) | 8 | Large models |
| ml.p5.48xlarge | H100 | 8 | Latest gen training |
| ml.trn1.32xlarge | Trainium | 16 | Cost-effective |
| ml.trn1n.32xlarge | Trainium | 16 | Higher network |
IMPORTANT: ml.trn1.2xlarge is NOT supported for HyperPod - only ml.trn1.32xlarge.
CRITICAL: Pre-Creation Validation
ALWAYS perform these checks BEFORE creating a cluster:
1. Verify Instance Type Support
# Must say "for cluster usage" in quota name aws service-quotas list-service-quotas \ --service-code sagemaker --region us-east-1 \ --query 'Quotas[?contains(QuotaName, `<INSTANCE_TYPE>`) && contains(QuotaName, `cluster`)].[QuotaName,Value]' \ --output table
2. Check AZ Availability
aws ec2 describe-instance-type-offerings \ --location-type availability-zone \ --filters Name=instance-type,Values=trn1.32xlarge \ --region us-east-1 \ --query 'InstanceTypeOfferings[*].Location' --output text
3. For EKS: Ensure 2+ AZs in config.yaml
availability_zone_ids: - use1-az6 # Primary for workers - use1-az4 # Secondary for EKS HA
4. Check K8s Version (EKS Only)
WebFetch: https://docs.aws.amazon.com/eks/latest/userguide/kubernetes-versions.html#kubernetes-release-calendar Prompt: What is the latest Kubernetes version in standard support?
5. Check Add-on Compatibility (EKS Only)
Before upgrading K8s versions, verify HyperPod add-ons support the target version:
aws eks describe-addon-versions --addon-name amazon-sagemaker-hyperpod-taskgovernance \ --query 'addons[0].addonVersions[*].compatibilities[*].clusterVersion' --output text
WARNING: EKS does NOT support downgrading. Stay on a supported version if you need HyperPod add-ons.
EKS Quick Start
# 1. Install CLI pip install sagemaker-hyperpod # 2. Initialize cluster stack hyp init cluster-stack my-cluster cd my-cluster # 3. Edit config.yaml (ensure 2+ AZs!) # 4. Validate and create hyp validate && hyp create cluster-stack --region us-east-1 # 5. Set context hyp set-cluster-context --cluster-name <NAME> --region us-east-1
Submit Training Job (EKS)
# Option 1: Using config file (recommended) hyp init hyp-pytorch-job my-job cd my-job # Edit config.yaml hyp validate hyp create hyp-pytorch-job # Option 2: Command line hyp create hyp-pytorch-job \ --job-name my-job \ --image <ECR-IMAGE> \ --instance-type ml.trn1.32xlarge \ --node-count 1 \ --accelerators 16 \ --accelerators-limit 16
Monitor Training Job (EKS)
# List jobs hyp list hyp-pytorch-job # Job details hyp describe hyp-pytorch-job --job-name <NAME> # View logs hyp get-logs hyp-pytorch-job --job-name <NAME> --follow # List pods hyp list-pods hyp-pytorch-job --job-name <NAME> # Delete job hyp delete hyp-pytorch-job --job-name <NAME>
Full guide: See orchestrators/eks/job-submission.md
Slurm Quick Start
# 1. Prepare lifecycle scripts (use AWS samples) git clone https://github.com/aws-samples/awsome-distributed-training.git cd awsome-distributed-training/1.architectures/5.sagemaker-hyperpod/LifecycleScripts/base-config/ # 2. Upload to S3 aws s3 cp . s3://my-bucket/lifecycle-scripts/ --recursive # 3. Create cluster aws sagemaker create-cluster --cluster-name my-cluster \ --instance-groups '[...]' --vpc-config "..." # 4. Connect via SSM aws ssm start-session --target <instance-id>
Full workflow: See reference/slurm-guide.md
Model Compatibility (Trainium/Inferentia)
CRITICAL: Verify model support before configuring Trainium jobs.
Check Support
WebFetch: https://huggingface.co/docs/optimum-neuron/en/supported_architectures Prompt: List supported model architectures for training on Trainium
Currently Supported (Training)
| Architecture | Tensor Parallelism | Pipeline Parallelism |
|---|---|---|
| Llama, Llama 2, Llama 3 | Yes | Yes |
| Qwen3 | Yes | Yes |
| Granite | Yes | No |
Common Errors (Quick Reference)
| Error | Cause | Solution |
|---|---|---|
InvalidParameterException (EKS) | Single AZ | Add 2+ AZs to config |
ml.trn1.2xlarge not found | Unsupported type | Use ml.trn1.32xlarge |
| Training Operator pod fails | Missing Pod Identity | See EKS guide |
Insufficient cpu | Full node request | Use partial resources |
Accelerator request != limit | Limits mismatch | Set accelerators_limit = accelerators |
| EFA health check failed | Multi-AZ | Use single subnet with OverrideVpcConfig |
| Add-on not supported | K8s version | Check add-on compatibility before upgrade |
Full troubleshooting: See reference/troubleshooting.md
Infrastructure Requirements
EFA Single-AZ Requirement
For EFA-enabled instances (trn1, p4d, p5), ALL instances MUST be in the SAME AZ.
Security Group
Must allow ALL traffic within itself:
aws ec2 authorize-security-group-ingress \ --group-id sg-xxx --protocol all --port -1 --source-group sg-xxx
CIDR Sizing
| Orchestrator | IPs per P5 |
|---|---|
| Slurm | 32 |
| EKS | 81 (includes pods) |
Quota Management
# Check quota aws service-quotas get-service-quota \ --service-code sagemaker --quota-code L-6865522E --region us-east-1 # Request increase aws service-quotas request-service-quota-increase \ --service-code sagemaker --quota-code L-6865522E --desired-value 4
Common codes:
- •
L-6865522E: ml.trn1.32xlarge for cluster usage - •
L-5C4CD236: ml.p5.48xlarge for cluster usage
Diagnostic Commands
# Cluster status aws sagemaker describe-cluster --cluster-name NAME # List nodes aws sagemaker list-cluster-nodes --cluster-name NAME # CloudWatch logs aws logs get-log-events \ --log-group-name /aws/sagemaker/Clusters/NAME/ID \ --log-stream-name LifecycleConfig/GROUP/INSTANCE # EKS nodes/pods kubectl get nodes && kubectl get pods -A