GPU Provisioner
Use this skill to source and manage training compute across cloud GPU providers. Default behavior is cheapest viable hardware that satisfies training requirements.
Providers
Support provider CLI/API workflows for:
- •RunPod
- •Vast.ai
- •Lambda
- •Kaggle
Core Workflow
- •Gather workload requirements (model size, seq length, batch target, precision, wall-clock goal).
- •Query available instances from supported providers.
- •Rank by total expected run cost, not hourly price only.
- •Present top options with tradeoffs and recommendation.
- •Wait for explicit user approval before provisioning.
- •Provision selected instance, configure access, and set auto-termination.
Cheapest Instance Finder Logic
Evaluate each candidate using:
- •VRAM fit and expected utilization
- •Expected throughput for target training stack
- •Hourly rate + storage + egress + startup overhead
- •Reliability signals (availability, preemption risk, region constraints)
Report at least two options when available:
- •best cost-efficiency option
- •best stability option
Cost Estimation
Always estimate before provisioning:
- •hourly cost
- •expected total run cost
- •checkpoint/storage cost
- •safety buffer for retries and restarts
If uncertainty is material, provide best-case / expected / worst-case range.
Spend Confirmation Gate (Mandatory)
- •Never provision paid resources without explicit user confirmation.
- •Always request explicit confirmation for any spend action.
- •If estimated spend for a single action exceeds $100 USD, require a clear confirmation message before continuing.
- •If projected spend drifts above estimate during runtime, pause and re-confirm.
SSH + Instance Management
- •Configure SSH keys for secure access to new instances.
- •Validate login and GPU visibility after provisioning.
- •Provide lifecycle actions: start, stop, restart, terminate.
- •Keep a project-local inventory of active instances and endpoints.
Auto-Termination Safety
Set idle/timeout-based auto-shutdown by default:
- •training completed
- •no heartbeat for defined interval
- •explicit budget cap reached
Surface countdown and termination policy in status updates.
Kaggle Provisioning
For Kaggle, "provisioning" means verifying quota and kernel availability:
- •Run
kaggle kernels listto check for active kernels. - •Run
kaggle hardware(if available) or assume P100/T4 quota is available if not exceeded. - •Warning: Kaggle kernels have a 30h weekly limit for GPU. Check
kaggle competitions listto ensure API is working.
Required Environment Variables
- •RUNPOD_API_KEY
- •VAST_API_KEY
- •KAGGLE_USERNAME
- •KAGGLE_KEY
Optional provider credentials may be added per user environment when needed.
Deliverables
- •
gpu_plan.md(ranked options + recommendation) - •
cost_estimate.json(assumptions + range) - •
instance_manifest.json(allocated resources + lifecycle settings)