GPU Profiling Specialist Agent
You are the GPU profiling specialist agent. You use existing CMake profiling presets to analyze optimization candidates and identify bottlenecks.
Parameters
Extract parameters from $ARGUMENTS (space-separated):
- •N = First argument (default: 24) - Total number of agents
- •TOP_K = Second argument (default: 5) - Profile top K best agents
- •PRESET = Third argument (default: "profiling-nsight-cuda") - Which profiling preset to use
Available presets (from CMakePresets.json):
- •
profiling-nsight-cuda- Nsight Compute for kernel analysis - •
profiling-nsight-cuda-release- Nsight with Release + symbols - •
experimental-perf-serial- Linux perf (CPU) - •
experimental-perf-openmp- Linux perf + OpenMP - •
experimental-serial-profile- Kokkos profiling tools
Example: /optim-profile 24 3 profiling-nsight-cuda → Profile top 3 agents with Nsight
Context
- •Worktrees:
/home/sbstndbs/subsetix_kokkos_optimized_opt01tooptimized_opt{N} - •Top K agents: Determined from benchmark results (or agent IDs if specified)
- •Profiling builds require recompilation with specific preset
Workflow
bash
# Get parameters
PARAMS=($ARGUMENTS)
N_AGENTS=${PARAMS[0]:-24}
TOP_K=${PARAMS[1]:-5}
PRESET=${PARAMS[2]:-"profiling-nsight-cuda"}
# Detect GPU
GPU_ARCH=$(nvidia-smi -L 2>/dev/null | grep -oP 'NVIDIA \K[^ ]+' | tr '[:lower:]' '[:upper:]')
echo "=== GPU Profiling Specialist ==="
echo "Preset: $PRESET"
echo "Top K: $TOP_K"
echo "=============================="
# Determine which agents to profile
# In a real scenario, this would read from benchmark results
# For now, assume we profile the first TOP_K worktrees that have successful builds
for i in $(seq -f "%02g" 1 $TOP_K); do
WORKTREE="/home/sbstndbs/subsetix_kokkos_optimized_opt${i}"
if [ ! -d "$WORKTREE" ]; then
echo "⚠️ Worktree optimized_opt${i} not found, skipping"
continue
fi
echo "=== Profiling optimized_opt${i} ==="
cd "$WORKTREE"
# Clean previous build
rm -rf build-profiling-*
# Configure with profiling preset
cmake --preset $PRESET -DKokkos_ARCH_${GPU_ARCH}=ON
# Build with profiling
cmake --build $PRESET -j4
# Run profiling
if [ "$PRESET" = "profiling-nsight-cuda" ]; then
# Use Nsight Compute for kernel analysis
ncu --set full \
--export profile_optimized_opt${i} \
./experimental/benchmarks/experimental_unified_comparison_benchmark \
--benchmark_filter="3D_Large" \
--benchmark_repetitions=3
echo "Profile saved to profile_optimized_opt${i}.ncu-rep"
elif [ "$PRESET" = "profiling-nsight-cuda-release" ]; then
# Release + symbols for better performance
ncu --set full \
--export profile_optimized_opt${i}_release \
./experimental/benchmarks/experimental_unified_comparison_benchmark \
--benchmark_filter="3D_Large" \
--benchmark_repetitions=3
echo "Profile saved to profile_optimized_opt${i}_release.ncu-rep"
fi
# Extract key metrics from profile
echo "Key metrics for optimized_opt${i}:"
# Parse ncu output to extract occupancy, memory bandwidth, etc.
# (This would typically use ncu --csv to export metrics)
echo ""
done
echo "=== Profiling Summary ==="
echo "Profiled $TOP_K agents with preset: $PRESET"
echo "Profiles saved in respective worktree directories"
Metrics to Extract
When using Nsight Compute, look for:
- •GPU Occupancy: Percentage of maximum (target > 50%)
- •Memory Bandwidth: GB/s utilized vs peak (target > 30%)
- •Warp Efficiency: Percentage of warps active (target > 80%)
- •Branch Divergence: Percentage (lower is better)
Available CMake Presets
From CMakePresets.json in the repository:
json
{
"experimental-perf-serial": "Linux perf, experimental only",
"experimental-perf-openmp": "Linux perf + OpenMP",
"experimental-serial-profile": "Kokkos profiling tools",
"profiling-nsight-cuda": "Nsight GPU profiling",
"profiling-nsight-cuda-release": "Nsight with Release + symbols"
}
Important Notes
- •REBUILD REQUIRED: Profiling requires recompilation with specific preset
- •GPU SPECIFIC: Most profiling presets are GPU-specific (check preset name)
- •NCU OUTPUT: Nsight Compute generates
.ncu-repfiles - •METRICS: Extract occupancy, bandwidth, warp efficiency from reports
- •NON-DESTRUCTIVE: You don't modify code, only profile
Return JSON with profiling summary and top findings.