Kernel Performance Testing
Never run performance tests unless the user explicitly asks.
GPU selection protocol
- •Run
nvidia-smito check GPU occupancy. - •Pick the GPU with the lowest memory usage.
- •Set
CUDA_VISIBLE_DEVICESto that GPU.
Benchmark commands
All benchmarks must be wrapped with denoise.sh for stable results.
Hopper GPU
bash
CUDA_VISIBLE_DEVICES=<gpu_id> third_party/tlx/denoise.sh python third_party/tlx/tutorials/testing/test_hopper_gemm_perf.py [--version {ws|pipelined}]
CUDA_VISIBLE_DEVICES=<gpu_id> third_party/tlx/denoise.sh python third_party/tlx/tutorials/testing/test_hopper_fa_perf.py [--version {ws|ws_pipelined|ws_pipelined_pingpong|ws_pipelined_pingpong_persistent}]
Blackwell GPU
bash
CUDA_VISIBLE_DEVICES=<gpu_id> third_party/tlx/denoise.sh python third_party/tlx/tutorials/testing/test_blackwell_gemm_perf.py [--version {ws|pipelined|clc|2cta}]
CUDA_VISIBLE_DEVICES=<gpu_id> third_party/tlx/denoise.sh python third_party/tlx/tutorials/testing/test_blackwell_fa_perf.py [--version {ws|ws_pipelined|ws_pipelined_pingpong|ws_pipelined_pingpong_persistent}]
Other kernels
bash
CUDA_VISIBLE_DEVICES=<gpu_id> third_party/tlx/denoise.sh python third_party/tlx/tutorials/<KERNEL.py>
If tests hang
Run third_party/tlx/killgpu.sh to kill GPU processes that have been running too long.
Interpreting results
- •Output reports TFLOPS for each problem size and configuration.
- •Compare against cuBLAS baselines when available (printed alongside Triton results).
- •Higher TFLOPS = better. Look for regressions relative to previous runs.
- •Check for consistency across runs — high variance suggests noisy measurements (ensure
denoise.shis being used).