qwen3-tts-rs Profiling & Benchmarking
Run performance profiling and benchmarks for the qwen3-tts Rust TTS engine.
Prerequisites
- •Docker with
--gpus allsupport - •
qwen3-tts:latestDocker image (has Rust toolchain + CUDA) - •Model weights in
test_data/models/(1.7B-CustomVoice is the default) - •
tokenizer.jsonmust be in the model directory
Docker Execution Pattern
The CUDA toolchain lives inside the Docker container. All cargo commands must
run there. The workspace is bind-mounted at /workspace:
docker run --rm --gpus all --entrypoint /bin/bash \ -v "$(pwd):/workspace" -w /workspace \ qwen3-tts:latest \ -c 'export PATH=/root/.rustup/toolchains/stable-aarch64-unknown-linux-gnu/bin:$PATH && <COMMAND>'
Profiling Modes
1. Chrome Trace (default — best for span hierarchy)
Produces trace.json for viewing in chrome://tracing or https://ui.perfetto.dev.
docker run --rm --gpus all --entrypoint /bin/bash \
-v "$(pwd):/workspace" -w /workspace \
qwen3-tts:latest \
-c 'export PATH=/root/.rustup/toolchains/stable-aarch64-unknown-linux-gnu/bin:$PATH && \
cargo run --profile=profiling --features=profiling,cuda,cli --bin e2e_bench -- \
--model-dir test_data/models/1.7B-CustomVoice --iterations 1 --warmup 1'
Output: trace.json (~12MB for 3 sentences). Contains spans:
- •
generate_frames— full generation loop - •
code_predictor/code_predictor_inner— per-frame acoustic code generation - •
talker_step— per-frame transformer forward pass - •
sampling/top_k/top_p— per-frame token sampling - •
gpu_synctrace events — marks everyto_vec1()GPU→CPU sync
2. Per-Stage Timing (no profiling feature needed)
The e2e_bench binary reports stage breakdowns (prefill / generation / decode)
even without the profiling feature:
docker run --rm --gpus all --entrypoint /bin/bash \
-v "$(pwd):/workspace" -w /workspace \
qwen3-tts:latest \
-c 'export PATH=/root/.rustup/toolchains/stable-aarch64-unknown-linux-gnu/bin:$PATH && \
cargo run --release --features=cuda,cli --bin e2e_bench -- \
--model-dir test_data/models/1.7B-CustomVoice --iterations 3 --warmup 1'
3. Streaming TTFA (Time to First Audio)
# Add --streaming flag
... --bin e2e_bench -- --model-dir test_data/models/1.7B-CustomVoice \
--iterations 3 --warmup 1 --streaming
4. JSON Output
... --bin e2e_bench -- --model-dir test_data/models/1.7B-CustomVoice \
--json-output results.json --iterations 3
GPU Sync Audit
List all to_vec1() GPU→CPU synchronization points:
bash scripts/audit-gpu-syncs.sh
Interpreting Results
Stage Breakdown Table
Label Words Wall (ms) Audio (s) RTF Tok/s Mem (MB) Prefill Generate Decode short 13 5235.2 3.68 1.423 8.8 858 21ms (1%) 2724ms (71%) 1109ms (29%) medium 53 23786.3 34.00 0.700 17.9 859 20ms (0%) 22694ms (95%) 1057ms (4%) long 115 43797.4 60.96 0.718 17.4 864 19ms (0%) 41861ms (96%) 1886ms (4%)
Key metrics:
- •RTF < 1.0 = faster than real-time
- •Prefill: Should be <50ms on GPU. If high, check embedding/attention.
- •Generation: Dominates. ~18 GPU→CPU syncs per frame (16 code_predictor + 2 sampling).
- •Decode: ConvNeXt decoder. Scales with frame count. ~4% for long text.
- •Tok/s: Semantic tokens per second. Higher = better.
Chrome Trace Analysis
In Perfetto/chrome://tracing:
- •Look for gaps between
talker_stepandcode_predictor— that's CPU overhead - •Check if
sampling(top_k + top_p) is significant vs model forward passes - •The
gpu_syncevents mark where GPU stalls waiting for CPU
Optimization Targets
The ~18 to_vec1() calls per frame are the main bottleneck:
- •16 in code_predictor (argmax per acoustic code group)
- •2 in sampling (read sampled token)
Batch these to reduce GPU→CPU round-trips.
Model Variants
| Model | Dir | Notes |
|---|---|---|
| 1.7B-CustomVoice | test_data/models/1.7B-CustomVoice | Default benchmark target |
| 1.7B-Base | test_data/models/1.7B-Base | Voice cloning (needs ref audio) |
| 1.7B-VoiceDesign | test_data/models/1.7B-VoiceDesign | Text-described voices |
Reference Baseline (1.7B-CustomVoice, CUDA)
From January 2025 on DGX (A100):
- •Short (13 words): RTF 1.42, 8.8 tok/s
- •Medium (53 words): RTF 0.70, 17.9 tok/s
- •Long (115 words): RTF 0.72, 17.4 tok/s
- •Prefill: ~20ms, Decode: ~1-2s, Generation: 71-96%