SIMD Optimization Skill
Patterns and learnings from SIMD optimization in this codebase.
Comprehensive documentation: See docs/optimizations/simd.md for full details on SIMD techniques.
Key Insight: Wider SIMD != Automatically Faster
Two AVX-512 optimizations implemented with dramatically different results:
AVX512-VPOPCNTDQ: 5.2x Speedup (Compute-Bound)
Implementation: src/bits/popcount.rs
- •Processes 8 u64 words (512 bits) in parallel
- •Hardware
_mm512_popcnt_epi64instruction - •Result: 96.8 GiB/s vs 18.5 GiB/s (scalar) = 5.2x faster
Why it wins: Pure compute-bound, embarrassingly parallel, no dependencies
AVX-512 JSON Parser: 7-17% Slower (Memory-Bound) - REMOVED
- •Processed 64 bytes/iteration (vs 32 for AVX2)
- •Result: 672 MiB/s vs 732 MiB/s (AVX2) = 8.9% slower
Why AVX2 won:
- •Memory-bound workload: Waiting for data from memory, not compute
- •AMD Zen 4 splits AVX-512 into two 256-bit micro-ops
- •State machine overhead: Wider SIMD = more bytes to process sequentially afterward
- •Cache alignment: 32-byte chunks fit cache lines better
When to Use AVX-512
- •Pure compute: math, crypto, compression
- •No memory bottlenecks
- •No sequential dependencies
- •Data-parallel algorithms
When NOT to Use AVX-512
- •Memory-bound workloads
- •Sequential state machines
- •Complex control flow
SIMD Instruction Set Hierarchy
| Level | Width | Bytes/Iter | Availability | Notes |
|---|---|---|---|---|
| SSE2 | 128bit | 16 | 100% | Universal baseline on x86_64 |
| SSE4.2 | 128bit | 16 | ~90% | PCMPISTRI string instructions |
| AVX2 | 256bit | 32 | ~95% | 2x width, best price/performance |
| BMI2 | N/A | N/A | ~95% | PDEP/PEXT, but AMD Zen 1/2 slow |
Compilation Model
Key insight: #[target_feature] is a compiler directive, not a runtime gate.
// All these compile on any x86_64:
#[target_feature(enable = "sse2")]
unsafe fn process_sse2(data: &[u8]) { ... }
#[target_feature(enable = "avx2")]
unsafe fn process_avx2(data: &[u8]) { ... }
// Runtime dispatch (requires std)
fn process(data: &[u8]) {
if is_x86_feature_detected!("avx2") {
unsafe { process_avx2(data) }
} else {
unsafe { process_sse2(data) }
}
}
ARM NEON Movemask
Problem: NEON lacks x86's _mm_movemask_epi8. Variable shifts are slow on M1.
Solution: Multiplication trick to pack bits:
#[inline]
#[target_feature(enable = "neon")]
unsafe fn neon_movemask(v: uint8x16_t) -> u16 {
let high_bits = vshrq_n_u8::<7>(v);
let low_u64 = vgetq_lane_u64::<0>(vreinterpretq_u64_u8(high_bits));
let high_u64 = vgetq_lane_u64::<1>(vreinterpretq_u64_u8(high_bits));
const MAGIC: u64 = 0x0102040810204080;
let low_packed = (low_u64.wrapping_mul(MAGIC) >> 56) as u8;
let high_packed = (high_u64.wrapping_mul(MAGIC) >> 56) as u8;
(low_packed as u16) | ((high_packed as u16) << 8)
}
Results: 10-18% improvement on string-heavy and nested JSON patterns.
SSE2 Unsigned Comparison
SSE2 lacks unsigned byte comparison. Use min trick:
unsafe fn unsigned_le(a: __m128i, b: __m128i) -> __m128i {
let min_ab = _mm_min_epu8(a, b);
_mm_cmpeq_epi8(min_ab, a) // a <= b iff min(a,b) == a
}
Testing Strategy
Problem: Runtime dispatch only tests highest available SIMD level.
Solution: Explicitly call each implementation in tests:
#[test]
fn test_all_simd_levels() {
let input = b"test input";
let expected = scalar_impl(input);
let sse2_result = unsafe { sse2::process(input) };
let avx2_result = unsafe { avx2::process(input) };
assert_eq!(sse2_result, expected);
assert_eq!(avx2_result, expected);
}
no_std Constraints
is_x86_feature_detected! requires std
Alternatives:
- •Use
count_ones()- LLVM optimizes to POPCNT with-C target-feature=+popcnt - •Use
#[target_feature(enable = "popcnt")]on functions - •Keep runtime detection only in
#[cfg(test)]blocks
ARM NEON is always available on aarch64
No runtime detection needed:
#[cfg(target_arch = "aarch64")]
{
// NEON intrinsics work without feature detection
unsafe { neon::process(data) }
}
Benchmark Commands
# Test AVX-512 popcount implementation cargo test --lib --features simd popcount # Benchmark popcount strategies cargo bench --bench popcount_strategies --features simd # Run comprehensive JSON benchmarks cargo bench --bench json_simd
Key Takeaways
- •Profile first, optimize second - Don't assume wider is better
- •Understand bottlenecks - Memory-bound vs compute-bound matters
- •Measure end-to-end - Micro-benchmarks can be misleading
- •Consider architecture - Zen 4 splits AVX-512, future Zen 5 may not
- •Amdahl's Law always wins - Optimize what matters (the slow 80%)
- •Remove failed optimizations - Slower code creates technical debt
See Also
- •docs/optimizations/simd.md - Comprehensive SIMD techniques reference
- •docs/optimizations/cache-memory.md - Memory-bound vs compute-bound analysis
- •docs/optimizations/branchless.md - SIMD masking techniques
- •docs/optimizations/history.md - Historical optimization record