Bit-Level Optimization Patterns
Learnings from optimizing succinct data structure operations.
Comprehensive documentation: See docs/optimizations/ for full technique reference:
- •bit-manipulation.md - Popcount, CTZ, PDEP/PEXT
- •lookup-tables.md - Byte-level lookup tables
- •hierarchical-structures.md - Rank/select indices
Byte-Level Lookup Tables
When to use: Scanning bits one-at-a-time is slow (64 iterations per word). Use 256-entry lookup tables to process 8 bits at once.
Pattern: Two-Level Lookup
// Precompute at compile time
const BYTE_MIN_EXCESS: [i8; 256] = { /* min prefix sum for each byte value */ };
const BYTE_TOTAL_EXCESS: [i8; 256] = { /* total excess (popcount*2 - 8) */ };
const BYTE_FIND_CLOSE: [[u8; 16]; 256] = { /* 2D: byte_value x initial_excess */ };
fn find_close_in_word_fast(word: u64, initial_excess: i32) -> Option<usize> {
let bytes = word.to_le_bytes();
let mut excess = initial_excess;
for (i, &byte_val) in bytes.iter().enumerate() {
// Level 1: Can the result be in this byte?
let min_excess_in_byte = BYTE_MIN_EXCESS[byte_val as usize] as i32;
if excess + min_excess_in_byte <= 0 {
// Level 2: Find exact position within byte
if excess <= 16 {
let match_pos = BYTE_FIND_CLOSE[byte_val as usize][(excess - 1) as usize];
return Some(i * 8 + match_pos as usize);
}
// Fallback for excess > 16: bit-by-bit scan of this byte only
}
excess += BYTE_TOTAL_EXCESS[byte_val as usize] as i32;
}
None
}
Key Insights
- •Two-level lookup: First check if result is in this byte (fast), then find exact position
- •Compile-time computation: Use
constblocks for zero runtime overhead - •Hierarchical skip: Skip entire bytes/words when result can't be there
- •Fallback for edge cases: Handle out-of-range lookup indices with bit scan
Results
- •50-90% speedup for BP operations
- •~90% speedup for construction
- •Deep nesting: 221ns vs 450ns (51% improvement)
Cumulative Index Pattern
When to use: O(n) operation called O(n) times = O(n^2). Add index for O(log n).
// Build cumulative popcount index
fn build_rank_index(words: &[u64]) -> Vec<u32> {
let mut rank = Vec::with_capacity(words.len() + 1);
let mut cumulative: u32 = 0;
rank.push(0);
for &word in words {
cumulative += word.count_ones();
rank.push(cumulative);
}
rank
}
// Binary search for select
fn select1(&self, k: usize) -> Option<usize> {
let k32 = k as u32;
let mut lo = 0;
let mut hi = self.words.len();
while lo < hi {
let mid = lo + (hi - lo) / 2;
if self.rank[mid + 1] <= k32 { lo = mid + 1; }
else { hi = mid; }
}
// Scan within word for exact bit position
let word_rank = self.rank[lo];
let remaining = k32 - word_rank;
select_in_word(self.words[lo], remaining as u32).map(|b| lo * 64 + b as usize)
}
Result: 627x speedup (2.76s to 4.4ms) when select is called per result.
Exponential Search for Sequential Select
When to use: Select queries during iteration (.users[]) access sequential positions.
Pattern: Galloping from Hint
fn select_with_hint(ib_rank: &[u32], k: usize, hint: usize) -> Option<usize> {
let k32 = k as u32;
let n = ib_rank.len() - 1;
let hint = hint.min(n.saturating_sub(1));
// Check direction from hint
let (lo, hi) = if ib_rank[hint + 1] <= k32 {
// Gallop forward: 1, 2, 4, 8, ... until overshoot
let mut bound = 1;
let mut prev = hint;
loop {
let next = (hint + bound).min(n);
if next >= n || ib_rank[next + 1] > k32 {
break (prev, next);
}
prev = next;
bound *= 2;
}
} else {
// Gallop backward
// ...
};
// Binary search within narrowed range [lo, hi]
}
Results
| Access Pattern | Binary Search | Exponential Search | Speedup |
|---|---|---|---|
| Sequential | 335 us | 102 us | 3.3x |
| Random | 780 us | 1.07 ms | 0.7x (regression) |
Trade-off
Exponential search adds ~37% overhead for random access. Consider providing two methods:
- •
select1_from(k, hint)- Fast for sequential (iteration) - •
select1_binary(k)- Fast for random (indexed lookup)
Hierarchical Skip Pattern
When searching for a position in a bitvector:
- •L2 blocks (1024 words): Skip if block's min-excess can't contain result
- •L1 blocks (32 words): Skip if block's min-excess can't contain result
- •L0 words (1 word): Skip if word's min-excess can't contain result
- •Bytes (8 bits): Skip if byte's min-excess can't contain result
- •Bits: Final scan within byte
Each level has precomputed statistics enabling O(1) skip decision.
Critical Lesson: Benchmark Against Production
Problem: PFSM batched processing appeared to be 40% improvement, but was actually 25% slower than production.
What Happened
| Implementation | Throughput | Role |
|---|---|---|
pfsm_optimized.rs | 516-578 MiB/s | Production (deployed) |
pfsm_simd.rs (batched) | 398-461 MiB/s | New attempt |
pfsm.rs (basic) | 282-344 MiB/s | Reference implementation |
The batched approach was benchmarked against pfsm.rs (basic reference), showing 40% improvement. But pfsm_optimized.rs was already deployed and was 25% faster.
The Rule
ALWAYS benchmark new optimizations against ALL existing implementations:
// RIGHT: Compare against everything, including production
group.bench_function("basic", |b| b.iter(|| pfsm::process(json)));
group.bench_function("batched", |b| b.iter(|| pfsm_simd::process(json)));
group.bench_function("production", |b| b.iter(|| pfsm_optimized::process(json)));
Checklist Before Declaring "Optimization Success"
- • Benchmark includes current production implementation
- • New code is faster than production, not just faster than a reference
- • Multiple input sizes tested (small, medium, large)
- • Multiple patterns tested (if applicable)
- • End-to-end benchmark confirms improvement carries through
Memory-Efficient Construction
Peak vs Steady-State Memory
Key insight: Optimizing construction reduces peak memory but may not affect steady-state memory (what benchmarks measure).
// BAD: Double allocation during construction
fn from_vec(data: Vec<u128>) -> Self {
let layout = Layout::from_size_align(len * 16, 64).unwrap();
let ptr = alloc::alloc(layout); // New allocation
copy_nonoverlapping(data.as_ptr(), ptr, len); // Copy
// data dropped here - peak was 2x
}
// GOOD: Builder pattern - single allocation
struct CacheAlignedBuilder {
ptr: NonNull<u128>,
len: usize,
capacity: usize,
}
impl CacheAlignedBuilder {
fn with_capacity(cap: usize) -> Self {
let layout = Layout::from_size_align(cap * 16, 64).unwrap();
let ptr = alloc::alloc(layout); // Allocate aligned immediately
Self { ptr, len: 0, capacity: cap }
}
fn push(&mut self, value: u128) {
self.ptr.as_ptr().add(self.len).write(value);
self.len += 1;
}
fn build(self) -> CacheAligned {
// Transfer ownership, no copy needed
let result = CacheAligned { ptr: self.ptr, len: self.len };
mem::forget(self); // Prevent Drop
result
}
}
Benefits:
- •Single allocation (not Vec + aligned copy)
- •50% peak memory reduction during construction
- •Cache-aligned from the start
"Succinct" Doesn't Mean Everything Is Compact
In this codebase:
- •Succinct operations: rank/select are O(1) with o(n) auxiliary space
- •Non-succinct storage:
bp_to_text: Vec<u32>uses 4 bytes per node (not compressed)
Memory breakdown for YamlIndex (~5x input):
| Component | Size (100MB input) | Notes |
|---|---|---|
bp_to_text + bp_to_text_end | ~80 MB | 4 bytes × 2 × 10M nodes - biggest target |
| BP bitvector + rank/select | ~20 MB | Actually succinct |
| Input text | 100 MB | Retained for value extraction |
| Other metadata | ~30 MB | Type info, containers bitvec |
Optimization opportunity: Position arrays are monotonic for scalars. Elias-Fano encoding could reduce 80 MB → ~6.6 MB (12× reduction). See compact-index-investigation.md.
See Also
- •docs/optimizations/bit-manipulation.md - Bit manipulation techniques
- •docs/optimizations/lookup-tables.md - Lookup table patterns
- •docs/optimizations/hierarchical-structures.md - Rank/select structures
- •docs/optimizations/access-patterns.md - Sequential vs random access
- •docs/optimizations/history.md - Historical optimization record
- •docs/plan/compact-index-investigation.md - Elias-Fano for position arrays