Performance Optimization Skill
Apply these principles when optimizing code for performance. Focus on the critical 3% where performance truly matters - a 12% improvement is never marginal in engineering.
Core Philosophy
- •Measure First: Never optimize without profiling data
- •Estimate Costs: Back-of-envelope calculations before implementation
- •Avoid Work: The fastest code is code that doesn't run
- •Reduce Allocations: Memory allocation is often the hidden bottleneck
- •Cache Locality: Memory access patterns dominate modern performance
Reference Latency Numbers
Use these for back-of-envelope calculations:
| Operation | Latency |
|---|---|
| L1 cache reference | 0.5 ns |
| Branch mispredict | 5 ns |
| L2 cache reference | 7 ns |
| Mutex lock/unlock | 25 ns |
| Main memory reference | 100 ns |
| Compress 1KB (Snappy) | 3 us |
| SSD random read (4KB) | 20 us |
| Round trip in datacenter | 50 us |
| Disk seek | 5 ms |
Optimization Techniques (Priority Order)
1. Algorithmic Improvements (Highest Impact)
Always check algorithm complexity first:
O(N^2) → O(N log N) = 1000x faster for N=1M O(N) → O(1) = unbounded improvement
Common patterns:
- •Replace nested loops with hash lookups
- •Use sorted-list intersection → hash table lookups
- •Process in reverse post-order to eliminate per-element checks
- •Replace interval trees (O(log N)) with hash maps (O(1)) when ranges aren't needed
2. Reduce Memory Allocations
Allocation is expensive (~25-100ns + GC pressure)
# BAD: Allocates on every call
def process(items):
result = [] # New allocation
for item in items:
result.append(transform(item))
return result
# GOOD: Pre-allocate or reuse
def process(items, out=None):
if out is None:
out = [None] * len(items)
for i, item in enumerate(items):
out[i] = transform(item)
return out
Techniques:
- •Pre-size containers with
reserve()or known capacity - •Hoist temporary containers outside loops
- •Reuse buffers across iterations (clear instead of recreate)
- •Move instead of copy large structures
- •Use stack allocation for bounded-lifetime objects
3. Compact Data Structures
Minimize memory footprint and cache lines touched:
// BAD: 24 bytes due to padding
struct Item {
flag: bool, // 1 byte + 7 padding
value: i64, // 8 bytes
count: i32, // 4 bytes + 4 padding
}
// GOOD: 16 bytes with reordering
struct Item {
value: i64, // 8 bytes
count: i32, // 4 bytes
flag: bool, // 1 byte + 3 padding
}
Techniques:
- •Reorder struct fields by size (largest first)
- •Use smaller numeric types when range permits (u8 vs i32)
- •Replace 64-bit pointers with 32-bit indices
- •Keep hot fields together, move cold fields to separate struct
- •Use bit vectors for small-domain sets
- •Flatten nested maps:
Map<A, Map<B, C>>→Map<(A,B), C>
4. Fast Paths for Common Cases
Optimize the common case without hurting rare cases:
# BAD: Always takes slow path
def parse_varint(data):
return generic_varint_parser(data)
# GOOD: Fast path for common 1-byte case
def parse_varint(data):
if data[0] < 128: # Single byte - 90% of cases
return data[0], 1
return generic_varint_parser(data) # Rare multi-byte
Techniques:
- •Handle common dimensions inline (1-D, 2-D tensors)
- •Check for empty/trivial inputs early
- •Specialize for common sizes (small strings, few elements)
- •Process trailing elements separately to avoid slow generic code
5. Precompute Expensive Information
Trade memory for compute when beneficial:
# BAD: Recomputes on every access
def is_vowel(char):
return char.lower() in 'aeiou'
# GOOD: Lookup table
VOWEL_TABLE = [c.lower() in 'aeiou' for c in (chr(i) for i in range(256))]
def is_vowel(char):
return VOWEL_TABLE[ord(char)]
Techniques:
- •Precompute flags/properties at construction time
- •Build lookup tables for character classification
- •Cache expensive computed properties
- •Validate at boundaries once, not repeatedly inside
6. Bulk/Batch APIs
Amortize fixed costs across multiple operations:
# BAD: N round trips
for item in items:
result = db.lookup(item)
# GOOD: 1 round trip
results = db.lookup_many(items)
Design APIs that support:
- •Batch lookups instead of individual fetches
- •Vectorized operations over loops
- •Streaming interfaces for large datasets
7. Avoid Unnecessary Work
# BAD: Always computes expensive value
def process(data, config):
expensive = compute_expensive(data) # Always runs
if config.needs_expensive:
use(expensive)
# GOOD: Defer until needed
def process(data, config):
if config.needs_expensive:
expensive = compute_expensive(data) # Only when needed
use(expensive)
Techniques:
- •Lazy evaluation for expensive operations
- •Short-circuit evaluation in conditions
- •Move loop-invariant code outside loops
- •Specialize instead of using general-purpose libraries in hot paths
8. Help the Compiler/Runtime
Lower-level optimizations when profiling shows need:
// Avoid function call overhead in hot loops
#[inline(always)]
fn hot_function(x: i32) -> i32 { x * 2 }
// Copy to local variable for better alias analysis
fn process(data: &mut [i32], factor: &i32) {
let f = *factor; // Compiler knows this won't change
for x in data {
*x *= f;
}
}
Techniques:
- •Use raw pointers/indices instead of iterators in critical loops
- •Hand-unroll very hot loops (4+ iterations)
- •Move slow-path code to separate non-inlined functions
- •Avoid abstractions that hide costs in hot paths
9. Reduce Synchronization
Minimize lock contention and atomic operations:
- •Default to thread-compatible (external sync) not thread-safe
- •Sample statistics (1-in-32) instead of tracking everything
- •Batch updates to shared state
- •Use thread-local storage for per-thread data
Profiling Workflow
- •Identify hotspots: Use profiler (pprof, perf, py-spy)
- •Measure baseline: Write benchmark before optimizing
- •Estimate improvement: Calculate expected gain
- •Implement change: Focus on one optimization at a time
- •Verify improvement: Run benchmark, confirm gain
- •Check for regressions: Ensure no other code paths slowed down
When Profile is Flat (No Clear Hotspots)
- •Pursue multiple small 1-2% improvements collectively
- •Look for restructuring opportunities higher in call stack
- •Profile allocations specifically (often hidden cost)
- •Gather hardware performance counters (cache misses, branch mispredicts)
- •Consider if the code is already well-optimized
Anti-Patterns to Avoid
- •Premature optimization in non-critical code
- •Optimizing without measurement - changes based on intuition
- •Micro-optimizing while ignoring algorithmic complexity
- •Copying when moving would suffice
- •Growing containers one element at a time (quadratic)
- •Allocating in loops when reuse is possible
- •String formatting in hot paths (use pre-built templates)
- •Regex when simple string matching suffices
Estimation Template
Before optimizing, estimate:
Operation cost: ___ ns/us/ms Frequency: ___ times per second/request Total time: cost × frequency = ___ Improvement target: ___% reduction Expected new time: ___ Is this worth it? [ ] Yes [ ] No