Rust Performance Best Practices
Expert-level performance optimization guide for Rust. Contains 41 rules across 8 categories with real benchmarks, failure modes, and profiling workflows.
The Optimization Workflow
CRITICAL: Most Rust code doesn't need optimization. Profile first, optimize second.
┌─────────────────────────────────────────────────────────────┐ │ OPTIMIZATION WORKFLOW │ ├─────────────────────────────────────────────────────────────┤ │ │ │ 1. MEASURE FIRST │ │ └── Profile before changing anything │ │ └── Use cargo flamegraph, perf, or heaptrack │ │ └── Identify actual bottlenecks (don't guess!) │ │ │ │ 2. CHECK BUILD SETTINGS │ │ └── Release mode? (10-100x vs debug) │ │ └── LTO enabled? (5-20% improvement) │ │ └── Target CPU? (10-30% for SIMD) │ │ │ │ 3. FIX ALGORITHMIC ISSUES │ │ └── O(n²) → O(n log n) matters more than micro-opts │ │ └── Check data structure choices │ │ └── Avoid unnecessary work │ │ │ │ 4. REDUCE ALLOCATIONS │ │ └── Pre-size collections (with_capacity) │ │ └── Reuse buffers (clear + reuse) │ │ └── Avoid cloning (borrow instead) │ │ │ │ 5. OPTIMIZE HOT LOOPS │ │ └── Iterators over indices │ │ └── Reduce lock scope │ │ └── Batch I/O operations │ │ │ │ 6. MEASURE AGAIN │ │ └── Verify improvement with benchmarks │ │ └── Check for regressions elsewhere │ │ └── Document the optimization │ │ │ └─────────────────────────────────────────────────────────────┘
Quick Profiling Commands
# CPU profiling (Linux) cargo flamegraph --bin myapp perf record -g ./target/release/myapp && perf report # Memory profiling heaptrack ./target/release/myapp && heaptrack_gui heaptrack.myapp.*.gz DHAT_LOG_FILE=dhat.out cargo run --release && dh_view.py dhat.out # Benchmark cargo bench # All benchmarks cargo bench hot_function # Specific benchmark # Check allocations MALLOC_TRACE=/tmp/mtrace.log ./target/release/myapp mtrace ./target/release/myapp /tmp/mtrace.log # Assembly inspection cargo asm my_crate::hot_function --rust # syscall count strace -c ./target/release/myapp 2>&1 | head -20
Common Scenarios → Rules
"My Rust program is slow"
Is it running in debug mode?
├── YES → build-release-profile (10-100x speedup)
└── NO
│
Where does flamegraph show time?
├── malloc/free → alloc-* rules (with_capacity, reuse buffers)
├── Mutex::lock → sync-* rules (RwLock, atomics, shorter scope)
├── read/write syscalls → io-* rules (BufReader/BufWriter)
├── clone/drop → alloc-avoid-clone, use references
└── Your code → iter-* rules, algorithm improvements
"My binary is too large"
1. Enable LTO: build-enable-lto (10-20% smaller) 2. Set opt-level = 'z': build-opt-level (optimizes for size) 3. panic = 'abort': build-panic-abort (removes unwinding code) 4. Strip symbols: strip = true in Cargo.toml 5. Remove debug info: debug = 0
"High memory usage"
1. Pre-size collections: alloc-*-with-capacity 2. Reuse allocations: alloc-reuse-buffers 3. Avoid cloning: alloc-avoid-clone 4. Use slices in APIs: alloc-use-slices-in-apis 5. Consider arena allocators: bumpalo crate
"Lock contention / thread scaling"
1. Profile: lock_api::ReentrantMutex or parking_lot profiling 2. Reduce lock scope: sync-keep-lock-scope-short 3. Read-heavy? → sync-use-rwlock 4. Simple counters? → sync-use-atomics 5. Message passing? → sync-use-channels 6. Thread-local + periodic flush for stats
"Slow file I/O"
1. Wrap in BufReader/BufWriter: io-use-bufreader, io-use-bufwriter 2. Flush before returning: io-flush-bufwriter (data loss prevention!) 3. Reuse line buffer: io-read-line-with-bufread 4. Consider mmap for random access: memmap2 crate
Rule Categories
| Priority | Category | Typical Impact | Prefix |
|---|---|---|---|
| 1 | Build Profiles | 10-100x (debug→release) | build- |
| 2 | Benchmarking | Enables measurement | bench- |
| 3 | Allocation | 2-50x for allocation-heavy code | alloc- |
| 4 | Data Structures | 2-10x for hot paths | data- |
| 5 | Iteration | 2-5x for loop-heavy code | iter- |
| 6 | Synchronization | 5-100x for contended code | sync- |
| 7 | I/O | 10-100x for I/O-bound code | io- |
| 8 | Unsafe | 5-30% in tight loops (experts only) | unsafe- |
1. Build Profiles (CRITICAL)
These apply to ALL Rust code. Check these first.
| Rule | Impact | One-liner |
|---|---|---|
build-release-profile | 10-100x | Always ship release builds |
build-opt-level | 2-5x | opt-level=3 for speed, 'z' for size |
build-enable-lto | 5-20% | LTO enables cross-crate optimization |
build-codegen-units | 5-15% | codegen-units=1 for max optimization |
build-panic-abort | Binary size | panic='abort' removes unwinding |
build-target-cpu | 10-30% | target-cpu=native for SIMD |
build-pgo | 5-20% | Profile-guided optimization |
build-incremental-off | 5-10% | Disable for release builds |
2. Benchmarking (REQUIRED)
You can't optimize what you don't measure.
| Rule | Purpose |
|---|---|
bench-cargo-bench | Use cargo bench with criterion |
bench-bench-profile | Bench profile enables optimizations |
bench-black-box | Prevent dead code elimination |
bench-avoid-io | I/O variance destroys measurements |
3. Allocation
Every allocation is a syscall. Reduce them.
| Rule | Impact | Pattern |
|---|---|---|
alloc-vec-with-capacity | 2-10x | Vec::with_capacity(n) not Vec::new() |
alloc-string-with-capacity | 2-5x | String::with_capacity(n) |
alloc-hashmap-with-capacity | 2-5x | HashMap::with_capacity(n) |
alloc-reuse-buffers | 2-50x | .clear() and reuse, don't reallocate |
alloc-use-slices-in-apis | Flexibility | &[T] not Vec<T> in parameters |
alloc-avoid-clone | 2-10x | Borrow &T instead of clone() |
4. Data Structures
The right data structure beats micro-optimization.
| Rule | When |
|---|---|
data-avoid-linkedlist | Almost always (Vec wins) |
data-choose-vecdeque-for-queue | FIFO queues |
data-choose-map-type | HashMap=O(1), BTreeMap=sorted |
data-use-entry-api | Insert-or-update patterns |
data-repr-transparent | FFI newtypes |
5. Iteration
Iterators are as fast as loops and safer.
| Rule | Impact | Pattern |
|---|---|---|
iter-avoid-collect-then-loop | 2-3x | Chain iterators, don't collect |
iter-use-lazy-iterators | 2-3x | .filter().map() not intermediate vecs |
iter-use-any-find | Short-circuit | .any() not .filter().count() > 0 |
iter-use-retain | In-place | .retain() not .filter().collect() |
iter-use-binary-search | O(log n) | .binary_search() on sorted data |
6. Synchronization
Locks are expensive. Minimize contention.
| Rule | Impact | When |
|---|---|---|
sync-share-with-arc | 350x vs clone | Share large data across threads |
sync-use-rwlock | 7x for reads | 90%+ reads, few writes |
sync-keep-lock-scope-short | 4x | Minimize code under lock |
sync-use-channels | 3-4x | Message passing vs shared state |
sync-use-atomics | 20x | Simple counters, flags |
7. I/O
Every syscall costs. Buffer them.
| Rule | Impact | Pattern |
|---|---|---|
io-use-bufreader | 50x | Wrap File in BufReader |
io-use-bufwriter | 18x | Wrap File in BufWriter |
io-flush-bufwriter | CRITICAL | Must flush or lose data! |
io-read-line-with-bufread | 53x | Reuse String buffer with read_line |
8. Unsafe (Expert Only)
Only after profiling proves these matter.
| Rule | Impact | Risk |
|---|---|---|
unsafe-get-unchecked | 5-30% | UB if bounds wrong |
unsafe-use-maybeuninit | 20-100x alloc | UB if read before write |
unsafe-avoid-transmute | Correctness | Prefer safe alternatives |
unsafe-repr-transparent | Zero-cost | Required for FFI newtypes |
Decision Trees
When to use with_capacity?
Do you know the size?
├── YES, exact → with_capacity(exact)
├── YES, approximate → with_capacity(estimate)
└── NO
│
Will it grow frequently?
├── YES → Start bigger or use reserve()
└── NO → Vec::new() is fine
Mutex vs RwLock vs Atomics?
Is it a simple counter/flag?
├── YES → Atomics (20x faster)
└── NO
│
What's the read/write ratio?
├── Mostly reads (>90%) → RwLock
├── Mostly writes → Mutex
└── Mixed → Mutex (simpler)
Consider: parking_lot > std for all of these
When is unsafe get_unchecked worth it?
Did you profile and find bounds checks are the bottleneck?
├── NO → Don't use it
└── YES
│
Did you check if LLVM already removed the bounds check?
├── NO → Check assembly first (cargo asm)
└── YES, still there
│
Can you use iterators instead?
├── YES → Use iterators (same speed, safe)
└── NO → get_unchecked with documented invariants
Reading Rules
Each rule file in rules/ contains:
- •Quantified impact with real benchmark numbers
- •Visual explanations of how the optimization works
- •Incorrect examples showing common mistakes
- •Correct examples with best practices
- •When NOT to apply - trade-offs and edge cases
- •Common mistakes to avoid
- •Profiling commands to identify the issue
- •References to official docs
Full Compiled Document
For all rules in a single file: AGENTS.md