AgentSkillsCN

rust-performance-best-practices

Rust 性能优化指南——涵盖构建配置、内存分配、同步机制,以及 I/O 操作。在编写、审查或优化 Rust 代码以提升性能时,可运用此技能。当遇到 Rust 代码运行缓慢、二进制文件体积过大、编译时间过长、LTO 配置、发布配置调优、内存分配优化、避免不必要的克隆操作、锁争用、BufReader/BufWriter、火焰图分析,或任何 Rust 性能排查任务时,此技能将自动触发。

SKILL.md
--- frontmatter
name: rust-performance-best-practices
description: Expert-level Rust performance optimization guidelines for build profiles, allocation, synchronization, and I/O. This skill should be used when writing, reviewing, or optimizing Rust code for performance. Triggers on tasks involving slow Rust code, large binary size, long compile times, LTO configuration, release profile tuning, allocation reduction, clone avoidance, lock contention, BufReader/BufWriter, flamegraph analysis, or any Rust performance investigation.

Rust Performance Best Practices

Expert-level performance optimization guide for Rust. Contains 41 rules across 8 categories with real benchmarks, failure modes, and profiling workflows.

The Optimization Workflow

CRITICAL: Most Rust code doesn't need optimization. Profile first, optimize second.

code
┌─────────────────────────────────────────────────────────────┐
│                   OPTIMIZATION WORKFLOW                      │
├─────────────────────────────────────────────────────────────┤
│                                                             │
│  1. MEASURE FIRST                                           │
│     └── Profile before changing anything                   │
│     └── Use cargo flamegraph, perf, or heaptrack           │
│     └── Identify actual bottlenecks (don't guess!)         │
│                                                             │
│  2. CHECK BUILD SETTINGS                                    │
│     └── Release mode? (10-100x vs debug)                   │
│     └── LTO enabled? (5-20% improvement)                   │
│     └── Target CPU? (10-30% for SIMD)                      │
│                                                             │
│  3. FIX ALGORITHMIC ISSUES                                  │
│     └── O(n²) → O(n log n) matters more than micro-opts   │
│     └── Check data structure choices                       │
│     └── Avoid unnecessary work                             │
│                                                             │
│  4. REDUCE ALLOCATIONS                                      │
│     └── Pre-size collections (with_capacity)               │
│     └── Reuse buffers (clear + reuse)                      │
│     └── Avoid cloning (borrow instead)                     │
│                                                             │
│  5. OPTIMIZE HOT LOOPS                                      │
│     └── Iterators over indices                             │
│     └── Reduce lock scope                                  │
│     └── Batch I/O operations                               │
│                                                             │
│  6. MEASURE AGAIN                                           │
│     └── Verify improvement with benchmarks                 │
│     └── Check for regressions elsewhere                    │
│     └── Document the optimization                          │
│                                                             │
└─────────────────────────────────────────────────────────────┘

Quick Profiling Commands

bash
# CPU profiling (Linux)
cargo flamegraph --bin myapp
perf record -g ./target/release/myapp && perf report

# Memory profiling
heaptrack ./target/release/myapp && heaptrack_gui heaptrack.myapp.*.gz
DHAT_LOG_FILE=dhat.out cargo run --release && dh_view.py dhat.out

# Benchmark
cargo bench                          # All benchmarks
cargo bench hot_function             # Specific benchmark

# Check allocations
MALLOC_TRACE=/tmp/mtrace.log ./target/release/myapp
mtrace ./target/release/myapp /tmp/mtrace.log

# Assembly inspection
cargo asm my_crate::hot_function --rust

# syscall count
strace -c ./target/release/myapp 2>&1 | head -20

Common Scenarios → Rules

"My Rust program is slow"

code
Is it running in debug mode?
├── YES → build-release-profile (10-100x speedup)
└── NO
    │
    Where does flamegraph show time?
    ├── malloc/free → alloc-* rules (with_capacity, reuse buffers)
    ├── Mutex::lock → sync-* rules (RwLock, atomics, shorter scope)
    ├── read/write syscalls → io-* rules (BufReader/BufWriter)
    ├── clone/drop → alloc-avoid-clone, use references
    └── Your code → iter-* rules, algorithm improvements

"My binary is too large"

code
1. Enable LTO: build-enable-lto (10-20% smaller)
2. Set opt-level = 'z': build-opt-level (optimizes for size)
3. panic = 'abort': build-panic-abort (removes unwinding code)
4. Strip symbols: strip = true in Cargo.toml
5. Remove debug info: debug = 0

"High memory usage"

code
1. Pre-size collections: alloc-*-with-capacity
2. Reuse allocations: alloc-reuse-buffers
3. Avoid cloning: alloc-avoid-clone
4. Use slices in APIs: alloc-use-slices-in-apis
5. Consider arena allocators: bumpalo crate

"Lock contention / thread scaling"

code
1. Profile: lock_api::ReentrantMutex or parking_lot profiling
2. Reduce lock scope: sync-keep-lock-scope-short
3. Read-heavy? → sync-use-rwlock
4. Simple counters? → sync-use-atomics
5. Message passing? → sync-use-channels
6. Thread-local + periodic flush for stats

"Slow file I/O"

code
1. Wrap in BufReader/BufWriter: io-use-bufreader, io-use-bufwriter
2. Flush before returning: io-flush-bufwriter (data loss prevention!)
3. Reuse line buffer: io-read-line-with-bufread
4. Consider mmap for random access: memmap2 crate

Rule Categories

PriorityCategoryTypical ImpactPrefix
1Build Profiles10-100x (debug→release)build-
2BenchmarkingEnables measurementbench-
3Allocation2-50x for allocation-heavy codealloc-
4Data Structures2-10x for hot pathsdata-
5Iteration2-5x for loop-heavy codeiter-
6Synchronization5-100x for contended codesync-
7I/O10-100x for I/O-bound codeio-
8Unsafe5-30% in tight loops (experts only)unsafe-

1. Build Profiles (CRITICAL)

These apply to ALL Rust code. Check these first.

RuleImpactOne-liner
build-release-profile10-100xAlways ship release builds
build-opt-level2-5xopt-level=3 for speed, 'z' for size
build-enable-lto5-20%LTO enables cross-crate optimization
build-codegen-units5-15%codegen-units=1 for max optimization
build-panic-abortBinary sizepanic='abort' removes unwinding
build-target-cpu10-30%target-cpu=native for SIMD
build-pgo5-20%Profile-guided optimization
build-incremental-off5-10%Disable for release builds

2. Benchmarking (REQUIRED)

You can't optimize what you don't measure.

RulePurpose
bench-cargo-benchUse cargo bench with criterion
bench-bench-profileBench profile enables optimizations
bench-black-boxPrevent dead code elimination
bench-avoid-ioI/O variance destroys measurements

3. Allocation

Every allocation is a syscall. Reduce them.

RuleImpactPattern
alloc-vec-with-capacity2-10xVec::with_capacity(n) not Vec::new()
alloc-string-with-capacity2-5xString::with_capacity(n)
alloc-hashmap-with-capacity2-5xHashMap::with_capacity(n)
alloc-reuse-buffers2-50x.clear() and reuse, don't reallocate
alloc-use-slices-in-apisFlexibility&[T] not Vec<T> in parameters
alloc-avoid-clone2-10xBorrow &T instead of clone()

4. Data Structures

The right data structure beats micro-optimization.

RuleWhen
data-avoid-linkedlistAlmost always (Vec wins)
data-choose-vecdeque-for-queueFIFO queues
data-choose-map-typeHashMap=O(1), BTreeMap=sorted
data-use-entry-apiInsert-or-update patterns
data-repr-transparentFFI newtypes

5. Iteration

Iterators are as fast as loops and safer.

RuleImpactPattern
iter-avoid-collect-then-loop2-3xChain iterators, don't collect
iter-use-lazy-iterators2-3x.filter().map() not intermediate vecs
iter-use-any-findShort-circuit.any() not .filter().count() > 0
iter-use-retainIn-place.retain() not .filter().collect()
iter-use-binary-searchO(log n).binary_search() on sorted data

6. Synchronization

Locks are expensive. Minimize contention.

RuleImpactWhen
sync-share-with-arc350x vs cloneShare large data across threads
sync-use-rwlock7x for reads90%+ reads, few writes
sync-keep-lock-scope-short4xMinimize code under lock
sync-use-channels3-4xMessage passing vs shared state
sync-use-atomics20xSimple counters, flags

7. I/O

Every syscall costs. Buffer them.

RuleImpactPattern
io-use-bufreader50xWrap File in BufReader
io-use-bufwriter18xWrap File in BufWriter
io-flush-bufwriterCRITICALMust flush or lose data!
io-read-line-with-bufread53xReuse String buffer with read_line

8. Unsafe (Expert Only)

Only after profiling proves these matter.

RuleImpactRisk
unsafe-get-unchecked5-30%UB if bounds wrong
unsafe-use-maybeuninit20-100x allocUB if read before write
unsafe-avoid-transmuteCorrectnessPrefer safe alternatives
unsafe-repr-transparentZero-costRequired for FFI newtypes

Decision Trees

When to use with_capacity?

code
Do you know the size?
├── YES, exact → with_capacity(exact)
├── YES, approximate → with_capacity(estimate)
└── NO
    │
    Will it grow frequently?
    ├── YES → Start bigger or use reserve()
    └── NO → Vec::new() is fine

Mutex vs RwLock vs Atomics?

code
Is it a simple counter/flag?
├── YES → Atomics (20x faster)
└── NO
    │
    What's the read/write ratio?
    ├── Mostly reads (>90%) → RwLock
    ├── Mostly writes → Mutex
    └── Mixed → Mutex (simpler)

    Consider: parking_lot > std for all of these

When is unsafe get_unchecked worth it?

code
Did you profile and find bounds checks are the bottleneck?
├── NO → Don't use it
└── YES
    │
    Did you check if LLVM already removed the bounds check?
    ├── NO → Check assembly first (cargo asm)
    └── YES, still there
        │
        Can you use iterators instead?
        ├── YES → Use iterators (same speed, safe)
        └── NO → get_unchecked with documented invariants

Reading Rules

Each rule file in rules/ contains:

  • Quantified impact with real benchmark numbers
  • Visual explanations of how the optimization works
  • Incorrect examples showing common mistakes
  • Correct examples with best practices
  • When NOT to apply - trade-offs and edge cases
  • Common mistakes to avoid
  • Profiling commands to identify the issue
  • References to official docs

Full Compiled Document

For all rules in a single file: AGENTS.md