AgentSkillsCN

perf-rust

Rust专用的性能剖析、基准测试与优化模式。

SKILL.md
--- frontmatter
name: perf-rust
description: Rust-specific profiling, benchmarking, and optimization patterns

What I do

  • Guide Rust profiling with cargo flamegraph, samply, and platform-specific tools
  • Document benchmarking patterns with criterion, divan, and iai
  • Cover LLVM compiler internals, codegen flags, and assembly verification
  • Provide memory analysis and zero-allocation patterns for allocation-free code

When to use me

Use this skill when profiling, benchmarking, or optimizing Rust code. Pair with perf-core for universal methodology and rust-pro for ownership, error handling, and unsafe patterns.

Profiling

bash
cargo flamegraph --bin my-app -- --input data.json  # flame graph (perf/dtrace)
samply record ./target/release/my-app               # lightweight, opens Firefox Profiler
PlatformPrimaryAlternativeCI
Linuxperf + flamegraphsamplycargo flamegraph
macOScargo-instrumentssamplysamply
WindowssamplyETW + WPAsamply

For heap profiling, use dhat-rs (see Memory section).

Benchmarking

criterion

rust
use criterion::{black_box, criterion_group, criterion_main, Criterion};
fn bench_sorting(c: &mut Criterion) {
    let mut group = c.benchmark_group("sorting");
    let data: Vec<u64> = (0..1000).rev().collect();
    group.bench_function("std_sort", |b| b.iter(|| {
        let mut v = black_box(data.clone()); v.sort(); v
    }));
    group.bench_function("unstable_sort", |b| b.iter(|| {
        let mut v = black_box(data.clone()); v.sort_unstable(); v
    }));
    group.finish();
}
criterion_group!(benches, bench_sorting);
criterion_main!(benches);

divan

rust
fn main() { divan::main(); }
#[divan::bench(args = [100, 1000, 10_000])]
fn sort_vec(n: usize) -> Vec<u64> {
    let mut v: Vec<u64> = (0..n as u64).rev().collect(); v.sort_unstable(); v
}

iai / iai-callgrind

Instruction-count based -- deterministic, no noise, ideal for CI. Use critcmp for cross-branch comparison (--save-baseline main / --baseline main).

Compiler Internals

FlagValuesEffect
opt-level0-3, s, z3 = max speed, z = min size
ltofalse, thin, fatLink-time optimization across crates
codegen-units1-256Lower = better optimization, slower compile
target-cpunative, specificEnable CPU-specific instructions
panicunwind, abortabort reduces binary size

cargo llvm-lines | head -20 -- find monomorphization cost. Extract non-generic inner functions or use dyn Trait for cold paths.

  • #[inline] for small hot functions across crate boundaries only
  • #[inline(never)] to isolate in profiling; #[inline(always)] almost never correct
  • LTO: lto = "thin" for fast builds, "fat" for max optimization
  • PGO: -Cprofile-generate → run workload → -Cprofile-use. Typical: 10-20% gain
  • SIMD: std::arch (stable), std::simd (nightly); verify with cargo-show-asm

Reading Compiler Output

bash
cargo asm my_crate::hot_function        # view assembly
cargo asm --llvm my_crate::hot_function # view LLVM IR

Godbolt: paste at godbolt.org with -O, compare with/without abstraction. Identical assembly = zero-cost confirmed. Look for: unexpected call (missing inlining), panicking (bounds checks), missing SIMD.

Memory & Allocation

AllocatorCrateBest For
jemalloctikv-jemallocatorMulti-threaded, reduced fragmentation
mimallocmimallocGeneral-purpose, consistent perf
System(default)Small binaries, minimal deps
rust
#[global_allocator]
static GLOBAL: tikv_jemallocator::Jemalloc = tikv_jemallocator::Jemalloc;

dhat-rs for heap profiling

rust
#[cfg(feature = "dhat-heap")]
#[global_allocator]
static ALLOC: dhat::Alloc = dhat::Alloc;
#[test]
fn test_allocations() {
    let _profiler = dhat::Profiler::new_heap();
    // run code, view results with dhat/dh_view.html
}
SourceWhyFix
format!New String each callwrite! to reusable buffer
to_string()Allocates from &strKeep as &str
Vec growthDoubles capacityVec::with_capacity
Box::new in loopsHeap alloc per iterStack or arena

Zero-Allocation Patterns

BeforeAfterWhy
Vec<T>SmallVec<[T; N]> / ArrayVec<T, N>Stack-allocated small collections
String parameter&str or Cow<'_, str>Avoid forced allocation
format!("{}", num)itoa::Buffer / ryu::BufferStack-allocated formatting
Vec::push in loopVec::with_capacityPre-allocate when size known
Box<dyn Trait>generic T: TraitAvoid heap for single type
to_owned()Extend borrow lifetimeEliminate allocation

Build Performance

cargo build --timings -- HTML report of per-crate compile times.

  • sccache -- RUSTC_WRAPPER=sccache for shared compilation cache
  • Workspace splitting -- isolate slow proc macros into separate members
  • Reduce generics -- impl Trait internally to limit monomorphization
  • cargo-udeps -- cargo +nightly udeps to find unused dependencies

Anti-Patterns

Anti-PatternMeasurable Impact
Vec::push without with_capacityO(log n) reallocations when size is known
format! in hot loopsHeap allocation per iteration; use write! to buffer
Missing --release in benchmarksDebug optimizations disabled; results meaningless
HashMap for small lookupsSorted array + binary search faster for <20 elements
Excessive monomorphization50 instantiations; use dyn Trait or non-generic inner
Missing LTO for release10-20% speedup left on the table
String::from + push_strUse format! or with_capacity + single alloc
Benchmarking without black_boxCompiler eliminates dead code; measures nothing

Companion Skills

DomainSkillCoverage
Performance analysisperf-coreProfiling methodology, flame graphs, benchmarking workflow
Rust engineeringrust-proOwnership, error handling, unsafe audit, async patterns