What I do
- •Guide Rust profiling with cargo flamegraph, samply, and platform-specific tools
- •Document benchmarking patterns with criterion, divan, and iai
- •Cover LLVM compiler internals, codegen flags, and assembly verification
- •Provide memory analysis and zero-allocation patterns for allocation-free code
When to use me
Use this skill when profiling, benchmarking, or optimizing Rust code. Pair
with perf-core for universal methodology and rust-pro for ownership,
error handling, and unsafe patterns.
Profiling
cargo flamegraph --bin my-app -- --input data.json # flame graph (perf/dtrace)
samply record ./target/release/my-app # lightweight, opens Firefox Profiler
| Platform | Primary | Alternative | CI |
|---|
| Linux | perf + flamegraph | samply | cargo flamegraph |
| macOS | cargo-instruments | samply | samply |
| Windows | samply | ETW + WPA | samply |
For heap profiling, use dhat-rs (see Memory section).
Benchmarking
criterion
use criterion::{black_box, criterion_group, criterion_main, Criterion};
fn bench_sorting(c: &mut Criterion) {
let mut group = c.benchmark_group("sorting");
let data: Vec<u64> = (0..1000).rev().collect();
group.bench_function("std_sort", |b| b.iter(|| {
let mut v = black_box(data.clone()); v.sort(); v
}));
group.bench_function("unstable_sort", |b| b.iter(|| {
let mut v = black_box(data.clone()); v.sort_unstable(); v
}));
group.finish();
}
criterion_group!(benches, bench_sorting);
criterion_main!(benches);
divan
fn main() { divan::main(); }
#[divan::bench(args = [100, 1000, 10_000])]
fn sort_vec(n: usize) -> Vec<u64> {
let mut v: Vec<u64> = (0..n as u64).rev().collect(); v.sort_unstable(); v
}
iai / iai-callgrind
Instruction-count based -- deterministic, no noise, ideal for CI.
Use critcmp for cross-branch comparison (--save-baseline main / --baseline main).
Compiler Internals
| Flag | Values | Effect |
|---|
opt-level | 0-3, s, z | 3 = max speed, z = min size |
lto | false, thin, fat | Link-time optimization across crates |
codegen-units | 1-256 | Lower = better optimization, slower compile |
target-cpu | native, specific | Enable CPU-specific instructions |
panic | unwind, abort | abort reduces binary size |
cargo llvm-lines | head -20 -- find monomorphization cost. Extract non-generic inner functions or use dyn Trait for cold paths.
- •
#[inline] for small hot functions across crate boundaries only
- •
#[inline(never)] to isolate in profiling; #[inline(always)] almost never correct
- •LTO:
lto = "thin" for fast builds, "fat" for max optimization
- •PGO:
-Cprofile-generate → run workload → -Cprofile-use. Typical: 10-20% gain
- •SIMD:
std::arch (stable), std::simd (nightly); verify with cargo-show-asm
Reading Compiler Output
cargo asm my_crate::hot_function # view assembly
cargo asm --llvm my_crate::hot_function # view LLVM IR
Godbolt: paste at godbolt.org with -O, compare with/without
abstraction. Identical assembly = zero-cost confirmed.
Look for: unexpected call (missing inlining), panicking (bounds checks), missing SIMD.
Memory & Allocation
| Allocator | Crate | Best For |
|---|
| jemalloc | tikv-jemallocator | Multi-threaded, reduced fragmentation |
| mimalloc | mimalloc | General-purpose, consistent perf |
| System | (default) | Small binaries, minimal deps |
#[global_allocator]
static GLOBAL: tikv_jemallocator::Jemalloc = tikv_jemallocator::Jemalloc;
dhat-rs for heap profiling
#[cfg(feature = "dhat-heap")]
#[global_allocator]
static ALLOC: dhat::Alloc = dhat::Alloc;
#[test]
fn test_allocations() {
let _profiler = dhat::Profiler::new_heap();
// run code, view results with dhat/dh_view.html
}
| Source | Why | Fix |
|---|
format! | New String each call | write! to reusable buffer |
to_string() | Allocates from &str | Keep as &str |
Vec growth | Doubles capacity | Vec::with_capacity |
Box::new in loops | Heap alloc per iter | Stack or arena |
Zero-Allocation Patterns
| Before | After | Why |
|---|
Vec<T> | SmallVec<[T; N]> / ArrayVec<T, N> | Stack-allocated small collections |
String parameter | &str or Cow<'_, str> | Avoid forced allocation |
format!("{}", num) | itoa::Buffer / ryu::Buffer | Stack-allocated formatting |
Vec::push in loop | Vec::with_capacity | Pre-allocate when size known |
Box<dyn Trait> | generic T: Trait | Avoid heap for single type |
to_owned() | Extend borrow lifetime | Eliminate allocation |
Build Performance
cargo build --timings -- HTML report of per-crate compile times.
- •sccache --
RUSTC_WRAPPER=sccache for shared compilation cache
- •Workspace splitting -- isolate slow proc macros into separate members
- •Reduce generics --
impl Trait internally to limit monomorphization
- •cargo-udeps --
cargo +nightly udeps to find unused dependencies
Anti-Patterns
| Anti-Pattern | Measurable Impact |
|---|
Vec::push without with_capacity | O(log n) reallocations when size is known |
format! in hot loops | Heap allocation per iteration; use write! to buffer |
Missing --release in benchmarks | Debug optimizations disabled; results meaningless |
HashMap for small lookups | Sorted array + binary search faster for <20 elements |
| Excessive monomorphization | 50 instantiations; use dyn Trait or non-generic inner |
| Missing LTO for release | 10-20% speedup left on the table |
String::from + push_str | Use format! or with_capacity + single alloc |
Benchmarking without black_box | Compiler eliminates dead code; measures nothing |
Companion Skills
| Domain | Skill | Coverage |
|---|
| Performance analysis | perf-core | Profiling methodology, flame graphs, benchmarking workflow |
| Rust engineering | rust-pro | Ownership, error handling, unsafe audit, async patterns |