Performance Profiling in Rust
Set up comprehensive performance profiling using Criterion benchmarks, flamegraphs, and custom TSC-based measurements to identify bottlenecks and validate optimizations in latency-sensitive Rust code.
When to use
- •Performance-critical applications (HFT, real-time systems, hot paths)
- •Sub-microsecond latency requirements
- •Lock-free data structures where contention matters
- •Before/after optimization validation (prove improvements)
- •Systems where allocation or syscalls are too expensive
- •Concurrent code with cache coherency concerns
- •When P99/P999 latencies matter more than averages
When NOT to use
- •I/O-bound applications (network, disk wait time dominates)
- •Simple CRUD apps where milliseconds are acceptable
- •Prototypes where correctness matters more than speed
- •Single-threaded code without hot paths
- •When profiling overhead is unacceptable (already in production)
Instructions
Step 1: Set Up Criterion Benchmarking Framework
Add dependencies to Cargo.toml:
[dev-dependencies]
criterion = { version = "0.5", features = ["html_reports"] }
[[bench]]
name = "my_bench"
harness = false
Create benches/my_bench.rs:
use criterion::{black_box, criterion_group, criterion_main, Criterion, BenchmarkId};
fn bench_operation(c: &mut Criterion) {
c.bench_function("operation_name", |b| {
// Setup (outside timing loop)
let data = setup_test_data();
b.iter(|| {
// Code to benchmark (inside timing loop)
black_box(expensive_operation(black_box(&data)))
});
});
}
criterion_group!(benches, bench_operation);
criterion_main!(benches);
Key points:
- •
black_box()prevents compiler from optimizing away code - •Setup happens once per benchmark, not per iteration
- •
harness = falseuses Criterion instead of built-in test harness
Step 2: Add Parametric Benchmarks
For testing across different inputs:
fn bench_with_params(c: &mut Criterion) {
let mut group = c.benchmark_group("operation_by_size");
for size in [64, 256, 1024, 4096].iter() {
group.bench_with_input(BenchmarkId::from_parameter(size), size, |b, &size| {
let buffer = vec![0u8; size];
b.iter(|| {
black_box(process_buffer(black_box(&buffer)))
});
});
}
group.finish();
}
When to use:
- •Testing scalability (does N=1000 perform 10x worse than N=100?)
- •Identifying threshold where performance degrades
- •Comparing different algorithms across input sizes
Step 3: Implement TSC-Based Latency Measurement
For sub-microsecond precision where Criterion's overhead matters:
// src/tsc.rs
use std::sync::OnceLock;
use std::time::{Duration, Instant};
static TSC_PER_NS: OnceLock<f64> = OnceLock::new();
#[cfg(target_arch = "aarch64")]
#[inline(always)]
pub fn rdtsc() -> u64 {
let tsc: u64;
unsafe {
core::arch::asm!("mrs {}, cntvct_el0", out(reg) tsc, options(nomem, nostack));
}
tsc
}
#[cfg(target_arch = "x86_64")]
#[inline(always)]
pub fn rdtsc() -> u64 {
unsafe { core::arch::x86_64::_rdtsc() }
}
pub fn calibrate_tsc() -> f64 {
let start_tsc = rdtsc();
let start = Instant::now();
std::thread::sleep(Duration::from_millis(100));
let end_tsc = rdtsc();
let elapsed_ns = start.elapsed().as_nanos() as u64;
(end_tsc - start_tsc) as f64 / elapsed_ns as f64
}
pub fn init_tsc() {
TSC_PER_NS.get_or_init(|| calibrate_tsc());
}
#[inline(always)]
pub fn tsc_to_ns(tsc: u64) -> u64 {
let factor = TSC_PER_NS.get().expect("TSC not calibrated - call init_tsc()");
(tsc as f64 / factor) as u64
}
Usage in benchmarks:
fn bench_with_tsc(c: &mut Criterion) {
init_tsc();
c.bench_function("hot_path_with_tsc", |b| {
b.iter(|| {
let start = rdtsc();
let result = hot_path_operation();
let latency_ns = tsc_to_ns(rdtsc() - start);
black_box((result, latency_ns))
});
});
}
Step 4: Build Latency Histogram
Track distribution of latencies (P50, P95, P99):
use core::sync::atomic::{AtomicU64, Ordering};
#[repr(C, align(64))]
struct CachePadded<T> {
value: T,
}
pub struct LatencyHistogram {
buckets: [CachePadded<AtomicU64>; 13],
total_samples: CachePadded<AtomicU64>,
total_latency_ns: CachePadded<AtomicU64>,
}
impl LatencyHistogram {
pub fn new() -> Self {
// 13 buckets: 0-100ns, 100-200ns, 200-500ns, 500-1μs, 1-2μs, 2-5μs,
// 5-10μs, 10-20μs, 20-50μs, 50-100μs, 100-200μs, 200-500μs, 500+μs
Self {
buckets: [
CachePadded { value: AtomicU64::new(0) },
CachePadded { value: AtomicU64::new(0) },
CachePadded { value: AtomicU64::new(0) },
CachePadded { value: AtomicU64::new(0) },
CachePadded { value: AtomicU64::new(0) },
CachePadded { value: AtomicU64::new(0) },
CachePadded { value: AtomicU64::new(0) },
CachePadded { value: AtomicU64::new(0) },
CachePadded { value: AtomicU64::new(0) },
CachePadded { value: AtomicU64::new(0) },
CachePadded { value: AtomicU64::new(0) },
CachePadded { value: AtomicU64::new(0) },
CachePadded { value: AtomicU64::new(0) },
],
total_samples: CachePadded { value: AtomicU64::new(0) },
total_latency_ns: CachePadded { value: AtomicU64::new(0) },
}
}
fn bucket_index(latency_ns: u64) -> usize {
match latency_ns {
0..=99 => 0,
100..=199 => 1,
200..=499 => 2,
500..=999 => 3,
1_000..=1_999 => 4,
2_000..=4_999 => 5,
5_000..=9_999 => 6,
10_000..=19_999 => 7,
20_000..=49_999 => 8,
50_000..=99_999 => 9,
100_000..=199_999 => 10,
200_000..=499_999 => 11,
_ => 12,
}
}
pub fn record(&self, latency_ns: u64) {
let bucket = Self::bucket_index(latency_ns);
self.buckets[bucket].value.fetch_add(1, Ordering::Relaxed);
self.total_samples.value.fetch_add(1, Ordering::Relaxed);
self.total_latency_ns.value.fetch_add(latency_ns, Ordering::Relaxed);
}
pub fn percentile(&self, p: f64) -> u64 {
let total = self.total_samples.value.load(Ordering::Relaxed);
if total == 0 { return 0; }
let target_count = (total as f64 * p) as u64;
let mut cumulative = 0u64;
for (i, bucket) in self.buckets.iter().enumerate() {
cumulative += bucket.value.load(Ordering::Relaxed);
if cumulative >= target_count {
return match i {
0 => 50, 1 => 150, 2 => 350, 3 => 750,
4 => 1_500, 5 => 3_500, 6 => 7_500, 7 => 15_000,
8 => 35_000, 9 => 75_000, 10 => 150_000,
11 => 350_000, _ => 750_000,
};
}
}
750_000
}
}
Step 5: Compare Baseline vs Optimized
Always benchmark before and after optimization:
fn bench_comparison(c: &mut Criterion) {
let mut group = c.benchmark_group("optimization_comparison");
group.bench_function("baseline", |b| {
let data = setup();
b.iter(|| baseline_implementation(black_box(&data)))
});
group.bench_function("optimized", |b| {
let data = setup();
b.iter(|| optimized_implementation(black_box(&data)))
});
group.finish();
}
Run benchmarks:
# Run all benchmarks cargo bench # Run specific benchmark cargo bench --bench my_bench # Save baseline for comparison cargo bench -- --save-baseline before_opt # Compare against baseline cargo bench -- --baseline before_opt
Step 6: Generate Flamegraphs (Optional)
For CPU profiling to find hot spots:
# Install cargo-flamegraph cargo install flamegraph # Generate flamegraph (requires sudo on Linux) cargo flamegraph --bench my_bench # View flamegraph.svg in browser
Best practices
✅ DO:
- •Always use
black_box()to prevent dead code elimination - •Benchmark before and after every optimization
- •Use
--save-baselineto track regressions over time - •Run benchmarks on idle system (close browsers, etc.)
- •Calibrate TSC once at startup, reuse factor
- •Use logarithmic histogram buckets for wide latency ranges
- •Measure P99/P999, not just averages (tail latencies matter)
- •Cache-pad histogram buckets to prevent false sharing
- •Use
iter_batchedfor setup that shouldn't be timed
❌ DON'T:
- •Don't benchmark on battery power (CPU throttling)
- •Don't trust single runs (Criterion does statistics for you)
- •Don't optimize without profiling first (premature optimization)
- •Don't use
std::time::Instantfor sub-microsecond timing (insufficient precision) - •Don't allocate in hot path during measurement
- •Don't forget to enable
--releasemode - •Don't benchmark with debug assertions enabled
Common pitfalls
- •
Pitfall: Compiler optimizes away entire benchmark
- •Symptom: Unrealistically fast results (picoseconds)
- •Fix: Wrap all inputs and outputs with
black_box()
- •
Pitfall: Setup code is included in timing
- •Symptom: Benchmark measures allocation/initialization instead of operation
- •Fix: Use
iter_batchedor move setup outsideb.iter()
- •
Pitfall: TSC not calibrated
- •Symptom: Panic with "TSC not calibrated" or garbage nanosecond values
- •Fix: Call
init_tsc()once at benchmark start
- •
Pitfall: Histogram buckets cause false sharing
- •Symptom: Concurrent benchmark is slower than expected
- •Fix: Wrap buckets in
#[repr(C, align(64))]CachePadded struct
- •
Pitfall: Measuring wrong thing
- •Symptom: Optimizations don't show improvement in benchmark
- •Fix: Profile with flamegraph to confirm hot path is being measured
- •
Pitfall: TSC frequency scaling on laptops
- •Symptom: Calibration variance >10% across runs
- •Fix: Disable CPU frequency scaling or accept wider tolerance
Related skills
- •benchmark-driven-development - Measure-optimize-measure workflow
- •cache-line-optimization - Prevent false sharing in concurrent code
- •latency-measurement - Sub-microsecond timing techniques
- •multi-level-testing - Complement benchmarks with correctness tests
Skill Version: 1.0 Last Updated: 2025-01-06