Benchmarking & performance skill (eo-processor)
Use this skill to make performance work repeatable, measurable, and safe in a hybrid Rust+PyO3+Python Earth Observation library.
The core objective is to answer, defensibly:
- •Did performance change (time, memory, allocations)?
- •Did correctness change (numerics, dtype/shape contract)?
- •Is the change worth the complexity?
When to activate
Activate this skill when you:
- •add a new Rust kernel or change an existing one
- •refactor any hot loop / ndarray expression in Rust
- •adjust parallelism, chunking, or algorithmic complexity
- •change dtype handling (float32/float64) or NaN/Inf behavior (performance often couples to semantics)
- •plan to claim speedups in docs/CHANGELOG/PR
Do not activate for trivial doc updates or purely cosmetic refactors.
Principles (performance without breaking trust)
- •Correctness is a gate: benchmarking without validating outputs is not acceptable.
- •Measure the right thing: avoid benchmarking builds that differ in optimization flags; compare like-for-like.
- •Warm up: compilation and caching effects can dominate first-run timings.
- •Use representative shapes: EO workloads are usually large (e.g., 1k²–10k² rasters), not just toy arrays.
- •Avoid accidental regressions: track both speed and memory; faster-but-alloc-heavy can be worse in real pipelines.
- •No unverified claims: do not claim performance improvements without before/after numbers and parameters.
Step 1: Define the benchmark question
Write down:
- •What operation is being measured? (function name and version)
- •What data sizes/shapes? (e.g., 1000x1000, 5000x5000)
- •What data distribution? (random uniform, realistic reflectance range, with/without NaNs)
- •What environment? (CPU model optional, OS, Python version, Rust release build, number of threads)
- •What metric? (wall time, throughput in pixels/s, peak RSS if available)
Acceptance criteria examples
- •“New implementation must be ≥ 1.2× faster than baseline for 5000x5000 float64 arrays”
- •“No more than +5% memory overhead compared to baseline”
- •“Numerical difference within tolerance (float64: 1e-12; float32: 1e-5)”
Step 2: Always pin build mode & runtime settings
Build mode
- •Benchmark only release builds for the Rust extension.
- •Ensure you’re not comparing a debug build vs release build.
- •If you compare against NumPy, ensure BLAS settings are stable.
Threading
Performance changes can be dominated by threading differences:
- •Pin thread counts consistently across runs.
- •If the repo uses Rayon or similar internally, control its thread count via environment (document what you used).
- •For NumPy baselines, ensure you understand whether BLAS threads are involved.
Stability
- •Close other heavy processes.
- •Prefer running multiple trials and report median (and optionally p95).
Step 3: Correctness check BEFORE timing
Before timing, validate:
- •output shape matches contract
- •dtype matches contract
- •outputs are finite/NaN per contract
- •values match baseline within tolerance
Baseline options
Pick the best available baseline:
- •an existing eo-processor implementation (before refactor)
- •a pure NumPy reference implementation of the formula
- •a known-correct small example with asserted values
If you can’t produce a correctness check, stop and add one first (usually a unit test in tests/).
Step 4: Benchmark protocol (repeatable)
Use this protocol for each benchmark you report:
- •
Generate inputs
- •Use seeded random generation for reproducibility.
- •Use realistic ranges (e.g., reflectance in [0, 1]) unless the function expects something else.
- •
Warm-up
- •Call the function at least once to warm caches and ensure the extension is loaded.
- •
Time multiple trials
- •Run N trials (e.g., 5–20).
- •Record the median and a dispersion metric (min/max or p95).
- •
Report
- •Provide: shape, dtype, trials, median time, throughput, build mode, thread count.
Example Python timing skeleton (adapt to repo norms)
- •Use
time.perf_counter()nottime.time() - •Avoid counting array creation time in the timed region
- •Ensure outputs are consumed to avoid lazy evaluation traps (especially if using Dask wrappers)
Step 5: Evaluate results (interpretation)
Speedup thresholds
- •< 1.05×: likely noise or not worth complexity unless it also reduces memory or fixes bugs.
- •1.05×–1.2×: consider whether it’s worth it; check memory/allocations and sustained performance.
- •≥ 1.2×: usually meaningful for EO rasters; still verify no semantic drift.
Check for regressions
- •Small arrays sometimes get slower while large arrays get faster. That’s acceptable if the library targets large rasters—just document it.
- •Watch for “fast median, slow tail”: if p95 worsened, investigate allocation spikes or scheduling.
Memory and allocations
If possible, assess:
- •intermediate allocations (ndarray expression temporaries are common)
- •peak memory usage (large rasters can OOM quickly)
If you can’t measure memory precisely, at least reason about allocations:
- •Did you add temporaries?
- •Did you switch from in-place fill to multiple intermediate arrays?
Step 6: Profiling & root-cause techniques (use selectively)
Use profiling only when the benchmark indicates a meaningful issue.
Common performance traps in this repo’s domain
- •Extra temporaries from chained ndarray arithmetic
- •Unintended dtype conversions / casts
- •Poor cache locality due to iteration order
- •Branch-heavy inner loops (NaN handling, conditional masking)
- •Parallel overhead dominating small inputs
Suggested methods (choose what applies)
- •Add lightweight internal instrumentation (counts, timing spans) temporarily; remove before commit.
- •Compare “fused loop” vs “expression-based” implementations.
- •Ensure the hottest path is in Rust (not Python glue).
If you can’t profile locally, you can still:
- •reduce the workload to isolate which step dominates
- •compare variants with minimal changes to infer the cause
Step 7: How to make performance changes safely
Preferred sequence:
- •Implement change with correctness tests.
- •Benchmark before/after with pinned settings.
- •If speedup is real, clean up code and document results.
- •If speedup is not real, revert or open a performance issue with data.
Safe optimizations (common wins)
- •Fuse computations into a single pass over the array data.
- •Precompute invariants outside inner loops.
- •Reduce allocations: write into a preallocated output array.
- •Avoid repeated bounds checks where safe and idiomatic (without
unsafeunless justified).
Risky optimizations (require stronger evidence)
- •Introducing
unsafe - •Changing NaN/Inf behavior for speed
- •Changing dtype semantics (float32 vs float64)
- •Altering parallel thresholds or scheduling without benchmarks on multiple shapes
Reporting template (use in your response/PR)
When you use this skill, report results in this format:
Benchmark setup
- •Function(s):
... - •Build: release (how built)
- •Hardware/OS: (if known)
- •Threads: (Rayon/BLAS/Python settings)
- •Input: shape(s), dtype(s), distribution (seeded)
Correctness
- •Baseline: (existing impl / NumPy reference)
- •Tolerance: (e.g., rtol/atol)
- •Result: pass/fail (and any notes about NaN behavior)
Performance results
For each shape/dtype:
- •Before: median X ms (N trials)
- •After: median Y ms (N trials)
- •Speedup: X/Y (or %)
- •Throughput: pixels/s (optional)
- •Notes: memory/allocations qualitative notes
Decision
- •Keep / revise / revert
- •Follow-ups (tests, docs, perf issue)
Definition of done (for perf work)
You’re done when:
- • Correctness is validated against a baseline
- • Benchmarks are repeatable and documented (inputs + settings)
- • Any performance claim has before/after numbers
- • No unacceptable memory regression is introduced (or it’s explicitly justified)
- • Tests cover at least one representative correctness case and key edge cases
- • The change doesn’t silently alter public semantics (or docs/versioning reflect it)
Local references (repo)
- •Engineering rules & quality gates:
AGENTS.md - •User-facing docs and performance notes:
README.md,QUICKSTART.md - •Complex workflows:
WORKFLOWS.md - •Benchmark artifacts and harnesses:
benchmarking/,dist_bench.json,dist_bench.md,benchmark-*.json - •Scripts and maintenance tooling:
scripts/ - •Tests:
tests/