Benchmarking & performance skill (eo-processor)

Use this skill to make performance work repeatable, measurable, and safe in a hybrid Rust+PyO3+Python Earth Observation library.

The core objective is to answer, defensibly:

•Did performance change (time, memory, allocations)?
•Did correctness change (numerics, dtype/shape contract)?
•Is the change worth the complexity?

When to activate

Activate this skill when you:

•add a new Rust kernel or change an existing one
•refactor any hot loop / ndarray expression in Rust
•adjust parallelism, chunking, or algorithmic complexity
•change dtype handling (float32/float64) or NaN/Inf behavior (performance often couples to semantics)
•plan to claim speedups in docs/CHANGELOG/PR

Do not activate for trivial doc updates or purely cosmetic refactors.

Principles (performance without breaking trust)

•Correctness is a gate: benchmarking without validating outputs is not acceptable.
•Measure the right thing: avoid benchmarking builds that differ in optimization flags; compare like-for-like.
•Warm up: compilation and caching effects can dominate first-run timings.
•Use representative shapes: EO workloads are usually large (e.g., 1k²–10k² rasters), not just toy arrays.
•Avoid accidental regressions: track both speed and memory; faster-but-alloc-heavy can be worse in real pipelines.
•No unverified claims: do not claim performance improvements without before/after numbers and parameters.

Step 1: Define the benchmark question

Write down:

•What operation is being measured? (function name and version)
•What data sizes/shapes? (e.g., 1000x1000, 5000x5000)
•What data distribution? (random uniform, realistic reflectance range, with/without NaNs)
•What environment? (CPU model optional, OS, Python version, Rust release build, number of threads)
•What metric? (wall time, throughput in pixels/s, peak RSS if available)

Acceptance criteria examples

•“New implementation must be ≥ 1.2× faster than baseline for 5000x5000 float64 arrays”
•“No more than +5% memory overhead compared to baseline”
•“Numerical difference within tolerance (float64: 1e-12; float32: 1e-5)”

Step 2: Always pin build mode & runtime settings

Build mode

•Benchmark only release builds for the Rust extension.
•Ensure you’re not comparing a debug build vs release build.
•If you compare against NumPy, ensure BLAS settings are stable.

Threading

Performance changes can be dominated by threading differences:

•Pin thread counts consistently across runs.
•If the repo uses Rayon or similar internally, control its thread count via environment (document what you used).
•For NumPy baselines, ensure you understand whether BLAS threads are involved.

Stability

•Close other heavy processes.
•Prefer running multiple trials and report median (and optionally p95).

Step 3: Correctness check BEFORE timing

Before timing, validate:

•output shape matches contract
•dtype matches contract
•outputs are finite/NaN per contract
•values match baseline within tolerance

Baseline options

Pick the best available baseline:

•an existing eo-processor implementation (before refactor)
•a pure NumPy reference implementation of the formula
•a known-correct small example with asserted values

If you can’t produce a correctness check, stop and add one first (usually a unit test in tests/).

Step 4: Benchmark protocol (repeatable)

Use this protocol for each benchmark you report:

•
Generate inputs
- •Use seeded random generation for reproducibility.
- •Use realistic ranges (e.g., reflectance in [0, 1]) unless the function expects something else.
•
Warm-up
- •Call the function at least once to warm caches and ensure the extension is loaded.
•
Time multiple trials
- •Run N trials (e.g., 5–20).
- •Record the median and a dispersion metric (min/max or p95).
•
Report
- •Provide: shape, dtype, trials, median time, throughput, build mode, thread count.

Example Python timing skeleton (adapt to repo norms)

•Use time.perf_counter() not time.time()
•Avoid counting array creation time in the timed region
•Ensure outputs are consumed to avoid lazy evaluation traps (especially if using Dask wrappers)

Step 5: Evaluate results (interpretation)

Speedup thresholds

•< 1.05×: likely noise or not worth complexity unless it also reduces memory or fixes bugs.
•1.05×–1.2×: consider whether it’s worth it; check memory/allocations and sustained performance.
•≥ 1.2×: usually meaningful for EO rasters; still verify no semantic drift.

Check for regressions

•Small arrays sometimes get slower while large arrays get faster. That’s acceptable if the library targets large rasters—just document it.
•Watch for “fast median, slow tail”: if p95 worsened, investigate allocation spikes or scheduling.

Memory and allocations

If possible, assess:

•intermediate allocations (ndarray expression temporaries are common)
•peak memory usage (large rasters can OOM quickly)

If you can’t measure memory precisely, at least reason about allocations:

•Did you add temporaries?
•Did you switch from in-place fill to multiple intermediate arrays?

Step 6: Profiling & root-cause techniques (use selectively)

Use profiling only when the benchmark indicates a meaningful issue.

Common performance traps in this repo’s domain

•Extra temporaries from chained ndarray arithmetic
•Unintended dtype conversions / casts
•Poor cache locality due to iteration order
•Branch-heavy inner loops (NaN handling, conditional masking)
•Parallel overhead dominating small inputs

Suggested methods (choose what applies)

•Add lightweight internal instrumentation (counts, timing spans) temporarily; remove before commit.
•Compare “fused loop” vs “expression-based” implementations.
•Ensure the hottest path is in Rust (not Python glue).

If you can’t profile locally, you can still:

•reduce the workload to isolate which step dominates
•compare variants with minimal changes to infer the cause

Step 7: How to make performance changes safely

Preferred sequence:

•Implement change with correctness tests.
•Benchmark before/after with pinned settings.
•If speedup is real, clean up code and document results.
•If speedup is not real, revert or open a performance issue with data.

Safe optimizations (common wins)

•Fuse computations into a single pass over the array data.
•Precompute invariants outside inner loops.
•Reduce allocations: write into a preallocated output array.
•Avoid repeated bounds checks where safe and idiomatic (without unsafe unless justified).

Risky optimizations (require stronger evidence)

•Introducing unsafe
•Changing NaN/Inf behavior for speed
•Changing dtype semantics (float32 vs float64)
•Altering parallel thresholds or scheduling without benchmarks on multiple shapes

Reporting template (use in your response/PR)

When you use this skill, report results in this format:

Benchmark setup

•Function(s): ...
•Build: release (how built)
•Hardware/OS: (if known)
•Threads: (Rayon/BLAS/Python settings)
•Input: shape(s), dtype(s), distribution (seeded)

Correctness

•Baseline: (existing impl / NumPy reference)
•Tolerance: (e.g., rtol/atol)
•Result: pass/fail (and any notes about NaN behavior)

Performance results

For each shape/dtype:

•Before: median X ms (N trials)
•After: median Y ms (N trials)
•Speedup: X/Y (or %)
•Throughput: pixels/s (optional)
•Notes: memory/allocations qualitative notes

Decision

•Keep / revise / revert
•Follow-ups (tests, docs, perf issue)

Definition of done (for perf work)

You’re done when:

• Correctness is validated against a baseline
• Benchmarks are repeatable and documented (inputs + settings)
• Any performance claim has before/after numbers
• No unacceptable memory regression is introduced (or it’s explicitly justified)
• Tests cover at least one representative correctness case and key edge cases
• The change doesn’t silently alter public semantics (or docs/versioning reflect it)

Local references (repo)

•Engineering rules & quality gates: AGENTS.md
•User-facing docs and performance notes: README.md, QUICKSTART.md
•Complex workflows: WORKFLOWS.md
•Benchmark artifacts and harnesses: benchmarking/, dist_bench.json, dist_bench.md, benchmark-*.json
•Scripts and maintenance tooling: scripts/
•Tests: tests/