AgentSkillsCN

analyze-report

分析基准测试与测试报告,提炼洞察、标记异常,并对照规格目标验证性能宣称。读取原始基准测试JSON/输出数据,计算衍生指标(缩放比例、竞争率、性能退化),并产出可执行的分析结论。在运行基准测试或测试后使用此功能,以解读测试结果。

SKILL.md
--- frontmatter
name: analyze-report
description: >
  Analyze benchmark and test reports to extract insights, flag anomalies, and verify
  performance claims against spec targets. Reads raw benchmark JSON/output, computes
  derived metrics (scaling ratios, contention rates, degradation), and produces
  actionable findings. Use after running benchmarks or tests to interpret results.

Analyze Report

Interpret benchmark and test results. Turn raw data into actionable findings.

When to Use

  • After running benchmarks — interpret results, flag anomalies
  • After ctest — analyze failures, identify patterns
  • Before spec updates — verify performance claims with measured data
  • Periodic — track performance trends across runs

When NOT to Use

  • Running benchmarks — run them first, then use this skill to analyze
  • Spec quality review — use /spec-review instead
  • Test gap analysis — use /test-plan instead

Inputs

  • No argument: find and analyze all reports in the project (benchmark JSON, test output, coverage reports)
  • File path: analyze a specific report file
  • --compare A B: compare two report files (e.g., before/after optimization)

Workflow

Phase 1: Discovery

  1. Read CLAUDE.md for project structure — find benchmark output directory, test report directory
  2. Glob for benchmark results: **/*_bench*.json, **/benchmark_*.json
  3. Glob for test reports: **/test-report*, **/coverage*
  4. Check for hardware baseline data (if project has hw-baseline or similar)

Phase 2: Parse Raw Data

For each benchmark result:

  • Extract scenario names, iterations, timing data (mean, median, P50/P99, min/max)
  • Identify the unit (ns, us, ms, ops/sec)
  • Group by benchmark suite

For each test report:

  • Extract pass/fail counts, failure messages
  • Extract coverage data if available

Phase 3: Compute Derived Metrics

Scaling analysis: Compare scenarios that differ by one parameter (e.g., 1 reader vs 4 readers).

code
Scaling ratio = time(N) / time(1)
Ideal: ratio ≈ 1.0 (independent)
Degraded: ratio > 2.0 (contention)

Contention indicators: Retry rates, CAS failures, cache miss ratios.

Regression detection (when comparing two runs):

code
Regression threshold: > 10% slower on same hardware
Improvement threshold: > 10% faster
Noise band: within 10%

Phase 4: Cross-Reference with Specs

For each performance claim in specs:

  • Find the corresponding benchmark result
  • Compare measured value against spec target
  • Classify: MEETS | EXCEEDS | MISSES | NO_DATA

Phase 5: Report

markdown
# Benchmark Analysis Report

**Date**: YYYY-MM-DD HH:MM
**Hardware**: {from baseline if available}

## Summary

| Metric | Value |
|--------|-------|
| Benchmarks analyzed | N |
| Spec claims verified | N / M |
| Anomalies found | N |

## Findings

### Performance vs Spec Targets

| Component | Spec Target | Measured | Status |
|-----------|-------------|----------|--------|
| Reactor loop | ~1.6ns (design est.) | P50=4.7ns | MEETS (within order) |
| SeqLock read | ~50ns | — | NO_DATA |

### Scaling Analysis

| Benchmark | 1→N Scaling | Ratio | Assessment |
|-----------|------------|-------|------------|
| DeltaRing 1→4 readers | 2.3ns→845ns | 367x | Expected (SPMC contention) |

### Anomalies

1. **{benchmark}:{scenario}** — {description}
   Data: {numbers}
   Possible cause: {analysis}
   Action: {recommendation}

### Trends (if comparing runs)

| Benchmark | Before | After | Change |
|-----------|--------|-------|--------|
| ... | ... | ... | +/-% |

Principles

  • Data first, interpretation second: Present raw numbers before drawing conclusions.
  • Hardware context matters: Same benchmark on different CPUs means different things. Always note hardware.
  • Noise awareness: Single-digit percent differences are noise, not signal. Only flag > 10% changes.
  • No spec numbers from thin air: If a spec claims X ns but no benchmark measures it, report NO_DATA, don't estimate.
  • Actionable output: Every anomaly should have a "what to do next" recommendation.