benchmark-logging

定义基准运行，以一致的指标、验收标准以及可复现的产物引用记录运行结果。

SKILL.md

--- frontmatter

name: benchmark-logging
description: Define benchmark runs and log outcomes with consistent metrics, acceptance criteria, and reproducible artifact references.

Benchmark Logging

Use this skill to run and document benchmark comparisons between sciClaw and baseline workflows.

When to use

•"run benchmark"
•"compare baseline vs sciclaw"
•"log benchmark outcomes"
•"add acceptance criteria"

Minimum benchmark record

•Benchmark ID and date.
•Task category and scenario definition.
•Baseline command sequence.
•sciClaw command sequence.
•Metrics: task success, reproducibility, latency, and resource usage.
•Acceptance decision (pass/fail) with rationale.

Workflow

•Freeze scenario definitions before running.
•Execute baseline and sciClaw runs with the same inputs.
•Record metric values and artifact paths.
•Log failures with root-cause notes and retry policy.
•Add manuscript-ready summary sentences only after data is logged.