Eval Bench
Use this skill to evaluate newly trained models before release. Evaluation must be repeatable, comparable across runs, and easy for humans to interpret.
Core Benchmarking (lm-evaluation-harness)
Use lm-evaluation-harness as the default benchmark runner.
Include common suites such as:
- •MMLU
- •HellaSwag
- •ARC (Easy/Challenge)
- •and other relevant tasks for the model domain
Benchmark requirements:
- •pin benchmark/task versions when possible
- •record prompt/eval configuration and decoding settings
- •keep scores comparable across model revisions
RAG Evaluation (Ragas)
When the target use case includes retrieval-augmented generation, run Ragas evaluation with documented settings.
Track retrieval + generation quality signals and summarize failure patterns.
Perplexity Measurement
Compute perplexity on representative held-out datasets. Report:
- •overall perplexity
- •perplexity by domain/slice
- •notable regressions vs prior checkpoints
Reporting
Generate a human-readable evaluation report that includes:
- •benchmark score table
- •perplexity summary
- •RAG metrics (if applicable)
- •strengths, weaknesses, and release-readiness recommendation
Deliverables
- •
eval_config.yaml(tasks + settings) - •
benchmark_results.json(raw machine-readable outputs) - •
evaluation_report.md(human-readable summary)