Reproducibility Report Generator
Generate comprehensive reproducibility reports for ML/LLM experiments. Based on requirements from NeurIPS Paper Checklist, ACL Responsible NLP Research, and ML Reproducibility Checklist.
References:
- •
~/.claude/ai_docs/reproducibility-checklist.md- Full NeurIPS 16-question checklist - •
~/.claude/ai_docs/ci-standards.md- CI computation methodology
When to Use This Skill
Trigger after:
- •Completing an experiment or series of related experiments
- •Before sharing results with collaborators
- •Before writing up findings for publication
- •When archiving experiment runs
Fundamental Principles
Completeness Over Convenience: A reproducibility report should contain everything needed to replicate results, even if some information seems obvious. Future you (or a collaborator) will thank present you.
Extract, Don't Fabricate: Auto-extract from code, configs, and logs. For anything missing, ask explicitly rather than assuming or inventing values. Leave {placeholder} markers for unknown items.
Prioritize Critical Information: Not all reproducibility concerns are equal. Model identity and prompts are critical; compute costs are helpful but not essential. Focus effort on what would block replication.
Document the Delta: If this experiment differs from a baseline or prior run, explicitly note what changed and why.
Workflow
1. Auto-Extract Information
Automatically gather reproducibility information from:
Code & Configs:
- •
src/configs/- Hydra config files (model, task, prompts) - •
src/configs/prompts/- Full prompt templates - •
.hydra/config.yamland.hydra/overrides.yamlin output dirs - •Python files for sampling parameters, API calls, caching logic
Experiment Outputs:
- •
out/*/directories - Hydra logs, results.jsonl - •
.cache/directory structure - Caching strategy evidence
Environment:
- •
pyproject.tomlorrequirements.txt- Dependencies - •Git commit hash and dirty state
2. Ask Targeted Questions for Gaps
For any missing critical information, ask the user directly. See references/checklist.md for common gaps to ask about.
3. Generate Report
Write to: out/{experiment_dir}/reproducibility.md
Use the template in references/template.md.
Reference Files
- •
references/template.md- Full report template with all sections - •
references/checklist.md- Validation checklists (Critical/Important/Helpful) and common gaps - •
references/llm-specific.md- LLM reproducibility concerns, provider metadata tables
Quick Reference
When generating a report, prioritize:
- •Model identity - Exact ID with version suffix, not aliases like "gpt-4" or "claude-3-sonnet"
- •Full prompts - Verbatim, including whitespace and special characters
- •Data paths and versions - Exact file paths, dataset versions, commit hashes
- •Exact reproduction command - Copy-pasteable, including all flags and overrides
- •Known limitations - What assumptions might not hold in practice?
Always ask for:
- •API costs (approximate is fine)
- •Unstated assumptions
- •Failed approaches that were tried
- •Prompt iteration history (if relevant)