Paper Analysis Methodology
Analyze ML/AI research papers using S. Keshav's three-pass reading method. Produces structured, self-contained summaries that capture all key information.
Overview
This skill provides a systematic framework for reading and analyzing research papers. Use this skill when you need to:
- •Analyze a research paper thoroughly
- •Extract implementation-relevant details
- •Produce a structured summary for future reference
- •Evaluate paper quality and contributions
- •Identify follow-up work and open questions
Three-Pass Method
Pass 1: Bird's-Eye View
Quick scan to understand what the paper is about:
- •Read the title, abstract, and introduction carefully
- •Read all section and sub-section headings (ignore body text)
- •Glance at mathematical content to identify theoretical foundations
- •Read the conclusions
- •Scan the references, noting which ones you recognize
After this pass, answer the Five Cs:
| C | Question |
|---|---|
| Category | What type of paper? (empirical study, new architecture, theoretical analysis, benchmark, survey, system description, method/technique) |
| Context | What prior work does it build on? What theoretical bases are used? |
| Correctness | Do the assumptions appear valid? |
| Contributions | What are the main contributions claimed? |
| Clarity | Is it well written? Clear structure? |
Resource Discovery
After Pass 1, search for supporting resources:
- •Code: Check paper footer/abstract for repo links, search GitHub by title, check PapersWithCode
- •Community implementations: Search GitHub for reimplementations; note framework (PyTorch, JAX, TF, etc.)
- •Presentations: Look for conference talks, author walkthroughs, or video explainers
- •Blog posts: Check author blogs, Distill, and popular ML blogs for write-ups
- •Supplementary materials: Project pages, appendices, datasets, interactive demos
- •Citation: Retrieve BibTeX from arxiv, Semantic Scholar, or the publisher
Pass 2: Content Grasp
Read with greater care. Ignore proofs and dense math derivations for now.
- •Figures & diagrams: Examine every figure, table, and diagram carefully
- •Are axes labeled? Error bars present? Results statistically significant?
- •What do the architecture diagrams reveal about the approach?
- •Key claims: Note every major claim and its supporting evidence
- •Method details: Understand the proposed method at a high level
- •What is the input/output?
- •What are the key components?
- •How does training work?
- •Experimental setup: Datasets, baselines, metrics, hardware
- •Computational cost: Note parameter counts, FLOPs, GPU hours, memory requirements
- •Results: Main results tables, ablation studies, comparisons
- •Are confidence intervals or error bars reported? How many runs/seeds?
- •Terminology: Note unfamiliar terms, acronyms, or concepts
- •Unread references: Mark important cited papers for follow-up
Pass 3: Deep Understanding
Virtually re-implement the paper mentally. Challenge everything.
- •Assumptions: Identify and challenge every assumption
- •Are they stated explicitly or implicit?
- •Are they reasonable for the problem domain?
- •Methodology critique: Could this be done differently or better?
- •What are the hidden failings?
- •What design choices are not well justified?
- •Mathematical rigor: Verify key equations and derivations
- •Experimental validity: Scrutinize the evaluation
- •Are baselines fair and up-to-date?
- •Is the evaluation protocol standard for the field?
- •Could the results be explained by confounding factors?
- •Reproducibility: Could you reimplement this?
- •Are hyperparameters fully specified?
- •Is the data pipeline described?
- •Is code available?
- •Statistical rigor: Multiple seeds/runs? Confidence intervals? Significance tests?
- •Comparison fairness: Do baselines get equal compute, tuning, and data access?
- •Failure modes: Where would this approach break? Edge cases, distribution shifts, adversarial inputs?
- •Ethical considerations: Bias, fairness, environmental cost, dual-use potential
- •Future work: Note ideas for extensions, improvements, or follow-up experiments
- •Strong and weak points: Identify what works well and what doesn't
Output Template
Create a file named {paper-short-title}.md. Use this template:
markdown
# {Full Paper Title}
**Authors:** {Author list}
**Published:** {Venue, year}
**DOI:** {DOI if available}
> **TL;DR:** {One-sentence summary of the paper's core contribution and result.}
---
## Resources & Links
| Resource | Link |
|----------|------|
| Paper page | {arxiv abs, conference page, or publisher URL} |
| PDF | {direct PDF link} |
| Official code | {repository URL — or "Not available"} |
| PapersWithCode | {PapersWithCode URL — or "None found"} |
| Community implementations | {URLs with framework noted — or "None found"} |
| Video / Talk | {URL — or "None found"} |
| Blog post / Explainer | {URL — or "None found"} |
| Supplementary materials | {URL — or "None found"} |
### Citation
```bibtex
{BibTeX entry}
```
---
## Five Cs (First-Pass Assessment)
| Dimension | Assessment |
|-----------|------------|
| **Category** | {Paper type} |
| **Context** | {Key prior work and foundations} |
| **Correctness** | {Validity of assumptions} |
| **Contributions** | {Numbered list of contributions} |
| **Clarity** | {Writing quality assessment} |
---
## Problem Statement
{What problem does this paper address? Why does it matter?}
## Motivation & Gap
{What gap in existing work does this paper fill?}
---
## Proposed Method
### Overview
{High-level description in 3-5 sentences.}
### Architecture / Algorithm
- {Component 1}: {description}
- {Component N}: {description}
### Key Equations
1. {Equation name}: `{equation}` — {what it computes}
### Training / Optimization
- **Objective function:** {loss}
- **Optimizer:** {optimizer and hyperparameters}
- **Schedule:** {learning rate schedule}
- **Key hyperparameters:** {list with values}
### Computational Cost
- **Parameters:** {total parameter count}
- **FLOPs:** {training/inference FLOPs if reported}
- **Training cost:** {GPU hours, hardware, estimated cost}
- **Inference time:** {latency per sample/batch}
- **Memory:** {peak GPU memory}
- **Scalability notes:** {how cost scales with data/model size}
---
## Experimental Setup
### Datasets
| Dataset | Size | Task | Split |
|---------|------|------|-------|
### Baselines
{Comparison methods}
### Metrics
{Evaluation metrics}
### Hardware & Budget
- **Hardware:** {GPUs/TPUs, count, type}
- **Training time:** {wall-clock time}
- **Comparison fairness:** {Do baselines get equal compute/tuning/data?}
---
## Key Results
### Main Findings
| Method | {Metric 1} | {Metric 2} |
|--------|-----------|-----------|
### Ablation Studies
- {Component}: {effect on performance}
### Statistical Rigor
- **Runs/seeds:** {number of independent runs}
- **Variance reporting:** {std dev, CI, IQR — what's reported?}
- **Significance tests:** {statistical tests used, if any}
---
## Critical Analysis
### Novelty Assessment
- **What is genuinely new:** {novel contributions vs. incremental improvements}
- **Closest prior work:** {most similar existing method and key differences}
### Strengths
1. {Strength with reasoning}
### Weaknesses
1. {Weakness with reasoning}
### Limitations
- **Acknowledged by authors:** {limitations the authors explicitly discuss}
- **Unacknowledged:** {limitations not discussed but apparent from analysis}
### Failure Modes & Edge Cases
{Where would this approach break? Distribution shifts, adversarial inputs, scaling limits, etc.}
### Ethical Considerations & Broader Impact
{Bias, fairness, environmental cost, dual-use potential, societal implications. Omit this section entirely if genuinely N/A.}
### Missing References
{Important related work not cited by the paper.}
### Reproducibility Assessment
- **Code available:** {Yes/No — see Resources & Links}
- **Data available:** {Yes/No — public datasets vs. proprietary}
- **Hyperparameters specified:** {Yes/Partially/No}
- **Implementation complexity:** {Low/Medium/High — effort to reimplement}
- **Overall reproducibility:** {High/Medium/Low}
---
## Connections & Context
### Builds On
- [{Paper}]({url}): {relationship}
### Potential Impact
{How might this work influence the field?}
---
## Future Work & Open Questions
{Extensions, improvements, unresolved questions}
---
## Reviewer Assessment
### Overall Score
| Score | Meaning |
|-------|---------|
| 1-3 | Serious flaws, not suitable for publication |
| 4-5 | Below average; significant weaknesses outweigh contributions |
| 6 | Marginally above acceptance threshold |
| 7 | Good paper; solid contribution with minor issues |
| 8 | Strong paper; clear contribution, well-executed |
| 9-10 | Exceptional; significant advance for the field |
**Score: {X}/10**
**Justification:** {2-3 sentences explaining the score}
### Confidence
| Score | Meaning |
|-------|---------|
| 1 | Low — outside area of expertise |
| 2 | Willing to defend but not certain |
| 3 | Fairly confident |
| 4 | Confident — checked key details |
| 5 | Very confident — deeply familiar with area |
**Confidence: {X}/5**
### Recommendation
**{Accept / Weak Accept / Borderline / Weak Reject / Reject}**
### Questions for Authors
1. {Key question that would affect the assessment}
---
## Key Takeaways
- {3-5 bullet points}
---
## Glossary
| Term | Definition |
|------|------------|
*Analysis generated using the three-pass method (Keshav, 2016).*
Process Guidelines
- •Read the full paper across all three passes before writing the summary
- •Be precise — use exact numbers from the paper
- •Distinguish between what the paper claims and what the evidence supports
- •For critical analysis, be honest and constructive — identify real issues, not nitpicks
- •The summary should be self-contained: someone reading it should understand the paper without reading the original
- •Score calibration: 6-7 = good paper with solid contribution; 8+ = genuinely strong/exceptional; don't grade-inflate
- •Omit N/A sections rather than filling them with "Not applicable" placeholders
- •Novelty assessment: compare against the closest specific prior work, not the field in general
- •TL;DR: draft after Pass 1, refine after Pass 3
- •Resource links: require genuine search effort — use "Not available" for expected resources (official code) and "None found" for optional ones (blog posts, videos)