Literature Review for ML Research

Search Strategy by Research Phase

Decision Table: Search Approach

Phase	Goal	Primary Source	Strategy
Scoping	Understand landscape	Survey papers, textbooks	Start broad, read abstracts only
Deep dive	Core related work	Semantic Scholar, Google Scholar	Citation graph: forward + backward
Gap finding	Identify what is missing	Recent venue proceedings	Filter by venue + year, read intros
Positioning	Place your contribution	Top-cited papers in subfield	Compare methods, build taxonomy
Camera-ready	Complete related work	arxiv, workshop papers	Fill gaps reviewers might flag

Decision Table: Source Selection

Source	Best For	Limitations
Google Scholar	Broad search, citation counts	Noisy results, includes predatory
Semantic Scholar	Citation graph, API access	Smaller index than GS
arxiv	Preprints, latest work	No peer review, variable quality
DBLP	Venue-specific search	Metadata only, no full text
Connected Papers	Visual citation graph	Limited to seed paper quality
Papers With Code	SOTA tables, code links	Benchmark-centric bias
ACL Anthology	NLP-specific	Single domain

Systematic Search Methodology

Query Construction

code

# Start with core concept combinations
query_templates = [
    # Broad
    '"{method}" AND "{task}"',
    # Venue-scoped
    '"{method}" AND "{task}" site:arxiv.org',
    # Recent
    '"{method}" AND "{task}" after:2023',
    # Author-scoped
    'author:"{known_author}" "{topic}"',
]

# Example for a transformer efficiency paper:
queries = [
    '"efficient transformer" AND "long sequence"',
    '"linear attention" OR "sparse attention"',
    '"subquadratic" AND "self-attention"',
    '"flash attention" OR "memory efficient attention"',
]

Semantic Scholar API Usage

python

import requests
from typing import Optional

S2_API = "https://api.semanticscholar.org/graph/v1"

def search_papers(
    query: str,
    year_range: Optional[str] = "2020-2025",
    limit: int = 20,
    fields: str = "title,year,citationCount,abstract,authors,venue",
) -> list[dict]:
    """Search Semantic Scholar for papers."""
    params = {
        "query": query,
        "limit": limit,
        "fields": fields,
    }
    if year_range:
        params["year"] = year_range
    resp = requests.get(f"{S2_API}/paper/search", params=params)
    resp.raise_for_status()
    return resp.json().get("data", [])

def get_citations(paper_id: str, direction: str = "citations") -> list[dict]:
    """Get forward (citations) or backward (references) links.

    direction: 'citations' (who cites this) or 'references' (who this cites)
    """
    fields = "title,year,citationCount,authors,venue"
    resp = requests.get(
        f"{S2_API}/paper/{paper_id}/{direction}",
        params={"fields": fields, "limit": 100},
    )
    resp.raise_for_status()
    return resp.json().get("data", [])

# Usage
papers = search_papers("flash attention efficient transformer")
for p in papers[:5]:
    print(f"[{p['year']}] {p['title']} (cited: {p['citationCount']})")

Paper Summary Template

markdown

### Paper: {Title}
- **Authors**: {First Author} et al., {Year}
- **Venue**: {Conference/Journal}
- **Citations**: {count} (as of {date})

#### Problem
{One sentence: what problem does this solve?}

#### Method
{2-3 sentences: key technical contribution}

#### Results
{Key numbers: which benchmarks, what improvement over what baseline}

#### Relevance to Our Work
{How it relates: extends/contradicts/enables/competes with our approach}

#### Limitations
{What they don't address that we do, or known weaknesses}

Comparison Table Template

markdown

| Method | Year | Task | Key Idea | Complexity | Acc. | Code |
|--------|------|------|----------|------------|------|------|
| Method A | 2023 | X | Does Y via Z | O(n log n) | 92.1 | Yes |
| Method B | 2024 | X | Does Y via W | O(n) | 93.4 | No |
| Ours | 2025 | X | Does Y via V | O(n) | 94.2 | Yes |

Taxonomy Construction

Building a Taxonomy

code

Step 1: Collect 30-50 papers in the space
Step 2: Tag each paper along orthogonal axes:
  - Approach type (e.g., attention-based, convolution-based, hybrid)
  - Training paradigm (supervised, self-supervised, few-shot)
  - Scale (small model, foundation model)
  - Application domain (vision, language, multimodal)
Step 3: Group papers that share tags -> these form taxonomy branches
Step 4: Identify sparse cells -> these are research gaps

Related Work Section Structure

latex

\section{Related Work}
% Organize by conceptual axis, not chronologically

\paragraph{Efficient Attention Mechanisms.}
Linear attention~\citep{katharopoulos2020} replaces softmax with...
Sparse attention patterns~\citep{child2019,beltagy2020} reduce complexity by...
Our work differs in that...

\paragraph{Knowledge Distillation for Transformers.}
\citet{sanh2019} showed that... \citet{jiao2020} extended this to...
Unlike these approaches, we...

\paragraph{Dynamic Computation.}
Early exit~\citep{schwartz2020} and token pruning~\citep{goyal2020} adaptively...
Our method is complementary to these techniques and can be combined with...

Staying Current

Monitoring Strategy

Channel	Frequency	Action
arxiv-sanity / Hugging Face Daily Papers	Daily	Skim titles, star relevant
Key author Twitter/X feeds	Weekly	Note new preprints
Top venue proceedings (NeurIPS, ICML, ICLR)	Per cycle	Read all accepted in subfield
Google Scholar alerts	As notified	Check forward citations of key papers
ML subreddits, Discord servers	Weekly	Track community reception

arxiv Monitoring Script

python

import arxiv

def search_recent(
    query: str,
    max_results: int = 20,
    sort_by: arxiv.SortCriterion = arxiv.SortCriterion.SubmittedDate,
) -> list[dict]:
    """Fetch recent arxiv papers matching query."""
    search = arxiv.Search(
        query=query,
        max_results=max_results,
        sort_by=sort_by,
    )
    results = []
    for paper in search.results():
        results.append({
            "title": paper.title,
            "authors": [a.name for a in paper.authors[:3]],
            "abstract": paper.summary[:200],
            "url": paper.entry_id,
            "published": paper.published.strftime("%Y-%m-%d"),
        })
    return results

papers = search_recent("cat:cs.LG AND ti:efficient AND ti:attention")

Gotchas and Anti-Patterns

Citation Bias

•Over-citing top labs / top venues while ignoring work from smaller groups.
•Citing only papers that support your narrative. Reviewers will flag missing contradicting work.
•Fix: Explicitly search for papers that contradict or weaken your claims. Include them and explain differences.

Missing Non-Arxiv Work

•Many fields publish primarily in journals (medical imaging, robotics). Arxiv-only search misses these.
•Workshop papers often contain early versions of important ideas.
•Fix: Search DBLP and Google Scholar in addition to arxiv. Check domain-specific repositories (e.g., PubMed for medical ML).

Conflating Citation Count with Impact

•Recent papers have low citation counts regardless of quality.
•Some high-citation papers are cited mainly for datasets, not methods.
•Fix: Weight recent papers by venue and author track record, not citations alone. Read the paper before citing.

Recency Bias

•Ignoring foundational older work that established the concepts you build on.
•Citing the latest version of an idea without crediting the original.
•Fix: Trace ideas back to their origin. Cite both the original and the most relevant recent extension.

Superficial Reading

•Citing based on abstract only leads to mischaracterization.
•Fix: For any paper in your related work section, read at minimum: abstract, intro, method section, and main results table. For direct competitors, read the full paper.

Incomplete Search Termination

•Stopping search when you have "enough" references without systematic coverage.
•Fix: Define search scope upfront (queries, venues, year range). Track coverage in a spreadsheet. Stop when new queries return only already-seen papers.

Agent Team Mode

For comprehensive literature reviews covering 30+ papers across multiple sources and subfields.

Team Configuration

yaml

team:
  recommended_size: 4
  agent_roles:
    - name: searcher-arxiv
      type: Explore
      focus: "Search arxiv for preprints, recent submissions, related work"
      skills_loaded: ["research:literature-review"]
    - name: searcher-scholar
      type: Explore
      focus: "Search Semantic Scholar and Google Scholar for peer-reviewed work"
      skills_loaded: ["research:literature-review"]
    - name: searcher-venues
      type: Explore
      focus: "Search specific top venues (NeurIPS, ICML, ICLR, ACL) proceedings"
      skills_loaded: ["research:literature-review"]
    - name: citation-explorer
      type: Explore
      focus: "Forward + backward citation graph exploration from seed papers"
      skills_loaded: ["research:literature-review"]
  file_ownership: "shared-read-only"
  lead_mode: "hands-on"

Team Workflow

•Lead defines search scope (queries, year range, target venues) and distributes query sets
•All searchers execute their queries in parallel across different sources
•citation-explorer starts from known seed papers, explores citation graph
•Lead deduplicates results, builds comparison table, constructs taxonomy
•Lead identifies sparse taxonomy cells (research gaps) and writes related work synthesis

Single-Agent Fallback

Without team mode, execute all phases sequentially (default behavior). Team mode is an optional enhancement.