Literature Review

This skill enables an AI agent to conduct a rigorous, structured literature review following established academic methodology. The agent defines a search strategy with targeted keywords, applies explicit inclusion and exclusion criteria to filter results, extracts key data from selected papers, and synthesizes the findings into a thematic narrative with a summary table and reference list. The workflow is inspired by systematic review practices (including PRISMA-style reporting) and is suitable for academic research, technology landscape analysis, and evidence-based decision making.

Workflow

•
Define the Research Question and Scope: Work with the user to formulate a precise research question using a framework such as PICO (Population, Intervention, Comparison, Outcome) or a domain-appropriate equivalent. Establish the review's scope: time range, languages, source types (journal articles, conference papers, preprints), and any domain constraints.
•
Develop the Search Strategy: Generate a set of search queries using combinations of primary keywords, synonyms, and Boolean operators. Identify the databases and sources to search (e.g., Google Scholar, Semantic Scholar, arXiv, PubMed, ACM Digital Library, IEEE Xplore). Document the complete search strategy for reproducibility.
•
Screen and Filter Results: Apply predefined inclusion and exclusion criteria to the search results. Inclusion criteria typically cover topic relevance, publication date range, study type, and language. Exclusion criteria filter out duplicates, non-peer-reviewed opinion pieces, retracted papers, and off-topic results. Record the number of papers at each stage for a PRISMA-style flow.
•
Extract Key Data: For each included paper, extract structured information: title, authors, year, venue, research question, methodology, key findings, limitations, and relevance to the review question. Store this data in a consistent format (table or structured notes) for cross-paper comparison.
•
Synthesize Themes and Findings: Organize extracted data into themes or categories that emerge across papers. Identify areas of consensus, debate, and gaps in the literature. Write a narrative synthesis that connects individual findings into a coherent story, supported by a summary comparison table.
•
Write the Review Document: Produce the final literature review with these sections: introduction and research question, methodology (search strategy, criteria, PRISMA flow), thematic synthesis, discussion of gaps and future directions, and a complete reference list in a standard citation format.

Usage

Provide the agent with a research topic or question. Optionally specify the desired scope (time range, source types), the number of papers to include, or a particular synthesis format.

code

Conduct a literature review on LLM evaluation benchmarks published between 2022-2025.

Focus: What benchmarks exist, what do they measure, and what gaps remain
in evaluating reasoning, safety, and real-world task completion?

Examples

Example 1: Literature Review on LLM Evaluation Benchmarks

User Request:

Review the literature on LLM evaluation benchmarks from 2022-2025, focusing on reasoning, safety, and task completion.

Search Strategy:

Database	Query
Semantic Scholar	"large language model evaluation benchmark" AND (reasoning OR safety OR "task completion")
arXiv	"LLM benchmark" AND ("2023" OR "2024" OR "2025")
ACM DL	"language model assessment" AND "benchmark suite"
Google Scholar	"LLM evaluation" survey OR "systematic review" 2023..2025

Inclusion/Exclusion Criteria:

Criteria	Type	Rule
Published 2022-2025	Inclusion	Must be within date range
Peer-reviewed or major preprint	Inclusion	Accepted at top venues or arXiv with 10+ citations
Proposes or surveys benchmarks	Inclusion	Must discuss specific evaluation frameworks
Blog posts / opinion pieces	Exclusion	No non-academic sources
Non-English	Exclusion	English-language only
Duplicates / superseded versions	Exclusion	Keep most recent version only

PRISMA-Style Flow:

code

Records identified through search: 847
After duplicate removal:           612
After title/abstract screening:    148
After full-text assessment:         42
Final papers included:              42

Synthesis Table (excerpt):

Benchmark	Year	Focus Area	Key Metric	Limitations
MMLU	2023	Knowledge & reasoning	Accuracy across 57 tasks	Static; no multi-step reasoning
HumanEval+	2023	Code generation	pass@k	Narrow scope (Python functions)
AgentBench	2023	Real-world task completion	Success rate across 8 environments	High cost to run; environment-specific
TrustLLM	2024	Safety & trustworthiness	6 dimensions including fairness	Self-reported; needs human validation
GPQA	2024	Graduate-level reasoning	Accuracy on expert-written questions	Small dataset; domain-specific
SWE-bench	2024	Software engineering tasks	Resolved rate on real GitHub issues	Requires execution infrastructure

Synthesized Finding (excerpt):

The literature reveals a clear trajectory from static knowledge tests (MMLU) toward dynamic, agentic evaluations (AgentBench, SWE-bench) that measure an LLM's ability to act in realistic environments. However, a significant gap persists: no single benchmark suite comprehensively evaluates reasoning, safety, and task completion together. Most benchmarks optimize for one dimension, creating a fragmented evaluation landscape where models can appear strong on reasoning benchmarks while performing poorly on safety metrics.

Example 2: Systematic Review Protocol with PRISMA-Style Flow

User Request:

Create a systematic review protocol for studying the effectiveness of retrieval-augmented generation (RAG) in reducing LLM hallucinations.

Research Question (PICO format):

•Population: Large language models (GPT-4 class and above)
•Intervention: Retrieval-augmented generation (RAG)
•Comparison: Non-RAG baseline (standard prompting)
•Outcome: Hallucination rate reduction (measured by factual accuracy metrics)

Search Strategy:

code

("retrieval-augmented generation" OR "RAG") AND
("hallucination" OR "factual accuracy" OR "faithfulness") AND
("large language model" OR "LLM" OR "GPT" OR "Claude")

Databases: Semantic Scholar, arXiv, ACM Digital Library, Google Scholar Date range: January 2023 to December 2025

Screening Protocol:

code

Phase 1 — Title/Abstract Screening:
  Include if: Empirically measures hallucination with and without RAG
  Exclude if: Theoretical only, no quantitative results, not LLM-focused

Phase 2 — Full-Text Review:
  Include if: Reports specific hallucination metrics (FActScore, ROUGE-L
              against ground truth, human evaluation scores)
  Exclude if: RAG used for non-factual tasks (creative writing, code gen)

Data Extraction Template:

Field	Description
Paper ID	Unique identifier
Model(s) tested	Which LLMs were evaluated
RAG architecture	Retrieval method, chunk size, top-k
Baseline	What non-RAG setup was compared
Hallucination metric	FActScore, human eval, accuracy, etc.
Result	Percentage change in hallucination rate
Domain	General knowledge, medical, legal, etc.

Expected PRISMA Diagram:

code

Identification:  ~1,200 records from 4 databases
Screening:       ~400 after title/abstract review
Eligibility:     ~80 after full-text assessment
Included:        ~35 meeting all criteria

Best Practices

•Document every decision for reproducibility. Record search queries, database dates, and the exact inclusion/exclusion criteria so another researcher could replicate the review.
•Use structured data extraction. A consistent extraction template ensures you capture the same fields from every paper, making cross-study comparison possible.
•Distinguish between quantity and quality of evidence. A theme supported by 20 blog posts is weaker than one supported by 3 rigorous RCTs. Weight your synthesis accordingly.
•Report what you did not find. Gaps in the literature are as important as the findings. Explicitly note underexplored areas, missing populations, or untested conditions.
•Keep the synthesis thematic, not paper-by-paper. A literature review that reads as a list of paper summaries is less useful than one organized around themes that weave findings together.
•Update the search before finalizing. If the review takes weeks to write, re-run the search before submitting to catch newly published work.

Edge Cases

•Insufficient literature: If fewer than 5 relevant papers exist, acknowledge this and reframe the output as a "scoping review" or "research gap analysis" rather than a comprehensive literature review.
•Preprint-heavy fields: In fast-moving areas like AI/ML, much of the relevant work may be on arXiv without peer review. Include preprints but flag their review status and note this as a limitation of the evidence base.
•Conflicting study results: When studies reach opposite conclusions, compare their methodologies, sample sizes, and conditions. Present the conflict transparently rather than arbitrarily siding with one study.
•Interdisciplinary topics: Reviews spanning multiple fields (e.g., AI + healthcare) may require searching domain-specific databases (PubMed) in addition to CS databases. Adjust the search strategy accordingly and note where terminology differs between fields.
•Non-English sources: If the topic has significant literature in other languages, note this limitation if you restrict to English and flag key non-English works that appear in citation lists.