Literature Engineer (evidence collector)

Goal: build a large, verifiable candidate pool for downstream dedupe/rank, mapping, notes, citations, and drafting.

This skill is intentionally evidence-first: if you can't reach the target size with verifiable IDs/provenance, the correct behavior is to block and ask for more exports / enable network, not to fabricate.

Inputs

•
queries.md
- •keywords, exclude, max_results, time window
•
Optional offline sources (any combination; all are merged):
- •papers/import.(csv|json|jsonl|bib)
- •papers/arxiv_export.(csv|json|jsonl|bib)
- •papers/imports/*.(csv|json|jsonl|bib)
•
Optional snowball exports (offline):
- •papers/snowball/*.(csv|json|jsonl|bib)

Outputs

•
papers/papers_raw.jsonl
- •
  1 record per line; minimum fields:
  - •title (str), authors (list[str]), year (int|""), url (str)
  - •stable identifier(s): arxiv_id and/or doi
  - •abstract (str; may be empty in offline mode)
  - •source (str) + provenance (list[dict])
•papers/papers_raw.csv (human scan)
•papers/retrieval_report.md (route counts, missing-meta stats, next actions)

Workflow (multi-route)

•Offline-first merge: ingest all available offline exports (and label provenance per file).
•Online retrieval (optional): if enabled, run arXiv API retrieval for each keyword query.
•Snowballing (optional): expand from seed papers via references/cited-by (online), or merge offline snowball exports.
•Normalize + dedupe: canonicalize IDs/URLs, merge duplicates while unioning provenance.
•Report: write a concise retrieval report with coverage buckets and missing-meta counts.

Quality checklist

• Candidate pool size target met (A150++: ≥1200) without fabrication.
• Each record has a stable identifier (arxiv_id or doi, plus url).
• Each record has provenance: which route/file/API produced it.

Script

Quick Start

•python .codex/skills/literature-engineer/scripts/run.py --help

All Options

•See python .codex/skills/literature-engineer/scripts/run.py --help.
•Reads retrieval config from queries.md.
•Offline inputs (merged if present): papers/import.(csv|json|jsonl|bib), papers/arxiv_export.(csv|json|jsonl|bib), papers/imports/*.(csv|json|jsonl|bib).
•Optional offline snowball inputs: papers/snowball/*.(csv|json|jsonl|bib).
•Online expansion requires network: use --online and/or --snowball.
•Online retrieval is best-effort: arXiv API can be flaky in some environments; the script will also attempt a Semantic Scholar route when needed.
•For LLM-agent topics, the script also performs a best-effort pinned arXiv id_list fetch (canonical classics like ReAct/Toolformer/Reflexion/Voyager/Tree-of-Thoughts + a small prior-survey seed set) so ref.bib can include must-cite anchors even when keyword search misses them.
•If HTTPS/TLS to external domains is unstable, the Semantic Scholar route is fetched via the r.jina.ai proxy so the pipeline can still self-boot without manual exports.
•When an online run returns 0 records due to transient network errors, a simple rerun is often sufficient (the pipeline should not fabricate).

Examples

•
Offline imports only:
- •
  Put exports under papers/imports/ then run:
  - •python .codex/skills/literature-engineer/scripts/run.py --workspace <ws>
•
Explicit offline inputs (multi-route):
- •python .codex/skills/literature-engineer/scripts/run.py --workspace <ws> --input path/to/a.bib --input path/to/b.jsonl
•
Online arXiv retrieval (needs network):
- •python .codex/skills/literature-engineer/scripts/run.py --workspace <ws> --online
•
Snowballing (needs network unless you provide offline snowball exports):
- •python .codex/skills/literature-engineer/scripts/run.py --workspace <ws> --snowball

Troubleshooting

Issue: can't reach ≥1200 papers

Symptom:

•papers/papers_raw.jsonl size is far below target; later stages will fail mapping/bindings and citation density.

Causes:

•Only a small offline export was provided.
•Network is blocked so online retrieval/snowballing can't run.

Solutions:

•Provide additional exports under papers/imports/ (multiple routes/queries).
•Provide snowball exports under papers/snowball/.
•Enable network and rerun with --online --snowball.