arXiv Search (metadata-first)

Collect an initial paper set with enough metadata to support downstream ranking, taxonomy building, and citation generation.

When online, prefer rich arXiv metadata (categories, arxiv_id, pdf_url, published/updated, etc.). When offline, accept an export and convert it cleanly.

Input

•queries.md (keywords, excludes, time window)

Outputs

•
papers/papers_raw.jsonl (JSONL; 1 paper per line)
- •Each record includes at least: title, authors, year, url, abstract
- •When using the arXiv API online mode, records also include helpful metadata: arxiv_id, pdf_url, categories, primary_category, published, updated, doi, journal_ref, comment
•
Convenience index (optional but generated by the script):
- •papers/papers_raw.csv

Decision: online vs offline

•If you have network access: run arXiv API retrieval.
•If not: import an export the user provides (CSV/JSON/JSONL) and normalize fields.
•Hybrid: if you import offline but still have network later, you can enrich missing fields (abstract/authors/categories) via arXiv id_list using --enrich-metadata or queries.md enrich_metadata: true.

Workflow (heuristic)

•Read queries.md and expand into concrete query strings.
•Retrieve results (online) or import an export (offline).
•
Normalize every record to include at least:
- •title, authors (array), year, url, abstract
•Keep the set broad at this stage; dedupe/ranking comes next.
•Apply time window and max_results if specified.

Quality checklist

• papers/papers_raw.jsonl exists.
• Each line is valid JSON and contains title, authors, year, url.

Side effects

•Allowed: create/overwrite papers/papers_raw.jsonl; append notes to STATUS.md.
•Not allowed: write prose sections in output/ before writing is approved.

Script

Quick Start

•python .codex/skills/arxiv-search/scripts/run.py --help
•Online: python .codex/skills/arxiv-search/scripts/run.py --workspace <workspace_dir> --query "<query>" --max-results 200
•Offline import: python .codex/skills/arxiv-search/scripts/run.py --workspace <workspace_dir> --input <export.csv|json|jsonl>

All Options

•--query <q>: repeatable; multiple queries are unioned
•--exclude <term>: repeatable; excludes applied after retrieval
•--max-results <n>: cap total retrieved
•--input <export.*>: offline mode (CSV/JSON/JSONL)
•--enrich-metadata: best-effort enrich via arXiv id_list (needs network)
•queries.md also supports: keywords, exclude, time window, max_results, enrich_metadata

Examples

•
Online (multi-query + excludes):
- •python .codex/skills/arxiv-search/scripts/run.py --workspace <ws> --query "LLM agent" --query "tool use" --exclude "survey" --max-results 300
•
Fetch a single paper by arXiv ID (direct id_list fetch):
- •python .codex/skills/arxiv-search/scripts/run.py --workspace <ws> --query 2509.02547 --max-results 1
•
Offline auto-detect (no flags):
- •Place papers/import.csv (or .json/.jsonl) under the workspace, then run: python .codex/skills/arxiv-search/scripts/run.py --workspace <ws>
•
Offline import + time window (via queries.md):
- •Set - time window: { from: 2022, to: 2025 } then run offline import normally

Troubleshooting

Common Issues

Issue: `papers/papers_raw.jsonl` is empty

Symptom:

•Script exits with “No results returned …” or output file is empty.

Causes:

•Network is blocked (online mode).
•Queries are too narrow or queries.md is empty.

Solutions:

•Use offline import: place papers/import.csv|json|jsonl in the workspace or pass --input.
•Broaden keywords and reduce excludes in queries.md.
•Run with explicit --query to sanity-check the parser.

Issue: Offline import records miss fields