ArXiv Summarizer Orchestrator
Run the full pipeline by composing three sub-skills.
Sub-skill Order
- •
arxiv-search-collector - •
arxiv-paper-processor - •
arxiv-batch-reporter
Workflow Parameters
- •
language: manual language parameter used by all stages. Default is English when omitted. - •
paper_processing_mode:subagent_parallelorserial. - •
max_parallel_papers: default5whenpaper_processing_mode=subagent_parallel.
Workflow
Stage A: Collection Setup + Query Retrieval
- •Initialize one run with
arxiv-search-collector/scripts/init_collection_run.py. - •Model generates multiple focused queries from original topic and writes a minimal
query_plan.json(label+queryonly). - •Run
arxiv-search-collector/scripts/fetch_queries_batch.pywith the plan file (recommended). - •(Optional fallback) call
arxiv-search-collector/scripts/fetch_query_metadata.pymanually for one-by-one fetch. - •Model reads each indexed query list and decides keep indexes.
- •Merge selected items with
arxiv-search-collector/scripts/merge_selected_papers.py. - •If relevance/coverage is still not good, iterate Stage A:
- •generate another query plan with new labels,
- •fetch again,
- •re-merge with
--incrementaland updatedselection-json. - •set weak labels to empty keep list (
[]) to explicitly drop them.
Pass --language <LANG> to collector scripts so all generated markdown files in Stage A follow the selected language.
Use serial query fetch in Stage A with conservative controls (for example --min-interval-sec 5, --retry-max 4).
Default collector settings already include retries/backoff and run-local throttle state (<run_dir>/.runtime/arxiv_api_state.json), so manual tuning is usually unnecessary.
Prefer cache reuse (no --force) unless query parameters changed or data refresh is required.
Output: one run directory with per-paper metadata subdirectories.
Stage B: Per-paper Artifact Download + Manual Summary
For each paper directory, invoke sub-skill arxiv-paper-processor once and let that skill produce <paper_dir>/summary.md.
Recommended pre-step for many papers:
- •Run one batch artifact download before per-paper reading:
python3 arxiv-paper-processor/scripts/download_papers_batch.py \ --run-dir /path/to/run \ --artifact source_then_pdf \ --max-workers 3 \ --min-interval-sec 5 \ --language <LANG>
Per-paper execution steps (inside arxiv-paper-processor):
- •If
<paper_dir>/summary.mdalready exists and is complete, skip this paper. - •If usable source (
source/source_extract/*.tex) or PDF (source/paper.pdf) already exists, skip download. - •If artifacts are missing, download source with
arxiv-paper-processor/scripts/download_arxiv_source.py. - •If source is unusable, download PDF with
arxiv-paper-processor/scripts/download_arxiv_pdf.py. - •Model reads content and manually writes
<paper_dir>/summary.mdby reference format, inlanguage.
Parallel strategy for many papers:
- •Default:
paper_processing_mode=subagent_parallelwithmax_parallel_papers=5. - •Optional:
paper_processing_mode=serialto process one paper at a time. - •In parallel mode, run multiple
arxiv-paper-processorinstances in batches; concurrent papers must not exceedmax_parallel_papers. - •Wait for one batch to finish before starting the next batch.
- •In serial mode, run exactly one
arxiv-paper-processorinstance at a time. - •Subagent workers should only own one paper directory each to avoid file conflicts.
- •Do not use scripts to auto-compose summary text; scripts are download-only tools.
Output: all paper directories contain summary.md.
Stage C: Bundle + Final Hierarchical Report
- •Run
arxiv-batch-reporter/scripts/collect_summaries_bundle.py --language <LANG>. - •Model reads
summaries_bundle.mdand writescollection_report_template.mdin base dir. - •In template, each paper leaf entry must include one standalone placeholder line:
{{ARXIV_BRIEF:<arxiv_id>}}. - •Run
arxiv-batch-reporter/scripts/render_collection_report.pyto generate finalcollection_report.md. - •Do not manually paraphrase per-paper conclusion lines in final report; they must come from per-paper
summary.mdsection 10 via script injection.
If language is non-English (for example Chinese), all intermediate markdown files and final reports should follow that language.
Periodic Scheduling
This orchestrator is suitable for cron/scheduled execution in OpenClaw:
- •Frequency examples: daily, weekly, monthly.
- •For rolling windows, use lookback (
1d,7d,30d) when initializing runs.
Output Layout
<output-root>/<topic>-<timestamp>-<range>/
- •
task_meta.json,task_meta.md - •
query_results/,query_selection/ - •
<arxiv_id>/metadata.md+ downloaded source/pdf +summary.md - •
summaries_bundle.md - •
collection_report_template.md - •final rendered collection report (e.g.
collection_report.md)
Use references/workflow-checklist.md as execution checklist.
Related Skills
This is the top-level orchestration skill.
Before using it, install and enable these three sub-skills:
- •
arxiv-search-collector - •
arxiv-paper-processor - •
arxiv-batch-reporter
Execution order inside this orchestrator:
- •
arxiv-search-collector(Stage A) - •
arxiv-paper-processor(Stage B) - •
arxiv-batch-reporter(Stage C)