Dedupe + Rank
Turn a broad retrieved set into a smaller core set for taxonomy/outline building.
This is a deterministic “curation” step: it should be stable and repeatable.
Input
- •
papers/papers_raw.jsonl
Outputs
- •
papers/papers_dedup.jsonl - •
papers/core_set.csv
Workflow (high level)
- •Dedupe by normalized
(title, year)and keep the richest metadata per duplicate cluster. - •Rank by relevance/recency signals (and optionally pin known classics for certain topics). For LLM-agent topics, also ensure a small quota of prior surveys/reviews is present to support a paper-like Related Work section.
- •Write
papers/core_set.csvwith stablepaper_idvalues and useful metadata columns (arxiv_id,pdf_url, categories).
Quality checklist
- •
papers/papers_dedup.jsonlexists and is valid JSONL. - •
papers/core_set.csvexists and has a header row.
Script
Quick Start
- •
python .codex/skills/dedupe-rank/scripts/run.py --help - •
python .codex/skills/dedupe-rank/scripts/run.py --workspace <workspace_dir> --core-size 300
All Options
- •
--core-size <n>: target size forpapers/core_set.csv - •
queries.mdalso supportscore_size/core_set_size/dedupe_core_size(overrides default when present)
Examples
- •Smaller core set for fast iteration (non-A150++):
- •
python .codex/skills/dedupe-rank/scripts/run.py --workspace <ws> --core-size 25
- •
Notes
- •This step may annotate
papers/core_set.csv:reasonwith tags such aspinned_classicandprior_survey(deterministic, topic-aware guards for survey writing). - •Systematic-review default: if the active pipeline is
systematic-reviewandcore_sizeis not specified, the script keeps the full deduped pool inpapers/core_set.csv(so screening does not silently drop candidates). - •This step is deterministic; reruns should be stable for the same inputs.
Troubleshooting
Common Issues
Issue: papers/core_set.csv is too small / empty
Symptom:
- •Core set has very few rows.
Causes:
- •Input
papers/papers_raw.jsonlis small, or many rows are missing required fields.
Solutions:
- •Broaden retrieval (or provide a richer offline export) and rerun.
- •Lower
--core-sizeonly if you intentionally want a small core set.
Issue: Duplicates still appear after dedupe
Symptom:
- •Near-identical titles remain.
Causes:
- •Title normalization is defeated by noisy exports.
Solutions:
- •Clean title fields in the export (strip prefixes/suffixes, fix encoding) and rerun.
Recovery Checklist
- •
papers/papers_raw.jsonllines containtitle/year/url. - •
papers/core_set.csvhas stablepaper_idvalues.