AgentSkillsCN

dedupe-rank

对原始论文集(`papers/papers_raw.jsonl`)进行去重与排序,生成 `papers/papers_dedup.jsonl` 与 `papers/core_set.csv` 文件。 **触发条件**:去重、排序、核心集、去重、排序、精选论文、核心集合。 **适用场景**:在检索后,你需要将覆盖广泛的论文集合收敛为可管理的核心集(用于分类/大纲/映射工作)。 **跳过条件**:若已有稳定的人工整理版本 `papers/core_set.csv`(无需再次反复调整)。 **网络要求**:无需联网。 **防护措施**:偏向确定性;输出结果应具备可重复性(稳定的 paper_id、统一的字段规范)。

SKILL.md
--- frontmatter
name: dedupe-rank
description: |
  Dedupe and rank a raw paper set (`papers/papers_raw.jsonl`) to produce `papers/papers_dedup.jsonl` and `papers/core_set.csv`.
  **Trigger**: dedupe, rank, core set, 去重, 排序, 精选论文, 核心集合.
  **Use when**: 检索后需要把广覆盖集合收敛成可管理的 core set(用于 taxonomy/outline/mapping)。
  **Skip if**: 已经有人手工整理了稳定的 `papers/core_set.csv`(无需再次 churn)。
  **Network**: none.
  **Guardrail**: 偏 deterministic;输出应可重复(稳定 paper_id、字段规范)。

Dedupe + Rank

Turn a broad retrieved set into a smaller core set for taxonomy/outline building.

This is a deterministic “curation” step: it should be stable and repeatable.

Input

  • papers/papers_raw.jsonl

Outputs

  • papers/papers_dedup.jsonl
  • papers/core_set.csv

Workflow (high level)

  1. Dedupe by normalized (title, year) and keep the richest metadata per duplicate cluster.
  2. Rank by relevance/recency signals (and optionally pin known classics for certain topics). For LLM-agent topics, also ensure a small quota of prior surveys/reviews is present to support a paper-like Related Work section.
  3. Write papers/core_set.csv with stable paper_id values and useful metadata columns (arxiv_id, pdf_url, categories).

Quality checklist

  • papers/papers_dedup.jsonl exists and is valid JSONL.
  • papers/core_set.csv exists and has a header row.

Script

Quick Start

  • python .codex/skills/dedupe-rank/scripts/run.py --help
  • python .codex/skills/dedupe-rank/scripts/run.py --workspace <workspace_dir> --core-size 300

All Options

  • --core-size <n>: target size for papers/core_set.csv
  • queries.md also supports core_size / core_set_size / dedupe_core_size (overrides default when present)

Examples

  • Smaller core set for fast iteration (non-A150++):
    • python .codex/skills/dedupe-rank/scripts/run.py --workspace <ws> --core-size 25

Notes

  • This step may annotate papers/core_set.csv:reason with tags such as pinned_classic and prior_survey (deterministic, topic-aware guards for survey writing).
  • Systematic-review default: if the active pipeline is systematic-review and core_size is not specified, the script keeps the full deduped pool in papers/core_set.csv (so screening does not silently drop candidates).
  • This step is deterministic; reruns should be stable for the same inputs.

Troubleshooting

Common Issues

Issue: papers/core_set.csv is too small / empty

Symptom:

  • Core set has very few rows.

Causes:

  • Input papers/papers_raw.jsonl is small, or many rows are missing required fields.

Solutions:

  • Broaden retrieval (or provide a richer offline export) and rerun.
  • Lower --core-size only if you intentionally want a small core set.

Issue: Duplicates still appear after dedupe

Symptom:

  • Near-identical titles remain.

Causes:

  • Title normalization is defeated by noisy exports.

Solutions:

  • Clean title fields in the export (strip prefixes/suffixes, fix encoding) and rerun.

Recovery Checklist

  • papers/papers_raw.jsonl lines contain title/year/url.
  • papers/core_set.csv has stable paper_id values.