AgentSkillsCN

paper-ingestion

将 PDF 研究论文导入系统,并转换为 Markdown 格式,以便进行原生 AI 分析。适用于用户希望阅读、分析或处理 PDF 论文,或已提供 PDF 文件的 URL 或路径时使用。默认采用 MinerU(GPU)进行处理,若无法成功则回退至 Docling。

SKILL.md
--- frontmatter
name: paper-ingestion
description: Ingest PDF research papers and convert to Markdown for AI-native analysis. Use when user wants to read, analyze, or process a PDF paper, or provides a PDF URL/path. Uses MinerU (GPU) by default, docling as fallback.

Paper Ingestion Tool

Convert PDF research papers to Markdown with image extraction, organized for AI-native analysis.

Quick Reference

bash
# From local file (default: mineru engine)
uv run scripts/ingest_paper.py /path/to/paper.pdf

# From URL
uv run scripts/ingest_paper.py "https://arxiv.org/pdf/2401.12345.pdf"

# Fallback engine (docling, fast but lower quality)
uv run scripts/ingest_paper.py paper.pdf --engine docling

# Custom output directory
uv run scripts/ingest_paper.py paper.pdf --output-dir /path/to/readings

Engine Selection

ScenarioEngineNotes
Default (highest quality)mineruGPU-accelerated, excellent math/tables
Fallback (fast, no GPU)doclingLower quality, good for quick previews

Output Structure

Files organized at {cwd}/{YYYYMMDD}-{Sanitized_Title}/:

code
20260131-DeepSeek_V3_Technical_Report/
  reference.pdf    # Original PDF
  full_text.md     # Markdown with YAML frontmatter
  notes.md         # Empty notes file
  assets/          # Extracted images
    image_001.png
    image_002.png

Naming rules:

  • Timestamped prefix: YYYYMMDD-
  • Title source: Use detected paper title after conversion (not URL string)
  • Windows-safe: No :?/\*<>|" characters
  • Duplicate check: Aborts if same title exists (ignoring date)

YAML Frontmatter

yaml
---
title: "Paper Title"
date_ingested: 2026-01-31
source_pdf: reference.pdf
conversion_engine: mineru
tags:
  - paper
  - inbox
aliases: []
---

JSON Output

Success:

json
{"status": "success", "markdown_path": "...", "title": "...", "date": "2026-01-31", "paper_dir": "...", "engine_used": "mineru"}

Error:

json
{"status": "error", "message": "...", "suggestion": "..."}

Error Handling

ErrorAction
Duplicate detectedRemove existing folder or use --force
MinerU timeoutTry --engine docling
Download failedCheck URL is accessible

Image Handling

  • Both engines: Extract images to assets/ folder
  • Markdown references: ![Fig1](./assets/image_001.png) (relative paths)
  • Syncthing compatible: Small image files sync across devices

Math Formatting

  • Inline and display math are normalized to LaTeX using $...$ / $$...$$