AgentSkillsCN

Arxiv To Md

Arxiv 转 Markdown

SKILL.md

Skill: arxiv-to-md

Version: 1.0.0

Status: Active

required_canon_version: ">=1.0.0"

Trigger

When the user asks to download, convert, or fetch an arXiv paper as markdown. Examples:

Inputs

InputTypeRequiredDefaultDescription
arxiv_idstringYes-arXiv paper ID (e.g., 1706.03762) or full URL
output_pathstringNoTHOUGHT/LAB/FERAL_RESIDENT/research/papers/markdown/{id}.mdWhere to save the markdown file
methodstringNoautoConversion method: html, latex, or auto

Outputs

  • Markdown file with proper #, ##, ### heading structure
  • Prints confirmation with output path

Methods

MethodDescriptionRequirements
latexDownloads LaTeX source, converts via pandoc. Best heading structure.pandoc installed
htmlFetches ar5iv.org HTML, converts to markdown. Fast fallback.None (requests, markdownify, bs4)
autoTries latex first, falls back to html on failure.Best of both

Execution

bash
# Run via venv python
.venv/Scripts/python.exe CAPABILITY/SKILLS/utilities/arxiv-to-md/pdf_converter.py <arxiv_id> <output_path> [method]

Examples

bash
# Auto method (recommended)
.venv/Scripts/python.exe CAPABILITY/SKILLS/utilities/arxiv-to-md/pdf_converter.py 1706.03762 output.md

# Force HTML (no pandoc needed)
.venv/Scripts/python.exe CAPABILITY/SKILLS/utilities/arxiv-to-md/pdf_converter.py 1706.03762 output.md html

# Force LaTeX (best quality)
.venv/Scripts/python.exe CAPABILITY/SKILLS/utilities/arxiv-to-md/pdf_converter.py 1706.03762 output.md latex

# From URL
.venv/Scripts/python.exe CAPABILITY/SKILLS/utilities/arxiv-to-md/pdf_converter.py https://arxiv.org/abs/1706.03762 output.md

Constraints

  • Network access required to fetch from arxiv.org / ar5iv.labs.arxiv.org
  • LaTeX method requires pandoc (C:\Users\<user>\AppData\Local\Pandoc\pandoc.exe)
  • Some papers have non-standard LaTeX that pandoc can't parse (auto falls back to HTML)
  • Output is UTF-8 encoded markdown

Dependencies

Python packages (in .venv):

  • requests
  • markdownify
  • beautifulsoup4

System (for latex method):

  • pandoc - installed via winget install JohnMacFarlane.Pandoc

Error Handling

ErrorCauseResolution
Could not parse arXiv IDInvalid ID formatUse format YYMM.NNNNN or full arxiv.org URL
Pandoc failedNon-standard LaTeXUse html method instead
404 Not FoundPaper doesn't exist on ar5ivTry latex method or verify paper ID

Implementation

Script: CAPABILITY/SKILLS/utilities/arxiv-to-md/pdf_converter.py

Key functions:

  • convert_arxiv(arxiv_input, output_path, method) - Main entry point
  • convert_arxiv_latex(arxiv_id) - LaTeX + pandoc method
  • convert_arxiv_html(arxiv_id) - ar5iv HTML method
  • parse_arxiv_id(arxiv_input) - Extract ID from URL or raw input