AgentSkillsCN

pdf-to-markdown

使用Docling将PDF文档转换为Markdown格式,并支持灵活的代理驱动清理。工作流程如下:(1) 从文件或URL加载PDF;(2) 转换为Markdown;(3) 代理审阅文档样本,并通过小规模迭代,针对标题深度错误、页眉页脚、页码、模板化内容等问题,定制正则表达式替换方案;(4) 可选地,根据代理提出的分隔符进行拆分,同时审阅文件名与文档结构。每一份PDF都独一无二——代理通过反复的检查-修正-验证循环,根据实际文档内容调整清理策略,而非采用硬编码规则。重要提示:为加快处理速度,请使用--no-ocr标志。

SKILL.md
--- frontmatter
name: pdf-to-markdown
description: Convert PDF documents to markdown format using docling, with flexible agent-driven cleanup. Workflow - (1) Load PDF from file or URL, (2) Convert to markdown, (3) Agent reviews document samples and creates custom regexp substitutions in small iterations for issues like wrong heading depths, headers/footers, page numbers, boilerplate (4) Optionally split with agent-proposed delimiters, reviewing filenames and structure. Each PDF is unique - agents adapt cleanup to actual document content through iterative check-correct-verify cycles, not hardcoded rules. IMPORTANT - Use --no-ocr flag for faster processing.

PDF to Markdown Converter

Convert PDF documents to markdown using docling with intelligent cleanup and splitting.

Quick Start

bash
# 1. Setup (one-time)
bash scripts/setup_venv.sh

# 2. Convert PDF (fast mode recommended)
# Local file:
python scripts/convert_full.py document.pdf --no-ocr
# Or URL (docling downloads automatically):
python scripts/convert_full.py https://example.com/document.pdf --no-ocr

# 3. Iterative cleanup - fix ONE issue at a time:
python scripts/extract_samples.py document.md                    # Review
python scripts/apply_substitutions.py document.md -s 'pattern' --dry-run  # Test  
python scripts/apply_substitutions.py document.md -s 'pattern'             # Apply
python scripts/extract_samples.py document.md                    # Verify

# 3b. MANDATORY: Check for subordinate elements after Phase 3
grep "^## Table\|^## Figure\|^## [a-z])\|^## [A-Z][a-z]*[^:]\{0,20\}$" document.md           # Detect Phase 4 issues

# 4. Optional splitting
python scripts/analyze_split_points.py document.md               # Analyze
python scripts/split_markdown.py document.md --heading-level 2 --dry-run   # Test
python scripts/split_markdown.py document.md --heading-level 2             # Apply

5-Phase Cleanup Approach

Work iteratively: dry-run → apply → verify each phase:

  1. Document Denoisingcleanup-phase1-denoising.md
    Remove image placeholders, boilerplate, conversion artifacts

  2. Headers/Footerscleanup-phase2-headers-footers.md
    Remove page numbers, copyright footers, repeated organizational content

  3. Basic Numbered Sectionscleanup-phase3-basic-numbered-sections.md
    Analyze numbering patterns and fix heading depths to match logical document hierarchy

  4. ⚠️ Context-Aware Subordinatescleanup-phase4-context-aware-subordinates.md
    MANDATORY CHECK: Fix tables, figures, list items incorrectly promoted to H2 headings

    bash
    # Always run these detection commands after Phase 3:
    grep "^## Table\|^## Figure\|^## [a-z])\|^## [A-Z][a-z]*[^:]\{0,20\}$" document.md
    
  5. Spacing/Formattingcleanup-phase5-spacing-formatting.md
    Clean up excessive blank lines, list formatting, table issues

Critical Principles

  • Small iterations: Fix 1-2 issues per iteration, not everything at once
  • Always verify: Extract samples after EACH change to confirm it worked
  • Test first: ALWAYS use --dry-run before applying
  • Document-specific patterns: Each PDF is unique - adapt to actual content
  • Safe recovery: Use backup files if something goes wrong

Performance Notes

⚠️ Use --no-ocr flag unless processing scanned documents:

  • Without OCR: 5-10 minutes for 180-page PDF ✅
  • With OCR: 60+ minutes for same PDF ❌

See performance-guide.md for optimization details.

Core Scripts

convert_full.py

bash
python scripts/convert_full.py <pdf_source> [--no-ocr] [-o output.md]

Supports both local files and URLs. Examples:

bash
python scripts/convert_full.py document.pdf --no-ocr
python scripts/convert_full.py https://example.com/doc.pdf --no-ocr  
python scripts/convert_full.py https://example.com/doc.pdf -o custom.md --no-ocr

extract_samples.py

bash
python scripts/extract_samples.py <markdown_file> [--min-repeats N]

Shows document structure and repeated patterns for cleanup planning.

apply_substitutions.py

bash
python scripts/apply_substitutions.py <markdown_file> -s 'pattern' [--dry-run]

Applies sed-style regex substitutions with automatic backup.

Always use --dry-run first to test patterns before applying.

Example usage:

bash
# Test a pattern first
python scripts/apply_substitutions.py document.md -s 's/old/new/g' --dry-run

# Apply if satisfied with dry-run results
python scripts/apply_substitutions.py document.md -s 's/old/new/g'

Pattern Development Strategy:

  1. Use extract_samples.py to identify issues
  2. Develop patterns specific to your document
  3. Test with --dry-run
  4. Apply and verify with extract_samples.py

See phase-specific documentation for detailed patterns and examples.

analyze_split_points.py / split_markdown.py

bash
python scripts/analyze_split_points.py <markdown_file>
python scripts/split_markdown.py <markdown_file> --heading-level 2 [--dry-run]

Detailed Guides