AgentSkillsCN

pdftext

利用AI驱动或传统工具,从PDF中提取文本,以供大语言模型使用。当您需要将学术PDF转换为Markdown格式、提取结构化内容(标题/表格/列表)、批量处理科研论文、为RAG系统准备PDF文件,或在“PDF提取”“PDF转文本”“PDF转Markdown”“Docling”“Pymupdf”“Pdfplumber”等关键词出现时,可使用此功能。系统提供Docling(AI驱动、保留原文结构、表格准确率达97.9%)以及传统工具(PyMuPDF以速度见长,Pdfplumber则以高质量著称)。所有处理均在设备本地完成,无需调用API。

SKILL.md
--- frontmatter
name: pdftext
description: Extract text from PDFs for LLM consumption using AI-powered or traditional tools. Use when converting academic PDFs to markdown, extracting structured content (headers/tables/lists), batch processing research papers, preparing PDFs for RAG systems, or when mentions of "pdf extraction", "pdf to text", "pdf to markdown", "docling", "pymupdf", "pdfplumber" appear. Provides Docling (AI-powered, structure-preserving, 97.9% table accuracy) and traditional tools (PyMuPDF for speed, pdfplumber for quality). All processing is on-device with no API calls.
license: Apache 2.0 (see LICENSE.txt)

PDF Text Extraction

Tool Selection

ToolSpeedQualityStructureUse When
Docling0.43s/pageGood✓ YesNeed headers/tables/lists, academic PDFs, LLM consumption
PyMuPDF0.01s/pageExcellent✗ NoSpeed critical, simple text extraction, good enough quality
pdfplumber0.44s/pagePerfect✗ NoMaximum fidelity needed, slow acceptable

Decision:

  • Academic research → Docling (structure preservation)
  • Batch processing → PyMuPDF (60x faster)
  • Critical accuracy → pdfplumber (0 quality issues)

Installation

bash
# Create virtual environment
python3 -m venv pdf_env
source pdf_env/bin/activate

# Install Docling (AI-powered, recommended)
pip install docling

# Install traditional tools
pip install pymupdf pdfplumber

First run downloads ML models (~500MB-1GB, cached locally, no API calls).

Basic Usage

Docling (Structure-Preserving)

python
from docling.document_converter import DocumentConverter

converter = DocumentConverter()  # Reuse for multiple PDFs
result = converter.convert("paper.pdf")
markdown = result.document.export_to_markdown()

# Save output
with open("paper.md", "w") as f:
    f.write(markdown)

Output includes: Headers (##), tables (|...|), lists (- ...), image markers.

PyMuPDF (Fast)

python
import fitz

doc = fitz.open("paper.pdf")
text = "\n".join(page.get_text() for page in doc)
doc.close()

with open("paper.txt", "w") as f:
    f.write(text)

pdfplumber (Highest Quality)

python
import pdfplumber

with pdfplumber.open("paper.pdf") as pdf:
    text = "\n".join(page.extract_text() or "" for page in pdf.pages)

with open("paper.txt", "w") as f:
    f.write(text)

Batch Processing

See examples/batch_convert.py for ready-to-use script.

Pattern:

python
from pathlib import Path
from docling.document_converter import DocumentConverter

converter = DocumentConverter()  # Initialize once
for pdf in Path("./pdfs").glob("*.pdf"):
    result = converter.convert(str(pdf))
    markdown = result.document.export_to_markdown()
    Path(f"./output/{pdf.stem}.md").write_text(markdown)

Performance tip: Reuse converter instance. Reinitializing wastes time.

Quality Considerations

Common issues:

  • Ligatures: /uniFB03 → "ffi" (post-process with regex)
  • Excessive whitespace: 50-90 instances (Docling has fewer)
  • Hyphenation breaks: End-of-line hyphens may remain

Quality metrics script: See examples/quality_analysis.py

Benchmarks: See references/benchmarks.md for enterprise production data.

Troubleshooting

Slow first run: ML models downloading (15-30s). Subsequent runs fast.

Out of memory: Reduce concurrent conversions, process large PDFs individually.

Missing tables: Ensure do_table_structure=True in Docling options.

Garbled text: PDF encoding issue. Apply ligature fixes post-processing.

Privacy

All tools run on-device. No API calls, no data sent externally. Docling downloads models once, caches locally (~500MB-1GB).

References

  • Tool comparison: references/tool-comparison.md
  • Quality metrics: references/quality-metrics.md
  • Production benchmarks: references/benchmarks.md