AgentSkillsCN

extracting-pdf-text

将 PDF 文本提取出来,供大语言模型使用。适用于处理 PDF 文件以进行 RAG 检索增强、文档分析,或文本提取任务。支持 API 服务(Mistral OCR)以及本地工具(PyMuPDF、pdfplumber)。能够处理基于文本的 PDF 文件、表格,以及通过 OCR 进行扫描的文档。

SKILL.md
--- frontmatter
name: extracting-pdf-text
description: Extract text from PDFs for LLM consumption. Use when processing PDFs for RAG, document analysis, or text extraction. Supports API services (Mistral OCR) and local tools (PyMuPDF, pdfplumber). Handles text-based PDFs, tables, and scanned documents with OCR.

Extracting PDF Text for LLMs

This skill provides tools and guidance for extracting text from PDFs in formats suitable for language model consumption.

Quick Decision Guide

PDF TypeBest ApproachScript
Simple text PDFPyMuPDFscripts/extract_pymupdf.py
PDF with tablespdfplumberscripts/extract_pdfplumber.py
Scanned/image PDF (local)pytesseractscripts/extract_with_ocr.py
Complex layout, highest accuracyMistral OCR APIscripts/extract_mistral_ocr.py
End-to-end RAG pipelinemarker-pdfpip install marker-pdf

Recommended Workflow

  1. Try PyMuPDF first - fastest, handles most text-based PDFs well
  2. If tables are mangled - switch to pdfplumber
  3. If scanned/image-based - use Mistral OCR API (best accuracy) or local OCR (free but slower)

Local Extraction (No API Required)

PyMuPDF - Fast General Extraction

Best for: Text-heavy PDFs, speed-critical workflows, basic structure preservation.

bash
uv run scripts/extract_pymupdf.py input.pdf output.md

The script outputs markdown with preserved headings and paragraphs. For LLM-optimized output, it uses pymupdf4llm which formats text for RAG systems.

pdfplumber - Table Extraction

Best for: PDFs with tables, financial documents, structured data.

bash
uv run scripts/extract_pdfplumber.py input.pdf output.md

Tables are converted to markdown format. Note: pdfplumber works best on machine-generated PDFs, not scanned documents.

Local OCR - Scanned Documents

Best for: Scanned PDFs when API access is unavailable.

bash
uv run scripts/extract_with_ocr.py input.pdf output.txt

Requires: pytesseract, pdf2image, and Tesseract installed (brew install tesseract on macOS).

API-Based Extraction

Mistral OCR API

Best for: Complex layouts, scanned documents, highest accuracy, multilingual content, math formulas.

Pricing: ~1000 pages per dollar (very cost-effective)

bash
export MISTRAL_API_KEY="your-key"
uv run scripts/extract_mistral_ocr.py input.pdf output.md

Features:

  • Outputs clean markdown
  • Preserves document structure (headings, lists, tables)
  • Handles images, math equations, multilingual text
  • 95%+ accuracy on complex documents

For detailed API options and other services, see references/api-services.md.

Output Format Recommendations

For LLM consumption, markdown is preferred:

  • Preserves semantic structure (headings become context boundaries)
  • Tables remain readable
  • Compatible with most RAG chunking strategies

For detailed comparisons of local tools, see references/local-tools.md.