AgentSkillsCN

pdf-reader

从 PDF 文件中提取文本,以便进行编辑、搜索和引用。当您需要读取 PDF 内容、从文档中提取文本、在 PDF 中进行搜索,或把 PDF 转换为文本以供进一步处理时,可使用此功能。支持多种提取方法(pdfplumber、PyMuPDF、pdfminer),并具备自动回退机制。

SKILL.md
--- frontmatter
name: pdf-reader
description: Extract text from PDF files for manipulation, search, and reference. Use when needing to read PDF content, extract text from documents, search within PDFs, or convert PDF to text for further processing. Supports multiple extraction methods (pdfplumber, PyMuPDF, pdfminer) with automatic fallback.

PDF Reader

Extract text from PDF files for text manipulation, search, and reference.

Quick Start

Extract all text from a PDF:

bash
python scripts/extract_text.py document.pdf

Save to file:

bash
python scripts/extract_text.py document.pdf -o .tmp/output.txt

Extract specific pages:

bash
python scripts/extract_text.py document.pdf -p 1-10 -o .tmp/pages.txt

Workflow

  1. Extract text → Run scripts/extract_text.py
  2. Process output → Text is now searchable, editable, quotable
  3. Reference content → Use extracted text for analysis or response

Script Options

code
extract_text.py <pdf_path> [options]

Options:
  -o, --output FILE      Save to file (default: print to stdout)
  -m, --method METHOD    auto|pdfplumber|pymupdf|pdfminer (default: auto)
  -p, --pages RANGE      Page range: "1-5" or "1,3,5" (default: all)
  --preserve-layout      Keep spatial arrangement of text
  --json                 Output with metadata (page sizes, method used)

Method Selection

ScenarioRecommended Method
General useauto (default)
Documents with tablespdfplumber
Large PDFs, speed neededpymupdf
Maximum text accuracypdfminer
Scanned/image PDFspymupdf (has OCR)

Examples

Extract and search

bash
python scripts/extract_text.py report.pdf | grep -i "revenue"

Extract tables (use pdfplumber)

bash
python scripts/extract_text.py data.pdf -m pdfplumber --json -o .tmp/data.json

Specific pages with layout

bash
python scripts/extract_text.py book.pdf -p 50-55 --preserve-layout -o .tmp/chapter.txt

Dependencies

At least one library required:

bash
pip install pdfplumber pymupdf pdfminer.six

For detailed library comparison, see references/pdf_libraries.md.

Troubleshooting

Empty output?

  • PDF may be scanned/image-based → try --method pymupdf (has OCR)
  • Check if PDF is password-protected

Garbled text?

  • Try different method: -m pdfminer
  • PDF may have non-standard font encoding

Tables not formatted?

  • Use -m pdfplumber --json for structured output
  • Consider --preserve-layout flag