AgentSkillsCN

pdf-processing-pro

生产就绪的PDF处理,支持表单、表格、OCR、验证及批量操作。当在生产环境中处理复杂PDF工作流、处理大量PDF,或需要强大的错误处理与验证时使用。切勿用于简单文本提取——应使用pdf-extract快速阅读。

SKILL.md
--- frontmatter
name: pdf-processing-pro
description: Production-ready PDF processing with forms, tables, OCR, validation, and batch operations. Use when working with complex PDF workflows in production environments, processing large volumes of PDFs, or requiring robust error handling and validation. Do NOT use for simple text extraction - use pdf-extract for quick reads.

PDF Processing Pro

Production-ready PDF processing toolkit with pre-built scripts, comprehensive error handling, and support for complex workflows.

Quick start

Extract text from PDF

python
import pdfplumber

with pdfplumber.open("document.pdf") as pdf:
    text = pdf.pages[0].extract_text()
    print(text)

Analyse PDF form (using included script)

bash
python scripts/analyze_form.py input.pdf --output fields.json
# Returns: JSON with all form fields, types, and positions

Fill PDF form with validation

bash
python scripts/fill_form.py input.pdf data.json output.pdf
# Validates all fields before filling, includes error reporting

Extract tables from PDF

bash
python scripts/extract_tables.py report.pdf --output tables.csv
# Extracts all tables with automatic column detection

Features

Production-ready scripts

  • Error handling with detailed messages and proper exit codes
  • Input validation, type checking, and configurable logging
  • Full type annotations and CLI interface (--help on all scripts)

Comprehensive workflows

  • PDF forms, table extraction, OCR processing
  • Batch operations, pre/post-processing validation

Advanced topics

PDF form processing

Complete form workflows including field analysis, dynamic filling, validation rules, multi-page forms, and checkbox/radio handling. See references/forms.md.

Table extraction

Complex table extraction including multi-page tables, merged cells, nested tables, custom detection, and CSV/Excel export. See references/tables.md.

OCR processing

Scanned PDFs and image-based documents including Tesseract integration, language support, image preprocessing, and confidence scoring. See references/ocr.md.

Included scripts

ScriptPurposeUsage
analyze_form.pyExtract form field infopython scripts/analyze_form.py input.pdf [--output fields.json] [--verbose]
fill_form.pyFill PDF forms with datapython scripts/fill_form.py input.pdf data.json output.pdf [--validate]
validate_form.pyValidate form data before fillingpython scripts/validate_form.py data.json schema.json
extract_tables.pyExtract tables to CSV/Excelpython scripts/extract_tables.py input.pdf [--output tables.csv] [--format csv|excel]
extract_text.pyExtract text with formattingpython scripts/extract_text.py input.pdf [--output text.txt] [--preserve-formatting]
merge_pdfs.pyMerge multiple PDFspython scripts/merge_pdfs.py file1.pdf file2.pdf --output merged.pdf
split_pdf.pySplit PDF into pagespython scripts/split_pdf.py input.pdf --output-dir pages/
validate_pdf.pyValidate PDF integritypython scripts/validate_pdf.py input.pdf

Dependencies

All scripts require:

bash
pip install pdfplumber pypdf pillow pytesseract pandas

Optional for OCR:

bash
# macOS: brew install tesseract
# Ubuntu: apt-get install tesseract-ocr
# Windows: Download from GitHub releases

References

FileContents
references/forms.mdComplete form processing guide
references/tables.mdAdvanced table extraction
references/ocr.mdScanned PDF processing
references/workflows.mdCommon workflows, error handling, performance tips, best practices
references/troubleshooting.mdTroubleshooting common issues and getting help