AgentSkillsCN

pdf-data-extract

使用 Python(pymupdf、tabula-py)从 PDF 文件中提取结构化数据、表格与表单字段。当用户需要将 PDF 中的表格、表单数据,或特定字段以 JSON、CSV 或 Markdown 格式导出时,这套工具将精准高效地完成任务。

SKILL.md
--- frontmatter
name: pdf-data-extract
description: Extract structured data, tables, and form fields from PDF files using Python (pymupdf, tabula-py). Use when the user needs tables, form data, or specific fields extracted from a PDF into JSON, CSV, or Markdown format.
compatibility: Requires Python 3, pymupdf, and optionally tabula-py with Java Runtime

PDF Data Extraction

Extract tables, form fields, and structured data from PDF files.

When to use this skill

  • User wants to extract tables from a PDF
  • User needs form field values from a fillable PDF
  • User asks to convert PDF data into JSON, CSV, or Markdown

Steps

Extract tables

  1. Ensure dependencies are installed:

    bash
    pip install pymupdf tabula-py pandas
    

    tabula-py requires Java Runtime (JRE). If unavailable, fall back to raw text extraction.

  2. Run the table extraction script.

    bash
    python scripts/extract_tables.py "INPUT_FILE_PATH"
    

    For CSV output instead of Markdown, add --csv:

    bash
    python scripts/extract_tables.py "INPUT_FILE_PATH" --csv
    

Extract form fields

For fillable PDF forms:

bash
python scripts/extract_form_fields.py "INPUT_FILE_PATH"

For JSON output, add --json:

bash
python scripts/extract_form_fields.py "INPUT_FILE_PATH" --json

Fallback: raw text extraction

If tabula is unavailable or the data is not in table format:

bash
python scripts/extract_text.py "INPUT_FILE_PATH"

Then parse the text output to identify patterns (key-value pairs, repeated structures) and convert to the requested format.

Output formatting

Present extracted data in the user's requested format:

  • JSON: array of objects in a fenced code block
  • CSV: comma-separated values in a fenced code block
  • Markdown table: pipe-delimited table

Edge cases

  • No tables found: Fall back to text extraction and manual parsing.
  • Merged cells or complex layouts: tabula may produce messy output — clean up column alignment manually.
  • Ambiguous values: Output null or [unclear] instead of guessing.
  • Java not installed: Skip tabula, use pymupdf text extraction as fallback.

Scripts