PDF Data Extraction

Extract tables, form fields, and structured data from PDF files.

When to use this skill

•

Run the table extraction script.

bash

python scripts/extract_tables.py "INPUT_FILE_PATH"

For CSV output instead of Markdown, add --csv:

bash

python scripts/extract_tables.py "INPUT_FILE_PATH" --csv

For fillable PDF forms:

bash

python scripts/extract_form_fields.py "INPUT_FILE_PATH"

For JSON output, add --json:

bash

python scripts/extract_form_fields.py "INPUT_FILE_PATH" --json

If tabula is unavailable or the data is not in table format:

bash

python scripts/extract_text.py "INPUT_FILE_PATH"

Then parse the text output to identify patterns (key-value pairs, repeated structures) and convert to the requested format.

Present extracted data in the user's requested format:

•No tables found: Fall back to text extraction and manual parsing.
•Merged cells or complex layouts: tabula may produce messy output — clean up column alignment manually.
•Ambiguous values: Output null or [unclear] instead of guessing.
•Java not installed: Skip tabula, use pymupdf text extraction as fallback.