PDF Data Extraction
Extract tables, form fields, and structured data from PDF files.
When to use this skill
- •User wants to extract tables from a PDF
- •User needs form field values from a fillable PDF
- •User asks to convert PDF data into JSON, CSV, or Markdown
Steps
Extract tables
- •
Ensure dependencies are installed:
bashpip install pymupdf tabula-py pandas
tabula-pyrequires Java Runtime (JRE). If unavailable, fall back to raw text extraction. - •
Run the table extraction script.
bashpython scripts/extract_tables.py "INPUT_FILE_PATH"
For CSV output instead of Markdown, add
--csv:bashpython scripts/extract_tables.py "INPUT_FILE_PATH" --csv
Extract form fields
For fillable PDF forms:
bash
python scripts/extract_form_fields.py "INPUT_FILE_PATH"
For JSON output, add --json:
bash
python scripts/extract_form_fields.py "INPUT_FILE_PATH" --json
Fallback: raw text extraction
If tabula is unavailable or the data is not in table format:
bash
python scripts/extract_text.py "INPUT_FILE_PATH"
Then parse the text output to identify patterns (key-value pairs, repeated structures) and convert to the requested format.
Output formatting
Present extracted data in the user's requested format:
- •JSON: array of objects in a fenced code block
- •CSV: comma-separated values in a fenced code block
- •Markdown table: pipe-delimited table
Edge cases
- •No tables found: Fall back to text extraction and manual parsing.
- •Merged cells or complex layouts: tabula may produce messy output — clean up column alignment manually.
- •Ambiguous values: Output
nullor[unclear]instead of guessing. - •Java not installed: Skip tabula, use pymupdf text extraction as fallback.
Scripts
- •extract_tables.py — Extract tables via tabula-py
- •extract_form_fields.py — Extract fillable form fields
- •extract_text.py — Raw text extraction fallback