PDF Extractor

Name: pdf-extractor
Rating: 92
Author: Greenmamba29

Overview

Extract and structure content from PDFs—including scanned documents via OCR—delivering clean text, parsed tables, identified metadata, and form data ready for downstream processing.

When to Use

•Extracting invoice data from supplier PDF invoices
•Parsing technical datasheets for battery specifications
•Processing government compliance documents and permits
•Converting scanned contracts into editable text
•Batch-extracting data from hundreds of procurement PDFs

Instructions

•Accept inputs: PDF file path or URL, extraction mode (text/tables/images/forms/all), OCR flag, output format (json/csv/markdown/text).
•Load the PDF and detect document type: digital native or scanned.
•If scanned and OCR=true, run OCR engine (Tesseract or cloud Vision API) to convert to text.
•
Extract content based on mode:
- •Text: parse all text with page/paragraph structure preserved.
- •Tables: detect table regions, extract rows/columns into structured data.
- •Images: extract embedded images with position metadata.
- •Forms: identify form fields and their values.
•Clean extracted text: remove headers/footers, fix encoding issues, normalize whitespace.
•Structure output in requested format with source PDF reference and page numbers.
•Return extraction summary: pages processed, tables found, confidence score (OCR), output file path.

Environment

code

OCR_ENGINE=tesseract
GOOGLE_VISION_API_KEY=your_key_optional
OUTPUT_FORMAT=json
BATCH_SIZE=10
TEMP_DIR=./pdf_processing

Examples

Input:

code

file: ./invoices/lithium_supplier_inv_2026.pdf
mode: tables
ocr: false
output_format: csv

Output:

code

Extraction complete.
Pages processed: 3
Tables found: 2
Rows extracted: 47
Output: ./output/lithium_supplier_inv_2026_tables.csv
Confidence: 99.2% (digital PDF)