PDF Extractor
Overview
Extract and structure content from PDFs—including scanned documents via OCR—delivering clean text, parsed tables, identified metadata, and form data ready for downstream processing.
When to Use
- •Extracting invoice data from supplier PDF invoices
- •Parsing technical datasheets for battery specifications
- •Processing government compliance documents and permits
- •Converting scanned contracts into editable text
- •Batch-extracting data from hundreds of procurement PDFs
Instructions
- •Accept inputs: PDF file path or URL, extraction mode (text/tables/images/forms/all), OCR flag, output format (json/csv/markdown/text).
- •Load the PDF and detect document type: digital native or scanned.
- •If scanned and OCR=true, run OCR engine (Tesseract or cloud Vision API) to convert to text.
- •Extract content based on mode:
- •Text: parse all text with page/paragraph structure preserved.
- •Tables: detect table regions, extract rows/columns into structured data.
- •Images: extract embedded images with position metadata.
- •Forms: identify form fields and their values.
- •Clean extracted text: remove headers/footers, fix encoding issues, normalize whitespace.
- •Structure output in requested format with source PDF reference and page numbers.
- •Return extraction summary: pages processed, tables found, confidence score (OCR), output file path.
Environment
code
OCR_ENGINE=tesseract GOOGLE_VISION_API_KEY=your_key_optional OUTPUT_FORMAT=json BATCH_SIZE=10 TEMP_DIR=./pdf_processing
Examples
Input:
code
file: ./invoices/lithium_supplier_inv_2026.pdf mode: tables ocr: false output_format: csv
Output:
code
Extraction complete. Pages processed: 3 Tables found: 2 Rows extracted: 47 Output: ./output/lithium_supplier_inv_2026_tables.csv Confidence: 99.2% (digital PDF)