Processing PDFs
Extract text and tables from PDF files, fill forms, and merge documents.
Quick Start
Extract text with pdfplumber:
python
import pdfplumber
with pdfplumber.open("document.pdf") as pdf:
for page in pdf.pages:
text = page.extract_text()
print(text)
Features
- •Text extraction: Extract text from any page
- •Table extraction: Extract tables as structured data
- •Form filling: Fill PDF forms programmatically
- •Document merging: Combine multiple PDFs
- •Page manipulation: Split, rotate, and reorder pages
- •Metadata access: Read and modify PDF metadata
Common Operations
Extract All Text
python
def extract_all_text(pdf_path):
"""Extract all text from a PDF file."""
import pdfplumber
text_parts = []
with pdfplumber.open(pdf_path) as pdf:
for page in pdf.pages:
text = page.extract_text()
if text:
text_parts.append(text)
return "\n\n".join(text_parts)
Extract Tables
python
def extract_tables(pdf_path):
"""Extract all tables from a PDF file."""
import pdfplumber
all_tables = []
with pdfplumber.open(pdf_path) as pdf:
for page in pdf.pages:
tables = page.extract_tables()
all_tables.extend(tables)
return all_tables
Merge PDFs
python
def merge_pdfs(pdf_paths, output_path):
"""Merge multiple PDFs into one."""
from PyPDF2 import PdfMerger
merger = PdfMerger()
for path in pdf_paths:
merger.append(path)
merger.write(output_path)
merger.close()
Configuration
| Option | Type | Default | Description |
|---|---|---|---|
extract_images | bool | false | Also extract images from PDFs |
ocr_enabled | bool | false | Use OCR for scanned documents |
table_strategy | string | "lines" | Table detection strategy |
Dependencies
- •
pdfplumber- Text and table extraction - •
PyPDF2- PDF manipulation - •
pdf2image- Convert PDF to images (optional) - •
pytesseract- OCR support (optional)
API Reference
See REFERENCE.md for detailed API documentation.
Examples
See EXAMPLES.md for more usage examples.
Troubleshooting
Text extraction returns empty
- •Check if the PDF contains actual text (not scanned images)
- •Enable OCR for scanned documents:
ocr_enabled: true - •Try different extraction settings
Tables not detected correctly
- •Use
table_strategy: "text"for text-based tables - •Adjust table detection settings
- •Consider manual boundary specification