PDF Processing Skill
Comprehensive PDF manipulation toolkit for extracting text and tables, creating new PDFs, merging/splitting documents, and handling forms.
Text Extraction
Using pdfplumber (recommended)
python
import pdfplumber
with pdfplumber.open("document.pdf") as pdf:
for page in pdf.pages:
text = page.extract_text()
print(text)
Using command line
bash
pdftotext input.pdf output.txt pdftotext -layout input.pdf output.txt # Preserve layout
Table Extraction
python
import pdfplumber
import pandas as pd
with pdfplumber.open("document.pdf") as pdf:
page = pdf.pages[0]
tables = page.extract_tables()
for table in tables:
df = pd.DataFrame(table[1:], columns=table[0])
df.to_excel("table.xlsx", index=False)
PDF to Images
Using pdftoppm (poppler-utils)
bash
# High quality JPEG pdftoppm -jpeg -r 300 input.pdf output # PNG format pdftoppm -png -r 150 input.pdf output
Using Python
python
from pdf2image import convert_from_path
images = convert_from_path('document.pdf', dpi=300)
for i, image in enumerate(images):
image.save(f'page_{i+1}.png', 'PNG')
Merging PDFs
Using pypdf
python
from pypdf import PdfMerger
merger = PdfMerger()
merger.append("file1.pdf")
merger.append("file2.pdf")
merger.append("file3.pdf", pages=(0, 5)) # Only first 5 pages
merger.write("merged.pdf")
merger.close()
Using qpdf (command line)
bash
qpdf --empty --pages file1.pdf file2.pdf file3.pdf -- merged.pdf
Splitting PDFs
python
from pypdf import PdfReader, PdfWriter
reader = PdfReader("input.pdf")
for i, page in enumerate(reader.pages):
writer = PdfWriter()
writer.add_page(page)
writer.write(f"page_{i+1}.pdf")
Rotating Pages
python
from pypdf import PdfReader, PdfWriter
reader = PdfReader("input.pdf")
writer = PdfWriter()
for page in reader.pages:
page.rotate(90) # 90, 180, 270
writer.add_page(page)
writer.write("rotated.pdf")
Creating PDFs with ReportLab
Simple Document
python
from reportlab.lib.pagesizes import letter
from reportlab.pdfgen import canvas
c = canvas.Canvas("output.pdf", pagesize=letter)
c.drawString(100, 750, "Hello World")
c.save()
Complex Document with Platypus
python
from reportlab.lib.pagesizes import letter
from reportlab.platypus import SimpleDocTemplate, Paragraph, Spacer, Table
from reportlab.lib.styles import getSampleStyleSheet
doc = SimpleDocTemplate("output.pdf", pagesize=letter)
styles = getSampleStyleSheet()
story = []
story.append(Paragraph("Document Title", styles['Heading1']))
story.append(Spacer(1, 12))
story.append(Paragraph("Body text here.", styles['Normal']))
# Table
data = [
['Column 1', 'Column 2', 'Column 3'],
['Row 1', 'Data', 'Data'],
['Row 2', 'Data', 'Data'],
]
table = Table(data)
story.append(table)
doc.build(story)
OCR for Scanned PDFs
python
import pytesseract
from pdf2image import convert_from_path
# Convert PDF to images
images = convert_from_path('scanned.pdf')
# OCR each page
for i, image in enumerate(images):
text = pytesseract.image_to_string(image, lang='eng+ara')
print(f"--- Page {i+1} ---")
print(text)
Command line
bash
# Extract images from PDF pdfimages -png input.pdf output # OCR an image tesseract output-000.png output -l eng+ara
Reading Metadata
python
from pypdf import PdfReader
reader = PdfReader("document.pdf")
meta = reader.metadata
print(f"Title: {meta.title}")
print(f"Author: {meta.author}")
print(f"Creator: {meta.creator}")
print(f"Pages: {len(reader.pages)}")
Adding Watermarks
python
from pypdf import PdfReader, PdfWriter
reader = PdfReader("input.pdf")
watermark = PdfReader("watermark.pdf").pages[0]
writer = PdfWriter()
for page in reader.pages:
page.merge_page(watermark)
writer.add_page(page)
writer.write("watermarked.pdf")
Password Protection
Encrypt
python
from pypdf import PdfReader, PdfWriter
reader = PdfReader("input.pdf")
writer = PdfWriter()
for page in reader.pages:
writer.add_page(page)
writer.encrypt("userpassword", "ownerpassword")
writer.write("encrypted.pdf")
Decrypt
python
reader = PdfReader("encrypted.pdf")
if reader.is_encrypted:
reader.decrypt("password")
Form Filling
Check for fillable fields
python
from pypdf import PdfReader
reader = PdfReader("form.pdf")
fields = reader.get_fields()
for field_name, field_data in fields.items():
print(f"{field_name}: {field_data}")
Fill fields
python
from pypdf import PdfReader, PdfWriter
reader = PdfReader("form.pdf")
writer = PdfWriter()
writer.append(reader)
writer.update_page_form_field_values(
writer.pages[0],
{"field_name": "value"}
)
writer.write("filled_form.pdf")
Dependencies
| Library | Purpose | Install |
|---|---|---|
| pypdf | Basic operations | pip install pypdf |
| pdfplumber | Text/table extraction | pip install pdfplumber |
| reportlab | PDF creation | pip install reportlab |
| pdf2image | PDF to images | pip install pdf2image |
| pytesseract | OCR | pip install pytesseract |
| poppler-utils | CLI tools | System package |
| qpdf | Advanced CLI | System package |
Command Line Quick Reference
bash
# Extract text pdftotext input.pdf output.txt # Convert to images pdftoppm -jpeg -r 300 input.pdf output # Extract images pdfimages -png input.pdf output # Merge files qpdf --empty --pages file1.pdf file2.pdf -- merged.pdf # Split by pages qpdf input.pdf --pages . 1-5 -- first5.pdf # Rotate qpdf input.pdf --rotate=90 -- rotated.pdf # Decrypt qpdf --decrypt --password=pass encrypted.pdf decrypted.pdf # Compress qpdf --linearize input.pdf output.pdf