PDF Toolkit
Comprehensive PDF manipulation toolkit for extracting text and tables, creating new PDFs, merging/splitting documents, and handling forms.
Prerequisites
- •Python 3.8+
- •Libraries: pypdf, pdfplumber, reportlab
- •Optional: pytesseract (for OCR), pdf2image, poppler-utils
Instructions
- •
Read and extract text from PDFs
pythonfrom pypdf import PdfReader reader = PdfReader("document.pdf") text = "" for page in reader.pages: text += page.extract_text() - •
Merge multiple PDFs
pythonfrom pypdf import PdfWriter, PdfReader writer = PdfWriter() for pdf_file in ["doc1.pdf", "doc2.pdf"]: reader = PdfReader(pdf_file) for page in reader.pages: writer.add_page(page) with open("merged.pdf", "wb") as output: writer.write(output) - •
Split PDF into individual pages
pythonreader = PdfReader("input.pdf") for i, page in enumerate(reader.pages): writer = PdfWriter() writer.add_page(page) with open(f"page_{i+1}.pdf", "wb") as output: writer.write(output) - •
Extract tables with pdfplumber
pythonimport pdfplumber with pdfplumber.open("document.pdf") as pdf: for page in pdf.pages: tables = page.extract_tables() for table in tables: for row in table: print(row) - •
Create new PDFs with reportlab
pythonfrom reportlab.lib.pagesizes import letter from reportlab.pdfgen import canvas c = canvas.Canvas("new.pdf", pagesize=letter) c.drawString(100, 750, "Hello World!") c.save() - •
OCR scanned PDFs
pythonimport pytesseract from pdf2image import convert_from_path images = convert_from_path('scanned.pdf') for image in images: text = pytesseract.image_to_string(image)
Error Handling
- •If PDF is password-protected, use
PdfReader("file.pdf", password="pass") - •If text extraction fails, try pdfplumber or OCR approach
- •For corrupted PDFs, use qpdf to attempt repair
Notes
- •Use pypdf for basic operations (merge, split, rotate)
- •Use pdfplumber for text/table extraction with layout
- •Use reportlab for creating new PDFs programmatically
- •Command-line tools: pdftotext, qpdf, pdftk for batch operations
Source: anthropics/skills