PDF Processing Skill Guide

Description

The PDF skill provides a rich set of tools and examples to handle any PDF-related processing. Whether you need to extract text, merge files, create watermarks, or add password protection, the skill integrates Python libraries and command-line utilities to ensure robust functionality.

When to Use the Skill

•Extracting tables or text from a PDF for data analysis.
•Merging or splitting PDF documents for organization.
•Adding encryption, watermarks, or metadata to enhance security and professionalism.
•Handling scanned PDFs via OCR to make them searchable or extract readable content.
•Creating brand-new PDF documents with dynamic content.

Usage Guide

Code Examples

Merge PDFs

python

from pypdf import PdfWriter, PdfReader

writer = PdfWriter()
for pdf_file in ["doc1.pdf", "doc2.pdf", "doc3.pdf"]:
    reader = PdfReader(pdf_file)
    for page in reader.pages:
        writer.add_page(page)

with open("merged.pdf", "wb") as output:
    writer.write(output)

Extract Text from Scanned PDFs

python

import pytesseract
from pdf2image import convert_from_path

# Convert PDF to images
images = convert_from_path('scanned.pdf')

# OCR each page
text = ""
for i, image in enumerate(images):
    text += f"Page {i+1}:\n"
    text += pytesseract.image_to_string(image)
    text += "\n\n"

print(text)

Command-Line Usage

Extract Text with `pdftotext`

bash

pdftotext input.pdf output.txt

Merge PDFs with `qpdf`

bash

qpdf --empty --pages file1.pdf file2.pdf -- merged.pdf

Inputs and Outputs

Inputs

•File path: Path to the input PDF file(s) (e.g., input.pdf).
•Operation: The specific processing task (e.g., merge, split, extract).
•Additional arguments: Optional arguments such as passwords, metadata, or specific pages.

Outputs

•Result file: Path to the output PDF file (e.g., output.pdf).
•Extracted data: Text, tables, or other extracted formats depending on the task.

Best Practices and Known Limitations

•Always ensure that Python libraries (pypdf, pdfplumber) and external tools (pdftotext, qpdf) are correctly installed in your environment.
•OCR processes (e.g., pytesseract) rely on high-resolution scanned documents for optimal accuracy.
•Avoid using Unicode subscripts/superscripts with ReportLab due to font limitations.

Example Workflows

Creating PDFs from Scratch

python

from reportlab.lib.pagesizes import letter
from reportlab.pdfgen import canvas

c = canvas.Canvas("hello_world.pdf", pagesize=letter)
c.drawString(100, 750, "Hello, World!")
c.save()

Adding Watermarks to Pages

python

from pypdf import PdfReader, PdfWriter

watermark = PdfReader("watermark.pdf").pages[0]
reader = PdfReader("document.pdf")
writer = PdfWriter()

for page in reader.pages:
    page.merge_page(watermark)
    writer.add_page(page)

with open("watermarked.pdf", "wb") as output:
    writer.write(output)

Version History and Changelog

Version	Date	Description
1.0.0	2022-02-09	Initial release with examples, usage guides, and best practices

PDF Processing Skill Guide

Description

When to Use the Skill

Usage Guide

Code Examples

Merge PDFs

Extract Text from Scanned PDFs

Command-Line Usage

Extract Text with pdftotext

Merge PDFs with qpdf

Inputs and Outputs

Inputs

Outputs

Best Practices and Known Limitations

Example Workflows

Creating PDFs from Scratch

Adding Watermarks to Pages

Version History and Changelog

Extract Text with `pdftotext`

Merge PDFs with `qpdf`