AgentSkillsCN

Pdf

Pdf

SKILL.md

PDF Processing Skill

Comprehensive PDF manipulation toolkit for extracting text and tables, creating new PDFs, merging/splitting documents, and handling forms.


Text Extraction

Using pdfplumber (recommended)

python
import pdfplumber

with pdfplumber.open("document.pdf") as pdf:
    for page in pdf.pages:
        text = page.extract_text()
        print(text)

Using command line

bash
pdftotext input.pdf output.txt
pdftotext -layout input.pdf output.txt  # Preserve layout

Table Extraction

python
import pdfplumber
import pandas as pd

with pdfplumber.open("document.pdf") as pdf:
    page = pdf.pages[0]
    tables = page.extract_tables()

    for table in tables:
        df = pd.DataFrame(table[1:], columns=table[0])
        df.to_excel("table.xlsx", index=False)

PDF to Images

Using pdftoppm (poppler-utils)

bash
# High quality JPEG
pdftoppm -jpeg -r 300 input.pdf output

# PNG format
pdftoppm -png -r 150 input.pdf output

Using Python

python
from pdf2image import convert_from_path

images = convert_from_path('document.pdf', dpi=300)
for i, image in enumerate(images):
    image.save(f'page_{i+1}.png', 'PNG')

Merging PDFs

Using pypdf

python
from pypdf import PdfMerger

merger = PdfMerger()
merger.append("file1.pdf")
merger.append("file2.pdf")
merger.append("file3.pdf", pages=(0, 5))  # Only first 5 pages
merger.write("merged.pdf")
merger.close()

Using qpdf (command line)

bash
qpdf --empty --pages file1.pdf file2.pdf file3.pdf -- merged.pdf

Splitting PDFs

python
from pypdf import PdfReader, PdfWriter

reader = PdfReader("input.pdf")

for i, page in enumerate(reader.pages):
    writer = PdfWriter()
    writer.add_page(page)
    writer.write(f"page_{i+1}.pdf")

Rotating Pages

python
from pypdf import PdfReader, PdfWriter

reader = PdfReader("input.pdf")
writer = PdfWriter()

for page in reader.pages:
    page.rotate(90)  # 90, 180, 270
    writer.add_page(page)

writer.write("rotated.pdf")

Creating PDFs with ReportLab

Simple Document

python
from reportlab.lib.pagesizes import letter
from reportlab.pdfgen import canvas

c = canvas.Canvas("output.pdf", pagesize=letter)
c.drawString(100, 750, "Hello World")
c.save()

Complex Document with Platypus

python
from reportlab.lib.pagesizes import letter
from reportlab.platypus import SimpleDocTemplate, Paragraph, Spacer, Table
from reportlab.lib.styles import getSampleStyleSheet

doc = SimpleDocTemplate("output.pdf", pagesize=letter)
styles = getSampleStyleSheet()
story = []

story.append(Paragraph("Document Title", styles['Heading1']))
story.append(Spacer(1, 12))
story.append(Paragraph("Body text here.", styles['Normal']))

# Table
data = [
    ['Column 1', 'Column 2', 'Column 3'],
    ['Row 1', 'Data', 'Data'],
    ['Row 2', 'Data', 'Data'],
]
table = Table(data)
story.append(table)

doc.build(story)

OCR for Scanned PDFs

python
import pytesseract
from pdf2image import convert_from_path

# Convert PDF to images
images = convert_from_path('scanned.pdf')

# OCR each page
for i, image in enumerate(images):
    text = pytesseract.image_to_string(image, lang='eng+ara')
    print(f"--- Page {i+1} ---")
    print(text)

Command line

bash
# Extract images from PDF
pdfimages -png input.pdf output

# OCR an image
tesseract output-000.png output -l eng+ara

Reading Metadata

python
from pypdf import PdfReader

reader = PdfReader("document.pdf")
meta = reader.metadata

print(f"Title: {meta.title}")
print(f"Author: {meta.author}")
print(f"Creator: {meta.creator}")
print(f"Pages: {len(reader.pages)}")

Adding Watermarks

python
from pypdf import PdfReader, PdfWriter

reader = PdfReader("input.pdf")
watermark = PdfReader("watermark.pdf").pages[0]

writer = PdfWriter()
for page in reader.pages:
    page.merge_page(watermark)
    writer.add_page(page)

writer.write("watermarked.pdf")

Password Protection

Encrypt

python
from pypdf import PdfReader, PdfWriter

reader = PdfReader("input.pdf")
writer = PdfWriter()

for page in reader.pages:
    writer.add_page(page)

writer.encrypt("userpassword", "ownerpassword")
writer.write("encrypted.pdf")

Decrypt

python
reader = PdfReader("encrypted.pdf")
if reader.is_encrypted:
    reader.decrypt("password")

Form Filling

Check for fillable fields

python
from pypdf import PdfReader

reader = PdfReader("form.pdf")
fields = reader.get_fields()

for field_name, field_data in fields.items():
    print(f"{field_name}: {field_data}")

Fill fields

python
from pypdf import PdfReader, PdfWriter

reader = PdfReader("form.pdf")
writer = PdfWriter()

writer.append(reader)
writer.update_page_form_field_values(
    writer.pages[0],
    {"field_name": "value"}
)
writer.write("filled_form.pdf")

Dependencies

LibraryPurposeInstall
pypdfBasic operationspip install pypdf
pdfplumberText/table extractionpip install pdfplumber
reportlabPDF creationpip install reportlab
pdf2imagePDF to imagespip install pdf2image
pytesseractOCRpip install pytesseract
poppler-utilsCLI toolsSystem package
qpdfAdvanced CLISystem package

Command Line Quick Reference

bash
# Extract text
pdftotext input.pdf output.txt

# Convert to images
pdftoppm -jpeg -r 300 input.pdf output

# Extract images
pdfimages -png input.pdf output

# Merge files
qpdf --empty --pages file1.pdf file2.pdf -- merged.pdf

# Split by pages
qpdf input.pdf --pages . 1-5 -- first5.pdf

# Rotate
qpdf input.pdf --rotate=90 -- rotated.pdf

# Decrypt
qpdf --decrypt --password=pass encrypted.pdf decrypted.pdf

# Compress
qpdf --linearize input.pdf output.pdf