Simple PDF Skill
Quick guide for PDF processing using Python libraries.
Library Selection Guide
Choose the right library based on your task:
| Task | Library | Guide |
|---|---|---|
| Create new PDFs | reportlab | reportlab-guide.md |
| Edit existing PDFs | PyMuPDF (fitz) | pymupdf-guide.md |
| Extract text/tables | pdfplumber or PyMuPDF | pdfplumber-guide.md |
| Merge/split PDFs | PyMuPDF or pypdf | pymupdf-guide.md |
| Add annotations | PyMuPDF | pymupdf-guide.md |
| Extract images | PyMuPDF | pymupdf-guide.md |
| Render to images | pypdfium2 | pypdfium2-guide.md |
| Password protection | pypdf | pypdf-guide.md |
| Generate charts | matplotlib + reportlab | chart-guide.md |
Quick Start Workflow
1. Identify the Task Type
Creating PDFs:
- •Use
reportlabfor new documents from scratch - •See reportlab-guide.md for complete API
Editing Existing PDFs:
- •Use
PyMuPDF (fitz)for any modifications - •Common edits: highlights, annotations, watermarks, merging, splitting
- •See pymupdf-guide.md
Extracting Content:
- •Text extraction:
pdfplumberorPyMuPDF - •Table extraction:
pdfplumber(better for tables) - •Image extraction:
PyMuPDF - •See pdfplumber-guide.md or pymupdf-guide.md
2. Special Considerations
Chinese Text Support:
- •CRITICAL: Default fonts do not support Chinese
- •Must register Chinese font before use in reportlab
- •See reportlab-guide.md → Chinese Font Support section
- •Recommended fonts: WQY Microhei (4.4MB), Noto Sans SC (15MB)
Performance:
- •For large PDFs, process in chunks
- •Use
fitz(PyMuPDF) for best performance on editing tasks - •Use
pdfplumberfor reliable text extraction
3. Implementation Reference
For implementation patterns and examples:
- •Code patterns: PATTERNS.md
- •Complete examples: EXAMPLES.md
- •Real-world scenarios: SCENARIOS.md
- •Workflow details: WORKFLOWS.md
Installation
Install required libraries:
bash
pip install reportlab pip install pymupdf pip install pdfplumber pip install pypdf pip install pypdfium2 pip install fonttools # For TTF font extraction
For advanced features (OCR, CLI tools):
bash
# For OCR (scanned PDFs) pip install pytesseract pdf2image sudo apt-get install tesseract-ocr # For command-line tools sudo apt-get install poppler-utils sudo apt-get install qpdf
Key Rules
- •reportlab: Canvas coordinates (0,0 at bottom-left), use
Pt()for font sizes,Inch()for positioning - •PyMuPDF: Uses RGB tuples (0-1 range), not 0-255
- •Always: Call
save()to finalize documents, close documents to free resources - •Chinese fonts: ALWAYS register Chinese fonts before using Chinese text in reportlab
- •Large PDFs: Process in chunks to avoid memory issues
- •Encrypted PDFs: Handle gracefully with proper password management