PDF Tools Guide
Rules and usage for PDF tools in [WORKSPACE_FOLDER]/.tools/.
MUST-NOT-FORGET
- •Check existing conversions before converting
- •Default output:
.tools/_pdf_to_jpg_converted/[PDF_FILENAME]/ - •Use 150 DPI for screen, 300 DPI for OCR
- •Two-pass downsizing: Ghostscript (images) then QPDF (structure)
PDF to JPG Conversion
Script: DevSystemV2/skills/pdf-tools/convert-pdf-to-jpg.py
Converts PDF pages to JPG images for vision analysis.
Usage:
python DevSystemV2/skills/pdf-tools/convert-pdf-to-jpg.py <input.pdf> [--output <dir>] [--dpi <dpi>] [--pages <range>]
Examples:
python DevSystemV2/skills/pdf-tools/convert-pdf-to-jpg.py invoice.pdf python DevSystemV2/skills/pdf-tools/convert-pdf-to-jpg.py invoice.pdf --dpi 200 --pages 1-2 python DevSystemV2/skills/pdf-tools/convert-pdf-to-jpg.py invoice.pdf --pages 1
Output Convention:
- •Default output:
.tools/_pdf_to_jpg_converted/[PDF_FILENAME]/ - •Each PDF gets its own subfolder named after the PDF file (without extension)
- •Files named:
[PDF_FILENAME]_page001.jpg,[PDF_FILENAME]_page002.jpg, etc. - •Check for existing subfolder to skip re-conversion
Parameters:
- •
--output: Output directory (default:.tools/_pdf_to_jpg_converted/) - •
--dpi: Resolution (default: 150) - •
--pages: Page range - "1", "1-3", or "all" (default: all)
7-Zip CLI Tools
Location: .tools/7z/
7-Zip is required to extract Ghostscript from its NSIS installer. The standalone 7za.exe cannot extract NSIS archives.
Extract archive
& ".tools/7z/7z.exe" x -y -o"output_folder" "archive.zip"
Extract NSIS installer (like Ghostscript)
& ".tools/7z/7z.exe" x -y -o"output_folder" "installer.exe"
List archive contents
& ".tools/7z/7z.exe" l "archive.zip"
Poppler CLI Tools
Location: .tools/poppler/Library/bin/
pdftoppm - PDF to Image
& ".tools/poppler/Library/bin/pdftoppm.exe" -jpeg -r 150 "input.pdf" "output_prefix"
pdftotext - Extract Text
& ".tools/poppler/Library/bin/pdftotext.exe" "input.pdf" "output.txt"
pdfinfo - Get PDF Metadata
& ".tools/poppler/Library/bin/pdfinfo.exe" "input.pdf"
pdfseparate - Split PDF Pages
& ".tools/poppler/Library/bin/pdfseparate.exe" "input.pdf" "output_%d.pdf"
pdfunite - Merge PDFs
& ".tools/poppler/Library/bin/pdfunite.exe" "page1.pdf" "page2.pdf" "merged.pdf"
QPDF CLI Tools
Location: .tools/qpdf/bin/
Merge PDFs
& ".tools/qpdf/bin/qpdf.exe" --empty --pages file1.pdf file2.pdf -- merged.pdf
Split PDF (extract pages)
& ".tools/qpdf/bin/qpdf.exe" input.pdf --pages . 1-5 -- output.pdf
Decrypt PDF
& ".tools/qpdf/bin/qpdf.exe" --decrypt --password=secret encrypted.pdf decrypted.pdf
Repair PDF
& ".tools/qpdf/bin/qpdf.exe" --replace-input damaged.pdf
Linearize (optimize for web)
& ".tools/qpdf/bin/qpdf.exe" --linearize input.pdf output.pdf
Smart PDF Compression Script
Script: DevSystemV2/skills/pdf-tools/compress-pdf.py
Intelligent PDF compression that analyzes content and applies optimal strategy.
Usage:
python DevSystemV2/skills/pdf-tools/compress-pdf.py <input.pdf> [--compression high|medium|low] [--output output.pdf]
Examples:
python DevSystemV2/skills/pdf-tools/compress-pdf.py report.pdf python DevSystemV2/skills/pdf-tools/compress-pdf.py report.pdf --compression high python DevSystemV2/skills/pdf-tools/compress-pdf.py report.pdf --compression low --output archive.pdf
Compression levels:
- •high: Target 50%+ reduction, aggressive (72 DPI)
- •medium: Target 25%+ reduction, balanced (150 DPI)
- •low: Target 10%+ reduction, preserve quality (300 DPI)
Features:
- •Analyzes PDF structure (images, DPI, format, optimization status)
- •Predicts compression potential before processing
- •Escalates to more aggressive strategies if target not reached
- •Reverts to previous result if aggressive approach shows insufficient improvement
Output: .tools/_pdf_output/[PDF_FILENAME]_compressed.pdf
Simple PDF Downsizing Script
Script: DevSystemV2/skills/pdf-tools/downsize-pdf-images.py
Direct Ghostscript wrapper for manual DPI control.
Usage:
python DevSystemV2/skills/pdf-tools/downsize-pdf-images.py <input.pdf> [--output <dir>] [--dpi <dpi>] [--preset <preset>]
Parameters:
- •
--output: Output directory (default:.tools/_pdf_output/) - •
--dpi: Resolution (default: 150) - •
--preset: Quality preset - screen (72), ebook (150), printer (300), prepress (300)
Output: .tools/_pdf_output/[PDF_FILENAME]_[DPI]dpi.pdf
Ghostscript CLI Tools
Location: .tools/gs/bin/
Compress images (downsize PDF)
& ".tools/gs/bin/gswin64c.exe" -sDEVICE=pdfwrite -dCompatibilityLevel=1.4 -dPDFSETTINGS=/screen -dDownsampleColorImages=true -dColorImageResolution=72 -dDownsampleGrayImages=true -dGrayImageResolution=72 -dDownsampleMonoImages=true -dMonoImageResolution=72 -dNOPAUSE -dQUIET -dBATCH -sOutputFile=output.pdf input.pdf
Quality presets (-dPDFSETTINGS)
- •
/screen- 72 DPI, smallest file size - •
/ebook- 150 DPI, medium quality - •
/printer- 300 DPI, high quality - •
/prepress- 300 DPI, color preserving
Remove all images (text only)
& ".tools/gs/bin/gswin64c.exe" -sDEVICE=pdfwrite -dCompatibilityLevel=1.4 -dFILTERIMAGE=true -dNOPAUSE -dQUIET -dBATCH -sOutputFile=output.pdf input.pdf
PDF Downsizing Workflow
Two-pass workflow for maximum compression:
Pass 1: Ghostscript (image compression)
& ".tools/gs/bin/gswin64c.exe" -sDEVICE=pdfwrite -dCompatibilityLevel=1.4 -dPDFSETTINGS=/screen -dDownsampleColorImages=true -dColorImageResolution=72 -dDownsampleGrayImages=true -dGrayImageResolution=72 -dNOPAUSE -dQUIET -dBATCH -sOutputFile=temp.pdf input.pdf
Pass 2: QPDF (structure optimization)
& ".tools/qpdf/bin/qpdf.exe" --linearize --object-streams=generate --stream-data=compress --compress-streams=y --optimize-images --flatten-annotations=screen temp.pdf output.pdf Remove-Item temp.pdf
PDF Analysis
Before optimizing, analyze the PDF to understand what's consuming space.
Get PDF info
& ".tools/poppler/Library/bin/pdfinfo.exe" "input.pdf"
List all images with details
& ".tools/poppler/Library/bin/pdfimages.exe" -list "input.pdf"
Shows: page, dimensions, color space, compression, DPI, size. Use to determine if images can be further compressed.
Count images
& ".tools/poppler/Library/bin/pdfimages.exe" -list "input.pdf" | Measure-Object
Optimization Strategies (by effectiveness)
Tested on 6 real-world annual reports (50-100 MB each):
High compression (>50% reduction):
- •JPEG2000 at 200 DPI: 77 MB → 4.5 MB (94% with screen, 86% with ebook)
- •Mixed content, not optimized: 84 MB → 34 MB (59%)
- •Mixed content: 49 MB → 21 MB (57%)
Moderate compression (10-50%):
- •Already optimized PDF: 47 MB → 42 MB (10%)
Low compression (<10%):
- •PDF with 10552 tiny images: 58 MB → 56 MB (4%) - many small UI elements
- •Image-only PDF at 115 DPI: 101 MB → 97 MB (4% at 100 DPI, 77% at 72 DPI)
Key factors affecting compression:
- •Image format: JPEG2000 (jpx) compresses dramatically when converted to JPEG
- •Image DPI: High DPI (200+) benefits most from downsampling
- •Already optimized: PDFs marked "Optimized: yes" compress less
- •Many small images: Thousands of tiny icons/UI elements resist compression
Strategy 1: Aggressive (best for large reductions)
cmd /c "& '.tools/gs/bin/gswin64c.exe' -sDEVICE=pdfwrite -dCompatibilityLevel=1.4 -dPDFSETTINGS=/screen -dDetectDuplicateImages=true -dCompressFonts=true -dSubsetFonts=true -dConvertCMYKImagesToRGB=true -dColorImageDownsampleType=/Bicubic -dNOPAUSE -dBATCH -sOutputFile=output.pdf input.pdf"
Strategy 2: Balanced (preserves more quality)
cmd /c "& '.tools/gs/bin/gswin64c.exe' -sDEVICE=pdfwrite -dCompatibilityLevel=1.4 -dPDFSETTINGS=/ebook -dDetectDuplicateImages=true -dCompressFonts=true -dSubsetFonts=true -dConvertCMYKImagesToRGB=true -dNOPAUSE -dBATCH -sOutputFile=output.pdf input.pdf"
Strategy 3: Structure only (no quality loss)
& ".tools/qpdf/bin/qpdf.exe" --linearize --object-streams=generate --compress-streams=y --recompress-flate input.pdf output.pdf
Key Optimization Flags
- •
-dDetectDuplicateImages=true- Replace duplicate images with references - •
-dCompressFonts=true- Compress font data - •
-dSubsetFonts=true- Include only used glyphs - •
-dConvertCMYKImagesToRGB=true- Convert CMYK to RGB (smaller) - •
-dColorImageDownsampleType=/Bicubic- Better quality downsampling
Best Practices
- •Analyze first: Use
pdfimages -listto understand image content before optimizing - •Check existing conversions: Before converting, check if subfolder exists in
_pdf_to_jpg_converted/ - •Use appropriate DPI: 72 for web/archive, 150 for screen, 300 for print/OCR
- •Convert specific pages: Use
--pagesto avoid converting entire large PDFs - •Clean up: Delete old conversions when no longer needed
- •Two-pass for maximum compression: Ghostscript (images) then QPDF (structure)
Setup
For initial installation, see SETUP.md in this skill folder.
Tool locations:
- •7-Zip:
.tools/7z/ - •Poppler:
.tools/poppler/ - •QPDF:
.tools/qpdf/ - •Ghostscript:
.tools/gs/ - •Installers:
.tools/_installer/ - •JPG output:
.tools/_pdf_to_jpg_converted/ - •PDF output:
.tools/_pdf_output/