PDF Tools

Search and extract content from PDFs without loading entire files into context.

Installation

bash

# macOS
brew install pdfgrep poppler

# Ubuntu/Debian
sudo apt install pdfgrep poppler-utils

Quick Reference

Task	Command
Search	`pdfgrep "term" file.pdf`
Search with page numbers	`pdfgrep -n "term" file.pdf`
Search with context	`pdfgrep -n -C 2 "term" file.pdf`
Get page count	`pdfinfo file.pdf \| grep Pages`
Extract pages 5-10	`pdftotext -f 5 -l 10 file.pdf -`

Core Workflow

Step 1: Search - Find where content lives

bash

pdfgrep -n "authentication" large-manual.pdf
# Output: 42: User authentication requires...
#         45: Authentication tokens expire...

Step 2: Extract - Get just those pages

bash

pdftotext -f 41 -l 46 large-manual.pdf -

Search Commands

bash

# Basic search
pdfgrep "search term" document.pdf

# Case-insensitive
pdfgrep -i "search term" document.pdf

# With page numbers
pdfgrep -n "search term" document.pdf

# With context (2 lines before/after)
pdfgrep -n -C 2 "search term" document.pdf

# Count occurrences
pdfgrep -c "search term" document.pdf

# Search all PDFs in directory
pdfgrep -r "term" /path/to/pdfs/

Extract Commands

bash

# Extract specific page range
pdftotext -f 10 -l 15 document.pdf -

# Extract single page
pdftotext -f 42 -l 42 document.pdf -

# Preserve layout (for tables)
pdftotext -layout -f 10 -l 10 document.pdf -

# Extract and limit output
pdftotext -f 10 -l 15 document.pdf - | head -50

Metadata

bash

# Get page count
pdfinfo document.pdf | grep Pages

# Full metadata
pdfinfo document.pdf

Troubleshooting

Empty output from pdftotext: PDF is image-based (scanned). These tools work with text-based PDFs only.

pdfgrep missing matches: Try case-insensitive (-i). Check if PDF has selectable text.