PDF Reader
Extract text from PDF files for text manipulation, search, and reference.
Quick Start
Extract all text from a PDF:
bash
python scripts/extract_text.py document.pdf
Save to file:
bash
python scripts/extract_text.py document.pdf -o .tmp/output.txt
Extract specific pages:
bash
python scripts/extract_text.py document.pdf -p 1-10 -o .tmp/pages.txt
Workflow
- •Extract text → Run
scripts/extract_text.py - •Process output → Text is now searchable, editable, quotable
- •Reference content → Use extracted text for analysis or response
Script Options
code
extract_text.py <pdf_path> [options] Options: -o, --output FILE Save to file (default: print to stdout) -m, --method METHOD auto|pdfplumber|pymupdf|pdfminer (default: auto) -p, --pages RANGE Page range: "1-5" or "1,3,5" (default: all) --preserve-layout Keep spatial arrangement of text --json Output with metadata (page sizes, method used)
Method Selection
| Scenario | Recommended Method |
|---|---|
| General use | auto (default) |
| Documents with tables | pdfplumber |
| Large PDFs, speed needed | pymupdf |
| Maximum text accuracy | pdfminer |
| Scanned/image PDFs | pymupdf (has OCR) |
Examples
Extract and search
bash
python scripts/extract_text.py report.pdf | grep -i "revenue"
Extract tables (use pdfplumber)
bash
python scripts/extract_text.py data.pdf -m pdfplumber --json -o .tmp/data.json
Specific pages with layout
bash
python scripts/extract_text.py book.pdf -p 50-55 --preserve-layout -o .tmp/chapter.txt
Dependencies
At least one library required:
bash
pip install pdfplumber pymupdf pdfminer.six
For detailed library comparison, see references/pdf_libraries.md.
Troubleshooting
Empty output?
- •PDF may be scanned/image-based → try
--method pymupdf(has OCR) - •Check if PDF is password-protected
Garbled text?
- •Try different method:
-m pdfminer - •PDF may have non-standard font encoding
Tables not formatted?
- •Use
-m pdfplumber --jsonfor structured output - •Consider
--preserve-layoutflag