AgentSkillsCN

pdf-tools

从 PDF 文件中搜索并提取内容。适用于在 PDF 中进行搜索、查找文档中的文本,或在无需通读整份文件的情况下提取特定页面时使用。

SKILL.md
--- frontmatter
name: pdf-tools
description: Search and extract content from PDF files. Use when searching PDFs, finding text in documents, or extracting specific pages without reading the entire file.
allowed-tools: Bash, Read, Glob

PDF Tools

Search and extract content from PDFs without loading entire files into context.

Installation

bash
# macOS
brew install pdfgrep poppler

# Ubuntu/Debian
sudo apt install pdfgrep poppler-utils

Quick Reference

TaskCommand
Searchpdfgrep "term" file.pdf
Search with page numberspdfgrep -n "term" file.pdf
Search with contextpdfgrep -n -C 2 "term" file.pdf
Get page countpdfinfo file.pdf | grep Pages
Extract pages 5-10pdftotext -f 5 -l 10 file.pdf -

Core Workflow

Step 1: Search - Find where content lives

bash
pdfgrep -n "authentication" large-manual.pdf
# Output: 42: User authentication requires...
#         45: Authentication tokens expire...

Step 2: Extract - Get just those pages

bash
pdftotext -f 41 -l 46 large-manual.pdf -

Search Commands

bash
# Basic search
pdfgrep "search term" document.pdf

# Case-insensitive
pdfgrep -i "search term" document.pdf

# With page numbers
pdfgrep -n "search term" document.pdf

# With context (2 lines before/after)
pdfgrep -n -C 2 "search term" document.pdf

# Count occurrences
pdfgrep -c "search term" document.pdf

# Search all PDFs in directory
pdfgrep -r "term" /path/to/pdfs/

Extract Commands

bash
# Extract specific page range
pdftotext -f 10 -l 15 document.pdf -

# Extract single page
pdftotext -f 42 -l 42 document.pdf -

# Preserve layout (for tables)
pdftotext -layout -f 10 -l 10 document.pdf -

# Extract and limit output
pdftotext -f 10 -l 15 document.pdf - | head -50

Metadata

bash
# Get page count
pdfinfo document.pdf | grep Pages

# Full metadata
pdfinfo document.pdf

Troubleshooting

Empty output from pdftotext: PDF is image-based (scanned). These tools work with text-based PDFs only.

pdfgrep missing matches: Try case-insensitive (-i). Check if PDF has selectable text.