AgentSkillsCN

pdf-extractor

从PDF文件中提取文本、表格与图像。适用场景:从报告中提取数据;将PDF表格转换为CSV;从演示文稿中提取图片;处理科研论文;批量将PDF文件转为文本。

SKILL.md
--- frontmatter
name: pdf-extractor
description: "Extract text, tables, and images from PDFs. Use when: extracting data from reports; converting PDF tables to CSV; pulling images from presentations; processing research papers; batch converting PDFs to text"
license: MIT
metadata:
  author: ClawFu
  version: 1.0.0
  mcp-server: "@clawfu/mcp-skills"

PDF Extractor

Extract text, tables, and images from PDF files using pdfplumber - turn static PDFs into usable data.

When to Use This Skill

  • Report processing - Extract data from PDF reports
  • Table extraction - Convert PDF tables to CSV
  • Image collection - Pull images from presentations
  • Text mining - Bulk convert PDFs to searchable text
  • Research - Process academic papers and whitepapers

What Claude Does vs What You Decide

Claude DoesYou Decide
Structures analysis frameworksMetric definitions
Identifies patterns in dataBusiness interpretation
Creates visualization templatesDashboard design
Suggests optimization areasAction priorities
Calculates statistical measuresDecision thresholds

Dependencies

bash
pip install pdfplumber pypdf click pandas
# For image extraction:
pip install Pillow

Commands

Extract Text

bash
python scripts/main.py text document.pdf
python scripts/main.py text document.pdf --pages 1-5

Extract Tables

bash
python scripts/main.py tables report.pdf --output tables.csv
python scripts/main.py tables financial.pdf --page 3

Extract Images

bash
python scripts/main.py images presentation.pdf --output ./images/

Merge PDFs

bash
python scripts/main.py merge doc1.pdf doc2.pdf --output combined.pdf

PDF Info

bash
python scripts/main.py info document.pdf

Examples

Example 1: Extract Financial Tables

bash
python scripts/main.py tables annual-report.pdf --output financials.csv

# Output: financials.csv with all tables found
# Also creates individual CSVs: table_page3_1.csv, table_page5_1.csv

Example 2: Batch Convert to Text

bash
python scripts/main.py batch ./pdfs/ --output ./text/

# Converts all PDFs in folder to .txt files

Example 3: Extract Specific Pages

bash
python scripts/main.py text whitepaper.pdf --pages 1,5-10,15

# Extracts only pages 1, 5-10, and 15

Skill Boundaries

What This Skill Does Well

  • Structuring data analysis
  • Identifying patterns and trends
  • Creating visualization frameworks
  • Calculating statistical measures

What This Skill Cannot Do

  • Access your actual data
  • Replace statistical expertise
  • Make business decisions
  • Guarantee prediction accuracy

Related Skills

Skill Metadata

  • Mode: centaur
yaml
category: automation
subcategory: document-processing
dependencies: [pdfplumber, pypdf, pandas]
difficulty: beginner
time_saved: 4+ hours/week