PDF Processing Skill
This skill provides guidance and tools for processing PDF files, extracting text, and analyzing document content.
Overview
This skill includes Python scripts for common PDF processing tasks:
- •Extracting text from PDF files
- •Analyzing PDF structure
- •Extracting tables from PDFs
- •Converting PDF pages to images
Available Scripts
1. Extract Text from PDF
Use execute_script to run the text extraction script:
python
python /nfs/FM/gongoubo/new_project/Agent-Handbook/mini-agents/Mini_Agents/skills/document-skills/pdf/scripts/extract_text.py /path/to/your/file.pdf
2. Analyze PDF Structure
Use execute_script to run the structure analysis script:
python
python /nfs/FM/gongoubo/new_project/Agent-Handbook/mini-agents/Mini_Agents/skills/document-skills/pdf/scripts/analyze_structure.py /path/to/your/file.pdf
3. Extract Tables from PDF
Use execute_script to run the table extraction script:
python
python /nfs/FM/gongoubo/new_project/Agent-Handbook/mini-agents/Mini_Agents/skills/document-skills/pdf/scripts/extract_tables.py /path/to/your/file.pdf
4. Quick Code Execution
For quick PDF processing, use execute_code with inline Python code:
python
import PyPDF2
# Open PDF file
pdf_path = '/path/to/your/file.pdf'
with open(pdf_path, 'rb') as file:
reader = PyPDF2.PdfReader(file)
# Get basic info
num_pages = len(reader.pages)
print(f"Number of pages: {num_pages}")
# Extract text from first page
first_page = reader.pages[0]
text = first_page.extract_text()
print(f"First page text preview: {text[:200]}...")
Usage Examples
Example 1: Extract All Text from PDF
Use execute_code tool:
python
import PyPDF2
pdf_path = '/path/to/your/file.pdf'
with open(pdf_path, 'rb') as file:
reader = PyPDF2.PdfReader(file)
all_text = ""
for page_num in range(len(reader.pages)):
page = reader.pages[page_num]
all_text += page.extract_text()
print(all_text)
Example 2: Extract Text from Specific Page
Use execute_code tool:
python
import PyPDF2
pdf_path = '/path/to/your/file.pdf'
page_number = 2 # Extract from page 3 (0-indexed)
with open(pdf_path, 'rb') as file:
reader = PyPDF2.PdfReader(file)
if page_number < len(reader.pages):
page = reader.pages[page_number]
text = page.extract_text()
print(f"Text from page {page_number + 1}:")
print(text)
else:
print(f"Page {page_number + 1} does not exist. Total pages: {len(reader.pages)}")
Example 3: Get PDF Metadata
Use execute_code tool:
python
import PyPDF2
pdf_path = '/path/to/your/file.pdf'
with open(pdf_path, 'rb') as file:
reader = PyPDF2.PdfReader(file)
metadata = reader.metadata
print("PDF Metadata:")
if metadata:
for key, value in metadata.items():
print(f" {key}: {value}")
else:
print(" No metadata available")
Best Practices
- •Always check if the PDF file exists before processing
- •Handle encoding issues gracefully (some PDFs may have special characters)
- •Use
execute_codefor simple text extraction tasks - •Use
execute_scriptfor complex PDF processing workflows - •Be aware that some PDFs may be password-protected or scanned images
- •For scanned PDFs, you may need OCR (Optical Character Recognition)
Required Libraries
- •PyPDF2
- •pdfplumber (for better text extraction and table extraction)
- •tabula-py (for advanced table extraction)
Install with: pip install PyPDF2 pdfplumber tabula-py
Troubleshooting
Issue: Cannot extract text from PDF
- •The PDF might be a scanned image. Use OCR tools like Tesseract.
- •The PDF might be password-protected. Provide the password when opening.
Issue: Text extraction is garbled
- •Try different PDF libraries (PyPDF2, pdfplumber, pdftotext)
- •Some PDFs have complex layouts that are hard to parse automatically
Issue: Large PDF processing is slow
- •Process pages in batches
- •Consider using multiprocessing for parallel processing