PDF Extraction

Description

Extracts structured content from regulatory PDF documents including text, tables, and section headers. Designed for parsing Forest Service Handbooks (FSH), Forest Service Manuals (FSM), and CFR documents. Supports targeted extraction by section number, page range, or content pattern.

Triggers

When should the agent invoke this skill?

•User uploads or references a PDF document for analysis
•Query requests specific sections from regulatory documents
•Need to extract tables from FSH/FSM documents (e.g., "extract Table 10-1")
•Request for text from specific pages or sections
•User asks to "read" or "extract" content from a PDF file
•Cross-referencing regulatory requirements from PDF sources

Instructions

Step-by-step reasoning for the agent:

•
Validate Input: Verify the PDF path exists and is readable
- •Check file exists at specified path
- •Verify it's a valid PDF file
- •Note file size and page count
•
Determine Extraction Mode: Based on user request
- •Full document: Extract all text (use for small documents)
- •By section: Extract specific FSH/FSM section (e.g., "Section 10.3")
- •By page: Extract specific page range
- •Tables only: Extract and format tabular data
•
Extract Content: Apply appropriate extraction method
- •Text extraction preserves section structure
- •Table extraction identifies and formats tables as markdown
- •Section extraction uses regex patterns for FSH/FSM formats
•
Post-Process: Clean and structure output
- •Remove excessive whitespace
- •Preserve paragraph structure
- •Format tables as markdown
- •Add page number citations
•Return Results: Provide extracted content with metadata

Inputs

Input	Type	Required	Description
file_path	string	Yes	Path to PDF file (absolute or relative to agent)
extraction_mode	string	No	Mode: "full", "section", "pages", "tables" (default: "full")
section_number	string	No	FSH/FSM section to extract (e.g., "10.3", "Chapter 30")
start_page	integer	No	Starting page for page-based extraction (1-indexed)
end_page	integer	No	Ending page for page-based extraction (inclusive)
search_pattern	string	No	Regex pattern to find specific content

Outputs

Output	Type	Description
success	boolean	Whether extraction succeeded
file_name	string	Name of the processed PDF file
page_count	integer	Total pages in the document
extracted_text	string	The extracted content (markdown formatted)
tables	array	List of extracted tables (if applicable)
sections_found	array	Section headers discovered in document
citations	array	Page/section references for extracted content
error	string	Error message if extraction failed

Reasoning Chain

Step-by-step reasoning for the agent:

•First, validate the PDF file exists and is accessible
•Then, determine the extraction scope based on user request
•Next, apply the appropriate extraction method (full, section, pages, tables)
•Then, clean and format the extracted content as markdown
•Finally, return results with page citations for traceability

Resources

•resources/fsh-section-patterns.json - Regex patterns for FSH section headers
•resources/fsm-section-patterns.json - Regex patterns for FSM section headers

Scripts

•
scripts/extract_pdf.py - Python implementation of PDF extraction
- •
  Functions:
  - •extract_full_text(file_path: str) -> dict - Extract all text from PDF
  - •extract_section(file_path: str, section: str) -> dict - Extract specific section
  - •extract_pages(file_path: str, start: int, end: int) -> dict - Extract page range
  - •extract_tables(file_path: str) -> dict - Extract all tables as markdown
  - •execute(inputs: dict) -> dict - Main entry point

Examples

Example 1: Full Document Extraction

Input:

json

{
  "file_path": "/path/to/FSH-1909.15-Chapter10.pdf",
  "extraction_mode": "full"
}

Output:

json

{
  "success": true,
  "file_name": "FSH-1909.15-Chapter10.pdf",
  "page_count": 45,
  "extracted_text": "# FSH 1909.15 - Chapter 10\n\n## 10.1 Authority\n\nThe National Environmental Policy Act...",
  "tables": [],
  "sections_found": ["10.1 Authority", "10.2 Objective", "10.3 Policy"],
  "citations": [
    {"content": "Chapter 10 header", "page": 1},
    {"content": "Section 10.1", "page": 1}
  ],
  "error": null
}

Example 2: Section-Specific Extraction

Input:

json

{
  "file_path": "/path/to/FSH-1909.15-Chapter30.pdf",
  "extraction_mode": "section",
  "section_number": "31.2"
}

Output:

json

{
  "success": true,
  "file_name": "FSH-1909.15-Chapter30.pdf",
  "page_count": 120,
  "extracted_text": "## 31.2 Categorical Exclusions Established by the Chief\n\nThe following categories of action...",
  "tables": [],
  "sections_found": ["31.2"],
  "citations": [
    {"content": "Section 31.2", "page": 42, "source": "FSH 1909.15"}
  ],
  "error": null
}

Example 3: Table Extraction

Input:

json

{
  "file_path": "/path/to/FSH-2409.18-Chapter40.pdf",
  "extraction_mode": "tables"
}

Output:

json

{
  "success": true,
  "file_name": "FSH-2409.18-Chapter40.pdf",
  "page_count": 85,
  "extracted_text": "",
  "tables": [
    {
      "page": 12,
      "caption": "Table 40-1: Cruise Method Selection",
      "markdown": "| Method | Use Case | Accuracy |\n|--------|----------|----------|\n| 100% | High-value timber | ±5% |\n| Variable Plot | Mixed stands | ±10% |"
    }
  ],
  "sections_found": [],
  "citations": [
    {"content": "Table 40-1", "page": 12}
  ],
  "error": null
}

References

•Forest Service Handbook 1909.15 - NEPA Handbook
•Forest Service Manual 1950 - Environmental Policy and Procedures
•36 CFR Part 220 - Forest Service NEPA Procedures
•PyMuPDF Documentation: https://pymupdf.readthedocs.io/
•pdfplumber Documentation: https://github.com/jsvine/pdfplumber