required_canon_version: >=3.0.0

Skill: pdf-to-markdown

Version: 0.1.0

Status: Draft

Trigger

Use when converting PDF documents to Markdown format, typically for documentation purposes or to make PDF content more accessible and editable.

Inputs

•input.json with the following structure:

json

{
  "pdf_path": "path/to/document.pdf",
  "output_path": "path/to/output.md",
  "options": {
    "extract_images": false,
    "preserve_formatting": true,
    "page_breaks": "---"
  }
}

Fields:

•pdf_path (required, string): Absolute or relative path to input PDF file
•output_path (required, string): Path where Markdown file will be written
•options.extract_images (optional, boolean): Whether to extract embedded images (default: false)
•options.preserve_formatting (optional, boolean): Attempt to preserve text formatting (default: true)
•options.page_breaks (optional, string): String to insert between pages (default: "---")

Outputs

•
Creates a Markdown file at the specified output_path containing:
- •Extracted text from the PDF
- •Headers converted from document structure
- •Tables converted to Markdown tables
- •Optional page break markers between pages
- •Preserved whitespace and basic formatting

Output Format:

markdown

# Document Title

Section header

Paragraph text with **bold** and *italic* formatting.

| Column 1 | Column 2 |
|----------|----------|
| Data 1   | Data 2   |

---

Page 2 content continues...

Constraints

•Input PDF must be readable and not password-protected
•Output path must be within project root (enforced by GuardedWriter)
•Cannot write outside allowed locations (BUILD/, CONTRACTS/_runs/, etc.)
•Deterministic output: same input PDF always produces same Markdown
•Must use GuardedWriter for all file writes (write firewall enforcement)
•Images are extracted only when explicitly requested

Dependencies

•pdfplumber>=0.9.0 - PDF text and structure extraction
•Standard library only (no additional dependencies for basic operation)

Fixtures

•fixtures/basic/ - Simple PDF conversion test
•fixtures/multi-page/ - Multi-page document with page breaks
•fixtures/tables/ - PDF containing tables for table extraction

Error Handling

•Returns exit code 1 on errors with descriptive message
•
Handles common PDF errors:
- •File not found
- •Invalid PDF format
- •Password-protected PDF (not supported)
- •Encoding issues in text extraction

required_canon_version: >=3.0.0