PDF Structure OCR (PP-StructureV3)

This skill utilizes PaddleOCR's latest PP-StructureV3 engine to transform PDF files into high-quality Markdown. It performs layout analysis, OCR, table recognition, and smart image processing.

Usage

This skill is self-contained. You must use the specific Python interpreter within the skill's directory to access pre-installed dependencies like paddlepaddle-gpu and paddleocr.

[!WARNING] High GPU Resource Usage: This task is extremely GPU-intensive.

•You must only initiate one task at a time.
•Ensure sufficient VRAM is available before execution.

Execution Command

The primary script is scripts/process_pdf.py. Execute it using the internal environment:

bash

@path/env/bin/python @path/scripts/process_pdf.py <input_pdf_path> <output_directory> [--output_md <filename.md>]

Parameters:

•<input_pdf_path>: Path to the source PDF file.
•<output_directory>: Directory where the Markdown and images folder will be created.
•--output_md: (Optional) Custom name for the generated Markdown file. Defaults to final_structured_result.md.

Key Features from the Script

•Semantic Image Renaming: Automatically searches for the nearest heading or paragraph title to name extracted images (e.g., P1_FinancialChart_0_898_71.jpg), making the assets human-readable.
•Hierarchy & Layout Cleanup: Fixes common OCR issues such as broken heading levels and redundant empty lines.
•Coordinate Tracking: Retains the original image coordinates in the filename for traceability.
•Page Identification: Injects # Page N markers at the start of each page's content for easier navigation.

Output Structure

•<output_directory>/<output_md>: The finalized Markdown file with corrected paths for images.
•<output_directory>/imgs/: A sub-folder containing all extracted figures, charts, and tables, named with semantic context.

Example

To process report.pdf and save results to an out folder with a custom name:

bash

@path/env/bin/python @path/scripts/process_pdf.py report.pdf ./out --output_md digitized_report.md

Constraints

•DO NOT use the system python3 or global pip.
•DO NOT attempt to install additional packages.
•ALWAYS reference the interpreter as @path/env/bin/python.
•If the env/ directory is missing, the skill is improperly installed and will fail.