PDF Structure OCR (PP-StructureV3)
This skill utilizes PaddleOCR's latest PP-StructureV3 engine to transform PDF files into high-quality Markdown. It performs layout analysis, OCR, table recognition, and smart image processing.
Usage
This skill is self-contained. You must use the specific Python interpreter within the skill's directory to access pre-installed dependencies like paddlepaddle-gpu and paddleocr.
[!WARNING] High GPU Resource Usage: This task is extremely GPU-intensive.
- •You must only initiate one task at a time.
- •Ensure sufficient VRAM is available before execution.
Execution Command
The primary script is scripts/process_pdf.py. Execute it using the internal environment:
@path/env/bin/python @path/scripts/process_pdf.py <input_pdf_path> <output_directory> [--output_md <filename.md>]
Parameters:
- •
<input_pdf_path>: Path to the source PDF file. - •
<output_directory>: Directory where the Markdown and images folder will be created. - •
--output_md: (Optional) Custom name for the generated Markdown file. Defaults tofinal_structured_result.md.
Key Features from the Script
- •Semantic Image Renaming: Automatically searches for the nearest heading or paragraph title to name extracted images (e.g.,
P1_FinancialChart_0_898_71.jpg), making the assets human-readable. - •Hierarchy & Layout Cleanup: Fixes common OCR issues such as broken heading levels and redundant empty lines.
- •Coordinate Tracking: Retains the original image coordinates in the filename for traceability.
- •Page Identification: Injects
# Page Nmarkers at the start of each page's content for easier navigation.
Output Structure
- •
<output_directory>/<output_md>: The finalized Markdown file with corrected paths for images. - •
<output_directory>/imgs/: A sub-folder containing all extracted figures, charts, and tables, named with semantic context.
Example
To process report.pdf and save results to an out folder with a custom name:
@path/env/bin/python @path/scripts/process_pdf.py report.pdf ./out --output_md digitized_report.md
Constraints
- •DO NOT use the system
python3or globalpip. - •DO NOT attempt to install additional packages.
- •ALWAYS reference the interpreter as
@path/env/bin/python. - •If the
env/directory is missing, the skill is improperly installed and will fail.