AgentSkillsCN

skill-ocr

利用PaddleOCR(PP-StructureV3)进行语义图像提取与布局分析,将复杂的PDF文档转换为结构化的Markdown格式。适用于在数字化PDF的同时,保留以下特性:1. 文档层级结构(标题、编号与章节)。2. 表格(自动转换为整洁的Markdown表格)。3. 图像(提取后,根据附近标题或文本进行语义重命名并引用)。4. 阅读顺序恢复(修复多栏或复杂布局)。 **重要提示**:此技能必须在自身内置的虚拟环境中运行。

SKILL.md
--- frontmatter
name: skill-ocr
description: >
  Converts complex PDF documents into structured Markdown with semantic image extraction and layout analysis using PaddleOCR (PP-StructureV3). 
  Use when you need to digitize PDFs while preserving:
  1. Document hierarchy (headings, numbering, and sections).
  2. Tables (automatically converted to clean Markdown tables).
  3. Images (extracted, semantically renamed based on nearby titles/text, and referenced).
  4. Reading order recovery (fixing multi-column or complex layouts).
  
  **CRITICAL**: This skill MUST be executed using its own internal virtual environment.

PDF Structure OCR (PP-StructureV3)

This skill utilizes PaddleOCR's latest PP-StructureV3 engine to transform PDF files into high-quality Markdown. It performs layout analysis, OCR, table recognition, and smart image processing.

Usage

This skill is self-contained. You must use the specific Python interpreter within the skill's directory to access pre-installed dependencies like paddlepaddle-gpu and paddleocr.

[!WARNING] High GPU Resource Usage: This task is extremely GPU-intensive.

  • You must only initiate one task at a time.
  • Ensure sufficient VRAM is available before execution.

Execution Command

The primary script is scripts/process_pdf.py. Execute it using the internal environment:

bash
@path/env/bin/python @path/scripts/process_pdf.py <input_pdf_path> <output_directory> [--output_md <filename.md>]

Parameters:

  • <input_pdf_path>: Path to the source PDF file.
  • <output_directory>: Directory where the Markdown and images folder will be created.
  • --output_md: (Optional) Custom name for the generated Markdown file. Defaults to final_structured_result.md.

Key Features from the Script

  • Semantic Image Renaming: Automatically searches for the nearest heading or paragraph title to name extracted images (e.g., P1_FinancialChart_0_898_71.jpg), making the assets human-readable.
  • Hierarchy & Layout Cleanup: Fixes common OCR issues such as broken heading levels and redundant empty lines.
  • Coordinate Tracking: Retains the original image coordinates in the filename for traceability.
  • Page Identification: Injects # Page N markers at the start of each page's content for easier navigation.

Output Structure

  • <output_directory>/<output_md>: The finalized Markdown file with corrected paths for images.
  • <output_directory>/imgs/: A sub-folder containing all extracted figures, charts, and tables, named with semantic context.

Example

To process report.pdf and save results to an out folder with a custom name:

bash
@path/env/bin/python @path/scripts/process_pdf.py report.pdf ./out --output_md digitized_report.md

Constraints

  • DO NOT use the system python3 or global pip.
  • DO NOT attempt to install additional packages.
  • ALWAYS reference the interpreter as @path/env/bin/python.
  • If the env/ directory is missing, the skill is improperly installed and will fail.