AgentSkillsCN

docling

使用 Docling 进行文档阅读与转换。当用户要求阅读、打开或处理以下格式的文档文件时,可使用此技能:PDF、DOCX、PPTX、XLSX、HTML、Markdown、AsciiDoc,或图片文件(PNG、JPG、TIFF)。支持对扫描文档进行 OCR 识别。触发条件包括: (1) 用户要求阅读/打开文档文件(例如:“请读一下这个 PDF”、“读一读这份文档”、“确认一下文件内容”) (2) 文件扩展名为 .pdf、.docx、.pptx、.xlsx、.html、.md、.adoc、.png、.jpg、.tiff (3) 用户希望通过 OCR 从扫描文档中提取文本 (4) 用户希望将文档转换为 Markdown/JSON/HTML (5) 用户希望对包含表格、图表或照片的文档进行处理 (6) 用户希望从文档中提取图片或图表

SKILL.md
--- frontmatter
name: docling
description: |
  Document reading and conversion using Docling. Use this skill when user asks to read, open, or process document files in these formats: PDF, DOCX, PPTX, XLSX, HTML, Markdown, AsciiDoc, or images (PNG, JPG, TIFF). Supports OCR for scanned documents. Trigger when:
  (1) User asks to read/open a document file (e.g., "このPDFを読んで", "read this document", "ファイルの内容を確認して")
  (2) File extension is .pdf, .docx, .pptx, .xlsx, .html, .md, .adoc, .png, .jpg, .tiff
  (3) User wants to extract text from scanned documents with OCR
  (4) User wants to convert documents to Markdown/JSON/HTML
  (5) User wants to process documents with tables, figures, or photos
  (6) User wants to extract images/figures from documents

Docling Document Conversion

Convert documents (PDF, DOCX, PPTX, HTML, Markdown, etc.) to structured formats with image extraction.

Installation

bash
pip install docling

# For page range processing (optional but recommended)
pip install pymupdf

# For Tesseract OCR (optional):
# macOS: brew install tesseract
# Ubuntu: apt-get install tesseract-ocr

Quick Start

CLI

bash
# Basic conversion (output to directory)
docling document.pdf --to markdown

# With OCR for scanned documents
docling scanned.pdf --ocr --ocr-engine easyocr --to markdown

# Batch conversion
docling file1.pdf file2.docx --output ./converted

Python

python
from docling.document_converter import DocumentConverter

converter = DocumentConverter()
result = converter.convert("document.pdf")
print(result.document.export_to_markdown())

Using the Enhanced Conversion Script

Execute scripts/convert_document.py for advanced conversions with image extraction:

bash
# Basic PDF to Markdown with image extraction
python scripts/convert_document.py document.pdf -o ./output

# All formats (Markdown, JSON, HTML) with images
python scripts/convert_document.py document.pdf -o ./output -f all

# With OCR (Japanese + English)
python scripts/convert_document.py scanned.pdf -o ./output --ocr --languages ja en

# High accuracy table extraction
python scripts/convert_document.py document.pdf -o ./output --table-mode accurate

# Process specific page range
python scripts/convert_document.py large.pdf -o ./output --pages 1-20

# Generate batch script for large files (50+ pages)
python scripts/convert_document.py large.pdf -o ./output --generate-script

Output Structure

code
output/
├── document.md       # Markdown with embedded image links
├── document.json     # Structured JSON data
├── document.html     # HTML output (optional)
└── images/
    ├── figure_001.png   # Diagrams, charts as PNG
    ├── figure_002.png
    ├── photo_001.jpg    # Photos as JPEG
    └── photo_002.jpg

Script Options

OptionDescriptionDefault
-o, --outputOutput directory (required)-
-f, --formatOutput format (markdown, json, html, all)markdown
--ocrEnable OCR for scanned documentsdisabled
--ocr-engineOCR engine (easyocr, tesseract)easyocr
--languagesOCR languagesen ja
--table-modeTable extraction (fast, accurate)fast
--pagesPage range (e.g., 1-20)all pages
--generate-scriptGenerate batch script for large files-
--batch-sizePages per batch20

Large File Handling

For files with 50+ pages or 50+ MB:

Option 1: Page Range Processing

bash
# Process pages 1-20
python scripts/convert_document.py large.pdf -o ./output --pages 1-20

# Then process pages 21-40
python scripts/convert_document.py large.pdf -o ./output2 --pages 21-40

Option 2: Generate Batch Script

bash
# Generate a batch processing script
python scripts/convert_document.py large.pdf -o ./output --generate-script --batch-size 20

# Run the generated script
python ./output/batch_process.py

Advanced Configuration

OCR Setup

python
from docling.document_converter import DocumentConverter, PdfFormatOption
from docling.datamodel.base_models import InputFormat
from docling.datamodel.pipeline_options import PdfPipelineOptions, EasyOcrOptions

pipeline = PdfPipelineOptions()
pipeline.do_ocr = True
pipeline.ocr_options = EasyOcrOptions(
    lang=["ja", "en"],  # Languages for OCR
    confidence_threshold=0.5
)

converter = DocumentConverter(
    format_options={InputFormat.PDF: PdfFormatOption(pipeline_options=pipeline)}
)

Table Extraction

python
from docling.datamodel.pipeline_options import TableFormerMode

pipeline.do_table_structure = True
pipeline.table_structure_options.mode = TableFormerMode.ACCURATE  # or FAST
pipeline.table_structure_options.do_cell_matching = True

Image Extraction (Python API)

python
from pathlib import Path
from docling.document_converter import DocumentConverter
from docling.datamodel.pipeline_options import PdfPipelineOptions
from docling.datamodel.base_models import InputFormat
from docling.document_converter import PdfFormatOption

# Enable image generation
pipeline = PdfPipelineOptions()
pipeline.generate_picture_images = True

converter = DocumentConverter(
    format_options={InputFormat.PDF: PdfFormatOption(pipeline_options=pipeline)}
)
result = converter.convert("document.pdf")

# Save images
images_dir = Path("./output/images")
images_dir.mkdir(parents=True, exist_ok=True)

for idx, picture in enumerate(result.document.pictures):
    if picture.image and picture.image.pil_image:
        pil_img = picture.image.pil_image
        pil_img.save(images_dir / f"image_{idx:03d}.png", "PNG")

Export Options

python
# Markdown
markdown = doc.export_to_markdown()

# JSON (dict)
data = doc.export_to_dict()

# HTML
html = doc.export_to_html()

# Save with different image modes
from docling_core.types.doc import ImageRefMode
doc.save_as_markdown("output.md", image_mode=ImageRefMode.REFERENCED)

Supported Formats

InputOutput
PDF, DOCX, PPTX, XLSXMarkdown
HTML, Markdown, AsciiDocJSON
Images (PNG, JPG, TIFF)HTML

OCR Engines

EngineInstallLanguages
EasyOCRpip install easyocr80+ languages
TesseractSystem package100+ languages

Common Patterns

Batch Processing

python
from pathlib import Path

converter = DocumentConverter()
for pdf in Path("docs").glob("*.pdf"):
    result = converter.convert(str(pdf))
    output = pdf.with_suffix(".md")
    output.write_text(result.document.export_to_markdown())

RAG Chunking

python
from docling.chunking import HierarchicalChunker

chunker = HierarchicalChunker()
chunks = list(chunker.chunk(result.document))
for chunk in chunks:
    print(chunk.text)

Extract Tables to CSV

python
for idx, table in enumerate(result.document.tables):
    df = table.export_to_dataframe()
    df.to_csv(f"table_{idx}.csv", index=False)
    print(df.to_markdown())  # Print as Markdown