Docling Document Conversion
Convert documents (PDF, DOCX, PPTX, HTML, Markdown, etc.) to structured formats with image extraction.
Installation
bash
pip install docling # For page range processing (optional but recommended) pip install pymupdf # For Tesseract OCR (optional): # macOS: brew install tesseract # Ubuntu: apt-get install tesseract-ocr
Quick Start
CLI
bash
# Basic conversion (output to directory) docling document.pdf --to markdown # With OCR for scanned documents docling scanned.pdf --ocr --ocr-engine easyocr --to markdown # Batch conversion docling file1.pdf file2.docx --output ./converted
Python
python
from docling.document_converter import DocumentConverter
converter = DocumentConverter()
result = converter.convert("document.pdf")
print(result.document.export_to_markdown())
Using the Enhanced Conversion Script
Execute scripts/convert_document.py for advanced conversions with image extraction:
bash
# Basic PDF to Markdown with image extraction python scripts/convert_document.py document.pdf -o ./output # All formats (Markdown, JSON, HTML) with images python scripts/convert_document.py document.pdf -o ./output -f all # With OCR (Japanese + English) python scripts/convert_document.py scanned.pdf -o ./output --ocr --languages ja en # High accuracy table extraction python scripts/convert_document.py document.pdf -o ./output --table-mode accurate # Process specific page range python scripts/convert_document.py large.pdf -o ./output --pages 1-20 # Generate batch script for large files (50+ pages) python scripts/convert_document.py large.pdf -o ./output --generate-script
Output Structure
code
output/
├── document.md # Markdown with embedded image links
├── document.json # Structured JSON data
├── document.html # HTML output (optional)
└── images/
├── figure_001.png # Diagrams, charts as PNG
├── figure_002.png
├── photo_001.jpg # Photos as JPEG
└── photo_002.jpg
Script Options
| Option | Description | Default |
|---|---|---|
-o, --output | Output directory (required) | - |
-f, --format | Output format (markdown, json, html, all) | markdown |
--ocr | Enable OCR for scanned documents | disabled |
--ocr-engine | OCR engine (easyocr, tesseract) | easyocr |
--languages | OCR languages | en ja |
--table-mode | Table extraction (fast, accurate) | fast |
--pages | Page range (e.g., 1-20) | all pages |
--generate-script | Generate batch script for large files | - |
--batch-size | Pages per batch | 20 |
Large File Handling
For files with 50+ pages or 50+ MB:
Option 1: Page Range Processing
bash
# Process pages 1-20 python scripts/convert_document.py large.pdf -o ./output --pages 1-20 # Then process pages 21-40 python scripts/convert_document.py large.pdf -o ./output2 --pages 21-40
Option 2: Generate Batch Script
bash
# Generate a batch processing script python scripts/convert_document.py large.pdf -o ./output --generate-script --batch-size 20 # Run the generated script python ./output/batch_process.py
Advanced Configuration
OCR Setup
python
from docling.document_converter import DocumentConverter, PdfFormatOption
from docling.datamodel.base_models import InputFormat
from docling.datamodel.pipeline_options import PdfPipelineOptions, EasyOcrOptions
pipeline = PdfPipelineOptions()
pipeline.do_ocr = True
pipeline.ocr_options = EasyOcrOptions(
lang=["ja", "en"], # Languages for OCR
confidence_threshold=0.5
)
converter = DocumentConverter(
format_options={InputFormat.PDF: PdfFormatOption(pipeline_options=pipeline)}
)
Table Extraction
python
from docling.datamodel.pipeline_options import TableFormerMode pipeline.do_table_structure = True pipeline.table_structure_options.mode = TableFormerMode.ACCURATE # or FAST pipeline.table_structure_options.do_cell_matching = True
Image Extraction (Python API)
python
from pathlib import Path
from docling.document_converter import DocumentConverter
from docling.datamodel.pipeline_options import PdfPipelineOptions
from docling.datamodel.base_models import InputFormat
from docling.document_converter import PdfFormatOption
# Enable image generation
pipeline = PdfPipelineOptions()
pipeline.generate_picture_images = True
converter = DocumentConverter(
format_options={InputFormat.PDF: PdfFormatOption(pipeline_options=pipeline)}
)
result = converter.convert("document.pdf")
# Save images
images_dir = Path("./output/images")
images_dir.mkdir(parents=True, exist_ok=True)
for idx, picture in enumerate(result.document.pictures):
if picture.image and picture.image.pil_image:
pil_img = picture.image.pil_image
pil_img.save(images_dir / f"image_{idx:03d}.png", "PNG")
Export Options
python
# Markdown
markdown = doc.export_to_markdown()
# JSON (dict)
data = doc.export_to_dict()
# HTML
html = doc.export_to_html()
# Save with different image modes
from docling_core.types.doc import ImageRefMode
doc.save_as_markdown("output.md", image_mode=ImageRefMode.REFERENCED)
Supported Formats
| Input | Output |
|---|---|
| PDF, DOCX, PPTX, XLSX | Markdown |
| HTML, Markdown, AsciiDoc | JSON |
| Images (PNG, JPG, TIFF) | HTML |
OCR Engines
| Engine | Install | Languages |
|---|---|---|
| EasyOCR | pip install easyocr | 80+ languages |
| Tesseract | System package | 100+ languages |
Common Patterns
Batch Processing
python
from pathlib import Path
converter = DocumentConverter()
for pdf in Path("docs").glob("*.pdf"):
result = converter.convert(str(pdf))
output = pdf.with_suffix(".md")
output.write_text(result.document.export_to_markdown())
RAG Chunking
python
from docling.chunking import HierarchicalChunker
chunker = HierarchicalChunker()
chunks = list(chunker.chunk(result.document))
for chunk in chunks:
print(chunk.text)
Extract Tables to CSV
python
for idx, table in enumerate(result.document.tables):
df = table.export_to_dataframe()
df.to_csv(f"table_{idx}.csv", index=False)
print(df.to_markdown()) # Print as Markdown