PDF to Markdown Converter
Convert PDF documents to markdown using docling with intelligent cleanup and splitting.
Quick Start
# 1. Setup (one-time)
bash scripts/setup_venv.sh
# 2. Convert PDF (fast mode recommended)
# Local file:
python scripts/convert_full.py document.pdf --no-ocr
# Or URL (docling downloads automatically):
python scripts/convert_full.py https://example.com/document.pdf --no-ocr
# 3. Iterative cleanup - fix ONE issue at a time:
python scripts/extract_samples.py document.md # Review
python scripts/apply_substitutions.py document.md -s 'pattern' --dry-run # Test
python scripts/apply_substitutions.py document.md -s 'pattern' # Apply
python scripts/extract_samples.py document.md # Verify
# 3b. MANDATORY: Check for subordinate elements after Phase 3
grep "^## Table\|^## Figure\|^## [a-z])\|^## [A-Z][a-z]*[^:]\{0,20\}$" document.md # Detect Phase 4 issues
# 4. Optional splitting
python scripts/analyze_split_points.py document.md # Analyze
python scripts/split_markdown.py document.md --heading-level 2 --dry-run # Test
python scripts/split_markdown.py document.md --heading-level 2 # Apply
5-Phase Cleanup Approach
Work iteratively: dry-run → apply → verify each phase:
- •
Document Denoising → cleanup-phase1-denoising.md
Remove image placeholders, boilerplate, conversion artifacts - •
Headers/Footers → cleanup-phase2-headers-footers.md
Remove page numbers, copyright footers, repeated organizational content - •
Basic Numbered Sections → cleanup-phase3-basic-numbered-sections.md
Analyze numbering patterns and fix heading depths to match logical document hierarchy - •
⚠️ Context-Aware Subordinates → cleanup-phase4-context-aware-subordinates.md
MANDATORY CHECK: Fix tables, figures, list items incorrectly promoted to H2 headingsbash# Always run these detection commands after Phase 3: grep "^## Table\|^## Figure\|^## [a-z])\|^## [A-Z][a-z]*[^:]\{0,20\}$" document.md - •
Spacing/Formatting → cleanup-phase5-spacing-formatting.md
Clean up excessive blank lines, list formatting, table issues
Critical Principles
- •Small iterations: Fix 1-2 issues per iteration, not everything at once
- •Always verify: Extract samples after EACH change to confirm it worked
- •Test first: ALWAYS use
--dry-runbefore applying - •Document-specific patterns: Each PDF is unique - adapt to actual content
- •Safe recovery: Use backup files if something goes wrong
Performance Notes
⚠️ Use --no-ocr flag unless processing scanned documents:
- •Without OCR: 5-10 minutes for 180-page PDF ✅
- •With OCR: 60+ minutes for same PDF ❌
See performance-guide.md for optimization details.
Core Scripts
convert_full.py
python scripts/convert_full.py <pdf_source> [--no-ocr] [-o output.md]
Supports both local files and URLs. Examples:
python scripts/convert_full.py document.pdf --no-ocr python scripts/convert_full.py https://example.com/doc.pdf --no-ocr python scripts/convert_full.py https://example.com/doc.pdf -o custom.md --no-ocr
extract_samples.py
python scripts/extract_samples.py <markdown_file> [--min-repeats N]
Shows document structure and repeated patterns for cleanup planning.
apply_substitutions.py
python scripts/apply_substitutions.py <markdown_file> -s 'pattern' [--dry-run]
Applies sed-style regex substitutions with automatic backup.
Always use --dry-run first to test patterns before applying.
Example usage:
# Test a pattern first python scripts/apply_substitutions.py document.md -s 's/old/new/g' --dry-run # Apply if satisfied with dry-run results python scripts/apply_substitutions.py document.md -s 's/old/new/g'
Pattern Development Strategy:
- •Use
extract_samples.pyto identify issues - •Develop patterns specific to your document
- •Test with
--dry-run - •Apply and verify with
extract_samples.py
See phase-specific documentation for detailed patterns and examples.
analyze_split_points.py / split_markdown.py
python scripts/analyze_split_points.py <markdown_file> python scripts/split_markdown.py <markdown_file> --heading-level 2 [--dry-run]
Detailed Guides
- •Advanced Usage → docling-usage.md
- •Performance Optimization → performance-guide.md
- •Workflow Details → workflow-guide.md
- •Script Reference → script-reference.md