Document Converter
Purpose
Convert various document formats to pure Markdown format with structured list-based tables. All tables are automatically converted to hierarchical list format for better readability and RAG (Retrieval-Augmented Generation) knowledge base integration.
Key Features
✅ Universal Format Support - PDF, DOCX, DOC, PPTX, PPT, TXT, HTML, ODT ✅ Intelligent Table Recognition - Automatically detects and converts different table types ✅ Structured Output - Tables converted to hierarchical lists (## → ### → *) ✅ RAG-Optimized - Clear semantic boundaries for easy chunking ✅ Consistent Formatting - All formats produce identical output structure ✅ No HTML Artifacts - Pure Markdown, no mixed formats ✅ Batch Processing - Convert entire directories at once
Installation
Required Dependencies
# macOS brew install pandoc # Ubuntu/Debian sudo apt-get install pandoc # Python dependencies pip install pdfplumber beautifulsoup4
Optional (for .doc files)
# macOS brew install libreoffice # Ubuntu/Debian sudo apt-get install libreoffice
Usage
Single File Conversion
cd /Users/tvwoo/.claude/skills/document-converter # Convert any supported format python main.py document.pdf python main.py report.docx python main.py presentation.pptx # Specify output file python main.py input.pdf output.md
Batch Conversion (All Formats)
cd /Users/tvwoo/.claude/skills/document-converter/scripts # Convert entire directory (auto-detects all supported formats) python batch_convert_all.py /path/to/documents # Specify output directory python batch_convert_all.py /path/to/input /path/to/output
Supported Formats
| Format | Support | Conversion Method | Notes |
|---|---|---|---|
| ✅ Full | pdfplumber + custom parser | Best table extraction | |
| DOCX | ✅ Full | Pandoc + table optimizer | Native support |
| DOC | ✅ Full | LibreOffice → Pandoc | Auto-converts to DOCX |
| PPTX/PPT | ✅ Full | Pandoc + table optimizer | Slide-by-slide |
| TXT | ✅ Full | Pandoc | Direct conversion |
| HTML | ✅ Full | Pandoc + cleanup | Web content |
| ODT | ✅ Full | Pandoc | OpenDocument |
Output Format Examples
Simple List Table
Input: Equipment list with quantities
Output:
## 工器具清单 * **万用表:** 1只 * **红外测温枪:** 1把 * **卷尺:** 1把 * **测电笔:** 1把
Configuration Table
Input: Furniture allocation table
Output:
## 办公区域家具类目配置标准 ### C 类办公桌 * 序号:1 * 规格(W*D*H):2000*10000 * 单位:张 * 部长:✓ * 中心主任:✓ ### C 类办公椅 * 序号:2 * 规格(W*D*H):五星脚 * 单位:张 * 部长:✓
Inspection Standards Table
Input: Equipment inspection checklist
Output:
## 高压供电设备巡视标准 ### 35kV开关柜 **各间隔气室气压表** * 标准:各间隔气室气压表指示正确(0.05-0.08Mpa),无气压异常信息 * 周期:每日2次 **断路器、三位置隔离开关** * 标准:断路器、三位置隔离开关等分合闸状态机械指示与电气指示一致 * 周期:每日2次
Form/Template Table
Input: Application form with fields
Output:
## 信访登记表 **基本信息:** * 来访人姓名:___________ * 性别:___________ * 联系电话:___________ **事项内容:** * 来访事由:___________ **处理信息:** * 接访部门:___________ * 综合部意见:___________ * 分管领导批示:___________
Why This Format is Perfect for RAG
1. Clear Hierarchical Structure
## Chapter Level ← Large semantic unit ### Section Level ← Medium semantic unit * Item Level ← Small semantic unit
2. Easy Chunking
Each ### section is a complete, independent knowledge unit that can be:
- •Indexed separately
- •Retrieved independently
- •Understood without context
3. LLM-Friendly
- •Standard Markdown syntax
- •No HTML noise
- •Clear semantic boundaries
- •Natural language structure
4. Example Query
User asks: "部长办公室需要配置什么家具?"
RAG retrieves:
### C 类办公桌 * 部长:✓ ### C 类办公椅 * 部长:✓ ### 茶水柜 * 部长:✓
Architecture
document-converter/
├── main.py # Main entry point
├── SKILL.md # This documentation
└── scripts/
├── universal_converter.py # Universal format handler
├── pdf_table_converter_v2.py # PDF-specific converter
├── batch_convert_all.py # Batch processing
└── convert.py # Legacy Pandoc wrapper
How It Works
For PDF Files
- •Extract tables using pdfplumber
- •Detect table type (list/inspection/config/form)
- •Convert to structured format based on type
- •Group fields intelligently (for forms)
- •Output clean Markdown
For DOCX/DOC/PPT Files
- •Convert with Pandoc to GFM Markdown
- •Parse Markdown tables from output
- •Apply same table detection as PDF
- •Convert to structured format
- •Clean HTML artifacts
Consistency Guarantee
Both paths use the same table conversion logic, ensuring identical output regardless of input format.
Advanced Features
Intelligent Table Type Detection
Simple List - Detected by: 名称 + 数量 columns Inspection Standards - Detected by: 巡视项目/检查内容 columns Configuration - Detected by: 部长/主任/规格/√ markers Forms - Detected by: >60% empty cells + form keywords
Automatic Field Grouping (Forms)
Fields are automatically categorized:
- •基本信息 - 姓名, 性别, 地址, 电话
- •事项内容 - 事由, 内容, 情况, 原因
- •处理信息 - 意见, 批示, 处理, 办理, 结果
Text Cleaning
- •Removes line breaks within cells
- •Normalizes whitespace
- •Removes table of contents dots (.....)
- •Cleans HTML artifacts
- •Removes Pandoc anchors
Troubleshooting
Error: "pandoc not found"
brew install pandoc # macOS sudo apt-get install pandoc # Ubuntu
Error: "Cannot convert .doc files"
brew install libreoffice # macOS sudo apt-get install libreoffice # Ubuntu
Error: "pdfplumber not found"
pip install pdfplumber beautifulsoup4
Tables not converting properly
- •Check if PDF is text-based (not scanned image)
- •For DOCX, ensure tables are actual table objects (not text with tabs)
- •Complex merged cells may be simplified
Performance
Single File
- •PDF (10 pages): ~2-5 seconds
- •DOCX (10 pages): ~1-3 seconds
- •PPT (20 slides): ~3-6 seconds
Batch Processing
- •100 PDFs: ~5-10 minutes
- •429 files (mixed): ~20-30 minutes
- •No token usage - all processing is local
Comparison with Other Tools
| Feature | This SKILL | Pandoc Only | Other Converters |
|---|---|---|---|
| Table Format | Structured lists | Markdown tables | HTML/Tables |
| RAG-Optimized | ✅ Yes | ❌ No | ❌ No |
| Consistent Output | ✅ Yes | ❌ No | ❌ No |
| All Formats | ✅ Yes | ✅ Yes | ⚠️ Limited |
| Batch Processing | ✅ Yes | ❌ No | ⚠️ Varies |
| Form Recognition | ✅ Yes | ❌ No | ❌ No |
Examples
Convert Single PDF
python main.py "/path/to/document.pdf" # Output: /path/to/document.md
Convert with Custom Output
python main.py "input.docx" "output/result.md"
Batch Convert Directory
python scripts/batch_convert_all.py "/path/to/documents" # Output: /path/to/documents/converted_md/
Batch Convert with Custom Output
python scripts/batch_convert_all.py "/input" "/output"
Version History
v2.0 (Current)
- •✅ Universal format support
- •✅ Intelligent table type detection
- •✅ RAG-optimized output
- •✅ Form field grouping
- •✅ Consistent formatting across all formats
- •✅ Batch processing for all formats
v1.0 (Legacy)
- •Basic Pandoc conversion
- •Markdown table output
- •PDF text extraction
- •Limited table handling
License
MIT License - Free to use and modify
Support
For issues or questions:
- •Check troubleshooting section above
- •Review examples in this documentation
- •Test with sample files first
Credits
Built with:
- •Pandoc - Universal document converter
- •pdfplumber - PDF table extraction
- •BeautifulSoup - HTML parsing