Document Converter

Purpose

Convert various document formats to pure Markdown format with structured list-based tables. All tables are automatically converted to hierarchical list format for better readability and RAG (Retrieval-Augmented Generation) knowledge base integration.

Key Features

✅ Universal Format Support - PDF, DOCX, DOC, PPTX, PPT, TXT, HTML, ODT ✅ Intelligent Table Recognition - Automatically detects and converts different table types ✅ Structured Output - Tables converted to hierarchical lists (## → ### → *) ✅ RAG-Optimized - Clear semantic boundaries for easy chunking ✅ Consistent Formatting - All formats produce identical output structure ✅ No HTML Artifacts - Pure Markdown, no mixed formats ✅ Batch Processing - Convert entire directories at once

Installation

Required Dependencies

bash

# macOS
brew install pandoc

# Ubuntu/Debian
sudo apt-get install pandoc

# Python dependencies
pip install pdfplumber beautifulsoup4

Optional (for .doc files)

bash

# macOS
brew install libreoffice

# Ubuntu/Debian
sudo apt-get install libreoffice

Usage

Single File Conversion

bash

cd /Users/tvwoo/.claude/skills/document-converter

# Convert any supported format
python main.py document.pdf
python main.py report.docx
python main.py presentation.pptx

# Specify output file
python main.py input.pdf output.md

Batch Conversion (All Formats)

bash

cd /Users/tvwoo/.claude/skills/document-converter/scripts

# Convert entire directory (auto-detects all supported formats)
python batch_convert_all.py /path/to/documents

# Specify output directory
python batch_convert_all.py /path/to/input /path/to/output

Supported Formats

Format	Support	Conversion Method	Notes
PDF	✅ Full	pdfplumber + custom parser	Best table extraction
DOCX	✅ Full	Pandoc + table optimizer	Native support
DOC	✅ Full	LibreOffice → Pandoc	Auto-converts to DOCX
PPTX/PPT	✅ Full	Pandoc + table optimizer	Slide-by-slide
TXT	✅ Full	Pandoc	Direct conversion
HTML	✅ Full	Pandoc + cleanup	Web content
ODT	✅ Full	Pandoc	OpenDocument

Output Format Examples

Simple List Table

Input: Equipment list with quantities

Output:

markdown

## 工器具清单

* **万用表：** 1只
* **红外测温枪：** 1把
* **卷尺：** 1把
* **测电笔：** 1把

Configuration Table

Input: Furniture allocation table

Output:

markdown

## 办公区域家具类目配置标准

### C 类办公桌
* 序号：1
* 规格(W*D*H)：2000*10000
* 单位：张
* 部长：✓
* 中心主任：✓

### C 类办公椅
* 序号：2
* 规格(W*D*H)：五星脚
* 单位：张
* 部长：✓

Inspection Standards Table

Input: Equipment inspection checklist

Output:

markdown

## 高压供电设备巡视标准

### 35kV开关柜

**各间隔气室气压表**
* 标准：各间隔气室气压表指示正确（0.05-0.08Mpa），无气压异常信息
* 周期：每日2次

**断路器、三位置隔离开关**
* 标准：断路器、三位置隔离开关等分合闸状态机械指示与电气指示一致
* 周期：每日2次

Form/Template Table

Input: Application form with fields

Output:

markdown

## 信访登记表

**基本信息：**
* 来访人姓名：___________
* 性别：___________
* 联系电话：___________

**事项内容：**
* 来访事由：___________

**处理信息：**
* 接访部门：___________
* 综合部意见：___________
* 分管领导批示：___________

Why This Format is Perfect for RAG

1. Clear Hierarchical Structure

markdown

## Chapter Level          ← Large semantic unit
### Section Level         ← Medium semantic unit
* Item Level             ← Small semantic unit

2. Easy Chunking

Each ### section is a complete, independent knowledge unit that can be:

•Indexed separately
•Retrieved independently
•Understood without context

3. LLM-Friendly

•Standard Markdown syntax
•No HTML noise
•Clear semantic boundaries
•Natural language structure

4. Example Query

User asks: "部长办公室需要配置什么家具？"

RAG retrieves:

markdown

### C 类办公桌
* 部长：✓

### C 类办公椅
* 部长：✓

### 茶水柜
* 部长：✓

Architecture

code

document-converter/
├── main.py                          # Main entry point
├── SKILL.md                         # This documentation
└── scripts/
    ├── universal_converter.py       # Universal format handler
    ├── pdf_table_converter_v2.py    # PDF-specific converter
    ├── batch_convert_all.py         # Batch processing
    └── convert.py                   # Legacy Pandoc wrapper

How It Works

For PDF Files

•Extract tables using pdfplumber
•Detect table type (list/inspection/config/form)
•Convert to structured format based on type
•Group fields intelligently (for forms)
•Output clean Markdown

For DOCX/DOC/PPT Files

•Convert with Pandoc to GFM Markdown
•Parse Markdown tables from output
•Apply same table detection as PDF
•Convert to structured format
•Clean HTML artifacts

Consistency Guarantee

Both paths use the same table conversion logic, ensuring identical output regardless of input format.

Advanced Features

Intelligent Table Type Detection

Simple List - Detected by: 名称 + 数量 columns Inspection Standards - Detected by: 巡视项目/检查内容 columns Configuration - Detected by: 部长/主任/规格/√ markers Forms - Detected by: >60% empty cells + form keywords

Automatic Field Grouping (Forms)

Fields are automatically categorized:

•基本信息 - 姓名, 性别, 地址, 电话
•事项内容 - 事由, 内容, 情况, 原因
•处理信息 - 意见, 批示, 处理, 办理, 结果

Text Cleaning

•Removes line breaks within cells
•Normalizes whitespace
•Removes table of contents dots (.....)
•Cleans HTML artifacts
•Removes Pandoc anchors

Troubleshooting

Error: "pandoc not found"

bash

brew install pandoc  # macOS
sudo apt-get install pandoc  # Ubuntu

Error: "Cannot convert .doc files"

bash

brew install libreoffice  # macOS
sudo apt-get install libreoffice  # Ubuntu

Error: "pdfplumber not found"

bash

pip install pdfplumber beautifulsoup4

Tables not converting properly

•Check if PDF is text-based (not scanned image)
•For DOCX, ensure tables are actual table objects (not text with tabs)
•Complex merged cells may be simplified

Performance

Single File

•PDF (10 pages): ~2-5 seconds
•DOCX (10 pages): ~1-3 seconds
•PPT (20 slides): ~3-6 seconds

Batch Processing

•100 PDFs: ~5-10 minutes
•429 files (mixed): ~20-30 minutes
•No token usage - all processing is local

Comparison with Other Tools

Feature	This SKILL	Pandoc Only	Other Converters
Table Format	Structured lists	Markdown tables	HTML/Tables
RAG-Optimized	✅ Yes	❌ No	❌ No
Consistent Output	✅ Yes	❌ No	❌ No
All Formats	✅ Yes	✅ Yes	⚠️ Limited
Batch Processing	✅ Yes	❌ No	⚠️ Varies
Form Recognition	✅ Yes	❌ No	❌ No

Examples

Convert Single PDF

bash

python main.py "/path/to/document.pdf"
# Output: /path/to/document.md

Convert with Custom Output

bash

python main.py "input.docx" "output/result.md"

Batch Convert Directory

bash

python scripts/batch_convert_all.py "/path/to/documents"
# Output: /path/to/documents/converted_md/

Batch Convert with Custom Output

bash

python scripts/batch_convert_all.py "/input" "/output"

Version History

v2.0 (Current)

•✅ Universal format support
•✅ Intelligent table type detection
•✅ RAG-optimized output
•✅ Form field grouping
•✅ Consistent formatting across all formats
•✅ Batch processing for all formats

v1.0 (Legacy)

•Basic Pandoc conversion
•Markdown table output
•PDF text extraction
•Limited table handling

License

MIT License - Free to use and modify

Support

For issues or questions:

•Check troubleshooting section above
•Review examples in this documentation
•Test with sample files first

Credits

Built with:

•Pandoc - Universal document converter
•pdfplumber - PDF table extraction
•BeautifulSoup - HTML parsing