Document Format Conversion
Convert various document formats to Markdown for knowledge base onboarding.
Supported Formats
| Format | Processing Method |
|---|---|
| DOCX | Pandoc conversion, preserve formatting and images |
| DOC | LibreOffice → DOCX → Pandoc |
| PDF Electronic | PyMuPDF4LLM fast conversion |
| PDF Scanned | PaddleOCR-VL online OCR |
| PPTX | pptx2md professional conversion |
| PPT | LibreOffice → PPTX → pptx2md |
Usage
bash
python .claude/skills/document-conversion/scripts/smart_convert.py \
<temp_path> \
--original-name "<original_filename>" \
--json-output
Parameters:
- •
<temp_path>: Temporary file path (e.g./tmp/kb_upload_xxx.pptx) - •
--original-name: Must pass original filename, used to generate correct image directory name - •
--json-output: Output JSON format result
Output Format
json
{
"success": true,
"markdown_file": "/path/to/output.md",
"images_dir": "original_filename_images",
"image_count": 5,
"input_file": "/path/to/input.pptx"
}
Processing Flow
- •Execute conversion command (must use
--original-nameand--json-output) - •Parse JSON output, check
successfield - •If
success: false, report error and end - •If
success: true, record generated file path and image directory
Important Notes
- •Image directory uses original filename naming (e.g.
培训资料_images/) - •Not passing
--original-namewill cause incorrect image reference paths - •PDF type is automatically detected, scanned version processing is slower (tens of seconds to minutes)
Format Details
Detailed processing instructions for each format, see FORMATS.md