AgentSkillsCN

markitdown

借助微软的 MarkItDown 工具,为文件转换为 Markdown 提供专业指导。将 PDF、Word、PowerPoint、Excel、图片、音频、HTML、CSV、JSON、XML、ZIP 以及 EPub 文件转换为 LLM 友好的 Markdown 格式。适用于为 AI 分析处理文档、从文件中提取内容,或为语言模型准备数据时使用。可在 markitdown、文档转换、pdf 转 markdown、docx 转 markdown、文件提取、文档处理等场景下触发。

SKILL.md
--- frontmatter
name: markitdown
description: Expert guidance for converting files to Markdown using Microsoft's MarkItDown utility. Convert PDF, Word, PowerPoint, Excel, images, audio, HTML, CSV, JSON, XML, ZIP, and EPub files to LLM-friendly Markdown format. Use when processing documents for AI analysis, extracting content from files, or preparing data for language models. Triggers on markitdown, document conversion, pdf to markdown, docx to markdown, file extraction, document processing.

MarkItDown Document Conversion

Convert files to Markdown using Microsoft's MarkItDown utility.

Installation

Full Installation

bash
pip install 'markitdown[all]'

Selective Installation

bash
pip install 'markitdown[pdf]'                    # PDF only
pip install 'markitdown[docx]'                   # Word documents
pip install 'markitdown[pptx]'                   # PowerPoint
pip install 'markitdown[xlsx]'                   # Excel
pip install 'markitdown[audio]'                  # Audio transcription
pip install 'markitdown[image]'                  # Image OCR
pip install 'markitdown[azure-doc-intelligence]' # Azure AI PDF
pip install 'markitdown[llm]'                    # LLM image descriptions

Command-Line Usage

bash
# Basic conversion
markitdown file.pdf

# Save to file
markitdown file.pdf > output.md
markitdown file.pdf -o output.md

# Batch conversion
for file in *.pdf; do markitdown "$file" > "${file%.pdf}.md"; done

Python API

Basic Usage

python
from markitdown import MarkItDown

md = MarkItDown()
result = md.convert("document.pdf")
print(result.text_content)

Stream Processing

python
with open("file.pdf", "rb") as f:
    result = md.convert_stream(f, file_extension=".pdf")

With Azure Document Intelligence

python
md = MarkItDown(
    azure_doc_intelligence_endpoint="https://your-resource.cognitiveservices.azure.com",
    azure_doc_intelligence_key="your-key"
)

With LLM Image Descriptions

python
md = MarkItDown(
    llm_model="gpt-4o",
    llm_client=None  # Uses default client
)

Supported Formats

FormatExtensionsFeatures
PDF.pdfText, tables, links, structure
Word.docxHeadings, lists, tables, images, links
PowerPoint.pptxSlides, titles, content, images
Excel.xlsx, .xlsSheets, tables, headers
Images.png, .jpg, .gifEXIF, OCR, LLM descriptions
Audio.wav, .mp3Transcription, timestamps
HTML.htmlContent, links, tables
CSV.csvData tables
JSON.jsonStructure preservation
XML.xmlData extraction
ZIP.zipArchive processing
EPub.epubE-book content
YouTubeURLsMetadata, transcripts

Common Patterns

Batch Processing

python
import os
from markitdown import MarkItDown

md = MarkItDown()

for filename in os.listdir("input/"):
    if filename.endswith(('.pdf', '.docx', '.pptx')):
        result = md.convert(f"input/{filename}")
        base = os.path.splitext(filename)[0]
        with open(f"output/{base}.md", "w") as f:
            f.write(result.text_content)

Error Handling

python
try:
    result = md.convert("file.pdf")
    markdown = result.text_content
except Exception as e:
    print(f"Conversion failed: {e}")

Memory-Efficient Processing

python
with open("large_file.pdf", "rb") as f:
    result = md.convert_stream(f, file_extension=".pdf")

Docker Usage

bash
# Build
docker build -t markitdown:latest .

# Run
docker run --rm -i markitdown:latest < input.pdf > output.md

# With volume
docker run --rm -v $(pwd):/data markitdown:latest /data/file.pdf

Output Format

MarkItDown produces clean, structured Markdown:

markdown
# Document Title

## Section Heading

Content with **bold** and *italic* formatting.

- Bullet lists
- Preserved from source

| Table | Headers |
|-------|---------|
| Data  | Values  |

[Links](https://example.com) maintained.

Best Practices

Performance

  • Use streams for files >10MB
  • Batch process multiple files
  • Cache converted results
  • Use selective dependencies

Quality

  • High-resolution images for OCR
  • Well-formatted source documents
  • Azure Document Intelligence for complex PDFs
  • LLM descriptions for important images

Integration

  • Check token counts for LLM limits
  • Chunk long documents
  • Preserve metadata in context
  • Validate output structure

Troubleshooting

IssueSolution
Import errorspip install --upgrade 'markitdown[all]'
Memory errorsUse convert_stream() instead of convert()
Poor OCRIncrease image resolution, use Azure
Missing contentCheck source file quality

Requirements

  • Python 3.10+
  • Virtual environment recommended
  • Optional: Azure subscription for enhanced features
  • Optional: OpenAI API for image descriptions

When to Use This Skill

  • Converting documents for AI analysis
  • Extracting content from PDFs
  • Processing Word/PowerPoint files
  • Preparing data for language models
  • Batch document conversion
  • Building document pipelines