AgentSkillsCN

read-document

从各类文档文件(.docx、.pdf、.xlsx、.pptx、.csv、.rtf、.odt、.epub、.html)中读取并提取文本内容,将其转换为纯文本或 Markdown 格式,以便编辑与修改。

SKILL.md
--- frontmatter
name: read-document
description: Read and extract text from document files (.docx, .pdf, .xlsx, .pptx, .csv, .rtf, .odt, .epub, .html) into plain text or markdown for editing

What I do

Convert binary and rich document formats into plain text or markdown that agents can read, edit, and work with. I use Python libraries already installed on this system.

Supported formats

FormatExtensionPython library
Word.docxmammoth (to markdown) or python-docx (raw text)
PDF.pdfpdfplumber (with layout) or pypdf (fast text)
Excel.xlsx, .xlsopenpyxl
PowerPoint.pptxpython-pptx
CSV.csvbuilt-in csv module
HTML.html, .htmbeautifulsoup4 + markdownify
Rich Text.rtfstriprtf (if installed)
OpenDocument.odtodfpy (if installed), otherwise unzip + xml parse
EPUB.epubebooklib (if installed), otherwise unzip + html parse

How to use me

Run the appropriate Python one-liner or short script via the Bash tool. Always quote paths with spaces.

Word (.docx) to Markdown

bash
python -c "
import mammoth, sys
with open(sys.argv[1], 'rb') as f:
    result = mammoth.convert_to_markdown(f)
    print(result.value)
" "path/to/document.docx"

Word (.docx) to plain text (preserves structure)

bash
python -c "
from docx import Document
import sys
doc = Document(sys.argv[1])
for p in doc.paragraphs:
    print(p.text)
for table in doc.tables:
    for row in table.rows:
        print(' | '.join(cell.text for cell in row.cells))
" "path/to/document.docx"

PDF to text (with layout preservation)

bash
python -c "
import pdfplumber, sys
with pdfplumber.open(sys.argv[1]) as pdf:
    for page in pdf.pages:
        text = page.extract_text()
        if text:
            print(text)
        print('---')
" "path/to/document.pdf"

PDF to text (fast, less formatting)

bash
python -c "
from pypdf import PdfReader
import sys
reader = PdfReader(sys.argv[1])
for page in reader.pages:
    print(page.extract_text())
" "path/to/document.pdf"

Excel (.xlsx) to text table

bash
python -c "
from openpyxl import load_workbook
import sys
wb = load_workbook(sys.argv[1], read_only=True, data_only=True)
for sheet in wb.sheetnames:
    ws = wb[sheet]
    print(f'## Sheet: {sheet}')
    for row in ws.iter_rows(values_only=True):
        print(' | '.join(str(c) if c is not None else '' for c in row))
    print()
" "path/to/spreadsheet.xlsx"

PowerPoint (.pptx) to text

bash
python -c "
from pptx import Presentation
import sys
prs = Presentation(sys.argv[1])
for i, slide in enumerate(prs.slides, 1):
    print(f'## Slide {i}')
    for shape in slide.shapes:
        if shape.has_text_frame:
            for para in shape.text_frame.paragraphs:
                print(para.text)
    print()
" "path/to/presentation.pptx"

CSV to markdown table

bash
python -c "
import csv, sys
with open(sys.argv[1], newline='', encoding='utf-8-sig') as f:
    reader = csv.reader(f)
    rows = list(reader)
    if rows:
        print(' | '.join(rows[0]))
        print(' | '.join('---' for _ in rows[0]))
        for row in rows[1:]:
            print(' | '.join(row))
" "path/to/data.csv"

HTML to Markdown

bash
python -c "
from markdownify import markdownify
import sys
with open(sys.argv[1], encoding='utf-8') as f:
    print(markdownify(f.read(), heading_style='ATX'))
" "path/to/page.html"

When to use me

  • User asks to read, import, review, or edit a .docx, .pdf, .xlsx, .pptx, .csv, or .html file
  • User drops a document file into the project and wants it analyzed
  • User wants to convert a document to markdown for editing
  • Any time you encounter a binary file format that the Read tool cannot handle

Important notes

  • Always save extracted text to a .md file in the project so other agents can work with it
  • For large PDFs, consider extracting specific pages: add page range logic
  • For large Excel files, consider extracting specific sheets only
  • If a Python library is missing, install it with: pip install <package>
  • Always use sys.argv[1] for the file path so paths with spaces work correctly
  • When editing documents, extract to markdown first, make edits, then inform the user the edited markdown is ready (re-encoding back to .docx requires additional steps)