AgentSkillsCN

word-docx-splitter

将大型 Word 文档(.docx)拆分为多个小块,并重新组合。当您处理大型法律文件、合同或协议时,若需要将其分段处理——尤其是在 AI 工具难以应对文档尺寸限制的情况下,此技能便能大显身手。它支持按标题样式(例如“标题 1”)或文本模式(例如“条款”)进行拆分,并在编辑完成后将各部分重新合并。

SKILL.md
--- frontmatter
name: word-docx-splitter
description: Split large Word documents (.docx) into smaller chunks and reassemble them. Use when working with large legal documents, contracts, or agreements that need to be processed in smaller pieces—particularly useful when AI tools struggle with document size limits. Supports splitting by heading style (e.g., "Heading 1") or text pattern (e.g., "ARTICLE"), and merging edited chunks back together.

Word Document Splitter & Merger

Split large .docx files into smaller documents by article/section, then reassemble after editing.

Dependencies

bash
pip install python-docx lxml --break-system-packages

Quick Reference

TaskCommand
List stylespython scripts/docx_splitter.py doc.docx --list-styles
Split by stylepython scripts/docx_splitter.py doc.docx --split-on "Heading 1"
Split by patternpython scripts/docx_splitter.py doc.docx --pattern "ARTICLE"
Merge chunkspython scripts/docx_merger.py ./chunks/ -o merged.docx

Splitting Documents

Step 1: Identify Split Points

Always check available styles first:

bash
python scripts/docx_splitter.py document.docx --list-styles

Step 2: Split

By heading style (preferred for well-formatted documents):

bash
python scripts/docx_splitter.py document.docx --split-on "Heading 1" --output-dir ./chunks

By text pattern (for inconsistent styling):

bash
python scripts/docx_splitter.py document.docx --pattern "ARTICLE" --output-dir ./chunks
python scripts/docx_splitter.py document.docx --pattern "^SECTION \d+" --output-dir ./chunks

Split Options

OptionDescription
--split-on, -sStyle name (e.g., "Heading 1", "Heading 2")
--pattern, -pRegex pattern (e.g., "ARTICLE", "^SECTION \d+")
--output-dir, -oOutput directory (default: <input>_split/)
--no-headerExclude preamble content from each chunk

Output Structure

Files are numbered for proper ordering:

code
document_split/
├── 00_Preamble.docx          # Content before first split point
├── 01_ARTICLE_I.docx
├── 02_ARTICLE_II.docx
└── ...

Merging Documents

Reassemble edited chunks:

bash
python scripts/docx_merger.py ./chunks/ -o final_document.docx

Merge specific files in order:

bash
python scripts/docx_merger.py file1.docx file2.docx file3.docx -o merged.docx

Merge Options

OptionDescription
-o, --outputOutput path (required)
--page-breaksAdd page breaks between sections
--include-duplicatesKeep duplicate preamble content

Typical Workflow

bash
# 1. Check document structure
python scripts/docx_splitter.py BigContract.docx --list-styles

# 2. Split by articles
python scripts/docx_splitter.py BigContract.docx --split-on "Heading 1" -o ./chunks

# 3. Edit individual chunks (with Harvey, Claude, or manually)

# 4. Reassemble
python scripts/docx_merger.py ./chunks/ -o BigContract_Final.docx

Limitations

  • Cross-references: Internal references preserved but not auto-updated if sections renumber
  • Tables: Tables spanning split boundaries may break
  • Numbering: Automatic numbering restarts per chunk
  • Track changes: Preserved but may need reconciliation after merge

Tips for Legal Documents

  • Split at ARTICLE level for credit agreements and loan documents
  • Use --pattern when heading styles are inconsistent (common in opposing counsel docs)
  • For very large definition sections, re-split that chunk by "Heading 2"
  • Review merged document at section boundaries for formatting issues