DOCX Creation, Editing, and Analysis
Work with Word documents: create from scratch, edit existing content, add tracked changes for professional review, extract text, and convert to PDF or images.
Overview
A .docx file is a ZIP archive of XML files and resources. Different tasks require different tools:
| Task | Tool | Workflow |
|---|---|---|
| Create new document | docx-js (JavaScript) | Write JS/TS, export with Packer.toBuffer() |
| Edit existing document | Document library (Python) | Unpack, manipulate XML, repack |
| Redline / tracked changes | Document library (Python) | Unpack, apply <w:ins>/<w:del> tags, repack |
| Extract text | pandoc | pandoc --track-changes=all file.docx -o out.md |
| Convert to images | LibreOffice + pdftoppm | DOCX to PDF to JPEG |
When to Use
- •User asks to create a new Word document from content or data
- •User asks to edit, modify, or update an existing .docx file
- •User asks to review a document with tracked changes (redlining)
- •User asks to add comments to a Word document
- •User asks to extract or read text from a .docx file
- •User asks to convert a .docx to PDF or images for visual inspection
Workflow Decision Tree
User request
|- "Create a new document"
| -> Creating workflow (docx-js)
|- "Edit my own document" + simple changes
| -> Basic OOXML editing workflow
|- "Review someone else's document"
| -> Redlining workflow (recommended default)
|- "Legal, academic, business, or government docs"
| -> Redlining workflow (required)
|- "Read/analyze document content"
| -> Text extraction with pandoc
|- "Show me what the document looks like"
-> Convert to images workflow
Creating a New Document (docx-js)
Use the docx-js library to create Word documents in JavaScript/TypeScript.
Steps
- •Read the docx-js reference: Load
references/docx-js-patterns.mdfor syntax, formatting rules, and common pitfalls - •Write a JS/TS file using
Document,Paragraph,TextRun,Table, and other components - •Export with
Packer.toBuffer()and write to disk
Key Rules
- •Never use
\nfor line breaks; use separateParagraphelements - •Always use
LevelFormat.BULLETconstant for bullet lists (not unicode symbols or string "bullet") - •Always specify
typeparameter forImageRun(png, jpg, gif, bmp, svg) - •Set
columnWidthsarray AND individual cell widths on tables - •Use
ShadingType.CLEARfor table cell shading (neverSOLID) - •
PageBreakmust always be inside aParagraph; standalone creates invalid XML - •Override built-in heading styles using exact IDs:
"Heading1","Heading2","Heading3" - •Include
outlineLevelon heading styles for Table of Contents compatibility - •Set a default font via
styles.default.document.run.font(Arial recommended)
const { Document, Packer, Paragraph, TextRun } = require('docx');
const fs = require('fs');
const doc = new Document({
styles: {
default: { document: { run: { font: "Arial", size: 24 } } }
},
sections: [{
properties: {
page: { margin: { top: 1440, right: 1440, bottom: 1440, left: 1440 } }
},
children: [
new Paragraph({
heading: HeadingLevel.HEADING_1,
children: [new TextRun("Section Title")]
}),
new Paragraph({
children: [new TextRun("Body text in Arial 12pt.")]
})
]
}]
});
Packer.toBuffer(doc).then(buf => fs.writeFileSync("output.docx", buf));
Editing an Existing Document (OOXML)
Use the Document library (Python) for editing existing Word documents via XML manipulation.
Steps
- •Read the OOXML reference: Load
references/ooxml-patterns.mdfor the Document library API and XML patterns - •Unpack:
python ooxml/scripts/unpack.py <file.docx> <output_dir> - •Write a Python script using the Document class for manipulation
- •Repack:
python ooxml/scripts/pack.py <dir> <output.docx>
Document Library Basics
from scripts.document import Document
doc = Document('unpacked')
# Find nodes by text, line number, or attributes
node = doc["word/document.xml"].get_node(tag="w:r", contains="target text")
# Replace content
doc["word/document.xml"].replace_node(node, "<w:r><w:t>new text</w:t></w:r>")
doc.save()
Set PYTHONPATH to the skill root directory before running:
PYTHONPATH=/path/to/docx-skill python your_script.py
Redlining Workflow (Tracked Changes)
For professional document review with tracked changes. This is the recommended default for editing another person's document and required for legal, academic, business, or government documents.
Principle: Minimal, Precise Edits
Only mark text that actually changes. Keep unchanged text outside <w:del>/<w:ins> tags. Preserve the original run's RSID for unchanged text.
# BAD - replaces entire sentence '<w:del>..entire sentence..</w:del><w:ins>..entire sentence..</w:ins>' # GOOD - only marks what changed: "30 days" -> "60 days" '<w:r w:rsidR="00AB12CD"><w:t>within </w:t></w:r>' '<w:del><w:r><w:delText>30</w:delText></w:r></w:del>' '<w:ins><w:r><w:t>60</w:t></w:r></w:ins>' '<w:r w:rsidR="00AB12CD"><w:t> days</w:t></w:r>'
Steps
- •Convert to markdown:
pandoc --track-changes=all file.docx -o current.md - •Identify and group changes into batches of 3-10 related changes (by section, type, or proximity)
- •Read OOXML reference and unpack:
python ooxml/scripts/unpack.py file.docx unpacked - •Implement each batch: Use
get_node()to find nodes, apply tracked changes withreplace_node(),suggest_deletion(), orinsert_after() - •Repack:
python ooxml/scripts/pack.py unpacked reviewed.docx - •Verify: Convert back to markdown and grep for expected changes
Method Selection
| Scenario | Method |
|---|---|
| Change regular text | replace_node() with <w:del>/<w:ins> |
| Delete entire run or paragraph | suggest_deletion(node) |
| Reject another author's insertion | revert_insertion(ins_node) |
| Restore another author's deletion | revert_deletion(del_node) |
| Add comment | doc.add_comment(start, end, text) |
| Reply to comment | doc.reply_to_comment(parent_id, text) |
Reading and Analyzing Content
Text Extraction
# Convert to markdown preserving tracked changes pandoc --track-changes=all document.docx -o output.md # Options: --track-changes=accept/reject/all
Raw XML Access
For comments, complex formatting, metadata, or embedded media:
python ooxml/scripts/unpack.py document.docx unpacked/
Key files inside the unpacked archive:
| Path | Content |
|---|---|
word/document.xml | Main document body |
word/comments.xml | Comments referenced in document.xml |
word/media/ | Embedded images and media |
word/styles.xml | Document styles |
word/settings.xml | Document settings |
Converting to Images
Two-step process for visual inspection:
# Step 1: DOCX to PDF soffice --headless --convert-to pdf document.docx # Step 2: PDF to JPEG images pdftoppm -jpeg -r 150 document.pdf page # Creates page-1.jpg, page-2.jpg, etc. # Specific page range: pdftoppm -jpeg -r 150 -f 2 -l 5 document.pdf page
Best Practices
Do
- •Read the appropriate reference file before starting any document task
- •Use separate
Paragraphelements for each line (never\n) - •Set a default font and establish visual hierarchy with styles
- •Group tracked changes into batches of 3-10 for manageable debugging
- •Grep
word/document.xmlbefore each script to get current line numbers - •Preserve original run formatting (
<w:rPr>) when making tracked changes - •Use
defusedxmlfor secure XML parsing
Do Not
- •Use unicode bullets (
"bullet"string orSymbolRunfor lists); useLevelFormat.BULLET - •Mix up
<w:ins>/<w:del>closing tags - •Use markdown line numbers to locate content in XML (they do not map)
- •Modify text inside another author's
<w:ins>or<w:del>directly; use nested deletions - •Skip validation after repacking (
doc.save()validates by default)
Dependencies
| Package | Install | Purpose |
|---|---|---|
| pandoc | sudo apt-get install pandoc | Text extraction |
| docx | npm install -g docx | Creating new documents |
| defusedxml | pip install defusedxml | Secure XML parsing |
| LibreOffice | sudo apt-get install libreoffice | PDF conversion |
| poppler-utils | sudo apt-get install poppler-utils | PDF to images (pdftoppm) |
Examples
Example 1: Create a Simple Report
User: Create a Word document with a title "Q4 Report", two sections with
headings, and a bullet list of key metrics.
Action: Read references/docx-js-patterns.md, then write a JS file using
Document with Heading1 style override, numbered sections, and a
bullet list via numbering config with LevelFormat.BULLET. Export
with Packer.toBuffer().
Example 2: Redline a Contract
User: Review this NDA and change the confidentiality period from 2 years
to 3 years, and update the governing law from California to Delaware.
Action: Use the redlining workflow. Convert to markdown first to understand
the document, then unpack, read references/ooxml-patterns.md, find
the target text with get_node(), apply minimal tracked changes
(only mark "2 years" -> "3 years" and "California" -> "Delaware"),
repack, and verify with pandoc.
Example 3: Extract Text from a Document
User: What does this contract say about termination clauses?
Action: Run pandoc --track-changes=all contract.docx -o contract.md,
then read the markdown file and search for termination-related
sections.
Output Checklist
- • Correct workflow selected based on decision tree
- • Reference file read before starting implementation
- • Document opens without errors in Word or LibreOffice
- • Tracked changes display correctly (for redlining tasks)
- • No unintended formatting changes introduced
- • All changes verified via markdown conversion