Working with Documents

Quick Reference

Format	Read	Create	Edit
DOCX	pandoc, python-docx	docx-js	OOXML (unpack/edit/pack)
PDF	pdfplumber, pypdf	reportlab	pypdf (merge/split)
PPTX	markitdown	html2pptx	OOXML (unpack/edit/pack)

Word Documents (.docx)

Reading Content

bash

# Convert to markdown (preserves structure)
pandoc document.docx -o output.md

# With tracked changes visible
pandoc --track-changes=all document.docx -o output.md

Creating New Documents

Use docx-js (JavaScript):

javascript

const { Document, Packer, Paragraph, TextRun } = require('docx');

const doc = new Document({
  sections: [{
    children: [
      new Paragraph({
        children: [
          new TextRun({ text: "Hello World", bold: true }),
        ],
      }),
    ],
  }],
});

Packer.toBuffer(doc).then(buffer => {
  fs.writeFileSync("output.docx", buffer);
});

Editing Existing Documents (Tracked Changes)

bash

# 1. Unpack
python ooxml/scripts/unpack.py document.docx unpacked/

# 2. Edit XML files in unpacked/word/document.xml
# Key files:
#   - word/document.xml (main content)
#   - word/comments.xml (comments)
#   - word/media/ (images)

# 3. Pack
python ooxml/scripts/pack.py unpacked/ edited.docx

Tracked changes XML pattern:

xml

<!-- Deletion -->
<w:del><w:r><w:delText>old text</w:delText></w:r></w:del>

<!-- Insertion -->
<w:ins><w:r><w:t>new text</w:t></w:r></w:ins>

PDF Documents

Reading PDFs

python

import pdfplumber

# Extract text
with pdfplumber.open("document.pdf") as pdf:
    for page in pdf.pages:
        print(page.extract_text())

# Extract tables
with pdfplumber.open("document.pdf") as pdf:
    for page in pdf.pages:
        tables = page.extract_tables()
        for table in tables:
            for row in table:
                print(row)

Creating PDFs

python

from reportlab.lib.pagesizes import letter
from reportlab.platypus import SimpleDocTemplate, Paragraph
from reportlab.lib.styles import getSampleStyleSheet

doc = SimpleDocTemplate("output.pdf", pagesize=letter)
styles = getSampleStyleSheet()
story = [
    Paragraph("Report Title", styles['Title']),
    Paragraph("Body text goes here.", styles['Normal']),
]
doc.build(story)

Merging/Splitting PDFs

python

from pypdf import PdfReader, PdfWriter

# Merge
writer = PdfWriter()
for pdf_file in ["doc1.pdf", "doc2.pdf"]:
    reader = PdfReader(pdf_file)
    for page in reader.pages:
        writer.add_page(page)
writer.write(open("merged.pdf", "wb"))

# Split
reader = PdfReader("input.pdf")
for i, page in enumerate(reader.pages):
    writer = PdfWriter()
    writer.add_page(page)
    writer.write(open(f"page_{i+1}.pdf", "wb"))

Command-Line Tools

bash

# Extract text
pdftotext input.pdf output.txt
pdftotext -layout input.pdf output.txt  # Preserve layout

# Merge with qpdf
qpdf --empty --pages file1.pdf file2.pdf -- merged.pdf

# Split pages
qpdf input.pdf --pages . 1-5 -- pages1-5.pdf

PowerPoint Presentations (.pptx)

Reading Content

bash

# Convert to markdown
python -m markitdown presentation.pptx

Creating New Presentations

Use html2pptx workflow:

•Create HTML slides (720pt × 405pt for 16:9)
•Convert with html2pptx.js library
•Validate with thumbnail grid

bash

# Create thumbnails for validation
python scripts/thumbnail.py output.pptx --cols 4

Editing Existing Presentations

bash

# 1. Unpack
python ooxml/scripts/unpack.py presentation.pptx unpacked/

# Key files:
#   - ppt/slides/slide1.xml, slide2.xml, etc.
#   - ppt/notesSlides/ (speaker notes)
#   - ppt/media/ (images)

# 2. Edit XML

# 3. Validate
python ooxml/scripts/validate.py unpacked/ --original presentation.pptx

# 4. Pack
python ooxml/scripts/pack.py unpacked/ edited.pptx

Rearranging Slides

bash

# Duplicate, reorder, delete slides
python scripts/rearrange.py template.pptx output.pptx 0,3,3,5,7
# Creates: slide 0, slide 3 (twice), slide 5, slide 7

Converting Between Formats

bash

# DOCX/PPTX to PDF
soffice --headless --convert-to pdf document.docx

# PDF to images
pdftoppm -jpeg -r 150 document.pdf page
# Creates: page-1.jpg, page-2.jpg, etc.

# DOCX to Markdown
pandoc document.docx -o output.md

OCR for Scanned Documents

python

import pytesseract
from pdf2image import convert_from_path

images = convert_from_path('scanned.pdf')
text = ""
for image in images:
    text += pytesseract.image_to_string(image)

Design Guidelines (Presentations)

Color Palettes

Pick 3-5 colors that work together:

Palette	Colors
Classic Blue	Navy #1C2833, Slate #2E4053, Silver #AAB7B8
Teal & Coral	Teal #5EA8A7, Coral #FE4447, White #FFFFFF
Black & Gold	Gold #BF9A4A, Black #000000, Cream #F4F6F6

Web-Safe Fonts Only

Arial, Helvetica, Times New Roman, Georgia, Verdana, Tahoma, Trebuchet MS, Courier New, Impact

Layout Rules

•Two-column: Use for exactly 2 distinct items
•Three-column: Use for exactly 3 items
•Never vertically stack charts below text
•Full-bleed images with text overlays work well

Dependencies

bash

# Python
pip install pypdf pdfplumber reportlab python-docx openpyxl

# System tools
apt-get install pandoc poppler-utils libreoffice

# Node.js (for docx-js)
npm install docx

Verification

Run: python scripts/verify.py

Related Skills

•working-with-spreadsheets - Excel file handling
•building-nextjs-apps - Frontend for document uploads