Document Parser

Overview

This skill provides tools and workflows for parsing large documents that exceed context limits. It extracts hierarchical structure, generates section abstracts, and extracts metadata using layout-aware hierarchical chunking principles optimized for RAG systems.

Core principle: Preserve semantic structure while chunking documents into 400-900 token sections with rich metadata for retrieval and comprehension.

When to Use This Skill

Use this skill when:

•Document exceeds 25k+ tokens and can't fit in context
•User explicitly requests document parsing or structure extraction
•Building RAG systems that need semantically coherent chunks
•Analyzing research papers, technical docs, or long-form content
•Need to extract tables, code blocks, benchmarks, or key terms
•Want progressive reading (abstracts first, then deep-dives)
•Comparing multiple large documents

Don't use for:

•❌ Documents under 10k tokens (read directly instead)
•❌ Binary file formats (PDFs, Word docs) - convert to markdown first
•❌ Simple text extraction (use grep/awk instead)

Core Capabilities

The document-parser skill provides four main capabilities:

•
Structure Analysis
- •Extract markdown headers (H1-H6)
- •Build hierarchical section tree
- •Count tokens per section (target: 400-900)
- •Generate section maps for navigation
•
Abstract Generation
- •Create 100-200 token summaries for major sections
- •Preserve key concepts and relationships
- •Enable progressive reading workflows
•
Metadata Extraction
- •Extract tables with structure preservation
- •Capture code blocks with language tags
- •Identify benchmarks (percentages, metrics)
- •Extract key terms (techniques, models, acronyms)
•
Output Generation
- •Machine-readable JSON (structure.json, metadata.json)
- •Human-readable markdown (section_map.md)
- •Full section content with metadata

Quick Reference

Task	Command	Output
Parse structure	`python3 scripts/parse_document_structure.py <file.md>`	structure.json, section_map.md
Extract metadata	`python3 scripts/extract_metadata.py <file.md>`	metadata.json
Custom output path	`--output <path>`	Specify output file
Section map	`--map <path>`	Human-readable navigation

Chunking Principles Reference

The skill implements RAG-optimized chunking principles:

The 400-900 Token Sweet Spot

•Too small (<400): Fragments semantic meaning, loses context
•Sweet spot (400-900): Complete thoughts, searchable, coherent
•Too large (>900): Dilutes relevance, adds noise

Layout-Aware Hierarchical Chunking

•Respect document structure (headers, sections)
•Never split mid-paragraph or mid-code-block
•Preserve parent-child relationships
•Include breadcrumb context (section path)

Dual-Storage Pattern

•Abstracts: Quick navigation, relevance filtering
•Full sections: Deep-dive when needed
•Metadata: Tables, benchmarks, key terms for targeted search

See references/chunking_principles.md for complete details.

Sandbox Configuration

IMPORTANT: This skill requires executing Python scripts. In read-only sandbox mode, you need to either:

•
Recommended: Configure sandbox allowlist in ~/.codex/config.toml:
toml
```
[sandbox]
allowed_paths = ["~/.codex/skills/*/scripts"]
```
•
Alternative: Use dangerouslyDisableSandbox: true when calling Bash tool

See README.md in this skill directory for complete sandbox setup instructions.

Implementation Workflows

Workflow 1: Parse Single Large Document

Use case: User has a 47k token research paper

bash

# Step 1: Parse document structure
cd ~/.codex/skills/document-parser
python3 scripts/parse_document_structure.py /path/to/document.md \
  --output structure.json \
  --map section_map.md

# Step 2: Review section map
cat section_map.md
# Shows hierarchical outline with token counts

# Step 3: Extract metadata
python3 scripts/extract_metadata.py /path/to/document.md \
  --output metadata.json

# Step 4: Review extracted metadata
cat metadata.json | jq '.tables | length'
cat metadata.json | jq '.benchmarks | length'
cat metadata.json | jq '.key_terms | keys'

Expected output:

•structure.json: Hierarchical section tree with token counts
•section_map.md: Human-readable outline for navigation
•metadata.json: Tables, code blocks, benchmarks, key terms

Workflow 2: Comparative Analysis

Use case: Compare two research papers on similar topics

bash

# Parse both documents
for doc in paper1.md paper2.md; do
  python3 scripts/parse_document_structure.py "$doc" \
    --output "${doc%.md}_structure.json"
  python3 scripts/extract_metadata.py "$doc" \
    --output "${doc%.md}_metadata.json"
done

# Compare structures
diff -u \
  <(jq '.sections[] | .title' paper1_structure.json) \
  <(jq '.sections[] | .title' paper2_structure.json)

# Compare key terms
diff -u \
  <(jq '.key_terms.techniques[]' paper1_metadata.json | sort) \
  <(jq '.key_terms.techniques[]' paper2_metadata.json | sort)

Workflow 3: Progressive Document Reading

Use case: Understand document before deep-dive

bash

# Step 1: Get high-level structure
python3 scripts/parse_document_structure.py document.md --map outline.md
cat outline.md
# Review: What are the main sections?

# Step 2: Read abstracts (if available in structure.json)
jq '.sections[] | select(.abstract) | {title, abstract}' structure.json

# Step 3: Extract metadata for context
python3 scripts/extract_metadata.py document.md --output metadata.json

# Step 4: Review key terms to understand domain
jq '.key_terms' metadata.json

# Step 5: Deep-dive into specific sections
# Read full sections from original document based on structure

Script Documentation

parse_document_structure.py

Extracts markdown headers, builds hierarchical section tree, counts tokens.

Usage:

bash

python3 scripts/parse_document_structure.py <file.md> [OPTIONS]

Options:

•--output FILEPATH - Output JSON file (default: structure.json)
•--map FILEPATH - Output markdown section map (default: section_map.md)

Output structure.json format:

json

{
  "sections": [
    {
      "id": "section-1",
      "title": "Introduction",
      "level": 1,
      "token_count": 450,
      "children": [
        {
          "id": "section-1.1",
          "title": "Background",
          "level": 2,
          "token_count": 320,
          "children": []
        }
      ]
    }
  ],
  "total_sections": 56,
  "total_tokens": 47000
}

Output section_map.md format:

markdown

# Document Structure

- Introduction (450 tokens)
  - Background (320 tokens)
  - Motivation (280 tokens)
- Methods (650 tokens)
  - Data Collection (520 tokens)
  - Analysis (580 tokens)

extract_metadata.py

Extracts tables, code blocks, benchmarks, and key terms.

Usage:

bash

python3 scripts/extract_metadata.py <file.md> [OPTIONS]

Options:

•--output FILEPATH - Output JSON file (default: metadata.json)

Output metadata.json format:

json

{
  "tables": [
    {
      "id": "table-1",
      "section": "Results",
      "headers": ["Model", "Accuracy", "F1"],
      "rows": [
        ["GPT-4", "95.2%", "0.94"],
        ["Claude", "94.8%", "0.93"]
      ]
    }
  ],
  "code_blocks": [
    {
      "id": "code-1",
      "section": "Implementation",
      "language": "python",
      "content": "def parse_document(text):\n    ..."
    }
  ],
  "benchmarks": [
    {
      "metric": "Accuracy",
      "value": "95.2%",
      "context": "GPT-4 on MMLU benchmark"
    }
  ],
  "key_terms": {
    "techniques": ["RAG", "Fine-tuning", "Few-shot learning"],
    "models": ["GPT-4", "Claude", "Llama-2"],
    "acronyms": ["MMLU", "RAG", "NLP"]
  }
}

Common Mistakes

❌ Sandbox permission errors when running scripts

Problem: Permission denied or scripts won't execute in read-only sandbox mode Fix: Configure sandbox allowlist in ~/.codex/config.toml:

toml

[sandbox]
allowed_paths = ["~/.codex/skills/*/scripts"]

Or use dangerouslyDisableSandbox: true flag when calling Bash tool (development only). See README.md for complete setup instructions.

❌ Parsing non-markdown files

Problem: Scripts expect markdown format Fix: Convert PDFs/Word docs to markdown first using pandoc:

bash

pandoc document.pdf -o document.md

❌ Ignoring token counts

Problem: Sections too large for embedding models Fix: Review section_map.md token counts, split sections >900 tokens manually

❌ Missing Python dependencies

Problem: Scripts require specific libraries Fix: Install dependencies:

bash

pip install tiktoken markdown beautifulsoup4

❌ Not preserving structure

Problem: Flat extraction loses context Fix: Always use hierarchical parsing, maintain parent-child relationships

❌ Skipping metadata extraction

Problem: Lose valuable structured data Fix: Always run both scripts for complete analysis

Examples

Example 1: Research Paper (47k tokens)

Input: 47k token research paper on RAG systems

Commands:

bash

python3 scripts/parse_document_structure.py rag_paper.md
python3 scripts/extract_metadata.py rag_paper.md

Results:

•56 sections extracted
•54 tables identified
•145 benchmarks found
•71 techniques cataloged
•Section map showing 3-level hierarchy
•Average section size: 839 tokens (within target range)

Example 2: Technical Documentation

Input: API documentation with code examples

Commands:

bash

python3 scripts/parse_document_structure.py api_docs.md --map api_outline.md
python3 scripts/extract_metadata.py api_docs.md

Use results to:

•Navigate API structure via outline
•Extract all code examples for testing
•Catalog all endpoints from tables
•Build searchable knowledge base

Example 3: Multi-Document Comparison

Input: 3 papers on LLM evaluation

Workflow:

bash

# Parse all documents
for doc in paper*.md; do
  python3 scripts/parse_document_structure.py "$doc"
  python3 scripts/extract_metadata.py "$doc"
done

# Compare methodologies
jq -r '.sections[] | select(.title | contains("Method")) | .title' *_structure.json

# Compare benchmarks
jq -r '.benchmarks[] | select(.metric == "Accuracy") | "\(.value) - \(.context)"' *_metadata.json

Testing Your Parsing

After parsing a document, verify quality:

Structure Checklist:

• All major sections captured
• Hierarchy preserved (H1 > H2 > H3)
• Token counts reasonable (400-900 target)
• Section map is human-readable
• JSON is valid (jq . structure.json)

Metadata Checklist:

• Tables extracted with structure
• Code blocks include language tags
• Benchmarks capture value + context
• Key terms are domain-relevant
• JSON is valid (jq . metadata.json)

Advanced Usage

Custom Section Splitting

If sections are too large (>900 tokens), split manually:

python

# In parse_document_structure.py, add target_size parameter
python3 scripts/parse_document_structure.py document.md \
  --target-size 600 \
  --max-size 900

Filtering by Section Level

Extract only top-level sections:

bash

jq '.sections[] | select(.level == 1)' structure.json

Building RAG Index

Use parsed output for RAG system:

python

import json

# Load structure
with open('structure.json') as f:
    structure = json.load(f)

# Load metadata
with open('metadata.json') as f:
    metadata = json.load(f)

# Build embeddings for each section
for section in structure['sections']:
    if 400 <= section['token_count'] <= 900:
        # Optimal chunk size
        embed_and_index(section)

Integration with Other Skills

This skill complements:

•skill-builder: Create new parsing strategies as skills
•time-awareness: Track document parsing timestamps

Proven Success

Tested successfully on:

•✅ 47K token research document
•✅ 56 sections extracted
•✅ 54 tables preserved
•✅ 145 benchmarks identified
•✅ 71 techniques cataloged
•✅ Hierarchical section maps generated
•✅ Metadata JSON validated

References

•references/chunking_principles.md - Complete RAG chunking methodology
•Scripts in scripts/ directory
•See skill-builder for creating document-specific parsing skills

Remember: Large documents are structured data. Parse the structure first, then read strategically.