Document Parser
Overview
This skill provides tools and workflows for parsing large documents that exceed context limits. It extracts hierarchical structure, generates section abstracts, and extracts metadata using layout-aware hierarchical chunking principles optimized for RAG systems.
Core principle: Preserve semantic structure while chunking documents into 400-900 token sections with rich metadata for retrieval and comprehension.
When to Use This Skill
Use this skill when:
- •Document exceeds 25k+ tokens and can't fit in context
- •User explicitly requests document parsing or structure extraction
- •Building RAG systems that need semantically coherent chunks
- •Analyzing research papers, technical docs, or long-form content
- •Need to extract tables, code blocks, benchmarks, or key terms
- •Want progressive reading (abstracts first, then deep-dives)
- •Comparing multiple large documents
Don't use for:
- •❌ Documents under 10k tokens (read directly instead)
- •❌ Binary file formats (PDFs, Word docs) - convert to markdown first
- •❌ Simple text extraction (use grep/awk instead)
Core Capabilities
The document-parser skill provides four main capabilities:
- •
Structure Analysis
- •Extract markdown headers (H1-H6)
- •Build hierarchical section tree
- •Count tokens per section (target: 400-900)
- •Generate section maps for navigation
- •
Abstract Generation
- •Create 100-200 token summaries for major sections
- •Preserve key concepts and relationships
- •Enable progressive reading workflows
- •
Metadata Extraction
- •Extract tables with structure preservation
- •Capture code blocks with language tags
- •Identify benchmarks (percentages, metrics)
- •Extract key terms (techniques, models, acronyms)
- •
Output Generation
- •Machine-readable JSON (structure.json, metadata.json)
- •Human-readable markdown (section_map.md)
- •Full section content with metadata
Quick Reference
| Task | Command | Output |
|---|---|---|
| Parse structure | python3 scripts/parse_document_structure.py <file.md> | structure.json, section_map.md |
| Extract metadata | python3 scripts/extract_metadata.py <file.md> | metadata.json |
| Custom output path | --output <path> | Specify output file |
| Section map | --map <path> | Human-readable navigation |
Chunking Principles Reference
The skill implements RAG-optimized chunking principles:
The 400-900 Token Sweet Spot
- •Too small (<400): Fragments semantic meaning, loses context
- •Sweet spot (400-900): Complete thoughts, searchable, coherent
- •Too large (>900): Dilutes relevance, adds noise
Layout-Aware Hierarchical Chunking
- •Respect document structure (headers, sections)
- •Never split mid-paragraph or mid-code-block
- •Preserve parent-child relationships
- •Include breadcrumb context (section path)
Dual-Storage Pattern
- •Abstracts: Quick navigation, relevance filtering
- •Full sections: Deep-dive when needed
- •Metadata: Tables, benchmarks, key terms for targeted search
See references/chunking_principles.md for complete details.
Sandbox Configuration
IMPORTANT: This skill requires executing Python scripts. In read-only sandbox mode, you need to either:
- •
Recommended: Configure sandbox allowlist in
~/.codex/config.toml:toml[sandbox] allowed_paths = ["~/.codex/skills/*/scripts"]
- •
Alternative: Use
dangerouslyDisableSandbox: truewhen calling Bash tool
See README.md in this skill directory for complete sandbox setup instructions.
Implementation Workflows
Workflow 1: Parse Single Large Document
Use case: User has a 47k token research paper
# Step 1: Parse document structure cd ~/.codex/skills/document-parser python3 scripts/parse_document_structure.py /path/to/document.md \ --output structure.json \ --map section_map.md # Step 2: Review section map cat section_map.md # Shows hierarchical outline with token counts # Step 3: Extract metadata python3 scripts/extract_metadata.py /path/to/document.md \ --output metadata.json # Step 4: Review extracted metadata cat metadata.json | jq '.tables | length' cat metadata.json | jq '.benchmarks | length' cat metadata.json | jq '.key_terms | keys'
Expected output:
- •
structure.json: Hierarchical section tree with token counts - •
section_map.md: Human-readable outline for navigation - •
metadata.json: Tables, code blocks, benchmarks, key terms
Workflow 2: Comparative Analysis
Use case: Compare two research papers on similar topics
# Parse both documents
for doc in paper1.md paper2.md; do
python3 scripts/parse_document_structure.py "$doc" \
--output "${doc%.md}_structure.json"
python3 scripts/extract_metadata.py "$doc" \
--output "${doc%.md}_metadata.json"
done
# Compare structures
diff -u \
<(jq '.sections[] | .title' paper1_structure.json) \
<(jq '.sections[] | .title' paper2_structure.json)
# Compare key terms
diff -u \
<(jq '.key_terms.techniques[]' paper1_metadata.json | sort) \
<(jq '.key_terms.techniques[]' paper2_metadata.json | sort)
Workflow 3: Progressive Document Reading
Use case: Understand document before deep-dive
# Step 1: Get high-level structure
python3 scripts/parse_document_structure.py document.md --map outline.md
cat outline.md
# Review: What are the main sections?
# Step 2: Read abstracts (if available in structure.json)
jq '.sections[] | select(.abstract) | {title, abstract}' structure.json
# Step 3: Extract metadata for context
python3 scripts/extract_metadata.py document.md --output metadata.json
# Step 4: Review key terms to understand domain
jq '.key_terms' metadata.json
# Step 5: Deep-dive into specific sections
# Read full sections from original document based on structure
Script Documentation
parse_document_structure.py
Extracts markdown headers, builds hierarchical section tree, counts tokens.
Usage:
python3 scripts/parse_document_structure.py <file.md> [OPTIONS]
Options:
- •
--output FILEPATH- Output JSON file (default: structure.json) - •
--map FILEPATH- Output markdown section map (default: section_map.md)
Output structure.json format:
{
"sections": [
{
"id": "section-1",
"title": "Introduction",
"level": 1,
"token_count": 450,
"children": [
{
"id": "section-1.1",
"title": "Background",
"level": 2,
"token_count": 320,
"children": []
}
]
}
],
"total_sections": 56,
"total_tokens": 47000
}
Output section_map.md format:
# Document Structure - Introduction (450 tokens) - Background (320 tokens) - Motivation (280 tokens) - Methods (650 tokens) - Data Collection (520 tokens) - Analysis (580 tokens)
extract_metadata.py
Extracts tables, code blocks, benchmarks, and key terms.
Usage:
python3 scripts/extract_metadata.py <file.md> [OPTIONS]
Options:
- •
--output FILEPATH- Output JSON file (default: metadata.json)
Output metadata.json format:
{
"tables": [
{
"id": "table-1",
"section": "Results",
"headers": ["Model", "Accuracy", "F1"],
"rows": [
["GPT-4", "95.2%", "0.94"],
["Claude", "94.8%", "0.93"]
]
}
],
"code_blocks": [
{
"id": "code-1",
"section": "Implementation",
"language": "python",
"content": "def parse_document(text):\n ..."
}
],
"benchmarks": [
{
"metric": "Accuracy",
"value": "95.2%",
"context": "GPT-4 on MMLU benchmark"
}
],
"key_terms": {
"techniques": ["RAG", "Fine-tuning", "Few-shot learning"],
"models": ["GPT-4", "Claude", "Llama-2"],
"acronyms": ["MMLU", "RAG", "NLP"]
}
}
Common Mistakes
❌ Sandbox permission errors when running scripts
Problem: Permission denied or scripts won't execute in read-only sandbox mode
Fix: Configure sandbox allowlist in ~/.codex/config.toml:
[sandbox] allowed_paths = ["~/.codex/skills/*/scripts"]
Or use dangerouslyDisableSandbox: true flag when calling Bash tool (development only).
See README.md for complete setup instructions.
❌ Parsing non-markdown files
Problem: Scripts expect markdown format Fix: Convert PDFs/Word docs to markdown first using pandoc:
pandoc document.pdf -o document.md
❌ Ignoring token counts
Problem: Sections too large for embedding models Fix: Review section_map.md token counts, split sections >900 tokens manually
❌ Missing Python dependencies
Problem: Scripts require specific libraries Fix: Install dependencies:
pip install tiktoken markdown beautifulsoup4
❌ Not preserving structure
Problem: Flat extraction loses context Fix: Always use hierarchical parsing, maintain parent-child relationships
❌ Skipping metadata extraction
Problem: Lose valuable structured data Fix: Always run both scripts for complete analysis
Examples
Example 1: Research Paper (47k tokens)
Input: 47k token research paper on RAG systems
Commands:
python3 scripts/parse_document_structure.py rag_paper.md python3 scripts/extract_metadata.py rag_paper.md
Results:
- •56 sections extracted
- •54 tables identified
- •145 benchmarks found
- •71 techniques cataloged
- •Section map showing 3-level hierarchy
- •Average section size: 839 tokens (within target range)
Example 2: Technical Documentation
Input: API documentation with code examples
Commands:
python3 scripts/parse_document_structure.py api_docs.md --map api_outline.md python3 scripts/extract_metadata.py api_docs.md
Use results to:
- •Navigate API structure via outline
- •Extract all code examples for testing
- •Catalog all endpoints from tables
- •Build searchable knowledge base
Example 3: Multi-Document Comparison
Input: 3 papers on LLM evaluation
Workflow:
# Parse all documents
for doc in paper*.md; do
python3 scripts/parse_document_structure.py "$doc"
python3 scripts/extract_metadata.py "$doc"
done
# Compare methodologies
jq -r '.sections[] | select(.title | contains("Method")) | .title' *_structure.json
# Compare benchmarks
jq -r '.benchmarks[] | select(.metric == "Accuracy") | "\(.value) - \(.context)"' *_metadata.json
Testing Your Parsing
After parsing a document, verify quality:
Structure Checklist:
- • All major sections captured
- • Hierarchy preserved (H1 > H2 > H3)
- • Token counts reasonable (400-900 target)
- • Section map is human-readable
- • JSON is valid (
jq . structure.json)
Metadata Checklist:
- • Tables extracted with structure
- • Code blocks include language tags
- • Benchmarks capture value + context
- • Key terms are domain-relevant
- • JSON is valid (
jq . metadata.json)
Advanced Usage
Custom Section Splitting
If sections are too large (>900 tokens), split manually:
# In parse_document_structure.py, add target_size parameter python3 scripts/parse_document_structure.py document.md \ --target-size 600 \ --max-size 900
Filtering by Section Level
Extract only top-level sections:
jq '.sections[] | select(.level == 1)' structure.json
Building RAG Index
Use parsed output for RAG system:
import json
# Load structure
with open('structure.json') as f:
structure = json.load(f)
# Load metadata
with open('metadata.json') as f:
metadata = json.load(f)
# Build embeddings for each section
for section in structure['sections']:
if 400 <= section['token_count'] <= 900:
# Optimal chunk size
embed_and_index(section)
Integration with Other Skills
This skill complements:
- •skill-builder: Create new parsing strategies as skills
- •time-awareness: Track document parsing timestamps
Proven Success
Tested successfully on:
- •✅ 47K token research document
- •✅ 56 sections extracted
- •✅ 54 tables preserved
- •✅ 145 benchmarks identified
- •✅ 71 techniques cataloged
- •✅ Hierarchical section maps generated
- •✅ Metadata JSON validated
References
- •
references/chunking_principles.md- Complete RAG chunking methodology - •Scripts in
scripts/directory - •See skill-builder for creating document-specific parsing skills
Remember: Large documents are structured data. Parse the structure first, then read strategically.