document-enricher

Version: 1.9.0 Status: Active Owner: SDLC Agêntico Core Team

Purpose

Automatically enriches existing reference documents with research findings from any SDLC phase. When agents perform research, the system:

•Detects documents related to the research topic
•Extracts content from original documents
•Merges original content with research findings
•Creates versioned enriched documents
•Updates corpus and knowledge graph

When to Use

Automatic activation when:

•Any research agent (domain-researcher, doc-crawler, requirements-analyst, adr-author, threat-modeler) receives a prompt
•Similarity to existing documents > 0.6
•Documents exist in .project/references/

Manual activation:

bash

/doc-search <keywords>              # Search for related documents
/doc-enrich <doc-id> <research>     # Manually enrich document
/doc-view <enrichment-id>           # View enriched version
/doc-diff <doc-id> <enrich-id>      # Compare versions

Architecture

Workflow

code

Agent Receives Prompt
        ↓
Step 0: find_related.py
   ├─ Extract keywords (TF-IDF)
   ├─ Query _index.yml
   ├─ Hybrid search (text + semantic)
   └─ Filter by similarity > 0.6
        ↓
   Documents Found?
   ├─ YES → Extract content (document-processor)
   └─ NO  → Continue normal research
        ↓
Execute Research (web, academic, community)
        ↓
enrich.py
   ├─ Merge: original + research
   ├─ Generate synthesis
   ├─ Create ENRICH-{id}.yml
   ├─ Generate .enriched.vN.md
   └─ Update metadata
        ↓
update_index.py
   ├─ Update _index.yml
   ├─ Update graph.json
   └─ Add 'enriches' relation
        ↓
Notify User

Similarity Scoring

Hybrid score formula:

code

similarity = 0.40 * keyword_overlap
           + 0.30 * title_similarity
           + 0.20 * summary_similarity
           + 0.10 * category_match

Threshold: 0.6 (configurable via ENRICHMENT_MIN_SIMILARITY env var)

Components

Scripts

Script	Purpose
`find_related.py`	Finds documents related to research topic
`enrich.py`	Merges original content with research findings
`render_markdown.py`	Generates enriched Markdown files
`update_index.py`	Updates _index.yml and graph.json

Templates

Template	Purpose
`enrichment_node.yml.template`	Corpus node structure for enrichments
`enriched_markdown.md.template`	Markdown format for enriched documents

Tests

•test_find_related.py - Unit tests for document search
•test_enrich.py - Unit tests for enrichment logic
•test_render_markdown.py - Unit tests for Markdown generation
•test_integration.py - End-to-end enrichment flow

Data Structures

Enrichment Metadata (_index.yml)

yaml

documents:
  - id: DOC-001
    path: "references/technical/oauth2-spec.pdf"
    title: "OAuth 2.0 Specification"
    keywords: ["oauth", "authentication"]

    enrichments:
      - enrichment_id: ENRICH-001
        enriched_at: "2026-01-22T14:30:00Z"
        research_topic: "OAuth 2.1 migration"
        agent: "domain-researcher"
        phase: 1
        corpus_node: "corpus/nodes/learnings/ENRICH-001.yml"
        enriched_file: "references/technical/oauth2-spec.enriched.v1.md"
        version: 1
        similarity: 0.85

Enrichment Corpus Node

yaml

id: ENRICH-001
type: enrichment
title: "OAuth 2.0 Specification - Enhanced with OAuth 2.1 migration"
created_at: "2026-01-22T14:30:00Z"
agent: "domain-researcher"

source_document:
  id: DOC-001
  path: "references/technical/oauth2-spec.pdf"

research_context:
  prompt: "Pesquise OAuth 2.1 migration best practices"
  phase: 1
  similarity: 0.85

content:
  original_summary: |
    Summary of original document content
  research_findings: |
    New research results from web, academic sources
  synthesis: |
    Combined analysis merging original + research
  sources:
    - url: "https://oauth.net/2.1/"
      title: "OAuth 2.1 Draft"
      accessed_at: "2026-01-22T14:30:00Z"

relations:
  - type: enriches
    target: DOC-001

decay_metadata:
  last_validated_at: "2026-01-22T14:30:00Z"
  decay_score: 1.0
  decay_status: fresh

tags: ["oauth", "authentication", "migration", "oauth2.1"]

Enriched Markdown Structure

markdown

# {Document Title} - Enhanced Research Edition

**Original Document**: `{path}`
**Enriched**: {date}
**Research Topic**: {topic}
**Agent**: {agent}
**Phase**: {phase}
**Version**: v{n}

---

## Original Content Summary

{extracted_summary}

---

## Research Findings

{research_data}

### Sources
- [{title}]({url}) - Accessed {date}

---

## Synthesis

{combined_analysis}

---

**🤖 Generated with SDLC Agêntico by @arbgjr**

Environment Variables

Variable	Default	Description
`ENRICHMENT_MIN_SIMILARITY`	0.6	Minimum similarity score for document matching
`ENRICHMENT_MAX_VERSIONS`	10	Max enrichment versions per document
`ENRICHMENT_AUTO_ARCHIVE`	true	Auto-archive old enrichments (> 1 year)

Integration with Agents

Modified agents include "Step 0" before research:

Phase 1 (Discovery):

•domain-researcher - Research academic/web sources
•doc-crawler - Extract and index documentation

Phase 2 (Requirements):

•requirements-analyst - Analyze requirements

Phase 3 (Architecture):

•adr-author - Document architecture decisions
•threat-modeler - Model security threats

Example Agent Modification

markdown

# domain-researcher

## Your Task

### Step 0: Check for Related Documents (NEW)

Before starting research, check if existing documents relate to this topic:

1. Use `/doc-search` with extracted keywords from prompt
2. If similarity > 0.6:
   - Extract content from original document
   - Note key points to complement (not duplicate) in research
3. If no documents found:
   - Proceed with standard research

### Step 1: Execute Research

[... existing research steps ...]

### Final Step: Enrich Documents

If related documents were found in Step 0:
1. Use `/doc-enrich` to merge original + research findings
2. Verify enriched version was created
3. Notify user with enrichment details

Quality Gates

enrichment-quality.yml

yaml

gate_id: enrichment-quality
name: "Enrichment Quality Gate"
applies_to:
  - phase: [1, 2, 3]
    condition: "enrichments_created > 0"

checks:
  - name: enrichment_has_sources
    severity: critical
    description: "Research findings must cite sources"

  - name: original_preserved
    severity: critical
    description: "Original document unchanged (SHA256 hash check)"

  - name: graph_relation_created
    severity: critical
    description: "Graph contains 'enriches' relation"

  - name: enrichment_version_incremented
    severity: warning
    description: "Version incremented correctly (v1 → v2 → ...)"

  - name: synthesis_quality
    severity: warning
    description: "Synthesis combines original + research coherently"

Success Metrics

Metric	Target	Measurement
Document Discovery Rate	> 70%	% of prompts that find related docs
Enrichment Quality	> 80%	% passing quality gate
Processing Time	< 30s	Time from research → enrichment
Graph Integrity	100%	% of enrichments with valid relation

Dependencies

•document-processor (v1.3.0+) - Document extraction
•rag-query (v1.4.0+) - Hybrid search
•graph-navigator (v1.4.0+) - Graph management
•decay-scoring (v1.5.0+) - Freshness tracking

Error Handling

Error	Mitigation
Document extraction fails	Log warning, continue with research only
Similarity computation timeout	Use cached results or skip enrichment
Graph update fails	Retry 3x, then create orphan enrichment
Markdown generation fails	Save raw YAML node, skip Markdown

Rollback Strategy

If enrichment causes issues:

bash

# Revert to original document state
python3 .claude/skills/document-enricher/scripts/rollback.py --enrichment-id ENRICH-001

# Removes:
# - .enriched.vN.md file
# - ENRICH-{id}.yml corpus node
# - Graph relation
# - _index.yml entry

Future Enhancements

•v2.0: Embeddings-based semantic search
•v2.1: Multi-document synthesis (combine 2+ docs)
•v2.2: Automatic re-enrichment on document updates
•v2.3: LLM-powered synthesis generation

Related ADRs:

•ADR-document-enrichment-architecture.yml (v1.9.0)

Related Learnings:

•LEARN-research-agent-patterns.yml