Overview

Automates the extraction, chunking, and metadata tagging of textbook content for the Physical AI & Humanoid Robotics RAG chatbot. It scans Docusaurus .mdx files and generates a structured JSON payload ready for Qdrant vector database ingestion.

Activation Triggers

File Pattern Triggers

•Saving files in docs/**/*.mdx (if configured for auto-watch)

Keyword Triggers

•"update rag index"
•"prepare embeddings"
•"index chapter"
•"scan docs for chatbot"
•"generate vector payload"

Command Triggers

bash

# Explicit activation
index_rag_content --week 3
prepare_embeddings --all

Core Functionality

1. Semantic Chunking

•Parses Markdown structure to split content by logical sections (Headers).
•Preserves code blocks within their explanatory context.
•Cleans MDX-specific syntax (imports, tabs) that adds noise to LLM context.
•Respects token limits (default ~500 words per chunk).

2. Metadata Extraction

•Frontmatter Parsing: Extracts title, week, difficulty, and tags.
•Context Awareness: Appends parent hierarchy (Module -> Chapter -> Section) to every chunk.
•Hardware Tagging: Identifies if a chunk requires specific hardware (e.g., "Requires: Jetson Orin").

3. Payload Generation

•Validates data against metadata_schema.json.
•Generates unique IDs for every text chunk.
•Outputs a single qdrant_payload.json file used by the backend ingestion script.

Inputs

Required Parameters

None (Defaults to scanning docs/ recursively).

Optional Parameters

python

{
  "target_week": int,       # Only index specific week
  "output_file": str,       # Custom output path
  "force_reindex": bool     # Ignore cache/checksums
}

Outputs

Generated Files

code

rag_data/
└── qdrant_payload.json     # The master dataset for the vector DB

Console Output

•Statistics on processed files.
•Number of chunks generated.
•Warnings for missing metadata or empty sections.

Integration Points

•Input: Reads from docs/ (generated by docusaurus-chapter-builder).
•Output: Feeds into the FastAPI/Qdrant backend (Course Requirement #2).