Overview
Automates the extraction, chunking, and metadata tagging of textbook content for the Physical AI & Humanoid Robotics RAG chatbot. It scans Docusaurus .mdx files and generates a structured JSON payload ready for Qdrant vector database ingestion.
Activation Triggers
File Pattern Triggers
- •Saving files in
docs/**/*.mdx(if configured for auto-watch)
Keyword Triggers
- •"update rag index"
- •"prepare embeddings"
- •"index chapter"
- •"scan docs for chatbot"
- •"generate vector payload"
Command Triggers
bash
# Explicit activation index_rag_content --week 3 prepare_embeddings --all
Core Functionality
1. Semantic Chunking
- •Parses Markdown structure to split content by logical sections (Headers).
- •Preserves code blocks within their explanatory context.
- •Cleans MDX-specific syntax (imports, tabs) that adds noise to LLM context.
- •Respects token limits (default ~500 words per chunk).
2. Metadata Extraction
- •Frontmatter Parsing: Extracts
title,week,difficulty, andtags. - •Context Awareness: Appends parent hierarchy (Module -> Chapter -> Section) to every chunk.
- •Hardware Tagging: Identifies if a chunk requires specific hardware (e.g., "Requires: Jetson Orin").
3. Payload Generation
- •Validates data against
metadata_schema.json. - •Generates unique IDs for every text chunk.
- •Outputs a single
qdrant_payload.jsonfile used by the backend ingestion script.
Inputs
Required Parameters
None (Defaults to scanning docs/ recursively).
Optional Parameters
python
{
"target_week": int, # Only index specific week
"output_file": str, # Custom output path
"force_reindex": bool # Ignore cache/checksums
}
Outputs
Generated Files
code
rag_data/ └── qdrant_payload.json # The master dataset for the vector DB
Console Output
- •Statistics on processed files.
- •Number of chunks generated.
- •Warnings for missing metadata or empty sections.
Integration Points
- •Input: Reads from
docs/(generated bydocusaurus-chapter-builder). - •Output: Feeds into the FastAPI/Qdrant backend (Course Requirement #2).