RAG Implementation Guide
Overview
This skill provides Retrieval-Augmented Generation (RAG) capabilities for semantic search and knowledge retrieval. It supports vector-based similarity search using FAISS and transformer embeddings, compatible with the agentic4mle project's RAG implementation.
Quick Start
In this project, Operator skill scripts are executed through the use_skill tool's run_script action (see evomaster/agent/tools/skill.py). Therefore, this guide focuses on script invocation as the primary usage pattern.
1) Semantic Search (Recommended Entry: scripts/search.py)
Running Scripts via use_skill
Arguments are passed through script_args as space-separated strings (internally executed as python /abs/path/to/script.py {script_args}).
Example (retrieve top 5 results with optional knowledge field output):
use_skill( skill_name="rag", action="run_script", script_name="search.py", script_args="--vec_dir /path/to/vectorstore --query \"What is the main topic?\" --top_k 5 --threshold 1.5 --nodes_data /path/to/nodes_data.json --content_path content.text --output json" )
Direct Command Line Execution (Equivalent)
python evomaster/skills/rag/scripts/search.py \ --vec_dir /path/to/vectorstore \ --query "What is the main topic?" \ --top_k 5 \ --threshold 1.5 \ --nodes_data /path/to/nodes_data.json
2) Encoding Only (scripts/encode.py)
use_skill( skill_name="rag", action="run_script", script_name="encode.py", script_args="--text \"What is the main topic?\" --model evomaster/skills/rag/local_models/all-mpnet-base-v2" )
3) Task Knowledge (basic tools + search.py)
To get data_knowledge and model_knowledge for a task:
- •Get the query text: use
execute_bashtocat <simple_instructions_dir>/<task_name>/simple_instructions.txtfrom project root, or usestr_replace_editorwith commandviewand the absolute path to that file. - •Run
search.pywith--query "<query from step 1>",--top_k 1,--threshold 1.5,--nodes_datapointing to the draft DB (e.g.draft_407_75_db.json), and--output json. - •From the JSON result, take the first item in
results; itscontent(or the node in nodes_data for thatnode_id) containsdata_knowledgeandmodel_knowledgefor draft-stage DB.
Core Components
1. Vector Database Structure
The vector database follows this structure:
MLE_DATABASE/
├── MLE75_tasksummary_v2.json # Task summaries JSON file
├── simple_instructions/ # Simple instructions directory
│ ├── <task_name>/
│ │ └── simple_instructions.txt
│ └── ...
└── node_vectorstore/ # Vector stores (multiple dimensions)
├── draft/ # Draft stage vectorstore (768 dims)
│ ├── faiss.index # FAISS index file
│ ├── embeddings.npy # Pre-computed embeddings
│ ├── nodes.jsonl # Node ID mappings (one JSON per line)
│ └── draft_407_75_db.json # Full node data with knowledge content
├── improve/ # Improve stage vectorstore
│ ├── faiss.index
│ ├── embeddings.npy
│ ├── nodes.jsonl
│ └── nodes_for_improve.json
└── debug/ # Debug stage vectorstore
├── faiss.index
├── embeddings.npy
├── nodes.jsonl
└── nodes_for_debug.json
Note: Multiple vectorstore directories exist for different embedding dimensions. Each directory corresponds to a specific model:
- •
node_vectorstore/(768 dims) - Uses local model:evomaster/skills/rag/local_models/all-mpnet-base-v2 - •
node_vectorstore_512/(512 dims) - Usestext-embedding-3-large(512 dimensions) - •
node_vectorstore_768/(768 dims) - Usestext-embedding-3-large(768 dimensions) - •
node_vectorstore_1024/(1024 dims) - Usestext-embedding-3-large(1024 dimensions) - •
node_vectorstore_2048/(2048 dims) - Usestext-embedding-3-large(2048 dimensions) - •
node_vectorstore_3072/(3072 dims) - Usestext-embedding-3-large(3072 dimensions, default)
Important: Use the vectorstore directory that matches your model's output dimension. The default node_vectorstore/ uses the local model, while other directories use OpenAI's text-embedding-3-large model with different dimensions.
2. Embedding Models
Default model location: evomaster/skills/rag/local_models/all-mpnet-base-v2
Supported embedding models:
- •Local model (default, for
node_vectorstore/):evomaster/skills/rag/local_models/all-mpnet-base-v2- 768 dimensions - •text-embedding-3-large (for
node_vectorstore_*directories): OpenAI embedding model with configurable dimensions (512, 768, 1024, 2048, 3072) - •sentence-transformers/all-mpnet-base-v2: HuggingFace model, 768 dimensions
- •sentence-transformers/all-MiniLM-L6-v2: Faster, smaller (384 dimensions)
- •Any HuggingFace transformer model compatible with AutoModel
Important:
- •When using a local model path, ensure the model directory contains
config.json,model.safetensors, and tokenizer files. - •When using
text-embedding-3-large, ensure you use the correct vectorstore directory matching the dimension used during indexing.
3. Retrieval Methods
Similarity Search
# Search with distance threshold
results = searcher.search_similar(
query_emb,
top_k=10,
distance_threshold=1.5 # Filter results beyond this distance
)
Text-based Search
# Direct text search (encodes and searches in one step)
results = searcher.search_by_text(
query_text="What is the main topic?",
top_k=5,
distance_threshold=None
)
Usage Patterns
Pattern 1: Basic Knowledge Retrieval
use_skill( skill_name="rag", action="run_script", script_name="search.py", script_args="--vec_dir evomaster/skills/rag/MLE_DATABASE/node_vectorstore/improve --nodes_data evomaster/skills/rag/MLE_DATABASE/node_vectorstore/improve/nodes_for_improve.json --query \"How to improve model performance?\" --top_k 5" )
Pattern 2: Multi-Stage Retrieval
# improve
use_skill(skill_name="rag", action="run_script", script_name="search.py",
script_args="--vec_dir evomaster/skills/rag/MLE_DATABASE/node_vectorstore/improve --nodes_data evomaster/skills/rag/MLE_DATABASE/node_vectorstore/improve/nodes_for_improve.json --query \"...\"")
# debug
use_skill(skill_name="rag", action="run_script", script_name="search.py",
script_args="--vec_dir evomaster/skills/rag/MLE_DATABASE/node_vectorstore/debug --nodes_data evomaster/skills/rag/MLE_DATABASE/node_vectorstore/debug/nodes_for_debug.json --query \"...\"")
Pattern 3: Custom Knowledge Extraction
Currently, scripts/search.py defaults to reading nodes_data[<node_id>].content.improve_knowledge (consistent with agentic4mle/utils/vectorstore.py).
If you need to read different fields during migration, the recommended approach is:
- •Copy
scripts/search.pyto a new script (e.g.,search_debug.py) - •Modify
get_knowledge()to read fields likebug_fix_specific / bug_fix_abstract / full_plan / code - •Call it via
use_skill(..., script_name="search_debug.py", ...)
Database Interface
The database interface provides methods for building and managing vector databases. Currently, it provides a simple interface that can be extended:
use_skill( skill_name="rag", action="run_script", script_name="database.py", script_args="--action build --output_dir ./vectorstore --model evomaster/skills/rag/local_models/all-mpnet-base-v2" )
Configuration
Model Selection
Default model: evomaster/skills/rag/local_models/all-mpnet-base-v2 (768 dims)
Choose embedding model based on your needs:
- •Default (local):
evomaster/skills/rag/local_models/all-mpnet-base-v2(768 dims)- •Fast, no download needed
- •Used with
node_vectorstore/(768 dims)
- •High accuracy (OpenAI):
text-embedding-3-largewith configurable dimensions (512, 768, 1024, 2048, 3072)- •Used with corresponding
node_vectorstore_<dim>/directories - •Requires API access
- •Used with corresponding
- •Accuracy priority (HuggingFace):
sentence-transformers/all-mpnet-base-v2(768 dims) - HuggingFace model - •Speed priority:
sentence-transformers/all-MiniLM-L6-v2(384 dims) - Faster, smaller - •Custom models: Any HuggingFace AutoModel-compatible model or local model path
Important: Always use the vectorstore directory that matches the embedding model and dimension used during indexing.
Distance Thresholds
Distance thresholds depend on the similarity metric:
- •L2 distance: Lower is better (typically < 1.0 for similar items)
- •Cosine distance: Lower is better (typically < 0.3 for similar items)
- •Inner product: Higher is better (typically > 0.7 for similar items)
FAISS default is L2 distance. Adjust thresholds accordingly.
Best Practices
- •Chunk Size: Balance context (larger) vs specificity (smaller) - typically 500-1000 tokens
- •Overlap: Use 10-20% overlap to preserve context at boundaries
- •Metadata: Include source, timestamp, and other metadata in nodes_data.json
- •Distance Thresholds: Set appropriate thresholds to filter irrelevant results
- •Top-K Selection: Start with k=5-10, adjust based on recall needs
- •Model Selection: Use larger models for accuracy, smaller for speed
Common Issues
- •Poor Retrieval: Check embedding quality, ensure proper encoding
- •Irrelevant Results: Adjust distance thresholds, check query formulation
- •Missing Information: Ensure documents are properly indexed
- •Slow Queries: Use smaller embedding models, optimize FAISS index
- •Memory Issues: Use CPU mode, reduce batch sizes
Integration with agentic4mle
This RAG implementation is compatible with the agentic4mle project's vectorstore structure. The actual database is located at evomaster/skills/rag/MLE_DATABASE/:
use_skill( skill_name="rag", action="run_script", script_name="search.py", script_args="--vec_dir evomaster/skills/rag/MLE_DATABASE/node_vectorstore/improve --nodes_data evomaster/skills/rag/MLE_DATABASE/node_vectorstore/improve/nodes_for_improve.json --query \"...\"" )
Next Steps
- •For advanced retrieval patterns, see reference.md
- •For database building, see database.py interface
- •For troubleshooting, check logs and distance metrics