AgentSkillsCN

i3

RAG 构建器——利用本地嵌入式向量数据库(零成本),高效构建向量数据库。 支持 PDF 下载、文本提取、分块处理以及向量数据库的创建。 适用场景:构建 RAG 系统、创建向量数据库、下载 PDF 文件、对文档进行嵌入式处理。 触发条件:构建 RAG、创建向量数据库、下载 PDF 文件、对文档进行嵌入式处理。

SKILL.md
--- frontmatter
name: i3
description: |
  RAG Builder - Vector database construction with local embeddings (zero cost)
  Handles PDF download, text extraction, chunking, and vector database creation
  Use when: building RAG, creating vector database, downloading PDFs, embedding documents
  Triggers: build RAG, create vector database, download PDFs, embed documents
version: "8.0.1"

I3-RAGBuilder

Agent ID: I3 Category: I - Systematic Review Automation Tier: LOW (Haiku) Icon: 🗄️⚡

Overview

Builds a RAG (Retrieval-Augmented Generation) system from PRISMA-selected papers. Uses completely free local embeddings and ChromaDB, making the RAG building stage $0 cost. Handles PDF download, text extraction, chunking, and vector database creation.

Zero-Cost Stack

ComponentToolCost
PDF Downloadrequests$0
Text ExtractionPyMuPDF$0
Embeddingsall-MiniLM-L6-v2$0 (local)
Vector DBChromaDB$0 (local)
ChunkingLangChain$0

Total RAG Building Cost: $0

Input Schema

yaml
Required:
  - project_path: "string"

Optional:
  - chunk_size_tokens: "int (default: 500)"
  - chunk_overlap_tokens: "int (default: 100)"
  - embedding_model: "string (default: all-MiniLM-L6-v2)"
  - delay_between_downloads: "float (default: 2.0)"
  - download_timeout: "int (default: 30)"

Output Schema

yaml
main_output:
  stage: "rag_build"
  pdf_download:
    total_papers: "int"
    downloaded: "int"
    failed: "int"
    success_rate: "string"
    total_size_mb: "int"
  rag_build:
    total_chunks: "int"
    avg_chunks_per_paper: "float"
    chunk_size_tokens: "int"
    chunk_overlap_tokens: "int"
    embedding_model: "string"
    embedding_dimensions: "int"
    vector_db: "string"
  output_paths:
    pdfs: "string"
    chroma_db: "string"
    rag_config: "string"

Human Checkpoint Protocol

🟠 SCH_RAG_READINESS (RECOMMENDED)

Before completing RAG build, I3 SHOULD:

  1. REPORT build status:

    code
    RAG Build Complete
    
    PDF Download:
    - Total papers: 287
    - PDFs downloaded: 245 (85.4%)
    - PDFs unavailable: 42
    
    Vector Database:
    - Total chunks: 4,850
    - Avg chunks/paper: 19.8
    - Embedding model: all-MiniLM-L6-v2
    - Database: ChromaDB
    
    Storage:
    - PDF size: 1.2 GB
    - Vector DB size: 450 MB
    
    Ready for research queries?
    
  2. ASK if user wants to proceed

  3. CONFIRM RAG is ready for queries

Execution Commands

bash
# Project path (set to your working directory)
cd "$(pwd)"

# Stage 4: PDF Download
python scripts/04_download_pdfs.py \
  --project {project_path} \
  --delay 2.0 \
  --timeout 30

# Stage 5: RAG Build
python scripts/05_build_rag.py \
  --project {project_path} \
  --chunk-size 1000 \
  --chunk-overlap 200 \
  --embedding-model sentence-transformers/all-MiniLM-L6-v2

Chunking Strategy (v1.2.6: Token-Based)

Problem: Documentation says "1000 tokens" but code used "1000 characters"

Fix: Token-based chunking with tiktoken

python
import tiktoken
tokenizer = tiktoken.get_encoding("cl100k_base")

# Settings
chunk_size_tokens = 500    # Actual tokens
chunk_overlap_tokens = 100  # Actual tokens

# Character fallback (if tiktoken unavailable)
chunk_size_chars = 1000
chunk_overlap_chars = 200

Embedding Model Options

ModelDimensionsSpeedQuality
all-MiniLM-L6-v2 (Default)384FastGood
all-mpnet-base-v2768MediumBetter
bge-small-en-v1.5384FastGood
e5-small-v2384FastGood

All models run locally at zero cost.

PDF Download Strategy

Open Access Sources

SourceURL PatternSuccess Rate
Semantic ScholaropenAccessPdf.url~40%
OpenAlexopen_access.oa_url~50%
arXivarxiv.org/pdf/{id}.pdf100%

Retry Logic

python
max_retries = 3
base_delay = 2.0

for attempt in range(max_retries):
    try:
        download_pdf(url)
        break
    except Timeout:
        delay = base_delay * (2 ** attempt)
        time.sleep(delay)

Validation

  • Minimum file size: 1KB
  • Content-Type: application/pdf
  • PDF header check: %PDF-

Vector Database Structure

code
data/04_rag/
├── chroma_db/
│   ├── chroma.sqlite3      # Metadata store
│   ├── {collection_id}/    # Vector embeddings
│   └── index/              # HNSW index
└── rag_config.json         # Configuration

Query Testing

After build, I3 tests retrieval with research question:

python
# Test query
results = vectorstore.similarity_search(
    research_question,
    k=5
)

# Report results
for doc in results:
    print(f"- {doc.metadata['title']} ({doc.metadata['year']})")
    print(f"  Preview: {doc.page_content[:150]}...")

Auto-Trigger Keywords

Keywords (EN)Keywords (KR)Action
build RAG, create vector databaseRAG 구축, 벡터 DBActivate I3
download PDFsPDF 다운로드Activate I3
embed documents문서 임베딩Activate I3

Integration with B5

I3 can call B5-parallel-document-processor for large PDF collections:

python
Task(
    subagent_type="diverga:b5",
    model="opus",
    prompt="""
    Process large PDF collection in parallel:
    - Total PDFs: {count}
    - Split across workers
    - Handle memory limits
    - Report extraction success
    """
)

Error Handling

ErrorAction
PDF corruptSkip, log to failed list
OCR neededFall back to pytesseract
Memory limitProcess in batches
Embedding timeoutRetry with smaller batch

Dependencies

yaml
requires: ["I2-screening-assistant"]
sequential_next: []
parallel_compatible: ["B5-parallel-document-processor"]

Related Agents

  • I0-review-pipeline-orchestrator: Pipeline coordination
  • I2-screening-assistant: PRISMA screening
  • B5-parallel-document-processor: Large PDF batch processing