AgentSkillsCN

rag

当用户希望构建 RAG、问答系统,或利用文档打造知识库时,可启用此功能。触发条件包括:RAG、检索增强生成、问答系统、知识库、文档问答、与文档对话、ChatGPT 用于文档、大语言模型 + 检索、基于文档的语义搜索、以事实为基础的底层大语言模型、减少幻觉、企业级搜索等。

SKILL.md
--- frontmatter
name: rag
description: "Use when user wants to build RAG, Q&A system, or knowledge base with documents. Triggers on: RAG, retrieval augmented generation, Q&A system, knowledge base, document Q&A, chat with docs, ChatGPT for docs, LLM + retrieval, semantic search over documents, ground LLM with facts, reduce hallucination, enterprise search."

RAG - Retrieval Augmented Generation

Build intelligent Q&A systems that ground LLM responses in your documents, reducing hallucinations and enabling knowledge updates without retraining.

When to Activate

This skill should be activated when the user:

  • Wants to build a Q&A system over their documents
  • Needs to reduce LLM hallucinations with factual grounding
  • Asks about "chat with docs", "ChatGPT for my data", or similar
  • Wants to keep knowledge up-to-date without fine-tuning
  • Needs source attribution for answers

Interactive Flow

Before implementing, gather requirements through focused questions.

Step 1: Understand the Use Case

Ask ONE question at a time:

code
"What type of documents will users query?"

A) Internal knowledge base (policies, procedures, docs)
B) Customer-facing FAQ / support articles
C) Technical documentation (API docs, code)
D) Mixed / other

Which one? (A/B/C/D)

Based on answer, follow up:

AnswerNext Question
A (Internal KB)"Do you need access control (different users see different docs)?"
B (Customer FAQ)"Do you need multi-turn conversation or single Q&A?"
C (Technical)"Will queries include code snippets or just natural language?"
D (Mixed)"Can you describe the main document types?"

Step 2: Clarify Constraints

code
"What's your priority?"

A) Accuracy first (willing to accept slower responses)
B) Speed first (good-enough answers, fast)
C) Cost first (minimize API calls)

Choose A, B, or C.
PriorityRecommendation
AccuracyAdd reranking → suggest rag-with-rerank
SpeedBasic RAG with caching
CostLocal embeddings (BGE), smaller LLM

Step 3: Confirm Before Implementation

code
"Based on your requirements, I'll build:

- **Document type**: [from Step 1]
- **Chunk size**: [512 for general, 256 for FAQ, 1024 for technical]
- **Embedding**: text-embedding-3-small
- **Retrieval**: Top-5 with AUTOINDEX
- **LLM**: gpt-4o-mini

Does this look right? (yes / adjust [what])"

Decision Points During Implementation

CheckpointQuestion
After chunking"I've split into X chunks. Sample: [show 2]. Chunk size OK?"
After indexing"Collection created with X documents. Ready to test?"
After first query"Here's a test result. Quality acceptable?"

Core Concepts

The RAG Paradigm

RAG decouples knowledge storage from reasoning capability:

code
┌─────────────┐    ┌─────────────┐    ┌─────────────┐
│  Documents  │───▶│  Retrieval  │───▶│    LLM      │
│  (Facts)    │    │  (Relevance)│    │  (Reasoning)│
└─────────────┘    └─────────────┘    └─────────────┘
      ▲                   │                  │
      │                   ▼                  ▼
   Update            Top-K chunks        Answer with
   anytime           as context          citations

Key insight: LLMs are excellent reasoners but unreliable knowledge stores. RAG leverages their reasoning while externalizing knowledge to a retrievable corpus.

Mental Model: Library + Librarian

Think of RAG as a library (vector database) with a librarian (retrieval) helping a scholar (LLM):

  • The library stores books (document chunks) indexed by topic (embeddings)
  • The librarian finds relevant books based on the question
  • The scholar synthesizes an answer from the provided materials

Why RAG Over Alternatives

ApproachProsConsBest For
RAGNo retraining, instant updates, source attributionRetrieval quality limits accuracyDynamic knowledge, audit requirements
Fine-tuningDeep knowledge integrationExpensive, slow updates, no citationsStable domain expertise
Long contextSimple, no chunkingExpensive per query, 128K limitSmall corpus, one-off analysis
Pure promptingZero setupKnowledge cutoff, hallucinationsGeneral knowledge only

Choose RAG when:

  • Knowledge changes frequently (docs updated weekly/monthly)
  • Users need source attribution ("where did you get this?")
  • Corpus exceeds context window (>100K tokens)
  • Domain accuracy matters more than response speed

Avoid RAG when:

  • Corpus is tiny (<10 pages) — just use long context
  • Questions don't need specific facts — pure LLM suffices
  • Latency is critical (<100ms) — consider caching or fine-tuning

Pipeline Architecture

code
┌──────────────────────────────────────────────────────────────────┐
│                        INDEXING PHASE                            │
├──────────────────────────────────────────────────────────────────┤
│  Documents  ──▶  Chunking  ──▶  Embedding  ──▶  Vector Store    │
│   (raw)         (512 tokens)   (1536-dim)      (Milvus)         │
└──────────────────────────────────────────────────────────────────┘

┌──────────────────────────────────────────────────────────────────┐
│                        QUERY PHASE                               │
├──────────────────────────────────────────────────────────────────┤
│  Question  ──▶  Embed  ──▶  Search  ──▶  Top-K  ──▶  LLM  ──▶  Answer
│                           (HNSW)       chunks      (GPT-4)      │
└──────────────────────────────────────────────────────────────────┘

Stage Breakdown

StagePurposeKey Decision
ChunkingSplit docs into retrievable unitsChunk size (see references/chunk-strategies.md)
EmbeddingConvert text to vectorsModel choice (accuracy vs cost vs speed)
IndexingEnable fast similarity searchIndex type (HNSW for most cases)
RetrievalFind relevant chunksTop-K value (recall vs precision)
GenerationSynthesize answerPrompt design, temperature

Implementation

Core implementation with production-ready defaults:

python
from pymilvus import MilvusClient, DataType
from langchain.text_splitter import RecursiveCharacterTextSplitter
from openai import OpenAI

class RAGSystem:
    def __init__(self, collection_name: str = "rag_kb", uri: str = "./milvus.db"):
        self.client = MilvusClient(uri=uri)
        self.collection_name = collection_name
        self.openai = OpenAI()
        self.splitter = RecursiveCharacterTextSplitter(chunk_size=512, chunk_overlap=50)
        self._init_collection()

    def _embed(self, texts: list) -> list:
        response = self.openai.embeddings.create(model="text-embedding-3-small", input=texts)
        return [item.embedding for item in response.data]

    def _init_collection(self):
        if self.client.has_collection(self.collection_name):
            return
        schema = self.client.create_schema(auto_id=True, enable_dynamic_field=True)
        schema.add_field("id", DataType.INT64, is_primary=True)
        schema.add_field("text", DataType.VARCHAR, max_length=65535)
        schema.add_field("source", DataType.VARCHAR, max_length=512)
        schema.add_field("embedding", DataType.FLOAT_VECTOR, dim=1536)

        index_params = self.client.prepare_index_params()
        index_params.add_index("embedding", index_type="AUTOINDEX", metric_type="COSINE")
        self.client.create_collection(self.collection_name, schema=schema, index_params=index_params)

    def add_document(self, text: str, source: str = ""):
        chunks = self.splitter.split_text(text)
        embeddings = self._embed(chunks)
        data = [{"text": c, "source": source, "embedding": e} for c, e in zip(chunks, embeddings)]
        self.client.insert(self.collection_name, data)
        return len(chunks)

    def query(self, question: str, top_k: int = 5):
        # Retrieve
        results = self.client.search(self.collection_name, self._embed([question]),
                                     limit=top_k, output_fields=["text", "source"])
        contexts = [{"text": h["entity"]["text"], "source": h["entity"]["source"]} for h in results[0]]

        # Generate
        context_text = "\n\n".join([f"[{c['source']}]: {c['text']}" for c in contexts])
        response = self.openai.chat.completions.create(
            model="gpt-4o-mini",
            messages=[{"role": "user", "content": f"""Answer based on these references. Cite sources.

References:
{context_text}

Question: {question}"""}],
            temperature=0.3
        )
        return {"answer": response.choices[0].message.content,
                "sources": list(set(c["source"] for c in contexts))}

Usage:

python
rag = RAGSystem()
rag.add_document(open("docs/intro.md").read(), source="intro.md")
result = rag.query("What is Milvus?")
print(result["answer"])

For advanced patterns (streaming, multi-turn, hybrid search), see references/advanced-patterns.md.

Configuration Guide

Chunk Size Selection

Document Typechunk_sizeoverlapRationale
General docs51250Balance context vs precision
Technical docs1024100Preserve code blocks, procedures
FAQ2560One Q&A per chunk
Legal/contracts1024200High overlap for clause continuity

Rule of thumb: Start with 512/50, adjust based on answer quality.

Top-K Selection

Use Casetop_kWhy
Precise factual Q&A3-5Less noise, focused context
Research/synthesis8-12More perspectives
With reranking20-50Recall high, reranker filters

Embedding Model Tradeoffs

ModelDimSpeedQualityCost
text-embedding-3-small1536FastGood$0.02/1M
text-embedding-3-large3072MediumBetter$0.13/1M
BAAI/bge-large-en1024LocalGoodFree

See references/embedding-models.md for detailed comparison.

Common Pitfalls

1. Chunks Too Large

Symptom: Irrelevant information pollutes context Fix: Reduce chunk_size, or use semantic chunking

2. Chunks Too Small

Symptom: Answers lack context, feel fragmented Fix: Increase chunk_size or overlap

3. Wrong Embedding Model for Language

Symptom: Poor retrieval for non-English text Fix: Use multilingual model (bge-m3) or language-specific model

4. Ignoring Metadata

Symptom: Can't filter by date, source, or category Fix: Store metadata fields, use filtered search

5. No Source Attribution

Symptom: Users don't trust answers Fix: Always return sources, include in prompt

When to Level Up

SymptomSolutionSkill
Top results aren't the bestAdd rerankingrag-with-rerank
Complex multi-step questionsUse agentic approachagentic-rag
Questions need cross-doc reasoningMulti-hop retrievalmulti-hop-rag

References

Internal:

Core operators:

  • core:chunking - Document chunking utilities
  • core:embedding - Embedding generation
  • core:ray - Data processing at scale

Verticals: