AgentSkillsCN

rag-architecture

估算并优化 AI/ML 相关成本,包括 Token 使用量、上下文窗口管理、批量处理以及缓存策略。

SKILL.md
--- frontmatter
name: rag-architecture
description: Design retrieval-augmented generation pipelines including chunking, embedding, retrieval, and context assembly strategies.
allowed-tools: Read, Write, Glob, Grep, Task

RAG Architecture Design

When to Use This Skill

Use this skill when:

  • Rag Architecture tasks - Working on design retrieval-augmented generation pipelines including chunking, embedding, retrieval, and context assembly strategies
  • Planning or design - Need guidance on Rag Architecture approaches
  • Best practices - Want to follow established patterns and standards

Overview

Retrieval-Augmented Generation (RAG) combines retrieval from a knowledge base with LLM generation to provide accurate, grounded responses. Proper architecture is critical for performance and quality.

RAG Pipeline Architecture

text
┌─────────────────────────────────────────────────────────────────┐
│                      RAG Pipeline                                │
├─────────────────────────────────────────────────────────────────┤
│                                                                  │
│  ┌──────────────────────────────────────────────────────────┐   │
│  │                  INDEXING PIPELINE                        │   │
│  │                                                           │   │
│  │  Documents → Chunking → Embedding → Vector Store          │   │
│  │                                                           │   │
│  └──────────────────────────────────────────────────────────┘   │
│                                                                  │
│  ┌──────────────────────────────────────────────────────────┐   │
│  │                  QUERY PIPELINE                           │   │
│  │                                                           │   │
│  │  Query → Embedding → Retrieval → Reranking → Context →   │   │
│  │         LLM Generation → Response                         │   │
│  │                                                           │   │
│  └──────────────────────────────────────────────────────────┘   │
│                                                                  │
└─────────────────────────────────────────────────────────────────┘

Document Chunking

Chunking Strategies

StrategyDescriptionBest For
Fixed SizeSplit by character/token countSimple, general
SentenceSplit at sentence boundariesProse, articles
ParagraphSplit at paragraph boundariesStructured docs
SemanticSplit by topic/meaningTechnical docs
RecursiveHierarchical splittingMixed content
Document StructureUse headers, sectionsManuals, specs

Chunk Size Guidelines

Document TypeChunk SizeOverlap
FAQ100-300 tokens10-20%
Articles300-500 tokens15-25%
Technical Docs500-1000 tokens20-30%
Legal/Contracts200-400 tokens25-35%
Code50-150 linesBy function

Chunking Implementation

csharp
public class DocumentChunker
{
    public IEnumerable<Chunk> ChunkDocument(
        Document document,
        ChunkingOptions options)
    {
        return options.Strategy switch
        {
            ChunkingStrategy.FixedSize =>
                FixedSizeChunk(document.Content, options.ChunkSize, options.Overlap),

            ChunkingStrategy.Sentence =>
                SentenceChunk(document.Content, options.MaxSentences),

            ChunkingStrategy.Semantic =>
                SemanticChunk(document.Content, options.SemanticThreshold),

            ChunkingStrategy.Recursive =>
                RecursiveChunk(document.Content, options),

            _ => throw new NotSupportedException()
        };
    }

    private IEnumerable<Chunk> RecursiveChunk(
        string content,
        ChunkingOptions options)
    {
        var separators = new[] { "\n\n", "\n", ". ", " " };

        foreach (var separator in separators)
        {
            var splits = content.Split(separator);

            if (splits.All(s => CountTokens(s) <= options.ChunkSize))
            {
                return MergeSmallChunks(splits, options.ChunkSize, options.Overlap)
                    .Select((text, i) => new Chunk
                    {
                        Id = Guid.NewGuid(),
                        Content = text,
                        Index = i,
                        Metadata = new ChunkMetadata
                        {
                            TokenCount = CountTokens(text),
                            Separator = separator
                        }
                    });
            }
        }

        return FixedSizeChunk(content, options.ChunkSize, options.Overlap);
    }
}

Embedding Strategies

Embedding Model Selection

ModelDimensionsSpeedQualityCost
text-embedding-3-small1536FastGoodLow
text-embedding-3-large3072MediumExcellentMedium
text-embedding-ada-0021536FastGoodLow
Cohere embed-v31024FastExcellentMedium
BGE-large1024MediumExcellentFree (local)

Embedding Best Practices

csharp
public class EmbeddingService
{
    private readonly IEmbeddingClient _client;
    private readonly SemaphoreSlim _rateLimiter;

    public async Task<float[][]> EmbedBatch(
        IEnumerable<string> texts,
        CancellationToken ct)
    {
        var textList = texts.ToList();
        var embeddings = new List<float[]>();

        // Process in batches to avoid rate limits
        foreach (var batch in textList.Chunk(100))
        {
            await _rateLimiter.WaitAsync(ct);

            try
            {
                var batchEmbeddings = await _client.EmbedAsync(
                    batch.ToArray(),
                    ct);

                embeddings.AddRange(batchEmbeddings);
            }
            finally
            {
                _rateLimiter.Release();
            }
        }

        return embeddings.ToArray();
    }

    public async Task<float[]> EmbedQuery(string query, CancellationToken ct)
    {
        // Some models need different prompts for queries vs documents
        var formattedQuery = $"query: {query}";
        return await _client.EmbedAsync(formattedQuery, ct);
    }
}

Vector Store Design

Store Selection

StoreTypeScalabilityFeatures
Azure AI SearchManagedHighHybrid search, filters
PineconeManagedHighSimple API
QdrantSelf-hosted/ManagedHighPayload filters
WeaviateSelf-hosted/ManagedHighGraphQL, modules
ChromaSelf-hostedMediumSimple, local dev
pgvectorPostgreSQL extensionMediumSQL integration

Index Design

csharp
public class VectorIndexSchema
{
    public string IndexName { get; set; } = "documents";

    public List<VectorField> VectorFields { get; set; } =
    [
        new VectorField
        {
            Name = "content_vector",
            Dimensions = 1536,
            Similarity = SimilarityMetric.Cosine,
            IndexType = IndexType.HNSW,
            HnswConfig = new HnswConfig
            {
                M = 16,
                EfConstruction = 100,
                EfSearch = 40
            }
        }
    ];

    public List<MetadataField> MetadataFields { get; set; } =
    [
        new MetadataField("document_id", FieldType.String, Filterable: true),
        new MetadataField("source", FieldType.String, Filterable: true),
        new MetadataField("created_at", FieldType.DateTime, Filterable: true),
        new MetadataField("category", FieldType.StringArray, Filterable: true),
        new MetadataField("content", FieldType.Text, Searchable: true)
    ];
}

Retrieval Strategies

Retrieval Methods

MethodDescriptionProsCons
Vector SearchSemantic similarityHandles synonymsMay miss exact
Keyword SearchBM25/TF-IDFExact matchesMisses synonyms
HybridVector + KeywordBest of bothMore complex
Multi-QueryGenerate variationsBetter recallHigher cost
HyDEHypothetical answerBetter precisionLatency

Hybrid Search Implementation

csharp
public class HybridRetriever
{
    private readonly IVectorStore _vectorStore;
    private readonly ISearchClient _keywordSearch;

    public async Task<List<SearchResult>> Retrieve(
        string query,
        RetrievalOptions options,
        CancellationToken ct)
    {
        // Run vector and keyword search in parallel
        var vectorTask = _vectorStore.SearchAsync(
            query,
            options.TopK * 2,  // Retrieve more for fusion
            ct);

        var keywordTask = _keywordSearch.SearchAsync(
            query,
            options.TopK * 2,
            ct);

        await Task.WhenAll(vectorTask, keywordTask);

        var vectorResults = await vectorTask;
        var keywordResults = await keywordTask;

        // Reciprocal Rank Fusion
        var fused = ReciprocalRankFusion(
            vectorResults,
            keywordResults,
            options.VectorWeight,
            options.KeywordWeight);

        return fused.Take(options.TopK).ToList();
    }

    private List<SearchResult> ReciprocalRankFusion(
        List<SearchResult> vectorResults,
        List<SearchResult> keywordResults,
        float vectorWeight,
        float keywordWeight,
        int k = 60)
    {
        var scores = new Dictionary<string, float>();

        for (int i = 0; i < vectorResults.Count; i++)
        {
            var id = vectorResults[i].Id;
            scores.TryAdd(id, 0);
            scores[id] += vectorWeight / (k + i + 1);
        }

        for (int i = 0; i < keywordResults.Count; i++)
        {
            var id = keywordResults[i].Id;
            scores.TryAdd(id, 0);
            scores[id] += keywordWeight / (k + i + 1);
        }

        return scores
            .OrderByDescending(kv => kv.Value)
            .Select(kv => new SearchResult
            {
                Id = kv.Key,
                Score = kv.Value
            })
            .ToList();
    }
}

Context Assembly

Context Window Management

csharp
public class ContextAssembler
{
    private readonly int _maxTokens;

    public string AssembleContext(
        List<SearchResult> results,
        string query,
        int reservedTokens = 500)
    {
        var availableTokens = _maxTokens - reservedTokens;
        var context = new StringBuilder();
        var usedTokens = 0;

        // Sort by relevance (already sorted from retrieval)
        foreach (var result in results)
        {
            var chunkTokens = CountTokens(result.Content);

            if (usedTokens + chunkTokens > availableTokens)
                break;

            context.AppendLine($"[Source: {result.Source}]");
            context.AppendLine(result.Content);
            context.AppendLine();

            usedTokens += chunkTokens;
        }

        return context.ToString();
    }
}

RAG Evaluation

Evaluation Metrics

MetricDescriptionTarget
Retrieval PrecisionRelevant docs / Retrieved docs> 80%
Retrieval RecallRetrieved relevant / All relevant> 70%
Answer AccuracyCorrect answers> 90%
FaithfulnessAnswer supported by context> 95%
Answer RelevancyAnswer matches query> 85%

Evaluation Framework

csharp
public class RagEvaluator
{
    public async Task<EvaluationReport> Evaluate(
        List<TestCase> testCases,
        IRagPipeline pipeline,
        CancellationToken ct)
    {
        var results = new List<TestResult>();

        foreach (var testCase in testCases)
        {
            var response = await pipeline.Query(testCase.Query, ct);

            results.Add(new TestResult
            {
                Query = testCase.Query,
                ExpectedAnswer = testCase.ExpectedAnswer,
                ActualAnswer = response.Answer,
                RetrievedDocs = response.Sources,
                RelevantDocs = testCase.RelevantDocs,
                Metrics = new TestMetrics
                {
                    RetrievalPrecision = CalculatePrecision(
                        response.Sources, testCase.RelevantDocs),
                    RetrievalRecall = CalculateRecall(
                        response.Sources, testCase.RelevantDocs),
                    AnswerCorrect = await EvaluateAnswer(
                        response.Answer, testCase.ExpectedAnswer),
                    Faithful = await CheckFaithfulness(
                        response.Answer, response.Context)
                }
            });
        }

        return new EvaluationReport(results);
    }
}

Architecture Template

markdown
# RAG Architecture: [System Name]

## Overview
[Brief description of the RAG system purpose]

## Components

### Document Processing
- **Source**: [Document sources]
- **Chunking**: [Strategy and parameters]
- **Embedding**: [Model and dimensions]

### Vector Store
- **Provider**: [Azure AI Search / Pinecone / etc.]
- **Index**: [Index configuration]
- **Metadata**: [Stored fields]

### Retrieval
- **Method**: [Vector / Hybrid / Multi-query]
- **Top-K**: [Number of results]
- **Filters**: [Applied filters]

### Generation
- **Model**: [LLM model]
- **Context Window**: [Token allocation]
- **Prompt**: [Template reference]

## Data Flow
[Mermaid diagram of the pipeline]

## Performance Targets
| Metric | Target |
|--------|--------|
| Retrieval Latency | < 200ms |
| E2E Latency | < 3s |
| Answer Accuracy | > 90% |

Validation Checklist

  • Document sources identified
  • Chunking strategy selected and tested
  • Embedding model chosen
  • Vector store provisioned
  • Retrieval method determined
  • Context assembly strategy defined
  • Evaluation metrics established
  • Performance targets set
  • Monitoring planned

Integration Points

Inputs from:

  • Data sources → Documents to index
  • model-selection skill → Embedding/LLM choice

Outputs to:

  • prompt-engineering skill → Context integration
  • token-budgeting skill → Cost estimation
  • Application code → RAG implementation