Vector Databases for AI Applications
When to Use This Skill
Use this skill when implementing:
- •RAG (Retrieval-Augmented Generation) systems for AI chatbots
- •Semantic search capabilities (meaning-based, not just keyword)
- •Recommendation systems based on similarity
- •Multi-modal AI (unified search across text, images, audio)
- •Document similarity and deduplication
- •Question answering over private knowledge bases
Quick Decision Framework
1. Vector Database Selection
START: Choosing a Vector Database EXISTING INFRASTRUCTURE? ├─ Using PostgreSQL already? │ └─ pgvector (<10M vectors, tight budget) │ See: references/pgvector.md │ └─ No existing vector database? │ ├─ OPERATIONAL PREFERENCE? │ │ │ ├─ Zero-ops managed only │ │ └─ Pinecone (fully managed, excellent DX) │ │ See: references/pinecone.md │ │ │ └─ Flexible (self-hosted or managed) │ │ │ ├─ SCALE: <100M vectors + complex filtering ⭐ │ │ └─ Qdrant (RECOMMENDED) │ │ • Best metadata filtering │ │ • Built-in hybrid search (BM25 + Vector) │ │ • Self-host: Docker/K8s │ │ • Managed: Qdrant Cloud │ │ See: references/qdrant.md │ │ │ ├─ SCALE: >100M vectors + GPU acceleration │ │ └─ Milvus / Zilliz Cloud │ │ See: references/milvus.md │ │ │ ├─ Embedded / No server │ │ └─ LanceDB (serverless, edge deployment) │ │ │ └─ Local prototyping │ └─ Chroma (simple API, in-memory)
2. Embedding Model Selection
REQUIREMENTS? ├─ Best quality (cost no object) │ └─ Voyage AI voyage-3 (1024d) │ • 9.74% better than OpenAI on MTEB │ • ~$0.12/1M tokens │ See: references/embedding-strategies.md │ ├─ Enterprise reliability │ └─ OpenAI text-embedding-3-large (3072d) │ • Industry standard │ • ~$0.13/1M tokens │ • Maturity shortening: reduce to 256/512/1024d │ ├─ Cost-optimized │ └─ OpenAI text-embedding-3-small (1536d) │ • ~$0.02/1M tokens (6x cheaper) │ • 90-95% of large model performance │ ├─ Multilingual (100+ languages) │ └─ Cohere embed-v3 (1024d) │ • ~$0.10/1M tokens │ └─ Self-hosted / Privacy-critical ├─ English: nomic-embed-text-v1.5 (768d, Apache 2.0) ├─ Multilingual: BAAI/bge-m3 (1024d, MIT) └─ Long docs: jina-embeddings-v2 (768d, 8K context)
Core Concepts
Document Chunking Strategy
Recommended defaults for most RAG systems:
- •Chunk size: 512 tokens (not characters)
- •Overlap: 50 tokens (10% overlap)
Why these numbers?
- •512 tokens balances context vs. precision
- •Too small (128-256): Fragments concepts, loses context
- •Too large (1024-2048): Dilutes relevance, wastes LLM tokens
- •50 token overlap ensures sentences aren't split mid-context
See references/chunking-patterns.md for advanced strategies by content type.
Hybrid Search (Vector + Keyword)
Hybrid Search = Vector Similarity + BM25 Keyword Matching
User Query: "OAuth refresh token implementation"
│
┌──────┴──────┐
│ │
Vector Search Keyword Search
(Semantic) (BM25)
│ │
Top 20 docs Top 20 docs
│ │
└──────┬──────┘
│
Reciprocal Rank Fusion
(Merge + Re-rank)
│
Final Top 5 Results
Why hybrid matters:
- •Vector captures semantic meaning ("OAuth refresh" ≈ "token renewal")
- •Keyword ensures exact matches ("refresh_token" literal)
- •Combined provides best retrieval quality
See references/hybrid-search.md for implementation details.
Getting Started
Python + Qdrant Example
from qdrant_client import QdrantClient
from qdrant_client.models import Distance, VectorParams, PointStruct
# 1. Initialize client
client = QdrantClient("localhost", port=6333)
# 2. Create collection
client.create_collection(
collection_name="documents",
vectors_config=VectorParams(size=1024, distance=Distance.COSINE)
)
# 3. Insert documents with embeddings
points = [
PointStruct(
id=idx,
vector=embedding, # From OpenAI/Voyage/etc
payload={
"text": chunk_text,
"source": "docs/api.md",
"section": "Authentication"
}
)
for idx, (embedding, chunk_text) in enumerate(chunks)
]
client.upsert(collection_name="documents", points=points)
# 4. Search with metadata filtering
results = client.search(
collection_name="documents",
query_vector=query_embedding,
limit=5,
query_filter={
"must": [
{"key": "section", "match": {"value": "Authentication"}}
]
}
)
For complete examples, see examples/qdrant-python/.
TypeScript + Qdrant Example
import { QdrantClient } from '@qdrant/js-client-rest';
const client = new QdrantClient({ url: 'http://localhost:6333' });
// Create collection
await client.createCollection('documents', {
vectors: { size: 1024, distance: 'Cosine' }
});
// Insert documents
await client.upsert('documents', {
points: chunks.map((chunk, idx) => ({
id: idx,
vector: chunk.embedding,
payload: {
text: chunk.text,
source: chunk.source
}
}))
});
// Search
const results = await client.search('documents', {
vector: queryEmbedding,
limit: 5,
filter: {
must: [
{ key: 'source', match: { value: 'docs/api.md' } }
]
}
});
For complete examples, see examples/typescript-rag/.
RAG Pipeline Architecture
Complete Pipeline Components
1. INGESTION ├─ Document Loading (PDF, web, code, Office) ├─ Text Extraction & Cleaning ├─ Chunking (semantic, recursive, code-aware) └─ Embedding Generation (batch, rate-limited) 2. INDEXING ├─ Vector Store Insertion (batch upsert) ├─ Index Configuration (HNSW, distance metric) └─ Keyword Index (BM25 for hybrid search) 3. RETRIEVAL (Query Time) ├─ Query Processing (expansion, embedding) ├─ Hybrid Search (vector + keyword) ├─ Filtering & Post-Processing (metadata, MMR) └─ Re-Ranking (cross-encoder, LLM-based) 4. GENERATION ├─ Context Construction (format chunks, citations) ├─ Prompt Engineering (system + context + query) ├─ LLM Inference (streaming, temperature tuning) └─ Response Post-Processing (citations, validation) 5. EVALUATION (Production Critical) ├─ Retrieval Metrics (precision, recall, relevancy) ├─ Generation Metrics (faithfulness, correctness) └─ System Metrics (latency, cost, satisfaction)
Essential Metadata for Production RAG
Critical for filtering and relevance:
metadata = {
# SOURCE TRACKING
"source": "docs/api-reference.md",
"source_type": "documentation", # code, docs, logs, chat
"last_updated": "2025-12-01T12:00:00Z",
# HIERARCHICAL CONTEXT
"section": "Authentication",
"subsection": "OAuth 2.1",
"heading_hierarchy": ["API Reference", "Authentication", "OAuth 2.1"],
# CONTENT CLASSIFICATION
"content_type": "code_example", # prose, code, table, list
"programming_language": "python",
# FILTERING DIMENSIONS
"product_version": "v2.0",
"audience": "enterprise", # free, pro, enterprise
# RETRIEVAL HINTS
"chunk_index": 3,
"total_chunks": 12,
"has_code": True
}
Why metadata matters:
- •Enables filtering BEFORE vector search (reduces search space)
- •Improves relevance through targeted retrieval
- •Supports multi-tenant systems (filter by user/org)
- •Enables versioned documentation (filter by product version)
Evaluation with RAGAS
Use scripts/evaluate_rag.py for automated evaluation:
from ragas import evaluate
from ragas.metrics import (
faithfulness, # Answer grounded in context
answer_relevancy, # Answer addresses query
context_recall, # Retrieved docs cover ground truth
context_precision # Retrieved docs are relevant
)
# Test dataset
test_data = {
"question": ["How do I refresh OAuth tokens?"],
"answer": ["Use /token with refresh_token grant..."],
"contexts": [["OAuth refresh documentation..."]],
"ground_truth": ["POST to /token with grant_type=refresh_token"]
}
# Evaluate
results = evaluate(test_data, metrics=[
faithfulness,
answer_relevancy,
context_recall,
context_precision
])
# Production targets:
# faithfulness: >0.90 (minimal hallucination)
# answer_relevancy: >0.85 (addresses user query)
# context_recall: >0.80 (sufficient context retrieved)
# context_precision: >0.75 (minimal noise)
Performance Optimization
Embedding Generation
- •Batch processing: 100-500 chunks per batch
- •Caching: Cache embeddings by content hash
- •Rate limiting: Respect API provider limits (exponential backoff)
Vector Search
- •Index type: HNSW (Hierarchical Navigable Small World) for most cases
- •Distance metric: Cosine for normalized embeddings
- •Pre-filtering: Apply metadata filters before vector search
- •Result diversity: Use MMR (Maximal Marginal Relevance) to reduce redundancy
Cost Optimization
- •Embedding model: Consider text-embedding-3-small for budget constraints
- •Dimension reduction: Use maturity shortening (3072d → 1024d)
- •Caching: Implement semantic caching for repeated queries
- •Batch operations: Group insertions/updates for efficiency
Common Workflows
1. Building a RAG Chatbot
- •Vector database: Qdrant (self-hosted or cloud)
- •Embeddings: OpenAI text-embedding-3-large
- •Chunking: 512 tokens, 50 overlap, semantic splitter
- •Search: Hybrid (vector + BM25)
- •Integration: Frontend with ai-chat skill
See examples/qdrant-python/ for complete implementation.
2. Semantic Search Engine
- •Vector database: Qdrant or Pinecone
- •Embeddings: Voyage AI voyage-3 (best quality)
- •Chunking: Content-type specific (see chunking-patterns.md)
- •Search: Hybrid with re-ranking
- •Filtering: Pre-filter by metadata (date, category, etc.)
3. Code Search
- •Vector database: Qdrant
- •Embeddings: OpenAI text-embedding-3-large
- •Chunking: AST-based (function/class boundaries)
- •Metadata: Language, file path, imports
- •Search: Hybrid with language filtering
See examples/qdrant-python/ for code-specific implementation.
Integration with Other Skills
Frontend Skills
- •ai-chat: Vector DB powers RAG pipeline behind chat interface
- •search-filter: Replace keyword search with semantic search
- •data-viz: Visualize embedding spaces, similarity scores
Backend Skills
- •databases-relational: Hybrid approach using pgvector extension
- •api-patterns: Expose semantic search via REST/GraphQL
- •observability: Monitor embedding quality and retrieval metrics
Multi-Language Support
Python (Primary)
- •Client:
qdrant-client - •Framework: LangChain, LlamaIndex
- •See:
examples/qdrant-python/
Rust
- •Client:
qdrant-client(1,549 code snippets in Context7) - •Framework: Raw Rust for performance-critical systems
- •See:
examples/rust-axum-vector/
TypeScript
- •Client:
@qdrant/js-client-rest - •Framework: LangChain.js, integration with Next.js
- •See:
examples/typescript-rag/
Go
- •Client:
qdrant-go - •Use case: High-performance microservices
Troubleshooting
Poor Retrieval Quality
- •Check chunking strategy (too large/small?)
- •Verify metadata filtering (too restrictive?)
- •Try hybrid search instead of vector-only
- •Implement re-ranking stage
- •Evaluate with RAGAS metrics
Slow Performance
- •Use HNSW index (not Flat)
- •Pre-filter with metadata before vector search
- •Reduce vector dimensions (maturity shortening)
- •Batch operations (insertions, searches)
- •Consider GPU acceleration (Milvus)
High Costs
- •Switch to text-embedding-3-small
- •Implement semantic caching
- •Reduce chunk overlap
- •Use self-hosted embeddings (nomic, bge-m3)
- •Batch embedding generation
Qdrant Context7 Documentation
Primary resource: /llmstxt/qdrant_tech_llms-full_txt
- •Trust score: High
- •Code snippets: 10,154
- •Quality score: 83.1
Access via Context7:
resolve-library-id({ libraryName: "Qdrant" })
get-library-docs({
context7CompatibleLibraryID: "/llmstxt/qdrant_tech_llms-full_txt",
topic: "hybrid search collections python",
mode: "code"
})
Additional Resources
Reference Documentation
- •
references/qdrant.md- Comprehensive Qdrant guide - •
references/pgvector.md- PostgreSQL pgvector extension - •
references/milvus.md- Milvus/Zilliz for billion-scale - •
references/embedding-strategies.md- Embedding model comparison - •
references/chunking-patterns.md- Advanced chunking techniques
Code Examples
- •
examples/qdrant-python/- FastAPI + Qdrant RAG pipeline - •
examples/pgvector-prisma/- PostgreSQL + Prisma integration - •
examples/typescript-rag/- TypeScript RAG with Hono
Automation Scripts
- •
scripts/generate_embeddings.py- Batch embedding generation - •
scripts/benchmark_similarity.py- Performance benchmarking - •
scripts/evaluate_rag.py- RAGAS-based evaluation
Next Steps:
- •Choose vector database based on scale and infrastructure
- •Select embedding model based on quality vs. cost trade-off
- •Implement chunking strategy for the content type
- •Set up hybrid search for production quality
- •Evaluate with RAGAS metrics
- •Optimize for performance and cost