AgentSkillsCN

Vector Database Patterns

全面介绍向量数据库相关知识,包括 Pinecone、Qdrant、Weaviate 等主流产品,以及嵌入策略与相似度搜索的最佳实践。

SKILL.md
--- frontmatter
name: Vector Database Patterns
description: Comprehensive guide to vector databases including Pinecone, Qdrant, Weaviate, embedding strategies, and similarity search.

Vector Database Patterns

Overview

Vector databases are specialized databases designed to store, index, and query high-dimensional vectors efficiently. They enable similarity search by finding vectors that are "closest" to a query vector using various distance metrics. This skill covers Pinecone, Qdrant, Weaviate, embedding strategies, similarity search, performance optimization, and production considerations.

Prerequisites

  • Understanding of vectors and embeddings
  • Knowledge of machine learning concepts
  • Familiarity with Python or TypeScript
  • Understanding of similarity metrics (cosine, Euclidean, dot product)
  • Basic knowledge of database concepts

Key Concepts

Vector Database Fundamentals

  • Vectors: Numerical representations of data (text, images, audio) in high-dimensional space
  • Embeddings: Vectors generated by machine learning models that capture semantic meaning
  • Distance Metrics: Measures of similarity between vectors (cosine, Euclidean, dot product)
  • Indexing: Data structures that enable fast similarity search
  • Metadata: Additional information associated with vectors for filtering

Vector Database Types

  • Pinecone: Managed service, easy setup, good for production
  • Qdrant: Open-source, self-hosted option, flexible
  • Weaviate: Open-source, GraphQL API, good for multimodal

Use Cases

  • Semantic search (finding similar documents, products, images)
  • Recommendation systems
  • Anomaly detection
  • Natural language processing tasks
  • Computer vision applications
  • Personalization engines
  • Knowledge retrieval for RAG (Retrieval-Augmented Generation)

Implementation Guide

Pinecone

Setup and Indexing

python
# Install Pinecone client
# pip install pinecone-client

import pinecone
from pinecone import Pinecone, ServerlessSpec

# Initialize Pinecone
pc = Pinecone(api_key="your-api-key")

# Create index
pc.create_index(
    name="my-index",
    dimension=1536,  # OpenAI embedding dimension
    metric="cosine",  # or "euclidean", "dotproduct"
    spec=ServerlessSpec(
        cloud="aws",
        region="us-east-1"
    )
)

# Connect to index
index = pc.Index("my-index")

# Check index stats
stats = index.describe_index_stats()
print(f"Total vectors: {stats['total_vector_count']}")
print(f"Dimension: {stats['dimension']}")
typescript
// Install Pinecone client
// npm install @pinecone-database/pinecone

import { Pinecone } from '@pinecone-database/pinecone';

// Initialize Pinecone
const pinecone = new Pinecone({
  apiKey: 'your-api-key'
});

// Create index
await pinecone.createIndex({
  name: 'my-index',
  dimension: 1536,
  metric: 'cosine',
  spec: {
    serverless: {
      cloud: 'aws',
      region: 'us-east-1'
    }
  }
});

// Connect to index
const index = pinecone.index('my-index');

// Check index stats
const stats = await index.describeIndexStats();
console.log('Total vectors:', stats.totalVectorCount);
console.log('Dimension:', stats.dimension);

Upserting Vectors

python
# Upsert single vector
index.upsert(
    vectors=[
        {
            "id": "doc1",
            "values": [0.1, 0.2, 0.3, ...],  # 1536-dimensional vector
            "metadata": {
                "title": "Document 1",
                "category": "technology",
                "date": "2024-01-01"
            }
        }
    ]
)

# Upsert multiple vectors
index.upsert(
    vectors=[
        {
            "id": "doc1",
            "values": vector1,
            "metadata": {"title": "Document 1", "category": "tech"}
        },
        {
            "id": "doc2",
            "values": vector2,
            "metadata": {"title": "Document 2", "category": "science"}
        },
        {
            "id": "doc3",
            "values": vector3,
            "metadata": {"title": "Document 3", "category": "tech"}
        }
    ],
    namespace="documents"
)

# Upsert in batches
from tqdm import tqdm

def upsert_in_batches(vectors, batch_size=100):
    for i in tqdm(range(0, len(vectors), batch_size)):
        batch = vectors[i:i + batch_size]
        index.upsert(vectors=batch)
typescript
// Upsert single vector
await index.upsert([
  {
    id: 'doc1',
    values: [0.1, 0.2, 0.3, ...], // 1536-dimensional vector
    metadata: {
      title: 'Document 1',
      category: 'technology',
      date: '2024-01-01'
    }
  }
]);

// Upsert multiple vectors
await index.upsert([
  {
    id: 'doc1',
    values: vector1,
    metadata: { title: 'Document 1', category: 'tech' }
  },
  {
    id: 'doc2',
    values: vector2,
    metadata: { title: 'Document 2', category: 'science' }
  },
  {
    id: 'doc3',
    values: vector3,
    metadata: { title: 'Document 3', category: 'tech' }
  }
]);

// Upsert with namespace
await index.upsert([
  {
    id: 'doc1',
    values: vector1,
    metadata: { title: 'Document 1' }
  }
], 'documents');

Querying

python
# Basic similarity search
results = index.query(
    vector=query_vector,
    top_k=10,
    include_metadata=True,
    include_values=False
)

for match in results['matches']:
    print(f"ID: {match['id']}, Score: {match['score']}")
    print(f"Metadata: {match['metadata']}")

# Query with namespace
results = index.query(
    vector=query_vector,
    top_k=10,
    namespace="documents",
    include_metadata=True
)

# Query with filter
results = index.query(
    vector=query_vector,
    top_k=10,
    filter={
        "category": {"$eq": "technology"},
        "date": {"$gte": "2024-01-01"}
    },
    include_metadata=True
)

# Query with complex filter
results = index.query(
    vector=query_vector,
    top_k=10,
    filter={
        "$or": [
            {"category": {"$eq": "technology"}},
            {"category": {"$eq": "science"}}
        ],
        "date": {"$gte": "2024-01-01"}
    },
    include_metadata=True
)
typescript
// Basic similarity search
const results = await index.query({
  vector: queryVector,
  topK: 10,
  includeMetadata: true,
  includeValues: false
});

results.matches.forEach(match => {
  console.log(`ID: ${match.id}, Score: ${match.score}`);
  console.log('Metadata:', match.metadata);
});

// Query with namespace
const results = await index.query({
  vector: queryVector,
  topK: 10,
  namespace: 'documents',
  includeMetadata: true
});

// Query with filter
const results = await index.query({
  vector: queryVector,
  topK: 10,
  filter: {
    category: { $eq: 'technology' },
    date: { $gte: '2024-01-01' }
  },
  includeMetadata: true
});

// Query with complex filter
const results = await index.query({
  vector: queryVector,
  topK: 10,
  filter: {
    $or: [
      { category: { $eq: 'technology' } },
      { category: { $eq: 'science' } }
    ],
    date: { $gte: '2024-01-01' }
  },
  includeMetadata: true
});

Deleting Vectors

python
# Delete single vector
index.delete(ids=["doc1"])

# Delete multiple vectors
index.delete(ids=["doc1", "doc2", "doc3"])

# Delete all vectors in namespace
index.delete(delete_all=True, namespace="documents")

# Delete by filter
index.delete(
    filter={
        "category": {"$eq": "old"},
        "date": {"$lt": "2023-01-01"}
    },
    namespace="documents"
)
typescript
// Delete single vector
await index.deleteOne('doc1');

// Delete multiple vectors
await index.deleteMany(['doc1', 'doc2', 'doc3']);

// Delete all vectors in namespace
await index.deleteAll({ namespace: 'documents' });

// Delete by filter
await index.deleteMany({
  filter: {
    category: { $eq: 'old' },
    date: { $lt: '2023-01-01' }
  },
  namespace: 'documents'
});

Qdrant

Collections and Points

python
# Install Qdrant client
# pip install qdrant-client

from qdrant_client import QdrantClient
from qdrant_client.models import Distance, VectorParams, PointStruct

# Initialize Qdrant client
client = QdrantClient(url="http://localhost:6333")

# Create collection
client.create_collection(
    collection_name="documents",
    vectors_config=VectorParams(
        size=1536,
        distance=Distance.COSINE  # or Distance.EUCLID, Distance.DOT
    )
)

# Create collection with multiple vectors
client.create_collection(
    collection_name="multimodal",
    vectors_config={
        "text": VectorParams(size=1536, distance=Distance.COSINE),
        "image": VectorParams(size=512, distance=Distance.EUCLID)
    }
)

# List collections
collections = client.get_collections()
for collection in collections.collections:
    print(f"Collection: {collection.name}")

# Get collection info
info = client.get_collection("documents")
print(f"Vectors count: {info.vectors_count}")
print(f"Points count: {info.points_count}")
typescript
// Install Qdrant client
// npm install @qdrant/js-client-rest

import { QdrantClient } from '@qdrant/js-client-rest';

// Initialize Qdrant client
const client = new QdrantClient({
  url: 'http://localhost:6333'
});

// Create collection
await client.createCollection('documents', {
  vectors: {
    size: 1536,
    distance: 'Cosine' // or 'Euclid', 'Dot'
  }
});

// Create collection with multiple vectors
await client.createCollection('multimodal', {
  vectors: {
    text: { size: 1536, distance: 'Cosine' },
    image: { size: 512, distance: 'Euclid' }
  }
});

// List collections
const collections = await client.getCollections();
collections.collections.forEach(collection => {
  console.log('Collection:', collection.name);
});

// Get collection info
const info = await client.getCollection('documents');
console.log('Vectors count:', info.vectorsCount);
console.log('Points count:', info.pointsCount);

Inserting Points

python
# Insert single point
client.upsert(
    collection_name="documents",
    points=[
        PointStruct(
            id=1,
            vector=[0.1, 0.2, 0.3, ...],
            payload={
                "title": "Document 1",
                "category": "technology",
                "date": "2024-01-01"
            }
        )
    ]
)

# Insert multiple points
client.upsert(
    collection_name="documents",
    points=[
        PointStruct(id=1, vector=vector1, payload={"title": "Doc 1", "category": "tech"}),
        PointStruct(id=2, vector=vector2, payload={"title": "Doc 2", "category": "science"}),
        PointStruct(id=3, vector=vector3, payload={"title": "Doc 3", "category": "tech"}),
    ]
)

# Insert in batches
from qdrant_client.models import Batch

def insert_in_batches(points, batch_size=100):
    for i in range(0, len(points), batch_size):
        batch = points[i:i + batch_size]
        client.upsert(
            collection_name="documents",
            points=Batch(
                ids=[p.id for p in batch],
                vectors=[p.vector for p in batch],
                payloads=[p.payload for p in batch]
            )
        )
typescript
// Insert single point
await client.upsert('documents', {
  points: [{
    id: 1,
    vector: [0.1, 0.2, 0.3, ...],
    payload: {
      title: 'Document 1',
      category: 'technology',
      date: '2024-01-01'
    }
  }]
});

// Insert multiple points
await client.upsert('documents', {
  points: [
    { id: 1, vector: vector1, payload: { title: 'Doc 1', category: 'tech' } },
    { id: 2, vector: vector2, payload: { title: 'Doc 2', category: 'science' } },
    { id: 3, vector: vector3, payload: { title: 'Doc 3', category: 'tech' } }
  ]
});

// Insert named vectors
await client.upsert('multimodal', {
  points: [{
    id: 1,
    vector: {
      text: textVector,
      image: imageVector
    },
    payload: {
      title: 'Document 1',
      type: 'multimodal'
    }
  }]
});

Querying

python
# Basic search
results = client.search(
    collection_name="documents",
    query_vector=query_vector,
    limit=10,
    with_payload=True
)

for result in results:
    print(f"ID: {result.id}, Score: {result.score}")
    print(f"Payload: {result.payload}")

# Search with filter
results = client.search(
    collection_name="documents",
    query_vector=query_vector,
    query_filter=Filter(
        must=[
            FieldCondition(
                key="category",
                match=MatchValue(value="technology")
            ),
            FieldCondition(
                key="date",
                range=Range(
                    gte="2024-01-01"
                )
            )
        ]
    ),
    limit=10,
    with_payload=True
)

# Search with named vector
results = client.search(
    collection_name="multimodal",
    query_vector=NamedVector(
        name="text",
        vector=query_vector
    ),
    limit=10
)

# Hybrid search (vector + keyword)
from qdrant_client.models import SearchRequest

results = client.search_batch(
    collection_name="documents",
    requests=[
        SearchRequest(
            vector=NamedVector(name="text", vector=query_vector),
            limit=10,
            with_payload=True
        ),
        SearchRequest(
            vector=NamedVector(name="image", vector=image_query_vector),
            limit=10,
            with_payload=True
        )
    ]
)
typescript
// Basic search
const results = await client.search('documents', {
  vector: queryVector,
  limit: 10,
  withPayload: true
});

results.forEach(result => {
  console.log(`ID: ${result.id}, Score: ${result.score}`);
  console.log('Payload:', result.payload);
});

// Search with filter
const results = await client.search('documents', {
  vector: queryVector,
  queryFilter: {
    must: [
      {
        key: 'category',
        match: { value: 'technology' }
      },
      {
        key: 'date',
        range: { gte: '2024-01-01' }
      }
    ]
  },
  limit: 10,
  withPayload: true
});

// Search with named vector
const results = await client.search('multimodal', {
  vector: {
    name: 'text',
    vector: queryVector
  },
  limit: 10
});

// Hybrid search
const results = await client.searchBatch('documents', [
  {
    vector: {
      name: 'text',
      vector: queryVector
    },
    limit: 10
  },
  {
    vector: {
      name: 'image',
      vector: imageQueryVector
    },
    limit: 10
  }
]);

Filtering

python
# Exact match filter
filter = Filter(
    must=[
        FieldCondition(
            key="category",
            match=MatchValue(value="technology")
        )
    ]
)

# Range filter
filter = Filter(
    must=[
        FieldCondition(
            key="price",
            range=Range(
                gte=100,
                lte=1000
            )
        )
    ]
)

# OR filter
filter = Filter(
    should=[
        FieldCondition(
            key="category",
            match=MatchValue(value="technology")
        ),
        FieldCondition(
            key="category",
            match=MatchValue(value="science")
        )
    ],
    min_count=1
)

# Nested filter
filter = Filter(
    must=[
        FieldCondition(
            key="metadata.category",
            match=MatchValue(value="technology")
        )
    ]
)

# Is NULL filter
filter = Filter(
    must_not=[
        FieldCondition(
            key="deleted_at",
            is_null=True
        )
    ]
)
typescript
// Exact match filter
const filter = {
  must: [
    {
      key: 'category',
      match: { value: 'technology' }
    }
  ]
};

// Range filter
const filter = {
  must: [
    {
      key: 'price',
      range: { gte: 100, lte: 1000 }
    }
  ]
};

// OR filter
const filter = {
  should: [
    {
      key: 'category',
      match: { value: 'technology' }
    },
    {
      key: 'category',
      match: { value: 'science' }
    }
  ],
  minCount: 1
};

// Nested filter
const filter = {
  must: [
    {
      key: 'metadata.category',
      match: { value: 'technology' }
    }
  ]
};

Weaviate

Schema Setup

python
# Install Weaviate client
# pip install weaviate-client

import weaviate
from weaviate import Client

# Initialize Weaviate client
client = Client("http://localhost:8080")

# Define schema
schema = {
    "classes": [
        {
            "class": "Document",
            "description": "A document",
            "vectorizer": "text2vec-openai",
            "properties": [
                {
                    "name": "title",
                    "dataType": ["string"],
                    "description": "The title of document"
                },
                {
                    "name": "content",
                    "dataType": ["text"],
                    "description": "The content of document"
                },
                {
                    "name": "category",
                    "dataType": ["string"],
                    "description": "The category of document"
                },
                {
                    "name": "date",
                    "dataType": ["date"],
                    "description": "The date of document"
                },
                {
                    "name": "metadata",
                    "dataType": ["object"],
                    "description": "Additional metadata"
                }
            ]
        }
    ]
}

# Create schema
client.schema.create(schema)

# Get schema
schema = client.schema.get()
print(schema)
typescript
// Install Weaviate client
// npm install weaviate-ts-client

import weaviate, { WeaviateClient } from 'weaviate-ts-client';

// Initialize Weaviate client
const client: WeaviateClient = weaviate.client({
  scheme: 'http',
  host: 'localhost:8080',
});

// Define schema
const schema = {
  classes: [
    {
      class: 'Document',
      description: 'A document',
      vectorizer: 'text2vec-openai',
      properties: [
        {
          name: 'title',
          dataType: ['string'],
          description: 'The title of document'
        },
        {
          name: 'content',
          dataType: ['text'],
          description: 'The content of document'
        },
        {
          name: 'category',
          dataType: ['string'],
          description: 'The category of document'
        },
        {
          name: 'date',
          dataType: ['date'],
          description: 'The date of document'
        },
        {
          name: 'metadata',
          dataType: ['object'],
          description: 'Additional metadata'
        }
      ]
    }
  ]
};

// Create schema
await client.schema
  .creator()
  .withClass(schema.classes[0])
  .do();

// Get schema
const retrievedSchema = await client.schema.getter().do();
console.log(retrievedSchema);

Inserting Data

python
# Insert single object
client.data_object.create(
    class_name="Document",
    data_object={
        "title": "Document 1",
        "content": "This is content of document 1",
        "category": "technology",
        "date": "2024-01-01T00:00:00Z",
        "metadata": {
            "author": "John Doe",
            "tags": ["tech", "ai"]
        }
    }
)

# Insert multiple objects
objects = [
    {
        "title": "Document 1",
        "content": "Content 1",
        "category": "technology"
    },
    {
        "title": "Document 2",
        "content": "Content 2",
        "category": "science"
    }
]

for obj in objects:
    client.data_object.create(
        class_name="Document",
        data_object=obj
    )

# Insert with custom vector
client.data_object.create(
    class_name="Document",
    data_object={
        "title": "Document 1",
        "content": "Content 1"
    },
    vector=[0.1, 0.2, 0.3, ...]
)

# Batch insert
from weaviate.batch import Batch

with Batch(client) as batch:
    for obj in objects:
        batch.add_data_object(
            data_object=obj,
            class_name="Document"
        )
typescript
// Insert single object
await client.data
  .creator()
  .withClassName('Document')
  .withProperties({
    title: 'Document 1',
    content: 'This is content of document 1',
    category: 'technology',
    date: '2024-01-01T00:00:00Z',
    metadata: {
      author: 'John Doe',
      tags: ['tech', 'ai']
    }
  })
  .do();

// Insert multiple objects
const objects = [
  {
    title: 'Document 1',
    content: 'Content 1',
    category: 'technology'
  },
  {
    title: 'Document 2',
    content: 'Content 2',
    category: 'science'
  }
];

for (const obj of objects) {
  await client.data
    .creator()
    .withClassName('Document')
    .withProperties(obj)
    .do();
}

// Insert with custom vector
await client.data
  .creator()
  .withClassName('Document')
  .withProperties({
    title: 'Document 1',
    content: 'Content 1'
  })
  .withVector([0.1, 0.2, 0.3, ...])
  .do();

Querying

python
# Semantic search
results = client.query.get(
    class_name="Document",
    properties=["title", "content", "category"]
).with_near_text({
    "concepts": ["artificial intelligence"],
    "distance": 0.7
}).with_limit(10).do()

for result in results["data"]["Get"]["Document"]:
    print(f"Title: {result['title']}")
    print(f"Distance: {result['_additional']['distance']}")

# Hybrid search (BM25 + vector)
results = client.query.get(
    class_name="Document",
    properties=["title", "content"]
).with_hybrid(
    query="artificial intelligence",
    alpha=0.7,  # 0 = pure BM25, 1 = pure vector
    vector=query_vector
).with_limit(10).do()

# Filter search
results = client.query.get(
    class_name="Document",
    properties=["title", "content", "category"]
).with_where({
    "path": ["category"],
    "operator": "Equal",
    "valueString": "technology"
}).with_near_text({
    "concepts": ["AI"]
}).with_limit(10).do()

# Filter with range
results = client.query.get(
    class_name="Document",
    properties=["title", "date"]
).with_where({
    "operator": "And",
    "operands": [
        {
            "path": ["category"],
            "operator": "Equal",
            "valueString": "technology"
        },
        {
            "path": ["date"],
            "operator": "GreaterThan",
            "valueDate": "2024-01-01T00:00:00Z"
        }
    ]
}).with_near_text({
    "concepts": ["AI"]
}).do()
typescript
// Semantic search
const results = await client.graphql
  .get()
  .withClassName('Document')
  .withFields('title content category _additional { distance }')
  .withNearText({
    concepts: ['artificial intelligence'],
    distance: 0.7
  })
  .withLimit(10)
  .do();

console.log(results.data.Get.Document);

// Hybrid search (BM25 + vector)
const results = await client.graphql
  .get()
  .withClassName('Document')
  .withFields('title content _additional { distance }')
  .withHybrid({
    query: 'artificial intelligence',
    alpha: 0.7, // 0 = pure BM25, 1 = pure vector
    vector: queryVector
  })
  .withLimit(10)
  .do();

// Filter search
const results = await client.graphql
  .get()
  .withClassName('Document')
  .withFields('title content category')
  .withWhere({
    path: ['category'],
    operator: 'Equal',
    valueText: 'technology'
  })
  .withNearText({
    concepts: ['AI']
  })
  .withLimit(10)
  .do();

// Filter with range
const results = await client.graphql
  .get()
  .withClassName('Document')
  .withFields('title date')
  .withWhere({
    operator: 'And',
    operands: [
      {
        path: ['category'],
        operator: 'Equal',
        valueText: 'technology'
      },
      {
        path: ['date'],
        operator: 'GreaterThan',
        valueDate: '2024-01-01T00:00:00Z'
      }
    ]
  })
  .withNearText({
    concepts: ['AI']
  })
  .do();

Embedding Strategies

Text Embeddings

python
# Using OpenAI embeddings
from openai import OpenAI

client = OpenAI(api_key="your-api-key")

def get_embedding(text: str) -> list:
    response = client.embeddings.create(
        model="text-embedding-3-small",
        input=text
    )
    return response.data[0].embedding

# Batch embeddings
def get_embeddings(texts: list) -> list:
    response = client.embeddings.create(
        model="text-embedding-3-small",
        input=texts
    )
    return [item.embedding for item in response.data]

# Chunking for long texts
def chunk_text(text: str, chunk_size: int = 1000, overlap: int = 200) -> list:
    chunks = []
    for i in range(0, len(text), chunk_size - overlap):
        chunks.append(text[i:i + chunk_size])
    return chunks
typescript
// Using OpenAI embeddings
import OpenAI from 'openai';

const openai = new OpenAI({
  apiKey: 'your-api-key'
});

async function getEmbedding(text: string): Promise<number[]> {
  const response = await openai.embeddings.create({
    model: 'text-embedding-3-small',
    input: text
  });
  return response.data[0].embedding;
}

// Batch embeddings
async function getEmbeddings(texts: string[]): Promise<number[][]> {
  const response = await openai.embeddings.create({
    model: 'text-embedding-3-small',
    input: texts
  });
  return response.data.map(item => item.embedding);
}

// Chunking for long texts
function chunkText(text: string, chunkSize: number = 1000, overlap: number = 200): string[] {
  const chunks: string[] = [];
  for (let i = 0; i < text.length; i += chunkSize - overlap) {
    chunks.push(text.slice(i, i + chunkSize));
  }
  return chunks;
}

Image Embeddings

python
# Using CLIP for image embeddings
from PIL import Image
import clip
import torch

# Load CLIP model
device = "cuda" if torch.cuda.is_available() else "cpu"
model, preprocess = clip.load("ViT-B/32", device=device)

def get_image_embedding(image_path: str) -> list:
    image = preprocess(Image.open(image_path)).unsqueeze(0).to(device)
    with torch.no_grad():
        image_features = model.encode_image(image)
    return image_features.cpu().numpy().tolist()[0]

# Batch image embeddings
def get_image_embeddings(image_paths: list) -> list:
    images = torch.stack([preprocess(Image.open(path)) for path in image_paths]).to(device)
    with torch.no_grad():
        image_features = model.encode_image(images)
    return image_features.cpu().numpy().tolist()

Multimodal Embeddings

python
# Using OpenAI CLIP for text-image similarity
def get_text_embedding(text: str) -> list:
    text_tokens = clip.tokenize([text]).to(device)
    with torch.no_grad():
        text_features = model.encode_text(text_tokens)
    return text_features.cpu().numpy().tolist()[0]

def get_image_embedding(image_path: str) -> list:
    image = preprocess(Image.open(image_path)).unsqueeze(0).to(device)
    with torch.no_grad():
        image_features = model.encode_image(image)
    return image_features.cpu().numpy().tolist()[0]

# Compute similarity
import numpy as np

def cosine_similarity(vec1: list, vec2: list) -> float:
    v1 = np.array(vec1)
    v2 = np.array(vec2)
    return np.dot(v1, v2) / (np.linalg.norm(v1) * np.linalg.norm(v2))

Similarity Search

Cosine Similarity

python
import numpy as np

def cosine_similarity(vec1: list, vec2: list) -> float:
    v1 = np.array(vec1)
    v2 = np.array(vec2)
    return np.dot(v1, v2) / (np.linalg.norm(v1) * np.linalg.norm(v2))

# Example
vector_a = [1, 2, 3]
vector_b = [2, 4, 6]
similarity = cosine_similarity(vector_a, vector_b)
print(f"Cosine similarity: {similarity}")

Euclidean Distance

python
import numpy as np

def euclidean_distance(vec1: list, vec2: list) -> float:
    v1 = np.array(vec1)
    v2 = np.array(vec2)
    return np.linalg.norm(v1 - v2)

# Example
vector_a = [1, 2, 3]
vector_b = [2, 4, 6]
distance = euclidean_distance(vector_a, vector_b)
print(f"Euclidean distance: {distance}")

Dot Product

python
import numpy as np

def dot_product(vec1: list, vec2: list) -> float:
    v1 = np.array(vec1)
    v2 = np.array(vec2)
    return np.dot(v1, v2)

# Example
vector_a = [1, 2, 3]
vector_b = [2, 4, 6]
product = dot_product(vector_a, vector_b)
print(f"Dot product: {product}")

Performance Optimization

Batch Operations

python
# Pinecone batch upsert
def upsert_in_batches(vectors, batch_size=100):
    for i in range(0, len(vectors), batch_size):
        batch = vectors[i:i + batch_size]
        index.upsert(vectors=batch)

# Qdrant batch insert
from qdrant_client.models import Batch

def insert_in_batches(points, batch_size=100):
    for i in range(0, len(points), batch_size):
        batch = points[i:i + batch_size]
        client.upsert(
            collection_name="documents",
            points=Batch(
                ids=[p.id for p in batch],
                vectors=[p.vector for p in batch],
                payloads=[p.payload for p in batch]
            )
        )

Indexing Strategies

python
# Pinecone: Choose appropriate index type
# For smaller datasets: p1 pods
# For larger datasets: p2 pods
# For production: s1 pods (SSD)

# Qdrant: Configure HNSW parameters
client.create_collection(
    collection_name="documents",
    vectors_config=VectorParams(
        size=1536,
        distance=Distance.COSINE,
        hnsw_config={
            "m": 16,  # Number of connections per node
            "ef_construct": 100  # Index build speed
        }
    )
)

Caching

python
# Cache embeddings
import hashlib
import pickle
from functools import lru_cache

def get_embedding_cache_key(text: str) -> str:
    return hashlib.md5(text.encode()).hexdigest()

@lru_cache(maxsize=1000)
def get_cached_embedding(text: str) -> list:
    cache_key = get_embedding_cache_key(text)
    # Check cache
    # If not in cache, compute and store
    return get_embedding(text)

Production Considerations

Scaling

python
# Pinecone: Scale index
# Increase replica count for higher throughput
# Use larger pod types for more storage

# Qdrant: Sharding
client.create_collection(
    collection_name="documents",
    vectors_config=VectorParams(size=1536, distance=Distance.COSINE),
    shard_number=4  # Number of shards
)

# Weaviate: Multi-node setup
# Configure replication factor

Monitoring

python
# Pinecone: Monitor index stats
stats = index.describe_index_stats()
print(f"Total vectors: {stats['total_vector_count']}")
print(f"Dimension: {stats['dimension']}")

# Qdrant: Monitor collection info
info = client.get_collection("documents")
print(f"Vectors count: {info.vectors_count}")
print(f"Points count: {info.points_count}")

# Weaviate: Monitor cluster
cluster_status = client.cluster.get_nodes()
print(cluster_status)

Backup and Recovery

python
# Pinecone: Export data
# Use Pinecone's export functionality

# Qdrant: Snapshot
client.create_snapshot(collection_name="documents")

# Weaviate: Backup
# Use Weaviate's backup tools

Cost Optimization

Choosing Right Service

  • Pinecone: Managed service, easy setup, good for production
  • Qdrant: Open-source, self-hosted option, flexible
  • Weaviate: Open-source, GraphQL API, good for multimodal

Storage Optimization

python
# Use smaller embedding models
# text-embedding-3-small (1536 dims) vs text-embedding-3-large (3072 dims)

# Compress vectors
# Use quantization or dimensionality reduction

# Delete old data
# Implement retention policies

Query Optimization

python
# Use filters to reduce search space
# Limit top_k results
# Use appropriate distance metrics

Best Practices

  1. Choose Appropriate Embedding Model

    • For text: OpenAI text-embedding-3-small or ada-002
    • For images: CLIP, DINO, or domain-specific models
    • For multimodal: CLIP or similar models
  2. Preprocess Data

    • Clean text by removing special characters
    • Normalize whitespace
    • Convert to lowercase for consistency
  3. Use Appropriate Chunking

    • Chunk long documents
    • Use semantic chunking
    • Maintain context between chunks
  4. Implement Caching

    • Cache embeddings to reduce API calls
    • Cache query results
    • Use Redis for caching
  5. Monitor Performance

    • Track query latency
    • Monitor storage usage
    • Set up alerts for anomalies
  6. Use Filters Effectively

    • Use metadata filters to reduce search space
    • Combine vector search with keyword search
    • Use hybrid search when appropriate
  7. Handle Errors Gracefully

    • Implement retry logic
    • Handle rate limits
    • Log errors for debugging
  8. Test Thoroughly

    • Test with real data
    • Evaluate search quality
    • Benchmark performance
  9. Security

    • Use authentication in production
    • Encrypt sensitive data
    • Follow principle of least privilege
  10. Scalability

    • Design for horizontal scaling
    • Use appropriate sharding strategies
    • Monitor resource usage

Related Skills