Semantic Search
Build vector-based semantic search systems that understand meaning, not just keywords.
When to Activate
Activate this skill when:
- •User wants to search text by meaning rather than exact keywords
- •User mentions "find similar", "semantic", "natural language search"
- •User has a collection of documents/texts to make searchable
- •User's search queries should understand synonyms and related concepts
Do NOT activate when:
- •User needs exact keyword matching → use
hybrid-search - •User has multiple data types (text + image) → use
multimodal-retrieval - •User needs filtering by attributes → use
filtered-search
Interactive Flow
Step 1: Understand the Use Case
"What type of content will users search?"
A) Short texts (titles, product names, questions)
- •Typically < 200 characters
- •Users expect quick, precise matches
B) Long documents (articles, papers, documentation)
- •Need chunking strategy
- •May need to return specific passages
C) Conversational queries (customer support, FAQ)
- •Queries are natural language questions
- •Answers should be semantically relevant
Which describes your use case? (A/B/C)
Step 2: Clarify Scale and Latency
"What's your expected scale?"
| Scale | Documents | Latency Target |
|---|---|---|
| Small | < 100K | < 50ms |
| Medium | 100K - 10M | < 100ms |
| Large | > 10M | < 200ms |
"For your scale, I'll configure appropriate index parameters."
Step 3: Confirm Before Implementation
"Based on your requirements:
- •Embedding model:
BAAI/bge-large-en-v1.5(1024 dim) - •Index type: AUTOINDEX
- •Metric: COSINE similarity
Proceed? (yes / adjust [what])"
Core Concepts
Mental Model: Library Catalog
Think of semantic search like a smart librarian:
- •Traditional search = looking for exact words in book titles
- •Semantic search = understanding "I want books about cooking" includes recipes, cuisine, culinary arts
┌─────────────────────────────────────────────────────────┐ │ Semantic Search │ │ │ │ Query: "affordable laptop" │ │ │ │ │ ▼ │ │ ┌───────────────┐ │ │ │ Embedding │ Convert text to vector │ │ │ Model │ (1024 dimensions) │ │ └───────┬───────┘ │ │ │ │ │ ▼ │ │ [0.12, -0.45, 0.78, ...] │ │ │ │ │ ▼ │ │ ┌────────────────────────┐ │ │ │ Vector Index │ Find nearest vectors │ │ │ (Milvus) │ in high-dim space │ │ └────────────┬───────────┘ │ │ │ │ │ ▼ │ │ Results: "budget-friendly notebook", "cheap computer" │ │ (semantically similar, different keywords) │ └─────────────────────────────────────────────────────────┘
Why Vectors Work
| Concept | Explanation |
|---|---|
| Embedding | Text → High-dimensional vector that captures meaning |
| Similarity | Vectors close together = similar meaning |
| COSINE | Measures angle between vectors (0-1, higher = more similar) |
Why Semantic Search Over Alternatives
| Need | Solution | Why |
|---|---|---|
| "Find documents about X" | ✅ Semantic Search | Understands meaning |
| "Find documents containing 'X'" | ❌ Use keyword search | Exact match needed |
| "Find documents about X in category Y" | ⚠️ Consider filtered-search | Need attribute filtering |
| "Find documents matching both keywords AND meaning" | ⚠️ Consider hybrid-search | Both precision and recall |
Limitations to Know
- •No keyword precision: "iPhone 15" query might return "Samsung Galaxy" (semantically similar)
- •Language dependency: Models trained on specific languages work best for those
- •Domain shift: General models may miss domain-specific terminology
Implementation
from pymilvus import MilvusClient, DataType
from sentence_transformers import SentenceTransformer
class SemanticSearch:
def __init__(self, collection_name: str = "semantic_search", uri: str = "./milvus.db"):
self.client = MilvusClient(uri=uri)
self.collection_name = collection_name
self.model = SentenceTransformer('BAAI/bge-large-en-v1.5')
self.dim = 1024
self._init_collection()
def _init_collection(self):
if self.client.has_collection(self.collection_name):
return
schema = self.client.create_schema(auto_id=True, enable_dynamic_field=True)
schema.add_field(field_name="id", datatype=DataType.INT64, is_primary=True)
schema.add_field(field_name="text", datatype=DataType.VARCHAR, max_length=65535)
schema.add_field(field_name="embedding", datatype=DataType.FLOAT_VECTOR, dim=self.dim)
index_params = self.client.prepare_index_params()
index_params.add_index(
field_name="embedding",
index_type="AUTOINDEX",
metric_type="COSINE"
)
self.client.create_collection(
collection_name=self.collection_name,
schema=schema,
index_params=index_params
)
def add(self, texts: list):
"""Add documents"""
embeddings = self.model.encode(texts).tolist()
data = [{"text": text, "embedding": emb} for text, emb in zip(texts, embeddings)]
self.client.insert(collection_name=self.collection_name, data=data)
def search(self, query: str, limit: int = 10):
"""Search"""
query_embedding = self.model.encode([query]).tolist()
results = self.client.search(
collection_name=self.collection_name,
data=query_embedding,
limit=limit,
output_fields=["text"]
)
return [
{"text": hit["entity"]["text"], "score": hit["distance"]}
for hit in results[0]
]
# Usage
search = SemanticSearch()
search.add(["Python is a programming language", "Java is also a programming language", "Machine learning is popular"])
results = search.search("what is programming")
for r in results:
print(f"{r['score']:.3f}: {r['text']}")
Configuration Guide
Embedding Model Selection
| Language | Model | Dimensions | Quality | Speed |
|---|---|---|---|---|
| English | BAAI/bge-large-en-v1.5 | 1024 | ★★★★★ | ★★★ |
| English | BAAI/bge-base-en-v1.5 | 768 | ★★★★ | ★★★★ |
| Chinese | BAAI/bge-large-zh-v1.5 | 1024 | ★★★★★ | ★★★ |
| Multilingual | BAAI/bge-m3 | 1024 | ★★★★ | ★★★ |
| API-based | text-embedding-3-small | 1536 | ★★★★ | ★★★★★ |
Similarity Threshold Guidelines
| Similarity Score | Interpretation | Action |
|---|---|---|
| > 0.9 | Near identical | High confidence match |
| 0.7 - 0.9 | Strong match | Good results |
| 0.5 - 0.7 | Related | May need verification |
| < 0.5 | Weak match | Likely irrelevant |
Common Pitfalls
❌ Pitfall 1: Expecting Keyword Precision
Problem: User searches "iPhone 15 Pro Max" but gets "Samsung Galaxy S24"
Why: Semantic search finds conceptually similar items, not exact matches
Fix: Use hybrid-search to combine keyword + semantic matching
❌ Pitfall 2: Not Chunking Long Documents
Problem: Searching a 10-page document returns nothing relevant
Why: Embedding models have token limits; long text gets truncated
Fix:
# Split into chunks before indexing chunks = [doc[i:i+500] for i in range(0, len(doc), 500)] search.add(chunks)
❌ Pitfall 3: Wrong Language Model
Problem: Chinese queries return poor results
Why: Using English-trained model for Chinese text
Fix: Use language-appropriate model (e.g., bge-large-zh-v1.5 for Chinese)
❌ Pitfall 4: Too Many Results
Problem: Returning 100 results when user needs top 3
Why: No relevance threshold filtering
Fix:
# Filter by similarity score results = [r for r in results if r["score"] > 0.7][:3]
When to Level Up
Consider upgrading when you need:
| Need | Upgrade To |
|---|---|
| Keyword + semantic matching | hybrid-search |
| Filter by category/price/date | filtered-search |
| Search across title + description | multi-vector-search |
| Higher precision results | Add core:rerank |
References
- •Chunking strategies:
core:chunking - •Embedding model details:
core:embedding - •Index configuration:
core:indexing - •Similarity metrics comparison:
references/similarity-metrics.md