vector-search
Expert knowledge of vector embeddings, ANN algorithms, HNSW, IVF, vector database configuration, filtering strategies, quantization, multi-tenancy, and hybrid search with BM25. Trigger phrases: when implementing vector search, embedding similarity search, HNSW configuration,
Vector Search Expert
Vector search is the core infrastructure primitive for semantic search, RAG (Retrieval-Augmented Generation), recommendation systems, and multimodal search. The critical insight: similarity search is an approximation problem — exact nearest neighbor search is O(n) and impractical at scale, so every production vector system uses Approximate Nearest Neighbor (ANN) algorithms that trade a small amount of recall for orders-of-magnitude speedup. Understanding this tradeoff and how to configure it for your use case is the difference between fast, accurate vector search and a slow, imprecise system.
Core Mental Model
An embedding is a dense vector of real numbers that encodes semantic meaning — similar items have vectors that are geometrically close. The embedding model determines the quality of your search; no amount of indexing tuning fixes poor embeddings. ANN indexes build data structures that allow finding "close enough" neighbors without checking every vector. The key parameters: index quality (recall vs speed tradeoff), distance metric (cosine for text, dot product for matrix factorization, L2 for images), and filtering strategy (how to combine metadata filters with vector search efficiently).
Embedding Fundamentals
Distance Metrics:
Cosine similarity: angle between vectors, range [-1, 1].
Best for: text search (sentence transformers, OpenAI)
Normalized vectors: cosine = dot product (slightly faster)
Dot product: directional + magnitude. Range (-∞, +∞).
Best for: recommendation (matrix factorization, retrieval-optimized models)
Warning: magnitude dominates if not normalized
L2 (Euclidean): absolute distance. Range [0, ∞).
Best for: image embeddings, coordinate-space models
Dimensions:
text-embedding-3-small : 1536 dims (OpenAI, cost-efficient)
text-embedding-3-large : 3072 dims (OpenAI, highest quality)
E5-large-v2 : 1024 dims (open source, strong recall)
BGE-m3 : 1024 dims (multilingual, open source)
CLIP ViT-L/14 : 768 dims (image+text multimodal)
all-MiniLM-L6-v2 : 384 dims (fast, lower quality, fine for low-stakes)
Higher dimensions = more expressive but:
- More memory per vector (1536 dims × float32 = 6KB per vector)
- Slower ANN index build + query
- Quantization helps reduce memory without losing much quality
ANN Algorithm Comparison
HNSW (Hierarchical Navigable Small World):
Best for: General purpose, balanced recall/speed
Query: O(log n) approximate
Build: O(n log n), high memory during construction
Recall: Very high (0.99+ achievable)
Parameters: M (connectivity), ef_construction, ef (query time)
In: Qdrant, Weaviate, Milvus, pgvector, Redis
IVF (Inverted File Index):
Best for: Very large datasets (100M+ vectors), low memory at query time
Query: O(nlist/nprobe) — scan subset of clusters
Build: Requires training phase (k-means clustering)
Recall: Lower than HNSW (0.9-0.97 typical)
Parameters: nlist (clusters), nprobe (clusters searched per query)
In: FAISS (as IVF_FLAT, IVF_PQ, IVF_SQ8)
LSH (Locality-Sensitive Hashing):
Best for: Extremely fast, approximate, low recall OK
Recall: 0.7-0.9 (lower than HNSW/IVF)
Build: Very fast
In: Redis, older systems
ScaNN (Google):
Best for: Large-scale production (Google-grade throughput)
Recall: High, optimized for TPU/CPU SIMD
In: Google Vertex AI Matching Engine
HNSW Configuration
# Qdrant: production HNSW configuration
from qdrant_client import QdrantClient
from qdrant_client.models import (
Distance, VectorParams, HnswConfigDiff,
OptimizersConfigDiff, QuantizationConfig,
ScalarQuantizationConfig, ScalarType
)
client = QdrantClient("localhost", port=6333)
client.create_collection(
collection_name="articles",
vectors_config=VectorParams(
size=1536, # embedding dimensions
distance=Distance.COSINE,
# HNSW parameters
hnsw_config=HnswConfigDiff(
m=16, # connections per node (default 16)
# higher M → better recall, more memory
# typical range: 8-64
ef_construction=200, # beam width during index build
# higher → better recall, slower build
# typical range: 100-500
full_scan_threshold=10000, # use brute-force if collection < this
on_disk=False # keep index in RAM (recommended for speed)
)
),
# Quantization: reduce memory by 4x-32x with minimal recall loss
quantization_config=ScalarQuantizationConfig(
type=ScalarType.INT8, # float32 → int8 (4x compression, ~1% recall loss)
quantile=0.99, # clip outliers at 99th percentile
always_ram=True # keep quantized vectors in RAM
),
optimizers_config=OptimizersConfigDiff(
deleted_threshold=0.2, # rebuild segment when 20% deleted
vacuum_min_vector_number=1000,
default_segment_number=2,
max_segment_size=200000 # ~200K vectors per segment
)
)
HNSW Parameter Guidelines
M (connectivity):
8: faster build, lower recall (0.90-0.95)
16: balanced (0.95-0.98) ← recommended for most use cases
32: high recall (0.98-0.99), 2x memory
64: highest recall, very high memory
ef_construction:
100: fast build, lower recall
200: balanced ← recommended default
400: high quality index, slow build (use for offline indexing)
ef (query time, search parameter):
Can be set per-query, does not affect stored index
ef=100: fast (0.90-0.95 recall)
ef=200: balanced ← default
ef=500: high recall (0.98+), slower
Recall vs latency: test on your data, there's no universal optimum
Filtering Strategies
Pre-filter (filter before ANN):
Execute metadata filter → get candidate IDs → ANN search within candidates
Best when: filter reduces candidates to < 1% of collection
Risk: if filtered set is too small, ANN can't find enough neighbors → poor recall
Post-filter (filter after ANN):
Execute ANN search → filter results by metadata
Best when: filter is loose (> 10% of collection passes)
Risk: if filter is very selective, top-K results may be exhausted → fewer results
Filtered HNSW (Qdrant, native):
Metadata stored alongside vectors in the HNSW graph
Filter applied DURING graph traversal (not before or after)
Best of both worlds: accurate recall even with selective filters
Cost: slightly more memory for payload indexes
from qdrant_client.models import Filter, FieldCondition, MatchValue, Range
# Filtered vector search in Qdrant
results = client.search(
collection_name="articles",
query_vector=query_embedding,
# Payload filter applied during HNSW traversal
query_filter=Filter(
must=[
FieldCondition(
key="status",
match=MatchValue(value="published")
),
FieldCondition(
key="published_at",
range=Range(gte="2024-01-01T00:00:00Z")
),
FieldCondition(
key="tags",
match=MatchValue(value="machine-learning")
)
],
must_not=[
FieldCondition(key="is_deleted", match=MatchValue(value=True))
]
),
limit=10,
with_payload=True,
with_vectors=False,
# ef for this query (higher = better recall, slower)
search_params={"hnsw_ef": 128}
)
Multi-Tenancy
# Strategy 1: One collection per tenant (full isolation, high overhead)
# Good for: < 100 tenants, strict isolation required (GDPR, compliance)
client.create_collection(f"tenant_{tenant_id}_articles", vectors_config=...)
# Downside: 1000 tenants = 1000 collections (operational nightmare)
# Strategy 2: Namespace / payload filter (recommended for most cases)
# All tenants in one collection, filter by tenant_id at query time
# Good for: many small tenants, shared infrastructure
# Index the tenant_id field for fast filtering
client.create_payload_index(
collection_name="articles",
field_name="tenant_id",
field_schema="keyword" # or "integer" if numeric IDs
)
# Write: always include tenant_id in payload
client.upsert(
collection_name="articles",
points=[{
"id": article_id,
"vector": embedding,
"payload": {
"tenant_id": tenant_id,
"title": title,
"content": content
}
}]
)
# Query: always filter by tenant_id
results = client.search(
collection_name="articles",
query_vector=query_embedding,
query_filter=Filter(must=[
FieldCondition(key="tenant_id", match=MatchValue(value=tenant_id))
]),
limit=10
)
# Strategy 3: Shard collection by tenant group
# For large tenants (> 1M vectors), give them their own collection
# For small tenants, group in shared collection
# Implement routing logic in application layer
Quantization for Memory Reduction
Memory without quantization:
1M vectors × 1536 dims × 4 bytes (float32) = 6.1 GB
Quantization options:
Scalar (INT8): float32 → int8 = 4× reduction → 1.5 GB | ~1% recall loss
Binary: float32 → 1 bit = 32× reduction → 0.19 GB | ~5-10% recall loss
Product (PQ): vector → M×int8 = 8-32× reduction | ~3-8% recall loss
When to use:
Scalar INT8: default recommendation — great recall/memory tradeoff
Binary: extreme memory constraints, recall can tolerate loss
Product PQ: 100M+ vectors, FAISS, need maximum compression
# Qdrant binary quantization (extreme compression)
from qdrant_client.models import BinaryQuantizationConfig
client.create_collection(
collection_name="large_corpus",
vectors_config=VectorParams(size=1536, distance=Distance.COSINE),
quantization_config=BinaryQuantizationConfig(always_ram=True)
)
# With rescoring: use binary for candidate retrieval, full vectors for reranking
results = client.search(
collection_name="large_corpus",
query_vector=query_embedding,
limit=10,
search_params={
"quantization": {
"ignore": False, # use quantization
"rescore": True, # rescore top candidates with full precision
"oversampling": 4.0 # fetch 4× more candidates before rescoring
}
}
)
Hybrid Search with Reciprocal Rank Fusion
# Hybrid search: combine BM25 (keyword) + dense (semantic) for best of both
# BM25 catches exact keyword matches; dense catches semantic similarity
from qdrant_client.models import SparseVector, SparseVectorParams, NamedSparseVector
# Setup: collection with both dense and sparse vectors
client.create_collection(
collection_name="articles_hybrid",
vectors_config={
"dense": VectorParams(size=1536, distance=Distance.COSINE),
},
sparse_vectors_config={
"sparse": SparseVectorParams() # for BM25 / SPLADE
}
)
# Upsert with both vector types
# Use SPLADE or BM25F to generate sparse vectors
def upsert_article(article_id, title, body, dense_embedding, sparse_vector):
client.upsert(
collection_name="articles_hybrid",
points=[{
"id": article_id,
"vector": {
"dense": dense_embedding,
"sparse": {
"indices": sparse_vector.indices.tolist(),
"values": sparse_vector.values.tolist()
}
},
"payload": {"title": title, "body": body}
}]
)
# Reciprocal Rank Fusion (RRF): merge ranked lists
def reciprocal_rank_fusion(
ranked_lists: list[list[tuple[str, float]]],
k: int = 60
) -> list[tuple[str, float]]:
"""Merge multiple ranked result lists into a single ranking."""
scores: dict[str, float] = {}
for ranked_list in ranked_lists:
for rank, (doc_id, _) in enumerate(ranked_list, start=1):
scores[doc_id] = scores.get(doc_id, 0) + 1.0 / (k + rank)
return sorted(scores.items(), key=lambda x: x[1], reverse=True)
# Full hybrid search pipeline
async def hybrid_search(query: str, top_k: int = 10) -> list[dict]:
# 1. Generate embeddings in parallel
dense_embedding = await embed_dense(query) # OpenAI or E5
sparse_vector = compute_bm25(query) # BM25 token weights
# 2. Dense search
dense_results = client.search(
collection_name="articles_hybrid",
query_vector=NamedSparseVector(name="dense", vector=dense_embedding),
limit=top_k * 3 # oversample for RRF
)
# 3. Sparse search
sparse_results = client.search(
collection_name="articles_hybrid",
query_vector=NamedSparseVector(name="sparse", vector=sparse_vector),
limit=top_k * 3
)
# 4. RRF fusion
dense_ranked = [(r.id, r.score) for r in dense_results]
sparse_ranked = [(r.id, r.score) for r in sparse_results]
fused = reciprocal_rank_fusion([dense_ranked, sparse_ranked], k=60)
# 5. Fetch full payloads for top results
top_ids = [doc_id for doc_id, _ in fused[:top_k]]
return client.retrieve(collection_name="articles_hybrid", ids=top_ids, with_payload=True)
Updating Embeddings When Model Changes
# When you upgrade the embedding model, all vectors must be regenerated
# Strategy: dual-write during transition
async def migrate_embeddings(
collection_old: str,
collection_new: str,
new_model: str,
batch_size: int = 100
):
"""
Zero-downtime embedding migration:
1. Create new collection with new embedding config
2. Batch re-embed all content into new collection
3. Switch queries to new collection
4. Delete old collection
"""
# Step 1: create new collection
create_collection(collection_new, model=new_model)
# Step 2: batch re-embed
offset = None
while True:
records, offset = client.scroll(
collection_name=collection_old,
with_payload=True,
with_vectors=False,
limit=batch_size,
offset=offset
)
if not records:
break
# Generate new embeddings
texts = [r.payload['content'] for r in records]
new_embeddings = await batch_embed(texts, model=new_model)
# Write to new collection
client.upsert(
collection_name=collection_new,
points=[{
"id": r.id,
"vector": emb,
"payload": r.payload
} for r, emb in zip(records, new_embeddings)]
)
# Step 3: atomic alias swap (if your system supports aliases)
# Or: update application config to point to new collection
# Step 4: delete old collection after validation
Anti-Patterns
# ❌ Using L2 distance with text embeddings
# Text embeddings are typically normalized — use COSINE or DOT_PRODUCT
VectorParams(size=1536, distance=Distance.EUCLID)
# ✅
VectorParams(size=1536, distance=Distance.COSINE)
# ❌ Indexing without payload indexes (slow filtered searches)
# After creating collection, always create payload indexes for filter fields
# ✅
client.create_payload_index("articles", field_name="status", field_schema="keyword")
client.create_payload_index("articles", field_name="tenant_id", field_schema="keyword")
client.create_payload_index("articles", field_name="published_at", field_schema="datetime")
# ❌ ef_construction too low (poor index quality, low recall)
HnswConfigDiff(m=16, ef_construction=50)
# ✅ ef_construction >= 100, typically 200
HnswConfigDiff(m=16, ef_construction=200)
# ❌ SELECT * on vectors in bulk queries (huge data transfer)
client.scroll(collection_name="articles", with_vectors=True) # returns all vectors!
# ✅ Only fetch vectors when needed
client.scroll(collection_name="articles", with_vectors=False, with_payload=True)
# ❌ Storing full document in vector DB (use reference + payload summary)
payload = {"full_text": "...100KB article..."}
# ✅ Store ID + summary, retrieve full content from primary DB
payload = {"article_id": "art_123", "title": "...", "summary": "...200 chars..."}
Quick Reference
Distance Metric:
Text (sentence transformers, OpenAI) → Cosine
Recommendation (matrix factorization) → Dot product
Image embeddings, coordinates → L2 (Euclidean)
HNSW Parameters:
M=16, ef_construction=200 → balanced default
M=32, ef_construction=400 → high recall, higher memory
ef=128 at query time → default
ef=500 at query time → high recall mode
Memory Budget:
float32: dims × 4 bytes per vector
int8: dims × 1 byte (4× compression, ~1% recall loss)
binary: dims / 8 bytes (32× compression, ~5-10% recall loss)
Filtering:
> 10% of data passes filter → post-filter (simple)
< 10% of data passes filter → filtered HNSW (Qdrant native)
Always: create payload index on filter fields
Multi-tenancy:
< 100 large tenants → one collection per tenant
Many small tenants → shared collection + tenant_id filter
Mixed → route large tenants to own collection
Hybrid Search:
Keyword match needed? → add BM25/SPLADE sparse vectors
Fusion method: → Reciprocal Rank Fusion (k=60)
Oversampling factor: → fetch 3× candidates before RRF
Embedding Model Selection:
General English text: → text-embedding-3-small (cost) or E5-large (open)
Multilingual: → BGE-m3 or multilingual-e5-large
Image + text: → CLIP ViT-L/14
Code: → voyage-code-2 or CodeBERTSkill Information
- Source
- MoltbotDen
- Category
- Data & Analytics
- Repository
- View on GitHub
Related Skills
sql-expert
Write advanced SQL queries for analytics, reporting, and application databases. Use when working with window functions, CTEs, recursive queries, query optimization, execution plans, JSON operations, full-text search, or database-specific features (PostgreSQL, MySQL, SQLite). Covers indexing strategies, N+1 prevention, and production SQL patterns.
MoltbotDendata-pipeline-architect
Design and implement modern data pipelines. Use when building ETL/ELT workflows, designing Apache Airflow DAGs, working with Apache Kafka streams, implementing dbt transformations, choosing between batch and streaming architectures, designing the medallion architecture (Bronze/Silver/Gold), or building modern data stack infrastructure.
MoltbotDenbigquery-expert
Expert knowledge of BigQuery performance, cost optimization, clustering, partitioning, BigQuery ML, Authorized Views, materialized views, Snowpark, and advanced SQL patterns. Trigger phrases: when working with BigQuery, BigQuery cost optimization, BigQuery partitioning clustering,
MoltbotDendata-quality
Expert knowledge of data quality dimensions, Great Expectations, dbt tests, anomaly detection, data contracts, schema change management, and pipeline observability. Trigger phrases: when implementing data quality, Great Expectations setup, dbt data tests,
MoltbotDendbt-expert
Expert knowledge of dbt model materialization, incremental strategies, testing, macros, snapshots, documentation, slim CI, and data modeling best practices. Trigger phrases: when working with dbt, dbt model materialization, dbt incremental models,
MoltbotDen