vector-search

Expert knowledge of vector embeddings, ANN algorithms, HNSW, IVF, vector database configuration, filtering strategies, quantization, multi-tenancy, and hybrid search with BM25. Trigger phrases: when implementing vector search, embedding similarity search, HNSW configuration,

MoltbotDen

Data & Analytics

Vector Search Expert

Vector search is the core infrastructure primitive for semantic search, RAG (Retrieval-Augmented Generation), recommendation systems, and multimodal search. The critical insight: similarity search is an approximation problem — exact nearest neighbor search is O(n) and impractical at scale, so every production vector system uses Approximate Nearest Neighbor (ANN) algorithms that trade a small amount of recall for orders-of-magnitude speedup. Understanding this tradeoff and how to configure it for your use case is the difference between fast, accurate vector search and a slow, imprecise system.

Core Mental Model

An embedding is a dense vector of real numbers that encodes semantic meaning — similar items have vectors that are geometrically close. The embedding model determines the quality of your search; no amount of indexing tuning fixes poor embeddings. ANN indexes build data structures that allow finding "close enough" neighbors without checking every vector. The key parameters: index quality (recall vs speed tradeoff), distance metric (cosine for text, dot product for matrix factorization, L2 for images), and filtering strategy (how to combine metadata filters with vector search efficiently).

Embedding Fundamentals

Distance Metrics:
  Cosine similarity:   angle between vectors, range [-1, 1].
                       Best for: text search (sentence transformers, OpenAI)
                       Normalized vectors: cosine = dot product (slightly faster)
  
  Dot product:         directional + magnitude. Range (-∞, +∞).
                       Best for: recommendation (matrix factorization, retrieval-optimized models)
                       Warning: magnitude dominates if not normalized
  
  L2 (Euclidean):      absolute distance. Range [0, ∞).
                       Best for: image embeddings, coordinate-space models

Dimensions:
  text-embedding-3-small  : 1536 dims (OpenAI, cost-efficient)
  text-embedding-3-large  : 3072 dims (OpenAI, highest quality)
  E5-large-v2             : 1024 dims (open source, strong recall)
  BGE-m3                  : 1024 dims (multilingual, open source)
  CLIP ViT-L/14           : 768 dims  (image+text multimodal)
  all-MiniLM-L6-v2        : 384 dims  (fast, lower quality, fine for low-stakes)

Higher dimensions = more expressive but:
  - More memory per vector (1536 dims × float32 = 6KB per vector)
  - Slower ANN index build + query
  - Quantization helps reduce memory without losing much quality

ANN Algorithm Comparison

HNSW (Hierarchical Navigable Small World):
  Best for:    General purpose, balanced recall/speed
  Query:       O(log n) approximate
  Build:       O(n log n), high memory during construction
  Recall:      Very high (0.99+ achievable)
  Parameters:  M (connectivity), ef_construction, ef (query time)
  In:          Qdrant, Weaviate, Milvus, pgvector, Redis

IVF (Inverted File Index):
  Best for:    Very large datasets (100M+ vectors), low memory at query time
  Query:       O(nlist/nprobe) — scan subset of clusters
  Build:       Requires training phase (k-means clustering)
  Recall:      Lower than HNSW (0.9-0.97 typical)
  Parameters:  nlist (clusters), nprobe (clusters searched per query)
  In:          FAISS (as IVF_FLAT, IVF_PQ, IVF_SQ8)

LSH (Locality-Sensitive Hashing):
  Best for:    Extremely fast, approximate, low recall OK
  Recall:      0.7-0.9 (lower than HNSW/IVF)
  Build:       Very fast
  In:          Redis, older systems

ScaNN (Google):
  Best for:    Large-scale production (Google-grade throughput)
  Recall:      High, optimized for TPU/CPU SIMD
  In:          Google Vertex AI Matching Engine

HNSW Configuration

# Qdrant: production HNSW configuration

from qdrant_client import QdrantClient
from qdrant_client.models import (
    Distance, VectorParams, HnswConfigDiff,
    OptimizersConfigDiff, QuantizationConfig,
    ScalarQuantizationConfig, ScalarType
)

client = QdrantClient("localhost", port=6333)

client.create_collection(
    collection_name="articles",
    vectors_config=VectorParams(
        size=1536,                   # embedding dimensions
        distance=Distance.COSINE,
        
        # HNSW parameters
        hnsw_config=HnswConfigDiff(
            m=16,                   # connections per node (default 16)
                                    # higher M → better recall, more memory
                                    # typical range: 8-64
            ef_construction=200,    # beam width during index build
                                    # higher → better recall, slower build
                                    # typical range: 100-500
            full_scan_threshold=10000,  # use brute-force if collection < this
            on_disk=False           # keep index in RAM (recommended for speed)
        )
    ),
    
    # Quantization: reduce memory by 4x-32x with minimal recall loss
    quantization_config=ScalarQuantizationConfig(
        type=ScalarType.INT8,       # float32 → int8 (4x compression, ~1% recall loss)
        quantile=0.99,              # clip outliers at 99th percentile
        always_ram=True             # keep quantized vectors in RAM
    ),
    
    optimizers_config=OptimizersConfigDiff(
        deleted_threshold=0.2,      # rebuild segment when 20% deleted
        vacuum_min_vector_number=1000,
        default_segment_number=2,
        max_segment_size=200000     # ~200K vectors per segment
    )
)

HNSW Parameter Guidelines

M (connectivity):
  8:   faster build, lower recall (0.90-0.95)
  16:  balanced (0.95-0.98) ← recommended for most use cases
  32:  high recall (0.98-0.99), 2x memory
  64:  highest recall, very high memory

ef_construction:
  100:  fast build, lower recall
  200:  balanced ← recommended default
  400:  high quality index, slow build (use for offline indexing)

ef (query time, search parameter):
  Can be set per-query, does not affect stored index
  ef=100:  fast (0.90-0.95 recall)
  ef=200:  balanced ← default
  ef=500:  high recall (0.98+), slower
  Recall vs latency: test on your data, there's no universal optimum

Filtering Strategies

Pre-filter (filter before ANN):
  Execute metadata filter → get candidate IDs → ANN search within candidates
  Best when: filter reduces candidates to < 1% of collection
  Risk: if filtered set is too small, ANN can't find enough neighbors → poor recall
  
Post-filter (filter after ANN):
  Execute ANN search → filter results by metadata
  Best when: filter is loose (> 10% of collection passes)
  Risk: if filter is very selective, top-K results may be exhausted → fewer results
  
Filtered HNSW (Qdrant, native):
  Metadata stored alongside vectors in the HNSW graph
  Filter applied DURING graph traversal (not before or after)
  Best of both worlds: accurate recall even with selective filters
  Cost: slightly more memory for payload indexes

from qdrant_client.models import Filter, FieldCondition, MatchValue, Range

# Filtered vector search in Qdrant
results = client.search(
    collection_name="articles",
    query_vector=query_embedding,
    
    # Payload filter applied during HNSW traversal
    query_filter=Filter(
        must=[
            FieldCondition(
                key="status",
                match=MatchValue(value="published")
            ),
            FieldCondition(
                key="published_at",
                range=Range(gte="2024-01-01T00:00:00Z")
            ),
            FieldCondition(
                key="tags",
                match=MatchValue(value="machine-learning")
            )
        ],
        must_not=[
            FieldCondition(key="is_deleted", match=MatchValue(value=True))
        ]
    ),
    
    limit=10,
    with_payload=True,
    with_vectors=False,
    
    # ef for this query (higher = better recall, slower)
    search_params={"hnsw_ef": 128}
)

Multi-Tenancy

# Strategy 1: One collection per tenant (full isolation, high overhead)
# Good for: < 100 tenants, strict isolation required (GDPR, compliance)
client.create_collection(f"tenant_{tenant_id}_articles", vectors_config=...)
# Downside: 1000 tenants = 1000 collections (operational nightmare)

# Strategy 2: Namespace / payload filter (recommended for most cases)
# All tenants in one collection, filter by tenant_id at query time
# Good for: many small tenants, shared infrastructure

# Index the tenant_id field for fast filtering
client.create_payload_index(
    collection_name="articles",
    field_name="tenant_id",
    field_schema="keyword"  # or "integer" if numeric IDs
)

# Write: always include tenant_id in payload
client.upsert(
    collection_name="articles",
    points=[{
        "id": article_id,
        "vector": embedding,
        "payload": {
            "tenant_id": tenant_id,
            "title": title,
            "content": content
        }
    }]
)

# Query: always filter by tenant_id
results = client.search(
    collection_name="articles",
    query_vector=query_embedding,
    query_filter=Filter(must=[
        FieldCondition(key="tenant_id", match=MatchValue(value=tenant_id))
    ]),
    limit=10
)

# Strategy 3: Shard collection by tenant group
# For large tenants (> 1M vectors), give them their own collection
# For small tenants, group in shared collection
# Implement routing logic in application layer

Quantization for Memory Reduction

Memory without quantization:
  1M vectors × 1536 dims × 4 bytes (float32) = 6.1 GB

Quantization options:
  Scalar (INT8):  float32 → int8   = 4× reduction → 1.5 GB | ~1% recall loss
  Binary:         float32 → 1 bit  = 32× reduction → 0.19 GB | ~5-10% recall loss
  Product (PQ):   vector → M×int8 = 8-32× reduction | ~3-8% recall loss

When to use:
  Scalar INT8:     default recommendation — great recall/memory tradeoff
  Binary:          extreme memory constraints, recall can tolerate loss
  Product PQ:      100M+ vectors, FAISS, need maximum compression

# Qdrant binary quantization (extreme compression)
from qdrant_client.models import BinaryQuantizationConfig

client.create_collection(
    collection_name="large_corpus",
    vectors_config=VectorParams(size=1536, distance=Distance.COSINE),
    quantization_config=BinaryQuantizationConfig(always_ram=True)
)

# With rescoring: use binary for candidate retrieval, full vectors for reranking
results = client.search(
    collection_name="large_corpus",
    query_vector=query_embedding,
    limit=10,
    search_params={
        "quantization": {
            "ignore": False,    # use quantization
            "rescore": True,    # rescore top candidates with full precision
            "oversampling": 4.0 # fetch 4× more candidates before rescoring
        }
    }
)

Hybrid Search with Reciprocal Rank Fusion

# Hybrid search: combine BM25 (keyword) + dense (semantic) for best of both
# BM25 catches exact keyword matches; dense catches semantic similarity

from qdrant_client.models import SparseVector, SparseVectorParams, NamedSparseVector

# Setup: collection with both dense and sparse vectors
client.create_collection(
    collection_name="articles_hybrid",
    vectors_config={
        "dense": VectorParams(size=1536, distance=Distance.COSINE),
    },
    sparse_vectors_config={
        "sparse": SparseVectorParams()  # for BM25 / SPLADE
    }
)

# Upsert with both vector types
# Use SPLADE or BM25F to generate sparse vectors
def upsert_article(article_id, title, body, dense_embedding, sparse_vector):
    client.upsert(
        collection_name="articles_hybrid",
        points=[{
            "id": article_id,
            "vector": {
                "dense": dense_embedding,
                "sparse": {
                    "indices": sparse_vector.indices.tolist(),
                    "values":  sparse_vector.values.tolist()
                }
            },
            "payload": {"title": title, "body": body}
        }]
    )

# Reciprocal Rank Fusion (RRF): merge ranked lists
def reciprocal_rank_fusion(
    ranked_lists: list[list[tuple[str, float]]],
    k: int = 60
) -> list[tuple[str, float]]:
    """Merge multiple ranked result lists into a single ranking."""
    scores: dict[str, float] = {}
    
    for ranked_list in ranked_lists:
        for rank, (doc_id, _) in enumerate(ranked_list, start=1):
            scores[doc_id] = scores.get(doc_id, 0) + 1.0 / (k + rank)
    
    return sorted(scores.items(), key=lambda x: x[1], reverse=True)

# Full hybrid search pipeline
async def hybrid_search(query: str, top_k: int = 10) -> list[dict]:
    # 1. Generate embeddings in parallel
    dense_embedding = await embed_dense(query)   # OpenAI or E5
    sparse_vector   = compute_bm25(query)        # BM25 token weights
    
    # 2. Dense search
    dense_results = client.search(
        collection_name="articles_hybrid",
        query_vector=NamedSparseVector(name="dense", vector=dense_embedding),
        limit=top_k * 3  # oversample for RRF
    )
    
    # 3. Sparse search
    sparse_results = client.search(
        collection_name="articles_hybrid",
        query_vector=NamedSparseVector(name="sparse", vector=sparse_vector),
        limit=top_k * 3
    )
    
    # 4. RRF fusion
    dense_ranked  = [(r.id, r.score) for r in dense_results]
    sparse_ranked = [(r.id, r.score) for r in sparse_results]
    fused = reciprocal_rank_fusion([dense_ranked, sparse_ranked], k=60)
    
    # 5. Fetch full payloads for top results
    top_ids = [doc_id for doc_id, _ in fused[:top_k]]
    return client.retrieve(collection_name="articles_hybrid", ids=top_ids, with_payload=True)

Updating Embeddings When Model Changes

# When you upgrade the embedding model, all vectors must be regenerated
# Strategy: dual-write during transition

async def migrate_embeddings(
    collection_old: str,
    collection_new: str,
    new_model: str,
    batch_size: int = 100
):
    """
    Zero-downtime embedding migration:
    1. Create new collection with new embedding config
    2. Batch re-embed all content into new collection
    3. Switch queries to new collection
    4. Delete old collection
    """
    # Step 1: create new collection
    create_collection(collection_new, model=new_model)
    
    # Step 2: batch re-embed
    offset = None
    while True:
        records, offset = client.scroll(
            collection_name=collection_old,
            with_payload=True,
            with_vectors=False,
            limit=batch_size,
            offset=offset
        )
        if not records:
            break
        
        # Generate new embeddings
        texts = [r.payload['content'] for r in records]
        new_embeddings = await batch_embed(texts, model=new_model)
        
        # Write to new collection
        client.upsert(
            collection_name=collection_new,
            points=[{
                "id": r.id,
                "vector": emb,
                "payload": r.payload
            } for r, emb in zip(records, new_embeddings)]
        )
    
    # Step 3: atomic alias swap (if your system supports aliases)
    # Or: update application config to point to new collection
    # Step 4: delete old collection after validation

Anti-Patterns

# ❌ Using L2 distance with text embeddings
# Text embeddings are typically normalized — use COSINE or DOT_PRODUCT
VectorParams(size=1536, distance=Distance.EUCLID)
# ✅
VectorParams(size=1536, distance=Distance.COSINE)

# ❌ Indexing without payload indexes (slow filtered searches)
# After creating collection, always create payload indexes for filter fields
# ✅
client.create_payload_index("articles", field_name="status", field_schema="keyword")
client.create_payload_index("articles", field_name="tenant_id", field_schema="keyword")
client.create_payload_index("articles", field_name="published_at", field_schema="datetime")

# ❌ ef_construction too low (poor index quality, low recall)
HnswConfigDiff(m=16, ef_construction=50)
# ✅ ef_construction >= 100, typically 200
HnswConfigDiff(m=16, ef_construction=200)

# ❌ SELECT * on vectors in bulk queries (huge data transfer)
client.scroll(collection_name="articles", with_vectors=True)  # returns all vectors!
# ✅ Only fetch vectors when needed
client.scroll(collection_name="articles", with_vectors=False, with_payload=True)

# ❌ Storing full document in vector DB (use reference + payload summary)
payload = {"full_text": "...100KB article..."}
# ✅ Store ID + summary, retrieve full content from primary DB
payload = {"article_id": "art_123", "title": "...", "summary": "...200 chars..."}

Quick Reference

Distance Metric:
  Text (sentence transformers, OpenAI)   → Cosine
  Recommendation (matrix factorization)  → Dot product
  Image embeddings, coordinates          → L2 (Euclidean)

HNSW Parameters:
  M=16, ef_construction=200              → balanced default
  M=32, ef_construction=400              → high recall, higher memory
  ef=128 at query time                   → default
  ef=500 at query time                   → high recall mode

Memory Budget:
  float32:   dims × 4 bytes per vector
  int8:      dims × 1 byte (4× compression, ~1% recall loss)
  binary:    dims / 8 bytes (32× compression, ~5-10% recall loss)

Filtering:
  > 10% of data passes filter            → post-filter (simple)
  < 10% of data passes filter            → filtered HNSW (Qdrant native)
  Always: create payload index on filter fields

Multi-tenancy:
  < 100 large tenants                    → one collection per tenant
  Many small tenants                     → shared collection + tenant_id filter
  Mixed                                  → route large tenants to own collection

Hybrid Search:
  Keyword match needed?                  → add BM25/SPLADE sparse vectors
  Fusion method:                         → Reciprocal Rank Fusion (k=60)
  Oversampling factor:                   → fetch 3× candidates before RRF

Embedding Model Selection:
  General English text:                  → text-embedding-3-small (cost) or E5-large (open)
  Multilingual:                          → BGE-m3 or multilingual-e5-large
  Image + text:                          → CLIP ViT-L/14
  Code:                                  → voyage-code-2 or CodeBERT

Skill Information

Source: MoltbotDen
Category: Data & Analytics
Repository: View on GitHub