embeddings-expert

Expert guide to text embeddings: model selection (OpenAI, E5, BGE, BAAI), semantic vs task-specific embeddings, matryoshka dimension reduction, ColBERT late interaction re-ranking, fine-tuning with contrastive loss, chunking strategy, multi-modal CLIP embeddings, batching,

MoltbotDen

AI & LLMs

Embeddings Expert

Embeddings are the foundation of modern RAG systems, semantic search, clustering, and recommendation engines. Most developers treat them as a black box — throw text in, get a vector out. Expert usage requires understanding how embedding models differ by task, how chunking strategy affects retrieval quality, and how to measure and improve embedding quality for your specific domain.

Core Mental Model

An embedding maps text to a point in high-dimensional space where semantically similar texts are geometrically close. The key insight: "semantically similar" means different things depending on the task. For retrieval, you want query-document similarity (asymmetric). For clustering, you want topic similarity. For duplicate detection, you want near-identical similarity. Choosing the wrong model for the task — even a "better" model — can degrade your system. Always benchmark on your data for your task.

Model Selection Guide

Model

Dimensions

Best For

Cost

`text-embedding-3-small`	1536	General retrieval, production default	$0.02/1M tokens
`text-embedding-3-large`	3072	High-precision retrieval, re-ranking	$0.13/1M tokens
`BAAI/bge-m3`	1024	Multilingual (100 languages), on-prem	Free (self-host)
`intfloat/e5-large-v2`	1024	Strong general retrieval, self-host	Free (self-host)
`BAAI/bge-large-en-v1.5`	1024	English retrieval, MTEB leader	Free (self-host)
`sentence-transformers/all-MiniLM-L6-v2`	384	Fast, lightweight, low memory	Free (self-host)

# Benchmark before choosing — MTEB scores are a starting point, not a destination
# Your domain data may rank models very differently

from sentence_transformers import SentenceTransformer
from sentence_transformers.evaluation import InformationRetrievalEvaluator

# Always evaluate on your specific corpus and queries
def benchmark_model(model_name: str, queries: dict, corpus: dict, relevant_docs: dict):
    model = SentenceTransformer(model_name)
    evaluator = InformationRetrievalEvaluator(
        queries=queries,     # {query_id: query_text}
        corpus=corpus,       # {doc_id: doc_text}
        relevant_docs=relevant_docs,  # {query_id: {doc_id, ...}}
        score_functions={"cos_sim": cos_sim},
        main_score_function="cos_sim",
    )
    results = evaluator(model)
    return {
        "ndcg@10": results["cos_sim"]["ndcg@10"],
        "mrr@10": results["cos_sim"]["mrr@10"],
        "recall@10": results["cos_sim"]["recall@10"],
    }

Batch Embedding (Production Pattern)

import numpy as np
from openai import OpenAI
from sentence_transformers import SentenceTransformer
import hashlib, json

client = OpenAI()

def embed_texts_openai(
    texts: list[str],
    model: str = "text-embedding-3-small",
    dimensions: int = 1536,
    batch_size: int = 2048,
) -> np.ndarray:
    """
    Production-quality batch embedding with OpenAI.
    Handles batching, cleaning, and ordering.
    """
    # Preprocess: strip newlines (they degrade embedding quality measurably)
    clean_texts = [text.replace("\n", " ").strip() for text in texts]
    
    all_embeddings = [None] * len(clean_texts)
    
    for i in range(0, len(clean_texts), batch_size):
        batch = clean_texts[i:i + batch_size]
        
        response = client.embeddings.create(
            model=model,
            input=batch,
            dimensions=dimensions,
        )
        
        # Embeddings may not be returned in order — use index
        for item in response.data:
            all_embeddings[i + item.index] = item.embedding
    
    return np.array(all_embeddings, dtype=np.float32)

def embed_texts_local(
    texts: list[str],
    model_name: str = "BAAI/bge-large-en-v1.5",
    batch_size: int = 64,
) -> np.ndarray:
    """Batch embedding with local sentence-transformers model."""
    model = SentenceTransformer(model_name, device="cuda")
    
    # Prepend instruction for E5/BGE models (improves retrieval)
    if "e5" in model_name.lower():
        texts = [f"passage: {t}" for t in texts]
    elif "bge" in model_name.lower():
        texts = [f"Represent this sentence for searching relevant passages: {t}" 
                 if is_query else t for t, is_query in zip(texts, [False] * len(texts))]
    
    embeddings = model.encode(
        texts,
        batch_size=batch_size,
        show_progress_bar=True,
        normalize_embeddings=True,  # L2 normalize for cosine similarity
        convert_to_numpy=True,
    )
    return embeddings

Embedding Cache

import redis
import hashlib
import numpy as np
import json

class EmbeddingCache:
    """Redis-backed embedding cache. Avoid re-embedding unchanged text."""
    
    def __init__(self, redis_url: str, model: str, ttl_seconds: int = 86400 * 30):
        self.redis = redis.from_url(redis_url)
        self.model = model
        self.ttl = ttl_seconds
    
    def _cache_key(self, text: str) -> str:
        text_hash = hashlib.sha256(text.encode()).hexdigest()[:16]
        return f"emb:{self.model}:{text_hash}"
    
    def get(self, text: str) -> np.ndarray | None:
        data = self.redis.get(self._cache_key(text))
        if data:
            return np.frombuffer(data, dtype=np.float32)
        return None
    
    def set(self, text: str, embedding: np.ndarray):
        self.redis.setex(
            self._cache_key(text),
            self.ttl,
            embedding.astype(np.float32).tobytes(),
        )
    
    def embed_with_cache(self, texts: list[str]) -> np.ndarray:
        results = {}
        to_embed = []
        
        for text in texts:
            cached = self.get(text)
            if cached is not None:
                results[text] = cached
            else:
                to_embed.append(text)
        
        if to_embed:
            new_embeddings = embed_texts_openai(to_embed, model=self.model)
            for text, emb in zip(to_embed, new_embeddings):
                self.set(text, emb)
                results[text] = emb
        
        return np.array([results[t] for t in texts])

Matryoshka Embeddings & Dimension Reduction

# OpenAI's text-embedding-3 models use Matryoshka training
# You can truncate dimensions without fine-tuning — quality degrades gradually

# Full quality: 1536 dims (100% recall)
full_emb = client.embeddings.create(
    model="text-embedding-3-small",
    input=["Sample text"],
    dimensions=1536,
)

# Reduced: 256 dims (~95% recall, 6x smaller) — great for cost-sensitive use cases
small_emb = client.embeddings.create(
    model="text-embedding-3-small",
    input=["Sample text"],
    dimensions=256,
)

# Practical dimension selection:
# 1536: maximum recall, use when retrieval quality is critical
# 512:  ~97% of 1536 quality, 3x storage reduction
# 256:  ~94% of 1536 quality, 6x storage reduction — often the sweet spot
# 64:   ~85% quality — only for very high-volume systems where storage is the bottleneck

# Manual truncation for non-OpenAI models
def truncate_and_normalize(embeddings: np.ndarray, target_dims: int) -> np.ndarray:
    """Truncate then re-normalize (required after truncation)."""
    truncated = embeddings[:, :target_dims]
    norms = np.linalg.norm(truncated, axis=1, keepdims=True)
    return truncated / norms

ColBERT Late Interaction Re-ranking

ColBERT keeps query and document embeddings separate (one vector per token), then computes similarity at inference time. More expensive but significantly higher recall than bi-encoder retrieval alone.

from ragatouille import RAGPretrainedModel

# RAGatouille: easy ColBERT wrapper
RAG = RAGPretrainedModel.from_pretrained("colbert-ir/colbertv2.0")

# Index documents
RAG.index(
    collection=documents,  # List of strings
    index_name="my_corpus",
    max_document_length=256,  # Chunk size for ColBERT
    split_documents=True,
)

# Retrieve (higher recall than cosine similarity alone)
results = RAG.search("What are the safety requirements?", k=10)

# Practical pattern: bi-encoder for initial retrieval, ColBERT for re-ranking
def hybrid_retrieval(query: str, top_k_initial=50, top_k_final=10):
    # Stage 1: Fast bi-encoder (approximate)
    query_emb = embed_texts_openai([query])[0]
    initial_results = vector_store.search(query_emb, k=top_k_initial)
    
    # Stage 2: ColBERT re-ranking on initial results (exact)
    reranked = RAG.rerank(
        query=query,
        documents=[r.text for r in initial_results],
        k=top_k_final,
    )
    return reranked

Fine-Tuning Embeddings

from sentence_transformers import SentenceTransformer, InputExample, losses
from torch.utils.data import DataLoader

# Prepare training data (contrastive pairs)
# Format: (anchor, positive, negative) or (sentence1, sentence2, label)

train_examples = [
    # Positive pairs: should be similar
    InputExample(texts=["user account suspended", "account has been deactivated"], label=1.0),
    # Hard negatives: superficially similar but different meaning
    InputExample(texts=["account suspended", "account created successfully"], label=0.0),
    # Triplets: (anchor, positive, negative)
    InputExample(texts=[
        "How do I reset my password?",        # anchor (query)
        "To reset your password, click...",    # positive (relevant doc)
        "Our password policy requires...",     # negative (not relevant to this query)
    ]),
]

model = SentenceTransformer("BAAI/bge-base-en-v1.5")

# MultipleNegativesRankingLoss: best for retrieval (only needs positive pairs)
train_dataloader = DataLoader(train_examples, shuffle=True, batch_size=32)
train_loss = losses.MultipleNegativesRankingLoss(model)

model.fit(
    train_objectives=[(train_dataloader, train_loss)],
    epochs=3,
    warmup_steps=100,
    show_progress_bar=True,
    output_path="./fine-tuned-embeddings",
)

Chunking Strategy

Chunking dramatically impacts retrieval quality. There is no universal best strategy — it depends on document structure and query type.

from langchain.text_splitter import RecursiveCharacterTextSplitter, MarkdownHeaderTextSplitter

# Strategy 1: Fixed-size chunking with overlap (default, works everywhere)
splitter = RecursiveCharacterTextSplitter(
    chunk_size=512,       # Characters per chunk
    chunk_overlap=64,     # Overlap to avoid cutting context at boundaries
    separators=["\n\n", "\n", ". ", " ", ""],  # Try paragraph → sentence → word
)

# Strategy 2: Semantic chunking (group sentences by topic)
from langchain_experimental.text_splitter import SemanticChunker
semantic_splitter = SemanticChunker(
    embeddings=OpenAIEmbeddings(model="text-embedding-3-small"),
    breakpoint_threshold_type="percentile",
    breakpoint_threshold_amount=95,  # Split when similarity drops to 5th percentile
)

# Strategy 3: Structure-aware for Markdown (respect headers)
md_splitter = MarkdownHeaderTextSplitter(
    headers_to_split_on=[
        ("#", "Section"),
        ("##", "Subsection"),
        ("###", "Sub-subsection"),
    ],
)

# Parent-child indexing pattern (retrieve small chunks, return parent for context)
def build_parent_child_index(documents: list[str]):
    parent_splitter = RecursiveCharacterTextSplitter(chunk_size=2000)
    child_splitter = RecursiveCharacterTextSplitter(chunk_size=400)
    
    child_docs = []
    for parent in parent_splitter.split_documents(documents):
        children = child_splitter.split_documents([parent])
        for child in children:
            child.metadata["parent_id"] = parent.metadata["id"]
        child_docs.extend(children)
    
    # Index children for retrieval; return parent for generation context
    return parent_docs, child_docs

Visualization with UMAP

import umap
import matplotlib.pyplot as plt
import numpy as np

def visualize_embeddings(embeddings: np.ndarray, labels: list[str], 
                          colors: list[str] = None) -> None:
    """
    UMAP projection to 2D for debugging embedding quality.
    Clusters of same-label points = good embeddings.
    Mixed clusters = embedding model not capturing the right features.
    """
    reducer = umap.UMAP(
        n_components=2,
        metric="cosine",
        n_neighbors=15,
        min_dist=0.1,
        random_state=42,
    )
    
    projected = reducer.fit_transform(embeddings)
    
    plt.figure(figsize=(12, 8))
    scatter = plt.scatter(
        projected[:, 0], projected[:, 1],
        c=colors if colors else range(len(labels)),
        cmap="tab20",
        alpha=0.7,
        s=50,
    )
    
    # Annotate a sample of points
    for i in range(0, len(labels), max(1, len(labels) // 50)):
        plt.annotate(labels[i][:30], (projected[i, 0], projected[i, 1]), 
                     fontsize=6, alpha=0.8)
    
    plt.title("Embedding Space Visualization (UMAP)")
    plt.savefig("embeddings_umap.png", dpi=150, bbox_inches="tight")
    plt.show()

Anti-Patterns

❌ Embedding long documents without chunking
A 10,000-word document embedded as one vector loses detail — the embedding averages across all topics. Chunk first; embed chunks.

❌ Using the same model for queries and documents with asymmetric models
E5 and BGE models have separate query and passage prefixes for a reason. Using passage: prefix on queries degrades retrieval significantly.

❌ Cosine similarity without L2 normalization
Cosine similarity is only valid on normalized vectors. If you compute dot products directly on unnormalized embeddings, results are wrong.

❌ Not cleaning text before embedding
Newlines, excess whitespace, HTML tags, and encoding artifacts all affect embedding quality. Strip them before embedding.

❌ Embedding at query time without caching
If the same query text recurs frequently (e.g., FAQ questions), re-embedding on every request wastes money and adds latency. Cache embeddings keyed by content hash.

❌ Choosing dimensions based on storage alone
Reducing from 1536 to 256 saves storage but affects recall. Measure recall@10 before and after dimension reduction on your actual query workload.

Quick Reference

Model decision:
  Production, cost-sensitive  → text-embedding-3-small (1536 dims)
  High-precision retrieval    → text-embedding-3-large (3072 dims) or BGE-large
  Multilingual               → BAAI/bge-m3 (100 languages)
  On-prem required           → BAAI/bge-large-en-v1.5

Chunking heuristics:
  Dense prose text           → 512 chars, 64 overlap
  Structured docs (Markdown) → Header-based splitting
  Q&A documents              → Semantic chunking or by QA pair
  Code                       → Function/class level splitting

Retrieval architecture:
  Speed > quality            → FAISS flat index + cosine sim
  Quality > speed            → HNSW (Pinecone, Weaviate, Qdrant)
  Maximum recall             → Bi-encoder → ColBERT re-rank
  Hybrid                     → BM25 + dense retrieval + reciprocal rank fusion

Cost comparison (1M tokens):
  text-embedding-3-small     → $0.02
  text-embedding-3-large     → $0.13
  BGE-large (self-hosted)    → GPU cost only (~$0.001 on A100)

Skill Information

Source: MoltbotDen
Category: AI & LLMs
Repository: View on GitHub