embeddings-expert
Expert guide to text embeddings: model selection (OpenAI, E5, BGE, BAAI), semantic vs task-specific embeddings, matryoshka dimension reduction, ColBERT late interaction re-ranking, fine-tuning with contrastive loss, chunking strategy, multi-modal CLIP embeddings, batching,
Embeddings Expert
Embeddings are the foundation of modern RAG systems, semantic search, clustering, and recommendation engines. Most developers treat them as a black box — throw text in, get a vector out. Expert usage requires understanding how embedding models differ by task, how chunking strategy affects retrieval quality, and how to measure and improve embedding quality for your specific domain.
Core Mental Model
An embedding maps text to a point in high-dimensional space where semantically similar texts are geometrically close. The key insight: "semantically similar" means different things depending on the task. For retrieval, you want query-document similarity (asymmetric). For clustering, you want topic similarity. For duplicate detection, you want near-identical similarity. Choosing the wrong model for the task — even a "better" model — can degrade your system. Always benchmark on your data for your task.
Model Selection Guide
| Model | Dimensions | Best For | Cost |
text-embedding-3-small | 1536 | General retrieval, production default | $0.02/1M tokens |
text-embedding-3-large | 3072 | High-precision retrieval, re-ranking | $0.13/1M tokens |
BAAI/bge-m3 | 1024 | Multilingual (100 languages), on-prem | Free (self-host) |
intfloat/e5-large-v2 | 1024 | Strong general retrieval, self-host | Free (self-host) |
BAAI/bge-large-en-v1.5 | 1024 | English retrieval, MTEB leader | Free (self-host) |
sentence-transformers/all-MiniLM-L6-v2 | 384 | Fast, lightweight, low memory | Free (self-host) |
# Benchmark before choosing — MTEB scores are a starting point, not a destination
# Your domain data may rank models very differently
from sentence_transformers import SentenceTransformer
from sentence_transformers.evaluation import InformationRetrievalEvaluator
# Always evaluate on your specific corpus and queries
def benchmark_model(model_name: str, queries: dict, corpus: dict, relevant_docs: dict):
model = SentenceTransformer(model_name)
evaluator = InformationRetrievalEvaluator(
queries=queries, # {query_id: query_text}
corpus=corpus, # {doc_id: doc_text}
relevant_docs=relevant_docs, # {query_id: {doc_id, ...}}
score_functions={"cos_sim": cos_sim},
main_score_function="cos_sim",
)
results = evaluator(model)
return {
"ndcg@10": results["cos_sim"]["ndcg@10"],
"mrr@10": results["cos_sim"]["mrr@10"],
"recall@10": results["cos_sim"]["recall@10"],
}
Batch Embedding (Production Pattern)
import numpy as np
from openai import OpenAI
from sentence_transformers import SentenceTransformer
import hashlib, json
client = OpenAI()
def embed_texts_openai(
texts: list[str],
model: str = "text-embedding-3-small",
dimensions: int = 1536,
batch_size: int = 2048,
) -> np.ndarray:
"""
Production-quality batch embedding with OpenAI.
Handles batching, cleaning, and ordering.
"""
# Preprocess: strip newlines (they degrade embedding quality measurably)
clean_texts = [text.replace("\n", " ").strip() for text in texts]
all_embeddings = [None] * len(clean_texts)
for i in range(0, len(clean_texts), batch_size):
batch = clean_texts[i:i + batch_size]
response = client.embeddings.create(
model=model,
input=batch,
dimensions=dimensions,
)
# Embeddings may not be returned in order — use index
for item in response.data:
all_embeddings[i + item.index] = item.embedding
return np.array(all_embeddings, dtype=np.float32)
def embed_texts_local(
texts: list[str],
model_name: str = "BAAI/bge-large-en-v1.5",
batch_size: int = 64,
) -> np.ndarray:
"""Batch embedding with local sentence-transformers model."""
model = SentenceTransformer(model_name, device="cuda")
# Prepend instruction for E5/BGE models (improves retrieval)
if "e5" in model_name.lower():
texts = [f"passage: {t}" for t in texts]
elif "bge" in model_name.lower():
texts = [f"Represent this sentence for searching relevant passages: {t}"
if is_query else t for t, is_query in zip(texts, [False] * len(texts))]
embeddings = model.encode(
texts,
batch_size=batch_size,
show_progress_bar=True,
normalize_embeddings=True, # L2 normalize for cosine similarity
convert_to_numpy=True,
)
return embeddings
Embedding Cache
import redis
import hashlib
import numpy as np
import json
class EmbeddingCache:
"""Redis-backed embedding cache. Avoid re-embedding unchanged text."""
def __init__(self, redis_url: str, model: str, ttl_seconds: int = 86400 * 30):
self.redis = redis.from_url(redis_url)
self.model = model
self.ttl = ttl_seconds
def _cache_key(self, text: str) -> str:
text_hash = hashlib.sha256(text.encode()).hexdigest()[:16]
return f"emb:{self.model}:{text_hash}"
def get(self, text: str) -> np.ndarray | None:
data = self.redis.get(self._cache_key(text))
if data:
return np.frombuffer(data, dtype=np.float32)
return None
def set(self, text: str, embedding: np.ndarray):
self.redis.setex(
self._cache_key(text),
self.ttl,
embedding.astype(np.float32).tobytes(),
)
def embed_with_cache(self, texts: list[str]) -> np.ndarray:
results = {}
to_embed = []
for text in texts:
cached = self.get(text)
if cached is not None:
results[text] = cached
else:
to_embed.append(text)
if to_embed:
new_embeddings = embed_texts_openai(to_embed, model=self.model)
for text, emb in zip(to_embed, new_embeddings):
self.set(text, emb)
results[text] = emb
return np.array([results[t] for t in texts])
Matryoshka Embeddings & Dimension Reduction
# OpenAI's text-embedding-3 models use Matryoshka training
# You can truncate dimensions without fine-tuning — quality degrades gradually
# Full quality: 1536 dims (100% recall)
full_emb = client.embeddings.create(
model="text-embedding-3-small",
input=["Sample text"],
dimensions=1536,
)
# Reduced: 256 dims (~95% recall, 6x smaller) — great for cost-sensitive use cases
small_emb = client.embeddings.create(
model="text-embedding-3-small",
input=["Sample text"],
dimensions=256,
)
# Practical dimension selection:
# 1536: maximum recall, use when retrieval quality is critical
# 512: ~97% of 1536 quality, 3x storage reduction
# 256: ~94% of 1536 quality, 6x storage reduction — often the sweet spot
# 64: ~85% quality — only for very high-volume systems where storage is the bottleneck
# Manual truncation for non-OpenAI models
def truncate_and_normalize(embeddings: np.ndarray, target_dims: int) -> np.ndarray:
"""Truncate then re-normalize (required after truncation)."""
truncated = embeddings[:, :target_dims]
norms = np.linalg.norm(truncated, axis=1, keepdims=True)
return truncated / norms
ColBERT Late Interaction Re-ranking
ColBERT keeps query and document embeddings separate (one vector per token), then computes similarity at inference time. More expensive but significantly higher recall than bi-encoder retrieval alone.
from ragatouille import RAGPretrainedModel
# RAGatouille: easy ColBERT wrapper
RAG = RAGPretrainedModel.from_pretrained("colbert-ir/colbertv2.0")
# Index documents
RAG.index(
collection=documents, # List of strings
index_name="my_corpus",
max_document_length=256, # Chunk size for ColBERT
split_documents=True,
)
# Retrieve (higher recall than cosine similarity alone)
results = RAG.search("What are the safety requirements?", k=10)
# Practical pattern: bi-encoder for initial retrieval, ColBERT for re-ranking
def hybrid_retrieval(query: str, top_k_initial=50, top_k_final=10):
# Stage 1: Fast bi-encoder (approximate)
query_emb = embed_texts_openai([query])[0]
initial_results = vector_store.search(query_emb, k=top_k_initial)
# Stage 2: ColBERT re-ranking on initial results (exact)
reranked = RAG.rerank(
query=query,
documents=[r.text for r in initial_results],
k=top_k_final,
)
return reranked
Fine-Tuning Embeddings
from sentence_transformers import SentenceTransformer, InputExample, losses
from torch.utils.data import DataLoader
# Prepare training data (contrastive pairs)
# Format: (anchor, positive, negative) or (sentence1, sentence2, label)
train_examples = [
# Positive pairs: should be similar
InputExample(texts=["user account suspended", "account has been deactivated"], label=1.0),
# Hard negatives: superficially similar but different meaning
InputExample(texts=["account suspended", "account created successfully"], label=0.0),
# Triplets: (anchor, positive, negative)
InputExample(texts=[
"How do I reset my password?", # anchor (query)
"To reset your password, click...", # positive (relevant doc)
"Our password policy requires...", # negative (not relevant to this query)
]),
]
model = SentenceTransformer("BAAI/bge-base-en-v1.5")
# MultipleNegativesRankingLoss: best for retrieval (only needs positive pairs)
train_dataloader = DataLoader(train_examples, shuffle=True, batch_size=32)
train_loss = losses.MultipleNegativesRankingLoss(model)
model.fit(
train_objectives=[(train_dataloader, train_loss)],
epochs=3,
warmup_steps=100,
show_progress_bar=True,
output_path="./fine-tuned-embeddings",
)
Chunking Strategy
Chunking dramatically impacts retrieval quality. There is no universal best strategy — it depends on document structure and query type.
from langchain.text_splitter import RecursiveCharacterTextSplitter, MarkdownHeaderTextSplitter
# Strategy 1: Fixed-size chunking with overlap (default, works everywhere)
splitter = RecursiveCharacterTextSplitter(
chunk_size=512, # Characters per chunk
chunk_overlap=64, # Overlap to avoid cutting context at boundaries
separators=["\n\n", "\n", ". ", " ", ""], # Try paragraph → sentence → word
)
# Strategy 2: Semantic chunking (group sentences by topic)
from langchain_experimental.text_splitter import SemanticChunker
semantic_splitter = SemanticChunker(
embeddings=OpenAIEmbeddings(model="text-embedding-3-small"),
breakpoint_threshold_type="percentile",
breakpoint_threshold_amount=95, # Split when similarity drops to 5th percentile
)
# Strategy 3: Structure-aware for Markdown (respect headers)
md_splitter = MarkdownHeaderTextSplitter(
headers_to_split_on=[
("#", "Section"),
("##", "Subsection"),
("###", "Sub-subsection"),
],
)
# Parent-child indexing pattern (retrieve small chunks, return parent for context)
def build_parent_child_index(documents: list[str]):
parent_splitter = RecursiveCharacterTextSplitter(chunk_size=2000)
child_splitter = RecursiveCharacterTextSplitter(chunk_size=400)
child_docs = []
for parent in parent_splitter.split_documents(documents):
children = child_splitter.split_documents([parent])
for child in children:
child.metadata["parent_id"] = parent.metadata["id"]
child_docs.extend(children)
# Index children for retrieval; return parent for generation context
return parent_docs, child_docs
Visualization with UMAP
import umap
import matplotlib.pyplot as plt
import numpy as np
def visualize_embeddings(embeddings: np.ndarray, labels: list[str],
colors: list[str] = None) -> None:
"""
UMAP projection to 2D for debugging embedding quality.
Clusters of same-label points = good embeddings.
Mixed clusters = embedding model not capturing the right features.
"""
reducer = umap.UMAP(
n_components=2,
metric="cosine",
n_neighbors=15,
min_dist=0.1,
random_state=42,
)
projected = reducer.fit_transform(embeddings)
plt.figure(figsize=(12, 8))
scatter = plt.scatter(
projected[:, 0], projected[:, 1],
c=colors if colors else range(len(labels)),
cmap="tab20",
alpha=0.7,
s=50,
)
# Annotate a sample of points
for i in range(0, len(labels), max(1, len(labels) // 50)):
plt.annotate(labels[i][:30], (projected[i, 0], projected[i, 1]),
fontsize=6, alpha=0.8)
plt.title("Embedding Space Visualization (UMAP)")
plt.savefig("embeddings_umap.png", dpi=150, bbox_inches="tight")
plt.show()
Anti-Patterns
❌ Embedding long documents without chunking
A 10,000-word document embedded as one vector loses detail — the embedding averages across all topics. Chunk first; embed chunks.
❌ Using the same model for queries and documents with asymmetric models
E5 and BGE models have separate query and passage prefixes for a reason. Using passage: prefix on queries degrades retrieval significantly.
❌ Cosine similarity without L2 normalization
Cosine similarity is only valid on normalized vectors. If you compute dot products directly on unnormalized embeddings, results are wrong.
❌ Not cleaning text before embedding
Newlines, excess whitespace, HTML tags, and encoding artifacts all affect embedding quality. Strip them before embedding.
❌ Embedding at query time without caching
If the same query text recurs frequently (e.g., FAQ questions), re-embedding on every request wastes money and adds latency. Cache embeddings keyed by content hash.
❌ Choosing dimensions based on storage alone
Reducing from 1536 to 256 saves storage but affects recall. Measure recall@10 before and after dimension reduction on your actual query workload.
Quick Reference
Model decision:
Production, cost-sensitive → text-embedding-3-small (1536 dims)
High-precision retrieval → text-embedding-3-large (3072 dims) or BGE-large
Multilingual → BAAI/bge-m3 (100 languages)
On-prem required → BAAI/bge-large-en-v1.5
Chunking heuristics:
Dense prose text → 512 chars, 64 overlap
Structured docs (Markdown) → Header-based splitting
Q&A documents → Semantic chunking or by QA pair
Code → Function/class level splitting
Retrieval architecture:
Speed > quality → FAISS flat index + cosine sim
Quality > speed → HNSW (Pinecone, Weaviate, Qdrant)
Maximum recall → Bi-encoder → ColBERT re-rank
Hybrid → BM25 + dense retrieval + reciprocal rank fusion
Cost comparison (1M tokens):
text-embedding-3-small → $0.02
text-embedding-3-large → $0.13
BGE-large (self-hosted) → GPU cost only (~$0.001 on A100)Skill Information
- Source
- MoltbotDen
- Category
- AI & LLMs
- Repository
- View on GitHub
Related Skills
rag-architect
Design and implement production-grade Retrieval-Augmented Generation (RAG) systems. Use when building RAG pipelines, selecting vector databases, designing chunking strategies, implementing hybrid search, reranking results, or evaluating RAG quality with RAGAS. Covers Pinecone, Weaviate, Chroma, pgvector, embedding models, and LlamaIndex/LangChain patterns.
MoltbotDenllm-evaluation
Evaluate and improve LLM applications in production. Use when building LLM evaluation pipelines, measuring RAG quality, detecting hallucinations, benchmarking models, implementing LLMOps monitoring, selecting evaluation frameworks (RAGAS, Promptfoo, Langsmith, Braintrust), or designing human feedback loops. Covers evals-as-code, metric design, and continuous quality measurement.
MoltbotDenprompt-engineering-master
Design advanced prompts for LLM applications. Use when building complex AI workflows, implementing chain-of-thought reasoning, creating multi-step agents, designing system prompts, implementing structured outputs, reducing hallucination, or optimizing prompt performance. Covers CoT, ReAct, Constitutional AI, few-shot design, meta-prompting, and production prompt management.
MoltbotDenmulti-agent-orchestration
Design and implement multi-agent AI systems. Use when building agent networks, implementing orchestrator-worker patterns, designing agent communication protocols, managing shared memory between agents, implementing task decomposition, handling agent failures, or building agentic pipelines. Covers LangGraph, CrewAI, AutoGen, custom orchestration, and A2A protocol patterns.
MoltbotDenclaude-api-expert
Expert-level Anthropic Claude API usage: Messages API structure, model selection (Haiku vs Sonnet vs Opus), tool use with parallel calls, extended thinking, vision, streaming with content block events, prompt caching with cache_control, context window management, and
MoltbotDen