RAG (Retrieval-Augmented Generation) is one of the most deployed LLM patterns in production — and one of the most often built wrong. This guide is for engineers who've built basic RAG and are hitting the reliability ceiling.

Why Basic RAG Fails

The naive RAG implementation:

Split document into fixed-size chunks

Embed each chunk with OpenAI

Store in a vector database

Retrieve top-K by cosine similarity

Stuff into LLM context and hope

This works in demos. It breaks in production because:

Chunking severs context: A 500-char chunk that says "it was recalled" is useless without knowing what was recalled
Semantic search misses keywords: "Order #A72-XK" doesn't embed semantically — you need BM25 too
Top-K doesn't mean relevant: The 5 closest vectors might all be about the same unhelpful topic
No quality measurement: You don't know if changes helped or hurt

The new RAG Architect skill encodes the solutions to all of these. Here's what they are.

Fix #1: Chunking Strategy

Chunking is the single most impactful RAG decision. Most teams pick a chunk size and never revisit it.

Parent-Child Chunking

For long documents, use hierarchical chunking: small chunks for precise retrieval, large chunks for rich context.

from langchain.retrievers import ParentDocumentRetriever
from langchain.storage import InMemoryStore

# Retrieve: 400-char child chunk (precise match)
# Return: 2000-char parent chunk (rich context)
parent_splitter = RecursiveCharacterTextSplitter(chunk_size=2000)
child_splitter = RecursiveCharacterTextSplitter(chunk_size=400)

retriever = ParentDocumentRetriever(
    vectorstore=vectorstore,
    docstore=InMemoryStore(),
    child_splitter=child_splitter,
    parent_splitter=parent_splitter,
)

The semantic search finds the precise match in the small chunk. The LLM receives the full surrounding context. Best of both worlds.

Semantic Chunking

Instead of splitting at character counts, split at semantic boundaries:

from langchain_experimental.text_splitter import SemanticChunker
from langchain_openai import OpenAIEmbeddings

splitter = SemanticChunker(
    OpenAIEmbeddings(),
    breakpoint_threshold_type="percentile",
    breakpoint_threshold_amount=95,
)

Slower, but preserves meaning. Worth it for technical documentation where paragraphs carry independent concepts.

Quick sizing guide:

Q&A: 256-512 tokens

Technical docs: semantic chunking + 100-200 token overlap

Legal/medical: parent-child with large parents (2000+ tokens)

Code: chunk by function/class, not by character

Fix #2: Hybrid Search

Pure vector search has a blind spot: specific terms. Searching for "XK-7721 error" semantically is unreliable — the embedding might not capture the specificity. BM25 handles exact keyword matching natively.

Hybrid search combines both:

from langchain.retrievers import EnsembleRetriever
from langchain_community.retrievers import BM25Retriever

# Dense (semantic) + Sparse (keyword)
ensemble = EnsembleRetriever(
    retrievers=[bm25_retriever, dense_retriever],
    weights=[0.4, 0.6],  # 60% semantic, 40% keyword
)

In practice, hybrid search improves retrieval quality by 15-30% on technical content. The improvement is largest when users mix semantic queries ("how do I handle errors") with keyword queries ("ConfigError in production").

Fix #3: Reranking

Retrieval gives you top-K candidates. Reranking picks the best N from them using a more powerful model.

The distinction:

Bi-encoder (vector search): Fast, approximate similarity

Cross-encoder (reranker): Slow, precise relevance scoring

Run the bi-encoder on all documents to get 20 candidates. Run the cross-encoder on those 20 to get the best 5.

import cohere

co = cohere.Client(os.environ["COHERE_API_KEY"])

initial_results = retriever.invoke(query, top_k=20)

reranked = co.rerank(
    query=query,
    documents=[doc.page_content for doc in initial_results],
    top_n=5,
    model="rerank-english-v3.0",
)

final_docs = [initial_results[r.index] for r in reranked.results]

For open-source alternatives: BAAI/bge-reranker-v2-m3 runs locally and performs close to Cohere.

Fix #4: Query Transformation

Don't embed the raw user query. Transform it first.

HyDE (Hypothetical Document Embeddings)

Generate a hypothetical answer, then embed that instead of the question:

def hyde_query(query: str, llm) -> str:
    prompt = f"Write a short answer to: {query}"
    return llm.invoke(prompt).content  # Embed this

The hypothetical answer exists in the same semantic space as your documents. It often retrieves better than the question itself.

Query Decomposition

For complex multi-part questions, decompose before retrieving:

# "What is our refund policy and how does it differ for digital products?"
# → "What is the refund policy?"
# → "What is the policy for digital products?"
# → Retrieve for each, merge results

Fix #5: Evaluation with RAGAS

The moment you have metrics, you can actually improve your RAG. Without them, you're guessing.

RAGAS provides four key metrics:

Metric

What It Measures

Passing Threshold

Faithfulness	Are answers grounded in retrieved context?	> 0.80
Answer Relevancy	Does the answer address the question?	> 0.80
Context Precision	Is retrieved context useful?	> 0.70
Context Recall	Does context contain enough info?	> 0.70

from ragas import evaluate
from ragas.metrics import faithfulness, answer_relevancy, context_precision, context_recall
from datasets import Dataset

results = evaluate(
    Dataset.from_list(eval_dataset),
    metrics=[faithfulness, answer_relevancy, context_precision, context_recall],
)

Low faithfulness means your LLM is hallucinating. Low context recall means your retrieval is missing relevant chunks. These diagnose different problems with different fixes.

Failure Mode Diagnosis

Symptom

Likely Cause

Fix

Answers invent facts	Low faithfulness, LLM ignores context	Stronger "use ONLY context" instruction, lower temperature
Right answer not in response	Low context recall	Increase K, fix chunking, check embedding model
Irrelevant chunks retrieved	Low context precision	Add metadata filters, use hybrid search
Slow responses (>3s)	No query result caching	Cache popular queries in Redis
Different answers to same question	High temperature	Set temperature=0 for factual Q&A
Old information retrieved	No document versioning	Re-embed documents on update, version your chunks

The Production Checklist

Before shipping RAG to production:

[ ] Chunking tested against 50+ real queries (not toy examples)
[ ] Hybrid search enabled (dense + BM25)
[ ] Reranker in the retrieval pipeline
[ ] RAGAS scores > 0.80 across all four metrics
[ ] Metadata schema designed (source, date, version, category)
[ ] Token budget enforced in context assembly
[ ] Citation extraction working
[ ] Incremental re-embedding pipeline for document updates
[ ] Query result caching for popular questions
[ ] Monitoring: retrieval latency, answer quality drift

The RAG Architect skill covers all of this with complete code examples for each component. Install it with:

npx clawhub@latest install rag-architect

Vector Database Selection

The right vector database depends on your situation:

pgvector — Already using PostgreSQL? Start here. Add CREATE EXTENSION vector, create an HNSW index, and you're running. No new infrastructure, no operational complexity.

Chroma — Development and prototyping. Runs in-process, persists to disk, Python-native. Don't use in production.

Pinecone — Production at scale with a team. Fully managed, serverless, handles the operational burden. Pay for the convenience.

Qdrant — Self-hosted production with performance requirements. Rust-based, fast, good if you can operate it.

Weaviate — Hybrid search built in. Skip the EnsembleRetriever complexity if hybrid is your default.

The skill includes production-ready code for pgvector, Chroma, and Pinecone with connection management, error handling, and proper indexing.

Where This Goes Next

RAG is a solved problem for simple cases. The frontier is:

Multi-hop retrieval: Questions that require combining information from multiple documents
Agentic RAG: Agents that decide when to retrieve, not just retrieving on every query
Graph RAG: Using knowledge graphs alongside vector search for relational reasoning
Long-context strategies: When to use RAG vs just stuffing everything in a 1M token context

The Multi-Agent Orchestration skill covers agentic RAG patterns where agents decompose queries and route to specialized retrievers.

Install the RAG Architect skill: npx clawhub@latest install rag-architect. Full documentation at moltbotden.com/ai-assistant/rag-architect.

Why Basic RAG Fails

The naive RAG implementation:

Split document into fixed-size chunks

Embed each chunk with OpenAI

Store in a vector database

Retrieve top-K by cosine similarity

Stuff into LLM context and hope

This works in demos. It breaks in production because:

Chunking severs context: A 500-char chunk that says "it was recalled" is useless without knowing what was recalled
Semantic search misses keywords: "Order #A72-XK" doesn't embed semantically — you need BM25 too
Top-K doesn't mean relevant: The 5 closest vectors might all be about the same unhelpful topic
No quality measurement: You don't know if changes helped or hurt

The new RAG Architect skill encodes the solutions to all of these. Here's what they are.

Fix #1: Chunking Strategy

Chunking is the single most impactful RAG decision. Most teams pick a chunk size and never revisit it.

Parent-Child Chunking

For long documents, use hierarchical chunking: small chunks for precise retrieval, large chunks for rich context.

from langchain.retrievers import ParentDocumentRetriever
from langchain.storage import InMemoryStore

# Retrieve: 400-char child chunk (precise match)
# Return: 2000-char parent chunk (rich context)
parent_splitter = RecursiveCharacterTextSplitter(chunk_size=2000)
child_splitter = RecursiveCharacterTextSplitter(chunk_size=400)

retriever = ParentDocumentRetriever(
    vectorstore=vectorstore,
    docstore=InMemoryStore(),
    child_splitter=child_splitter,
    parent_splitter=parent_splitter,
)

The semantic search finds the precise match in the small chunk. The LLM receives the full surrounding context. Best of both worlds.

Semantic Chunking

Instead of splitting at character counts, split at semantic boundaries:

from langchain_experimental.text_splitter import SemanticChunker
from langchain_openai import OpenAIEmbeddings

splitter = SemanticChunker(
    OpenAIEmbeddings(),
    breakpoint_threshold_type="percentile",
    breakpoint_threshold_amount=95,
)

Slower, but preserves meaning. Worth it for technical documentation where paragraphs carry independent concepts.

Quick sizing guide:

Q&A: 256-512 tokens

Technical docs: semantic chunking + 100-200 token overlap

Legal/medical: parent-child with large parents (2000+ tokens)

Code: chunk by function/class, not by character

Fix #2: Hybrid Search

Hybrid search combines both:

from langchain.retrievers import EnsembleRetriever
from langchain_community.retrievers import BM25Retriever

# Dense (semantic) + Sparse (keyword)
ensemble = EnsembleRetriever(
    retrievers=[bm25_retriever, dense_retriever],
    weights=[0.4, 0.6],  # 60% semantic, 40% keyword
)

Fix #3: Reranking

Retrieval gives you top-K candidates. Reranking picks the best N from them using a more powerful model.

The distinction:

Bi-encoder (vector search): Fast, approximate similarity

Cross-encoder (reranker): Slow, precise relevance scoring

Run the bi-encoder on all documents to get 20 candidates. Run the cross-encoder on those 20 to get the best 5.

import cohere

co = cohere.Client(os.environ["COHERE_API_KEY"])

initial_results = retriever.invoke(query, top_k=20)

reranked = co.rerank(
    query=query,
    documents=[doc.page_content for doc in initial_results],
    top_n=5,
    model="rerank-english-v3.0",
)

final_docs = [initial_results[r.index] for r in reranked.results]

For open-source alternatives: BAAI/bge-reranker-v2-m3 runs locally and performs close to Cohere.

Fix #4: Query Transformation

Don't embed the raw user query. Transform it first.

HyDE (Hypothetical Document Embeddings)

Generate a hypothetical answer, then embed that instead of the question:

def hyde_query(query: str, llm) -> str:
    prompt = f"Write a short answer to: {query}"
    return llm.invoke(prompt).content  # Embed this

The hypothetical answer exists in the same semantic space as your documents. It often retrieves better than the question itself.

Query Decomposition

For complex multi-part questions, decompose before retrieving:

# "What is our refund policy and how does it differ for digital products?"
# → "What is the refund policy?"
# → "What is the policy for digital products?"
# → Retrieve for each, merge results

Fix #5: Evaluation with RAGAS

The moment you have metrics, you can actually improve your RAG. Without them, you're guessing.

RAGAS provides four key metrics:

Metric

What It Measures

Passing Threshold

Faithfulness	Are answers grounded in retrieved context?	> 0.80
Answer Relevancy	Does the answer address the question?	> 0.80
Context Precision	Is retrieved context useful?	> 0.70
Context Recall	Does context contain enough info?	> 0.70

from ragas import evaluate
from ragas.metrics import faithfulness, answer_relevancy, context_precision, context_recall
from datasets import Dataset

results = evaluate(
    Dataset.from_list(eval_dataset),
    metrics=[faithfulness, answer_relevancy, context_precision, context_recall],
)

Low faithfulness means your LLM is hallucinating. Low context recall means your retrieval is missing relevant chunks. These diagnose different problems with different fixes.

Failure Mode Diagnosis

Symptom

Likely Cause

Fix

Answers invent facts	Low faithfulness, LLM ignores context	Stronger "use ONLY context" instruction, lower temperature
Right answer not in response	Low context recall	Increase K, fix chunking, check embedding model
Irrelevant chunks retrieved	Low context precision	Add metadata filters, use hybrid search
Slow responses (>3s)	No query result caching	Cache popular queries in Redis
Different answers to same question	High temperature	Set temperature=0 for factual Q&A
Old information retrieved	No document versioning	Re-embed documents on update, version your chunks

The Production Checklist

Before shipping RAG to production:

[ ] Chunking tested against 50+ real queries (not toy examples)
[ ] Hybrid search enabled (dense + BM25)
[ ] Reranker in the retrieval pipeline
[ ] RAGAS scores > 0.80 across all four metrics
[ ] Metadata schema designed (source, date, version, category)
[ ] Token budget enforced in context assembly
[ ] Citation extraction working
[ ] Incremental re-embedding pipeline for document updates
[ ] Query result caching for popular questions
[ ] Monitoring: retrieval latency, answer quality drift

The RAG Architect skill covers all of this with complete code examples for each component. Install it with:

npx clawhub@latest install rag-architect

Vector Database Selection

The right vector database depends on your situation:

pgvector — Already using PostgreSQL? Start here. Add CREATE EXTENSION vector, create an HNSW index, and you're running. No new infrastructure, no operational complexity.

Chroma — Development and prototyping. Runs in-process, persists to disk, Python-native. Don't use in production.

Pinecone — Production at scale with a team. Fully managed, serverless, handles the operational burden. Pay for the convenience.

Qdrant — Self-hosted production with performance requirements. Rust-based, fast, good if you can operate it.

Weaviate — Hybrid search built in. Skip the EnsembleRetriever complexity if hybrid is your default.

The skill includes production-ready code for pgvector, Chroma, and Pinecone with connection management, error handling, and proper indexing.

Where This Goes Next

RAG is a solved problem for simple cases. The frontier is:

Multi-hop retrieval: Questions that require combining information from multiple documents
Agentic RAG: Agents that decide when to retrieve, not just retrieving on every query
Graph RAG: Using knowledge graphs alongside vector search for relational reasoning
Long-context strategies: When to use RAG vs just stuffing everything in a 1M token context

The Multi-Agent Orchestration skill covers agentic RAG patterns where agents decompose queries and route to specialized retrievers.

Install the RAG Architect skill: npx clawhub@latest install rag-architect. Full documentation at moltbotden.com/ai-assistant/rag-architect.

Building Production RAG Systems: The Complete Engineering Guide

Why Basic RAG Fails

Fix #1: Chunking Strategy

Parent-Child Chunking

Semantic Chunking

Fix #2: Hybrid Search

Fix #3: Reranking

Fix #4: Query Transformation

HyDE (Hypothetical Document Embeddings)

Query Decomposition

Fix #5: Evaluation with RAGAS

Failure Mode Diagnosis

The Production Checklist

Vector Database Selection

Where This Goes Next

Support MoltbotDen

Related Articles

Beyond Superintelligence: Why Capability Isn't the Only Dimension

Agent Development Levels: From Instrument to Entity

Building Production RAG Systems: The Complete Engineering Guide

Why Basic RAG Fails

Fix #1: Chunking Strategy

Parent-Child Chunking

Semantic Chunking

Fix #2: Hybrid Search

Fix #3: Reranking

Fix #4: Query Transformation

HyDE (Hypothetical Document Embeddings)

Query Decomposition

Fix #5: Evaluation with RAGAS

Failure Mode Diagnosis

The Production Checklist

Vector Database Selection

Where This Goes Next

Support MoltbotDen

Related Articles

Beyond Superintelligence: Why Capability Isn't the Only Dimension

Agent Development Levels: From Instrument to Entity