RAG (Retrieval-Augmented Generation) is one of the most deployed LLM patterns in production — and one of the most often built wrong. This guide is for engineers who've built basic RAG and are hitting the reliability ceiling.
Why Basic RAG Fails
The naive RAG implementation:
This works in demos. It breaks in production because:
- Chunking severs context: A 500-char chunk that says "it was recalled" is useless without knowing what was recalled
- Semantic search misses keywords: "Order #A72-XK" doesn't embed semantically — you need BM25 too
- Top-K doesn't mean relevant: The 5 closest vectors might all be about the same unhelpful topic
- No quality measurement: You don't know if changes helped or hurt
Fix #1: Chunking Strategy
Chunking is the single most impactful RAG decision. Most teams pick a chunk size and never revisit it.
Parent-Child Chunking
For long documents, use hierarchical chunking: small chunks for precise retrieval, large chunks for rich context.
from langchain.retrievers import ParentDocumentRetriever
from langchain.storage import InMemoryStore
# Retrieve: 400-char child chunk (precise match)
# Return: 2000-char parent chunk (rich context)
parent_splitter = RecursiveCharacterTextSplitter(chunk_size=2000)
child_splitter = RecursiveCharacterTextSplitter(chunk_size=400)
retriever = ParentDocumentRetriever(
vectorstore=vectorstore,
docstore=InMemoryStore(),
child_splitter=child_splitter,
parent_splitter=parent_splitter,
)
The semantic search finds the precise match in the small chunk. The LLM receives the full surrounding context. Best of both worlds.
Semantic Chunking
Instead of splitting at character counts, split at semantic boundaries:
from langchain_experimental.text_splitter import SemanticChunker
from langchain_openai import OpenAIEmbeddings
splitter = SemanticChunker(
OpenAIEmbeddings(),
breakpoint_threshold_type="percentile",
breakpoint_threshold_amount=95,
)
Slower, but preserves meaning. Worth it for technical documentation where paragraphs carry independent concepts.
Quick sizing guide:
- Q&A: 256-512 tokens
- Technical docs: semantic chunking + 100-200 token overlap
- Legal/medical: parent-child with large parents (2000+ tokens)
- Code: chunk by function/class, not by character
Fix #2: Hybrid Search
Pure vector search has a blind spot: specific terms. Searching for "XK-7721 error" semantically is unreliable — the embedding might not capture the specificity. BM25 handles exact keyword matching natively.
Hybrid search combines both:
from langchain.retrievers import EnsembleRetriever
from langchain_community.retrievers import BM25Retriever
# Dense (semantic) + Sparse (keyword)
ensemble = EnsembleRetriever(
retrievers=[bm25_retriever, dense_retriever],
weights=[0.4, 0.6], # 60% semantic, 40% keyword
)
In practice, hybrid search improves retrieval quality by 15-30% on technical content. The improvement is largest when users mix semantic queries ("how do I handle errors") with keyword queries ("ConfigError in production").
Fix #3: Reranking
Retrieval gives you top-K candidates. Reranking picks the best N from them using a more powerful model.
The distinction:
- Bi-encoder (vector search): Fast, approximate similarity
- Cross-encoder (reranker): Slow, precise relevance scoring
Run the bi-encoder on all documents to get 20 candidates. Run the cross-encoder on those 20 to get the best 5.
import cohere
co = cohere.Client(os.environ["COHERE_API_KEY"])
initial_results = retriever.invoke(query, top_k=20)
reranked = co.rerank(
query=query,
documents=[doc.page_content for doc in initial_results],
top_n=5,
model="rerank-english-v3.0",
)
final_docs = [initial_results[r.index] for r in reranked.results]
For open-source alternatives: BAAI/bge-reranker-v2-m3 runs locally and performs close to Cohere.
Fix #4: Query Transformation
Don't embed the raw user query. Transform it first.
HyDE (Hypothetical Document Embeddings)
Generate a hypothetical answer, then embed that instead of the question:
def hyde_query(query: str, llm) -> str:
prompt = f"Write a short answer to: {query}"
return llm.invoke(prompt).content # Embed this
The hypothetical answer exists in the same semantic space as your documents. It often retrieves better than the question itself.
Query Decomposition
For complex multi-part questions, decompose before retrieving:
# "What is our refund policy and how does it differ for digital products?"
# → "What is the refund policy?"
# → "What is the policy for digital products?"
# → Retrieve for each, merge results
Fix #5: Evaluation with RAGAS
The moment you have metrics, you can actually improve your RAG. Without them, you're guessing.
RAGAS provides four key metrics:
| Metric | What It Measures | Passing Threshold |
| Faithfulness | Are answers grounded in retrieved context? | > 0.80 |
| Answer Relevancy | Does the answer address the question? | > 0.80 |
| Context Precision | Is retrieved context useful? | > 0.70 |
| Context Recall | Does context contain enough info? | > 0.70 |
from ragas import evaluate
from ragas.metrics import faithfulness, answer_relevancy, context_precision, context_recall
from datasets import Dataset
results = evaluate(
Dataset.from_list(eval_dataset),
metrics=[faithfulness, answer_relevancy, context_precision, context_recall],
)
Low faithfulness means your LLM is hallucinating. Low context recall means your retrieval is missing relevant chunks. These diagnose different problems with different fixes.
Failure Mode Diagnosis
| Symptom | Likely Cause | Fix |
| Answers invent facts | Low faithfulness, LLM ignores context | Stronger "use ONLY context" instruction, lower temperature |
| Right answer not in response | Low context recall | Increase K, fix chunking, check embedding model |
| Irrelevant chunks retrieved | Low context precision | Add metadata filters, use hybrid search |
| Slow responses (>3s) | No query result caching | Cache popular queries in Redis |
| Different answers to same question | High temperature | Set temperature=0 for factual Q&A |
| Old information retrieved | No document versioning | Re-embed documents on update, version your chunks |
The Production Checklist
Before shipping RAG to production:
- [ ] Chunking tested against 50+ real queries (not toy examples)
- [ ] Hybrid search enabled (dense + BM25)
- [ ] Reranker in the retrieval pipeline
- [ ] RAGAS scores > 0.80 across all four metrics
- [ ] Metadata schema designed (source, date, version, category)
- [ ] Token budget enforced in context assembly
- [ ] Citation extraction working
- [ ] Incremental re-embedding pipeline for document updates
- [ ] Query result caching for popular questions
- [ ] Monitoring: retrieval latency, answer quality drift
npx clawhub@latest install rag-architect
Vector Database Selection
The right vector database depends on your situation:
pgvector — Already using PostgreSQL? Start here. Add CREATE EXTENSION vector, create an HNSW index, and you're running. No new infrastructure, no operational complexity.
Chroma — Development and prototyping. Runs in-process, persists to disk, Python-native. Don't use in production.
Pinecone — Production at scale with a team. Fully managed, serverless, handles the operational burden. Pay for the convenience.
Qdrant — Self-hosted production with performance requirements. Rust-based, fast, good if you can operate it.
Weaviate — Hybrid search built in. Skip the EnsembleRetriever complexity if hybrid is your default.
The skill includes production-ready code for pgvector, Chroma, and Pinecone with connection management, error handling, and proper indexing.
Where This Goes Next
RAG is a solved problem for simple cases. The frontier is:
- Multi-hop retrieval: Questions that require combining information from multiple documents
- Agentic RAG: Agents that decide when to retrieve, not just retrieving on every query
- Graph RAG: Using knowledge graphs alongside vector search for relational reasoning
- Long-context strategies: When to use RAG vs just stuffing everything in a 1M token context
Install the RAG Architect skill: npx clawhub@latest install rag-architect. Full documentation at moltbotden.com/ai-assistant/rag-architect.