Skip to main content
AI & LLMsDocumented

rag-architect

Production RAG (Retrieval-Augmented Generation) system design. Vector databases (Pinecone, Weaviate, pgvector), chunking strategies, hybrid search, reranking, and RAGAS evaluation. The complete playbook for building RAG that actually works.

Share:

Installation

npx clawhub@latest install rag-architect

View the full skill documentation and source below.

Documentation

RAG Architecture: Production-Grade Retrieval-Augmented Generation

What Is RAG and When to Use It

RAG grounds LLM responses in your own data — documents, databases, knowledge bases — reducing hallucination and keeping answers current without fine-tuning.

Use RAG when:

  • Answers require proprietary or frequently-updated data

  • Users need source citations

  • The LLM's training cutoff is a problem

  • Fine-tuning is too expensive or slow for your update cadence


Don't use RAG when:
  • Data fits in the context window (use direct injection)

  • You need to change model behavior (use fine-tuning instead)

  • Low latency is critical with no budget for retrieval (~100-500ms overhead)



Architecture Layers

Query → [Pre-Processing] → [Retrieval] → [Augmentation] → [Generation] → [Post-Processing]
          Query rewrite     Vector search   Context window    LLM call      Faithfulness check
          HyDE              BM25 hybrid     Reranking         Streaming     Citation extraction
          Decomposition     Metadata filter Token limit mgmt  Grounding     Hallucination check

Step 1: Document Processing Pipeline

Chunking Strategy

Chunking is the single most impactful RAG decision. Bad chunking = bad retrieval.

Fixed-size chunking (fast, baseline):

from langchain.text_splitter import RecursiveCharacterTextSplitter

splitter = RecursiveCharacterTextSplitter(
    chunk_size=1000,          # Characters, not tokens
    chunk_overlap=200,         # 20% overlap prevents context cuts
    separators=["\n\n", "\n", ". ", " ", ""],  # Tries each in order
)
chunks = splitter.create_documents([text], metadatas=[{"source": "doc.pdf"}])

Semantic chunking (better coherence, slower):

from langchain_experimental.text_splitter import SemanticChunker
from langchain_openai import OpenAIEmbeddings

splitter = SemanticChunker(
    OpenAIEmbeddings(),
    breakpoint_threshold_type="percentile",  # or "standard_deviation", "interquartile"
    breakpoint_threshold_amount=95,
)

Hierarchical / parent-child chunking (best for long documents):

# Parent: 2000 char chunks for context
# Child: 400 char chunks for precision retrieval
# Retrieve child, return parent context

from langchain.retrievers import ParentDocumentRetriever
from langchain.storage import InMemoryStore

parent_splitter = RecursiveCharacterTextSplitter(chunk_size=2000)
child_splitter = RecursiveCharacterTextSplitter(chunk_size=400)

retriever = ParentDocumentRetriever(
    vectorstore=vectorstore,
    docstore=InMemoryStore(),
    child_splitter=child_splitter,
    parent_splitter=parent_splitter,
)

Chunking rules of thumb:

  • Q&A systems: 256-512 tokens per chunk

  • Summarization: 512-1024 tokens

  • Technical docs: Use semantic chunking + 100-200 token overlap

  • Code: Chunk by function/class, not by character count


Metadata Enrichment

Always attach metadata — it enables powerful filtered retrieval:

{
    "source": "user_manual_v2.pdf",
    "page": 42,
    "section": "Installation",
    "doc_type": "manual",
    "created_at": "2024-01-15",
    "last_modified": "2024-03-01",
    "language": "en",
    "product": "ProductX",
    "version": "2.0",
    "chunk_index": 5,
    "total_chunks": 23,
}

Step 2: Embedding Models

ModelProviderDimensionsBest For
text-embedding-3-largeOpenAI3072General, English, accuracy
text-embedding-3-smallOpenAI1536Cost-sensitive, fast
embed-english-v3.0Cohere1024English, reranking
embed-multilingual-v3.0Cohere1024Multi-language
nomic-embed-textNomic/Ollama768Open source, local
bge-m3BAAI1024Open source, best open-source
Critical: Use the same embedding model for ingestion AND retrieval. Model changes require re-embedding everything.
from openai import OpenAI

client = OpenAI()

def embed_texts(texts: list[str], batch_size: int = 100) -> list[list[float]]:
    """Embed texts in batches to respect rate limits."""
    all_embeddings = []
    for i in range(0, len(texts), batch_size):
        batch = texts[i:i + batch_size]
        response = client.embeddings.create(
            model="text-embedding-3-small",
            input=batch,
            encoding_format="float",  # or "base64" for storage efficiency
        )
        all_embeddings.extend([item.embedding for item in response.data])
    return all_embeddings

Step 3: Vector Database Selection

DatabaseTypeBest ForHosted
PineconeManagedProduction, large scale, teamsYes
WeaviateOpen sourceHybrid search, complex filteringBoth
ChromaOpen sourceDevelopment, local, small scaleNo
pgvectorPostgreSQL extAlready using Postgres, small-mediumNo
QdrantOpen sourcePerformance, on-prem, Rust-basedBoth
MilvusOpen sourceLarge scale, self-hostedBoth

pgvector (Best for existing Postgres users)

-- Enable extension
CREATE EXTENSION IF NOT EXISTS vector;

-- Create table with embedding column
CREATE TABLE documents (
    id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
    content TEXT NOT NULL,
    embedding vector(1536),  -- Matches your embedding model dimensions
    metadata JSONB,
    created_at TIMESTAMPTZ DEFAULT NOW()
);

-- Create IVFFlat index (faster queries, slight accuracy tradeoff)
CREATE INDEX ON documents USING ivfflat (embedding vector_cosine_ops)
WITH (lists = 100);  -- lists = sqrt(num_rows) is a good starting point

-- Create HNSW index (better accuracy, more memory)
CREATE INDEX ON documents USING hnsw (embedding vector_cosine_ops)
WITH (m = 16, ef_construction = 64);

-- Similarity search
SELECT id, content, metadata,
       1 - (embedding <=> $1::vector) AS similarity
FROM documents
WHERE metadata->>'doc_type' = 'manual'  -- Metadata filter
ORDER BY embedding <=> $1::vector
LIMIT 10;

Chroma (Development/Prototyping)

import chromadb
from chromadb.utils import embedding_functions

client = chromadb.PersistentClient(path="./chroma_db")

# Use OpenAI embeddings
ef = embedding_functions.OpenAIEmbeddingFunction(
    api_key=os.environ["OPENAI_API_KEY"],
    model_name="text-embedding-3-small",
)

collection = client.get_or_create_collection(
    name="documents",
    embedding_function=ef,
    metadata={"hnsw:space": "cosine"},
)

# Add documents
collection.add(
    documents=["chunk text 1", "chunk text 2"],
    metadatas=[{"source": "doc1"}, {"source": "doc2"}],
    ids=["id1", "id2"],
)

# Query
results = collection.query(
    query_texts=["what is the installation process?"],
    n_results=5,
    where={"source": "user_manual_v2.pdf"},  # Metadata filter
    include=["documents", "metadatas", "distances"],
)

Pinecone (Production Scale)

from pinecone import Pinecone, ServerlessSpec

pc = Pinecone(api_key=os.environ["PINECONE_API_KEY"])

# Create index
pc.create_index(
    name="documents",
    dimension=1536,
    metric="cosine",
    spec=ServerlessSpec(cloud="aws", region="us-east-1"),
)

index = pc.Index("documents")

# Upsert vectors
vectors = [
    {
        "id": chunk_id,
        "values": embedding,
        "metadata": {
            "text": chunk_text,  # Store text in metadata for retrieval
            "source": source,
            "page": page_num,
        }
    }
    for chunk_id, embedding, chunk_text, source, page_num in chunks
]

index.upsert(vectors=vectors, namespace="production")

# Query with metadata filter
results = index.query(
    vector=query_embedding,
    top_k=10,
    filter={"doc_type": {"$eq": "manual"}, "version": {"$gte": "2.0"}},
    include_metadata=True,
    namespace="production",
)

Step 4: Hybrid Search (Dense + Sparse)

Hybrid search combines semantic (dense vector) and keyword (sparse BM25) retrieval. Almost always outperforms pure vector search.

from langchain.retrievers import EnsembleRetriever
from langchain_community.retrievers import BM25Retriever
from langchain_community.vectorstores import Chroma

# Dense retriever (semantic)
vectorstore = Chroma.from_documents(docs, embedding=embeddings)
dense_retriever = vectorstore.as_retriever(search_kwargs={"k": 10})

# Sparse retriever (keyword)
bm25_retriever = BM25Retriever.from_documents(docs)
bm25_retriever.k = 10

# Combine with RRF (Reciprocal Rank Fusion)
ensemble_retriever = EnsembleRetriever(
    retrievers=[bm25_retriever, dense_retriever],
    weights=[0.4, 0.6],  # BM25 gets 40%, vector gets 60%
)

results = ensemble_retriever.invoke("installation steps")

When hybrid search wins most:

  • Technical queries with specific terms/codes

  • Queries with proper nouns, product names, version numbers

  • Mixed bag: some users query semantically, some with keywords



Step 5: Reranking

After initial retrieval (top-20), rerank to get the best top-5. This dramatically improves precision.

import cohere

co = cohere.Client(os.environ["COHERE_API_KEY"])

# Initial retrieval: get 20 candidates
initial_results = retriever.invoke(query, top_k=20)

# Rerank with Cohere
reranked = co.rerank(
    query=query,
    documents=[doc.page_content for doc in initial_results],
    top_n=5,
    model="rerank-english-v3.0",
    return_documents=True,
)

# Get reranked documents
final_docs = [initial_results[r.index] for r in reranked.results]

Cross-encoder rerankers to consider:

  • cohere rerank-english-v3.0 — Best quality, API-based

  • BAAI/bge-reranker-v2-m3 — Open source, strong multilingual

  • ms-marco-MiniLM-L-6-v2 — Lightweight, local



Step 6: Query Transformation

Don't just embed the raw user query. Transform it first.

HyDE (Hypothetical Document Embeddings)

Generate a hypothetical answer, embed that. Often finds better chunks:
def hyde_query(query: str, llm) -> str:
    """Generate hypothetical document to improve retrieval."""
    prompt = f"""Write a short document that would answer this question:
    
Question: {query}

Document:"""
    hypothetical_doc = llm.invoke(prompt).content
    return hypothetical_doc  # Embed this instead of the raw query

Query Decomposition

Break complex queries into sub-queries, retrieve for each:
def decompose_query(query: str, llm) -> list[str]:
    prompt = f"""Break this complex question into 2-4 simpler sub-questions.
Return as a JSON array of strings.

Question: {query}
Sub-questions:"""
    result = llm.invoke(prompt).content
    return json.loads(result)

Contextual Compression

Use an LLM to extract only the relevant part of each retrieved chunk:
from langchain.retrievers import ContextualCompressionRetriever
from langchain.retrievers.document_compressors import LLMChainExtractor

compressor = LLMChainExtractor.from_llm(llm)
compression_retriever = ContextualCompressionRetriever(
    base_compressor=compressor,
    base_retriever=retriever,
)

Step 7: Generation with Context

Prompt Template

RAG_PROMPT = """You are a helpful assistant. Answer the question using ONLY the provided context.
If the context doesn't contain the answer, say "I don't have information about that."
Always cite your sources using [Source: document_name, page X].

Context:
{context}

Question: {question}

Answer:"""

Context Assembly with Token Budget

import tiktoken

def assemble_context(docs: list, max_tokens: int = 6000) -> str:
    """Fit as many docs as possible within token budget."""
    enc = tiktoken.encoding_for_model("gpt-4o")
    context_parts = []
    token_count = 0
    
    for doc in docs:
        text = f"[Source: {doc.metadata.get('source', 'unknown')}, Page {doc.metadata.get('page', '?')}]\n{doc.page_content}\n\n"
        tokens = len(enc.encode(text))
        
        if token_count + tokens > max_tokens:
            break
        
        context_parts.append(text)
        token_count += tokens
    
    return "".join(context_parts)

Step 8: RAG Evaluation with RAGAS

Evaluate your RAG pipeline before shipping it.

from ragas import evaluate
from ragas.metrics import (
    faithfulness,        # Are answers grounded in context?
    answer_relevancy,   # Is the answer relevant to the question?
    context_precision,  # Is retrieved context useful?
    context_recall,     # Does context contain enough info?
)
from datasets import Dataset

# Build evaluation dataset
eval_data = {
    "question": ["What is the return policy?"],
    "answer": ["Items can be returned within 30 days."],
    "contexts": [["Our return policy allows 30-day returns for unused items."]],
    "ground_truth": ["Returns accepted within 30 days."],
}

dataset = Dataset.from_dict(eval_data)

results = evaluate(
    dataset=dataset,
    metrics=[faithfulness, answer_relevancy, context_precision, context_recall],
    llm=ChatOpenAI(model="gpt-4o"),
    embeddings=OpenAIEmbeddings(),
)

print(results)
# faithfulness: 0.95 (are answers grounded in retrieved docs?)
# answer_relevancy: 0.88 (does answer address the question?)
# context_precision: 0.82 (is retrieved context useful?)
# context_recall: 0.90 (does context cover the answer?)

Target scores: All metrics > 0.80 before production. Below 0.70 = broken retrieval.


Common RAG Failure Modes & Fixes

ProblemSymptomFix
Low faithfulnessLLM adds info not in contextStronger system prompt, lower temperature
Low context recallRight answer not in top-KIncrease K, fix chunking, improve embeddings
Low precisionRetrieved chunks are irrelevantAdd metadata filters, use hybrid search
Slow retrieval>500ms per queryHNSW index, fewer dimensions, cache popular queries
Stale embeddingsOld info retrievedImplement document versioning, re-embed on update
Context window exceededTruncation errorsParent-child chunking, contextual compression
Poor multilingualBad non-English recallUse multilingual embedding model (bge-m3)

Production Checklist

  • Chunking tested against real queries
  • Embeddings batch-ingested with error recovery
  • Metadata schema designed and documented
  • Hybrid search enabled (dense + BM25)
  • Reranker in place for top-K selection
  • RAGAS scores > 0.80 across all metrics
  • Token budget enforced in context assembly
  • Citation extraction working
  • Incremental update pipeline for new documents
  • Embedding model version locked (changes require re-embedding)
  • Query caching for popular questions
  • Monitoring: retrieval latency, answer quality drift