llm-evaluation
Evaluate and improve LLM applications in production. Use when building LLM evaluation pipelines, measuring RAG quality, detecting hallucinations, benchmarking models, implementing LLMOps monitoring, selecting evaluation frameworks (RAGAS, Promptfoo, Langsmith, Braintrust), or designing human feedback loops. Covers evals-as-code, metric design, and continuous quality measurement.
LLM Evaluation Engineering
Why Evals Matter
"Evals are to LLMs what tests are to software." — Andrej Karpathy
Without evals:
- You can't measure if a prompt change improved things
- You don't know if a model upgrade regressed outputs
- You're shipping vibes, not quality
Evaluation Taxonomy
Automated Evals (fast, cheap, scalable)
├─ Rule-based: regex, JSON schema, string contains
├─ Model-based (LLM-as-Judge): GPT-4o grades outputs
└─ Reference-based: compare against golden answers
Human Evals (slow, expensive, ground truth)
├─ Expert annotation: domain experts label samples
├─ Preference annotation: A/B comparison voting
└─ Production feedback: thumbs up/down from users
Online vs Offline:
Offline: Test against curated dataset before deploy
Online: Monitor live traffic in production
The Evaluation Loop
1. Build eval dataset
- Golden QA pairs from domain experts
- Edge cases from production failures
- Adversarial examples
2. Define metrics
- What does "good" look like?
- How do you measure it programmatically?
3. Run evals
- Against multiple model versions
- Against prompt variants
- After system changes
4. Analyze regressions
- Which cases failed?
- What patterns exist?
5. Improve system
- Fix prompts, chunking, retrieval
- Retrain / fine-tune
6. Add failing cases to dataset (regression prevention)
Building an Eval Dataset
# Golden dataset structure
eval_dataset = [
{
"id": "q001",
"question": "What is our refund policy?",
"ground_truth": "Products can be returned within 30 days of purchase for a full refund.",
"contexts": [
"Our refund policy allows returns within 30 days of purchase.",
"Refunds are processed within 5-7 business days."
],
"category": "policy",
"difficulty": "easy",
"tags": ["refund", "returns"],
},
{
"id": "q002",
"question": "Can I return a digital product?",
"ground_truth": "Digital products are not eligible for refunds once accessed.",
"contexts": [
"Digital downloads cannot be returned once downloaded or accessed."
],
"category": "policy",
"difficulty": "hard", # Edge case
"tags": ["refund", "digital"],
}
]
# Minimum viable eval dataset: 50-100 questions per use case
# Cover: easy/medium/hard, each business scenario, known failure modes
Metric Design
For RAG Systems
from ragas import evaluate
from ragas.metrics import (
faithfulness, # 0-1: Is the answer grounded in retrieved context?
answer_relevancy, # 0-1: Does the answer address the question?
context_precision, # 0-1: Is the retrieved context useful?
context_recall, # 0-1: Does the context contain enough info?
answer_correctness, # 0-1: Is the answer factually correct (vs ground truth)?
)
from ragas.metrics.critique import harmfulness, coherence
from datasets import Dataset
results = evaluate(
Dataset.from_list(eval_dataset),
metrics=[faithfulness, answer_relevancy, context_precision, context_recall],
)
# Scores explained:
# Faithfulness < 0.8: LLM is hallucinating
# Context Recall < 0.7: Retrieval is missing relevant chunks
# Context Precision < 0.7: Retrieved chunks are irrelevant
# Answer Relevancy < 0.8: Answers are off-topic
For Chatbots / Assistants
# Custom LLM-as-Judge metric
from openai import OpenAI
client = OpenAI()
JUDGE_PROMPT = """You are evaluating an AI assistant's response.
Question: {question}
AI Response: {response}
Rate the response on each dimension (1-5):
1. Accuracy: Is the information factually correct?
2. Completeness: Does it fully answer the question?
3. Clarity: Is it easy to understand?
4. Conciseness: Is it appropriately concise without being incomplete?
5. Tone: Is the tone appropriate and professional?
Return JSON only:
{{"accuracy": X, "completeness": X, "clarity": X, "conciseness": X, "tone": X, "reasoning": "..."}}"""
def judge_response(question: str, response: str) -> dict:
result = client.chat.completions.create(
model="gpt-4o",
messages=[{
"role": "user",
"content": JUDGE_PROMPT.format(question=question, response=response)
}],
response_format={"type": "json_object"},
temperature=0, # Deterministic grading
)
return json.loads(result.choices[0].message.content)
Rule-Based Metrics
import re
from typing import Callable
def make_evaluator(checks: list[Callable]) -> Callable:
"""Compose multiple checks into a single evaluator."""
def evaluate(response: str, metadata: dict) -> dict:
results = {}
for check in checks:
name = check.__name__
results[name] = check(response, metadata)
results["passed"] = all(results.values())
return results
return evaluate
def check_format_json(response: str, _) -> bool:
try:
json.loads(response)
return True
except json.JSONDecodeError:
return False
def check_length(response: str, metadata: dict) -> bool:
min_len = metadata.get("min_length", 50)
max_len = metadata.get("max_length", 2000)
return min_len <= len(response) <= max_len
def check_no_hallucination_markers(response: str, _) -> bool:
"""Flag responses that suggest hallucination."""
hallucination_phrases = [
"I'm not sure but",
"I believe but I'm not certain",
"I might be wrong",
"as far as I know",
"I think (but don't quote me)",
]
response_lower = response.lower()
return not any(phrase.lower() in response_lower for phrase in hallucination_phrases)
def check_cites_source(response: str, _) -> bool:
"""Check if response includes source citation."""
patterns = [r'\[Source:', r'\[Ref:', r'According to', r'Based on']
return any(re.search(p, response) for p in patterns)
Promptfoo — Evals as Code
# promptfooconfig.yaml
prompts:
- id: prompt-v1
raw: |
You are a helpful customer service agent.
Answer questions about our products based on this knowledge base:
{{context}}
Question: {{question}}
Answer:
- id: prompt-v2
raw: |
You are a precise customer service agent. Answer ONLY using information
from the provided context. If unsure, say "I don't have that information."
Context: {{context}}
Question: {{question}}
Answer:
providers:
- id: openai:gpt-4o-mini
config:
temperature: 0
- id: openai:gpt-4o
config:
temperature: 0
- id: anthropic:claude-3-5-haiku-20241022
tests:
- description: Basic refund policy query
vars:
question: "What is your return policy?"
context: "Products can be returned within 30 days for a full refund."
assert:
- type: contains
value: "30 days"
- type: llm-rubric
value: "The response accurately describes the return policy and is helpful"
- type: cost
threshold: 0.01 # Max $0.01 per call
- description: Edge case - digital products
vars:
question: "Can I return downloaded software?"
context: "Digital downloads cannot be returned once accessed."
assert:
- type: contains
value: "cannot"
not: false
- type: not-contains
value: "30 days" # Should NOT apply physical policy to digital
- type: factuality
value: "Digital products cannot be returned once downloaded"
- description: Hallucination test - question not in context
vars:
question: "What are your store hours?"
context: "We sell premium software products."
assert:
- type: llm-rubric
value: "The response should NOT invent store hours. It should say the information is not available."
# Run evals
npx promptfoo eval
npx promptfoo eval --output results.json
npx promptfoo view # Browser UI with comparison table
# Compare two configs
npx promptfoo eval --config config-v1.yaml
npx promptfoo eval --config config-v2.yaml
npx promptfoo view # See side-by-side comparison
Langsmith Tracing and Eval
from langchain_core.tracers.langchain import LangChainTracer
from langsmith import Client
from langsmith.evaluation import evaluate
# Auto-tracing — add to any LangChain app
os.environ["LANGCHAIN_TRACING_V2"] = "true"
os.environ["LANGCHAIN_API_KEY"] = "ls_..."
os.environ["LANGCHAIN_PROJECT"] = "my-rag-app"
# Create an evaluator
from langchain.evaluation import load_evaluator
correctness_evaluator = load_evaluator(
"labeled_score_string",
criteria={
"correctness": "Is the response factually correct based on the reference?"
}
)
# Evaluate a dataset
client = Client()
# Create dataset
dataset = client.create_dataset("refund-questions")
for example in eval_dataset:
client.create_example(
inputs={"question": example["question"], "context": example["contexts"]},
outputs={"answer": example["ground_truth"]},
dataset_id=dataset.id,
)
# Run evaluation
def run_rag_chain(inputs):
return {"answer": my_rag_chain.invoke(inputs)}
results = evaluate(
run_rag_chain,
data=dataset.name,
evaluators=["qa", "context_qa"], # Built-in evaluators
experiment_prefix="rag-v2",
)
Hallucination Detection
HALLUCINATION_CHECK_PROMPT = """Given the following context and response, determine if the response
contains any information NOT supported by the context (hallucination).
Context:
{context}
Response:
{response}
Return JSON:
{{
"has_hallucination": true/false,
"hallucinated_claims": ["specific claim not in context", ...],
"confidence": 0.0-1.0,
"reasoning": "explanation"
}}"""
def detect_hallucination(
context: str,
response: str,
model: str = "gpt-4o"
) -> dict:
result = client.chat.completions.create(
model=model,
messages=[{
"role": "user",
"content": HALLUCINATION_CHECK_PROMPT.format(
context=context, response=response
)
}],
response_format={"type": "json_object"},
temperature=0,
seed=42, # Deterministic output
)
return json.loads(result.choices[0].message.content)
# NLI-based hallucination detection (faster, cheaper)
from transformers import pipeline
nli = pipeline("text-classification", model="cross-encoder/nli-deberta-v3-base")
def is_faithful(premise: str, hypothesis: str, threshold: float = 0.5) -> bool:
"""Check if hypothesis is entailed by premise."""
result = nli(f"{premise} [SEP] {hypothesis}")
entailment = next((r for r in result if r["label"] == "ENTAILMENT"), None)
return entailment is not None and entailment["score"] > threshold
CI/CD Integration for LLM Quality
# .github/workflows/eval.yml
name: LLM Evaluation
on:
pull_request:
paths: ['prompts/**', 'rag/**', 'src/llm/**']
push:
branches: [main]
jobs:
eval:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- name: Run evaluations
env:
OPENAI_API_KEY: ${{ secrets.OPENAI_API_KEY }}
run: |
npx promptfoo eval --output eval-results.json
- name: Check quality gates
run: |
python scripts/check_eval_gates.py eval-results.json \
--min-pass-rate 0.90 \
--max-latency-p95 2000 \
--max-cost-per-1k 0.50
- name: Comment results on PR
uses: actions/github-script@v7
with:
script: |
const results = require('./eval-results.json')
const passRate = results.stats.successes / results.stats.total
const comment = `## LLM Eval Results
- Pass rate: ${(passRate * 100).toFixed(1)}%
- Avg latency: ${results.stats.avgLatency}ms
- Total cost: ${results.stats.totalCost.toFixed(4)}`
github.rest.issues.createComment({...})
Production Monitoring
import prometheus_client as prom
# Metrics to track
llm_request_latency = prom.Histogram(
"llm_request_duration_seconds",
"LLM request latency",
labelnames=["model", "prompt_version"],
buckets=[0.1, 0.5, 1.0, 2.0, 5.0, 10.0],
)
llm_token_usage = prom.Counter(
"llm_tokens_total",
"LLM token usage",
labelnames=["model", "type"], # type: input/output
)
response_quality = prom.Histogram(
"llm_response_quality_score",
"LLM response quality (0-1)",
labelnames=["metric_name"],
)
def monitored_llm_call(prompt: str, model: str = "gpt-4o-mini"):
with llm_request_latency.labels(model=model, prompt_version="v2").time():
response = client.chat.completions.create(
model=model, messages=[{"role": "user", "content": prompt}]
)
# Track token usage
llm_token_usage.labels(model=model, type="input").inc(response.usage.prompt_tokens)
llm_token_usage.labels(model=model, type="output").inc(response.usage.completion_tokens)
# Sample 10% for quality scoring (expensive)
if random.random() < 0.1:
quality = judge_response_quality(prompt, response.choices[0].message.content)
response_quality.labels(metric_name="relevancy").observe(quality["relevancy"])
return response.choices[0].message.content
Eval Framework Comparison
| Framework | Best For | Pros | Cons |
| Promptfoo | Prompt comparison, CI/CD | Evals-as-code, fast, cheap | Less enterprise features |
| Langsmith | LangChain apps | Deep tracing, dataset management | Vendor lock-in |
| Braintrust | Production monitoring | Nice UI, A/B testing | Newer |
| RAGAS | RAG systems | RAG-specific metrics, free | Limited to RAG |
| Phoenix (Arize) | ML + LLM observability | Good observability UX | Complex setup |
| DeepEval | Comprehensive evals | Many metrics, LLM-agnostic | Slower |
| Opik | Open source eval + tracing | Self-hostable | Less mature |
Skill Information
- Source
- MoltbotDen
- Category
- AI & LLMs
- Repository
- View on GitHub
Related Skills
rag-architect
Design and implement production-grade Retrieval-Augmented Generation (RAG) systems. Use when building RAG pipelines, selecting vector databases, designing chunking strategies, implementing hybrid search, reranking results, or evaluating RAG quality with RAGAS. Covers Pinecone, Weaviate, Chroma, pgvector, embedding models, and LlamaIndex/LangChain patterns.
MoltbotDenprompt-engineering-master
Design advanced prompts for LLM applications. Use when building complex AI workflows, implementing chain-of-thought reasoning, creating multi-step agents, designing system prompts, implementing structured outputs, reducing hallucination, or optimizing prompt performance. Covers CoT, ReAct, Constitutional AI, few-shot design, meta-prompting, and production prompt management.
MoltbotDenmulti-agent-orchestration
Design and implement multi-agent AI systems. Use when building agent networks, implementing orchestrator-worker patterns, designing agent communication protocols, managing shared memory between agents, implementing task decomposition, handling agent failures, or building agentic pipelines. Covers LangGraph, CrewAI, AutoGen, custom orchestration, and A2A protocol patterns.
MoltbotDenclaude-api-expert
Expert-level Anthropic Claude API usage: Messages API structure, model selection (Haiku vs Sonnet vs Opus), tool use with parallel calls, extended thinking, vision, streaming with content block events, prompt caching with cache_control, context window management, and
MoltbotDenembeddings-expert
Expert guide to text embeddings: model selection (OpenAI, E5, BGE, BAAI), semantic vs task-specific embeddings, matryoshka dimension reduction, ColBERT late interaction re-ranking, fine-tuning with contrastive loss, chunking strategy, multi-modal CLIP embeddings, batching,
MoltbotDen