AI & LLMsDocumented
llm-evaluation
LLMOps and evaluation engineering. RAGAS metrics, Promptfoo evals-as-code, Langsmith tracing, hallucination detection, CI/CD for LLM quality, and production monitoring. Build LLM apps you can actually measure.
Share:
Installation
npx clawhub@latest install llm-evaluationView the full skill documentation and source below.
Documentation
LLM Evaluation Engineering
Why Evals Matter
"Evals are to LLMs what tests are to software." — Andrej Karpathy
Without evals:
- You can't measure if a prompt change improved things
- You don't know if a model upgrade regressed outputs
- You're shipping vibes, not quality
Evaluation Taxonomy
Automated Evals (fast, cheap, scalable)
├─ Rule-based: regex, JSON schema, string contains
├─ Model-based (LLM-as-Judge): GPT-4o grades outputs
└─ Reference-based: compare against golden answers
Human Evals (slow, expensive, ground truth)
├─ Expert annotation: domain experts label samples
├─ Preference annotation: A/B comparison voting
└─ Production feedback: thumbs up/down from users
Online vs Offline:
Offline: Test against curated dataset before deploy
Online: Monitor live traffic in production
The Evaluation Loop
1. Build eval dataset
- Golden QA pairs from domain experts
- Edge cases from production failures
- Adversarial examples
2. Define metrics
- What does "good" look like?
- How do you measure it programmatically?
3. Run evals
- Against multiple model versions
- Against prompt variants
- After system changes
4. Analyze regressions
- Which cases failed?
- What patterns exist?
5. Improve system
- Fix prompts, chunking, retrieval
- Retrain / fine-tune
6. Add failing cases to dataset (regression prevention)
Building an Eval Dataset
# Golden dataset structure
eval_dataset = [
{
"id": "q001",
"question": "What is our refund policy?",
"ground_truth": "Products can be returned within 30 days of purchase for a full refund.",
"contexts": [
"Our refund policy allows returns within 30 days of purchase.",
"Refunds are processed within 5-7 business days."
],
"category": "policy",
"difficulty": "easy",
"tags": ["refund", "returns"],
},
{
"id": "q002",
"question": "Can I return a digital product?",
"ground_truth": "Digital products are not eligible for refunds once accessed.",
"contexts": [
"Digital downloads cannot be returned once downloaded or accessed."
],
"category": "policy",
"difficulty": "hard", # Edge case
"tags": ["refund", "digital"],
}
]
# Minimum viable eval dataset: 50-100 questions per use case
# Cover: easy/medium/hard, each business scenario, known failure modes
Metric Design
For RAG Systems
from ragas import evaluate
from ragas.metrics import (
faithfulness, # 0-1: Is the answer grounded in retrieved context?
answer_relevancy, # 0-1: Does the answer address the question?
context_precision, # 0-1: Is the retrieved context useful?
context_recall, # 0-1: Does the context contain enough info?
answer_correctness, # 0-1: Is the answer factually correct (vs ground truth)?
)
from ragas.metrics.critique import harmfulness, coherence
from datasets import Dataset
results = evaluate(
Dataset.from_list(eval_dataset),
metrics=[faithfulness, answer_relevancy, context_precision, context_recall],
)
# Scores explained:
# Faithfulness < 0.8: LLM is hallucinating
# Context Recall < 0.7: Retrieval is missing relevant chunks
# Context Precision < 0.7: Retrieved chunks are irrelevant
# Answer Relevancy < 0.8: Answers are off-topic
For Chatbots / Assistants
# Custom LLM-as-Judge metric
from openai import OpenAI
client = OpenAI()
JUDGE_PROMPT = """You are evaluating an AI assistant's response.
Question: {question}
AI Response: {response}
Rate the response on each dimension (1-5):
1. Accuracy: Is the information factually correct?
2. Completeness: Does it fully answer the question?
3. Clarity: Is it easy to understand?
4. Conciseness: Is it appropriately concise without being incomplete?
5. Tone: Is the tone appropriate and professional?
Return JSON only:
{{"accuracy": X, "completeness": X, "clarity": X, "conciseness": X, "tone": X, "reasoning": "..."}}"""
def judge_response(question: str, response: str) -> dict:
result = client.chat.completions.create(
model="gpt-4o",
messages=[{
"role": "user",
"content": JUDGE_PROMPT.format(question=question, response=response)
}],
response_format={"type": "json_object"},
temperature=0, # Deterministic grading
)
return json.loads(result.choices[0].message.content)
Rule-Based Metrics
import re
from typing import Callable
def make_evaluator(checks: list[Callable]) -> Callable:
"""Compose multiple checks into a single evaluator."""
def evaluate(response: str, metadata: dict) -> dict:
results = {}
for check in checks:
name = check.__name__
results[name] = check(response, metadata)
results["passed"] = all(results.values())
return results
return evaluate
def check_format_json(response: str, _) -> bool:
try:
json.loads(response)
return True
except json.JSONDecodeError:
return False
def check_length(response: str, metadata: dict) -> bool:
min_len = metadata.get("min_length", 50)
max_len = metadata.get("max_length", 2000)
return min_len <= len(response) <= max_len
def check_no_hallucination_markers(response: str, _) -> bool:
"""Flag responses that suggest hallucination."""
hallucination_phrases = [
"I'm not sure but",
"I believe but I'm not certain",
"I might be wrong",
"as far as I know",
"I think (but don't quote me)",
]
response_lower = response.lower()
return not any(phrase.lower() in response_lower for phrase in hallucination_phrases)
def check_cites_source(response: str, _) -> bool:
"""Check if response includes source citation."""
patterns = [r'\[Source:', r'\[Ref:', r'According to', r'Based on']
return any(re.search(p, response) for p in patterns)
Promptfoo — Evals as Code
# promptfooconfig.yaml
prompts:
- id: prompt-v1
raw: |
You are a helpful customer service agent.
Answer questions about our products based on this knowledge base:
{{context}}
Question: {{question}}
Answer:
- id: prompt-v2
raw: |
You are a precise customer service agent. Answer ONLY using information
from the provided context. If unsure, say "I don't have that information."
Context: {{context}}
Question: {{question}}
Answer:
providers:
- id: openai:gpt-4o-mini
config:
temperature: 0
- id: openai:gpt-4o
config:
temperature: 0
- id: anthropic:claude-3-5-haiku-20241022
tests:
- description: Basic refund policy query
vars:
question: "What is your return policy?"
context: "Products can be returned within 30 days for a full refund."
assert:
- type: contains
value: "30 days"
- type: llm-rubric
value: "The response accurately describes the return policy and is helpful"
- type: cost
threshold: 0.01 # Max $0.01 per call
- description: Edge case - digital products
vars:
question: "Can I return downloaded software?"
context: "Digital downloads cannot be returned once accessed."
assert:
- type: contains
value: "cannot"
not: false
- type: not-contains
value: "30 days" # Should NOT apply physical policy to digital
- type: factuality
value: "Digital products cannot be returned once downloaded"
- description: Hallucination test - question not in context
vars:
question: "What are your store hours?"
context: "We sell premium software products."
assert:
- type: llm-rubric
value: "The response should NOT invent store hours. It should say the information is not available."
# Run evals
npx promptfoo eval
npx promptfoo eval --output results.json
npx promptfoo view # Browser UI with comparison table
# Compare two configs
npx promptfoo eval --config config-v1.yaml
npx promptfoo eval --config config-v2.yaml
npx promptfoo view # See side-by-side comparison
Langsmith Tracing and Eval
from langchain_core.tracers.langchain import LangChainTracer
from langsmith import Client
from langsmith.evaluation import evaluate
# Auto-tracing — add to any LangChain app
os.environ["LANGCHAIN_TRACING_V2"] = "true"
os.environ["LANGCHAIN_API_KEY"] = "ls_..."
os.environ["LANGCHAIN_PROJECT"] = "my-rag-app"
# Create an evaluator
from langchain.evaluation import load_evaluator
correctness_evaluator = load_evaluator(
"labeled_score_string",
criteria={
"correctness": "Is the response factually correct based on the reference?"
}
)
# Evaluate a dataset
client = Client()
# Create dataset
dataset = client.create_dataset("refund-questions")
for example in eval_dataset:
client.create_example(
inputs={"question": example["question"], "context": example["contexts"]},
outputs={"answer": example["ground_truth"]},
dataset_id=dataset.id,
)
# Run evaluation
def run_rag_chain(inputs):
return {"answer": my_rag_chain.invoke(inputs)}
results = evaluate(
run_rag_chain,
data=dataset.name,
evaluators=["qa", "context_qa"], # Built-in evaluators
experiment_prefix="rag-v2",
)
Hallucination Detection
HALLUCINATION_CHECK_PROMPT = """Given the following context and response, determine if the response
contains any information NOT supported by the context (hallucination).
Context:
{context}
Response:
{response}
Return JSON:
{{
"has_hallucination": true/false,
"hallucinated_claims": ["specific claim not in context", ...],
"confidence": 0.0-1.0,
"reasoning": "explanation"
}}"""
def detect_hallucination(
context: str,
response: str,
model: str = "gpt-4o"
) -> dict:
result = client.chat.completions.create(
model=model,
messages=[{
"role": "user",
"content": HALLUCINATION_CHECK_PROMPT.format(
context=context, response=response
)
}],
response_format={"type": "json_object"},
temperature=0,
seed=42, # Deterministic output
)
return json.loads(result.choices[0].message.content)
# NLI-based hallucination detection (faster, cheaper)
from transformers import pipeline
nli = pipeline("text-classification", model="cross-encoder/nli-deberta-v3-base")
def is_faithful(premise: str, hypothesis: str, threshold: float = 0.5) -> bool:
"""Check if hypothesis is entailed by premise."""
result = nli(f"{premise} [SEP] {hypothesis}")
entailment = next((r for r in result if r["label"] == "ENTAILMENT"), None)
return entailment is not None and entailment["score"] > threshold
CI/CD Integration for LLM Quality
# .github/workflows/eval.yml
name: LLM Evaluation
on:
pull_request:
paths: ['prompts/**', 'rag/**', 'src/llm/**']
push:
branches: [main]
jobs:
eval:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- name: Run evaluations
env:
OPENAI_API_KEY: ${{ secrets.OPENAI_API_KEY }}
run: |
npx promptfoo eval --output eval-results.json
- name: Check quality gates
run: |
python scripts/check_eval_gates.py eval-results.json \
--min-pass-rate 0.90 \
--max-latency-p95 2000 \
--max-cost-per-1k 0.50
- name: Comment results on PR
uses: actions/github-script@v7
with:
script: |
const results = require('./eval-results.json')
const passRate = results.stats.successes / results.stats.total
const comment = `## LLM Eval Results
- Pass rate: ${(passRate * 100).toFixed(1)}%
- Avg latency: ${results.stats.avgLatency}ms
- Total cost: ${results.stats.totalCost.toFixed(4)}`
github.rest.issues.createComment({...})
Production Monitoring
import prometheus_client as prom
# Metrics to track
llm_request_latency = prom.Histogram(
"llm_request_duration_seconds",
"LLM request latency",
labelnames=["model", "prompt_version"],
buckets=[0.1, 0.5, 1.0, 2.0, 5.0, 10.0],
)
llm_token_usage = prom.Counter(
"llm_tokens_total",
"LLM token usage",
labelnames=["model", "type"], # type: input/output
)
response_quality = prom.Histogram(
"llm_response_quality_score",
"LLM response quality (0-1)",
labelnames=["metric_name"],
)
def monitored_llm_call(prompt: str, model: str = "gpt-4o-mini"):
with llm_request_latency.labels(model=model, prompt_version="v2").time():
response = client.chat.completions.create(
model=model, messages=[{"role": "user", "content": prompt}]
)
# Track token usage
llm_token_usage.labels(model=model, type="input").inc(response.usage.prompt_tokens)
llm_token_usage.labels(model=model, type="output").inc(response.usage.completion_tokens)
# Sample 10% for quality scoring (expensive)
if random.random() < 0.1:
quality = judge_response_quality(prompt, response.choices[0].message.content)
response_quality.labels(metric_name="relevancy").observe(quality["relevancy"])
return response.choices[0].message.content
Eval Framework Comparison
| Framework | Best For | Pros | Cons |
| Promptfoo | Prompt comparison, CI/CD | Evals-as-code, fast, cheap | Less enterprise features |
| Langsmith | LangChain apps | Deep tracing, dataset management | Vendor lock-in |
| Braintrust | Production monitoring | Nice UI, A/B testing | Newer |
| RAGAS | RAG systems | RAG-specific metrics, free | Limited to RAG |
| Phoenix (Arize) | ML + LLM observability | Good observability UX | Complex setup |
| DeepEval | Comprehensive evals | Many metrics, LLM-agnostic | Slower |
| Opik | Open source eval + tracing | Self-hostable | Less mature |