fine-tuning-llm
Expert guide to LLM fine-tuning: when to fine-tune vs RAG vs prompting, dataset preparation for quality over quantity, LoRA/QLoRA configuration, SFT vs DPO vs RLHF trade-offs, PEFT training scripts, post-training quantization, and evaluation frameworks. Trigger phrases:
Fine-Tuning LLMs
Fine-tuning is frequently the wrong answer. Most teams reach for it too early, burn weeks on dataset prep and GPU time, and end up with a model that's worse than a well-prompted base model. This skill helps you decide when fine-tuning is the right tool — and when it is, how to do it correctly.
Core Mental Model
Fine-tuning teaches the model how to respond, not what to know. If your problem is that the model doesn't know recent facts → use RAG. If the problem is that it doesn't know your domain vocabulary exists → use RAG. If the problem is that responses are in the wrong format, style, tone, or structure — even when the model "knows" the right answer — fine-tuning is your lever. The key question: "Would adding this to the system prompt solve it?" If yes, don't fine-tune.
When to Fine-Tune vs RAG vs Prompting
| Approach | Use When | Don't Use When |
| Prompting | Format/style issues solvable in <500 tokens | Consistent behavior across thousands of calls |
| RAG | Knowledge gaps (facts, docs, current info) | The model knows the content but responds wrong |
| Fine-tuning | Style consistency, format adherence, domain jargon, response structure | Knowledge injection — fine-tuning doesn't reliably add facts |
| Fine-tune + RAG | Domain-specific retrieval AND response style both matter | Budget constraints — this is the most expensive |
# Decision framework
def choose_approach(problem_type: str, budget: str) -> str:
if problem_type == "knowledge_gap":
return "RAG" # Always RAG for knowledge
if problem_type in ["style", "format", "tone", "structure"]:
# Try prompting first — it's free
if can_solve_with_prompt():
return "few_shot_prompting"
elif budget == "low":
return "prompt_optimization" # DSPy, OPRO
else:
return "supervised_fine_tuning"
if problem_type == "instruction_following":
return "DPO_or_RLHF" # Alignment techniques
if problem_type == "domain_jargon_and_reasoning":
return "SFT_then_RAG" # Both
Dataset Preparation
Quality over quantity is not a cliché — it's the most important rule.
1,000 curated, high-quality examples beat 100,000 scraped, noisy examples every time. Bad data doesn't average out — it degrades the model in ways that are hard to diagnose.
# Dataset quality pipeline
import json
from datasets import Dataset
from sentence_transformers import SentenceTransformer
import numpy as np
def prepare_dataset(raw_examples: list[dict]) -> Dataset:
"""
Full quality pipeline: dedup → filter → format → validate
"""
# 1. Deduplication (near-duplicate removal using embeddings)
model = SentenceTransformer("all-MiniLM-L6-v2")
texts = [ex["instruction"] for ex in raw_examples]
embeddings = model.encode(texts, batch_size=256)
# Remove examples within cosine similarity > 0.95
filtered = remove_near_duplicates(raw_examples, embeddings, threshold=0.95)
print(f"After dedup: {len(filtered)} / {len(raw_examples)}")
# 2. Quality filter — remove short, empty, or malformed responses
filtered = [
ex for ex in filtered
if len(ex["response"].split()) >= 10 # Minimum response length
and ex["response"].strip()
and ex["instruction"].strip()
and "TODO" not in ex["response"] # Remove placeholder responses
]
# 3. Format to chat template (Llama 3 format)
formatted = []
for ex in filtered:
formatted.append({
"messages": [
{"role": "system", "content": ex.get("system", "You are a helpful assistant.")},
{"role": "user", "content": ex["instruction"]},
{"role": "assistant", "content": ex["response"]},
]
})
# 4. Train/val split (90/10)
split_idx = int(len(formatted) * 0.9)
return (
Dataset.from_list(formatted[:split_idx]), # train
Dataset.from_list(formatted[split_idx:]), # validation
)
def remove_near_duplicates(examples, embeddings, threshold=0.95):
# Greedy O(n²) — replace with FAISS for >100K examples
keep = []
kept_embeddings = []
for i, (ex, emb) in enumerate(zip(examples, embeddings)):
if kept_embeddings:
sims = np.dot(kept_embeddings, emb) / (
np.linalg.norm(kept_embeddings, axis=1) * np.linalg.norm(emb)
)
if sims.max() > threshold:
continue
keep.append(ex)
kept_embeddings.append(emb)
return keep
LoRA Configuration
LoRA (Low-Rank Adaptation) adds trainable rank-decomposition matrices to frozen model weights. It's the practical choice for most fine-tuning — 10-100x fewer trainable parameters vs full fine-tuning.
from peft import LoraConfig, get_peft_model, TaskType
from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig
import torch
# QLoRA: Load in 4-bit, fine-tune LoRA adapters
bnb_config = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_use_double_quant=True, # Nested quantization for memory
bnb_4bit_quant_type="nf4", # NF4 dtype (better than fp4 for LLMs)
bnb_4bit_compute_dtype=torch.bfloat16,
)
model = AutoModelForCausalLM.from_pretrained(
"meta-llama/Meta-Llama-3-8B-Instruct",
quantization_config=bnb_config,
device_map="auto",
attn_implementation="flash_attention_2", # 2-3x memory reduction
)
# LoRA config: the most important hyperparameters
lora_config = LoraConfig(
r=16, # Rank: 8-64 typical. Higher = more capacity, more params.
# Start with 16. Increase if underfitting, decrease if overfitting.
lora_alpha=32, # Scale = alpha/r. Convention: alpha = 2*r. Controls update magnitude.
target_modules=[ # Which layers to adapt. "all-linear" works well for instruction tuning.
"q_proj", "v_proj", "k_proj", "o_proj", # Attention
"gate_proj", "up_proj", "down_proj", # MLP (include for better results)
],
lora_dropout=0.05, # Regularization. 0.05-0.1 typical.
bias="none", # Don't train bias terms
task_type=TaskType.CAUSAL_LM,
)
model = get_peft_model(model, lora_config)
model.print_trainable_parameters()
# Trainable params: 83,886,080 || All params: 8,030,261,248 || Trainable%: 1.0445%
Training with TRL
from trl import SFTTrainer, SFTConfig
from transformers import TrainingArguments
training_args = SFTConfig(
output_dir="./outputs/llama3-finetuned",
num_train_epochs=3,
per_device_train_batch_size=2, # With gradient accumulation = effective batch 16
gradient_accumulation_steps=8,
gradient_checkpointing=True, # Trade compute for memory
learning_rate=2e-4, # Higher than full fine-tune (LoRA can handle it)
lr_scheduler_type="cosine",
warmup_ratio=0.05, # Warmup for 5% of steps
weight_decay=0.01,
bf16=True, # BF16 on Ampere+; fp16 for older GPUs
logging_steps=10,
eval_strategy="steps",
eval_steps=100,
save_strategy="steps",
save_steps=100,
load_best_model_at_end=True,
metric_for_best_model="eval_loss",
max_seq_length=4096,
packing=True, # Pack multiple short examples into one sequence
)
trainer = SFTTrainer(
model=model,
args=training_args,
train_dataset=train_dataset,
eval_dataset=val_dataset,
tokenizer=tokenizer,
)
trainer.train()
trainer.save_model("./outputs/llama3-finetuned-final")
SFT vs DPO vs RLHF
| Method | What It Optimizes | When to Use | Complexity |
| SFT (Supervised Fine-Tuning) | Mimic example responses | Format, style, domain vocab | Low |
| DPO (Direct Preference Optimization) | Prefer chosen over rejected | Alignment, safety, quality improvement | Medium |
| RLHF | Reward model signal | Complex alignment at scale (GPT-4 level) | Very High |
# DPO training — needs preference pairs (chosen vs rejected)
from trl import DPOTrainer, DPOConfig
# Dataset format for DPO
dpo_dataset = [
{
"prompt": "Explain recursion in Python",
"chosen": "Recursion is when a function calls itself...[good explanation]",
"rejected": "Just look it up lol", # Bad response
},
]
dpo_config = DPOConfig(
beta=0.1, # KL divergence penalty (higher = stay closer to reference model)
loss_type="sigmoid", # sigmoid | hinge | ipo
learning_rate=5e-7, # Much lower than SFT — small nudges only
per_device_train_batch_size=2,
gradient_accumulation_steps=4,
num_train_epochs=1, # DPO overfits quickly — often 1 epoch is enough
)
dpo_trainer = DPOTrainer(
model=sft_model, # Start from SFT checkpoint, not base
ref_model=ref_model, # Frozen reference (original SFT)
args=dpo_config,
train_dataset=dpo_dataset,
tokenizer=tokenizer,
)
dpo_trainer.train()
Evaluation Before and After
# Evaluation harness using lm-evaluation-harness
# pip install lm-eval
# Command line evaluation on standard benchmarks
# lm_eval --model hf --model_args pretrained=./outputs/my-model \
# --tasks mmlu,hellaswag,truthfulqa_mc1 \
# --num_fewshot 5 --batch_size 8 --output_path results.json
# Custom held-out set evaluation
from rouge_score import rouge_scorer
import json
def evaluate_on_holdout(model, tokenizer, holdout_examples: list[dict]) -> dict:
scorer = rouge_scorer.RougeScorer(["rouge1", "rouge2", "rougeL"], use_stemmer=True)
results = []
for ex in holdout_examples:
prompt = format_prompt(ex["instruction"])
inputs = tokenizer(prompt, return_tensors="pt").to("cuda")
with torch.no_grad():
outputs = model.generate(
**inputs,
max_new_tokens=512,
temperature=0.0, # Greedy for deterministic eval
do_sample=False,
)
generated = tokenizer.decode(outputs[0][inputs["input_ids"].shape[1]:], skip_special_tokens=True)
scores = scorer.score(ex["reference"], generated)
results.append({
"rougeL": scores["rougeL"].fmeasure,
"generated_length": len(generated.split()),
})
return {
"mean_rougeL": np.mean([r["rougeL"] for r in results]),
"mean_length": np.mean([r["generated_length"] for r in results]),
}
Post-Training Quantization
# GPTQ quantization (fast inference, small model)
from transformers import AutoModelForCausalLM, GPTQConfig
gptq_config = GPTQConfig(
bits=4,
dataset="c4", # Calibration dataset
tokenizer=tokenizer,
)
quantized_model = AutoModelForCausalLM.from_pretrained(
"./outputs/finetuned-merged",
quantization_config=gptq_config,
device_map="auto",
)
quantized_model.save_pretrained("./outputs/finetuned-gptq-4bit")
# Or: export to GGUF for llama.cpp (CPU inference)
# python llama.cpp/convert_hf_to_gguf.py ./outputs/finetuned-merged --outtype q4_k_m
Axolotl Config (Production Fine-Tuning)
# axolotl_config.yaml — Axolotl handles all the boilerplate
base_model: meta-llama/Meta-Llama-3-8B-Instruct
model_type: LlamaForCausalLM
tokenizer_type: AutoTokenizer
load_in_4bit: true
strict: false
datasets:
- path: my_dataset.jsonl
type: chat_template
chat_template: llama3
dataset_prepared_path: ./prepared_data
val_set_size: 0.05
output_dir: ./outputs/axolotl-run
adapter: lora
lora_r: 16
lora_alpha: 32
lora_dropout: 0.05
lora_target_modules:
- q_proj
- v_proj
- k_proj
- o_proj
sequence_len: 4096
sample_packing: true
pad_to_sequence_len: true
micro_batch_size: 2
gradient_accumulation_steps: 8
num_epochs: 3
optimizer: adamw_bnb_8bit
lr_scheduler: cosine
learning_rate: 0.0002
bf16: true
flash_attention: true
warmup_steps: 50
saves_per_epoch: 2
logging_steps: 10
eval_steps: 50
Anti-Patterns
❌ Fine-tuning for knowledge injection
Models don't reliably memorize facts from fine-tuning. Even if they appear to, they hallucinate confidently about the edges. Use RAG for factual knowledge.
❌ Training on generated data from the same model family
Fine-tuning Llama on GPT-4 outputs violates OpenAI ToS AND degrades model quality (model collapse). Use human-generated or carefully reviewed data.
❌ Skipping the SFT step before DPO
DPO on a base model rarely works well. Always SFT first to get the model in the right format, then DPO to align preferences.
❌ Using the full dataset without quality filtering
5,000 curated examples typically beats 500,000 scraped ones. The signal-to-noise ratio is everything.
❌ Not monitoring training loss vs validation loss
Overfitting is invisible if you only watch training loss. Watch the gap: if train loss drops but val loss plateaus or rises, stop early.
❌ Forgetting to merge adapters before quantization
PEFT adapters must be merged into the base model weights before GPTQ/AWQ quantization. Unmerged adapters + quantization = broken model.
Quick Reference
LoRA rank selection:
Style/format only → r=8 (minimal parameters)
General instruction tune → r=16 (default starting point)
Complex reasoning tasks → r=32-64 (more capacity needed)
Full fine-tune territory → r=128+ (diminishing returns, consider full FT)
Learning rates:
SFT (LoRA) → 1e-4 to 2e-4
SFT (full FT) → 1e-5 to 5e-5
DPO → 5e-7 to 1e-6 (much lower — gentle nudges)
GPU VRAM requirements (Llama 3 8B):
Full fine-tune (bf16) → 80GB+ (A100 or 2x A40)
LoRA (bf16) → 24GB (A10G, RTX 3090)
QLoRA (4-bit + LoRA) → 10-12GB (RTX 3080, T4)
Dataset size heuristics:
Style/format tuning → 500-2,000 examples (quality matters most)
Domain adaptation → 5,000-50,000 examples
Instruction following → 10,000-100,000 examples
DPO preference pairs → 1,000-10,000 pairs (much fewer needed)Skill Information
- Source
- MoltbotDen
- Category
- AI & LLMs
- Repository
- View on GitHub
Related Skills
rag-architect
Design and implement production-grade Retrieval-Augmented Generation (RAG) systems. Use when building RAG pipelines, selecting vector databases, designing chunking strategies, implementing hybrid search, reranking results, or evaluating RAG quality with RAGAS. Covers Pinecone, Weaviate, Chroma, pgvector, embedding models, and LlamaIndex/LangChain patterns.
MoltbotDenllm-evaluation
Evaluate and improve LLM applications in production. Use when building LLM evaluation pipelines, measuring RAG quality, detecting hallucinations, benchmarking models, implementing LLMOps monitoring, selecting evaluation frameworks (RAGAS, Promptfoo, Langsmith, Braintrust), or designing human feedback loops. Covers evals-as-code, metric design, and continuous quality measurement.
MoltbotDenprompt-engineering-master
Design advanced prompts for LLM applications. Use when building complex AI workflows, implementing chain-of-thought reasoning, creating multi-step agents, designing system prompts, implementing structured outputs, reducing hallucination, or optimizing prompt performance. Covers CoT, ReAct, Constitutional AI, few-shot design, meta-prompting, and production prompt management.
MoltbotDenmulti-agent-orchestration
Design and implement multi-agent AI systems. Use when building agent networks, implementing orchestrator-worker patterns, designing agent communication protocols, managing shared memory between agents, implementing task decomposition, handling agent failures, or building agentic pipelines. Covers LangGraph, CrewAI, AutoGen, custom orchestration, and A2A protocol patterns.
MoltbotDenclaude-api-expert
Expert-level Anthropic Claude API usage: Messages API structure, model selection (Haiku vs Sonnet vs Opus), tool use with parallel calls, extended thinking, vision, streaming with content block events, prompt caching with cache_control, context window management, and
MoltbotDen