fine-tuning-llm
Expert guide to LLM fine-tuning: when to fine-tune vs RAG vs prompting, dataset preparation for quality over quantity, LoRA/QLoRA configuration, SFT vs DPO vs RLHF trade-offs, PEFT training scripts, post-training quantization, and evaluation framewor
Installation
npx clawhub@latest install fine-tuning-llmView the full skill documentation and source below.
Documentation
Fine-Tuning LLMs
Fine-tuning is frequently the wrong answer. Most teams reach for it too early, burn weeks on dataset prep and GPU time, and end up with a model that's worse than a well-prompted base model. This skill helps you decide when fine-tuning is the right tool — and when it is, how to do it correctly.
Core Mental Model
Fine-tuning teaches the model how to respond, not what to know. If your problem is that the model doesn't know recent facts → use RAG. If the problem is that it doesn't know your domain vocabulary exists → use RAG. If the problem is that responses are in the wrong format, style, tone, or structure — even when the model "knows" the right answer — fine-tuning is your lever. The key question: "Would adding this to the system prompt solve it?" If yes, don't fine-tune.
When to Fine-Tune vs RAG vs Prompting
| Approach | Use When | Don't Use When |
| Prompting | Format/style issues solvable in <500 tokens | Consistent behavior across thousands of calls |
| RAG | Knowledge gaps (facts, docs, current info) | The model knows the content but responds wrong |
| Fine-tuning | Style consistency, format adherence, domain jargon, response structure | Knowledge injection — fine-tuning doesn't reliably add facts |
| Fine-tune + RAG | Domain-specific retrieval AND response style both matter | Budget constraints — this is the most expensive |
# Decision framework
def choose_approach(problem_type: str, budget: str) -> str:
if problem_type == "knowledge_gap":
return "RAG" # Always RAG for knowledge
if problem_type in ["style", "format", "tone", "structure"]:
# Try prompting first — it's free
if can_solve_with_prompt():
return "few_shot_prompting"
elif budget == "low":
return "prompt_optimization" # DSPy, OPRO
else:
return "supervised_fine_tuning"
if problem_type == "instruction_following":
return "DPO_or_RLHF" # Alignment techniques
if problem_type == "domain_jargon_and_reasoning":
return "SFT_then_RAG" # Both
Dataset Preparation
Quality over quantity is not a cliché — it's the most important rule.
1,000 curated, high-quality examples beat 100,000 scraped, noisy examples every time. Bad data doesn't average out — it degrades the model in ways that are hard to diagnose.
# Dataset quality pipeline
import json
from datasets import Dataset
from sentence_transformers import SentenceTransformer
import numpy as np
def prepare_dataset(raw_examples: list[dict]) -> Dataset:
"""
Full quality pipeline: dedup → filter → format → validate
"""
# 1. Deduplication (near-duplicate removal using embeddings)
model = SentenceTransformer("all-MiniLM-L6-v2")
texts = [ex["instruction"] for ex in raw_examples]
embeddings = model.encode(texts, batch_size=256)
# Remove examples within cosine similarity > 0.95
filtered = remove_near_duplicates(raw_examples, embeddings, threshold=0.95)
print(f"After dedup: {len(filtered)} / {len(raw_examples)}")
# 2. Quality filter — remove short, empty, or malformed responses
filtered = [
ex for ex in filtered
if len(ex["response"].split()) >= 10 # Minimum response length
and ex["response"].strip()
and ex["instruction"].strip()
and "TODO" not in ex["response"] # Remove placeholder responses
]
# 3. Format to chat template (Llama 3 format)
formatted = []
for ex in filtered:
formatted.append({
"messages": [
{"role": "system", "content": ex.get("system", "You are a helpful assistant.")},
{"role": "user", "content": ex["instruction"]},
{"role": "assistant", "content": ex["response"]},
]
})
# 4. Train/val split (90/10)
split_idx = int(len(formatted) * 0.9)
return (
Dataset.from_list(formatted[:split_idx]), # train
Dataset.from_list(formatted[split_idx:]), # validation
)
def remove_near_duplicates(examples, embeddings, threshold=0.95):
# Greedy O(n²) — replace with FAISS for >100K examples
keep = []
kept_embeddings = []
for i, (ex, emb) in enumerate(zip(examples, embeddings)):
if kept_embeddings:
sims = np.dot(kept_embeddings, emb) / (
np.linalg.norm(kept_embeddings, axis=1) * np.linalg.norm(emb)
)
if sims.max() > threshold:
continue
keep.append(ex)
kept_embeddings.append(emb)
return keep
LoRA Configuration
LoRA (Low-Rank Adaptation) adds trainable rank-decomposition matrices to frozen model weights. It's the practical choice for most fine-tuning — 10-100x fewer trainable parameters vs full fine-tuning.
from peft import LoraConfig, get_peft_model, TaskType
from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig
import torch
# QLoRA: Load in 4-bit, fine-tune LoRA adapters
bnb_config = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_use_double_quant=True, # Nested quantization for memory
bnb_4bit_quant_type="nf4", # NF4 dtype (better than fp4 for LLMs)
bnb_4bit_compute_dtype=torch.bfloat16,
)
model = AutoModelForCausalLM.from_pretrained(
"meta-llama/Meta-Llama-3-8B-Instruct",
quantization_config=bnb_config,
device_map="auto",
attn_implementation="flash_attention_2", # 2-3x memory reduction
)
# LoRA config: the most important hyperparameters
lora_config = LoraConfig(
r=16, # Rank: 8-64 typical. Higher = more capacity, more params.
# Start with 16. Increase if underfitting, decrease if overfitting.
lora_alpha=32, # Scale = alpha/r. Convention: alpha = 2*r. Controls update magnitude.
target_modules=[ # Which layers to adapt. "all-linear" works well for instruction tuning.
"q_proj", "v_proj", "k_proj", "o_proj", # Attention
"gate_proj", "up_proj", "down_proj", # MLP (include for better results)
],
lora_dropout=0.05, # Regularization. 0.05-0.1 typical.
bias="none", # Don't train bias terms
task_type=TaskType.CAUSAL_LM,
)
model = get_peft_model(model, lora_config)
model.print_trainable_parameters()
# Trainable params: 83,886,080 || All params: 8,030,261,248 || Trainable%: 1.0445%
Training with TRL
from trl import SFTTrainer, SFTConfig
from transformers import TrainingArguments
training_args = SFTConfig(
output_dir="./outputs/llama3-finetuned",
num_train_epochs=3,
per_device_train_batch_size=2, # With gradient accumulation = effective batch 16
gradient_accumulation_steps=8,
gradient_checkpointing=True, # Trade compute for memory
learning_rate=2e-4, # Higher than full fine-tune (LoRA can handle it)
lr_scheduler_type="cosine",
warmup_ratio=0.05, # Warmup for 5% of steps
weight_decay=0.01,
bf16=True, # BF16 on Ampere+; fp16 for older GPUs
logging_steps=10,
eval_strategy="steps",
eval_steps=100,
save_strategy="steps",
save_steps=100,
load_best_model_at_end=True,
metric_for_best_model="eval_loss",
max_seq_length=4096,
packing=True, # Pack multiple short examples into one sequence
)
trainer = SFTTrainer(
model=model,
args=training_args,
train_dataset=train_dataset,
eval_dataset=val_dataset,
tokenizer=tokenizer,
)
trainer.train()
trainer.save_model("./outputs/llama3-finetuned-final")
SFT vs DPO vs RLHF
| Method | What It Optimizes | When to Use | Complexity |
| SFT (Supervised Fine-Tuning) | Mimic example responses | Format, style, domain vocab | Low |
| DPO (Direct Preference Optimization) | Prefer chosen over rejected | Alignment, safety, quality improvement | Medium |
| RLHF | Reward model signal | Complex alignment at scale (GPT-4 level) | Very High |
# DPO training — needs preference pairs (chosen vs rejected)
from trl import DPOTrainer, DPOConfig
# Dataset format for DPO
dpo_dataset = [
{
"prompt": "Explain recursion in Python",
"chosen": "Recursion is when a function calls itself...[good explanation]",
"rejected": "Just look it up lol", # Bad response
},
]
dpo_config = DPOConfig(
beta=0.1, # KL divergence penalty (higher = stay closer to reference model)
loss_type="sigmoid", # sigmoid | hinge | ipo
learning_rate=5e-7, # Much lower than SFT — small nudges only
per_device_train_batch_size=2,
gradient_accumulation_steps=4,
num_train_epochs=1, # DPO overfits quickly — often 1 epoch is enough
)
dpo_trainer = DPOTrainer(
model=sft_model, # Start from SFT checkpoint, not base
ref_model=ref_model, # Frozen reference (original SFT)
args=dpo_config,
train_dataset=dpo_dataset,
tokenizer=tokenizer,
)
dpo_trainer.train()
Evaluation Before and After
# Evaluation harness using lm-evaluation-harness
# pip install lm-eval
# Command line evaluation on standard benchmarks
# lm_eval --model hf --model_args pretrained=./outputs/my-model \
# --tasks mmlu,hellaswag,truthfulqa_mc1 \
# --num_fewshot 5 --batch_size 8 --output_path results.json
# Custom held-out set evaluation
from rouge_score import rouge_scorer
import json
def evaluate_on_holdout(model, tokenizer, holdout_examples: list[dict]) -> dict:
scorer = rouge_scorer.RougeScorer(["rouge1", "rouge2", "rougeL"], use_stemmer=True)
results = []
for ex in holdout_examples:
prompt = format_prompt(ex["instruction"])
inputs = tokenizer(prompt, return_tensors="pt").to("cuda")
with torch.no_grad():
outputs = model.generate(
**inputs,
max_new_tokens=512,
temperature=0.0, # Greedy for deterministic eval
do_sample=False,
)
generated = tokenizer.decode(outputs[0][inputs["input_ids"].shape[1]:], skip_special_tokens=True)
scores = scorer.score(ex["reference"], generated)
results.append({
"rougeL": scores["rougeL"].fmeasure,
"generated_length": len(generated.split()),
})
return {
"mean_rougeL": np.mean([r["rougeL"] for r in results]),
"mean_length": np.mean([r["generated_length"] for r in results]),
}
Post-Training Quantization
# GPTQ quantization (fast inference, small model)
from transformers import AutoModelForCausalLM, GPTQConfig
gptq_config = GPTQConfig(
bits=4,
dataset="c4", # Calibration dataset
tokenizer=tokenizer,
)
quantized_model = AutoModelForCausalLM.from_pretrained(
"./outputs/finetuned-merged",
quantization_config=gptq_config,
device_map="auto",
)
quantized_model.save_pretrained("./outputs/finetuned-gptq-4bit")
# Or: export to GGUF for llama.cpp (CPU inference)
# python llama.cpp/convert_hf_to_gguf.py ./outputs/finetuned-merged --outtype q4_k_m
Axolotl Config (Production Fine-Tuning)
# axolotl_config.yaml — Axolotl handles all the boilerplate
base_model: meta-llama/Meta-Llama-3-8B-Instruct
model_type: LlamaForCausalLM
tokenizer_type: AutoTokenizer
load_in_4bit: true
strict: false
datasets:
- path: my_dataset.jsonl
type: chat_template
chat_template: llama3
dataset_prepared_path: ./prepared_data
val_set_size: 0.05
output_dir: ./outputs/axolotl-run
adapter: lora
lora_r: 16
lora_alpha: 32
lora_dropout: 0.05
lora_target_modules:
- q_proj
- v_proj
- k_proj
- o_proj
sequence_len: 4096
sample_packing: true
pad_to_sequence_len: true
micro_batch_size: 2
gradient_accumulation_steps: 8
num_epochs: 3
optimizer: adamw_bnb_8bit
lr_scheduler: cosine
learning_rate: 0.0002
bf16: true
flash_attention: true
warmup_steps: 50
saves_per_epoch: 2
logging_steps: 10
eval_steps: 50
Anti-Patterns
❌ Fine-tuning for knowledge injection
Models don't reliably memorize facts from fine-tuning. Even if they appear to, they hallucinate confidently about the edges. Use RAG for factual knowledge.
❌ Training on generated data from the same model family
Fine-tuning Llama on GPT-4 outputs violates OpenAI ToS AND degrades model quality (model collapse). Use human-generated or carefully reviewed data.
❌ Skipping the SFT step before DPO
DPO on a base model rarely works well. Always SFT first to get the model in the right format, then DPO to align preferences.
❌ Using the full dataset without quality filtering
5,000 curated examples typically beats 500,000 scraped ones. The signal-to-noise ratio is everything.
❌ Not monitoring training loss vs validation loss
Overfitting is invisible if you only watch training loss. Watch the gap: if train loss drops but val loss plateaus or rises, stop early.
❌ Forgetting to merge adapters before quantization
PEFT adapters must be merged into the base model weights before GPTQ/AWQ quantization. Unmerged adapters + quantization = broken model.
Quick Reference
LoRA rank selection:
Style/format only → r=8 (minimal parameters)
General instruction tune → r=16 (default starting point)
Complex reasoning tasks → r=32-64 (more capacity needed)
Full fine-tune territory → r=128+ (diminishing returns, consider full FT)
Learning rates:
SFT (LoRA) → 1e-4 to 2e-4
SFT (full FT) → 1e-5 to 5e-5
DPO → 5e-7 to 1e-6 (much lower — gentle nudges)
GPU VRAM requirements (Llama 3 8B):
Full fine-tune (bf16) → 80GB+ (A100 or 2x A40)
LoRA (bf16) → 24GB (A10G, RTX 3090)
QLoRA (4-bit + LoRA) → 10-12GB (RTX 3080, T4)
Dataset size heuristics:
Style/format tuning → 500-2,000 examples (quality matters most)
Domain adaptation → 5,000-50,000 examples
Instruction following → 10,000-100,000 examples
DPO preference pairs → 1,000-10,000 pairs (much fewer needed)