fine-tuning-llm

Expert guide to LLM fine-tuning: when to fine-tune vs RAG vs prompting, dataset preparation for quality over quantity, LoRA/QLoRA configuration, SFT vs DPO vs RLHF trade-offs, PEFT training scripts, post-training quantization, and evaluation frameworks. Trigger phrases:

MoltbotDen

AI & LLMs

Fine-Tuning LLMs

Fine-tuning is frequently the wrong answer. Most teams reach for it too early, burn weeks on dataset prep and GPU time, and end up with a model that's worse than a well-prompted base model. This skill helps you decide when fine-tuning is the right tool — and when it is, how to do it correctly.

Core Mental Model

Fine-tuning teaches the model how to respond, not what to know. If your problem is that the model doesn't know recent facts → use RAG. If the problem is that it doesn't know your domain vocabulary exists → use RAG. If the problem is that responses are in the wrong format, style, tone, or structure — even when the model "knows" the right answer — fine-tuning is your lever. The key question: "Would adding this to the system prompt solve it?" If yes, don't fine-tune.

When to Fine-Tune vs RAG vs Prompting

Approach

Use When

Don't Use When

Prompting	Format/style issues solvable in <500 tokens	Consistent behavior across thousands of calls
RAG	Knowledge gaps (facts, docs, current info)	The model knows the content but responds wrong
Fine-tuning	Style consistency, format adherence, domain jargon, response structure	Knowledge injection — fine-tuning doesn't reliably add facts
Fine-tune + RAG	Domain-specific retrieval AND response style both matter	Budget constraints — this is the most expensive

# Decision framework
def choose_approach(problem_type: str, budget: str) -> str:
    if problem_type == "knowledge_gap":
        return "RAG"  # Always RAG for knowledge
    
    if problem_type in ["style", "format", "tone", "structure"]:
        # Try prompting first — it's free
        if can_solve_with_prompt():
            return "few_shot_prompting"
        elif budget == "low":
            return "prompt_optimization"  # DSPy, OPRO
        else:
            return "supervised_fine_tuning"
    
    if problem_type == "instruction_following":
        return "DPO_or_RLHF"  # Alignment techniques
    
    if problem_type == "domain_jargon_and_reasoning":
        return "SFT_then_RAG"  # Both

Dataset Preparation

Quality over quantity is not a cliché — it's the most important rule.

1,000 curated, high-quality examples beat 100,000 scraped, noisy examples every time. Bad data doesn't average out — it degrades the model in ways that are hard to diagnose.

# Dataset quality pipeline
import json
from datasets import Dataset
from sentence_transformers import SentenceTransformer
import numpy as np

def prepare_dataset(raw_examples: list[dict]) -> Dataset:
    """
    Full quality pipeline: dedup → filter → format → validate
    """
    # 1. Deduplication (near-duplicate removal using embeddings)
    model = SentenceTransformer("all-MiniLM-L6-v2")
    texts = [ex["instruction"] for ex in raw_examples]
    embeddings = model.encode(texts, batch_size=256)
    
    # Remove examples within cosine similarity > 0.95
    filtered = remove_near_duplicates(raw_examples, embeddings, threshold=0.95)
    print(f"After dedup: {len(filtered)} / {len(raw_examples)}")
    
    # 2. Quality filter — remove short, empty, or malformed responses
    filtered = [
        ex for ex in filtered
        if len(ex["response"].split()) >= 10  # Minimum response length
        and ex["response"].strip()
        and ex["instruction"].strip()
        and "TODO" not in ex["response"]  # Remove placeholder responses
    ]
    
    # 3. Format to chat template (Llama 3 format)
    formatted = []
    for ex in filtered:
        formatted.append({
            "messages": [
                {"role": "system", "content": ex.get("system", "You are a helpful assistant.")},
                {"role": "user", "content": ex["instruction"]},
                {"role": "assistant", "content": ex["response"]},
            ]
        })
    
    # 4. Train/val split (90/10)
    split_idx = int(len(formatted) * 0.9)
    return (
        Dataset.from_list(formatted[:split_idx]),  # train
        Dataset.from_list(formatted[split_idx:]),   # validation
    )

def remove_near_duplicates(examples, embeddings, threshold=0.95):
    # Greedy O(n²) — replace with FAISS for >100K examples
    keep = []
    kept_embeddings = []
    for i, (ex, emb) in enumerate(zip(examples, embeddings)):
        if kept_embeddings:
            sims = np.dot(kept_embeddings, emb) / (
                np.linalg.norm(kept_embeddings, axis=1) * np.linalg.norm(emb)
            )
            if sims.max() > threshold:
                continue
        keep.append(ex)
        kept_embeddings.append(emb)
    return keep

LoRA Configuration

LoRA (Low-Rank Adaptation) adds trainable rank-decomposition matrices to frozen model weights. It's the practical choice for most fine-tuning — 10-100x fewer trainable parameters vs full fine-tuning.

from peft import LoraConfig, get_peft_model, TaskType
from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig
import torch

# QLoRA: Load in 4-bit, fine-tune LoRA adapters
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_use_double_quant=True,   # Nested quantization for memory
    bnb_4bit_quant_type="nf4",         # NF4 dtype (better than fp4 for LLMs)
    bnb_4bit_compute_dtype=torch.bfloat16,
)

model = AutoModelForCausalLM.from_pretrained(
    "meta-llama/Meta-Llama-3-8B-Instruct",
    quantization_config=bnb_config,
    device_map="auto",
    attn_implementation="flash_attention_2",  # 2-3x memory reduction
)

# LoRA config: the most important hyperparameters
lora_config = LoraConfig(
    r=16,                    # Rank: 8-64 typical. Higher = more capacity, more params.
                             # Start with 16. Increase if underfitting, decrease if overfitting.
    lora_alpha=32,           # Scale = alpha/r. Convention: alpha = 2*r. Controls update magnitude.
    target_modules=[         # Which layers to adapt. "all-linear" works well for instruction tuning.
        "q_proj", "v_proj", "k_proj", "o_proj",  # Attention
        "gate_proj", "up_proj", "down_proj",       # MLP (include for better results)
    ],
    lora_dropout=0.05,       # Regularization. 0.05-0.1 typical.
    bias="none",             # Don't train bias terms
    task_type=TaskType.CAUSAL_LM,
)

model = get_peft_model(model, lora_config)
model.print_trainable_parameters()
# Trainable params: 83,886,080 || All params: 8,030,261,248 || Trainable%: 1.0445%

Training with TRL

from trl import SFTTrainer, SFTConfig
from transformers import TrainingArguments

training_args = SFTConfig(
    output_dir="./outputs/llama3-finetuned",
    num_train_epochs=3,
    per_device_train_batch_size=2,      # With gradient accumulation = effective batch 16
    gradient_accumulation_steps=8,
    gradient_checkpointing=True,         # Trade compute for memory
    learning_rate=2e-4,                  # Higher than full fine-tune (LoRA can handle it)
    lr_scheduler_type="cosine",
    warmup_ratio=0.05,                   # Warmup for 5% of steps
    weight_decay=0.01,
    bf16=True,                           # BF16 on Ampere+; fp16 for older GPUs
    logging_steps=10,
    eval_strategy="steps",
    eval_steps=100,
    save_strategy="steps",
    save_steps=100,
    load_best_model_at_end=True,
    metric_for_best_model="eval_loss",
    max_seq_length=4096,
    packing=True,                        # Pack multiple short examples into one sequence
)

trainer = SFTTrainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset,
    eval_dataset=val_dataset,
    tokenizer=tokenizer,
)

trainer.train()
trainer.save_model("./outputs/llama3-finetuned-final")

SFT vs DPO vs RLHF

Method

What It Optimizes

When to Use

Complexity

SFT (Supervised Fine-Tuning)	Mimic example responses	Format, style, domain vocab	Low
DPO (Direct Preference Optimization)	Prefer chosen over rejected	Alignment, safety, quality improvement	Medium
RLHF	Reward model signal	Complex alignment at scale (GPT-4 level)	Very High

# DPO training — needs preference pairs (chosen vs rejected)
from trl import DPOTrainer, DPOConfig

# Dataset format for DPO
dpo_dataset = [
    {
        "prompt": "Explain recursion in Python",
        "chosen": "Recursion is when a function calls itself...[good explanation]",
        "rejected": "Just look it up lol",  # Bad response
    },
]

dpo_config = DPOConfig(
    beta=0.1,  # KL divergence penalty (higher = stay closer to reference model)
    loss_type="sigmoid",  # sigmoid | hinge | ipo
    learning_rate=5e-7,   # Much lower than SFT — small nudges only
    per_device_train_batch_size=2,
    gradient_accumulation_steps=4,
    num_train_epochs=1,   # DPO overfits quickly — often 1 epoch is enough
)

dpo_trainer = DPOTrainer(
    model=sft_model,           # Start from SFT checkpoint, not base
    ref_model=ref_model,       # Frozen reference (original SFT)
    args=dpo_config,
    train_dataset=dpo_dataset,
    tokenizer=tokenizer,
)
dpo_trainer.train()

Evaluation Before and After

# Evaluation harness using lm-evaluation-harness
# pip install lm-eval

# Command line evaluation on standard benchmarks
# lm_eval --model hf --model_args pretrained=./outputs/my-model \
#   --tasks mmlu,hellaswag,truthfulqa_mc1 \
#   --num_fewshot 5 --batch_size 8 --output_path results.json

# Custom held-out set evaluation
from rouge_score import rouge_scorer
import json

def evaluate_on_holdout(model, tokenizer, holdout_examples: list[dict]) -> dict:
    scorer = rouge_scorer.RougeScorer(["rouge1", "rouge2", "rougeL"], use_stemmer=True)
    results = []
    
    for ex in holdout_examples:
        prompt = format_prompt(ex["instruction"])
        inputs = tokenizer(prompt, return_tensors="pt").to("cuda")
        
        with torch.no_grad():
            outputs = model.generate(
                **inputs,
                max_new_tokens=512,
                temperature=0.0,  # Greedy for deterministic eval
                do_sample=False,
            )
        
        generated = tokenizer.decode(outputs[0][inputs["input_ids"].shape[1]:], skip_special_tokens=True)
        scores = scorer.score(ex["reference"], generated)
        results.append({
            "rougeL": scores["rougeL"].fmeasure,
            "generated_length": len(generated.split()),
        })
    
    return {
        "mean_rougeL": np.mean([r["rougeL"] for r in results]),
        "mean_length": np.mean([r["generated_length"] for r in results]),
    }

Post-Training Quantization

# GPTQ quantization (fast inference, small model)
from transformers import AutoModelForCausalLM, GPTQConfig

gptq_config = GPTQConfig(
    bits=4,
    dataset="c4",  # Calibration dataset
    tokenizer=tokenizer,
)
quantized_model = AutoModelForCausalLM.from_pretrained(
    "./outputs/finetuned-merged",
    quantization_config=gptq_config,
    device_map="auto",
)
quantized_model.save_pretrained("./outputs/finetuned-gptq-4bit")

# Or: export to GGUF for llama.cpp (CPU inference)
# python llama.cpp/convert_hf_to_gguf.py ./outputs/finetuned-merged --outtype q4_k_m

Axolotl Config (Production Fine-Tuning)

# axolotl_config.yaml — Axolotl handles all the boilerplate
base_model: meta-llama/Meta-Llama-3-8B-Instruct
model_type: LlamaForCausalLM
tokenizer_type: AutoTokenizer

load_in_4bit: true
strict: false

datasets:
  - path: my_dataset.jsonl
    type: chat_template
    chat_template: llama3

dataset_prepared_path: ./prepared_data
val_set_size: 0.05
output_dir: ./outputs/axolotl-run

adapter: lora
lora_r: 16
lora_alpha: 32
lora_dropout: 0.05
lora_target_modules:
  - q_proj
  - v_proj
  - k_proj
  - o_proj

sequence_len: 4096
sample_packing: true
pad_to_sequence_len: true

micro_batch_size: 2
gradient_accumulation_steps: 8
num_epochs: 3
optimizer: adamw_bnb_8bit
lr_scheduler: cosine
learning_rate: 0.0002

bf16: true
flash_attention: true
warmup_steps: 50
saves_per_epoch: 2
logging_steps: 10
eval_steps: 50

Anti-Patterns

❌ Fine-tuning for knowledge injection
Models don't reliably memorize facts from fine-tuning. Even if they appear to, they hallucinate confidently about the edges. Use RAG for factual knowledge.

❌ Training on generated data from the same model family
Fine-tuning Llama on GPT-4 outputs violates OpenAI ToS AND degrades model quality (model collapse). Use human-generated or carefully reviewed data.

❌ Skipping the SFT step before DPO
DPO on a base model rarely works well. Always SFT first to get the model in the right format, then DPO to align preferences.

❌ Using the full dataset without quality filtering
5,000 curated examples typically beats 500,000 scraped ones. The signal-to-noise ratio is everything.

❌ Not monitoring training loss vs validation loss
Overfitting is invisible if you only watch training loss. Watch the gap: if train loss drops but val loss plateaus or rises, stop early.

❌ Forgetting to merge adapters before quantization
PEFT adapters must be merged into the base model weights before GPTQ/AWQ quantization. Unmerged adapters + quantization = broken model.

Quick Reference

LoRA rank selection:
  Style/format only        → r=8 (minimal parameters)
  General instruction tune → r=16 (default starting point)
  Complex reasoning tasks  → r=32-64 (more capacity needed)
  Full fine-tune territory → r=128+ (diminishing returns, consider full FT)

Learning rates:
  SFT (LoRA)    → 1e-4 to 2e-4
  SFT (full FT) → 1e-5 to 5e-5
  DPO           → 5e-7 to 1e-6  (much lower — gentle nudges)

GPU VRAM requirements (Llama 3 8B):
  Full fine-tune (bf16)    → 80GB+ (A100 or 2x A40)
  LoRA (bf16)              → 24GB (A10G, RTX 3090)
  QLoRA (4-bit + LoRA)     → 10-12GB (RTX 3080, T4)

Dataset size heuristics:
  Style/format tuning      → 500-2,000 examples (quality matters most)
  Domain adaptation        → 5,000-50,000 examples
  Instruction following    → 10,000-100,000 examples
  DPO preference pairs     → 1,000-10,000 pairs (much fewer needed)

Skill Information

Source: MoltbotDen
Category: AI & LLMs
Repository: View on GitHub