RAG vs Fine-Tuning — battle

2026-06-16

You have a capable base model and a pile of your own stuff — a wiki, a ticket history, a product catalog, a house writing style — and you want the model to use it. Two camps shout different answers. One says fine-tune: train the model on your data until it “knows” your domain. The other says RAG: leave the model alone and retrieve the right context at question time.

Pick wrong and you feel it fast. Fine-tune on your docs and the model starts answering in a confident in-house voice while citing a policy that changed last week, inventing ticket IDs that look exactly right, and still failing to tell you today’s stock level. Or bolt on RAG and it retrieves the perfect passage and then answers in the wrong format, ignores your style guide, and pastes a raw chunk into the reply.

The mistake is treating RAG and fine-tuning as two roads to the same place. They aren’t — they change different things, and the whole decision collapses to one question:

Do you need to change what the model KNOWS — or how it BEHAVES?

Change what it knows — facts, data, anything that changes over time — that’s RAG.
Change how it behaves — format, style, skill, tone, anything stable — that’s fine-tuning.

The one line to tattoo on the decision: RAG hands the model an open book at question time; fine-tuning sends it to school before the exam. An open book keeps facts current and lets you cite them. School teaches a skill or a way of answering. They are not substitutes — and the strongest systems use both, which we’ll earn at the end.

Companion pieces: once you’ve picked your approach you still have to run the model — see Ollama vs vLLM vs Triton — battle for the serving engines, and LLM vs VLM vs DIFF — battle for the workload classes you point them at.

RAG — retrieve the facts at question time

Retrieval-Augmented Generation leaves the weights untouched and changes the model’s input. Offline, you split your corpus into chunks and embed each chunk into a vector; at query time you embed the question, run a nearest-neighbour search over those vectors, pull the top-k matching chunks, and stuff them into the prompt — “here are the relevant passages; answer using them.” The model is a reader, not a memoriser. The facts live in the corpus, not in the parameters.

That architecture is exactly why RAG is the right tool when the thing you’re adding is knowledge that moves. A fact that lands at 09:00 is answerable at 09:05 after a re-index — nothing is retrained. Because the answer is built from specific retrieved passages, you can cite them and show provenance, filter by access control so a user only retrieves what they’re allowed to see, and scale to a huge or private corpus the model never saw in pre-training. And grounding the answer in real passages is the single most effective lever against factual hallucination: the model is quoting, not recalling.

The costs are real and they’re all downstream of “retrieval is now part of your system.” The answer is only ever as good as what you retrieved — bad retrieval produces a confident wrong answer, so chunking, embeddings, and a reranker become things you tune and monitor. Every call pays for the retrieved passages in context tokens (more latency, more cost) plus the retrieval hop. And none of it changes how the model writes — RAG will not give you a house voice or a strict output schema. RAG changes the input, not the weights.

Fine-tuning — bake the behaviour into the weights

Fine-tuning takes the opposite route: it changes the weights. You assemble curated input → output examples and run gradient updates until the behaviour you want is part of the model. Full fine-tuning updates every parameter; PEFT / LoRA / QLoRA trains tiny low-rank adapter matrices (often a few MB) and can run on a single consumer GPU — which is why fine-tuning stopped being a big-lab-only move.

Fine-tuning shines when the thing you’re adding is behaviour that’s stable: a consistent format / style / tone, a narrow skill, reliable structured output, a domain idiom or reasoning pattern. A big secondary win is prompt compression — once the task is in the weights you stop re-explaining it in every prompt, which means shorter prompts, lower latency, lower cost, and it’s how a small fine-tuned model can beat a big general one on one narrow job.

The costs are the mirror image of RAG’s. Fine-tuning is a poor way to add facts: knowledge is frozen at training time, and teaching new facts by fine-tuning is the classic way to manufacture confident hallucination — recent research finds models learn new factual knowledge from fine-tuning slowly, and the more such examples they’re pushed to fit, the more they hallucinate on everything else. Updates are expensive: to change anything you retrain. You get no provenance, you risk catastrophic forgetting and overfitting, and you need curated training data. Fine-tuning changes the weights, not the model’s access to fresh facts.

Two different edits to the same system. RAG edits the input — it retrieves passages and the model reads them, weights untouched. Fine-tuning edits the weights — behaviour is baked in, but the facts it learned are frozen at training time.

Main differences

	RAG	Fine-tuning
What it changes	the input (context at query time)	the weights (behaviour, baked in)
Mechanism	chunk + embed corpus → vector search → top-k into the prompt	curated examples → gradient updates (full, or LoRA/QLoRA adapters)
Best for	knowledge — facts, data, anything that changes	behaviour — format, style, skill, tone, structure
Update a fact	re-index the changed docs — minutes, ~$0, no GPU	retrain — hours, GPU $$ (LoRA is cheaper, still a training run)
Per-query cost	higher — long prompt (question + retrieved passages) + retrieval hop	lower — short prompt; the task already lives in the weights
Provenance / citations	yes — answer points at the passages it used	no — the answer comes from opaque weights
Effect on hallucination	reduces factual hallucination (grounds the answer)	increases it if used to teach facts (confident, plausible, wrong)
Main failure mode	bad retrieval → confidently wrong answer	catastrophic forgetting / overfitting; stale facts
Infra it adds	embeddings + vector DB + (often) a reranker	a training pipeline + curated dataset + eval
Example use	answer over a live KB, docs Q&A, support over tickets	force a JSON schema, a house voice, a niche classifier

↔ scroll the table sideways to see both columns.

”I want to…” → which one

I want to…	Reach for	Why
Answer over a large, frequently-changing knowledge base	RAG	re-index to update; no retrain when facts move
Always reply in our exact JSON schema / house voice	Fine-tune	format is behaviour — bake it into the weights
Add a brand-new fact the model has never seen	RAG	teaching facts by fine-tuning breeds confident hallucination
Cite sources / enforce per-document access control	RAG	the answer points at retrievable, permissioned passages
Teach a niche skill / reasoning style / domain idiom	Fine-tune	a stable skill belongs in the weights, not the prompt
Cut prompt length + latency on a high-volume narrow task	Fine-tune	the task moves into the weights; prompts get short
Get fresh facts AND a strict format	Both	fine-tune the format, RAG the facts (see below)

It’s not either/or — the axis is orthogonal

Re-read the two columns and the trick reveals itself: RAG and fine-tuning aren’t competing on the same axis at all. One is about knowledge, the other about behaviour — orthogonal dimensions. Asking “RAG or fine-tuning?” is like asking “a textbook or a teacher?” For anything real you often want both, and they compose cleanly because they touch different parts of the system: fine-tune the model so it answers the way you need, and let RAG feed it the facts it needs. That combination has a name in the literature — RAFT (Retrieval-Augmented Fine-Tuning): train the model to reason over retrieved passages (and to ignore distractor passages), then serve it with RAG.

There’s also a cost asymmetry worth staring at before you commit, because it decides which bill you’d rather pay, and how often:

The asymmetry that decides the bill: RAG is cheap to update but pays for retrieved context on every call; fine-tuning is costly to update but lean on every call. Pick by the cost you pay most often — churny facts favour RAG; high call volume on a stable task favours fine-tuning.

In practice this becomes an escalation ladder, not a fork. Reach for the cheapest rung that solves the problem, and climb only when the rung below genuinely isn’t enough:

The ladder, not the fork. Start with the prompt; add RAG when you need fresh, private, or citable facts; fine-tune when you need a stable format, skill, or shorter prompts; combine them when you genuinely need both fresh facts and baked-in behaviour. Most teams never need rung 4.

The two anti-patterns are just the two ways of crossing the axis the wrong way. Don’t fine-tune to teach facts — you’ll get a fluent model that confidently invents them, and every fact change means another training run. Don’t reach for RAG to fix format — no amount of retrieved context makes a model emit your JSON schema reliably; that’s behaviour, and behaviour lives in the weights (or, cheaper, in a tightly-specified prompt). Match the tool to the side of the axis you’re actually on.

TL;DR: RAG and fine-tuning aren’t rivals — they change different things. RAG edits the input: retrieve top-k passages from your corpus at query time and let the (frozen) model read them. It’s the tool for knowledge — facts that change, large/private corpora, citations, access control — it’s cheap to update (re-index, no GPU) but heavier per call (long prompts), and it cuts factual hallucination by grounding. Fine-tuning edits the weights: train (full, or cheap LoRA/QLoRA adapters) until the behaviour is baked in. It’s the tool for behaviour — format, style, skill, structured output — it gives shorter prompts and lower latency but is costly to update (retrain) and a bad way to add facts (teaching facts by fine-tuning increases confident hallucination). The mnemonic: RAG is an open book; fine-tuning is school. Treat it as an escalation ladder — prompt → RAG → fine-tune → both (RAFT) — and climb only when the cheaper rung isn’t enough. Don’t fine-tune to teach facts; don’t RAG to fix format.

Sources

← ai llm engineering