RAG vs Fine-Tuning — battle
You have a capable base model and a pile of your own stuff — a wiki, a ticket history, a product catalog, a house writing style — and you want the model to use it. Two camps shout different answers. One says fine-tune: train the model on your data until it “knows” your domain. The other says RAG: leave the model alone and retrieve the right context at question time.
Pick wrong and you feel it fast. Fine-tune on your docs and the model starts answering in a confident in-house voice while citing a policy that changed last week, inventing ticket IDs that look exactly right, and still failing to tell you today’s stock level. Or bolt on RAG and it retrieves the perfect passage and then answers in the wrong format, ignores your style guide, and pastes a raw chunk into the reply.
The mistake is treating RAG and fine-tuning as two roads to the same place. They aren’t — they change different things, and the whole decision collapses to one question:
Do you need to change what the model KNOWS — or how it BEHAVES?
- Change what it knows — facts, data, anything that changes over time — that’s RAG.
- Change how it behaves — format, style, skill, tone, anything stable — that’s fine-tuning.
The one line to tattoo on the decision: RAG hands the model an open book at question time; fine-tuning sends it to school before the exam. An open book keeps facts current and lets you cite them. School teaches a skill or a way of answering. They are not substitutes — and the strongest systems use both, which we’ll earn at the end.
Companion pieces: once you’ve picked your approach you still have to run the model — see Ollama vs vLLM vs Triton — battle for the serving engines, and LLM vs VLM vs DIFF — battle for the workload classes you point them at.
RAG — retrieve the facts at question time
Retrieval-Augmented Generation leaves the weights untouched and changes the model’s input. Offline, you split your corpus into chunks and embed each chunk into a vector; at query time you embed the question, run a nearest-neighbour search over those vectors, pull the top-k matching chunks, and stuff them into the prompt — “here are the relevant passages; answer using them.” The model is a reader, not a memoriser. The facts live in the corpus, not in the parameters.
That architecture is exactly why RAG is the right tool when the thing you’re adding is knowledge that moves. A fact that lands at 09:00 is answerable at 09:05 after a re-index — nothing is retrained. Because the answer is built from specific retrieved passages, you can cite them and show provenance, filter by access control so a user only retrieves what they’re allowed to see, and scale to a huge or private corpus the model never saw in pre-training. And grounding the answer in real passages is the single most effective lever against factual hallucination: the model is quoting, not recalling.
The costs are real and they’re all downstream of “retrieval is now part of your system.” The answer is only ever as good as what you retrieved — bad retrieval produces a confident wrong answer, so chunking, embeddings, and a reranker become things you tune and monitor. Every call pays for the retrieved passages in context tokens (more latency, more cost) plus the retrieval hop. And none of it changes how the model writes — RAG will not give you a house voice or a strict output schema. RAG changes the input, not the weights.
Fine-tuning — bake the behaviour into the weights
Fine-tuning takes the opposite route: it changes the weights. You assemble curated input → output examples and run gradient updates until the behaviour you want is part of the model. Full fine-tuning updates every parameter; PEFT / LoRA / QLoRA trains tiny low-rank adapter matrices (often a few MB) and can run on a single consumer GPU — which is why fine-tuning stopped being a big-lab-only move.
Fine-tuning shines when the thing you’re adding is behaviour that’s stable: a consistent format / style / tone, a narrow skill, reliable structured output, a domain idiom or reasoning pattern. A big secondary win is prompt compression — once the task is in the weights you stop re-explaining it in every prompt, which means shorter prompts, lower latency, lower cost, and it’s how a small fine-tuned model can beat a big general one on one narrow job.
The costs are the mirror image of RAG’s. Fine-tuning is a poor way to add facts: knowledge is frozen at training time, and teaching new facts by fine-tuning is the classic way to manufacture confident hallucination — recent research finds models learn new factual knowledge from fine-tuning slowly, and the more such examples they’re pushed to fit, the more they hallucinate on everything else. Updates are expensive: to change anything you retrain. You get no provenance, you risk catastrophic forgetting and overfitting, and you need curated training data. Fine-tuning changes the weights, not the model’s access to fresh facts.
Main differences
| RAG | Fine-tuning | |
|---|---|---|
| What it changes | the input (context at query time) | the weights (behaviour, baked in) |
| Mechanism | chunk + embed corpus → vector search → top-k into the prompt | curated examples → gradient updates (full, or LoRA/QLoRA adapters) |
| Best for | knowledge — facts, data, anything that changes | behaviour — format, style, skill, tone, structure |
| Update a fact | re-index the changed docs — minutes, ~$0, no GPU | retrain — hours, GPU $$ (LoRA is cheaper, still a training run) |
| Per-query cost | higher — long prompt (question + retrieved passages) + retrieval hop | lower — short prompt; the task already lives in the weights |
| Provenance / citations | yes — answer points at the passages it used | no — the answer comes from opaque weights |
| Effect on hallucination | reduces factual hallucination (grounds the answer) | increases it if used to teach facts (confident, plausible, wrong) |
| Main failure mode | bad retrieval → confidently wrong answer | catastrophic forgetting / overfitting; stale facts |
| Infra it adds | embeddings + vector DB + (often) a reranker | a training pipeline + curated dataset + eval |
| Example use | answer over a live KB, docs Q&A, support over tickets | force a JSON schema, a house voice, a niche classifier |
↔ scroll the table sideways to see both columns.
”I want to…” → which one
| I want to… | Reach for | Why |
|---|---|---|
| Answer over a large, frequently-changing knowledge base | RAG | re-index to update; no retrain when facts move |
| Always reply in our exact JSON schema / house voice | Fine-tune | format is behaviour — bake it into the weights |
| Add a brand-new fact the model has never seen | RAG | teaching facts by fine-tuning breeds confident hallucination |
| Cite sources / enforce per-document access control | RAG | the answer points at retrievable, permissioned passages |
| Teach a niche skill / reasoning style / domain idiom | Fine-tune | a stable skill belongs in the weights, not the prompt |
| Cut prompt length + latency on a high-volume narrow task | Fine-tune | the task moves into the weights; prompts get short |
| Get fresh facts AND a strict format | Both | fine-tune the format, RAG the facts (see below) |
It’s not either/or — the axis is orthogonal
Re-read the two columns and the trick reveals itself: RAG and fine-tuning aren’t competing on the same axis at all. One is about knowledge, the other about behaviour — orthogonal dimensions. Asking “RAG or fine-tuning?” is like asking “a textbook or a teacher?” For anything real you often want both, and they compose cleanly because they touch different parts of the system: fine-tune the model so it answers the way you need, and let RAG feed it the facts it needs. That combination has a name in the literature — RAFT (Retrieval-Augmented Fine-Tuning): train the model to reason over retrieved passages (and to ignore distractor passages), then serve it with RAG.
There’s also a cost asymmetry worth staring at before you commit, because it decides which bill you’d rather pay, and how often:
In practice this becomes an escalation ladder, not a fork. Reach for the cheapest rung that solves the problem, and climb only when the rung below genuinely isn’t enough:
The two anti-patterns are just the two ways of crossing the axis the wrong way. Don’t fine-tune to teach facts — you’ll get a fluent model that confidently invents them, and every fact change means another training run. Don’t reach for RAG to fix format — no amount of retrieved context makes a model emit your JSON schema reliably; that’s behaviour, and behaviour lives in the weights (or, cheaper, in a tightly-specified prompt). Match the tool to the side of the axis you’re actually on.
TL;DR: RAG and fine-tuning aren’t rivals — they change different things. RAG edits the input: retrieve top-k passages from your corpus at query time and let the (frozen) model read them. It’s the tool for knowledge — facts that change, large/private corpora, citations, access control — it’s cheap to update (re-index, no GPU) but heavier per call (long prompts), and it cuts factual hallucination by grounding. Fine-tuning edits the weights: train (full, or cheap LoRA/QLoRA adapters) until the behaviour is baked in. It’s the tool for behaviour — format, style, skill, structured output — it gives shorter prompts and lower latency but is costly to update (retrain) and a bad way to add facts (teaching facts by fine-tuning increases confident hallucination). The mnemonic: RAG is an open book; fine-tuning is school. Treat it as an escalation ladder — prompt → RAG → fine-tune → both (RAFT) — and climb only when the cheaper rung isn’t enough. Don’t fine-tune to teach facts; don’t RAG to fix format.
Sources
- Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks (the RAG paper) — arXiv
- Dense Passage Retrieval for Open-Domain QA — arXiv
- LoRA: Low-Rank Adaptation of Large Language Models — arXiv
- QLoRA: Efficient Finetuning of Quantized LLMs — arXiv
- RAFT: Adapting Language Model to Domain Specific RAG — arXiv
- Does Fine-Tuning LLMs on New Knowledge Encourage Hallucinations? — arXiv