RAG vs Fine-Tuning — battle

You have a capable base model and a pile of your own stuff — a wiki, a ticket history, a product catalog, a house writing style — and you want the model to use it. Two camps shout different answers. One says fine-tune: train the model on your data until it “knows” your domain. The other says RAG: leave the model alone and retrieve the right context at question time.

Pick wrong and you feel it fast. Fine-tune on your docs and the model starts answering in a confident in-house voice while citing a policy that changed last week, inventing ticket IDs that look exactly right, and still failing to tell you today’s stock level. Or bolt on RAG and it retrieves the perfect passage and then answers in the wrong format, ignores your style guide, and pastes a raw chunk into the reply.

The mistake is treating RAG and fine-tuning as two roads to the same place. They aren’t — they change different things, and the whole decision collapses to one question:

Do you need to change what the model KNOWS — or how it BEHAVES?

The one line to tattoo on the decision: RAG hands the model an open book at question time; fine-tuning sends it to school before the exam. An open book keeps facts current and lets you cite them. School teaches a skill or a way of answering. They are not substitutes — and the strongest systems use both, which we’ll earn at the end.

Companion pieces: once you’ve picked your approach you still have to run the model — see Ollama vs vLLM vs Triton — battle for the serving engines, and LLM vs VLM vs DIFF — battle for the workload classes you point them at.

RAG — retrieve the facts at question time

Retrieval-Augmented Generation leaves the weights untouched and changes the model’s input. Offline, you split your corpus into chunks and embed each chunk into a vector; at query time you embed the question, run a nearest-neighbour search over those vectors, pull the top-k matching chunks, and stuff them into the prompt — “here are the relevant passages; answer using them.” The model is a reader, not a memoriser. The facts live in the corpus, not in the parameters.

That architecture is exactly why RAG is the right tool when the thing you’re adding is knowledge that moves. A fact that lands at 09:00 is answerable at 09:05 after a re-index — nothing is retrained. Because the answer is built from specific retrieved passages, you can cite them and show provenance, filter by access control so a user only retrieves what they’re allowed to see, and scale to a huge or private corpus the model never saw in pre-training. And grounding the answer in real passages is the single most effective lever against factual hallucination: the model is quoting, not recalling.

The costs are real and they’re all downstream of “retrieval is now part of your system.” The answer is only ever as good as what you retrieved — bad retrieval produces a confident wrong answer, so chunking, embeddings, and a reranker become things you tune and monitor. Every call pays for the retrieved passages in context tokens (more latency, more cost) plus the retrieval hop. And none of it changes how the model writes — RAG will not give you a house voice or a strict output schema. RAG changes the input, not the weights.

Fine-tuning — bake the behaviour into the weights

Fine-tuning takes the opposite route: it changes the weights. You assemble curated input → output examples and run gradient updates until the behaviour you want is part of the model. Full fine-tuning updates every parameter; PEFT / LoRA / QLoRA trains tiny low-rank adapter matrices (often a few MB) and can run on a single consumer GPU — which is why fine-tuning stopped being a big-lab-only move.

Fine-tuning shines when the thing you’re adding is behaviour that’s stable: a consistent format / style / tone, a narrow skill, reliable structured output, a domain idiom or reasoning pattern. A big secondary win is prompt compression — once the task is in the weights you stop re-explaining it in every prompt, which means shorter prompts, lower latency, lower cost, and it’s how a small fine-tuned model can beat a big general one on one narrow job.

The costs are the mirror image of RAG’s. Fine-tuning is a poor way to add facts: knowledge is frozen at training time, and teaching new facts by fine-tuning is the classic way to manufacture confident hallucination — recent research finds models learn new factual knowledge from fine-tuning slowly, and the more such examples they’re pushed to fit, the more they hallucinate on everything else. Updates are expensive: to change anything you retrain. You get no provenance, you risk catastrophic forgetting and overfitting, and you need curated training data. Fine-tuning changes the weights, not the model’s access to fresh facts.

RAG changes the model's input by retrieving passages at query time; fine-tuning changes the weights by training, so facts are frozen and behaviour is baked in Do you need to change what the model KNOWS — or how it BEHAVES? RAG — change what it KNOWS Fine-tuning — change how it BEHAVES Your question Embed → vector search Top-k passages from your corpus LLM — weights frozen Grounded answer + citations Changes the INPUT. Re-index to update facts — no retraining. Curated input → output examples Train / LoRA adapter New weights — behaviour baked in LLM′ — same model, new behaviour Answer in the learned format / skill Changes the WEIGHTS. Retrain to change anything — facts frozen at train time.
Two different edits to the same system. RAG edits the input — it retrieves passages and the model reads them, weights untouched. Fine-tuning edits the weights — behaviour is baked in, but the facts it learned are frozen at training time.

Main differences

RAGFine-tuning
What it changesthe input (context at query time)the weights (behaviour, baked in)
Mechanismchunk + embed corpus → vector search → top-k into the promptcurated examples → gradient updates (full, or LoRA/QLoRA adapters)
Best forknowledge — facts, data, anything that changesbehaviour — format, style, skill, tone, structure
Update a factre-index the changed docs — minutes, ~$0, no GPUretrain — hours, GPU $$ (LoRA is cheaper, still a training run)
Per-query costhigher — long prompt (question + retrieved passages) + retrieval hoplower — short prompt; the task already lives in the weights
Provenance / citationsyes — answer points at the passages it usedno — the answer comes from opaque weights
Effect on hallucinationreduces factual hallucination (grounds the answer)increases it if used to teach facts (confident, plausible, wrong)
Main failure modebad retrieval → confidently wrong answercatastrophic forgetting / overfitting; stale facts
Infra it addsembeddings + vector DB + (often) a rerankera training pipeline + curated dataset + eval
Example useanswer over a live KB, docs Q&A, support over ticketsforce a JSON schema, a house voice, a niche classifier

↔ scroll the table sideways to see both columns.

”I want to…” → which one

I want to…Reach forWhy
Answer over a large, frequently-changing knowledge baseRAGre-index to update; no retrain when facts move
Always reply in our exact JSON schema / house voiceFine-tuneformat is behaviour — bake it into the weights
Add a brand-new fact the model has never seenRAGteaching facts by fine-tuning breeds confident hallucination
Cite sources / enforce per-document access controlRAGthe answer points at retrievable, permissioned passages
Teach a niche skill / reasoning style / domain idiomFine-tunea stable skill belongs in the weights, not the prompt
Cut prompt length + latency on a high-volume narrow taskFine-tunethe task moves into the weights; prompts get short
Get fresh facts AND a strict formatBothfine-tune the format, RAG the facts (see below)

It’s not either/or — the axis is orthogonal

Re-read the two columns and the trick reveals itself: RAG and fine-tuning aren’t competing on the same axis at all. One is about knowledge, the other about behaviour — orthogonal dimensions. Asking “RAG or fine-tuning?” is like asking “a textbook or a teacher?” For anything real you often want both, and they compose cleanly because they touch different parts of the system: fine-tune the model so it answers the way you need, and let RAG feed it the facts it needs. That combination has a name in the literature — RAFT (Retrieval-Augmented Fine-Tuning): train the model to reason over retrieved passages (and to ignore distractor passages), then serve it with RAG.

There’s also a cost asymmetry worth staring at before you commit, because it decides which bill you’d rather pay, and how often:

RAG is cheap to update but heavier per query; fine-tuning is costly to update but lean per query RAG Fine-tuning To update a fact re-index changed docs minutes · ~$0 · no GPU retrain / re-tune the adapter hours · GPU $$ · a training run Per query at inference long prompt + retrieved passages more tokens · + a retrieval hop short prompt — task in the weights fewer tokens · lower latency Bar length = relative cost. The bars cross: RAG is cheap to refresh / heavy per call; fine-tuning is costly to refresh / lean per call.
The asymmetry that decides the bill: RAG is cheap to update but pays for retrieved context on every call; fine-tuning is costly to update but lean on every call. Pick by the cost you pay most often — churny facts favour RAG; high call volume on a stable task favours fine-tuning.

In practice this becomes an escalation ladder, not a fork. Reach for the cheapest rung that solves the problem, and climb only when the rung below genuinely isn’t enough:

Escalation ladder: start with prompt engineering, add RAG for fresh facts, fine-tune for behaviour, combine for both Reach for the cheapest rung that works each rung adds capability — and cost, and moving parts 1 · Prompt instructions · examples schema in the prompt always start here 2 · + RAG fresh / private / large facts the model lacks citations · access control re-index to update 3 · + Fine-tune format · style · skill shorter prompts, lower latency retrain to update LoRA keeps it cheap-ish 4 · RAG + fine-tune (RAFT) behaviour in the weights facts from retrieval most capability, most moving parts cost · capability · moving parts increase →
The ladder, not the fork. Start with the prompt; add RAG when you need fresh, private, or citable facts; fine-tune when you need a stable format, skill, or shorter prompts; combine them when you genuinely need both fresh facts and baked-in behaviour. Most teams never need rung 4.

The two anti-patterns are just the two ways of crossing the axis the wrong way. Don’t fine-tune to teach facts — you’ll get a fluent model that confidently invents them, and every fact change means another training run. Don’t reach for RAG to fix format — no amount of retrieved context makes a model emit your JSON schema reliably; that’s behaviour, and behaviour lives in the weights (or, cheaper, in a tightly-specified prompt). Match the tool to the side of the axis you’re actually on.

TL;DR: RAG and fine-tuning aren’t rivals — they change different things. RAG edits the input: retrieve top-k passages from your corpus at query time and let the (frozen) model read them. It’s the tool for knowledge — facts that change, large/private corpora, citations, access control — it’s cheap to update (re-index, no GPU) but heavier per call (long prompts), and it cuts factual hallucination by grounding. Fine-tuning edits the weights: train (full, or cheap LoRA/QLoRA adapters) until the behaviour is baked in. It’s the tool for behaviour — format, style, skill, structured output — it gives shorter prompts and lower latency but is costly to update (retrain) and a bad way to add facts (teaching facts by fine-tuning increases confident hallucination). The mnemonic: RAG is an open book; fine-tuning is school. Treat it as an escalation ladder — prompt → RAG → fine-tune → both (RAFT) — and climb only when the cheaper rung isn’t enough. Don’t fine-tune to teach facts; don’t RAG to fix format.

Sources

← ai llm engineering