Ollama vs vLLM vs Triton — battle
Three tools keep getting thrown into the same sentence — “should I serve this on Ollama, vLLM, or Triton?” — as if they’re three speeds of the same thing. They aren’t. One is a friendly local daemon, one is a throughput engine, one is a serving platform that often runs the second one inside it. Picking between them by vibe is the most expensive mistake in the GPU-serving stack, and it almost always traces back to a single question nobody asks first:
Is your workload memory-bound or compute-bound?
Get that wrong and you’ll buy a second GPU to “add concurrency” to a job that can’t use it, or run a high-traffic model one request at a time and wonder why a card rated for thousands of tokens a second is serving dozens.
The short version, which we’ll earn properly in the middle: LLM and VLM text generation is memory-bound, so batching many requests is a huge win. Image and video diffusion is compute-bound, so concurrency on one GPU buys you nothing. Same silicon, opposite playbook. Now the contenders.
Ollama — the easy always-on daemon
Ollama is the one you reach for when you just want a model answering questions on localhost in five minutes. It’s a long-lived local daemon with a clean REST API and a Modelfile story borrowed straight from Docker. Its genuinely nice tricks:
- It manages VRAM for you. Models load on first request and unload after an idle timeout (the keep-alive). Walk away, the card frees up; come back, it reloads. No babysitting.
- It hot-swaps models. Ask for a different model and Ollama drops the old one and loads the new one on demand. One daemon, many models, no restart.
- It’s serial by default. Out of the box a model serves requests one at a time. There is a parallelism knob (
OLLAMA_NUM_PARALLEL), but it’s a modest setting, not a batching engine — Ollama’s design center is convenience, not throughput.
That serial-by-default behaviour is the tell. Ollama is built for development, prototyping, and single-model or low-concurrency use — a coding assistant, a local agent, one person’s chat. Point real concurrent production traffic at it and it becomes the bottleneck, because it isn’t trying to pack requests; it’s trying to be effortless. Which it is.
Verdict: the best developer experience here by a mile, and the only contender that self-manages VRAM and hot-swaps models. Not the tool for high-throughput serving — that was never the job.
vLLM — the throughput engine
vLLM is not “Ollama but faster.” It’s a specialized inference engine for LLMs and VLMs whose entire reason to exist is one thing: serving many concurrent requests at high throughput. Two ideas do the heavy lifting:
- Continuous batching. Instead of waiting for a fixed batch to fill, vLLM merges and splits requests every decoding step — a finished request leaves, a new one slots in, the GPU never idles between batches. This is the lever that turns spare compute into throughput.
- PagedAttention. The KV-cache is managed like virtual memory — non-contiguous “pages” instead of one big reserved block — so the card fits far more concurrent sequences without fragmentation waste.
The payoff is dramatic, and it’s the direct consequence of the memory-bound nature of decoding (more on that below). In one benchmark, pushing concurrency up to ~16 simultaneous requests completed them at near-flat latency — on the order of 10–15× the throughput of serving one at a time. That’s not a tuning tweak; that’s a different regime.
The trade-offs are real and you should plan around them:
- Cold start. Loading a model into vLLM takes real time — weight loading, CUDA-graph capture, cache warmup. It’s a server you start and leave running, not a thing you spin up per request.
- No hot model swap. vLLM serves the model you launched it with. Switching models means restarting (or running another instance). There’s no Ollama-style “just ask for a different one.”
- It holds VRAM. It reserves cache aggressively and keeps the GPU occupied until you stop it. That’s a feature for a dedicated serving box and a nuisance on a shared dev machine.
Verdict: the raw-throughput champion for LLM/VLM serving. If concurrent production traffic hits one model, this is the answer — and the gap over serial serving isn’t incremental, it’s an order of magnitude.
Triton — the serving platform (that hosts vLLM)
Here’s where people get most confused. NVIDIA Triton Inference Server is not a faster LLM engine. It’s a generic, production-grade model-serving platform, and it plays a different game entirely. What it brings:
- Serves anything. PyTorch, ONNX, TensorRT, plain Python, diffusion models, classic CV — many frameworks behind one uniform API. It doesn’t care what the model is.
- Many models, one server. Host a whole zoo behind a single endpoint, pack multiple small models onto one GPU, version them, hot-reload them without downtime.
- Real ops surface. Model versioning, dynamic batching, ensembles and pipelines (chain a tokenizer → model → post-processor as one served unit), and Prometheus metrics out of the box.
And the part that resolves the confusion: for LLMs, the standard pattern is Triton running vLLM or TensorRT-LLM as a backend. Triton has a vLLM backend precisely so you get vLLM’s throughput plus Triton’s ops layer. So Triton doesn’t beat vLLM on LLM throughput — it wraps it. Asking “vLLM or Triton?” for a single LLM is often a false choice; the real production answer can be “Triton, with vLLM inside.”
Verdict: the platform play. Choose it when you’re operating a heterogeneous fleet of many model types and you want one uniform serving, versioning, and metrics layer. Don’t choose it expecting more tokens-per-second out of one model than the engine it’s hosting can already deliver.
Feature comparison
| Ollama | vLLM | Triton | |
|---|---|---|---|
| Model types served | LLMs / VLMs (GGUF) | LLMs / VLMs | any framework — PyTorch, ONNX, TensorRT, Python, diffusion, CV |
| Concurrency model | serial by default (a modest parallelism knob exists) | continuous batching + PagedAttention | dynamic batching; delegates to the backend (e.g. vLLM) for LLMs |
| Server lifecycle / VRAM | always-on daemon; idle-unloads via keep-alive | long-lived server; holds VRAM until stopped | long-lived server; manages many models’ memory |
| Model hot-swap | yes — load/unload on demand | no — switching = restart | yes — load/unload & version without downtime |
| Cold start | fast, lazy per model | slow — weight load + graph capture + warmup | per-model load; amortized across a long-lived server |
| Production ops (versioning, metrics, ensembles) | minimal | metrics, but single-model focus | full — versioning, Prometheus, ensembles/pipelines, hot-reload |
| Best raw LLM throughput | low (serial) | highest | = whatever backend it hosts (typically vLLM / TRT-LLM) |
| Relationship to the others | the easy on-ramp | the engine | the platform — often runs vLLM inside |
↔ scroll the table sideways to see every column.
When to choose which
| Scenario | Pick |
|---|---|
| Local dev / prototyping, one machine | Ollama |
| One model always loaded, low concurrency, you want hot-swap + idle-unload | Ollama |
| High-concurrency production LLM/VLM serving, one (or few) models | vLLM |
| Many heterogeneous models behind one API, with versioning + metrics + ensembles | Triton (with vLLM or TensorRT-LLM as the LLM backend) |
| Image / video diffusion | none of the above solve it — see below |
The number that lies: “GPU at 96%”
Now the centerpiece. The reason people serve LLMs one-at-a-time and never notice the waste is a single misread metric.
When nvidia-smi says the GPU is 96% utilized, almost everyone reads that as “96% of the compute is in use.” It does not mean that. GPU utilization is a time-occupancy metric: “a kernel was executing during 96% of the sampled time.” It says nothing about whether that kernel was doing useful math at full rate — only that something was scheduled.
So a workload can report 96% “busy” while burning a small fraction of the card’s peak FLOPs, because:
- the tensor cores are stalled waiting on memory — a kernel is technically “running,” but it spends its cycles waiting for weights to arrive from HBM rather than multiplying; or
- it’s running in a precision mode that ignores the fast path — e.g. plain BF16 on hardware that has dramatically faster FP8/FP4 tensor-core instructions it never invokes.
Utilization-% is “was a kernel resident,” not “how much of the silicon’s math throughput am I extracting.” Those are wildly different questions, and the gap between them is exactly where free performance hides.
Why batching helps LLMs but not diffusion
This is the whole fight in two sentences.
LLM decoding (batch-1) is memory-bound. Generating one token at a time, the GPU spends most of its cycles streaming the model’s weights and KV-cache out of HBM; the tensor cores do a little math and then wait for the next chunk of memory. The compute units are idle a lot even at “96% utilization.” vLLM’s continuous batching fills that idle compute with other requests’ tokens — the weights are already in flight, so each extra request rides along almost for free. That’s why throughput climbs nearly linearly with concurrency until the card finally runs out of math to give. Idle compute existed; batching spent it.
A single diffusion render is compute-bound. A denoising step is dense convolution and attention over a big latent — the tensor cores are already saturated doing math, not waiting on memory. There’s no idle compute to fill. Stack a second concurrent render and the GPU just time-slices the same math units between the two: each render runs at half speed, and total images-per-second stays flat. Concurrency only wins when there’s idle compute to absorb it — and a saturated diffusion step has none.
That’s the rule, stated once and for all: concurrency converts idle compute into throughput. Memory-bound work has idle compute to convert; compute-bound work doesn’t.
So how do you speed up diffusion?
Not by throwing concurrency at one GPU — we just saw the math units are already full. You attack it from two directions:
- Scale out — more GPUs. Diffusion renders are independent, so N cards run N renders in true parallel. This is the only “concurrency” that helps a compute-bound job: separate silicon, separate math units, no time-slicing.
- Make each render cheaper — attack the per-render compute. This is where the real wins are, because the per-render cost is the wall:
- Step-distilled models — Turbo / Lightning / LCM-style few-step samplers that get there in 4–8 steps instead of 30–50.
- Lower precision on modern hardware — FP8 / FP4 tensor-core paths on Blackwell-class GPUs, which do the same math with far higher throughput (and which a plain BF16 run silently leaves on the table).
- Kernel fusion —
torch.compileor TensorRT to fuse ops and cut launch/memory overhead. - Feature caching — TeaCache / DeepCache and friends, reusing computed features across adjacent steps.
- Fewer denoising steps and fused attention as the across-the-board baseline.
And to close the loop on the platform question: Triton is not a diffusion concurrency trick. Hosting a compute-bound diffusion model behind Triton (or any server) doesn’t add per-GPU throughput — the GPU’s math is the limit, and a serving layer doesn’t manufacture more of it. Triton is great for operating that model (versioning, metrics, one API alongside your other models); it is not a way to make a saturated card render faster.
TL;DR — how to choose
TL;DR: First decide your regime. Memory-bound (LLM/VLM text gen) → concurrency is a huge win, so reach for an engine that batches. Compute-bound (image/video diffusion) → concurrency on one GPU does nothing; speed comes from cheaper renders or more GPUs.
- Ollama — local dev, single-model, low concurrency. Self-manages VRAM, hot-swaps models, serial by default. Effortless, not high-throughput.
- vLLM — production LLM/VLM serving under concurrent load. Continuous batching + PagedAttention give ~10–15× over serial in benchmarks. Pays a cold start, holds VRAM, no hot-swap.
- Triton — a serving platform for many heterogeneous models behind one API, with versioning/metrics/ensembles. For LLMs it hosts vLLM/TensorRT-LLM — it complements the engine, it doesn’t out-throughput it.
- Diffusion — none of these add per-GPU throughput. Use step-distilled models, FP8/FP4,
torch.compile/TensorRT, feature caching, fewer steps — or more GPUs.And whatever you do, don’t trust “GPU at 96%.” That’s time-occupancy, not FLOPs. The card can be 96% “busy” and mostly idle on math — which is precisely the gap batching fills for LLMs and can’t fill for diffusion.
Sources
- vLLM — Continuous batching & PagedAttention (docs)
- Efficient Memory Management for LLM Serving with PagedAttention (vLLM paper) — arXiv
- NVIDIA Triton Inference Server — documentation
- Triton vLLM backend — GitHub
- Ollama — keep-alive & parallel request handling (FAQ)
- Understanding GPU utilization — it’s not FLOP efficiency (NVIDIA developer forums / DCGM docs)
- Roofline model: memory-bound vs compute-bound — Wikipedia