Ollama vs vLLM vs Triton — battle

Three tools keep getting thrown into the same sentence — “should I serve this on Ollama, vLLM, or Triton?” — as if they’re three speeds of the same thing. They aren’t. One is a friendly local daemon, one is a throughput engine, one is a serving platform that often runs the second one inside it. Picking between them by vibe is the most expensive mistake in the GPU-serving stack, and it almost always traces back to a single question nobody asks first:

Is your workload memory-bound or compute-bound?

Get that wrong and you’ll buy a second GPU to “add concurrency” to a job that can’t use it, or run a high-traffic model one request at a time and wonder why a card rated for thousands of tokens a second is serving dozens.

The short version, which we’ll earn properly in the middle: LLM and VLM text generation is memory-bound, so batching many requests is a huge win. Image and video diffusion is compute-bound, so concurrency on one GPU buys you nothing. Same silicon, opposite playbook. Now the contenders.

Ollama — the easy always-on daemon

Ollama is the one you reach for when you just want a model answering questions on localhost in five minutes. It’s a long-lived local daemon with a clean REST API and a Modelfile story borrowed straight from Docker. Its genuinely nice tricks:

That serial-by-default behaviour is the tell. Ollama is built for development, prototyping, and single-model or low-concurrency use — a coding assistant, a local agent, one person’s chat. Point real concurrent production traffic at it and it becomes the bottleneck, because it isn’t trying to pack requests; it’s trying to be effortless. Which it is.

Verdict: the best developer experience here by a mile, and the only contender that self-manages VRAM and hot-swaps models. Not the tool for high-throughput serving — that was never the job.

vLLM — the throughput engine

vLLM is not “Ollama but faster.” It’s a specialized inference engine for LLMs and VLMs whose entire reason to exist is one thing: serving many concurrent requests at high throughput. Two ideas do the heavy lifting:

The payoff is dramatic, and it’s the direct consequence of the memory-bound nature of decoding (more on that below). In one benchmark, pushing concurrency up to ~16 simultaneous requests completed them at near-flat latency — on the order of 10–15× the throughput of serving one at a time. That’s not a tuning tweak; that’s a different regime.

The trade-offs are real and you should plan around them:

Verdict: the raw-throughput champion for LLM/VLM serving. If concurrent production traffic hits one model, this is the answer — and the gap over serial serving isn’t incremental, it’s an order of magnitude.

Triton — the serving platform (that hosts vLLM)

Here’s where people get most confused. NVIDIA Triton Inference Server is not a faster LLM engine. It’s a generic, production-grade model-serving platform, and it plays a different game entirely. What it brings:

And the part that resolves the confusion: for LLMs, the standard pattern is Triton running vLLM or TensorRT-LLM as a backend. Triton has a vLLM backend precisely so you get vLLM’s throughput plus Triton’s ops layer. So Triton doesn’t beat vLLM on LLM throughput — it wraps it. Asking “vLLM or Triton?” for a single LLM is often a false choice; the real production answer can be “Triton, with vLLM inside.”

Verdict: the platform play. Choose it when you’re operating a heterogeneous fleet of many model types and you want one uniform serving, versioning, and metrics layer. Don’t choose it expecting more tokens-per-second out of one model than the engine it’s hosting can already deliver.

Feature comparison

OllamavLLMTriton
Model types servedLLMs / VLMs (GGUF)LLMs / VLMsany framework — PyTorch, ONNX, TensorRT, Python, diffusion, CV
Concurrency modelserial by default (a modest parallelism knob exists)continuous batching + PagedAttentiondynamic batching; delegates to the backend (e.g. vLLM) for LLMs
Server lifecycle / VRAMalways-on daemon; idle-unloads via keep-alivelong-lived server; holds VRAM until stoppedlong-lived server; manages many models’ memory
Model hot-swapyes — load/unload on demandno — switching = restartyes — load/unload & version without downtime
Cold startfast, lazy per modelslow — weight load + graph capture + warmupper-model load; amortized across a long-lived server
Production ops (versioning, metrics, ensembles)minimalmetrics, but single-model focusfull — versioning, Prometheus, ensembles/pipelines, hot-reload
Best raw LLM throughputlow (serial)highest= whatever backend it hosts (typically vLLM / TRT-LLM)
Relationship to the othersthe easy on-rampthe enginethe platform — often runs vLLM inside

↔ scroll the table sideways to see every column.

When to choose which

ScenarioPick
Local dev / prototyping, one machineOllama
One model always loaded, low concurrency, you want hot-swap + idle-unloadOllama
High-concurrency production LLM/VLM serving, one (or few) modelsvLLM
Many heterogeneous models behind one API, with versioning + metrics + ensemblesTriton (with vLLM or TensorRT-LLM as the LLM backend)
Image / video diffusionnone of the above solve it — see below

The number that lies: “GPU at 96%”

Now the centerpiece. The reason people serve LLMs one-at-a-time and never notice the waste is a single misread metric.

When nvidia-smi says the GPU is 96% utilized, almost everyone reads that as “96% of the compute is in use.” It does not mean that. GPU utilization is a time-occupancy metric: “a kernel was executing during 96% of the sampled time.” It says nothing about whether that kernel was doing useful math at full rate — only that something was scheduled.

So a workload can report 96% “busy” while burning a small fraction of the card’s peak FLOPs, because:

Utilization-% is “was a kernel resident,” not “how much of the silicon’s math throughput am I extracting.” Those are wildly different questions, and the gap between them is exactly where free performance hides.

GPU utilization is time-occupancy, not FLOPs used — a kernel can be resident while the math units idle What nvidia-smi shows: 96% utilization a kernel is resident — "busy" time-occupancy: was something scheduled? → yes, 96% of the time

What’s actually happening to the math units FLOPs doing math tensor cores stalled — waiting on HBM, or not using the fast precision path FLOP-efficiency: how much peak math am I extracting? → a fraction of it Same GPU, same instant. “96% utilized” and “mostly idle compute” are both true.

Utilization answers "was a kernel running?"; FLOP-efficiency answers "was the silicon doing math?" The space between them is where batching lives — or where it can't help.

Why batching helps LLMs but not diffusion

This is the whole fight in two sentences.

LLM decoding (batch-1) is memory-bound. Generating one token at a time, the GPU spends most of its cycles streaming the model’s weights and KV-cache out of HBM; the tensor cores do a little math and then wait for the next chunk of memory. The compute units are idle a lot even at “96% utilization.” vLLM’s continuous batching fills that idle compute with other requests’ tokens — the weights are already in flight, so each extra request rides along almost for free. That’s why throughput climbs nearly linearly with concurrency until the card finally runs out of math to give. Idle compute existed; batching spent it.

A single diffusion render is compute-bound. A denoising step is dense convolution and attention over a big latent — the tensor cores are already saturated doing math, not waiting on memory. There’s no idle compute to fill. Stack a second concurrent render and the GPU just time-slices the same math units between the two: each render runs at half speed, and total images-per-second stays flat. Concurrency only wins when there’s idle compute to absorb it — and a saturated diffusion step has none.

That’s the rule, stated once and for all: concurrency converts idle compute into throughput. Memory-bound work has idle compute to convert; compute-bound work doesn’t.

So how do you speed up diffusion?

Not by throwing concurrency at one GPU — we just saw the math units are already full. You attack it from two directions:

  1. Scale out — more GPUs. Diffusion renders are independent, so N cards run N renders in true parallel. This is the only “concurrency” that helps a compute-bound job: separate silicon, separate math units, no time-slicing.
  2. Make each render cheaper — attack the per-render compute. This is where the real wins are, because the per-render cost is the wall:
    • Step-distilled models — Turbo / Lightning / LCM-style few-step samplers that get there in 4–8 steps instead of 30–50.
    • Lower precision on modern hardware — FP8 / FP4 tensor-core paths on Blackwell-class GPUs, which do the same math with far higher throughput (and which a plain BF16 run silently leaves on the table).
    • Kernel fusiontorch.compile or TensorRT to fuse ops and cut launch/memory overhead.
    • Feature caching — TeaCache / DeepCache and friends, reusing computed features across adjacent steps.
    • Fewer denoising steps and fused attention as the across-the-board baseline.

And to close the loop on the platform question: Triton is not a diffusion concurrency trick. Hosting a compute-bound diffusion model behind Triton (or any server) doesn’t add per-GPU throughput — the GPU’s math is the limit, and a serving layer doesn’t manufacture more of it. Triton is great for operating that model (versioning, metrics, one API alongside your other models); it is not a way to make a saturated card render faster.

TL;DR — how to choose

TL;DR: First decide your regime. Memory-bound (LLM/VLM text gen) → concurrency is a huge win, so reach for an engine that batches. Compute-bound (image/video diffusion) → concurrency on one GPU does nothing; speed comes from cheaper renders or more GPUs.

And whatever you do, don’t trust “GPU at 96%.” That’s time-occupancy, not FLOPs. The card can be 96% “busy” and mostly idle on math — which is precisely the gap batching fills for LLMs and can’t fill for diffusion.

Sources

← ai llm engineering