LLM vs VLM vs DIFF — battle

2026-05-30

You wire up three models on the same GPU. A chat model, a model that reads images, and an image generator. Same CUDA, same nvidia-smi, all three pin the card to “96% utilization” when they run. So you reason about them the same way — and then production betrays you. You batch ten chat requests onto one card and throughput climbs almost linearly; you batch two image renders onto the same card and each one runs at half speed. You buy a second GPU to “add concurrency” to a job that can’t use it. You run a high-traffic model one request at a time and wonder why a card rated for thousands of tokens a second is serving dozens.

The mistake is treating three different workload classes as one. LLM, VLM, and diffusion look identical from the outside, but they split on a single axis that decides everything downstream:

Is the work memory-bound or compute-bound?

That one question tells you which serving stack to reach for, whether concurrency is a free win or dead weight, and how you scale throughput when traffic grows. Get the class wrong and you buy the wrong hardware and the wrong serving stack. Let’s earn the answer one class at a time.

This is the companion piece to Ollama vs vLLM vs Triton — battle, which compares the serving engines. This one is about the workloads you point those engines at.

LLM — text in, text out

A large language model is the one everyone pictures: text goes in, text comes out. Under the hood it’s a transformer decoder running autoregressively — it generates one token, appends it to the context, and feeds the whole thing back to predict the next token. One forward pass per token, hundreds or thousands of times per response. Llama, Qwen, Mistral — all the same shape.

The thing that matters for serving is the compute regime of that token loop. At batch size 1, generating a single token is memory-bound. To produce each token the GPU has to stream the model’s weights and the growing KV-cache out of HBM, do a relatively small amount of math on them, and move on. The tensor cores — the silicon’s actual math units — do a little work and then sit idle waiting for the next chunk of memory to arrive. The bottleneck isn’t FLOPs; it’s memory bandwidth.

That idle compute is not a bug to optimize away — it’s an opportunity. Because the weights are already in flight from HBM, you can run other requests’ tokens through the same loaded weights almost for free. This is why batching and concurrency are an enormous win for LLMs: a batching server packs many concurrent requests into the same forward passes, spending the idle compute that batch-1 leaves on the table. Throughput climbs nearly linearly with concurrency until the card finally runs out of math to give. You serve an LLM with an LLM server — Ollama, vLLM, TGI, and friends — and you scale it by bigger batches or more replicas.

VLM — an LLM with eyes

A vision-language model takes image(s) plus text and produces text: caption this photo, answer a question about this chart, read this screenshot. Qwen-VL, LLaVA, Pixtral. It sounds like a third species, and this is exactly where people overcomplicate their architecture.

It isn’t a third species. A VLM is a vision encoder bolted onto the front of the same transformer decoder. The encoder turns pixels into a handful of embedding tokens; those tokens get prepended to the text tokens; and from there the model generates a reply the same autoregressive, one-token-at-a-time way an LLM does. The image-encode step is a one-time prefill cost paid before generation starts — and then the generation half is identical to an LLM’s.

So the compute regime is the same: the encode is a brief compute-bound burst, but the part that dominates a typical response — token generation — is memory-bound, exactly like an LLM. That means a VLM batches like an LLM and is served by the same LLM servers. The “LLM with eyes” framing isn’t a cute analogy; it’s the operational truth. VLM is not a third regime — it rides the LLM regime. If you already know how to serve an LLM, you already know how to serve a VLM. Don’t treat it as exotic.

DIFF — noise in, pixels out

Diffusion is the odd one out, and the one that breaks the “just batch it” instinct. A diffusion model takes noise plus conditioning — a text prompt, a reference image, an inpainting mask — and produces an image or a video. SDXL, Flux, Qwen-Image for stills; Wan, Hunyuan for video.

The architecture is a UNet or a DiT (diffusion transformer), but the part that defines its behaviour is the loop around it: diffusion runs that network iteratively over many denoising steps, 20 to 50 of them for a quality result, each step a full forward pass that nudges a noisy latent a little closer to a clean image. And each of those steps is dense math — big convolutions and attention over a large latent tensor.

That makes a single diffusion render compute-bound at batch 1. The tensor cores are already saturated doing real math; they aren’t waiting on memory. A single render can legitimately pin one GPU near 100% with the math units genuinely full. And here is the consequence that catches everyone: concurrency does not help on one GPU. There’s no idle compute to fill — so a second concurrent render doesn’t ride along for free, it just time-slices the same busy math units. Two renders each run at roughly half speed; total images-per-second stays flat. You gained nothing.

Diffusion is also not served by an LLM server. There’s no token stream to batch, no KV-cache to page. It runs in-process in a Python pipeline — diffusers, ComfyUI, and the like. So you don’t scale it with bigger batches. You scale it two ways: more GPUs (each runs an independent render in true parallel), or make each render cheaper — step-distilled models (Turbo / Lightning / LCM that finish in 4–8 steps instead of 30–50), low precision (FP8 / FP4 on modern tensor cores), kernel fusion (torch.compile / TensorRT), feature caching (TeaCache / DeepCache, reusing computed features across adjacent steps), or simply fewer steps. We’ll come back to this.

Main differences

	LLM	VLM	DIFF
Input → output	text → text	image(s) + text → text	noise + conditioning (text/image/mask) → image or video
Core architecture	transformer decoder	vision encoder → same transformer decoder	UNet / DiT denoiser
Generation pattern	autoregressive, one token at a time	image-encode (prefill), then autoregressive tokens	iterative — many denoising steps, each a full forward pass
Compute regime (batch-1)	memory-bound — tensor cores idle, streaming weights/KV-cache from HBM	memory-bound (one-time compute-bound encode, then LLM-like decode)	compute-bound — tensor cores saturated; one render can pin the GPU
Served by	LLM server (Ollama, vLLM, TGI…)	same LLM servers	not an LLM server — runs in-process (`diffusers` / Comfy)
Does batching/concurrency help?	yes — big win (fills idle compute)	yes — batches like an LLM	no on one GPU — a 2nd render time-slices already-busy compute
How you scale throughput	bigger batches / more replicas	bigger batches / more replicas	more GPUs, or cheaper renders (distillation, FP8/FP4, compile, caching, fewer steps)
Example models	Llama, Qwen, Mistral	Qwen-VL, LLaVA, Pixtral	SDXL, Flux, Qwen-Image, Wan / Hunyuan (video)

↔ scroll the table sideways to see every column.

“I want to…” → which class

I want to…	Class	What runs it	Operational implication
Reason over text, chat, write or analyze code	LLM	LLM server (batching)	*one GPU serves many* concurrent requests** — a batching server packs them into the same forward passes
Caption, understand, or answer questions about an image	VLM	same LLM server (batching)	same as LLM — one GPU, many concurrent requests; VLM rides the LLM regime
Generate or edit an image	DIFF (image)	in-process pipeline (`diffusers` / Comfy)	*one GPU does one* render at a time** — give it the whole card; add GPUs to add throughput
Generate a video	DIFF (video)	in-process pipeline	same as image diffusion, only heavier — one render at a time per GPU, scale by GPUs or cheaper renders

The whole thing in one axis: memory-bound vs compute-bound

Strip away the architectures and the entire split is this:

LLM and VLM text generation is memory-bound. The math units spend most of their cycles waiting on memory. There is idle compute sitting right there.
A single diffusion render is compute-bound. The math units are full. There is no idle compute.

Everything operational falls out of that one fact — and the cleanest way to see it is to stare at the most misread number in the stack.

When nvidia-smi reports the GPU at 96% utilization, almost everyone reads it as “96% of the compute is in use.” It does not mean that. GPU utilization is a time-occupancy metric: “a kernel was executing during 96% of the sampled time.” It says nothing about whether that kernel was doing math at full rate — only that something was scheduled. GPU utilization % ≠ FLOPs used.

So the same “96%” means two completely different things depending on the class:

A diffusion render at 96% is genuinely compute-saturated — the tensor cores really are doing dense math 96% of the time. The number is honest.
An LLM at 96% is often memory-stalled with idle tensor cores — a kernel is technically resident the whole time, but it spends its cycles waiting for weights to stream from HBM rather than multiplying. The number is “busy,” but the math units are starved.

Utilization answers "was a kernel resident?"; FLOP-efficiency answers "was the silicon doing math?" An LLM hides idle compute behind a high utilization number — that's the slack batching spends. A saturated diffusion step has no such slack.

That gap is the whole story. Batching converts idle compute into throughput. Memory-bound work (LLM/VLM) has idle compute to convert, so concurrency is a huge win. Compute-bound work (diffusion) has none, so concurrency on one GPU buys you nothing. Same silicon, opposite playbook — and it all traces back to which side of the memory-bound/compute-bound line the workload sits on.

So how do you serve and scale each?

The operational consequence writes itself once you know the regime.

LLM and VLM — co-locate many requests on one GPU. These are memory-bound, so a batching server is the right tool: it packs concurrent requests into shared forward passes and turns the idle compute into throughput. One reasonably sized GPU can serve a large number of simultaneous chat or vision requests this way. You scale by raising concurrency (bigger effective batches) until the card runs out of math, then by adding replicas behind a load balancer. VLM gets no special treatment here — same server, same batching, same scaling story. The only VLM-specific cost is the image-encode prefill, which is a brief burst, not a regime change.

Diffusion — give it the whole GPU. A render is compute-bound, so the right posture is concurrency-1: let one render own the card’s math units instead of forcing two to share them at half speed each. Then you scale along two independent axes:

More GPUs. Renders are independent jobs, so N cards run N renders in true parallel. This is the only form of “concurrency” that helps a compute-bound workload — separate silicon means separate math units and no time-slicing.
Cheaper renders. Because the per-render compute is the wall, the biggest wins come from making each render cost less math: step-distilled samplers (Turbo / Lightning / LCM) that finish in 4–8 steps instead of 30–50; lower precision (FP8 / FP4 on modern tensor cores, which do the same math at far higher throughput — a plain BF16 run silently leaves that on the table); kernel fusion via torch.compile or TensorRT; feature caching (TeaCache / DeepCache) that reuses computed features across adjacent steps; and simply fewer denoising steps as the across-the-board baseline.

Notice what is not on the diffusion list: “put it behind a batching server.” A serving layer doesn’t manufacture FLOPs. Wrapping a compute-bound model in any server adds no per-GPU throughput, because the GPU’s math is already the limit — the server is there for operations (an API, versioning, metrics), not for speed. That’s the same trap as buying a second card to “add concurrency” to a single saturated render. The card isn’t waiting on anything you can fill.

TL;DR: Three model classes, one axis. LLM (text→text) and VLM (image+text→text) are autoregressive transformer decoders — memory-bound at batch-1, so batching is a huge win and you serve them on an LLM server, scaling by bigger batches and more replicas. A VLM is just an LLM with a vision front-end — same regime, same server, not exotic. Diffusion (noise+conditioning→image/video) is an iterative UNet/DiT — compute-bound at batch-1, so concurrency on one GPU does nothing; you run it in-process at concurrency-1 and scale by adding GPUs or making each render cheaper (step distillation, FP8/FP4, torch.compile/TensorRT, feature caching, fewer steps).

Operationally: co-locate many LLM/VLM requests on one GPU (a server batches them into the idle compute); give diffusion the whole GPU and scale it with more silicon. And don’t trust “GPU at 96%” — that’s time-occupancy, not FLOPs. An LLM can be 96% “busy” and mostly idle on math (which is the slack batching fills); a diffusion render at 96% is genuinely full (which is why nothing fills it). Pick the class right and the hardware and the serving stack pick themselves.

Sources

← ai llm engineering