LLM model scaling — VRAM · RAM · SSD spillover (Ollama vs vLLM)

· updated

A single-file in-browser visualiser of how a large-language model actually sits in memory and how fast it serves under real hardware constraints.

Pick a model (3B → 671B, including MoE), a quantization (FP16 / Q8 / Q5 / Q4 / Q3 / Q2), a context length (2K → 128K), a GPU (4060 → MI300X) and count, system RAM, and SSD speed. The sim does the arithmetic live:

Open fullscreen ↗

Two tabs

Ollama — llama.cpp under the hood

vLLM — production inference engine

What’s “precise” about it

Controls

KnobWhat it does
Backend tabSwitch between Ollama and vLLM. The sim swaps the layout + verdict logic.
ModelPick the model. Dense models 3B…405B, plus three MoE families (Mixtral 8×7B / 8×22B, DeepSeek-V3).
QuantQuantization format. Drops the per-param byte cost from 2 (FP16) to 0.27 (Q2_K).
CtxContext window in tokens. KV cache size scales linearly with this.
GPUOne of 13 GPUs from RTX 4060 (8 GB) to MI300X (192 GB).
× countHow many of that GPU you have (used as TP factor on the vLLM tab).
RAMSystem RAM available for Ollama’s CPU layers.
SSDStorage backing mmap’d weights or initial loads.
mmap (Ollama)Whether to fault weights from SSD on demand. Off → if RAM is too small, model won’t load at all.
TP (vLLM)Tensor-parallel degree across the chosen GPUs.
concurrent (vLLM)How many in-flight requests vLLM is serving right now.

How to read the verdict

PillMeans
ALL ON GPU (Ollama)Weights + KV cache fit in VRAM. Decode is at the GPU’s full memory bandwidth.
CPU OFFLOAD (Ollama)Some weights spilled to system RAM. Decode now bandwidth-bound by DDR5; expect 1–10 tok/s.
SSD SPILL (Ollama)Even RAM was too small; remaining weights mmap’d from SSD. Decode ≪ 1 tok/s — usually unusable.
WONT LOADThe (weights + minimum KV) footprint doesn’t fit. The sim explicitly tells you what to change (drop quant, raise TP, switch GPU).
KV STARVED (vLLM)Weights fit, but almost no VRAM left for a KV pool. Batch ≤ 1.
TIGHT (vLLM)KV pool fits a handful of concurrent requests; lower ctx or quant for more.
HEALTHY (vLLM)Big KV pool, throughput scales with batch up to a healthy saturation point.

How it works under the hood

Single HTML file. No build, no server. State is six dropdowns + two backend-specific extras; on any change it recomputes the memory layout and throughput from first principles and redraws the canvas. All numbers anchored to vendor-published bandwidth specs.

← systems