LLM model scaling — VRAM · RAM · SSD spillover (Ollama vs vLLM)

2026-05-19 · updated 2026-05-19

A single-file in-browser visualiser of how a large-language model actually sits in memory and how fast it serves under real hardware constraints.

Pick a model (3B → 671B, including MoE), a quantization (FP16 / Q8 / Q5 / Q4 / Q3 / Q2), a context length (2K → 128K), a GPU (4060 → MI300X) and count, system RAM, and SSD speed. The sim does the arithmetic live:

Memory layout — weights + KV cache + activations + framework overhead stacked across VRAM → system RAM → SSD (via mmap).
Throughput estimate — bandwidth-bound decode for Ollama, batched PagedAttention for vLLM, anchored to real hardware specs.
Verdict pill — ALL ON GPU / CPU OFFLOAD / SSD SPILL / WONT LOAD / TIGHT / HEALTHY.

Open fullscreen ↗

Two tabs

Ollama — `llama.cpp` under the hood

Reads GGUF, optionally mmap’s the file from disk so weights are faulted into RAM on demand.
Splits layers across GPU and CPU via n_gpu_layers. The sim binary- searches the largest layer fraction that still fits in VRAM.
Single-stream — batch is always 1.
Decode throughput is bandwidth-bound at effective_BW / weight_bytes. With weights split across tiers, effective BW is the harmonic mean weighted by fraction. Spilling to RAM costs ≈ 20× over pure GPU; spilling to SSD costs ≈ 200× more. The sim shows this directly.

vLLM — production inference engine

Weights never spill to CPU/SSD. If they don’t fit per-GPU after the tensor-parallel shard, the model won’t load — the sim shows exactly that as a WONT LOAD verdict and tells you to bump TP or drop the quant.
The remaining VRAM after weights + overhead + activations is the PagedAttention KV pool. Max batch = pool / (kv_per_token × ctx).
Aggregate throughput scales near-linearly with batch up to about the pool capacity, then saturates. The sim renders the saturation curve with a marker for the currently-selected concurrent-request count.

What’s “precise” about it

Memory math. weights = params × bytes_per_param. K-quant block overhead baked into bytes_per_param (Q4_K_M ≈ 0.52 B/p, not the naïve 0.5).
KV cache per token = 2 × n_layers × n_kv_heads × head_dim × dtype_bytes. For Llama-3-8B → 128 KB/token; 70B → 320 KB/token; 405B → 504 KB/token; DeepSeek-V3 → 4 MB/token (huge — that’s why it’s such a different beast).
MoE — params (total) determines the on-disk weights footprint; active_params (per token) determines decode bandwidth cost. The sim models this directly: Mixtral 8×22B writes 141 GB of weights but reads only ~39 GB of active params per token, so it’s much faster than a dense 141 B model.
GPU bandwidth anchored to vendor specs: RTX 4060 272 GB/s · 4070 504 · 4080 717 · 3090 936 · 4090 1008 · 5090 1792 · A100-40 1555 · 6000-Ada 960 · A100-80 2039 · H100 SXM 3350 · H200 4800 · B100 8000 · MI300X 5300.
System RAM modeled as DDR5-5600 ≈ 90 GB/s.
SSD options: SATA 0.55 GB/s · NVMe Gen3 3.5 · Gen4 7 · Gen5 14.
Tensor parallelism shards weights N-ways across the TP group; effective bandwidth scales by tp_factor and per-GPU weight share drops accordingly. The sim assumes gpus = TP for simplicity (no pipeline parallelism — that’s a separate axis).

Controls

Knob	What it does
Backend tab	Switch between Ollama and vLLM. The sim swaps the layout + verdict logic.
Model	Pick the model. Dense models 3B…405B, plus three MoE families (Mixtral 8×7B / 8×22B, DeepSeek-V3).
Quant	Quantization format. Drops the per-param byte cost from 2 (FP16) to 0.27 (Q2_K).
Ctx	Context window in tokens. KV cache size scales linearly with this.
GPU	One of 13 GPUs from RTX 4060 (8 GB) to MI300X (192 GB).
× count	How many of that GPU you have (used as TP factor on the vLLM tab).
RAM	System RAM available for Ollama’s CPU layers.
SSD	Storage backing `mmap`’d weights or initial loads.
mmap (Ollama)	Whether to fault weights from SSD on demand. Off → if RAM is too small, model won’t load at all.
TP (vLLM)	Tensor-parallel degree across the chosen GPUs.
concurrent (vLLM)	How many in-flight requests vLLM is serving right now.

How to read the verdict

Pill	Means
`ALL ON GPU` (Ollama)	Weights + KV cache fit in VRAM. Decode is at the GPU’s full memory bandwidth.
`CPU OFFLOAD` (Ollama)	Some weights spilled to system RAM. Decode now bandwidth-bound by DDR5; expect 1–10 tok/s.
`SSD SPILL` (Ollama)	Even RAM was too small; remaining weights mmap’d from SSD. Decode ≪ 1 tok/s — usually unusable.
`WONT LOAD`	The (weights + minimum KV) footprint doesn’t fit. The sim explicitly tells you what to change (drop quant, raise TP, switch GPU).
`KV STARVED` (vLLM)	Weights fit, but almost no VRAM left for a KV pool. Batch ≤ 1.
`TIGHT` (vLLM)	KV pool fits a handful of concurrent requests; lower ctx or quant for more.
`HEALTHY` (vLLM)	Big KV pool, throughput scales with batch up to a healthy saturation point.

How it works under the hood

Single HTML file. No build, no server. State is six dropdowns + two backend-specific extras; on any change it recomputes the memory layout and throughput from first principles and redraws the canvas. All numbers anchored to vendor-published bandwidth specs.

← systems