LLM model scaling — VRAM · RAM · SSD spillover (Ollama vs vLLM)
A single-file in-browser visualiser of how a large-language model actually sits in memory and how fast it serves under real hardware constraints.
Pick a model (3B → 671B, including MoE), a quantization (FP16 / Q8 / Q5 / Q4 / Q3 / Q2), a context length (2K → 128K), a GPU (4060 → MI300X) and count, system RAM, and SSD speed. The sim does the arithmetic live:
- Memory layout — weights + KV cache + activations + framework
overhead stacked across VRAM → system RAM → SSD (via
mmap). - Throughput estimate — bandwidth-bound decode for Ollama, batched PagedAttention for vLLM, anchored to real hardware specs.
- Verdict pill —
ALL ON GPU/CPU OFFLOAD/SSD SPILL/WONT LOAD/TIGHT/HEALTHY.
Two tabs
Ollama — llama.cpp under the hood
- Reads GGUF, optionally
mmap’s the file from disk so weights are faulted into RAM on demand. - Splits layers across GPU and CPU via
n_gpu_layers. The sim binary- searches the largest layer fraction that still fits in VRAM. - Single-stream — batch is always 1.
- Decode throughput is bandwidth-bound at
effective_BW / weight_bytes. With weights split across tiers, effective BW is the harmonic mean weighted by fraction. Spilling to RAM costs ≈ 20× over pure GPU; spilling to SSD costs ≈ 200× more. The sim shows this directly.
vLLM — production inference engine
- Weights never spill to CPU/SSD. If they don’t fit per-GPU after the
tensor-parallel shard, the model won’t load — the sim shows
exactly that as a
WONT LOADverdict and tells you to bump TP or drop the quant. - The remaining VRAM after weights + overhead + activations is the
PagedAttention KV pool. Max batch =
pool / (kv_per_token × ctx). - Aggregate throughput scales near-linearly with batch up to about the pool capacity, then saturates. The sim renders the saturation curve with a marker for the currently-selected concurrent-request count.
What’s “precise” about it
- Memory math.
weights = params × bytes_per_param. K-quant block overhead baked intobytes_per_param(Q4_K_M ≈ 0.52 B/p, not the naïve 0.5). - KV cache per token =
2 × n_layers × n_kv_heads × head_dim × dtype_bytes. For Llama-3-8B → 128 KB/token; 70B → 320 KB/token; 405B → 504 KB/token; DeepSeek-V3 → 4 MB/token (huge — that’s why it’s such a different beast). - MoE —
params(total) determines the on-disk weights footprint;active_params(per token) determines decode bandwidth cost. The sim models this directly: Mixtral 8×22B writes 141 GB of weights but reads only ~39 GB of active params per token, so it’s much faster than a dense 141 B model. - GPU bandwidth anchored to vendor specs: RTX 4060 272 GB/s · 4070 504 · 4080 717 · 3090 936 · 4090 1008 · 5090 1792 · A100-40 1555 · 6000-Ada 960 · A100-80 2039 · H100 SXM 3350 · H200 4800 · B100 8000 · MI300X 5300.
- System RAM modeled as DDR5-5600 ≈ 90 GB/s.
- SSD options: SATA 0.55 GB/s · NVMe Gen3 3.5 · Gen4 7 · Gen5 14.
- Tensor parallelism shards weights N-ways across the TP group;
effective bandwidth scales by
tp_factorand per-GPU weight share drops accordingly. The sim assumesgpus = TPfor simplicity (no pipeline parallelism — that’s a separate axis).
Controls
| Knob | What it does |
|---|---|
| Backend tab | Switch between Ollama and vLLM. The sim swaps the layout + verdict logic. |
| Model | Pick the model. Dense models 3B…405B, plus three MoE families (Mixtral 8×7B / 8×22B, DeepSeek-V3). |
| Quant | Quantization format. Drops the per-param byte cost from 2 (FP16) to 0.27 (Q2_K). |
| Ctx | Context window in tokens. KV cache size scales linearly with this. |
| GPU | One of 13 GPUs from RTX 4060 (8 GB) to MI300X (192 GB). |
| × count | How many of that GPU you have (used as TP factor on the vLLM tab). |
| RAM | System RAM available for Ollama’s CPU layers. |
| SSD | Storage backing mmap’d weights or initial loads. |
| mmap (Ollama) | Whether to fault weights from SSD on demand. Off → if RAM is too small, model won’t load at all. |
| TP (vLLM) | Tensor-parallel degree across the chosen GPUs. |
| concurrent (vLLM) | How many in-flight requests vLLM is serving right now. |
How to read the verdict
| Pill | Means |
|---|---|
ALL ON GPU (Ollama) | Weights + KV cache fit in VRAM. Decode is at the GPU’s full memory bandwidth. |
CPU OFFLOAD (Ollama) | Some weights spilled to system RAM. Decode now bandwidth-bound by DDR5; expect 1–10 tok/s. |
SSD SPILL (Ollama) | Even RAM was too small; remaining weights mmap’d from SSD. Decode ≪ 1 tok/s — usually unusable. |
WONT LOAD | The (weights + minimum KV) footprint doesn’t fit. The sim explicitly tells you what to change (drop quant, raise TP, switch GPU). |
KV STARVED (vLLM) | Weights fit, but almost no VRAM left for a KV pool. Batch ≤ 1. |
TIGHT (vLLM) | KV pool fits a handful of concurrent requests; lower ctx or quant for more. |
HEALTHY (vLLM) | Big KV pool, throughput scales with batch up to a healthy saturation point. |
How it works under the hood
Single HTML file. No build, no server. State is six dropdowns + two backend-specific extras; on any change it recomputes the memory layout and throughput from first principles and redraws the canvas. All numbers anchored to vendor-published bandwidth specs.