Robert Zieliński
/
Simulators
/
Computer Science
/
Systems
LLM model scaling
— real-time decode flow · VRAM ↔ RAM ↔ SSD
—
—
tick 0
Ollama
vLLM
▶ Run
▸ Step
▸▸ ×10
⟲ Reset
speed
×1.0
Setup
Model
Phi-3 mini (3.8B)
Llama 3.2 (3B)
Llama 3.1 (8B)
Mistral (7B)
Llama 2 (13B)
Qwen 2.5 (32B)
Qwen 3 (32B)
Llama 3.1 (70B)
Llama 3.1 (405B)
Qwen 3 30B-A3B (30B / 3.3B active)
Mixtral 8×7B (47B / 12.9B active)
Qwen 3 235B-A22B (235B / 22B active)
Mixtral 8×22B (141B / 39B active)
DeepSeek-V3 (671B / 37B active)
Quant
FP16/BF16 (2.0 B/p)
Q8_0 (1.0 B/p)
Q5_K_M (0.63 B/p)
Q4_K_M (0.50 B/p)
Q3_K_M (0.38 B/p)
Q2_K (0.25 B/p)
Context
2K
4K
8K
16K
32K
64K
128K
GPU
RTX 4060 (8 GB)
RTX 4070 (12 GB)
RTX 4080 (16 GB)
RTX 3090 (24 GB)
RTX 4090 (24 GB)
RTX 5090 (32 GB)
A100 (40 GB)
RTX 6000 Ada (48 GB)
A100 (80 GB)
H100 (80 GB)
H200 (141 GB)
B100 (192 GB)
MI300X (192 GB)
GPUs
1
2
4
8
System RAM
16 GB
32 GB
64 GB
128 GB
256 GB
512 GB
1 TB
SSD
SATA SSD (0.5 GB/s)
NVMe Gen3 (3.5 GB/s)
NVMe Gen4 (7 GB/s)
NVMe Gen5 (14 GB/s)
mmap from SSD
Tensor parallel
1
2
4
8
Concurrent users
16
Max lanes
(--max-num-seqs)
64