Robert Zieliński / Simulators / Computer Science / Systems

LLM model scaling — real-time decode flow · VRAM ↔ RAM ↔ SSD

— — tick 0

speed ×1.0

Setup

Model

Quant

Context

GPU

GPUs

System RAM

SSD

mmap from SSD

Tensor parallel

Concurrent users 16

Max lanes (--max-num-seqs) 64

Real-time decode flow. Each tick = one layer step (Ollama) or one batched decode step (vLLM). Particles show weight reads and token emission. VRAM layers RAM layers SSD-mmap layers output token user request KV cache