GPU SIMT execution — warps, latency hiding, divergence & coalescing

2026-05-27 · updated 2026-05-27

A pure-JS, single-file visualizer of how a GPU actually executes code. You write a small Python kernel — or pick a ready scenario — and the simulator runs it the way a GPU does: a Streaming Multiprocessor (SM) issues 32-thread warps in lockstep, hides memory latency by swapping warps, splits on divergent branches, and coalesces (or scatters) memory accesses. Every number and animation comes from a real per-lane interpreter executing your kernel, not a scripted loop.

The device model is anchored on the NVIDIA GeForce RTX 4090 (Ada Lovelace, AD102, compute capability 8.9): 128 SMs, 16,384 FP32 CUDA cores, warp = 32 threads, 48 resident warps / 1,536 threads / 24 blocks per SM, 4 warp schedulers per SM, a 256 KB register file per SM, 128 KB combined L1/shared, 72 MB L2, and 24 GB of GDDR6X at ~1,008 GB/s for ~82.6 TFLOPS FP32.

Open fullscreen ↗

How to drive it

Scenarios (toolbar dropdown) each spotlight one phenomenon: vector add (coalesced), strided gather (uncoalesced), branch divergence, shared-memory reduction, atomic histogram, and a high-intensity compute-bound kernel.
Kernel editor (⎘ Kernel, or press e): write your own kernel in a Python subset — if/elif/else, for i in range(...), while, def, lists & dicts, arithmetic, and CUDA intrinsics (cuda.grid(), cuda.shared.array(), cuda.syncthreads(), cuda.atomic.add()). Set the input arrays and the launch geometry (blocks × threads, registers/thread), then Compile & Launch. Every one of the 32 lanes in a warp runs your code for real.
⏱ hiding: ON/OFF is the toggle to play with first. Off forces a single resident warp; watch ALU utilisation collapse during every memory stall — then turn it back on and watch other warps fill the gap.
Views (v cycles them): SM/Scheduler · Lanes/Divergence · Memory/Coalescing · Hierarchy · CPU vs GPU. Hover anything — a warp slot, a lane, a memory transaction, a bank, a roofline point — for a context explanation.
Standard controls: space run/pause · s step · S step ×10 · r reset.

The five things it teaches

1 — Latency hiding & occupancy. A GPU does not avoid memory latency the way a CPU does (big caches, out-of-order execution); it tolerates it. When the issuing warp hits a ~hundreds-of-cycles global load it is marked stalled, and the scheduler immediately issues from another ready warp at zero cost — every resident warp’s registers stay on-chip the whole time. Occupancy = active warps ÷ 48. With only a few warps resident, a stall exposes a bubble and utilisation craters; with enough warps, the SM stays busy. (More is not always better — past the sweet spot, more warps means fewer registers each and spills.)

2 — Branch divergence. A warp executes one instruction across all 32 lanes. When a data-dependent if/else sends some lanes one way and the rest another, the hardware runs each path serially with the other lanes masked off, then the lanes reconverge. The Lanes view shows this directly: lanes split across source lines, the throughput meter halves, and the mask-history strip shows the serial passes resolving back to full rows. Divergence within a warp is the cost; divergence between warps is free.

3 — Memory coalescing. When the 32 lanes of a warp touch contiguous, aligned addresses, the hardware packs them into a few wide 32-byte transactions — near-peak bandwidth. Strided or scattered access fans out into many transactions, each mostly unused, and effective bandwidth collapses (stride-2 → 50%, large strides → toward 1/8). The Memory view draws the 32 lane requests, the transactions they generate, and the resulting bandwidth efficiency.

4 — Shared-memory bank conflicts. On-chip shared memory has 32 banks (= the warp size). If the 32 lanes hit 32 different banks, the access is one cycle. Two lanes hitting different words in the same bank serialize (an N-way conflict costs N cycles); all lanes reading the same word broadcasts for free.

5 — When the CPU actually wins. The capstone view places your kernel’s arithmetic intensity (FLOP per byte) on a roofline. The 4090’s ridge point sits around ~82 FLOP/byte, so plenty of “heavy” kernels are still memory-bound. GPUs win on regular, high-intensity, massively-parallel arithmetic (GEMM, convolutions, neural nets). CPUs win on pointer-chasing / irregular memory (graphs, trees, hash tables), branchy control flow, low-parallelism / serial work, and small, latency-bound jobs where there isn’t enough parallelism to fill warps or hide a single stall. The same memory-latency problem, two opposite answers: the CPU spends its transistors on caches and reordering to make one thread fast; the GPU spends them on ALUs and a huge register file to keep thousands of threads in flight and tolerate each being slow.

What this is (and isn’t)

It’s a teaching emulator, not a cycle-accurate model. Every lane really runs an interpreter over a Python subset — so you can see why irregular Python maps badly to SIMT — but a few things are compressed for legibility: memory latency is scaled down from the real ~300–400 cycles to a watchable handful, the SM is shown in detail while the other 127 are summarized, and the kernel language is a subset (it will tell you, with a line number, when you use something it doesn’t support). The execution-model facts — warp size, the 48-warp/1,536-thread occupancy ceiling, 4 schedulers issuing up to 4 warps per cycle, serial-divergence-with-masking, coalescing into 32-byte segments, 32 shared-memory banks — are accurate to NVIDIA’s CUDA C++ Programming Guide and the Ada architecture documentation.

For the CPU side of the story in the same spirit, see the PicoRV32 live core visualizer, which walks a single RISC-V core through fetch → decode → execute one instruction at a time.

← gpu