GB10 Grace Blackwell — heterogeneous CPU + GPU task distribution

· updated

A pure-JS, single-file visualizer of the NVIDIA GB10 Grace Blackwell Superchip — the SoC inside DGX Spark (announced at CES 2025 as “Project DIGITS”). GB10 puts a 20-core Arm Grace CPU and a Blackwell GPU in one package, sharing one coherent 128 GB LPDDR5X memory pool over NVLink-C2C. You give it a workload of Python tasks, and a distributor routes each task to whichever engine runs it faster — then runs them, over the shared pool, with no host↔device copies.

It’s the companion to the RTX 4090 SIMT execution sim: that one shows how a GPU executes a kernel (warps, latency hiding, divergence, coalescing); this one zooms out to the whole superchip and asks which engine should run each task, and why. Both are driven by the same per-lane Python interpreter — here it’s used to measure each task’s traits (arithmetic intensity, coalescing, divergence) that decide the routing.

Verified GB10 / DGX Spark specs (NVIDIA + MediaTek + Arm sources): a Grace CPU of 20 Arm cores — 10× Cortex-X925 + 10× Cortex-A725 (Armv9.2-A, co-designed with MediaTek); a Blackwell GPU with 5th-gen Tensor Cores and FP4 (up to 1 PFLOP at FP4, with sparsity); 128 GB LPDDR5X unified, coherent memory at ~273 GB/s; connected by NVLink-C2C.

Open fullscreen ↗

How to drive it

The lesson: one pool, route to the right engine

A discrete GPU lives across a PCIe bus with its own VRAM, so using it means copying data over and back. GB10 is different: the CPU and GPU share one coherent address space, so either engine can run a task in place, with no copy. That makes fine-grained heterogeneous scheduling actually pay off — you route each task to whichever engine suits it, and the data is already there.

So which engine? The distributor measures each task and weighs two costs:

The GB10 twist that the sim makes concrete: because the ~273 GB/s bandwidth is shared by both engines (and modest next to a discrete GPU’s ~1 TB/s), a memory-bound task is bandwidth-limited the same on either engine. The GPU’s only edge then is coalescing — and for scattered / irregular access, the CPU’s caches beat the GPU’s wasted wide transactions, so the work goes to the CPU. That’s why, in the Mixed workload, vector-add and the dense compute loop land on the GPU while the irregular gather, the sparse scatter, and the tiny task land on the Grace cores.

What this is (and isn’t)

A teaching model, not a cycle-accurate simulator. Task traits (FLOP/byte, coalescing efficiency, lane activity) are measured by really running a sample warp of each kernel through the interpreter; the CPU-vs-GPU times are a relative model (clearly labelled), tuned so the routing matches the real qualitative rules. Three GB10 figures NVIDIA never disclosed are labelled as such in the UI and never asserted: the Blackwell GPU’s CUDA-core count, the NVLink-C2C bandwidth, and the GPU’s dense FP32 peak. The headline 1 PFLOP is FP4 with sparsity, not a dense FP32 number.

For the microarchitectural half — how a single kernel actually executes on the GPU, warp by warp — see the RTX 4090 SIMT execution sim.

← gb10