INT4 vs NVFP4

2026-05-07

Two ways to put a 4-bit weight in a register. They look similar on paper and behave very differently on silicon.

Differences at a glance

	INT4	NVFP4
Number system	uniform integers, 16 evenly-spaced levels	E2M1 float (1 sign · 2 exp · 1 mantissa) — non-uniform, denser near zero
Range	[−8, 7] (asymmetric, 16 codes)	[−6, 6] (15 distinct values, 16 codes incl. ±0)
Block scale	per-channel or block-32, FP16/FP32	block-16, FP8 E4M3
Outer scale	—	one FP32 per tensor
Outlier behaviour	poor; needs Hadamard/GPTQ rotation	absorbed by float dynamic range
Hardware path on Blackwell	emulated / general FP4-INT4 path	dedicated Tensor Core instruction
Throughput (Blackwell)	1.0×	~2.3×
Accuracy vs FP16 baseline	within ~1% with AWQ/AutoRound	within ~1%, sometimes ahead, sometimes slightly behind on language tasks
Weight memory	smallest	slightly larger (~+7 GB on Llama 3.3 70B due to finer scales)

The number grid is the whole story

INT4 spaces 16 integer levels evenly across [−8, 7]. NVFP4 spends its 16 codes on a floating-point grid — 15 distinct values (±0 share a value) clustered near zero, sparser near the tails. Weight distributions in transformers are heavy near zero with rare outliers, so NVFP4’s grid hits the data’s shape; INT4’s wastes precision in regions weights almost never visit.

Same 16 codes, very different placement. NVFP4 trades evenness for resolution where transformer weights actually live.

Two-level scaling, smaller blocks

INT4 schemes typically attach one scale per channel (or one per 32-element block) in FP16/FP32. NVFP4 splits that job in two: a fine-grained FP8 scale per 16-element micro-block to track local dynamic range, plus a single FP32 per-tensor scale to keep the FP8 block scales from overflowing. Halving the block size and storing scales in a smaller type is what closes the accuracy gap with INT4 — and is also why NVFP4 weights are slightly bigger on disk.

NVFP4 spends a few extra bytes on scales to make the 4-bit codes do more work.

Why Blackwell makes NVFP4 win in practice

Blackwell’s tensor cores have a native NVFP4 instruction. INT4 GEMMs on the same hardware run a slower path. The accuracy gap is small in either direction depending on the model and recipe — but the throughput gap is structural.

Same accuracy band; ~2.3× the tokens per second. That's the trade Blackwell offers.

When to pick which

NVFP4 — Blackwell (B100/B200/RTX 6000 Pro), inference-heavy serving, you want the throughput and the engineering cost of a custom INT4 recipe isn’t justified.
INT4 (AWQ / AutoRound / GPTQ) — Hopper or older, smallest possible weights, mature toolchain, every GB of VRAM matters.
Don’t mix — running INT4 weights through an NVFP4 kernel (or vice versa) loses the hardware advantage that makes either format interesting.

TL;DR: INT4 is a uniform integer grid with one big scale; NVFP4 is an E2M1 floating-point grid with two-level scaling (FP8 per 16 weights + FP32 per tensor). Accuracy lands in the same band. Throughput on Blackwell does not — NVFP4 is ~2.3× faster because the tensor core has a native instruction for it.

Sources

← ai llm engineering