INT4 vs NVFP4

Two ways to put a 4-bit weight in a register. They look similar on paper and behave very differently on silicon.

Differences at a glance

INT4NVFP4
Number systemuniform integers, 16 evenly-spaced levelsE2M1 float (1 sign · 2 exp · 1 mantissa) — non-uniform, denser near zero
Range[−8, 7] (asymmetric, 16 codes)[−6, 6] (15 distinct values, 16 codes incl. ±0)
Block scaleper-channel or block-32, FP16/FP32block-16, FP8 E4M3
Outer scaleone FP32 per tensor
Outlier behaviourpoor; needs Hadamard/GPTQ rotationabsorbed by float dynamic range
Hardware path on Blackwellemulated / general FP4-INT4 pathdedicated Tensor Core instruction
Throughput (Blackwell)1.0×~2.3×
Accuracy vs FP16 baselinewithin ~1% with AWQ/AutoRoundwithin ~1%, sometimes ahead, sometimes slightly behind on language tasks
Weight memorysmallestslightly larger (~+7 GB on Llama 3.3 70B due to finer scales)

The number grid is the whole story

INT4 spaces 16 integer levels evenly across [−8, 7]. NVFP4 spends its 16 codes on a floating-point grid — 15 distinct values (±0 share a value) clustered near zero, sparser near the tails. Weight distributions in transformers are heavy near zero with rare outliers, so NVFP4’s grid hits the data’s shape; INT4’s wastes precision in regions weights almost never visit.

INT4 vs NVFP4 representable values on the number line INT4 — 16 uniform integer levels in [−8, 7] −8 0 +7 NVFP4 (E2M1) — 15 distinct values, denser near zero, range [−6, 6] −6 0 +6 positive grid: {0, 0.5, 1, 1.5, 2, 3, 4, 6} — mirrored for negatives (16 codes total, ±0 collapse)
Same 16 codes, very different placement. NVFP4 trades evenness for resolution where transformer weights actually live.

Two-level scaling, smaller blocks

INT4 schemes typically attach one scale per channel (or one per 32-element block) in FP16/FP32. NVFP4 splits that job in two: a fine-grained FP8 scale per 16-element micro-block to track local dynamic range, plus a single FP32 per-tensor scale to keep the FP8 block scales from overflowing. Halving the block size and storing scales in a smaller type is what closes the accuracy gap with INT4 — and is also why NVFP4 weights are slightly bigger on disk.

Scale hierarchies — INT4 per-channel vs NVFP4 two-level INT4 — one FP16 scale per output channel … 32 INT4 weights … FP16 scale → one scale covers a whole row/channel; outliers stretch the scale and waste codes on small values NVFP4 — FP8 scale every 16 weights, plus one FP32 per tensor 16 FP4 FP8 16 FP4 FP8 16 FP4 FP8 FP32 scale · per tensor keeps FP8 block scales from overflowing → tighter blocks track local outliers; the tensor-level FP32 catches global range
NVFP4 spends a few extra bytes on scales to make the 4-bit codes do more work.

Why Blackwell makes NVFP4 win in practice

Blackwell’s tensor cores have a native NVFP4 instruction. INT4 GEMMs on the same hardware run a slower path. The accuracy gap is small in either direction depending on the model and recipe — but the throughput gap is structural.

Relative throughput and accuracy — INT4 vs NVFP4 on Blackwell Throughput (tokens/s, normalised to INT4) INT4 1.00× NVFP4 2.30× 2.5× Accuracy retention vs FP16 baseline (higher is better) FP16 INT4 ~99.0% NVFP4 ~99.1% accuracy is task- and recipe-dependent; broad parity, with INT4 marginally ahead on some language tasks and NVFP4 ahead on reasoning
Same accuracy band; ~2.3× the tokens per second. That's the trade Blackwell offers.

When to pick which

TL;DR: INT4 is a uniform integer grid with one big scale; NVFP4 is an E2M1 floating-point grid with two-level scaling (FP8 per 16 weights + FP32 per tensor). Accuracy lands in the same band. Throughput on Blackwell does not — NVFP4 is ~2.3× faster because the tensor core has a native instruction for it.

Sources

← ai llm engineering