INT4 vs NVFP4
Two ways to put a 4-bit weight in a register. They look similar on paper and behave very differently on silicon.
Differences at a glance
| INT4 | NVFP4 | |
|---|---|---|
| Number system | uniform integers, 16 evenly-spaced levels | E2M1 float (1 sign · 2 exp · 1 mantissa) — non-uniform, denser near zero |
| Range | [−8, 7] (asymmetric, 16 codes) | [−6, 6] (15 distinct values, 16 codes incl. ±0) |
| Block scale | per-channel or block-32, FP16/FP32 | block-16, FP8 E4M3 |
| Outer scale | — | one FP32 per tensor |
| Outlier behaviour | poor; needs Hadamard/GPTQ rotation | absorbed by float dynamic range |
| Hardware path on Blackwell | emulated / general FP4-INT4 path | dedicated Tensor Core instruction |
| Throughput (Blackwell) | 1.0× | ~2.3× |
| Accuracy vs FP16 baseline | within ~1% with AWQ/AutoRound | within ~1%, sometimes ahead, sometimes slightly behind on language tasks |
| Weight memory | smallest | slightly larger (~+7 GB on Llama 3.3 70B due to finer scales) |
The number grid is the whole story
INT4 spaces 16 integer levels evenly across [−8, 7]. NVFP4 spends its 16 codes on a floating-point grid — 15 distinct values (±0 share a value) clustered near zero, sparser near the tails. Weight distributions in transformers are heavy near zero with rare outliers, so NVFP4’s grid hits the data’s shape; INT4’s wastes precision in regions weights almost never visit.
Two-level scaling, smaller blocks
INT4 schemes typically attach one scale per channel (or one per 32-element block) in FP16/FP32. NVFP4 splits that job in two: a fine-grained FP8 scale per 16-element micro-block to track local dynamic range, plus a single FP32 per-tensor scale to keep the FP8 block scales from overflowing. Halving the block size and storing scales in a smaller type is what closes the accuracy gap with INT4 — and is also why NVFP4 weights are slightly bigger on disk.
Why Blackwell makes NVFP4 win in practice
Blackwell’s tensor cores have a native NVFP4 instruction. INT4 GEMMs on the same hardware run a slower path. The accuracy gap is small in either direction depending on the model and recipe — but the throughput gap is structural.
When to pick which
- NVFP4 — Blackwell (B100/B200/RTX 6000 Pro), inference-heavy serving, you want the throughput and the engineering cost of a custom INT4 recipe isn’t justified.
- INT4 (AWQ / AutoRound / GPTQ) — Hopper or older, smallest possible weights, mature toolchain, every GB of VRAM matters.
- Don’t mix — running INT4 weights through an NVFP4 kernel (or vice versa) loses the hardware advantage that makes either format interesting.
TL;DR: INT4 is a uniform integer grid with one big scale; NVFP4 is an E2M1 floating-point grid with two-level scaling (FP8 per 16 weights + FP32 per tensor). Accuracy lands in the same band. Throughput on Blackwell does not — NVFP4 is ~2.3× faster because the tensor core has a native instruction for it.
Sources
- Introducing NVFP4 for Efficient and Accurate Low-Precision Inference — NVIDIA
- NVFP4: Same Accuracy with 2.3× Higher Throughput for 4-Bit LLMs — Benjamin Marie
- INT4 vs FP4: The Future of 4-Bit Quantization — Hugging Face
- INT vs FP: A Comprehensive Study of Fine-Grained Low-bit Quantization Formats — arXiv
- NVIDIA Blackwell: The Impact of NVFP4 For LLM Inference — Edge AI and Vision Alliance
- Source video — YouTube