width-explicit (NEON 128) or VL-agnostic (SVE/SVE2)
Modes
real / protected / long, 16/32/64-bit baggage
A64 only at EL1+; A32/T32 optional and dropping
Backward compat horizon
decades (32-bit Windows still runs)
shorter — Apple dropped 32-bit ARM in 2017, server is A64-only
Implementation diversity
2 main vendors (AMD, Intel) — mostly converged
many licensees (Apple, Qualcomm, Ampere, NVIDIA, Marvell, AWS, ARM Ltd Neoverse) — divergent microarchitectures
Standard system platform
PC firmware: UEFI + ACPI + APIC, very uniform
SystemReady (UEFI+ACPI) on servers; DT + custom on mobile — fragmented
Both camps have converged on the same execution back-end: superscalar, out-of-order, register-renamed, with µops. The differences are now in the front-end (decode), the memory-ordering contract, and the surrounding ecosystem.
Why ARM wins perf-per-watt — and what it costs
ARM’s lead is structural, not magical. Each gain has a paired cost.
ARM gain
Why
What you pay
Cheaper decode (often 8-wide at low power)
Fixed 4-B instructions → trivial parallel length-decode; no µop cache needed → less SRAM and dynamic power
Worse code density. ARM64 binaries are typically ~10–25 % larger than the same C++ compiled for x86-64 → more iCache pressure → bigger L1i and prefetchers needed at the top end
More registers (31 GP, 32 V) → less register-rename pressure, fewer spills
RISC orthogonality
Calling conventions are wider; every callee-save preserved on entry/exit costs more stack traffic
Weak memory model → fewer coherence stalls
HW free to reorder loads/stores across cores aggressively
Concurrent code is harder. Forget a LDAR / DMB ISH and the bug only shows up on heavily-loaded multi-socket. Real-world: locks, RCU, lock-free queues all need extra fences vs x86
Load/store-only memory pipeline
One memory-access shape to schedule; simpler AGU and store buffer
Every memory RMW takes ≥ 3 ops (LDR/LDXR + op + STR/STXR) — visible code-size and µop-count hit
Predication (CSEL, CSINC, …)
Compiler turns short if into branchless data flow → fewer mispredicts, fewer branch units
Increases register pressure (must compute both candidates) and ties up the issue queue
LSE atomics (v8.1+) scale better under contention
One-shot CAS/LDADD instead of LL/SC retry loops
Requires v8.1+ — older silicon falls back to LL/SC which can livelock
No microcode update path
Saves the µcode patch SRAM + load logic
A late-discovered ISA bug needs a new chip. Spectre-class fixes ship slower
Aggressive big.LITTLE (P-cores + E-cores) since 2011
Heterogeneous scheduling at low power
OS scheduler complexity; thread-affinity bugs; vendor-locked QoS
SoC integration (Apple M, Snapshot, Graviton)
Memory controller, NPU, GPU, fabric on-die or on-package — kills off-chip DRAM hops
Less vendor choice, less repairability, locked memory configurations
ARM-licensee specialisation
Each vendor optimises for its workload (Apple: client; Ampere: cloud; AWS: web)
Fragmentation: code tuned for Neoverse N2 may underperform on Apple firestorm; SVE-VLA can’t be tuned per width as easily as fixed AVX-512
Cleaner ISA — no real mode, no segmentation, no 16-bit
Smaller transistor count for legacy logic
Some legacy SW just doesn’t exist for ARM (older proprietary Windows binaries, JITs not yet ported, drivers)
Where x86-64 still wins on absolute perf (not per-watt)
Workload
Why x86 still leads
Single-thread integer (peak)
Wider OoO windows + µop cache + aggressive prefetch tuned over 25 years
HPC / dense FP / AI inference
AVX-512 (512-bit) + AMX tiles; SVE2 is catching up but ecosystem is younger
JITs that emit RMW patterns
LOCK + RMW = 1 op; ARM needs LL/SC or LSE explicitly
Legacy software (Windows desktop, older ISVs)
binary compatibility is x86’s product; emulation on ARM (Rosetta 2, Prism) costs ~10–30 %
Workloads sensitive to memory-ordering surprises
TSO masks a lot of concurrency bugs that ARM exposes
The honest summary
ARM trades strict ordering, RMW convenience, code density, microcode flexibility, and ecosystem maturity for simpler decode, more registers, freer reordering, and lighter cores — and that trade currently buys roughly 1.2–1.5× perf-per-watt at the socket level in scale-out server workloads, with the gap closer in single-thread peak and wider in throughput-per-rack.