What the CPU does when there's nothing to do
A CPU is a clocked device. Every cycle, something happens — there’s no “off” while power is applied. So when the OS scheduler looks around and finds zero runnable threads, what does the chip actually do? The answer is more interesting than “spin in a loop.”
The naive answer (and why it’s wrong)
You could imagine the kernel running a tight while(1) { check_for_work(); } loop on the idle core. It would technically work — but it would also burn full TDP doing nothing, melt your laptop battery in two hours, and cap your turbo boost on the busy cores because the package thermal budget is shared.
So real systems don’t do this. The CPU has dedicated instructions to stop fetching until something interesting happens.
HLT (x86) and WFI (ARM)
The basic primitive is a “park” instruction:
- x86:
HLT— halts the core until the next external interrupt. The clock to the execution units gates off; the core sits at low power until the interrupt controller pokes it. - ARM:
WFI(“Wait For Interrupt”) — same idea. Some implementations also supportWFE(Wait For Event) for fine-grained signalling between cores.
Both are privileged instructions. Userspace can’t issue them — only the kernel decides when a core gets to sleep, because only the kernel knows whether there’s runnable work elsewhere on the system.
The kernel idle task
On Linux, every CPU has an idle task (PID 0 per-CPU, the “swapper”). When the scheduler picks next and there’s nothing runnable, it picks the idle task. The idle task’s job is to call into the cpuidle subsystem, which selects an appropriate sleep state and issues the park instruction.
Roughly:
schedule() → pick idle task → cpuidle_governor->select() → // "menu" or "teo" governor picks a C-state cpuidle_driver->enter(state) → // arch-specific: HLT, MWAIT, WFI, ... // core sleeps here until interrupt // on wake, returns and schedule() runs again
The governor is making a real prediction: how long do I expect to sleep? Deeper sleep saves more power but takes longer to wake. Misjudge it and you tank latency-sensitive workloads.
P-states vs C-states
Modern CPUs have two orthogonal sets of power states, and they’re easy to confuse:
- P-states (performance states) — describe how the CPU runs while it’s active: frequency and voltage. P0 is the highest (turbo); higher P-numbers are slower and lower-voltage.
- C-states (idle states) — describe how deeply the CPU sleeps when it’s not running anything. This is what
HLT/MWAIT/WFIenter.
Two different governors, two different decisions. P-states get picked by the cpufreq scheduler (“schedutil”, “powersave”, “performance”); C-states by cpuidle (“menu”, “teo”). A core can be in C0 P3 (active but slow) or C0 P0 (active and turboing) or C6 (asleep, P-state irrelevant).
C-states: graduated sleep
x86 exposes a hierarchy of sleep states named C0 through C10 (not all are implemented; vendor-specific):
| State | What’s off | Wake latency | Typical use |
|---|---|---|---|
| C0 | nothing — actively executing | — | running code |
| C1 | clocks gated, caches live | ~1 µs | brief idle |
| C1E | + voltage reduced | a few µs | short idle |
| C3 | L1/L2 caches flushed | tens of µs | longer idle |
| C6 | core powered off, state saved to SRAM | ~50–100 µs | deep idle |
| C7+ | shared L3 / package-level off | hundreds of µs | whole-package idle |
Deeper states save more power but have to rebuild more state on wake — caches have to refill from memory, voltages have to ramp. That’s why the governor matters: entering C6 for a 5 µs sleep loses you more than you save.
You can see this on Linux:
$ cat /sys/devices/system/cpu/cpu0/cpuidle/state*/name
POLL
C1_ACPI
C2_ACPI
C3_ACPI
MWAIT — sleeping on a memory address
HLT only wakes on interrupts. But sometimes a core wants to sleep until another core writes a specific memory address — classic example: a spinlock or futex. Polling burns power; sleeping with HLT requires routing an IPI (inter-processor interrupt), which is overkill.
x86 added MONITOR + MWAIT for this:
MONITOR [addr] ; arm a watch on this cache line MWAIT ; sleep until the line is written (or interrupt)
The CPU uses the cache coherence protocol it already has — when another core’s write invalidates the watched line, this core wakes. No interrupt needed. This is what implements efficient userspace blocking primitives under the hood (via umwait/tpause for unprivileged variants on newer chips).
The race nobody talks about
There’s a subtle race in the obvious implementation:
if (no_work) { // ← interrupt arrives here, sets work flag HLT; // we sleep anyway, miss the wake-up }
The fix on x86 is that STI; HLT is atomic with respect to interrupt delivery — the CPU guarantees one instruction of “interrupts disabled” after STI, so the HLT always commits with interrupts enabled. ARM has similar guarantees around WFI. Get this wrong in a hypervisor or a bare-metal kernel and you get rare, awful “system just stops responding” bugs.
Race to idle
A counterintuitive consequence of all this: it’s often more efficient to sprint at high frequency for a brief burst and then sleep deeply, than to run at moderate frequency for longer. The static power cost of being awake at all (leakage, clock distribution, uncore) doesn’t scale linearly with frequency, so finishing fast and getting to C6 wins. This is why a spike on a CPU graph isn’t necessarily waste — it might be the most power-efficient path back to silence.
Timer coalescing — protecting the silence
Idle is fragile. If ten different processes each wake up at slightly different times once per second, the CPU never gets a long enough quiet stretch to enter deep C-states. Wake, work, sleep, wake, work, sleep — and the deep idle states are never amortized.
Modern kernels fight this with timer coalescing: deferrable timers, hrtimer slack, and tickless idle (NO_HZ/NO_HZ_FULL on Linux) align wakeups so the system handles a burst of activity then stays quiet. The same idea on Windows lets a laptop battery survive an open browser; without it, one badly-written app polling every 30 ms can halve battery life without ever showing up meaningfully in a CPU% graph. The crime isn’t CPU consumption — it’s preventing rest.
Busy-wait is a lie to the scheduler
This connects to a userspace pattern that matters more than people realize. If an application has nothing to do, it should block — epoll, select, a condition variable, a futex, anything that takes the thread off the runqueue. When it does, the scheduler picks the idle thread, and the core can sleep.
If it instead spins (while (!flag) {}), the thread looks runnable. The scheduler dutifully runs it. The idle thread never gets a turn. The CPU never sleeps. Battery dies, fan spins up, and CPU% might still look reasonable because the spinner is “doing work.” This is why busy-waiting is forbidden in any code that runs on battery-powered or shared hardware — the cost isn’t measured in cycles, it’s measured in C-state residency.
Hybrid cores: P-cores and E-cores
Recent x86 (Intel 12th gen+) and ARM (big.LITTLE) systems make idle decisions even richer. The scheduler can:
- Park entire cores so they enter the deepest possible state, rather than spreading load thinly across many awake cores.
- Prefer efficiency cores for background work, keeping performance cores asleep for foreground bursts.
- Route latency-sensitive threads to whichever core is already awake, avoiding a wake-up event entirely.
The optimization target isn’t “balanced utilization” anymore — it’s “as much silicon asleep as possible, as deeply as possible, for as long as possible.”
Why this matters in practice
- Laptops: modern idle is ~95% of wall-clock time even on a “busy” machine. C-state residency directly determines battery life. A misbehaving driver that prevents deep sleep can halve it.
- Servers: datacenters care about the same thing at scale, and additionally about boost headroom — one core staying in C0 with a polling loop steals thermal budget from cores doing actual work.
- Real-time / latency-sensitive: you sometimes want to forbid deep C-states (
intel_idle.max_cstate=1,idle=poll) to keep wake latency predictable, accepting the power cost. - Misreading Task Manager: the “System Idle Process” at 99% isn’t a runaway process — it’s the accounting record for time the CPU had nothing to do. High idle is good news.
TL;DR: an idle CPU isn’t spinning — it’s executing one privileged instruction (
HLT,WFI, orMWAIT) that gates its clocks and waits for a hardware event. The kernel’s idle task picks how deep to sleep based on a prediction of how soon it’ll be needed. On a modern power-managed chip, idle isn’t wasted time — it’s the asset that buys you battery life, thermal headroom, and the boost clock you’ll need next. “Do nothing” is one of the most performance-critical things a chip does.