What the CPU does when there's nothing to do

A CPU is a clocked device. Every cycle, something happens — there’s no “off” while power is applied. So when the OS scheduler looks around and finds zero runnable threads, what does the chip actually do? The answer is more interesting than “spin in a loop.”

The naive answer (and why it’s wrong)

You could imagine the kernel running a tight while(1) { check_for_work(); } loop on the idle core. It would technically work — but it would also burn full TDP doing nothing, melt your laptop battery in two hours, and cap your turbo boost on the busy cores because the package thermal budget is shared.

So real systems don’t do this. The CPU has dedicated instructions to stop fetching until something interesting happens.

HLT (x86) and WFI (ARM)

The basic primitive is a “park” instruction:

Both are privileged instructions. Userspace can’t issue them — only the kernel decides when a core gets to sleep, because only the kernel knows whether there’s runnable work elsewhere on the system.

The kernel idle task

On Linux, every CPU has an idle task (PID 0 per-CPU, the “swapper”). When the scheduler picks next and there’s nothing runnable, it picks the idle task. The idle task’s job is to call into the cpuidle subsystem, which selects an appropriate sleep state and issues the park instruction.

Roughly:

schedule()  pick idle task 
  cpuidle_governor->select()    // "menu" or "teo" governor picks a C-state
  cpuidle_driver->enter(state)  // arch-specific: HLT, MWAIT, WFI, ...
  // core sleeps here until interrupt
  // on wake, returns and schedule() runs again

The governor is making a real prediction: how long do I expect to sleep? Deeper sleep saves more power but takes longer to wake. Misjudge it and you tank latency-sensitive workloads.

Scheduler view — runqueue with everything blocked except the idle task Per-CPU runqueue — a quiet moment chrome (renderer) blocked on epoll pulseaudio waiting on hr-timer kworker/0:1 workqueue empty sshd waiting on socket idle (swapper/0) RUNNABLE · prio MIN scheduler picks it cpuidle_governor.select() predict sleep length → choose C-state arch driver — enter() HLT · MWAIT · WFI core asleep clocks gated · waits for hardware event IRQ → schedule() on wake, the scheduler runs again — and might hand control back to a real thread
From the scheduler's view, idle isn't an absence of work — it's a thread that always wins when nothing else is runnable.

P-states vs C-states

Modern CPUs have two orthogonal sets of power states, and they’re easy to confuse:

Two different governors, two different decisions. P-states get picked by the cpufreq scheduler (“schedutil”, “powersave”, “performance”); C-states by cpuidle (“menu”, “teo”). A core can be in C0 P3 (active but slow) or C0 P0 (active and turboing) or C6 (asleep, P-state irrelevant).

C-states: graduated sleep

x86 exposes a hierarchy of sleep states named C0 through C10 (not all are implemented; vendor-specific):

StateWhat’s offWake latencyTypical use
C0nothing — actively executingrunning code
C1clocks gated, caches live~1 µsbrief idle
C1E+ voltage reduceda few µsshort idle
C3L1/L2 caches flushedtens of µslonger idle
C6core powered off, state saved to SRAM~50–100 µsdeep idle
C7+shared L3 / package-level offhundreds of µswhole-package idle

Deeper states save more power but have to rebuild more state on wake — caches have to refill from memory, voltages have to ramp. That’s why the governor matters: entering C6 for a 5 µs sleep loses you more than you save.

C-state wake latency vs power saved 1 µs 10 µs 100 µs 1 ms wake latency (log scale) power saved → deeper = more savings, costlier wake C1 C1E C3 C6 C7+
C-state tradeoff: each step down saves more power but costs more to wake.

You can see this on Linux:

$ cat /sys/devices/system/cpu/cpu0/cpuidle/state*/name
POLL
C1_ACPI
C2_ACPI
C3_ACPI

MWAIT — sleeping on a memory address

HLT only wakes on interrupts. But sometimes a core wants to sleep until another core writes a specific memory address — classic example: a spinlock or futex. Polling burns power; sleeping with HLT requires routing an IPI (inter-processor interrupt), which is overkill.

x86 added MONITOR + MWAIT for this:

MONITOR [addr]   ; arm a watch on this cache line
MWAIT            ; sleep until the line is written (or interrupt)

The CPU uses the cache coherence protocol it already has — when another core’s write invalidates the watched line, this core wakes. No interrupt needed. This is what implements efficient userspace blocking primitives under the hood (via umwait/tpause for unprivileged variants on newer chips).

The race nobody talks about

There’s a subtle race in the obvious implementation:

if (no_work) {
  // ← interrupt arrives here, sets work flag
  HLT;  // we sleep anyway, miss the wake-up
}

The fix on x86 is that STI; HLT is atomic with respect to interrupt delivery — the CPU guarantees one instruction of “interrupts disabled” after STI, so the HLT always commits with interrupts enabled. ARM has similar guarantees around WFI. Get this wrong in a hypervisor or a bare-metal kernel and you get rare, awful “system just stops responding” bugs.

Race to idle

A counterintuitive consequence of all this: it’s often more efficient to sprint at high frequency for a brief burst and then sleep deeply, than to run at moderate frequency for longer. The static power cost of being awake at all (leakage, clock distribution, uncore) doesn’t scale linearly with frequency, so finishing fast and getting to C6 wins. This is why a spike on a CPU graph isn’t necessarily waste — it might be the most power-efficient path back to silence.

Race to idle: same task, lower total energy Slow & steady — energy ≈ 29.6k P t Race to idle — energy ≈ 20.0k P t sprint deep idle
Same work, two strategies. Sprinting then sleeping deeply pays the static-power floor for less time.

Timer coalescing — protecting the silence

Idle is fragile. If ten different processes each wake up at slightly different times once per second, the CPU never gets a long enough quiet stretch to enter deep C-states. Wake, work, sleep, wake, work, sleep — and the deep idle states are never amortized.

Modern kernels fight this with timer coalescing: deferrable timers, hrtimer slack, and tickless idle (NO_HZ/NO_HZ_FULL on Linux) align wakeups so the system handles a burst of activity then stays quiet. The same idea on Windows lets a laptop battery survive an open browser; without it, one badly-written app polling every 30 ms can halve battery life without ever showing up meaningfully in a CPU% graph. The crime isn’t CPU consumption — it’s preventing rest.

Uncoalesced vs coalesced timer wakeups Uncoalesced — never reaches deep idle C1 only Coalesced — long deep-idle stretch C6 residency time → wakeup deep idle
Coalescing the same number of wakeups into one burst lets the CPU enter and stay in deep idle.

Busy-wait is a lie to the scheduler

This connects to a userspace pattern that matters more than people realize. If an application has nothing to do, it should blockepoll, select, a condition variable, a futex, anything that takes the thread off the runqueue. When it does, the scheduler picks the idle thread, and the core can sleep.

If it instead spins (while (!flag) {}), the thread looks runnable. The scheduler dutifully runs it. The idle thread never gets a turn. The CPU never sleeps. Battery dies, fan spins up, and CPU% might still look reasonable because the spinner is “doing work.” This is why busy-waiting is forbidden in any code that runs on battery-powered or shared hardware — the cost isn’t measured in cycles, it’s measured in C-state residency.

Hybrid cores: P-cores and E-cores

Recent x86 (Intel 12th gen+) and ARM (big.LITTLE) systems make idle decisions even richer. The scheduler can:

The optimization target isn’t “balanced utilization” anymore — it’s “as much silicon asleep as possible, as deeply as possible, for as long as possible.”

Why this matters in practice

TL;DR: an idle CPU isn’t spinning — it’s executing one privileged instruction (HLT, WFI, or MWAIT) that gates its clocks and waits for a hardware event. The kernel’s idle task picks how deep to sleep based on a prediction of how soon it’ll be needed. On a modern power-managed chip, idle isn’t wasted time — it’s the asset that buys you battery life, thermal headroom, and the boost clock you’ll need next. “Do nothing” is one of the most performance-critical things a chip does.

← cpu