Scaling Web Servers — latency, throughput, bandwidth, concurrency & parallelism

2026-05-19 · updated 2026-05-19

Five words that get used interchangeably and aren’t. The simulator gives each one a knob and a live readout so you can watch what happens to the other four when you change one.

The whole scene is a single flow:

clients (req/s) ──→ pipe (bandwidth) ──→ server (N workers, CPU time) ──→ pipe back ──→ clients

Five inputs:

Request arrival rate — how often clients send requests (offered load).
One-way network delay — pure travel time on the wire.
CPU time per request — how long a worker holds the request to do work.
Pipe bandwidth — max bytes-in-flight the link can carry.
Worker count — how many requests can be processed truly simultaneously.

Five outputs (averaged over the last 1 simulated second):

Latency — round-trip time for one request.
Throughput — completed requests per second.
Bandwidth used — percent of pipe capacity occupied by in-flight bytes.
Concurrency — total in-flight requests right now (traveling + queued + processing).
Parallelism — workers actively running a request right now.

Open fullscreen ↗

The five scenarios on the right-hand side

Preset	What it sets up	What you should see
Balanced	50 req/s · 25 ms net · 40 ms CPU · 1 Gbps · 4 workers	Everything moderate. Baseline to compare the others against.
Low latency	5 req/s · 5 ms net · 8 ms CPU · 10 Gbps · 4 workers	Latency drops to ~20 ms. Throughput is low (only 5 req/s offered), bandwidth used is near 0 — but each individual request is fast. The “real-time API for a single user” case.
High throughput	400 req/s · 25 ms net · 30 ms CPU · 10 Gbps · 16 workers	Throughput climbs near 400 req/s. Individual latency is still modest. The “API at peak” case — what cloud auto-scalers chase.
Bandwidth-bottlenecked	200 req/s · narrow 50 Mbps pipe · 10 ms CPU · 16 workers	Workers are mostly idle, queue stays empty, but throughput tops out far below 200 req/s. Bandwidth-used pegs near 100 %. The pipe is the limit; adding workers does nothing.
Overloaded	250 req/s · 4 workers · 80 ms CPU	Offered load (250 × 80 ms = 20 worker-seconds) exceeds capacity (4 workers). Queue grows without bound; latency climbs; throughput plateaus at ≈ 4 / 0.08 ≈ 50 req/s.
Little’s Law	100 req/s · 50 ms net · 100 ms CPU · 8 workers	Steady state. The formula strip at the bottom shows `Concurrency ≈ Throughput × Latency`. The error % should stay under 20.

What’s “honest” about it

Little’s Law is enforced by physics, not by code. The simulator doesn’t compute Concurrency = Throughput × Latency anywhere — both are measured independently. The formula at the bottom is a sanity check that the simulated queueing model actually obeys the law.
Bandwidth and CPU compete. Bytes-in-flight is computed from the number of packets currently traveling × their size. If you drop the pipe to 50 Mbps with 16 workers, the workers will end up mostly idle — the pipe can’t feed them fast enough.
Queue and parallelism are different things. Concurrency includes requests waiting for a worker. Parallelism counts only the requests being executed right now. Same word in casual speech, two different numbers in the metrics strip.
Bandwidth isn’t speed. The “pipe” never makes a single request arrive faster — it caps how many can be in transit at once. That’s why doubling your home internet doesn’t make a single API call snappier.

Controls

Key / button	Action
`space`	Pause / resume the simulation
`r`	Reset all metric buffers
slider in the toolbar	Time multiplier (×0.25 – ×3) — slows or speeds up the sim
any preset button on the right	Snap all 5 knobs to a configuration
any of the 5 knobs	Change one input in isolation
mouse hover on a metric card	Plain-English definition

← systems