Scaling Web Servers — latency, throughput, bandwidth, concurrency & parallelism
Five words that get used interchangeably and aren’t. The simulator gives each one a knob and a live readout so you can watch what happens to the other four when you change one.
The whole scene is a single flow:
clients (req/s) ──→ pipe (bandwidth) ──→ server (N workers, CPU time) ──→ pipe back ──→ clients
Five inputs:
- Request arrival rate — how often clients send requests (offered load).
- One-way network delay — pure travel time on the wire.
- CPU time per request — how long a worker holds the request to do work.
- Pipe bandwidth — max bytes-in-flight the link can carry.
- Worker count — how many requests can be processed truly simultaneously.
Five outputs (averaged over the last 1 simulated second):
- Latency — round-trip time for one request.
- Throughput — completed requests per second.
- Bandwidth used — percent of pipe capacity occupied by in-flight bytes.
- Concurrency — total in-flight requests right now (traveling + queued + processing).
- Parallelism — workers actively running a request right now.
The five scenarios on the right-hand side
| Preset | What it sets up | What you should see |
|---|---|---|
| Balanced | 50 req/s · 25 ms net · 40 ms CPU · 1 Gbps · 4 workers | Everything moderate. Baseline to compare the others against. |
| Low latency | 5 req/s · 5 ms net · 8 ms CPU · 10 Gbps · 4 workers | Latency drops to ~20 ms. Throughput is low (only 5 req/s offered), bandwidth used is near 0 — but each individual request is fast. The “real-time API for a single user” case. |
| High throughput | 400 req/s · 25 ms net · 30 ms CPU · 10 Gbps · 16 workers | Throughput climbs near 400 req/s. Individual latency is still modest. The “API at peak” case — what cloud auto-scalers chase. |
| Bandwidth-bottlenecked | 200 req/s · narrow 50 Mbps pipe · 10 ms CPU · 16 workers | Workers are mostly idle, queue stays empty, but throughput tops out far below 200 req/s. Bandwidth-used pegs near 100 %. The pipe is the limit; adding workers does nothing. |
| Overloaded | 250 req/s · 4 workers · 80 ms CPU | Offered load (250 × 80 ms = 20 worker-seconds) exceeds capacity (4 workers). Queue grows without bound; latency climbs; throughput plateaus at ≈ 4 / 0.08 ≈ 50 req/s. |
| Little’s Law | 100 req/s · 50 ms net · 100 ms CPU · 8 workers | Steady state. The formula strip at the bottom shows Concurrency ≈ Throughput × Latency. The error % should stay under 20. |
What’s “honest” about it
-
Little’s Law is enforced by physics, not by code. The simulator doesn’t compute
Concurrency = Throughput × Latencyanywhere — both are measured independently. The formula at the bottom is a sanity check that the simulated queueing model actually obeys the law. -
Bandwidth and CPU compete. Bytes-in-flight is computed from the number of packets currently traveling × their size. If you drop the pipe to 50 Mbps with 16 workers, the workers will end up mostly idle — the pipe can’t feed them fast enough.
-
Queue and parallelism are different things. Concurrency includes requests waiting for a worker. Parallelism counts only the requests being executed right now. Same word in casual speech, two different numbers in the metrics strip.
-
Bandwidth isn’t speed. The “pipe” never makes a single request arrive faster — it caps how many can be in transit at once. That’s why doubling your home internet doesn’t make a single API call snappier.
Controls
| Key / button | Action |
|---|---|
space | Pause / resume the simulation |
r | Reset all metric buffers |
| slider in the toolbar | Time multiplier (×0.25 – ×3) — slows or speeds up the sim |
| any preset button on the right | Snap all 5 knobs to a configuration |
| any of the 5 knobs | Change one input in isolation |
| mouse hover on a metric card | Plain-English definition |