Scaling Web Servers — latency · throughput · bandwidth · concurrency

Knobs · adjust the inputs

Request arrival rate 161 req/s

How often clients spawn new requests. The offered load on the system.

One-way network delay 50 ms

Pure travel time on the wire, one way. The minimum possible RTT is 2 × delay.

CPU time per request 66 ms

How long a worker holds the request to actually do work. CPU-bound part of latency.

Pipe bandwidth 20 Mbps

Maximum simultaneous bytes-in-flight the pipe can carry. Caps throughput regardless of how many workers you have.

Servers (behind LB) horizontal scaling 3 hosts

Horizontal scaling (scale out). Add more app hosts behind a round-robin load balancer. Each host has its own CPU + RAM.

Workers per server vertical scaling 4 procs

Vertical scaling (scale up). App worker processes per server. Each is a separate CPU lane and ~220 MB of RAM (Puma cluster / Unicorn model).

Threads per worker vertical scaling 4 slots

Concurrency within a worker. Only one thread per worker runs CPU at a time (the GVL). Extra threads pay off only when a request blocks on I/O: while one thread waits on the DB, another runs CPU. With zero DB time they add nothing; with I/O they lift throughput toward the CPU ceiling.

Max queue depth (backpressure) 50 requests

Hard cap on requests waiting for a worker. Above the cap, the LB returns 503 Service Unavailable immediately — backpressure to protect downstream from overload.

DB connection pool 24 conns

Shared Postgres max_connections. A thread doing I/O checks out a connection; when the pool is exhausted, queries queue at the DB — and a blocking migration can pile them up.

DB query time (I/O) 40 ms

Mean time a request waits on the database (jittered per query). During this wait the thread releases the GVL, so another thread in the same worker can run CPU — this is what makes threads pay off.

Database event

Inject a DB event: a row lock serialises writes; a blocking CREATE INDEX takes a lock that stalls writes until it finishes (connections pile up, RAM climbs); CONCURRENTLY builds without blocking but runs queries slower.

Flow visualizer N requests in flight

Latency

—ms

RTT for one request

Throughput

—req/s

completed / second

Bandwidth

—% used

— of 20 Mbps

Concurrency

—

in-flight requests

Parallelism

—

/ — workers busy

CPU

—%

mean across servers

RAM

—MB

/ 4096 MB cap

503s

—

rejected requests

Little's Law: Concurrency = Throughput × Latency — ≈ — × —