Local Weather Cluster, how to run ICON DWD at home?

A numerical weather model isn’t a webapp. You can’t horizontally scale it by adding stateless pods behind a load balancer — every rank has to talk to its neighbours every timestep, halos travel at hundreds of MB/s, and the slowest rank gates the whole forecast. So when one 32-core box wasn’t fast enough to run an ICON 2 km LAM forecast for Poland in under a wall hour, the answer wasn’t “add a queue and shard.” It was: build a real cluster, two hosts, one tightly coupled MPI job, and make it idempotent enough that ./run.sh start is the only command anyone has to remember.

This post walks through the implementation: the model, the hardware, the image, the launcher, the shared filesystem, the orchestration, and the numbers it produced.

The model and why a cluster

ICON is the operational global / limited-area weather model developed by DWD (Deutscher Wetterdienst) and MPI-M. Limited-Area Mode runs it over a regional grid — here a triangle mesh covering Poland at the IMGW R5B9 refinement (~2 km, 254 212 cells, 60 vertical levels). DWD’s own ICON-EU at 6.5 km is the source for initial conditions and lateral boundaries; ICON-LAM steps the local domain forward at higher resolution and resolves convection explicitly.

The dycore is non-hydrostatic on a triangular C-grid. A 20-second timestep with 5 dynamics substeps and a single rank-set means ~5400 timesteps per forecast hour, each one ending in a halo exchange. On 32 ranks of a Ryzen 9950X-class box this is ~0.142 timesteps per second, ~2.8× faster than wall-clock. Acceptable. But a 24 h forecast still needs ~8.5 h wall time, and the goal was twice-daily releases. So: spread it.

ICON-LAM Poland domain at R5B9 (~2 km) Warsaw Kraków Rzeszów Gdańsk 14°E 25°E 48°N 55°N compute domain · 254 212 cells 10-cell nudging zone (cropped in postprocess to hide LBC seam)
The R5B9 LAM grid covering Poland. Cells are triangles ~2 km on a side; the dashed inset is the boundary nudging zone where the LAM blends back to its ICON-EU host.

The two boxes

Two machines wired with a single direct 10 GbE link — no switch in the middle.

HostArchCoresRAMRole
beelinkx86_64 (Ryzen AI Max+ 395)32124 GiBhead — runs mpiexec, NFS server, prep/postprocess, web
sparkaarch64 (DGX Spark / Grace)20119 GiBpeer — runs additional MPI ranks

Two architectures, deliberately. The spark is a Grace-based aarch64 dev kit that happens to be on the desk; throwing it in adds 20 cores of unused capacity for free. The catch — which becomes the central engineering choice below — is that MPI has to bridge x86_64 ↔ aarch64 in the same job.

The 10 GbE link is configured statically:

beelink: enp197s0f0   10.10.10.1/30
spark:   enP7s7        10.10.10.2/30

Both ends keep the link up via a NetworkManager profile (icon-fast, autoconnect=yes). RTT is ~0.3–0.7 ms; raw SSH throughput is ~520 MB/s, which is encryption-bound, not link-bound. WiFi and the Tailscale fallback path are explicitly not used for cluster traffic: MPI halo exchanges are latency-sensitive and lossy paths cause synchronous stalls across every rank.

Cluster topology: two hosts on a direct 10 GbE link beelink · head x86_64 · Ryzen AI Max+ 395 32 cores · 124 GiB • mpiexec (Hydra launcher) • 32 MPI ranks • NFS server • prep / postprocess containers • web (nginx :8080) • auto-render watcher spark · peer aarch64 · DGX Spark / Grace 20 cores · 119 GiB • in-container sshd on :2222 • 20 MPI ranks (aarch64 ICON binary) • NFS client (mounts /mnt/SSD1) • shares /opt/icon-data via local clone no compose stack here 10 GbE · direct 0.3–0.7 ms RTT · 520 MB/s 10.10.10.1 ⇆ 10.10.10.2 · no switch WiFi / Tailscale paths are NOT used — too lossy for MPI halos
Two boxes, one cable, two arches. Everything else (NFS, SSH, MPI) rides this single 10 GbE link.

The MPI choice that nothing else flowed from

There are two practical MPI implementations on Linux: OpenMPI and MPICH. They have the same API, similar ABIs, comparable performance. For a homogeneous cluster you can use either.

For x86_64 ↔ aarch64 in one job, you cannot use OpenMPI 5. The intercommunicator setup it relies on for ICON’s asynchronous I/O server crashes immediately:

ompi_comm_get_rprocs: Not supported

OpenMPI dropped heterogeneous-arch support in the 5.x line. MPICH 4.2 keeps it. So the whole image switched to MPICH — not just the cluster image, also the single-host one, for consistency. This is the choice that everything below cascades from: per-arch images, SSH-based Hydra launcher, in-container sshd, custom port, manual host list.

If both your boxes were the same architecture, none of the next four sections would exist — you’d just use one image and a hostfile.

A per-arch, multi-stage image

The forecast container has two jobs: compile ICON against MPICH (once), then run the binary with the right environment at forecast time. Splitting those into a multi-stage Dockerfile keeps the runtime image small (no compilers, no source tree) and makes the build cacheable per-arch.

# Stage 1 — build ICON against MPICH
FROM debian:stable-slim AS builder
RUN apt-get update && apt-get install -y --no-install-recommends \
        gcc g++ gfortran make m4 perl python3 cmake \
        mpich libmpich-dev \
        libnetcdf-dev libnetcdff-dev \
        libblas-dev liblapack-dev \
        libeccodes-dev libfyaml-dev libxml2-dev
WORKDIR /build
RUN --mount=type=bind,from=icon-model,target=/src,readonly \
    /src/config/generic/gcc && \
    make -j"$(nproc)" 2>&1 | tail -30 && \
    install -D -m 755 bin/icon /opt/icon-out/icon

# Stage 2 — runtime only
FROM debian:stable-slim
ENV HYDRA_LAUNCHER_EXTRA_ARGS="-p 2222 -i /root/.ssh/id_ed25519 \
    -o StrictHostKeyChecking=no -o UserKnownHostsFile=/dev/null \
    -o LogLevel=ERROR"
RUN apt-get update && apt-get install -y --no-install-recommends \
        mpich libmpich-dev libnetcdf-dev libnetcdff-dev \
        libgomp1 libgfortran5 \
        openssh-server openssh-client \
        python3 gettext-base
COPY --from=builder /opt/icon-out/icon /opt/icon/bin/icon
COPY forecast/sshd_mpi.conf     /etc/ssh/sshd_mpi_config
COPY forecast/entrypoint-mpi.sh /usr/local/bin/entrypoint-mpi.sh
ENTRYPOINT ["/usr/local/bin/entrypoint-mpi.sh"]

Two things to internalise:

  1. The image is per-arch. docker build on beelink produces an x86_64 image, on spark an aarch64 image. The tag (icon-pl-forecast-mpi:latest) is the same on both hosts; the underlying manifests are different. We rebuild on both hosts after any change to ICON source — docker buildx could produce a multi-arch manifest, but the source tree mounted via --build-context icon-model=… is large and the wins aren’t worth it.
  2. The build context root is /mnt/crucial2/projects/leometeo — not the icon-lam-pl/ subdir — so the Dockerfile’s --mount=type=bind,from=icon-model can reach a sibling directory. Build is invoked with --build-context icon-model=$PWD/icon-model.

The legacy icon-build service in docker-compose.yml that produced a shared icon_bin/ volume is not used by the cluster path. The binary lives inside the per-arch image. Single-host compose runs still use it, for backwards compatibility.

Cross-container SSH for the Hydra launcher

MPICH’s process manager is Hydra, and its default cross-host launcher is plain ssh. To launch ranks inside a peer container, the head’s mpiexec needs to SSH into the peer container — not into the host. Standard sshd on the host is on :22; ours runs inside the container on :2222.

# forecast/sshd_mpi.conf
Port 2222
ListenAddress 0.0.0.0
PermitRootLogin prohibit-password
PubkeyAuthentication yes
AuthorizedKeysFile /root/.ssh/authorized_keys
PasswordAuthentication no
UsePAM no
AcceptEnv OMPI_* PMIX_* PMI_* HYDRA_* OMP_*
HostKey /etc/ssh/ssh_host_rsa_key
HostKey /etc/ssh/ssh_host_ecdsa_key
HostKey /etc/ssh/ssh_host_ed25519_key

The container runs --network=host, so the port is real and reachable as spark:2222 from outside. The entrypoint stages a keypair, generates host keys idempotently, and exec’s sshd before the forecast runs:

# forecast/entrypoint-mpi.sh
if [ "${MPI_DISTRIBUTED:-0}" = "1" ]; then
    install -d -m 700 /root/.ssh
    install -m 600 /root/.ssh-mpi/id_ed25519      /root/.ssh/id_ed25519
    install -m 644 /root/.ssh-mpi/id_ed25519.pub  /root/.ssh/id_ed25519.pub
    install -m 600 /root/.ssh-mpi/authorized_keys /root/.ssh/authorized_keys
    ssh-keygen -A >/dev/null                           # idempotent
    /usr/sbin/sshd -f /etc/ssh/sshd_mpi_config
fi
exec "$@"

The keypair lives at ~/.ssh-mpi/ on each host (id_ed25519 + id_ed25519.pub + authorized_keys — both hosts trust the same single key). It’s mounted into the container read-only as /root/.ssh-mpi. The container’s /etc/hosts is patched at start-up to point beelink and spark at the 10 GbE addresses:

docker run --add-host beelink:10.10.10.1 --add-host spark:10.10.10.2

That last part is what routes MPI traffic over the fast link without anyone having to configure routes inside the container.

How MPICH Hydra launches ranks across containers beelink container mpiexec -launcher ssh -hosts beelink:32,spark:20 forks 32 local hydra_pmi_proxy ranks ssh spark:2222 → start hydra_pmi_proxy on peer (20 aarch64 ranks) /root/.ssh/id_ed25519 spark container sshd · :2222 authorized_keys = same pub key hydra_pmi_proxy launches 20 × icon (aarch64) --network=host → :2222 reachable /etc/hosts beelink=10.10.10.1 SSH → :2222 over 10.10.10.0/30 HYDRA_LAUNCHER_EXTRA_ARGS = "-p 2222 -i /root/.ssh/id_ed25519 -o StrictHostKeyChecking=no …"
Hydra opens an SSH connection into the peer container's sshd on :2222, then spawns a proxy that fork-execs the per-rank ICON processes.

A shared filesystem nobody talks about

ICON writes one large NetCDF file per rank-output plus a few central namelists, but the inputs are the load-bearing thing: every rank reads the same grid file (lam_imgw_R5B9.nc), the same EXTPAR (orography, land cover), the same per-cycle IC file, and 24 hourly LBC files. Having two hosts each maintain their own copy is a coherence headache no one needs.

So beelink runs an NFS server. The two paths it exports are exactly the two paths the head needs to be able to read and write from on both sides:

# /etc/exports.d/icon.exports     ← note the .exports extension
/mnt/SSD1/icon-lam-pl                                                      10.10.10.0/30(rw,sync,no_subtree_check,no_root_squash)
/mnt/crucial2/projects/leometeo/icon-lam-pl-runtime/forecast_work_poland   10.10.10.0/30(rw,sync,no_subtree_check,no_root_squash)

On spark those paths are NFS-mounted at the same absolute paths, so containers see identical mount points on both hosts and the bind-mount lines in docker run are byte-identical.

Two gotchas that ate hours:

The static read-only inputs (/opt/icon-data, /opt/icon-vct, /opt/forecast) are spark-local clones rather than NFS — they’re tiny, immutable, and faster to read off local NVMe. The path layout is faked with sudo ln -sfn so the bind-mount lines still work.

What's NFS-shared and what's host-local NFS-exported from beelink (rw, same absolute paths on both hosts) /mnt/SSD1/icon-lam-pl/ grid_extpar/ · poland/{prep_data,forecast_out,output} · tracka/ /mnt/crucial2/.../icon-lam-pl-runtime/forecast_work_poland/ beelink-local /mnt/crucial2/.../icon-model/ /mnt/crucial2/.../icon-lam-pl/forecast/ (namelist templates, scripts) ~/.ssh-mpi/ ← shared keypair mounted read-only into containers spark-local ~/icon-model + ~/icon_bin with sudo ln -sfn into /mnt/crucial2/.../{icon-model,icon_bin} forecast/ (rsync'd after edits) rsync spark-fast:.../forecast/ /mnt/SSD1 had to be reformatted ext4 — exFAT cannot be NFS-exported
Big writable working dirs go over NFS; tiny static read-only inputs (~100 MB) stay local on each host.

Bridging open data to ICON: the missing vertical remap

Before any of the cluster orchestration matters, the forecast has to have initial conditions. And this is the place where running ICON on a home machine almost broke before it started.

ICON’s official preprocessor for limited-area runs is iconremap, shipped as part of dwd_icon_tools. It’s the canonical way to turn a global-model analysis into a LAM IC. It expects, however, high-resolution IFS analyses or full ICON analysis fields — neither of which is on a public server. Getting them requires a DWD or ECMWF account and an institutional context that a home setup doesn’t have.

What is public, on the DWD Open Data Server, is ICON-EU forecast output: GRIB2 files at hourly cadence, 6.5 km resolution, model levels 10–74 of the global ICON-EU vertical grid. CDO can take those and do a horizontal remap onto an arbitrary LAM grid with a single remapdis call. Solved problem.

What CDO does not do, and what dwd_icon_tools/iconremap would have done, is the vertical step:

  1. Interpolate every atmospheric field (T, U, V, W, P, QV, QC, QI, …) from ICON-EU’s height levels onto the target LAM’s terrain-following SLEVE coordinate.
  2. Handle the case where the LAM surface lies below ICON-EU’s lowest available level (~1700 m AGL over the smoothed ICON-EU orography of the Tatra mountains) — with physically sensible extrapolation, not a constant copy.
  3. Add the fields ICON’s runtime check_variables requires but neither CDO nor DWD opendata provides: surface geopotential GEOSP, vertical wind W on half-levels (DWD publishes full-level W; ICON wants nlev+1 half-level entries).

Without (1) ICON’s vert_interp_atm blows up in 40 seconds because source heights don’t agree with the LAM’s target SLEVE. Without (2) the LAM produces a famous “175 km/h surface gust” spike because the free-troposphere wind from ICON-EU’s lowest model level gets propagated unchanged down to the LAM’s actual surface. Without (3) the forecast won’t start at all.

So I wrote iconremap — a small Python module (MIT) that fills exactly that gap. Not a replacement for the full DWD tool, just the open-data path.

What it does, in one paragraph

Read the CDO-horizontally-remapped IC NetCDF and the LAM EXTPAR (which carries the target topography HSURF). Compute the LAM’s target half-level heights z_ifc from vct_a + SLEVE namelist parameters, using the exact algorithm from ICON’s mo_init_vgrid.f90 so the heights match what ICON computes at runtime to the centimetre. Terrain-shift the source HHL so the source bottom anchors at LAM topography (otherwise interpolation happens in absolute height, asking “what’s the wind 10 m above the LAM’s Tatra peak?” using data that only goes up to 10 m above DWD’s smoothed-down Tatra — different absolute altitude entirely). Then for each variable, run a per-cell vertical interpolation with physics-aware extrapolation outside the source range:

VariableBelow source surfaceAbove source top
Tdry-adiabatic lapseisothermal stratosphere
Phydrostatic balance using local Thydrostatic decay
QVClausius–Clapeyron at extrapolated PBrewer–Dobson decay (H ≈ 2 km)
QC / QI / QR / QSzerozero
U / V / Wlog-law surface-layer decay, capped at ±25 m/sconstant

The log-law wind extrapolation is the one that matters in practice — that’s the line of code that turns the spurious 175 km/h surface gust into a believable 70 km/h frontal gust. Finally, attach the LAM’s z_ifc as the new HHL, derive GEOSP = HSURF × g, and pad W from full to half levels by averaging adjacent full-level values. Write the NetCDF; ICON ingests it directly.

Where iconremap fits in the prep pipeline DWD opendata GRIB ICON-EU forecast, 6.5 km levels 10–74, hourly CDO remapdis → LAM grid (horizontal only) iconremap (Python) vertical → SLEVE z_ifc + GEOSP, W half-levels ICON-LAM IC.nc ingested directly by forecast What iconremap handles per variable, outside the source range: below source surface variable above source top dry-adiabatic lapse T isothermal (stratosphere) hydrostatic via local T P hydrostatic decay Clausius–Clapeyron @ extrap P QV Brewer–Dobson e^(-z/2km) zero QC / QI / QR / QS zero log-law surface-layer decay (cap ±25 m/s) U / V / W constant
The vertical step is the gap between "CDO works on opendata" and "ICON-LAM has an IC it can read." Without physics-aware extrapolation, the lowest source level — ~1700 m AGL — gets propagated unchanged down to the LAM's actual surface, producing the famous spurious 175 km/h surface gust.

The code is small — a few hundred lines across sleve.py (SLEVE z_ifc computation transcribed from ICON Fortran), vertical.py (the interpolation + extrapolation), meteo.py (the physics constants), pipeline.py (glue). Bind-mounted into the prep container as a Python dist-packages entry; called once per cycle from post_prep.sh:

python3 -m iconremap --input "$PRE_IC" --extpar "$EXTPAR" --output "$IC_NC" --quiet

It’s MIT-licensed, on github.com/robertziel/iconremap, and it’s the difference between “ICON LAM is for institutions” and “ICON LAM is for anyone with a laptop and a public-internet connection.”

The orchestrator: one run.sh, three commands

The single source of truth is run.sh, ~475 lines of bash that owns prep, forecast, postprocess, cluster lifecycle, and an auto-render watcher. Its public surface is small:

./run.sh                              # status
./run.sh start                        # = --region poland --hours 24 --machines spark,beelink
./run.sh start --hours 6
./run.sh start --machines beelink     # single-host MPI on beelink
./run.sh start --no-render
./run.sh stop
./run.sh restart --hours 12

The flow start runs is deliberate, and each phase fails fast:

run.sh start — phase ordering 1 · preflight image present · peer ssh · peer mounts · ranks configured 2 · stop everything compose stop · remove peer containers · kill watcher 3 · clean_region wipe cycle intermediates (keep icon-eu cache + hhl_lam.nc) 4 · save .cluster_hosts so future stop knows which peers to clean up 5 · compose up: prep, postprocess, pipeline, web skip forecast-poland and icon-build (cluster replaces them) 6 · wait for latest_cycle (≤45 min) prep polls DWD every 30 min · tails its log every 30 s 7 · cluster_start_forecast peers first (sshd must bind :2222) · then head runs mpiexec 8 · render_start (auto-render watcher) nohup polls last_postprocessed_cycle every 60 s
Each phase is idempotent in isolation; the script just chains them and persists enough state (.cluster_hosts, render PID file) to survive between invocations.

A few details that matter:

Per-rank pinning and the IO server

Inside the forecast container the actual mpiexec line — for the cluster path — is small:

mpiexec \
    -launcher ssh \
    -hosts "$MPI_HOSTS" \
    "$ICON_BIN" 2>&1 | tee icon.log | tail -50

$MPI_HOSTS is built up as beelink:32,spark:20. MPICH then divides global ranks across hosts in declaration order: ranks 0–31 on beelink, 32–51 on spark. Order matters here: ICON dedicates the last rank to its asynchronous output server, and the aarch64 build of ICON has a latent SIGILL bug in its CDI varID lookup under certain namelist combinations. Putting the IO rank on beelink (x86_64) by listing it last sidesteps the bug. Hence the default --machines spark,beelink — the order is load-bearing, not cosmetic.

Each host’s rank count is configurable via BEELINK_RANKS / SPARK_RANKS env vars. Defaults match physical core counts: 32 and 20 → 52 ranks total. OMP_NUM_THREADS=1 everywhere — ICON’s gain from threading on top of MPI on this size of domain is negligible and complicates pinning.

The namelist sets a single I/O proc and one prefetch proc:

&parallel_nml
  num_io_procs       = 1
  num_prefetch_proc  = 1     ! mandatory for itype_latbc=1
/

So of the 52 ranks: one async output writer, one LBC prefetcher, 50 compute ranks. Halo exchanges fan across all of them every 4 ns of model time.

What the numbers look like

The interesting metric isn’t “tokens/s” — it’s Time step: rate, which ICON logs every step. After ~10 minutes of run the dycore is in steady state and the rate is stable. Recorded:

CaseRanksSteps in 10 minsteps/sForecast/wall
beelink only32850.1422.83×
spark only20700.1172.33×
mixed cluster521240.2074.13×
Forecast-to-wall ratio: single host vs cluster Forecast hours produced per wall-clock hour (higher is better) beelink (32) 2.83× spark (20) 2.33× cluster (52) 4.13× cluster = 1.46× faster than beelink alone · 80% of theoretical sum (0.142 + 0.117 = 0.259)
Cross-host halo exchange isn't free — the cluster delivers ~80% of the additive theoretical sum, not 100%.

The 20% gap from theoretical is structural:

For a 6 h forecast the cluster saves ~22 min of wall vs. single-host beelink. For a 24 h forecast it saves ~3.5 h, which is the whole reason this exists.

Why this layout, in retrospect

A few decisions did most of the work:

ChoiceWhyWhat it bought
MPICH over OpenMPI 5Heterogeneous-arch supportx86 + aarch64 in one job
Multi-stage Docker, per-arch tagsCacheable build, slim runtimeSingle docker run line works on either host
In-container sshd on :2222Hydra needs to launch into peers, not into hostsProcess isolation, no host sshd changes
NFS over the 10 GbE linkSingle writable workdirIdentical bind-mount strings on both hosts
--add-host beelink:10.10.10.1Force traffic onto the fast linkNo router config, no static routes
One run.shIdempotent state machinestart/stop/restart survives reboots and partial failures

Things that are not in this stack and would have been over-engineering for two hosts: Kubernetes, Slurm, an external scheduler, a service mesh, a metrics backplane. The state is in two text files (latest_cycle, .cluster_hosts) and docker ps. The right amount of infrastructure for two boxes is approximately none.

How does it compare to a reference operational model?

Throughput and engineering are interesting, but the only number that matters in the end is whether the forecast is useful. A side-by-side against a reference operational model is the cheapest sanity check.

The two maps below show 24-hour accumulated precipitation over the same domain (Poland), forecast on consecutive cycles by two different models at comparable resolution.

ALADIN CZ 2.3 km precipitation forecast for Poland, valid 2026-05-14 00:00 UTC
ALADIN CZ 2.3 km, cycle 2026-05-13 00 UTC, +24 h forecast valid 14 May 00:00 UTC. The Czech hydrometeorological service's operational LAM — convection-permitting, full Poland coverage, similar grid spacing to ours. The reference.
Local ICON-LAM Poland 2 km precipitation forecast, cycle 2026-05-12 23 UTC, valid 2026-05-13 23:00 UTC
The local cluster's own ICON-LAM at 2 km, cycle 2026-05-12 23 UTC, +24 h forecast valid 13 May 23:00 UTC. Same precipitation field, same colour scale, same area. Data extent visible here covers the NW quadrant of the LAM domain — the rest of Poland is dry in both forecasts.

What the comparison actually tells us:

This is the bar a home cluster has to clear: not “do my numbers match the reference exactly,” but “are they in the same family.” For a stack that’s two boxes, one cable, and a bash script, landing inside the model-to-model spread of an operational reference is the result that justifies the existence of every section above.

Things that will bite you

Useful to know up front:

TL;DR: A home weather-forecast cluster is two boxes, one direct 10 GbE cable, one shared NFS workdir, MPICH (not OpenMPI), a per-arch multi-stage container with an in-container sshd on a non-standard port, and one bash script that knows how to start, stop, and recover. The cluster runs a 24 h ICON-LAM forecast at ~4× real-time — 1.46× faster than the bigger box alone — for the cost of putting a switch’s worth of cable between two machines that were already on the desk.

Sources

← distributed systems