Backends multi-node distributed compilaton #10

FlorianDeconinck · 2023-06-21T20:26:59Z

An enduring issue with the model right now is the incapacity to efficiently build at-scale. Every stencil takes a significant amount of time to build due to the well known under performance of nvcc. This coupled to the fact that the cube sphere means we have up to 9 different code path (following the placement of a rank on any given tile), this leads to build time up into the 3+ hrs.

A solution is to use distributed compilation on multi-node*. Using the new identify code path technique, that guarantees relocability we should be able to compile with 54 ranks and scale up to any layout.

Here's an outline of a solution:

Rank 0 spins a file socket server - acting as a scheduler for everybody else
When hitting FrozenStencil, the rank queries the server for stencil state
- Build: stencil is not built - build it
- Stub: stencil is being built - stub for now come back when execution is needed
- Load: stencil is ready load it
When a stencil needs to be executed, the rank queries the server until given the "Load" call

Why not multithread? Because Python+ GIL = sad developer

The text was updated successfully, but these errors were encountered:

FlorianDeconinck · 2023-08-30T14:37:09Z

This issue should also be the base for a full uprooting and refactor of the CompileConfig, distrubuted_caches, etc. and all build/load system that has been growing in multiple files in both Orchestrated and Stencil based system.

The build system should be the same for all execution - presenting an unify API to users to create workflow.

FlorianDeconinck mentioned this issue Jun 21, 2023

Backends distributed multi-node compilaton GEOS-ESM/pace#17

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Backends multi-node distributed compilaton #10

Backends multi-node distributed compilaton #10

FlorianDeconinck commented Jun 21, 2023

FlorianDeconinck commented Aug 30, 2023

Backends multi-node distributed compilaton #10

Backends multi-node distributed compilaton #10

Comments

FlorianDeconinck commented Jun 21, 2023

FlorianDeconinck commented Aug 30, 2023