Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Backends multi-node distributed compilaton #10

Open
FlorianDeconinck opened this issue Jun 21, 2023 · 1 comment
Open

Backends multi-node distributed compilaton #10

FlorianDeconinck opened this issue Jun 21, 2023 · 1 comment

Comments

@FlorianDeconinck
Copy link
Collaborator

An enduring issue with the model right now is the incapacity to efficiently build at-scale. Every stencil takes a significant amount of time to build due to the well known under performance of nvcc. This coupled to the fact that the cube sphere means we have up to 9 different code path (following the placement of a rank on any given tile), this leads to build time up into the 3+ hrs.

A solution is to use distributed compilation on multi-node*. Using the new identify code path technique, that guarantees relocability we should be able to compile with 54 ranks and scale up to any layout.

Here's an outline of a solution:

  • Rank 0 spins a file socket server - acting as a scheduler for everybody else
  • When hitting FrozenStencil, the rank queries the server for stencil state
    • Build: stencil is not built - build it
    • Stub: stencil is being built - stub for now come back when execution is needed
    • Load: stencil is ready load it
  • When a stencil needs to be executed, the rank queries the server until given the "Load" call
  • Why not multithread? Because Python+ GIL = sad developer
@FlorianDeconinck
Copy link
Collaborator Author

This issue should also be the base for a full uprooting and refactor of the CompileConfig, distrubuted_caches, etc. and all build/load system that has been growing in multiple files in both Orchestrated and Stencil based system.

The build system should be the same for all execution - presenting an unify API to users to create workflow.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant