Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bugfix in distributed GPU tests and Distributed set! #3880

Merged
merged 55 commits into from
Nov 9, 2024
Merged
Show file tree
Hide file tree
Changes from 8 commits
Commits
Show all changes
55 commits
Select commit Hold shift + click to select a range
5227a13
fix pipeline
simone-silvestri Oct 29, 2024
1225061
mpi test and gpu test
simone-silvestri Oct 29, 2024
1652c6b
do we need to precompile it inside?
simone-silvestri Oct 29, 2024
9323203
precompile inside the node
simone-silvestri Oct 29, 2024
37b17ff
try previous climacommon version
simone-silvestri Oct 29, 2024
2ac8cde
go even more back
simone-silvestri Oct 29, 2024
0eb2720
use the ClimaOcean implementation
simone-silvestri Oct 29, 2024
50d0ec3
using the ClimaOcean implementation
simone-silvestri Oct 29, 2024
6e183bd
see if this test passes
simone-silvestri Oct 30, 2024
bd84d38
Merge branch 'main' into ss/fix-gpu-tests
simone-silvestri Oct 31, 2024
c56b15b
maybe precompiling before...
simone-silvestri Oct 31, 2024
371a45b
Merge branch 'ss/fix-gpu-tests' of github.com:CliMA/Oceananigans.jl i…
simone-silvestri Oct 31, 2024
e30973f
double O0
simone-silvestri Oct 31, 2024
e4cb16e
back to previous clima_common
simone-silvestri Oct 31, 2024
0c1f01c
another quick test
simone-silvestri Nov 1, 2024
bec1cd1
change environment
simone-silvestri Nov 1, 2024
75546af
correct the utils
simone-silvestri Nov 1, 2024
5f49ec0
Merge branch 'main' into ss/fix-gpu-tests
simone-silvestri Nov 1, 2024
9b334af
this should load mpitrampoline
simone-silvestri Nov 4, 2024
f8c6401
Fix formatting
glwagner Nov 4, 2024
1dc42bb
Go back to latest climacommon
glwagner Nov 4, 2024
5a870e7
try adding Manifest
simone-silvestri Nov 5, 2024
9e63f56
Manifest from julia 1.10
simone-silvestri Nov 5, 2024
59548f8
we probably need to initialize on a GPU
simone-silvestri Nov 5, 2024
642cfd9
these options should not create problems
simone-silvestri Nov 5, 2024
4cee49a
let's see if this differs
simone-silvestri Nov 5, 2024
a46b25d
just version infos
simone-silvestri Nov 5, 2024
4dffbe5
fiddling with O0
simone-silvestri Nov 6, 2024
9c3c6cd
why are we using 8 threads?
simone-silvestri Nov 6, 2024
3b28ecb
memory requirements are not this huge
simone-silvestri Nov 6, 2024
7126c7c
speed up the precompilation a bit, to revert later
simone-silvestri Nov 6, 2024
733ab2b
might this be the culprit?
simone-silvestri Nov 6, 2024
2dbf1a0
revert to 8 tasks to precompile
simone-silvestri Nov 6, 2024
a4b129a
final version?
simone-silvestri Nov 6, 2024
29f7d69
return to previous state of affairs
simone-silvestri Nov 6, 2024
b174313
reinclude enzyme
simone-silvestri Nov 6, 2024
0283e6a
set cuda runtime version
simone-silvestri Nov 6, 2024
b4c1f2a
will this help in finding cuda?
simone-silvestri Nov 6, 2024
bc53a97
make sure we don't run OOM
simone-silvestri Nov 6, 2024
811bfdb
bugfix in `set!`
simone-silvestri Nov 7, 2024
cd86a6a
try precompile inside runtests
simone-silvestri Nov 7, 2024
4039299
revert back
simone-silvestri Nov 7, 2024
2c6ad90
recompile everywhere
simone-silvestri Nov 7, 2024
781992c
try nuclear option
simone-silvestri Nov 7, 2024
08949b3
skip all these commands
simone-silvestri Nov 7, 2024
908b31a
some failsafe option
simone-silvestri Nov 7, 2024
466ec0c
increase a bit the memory
simone-silvestri Nov 7, 2024
a27b383
comment
simone-silvestri Nov 7, 2024
eec18c2
whoops unit tests are small
simone-silvestri Nov 7, 2024
62c5834
Merge branch 'main' into ss/fix-gpu-tests
simone-silvestri Nov 8, 2024
8011ef5
increase memory limits
simone-silvestri Nov 8, 2024
0965067
Merge branch 'ss/fix-gpu-tests' of github.com:CliMA/Oceananigans.jl i…
simone-silvestri Nov 8, 2024
cd00381
tests were running on the CPU on sverdrup
simone-silvestri Nov 8, 2024
eebfc04
Merge branch 'main' into ss/fix-gpu-tests
simone-silvestri Nov 8, 2024
8fc903e
Merge branch 'main' into ss/fix-gpu-tests
navidcy Nov 9, 2024
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
35 changes: 26 additions & 9 deletions .buildkite/distributed/pipeline.yml
Original file line number Diff line number Diff line change
@@ -1,26 +1,31 @@
agents:
queue: new-central
slurm_mem: 8G
modules: climacommon/2024_10_09
modules: climacommon/2024_05_27

env:
JULIA_LOAD_PATH: "${JULIA_LOAD_PATH}:${BUILDKITE_BUILD_CHECKOUT_PATH}/.buildkite/distributed"
OPENBLAS_NUM_THREADS: 1
JULIA_PKG_SERVER_REGISTRY_PREFERENCE: eager
JULIA_NUM_PRECOMPILE_TASKS: 8
JULIA_NUM_THREADS: 8
OMPI_MCA_opal_warn_on_missing_libcuda: 0
JULIA_NUM_PRECOMPILE_TASKS: 8

steps:
- label: "initialize"
key: "init_central"
key: "init"
env:
TEST_GROUP: "init"
command:
- echo "--- Instantiate project"
- "julia -O0 --color=yes --project -e 'using Pkg; Pkg.test()'"
- "echo '--- Instantiate project'"
- "julia --project -e 'using Pkg; Pkg.instantiate(verbose=true)'"

- "echo '--- Precompile project'"
- "julia --project -e 'using Pkg; Pkg.precompile(strict=true)'"
- "julia --project -e 'using Pkg; Pkg.status()'"

# Force the initialization of the CUDA runtime as it is lazily loaded by default:
- "echo '--- Initialize the CUDA runtime'"
- "julia --project -e 'using CUDA; CUDA.precompile_runtime()'"
- "julia --project -e 'using Pkg; Pkg.test()'"
agents:
slurm_mem: 120G
slurm_gpus: 1
slurm_cpus_per_task: 8

Expand All @@ -30,6 +35,7 @@ steps:
key: "distributed_cpu"
env:
TEST_GROUP: "distributed"
MPI_TEST: "true"
commands:
- "srun julia -O0 --color=yes --project -e 'using Pkg; Pkg.test()'"
agents:
Expand All @@ -40,6 +46,8 @@ steps:
key: "distributed_gpu"
env:
TEST_GROUP: "distributed"
GPU_TEST: "true"
MPI_TEST: "true"
commands:
- "srun julia -O0 --color=yes --project -e 'using Pkg; Pkg.test()'"
agents:
Expand All @@ -56,6 +64,7 @@ steps:
key: "distributed_solvers_cpu"
env:
TEST_GROUP: "distributed_solvers"
MPI_TEST: "true"
commands:
- "srun julia -O0 --color=yes --project -e 'using Pkg; Pkg.test()'"
agents:
Expand All @@ -66,6 +75,8 @@ steps:
key: "distributed_solvers_gpu"
env:
TEST_GROUP: "distributed_solvers"
GPU_TEST: "true"
MPI_TEST: "true"
commands:
- "srun julia -O0 --color=yes --project -e 'using Pkg; Pkg.test()'"
agents:
Expand All @@ -81,6 +92,7 @@ steps:
key: "distributed_hydrostatic_model_cpu"
env:
TEST_GROUP: "distributed_hydrostatic_model"
MPI_TEST: "true"
commands:
- "srun julia -O0 --color=yes --project -e 'using Pkg; Pkg.test()'"
agents:
Expand All @@ -91,6 +103,8 @@ steps:
key: "distributed_hydrostatic_model_gpu"
env:
TEST_GROUP: "distributed_hydrostatic_model"
GPU_TEST: "true"
MPI_TEST: "true"
commands:
- "srun julia -O0 --color=yes --project -e 'using Pkg; Pkg.test()'"
agents:
Expand All @@ -106,6 +120,7 @@ steps:
key: "distributed_nonhydrostatic_regression_cpu"
env:
TEST_GROUP: "distributed_nonhydrostatic_regression"
MPI_TEST: "true"
commands:
- "srun julia -O0 --color=yes --project -e 'using Pkg; Pkg.test()'"
agents:
Expand All @@ -116,6 +131,8 @@ steps:
key: "distributed_nonhydrostatic_regression_gpu"
env:
TEST_GROUP: "distributed_nonhydrostatic_regression"
GPU_TEST: "true"
MPI_TEST: "true"
commands:
- "srun julia -O0 --color=yes --project -e 'using Pkg; Pkg.test()'"
agents:
Expand Down
33 changes: 21 additions & 12 deletions test/utils_for_runtests.jl
Original file line number Diff line number Diff line change
Expand Up @@ -3,21 +3,26 @@ using Oceananigans.DistributedComputations: Distributed, Partition, child_archit

import Oceananigans.Fields: interior

test_child_arch() = CUDA.has_cuda() ? GPU() : CPU()
test_child_arch() = parse(Bool, get(ENV, "GPU_TEST", "false")) ? GPU() : CPU()
simone-silvestri marked this conversation as resolved.
Show resolved Hide resolved
mpi_test() = parse(Bool, get(ENV, "MPI_TEST", "false"))
simone-silvestri marked this conversation as resolved.
Show resolved Hide resolved
simone-silvestri marked this conversation as resolved.
Show resolved Hide resolved

function test_architectures()
child_arch = test_child_arch()

# If MPI is initialized with MPI.Comm_size > 0, we are running in parallel.
# We test several different configurations: `Partition(x = 4)`, `Partition(y = 4)`,
# `Partition(x = 2, y = 2)`, and different fractional subdivisions in x, y and xy
if MPI.Initialized() && MPI.Comm_size(MPI.COMM_WORLD) == 4
return (Distributed(child_arch; partition = Partition(4)),
Distributed(child_arch; partition = Partition(1, 4)),
Distributed(child_arch; partition = Partition(2, 2)),
Distributed(child_arch; partition = Partition(x = Fractional(1, 2, 3, 4))),
Distributed(child_arch; partition = Partition(y = Fractional(1, 2, 3, 4))),
Distributed(child_arch; partition = Partition(x = Fractional(1, 2), y = Equal())))
if mpi_test()
if MPI.Initialized() && MPI.Comm_size(MPI.COMM_WORLD) == 4
return (Distributed(child_arch; partition = Partition(4)),
Distributed(child_arch; partition = Partition(1, 4)),
Distributed(child_arch; partition = Partition(2, 2)),
Distributed(child_arch; partition = Partition(x = Fractional(1, 2, 3, 4))),
Distributed(child_arch; partition = Partition(y = Fractional(1, 2, 3, 4))),
Distributed(child_arch; partition = Partition(x = Fractional(1, 2), y = Equal())))
else
return throw("The MPI partitioning is not correctly configured.")
end
else
return tuple(child_arch)
end
Expand All @@ -31,10 +36,14 @@ function nonhydrostatic_regression_test_architectures()
# If MPI is initialized with MPI.Comm_size > 0, we are running in parallel.
# We test 3 different configurations: `Partition(x = 4)`, `Partition(y = 4)`
# and `Partition(x = 2, y = 2)`
if MPI.Initialized() && MPI.Comm_size(MPI.COMM_WORLD) == 4
return (Distributed(child_arch; partition = Partition(4)),
Distributed(child_arch; partition = Partition(1, 4)),
Distributed(child_arch; partition = Partition(2, 2)))
if mpi_test()
if MPI.Initialized() && MPI.Comm_size(MPI.COMM_WORLD) == 4
return (Distributed(child_arch; partition = Partition(4)),
Distributed(child_arch; partition = Partition(1, 4)),
Distributed(child_arch; partition = Partition(2, 2)))
else
return throw("The MPI partitioning is not correctly configured.")
end
else
return tuple(child_arch)
end
Expand Down