Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bugfix in distributed GPU tests and Distributed set! #3880

Merged
merged 55 commits into from
Nov 9, 2024
Merged
Show file tree
Hide file tree
Changes from 39 commits
Commits
Show all changes
55 commits
Select commit Hold shift + click to select a range
5227a13
fix pipeline
simone-silvestri Oct 29, 2024
1225061
mpi test and gpu test
simone-silvestri Oct 29, 2024
1652c6b
do we need to precompile it inside?
simone-silvestri Oct 29, 2024
9323203
precompile inside the node
simone-silvestri Oct 29, 2024
37b17ff
try previous climacommon version
simone-silvestri Oct 29, 2024
2ac8cde
go even more back
simone-silvestri Oct 29, 2024
0eb2720
use the ClimaOcean implementation
simone-silvestri Oct 29, 2024
50d0ec3
using the ClimaOcean implementation
simone-silvestri Oct 29, 2024
6e183bd
see if this test passes
simone-silvestri Oct 30, 2024
bd84d38
Merge branch 'main' into ss/fix-gpu-tests
simone-silvestri Oct 31, 2024
c56b15b
maybe precompiling before...
simone-silvestri Oct 31, 2024
371a45b
Merge branch 'ss/fix-gpu-tests' of github.com:CliMA/Oceananigans.jl i…
simone-silvestri Oct 31, 2024
e30973f
double O0
simone-silvestri Oct 31, 2024
e4cb16e
back to previous clima_common
simone-silvestri Oct 31, 2024
0c1f01c
another quick test
simone-silvestri Nov 1, 2024
bec1cd1
change environment
simone-silvestri Nov 1, 2024
75546af
correct the utils
simone-silvestri Nov 1, 2024
5f49ec0
Merge branch 'main' into ss/fix-gpu-tests
simone-silvestri Nov 1, 2024
9b334af
this should load mpitrampoline
simone-silvestri Nov 4, 2024
f8c6401
Fix formatting
glwagner Nov 4, 2024
1dc42bb
Go back to latest climacommon
glwagner Nov 4, 2024
5a870e7
try adding Manifest
simone-silvestri Nov 5, 2024
9e63f56
Manifest from julia 1.10
simone-silvestri Nov 5, 2024
59548f8
we probably need to initialize on a GPU
simone-silvestri Nov 5, 2024
642cfd9
these options should not create problems
simone-silvestri Nov 5, 2024
4cee49a
let's see if this differs
simone-silvestri Nov 5, 2024
a46b25d
just version infos
simone-silvestri Nov 5, 2024
4dffbe5
fiddling with O0
simone-silvestri Nov 6, 2024
9c3c6cd
why are we using 8 threads?
simone-silvestri Nov 6, 2024
3b28ecb
memory requirements are not this huge
simone-silvestri Nov 6, 2024
7126c7c
speed up the precompilation a bit, to revert later
simone-silvestri Nov 6, 2024
733ab2b
might this be the culprit?
simone-silvestri Nov 6, 2024
2dbf1a0
revert to 8 tasks to precompile
simone-silvestri Nov 6, 2024
a4b129a
final version?
simone-silvestri Nov 6, 2024
29f7d69
return to previous state of affairs
simone-silvestri Nov 6, 2024
b174313
reinclude enzyme
simone-silvestri Nov 6, 2024
0283e6a
set cuda runtime version
simone-silvestri Nov 6, 2024
b4c1f2a
will this help in finding cuda?
simone-silvestri Nov 6, 2024
bc53a97
make sure we don't run OOM
simone-silvestri Nov 6, 2024
811bfdb
bugfix in `set!`
simone-silvestri Nov 7, 2024
cd86a6a
try precompile inside runtests
simone-silvestri Nov 7, 2024
4039299
revert back
simone-silvestri Nov 7, 2024
2c6ad90
recompile everywhere
simone-silvestri Nov 7, 2024
781992c
try nuclear option
simone-silvestri Nov 7, 2024
08949b3
skip all these commands
simone-silvestri Nov 7, 2024
908b31a
some failsafe option
simone-silvestri Nov 7, 2024
466ec0c
increase a bit the memory
simone-silvestri Nov 7, 2024
a27b383
comment
simone-silvestri Nov 7, 2024
eec18c2
whoops unit tests are small
simone-silvestri Nov 7, 2024
62c5834
Merge branch 'main' into ss/fix-gpu-tests
simone-silvestri Nov 8, 2024
8011ef5
increase memory limits
simone-silvestri Nov 8, 2024
0965067
Merge branch 'ss/fix-gpu-tests' of github.com:CliMA/Oceananigans.jl i…
simone-silvestri Nov 8, 2024
cd00381
tests were running on the CPU on sverdrup
simone-silvestri Nov 8, 2024
eebfc04
Merge branch 'main' into ss/fix-gpu-tests
simone-silvestri Nov 8, 2024
8fc903e
Merge branch 'main' into ss/fix-gpu-tests
navidcy Nov 9, 2024
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
67 changes: 53 additions & 14 deletions .buildkite/distributed/pipeline.yml
Original file line number Diff line number Diff line change
@@ -1,7 +1,7 @@
agents:
queue: new-central
slurm_mem: 8G
modules: climacommon/2024_10_09
modules: climacommon/2024_10_08

env:
JULIA_LOAD_PATH: "${JULIA_LOAD_PATH}:${BUILDKITE_BUILD_CHECKOUT_PATH}/.buildkite/distributed"
Expand All @@ -16,60 +16,83 @@ steps:
key: "init_central"
env:
TEST_GROUP: "init"
GPU_TEST: "true"
command:
- echo "--- Instantiate project"
- "julia -O0 --color=yes --project -e 'using Pkg; Pkg.test()'"
- "julia -O0 --project -e 'using Pkg; Pkg.instantiate(; verbose=true); Pkg.precompile(; strict=true)'"

# Force the initialization of the CUDA runtime as it is lazily loaded by default
- "julia -O0 --project -e 'using CUDA; CUDA.precompile_runtime(); CUDA.versioninfo()'"
- "julia -O0 --project -e 'using MPI; MPI.versioninfo()'"

- echo "--- Initialize tests"
- "julia -O0 --project -e 'using Pkg; Pkg.test()'"
agents:
slurm_mem: 120G
slurm_gpus: 1
slurm_cpus_per_task: 8
slurm_mem: 8G
slurm_ntasks: 1
slurm_gpus_per_task: 1

- wait

- label: "🐉 cpu distributed unit tests"
key: "distributed_cpu"
env:
TEST_GROUP: "distributed"
MPI_TEST: "true"
commands:
- "srun julia -O0 --color=yes --project -e 'using Pkg; Pkg.test()'"
agents:
slurm_mem: 120G
slurm_mem: 8G
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why?

Copy link
Collaborator Author

@simone-silvestri simone-silvestri Nov 7, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

120G is much more than we need for those tests. After some frustration, because tests were extremely slow to start, I noticed that the agents began much quicker by requesting a smaller memory amount. So I am deducing that the tests run on shared nodes instead of exclusive ones, and requesting lower resources allows us to squeeze in when the cluster is busy.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good reason. might warrant a comment

slurm_ntasks: 4
retry:
automatic:
- exit_status: 1
limit: 1

- label: "🐲 gpu distributed unit tests"
key: "distributed_gpu"
env:
TEST_GROUP: "distributed"
GPU_TEST: "true"
MPI_TEST: "true"
commands:
- "julia -O0 --project -e 'using CUDA; CUDA.precompile_runtime(); CUDA.versioninfo()'"
- "srun julia -O0 --color=yes --project -e 'using Pkg; Pkg.test()'"
agents:
slurm_mem: 120G
slurm_mem: 8G
slurm_ntasks: 4
slurm_gpus_per_task: 1
retry:
automatic:
- exit_status: 1
limit: 1


- label: "🦾 cpu distributed solvers tests"
key: "distributed_solvers_cpu"
env:
TEST_GROUP: "distributed_solvers"
MPI_TEST: "true"
commands:
- "srun julia -O0 --color=yes --project -e 'using Pkg; Pkg.test()'"
agents:
slurm_mem: 120G
slurm_mem: 8G
slurm_ntasks: 4
retry:
automatic:
- exit_status: 1
limit: 1

- label: "🛸 gpu distributed solvers tests"
key: "distributed_solvers_gpu"
env:
TEST_GROUP: "distributed_solvers"
GPU_TEST: "true"
MPI_TEST: "true"
commands:
- "julia -O0 --project -e 'using CUDA; CUDA.precompile_runtime(); CUDA.versioninfo()'"
- "srun julia -O0 --color=yes --project -e 'using Pkg; Pkg.test()'"
agents:
slurm_mem: 120G
slurm_mem: 32G
slurm_ntasks: 4
slurm_gpus_per_task: 1
retry:
Expand All @@ -81,20 +104,28 @@ steps:
key: "distributed_hydrostatic_model_cpu"
env:
TEST_GROUP: "distributed_hydrostatic_model"
MPI_TEST: "true"
commands:
- "srun julia -O0 --color=yes --project -e 'using Pkg; Pkg.test()'"
agents:
slurm_mem: 120G
slurm_mem: 32G
slurm_ntasks: 4
retry:
automatic:
- exit_status: 1
limit: 1

- label: "🦏 gpu distributed hydrostatic model tests"
key: "distributed_hydrostatic_model_gpu"
env:
TEST_GROUP: "distributed_hydrostatic_model"
GPU_TEST: "true"
MPI_TEST: "true"
commands:
- "julia -O0 --project -e 'using CUDA; CUDA.precompile_runtime(); CUDA.versioninfo()'"
- "srun julia -O0 --color=yes --project -e 'using Pkg; Pkg.test()'"
agents:
slurm_mem: 120G
slurm_mem: 32G
slurm_ntasks: 4
slurm_gpus_per_task: 1
retry:
Expand All @@ -106,20 +137,28 @@ steps:
key: "distributed_nonhydrostatic_regression_cpu"
env:
TEST_GROUP: "distributed_nonhydrostatic_regression"
MPI_TEST: "true"
commands:
- "srun julia -O0 --color=yes --project -e 'using Pkg; Pkg.test()'"
agents:
slurm_mem: 120G
slurm_mem: 32G
slurm_ntasks: 4
retry:
automatic:
- exit_status: 1
limit: 1

- label: "🕺 gpu distributed nonhydrostatic regression"
key: "distributed_nonhydrostatic_regression_gpu"
env:
TEST_GROUP: "distributed_nonhydrostatic_regression"
GPU_TEST: "true"
MPI_TEST: "true"
commands:
- "julia -O0 --project -e 'using CUDA; CUDA.precompile_runtime(); CUDA.versioninfo()'"
- "srun julia -O0 --color=yes --project -e 'using Pkg; Pkg.test()'"
agents:
slurm_mem: 120G
slurm_mem: 32G
slurm_ntasks: 4
slurm_gpus_per_task: 1
retry:
Expand Down
7 changes: 4 additions & 3 deletions Project.toml
Original file line number Diff line number Diff line change
Expand Up @@ -50,7 +50,7 @@ CubedSphere = "0.2, 0.3"
Dates = "1.9"
Distances = "0.10"
DocStringExtensions = "0.8, 0.9"
Enzyme = "0.13.3"
Enzyme = "0.13.14"
FFTW = "1"
Glob = "1.3"
IncompleteLU = "0.2"
Expand All @@ -77,10 +77,11 @@ julia = "1.9"

[extras]
DataDeps = "124859b0-ceae-595e-8997-d05f6a7a8dfe"
Enzyme = "7da242da-08ed-463a-9acd-ee780be4f1d9"
SafeTestsets = "1bc83da4-3b8d-516f-aca4-4fe02f6d838f"
Test = "8dfed614-e22c-5e08-85e1-65c5234f0b40"
TimesDates = "bdfc003b-8df8-5c39-adcd-3a9087f5df4a"
Enzyme = "7da242da-08ed-463a-9acd-ee780be4f1d9"
MPIPreferences = "3da0fdf6-3ccc-4f1b-acd9-58baa6c99267"

[targets]
test = ["DataDeps", "Enzyme", "SafeTestsets", "Test", "TimesDates"]
test = ["DataDeps", "SafeTestsets", "Test", "Enzyme", "MPIPreferences", "TimesDates"]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Was this the crucial part?

3 changes: 0 additions & 3 deletions test/runtests.jl
Original file line number Diff line number Diff line change
Expand Up @@ -28,16 +28,13 @@ CUDA.allowscalar() do

# Initialization steps
if group == :init || group == :all
Pkg.instantiate(; verbose=true)
Pkg.precompile(; strict=true)
Pkg.status()

try
MPI.versioninfo()
catch; end

try
CUDA.precompile_runtime()
CUDA.versioninfo()
catch; end
end
Expand Down
4 changes: 1 addition & 3 deletions test/test_distributed_poisson_solvers.jl
Original file line number Diff line number Diff line change
Expand Up @@ -121,9 +121,7 @@ function divergence_free_poisson_tridiagonal_solution(grid_points, ranks, stretc
return Array(interior(∇²ϕ)) ≈ Array(R)
end

@testset "Distributed FFT-based Poisson solver" begin
child_arch = test_child_arch()

@testset "Distributed FFT-based Poisson solver" begin
for topology in ((Periodic, Periodic, Periodic),
(Periodic, Periodic, Bounded),
(Periodic, Bounded, Bounded),
Expand Down
2 changes: 0 additions & 2 deletions test/test_distributed_transpose.jl
Original file line number Diff line number Diff line change
Expand Up @@ -38,8 +38,6 @@ function test_transpose(grid_points, ranks, topo, child_arch)
end

@testset "Distributed Transpose" begin
child_arch = test_child_arch()

for topology in ((Periodic, Periodic, Periodic),
(Periodic, Periodic, Bounded),
(Periodic, Bounded, Bounded),
Expand Down
39 changes: 23 additions & 16 deletions test/utils_for_runtests.jl
Original file line number Diff line number Diff line change
Expand Up @@ -3,21 +3,26 @@ using Oceananigans.DistributedComputations: Distributed, Partition, child_archit

import Oceananigans.Fields: interior

test_child_arch() = CUDA.has_cuda() ? GPU() : CPU()
# Are the test running on the GPUs?
# Are the test running in parallel?
child_arch = get(ENV, "GPU_TEST", nothing) == "true" ? GPU() : CPU()
mpi_test = get(ENV, "MPI_TEST", nothing) == "true"

function test_architectures()
child_arch = test_child_arch()

# If MPI is initialized with MPI.Comm_size > 0, we are running in parallel.
# We test several different configurations: `Partition(x = 4)`, `Partition(y = 4)`,
# `Partition(x = 2, y = 2)`, and different fractional subdivisions in x, y and xy
if MPI.Initialized() && MPI.Comm_size(MPI.COMM_WORLD) == 4
return (Distributed(child_arch; partition = Partition(4)),
Distributed(child_arch; partition = Partition(1, 4)),
Distributed(child_arch; partition = Partition(2, 2)),
Distributed(child_arch; partition = Partition(x = Fractional(1, 2, 3, 4))),
Distributed(child_arch; partition = Partition(y = Fractional(1, 2, 3, 4))),
Distributed(child_arch; partition = Partition(x = Fractional(1, 2), y = Equal())))
if mpi_test
if MPI.Initialized() && MPI.Comm_size(MPI.COMM_WORLD) == 4
return (Distributed(child_arch; partition = Partition(4)),
Distributed(child_arch; partition = Partition(1, 4)),
Distributed(child_arch; partition = Partition(2, 2)),
Distributed(child_arch; partition = Partition(x = Fractional(1, 2, 3, 4))),
Distributed(child_arch; partition = Partition(y = Fractional(1, 2, 3, 4))),
Distributed(child_arch; partition = Partition(x = Fractional(1, 2), y = Equal())))
else
return throw("The MPI partitioning is not correctly configured.")
end
else
return tuple(child_arch)
end
Expand All @@ -26,15 +31,17 @@ end
# For nonhydrostatic simulations we cannot use `Fractional` at the moment (requirements
# for the tranpose are more stringent than for hydrostatic simulations).
function nonhydrostatic_regression_test_architectures()
child_arch = test_child_arch()

# If MPI is initialized with MPI.Comm_size > 0, we are running in parallel.
# We test 3 different configurations: `Partition(x = 4)`, `Partition(y = 4)`
# and `Partition(x = 2, y = 2)`
if MPI.Initialized() && MPI.Comm_size(MPI.COMM_WORLD) == 4
return (Distributed(child_arch; partition = Partition(4)),
Distributed(child_arch; partition = Partition(1, 4)),
Distributed(child_arch; partition = Partition(2, 2)))
if mpi_test
if MPI.Initialized() && MPI.Comm_size(MPI.COMM_WORLD) == 4
return (Distributed(child_arch; partition = Partition(4)),
Distributed(child_arch; partition = Partition(1, 4)),
Distributed(child_arch; partition = Partition(2, 2)))
else
return throw("The MPI partitioning is not correctly configured.")
end
else
return tuple(child_arch)
end
Expand Down