Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CUDA CI no longer has version tag #840

Merged
merged 8 commits into from
Jun 21, 2024
Merged

CUDA CI no longer has version tag #840

merged 8 commits into from
Jun 21, 2024

Conversation

vchuravy
Copy link
Member

Since JuliaGPU/buildkite@ed35a79 CUDA agent no longer have a version tag.

@giordano
Copy link
Member

CUDA tests are timing out, but they're taking exceedingly long. For example, the test_allgather.jl takes 2-3 minutes, while in the GitHub Actions CI it takes 3-4 seconds.

@vchuravy
Copy link
Member Author

So locally with OpenMPI v5

vchuravy@odin ~/s/MPI (vc/cuda_ci) [1]> OMPI_MCA_accelerator=cuda  julia +1.10 --project=. -e 'import Pkg; Pkg.test("MPI"; test_args=["--backend=CUDA"])'
The latest version of Julia in the `1.10` channel is 1.10.4+0.x64.linux.gnu. You currently have `1.10.3+0.x64.linux.gnu` installed. Run:

  juliaup update

to install Julia 1.10.4+0.x64.linux.gnu and update the `1.10` channel to that version.
     Testing MPI
      Status `/tmp/jl_XzJu15/Project.toml`
  [497a8b3b] DoubleFloats v1.3.8
  [da04e1cc] MPI v0.20.19 `~/src/MPI`
  [3da0fdf6] MPIPreferences v0.1.11
  [44cfe95a] Pkg v1.10.0
  [9a3f8284] Random
  [fa267f1f] TOML v1.0.3
  [8dfed614] Test
      Status `/tmp/jl_XzJu15/Manifest.toml`
  [34da2185] Compat v4.15.0
  [187b0558] ConstructionBase v1.5.5
  [ffbed154] DocStringExtensions v0.9.3
  [497a8b3b] DoubleFloats v1.3.8
  [14197337] GenericLinearAlgebra v0.3.11
  [92d709cd] IrrationalConstants v0.2.2
  [692b3bcd] JLLWrappers v1.5.0
  [2ab3a3ac] LogExpFunctions v0.3.28
  [da04e1cc] MPI v0.20.19 `~/src/MPI`
  [3da0fdf6] MPIPreferences v0.1.11
  [1914dd2f] MacroTools v0.5.13
  [eebad327] PkgVersion v0.3.3
  [f27b6e38] Polynomials v4.0.11
  [aea7be01] PrecompileTools v1.2.1
  [21216c6a] Preferences v1.4.3
  [be4d8f0f] Quadmath v0.5.10
  [3cdcf5f2] RecipesBase v1.3.4
  [ae029012] Requires v1.3.0
  [efcf1570] Setfield v1.1.1
  [276daf66] SpecialFunctions v2.4.0
  [1e83bf80] StaticArraysCore v1.4.3
  [e33a78d0] Hwloc_jll v2.10.0+0
  [7cb0a576] MPICH_jll v4.2.1+1
  [f1f71cc9] MPItrampoline_jll v5.4.0+0
  [9237b28f] MicrosoftMPI_jll v10.1.4+2
⌅ [fe0851c0] OpenMPI_jll v4.1.6+0
  [efe28fd5] OpenSpecFun_jll v0.5.5+0
  [0dad84c5] ArgTools v1.1.1
  [56f22d72] Artifacts
  [2a0f44e3] Base64
  [ade2ca70] Dates
  [8ba89e20] Distributed
  [f43a241f] Downloads v1.6.0
  [7b1f6079] FileWatching
  [9fa8497b] Future
  [b77e0a4c] InteractiveUtils
  [4af54fe1] LazyArtifacts
  [b27032c2] LibCURL v0.6.4
  [76f85450] LibGit2
  [8f399da3] Libdl
  [37e2e46d] LinearAlgebra
  [56ddb016] Logging
  [d6f4376e] Markdown
  [ca575930] NetworkOptions v1.2.0
  [44cfe95a] Pkg v1.10.0
  [de0858da] Printf
  [3fa0cd96] REPL
  [9a3f8284] Random
  [ea8e919c] SHA v0.7.0
  [9e88b42a] Serialization
  [6462fe0b] Sockets
  [2f01184e] SparseArrays v1.10.0
  [fa267f1f] TOML v1.0.3
  [a4e569a6] Tar v1.10.0
  [8dfed614] Test
  [cf7118a7] UUIDs
  [4ec0a83e] Unicode
  [e66e0078] CompilerSupportLibraries_jll v1.1.1+0
  [deac9b47] LibCURL_jll v8.4.0+0
  [e37daf67] LibGit2_jll v1.6.4+0
  [29816b5a] LibSSH2_jll v1.11.0+1
  [c8ffd9c3] MbedTLS_jll v2.28.2+1
  [14a3606d] MozillaCACerts_jll v2023.1.10
  [4536629a] OpenBLAS_jll v0.3.23+4
  [05823500] OpenLibm_jll v0.8.1+2
  [bea87d4a] SuiteSparse_jll v7.2.1+1
  [83775a58] Zlib_jll v1.2.13+1
  [8e850b90] libblastrampoline_jll v5.8.0+1
  [8e850ede] nghttp2_jll v1.52.0+1
  [3f19e933] p7zip_jll v17.4.0+2
        Info Packages marked with ⌅ have new versions available but compatibility constraints restrict them from upgrading.
     Testing Running tests...
   Resolving package versions...
    Updating `/tmp/jl_XzJu15/Project.toml`
  [052768ef] + CUDA v5.4.2
    Updating `/tmp/jl_XzJu15/Manifest.toml`
  [621f4979] + AbstractFFTs v1.5.0
  [79e6a3ab] + Adapt v4.0.4
  [a9b6321e] + Atomix v0.1.0
  [ab4f0b2a] + BFloat16s v0.5.0
  [fa961155] + CEnum v0.5.0
  [052768ef] + CUDA v5.4.2
  [1af6417a] + CUDA_Runtime_Discovery v0.3.4
  [3da002f7] + ColorTypes v0.11.5
  [5ae59095] + Colors v0.12.11
  [a8cc5b0e] + Crayons v4.1.1
  [9a962f9c] + DataAPI v1.16.0
  [a93c6f00] + DataFrames v1.6.1
  [864edb3b] + DataStructures v0.18.20
  [e2d170a0] + DataValueInterfaces v1.0.0
  [e2ba6199] + ExprTools v0.1.10
  [53c48c17] + FixedPointNumbers v0.8.5
  [0c68f7d7] + GPUArrays v10.2.0
  [46192b85] + GPUArraysCore v0.1.6
  [61eb1bfa] + GPUCompiler v0.26.5
  [842dd82b] + InlineStrings v1.4.1
  [41ab1584] + InvertedIndices v1.3.0
  [82899510] + IteratorInterfaceExtensions v1.0.0
  [63c18a36] + KernelAbstractions v0.9.20
  [929cbde3] + LLVM v7.2.1
  [8b046642] + LLVMLoopInfo v1.0.0
  [b964fa9f] + LaTeXStrings v1.3.1
  [e1d29d7a] + Missings v1.2.0
  [5da4648a] + NVTX v0.3.4
  [bac558e1] + OrderedCollections v1.6.3
  [69de0a69] + Parsers v2.8.1
  [2dfb63ee] + PooledArrays v1.4.3
  [08abe8d2] + PrettyTables v2.3.2
  [74087812] + Random123 v1.7.0
  [e6cf234a] + RandomNumbers v1.5.3
  [189a3867] + Reexport v1.2.2
  [6c6a2e73] + Scratch v1.2.1
  [91c51154] + SentinelArrays v1.4.3
  [a2af1166] + SortingAlgorithms v1.2.1
  [90137ffa] + StaticArrays v1.9.5
  [892a3eda] + StringManipulation v0.3.4
  [3783bdb8] + TableTraits v1.0.1
  [bd369af6] + Tables v1.11.1
  [a759f4b9] + TimerOutputs v0.5.24
  [013be700] + UnsafeAtomics v0.2.1
  [d80eeb9a] + UnsafeAtomicsLLVM v0.1.4
  [4ee394cb] + CUDA_Driver_jll v0.9.0+0
  [76a88914] + CUDA_Runtime_jll v0.14.0+1
  [9c1d0b0a] + JuliaNVTXCallbacks_jll v0.2.1+0
⌅ [dad2f222] + LLVMExtra_jll v0.0.29+0
  [e98f9f5b] + NVTX_jll v3.1.0+2
  [10745b16] + Statistics v1.10.0
        Info Packages marked with ⌅ have new versions available but compatibility constraints restrict them from upgrading. To see why use `status --outdated -m`
CUDA runtime 12.5, artifact installation
CUDA driver 12.5
NVIDIA driver 550.78.0, originally for CUDA 12.4

CUDA libraries: 
- CUBLAS: 12.5.2
- CURAND: 10.3.6
- CUFFT: 11.2.3
- CUSOLVER: 11.6.2
- CUSPARSE: 12.4.1
- CUPTI: 23.0.0
- NVML: 12.0.0+550.78

Julia packages: 
- CUDA: 5.4.2
- CUDA_Driver_jll: 0.9.0+0
- CUDA_Runtime_jll: 0.14.0+1

Toolchain:
- Julia: 1.10.3
- LLVM: 15.0.7

1 device:
  0: Quadro RTX 4000 (sm_75, 7.774 GiB / 8.000 GiB available)
MPIPreferences:
  binary:  system
  abi:     OpenMPI
  libmpi:  libmpi
  mpiexec: mpiexec

Package versions
  MPI.jl:             0.20.19
  MPIPreferences.jl:  0.1.11

Library information:
  libmpi:  libmpi
  libmpi dlpath:  /usr/lib/libmpi.so
  MPI version:  3.1.0
  Library version:  
    Open MPI v5.0.3, package: Open MPI builduser@buildhost Distribution, ident: 5.0.3, repo rev: v5.0.3, Apr 08, 2024
Hello world, I am rank 0 of 4
Hello world, I am rank 2 of 4
Hello world, I am rank 3 of 4
Hello world, I am rank 1 of 4
Test Summary: | Pass  Total  Time
mpiexecjl     |    7      7  7.6s
Test Summary:     | Pass  Total   Time
test_allgather.jl |    1      1  23.6s
Test Summary:      | Pass  Total   Time
test_allgatherv.jl |    1      1  23.1s
Test Summary:     | Pass  Total   Time
test_allreduce.jl |    1      1  19.4s
Test Summary:    | Pass  Total   Time
test_alltoall.jl |    1      1  22.7s
Test Summary:     | Pass  Total   Time
test_alltoallv.jl |    1      1  23.1s
Test Summary: | Pass  Total  Time
test_basic.jl |    1      1  4.5s
Test Summary: | Pass  Total   Time
test_bcast.jl |    1      1  25.2s
Test Summary:       | Pass  Total  Time
test_cart_coords.jl |    1      1  2.2s
Test Summary:       | Pass  Total  Time
test_cart_create.jl |    1      1  2.0s
Test Summary:    | Pass  Total  Time
test_cart_get.jl |    1      1  1.9s
Test Summary:     | Pass  Total  Time
test_cart_rank.jl |    1      1  2.0s
Test Summary:      | Pass  Total  Time
test_cart_shift.jl |    1      1  2.0s
Test Summary: | Pass  Total  Time
test_comm.jl  |    1      1  1.9s
Test Summary:            | Pass  Total  Time
test_cooperative_wait.jl |    1      1  6.1s

So I am surprised as well how long the CI runs took.

@luraess
Copy link
Contributor

luraess commented Jun 20, 2024

Interestingly,

  • Unit Tests / test-default (macos-13, 1, x64) (pull_request) failed on shared_io but re-running it succeeded
  • Unit Tests / Test MVAPICH 3.0 (pull_request) failed as seemed stalled, but rerunning it succeeded.

@luraess luraess mentioned this pull request Jun 20, 2024
test/runtests.jl Outdated Show resolved Hide resolved
@luraess
Copy link
Contributor

luraess commented Jun 20, 2024

Great that all CUDA tests now pass @vchuravy . Maybe one could merge changes from #840 with #844 (ROCm) where all s fine except for 3 tests.

@vchuravy vchuravy merged commit 690faae into master Jun 21, 2024
51 of 53 checks passed
@vchuravy vchuravy deleted the vc/cuda_ci branch June 21, 2024 17:48
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants