Skip to content
This repository has been archived by the owner on Sep 12, 2023. It is now read-only.

GPU offloading of tensor contraction #6

Open
mofeing opened this issue Mar 27, 2023 · 12 comments
Open

GPU offloading of tensor contraction #6

mofeing opened this issue Mar 27, 2023 · 12 comments
Labels
enhancement New feature or request performance Makes the code go "brrrr"

Comments

@mofeing
Copy link
Member

mofeing commented Mar 27, 2023

@HerManNav can you write here the results you obtain?

@mofeing mofeing added enhancement New feature or request performance Makes the code go "brrrr" labels Mar 27, 2023
@HerManNav
Copy link

It seems the CuArray structure is not being properly integrated with Tensors. Further work is requiered

Creating a Tensor:

cu_tensor = Tensor(CuArray(rand(4,4)), [:i, :j])

4×4 Tensor{Float64, 2, CuArray{Float64, 2, CUDA.Mem.DeviceBuffer}}:
┌ Warning: Performing scalar indexing on task Task (runnable) @0x00007f71dfb26e10.
│ Invocation of getindex resulted in scalar indexing of a GPU array.
│ This is typically caused by calling an iterating implementation of a method.
│ Such implementations *do not* execute on the GPU, but very slowly on the CPU,
│ and therefore are only permitted from the REPL for prototyping purposes.
│ If you did intend to index this array, annotate the caller with @allowscalar.
└ @ GPUArraysCore ~/.julia/packages/GPUArraysCore/HaQcr/src/GPUArraysCore.jl:106
 0.241556  0.129489   0.80101    0.615841
 0.679561  0.0986361  0.270873   0.855307
 0.955491  0.773435   0.4851     0.902845
 0.917217  0.348947   0.0748666  0.898347

Trying to contract 2 tensors:

data1, data2 = rand(4,4), rand(4,4)

tensor1 = Tensor(data1, [:i, :j])
tensor2 = Tensor(data2, [:j, :k])

cu_tensor1 = Tensor(CuArray(data1), [:i, :j])
cu_tensor2 = Tensor(CuArray(data2), [:j, :k])
julia> @benchmark contract(tensor1, tensor2) evals=100
BenchmarkTools.Trial: 10000 samples with 100 evaluations.
 Range (min  max):  3.221 μs  66.071 μs  ┊ GC (min  max):  0.00%  93.54%
 Time  (median):     3.345 μs              ┊ GC (median):     0.00%
 Time  (mean ± σ):   3.775 μs ±  4.460 μs  ┊ GC (mean ± σ):  10.50% ±  8.27%

   ▃▇█▇▅▄▄▃▂▁▁▁                                              ▂
  ▇█████████████▇▇▇▅▅▁▅▅▇█▇▅▆▄▅▄▄▄▃▁▄▃▁▁▁▁▁▁▁▁▁▁▄▁▁▁▁▁▁▁▃▁▁▄ █
  3.22 μs      Histogram: log(frequency) by time     5.18 μs <

 Memory estimate: 5.89 KiB, allocs estimate: 83.

@benchmark (CUDA.@sync contract(cu_tensor1, cu_tensor2)) evals=100
BenchmarkTools.Trial: 1646 samples with 100 evaluations.
 Range (min  max):  26.898 μs  202.919 μs  ┊ GC (min  max): 0.00%  39.54%
 Time  (median):     27.774 μs               ┊ GC (median):    0.00%
 Time  (mean ± σ):   30.361 μs ±  16.088 μs  ┊ GC (mean ± σ):  2.44% ±  3.96%

  ▄▇█▇▆▂ ▂▂ ▁▂▂▁   ▁                      ▁                    ▁
  ██████▇███████▇████▇▇█▇▆▆▄▅▅▁▁▁▁▁▁▁▁▅▄▇▇███▇▆▅▆▆▅▅▅▁▁▄▅▁▁▁▁▄ █
  26.9 μs       Histogram: log(frequency) by time      43.4 μs <

 Memory estimate: 6.75 KiB, allocs estimate: 108.

@mofeing
Copy link
Member Author

mofeing commented May 8, 2023

Those tensors are very small. GPUs have increased throughput but at the cost of increased latency too. It is more than probable that the execution time is given by the latency costs in this example.

Could you try a tensor contraction (a matrix multiplication) of two tensors of size i = j = k = 2048?

@HerManNav
Copy link

I've checked it out with N = 1024. For highest values I get either overflow (?) or printing (not important) errors.

The code is the following:

n = 1024
data1, data2 = rand(Float64, (n,n)), rand(Float64, (n,n))
indices1, indices2 = [:i, :j], [:j, :k]

tensor1 = Tensor(data1, indices1)
tensor2 = Tensor(data2, indices2)
cu_tensor1 = Tensor(CuArray(data1), indices1)
cu_tensor2 = Tensor(CuArray(data2), indices2)

Obtained the following benchmarks:

julia> @benchmark contract(tensor1, tensor2) evals=100
BenchmarkTools.Trial: 7 samples with 100 evaluations.
 Range (min  max):  7.631 ms    8.058 ms  ┊ GC (min  max): 1.18%  1.42%
 Time  (median):     7.764 ms               ┊ GC (median):    1.27%
 Time  (mean ± σ):   7.786 ms ± 140.965 μs  ┊ GC (mean ± σ):  1.29% ± 0.13%

  ▁   ▁            ▁█            ▁                          ▁  
  █▁▁▁█▁▁▁▁▁▁▁▁▁▁▁▁██▁▁▁▁▁▁▁▁▁▁▁▁█▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁█ ▁
  7.63 ms         Histogram: frequency by time        8.06 ms <

 Memory estimate: 8.01 MiB, allocs estimate: 84.

julia> @benchmark (CUDA.@sync contract(cu_tensor1, cu_tensor2)) evals=100
BenchmarkTools.Trial: 6 samples with 100 evaluations.
 Range (min  max):  9.009 ms    9.364 ms  ┊ GC (min  max): 0.00%  0.00%
 Time  (median):     9.052 ms               ┊ GC (median):    0.00%
 Time  (mean ± σ):   9.095 ms ± 133.575 μs  ┊ GC (mean ± σ):  0.00% ± 0.00%

  █ █   █ █ █                                               █  
  █▁█▁▁▁█▁█▁█▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁█ ▁
  9.01 ms         Histogram: frequency by time        9.36 ms <

 Memory estimate: 9.63 KiB, allocs estimate: 165.

For n = 2048 I got the following errors the 1st and 2nd time i call the contract method:

(1) Overflow errors (?) (graphics memory is at 15-20% at this point)
julia> c_gpu = contract(cu_tensor1, cu_tensor2)
2048×2048 Tensor{Float64, 2, CuArray{Float64, 2, CUDA.Mem.DeviceBuffer}}:
 519.256  502.685  508.305  499.599  522.73   525.856  522.624  510.051  512.903  519.176  528.907    0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0
 515.976  499.278  509.186  490.304  515.398  515.891  511.249  499.848  500.446  515.795  520.516     0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0
 510.431  502.946  511.511  492.046  515.862  517.249  510.14   497.419  505.281  512.77   518.809     0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0
 521.561  514.658  509.468  501.92   525.946  527.689  520.318  504.073  515.248  509.789  523.559     0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0
 521.297  503.255  512.17   495.982  518.224  525.764  520.748  505.266  515.825  518.52   521.094     0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0
 512.909  503.069  504.048  486.549  521.421  523.918  515.724  502.2    498.471  505.944  511.893    0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0
 512.184  504.861  496.004  488.859  518.694  519.71   508.974  491.921  509.117  510.945  513.675     0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0
 503.389  499.401  505.807  481.635  511.843  519.978  515.158  499.42   501.005  505.916  516.758     0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0
 509.352  504.151  508.406  483.566  513.979  527.051  512.119  503.92   505.231  511.086  518.15      0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0
 518.336  504.808  509.284  496.577  515.46   526.837  517.211  499.784  506.745  509.873  517.186     0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0
 514.408  506.958  508.64   497.092  516.358  523.78   517.314  502.956  507.395  508.209  525.776    0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0
 523.338  502.725  515.613  495.124  516.755  527.762  523.037  506.025  511.255  521.471  531.632     0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0
 512.79   496.735  502.622  495.738  516.733  526.588  520.814  501.285  507.721  509.786  517.346     0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0
 529.416  519.35   521.674  511.121  538.795  537.796  526.503  510.938  515.925  528.625  539.632     0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0
 517.249  509.661  510.698  498.491  521.574  529.117  513.963  499.958  515.971  516.778  524.376     0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0
                                                                                                                                                                                         
   0.0      0.0      0.0      0.0      0.0      0.0      0.0      0.0      0.0      0.0      0.0       0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0
   0.0      0.0      0.0      0.0      0.0      0.0      0.0      0.0      0.0      0.0      0.0      0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0
   0.0      0.0      0.0      0.0      0.0      0.0      0.0      0.0      0.0      0.0      0.0       0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0
   0.0      0.0      0.0      0.0      0.0      0.0      0.0      0.0      0.0      0.0      0.0       0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0
   0.0      0.0      0.0      0.0      0.0      0.0      0.0      0.0      0.0      0.0      0.0       0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0
   0.0      0.0      0.0      0.0      0.0      0.0      0.0      0.0      0.0      0.0      0.0       0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0
   0.0      0.0      0.0      0.0      0.0      0.0      0.0      0.0      0.0      0.0      0.0      0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0
   0.0      0.0      0.0      0.0      0.0    539.023  526.559  514.493  519.713  517.316  537.925     0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0
 516.104  513.338  509.786  503.44   517.002  532.929  524.534  503.017  513.091  514.49   525.879     0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0
 513.442  499.847  506.063  486.698  515.506  518.1    512.021  491.542  501.344  505.82   514.353     0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0
 516.77   499.638  505.619  483.69   510.385  518.392  502.505  499.113  502.033  516.214  520.229     0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0
 509.154  500.312  509.26   492.161  510.888  526.873  508.441  499.423  504.183  513.042  520.858    0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0
 515.513  505.091  504.411  490.614  516.265  515.871  506.809  506.871  501.126  508.47   520.99      0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0
 515.861  505.887  505.846  495.802  510.078  516.646  515.991  498.131  509.188  511.239  520.735     0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0
(2) Printing error:
julia> c_gpu = contract(cu_tensor1, cu_tensor2)
2048×2048 Tensor{Float64, 2, CuArray{Float64, 2, CUDA.Mem.DeviceBuffer}}:
 519.256  502.685  508.305  499.599  522.73   525.856  522.624  510.051  512.903  519.176  528.907    0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0
 515.976  499.278  509.186  490.304  515.398  515.891  511.249  499.848  500.446  515.795  520.516     0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0
 510.431  502.946  511.511  492.046  515.862  517.249  510.14   497.419  505.281  512.77   518.809     0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0
 521.561  514.658  509.468  501.92   525.946  527.689  520.318  504.073  515.248  509.789  523.559     0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0
 521.297  503.255  512.17   495.982  518.224  525.764  520.748  505.266  515.825  518.52   521.094     0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0
 512.909  503.069  504.048  486.549  521.421  523.918  515.724  502.2    498.471  505.944  511.893    0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0
 512.184  504.861  496.004  488.859  518.694  519.71   508.974  491.921  509.117  510.945  513.675     0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0
 503.389  499.401  505.807  481.635  511.843  519.978  515.158  499.42   501.005  505.916  516.758     0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0
 509.352  504.151  508.406  483.566  513.979  527.051  512.119  503.92   505.231  511.086  518.15      Error showing value of type Tensor{Float64, 2, CuArray{Float64, 2, CUDA.Mem.DeviceBuffer}}:
ERROR: ArgumentError: can't repeat a string -2 times
Stacktrace:
  [1] repeat(s::String, r::Int64)
    @ Base ./strings/substring.jl:249
  [2] print_matrix_row(io::IOContext{Base.TTY}, X::AbstractVecOrMat, A::Vector{Tuple{Int64, Int64}}, i::Int64, cols::Vector{Int64}, sep::String, idxlast::Int64)
    @ Base ./arrayshow.jl:118
  [3] _print_matrix(io::IOContext{Base.TTY}, X::AbstractVecOrMat, pre::String, sep::String, post::String, hdots::String, vdots::String, ddots::String, hmod::Int64, vmod::Int64, rowsA::UnitRange{Int64}, colsA::UnitRange{Int64})
    @ Base ./arrayshow.jl:254
  [4] print_matrix(io::IOContext{Base.TTY}, X::Tensor{Float64, 2, CuArray{Float64, 2, CUDA.Mem.DeviceBuffer}}, pre::String, sep::String, post::String, hdots::String, vdots::String, ddots::String, hmod::Int64, vmod::Int64) (repeats 2 times)
    @ Base ./arrayshow.jl:171
  [5] print_array
    @ ./arrayshow.jl:358 [inlined]
  [6] show(io::IOContext{Base.TTY}, #unused#::MIME{Symbol("text/plain")}, X::Tensor{Float64, 2, CuArray{Float64, 2, CUDA.Mem.DeviceBuffer}})
    @ Base ./arrayshow.jl:399
  [7] (::REPL.var"#43#44"{REPL.REPLDisplay{REPL.LineEditREPL}, MIME{Symbol("text/plain")}, Base.RefValue{Any}})(io::Any)
    @ REPL ~/.julia/juliaup/julia-1.8.5+0.x64.linux.gnu/share/julia/stdlib/v1.8/REPL/src/REPL.jl:267
  [8] with_repl_linfo(f::Any, repl::REPL.LineEditREPL)
    @ REPL ~/.julia/juliaup/julia-1.8.5+0.x64.linux.gnu/share/julia/stdlib/v1.8/REPL/src/REPL.jl:521
  [9] display(d::REPL.REPLDisplay, mime::MIME{Symbol("text/plain")}, x::Any)
    @ REPL ~/.julia/juliaup/julia-1.8.5+0.x64.linux.gnu/share/julia/stdlib/v1.8/REPL/src/REPL.jl:260
 [10] display(d::REPL.REPLDisplay, x::Any)
    @ REPL ~/.julia/juliaup/julia-1.8.5+0.x64.linux.gnu/share/julia/stdlib/v1.8/REPL/src/REPL.jl:272
 [11] display(x::Any)
    @ Base.Multimedia ./multimedia.jl:328
 [12] #invokelatest#2
    @ ./essentials.jl:729 [inlined]
 [13] invokelatest
    @ ./essentials.jl:726 [inlined]
 [14] print_response(errio::IO, response::Any, show_value::Bool, have_color::Bool, specialdisplay::Union{Nothing, AbstractDisplay})
    @ REPL ~/.julia/juliaup/julia-1.8.5+0.x64.linux.gnu/share/julia/stdlib/v1.8/REPL/src/REPL.jl:296
 [15] (::REPL.var"#45#46"{REPL.LineEditREPL, Pair{Any, Bool}, Bool, Bool})(io::Any)
    @ REPL ~/.julia/juliaup/julia-1.8.5+0.x64.linux.gnu/share/julia/stdlib/v1.8/REPL/src/REPL.jl:278
 [16] with_repl_linfo(f::Any, repl::REPL.LineEditREPL)
    @ REPL ~/.julia/juliaup/julia-1.8.5+0.x64.linux.gnu/share/julia/stdlib/v1.8/REPL/src/REPL.jl:521
 [17] print_response(repl::REPL.AbstractREPL, response::Any, show_value::Bool, have_color::Bool)
    @ REPL ~/.julia/juliaup/julia-1.8.5+0.x64.linux.gnu/share/julia/stdlib/v1.8/REPL/src/REPL.jl:276
 [18] (::REPL.var"#do_respond#66"{Bool, Bool, REPL.var"#77#87"{REPL.LineEditREPL, REPL.REPLHistoryProvider}, REPL.LineEditREPL, REPL.LineEdit.Prompt})(s::REPL.LineEdit.MIState, buf::Any, ok::Bool)
    @ REPL ~/.julia/juliaup/julia-1.8.5+0.x64.linux.gnu/share/julia/stdlib/v1.8/REPL/src/REPL.jl:857
 [19] (::VSCodeServer.var"#98#101"{REPL.var"#do_respond#66"{Bool, Bool, REPL.var"#77#87"{REPL.LineEditREPL, REPL.REPLHistoryProvider}, REPL.LineEditREPL, REPL.LineEdit.Prompt}})(mi::REPL.LineEdit.MIState, buf::IOBuffer, ok::Bool)
    @ VSCodeServer ~/.vscode/extensions/julialang.language-julia-1.38.2/scripts/packages/VSCodeServer/src/repl.jl:122
 [20] #invokelatest#2
    @ ./essentials.jl:729 [inlined]
 [21] invokelatest
    @ ./essentials.jl:726 [inlined]
 [22] run_interface(terminal::REPL.Terminals.TextTerminal, m::REPL.LineEdit.ModalInterface, s::REPL.LineEdit.MIState)
    @ REPL.LineEdit ~/.julia/juliaup/julia-1.8.5+0.x64.linux.gnu/share/julia/stdlib/v1.8/REPL/src/LineEdit.jl:2510
 [23] run_frontend(repl::REPL.LineEditREPL, backend::REPL.REPLBackendRef)
    @ REPL ~/.julia/juliaup/julia-1.8.5+0.x64.linux.gnu/share/julia/stdlib/v1.8/REPL/src/REPL.jl:1248
 [24] (::REPL.var"#49#54"{REPL.LineEditREPL, REPL.REPLBackendRef})()
    @ REPL ./task.jl:484

@jofrevalles
Copy link
Member

I can not see the overflow errors. Does it say anything in particular?

@mofeing
Copy link
Member Author

mofeing commented May 10, 2023

  1. Overflow means other kind of error (which actually doesn't raise an exception, maybe a hardware interruption). I think you mean Out-of-Memory or OOF, but you mustn't get a OOF with a 15-20% memory utilization.
  2. That's not a OOF error, or anything related to the GPU. The error is thrown when is trying to print the data. If you launch the code with ; in the end, the data should not be printed and thus, you should not have the error.
c_gpu = contract(cu_tensor1, cu_tensor2);

Still this is a strange error. We should check that.

@HerManNav
Copy link

Indeed, it should be some problem with printing the data from the vram, since both results (from cpu and gpu) match if we index them manually (not printing). I don't think we should actually care about this anymore.

Anyway, my main concern is that the executions I performed in GPU are always slower than in CPU, even for large matrices or 3-tensor, as I explained in my previous post.

I just checked it out with Base.@Elapsed & Base.@time and its CUDA analogous, this time with n = 8192. I run it several times since these tools, unlike BenchmarkTools, only measures one execution of the provided code.

# Execute it for the first time to let Julia compile the code (this is done only once):
julia> c_cpu = contract(tensor1, tensor2);

julia> c_gpu = contract(cu_tensor1, cu_tensor2);


# Now let's check those times with @time:
julia> Base.@time contract(tensor1, tensor2);
  3.296003 seconds (84 allocations: 512.006 MiB, 0.26% gc time)

julia> Base.@time contract(tensor1, tensor2);
  3.320518 seconds (84 allocations: 512.006 MiB)

julia> Base.@time contract(tensor1, tensor2);
  3.295321 seconds (84 allocations: 512.006 MiB, 0.49% gc time)

julia> CUDA.@time contract(cu_tensor1, cu_tensor2);
  4.424156 seconds (22.50 k CPU allocations: 1.155 MiB) (1 GPU allocation: 512.000 MiB, 0.02% memmgmt time)

julia> CUDA.@time contract(cu_tensor1, cu_tensor2);
  4.382499 seconds (172 CPU allocations: 9.797 KiB) (1 GPU allocation: 512.000 MiB, 0.02% memmgmt time)

julia> CUDA.@time contract(cu_tensor1, cu_tensor2);
  4.359802 seconds (173 CPU allocations: 10.094 KiB) (1 GPU allocation: 512.000 MiB, 0.02% memmgmt time)


# Now with @elapsed:
julia> CUDA.@elapsed contract(cu_tensor1, cu_tensor2)
4.3339796f0

julia> CUDA.@elapsed contract(cu_tensor1, cu_tensor2)
4.3237624f0

julia> Base.@elapsed contract(tensor1, tensor2)
3.305727348

julia> Base.@elapsed contract(tensor1, tensor2)
3.311123997

Maybe we should take a look to the particular implementation in Tensors and more specifically in OMEinsum?

@jofrevalles
Copy link
Member

I think that performing a simple operation like matrix multiplication could be a good first step to check the performance of the GPU. This is a computation-heavy operation that should benefit from parallel processing on a GPU, especially for large matrices. For example:

using CUDA

# Create two large matrices
n = 8192
A = rand(n, n)
B = rand(n, n)

# Transfer the matrices to the GPU
A_d = cu(A)
B_d = cu(B)

# Warmup run to compile the functions
C_cpu = A * B
C_gpu = A_d * B_d

# Measure the time on the CPU
@time A * B

# Measure the time on the GPU
CUDA.@time A_d * B_d

@HerManNav, could you try this?

@mofeing
Copy link
Member Author

mofeing commented May 11, 2023

Maybe we should take a look to the particular implementation in Tensors and more specifically in OMEinsum?

Completely, I mean: isn't there any documentation about GPU computing on OMEinsum?

@HerManNav
Copy link

HerManNav commented May 11, 2023

The only thing I found is a reference of lack of GPU support for OMEinsum v0.3.0. I currently use a much recent version (v0.7.4).

https://under-peter.github.io/OMEinsum.jl/stable/implementation/

I also found some old (2019-2020) issues in the same OMEinsum.jl repo in which they say there were an incompatibility between OMEinsum and CuArrays, but not performance issues. Those problems were already fixed and closed.

About the matrix multiplication @jofrevalles proposed, it works perfectly and with a huge speed-up. So we can conclude the problem is the integration of OMEinsum with CuArrays.

@mofeing
Copy link
Member Author

mofeing commented May 11, 2023

Okay, let's discuss it together tomorrow.

@HerManNav
Copy link

I discovered that VRAM is not being properly freed (at least not always) when executing consecutive contraction operations. I compared this against contraction operations in CPU with regular Arrays. I perfomed some contractions overwriting the same variable.

In CPU, Julia GC works as expected. In GPU, however, VRAM gets full at a certain point, and starts swapping from there on, taking much longer. This could be either because of CuArrays or Tensors.

@mofeing
Copy link
Member Author

mofeing commented May 12, 2023

Nice catch! Tensors has nothing to do here, OMEinsum in any case.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
enhancement New feature or request performance Makes the code go "brrrr"
Projects
None yet
Development

No branches or pull requests

3 participants