-
Notifications
You must be signed in to change notification settings - Fork 1
GPU offloading of tensor contraction #6
Comments
It seems the CuArray structure is not being properly integrated with Tensors. Further work is requiered Creating a Tensor: cu_tensor = Tensor(CuArray(rand(4,4)), [:i, :j])
4×4 Tensor{Float64, 2, CuArray{Float64, 2, CUDA.Mem.DeviceBuffer}}:
┌ Warning: Performing scalar indexing on task Task (runnable) @0x00007f71dfb26e10.
│ Invocation of getindex resulted in scalar indexing of a GPU array.
│ This is typically caused by calling an iterating implementation of a method.
│ Such implementations *do not* execute on the GPU, but very slowly on the CPU,
│ and therefore are only permitted from the REPL for prototyping purposes.
│ If you did intend to index this array, annotate the caller with @allowscalar.
└ @ GPUArraysCore ~/.julia/packages/GPUArraysCore/HaQcr/src/GPUArraysCore.jl:106
0.241556 0.129489 0.80101 0.615841
0.679561 0.0986361 0.270873 0.855307
0.955491 0.773435 0.4851 0.902845
0.917217 0.348947 0.0748666 0.898347 Trying to contract 2 tensors: data1, data2 = rand(4,4), rand(4,4)
tensor1 = Tensor(data1, [:i, :j])
tensor2 = Tensor(data2, [:j, :k])
cu_tensor1 = Tensor(CuArray(data1), [:i, :j])
cu_tensor2 = Tensor(CuArray(data2), [:j, :k]) julia> @benchmark contract(tensor1, tensor2) evals=100
BenchmarkTools.Trial: 10000 samples with 100 evaluations.
Range (min … max): 3.221 μs … 66.071 μs ┊ GC (min … max): 0.00% … 93.54%
Time (median): 3.345 μs ┊ GC (median): 0.00%
Time (mean ± σ): 3.775 μs ± 4.460 μs ┊ GC (mean ± σ): 10.50% ± 8.27%
▃▇█▇▅▄▄▃▂▁▁▁ ▂
▇█████████████▇▇▇▅▅▁▅▅▇█▇▅▆▄▅▄▄▄▃▁▄▃▁▁▁▁▁▁▁▁▁▁▄▁▁▁▁▁▁▁▃▁▁▄ █
3.22 μs Histogram: log(frequency) by time 5.18 μs <
Memory estimate: 5.89 KiB, allocs estimate: 83.
@benchmark (CUDA.@sync contract(cu_tensor1, cu_tensor2)) evals=100
BenchmarkTools.Trial: 1646 samples with 100 evaluations.
Range (min … max): 26.898 μs … 202.919 μs ┊ GC (min … max): 0.00% … 39.54%
Time (median): 27.774 μs ┊ GC (median): 0.00%
Time (mean ± σ): 30.361 μs ± 16.088 μs ┊ GC (mean ± σ): 2.44% ± 3.96%
▄▇█▇▆▂ ▂▂ ▁▂▂▁ ▁ ▁ ▁
██████▇███████▇████▇▇█▇▆▆▄▅▅▁▁▁▁▁▁▁▁▅▄▇▇███▇▆▅▆▆▅▅▅▁▁▄▅▁▁▁▁▄ █
26.9 μs Histogram: log(frequency) by time 43.4 μs <
Memory estimate: 6.75 KiB, allocs estimate: 108. |
Those tensors are very small. GPUs have increased throughput but at the cost of increased latency too. It is more than probable that the execution time is given by the latency costs in this example. Could you try a tensor contraction (a matrix multiplication) of two tensors of size |
I've checked it out with N = 1024. For highest values I get either overflow (?) or printing (not important) errors. The code is the following: n = 1024
data1, data2 = rand(Float64, (n,n)), rand(Float64, (n,n))
indices1, indices2 = [:i, :j], [:j, :k]
tensor1 = Tensor(data1, indices1)
tensor2 = Tensor(data2, indices2)
cu_tensor1 = Tensor(CuArray(data1), indices1)
cu_tensor2 = Tensor(CuArray(data2), indices2) Obtained the following benchmarks: julia> @benchmark contract(tensor1, tensor2) evals=100
BenchmarkTools.Trial: 7 samples with 100 evaluations.
Range (min … max): 7.631 ms … 8.058 ms ┊ GC (min … max): 1.18% … 1.42%
Time (median): 7.764 ms ┊ GC (median): 1.27%
Time (mean ± σ): 7.786 ms ± 140.965 μs ┊ GC (mean ± σ): 1.29% ± 0.13%
▁ ▁ ▁█ ▁ ▁
█▁▁▁█▁▁▁▁▁▁▁▁▁▁▁▁██▁▁▁▁▁▁▁▁▁▁▁▁█▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁█ ▁
7.63 ms Histogram: frequency by time 8.06 ms <
Memory estimate: 8.01 MiB, allocs estimate: 84.
julia> @benchmark (CUDA.@sync contract(cu_tensor1, cu_tensor2)) evals=100
BenchmarkTools.Trial: 6 samples with 100 evaluations.
Range (min … max): 9.009 ms … 9.364 ms ┊ GC (min … max): 0.00% … 0.00%
Time (median): 9.052 ms ┊ GC (median): 0.00%
Time (mean ± σ): 9.095 ms ± 133.575 μs ┊ GC (mean ± σ): 0.00% ± 0.00%
█ █ █ █ █ █
█▁█▁▁▁█▁█▁█▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁█ ▁
9.01 ms Histogram: frequency by time 9.36 ms <
Memory estimate: 9.63 KiB, allocs estimate: 165. For (1) Overflow errors (?) (graphics memory is at 15-20% at this point)julia> c_gpu = contract(cu_tensor1, cu_tensor2)
2048×2048 Tensor{Float64, 2, CuArray{Float64, 2, CUDA.Mem.DeviceBuffer}}:
519.256 502.685 508.305 499.599 522.73 525.856 522.624 510.051 512.903 519.176 528.907 … 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
515.976 499.278 509.186 490.304 515.398 515.891 511.249 499.848 500.446 515.795 520.516 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
510.431 502.946 511.511 492.046 515.862 517.249 510.14 497.419 505.281 512.77 518.809 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
521.561 514.658 509.468 501.92 525.946 527.689 520.318 504.073 515.248 509.789 523.559 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
521.297 503.255 512.17 495.982 518.224 525.764 520.748 505.266 515.825 518.52 521.094 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
512.909 503.069 504.048 486.549 521.421 523.918 515.724 502.2 498.471 505.944 511.893 … 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
512.184 504.861 496.004 488.859 518.694 519.71 508.974 491.921 509.117 510.945 513.675 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
503.389 499.401 505.807 481.635 511.843 519.978 515.158 499.42 501.005 505.916 516.758 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
509.352 504.151 508.406 483.566 513.979 527.051 512.119 503.92 505.231 511.086 518.15 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
518.336 504.808 509.284 496.577 515.46 526.837 517.211 499.784 506.745 509.873 517.186 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
514.408 506.958 508.64 497.092 516.358 523.78 517.314 502.956 507.395 508.209 525.776 … 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
523.338 502.725 515.613 495.124 516.755 527.762 523.037 506.025 511.255 521.471 531.632 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
512.79 496.735 502.622 495.738 516.733 526.588 520.814 501.285 507.721 509.786 517.346 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
529.416 519.35 521.674 511.121 538.795 537.796 526.503 510.938 515.925 528.625 539.632 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
517.249 509.661 510.698 498.491 521.574 529.117 513.963 499.958 515.971 516.778 524.376 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
⋮ ⋮ ⋮ ⋱ ⋮ ⋮ ⋮ ⋮
0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 … 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 … 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
0.0 0.0 0.0 0.0 0.0 539.023 526.559 514.493 519.713 517.316 537.925 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
516.104 513.338 509.786 503.44 517.002 532.929 524.534 503.017 513.091 514.49 525.879 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
513.442 499.847 506.063 486.698 515.506 518.1 512.021 491.542 501.344 505.82 514.353 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
516.77 499.638 505.619 483.69 510.385 518.392 502.505 499.113 502.033 516.214 520.229 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
509.154 500.312 509.26 492.161 510.888 526.873 508.441 499.423 504.183 513.042 520.858 … 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
515.513 505.091 504.411 490.614 516.265 515.871 506.809 506.871 501.126 508.47 520.99 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
515.861 505.887 505.846 495.802 510.078 516.646 515.991 498.131 509.188 511.239 520.735 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 (2) Printing error:julia> c_gpu = contract(cu_tensor1, cu_tensor2)
2048×2048 Tensor{Float64, 2, CuArray{Float64, 2, CUDA.Mem.DeviceBuffer}}:
519.256 502.685 508.305 499.599 522.73 525.856 522.624 510.051 512.903 519.176 528.907 … 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
515.976 499.278 509.186 490.304 515.398 515.891 511.249 499.848 500.446 515.795 520.516 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
510.431 502.946 511.511 492.046 515.862 517.249 510.14 497.419 505.281 512.77 518.809 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
521.561 514.658 509.468 501.92 525.946 527.689 520.318 504.073 515.248 509.789 523.559 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
521.297 503.255 512.17 495.982 518.224 525.764 520.748 505.266 515.825 518.52 521.094 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
512.909 503.069 504.048 486.549 521.421 523.918 515.724 502.2 498.471 505.944 511.893 … 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
512.184 504.861 496.004 488.859 518.694 519.71 508.974 491.921 509.117 510.945 513.675 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
503.389 499.401 505.807 481.635 511.843 519.978 515.158 499.42 501.005 505.916 516.758 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
509.352 504.151 508.406 483.566 513.979 527.051 512.119 503.92 505.231 511.086 518.15 Error showing value of type Tensor{Float64, 2, CuArray{Float64, 2, CUDA.Mem.DeviceBuffer}}:
ERROR: ArgumentError: can't repeat a string -2 times
Stacktrace:
[1] repeat(s::String, r::Int64)
@ Base ./strings/substring.jl:249
[2] print_matrix_row(io::IOContext{Base.TTY}, X::AbstractVecOrMat, A::Vector{Tuple{Int64, Int64}}, i::Int64, cols::Vector{Int64}, sep::String, idxlast::Int64)
@ Base ./arrayshow.jl:118
[3] _print_matrix(io::IOContext{Base.TTY}, X::AbstractVecOrMat, pre::String, sep::String, post::String, hdots::String, vdots::String, ddots::String, hmod::Int64, vmod::Int64, rowsA::UnitRange{Int64}, colsA::UnitRange{Int64})
@ Base ./arrayshow.jl:254
[4] print_matrix(io::IOContext{Base.TTY}, X::Tensor{Float64, 2, CuArray{Float64, 2, CUDA.Mem.DeviceBuffer}}, pre::String, sep::String, post::String, hdots::String, vdots::String, ddots::String, hmod::Int64, vmod::Int64) (repeats 2 times)
@ Base ./arrayshow.jl:171
[5] print_array
@ ./arrayshow.jl:358 [inlined]
[6] show(io::IOContext{Base.TTY}, #unused#::MIME{Symbol("text/plain")}, X::Tensor{Float64, 2, CuArray{Float64, 2, CUDA.Mem.DeviceBuffer}})
@ Base ./arrayshow.jl:399
[7] (::REPL.var"#43#44"{REPL.REPLDisplay{REPL.LineEditREPL}, MIME{Symbol("text/plain")}, Base.RefValue{Any}})(io::Any)
@ REPL ~/.julia/juliaup/julia-1.8.5+0.x64.linux.gnu/share/julia/stdlib/v1.8/REPL/src/REPL.jl:267
[8] with_repl_linfo(f::Any, repl::REPL.LineEditREPL)
@ REPL ~/.julia/juliaup/julia-1.8.5+0.x64.linux.gnu/share/julia/stdlib/v1.8/REPL/src/REPL.jl:521
[9] display(d::REPL.REPLDisplay, mime::MIME{Symbol("text/plain")}, x::Any)
@ REPL ~/.julia/juliaup/julia-1.8.5+0.x64.linux.gnu/share/julia/stdlib/v1.8/REPL/src/REPL.jl:260
[10] display(d::REPL.REPLDisplay, x::Any)
@ REPL ~/.julia/juliaup/julia-1.8.5+0.x64.linux.gnu/share/julia/stdlib/v1.8/REPL/src/REPL.jl:272
[11] display(x::Any)
@ Base.Multimedia ./multimedia.jl:328
[12] #invokelatest#2
@ ./essentials.jl:729 [inlined]
[13] invokelatest
@ ./essentials.jl:726 [inlined]
[14] print_response(errio::IO, response::Any, show_value::Bool, have_color::Bool, specialdisplay::Union{Nothing, AbstractDisplay})
@ REPL ~/.julia/juliaup/julia-1.8.5+0.x64.linux.gnu/share/julia/stdlib/v1.8/REPL/src/REPL.jl:296
[15] (::REPL.var"#45#46"{REPL.LineEditREPL, Pair{Any, Bool}, Bool, Bool})(io::Any)
@ REPL ~/.julia/juliaup/julia-1.8.5+0.x64.linux.gnu/share/julia/stdlib/v1.8/REPL/src/REPL.jl:278
[16] with_repl_linfo(f::Any, repl::REPL.LineEditREPL)
@ REPL ~/.julia/juliaup/julia-1.8.5+0.x64.linux.gnu/share/julia/stdlib/v1.8/REPL/src/REPL.jl:521
[17] print_response(repl::REPL.AbstractREPL, response::Any, show_value::Bool, have_color::Bool)
@ REPL ~/.julia/juliaup/julia-1.8.5+0.x64.linux.gnu/share/julia/stdlib/v1.8/REPL/src/REPL.jl:276
[18] (::REPL.var"#do_respond#66"{Bool, Bool, REPL.var"#77#87"{REPL.LineEditREPL, REPL.REPLHistoryProvider}, REPL.LineEditREPL, REPL.LineEdit.Prompt})(s::REPL.LineEdit.MIState, buf::Any, ok::Bool)
@ REPL ~/.julia/juliaup/julia-1.8.5+0.x64.linux.gnu/share/julia/stdlib/v1.8/REPL/src/REPL.jl:857
[19] (::VSCodeServer.var"#98#101"{REPL.var"#do_respond#66"{Bool, Bool, REPL.var"#77#87"{REPL.LineEditREPL, REPL.REPLHistoryProvider}, REPL.LineEditREPL, REPL.LineEdit.Prompt}})(mi::REPL.LineEdit.MIState, buf::IOBuffer, ok::Bool)
@ VSCodeServer ~/.vscode/extensions/julialang.language-julia-1.38.2/scripts/packages/VSCodeServer/src/repl.jl:122
[20] #invokelatest#2
@ ./essentials.jl:729 [inlined]
[21] invokelatest
@ ./essentials.jl:726 [inlined]
[22] run_interface(terminal::REPL.Terminals.TextTerminal, m::REPL.LineEdit.ModalInterface, s::REPL.LineEdit.MIState)
@ REPL.LineEdit ~/.julia/juliaup/julia-1.8.5+0.x64.linux.gnu/share/julia/stdlib/v1.8/REPL/src/LineEdit.jl:2510
[23] run_frontend(repl::REPL.LineEditREPL, backend::REPL.REPLBackendRef)
@ REPL ~/.julia/juliaup/julia-1.8.5+0.x64.linux.gnu/share/julia/stdlib/v1.8/REPL/src/REPL.jl:1248
[24] (::REPL.var"#49#54"{REPL.LineEditREPL, REPL.REPLBackendRef})()
@ REPL ./task.jl:484 |
I can not see the overflow errors. Does it say anything in particular? |
c_gpu = contract(cu_tensor1, cu_tensor2); Still this is a strange error. We should check that. |
Indeed, it should be some problem with printing the data from the vram, since both results (from cpu and gpu) match if we index them manually (not printing). I don't think we should actually care about this anymore. Anyway, my main concern is that the executions I performed in GPU are always slower than in CPU, even for large matrices or 3-tensor, as I explained in my previous post. I just checked it out with Base.@Elapsed & Base.@time and its CUDA analogous, this time with # Execute it for the first time to let Julia compile the code (this is done only once):
julia> c_cpu = contract(tensor1, tensor2);
julia> c_gpu = contract(cu_tensor1, cu_tensor2);
# Now let's check those times with @time:
julia> Base.@time contract(tensor1, tensor2);
3.296003 seconds (84 allocations: 512.006 MiB, 0.26% gc time)
julia> Base.@time contract(tensor1, tensor2);
3.320518 seconds (84 allocations: 512.006 MiB)
julia> Base.@time contract(tensor1, tensor2);
3.295321 seconds (84 allocations: 512.006 MiB, 0.49% gc time)
julia> CUDA.@time contract(cu_tensor1, cu_tensor2);
4.424156 seconds (22.50 k CPU allocations: 1.155 MiB) (1 GPU allocation: 512.000 MiB, 0.02% memmgmt time)
julia> CUDA.@time contract(cu_tensor1, cu_tensor2);
4.382499 seconds (172 CPU allocations: 9.797 KiB) (1 GPU allocation: 512.000 MiB, 0.02% memmgmt time)
julia> CUDA.@time contract(cu_tensor1, cu_tensor2);
4.359802 seconds (173 CPU allocations: 10.094 KiB) (1 GPU allocation: 512.000 MiB, 0.02% memmgmt time)
# Now with @elapsed:
julia> CUDA.@elapsed contract(cu_tensor1, cu_tensor2)
4.3339796f0
julia> CUDA.@elapsed contract(cu_tensor1, cu_tensor2)
4.3237624f0
julia> Base.@elapsed contract(tensor1, tensor2)
3.305727348
julia> Base.@elapsed contract(tensor1, tensor2)
3.311123997 Maybe we should take a look to the particular implementation in Tensors and more specifically in OMEinsum? |
I think that performing a simple operation like matrix multiplication could be a good first step to check the performance of the GPU. This is a computation-heavy operation that should benefit from parallel processing on a GPU, especially for large matrices. For example: using CUDA
# Create two large matrices
n = 8192
A = rand(n, n)
B = rand(n, n)
# Transfer the matrices to the GPU
A_d = cu(A)
B_d = cu(B)
# Warmup run to compile the functions
C_cpu = A * B
C_gpu = A_d * B_d
# Measure the time on the CPU
@time A * B
# Measure the time on the GPU
CUDA.@time A_d * B_d @HerManNav, could you try this? |
Completely, I mean: isn't there any documentation about GPU computing on OMEinsum? |
The only thing I found is a reference of lack of GPU support for OMEinsum v0.3.0. I currently use a much recent version (v0.7.4). https://under-peter.github.io/OMEinsum.jl/stable/implementation/ I also found some old (2019-2020) issues in the same OMEinsum.jl repo in which they say there were an incompatibility between OMEinsum and CuArrays, but not performance issues. Those problems were already fixed and closed. About the matrix multiplication @jofrevalles proposed, it works perfectly and with a huge speed-up. So we can conclude the problem is the integration of OMEinsum with CuArrays. |
Okay, let's discuss it together tomorrow. |
I discovered that VRAM is not being properly freed (at least not always) when executing consecutive contraction operations. I compared this against contraction operations in CPU with regular Arrays. I perfomed some contractions overwriting the same variable. In CPU, Julia GC works as expected. In GPU, however, VRAM gets full at a certain point, and starts swapping from there on, taking much longer. This could be either because of CuArrays or Tensors. |
Nice catch! |
@HerManNav can you write here the results you obtain?
The text was updated successfully, but these errors were encountered: