Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[NDTensorsCUDAExt] Remove and test for scalar indexing #1245

Merged
merged 62 commits into from
Nov 15, 2023

Conversation

kmp5VT
Copy link
Collaborator

@kmp5VT kmp5VT commented Nov 9, 2023

Description

@mtfishman I thought it was weird that there was a scalar indexing warning in this bug report https://itensor.discourse.group/t/cuda-issue-when-converting-mps-to-mpo/1274/2, so I looked into it. I believe Julia determines which mul! to use based on the C matrix. When C is a Transpose the generic matmul mul! implementation is called which causes a scalar indexing problem. Its completely fine if A or B are a Transpose, the CUDA mul! kernel will still be called. So I created this exposed mul! function to fix the problem

@kmp5VT kmp5VT requested a review from mtfishman November 9, 2023 21:15
@mtfishman
Copy link
Member

The issue there turned out to be that one of the tensors was actually on CPU (since the delta tensor got allocated there by default), if you look at the stack trace there is a Matrix instead of a CuMatrix. So are you certain this PR actually fixes a known issue?

@codecov-commenter
Copy link

codecov-commenter commented Nov 9, 2023

Codecov Report

All modified and coverable lines are covered by tests ✅

Comparison is base (d9265a2) 85.44% compared to head (ae147c7) 54.81%.
Report is 1 commits behind head on main.

❗ Current head ae147c7 differs from pull request most recent head 21aa398. Consider uploading reports for the commit 21aa398 to get more accurate results

❗ Your organization needs to install the Codecov GitHub app to enable full functionality.

Additional details and impacted files
@@             Coverage Diff             @@
##             main    #1245       +/-   ##
===========================================
- Coverage   85.44%   54.81%   -30.63%     
===========================================
  Files          89       88        -1     
  Lines        8401     8348       -53     
===========================================
- Hits         7178     4576     -2602     
- Misses       1223     3772     +2549     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@kmp5VT
Copy link
Collaborator Author

kmp5VT commented Nov 10, 2023

The issue there turned out to be that one of the tensors was actually on CPU (since the delta tensor got allocated there by default), if you look at the stack trace there is a Matrix instead of a CuMatrix. So are you certain this PR actually fixes a known issue?

@mtfishman , yes so the direct problem with the issue was related to the delta tensor but there was also a thrown warning about scalar indexing. When I ran the test with CUDA.allowscalar(false) it points me to scalar indexing in mul! with A and C being Transposed so I made this test

using ITensors, NDTensors
i,j,k = Index.((2,2,2))
A = NDTensors.cu(randomTensor(j,i)); B = NDTensors.cu(randomTensor(j,k)); C = NDTensors.cu(randomTensor(k,i));
contract(C,(3,1), A, (-2,1), B, (-2,3))

I also tried with just Transpose(C) = A * B also Transpose(C) = A * Transpose(B) and all 3 scenarios fail (use scalar indexing) and it looks like its because of the Transpose(C) wrapper. But this change fixes all three

@mtfishman
Copy link
Member

I see, can you add tests with those cases? Ideally it would directly use CuMatrix and mul! and not test it through contract!.

@kmp5VT
Copy link
Collaborator Author

kmp5VT commented Nov 10, 2023

I see, can you add tests with those cases? Ideally it would directly use CuMatrix and mul! and not test it through contract!.

Yes I can add the tests, and it does show the same error with

A = fill!(CuMatrix{Float64, CUDA.Mem.DeviceBuffer}(undef, (2,3)), randn(Float64))
B = fill!(CuMatrix{Float64, CUDA.Mem.DeviceBuffer}(undef, (3,4)), randn(Float64))
C = fill!(CuMatrix{Float64, CUDA.Mem.DeviceBuffer}(undef, (4,2)), randn(Float64))
CUDA.allowscalar(false)
mul!(transpose(C), A, B, 1.0, 0.0)

And the code throws an error.

@mtfishman
Copy link
Member

I guess Metal has the same issue:

julia> using Metal

julia> A, B, C = mtl.((randn(2, 2), randn(2, 2), randn(2, 2)))
(Float32[1.9648244 -0.14690858; 0.841998 0.8000668], Float32[0.29232308 -1.0040092; -0.30180508 -0.26494592], Float32[1.9440461 -0.0030814873; -0.13209581 0.10974641])

julia> using LinearAlgebra

julia> mul!(transpose(C), A, B)
┌ Warning: Performing scalar indexing on task Task (runnable) @0x000000010ae4c010.
│ Invocation of getindex resulted in scalar indexing of a GPU array.
│ This is typically caused by calling an iterating implementation of a method.
│ Such implementations *do not* execute on the GPU, but very slowly on the CPU,
│ and therefore are only permitted from the REPL for prototyping purposes.
│ If you did intend to index this array, annotate the caller with @allowscalar.
└ @ GPUArraysCore ~/.julia/packages/GPUArraysCore/uOYfN/src/GPUArraysCore.jl:106
2×2 transpose(::MtlMatrix{Float32, Metal.MTL.MTLResourceStorageModePrivate}) with eltype Float32:
 0.618701    -1.93378
 0.00467122  -1.05735

@mtfishman
Copy link
Member

Also Hermitian transposition:

julia> using Metal

julia> A, B, C = mtl.((randn(2, 2), randn(2, 2), randn(2, 2)))
(Float32[0.418296 -0.7982497; 2.1851687 0.10532443], Float32[0.028218202 0.96657676; -1.4152029 -0.11327304], Float32[1.0778706 0.3015457; -0.34520924 -0.62605405])

julia> using LinearAlgebra

julia> mul!(C', A, B)
┌ Warning: Performing scalar indexing on task Task (runnable) @0x000000010ad68010.
│ Invocation of getindex resulted in scalar indexing of a GPU array.
│ This is typically caused by calling an iterating implementation of a method.
│ Such implementations *do not* execute on the GPU, but very slowly on the CPU,
│ and therefore are only permitted from the REPL for prototyping purposes.
│ If you did intend to index this array, annotate the caller with @allowscalar.
└ @ GPUArraysCore ~/.julia/packages/GPUArraysCore/uOYfN/src/GPUArraysCore.jl:106
2×2 adjoint(::MtlMatrix{Float32, Metal.MTL.MTLResourceStorageModePrivate}) with eltype Float32:
  1.14149    0.494735
 -0.0873939  2.1002

@mtfishman
Copy link
Member

So please add a similar definition for Adjoint.

@kmp5VT kmp5VT marked this pull request as ready for review November 15, 2023 00:15
@kmp5VT kmp5VT changed the title [NDTensorsCUDAExt] More Scalar indexing fixes [NDTensorsCUDAExt] Address and remove all scalar indexing from NDTensors/test Nov 15, 2023
@mtfishman mtfishman changed the title [NDTensorsCUDAExt] Address and remove all scalar indexing from NDTensors/test [NDTensorsCUDAExt] Remove and test for scalar indexing Nov 15, 2023
NDTensors/test/dense.jl Outdated Show resolved Hide resolved
NDTensors/Project.toml Outdated Show resolved Hide resolved
@mtfishman
Copy link
Member

Thanks @kmp5VT, this will be very useful for ensuring performance of GPU backends going forward. Is this ready to merge from your end once tests pass?

@kmp5VT
Copy link
Collaborator Author

kmp5VT commented Nov 15, 2023

Thanks @kmp5VT, this will be very useful for ensuring performance of GPU backends going forward. Is this ready to merge from your end once tests pass?

@mtfishman not a problem. Yes this is ready to merge, I have no additional changes this round. Thanks!

@mtfishman mtfishman merged commit 61caa1b into ITensor:main Nov 15, 2023
8 checks passed
@kmp5VT kmp5VT deleted the kmp5/debug/scalar_indexing branch November 15, 2023 22:53
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants