Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[NDTensors] Fix scalar indexing issue for Diag broadcast on GPU #1497

Merged
merged 49 commits into from
Jun 21, 2024

Conversation

kmp5VT
Copy link
Collaborator

@kmp5VT kmp5VT commented Jun 13, 2024

Description

Fixes #1482.

Checklist:

  • Test for new NDTensors.map_diag! and NDTensors.map_diag functions.
  • Fix Jenkins CI (see comment below for a suggestion).
  • Rename NDTensorsGPUArraysCoreExt/diag.jl to NDTensorsGPUArraysCoreExt/blocksparsetensor.jl.
  • The code can correctly output broadcast functions for Diag datatype using GPU backends
  • Create unittests in Diag and DiagBlockSparse

@mtfishman mtfishman changed the title [NDTensorsGPUArraysCoreExt] Address issue 1482, scalar indexing for Diag Broadcast on GPU [NDTensorsGPUArraysCoreExt] Fix scalar indexing issue for Diag Broadcast on GPU Jun 13, 2024
@mtfishman
Copy link
Member

mtfishman commented Jun 13, 2024

Thanks for looking into this @kmp5VT. It seems better to me to just change the definition of:

function permutedims!(
  R::DiagTensor{<:Number,N},
  T::DiagTensor{<:Number,N},
  perm::NTuple{N,Int},
  f::Function=(r, t) -> t,
) where {N}
  # ...
end

to the one you've defined for GPU (or something similar, see my next comment) so we don't have to support two implementations of that function.

Also I see for loops/scalar indexing in a number of other functions:

function diag(tensor::DiagTensor)
  # ...
end

# I see this one is using some ad-hoc dispatch
# for GPU but it seems like we should either use
# `expose` or come up with a better code pattern,
# say use a slicing operation, that is generic
# to CPU and GPU.
function dense(::Type{<:Array}, T::DiagTensor)
  # ...
end

function permutedims!(
  R::DenseTensor{ElR,N}, T::DiagTensor{ElT,N}, perm::NTuple{N,Int}, f::Function=(r, t) -> t
) where {ElR,ElT,N}
  # ...
end

Can you look into those as well?

@mtfishman
Copy link
Member

One recommendation I have would be to look into the code patterns I came up with in NDTensors.DiagonalArrays for more conveniently working with diagonal values. For example: https://github.com/ITensor/ITensors.jl/blob/v0.6.11/NDTensors/src/lib/DiagonalArrays/src/diaginterface/diaginterface.jl.

As a demonstration:

using Compat: allequal

function diaglength(a::AbstractArray)
  return minimum(size(a))
end
diaglength(a::AbstractArray{<:Any,0}) = 1

function diagstride(a::AbstractArray)
  s = 1
  p = 1
  for i in 1:(ndims(a) - 1)
    p *= size(a, i)
    s += p
  end
  return s
end

function diagindices(a::AbstractArray)
  maxdiag = LinearIndices(a)[CartesianIndex(ntuple(Returns(diaglength(a)), ndims(a)))]
  return 1:diagstride(a):maxdiag
end
function diagindices(a::AbstractArray{<:Any,0})
  return Base.OneTo(1)
end

function diagview(a::AbstractArray)
  return @view a[diagindices(a)]
end

using LinearAlgebra: Diagonal
function diagview(a::Diagonal)
  return a.diag
end

x = Diagonal(randn(4))
y = randn(4, 4)

using Metal: mtl
x = mtl(x)
y = mtl(y)

diagview(y) .= diagview(x)

That appears to avoid scalar indexing but I'm not sure if it is actually fast on GPU. But I think if diagview was defined in a similar way for Tensor objects, such as DiagTensor, DenseTensor, etc. we could use a code pattern like that to implement most of the functions that currently use scalar indexing/explicit for loops that I listed above in a simple way, just in terms of broadcasting over diagview.

@codecov-commenter
Copy link

codecov-commenter commented Jun 13, 2024

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 60.84%. Comparing base (82cfd76) to head (cb8d766).
Report is 9 commits behind head on main.

Current head cb8d766 differs from pull request most recent head 51323fc

Please upload reports for the commit 51323fc to get more accurate results.

❗ Your organization needs to install the Codecov GitHub app to enable full functionality.

Additional details and impacted files
@@             Coverage Diff             @@
##             main    #1497       +/-   ##
===========================================
- Coverage   78.05%   60.84%   -17.21%     
===========================================
  Files         148      148               
  Lines        9679     9672        -7     
===========================================
- Hits         7555     5885     -1670     
- Misses       2124     3787     +1663     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@mtfishman mtfishman changed the title [NDTensorsGPUArraysCoreExt] Fix scalar indexing issue for Diag Broadcast on GPU [NDTensors] Fix scalar indexing issue for Diag Broadcast on GPU Jun 13, 2024
@mtfishman mtfishman changed the title [NDTensors] Fix scalar indexing issue for Diag Broadcast on GPU [NDTensors] Fix scalar indexing issue for Diag broadcast on GPU Jun 13, 2024
NDTensors/test/test_diag.jl Outdated Show resolved Hide resolved
@mtfishman
Copy link
Member

@kmp5VT I think it could simplify the code to define overloads of DiagonalArrays.diagview for DenseTensor and DiagTensor.

@kmp5VT
Copy link
Collaborator Author

kmp5VT commented Jun 19, 2024

Using the new functionality in this PR can we rewrite that as something like:

function dense(T::DiagTensor)
  R = zeros(dense(typeof(T)), inds(T))
  diagview(R) .= diagview(T)
  return R  
end

i.e. get rid of the dispatch on the unwrapped array type?

I think it is a good idea to remake the function like you are suggesting. One previous point about these functions is that UniformDiag storage types always convert to Array which could cause issues with GPU code. Right now we have that fixed in other places. But I think it would be a good idea to create a function like

function dense(T::DiagTensor, ArrayT::Type{<:AbstractArray})
  R = adapt(ArrayT, zeros(dense(typeof(T)), inds(T))
  diagview(R) .= diagview(T)
  return R
end

Also with code as it is now we convert NonuniformDiag storage types from CPU back to GPU with this line of code return adapt(unwrap_array_type(T), D_cpu) but this would fail if T::UniformDiag type because unwrap_arra_type(T) = Number

@mtfishman
Copy link
Member

What if in those places in the code where we were using that more complicated version of dense, instead we called adapt explicitly, i.e. change calls like dense(array_type, T) to adapt(array_type, dense(T))?

@kmp5VT
Copy link
Collaborator Author

kmp5VT commented Jun 19, 2024

@mtfishman Actually the code you wrote works fine without using adapt,

julia> using NDTensors, Metal
julia> A = tensor(Diag(3,3), (3,3))
julia> dense(A)
Dim 1: 3
Dim 2: 3
Dense{Int64, Vector{Int64}}
 3×3
 3  0  0
 0  3  0
 0  0  3
jullia> dense(mtl(A))
Dim 1: 3
Dim 2: 3
Dense{Int64, MtlVector{Int64, Private}}
 3×3
 3  0  0
 0  3  0
 0  0  3

@mtfishman
Copy link
Member

mtfishman commented Jun 19, 2024

@kmp5VT could you try adding Pkg.Registry.update(); Pkg.update(); to the Jenkins script, like I did in ITensor/ITensorGPU.jl#9? That may solve the test failures, since it will force Julia to update the registry, I think for some reason it is just seeing an old registry and therefore only seeing old package versions, so it doesn't know that it can upgrade to BlockArrays v1.1.

@mtfishman
Copy link
Member

mtfishman commented Jun 19, 2024

Besides the last two comments I left, this looks good to me.

I think with this change (and also JuliaGPU/Metal.jl#374), we are probably getting very close to having every NDTensors/ITensors operation working on GPU, which is a great step to get to! I'm sure a few lingering issues will come up, and there will be continued work improving performance (particularly for block sparse operations), but it definitely seems like the support for GPU operations is quite good now (at least for our supported backends CUDA, Metal, and ROCm), assuming the lack of user reports of issues is a good indicator of that.

Probably the biggest missing piece (as summarized in https://itensor.github.io/ITensors.jl/dev/RunningOnGPUs.html) is support for Intel GPUs through oneAPI.jl, which presumably would not be difficult to add and could just follow the design used in #1325 to add support for AMD GPUs, in fact probably that PR could just be copied and translated to oneAPI.jl, at least as a good start.

@mtfishman
Copy link
Member

mtfishman commented Jun 20, 2024

Not sure why the ITensorGaussianMPS downstream test timed out (maybe just a random failure in Github Actions) but it looks like that change fixed the Jenkins tests. EDIT: I reran it, those are passing now.

@mtfishman
Copy link
Member

mtfishman commented Jun 20, 2024

Can you add tests for NDTensors.map_diag! and NDTensors.map_diag?

@mtfishman mtfishman merged commit a1e7ec5 into ITensor:main Jun 21, 2024
15 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[NDTensors] [BUG] Scalar indexing issue when broadcasting tensors with Diag storage on GPU
3 participants