Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Large LocalArray eltypes runs into compiler heuristics #99

Open
maleadt opened this issue Jul 14, 2022 · 5 comments
Open

Large LocalArray eltypes runs into compiler heuristics #99

maleadt opened this issue Jul 14, 2022 · 5 comments

Comments

@maleadt
Copy link
Member

maleadt commented Jul 14, 2022

The following works:

function broken_kernel()
    c_frags = LocalArray{Tuple{16}, CUDA.WMMA.Fragment{16, 16, 16, 8, Float16, CUDA.WMMA.Unspecified, CUDA.WMMA.Accumulator}}(undef)

    frag = CUDA.WMMA.Fragment{16, 16, 16, 8, Float16, CUDA.WMMA.Unspecified, CUDA.WMMA.Accumulator}(ntuple(_->Float16(0), 8))
    setindex(c_frags, frag, 1)

    return
end

CUDA.code_llvm(GemmKernels.Kernel.broken_kernel, Tuple{})

But bumping the eltype to Tuple{64} results in an apply_iterate.

@maleadt
Copy link
Member Author

maleadt commented Nov 8, 2022

Apparently I fixed that in JuliaLang/julia#46050, so 1.8 is supported. I've added it to CI.

@maleadt maleadt closed this as completed Nov 8, 2022
@maleadt
Copy link
Member Author

maleadt commented Nov 8, 2022

Let's re-open this to keep track of the max fragment size though. cc @wardvermeulen

@maleadt maleadt reopened this Nov 8, 2022
@maleadt maleadt changed the title 1.8 compatibility Large LocalArray eltypes runs into compiler heuristics Nov 8, 2022
@maleadt
Copy link
Member Author

maleadt commented Jun 27, 2023

Updated MWE:

using GemmKernels, CUDA

using GemmKernels: LocalArray
using Base: setindex

function kernel()
    c_frags = LocalArray{Tuple{64}, Float32, 1, 64}(undef)
    setindex(c_frags, 0f0, 1)
    return
end

function main()
    CUDA.code_llvm(kernel, Tuple{})
end

isinteractive() || main()

The apply iterate can be avoided by setting the tuple_splat inference and optimization params (max_tuple_splat on 1.10) to a higher value. It defaults to 32. @wardvermeulen Where in the FPU operator implementation did you account for this?

@wardvermeulen
Copy link
Collaborator

The apply iterate can be avoided by setting the tuple_splat inference and optimization params (max_tuple_splat on 1.10) to a higher value. It defaults to 32. @wardvermeulen Where in the FPU operator implementation did you account for this?

If you are referring to providing safeguards so this behavior does not occur, I did not account for this in the implementation.

@maleadt
Copy link
Member Author

maleadt commented Jun 27, 2023

Ah OK, I thought there were some hard-coded limits that relate to the 16-element LocalArray limit.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants