Large LocalArray eltypes runs into compiler heuristics #99

maleadt · 2022-07-14T13:07:15Z

The following works:

function broken_kernel()
    c_frags = LocalArray{Tuple{16}, CUDA.WMMA.Fragment{16, 16, 16, 8, Float16, CUDA.WMMA.Unspecified, CUDA.WMMA.Accumulator}}(undef)

    frag = CUDA.WMMA.Fragment{16, 16, 16, 8, Float16, CUDA.WMMA.Unspecified, CUDA.WMMA.Accumulator}(ntuple(_->Float16(0), 8))
    setindex(c_frags, frag, 1)

    return
end

CUDA.code_llvm(GemmKernels.Kernel.broken_kernel, Tuple{})

But bumping the eltype to Tuple{64} results in an apply_iterate.

The text was updated successfully, but these errors were encountered:

maleadt · 2022-11-08T15:29:59Z

Apparently I fixed that in JuliaLang/julia#46050, so 1.8 is supported. I've added it to CI.

maleadt · 2022-11-08T17:02:50Z

Let's re-open this to keep track of the max fragment size though. cc @wardvermeulen

maleadt · 2023-06-27T10:17:23Z

Updated MWE:

using GemmKernels, CUDA

using GemmKernels: LocalArray
using Base: setindex

function kernel()
    c_frags = LocalArray{Tuple{64}, Float32, 1, 64}(undef)
    setindex(c_frags, 0f0, 1)
    return
end

function main()
    CUDA.code_llvm(kernel, Tuple{})
end

isinteractive() || main()

The apply iterate can be avoided by setting the tuple_splat inference and optimization params (max_tuple_splat on 1.10) to a higher value. It defaults to 32. @wardvermeulen Where in the FPU operator implementation did you account for this?

wardvermeulen · 2023-06-27T11:06:36Z

The apply iterate can be avoided by setting the tuple_splat inference and optimization params (max_tuple_splat on 1.10) to a higher value. It defaults to 32. @wardvermeulen Where in the FPU operator implementation did you account for this?

If you are referring to providing safeguards so this behavior does not occur, I did not account for this in the implementation.

maleadt · 2023-06-27T11:09:58Z

Ah OK, I thought there were some hard-coded limits that relate to the 16-element LocalArray limit.

maleadt mentioned this issue Jul 15, 2022

setindex(::Tuple) performance regression JuliaLang/julia#46049

Closed

maleadt closed this as completed Nov 8, 2022

maleadt reopened this Nov 8, 2022

maleadt changed the title ~~1.8 compatibility~~ Large LocalArray eltypes runs into compiler heuristics Nov 8, 2022

wardvermeulen mentioned this issue May 14, 2023

FPU operator #101

Merged

maleadt mentioned this issue Sep 7, 2023

Questions about usage of registers #152

Closed

wardvermeulen mentioned this issue Nov 1, 2023

FPU operator issues #165

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Large LocalArray eltypes runs into compiler heuristics #99

Large LocalArray eltypes runs into compiler heuristics #99

maleadt commented Jul 14, 2022 •

edited

Loading

maleadt commented Nov 8, 2022

maleadt commented Nov 8, 2022

maleadt commented Jun 27, 2023

wardvermeulen commented Jun 27, 2023

maleadt commented Jun 27, 2023

Large LocalArray eltypes runs into compiler heuristics #99

Large LocalArray eltypes runs into compiler heuristics #99

Comments

maleadt commented Jul 14, 2022 • edited Loading

maleadt commented Nov 8, 2022

maleadt commented Nov 8, 2022

maleadt commented Jun 27, 2023

wardvermeulen commented Jun 27, 2023

maleadt commented Jun 27, 2023

maleadt commented Jul 14, 2022 •

edited

Loading