-
Notifications
You must be signed in to change notification settings - Fork 29
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Polluted result using CUDA 3D rfft in Float32 data type #326
Comments
Hi @doraemonho — Nice! I’ll try to have a look at it soon. |
Inside the GetFk function the GPUgrid is called instead the grid from the function’s argument. Is that the prob? |
Thanks for taking a look!
I rewrtie the function a bit to define the GPUgrid inside the function (showed below), but the result doesn't change.
|
I’ll have a closer look in the morning from my laptop |
It occurs to me that, the root of this problem is coming from the fft plan itself.
I am not sure if this is a FFTW.jl 's or CUDA.jl 's problem. |
Quick Update here |
Can we make sure it's not a plotting issue? I tried to do some FFTs back and forth and couldn't find any disagreement between different data types. Have a look: Here's an example just with FFTW and CUDA CUDA + FFTW
using CUDA
using FFTW
using LinearAlgebra: mul!, ldiv!
using Test
for T in [Float64, Float32]
nx, ny, nz = 6, 8, 10
nk, nl, nm = Int(nx//2)+1, ny, nz
a = randn(T, (nx, ny, nz))
a_cu = CuArray(a)
ah = randn(Complex{T}, (nk, nl, nm))
ah_cu = CuArray(ah)
plan = plan_rfft(a)
plan_cu = plan_rfft(a_cu)
bh = zeros(Complex{T}, (nk, nl, nm))
bh_cu = CuArray(bh)
c = zeros(T, (nx, ny, nz))
c_cu = CuArray(c)
mul!(bh, plan, a)
ldiv!(c, plan, deepcopy(bh))
mul!(bh_cu, plan_cu, a_cu)
ldiv!(c_cu, plan_cu, deepcopy(bh_cu))
@testset "CPU test for $T" begin
@test bh == rfft(a)
@test c == irfft(bh, nx)
end
@testset "GPU test for $T" begin
@test bh_cu == rfft(a_cu)
@test c_cu == irfft(bh_cu, nx)
end
end and here's an example with CUDA + FourierFlows CUDA + FourierFlows
using CUDA
using LinearAlgebra: mul!, ldiv!
using FourierFlows
using Test
for T in [Float64, Float32]
nx, ny, nz = 6, 8, 10
grid = ThreeDGrid(nx, nx, ny, ny, nz, nz; T)
grid_cu = ThreeDGrid(GPU(), nx, nx, ny, ny, nz, nz; T)
a = randn(T, (grid.nx, grid.ny, grid.nz))
a_cu = CuArray(a)
ah = randn(Complex{eltype(grid)}, (grid.nkr, grid.nl, grid.nm))
ah_cu = CuArray(ah)
plan = grid.rfftplan
plan_cu = grid_cu.rfftplan
bh = zeros(Complex{eltype(grid)}, (grid.nkr, grid.nl, grid.nm))
bh_cu = CuArray(bh)
c = zeros(eltype(grid), (grid.nx, grid.ny, grid.nz))
c_cu = CuArray(c)
mul!(bh, plan, a)
ldiv!(c, plan, deepcopy(bh))
mul!(bh_cu, plan_cu, a_cu)
ldiv!(c_cu, plan_cu, deepcopy(bh_cu))
@testset "CPU test for $T" begin
@test bh == rfft(a)
@test c == irfft(bh, nx)
end
@testset "GPU test for $T" begin
@test bh_cu == rfft(a_cu)
@test c_cu == irfft(bh_cu, nx)
end
end All test pass for me. Same if I choose |
Thanks for checking!
I quite certain the problem is not coming the plot as the radial spectrum of FP32-CUDA found a spike at large k number. using CUDA
using FFTW
using LinearAlgebra: mul!, ldiv!
using FourierFlows
using Test
nx, ny, nz = 128, 128, 128; # define size of the array
θ = rand(θ,div(nx,2)+1, ny, nz)*2π; # define \theta \in [0,2\pi]
for T in [Float64,Float32]
grid = ThreeDGrid(nx, nx, ny, ny, nz, nz; T)
grid_cu = ThreeDGrid(GPU(), nx, nx, ny, ny, nz, nz; T)
θT = convert(Array{T,3},θ)
eⁱᶿk⁻² = @. exp.(im*θT)*grid.invKrsq; #define the spectral filtered function in CPU
eⁱᶿk⁻²_cu = CuArray(eⁱᶿk⁻²); #define the spectral filtered function in GPU
plan = grid.rfftplan
plan_cu = grid_cu.rfftplan
c = zeros(eltype(grid), (grid.nx, grid.ny, grid.nz))
c_cu = CuArray(c);
ldiv!(c, plan, deepcopy(eⁱᶿk⁻²))
ldiv!(c_cu, plan_cu, deepcopy(eⁱᶿk⁻²_cu))
c_cu = Array(c_cu); #move the result from GPU-> CPU
test_title = "eⁱᶿk⁻² CPU & GPUtest for " * string(T)
@testset "$test_title" begin
#Compare the result using ≈
@test c_cu ≈ c
end
end |
Can you print the types of |
Before and after the conversion to Array for |
do you mean before/after the rift or before the testing process? using CUDA
using FFTW
using LinearAlgebra: mul!, ldiv!
using FourierFlows
using Test
nx, ny, nz = 128, 128, 128; # define size of the array
θ = rand(div(nx,2)+1, ny, nz)*2π; # define \theta \in [0,2\pi]
for T in [Float64,Float32]
grid = ThreeDGrid(nx, nx, ny, ny, nz, nz; T)
grid_cu = ThreeDGrid(GPU(), nx, nx, ny, ny, nz, nz; T)
θT = convert(Array{T,3},θ)
eⁱᶿk⁻² = @. exp.(im*θT)*grid.invKrsq; #define the spectral filtered function in CPU
eⁱᶿk⁻²_cu = CuArray(eⁱᶿk⁻²); #define the spectral filtered function in GPU
println("before the rfft")
@show typeof(eⁱᶿk⁻²)
@show typeof(eⁱᶿk⁻²_cu)
plan = grid.rfftplan
plan_cu = grid_cu.rfftplan
c = zeros(eltype(grid), (grid.nx, grid.ny, grid.nz))
c_cu = CuArray(c);
ldiv!(c, plan, deepcopy(eⁱᶿk⁻²))
ldiv!(c_cu, plan_cu, deepcopy(eⁱᶿk⁻²_cu))
c_cu = Array(c_cu); #move the result from GPU-> CPU
println("before the test")
@show typeof(c)
@show typeof(c_cu)
test_title = "eⁱᶿk⁻² CPU & GPU test for " * string(T)
@testset "$test_title" begin
#Compare the result using ≈
@test c_cu ≈ c
end
end |
@doraemonho sorry for slacking... I went out on leave and now slowly coming back. Did you have any progress on this or is it still outstanding? |
@navidcy no worries, I still have this issue. |
Just an update from what we were : -) |
@doraemonho thanks for this! |
Hi there! I'm working on cuFFT and following up from JuliaGPU/CUDA.jl#1656 (internal NVIDIA bug# 3847913). Sorry to hijack your issues - I thought it would be easier and faster to chat here directly. Do you have more information about what exactly is the input to the complex-to-real transform that is being done by cuFFT ? The reason why this matters is the following. When doing a DFT of length N with a real input, if
Given this, in practice, real-to-complex transforms work on Complex-to-real transforms take
One might assume that What I suspect is that your input doesn't satisfy one of the above properties. In JuliaGPU/CUDA.jl#1656 you mentionned that
I've attached a cpp repro that shows this: when the last entry of the N/2+1 complex input to the complex-to-real FFT is not real (i.e., the imaginary part is non zero), in some cases (powers of 2 with enough batches on new-enough GPUs), the output will be wrong. If you compile and run it, for instance with CUDA 11.8, you will see
I realize this is a subtle issue. It might also have nothing to do with this, but I thought this was worth a shot. Let me know if this helps, |
Hi, thanks for checking it out!
The input is just a random phase function We always set I have one more question to ask. I was trying using CUDA
using FFTW
using Random
using LinearAlgebra: mul!, ldiv!
using Test
function getk⁻²_and_rfftplan(dev::Dev;
nx = 64, ny = nx, nz = nx, Lx = 2π, Ly = Lx, Lz = Lx,
T = Float32, nthreads=Sys.CPU_THREADS, effort=FFTW.MEASURE) where Dev
ArrayType = dev == "CPU" ? Array : CuArray;
# Define the rfft parameters
nk = nx
nl = ny
nm = nz
nkr = Int(nx/2 + 1)
# Wavenubmer
k = ArrayType{T}(reshape( fftfreq(nx, 2π/Lx*nx), (nk, 1, 1)))
l = ArrayType{T}(reshape( fftfreq(ny, 2π/Ly*ny), (1, nl, 1)))
m = ArrayType{T}(reshape( fftfreq(nz, 2π/Lz*nz), (1, 1, nm)))
kr = ArrayType{T}(reshape(rfftfreq(nx, 2π/Lx*nx), (nkr, 1, 1)))
Ksq = @. k^2 + l^2 + m^2
invKsq = @. 1 / Ksq
CUDA.@allowscalar invKsq[1, 1, 1] = 0;
Krsq = @. kr^2 + l^2 + m^2
invKrsq = @. 1 / Krsq
CUDA.@allowscalar invKrsq[1, 1, 1] = 0;
FFTW.set_num_threads(nthreads);
rfftplan = plan_rfft(ArrayType{T, 3}(undef, nx, ny, nz))
return invKrsq,rfftplan
end
# Set up the initial condition
Random.seed!(1234); # Set up the random seed
θ = rand(div(nx,2)+1, ny, nz)*2π; # define \theta \in [0,2\pi]
nx = ny = nz= 128
T = Float32
k⁻², rfftplan = getk⁻²_and_rfftplan("CPU"; nx = nx, Lx = 2π, ny = ny, nz = ny, T = T)
k⁻²_c,rfftplan_c = getk⁻²_and_rfftplan("GPU"; nx = nx, Lx = 2π, ny = ny, nz = ny, T = T)
θT = im.*convert(Array{T,3},θ)
eⁱᶿk⁻² = @. exp.(θT)*k⁻²; #define the spectral filtered function in CPU
eⁱᶿk⁻²_cu = CuArray(eⁱᶿk⁻²); #define the spectral filtered function in GPU
c_b4_cleaning = CUDA.zeros(T, nx, ny, nz);
c_af_cleaning = CUDA.zeros(T, nx, ny, nz);
# Work out the rfft before cleaning 0 terms
ldiv!(c_b4_cleaning, rfftplan_c, deepcopy(eⁱᶿk⁻²_cu));
#Clean y[N/2] terms (index 0 = 1 julia)
@. eⁱᶿk⁻²_cu[1,:,:] .= 0;
ldiv!(c_af_cleaning, rfftplan_c, deepcopy(eⁱᶿk⁻²_cu));
c_b4_cleaning = Array(c_b4_cleaning); #move the result from GPU-> CPU
c_af_cleaning = Array(c_af_cleaning); #move the result from GPU-> CPU
figure(figsize=(12,6))
subplot("121");
imshow(c_b4_cleaning[:,:,100]);title("Before cleaning y[1,:,:]")
subplot("122");
imshow(c_af_cleaning[:,:,100]);title("After cleaning y[1,:,:]") |
Yes, there are occasional changes. I doesn't surprise me that you didn't see this behavior in older versions.
Awesome! If that solves your problem, we'd appreciate if you can follow up on the Bug ticket. In think in summary
|
Thanks for the information! I appreciate your help. I would follow the ticket then. @navidcy would you think we should keep this issue open or just close it? |
Let’s keep it open for now so that people can see that there is an issue. Perhaps we should issue a warning if someone makes a ThreeDGrid w Float32 on GPU. The issue is only for 3D, right? |
From the description, it seems to apply to all cases(1D/2D/3D) as long as the Fourier transformed |
This applies to 1D, 2D and 3D yes. If the imaginary part of FYI, this is also documented in section 2.3 of cuFFT's online documentation |
I'm confused a bit. If we start with a real signal and Fourier transform it then the components that correspond to 0-th frequency |
$ julia --project
_
_ _ _(_)_ | Documentation: https://docs.julialang.org
(_) | (_) (_) |
_ _ _| |_ __ _ | Type "?" for help, "]?" for Pkg help.
| | | | | | |/ _` | |
| | |_| | | | (_| | | Version 1.8.2 (2022-09-29)
_/ |\__'_|_|_|\__'_| |
|__/ |
julia> using LinearAlgebra
julia> using CUDA
julia> using CUDA.CUFFT
julia> a = cu(rand(8))
8-element CuArray{Float32, 1, CUDA.Mem.DeviceBuffer}:
0.4507753
0.06592999
0.96525353
0.6843841
0.94825345
0.046085697
0.38072583
0.8913641
julia> rfft(a)
5-element CuArray{ComplexF32, 1, CUDA.Mem.DeviceBuffer}:
4.4327717f0 + 0.0f0im
-0.33708918f0 - 0.4522028f0im
0.053049427f0 + 1.4637324f0im
-0.65786713f0 + 0.71685266f0im
1.0572441f0 + 0.0f0im
julia> ah =complex(cu(zeros(5)))
5-element CuArray{ComplexF32, 1, CUDA.Mem.DeviceBuffer}:
0.0f0 + 0.0f0im
0.0f0 + 0.0f0im
0.0f0 + 0.0f0im
0.0f0 + 0.0f0im
0.0f0 + 0.0f0im
julia> rfftplan = plan_rfft(CuArray{Float32, 1}(undef, 8))
CUFFT real-to-complex forward plan for 8-element CuArray of Float32
julia> mul!(ah, rfftplan, a)
5-element CuArray{ComplexF32, 1, CUDA.Mem.DeviceBuffer}:
4.4327717f0 + 0.0f0im
-0.33708918f0 - 0.4522028f0im
0.053049427f0 + 1.4637324f0im
-0.65786713f0 + 0.71685266f0im
1.0572441f0 + 0.0f0im
julia> b = deepcopy(a*0)
8-element CuArray{Float32, 1, CUDA.Mem.DeviceBuffer}:
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
julia> ldiv!(b, rfftplan, ah)
8-element CuArray{Float32, 1, CUDA.Mem.DeviceBuffer}:
0.45077527
0.06593
0.9652535
0.68438405
0.9482534
0.046085723
0.38072583
0.89136404 |
(same results in 2D and 3D...) |
OK, I see... the issue comes when we generate the But why that's not an issue with Float64? |
Because cuFFT FP64 transforms use (slightly) different algorithms, so the issue is not visible. |
I guess it seems that a warning should at least be displayed when user does give an N/2-Fourier component with non-zero imaginary part? I guess by CUDA.jl? |
Hi, I am working on the 3D RFFT but find out the result of rfft using CUDA is a bit weird in Float32 .
I define a function
F(k)
in k-space as a gaussian function likee^{-(k-k0)^2}e^{-i\phi}
, where k0 is some number and phi is a random number array between 0 and 2pi. Here is the implementationIf I do a rfft on this function, the result of CUDA rfft will overlay a moiré like pattern like below:
Here is the implemention:
However, if you change T from Float32 to Float64, the result would become normal again:
The problem itself is rare and would happen in this kind of function.
The text was updated successfully, but these errors were encountered: