Skip to content

Optimization using cuFFT Callbacks

Alex Nitz edited this page Mar 13, 2015 · 15 revisions

Algorithmic Steps

We'll consider 3 of the kernels within the inner loop of the PyCBC analysis code.

  1. correlate

    multiples a complex template and data segment together to create the 'correlation' vector

  2. ifft

    Using cuFFT, we take the correlation vector and inverse FFT it to produce an SNR time series. This is a C2C out of place transform (2^20 points unless otherwise noted)

  3. threshold

    Peaks in the SNR timeseries are located by windows maximums within a window size.

Callback Variations

  1. No callback used
  2. Fuse the correlate and iFFT together
  3. (2) plus simply return zero for the second half of the correlation vector.

Future Variations that should be safe

  • Store template as real phase, calculate both phases within input callback
  • Store SNR timeseries as fp16
  • Move portion of thresholding code into output callback.
  • Use exact non-zero input boundaries (How do we pass in the boundaries?)
  • Use exact non-corrupted output boundaries (How do we pass in the boundaries?)

Future Variations that need accuracy checking

  • Load template and/or data as fp16

Callback wishlist

  • Callbacks for input/output of intermediate steps in the FFT

Notes on Benchmarking

  • CUDA 6.5
  • Vector sizes are 2^20 points complex single precision unless otherwise noted
  • FFT batching is not used.
  • in-situ analysis on Gaussian data
  • host performance and GPU dead time are ignored.
  • Results obtained by running under COMPUTE_PROFILE=1 and parsing the resultant log files.

Tesla K10 Results

  1. No Callbacks
Operation Kernel Average gputime (µs)
FFT _ZN12spRadix0256B10k 208.00 x 2
FFT _ZN12spRadix0016B10k 180.25
Correlate correlate 175.18
Threshold threshold_and_cluste 75.41
Threshold threshold_and_cluste 5.14
  1. Fuse Correlate FFT
Operation Kernel Average gputime (µs)
FFT _ZN12spRadix0256C18k (callback) 319.51
FFT _ZN12spRadix0256B10k 198.09
FFT _ZN12spRadix0016B10k 181.27
Threshold threshold_and_cluste 75.48
Threshold threshold_and_cluste 5.23
__device__ cufftComplex in_call(void* input, size_t offset,
                            void* caller_info, void* shared) {
    cufftComplex r;

    cufftComplex s = ((cufftComplex*) input)[offset];
    cufftComplex h = ((cufftComplex*) %s)[offset];

    r.x = h.x * s.x + h.y * s.y;
    r.y = h.x * s.y - h.y * s.x;

    return r;
}
  1. Fuse Correlate FFT, and return zeros for second half of vector
Operation Kernel Average gputime (µs)
FFT _ZN12spRadix0256C18k (callback) 254.62
FFT _ZN12spRadix0256B10k 198.03
FFT _ZN12spRadix0016B10k 181.27
Threshold threshold_and_cluste 75.58
Threshold threshold_and_cluste 5.14
__device__ cufftComplex in_call(void* input, size_t offset,
                            void* caller_info, void* shared) {
    if (offset > %s)
        return (cufftComplex){0, 0};
    else{
        cufftComplex r;

        cufftComplex s = ((cufftComplex*) input)[offset];
        cufftComplex h = ((cufftComplex*) %s)[offset];

        r.x = h.x * s.x + h.y * s.y;
        r.y = h.x * s.y - h.y * s.x;

        return r;
    }

}

Curious Results using Store Callback

Adding a callback for the store operation has very strange results at the moment. Using the (3) load callback and adding a simple store callback that just writes out the results with no changes made, results in a very large performance regression in that last FFT kernel. One might expect there to be some overhead using the store callback, and so would be slightly slower than the (3) case alone.

Store callback that simply sets the output
Operation Kernel Average gputime (µs)
FFT _ZN12spRadix0256C18k (load callback) 254.61
FFT _ZN12spRadix0256B10k 198.12
FFT _ZN12spRadix0016A18k (store callback) 558.46
Threshold threshold_and_cluste 75.65
Threshold threshold_and_cluste 5.14
__device__ void out_call(void *out, size_t offset, cufftComplex element,
                         void *caller_info, void *shared){
       ((cufftComplex*) out)[offset] = element;
}
Store callback that does nothing (no output written at all!)

Even if the store callback is a NOOP so that nothing is written to the output memory, it still seems to be anomalously slow. Even with some overhead, I would have expected this to be faster than the non-callback case as you save on the memory writes.

Operation Kernel Average gputime (µs)
FFT _ZN12spRadix0256C18k (load callback) 254.97
FFT _ZN12spRadix0256B10k 197.76
FFT _ZN12spRadix0016A18k (store callback) 478.47
Threshold threshold_and_cluste 76.08
Threshold threshold_and_cluste 4.64
__device__ void out_call(void *out, size_t offset, cufftComplex element,
                         void *caller_info, void *shared){
}