-
Notifications
You must be signed in to change notification settings - Fork 145
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
One of a few experiments I did with moving libnabo to CUDA #33
base: master
Are you sure you want to change the base?
Conversation
…d is about 20% faster than the CPU implementation. My code is very poorly optimized and should be taken with a grain of salt. I'll be working on it some more later next week. In a best case scenario, this function will perform in O(nlogn) time. In a worse case scenario, it'll be O(n^2), where a worst case scenario is two points within the same cluster being on exactly opposite sides of the tree. Points should be organized as so [Cluster|Cluster|Cluster|Cluster] Where each cluster is 32 points large, and is ordered from least to greatest compared around their distance from the center of the cluster. The furthest away point should be no more than max_rad distance from the center. Eventually this code, if my predictions hold up, should perform 7 - 9x faster than the CPU implementation. Obviously thats quite a long ways away, will be countless days of work, and will never be optimized perfectly, but you get the idea. This code is highly unstable and has a very large amount of bugs (I counted 20+). DO NOT use this code in any application yet. It will almost certainly crash either your GPU driver or the application. This code was written to look, initially, as close to the openCL code as possible. Said being, the amount of branching that currently occurs is huge, since I directly copied the transversal patterns of the OpenCL version. I will be reducing the amount of branching soon. -Louis
Optimized branching. Reduced memory overhead. First tests with dynamic parrallelism failed. I'll try again tomorrow. Began to diverage from libnabo's default transversal patterns. I'm using my own, as they seem better for CUDA. This may have a negative impact later on. I don't know yet. Improved comments.
Fixed some syntax errors that were preventing it from compiling.
Added a best practice guide and installation guide for CUDA
Can one of the admins verify this patch? |
Oh... I didn't mean to push the change to the read me... Um... Should I create a cudareadme.md? |
#29 Linking the thread I had started about this. My experience with implementing a KD search in CUDA is that no matter what I did thread divergence was always the biggest overhead. In a best case scenario the threads were idle about half the time. In a worst case scenario there were idle 31/32 of the time. My general predictions is that KD tree search algorithms like what libnabo uses will not be as effective in a GPGPU environment until something like NVIDIA's Volta releases, which would allow the usage of an on-board ARM chip to queue new work for the GPU and therefore potentially reduce warp divergence. FLANN's approach to CUDA seemed interesting, but the error was far too large for it to be practical for anything more than super basic applications. Unless Dustin sees something that I didn't, I'm going to say that doing this on the GPU may not be very viable for a bit longer (2 years perhaps?). And even by then, the monstrous improvements in cache performance that Intel Skylake promises may further reduce the potential performance of GPGPU KD Search over CPU KD Searching. |
@MatrixCompSci thanks a lot for sharing your insight, thoughts and code. Great! |
The one thing I really want to try still is doing this on a Xeon Phi. I'm generally curious haha |
As stated before, all attempts were unsuccessful as I was not able to beat nor match libnabo CPU performance.
I'm passing this project onto a colleague of mine who has been working with KD trees and KNN for nearly 10+ years. I hope he can do a much better job than I did.
The code isn't linked to C or C++ code. The port I wrote for that was very messy and was consistently changing due to the various approaches attempted. I'm only putting this here as a proof of concept in case anyone wants to tinker around with it.
Dustin, my colleague, will be writing a new C++ wrapper in the coming month when he does his attempt.
I don't know what his approach will be, but his fork is here: https://github.com/dustprog/libnabo