You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I have a particular question about the performance of distance join when increase the number of available cores but also the size of the workload (size of the datasets to join). I have a group of points datasets (namely 'points' and 'centers') which double in size for each run. For example, points0.txt has ~20000 points, points1.txt has ~40000, points2.txt has ~60000 and points3.txt has ~80000. Similar for centers but ranging from 47663 (centers0.txt) to 190655 (centers3.txt, four times centers0.txt). Now I perform distance join between corresponding datasets for points and centers duplicating the number of cores accordingly for each run. So, first a distance join between points0 and centers0 using N cores, then between points1 and centers1 using 2N cores, then using 3N cores and so on.
So what we expect is that the execution time for each distance join to be quite similar. However, you can see in [1] that there is considerable improvement in performance when you increase the number of cores. It sounds very good but we are wondering why...
We have prepared a sample code you can download from [2]. It has the source code and random datasets and also some scripts to run a similar experiment and plot the data.
Wonder if we are missing a particular parameters or special setting to run distance join in Simba. Any help or feedback will be very appreciated.
Fast thought: distance join algorithm I implemented in Simba can effectively prune most of join pairs which cause its performance deeply coupled with result size itself. Besides, distance join is both data and CPU intensive. Thus, it is very possible that you can observe results like that.
I will try to get some time recently looking into this. Will get back to you if I found the reason.
On Dec 19, 2017, 22:53 +0800, aocalderon ***@***.***>, wrote:
Hello there,
I have a particular question about the performance of distance join when increase the number of available cores but also the size of the workload (size of the datasets to join). I have a group of points datasets (namely 'points' and 'centers') which double in size for each run. For example, points0.txt has ~20000 points, points1.txt has ~40000, points2.txt has ~60000 and points3.txt has ~80000. Similar for centers but ranging from 47663 (centers0.txt) to 190655 (centers3.txt, four times centers0.txt). Now I perform distance join between corresponding datasets for points and centers duplicating the number of cores accordingly for each run. So, first a distance join between points0 and centers0 using N cores, then between points1 and centers1 using 2N cores, then using 3N cores and so on.
So what we expect is that the execution time for each distance join to be quite similar. However, you can see in [1] that there is considerable improvement in performance when you increase the number of cores. It sounds very good but we are wondering why...
We have prepared a sample code you can download from [2]. It has the source code and random datasets and also some scripts to run a similar experiment and plot the data.
Wonder if we are missing a particular parameters or special setting to run distance join in Simba. Any help or feedback will be very appreciated.
Kind regards,
Andres
[1] http://www.cs.ucr.edu/~acald013/public/simba/Centers2Points.pdf
[2] http://www.cs.ucr.edu/~acald013/public/simba/RandomTester.tar.gz
—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub, or mute the thread.
Hello there,
I have a particular question about the performance of distance join when increase the number of available cores but also the size of the workload (size of the datasets to join). I have a group of points datasets (namely 'points' and 'centers') which double in size for each run. For example, points0.txt has ~20000 points, points1.txt has ~40000, points2.txt has ~60000 and points3.txt has ~80000. Similar for centers but ranging from 47663 (centers0.txt) to 190655 (centers3.txt, four times centers0.txt). Now I perform distance join between corresponding datasets for points and centers duplicating the number of cores accordingly for each run. So, first a distance join between points0 and centers0 using N cores, then between points1 and centers1 using 2N cores, then using 3N cores and so on.
So what we expect is that the execution time for each distance join to be quite similar. However, you can see in [1] that there is considerable improvement in performance when you increase the number of cores. It sounds very good but we are wondering why...
We have prepared a sample code you can download from [2]. It has the source code and random datasets and also some scripts to run a similar experiment and plot the data.
Wonder if we are missing a particular parameters or special setting to run distance join in Simba. Any help or feedback will be very appreciated.
Kind regards,
Andres
[1] http://www.cs.ucr.edu/~acald013/public/simba/Centers2Points.pdf
[2] http://www.cs.ucr.edu/~acald013/public/simba/RandomTester.tar.gz
The text was updated successfully, but these errors were encountered: