-
Notifications
You must be signed in to change notification settings - Fork 74
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
multidplyr runs significantly slower when wrapped in user-defined function #87
Comments
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
Ok, now I see what you mean: library(tidyverse)
library(multidplyr)
df <- tibble(
index = rep(1:500000, 3),
to_concat = (rep(1:500000, 3))
)
cluster <- new_cluster(5)
print(system.time({
df %>%
group_by(index) %>%
partition(cluster) %>%
summarise(concat = stringr::str_c(to_concat, collapse = "_")) %>%
collect()
}))
#> user system elapsed
#> 3.175 0.313 8.864
f <- function(df, cluster) {
df %>%
group_by(index) %>%
partition(cluster) %>%
summarise(concat = stringr::str_c(to_concat, collapse = "_")) %>%
collect()
}
print(system.time(f(df, cluster)))
#> user system elapsed
#> 11.328 0.352 14.678 Created on 2019-11-18 by the reprex package (v0.3.0) |
The performance difference comes from within |
Has there been any progress on this issue? I'm in the process of updating a package of mine to use multidplyr, and I just ran into the same problem. In addition, if I call the function and pass in the number of cores to use (8, in this case), only four of those display heavy activity in my CPU history. If I run the same code without running it as a function, all 8 cores display heavy activity. |
I run into the same problem. For now I need to go back and use the beta version which work out OK |
Also having this problem, opened a new issue: #123 |
I would like to wrap multidplyr code into a function, since I am applying the parallelised multidplyr code to multiple datasets. However, this significantly slows the processing speed, to a point where the parallelisation of the code is not advantageous anymore in terms of run time.
Is there a fix for this? Am I doing something incorrectly? I believe that this may have something to do with the global environment vs local environment to the user-defined function.
Below is an example to illustrate the effect of wrapping multidplyr functions into a user-defined function in terms of run time.
Note: I understand that for this particular example itself, it is not advantageous to use the multidplyr commands vs dplyr since the overhead will outweigh the advantage of parallelising however, the issue I want to highlight is the doubling of the run time when wrapping the identical multidplyr code into a user defined function.
user system elapsed 0.444 0.068 2.753
user system elapsed 1.432 0.288 4.353
The text was updated successfully, but these errors were encountered: