Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

multidplyr runs significantly slower when wrapped in user-defined function #87

Open
dzhang32 opened this issue Nov 15, 2019 · 9 comments
Labels
feature a feature request or enhancement

Comments

@dzhang32
Copy link

dzhang32 commented Nov 15, 2019

I would like to wrap multidplyr code into a function, since I am applying the parallelised multidplyr code to multiple datasets. However, this significantly slows the processing speed, to a point where the parallelisation of the code is not advantageous anymore in terms of run time.

Is there a fix for this? Am I doing something incorrectly? I believe that this may have something to do with the global environment vs local environment to the user-defined function.

Below is an example to illustrate the effect of wrapping multidplyr functions into a user-defined function in terms of run time.

Note: I understand that for this particular example itself, it is not advantageous to use the multidplyr commands vs dplyr since the overhead will outweigh the advantage of parallelising however, the issue I want to highlight is the doubling of the run time when wrapping the identical multidplyr code into a user defined function.

library(tidyverse)
library(multidplyr)
library(stringr)

df <- tibble(index = rep(1:100000, 3), 
             to_concat = (rep(1:100000, 3)))

cluster <- new_cluster(2)

system.time(expr = {
  
  df %>% 
    group_by(index) %>% 
    partition(cluster) %>% 
    summarise(concat = stringr::str_c(to_concat, collapse = "_")) %>% 
    collect()
    
})

rm(cluster)

user system elapsed 0.444 0.068 2.753

user_defined_func <- function(df, cluster_num){
  
  cluster <- new_cluster(cluster_num)
  
  df <- 
    df %>% 
    group_by(index) %>% 
    partition(cluster) %>% 
    summarise(concat = stringr::str_c(to_concat, collapse = "_")) %>% 
    collect()
  
  rm(cluster)
  
  return(df)

}

system.time(expr = {
  
  user_defined_func(df, cluster_num = 2)

})

user system elapsed 1.432 0.288 4.353

@hadley

This comment has been minimized.

@dzhang32

This comment has been minimized.

@hadley

This comment has been minimized.

@hadley hadley closed this as completed Nov 16, 2019
@dzhang32

This comment has been minimized.

@hadley
Copy link
Member

hadley commented Nov 18, 2019

Ok, now I see what you mean:

library(tidyverse)
library(multidplyr)

df <- tibble(
  index = rep(1:500000, 3),
  to_concat = (rep(1:500000, 3))
)

cluster <- new_cluster(5)

print(system.time({
  df %>%
    group_by(index) %>%
    partition(cluster) %>%
    summarise(concat = stringr::str_c(to_concat, collapse = "_")) %>%
    collect()
}))
#>    user  system elapsed 
#>   3.175   0.313   8.864

f <- function(df, cluster) {
  df %>% 
    group_by(index) %>% 
    partition(cluster) %>% 
    summarise(concat = stringr::str_c(to_concat, collapse = "_")) %>% 
    collect()
}
print(system.time(f(df, cluster)))
#>    user  system elapsed 
#>  11.328   0.352  14.678

Created on 2019-11-18 by the reprex package (v0.3.0)

@hadley hadley reopened this Nov 18, 2019
@hadley
Copy link
Member

hadley commented Nov 18, 2019

The performance difference comes from within summarise.multidplyr_party_df(), because the generated call is 12 MB rather than 400kB. This is because the environment of the ... contains a bunch of stuff that we don't care about (mostly importantly in the wrapped function case, a full copy of the dataset). Even in the case that's currently fast, it looks like we're still copying stuff, so this is worth some further exploration. The easiest approach would probably be to take the same approach as dbplyr and dtplyr, and walk the quosure to embed needed data (building from dtplyr:::dt_squash()).

@hadley hadley added the feature a feature request or enhancement label Jan 24, 2021
@adamthrash
Copy link

Has there been any progress on this issue? I'm in the process of updating a package of mine to use multidplyr, and I just ran into the same problem.

In addition, if I call the function and pass in the number of cores to use (8, in this case), only four of those display heavy activity in my CPU history. If I run the same code without running it as a function, all 8 cores display heavy activity.

@ldanai
Copy link

ldanai commented Apr 9, 2021

I run into the same problem.

For now I need to go back and use the beta version which work out OK
remotes::install_github("tidyverse/multidplyr#10")

@avsdev-cw
Copy link

avsdev-cw commented Nov 10, 2021

In addition, if I call the function and pass in the number of cores to use (8, in this case), only four of those display heavy activity in my CPU history. If I run the same code without running it as a function, all 8 cores display heavy activity.

Also having this problem, opened a new issue: #123

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature a feature request or enhancement
Projects
None yet
Development

No branches or pull requests

5 participants