multidplyr runs significantly slower when wrapped in user-defined function #87

dzhang32 · 2019-11-15T17:06:48Z

I would like to wrap multidplyr code into a function, since I am applying the parallelised multidplyr code to multiple datasets. However, this significantly slows the processing speed, to a point where the parallelisation of the code is not advantageous anymore in terms of run time.

Is there a fix for this? Am I doing something incorrectly? I believe that this may have something to do with the global environment vs local environment to the user-defined function.

Below is an example to illustrate the effect of wrapping multidplyr functions into a user-defined function in terms of run time.

Note: I understand that for this particular example itself, it is not advantageous to use the multidplyr commands vs dplyr since the overhead will outweigh the advantage of parallelising however, the issue I want to highlight is the doubling of the run time when wrapping the identical multidplyr code into a user defined function.

library(tidyverse)
library(multidplyr)
library(stringr)

df <- tibble(index = rep(1:100000, 3), 
             to_concat = (rep(1:100000, 3)))

cluster <- new_cluster(2)

system.time(expr = {
  
  df %>% 
    group_by(index) %>% 
    partition(cluster) %>% 
    summarise(concat = stringr::str_c(to_concat, collapse = "_")) %>% 
    collect()
    
})

rm(cluster)

user system elapsed 0.444 0.068 2.753

user_defined_func <- function(df, cluster_num){
  
  cluster <- new_cluster(cluster_num)
  
  df <- 
    df %>% 
    group_by(index) %>% 
    partition(cluster) %>% 
    summarise(concat = stringr::str_c(to_concat, collapse = "_")) %>% 
    collect()
  
  rm(cluster)
  
  return(df)

}

system.time(expr = {
  
  user_defined_func(df, cluster_num = 2)

})

user system elapsed 1.432 0.288 4.353

The text was updated successfully, but these errors were encountered:

hadley · 2019-11-18T13:19:04Z

Ok, now I see what you mean:

library(tidyverse)
library(multidplyr)

df <- tibble(
  index = rep(1:500000, 3),
  to_concat = (rep(1:500000, 3))
)

cluster <- new_cluster(5)

print(system.time({
  df %>%
    group_by(index) %>%
    partition(cluster) %>%
    summarise(concat = stringr::str_c(to_concat, collapse = "_")) %>%
    collect()
}))
#>    user  system elapsed 
#>   3.175   0.313   8.864

f <- function(df, cluster) {
  df %>% 
    group_by(index) %>% 
    partition(cluster) %>% 
    summarise(concat = stringr::str_c(to_concat, collapse = "_")) %>% 
    collect()
}
print(system.time(f(df, cluster)))
#>    user  system elapsed 
#>  11.328   0.352  14.678

^{Created on 2019-11-18 by the reprex package (v0.3.0)}

hadley · 2019-11-18T13:38:34Z

The performance difference comes from within summarise.multidplyr_party_df(), because the generated call is 12 MB rather than 400kB. This is because the environment of the ... contains a bunch of stuff that we don't care about (mostly importantly in the wrapped function case, a full copy of the dataset). Even in the case that's currently fast, it looks like we're still copying stuff, so this is worth some further exploration. The easiest approach would probably be to take the same approach as dbplyr and dtplyr, and walk the quosure to embed needed data (building from dtplyr:::dt_squash()).

adamthrash · 2021-03-17T17:10:48Z

Has there been any progress on this issue? I'm in the process of updating a package of mine to use multidplyr, and I just ran into the same problem.

In addition, if I call the function and pass in the number of cores to use (8, in this case), only four of those display heavy activity in my CPU history. If I run the same code without running it as a function, all 8 cores display heavy activity.

ldanai · 2021-04-09T00:07:51Z

I run into the same problem.

For now I need to go back and use the beta version which work out OK
remotes::install_github("tidyverse/multidplyr#10")

avsdev-cw · 2021-11-10T14:48:56Z

In addition, if I call the function and pass in the number of cores to use (8, in this case), only four of those display heavy activity in my CPU history. If I run the same code without running it as a function, all 8 cores display heavy activity.

Also having this problem, opened a new issue: #123

This comment has been minimized.

Sign in to view

hadley closed this as completed Nov 16, 2019

This comment has been minimized.

Sign in to view

hadley reopened this Nov 18, 2019

hadley added the feature a feature request or enhancement label Jan 24, 2021

hadley mentioned this issue Sep 25, 2022

Using multidplyr within a function causes unexpected behaviour #123

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

multidplyr runs significantly slower when wrapped in user-defined function #87

multidplyr runs significantly slower when wrapped in user-defined function #87

dzhang32 commented Nov 15, 2019 •

edited by hadley

Loading

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

hadley commented Nov 18, 2019

hadley commented Nov 18, 2019

adamthrash commented Mar 17, 2021

ldanai commented Apr 9, 2021

avsdev-cw commented Nov 10, 2021 •

edited

Loading

multidplyr runs significantly slower when wrapped in user-defined function #87

multidplyr runs significantly slower when wrapped in user-defined function #87

Comments

dzhang32 commented Nov 15, 2019 • edited by hadley Loading

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

hadley commented Nov 18, 2019

hadley commented Nov 18, 2019

adamthrash commented Mar 17, 2021

ldanai commented Apr 9, 2021

avsdev-cw commented Nov 10, 2021 • edited Loading

dzhang32 commented Nov 15, 2019 •

edited by hadley

Loading

avsdev-cw commented Nov 10, 2021 •

edited

Loading