-
Notifications
You must be signed in to change notification settings - Fork 74
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
GROUP BY semantics not honored #132
Comments
I'm agree with this, group by can be a way to sort everything, but replicate the behavior when we need group by and not can be hard, I think can be more simpler, a function that helps us to choose how to split the data in the clusters. |
From my understanding, workers cannot communicate. As such, this is an expected behavior. Each worker performs the library(purrr)
library(dplyr, warn.conflicts = FALSE)
tibble(
group = c("a", "a", "b", "b"),
int = c(1, 1, 1, 1)
) %>%
split.data.frame(rep(c(1,2), times = 2)) %>%
map(
~ .x %>%
group_by(group) %>%
summarise(sum = sum(int), n = n())
) %>%
bind_rows()
#> # A tibble: 4 × 3
#> group sum n
#> <chr> <dbl> <int>
#> 1 a 1 1
#> 2 b 1 1
#> 3 a 1 1
#> 4 b 1 1 Created on 2023-04-06 with reprex v2.0.2 |
Seems like we are rationalizing a bug. Consider that GROUP BY semantics have been long established by the SQL community. I can name many systems that work like I expect GROUP BY to work, including IBM DB2, MySQL, Postgres, Apache Spark and, yes, Perhaps workers cannot communicate. In that case, why wouldn't it be reasonable for multiplyr to implement some sort of IPC? |
Yeah, this is expected behaviour. Everything is implicitly grouped by the partition and there's no way around in multidplyr. If this is the behaviour you want, I'd suggest you use a system that does support it. |
It seems multiplyr is not honoring GROUP BY semantics as seen on reprex below.
Created on 2022-03-02 by the reprex package (v2.0.1)
The text was updated successfully, but these errors were encountered: