Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support for spatial data frames - {sf} package format #112

Open
jlacko opened this issue Feb 25, 2021 · 4 comments
Open

Support for spatial data frames - {sf} package format #112

jlacko opened this issue Feb 25, 2021 · 4 comments
Labels
feature a feature request or enhancement

Comments

@jlacko
Copy link

jlacko commented Feb 25, 2021

The {multidplyr} package changes class of object distributed to workers to multidplyr_party_df. This causes a loss of the "special sauce" that is provided by the {sf} package for spatial datasets (special interpretation of the geometry column, and information about the coordinate reference system).

It would be advantageous for spatial data processing to allow parallelization of some tasks, such as point-in-polygon operation demonstrated in the reprex bellow.

To do so would likely require keeping the class of the distributed object unchanged (or perhaps re-implementing the sf methods, in which case the issue would likely fall outside of scope of the {multidplyr} package).

library(sf)
library(dplyr)   

# Well known & much loved shapefile of NC included with sf package
nc_polygons <- sf::st_read(system.file("shape/nc.shp", 
                                       package="sf"),
                           quiet = T)

# NC state lines
nc_state <- nc_polygons %>% 
  summarise()

# some random points over NC state
nc_points <- sf::st_sample(nc_state, 500) %>% 
  st_as_sf()

# this will work - single core
result <- nc_polygons %>% 
  st_join(nc_points) %>% 
  group_by(NAME) %>% 
  count() %>% 
  st_drop_geometry()

# this will break
library(multidplyr)

cluster <- new_cluster(4)
cluster_library(cluster, packages = c("dplyr", "sf"))
cluster_copy(cluster, "nc_points")

result <- nc_polygons %>% 
  partition(cluster) %>%
  st_join(nc_points) %>%  # the pipe breaks here, even though nc_points is available
  group_by(NAME) %>% 
  count() %>% 
  st_drop_geometry() %>% 
  collect()


@Fredo-XVII
Copy link

Hello,

I have the same issue with the fable package. I am able to build the model fit because the data is a tsibble and inherits from tibble, but when I go to forecast, the dataframe is a mable, a model table or a table of models, and this gets converted to a tibble so the fable package's forecast function doesn't know what to do with it.

It would be nice if there was some way to inherit the class of the object that is being passed to the cluster.

This is a great package by the way. The future package is much more complicated than this, touchy, and inconsistent when trying to take advantage of nearly all cores. Please do not let this project slide.

Thanks!!

@astraetech
Copy link

Definitely second the {sf} package inheritance request. I may be wrong, but multidplyr is an incredible opportunity to make massive computations more efficient.

@avsdev-cw
Copy link

I also would like to see support for sf (or in general other "specialised" tibble classes).

As an aside, to work with sf in a parallel pipe:

  grid_sf3 <- grid_sf2 %>%
    multidplyr::partition(cluster) %>%
    dplyr::mutate(
      dist = as.numeric(sf::st_distance(geometry, coast))
    ) %>%
    dplyr::collect() %>%
    sf::st_sf()

@jfulponi
Copy link

Hello! does anyone have a better approach? Unfortunately the only thing I can think of is to process with multidplyr the parts of the "data.frame" without the geometries and on the other hand do the sf operations to end up doing a join between both tables. I also did some things with furrr but I have the feeling that there are endless and unreadable lines of code to do something relatively simple. Also, in the worst case, put together a Spark cluster just for a summarize().

@hadley hadley added the feature a feature request or enhancement label Oct 31, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature a feature request or enhancement
Projects
None yet
Development

No branches or pull requests

6 participants