Support for pandas DataFrame subclasses #52

jorisvandenbossche · 2021-06-21T08:31:04Z

When dask uses partd for eg shuffle operations, the dataframes always come back as a pandas.DataFrame, even if a subclass was stored (xref geopandas/dask-geopandas#59 (comment)).

For example:

import geopandas
gdf = geopandas.read_file(geopandas.datasets.get_path("naturalearth_lowres"))

import partd
# dask.dataframe shuffle operations use PandasBlocks
p = partd.PandasBlocks(partd.Dict())

p.append({"gdf": gdf})
res = p.get("gdf")

>>> type(gdf)
pandas.core.frame.DataFrame
>>> type(res)
pandas.core.frame.DataFrame

To be able to use dask's shuffle operations with dask_geopandas, which uses a pandas subclass as the partition type, the subclass should be preserved in the partd roundtrip (or are there other ways that you can override / dispatch this operation in dask?).
I was wondering how other dask.dataframe subclasses handle this, but eg dask_cudf doesn't seem to support "disk"-based shuffling.

The text was updated successfully, but these errors were encountered:

jorisvandenbossche · 2021-06-21T08:42:24Z

The pandas.DataFrame is recreated here:

partd/partd/pandas.py

Lines 194 to 203 in 9c9ba0a

    
           def deserialize(bytes): 
        
               """ Deserialize and decompress bytes back to a pandas DataFrame """ 
        
               frames = list(framesplit(bytes)) 
        
               headers = pickle.loads(frames[0]) 
        
               bytes = frames[1:] 
        
               axes = [index_from_header_bytes(headers[0], bytes[0]), 
        
                       index_from_header_bytes(headers[1], bytes[1])] 
        
               blocks = [block_from_header_bytes(h, b) 
        
                         for (h, b) in zip(headers[2:], bytes[2:])] 
        
               return pd.DataFrame(create_block_manager_from_blocks(blocks, axes))

Currently only the actual underlying data is stored, and so when deserializing, I don't think partd knows anything about the original class? So if we want to change that, it would need to start store some additional information?

An alternative might be to tackle this on the dask side, and ensure the retrieved part is of the same type as meta (eg could do something like res = meta._constructor(res) at https://github.com/dask/dask/blob/8aea537d925b794a94f828d35211a5da05ad9dce/dask/dataframe/shuffle.py#L740-L744)

jrbourbeau · 2021-06-23T23:35:52Z

cc @madsbk @quasiben for their GPU-shuffle connection

jorisvandenbossche mentioned this issue Jun 21, 2021

BUG: set_index results in invalid dask GeoDataFrame (partitions are DataFrames) geopandas/dask-geopandas#59

Closed

jorisvandenbossche mentioned this issue Sep 24, 2021

BUG: Computing a dask shuffle returns a pd.DataFrame, not gpd.GeoDataFrame geopandas/dask-geopandas#116

Closed

charlesbluca mentioned this issue Aug 2, 2022

Generalizing pandas serialization methods #61

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support for pandas DataFrame subclasses #52

Support for pandas DataFrame subclasses #52

jorisvandenbossche commented Jun 21, 2021

jorisvandenbossche commented Jun 21, 2021

jrbourbeau commented Jun 23, 2021

Support for pandas DataFrame subclasses #52

Support for pandas DataFrame subclasses #52

Comments

jorisvandenbossche commented Jun 21, 2021

jorisvandenbossche commented Jun 21, 2021

jrbourbeau commented Jun 23, 2021