-
Notifications
You must be signed in to change notification settings - Fork 12
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Feature/153 support polars in pin_write to parquet #263
base: main
Are you sure you want to change the base?
Feature/153 support polars in pin_write to parquet #263
Conversation
pins/drivers.py
Outdated
elif is_pandas_df: | ||
return "pandas" | ||
else: | ||
assert_never(df) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: are you able to remove
if not is_polars_df and not is_pandas_df:
return "unknown"
chunk then return "unknown"
to make this a bit more concise?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OR, WDYT of raising NotImplementedError
here rather than returning "unknown", since we already know "unknown" will cause a failure? (We could try/except NotImplementedError
for creating the default title potentially)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like the NotImplementedError
idea a lot 😄
pins/drivers.py
Outdated
@@ -15,14 +16,38 @@ | |||
|
|||
|
|||
def _assert_is_pandas_df(x): | |||
import pandas as pd | |||
df_family = _get_df_family(x) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a benefit to using _get_df_family()
over checking for a pandas dataframe?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The motivation is to centralize the logic for deciding a DataFrame's type into one single location.
I think my preference for consistency's sake is to refactor the _assert_is_pandas_df
function away, and use a similar strategy of case checking as in these lines:
The disadvantage is that the error messages could get a bit repetitive.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree with eventually refactoring _assert_is_pandas_df
out in favor of _get_df_family
! I think we can streamline the errors a bit 😄
2c2e502
to
f41103e
Compare
pins/drivers.py
Outdated
raise NotImplementedError( | ||
f"Currently only pandas.DataFrame can be saved as type {file_type!r}." | ||
) | ||
|
||
|
||
def _get_df_family(df) -> Literal["pandas", "polars"]: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
def _get_df_family(df) -> Literal["pandas", "polars"]: | |
def _get_df_family(df, file_type: str) -> Literal["pandas", "polars"]: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good idea, I think I see a nice abstraction for this...
dcab779
to
d682808
Compare
71e58c9
to
fd2a12e
Compare
From pairing w/ @isabelizimm, one potentially handy pattern for handling multiple DataFrames could be an adaptor. Basically, everywhere a piece of DataFrame logic is, could become a method on an adaptor class. For example... from abc import abstractmethod
from typing_extensions import TYPE_CHECKING, TypeAlias
class Adaptor:
def __init__(self, data):
self._d = data
@abstractmethod
def write_parquet(self, name: str):
raise NotImplementedError()
@abstractmethod
def default_title(self):
# return whatever is needed for use in default_title()
raise NotImplementedError()
class PandasAdaptor(Adaptor):
def write_parquet(self, name: str):
self._d.to_parquet(name)
def default_title(self):
return "I'm a n row pandas DataFrame"
class PolarsAdaptor(Adaptor):
def write_parquet(self, name: str):
self._d.write_parquet(name)
if TYPE_CHECKING:
import pandas as pd
import polars as pl
DataFrame: TypeAlias = pd.DataFrame | pl.DataFrame
def create_adaptor(d: "DataFrame") -> Adaptor:
# TODO: some kind of conditional importing
# can use databackend to avoid imports: https://github.com/machow/databackend
# this is what Great Tables uses here: https://github.com/posit-dev/great-tables/blob/main/great_tables/_tbl_data.py
import pandas as pd
import polars as pl
if isinstance(d, pd.DataFrame):
return PandasAdaptor(d)
elif isinstance(d, pl.DataFrame):
return PolarsAdaptor(d)
raise NotImplementedError()
import pandas as pd
df = pd.DataFrame({"x": [1,2]})
create_adaptor(df).default_title() We could use narwhals to replace the DataFrame logic (which is what is happening in this shiny PR posit-dev/py-shiny#1570). But it might be easier just to start with pandas and polars adaptors (to airgap out DataFrame logic), and then slot narwhals in after. |
I've had a go at using that approach over in #298 - great idea. |
To resolve #153.
A harder one is #233 - at the moment,
pandas
is the hard-coded return type forpin_read
.I have added some lightweight documentation in the README files; perhaps it would be good to add the documentation associated with the current recommended approach for
#233
at the same time to cover both the read and write cases.