Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Incorrect roundtrip of index names on filtered dataframe #732

Open
philippjfr opened this issue Jan 19, 2022 · 6 comments
Open

Incorrect roundtrip of index names on filtered dataframe #732

philippjfr opened this issue Jan 19, 2022 · 6 comments

Comments

@philippjfr
Copy link

philippjfr commented Jan 19, 2022

When saving a filtered dataframe to parquet using Pandas and fastparquet the index names are round-tripped incorrectly: ​

import pandas  as pd 

df = pd._testing.makeMixedDataFrame()

filtered_df = df[df.A>=1]

filtered_df.to_parquet('test.parq', engine='fastparquet')

loaded_df = pd.read_parquet('test.parq')

print(filtered_df.index.names)
print(loaded_df.index.names)
FrozenList([None])
FrozenList(['index'])

Versions

fastparquet 0.7.2
pandas 1.3.2

@yohplala
Copy link

Hello,
I believe this to be a behavior of fastparquet to be expected.
When you filter, I am guessing that the index in the dataframe is not a range index any longer, meaning, it becomes stored by fastparquet as a specific column.
In this case, when an index without name is resetted as a column in fastparquet, it is given the default name (by pandas actually) 'index'.
This is what you see.

Is your index correct? (is it what you expect?) (to check there is no bug at this level)

You would find the behavior you are expecting by using write_index=False when calling fastparquet.
I believe that pandas forwards parameters to fastparquet.
So you would get the expected behavior by:

filtered_df.to_parquet('test.parq', engine='fastparquet', write_index=False)

Bests

@philippjfr
Copy link
Author

Thanks @yohplala, I can see that reasoning and your technical explanation makes sense. However I still disagree that this is expected, I would expect a DataFrame to round-trip exactly as is, i.e. it should pass pd.testing.assert_frame_equal(original_df, loaded_df). If you switch to engine='pyarrow' it behaves as expected.

@martindurant
Copy link
Member

Indeed, I think we can call this a bug. Indeed, parquet requires that the column being saved must have a real str name, but we also save pandas metadata, in which we can give the actual final name of the index. Either we are not writing the metadata, or we are not applying it correctly - can check by doing the roundtrip pyarrow/fastparquet and fastparquet/pyarrow.

This behaviour has been around a long time, I think, and there are tests in dask which use both engines and explicitly ignore the name of the index, if it was None. Fixing this might break those tests! Personally, I think "index" is a fine name for an index :)

@martindurant
Copy link
Member

@yohplala , you are probably in a good place to ensure None roundtrips, if you have any interest. I can fix any tests that this causes to fail in Dask. I have the feeling the issue isn't high priority.

@yohplala
Copy link

Hi @martindurant , to be honest, I have no need for this, and am only able to code in spare time, few hours per week. So this will be a very low priority for me.
This said, be assured I am a proactive supporter of fastparquet, and I would propose to leave this ticket opened.
In the short term, I am prioritizing private developments, that I think I should be able to deal with within the 2 next months. After those, I was thinking to deal with some tickets of fastparquet. I don't know if I will deal with this one 1st, but let's keep it in the stack.

@martindurant
Copy link
Member

No rush! I might do it myself also, but I have a similar problem with finding time :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants