Incorrect roundtrip of index names on filtered dataframe #732

philippjfr · 2022-01-19T12:58:35Z

When saving a filtered dataframe to parquet using Pandas and fastparquet the index names are round-tripped incorrectly:

import pandas  as pd 

df = pd._testing.makeMixedDataFrame()

filtered_df = df[df.A>=1]

filtered_df.to_parquet('test.parq', engine='fastparquet')

loaded_df = pd.read_parquet('test.parq')

print(filtered_df.index.names)
print(loaded_df.index.names)

FrozenList([None])
FrozenList(['index'])

Versions

fastparquet 0.7.2
pandas 1.3.2

The text was updated successfully, but these errors were encountered:

yohplala · 2022-01-19T13:17:43Z

Hello,
I believe this to be a behavior of fastparquet to be expected.
When you filter, I am guessing that the index in the dataframe is not a range index any longer, meaning, it becomes stored by fastparquet as a specific column.
In this case, when an index without name is resetted as a column in fastparquet, it is given the default name (by pandas actually) 'index'.
This is what you see.

Is your index correct? (is it what you expect?) (to check there is no bug at this level)

You would find the behavior you are expecting by using write_index=False when calling fastparquet.
I believe that pandas forwards parameters to fastparquet.
So you would get the expected behavior by:

filtered_df.to_parquet('test.parq', engine='fastparquet', write_index=False)

Bests

philippjfr · 2022-01-19T14:30:39Z

Thanks @yohplala, I can see that reasoning and your technical explanation makes sense. However I still disagree that this is expected, I would expect a DataFrame to round-trip exactly as is, i.e. it should pass pd.testing.assert_frame_equal(original_df, loaded_df). If you switch to engine='pyarrow' it behaves as expected.

martindurant · 2022-01-19T14:38:58Z

Indeed, I think we can call this a bug. Indeed, parquet requires that the column being saved must have a real str name, but we also save pandas metadata, in which we can give the actual final name of the index. Either we are not writing the metadata, or we are not applying it correctly - can check by doing the roundtrip pyarrow/fastparquet and fastparquet/pyarrow.

This behaviour has been around a long time, I think, and there are tests in dask which use both engines and explicitly ignore the name of the index, if it was None. Fixing this might break those tests! Personally, I think "index" is a fine name for an index :)

martindurant · 2022-01-24T15:03:44Z

@yohplala , you are probably in a good place to ensure None roundtrips, if you have any interest. I can fix any tests that this causes to fail in Dask. I have the feeling the issue isn't high priority.

yohplala · 2022-01-24T21:05:44Z

Hi @martindurant , to be honest, I have no need for this, and am only able to code in spare time, few hours per week. So this will be a very low priority for me.
This said, be assured I am a proactive supporter of fastparquet, and I would propose to leave this ticket opened.
In the short term, I am prioritizing private developments, that I think I should be able to deal with within the 2 next months. After those, I was thinking to deal with some tickets of fastparquet. I don't know if I will deal with this one 1st, but let's keep it in the stack.

martindurant · 2022-01-24T22:59:24Z

No rush! I might do it myself also, but I have a similar problem with finding time :)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Incorrect roundtrip of index names on filtered dataframe #732

Incorrect roundtrip of index names on filtered dataframe #732

philippjfr commented Jan 19, 2022 •

edited

Loading

yohplala commented Jan 19, 2022

philippjfr commented Jan 19, 2022

martindurant commented Jan 19, 2022

martindurant commented Jan 24, 2022

yohplala commented Jan 24, 2022

martindurant commented Jan 24, 2022

Incorrect roundtrip of index names on filtered dataframe #732

Incorrect roundtrip of index names on filtered dataframe #732

Comments

philippjfr commented Jan 19, 2022 • edited Loading

Versions

yohplala commented Jan 19, 2022

philippjfr commented Jan 19, 2022

martindurant commented Jan 19, 2022

martindurant commented Jan 24, 2022

yohplala commented Jan 24, 2022

martindurant commented Jan 24, 2022

philippjfr commented Jan 19, 2022 •

edited

Loading