Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

append corrupts _metadata and _common_metadata files #807

Open
zdomokos opened this issue Sep 15, 2022 · 6 comments
Open

append corrupts _metadata and _common_metadata files #807

zdomokos opened this issue Sep 15, 2022 · 6 comments

Comments

@zdomokos
Copy link

What happened:

saving dataframe chunks as rowgroups with the code below:

def save_parquet(df, file_name: Path, write_index: bool = True):
    append = file_name.exists() and any(file_name.glob("*.parquet"))
    if append:
        fastparquet.write(str(file_name), df, write_index=write_index, compression='SNAPPY', file_scheme="hive", append=True)
    else:
        fastparquet.write(str(file_name), df, write_index=write_index, compression='SNAPPY', file_scheme="hive")

after 215 appends, the append write corrupts the _metadata and _common_metadata files and cannot append any more.

INFO:root:Processing: L101
ERROR:root:Error at: 216
Traceback (most recent call last):
  File "C:\AVMSDK\Projects\iii_ch\iiich\apps\data_management\etl\adhoc\refinitiv\l1_sampled\a1.py", line 54, in split_file
    save_parquet(df, file_name=out_file, write_index=False)
  File "C:\AVMSDK\Projects\iii_ch\iiich\apps\data_management\etl\adhoc\refinitiv\l1_sampled\a1.py", line 24, in save_parquet
    fastparquet.write(str(file_name), df, write_index=write_index, compression='SNAPPY', file_scheme="hive", append=True)
  File "C:\Users\zdomokos\Anaconda3\envs\iii_ch\lib\site-packages\fastparquet\writer.py", line 1172, in write
    pf = ParquetFile(filename, open_with=open_with)
  File "C:\Users\zdomokos\Anaconda3\envs\iii_ch\lib\site-packages\fastparquet\api.py", line 146, in __init__
    self._parse_header(f, verify)
  File "C:\Users\zdomokos\Anaconda3\envs\iii_ch\lib\site-packages\fastparquet\api.py", line 216, in _parse_header
    for rg in fmd[4]:
TypeError: 'NoneType' object is not iterable

Environment:

  • fastparquet version: 0.8.3
  • Dask version: 2022.7.0
  • Python version: 3.10 anaconda forge
  • Operating System: windows 11
  • Install method (conda, pip, source): conda
@martindurant
Copy link
Member

Quick question: how do you know that the _common_metadata file is also corrupted? There isn't really any reason for an append operation to touch this file at all (but I haven't checked whether it does or not).

@martindurant
Copy link
Member

I should have mentioned an obvious workaround, if you have not time to debug further: just write separate data files in the same directory without bothering to use append on the whole dataset. Without (_common)_metadata, a dataset is just the total of the .parquet files in a directory.

However, was there no error/warning during append before the data became unreadable?

@martindurant
Copy link
Member

@yohplala , it's probably time we follow suit from dask, spark, arrow... and explicitly allow for write and append without creating the global _metadata. It could even be the default, although fastparquet is more likely to be able to fit the whole dataset

@zdomokos
Copy link
Author

how do you know that the _common_metadata file is also corrupted?
I compared the content with a before corruption (size ~1.2kb) and after corruption (size ~30b)
was there no error/warning during append before the data became unreadable?
The error is pasted into the ticket #807 (comment)
I am very busy with a different project, will get back to this issue in a week time.

@zdomokos zdomokos reopened this Sep 27, 2022
@martindurant
Copy link
Member

I am very busy with a different project, will get back to this issue in a week time.

no worries, I appreciate your taking the time to report.

was there no error/warning during append before the data became unreadable?

The error is pasted into the ticket

By the time of the error, the metadata file is apparently already corrupt. I was wondering if there was any warning on the previous iteration.

The next step would be to try to make a reproducer on fake/public data, since, as @yohplala points out in the partner issue, many-iteration appends do normally succeed.

@yohplala
Copy link

@yohplala , it's probably time we follow suit from dask, spark, arrow... and explicitly allow for write and append without creating the global _metadata. It could even be the default, although fastparquet is more likely to be able to fit the whole dataset

Hi @martindurant ,
Ok, I will follow :)
Maybe we can discuss this in a specific thread?
If dask already does this by default, you likely already have a clear view about it.

My 1st reaction is when ParquetFile instance is created from a list of files:
we probably need a new helper to sort parquet file names to feed ParquetFile.__init__(), to circumvent lexicographic filename sorting. and effectively use fastparquet's numbering logic in filename.
Something based on row 23 in api.py.

PART_ID = re.compile(r'.*part.(?P<i>[\d]+).parquet$')

Other than that, I am guessing changes are mostly in writer.py, in routines wrriting the _metadata file.

At write time, given a "classical" append does not update metadata in existing files, we could consider by convention that updated metadata have to be read from the last file.
Hence the interest in fastparquet's parquet file naming convention, that keeps file order with filenames.
(as opposed to pyarrow naming convention I believe)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants