append corrupts _metadata and _common_metadata files #807

zdomokos · 2022-09-15T17:49:47Z

What happened:

saving dataframe chunks as rowgroups with the code below:

def save_parquet(df, file_name: Path, write_index: bool = True):
    append = file_name.exists() and any(file_name.glob("*.parquet"))
    if append:
        fastparquet.write(str(file_name), df, write_index=write_index, compression='SNAPPY', file_scheme="hive", append=True)
    else:
        fastparquet.write(str(file_name), df, write_index=write_index, compression='SNAPPY', file_scheme="hive")

after 215 appends, the append write corrupts the _metadata and _common_metadata files and cannot append any more.

INFO:root:Processing: L101
ERROR:root:Error at: 216
Traceback (most recent call last):
  File "C:\AVMSDK\Projects\iii_ch\iiich\apps\data_management\etl\adhoc\refinitiv\l1_sampled\a1.py", line 54, in split_file
    save_parquet(df, file_name=out_file, write_index=False)
  File "C:\AVMSDK\Projects\iii_ch\iiich\apps\data_management\etl\adhoc\refinitiv\l1_sampled\a1.py", line 24, in save_parquet
    fastparquet.write(str(file_name), df, write_index=write_index, compression='SNAPPY', file_scheme="hive", append=True)
  File "C:\Users\zdomokos\Anaconda3\envs\iii_ch\lib\site-packages\fastparquet\writer.py", line 1172, in write
    pf = ParquetFile(filename, open_with=open_with)
  File "C:\Users\zdomokos\Anaconda3\envs\iii_ch\lib\site-packages\fastparquet\api.py", line 146, in __init__
    self._parse_header(f, verify)
  File "C:\Users\zdomokos\Anaconda3\envs\iii_ch\lib\site-packages\fastparquet\api.py", line 216, in _parse_header
    for rg in fmd[4]:
TypeError: 'NoneType' object is not iterable

Environment:

fastparquet version: 0.8.3
Dask version: 2022.7.0
Python version: 3.10 anaconda forge
Operating System: windows 11
Install method (conda, pip, source): conda

The text was updated successfully, but these errors were encountered:

martindurant · 2022-09-19T13:43:35Z

Quick question: how do you know that the _common_metadata file is also corrupted? There isn't really any reason for an append operation to touch this file at all (but I haven't checked whether it does or not).

martindurant · 2022-09-26T14:53:38Z

I should have mentioned an obvious workaround, if you have not time to debug further: just write separate data files in the same directory without bothering to use append on the whole dataset. Without (_common)_metadata, a dataset is just the total of the .parquet files in a directory.

However, was there no error/warning during append before the data became unreadable?

martindurant · 2022-09-26T15:06:59Z

@yohplala , it's probably time we follow suit from dask, spark, arrow... and explicitly allow for write and append without creating the global _metadata. It could even be the default, although fastparquet is more likely to be able to fit the whole dataset

zdomokos · 2022-09-27T19:21:10Z

how do you know that the _common_metadata file is also corrupted?
I compared the content with a before corruption (size ~1.2kb) and after corruption (size ~30b)
was there no error/warning during append before the data became unreadable?
The error is pasted into the ticket #807 (comment)
I am very busy with a different project, will get back to this issue in a week time.

martindurant · 2022-09-27T19:41:44Z

I am very busy with a different project, will get back to this issue in a week time.

no worries, I appreciate your taking the time to report.

was there no error/warning during append before the data became unreadable?

The error is pasted into the ticket

By the time of the error, the metadata file is apparently already corrupt. I was wondering if there was any warning on the previous iteration.

The next step would be to try to make a reproducer on fake/public data, since, as @yohplala points out in the partner issue, many-iteration appends do normally succeed.

yohplala · 2022-09-28T06:35:12Z

@yohplala , it's probably time we follow suit from dask, spark, arrow... and explicitly allow for write and append without creating the global _metadata. It could even be the default, although fastparquet is more likely to be able to fit the whole dataset

Hi @martindurant ,
Ok, I will follow :)
Maybe we can discuss this in a specific thread?
If dask already does this by default, you likely already have a clear view about it.

My 1st reaction is when ParquetFile instance is created from a list of files:
we probably need a new helper to sort parquet file names to feed ParquetFile.__init__(), to circumvent lexicographic filename sorting. and effectively use fastparquet's numbering logic in filename.
Something based on row 23 in api.py.

PART_ID = re.compile(r'.*part.(?P<i>[\d]+).parquet$')

Other than that, I am guessing changes are mostly in writer.py, in routines wrriting the _metadata file.

At write time, given a "classical" append does not update metadata in existing files, we could consider by convention that updated metadata have to be read from the last file.
Hence the interest in fastparquet's parquet file naming convention, that keeps file order with filenames.
(as opposed to pyarrow naming convention I believe)

martindurant mentioned this issue Sep 15, 2022

OverflowError: Python int too large to convert to C long #806

Closed

zdomokos closed this as completed Sep 27, 2022

zdomokos reopened this Sep 27, 2022

martindurant mentioned this issue Sep 28, 2022

Make _metadata optional on writing #809

Open

davetapley mentioned this issue Jun 5, 2023

TypeError: 'NoneType' object is not iterable #867

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

append corrupts _metadata and _common_metadata files #807

append corrupts _metadata and _common_metadata files #807

zdomokos commented Sep 15, 2022

martindurant commented Sep 19, 2022

martindurant commented Sep 26, 2022

martindurant commented Sep 26, 2022

zdomokos commented Sep 27, 2022

martindurant commented Sep 27, 2022

yohplala commented Sep 28, 2022

append corrupts _metadata and _common_metadata files #807

append corrupts _metadata and _common_metadata files #807

Comments

zdomokos commented Sep 15, 2022

martindurant commented Sep 19, 2022

martindurant commented Sep 26, 2022

martindurant commented Sep 26, 2022

zdomokos commented Sep 27, 2022

martindurant commented Sep 27, 2022

yohplala commented Sep 28, 2022