Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

OverflowError: Python int too large to convert to C long #806

Closed
zdomokos opened this issue Sep 15, 2022 · 16 comments · Fixed by #824
Closed

OverflowError: Python int too large to convert to C long #806

zdomokos opened this issue Sep 15, 2022 · 16 comments · Fixed by #824

Comments

@zdomokos
Copy link

zdomokos commented Sep 15, 2022

saving 100,000,000 records with

df.to_parquet(str(out_filepart), engine='fastparquet', compression='snappy', index=False, partition_cols=['#RIC'])

70% of the time I get this error message:

Exception ignored in: 'fastparquet.cencoding.write_thrift'
Traceback (most recent call last):
  File "C:\Python\Anaconda3\envs\iii_ch\lib\site-packages\fastparquet\writer.py", line 1499, in write_thrift
    return f.write(obj.to_bytes())
OverflowError: Python int too large to convert to C long

The exception is displayed in the terminal, but must be caught in fastparquet, as my code continues.
Do I loose data? Yes I do.

Cannot read the data at all.

Using read method:

df1 = pd.read_parquet(pqfile, engine='fastparquet')  

Throws this exception. here is the stack:

Traceback (most recent call last):
  File "C:\Python\Anaconda3\envs\iii_ch\lib\site-packages\IPython\core\interactiveshell.py", line 3398, in run_code
    exec(code_obj, self.user_global_ns, self.user_ns)
  File "<ipython-input-2-048300019e1f>", line 1, in <cell line: 1>
    runfile('D:/AVMSDK/Projects/iii_ch/iiich/apps/data_management/etl/adhoc/refinitiv/l1_sampled/parquet_read.py', wdir='D:/AVMSDK/Projects/iii_ch/iiich/apps/data_management/etl/adhoc/refinitiv/l1_sampled')
  File "C:\Users\zdomokos\AppData\Local\JetBrains\Toolbox\apps\PyCharm-P\ch-0\222.3739.56\plugins\python\helpers\pydev\_pydev_bundle\pydev_umd.py", line 198, in runfile
    pydev_imports.execfile(filename, global_vars, local_vars)  # execute the script
  File "C:\Users\zdomokos\AppData\Local\JetBrains\Toolbox\apps\PyCharm-P\ch-0\222.3739.56\plugins\python\helpers\pydev\_pydev_imps\_pydev_execfile.py", line 18, in execfile
    exec(compile(contents+"\n", file, 'exec'), glob, loc)
  File "D:/AVMSDK/Projects/iii_ch/iiich/apps/data_management/etl/adhoc/refinitiv/l1_sampled/parquet_read.py", line 26, in <module>
    read_back_data(pqfile)
  File "D:/AVMSDK/Projects/iii_ch/iiich/apps/data_management/etl/adhoc/refinitiv/l1_sampled/parquet_read.py", line 8, in read_back_data
    df1 = pd.read_parquet(pqfile, engine='fastparquet')  # https://fastparquet.readthedocs.io/en/latest/quickstart.html#reading
  File "C:\Python\Anaconda3\envs\iii_ch\lib\site-packages\pandas\io\parquet.py", line 493, in read_parquet
    return impl.read(
  File "C:\Python\Anaconda3\envs\iii_ch\lib\site-packages\pandas\io\parquet.py", line 347, in read
    result = parquet_file.to_pandas(columns=columns, **kwargs)
  File "C:\Python\Anaconda3\envs\iii_ch\lib\site-packages\fastparquet\api.py", line 752, in to_pandas
    size = sum(rg.num_rows for rg in rgs)
TypeError: unsupported operand type(s) for +: 'int' and 'dict'
@martindurant
Copy link
Member

Is this happening as part of the workflow in #807 ? I would suggest that the two are related. Could you please provide the approximate number of rows per row group, number of columns and approximate row-group size on disc? Perhaps df.dtypes.value_counts() would be useful too.

@zdomokos
Copy link
Author

zdomokos commented Sep 15, 2022

No it is not related to #807. In this case I am not using append. Just write through pandas. In #807 fastparquet write method is utilized directly. Updated the bug report with the information that it does loose data, as it cannot be read back.
row#: 10,000,000 per row group,
column#: 5
size on disk for the entire write is ~ 1GB in multiple folders as partition_cols=['#RIC'] is used in to_parquet call.
Saving the same data with 'pyarrow' engine works.

@martindurant
Copy link
Member

I intuit that it might still be related :)

Is there any tracebacK?

At a guess, the row groups need to be smaller than that. Fastparquet does not split row groups into "pages", because these aren't used by any chunk-wise loader, but they do have different dtypes for the fields that tell you how big they are. arrow probably does use pages internally.

@zdomokos
Copy link
Author

I tried with 1,000,000 rowgroup; same exception. Practically I would not want to go below this size for my application. Pasted the traceback to the first comment, and added more info there. Thanks a lot for responding to this.

@martindurant
Copy link
Member

Are any of your columns strings, and if so, what typical lengths do they have?

Do you think its feasible to recreate the error with function generating random data?

@martindurant
Copy link
Member

(sorry for all the questions, it's very hard to find the cause of this kind of problem!)

@zdomokos
Copy link
Author

This is what the data looks like:

image

{
"Version": 1,
"Num_rows": 10000000,
"Created_by": "fastparquet-python version 0.8.3 (build 0)",
"Schema": [
{
"Field_id": 0,
"Name": "#RIC",
"Type": "BYTE_ARRAY",
"Type_length": 0,
"LogicalType": null,
"Scale": 0,
"Precision": 0,
"Repetition_type": "OPTIONAL",
"Converted_type": "UTF8"
},
{
"Field_id": 0,
"Name": "Date-Time",
"Type": "BYTE_ARRAY",
"Type_length": 0,
"LogicalType": null,
"Scale": 0,
"Precision": 0,
"Repetition_type": "OPTIONAL",
"Converted_type": "UTF8"
},
{
"Field_id": 0,
"Name": "Type",
"Type": "BYTE_ARRAY",
"Type_length": 0,
"LogicalType": null,
"Scale": 0,
"Precision": 0,
"Repetition_type": "OPTIONAL",
"Converted_type": "UTF8"
},
{
"Field_id": 0,
"Name": "Bid Price",
"Type": "DOUBLE",
"Type_length": 64,
"LogicalType": null,
"Scale": 0,
"Precision": 0,
"Repetition_type": "OPTIONAL",
"Converted_type": "UTF8"
},
{
"Field_id": 0,
"Name": "Bid Size",
"Type": "DOUBLE",
"Type_length": 64,
"LogicalType": null,
"Scale": 0,
"Precision": 0,
"Repetition_type": "OPTIONAL",
"Converted_type": "UTF8"
},
{
"Field_id": 0,
"Name": "Ask Price",
"Type": "DOUBLE",
"Type_length": 64,
"LogicalType": null,
"Scale": 0,
"Precision": 0,
"Repetition_type": "OPTIONAL",
"Converted_type": "UTF8"
},
{
"Field_id": 0,
"Name": "Ask Size",
"Type": "DOUBLE",
"Type_length": 64,
"LogicalType": null,
"Scale": 0,
"Precision": 0,
"Repetition_type": "OPTIONAL",
"Converted_type": "UTF8"
}
]
}

{
"column_indexes": [
{
"field_name": null,
"metadata": null,
"name": null,
"numpy_type": "object",
"pandas_type": "mixed-integer"
}
],
"columns": [
{
"field_name": "#RIC",
"metadata": null,
"name": "#RIC",
"numpy_type": "object",
"pandas_type": "unicode"
},
{
"field_name": "Date-Time",
"metadata": null,
"name": "Date-Time",
"numpy_type": "object",
"pandas_type": "unicode"
},
{
"field_name": "Type",
"metadata": null,
"name": "Type",
"numpy_type": "object",
"pandas_type": "unicode"
},
{
"field_name": "Bid Price",
"metadata": null,
"name": "Bid Price",
"numpy_type": "float64",
"pandas_type": "float64"
},
{
"field_name": "Bid Size",
"metadata": null,
"name": "Bid Size",
"numpy_type": "float64",
"pandas_type": "float64"
},
{
"field_name": "Ask Price",
"metadata": null,
"name": "Ask Price",
"numpy_type": "float64",
"pandas_type": "float64"
},
{
"field_name": "Ask Size",
"metadata": null,
"name": "Ask Size",
"numpy_type": "float64",
"pandas_type": "float64"
}
],
"creator": {
"library": "fastparquet",
"version": "0.8.3"
},
"index_columns": [],
"pandas_version": "1.4.2",
"partition_columns": []
}

@zdomokos
Copy link
Author

Trying to save back dataframe:

df = dd.read_parquet(parquet_files2)
df['data_datetime'] = dd.to_datetime(df['Date-Time'], utc=True, infer_datetime_format=True)
df = df.drop(['Date-Time'], axis=1)
df.compute()
df.to_parquet(file_name, engine='fastparquet')

get this:

Exception ignored in: 'fastparquet.cencoding.write_thrift'
Traceback (most recent call last):
  File "C:\Python\Anaconda3\envs\iii_ch\lib\site-packages\fastparquet\writer.py", line 1499, in write_thrift
    return f.write(obj.to_bytes())
OverflowError: Python int too large to convert to C long
Exception ignored in: 'fastparquet.cencoding.write_thrift'
Traceback (most recent call last):
  File "C:\Python\Anaconda3\envs\iii_ch\lib\site-packages\fastparquet\writer.py", line 1499, in write_thrift
    return f.write(obj.to_bytes())
OverflowError: Python int too large to convert to C long
Exception ignored in: 'fastparquet.cencoding.write_thrift'
Traceback (most recent call last):
  File "C:\Python\Anaconda3\envs\iii_ch\lib\site-packages\fastparquet\writer.py", line 1499, in write_thrift
    return f.write(obj.to_bytes())
OverflowError: Python int too large to convert to C long

@yohplala
Copy link

I tried with 1,000,000 rowgroup; same exception. Practically I would not want to go below this size for my application. Pasted the traceback to the first comment, and added more info there. Thanks a lot for responding to this.

Reacting to #807 here.
With a dummy dataset, writing 217 row groups does not raise any trouble.
Do you think you could try with a single data chunk of a few rows and write X times on your disk till it fails?
If it fails, could you post the dataset here? (you can embed file in github, only post the dataset as a single parquet file for instance)
Bests,

@martindurant
Copy link
Member

On a hunch, can you please try writing with stats=False?

@hydriniumh2
Copy link

hydriniumh2 commented Sep 20, 2022

Is this related to this issue? #348
I was having the same issue with a 3million record file and setting row_group_offsets=1_000_000_00 down from 50,000,000 resolved the issue for me.
Although the OverflowError not raising an exception is problematic, we didn't even notice our scripts were breaking because all runs were finishing without raising errors.

@SebastianLopezO
Copy link

In this case I would recommend using pyarrow, because fastparquet has a great limitation when it comes to processing more than 10,000,000 rows or 1GB per module, so it is better pyarrow

It is installed with this:

pip install pyarrow

In the code use:

df= pd.to_parquet(pqfile, engine='pyarrow') 

@martindurant
Copy link
Member

pyarrow also has limitations...
I would recommend improving fastparquet as much as possible to deal with a wider set of user workflows. Currently, it is left up to the user to find a suitable value for row_group_offsets, and we could try to make a heuristic to find a safe value. Also, we could explicitly introduce checks is fastparquet.writer (look for occurrences of i32). Let's make fastparquet as good as we can.

@martindurant
Copy link
Member

Can you please check if #824 successfully raises an exception without silently writing corrupted data?

@martindurant
Copy link
Member

I also added an "auto" flag to prevent stats aggregation for bytes types and a warning when the row group size seems big. I would appreciate testers of that PR.

@martindurant
Copy link
Member

@SebastianLopezO , we pride ourselves on our responsiveness. The issue is now fixed in #824 , and will be in the next release.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants