OverflowError: Python int too large to convert to C long #806

zdomokos · 2022-09-15T17:44:11Z

saving 100,000,000 records with

df.to_parquet(str(out_filepart), engine='fastparquet', compression='snappy', index=False, partition_cols=['#RIC'])

70% of the time I get this error message:

Exception ignored in: 'fastparquet.cencoding.write_thrift'
Traceback (most recent call last):
  File "C:\Python\Anaconda3\envs\iii_ch\lib\site-packages\fastparquet\writer.py", line 1499, in write_thrift
    return f.write(obj.to_bytes())
OverflowError: Python int too large to convert to C long

The exception is displayed in the terminal, but must be caught in fastparquet, as my code continues.
Do I loose data? Yes I do.

Cannot read the data at all.

Using read method:

df1 = pd.read_parquet(pqfile, engine='fastparquet')

Throws this exception. here is the stack:

Traceback (most recent call last):
  File "C:\Python\Anaconda3\envs\iii_ch\lib\site-packages\IPython\core\interactiveshell.py", line 3398, in run_code
    exec(code_obj, self.user_global_ns, self.user_ns)
  File "<ipython-input-2-048300019e1f>", line 1, in <cell line: 1>
    runfile('D:/AVMSDK/Projects/iii_ch/iiich/apps/data_management/etl/adhoc/refinitiv/l1_sampled/parquet_read.py', wdir='D:/AVMSDK/Projects/iii_ch/iiich/apps/data_management/etl/adhoc/refinitiv/l1_sampled')
  File "C:\Users\zdomokos\AppData\Local\JetBrains\Toolbox\apps\PyCharm-P\ch-0\222.3739.56\plugins\python\helpers\pydev\_pydev_bundle\pydev_umd.py", line 198, in runfile
    pydev_imports.execfile(filename, global_vars, local_vars)  # execute the script
  File "C:\Users\zdomokos\AppData\Local\JetBrains\Toolbox\apps\PyCharm-P\ch-0\222.3739.56\plugins\python\helpers\pydev\_pydev_imps\_pydev_execfile.py", line 18, in execfile
    exec(compile(contents+"\n", file, 'exec'), glob, loc)
  File "D:/AVMSDK/Projects/iii_ch/iiich/apps/data_management/etl/adhoc/refinitiv/l1_sampled/parquet_read.py", line 26, in <module>
    read_back_data(pqfile)
  File "D:/AVMSDK/Projects/iii_ch/iiich/apps/data_management/etl/adhoc/refinitiv/l1_sampled/parquet_read.py", line 8, in read_back_data
    df1 = pd.read_parquet(pqfile, engine='fastparquet')  # https://fastparquet.readthedocs.io/en/latest/quickstart.html#reading
  File "C:\Python\Anaconda3\envs\iii_ch\lib\site-packages\pandas\io\parquet.py", line 493, in read_parquet
    return impl.read(
  File "C:\Python\Anaconda3\envs\iii_ch\lib\site-packages\pandas\io\parquet.py", line 347, in read
    result = parquet_file.to_pandas(columns=columns, **kwargs)
  File "C:\Python\Anaconda3\envs\iii_ch\lib\site-packages\fastparquet\api.py", line 752, in to_pandas
    size = sum(rg.num_rows for rg in rgs)
TypeError: unsupported operand type(s) for +: 'int' and 'dict'

The text was updated successfully, but these errors were encountered:

martindurant · 2022-09-15T17:54:17Z

Is this happening as part of the workflow in #807 ? I would suggest that the two are related. Could you please provide the approximate number of rows per row group, number of columns and approximate row-group size on disc? Perhaps df.dtypes.value_counts() would be useful too.

zdomokos · 2022-09-15T20:40:13Z

No it is not related to #807. In this case I am not using append. Just write through pandas. In #807 fastparquet write method is utilized directly. Updated the bug report with the information that it does loose data, as it cannot be read back.
row#: 10,000,000 per row group,
column#: 5
size on disk for the entire write is ~ 1GB in multiple folders as partition_cols=['#RIC'] is used in to_parquet call.
Saving the same data with 'pyarrow' engine works.

martindurant · 2022-09-16T00:58:22Z

I intuit that it might still be related :)

Is there any tracebacK?

At a guess, the row groups need to be smaller than that. Fastparquet does not split row groups into "pages", because these aren't used by any chunk-wise loader, but they do have different dtypes for the fields that tell you how big they are. arrow probably does use pages internally.

zdomokos · 2022-09-16T01:08:38Z

I tried with 1,000,000 rowgroup; same exception. Practically I would not want to go below this size for my application. Pasted the traceback to the first comment, and added more info there. Thanks a lot for responding to this.

martindurant · 2022-09-16T02:04:52Z

Are any of your columns strings, and if so, what typical lengths do they have?

Do you think its feasible to recreate the error with function generating random data?

martindurant · 2022-09-16T02:07:13Z

(sorry for all the questions, it's very hard to find the cause of this kind of problem!)

zdomokos · 2022-09-16T02:28:11Z

This is what the data looks like:

{
"Version": 1,
"Num_rows": 10000000,
"Created_by": "fastparquet-python version 0.8.3 (build 0)",
"Schema": [
{
"Field_id": 0,
"Name": "#RIC",
"Type": "BYTE_ARRAY",
"Type_length": 0,
"LogicalType": null,
"Scale": 0,
"Precision": 0,
"Repetition_type": "OPTIONAL",
"Converted_type": "UTF8"
},
{
"Field_id": 0,
"Name": "Date-Time",
"Type": "BYTE_ARRAY",
"Type_length": 0,
"LogicalType": null,
"Scale": 0,
"Precision": 0,
"Repetition_type": "OPTIONAL",
"Converted_type": "UTF8"
},
{
"Field_id": 0,
"Name": "Type",
"Type": "BYTE_ARRAY",
"Type_length": 0,
"LogicalType": null,
"Scale": 0,
"Precision": 0,
"Repetition_type": "OPTIONAL",
"Converted_type": "UTF8"
},
{
"Field_id": 0,
"Name": "Bid Price",
"Type": "DOUBLE",
"Type_length": 64,
"LogicalType": null,
"Scale": 0,
"Precision": 0,
"Repetition_type": "OPTIONAL",
"Converted_type": "UTF8"
},
{
"Field_id": 0,
"Name": "Bid Size",
"Type": "DOUBLE",
"Type_length": 64,
"LogicalType": null,
"Scale": 0,
"Precision": 0,
"Repetition_type": "OPTIONAL",
"Converted_type": "UTF8"
},
{
"Field_id": 0,
"Name": "Ask Price",
"Type": "DOUBLE",
"Type_length": 64,
"LogicalType": null,
"Scale": 0,
"Precision": 0,
"Repetition_type": "OPTIONAL",
"Converted_type": "UTF8"
},
{
"Field_id": 0,
"Name": "Ask Size",
"Type": "DOUBLE",
"Type_length": 64,
"LogicalType": null,
"Scale": 0,
"Precision": 0,
"Repetition_type": "OPTIONAL",
"Converted_type": "UTF8"
}
]
}

{
"column_indexes": [
{
"field_name": null,
"metadata": null,
"name": null,
"numpy_type": "object",
"pandas_type": "mixed-integer"
}
],
"columns": [
{
"field_name": "#RIC",
"metadata": null,
"name": "#RIC",
"numpy_type": "object",
"pandas_type": "unicode"
},
{
"field_name": "Date-Time",
"metadata": null,
"name": "Date-Time",
"numpy_type": "object",
"pandas_type": "unicode"
},
{
"field_name": "Type",
"metadata": null,
"name": "Type",
"numpy_type": "object",
"pandas_type": "unicode"
},
{
"field_name": "Bid Price",
"metadata": null,
"name": "Bid Price",
"numpy_type": "float64",
"pandas_type": "float64"
},
{
"field_name": "Bid Size",
"metadata": null,
"name": "Bid Size",
"numpy_type": "float64",
"pandas_type": "float64"
},
{
"field_name": "Ask Price",
"metadata": null,
"name": "Ask Price",
"numpy_type": "float64",
"pandas_type": "float64"
},
{
"field_name": "Ask Size",
"metadata": null,
"name": "Ask Size",
"numpy_type": "float64",
"pandas_type": "float64"
}
],
"creator": {
"library": "fastparquet",
"version": "0.8.3"
},
"index_columns": [],
"pandas_version": "1.4.2",
"partition_columns": []
}

zdomokos · 2022-09-16T03:37:45Z

Trying to save back dataframe:

df = dd.read_parquet(parquet_files2)
df['data_datetime'] = dd.to_datetime(df['Date-Time'], utc=True, infer_datetime_format=True)
df = df.drop(['Date-Time'], axis=1)
df.compute()
df.to_parquet(file_name, engine='fastparquet')

get this:

Exception ignored in: 'fastparquet.cencoding.write_thrift'
Traceback (most recent call last):
  File "C:\Python\Anaconda3\envs\iii_ch\lib\site-packages\fastparquet\writer.py", line 1499, in write_thrift
    return f.write(obj.to_bytes())
OverflowError: Python int too large to convert to C long
Exception ignored in: 'fastparquet.cencoding.write_thrift'
Traceback (most recent call last):
  File "C:\Python\Anaconda3\envs\iii_ch\lib\site-packages\fastparquet\writer.py", line 1499, in write_thrift
    return f.write(obj.to_bytes())
OverflowError: Python int too large to convert to C long
Exception ignored in: 'fastparquet.cencoding.write_thrift'
Traceback (most recent call last):
  File "C:\Python\Anaconda3\envs\iii_ch\lib\site-packages\fastparquet\writer.py", line 1499, in write_thrift
    return f.write(obj.to_bytes())
OverflowError: Python int too large to convert to C long

yohplala · 2022-09-16T07:05:05Z

I tried with 1,000,000 rowgroup; same exception. Practically I would not want to go below this size for my application. Pasted the traceback to the first comment, and added more info there. Thanks a lot for responding to this.

Reacting to #807 here.
With a dummy dataset, writing 217 row groups does not raise any trouble.
Do you think you could try with a single data chunk of a few rows and write X times on your disk till it fails?
If it fails, could you post the dataset here? (you can embed file in github, only post the dataset as a single parquet file for instance)
Bests,

martindurant · 2022-09-19T13:37:34Z

On a hunch, can you please try writing with stats=False?

hydriniumh2 · 2022-09-20T14:05:18Z

Is this related to this issue? #348
I was having the same issue with a 3million record file and setting row_group_offsets=1_000_000_00 down from 50,000,000 resolved the issue for me.
Although the OverflowError not raising an exception is problematic, we didn't even notice our scripts were breaking because all runs were finishing without raising errors.

SebastianLopezO · 2022-11-25T17:49:58Z

In this case I would recommend using pyarrow, because fastparquet has a great limitation when it comes to processing more than 10,000,000 rows or 1GB per module, so it is better pyarrow

It is installed with this:

pip install pyarrow

In the code use:

df= pd.to_parquet(pqfile, engine='pyarrow')

martindurant · 2022-11-25T17:58:37Z

pyarrow also has limitations...
I would recommend improving fastparquet as much as possible to deal with a wider set of user workflows. Currently, it is left up to the user to find a suitable value for row_group_offsets, and we could try to make a heuristic to find a safe value. Also, we could explicitly introduce checks is fastparquet.writer (look for occurrences of i32). Let's make fastparquet as good as we can.

martindurant · 2022-11-25T18:10:42Z

Can you please check if #824 successfully raises an exception without silently writing corrupted data?

martindurant · 2022-11-26T01:11:21Z

I also added an "auto" flag to prevent stats aggregation for bytes types and a warning when the row group size seems big. I would appreciate testers of that PR.

martindurant · 2022-11-26T17:58:24Z

@SebastianLopezO , we pride ourselves on our responsiveness. The issue is now fixed in #824 , and will be in the next release.

martindurant mentioned this issue Nov 26, 2022

Check i32 values for overflow before write #824

Merged

martindurant closed this as completed in #824 Nov 26, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

OverflowError: Python int too large to convert to C long #806

OverflowError: Python int too large to convert to C long #806

zdomokos commented Sep 15, 2022 •

edited

Loading

martindurant commented Sep 15, 2022

zdomokos commented Sep 15, 2022 •

edited

Loading

martindurant commented Sep 16, 2022

zdomokos commented Sep 16, 2022

martindurant commented Sep 16, 2022

martindurant commented Sep 16, 2022

zdomokos commented Sep 16, 2022

zdomokos commented Sep 16, 2022

yohplala commented Sep 16, 2022

martindurant commented Sep 19, 2022

hydriniumh2 commented Sep 20, 2022 •

edited

Loading

SebastianLopezO commented Nov 25, 2022

martindurant commented Nov 25, 2022

martindurant commented Nov 25, 2022

martindurant commented Nov 26, 2022

martindurant commented Nov 26, 2022

OverflowError: Python int too large to convert to C long #806

OverflowError: Python int too large to convert to C long #806

Comments

zdomokos commented Sep 15, 2022 • edited Loading

martindurant commented Sep 15, 2022

zdomokos commented Sep 15, 2022 • edited Loading

martindurant commented Sep 16, 2022

zdomokos commented Sep 16, 2022

martindurant commented Sep 16, 2022

martindurant commented Sep 16, 2022

zdomokos commented Sep 16, 2022

zdomokos commented Sep 16, 2022

yohplala commented Sep 16, 2022

martindurant commented Sep 19, 2022

hydriniumh2 commented Sep 20, 2022 • edited Loading

SebastianLopezO commented Nov 25, 2022

martindurant commented Nov 25, 2022

martindurant commented Nov 25, 2022

martindurant commented Nov 26, 2022

martindurant commented Nov 26, 2022

zdomokos commented Sep 15, 2022 •

edited

Loading

zdomokos commented Sep 15, 2022 •

edited

Loading

hydriniumh2 commented Sep 20, 2022 •

edited

Loading