-
-
Notifications
You must be signed in to change notification settings - Fork 178
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
OverflowError: Python int too large to convert to C long #806
Comments
Is this happening as part of the workflow in #807 ? I would suggest that the two are related. Could you please provide the approximate number of rows per row group, number of columns and approximate row-group size on disc? Perhaps |
No it is not related to #807. In this case I am not using append. Just write through pandas. In #807 fastparquet write method is utilized directly. Updated the bug report with the information that it does loose data, as it cannot be read back. |
I intuit that it might still be related :) Is there any tracebacK? At a guess, the row groups need to be smaller than that. Fastparquet does not split row groups into "pages", because these aren't used by any chunk-wise loader, but they do have different dtypes for the fields that tell you how big they are. arrow probably does use pages internally. |
I tried with 1,000,000 rowgroup; same exception. Practically I would not want to go below this size for my application. Pasted the traceback to the first comment, and added more info there. Thanks a lot for responding to this. |
Are any of your columns strings, and if so, what typical lengths do they have? Do you think its feasible to recreate the error with function generating random data? |
(sorry for all the questions, it's very hard to find the cause of this kind of problem!) |
This is what the data looks like: { { |
Trying to save back dataframe:
get this:
|
Reacting to #807 here. |
On a hunch, can you please try writing with |
Is this related to this issue? #348 |
In this case I would recommend using pyarrow, because fastparquet has a great limitation when it comes to processing more than 10,000,000 rows or 1GB per module, so it is better pyarrow It is installed with this:
In the code use:
|
pyarrow also has limitations... |
Can you please check if #824 successfully raises an exception without silently writing corrupted data? |
I also added an "auto" flag to prevent stats aggregation for bytes types and a warning when the row group size seems big. I would appreciate testers of that PR. |
@SebastianLopezO , we pride ourselves on our responsiveness. The issue is now fixed in #824 , and will be in the next release. |
saving 100,000,000 records with
70% of the time I get this error message:
The exception is displayed in the terminal, but must be caught in fastparquet, as my code continues.
Do I loose data? Yes I do.
Cannot read the data at all.
Using read method:
Throws this exception. here is the stack:
The text was updated successfully, but these errors were encountered: