You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I have a use-case very similar to #1177, but read_csv_chunked is giving me an error for a very long file.
I am reading in a flat file with ~3 billion rows with the SideEffectChunkCallback.
The callback lightly processes the chunk and then the resulting data frame is
uploaded into a database using DBI::dbAppendTable(). At some point around row INT_MAX I get this:
Error in read_tokens_chunked_(data, callback, chunk_size, tokenizer, col_specs, :
attempt to set index 10000/10000 in SET_STRING_ELT
This error appears to be different depending on the data type(s) being
read in (and perhaps how far past INT_MAX it goes). For example, reading
integer types will cause a segfault. And reading doubles sometimes segfaults
and sometimes just raises a warning. A cursory glance at the source suggests the
differences lie in the Collector implementation.
# Integer
*** caught segfault ***
address 0x[...], cause 'memory not mapped'
malloc(): memory corruption
Aborted
# Occasionally, I see this when reading a Double column
Warning message:
In read_tokens_chunked_(data, callback, chunk_size, tokenizer, col_specs, :
NAs introduced by coercion to integer range
Right now, my workaround is splitting the file ahead of this step, but it
would be really nice to remove that dependency and run everything in one R script.
Is there any interest to address this use-case? I haven't looked into
feasibility yet, but I'm hoping it's going to come down to just replacing some ints with R_xlen_ts or something to that effect. If I find it's relatively
straightforward to address would a PR be welcome?
I have a use-case very similar to #1177, but
read_csv_chunked
is giving me an error for a very long file.I am reading in a flat file with ~3 billion rows with the
SideEffectChunkCallback
.The callback lightly processes the chunk and then the resulting data frame is
uploaded into a database using
DBI::dbAppendTable()
. At some point around rowINT_MAX
I get this:This error appears to be different depending on the data type(s) being
read in (and perhaps how far past
INT_MAX
it goes). For example, readinginteger types will cause a segfault. And reading doubles sometimes segfaults
and sometimes just raises a warning. A cursory glance at the source suggests the
differences lie in the
Collector
implementation.Right now, my workaround is splitting the file ahead of this step, but it
would be really nice to remove that dependency and run everything in one R script.
Is there any interest to address this use-case? I haven't looked into
feasibility yet, but I'm hoping it's going to come down to just replacing some
int
s withR_xlen_t
s or something to that effect. If I find it's relativelystraightforward to address would a PR be welcome?
Reprex
And test.csv was generated with this simple C program.
Session Info
PS, I've gotten a lot of mileage out of the
read_*_chunked
family of functions over the years. Thanks for these!The text was updated successfully, but these errors were encountered: