`read_csv_chunked` fails when file contains greater than `INT_MAX` rows #1554

pegeler · 2024-09-02T04:34:31Z

I have a use-case very similar to #1177, but read_csv_chunked is giving me an error for a very long file.
I am reading in a flat file with ~3 billion rows with the SideEffectChunkCallback.
The callback lightly processes the chunk and then the resulting data frame is
uploaded into a database using DBI::dbAppendTable(). At some point around row
INT_MAX I get this:

Error in read_tokens_chunked_(data, callback, chunk_size, tokenizer, col_specs,  :
  attempt to set index 10000/10000 in SET_STRING_ELT

This error appears to be different depending on the data type(s) being
read in (and perhaps how far past INT_MAX it goes). For example, reading
integer types will cause a segfault. And reading doubles sometimes segfaults
and sometimes just raises a warning. A cursory glance at the source suggests the
differences lie in the Collector implementation.

# Integer
 *** caught segfault ***
address 0x[...], cause 'memory not mapped'
malloc(): memory corruption
Aborted

# Occasionally, I see this when reading a Double column
Warning message:
In read_tokens_chunked_(data, callback, chunk_size, tokenizer, col_specs,  :
  NAs introduced by coercion to integer range

Right now, my workaround is splitting the file ahead of this step, but it
would be really nice to remove that dependency and run everything in one R script.
Is there any interest to address this use-case? I haven't looked into
feasibility yet, but I'm hoping it's going to come down to just replacing some
ints with R_xlen_ts or something to that effect. If I find it's relatively
straightforward to address would a PR be welcome?

Reprex

#!/usr/bin/env Rscript
readr::read_csv_chunked(
  file = "test.csv",
  col_types = "c",
  callback = readr::SideEffectChunkCallback$new(\(x, pos) {})
)

And test.csv was generated with this simple C program.

#include <stdio.h>
#include <stdint.h>

const int64_t max_rows = (int64_t) 3 << 30;  // 3 billion and change


int main(int argc, char *argv[]) {
  FILE *stream;
  if (argc == 2) {
    if ((stream = fopen(argv[1], "w")) == NULL) {
      fprintf(stderr, "Could not open file %s\n", argv[1]);
      return 1;
    }
  } else {
    stream = stdout;
  }

  fprintf(stderr, "Writing a file of length %lld\n", max_rows);

  fputs("header\n", stream);
  for (int64_t i = 0; i < max_rows; i++)
    fputs("1\n", stream);

  return 0;
}

Session Info

Platform Linux
R 4.2.3
readr 2.1.5

PS, I've gotten a lot of mileage out of the read_*_chunked family of functions over the years. Thanks for these!

The text was updated successfully, but these errors were encountered:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

`read_csv_chunked` fails when file contains greater than `INT_MAX` rows #1554

`read_csv_chunked` fails when file contains greater than `INT_MAX` rows #1554

pegeler commented Sep 2, 2024

read_csv_chunked fails when file contains greater than INT_MAX rows #1554

read_csv_chunked fails when file contains greater than INT_MAX rows #1554

Comments

pegeler commented Sep 2, 2024

Reprex

Session Info

`read_csv_chunked` fails when file contains greater than `INT_MAX` rows #1554

`read_csv_chunked` fails when file contains greater than `INT_MAX` rows #1554