Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

read_csv_chunked fails when file contains greater than INT_MAX rows #1554

Open
pegeler opened this issue Sep 2, 2024 · 0 comments
Open

read_csv_chunked fails when file contains greater than INT_MAX rows #1554

pegeler opened this issue Sep 2, 2024 · 0 comments

Comments

@pegeler
Copy link

pegeler commented Sep 2, 2024

I have a use-case very similar to #1177, but read_csv_chunked is giving me an error for a very long file.
I am reading in a flat file with ~3 billion rows with the SideEffectChunkCallback.
The callback lightly processes the chunk and then the resulting data frame is
uploaded into a database using DBI::dbAppendTable(). At some point around row
INT_MAX I get this:

Error in read_tokens_chunked_(data, callback, chunk_size, tokenizer, col_specs,  :
  attempt to set index 10000/10000 in SET_STRING_ELT

This error appears to be different depending on the data type(s) being
read in (and perhaps how far past INT_MAX it goes). For example, reading
integer types will cause a segfault. And reading doubles sometimes segfaults
and sometimes just raises a warning. A cursory glance at the source suggests the
differences lie in the Collector implementation.

# Integer
 *** caught segfault ***
address 0x[...], cause 'memory not mapped'
malloc(): memory corruption
Aborted

# Occasionally, I see this when reading a Double column
Warning message:
In read_tokens_chunked_(data, callback, chunk_size, tokenizer, col_specs,  :
  NAs introduced by coercion to integer range

Right now, my workaround is splitting the file ahead of this step, but it
would be really nice to remove that dependency and run everything in one R script.
Is there any interest to address this use-case? I haven't looked into
feasibility yet, but I'm hoping it's going to come down to just replacing some
ints with R_xlen_ts or something to that effect. If I find it's relatively
straightforward to address would a PR be welcome?

Reprex

#!/usr/bin/env Rscript
readr::read_csv_chunked(
  file = "test.csv",
  col_types = "c",
  callback = readr::SideEffectChunkCallback$new(\(x, pos) {})
)

And test.csv was generated with this simple C program.

#include <stdio.h>
#include <stdint.h>

const int64_t max_rows = (int64_t) 3 << 30;  // 3 billion and change


int main(int argc, char *argv[]) {
  FILE *stream;
  if (argc == 2) {
    if ((stream = fopen(argv[1], "w")) == NULL) {
      fprintf(stderr, "Could not open file %s\n", argv[1]);
      return 1;
    }
  } else {
    stream = stdout;
  }

  fprintf(stderr, "Writing a file of length %lld\n", max_rows);

  fputs("header\n", stream);
  for (int64_t i = 0; i < max_rows; i++)
    fputs("1\n", stream);

  return 0;
}

Session Info

  • Platform Linux
  • R 4.2.3
  • readr 2.1.5

PS, I've gotten a lot of mileage out of the read_*_chunked family of functions over the years. Thanks for these!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant