Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

exceed 2^31-1 bytes #64

Open
partizanos opened this issue Apr 16, 2024 · 13 comments
Open

exceed 2^31-1 bytes #64

partizanos opened this issue Apr 16, 2024 · 13 comments

Comments

@partizanos
Copy link

partizanos commented Apr 16, 2024

Hello I try to use ricu with sic dataset however I face this issue (below) any ideas?

sic$laboratory
Data for `sic` is missing
Setup now (Y/n)? Y
The requested tables have already been downloaded
── Importing 8 tables for `sic` ───────────────────────────────────────────────────
Error in paste(do.call("c", msg), collapse = "\n") : 
  result would exceed 2^31-1 bytes
In addition: There were 50 or more warnings (use warnings() to see the first 50)
@mcr1213
Copy link

mcr1213 commented Apr 21, 2024

I also have this issue with no solution yet. It seems to be specific when using import_src on the 2,15 GB data_float_h.csv.gz file from SICdb, all other datasets worked fine.

Some things I've tried:

  • Upgraded R to 4.3.3 and ricu 0.6.0
  • Upgraded all packages + fresh install of R.
  • Tried different hardware as I tried first on a M2 Mac, however another Linux system gives the same problem.
  • checked sha256 sums on downloaded files as in some other threads file-corruption was a cause

Full traceback is included:
Screenshot 2024-04-21 at 18 40 16

Any other suggestions to try would be much appreciated.

@manuelburger
Copy link

The configuration files under inst/extdata/config/data-sources.json for the SICdb database with sic tag do not correctly reflect the most recent version, which is downloaded from Physionet. Configurations, which are mostly correct can be found here in a previous PR to integrate the database, but seem to have not been merged entirely: #30 to the current main branch.

The error message posted stems from the fact, that ricu or more specifically the read_csv_chunked function raises a warning for every single erroneous line, when importing the csv. The most problematic is the configuration for the data_float_h table, where in the current main branch here:

the rawdata column is specified to be of type col_double. The database documentation here: https://www.sicdb.com/Documentation/Signal_Data clearly states, that this column is a binary data column compressing up to 60 floats into a single cell of the csv table, to keep the row count of the table manageable, while still providing up to a minute level of resolution for some variables. 60 compressed floats naturally do not cast well to a col_double and thus one gets a full error message for every single line of the entire data_float_h table, this error messages are all concatenated by ricu instead by this function report_problems here, concatenating this many error messages blows the R string size of 2^31-1 bytes, which explains the error message.

Interestingly there's a second report_problems function just above the first one here, which would handle this problem by only reporting the 10 first issues and ignoring the rest, well, since it's listed first in source code, the second function will ultimately be used and thus all messages are propagated at the moment.

Potential fix is:

  • Making sure the correct report_problems function is called, which ignores all but the 10 first functions.
  • Prior does not tackle the source of the problem, which is the wrong configuration. The rawdata column should be imported with type col_character and then the PR referenced above here: Enable SICdb in ricu #30 actually contains some code to unfold the 60 compressed floats to use the SICdb in its full high resolution.

Hope this helps

@mcr1213
Copy link

mcr1213 commented Apr 27, 2024

@manuelburger Thank you so much for the clear explanation. I've removed the redundant 'report_problems' function and changed the rawdata column from col_double to col_character in the config file. However, after 31% another error occurs:

Screenshot 2024-04-27 at 19 39 08

Probably this has to do with the changes you mentioned in #30 which are not merged with the main branch. Is there any particular reason that these changes are not available? Or is it only me for which SICdb 1.0.6 is not working in ricu?

@mcr1213
Copy link

mcr1213 commented Apr 29, 2024

So short update, I've taken the branch mentioned in #30 as created by @prockenschaub and recompiled the ricu package (the older 0.5.5 version that is) and tried with this to add SICdb. The previous error does not occur, however after importing 86% a new one does:

Screenshot 2024-04-29 at 19 21 41

I've tried tracing back the code to see if there was an obvious explanation, but could not find one. It is not clear to me what function res should be.

Is there anyone with a working SICdb environment? And could they tell me which codebase they used?

@prockenschaub
Copy link
Collaborator

prockenschaub commented Apr 29, 2024

@mcr1213 I originally meant to work with SICdb when it was released but this has been pushed back repeatedly, so I haven't touched the code in a while. I originally thought that SICdb was fully integrated in ricu 0.6.0 and there was no need for my code, but apparently not.

Since there appears to be increased interest in SICdb, maybe now is a good time to look at it again. I will try to find some time in the coming days to look at your error and see what's wrong / how we can bring the code into the latest version of ricu and SICdb.

Edit: I had a quick look. res should be the function sic_data_float_h as defined in data-sources.jsan:

"callback": "sic_data_float_h"

@mcr1213
Copy link

mcr1213 commented May 14, 2024

@prockenschaub Thanks for your suggestion. Unfortunately, I'm no expert in debugging R-packages and it does not yet work for me. At the moment my hypothesis is that the mentioned 'sic_data_float_h' cannot be found. When doing ls("package:ricu") this function does not show up in the available functions. I do know that this function is placed in the new (compared to the original release) file "./R/callback-tb-R". Searches in google/chatgpt suggested mentioning the file in the main DESCRIPTION file, but the other files are not referenced there.

I've also tried to 'Reoxygenize' the package to recreate NAMESPACE, but no luck.

Can you tell me if I'm on the right track? Does the sicdb work for you?

@dplecko
Copy link
Member

dplecko commented May 24, 2024

I will resolve this issue in the next version (i.e., in June). In the meantime, if this is an urgent matter for anyone, my suggestion is to simply perform manual conversion to fst. I am attaching below some (pretty raw) code that I used for converting the sic tables when I first accessed the data. This code could perhaps be helpful for anyone looking for a quick fix, until I resolve the issue properly.

First, I split the data_float_h table into chunks (since it is huge)

import csv, os

def split_csv_file(input_file, output_prefix, num_files):
    # Open the input CSV file
    with open(input_file, 'r') as file:
        # Create a CSV reader
        reader = csv.reader(file)
        
        # Read the header row
        header = next(reader)
        
        # Calculate the number of rows per file (excluding the header row)
        rows_per_file = (sum(1 for _ in reader) + num_files - 1) // num_files
        
        # Reset the file pointer to the beginning
        file.seek(0)
        
        # Split the CSV into smaller chunks
        chunk_index = 1
        for i, row in enumerate(reader):
            if (i % rows_per_file) == 0:
                # Open a new output file
                output_file = f"{output_prefix}_{chunk_index}.csv"
                with open(output_file, 'w', newline='') as output:
                    writer = csv.writer(output)
                    writer.writerow(header)  # Write the header row
                    
                    # Write rows to the current chunk until desired size
                    for j in range(rows_per_file):
                        try:
                            writer.writerow(next(reader))
                        except StopIteration:
                            break
                    print(f"Saved {output_file}")
                
                chunk_index += 1

input_path = os.path.expanduser("sic-data/data_float_h.csv")
split_csv_file(input_path, "output", 30)

And then all tables can be converted to fst


root <- rprojroot::find_root(".gitignore")
r_dir <- file.path(root, "r")
invisible(lapply(list.files(r_dir, full.names = TRUE), source))

library(fst)
library(ricu)

if (!dir.exists(file.path(data_dir(), "sic"))) 
  dir.create(file.path(data_dir(), "sic"))

convert_names <- c(
  "cases", "d_references", "data_range", "data_ref", "laboratory",
  "medication", "unitlog",
  "data_float_h"
)

data_path <- file.path("~", "Desktop", "sic-data")
if (is.element("data_float_hfull", convert_names)) {
  
  convert_names <- paste0(
    "data_float_h/",
    gsub(".csv", "", list.files(file.path(data_path, "data_float_h")))
  )
}

for (tab_name in convert_names) {
  
  if (file.exists(file.path(data_path, paste0(tab_name, ".csv")))) {
    
    tbl <- read.csv(file.path(data_path, paste0(tab_name, ".csv")))
    # file.remove(paste0(tab_name, ".parquet"))
    
    if (grepl("data_float_h_", tab_name)) 
      tab_name <- gsub("data_float_h_", "", tab_name)
    
    if (tab_name == "microbiology") {
      
      off_col <- which(names(tbl) == "offset")
      names(tbl)[off_col] <- "Offset"
    }
    
    if (tab_name == "gcs") {
      
      tbl$Offset <- 0
    }
    
    write_fst(tbl, path = file.path(data_dir(), "sic", paste0(tab_name, ".fst")))
    
  }
  
  print(tab_name)
}

fix_rawdata <- which(
  vapply(
    1:30,
    function(i) {
      class(
        read.fst(file.path(data_dir(), "sic", "data_float_h", 
                           paste0(i, ".fst")))$rawdata
      )
    }, character(1L) 
  ) == "logical"
)

for (i in fix_rawdata) {
  
  lgl_out <- read.fst(file.path(data_dir(), "sic", "data_float_h", 
                                paste0(i, ".fst")))
  lgl_out$rawdata <- as.numeric(lgl_out$rawdata)
  
  write.fst(lgl_out, file.path(data_dir(), "sic", "data_float_h", 
                               paste0(i, ".fst")))
}

Once the fst files are properly named and located in a folder called sic within the directory given by ricu::data_dir(), there should be no further issues.

@mcr1213
Copy link

mcr1213 commented May 29, 2024

Thanks for the help everyone! The tables can now be successfully imported.

@partizanos
Copy link
Author

partizanos commented Jun 27, 2024

Happy to see active interest on the issue.

@dplecko I ran the Python and R code snippets and while I was able to generate the data_float_h in parts; however, when they moving inside data_dir() inside a folder data_float_h, are not recognized from rICU. Did you merge the chunked output files into one somehow, or does the folder with the 15 .fst files suffice? Is there a timeline for the fix ( I saw an upcoming ricu v0.6.1 but not sure if sic handling will be included)?
@mcr1213 I am glad to hear, did you go for the other branch? @mcr1213 which solution worked for you?

Thank you in advance for your help and active maintenance of the repository.

@mcr1213
Copy link

mcr1213 commented Jun 28, 2024

@partizanos I'm afraid I had to do some combination of all the solutions provided. I'm not exactly sure which step was crucial to result in a working sicdb. The script above I used to unpack the data. In the end I ended up with a single data_float_h.fst file that worked.

I guess that multiple .fst files in dir data_float_h should work too, as other datasets use the same structure.

@dplecko
Copy link
Member

dplecko commented Jul 2, 2024

@partizanos here is how the tables are organized for me in the data_dir() location.

data_float_h_layout

The folder should be called data_float_h and inside you should have files that are called 1.fst, 2.fst, and so on (the exact number of chunks should not matter). If you have this setup, but the loading is not working, I would be quite surprised, and would ask you for further details on what exactly is causing the issue.

A proper fix for all of this will happen some time this summer in ricu 0.6.2.

@partizanos
Copy link
Author

partizanos commented Jul 19, 2024

Hello, I tried quite some of the combinations some progress but still no luck. I enlist my experience below in case it helps investigation. The investigation below was done using R version 4.3.3 , ricu 0.5.5

  1. checked out branch 5.5
    I checked out the suggested branch and I encountered @mcr1213 issues regarding the 5.5 branch (same exception upon handling 86% switching rel to the sic_data_float_h callback did not fix the issue for me. The raw_data column though in the branch is handled as @dplecko suggests (character type).

  2. manual splitting
    I managed to convert 7/8 tables using the method mentioned by @dplecko and @mcr1213 modifying the scripts (python and R to support my windows system) sic_data_float_h_script_fixes.zip.
    The same does not happen with data_float_h (no difference).
    I splitted and converted csv files to fst succesfully. (i.e., data_float_h/with and named them 1 2 3.fst then removed all the other files (as shown in the comment @dplecko)
    I also tried to concatenate them in a big data_float_h.fst as well but without any luck.
    when I access the sic$data_float_h it requests to set_up_env
    As a test I remove cases.fst manually from the data_dir()/sic folder and ricu correctly recognizes that the folder is missing.

Let me know if you have any suggestions on how to debug this further and thank you for the useful comments and active support of this repository.

@partizanos
Copy link
Author

partizanos commented Jul 20, 2024

Hello happy update I managed to replicate the pipeline you describedusing 0.6.1 ricu and by reading/writing with the splitting method.

I rewrote the splitting function to work with cpp single python was taking long type with my python script. Uploaded here:
then I had to rename the files:
finally there is no need to run import_src it should be detected automatically by ricu.
The above does not work with 0.5.* versions

  • split files as indicated by @dplecko ,
    file to compile: sic2_fixes_cpp.zip
    note that you need to clone spdlog from where you run the script and put the include folder (cloned_repo/include), compile it with g++ -std=c++11 -o split_csv sic1.cpp -lz -I./include

Succesfull compilation should give you a ./split_csv binary to execute.
This creates the output files in your folder you have to move them to the folder indicated by data_dir (in my case C:\Users\username\AppData\Local\ricu ) and then inside sic you must put them in data_float_h folder.

  • Once done you can run the second script sic2_r.zip which should create the output_1 2 .. etc fst files
  • finally you have to rename them
  • using ricu R it should automatically detect the tables
    image

Thanks everyone for the help!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants