Replies: 8 comments
-
Update: R can read files from zip via unz() function. |
Beta Was this translation helpful? Give feedback.
-
New example - pod5 from nanopore https://pod5-file-format.readthedocs.io/en/latest/SPECIFICATION.html https://github.com/nanoporetech/pod5-file-format for storing raw sequencing data - several dataframes per file. In pod5, they "glue" several Arrow-tables (same Arrow representation as in memory - so called Feather v2 format - not as suited for long-term storage as parquet) into a custom "container" https://pod5-file-format.readthedocs.io/en/latest/SPECIFICATION.html#combined-file-layout |
Beta Was this translation helpful? Give feedback.
-
huh, looks very interesting! Do you have any idea why they didn't go with Apache Parquet+container instead? |
Beta Was this translation helpful? Give feedback.
-
some interesting observations from Twitter: https://twitter.com/Hasindu2008/status/1619914433438040065 |
Beta Was this translation helpful? Give feedback.
-
Not sure - but isn't parquet much more complex ? also maybe this arrow-one is still faster ? there is also that https://youtu.be/nrXoZ3NTmnU?si=aUCniTgr2bm1br0X |
Beta Was this translation helpful? Give feedback.
-
another slightly relevant thing: "delta lake" |
Beta Was this translation helpful? Give feedback.
-
huh, never heard of delta lake! Some of its advantages could be useful to us, e.g. deleting columns |
Beta Was this translation helpful? Give feedback.
-
Btw, one of the key issues has always been the lack of Out-Of-Core merge-sort for parquet files (currently, we use Unix's sort on our .tsv pairs). Two potential solutions have emerged meanwhile: |
Beta Was this translation helpful? Give feedback.
-
Issue: storing Hi-C contacts in a gzipped .tsv cause major slowdowns for some computations. We need to pick a binary container and write software for common operations.
.tsv/.csv:
Cons:
Pros:
The alternative is to store pair tables in existing binary container files. The two options are:
HDF5:
Pros:
Cons:
Parquet:
Pros:
Cons:
Personally, I'm not happy with either of the solutions. Thoughts?...
Beta Was this translation helpful? Give feedback.
All reactions