-
Notifications
You must be signed in to change notification settings - Fork 76
This issue was moved to a discussion.
You can continue the conversation there. Go to discussion →
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Writing ntuple larger than 2GB fails when no compression is used #1130
Comments
Another comparison:Version failingCode:
Effect:
Version working:It seems that enabling default compression makes possible to write correct files larger than 2 GB. Code:
Effect:
|
The simplest code to reproduce the problem (path to the output file can be adjusted):
This snippet gives me the error:
|
Interesting, I've tried as well the same simple code but with compression enabled: from pathlib import Path
import numpy as np
import uproot
ntuple_path = Path('file.root')
data_size = 1000_000_000
data_dict = {
"x": np.ones(data_size, dtype=np.float64),
}
with uproot.recreate(ntuple_path) as fout:
fout["tree"] = data_dict The code took 20 min to run on my cluster, used ~40GB of RAM and crashed with: Traceback (most recent call last):
File "/memfs/7685922/bug.py", line 10, in <module>
fout["tree"] = data_dict
~~~~^^^^^^^^
File "/memfs/7685922/venv/lib/python3.11/site-packages/uproot/writing/writable.py", line 984, in __setitem__
self.update({where: what})
File "/memfs/7685922/venv/lib/python3.11/site-packages/uproot/writing/writable.py", line 1555, in update
uproot.writing.identify.add_to_directory(v, name, directory, streamers)
File "/memfs/7685922/venv/lib/python3.11/site-packages/uproot/writing/identify.py", line 152, in add_to_directory
tree.extend(data)
File "/memfs/7685922/venv/lib/python3.11/site-packages/uproot/writing/writable.py", line 1834, in extend
self._cascading.extend(self._file, self._file.sink, data)
File "/memfs/7685922/venv/lib/python3.11/site-packages/uproot/writing/_cascadetree.py", line 816, in extend
totbytes, zipbytes, location = self.write_np_basket(
^^^^^^^^^^^^^^^^^^^^^
File "/memfs/7685922/venv/lib/python3.11/site-packages/uproot/writing/_cascadetree.py", line 1399, in write_np_basket
uproot.reading._key_format_big.pack(
struct.error: 'i' format requires -2147483648 <= number <= 2147483647 |
I've been meaning to get back to this. Maybe we could add an error message, but the ROOT format itself does not allow TBaskets to be bigger than 2 GB because both the uproot5/src/uproot/writing/_cascadetree.py Lines 1399 to 1408 in 94c085b
and here's the definition of Line 2196 in 94c085b
Here's the ROOT definition of a TKey: https://root.cern.ch/doc/master/classTKey.html#ab2e59bcc49663466e74286cabd3d42c1 in which Do you know that file["tree_name"] = {"branch": branch_data} writes all of the file["tree_name"] = {"branch": first_basket}
file["tree_name"].extend({"branch": second_basket})
file["tree_name"].extend({"branch": third_basket})
... It can't be an interface that takes all of the data in one call because the TBasket data might not fit in memory, especially if you have many TBranches (each with one TBasket). This interface is documented here. In most files, ROOT TBaskets tend to be too small: they tend to be on the order of kilobytes, when it would be more efficient to read if they were megabytes. If you ask ROOT to make big TBaskets, on the order of megabytes or bigger, it just doesn't do it—there seems to be some internal limit. Uproot does exactly what you ask, and you were asking for gigabyte-sized TBaskets. If you didn't run into the 2 GB limit, I wonder if ROOT would be able to read them. Since it prevents the writing of TBaskets that large, I wouldn't be surprised if there's an implicit assumption in the reading code. Did you ever write 1 GB TBaskets and then read them back in ROOT? About this issue, I think I can close it because the format simply doesn't accept integers of that size, and most likely, you intended to write multiple TBaskets with uproot.WritableTree.extend. |
@jpivarski thanks for your detailed explanation. Indeed documentation of uproot is great on that aspects. It just needs a careful reading. Can you suggest me some methods on inspecting bucket size and checking the exact limits ? I am playing with uproot as tool to convert large HDF files into something that could be then inspected online using JSROOT. The ROOT ntuples are transfered to S3 filestystem provided by our supercomputing center (ACK Cyfronet in Krakow). As S3 provides nice way to share the files via URL, the JSROOT can nicely load the files. I exploit there the partial read from HTTP feature (root-project/jsroot#284). I've played a bit with basket size and for my use reading case the optimum basket size is about 1000000 (10^6) rows/entries. This gives fastest loading time in JSROOT. My tree has ~20 branches with mostly 64 bit floats. For small benchmark I've took the same HDF file and generated two root files, one with 100k entries per basket and one with 1000k entries. You can play yourselves with them: 100k entries/basket
1000k entries/basket
In general I feel that having less HTTP request of ~1MB size gives the most optimal perfomance. Going down with basket size to 10k entries slows down JSROOT even more. The problemNow the problem is following: with 1000000 entries per basket I cannot process larger files.
when running this code I get an error with
|
@jpivarski I am not sure if discussion on closed issue is the best place ? Should I convert this into discussion (https://github.com/scikit-hep/uproot5/discussions) ? |
This issue was moved to a discussion.
You can continue the conversation there. Go to discussion →
The problem
Trying to save ntuple (TTree) with more than 2 GB of data and no compression fails with following error:
The minimum code to reproduce:
Details
More details with my original code from which the problem is below. In the comments below I've also provided more examples.
I was trying to write an ROOT ntuple with following code:
This works nicely until files are small, say smaller than 2GB.
When trying to save larger file I get following error:
I saw similar error reported long ago here: scikit-hep/uproot3#462
Also - when looking at the source code of
extend
method inclass NTuple(CascadeNode)
it seems that all calls toadd_rblob
are withbig=False
argument. which suggest that only 4-byte pointers are being used.See:
uproot5/src/uproot/writing/_cascadentuple.py
Line 779 in 8a42e7d
This is my uproot version:
The text was updated successfully, but these errors were encountered: