-
Notifications
You must be signed in to change notification settings - Fork 21
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Change-only version selection (PyInf#10524) #264
Comments
Can you think about this and estimate? |
I like this as a feature, and I'm trying to think of a sensible way to implement it. A couple of options come to mind off the top of my head:
Both options have strengths and weaknesses. I'll keep thinking on this - there may be another way we can do this. |
I have a plan for this, and have finished with a number of other more pressing bug reports. I think we can just compare the slices that have changed between versions, which should just be @rahasurana The API would be something like with h5py.File('data.hdf5', 'r') as f:
vf = VersionedHDF5File(f)
diff = vf.get_diff(dataset_name, 'v0', 'v1') Here, |
Thanks for looking into this @peytondmurray! That sounds good and is inline with the original requirement of
But I think it still won't match up to TileDb's performance which stores the diff already and therefore needs a single lookup to fetch the diff info. Forgive me as I don't understand the internals well enough and might sound over-ambitious with the suggestion, but could we store the slice info additionally during the new version being written? That would avoid the need to pass the version written before "interested version" and also avoid computing this info on-demand. We could just refer the slices info and return a list of corresponding arrays that only appear in the interested version? Please let me know you comments on above. |
You're absolutely right, the proposed method wouldn't be a single lookup. So it wouldn't be as fast as if we stored the slice info additionally during a new version write.
True, although if you'd like we can make the default behavior of But with regards to computing this information on the fly, this would be something of a departure from the way information is currently stored. It's certainly possible, but I'd be interested in the opinion of @asmeurer about this. The proposed method compares slices of two virtual datasets, so Either way, I think it should be possible to make this work with old data versions - we just recompute the diffs for each version when opening up old data. |
I like the slices comparison: it's fast and matches what's actually going on under the hood. With slices comparison you can see whether chunks are getting reused between versions and where the extra storage is spent. The goal would be for each chunk to find the earliest version it was mapped to the same raw data chunk. I had actually written something like this for our own diagnostic purposes, it would be nice to get this fully supported! |
Great, this sounds good to me! I should also add that if we are comparing slices, the following chunks would be considered different:
So even though elements If you want more granular diffs, we can of course do element-by-element comparisons on chunks that differ. It's still way faster than diffing the entire two versions, but will represent a slower and more granular approach. |
Background:
As part of an internal evaluation, we explored the time travel feature of TileDB. And based on our benchmarking, we found that computing the change in data between two timestamps/versions was upto 6X slower and 3x more memory consuming for versioned-hdf5 when compared to our toy TileDB wrapper, in general.
Findings:
The reason for TileDB to be faster is that it stores the data being added in a new TileDB fragment which can be retrieved later based on the timestamp. And when changes between two timestamps are to be fetched, it just merges the fragments between the timestamps and returns it.
OTOH, versioned-hdf5 creates new chunks for the data which are then added to the Global chunk hash and the virtual dataset for that version reflects old chunks + new chunks shadowing older chunks for same data. Now when user reads data for a given version/timestamp, all the data at the timestamp is read and needs to be filtered, which is the major bottleneck.
Requirement:
Is there an API already available to fetch only the modified data/chunks? Couldn’t find anything on API doc page.
Can you please add an API that allows accessing data of only the chunks that were added in that version.
This would reduce the amount of data being read/filtered and speed up the change-based selection.
Having a mapping of version to all the chunks added in this version should help here(?).
Also, addressing Store timestamps in array (#171), would further help speed up identifying version corresponding to a timestamp for a couple of internal use-cases.
Internal ticket: PyInf#10524
CC: @ArvidJB
The text was updated successfully, but these errors were encountered: