Version replaying #146

asmeurer · 2020-11-10T21:15:28Z

We need a way to "replay" versions, i.e., to reconstruct the raw data for a dataset and rebuild the data in each version. This will enable several features

Changing certain metadata about a dataset, which currently cannot be changed across versions, including ndim, chunks, and dtype Support for changing chunk size #110
Support for deleting a version. Currently we can delete the version group, but it would not remove the data for that version from the raw data. Deleting versions and garbage collecting #143
It opens the possibility of doing some tricks like inserting rows without actually rewriting all the data, by leaving some data in the raw data empty insert method #47. The issue with this is that you need a way to garbage collect the empty data after a while, or else it becomes more inefficient than just doing it the naive way.

The text was updated successfully, but these errors were encountered:

asmeurer · 2020-11-10T21:19:24Z

For changing metadata, there are two points.

If we want to only change the metadata for new versions, but keep old versions as they are, I think this issue isn't what we need. Rather, we need some way to point the new dataset to a different raw dataset. Currently the raw dataset is referenced by name, but I think this wouldn't be hard to change, as we already are storing the path to the raw dataset in the metadata ever since Continuing work on sparse datasets #144
If we want to change it for all old versions, this is what is needed. This is a potentially destructive action, so it may be worth having some checks that the new metadata isn't lossy against the old without some sort of force flag (e.g., the new ndim should be larger, the new dtype should cast from the old losslessly, etc.).

ArvidJB · 2020-11-10T22:03:29Z

Sounds good. Let's make this as simple as possible. It would be great if we could have some API which takes a callback and then calls it for each version to write a new HDF5 file:

Input would be the InMemoryGroup for that version (including attrs)
Output would be None (if we want to skip this version) or a new InMemoryGroup with added/removed/modified datasets/attrs

Do you think this would work?

asmeurer · 2020-11-16T23:22:48Z

I hadn't considered doing everything in a new file. That would take care of the atomicity concerns, though it also means we would need to copy everything.

This is the start of the work for deshaw#146.

asmeurer · 2021-09-08T22:19:03Z

Version replaying is now implemented in versioned_hdf5/replay.py, and the modify_metadata and delete_version functions in particular. The insertion idea has not yet been implemented, but we can use #47 to track that, so I am closing this issue.

asmeurer mentioned this issue Nov 10, 2020

Deleting versions and garbage collecting #143

Closed

asmeurer added this to the November 2020 milestone Nov 10, 2020

asmeurer added the high_priority label Nov 10, 2020

asmeurer added a commit to asmeurer/versioned-hdf5 that referenced this issue Nov 19, 2020

Add a recreate_dataset() function

cd2616a

This is the start of the work for deshaw#146.

asmeurer mentioned this issue Nov 19, 2020

Version replaying #152

Merged

ericdatakelly assigned asmeurer Dec 3, 2020

ericdatakelly modified the milestones: November 2020, December 2020 Dec 17, 2020

ericdatakelly modified the milestones: December 2020, January 2021 Jan 6, 2021

ericdatakelly modified the milestones: January 2021, February 2021 Feb 4, 2021

ericdatakelly modified the milestones: February 2021, March 2021 Mar 8, 2021

ericdatakelly modified the milestones: March 2021, April 2021 Apr 5, 2021

ericdatakelly modified the milestones: April 2021, May 2021 May 10, 2021

ericdatakelly modified the milestones: May 2021, June 2021 Jun 7, 2021

ericdatakelly modified the milestones: June 2021, July 2021 Jul 8, 2021

ericdatakelly modified the milestones: July 2021, August 2021 Aug 12, 2021

asmeurer closed this as completed Sep 8, 2021

ericdatakelly removed this from the August 2021 milestone Sep 29, 2021

ericdatakelly added this to the September 2021 milestone Sep 29, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Version replaying #146

Version replaying #146

asmeurer commented Nov 10, 2020 •

edited

Loading

asmeurer commented Nov 10, 2020

ArvidJB commented Nov 10, 2020

asmeurer commented Nov 16, 2020

asmeurer commented Sep 8, 2021

Version replaying #146

Version replaying #146

Comments

asmeurer commented Nov 10, 2020 • edited Loading

asmeurer commented Nov 10, 2020

ArvidJB commented Nov 10, 2020

asmeurer commented Nov 16, 2020

asmeurer commented Sep 8, 2021

asmeurer commented Nov 10, 2020 •

edited

Loading