Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Version replaying #146

Closed
asmeurer opened this issue Nov 10, 2020 · 4 comments
Closed

Version replaying #146

asmeurer opened this issue Nov 10, 2020 · 4 comments
Assignees

Comments

@asmeurer
Copy link
Collaborator

asmeurer commented Nov 10, 2020

We need a way to "replay" versions, i.e., to reconstruct the raw data for a dataset and rebuild the data in each version. This will enable several features

  • Changing certain metadata about a dataset, which currently cannot be changed across versions, including ndim, chunks, and dtype Support for changing chunk size #110
  • Support for deleting a version. Currently we can delete the version group, but it would not remove the data for that version from the raw data. Deleting versions and garbage collecting #143
  • It opens the possibility of doing some tricks like inserting rows without actually rewriting all the data, by leaving some data in the raw data empty insert method #47. The issue with this is that you need a way to garbage collect the empty data after a while, or else it becomes more inefficient than just doing it the naive way.
@asmeurer
Copy link
Collaborator Author

For changing metadata, there are two points.

  1. If we want to only change the metadata for new versions, but keep old versions as they are, I think this issue isn't what we need. Rather, we need some way to point the new dataset to a different raw dataset. Currently the raw dataset is referenced by name, but I think this wouldn't be hard to change, as we already are storing the path to the raw dataset in the metadata ever since Continuing work on sparse datasets #144

  2. If we want to change it for all old versions, this is what is needed. This is a potentially destructive action, so it may be worth having some checks that the new metadata isn't lossy against the old without some sort of force flag (e.g., the new ndim should be larger, the new dtype should cast from the old losslessly, etc.).

@asmeurer asmeurer added this to the November 2020 milestone Nov 10, 2020
@ArvidJB
Copy link
Collaborator

ArvidJB commented Nov 10, 2020

Sounds good. Let's make this as simple as possible. It would be great if we could have some API which takes a callback and then calls it for each version to write a new HDF5 file:

  • Input would be the InMemoryGroup for that version (including attrs)
  • Output would be None (if we want to skip this version) or a new InMemoryGroup with added/removed/modified datasets/attrs

Do you think this would work?

@asmeurer
Copy link
Collaborator Author

I hadn't considered doing everything in a new file. That would take care of the atomicity concerns, though it also means we would need to copy everything.

asmeurer added a commit to asmeurer/versioned-hdf5 that referenced this issue Nov 19, 2020
This is the start of the work for deshaw#146.
@ericdatakelly ericdatakelly modified the milestones: November 2020, December 2020 Dec 17, 2020
@ericdatakelly ericdatakelly modified the milestones: December 2020, January 2021 Jan 6, 2021
@ericdatakelly ericdatakelly modified the milestones: January 2021, February 2021 Feb 4, 2021
@ericdatakelly ericdatakelly modified the milestones: February 2021, March 2021 Mar 8, 2021
@ericdatakelly ericdatakelly modified the milestones: March 2021, April 2021 Apr 5, 2021
@ericdatakelly ericdatakelly modified the milestones: April 2021, May 2021 May 10, 2021
@ericdatakelly ericdatakelly modified the milestones: May 2021, June 2021 Jun 7, 2021
@ericdatakelly ericdatakelly modified the milestones: June 2021, July 2021 Jul 8, 2021
@ericdatakelly ericdatakelly modified the milestones: July 2021, August 2021 Aug 12, 2021
@asmeurer
Copy link
Collaborator Author

asmeurer commented Sep 8, 2021

Version replaying is now implemented in versioned_hdf5/replay.py, and the modify_metadata and delete_version functions in particular. The insertion idea has not yet been implemented, but we can use #47 to track that, so I am closing this issue.

@asmeurer asmeurer closed this as completed Sep 8, 2021
@ericdatakelly ericdatakelly removed this from the August 2021 milestone Sep 29, 2021
@ericdatakelly ericdatakelly added this to the September 2021 milestone Sep 29, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants