Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Added functionalities to save and load models #599

Open
wants to merge 9 commits into
base: main
Choose a base branch
from

Conversation

digicosmos86
Copy link
Collaborator

This PR adds the functionalities to save and load models. These are mainly achieved by serializing settings saved in the HSSM object into a json file in a zip archive. It is also possible to save the data (optional) and InferenceData into the same zip archive by saving them into csv (or parquet) format and netcdf.

To that end, serialization and deserialization methods are added to multiple objects (UserParam, Params, Config, etc) so they can be turned into dicts and eventually JSONs. Some objects, such as the vi_approx objects, will be saved as pickles.

The reason why this implementation is favored over one single pickle file is that pickle files are not secure. They can contain arbitrary code that can be executed without the users' knowledge. The users can also pickle the object themselves if they so choose.

The disadvantage of this is that many objects might not be able to be serialized into JSON formats, especially when the models are highly customized with custom Link functions, custom distributions, etc.. Unfortunately not even the pickling could solve the problem with this level of customization. This method should work 90% percent of the time.

Copy link
Collaborator

@AlexanderFengler AlexanderFengler left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall this looks good (at the same time confusing because stuff is added in many places), but, at the risk of making a controversial statement: This adds a lot of complexity. If we can fruitfully use cpickle here, and have a slim wrapper directly on the model that works with cpickle I think it would make our life a lot easier.

Could you convince me that this is not a good approach?

Let me just mention, I understand pickle files are not safe, but from what I see we are currently zipping a couple pickle files anyways to get the job done.

My main issue with the current approach is that it begs for save --> load inconsistencies and additions to HSSM will need intervention in more places now.

Really not saying this to diminish the work done here, which is great in principle.


@dataclass
@dataclass(slots=True)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is using slots=True necessary here?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. This means the dataclass will use the slots implementation rather than a dictionary implementation. In our use case it's faster than the dictionary implementation

if save_data:
dataIO = BytesIO()

if save_data_format == "csv":
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

kind of want to avoid csv tbh.
It lacks guarantee on not loosing data formats.
I had this case before, if I remember correctly, it was really easy to create a situation where you loose information from simple saving and loading.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually I was only using the parquet format but then realized that you have to install pyarrow for that format. This is an unfortunate compromise for compatibility. We can highlight in documentation that use parquet format if you want maximum information preservation

model._inference_obj.to_netcdf(tmpfile.name)
zipf.write(tmpfile.name, "traces.nc")

_write_pickle(zipf, "traces_vi", model._inference_obj_vi)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We are basically writing a zip file into which we are packing other files including pickle files?

I am taking defense-stance before making the next statement... but:

Couldn't we just use cpickle on the whole thing and put that in a zip file (compression being the main reason here?)? :0

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Compression is not the main reason here. We just want all these parts to be in one place. This is another compromise for these parts that cannot be properly turned into a transparent format. We can also choose not to save these parts as a limitation of this functionality

@digicosmos86
Copy link
Collaborator Author

digicosmos86 commented Oct 30, 2024

Overall this looks good (at the same time confusing because stuff is added in many places), but, at the risk of making a controversial statement: This adds a lot of complexity. If we can fruitfully use cpickle here, and have a slim wrapper directly on the model that works with cpickle I think it would make our life a lot easier.

Could you convince me that this is not a good approach?

Let me just mention, I understand pickle files are not safe, but from what I see we are currently zipping a couple pickle files anyways to get the job done.

My main issue with the current approach is that it begs for save --> load inconsistencies and additions to HSSM will need intervention in more places now.

Really not saying this to diminish the work done here, which is great in principle.

Well, apart from obvious security issues, pickles also cannot be used across python versions, across HSSM versions, and maybe even across platform (if I am not mistaken). Imagine this case scenario: a researcher has used HSSM to create a model. Now she wants to reproducibly share this model with the others. If she is only using a pickle file, then even if the others trust her enough to load this file, they would still probably not be able to use her model because they use a different version of Python, or HSSM, or a different OS. This pickle file will be useless whenever the version of HSSM is updated too, so not good for long-term preservation either.

Using pickle for some parts is a compromise, because there is no way to save those parts to a transparent format. I didn't want to include this, and maybe I shouldn't have. Maybe the focus of this functionality should be on parameter definitions and traces (InferenceData), not everything in the model.

In the end, it depends on what use case we want to support. In my opinion, if we only want people to be able to save their own model and reload it in the short term, they can feel free to pickle the whole thing. However, if we want the models to be shared and/or preserved across versions, then pickling is not going to work here

@AlexanderFengler
Copy link
Collaborator

As per our various discussions I think it's a bit early to commit to a sophisticated save / load routine via serialization.

Main reasoning: I don't think we should lock ourselves in with baggage on version compatibility etc. just yet, and give it a bit more time for things to mature on that end. I understand the upside of not using pickle, but the downside is that we have to maintain more baggage to keep version compatibility.

If possible, let's just go with a simple cpickle solution for now which, as discussed, could be augmented by some metadata to aid reproducibility. We should definitely keep the relevant ideas around, and commit to a proper save / load approach a bit later.

Sounds reasonable @digicosmos86 ?

@digicosmos86
Copy link
Collaborator Author

As per our various discussions I think it's a bit early to commit to a sophisticated save / load routine via serialization.

Main reasoning: I don't think we should lock ourselves in with baggage on version compatibility etc. just yet, and give it a bit more time for things to mature on that end. I understand the upside of not using pickle, but the downside is that we have to maintain more baggage to keep version compatibility.

If possible, let's just go with a simple cpickle solution for now which, as discussed, could be augmented by some metadata to aid reproducibility. We should definitely keep the relevant ideas around, and commit to a proper save / load approach a bit later.

Sounds reasonable @digicosmos86 ?

Sounds good to me. Since you've had much more experience with cpickle, I wonder if it would make more sense for you to implement that? For this PR, I'll go ahead and remove any changes in hssm.py, but I think all the other extra functionalities are still worth keeping. How does that sound?

@AlexanderFengler
Copy link
Collaborator

As per our various discussions I think it's a bit early to commit to a sophisticated save / load routine via serialization.
Main reasoning: I don't think we should lock ourselves in with baggage on version compatibility etc. just yet, and give it a bit more time for things to mature on that end. I understand the upside of not using pickle, but the downside is that we have to maintain more baggage to keep version compatibility.
If possible, let's just go with a simple cpickle solution for now which, as discussed, could be augmented by some metadata to aid reproducibility. We should definitely keep the relevant ideas around, and commit to a proper save / load approach a bit later.
Sounds reasonable @digicosmos86 ?

Sounds good to me. Since you've had much more experience with cpickle, I wonder if it would make more sense for you to implement that? For this PR, I'll go ahead and remove any changes in hssm.py, but I think all the other extra functionalities are still worth keeping. How does that sound?

alright, will take care of this over christmas break.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants