Added functionalities to save and load models #599

digicosmos86 · 2024-10-28T20:54:53Z

This PR adds the functionalities to save and load models. These are mainly achieved by serializing settings saved in the HSSM object into a json file in a zip archive. It is also possible to save the data (optional) and InferenceData into the same zip archive by saving them into csv (or parquet) format and netcdf.

To that end, serialization and deserialization methods are added to multiple objects (UserParam, Params, Config, etc) so they can be turned into dicts and eventually JSONs. Some objects, such as the vi_approx objects, will be saved as pickles.

The reason why this implementation is favored over one single pickle file is that pickle files are not secure. They can contain arbitrary code that can be executed without the users' knowledge. The users can also pickle the object themselves if they so choose.

The disadvantage of this is that many objects might not be able to be serialized into JSON formats, especially when the models are highly customized with custom Link functions, custom distributions, etc.. Unfortunately not even the pickling could solve the problem with this level of customization. This method should work 90% percent of the time.

AlexanderFengler

Overall this looks good (at the same time confusing because stuff is added in many places), but, at the risk of making a controversial statement: This adds a lot of complexity. If we can fruitfully use cpickle here, and have a slim wrapper directly on the model that works with cpickle I think it would make our life a lot easier.

Could you convince me that this is not a good approach?

Let me just mention, I understand pickle files are not safe, but from what I see we are currently zipping a couple pickle files anyways to get the job done.

My main issue with the current approach is that it begs for save --> load inconsistencies and additions to HSSM will need intervention in more places now.

Really not saying this to diminish the work done here, which is great in principle.

AlexanderFengler · 2024-10-29T22:25:42Z

src/hssm/config.py


-@dataclass
+@dataclass(slots=True)


is using slots=True necessary here?

Yes. This means the dataclass will use the slots implementation rather than a dictionary implementation. In our use case it's faster than the dictionary implementation

AlexanderFengler · 2024-10-29T22:52:18Z

src/hssm/io.py

+        if save_data:
+            dataIO = BytesIO()
+
+            if save_data_format == "csv":


kind of want to avoid csv tbh.
It lacks guarantee on not loosing data formats.
I had this case before, if I remember correctly, it was really easy to create a situation where you loose information from simple saving and loading.

Actually I was only using the parquet format but then realized that you have to install pyarrow for that format. This is an unfortunate compromise for compatibility. We can highlight in documentation that use parquet format if you want maximum information preservation

AlexanderFengler · 2024-10-29T22:57:47Z

src/hssm/io.py

+                model._inference_obj.to_netcdf(tmpfile.name)
+                zipf.write(tmpfile.name, "traces.nc")
+
+        _write_pickle(zipf, "traces_vi", model._inference_obj_vi)


We are basically writing a zip file into which we are packing other files including pickle files?

I am taking defense-stance before making the next statement... but:

Couldn't we just use cpickle on the whole thing and put that in a zip file (compression being the main reason here?)? :0

Compression is not the main reason here. We just want all these parts to be in one place. This is another compromise for these parts that cannot be properly turned into a transparent format. We can also choose not to save these parts as a limitation of this functionality

digicosmos86 · 2024-10-30T13:29:24Z

Overall this looks good (at the same time confusing because stuff is added in many places), but, at the risk of making a controversial statement: This adds a lot of complexity. If we can fruitfully use cpickle here, and have a slim wrapper directly on the model that works with cpickle I think it would make our life a lot easier.

Could you convince me that this is not a good approach?

Let me just mention, I understand pickle files are not safe, but from what I see we are currently zipping a couple pickle files anyways to get the job done.

My main issue with the current approach is that it begs for save --> load inconsistencies and additions to HSSM will need intervention in more places now.

Really not saying this to diminish the work done here, which is great in principle.

Well, apart from obvious security issues, pickles also cannot be used across python versions, across HSSM versions, and maybe even across platform (if I am not mistaken). Imagine this case scenario: a researcher has used HSSM to create a model. Now she wants to reproducibly share this model with the others. If she is only using a pickle file, then even if the others trust her enough to load this file, they would still probably not be able to use her model because they use a different version of Python, or HSSM, or a different OS. This pickle file will be useless whenever the version of HSSM is updated too, so not good for long-term preservation either.

Using pickle for some parts is a compromise, because there is no way to save those parts to a transparent format. I didn't want to include this, and maybe I shouldn't have. Maybe the focus of this functionality should be on parameter definitions and traces (InferenceData), not everything in the model.

In the end, it depends on what use case we want to support. In my opinion, if we only want people to be able to save their own model and reload it in the short term, they can feel free to pickle the whole thing. However, if we want the models to be shared and/or preserved across versions, then pickling is not going to work here

AlexanderFengler · 2024-12-12T21:48:06Z

As per our various discussions I think it's a bit early to commit to a sophisticated save / load routine via serialization.

Main reasoning: I don't think we should lock ourselves in with baggage on version compatibility etc. just yet, and give it a bit more time for things to mature on that end. I understand the upside of not using pickle, but the downside is that we have to maintain more baggage to keep version compatibility.

If possible, let's just go with a simple cpickle solution for now which, as discussed, could be augmented by some metadata to aid reproducibility. We should definitely keep the relevant ideas around, and commit to a proper save / load approach a bit later.

Sounds reasonable @digicosmos86 ?

digicosmos86 · 2024-12-16T14:33:07Z

As per our various discussions I think it's a bit early to commit to a sophisticated save / load routine via serialization.

Main reasoning: I don't think we should lock ourselves in with baggage on version compatibility etc. just yet, and give it a bit more time for things to mature on that end. I understand the upside of not using pickle, but the downside is that we have to maintain more baggage to keep version compatibility.

If possible, let's just go with a simple cpickle solution for now which, as discussed, could be augmented by some metadata to aid reproducibility. We should definitely keep the relevant ideas around, and commit to a proper save / load approach a bit later.

Sounds reasonable @digicosmos86 ?

Sounds good to me. Since you've had much more experience with cpickle, I wonder if it would make more sense for you to implement that? For this PR, I'll go ahead and remove any changes in hssm.py, but I think all the other extra functionalities are still worth keeping. How does that sound?

AlexanderFengler · 2024-12-20T00:00:55Z

As per our various discussions I think it's a bit early to commit to a sophisticated save / load routine via serialization.
Main reasoning: I don't think we should lock ourselves in with baggage on version compatibility etc. just yet, and give it a bit more time for things to mature on that end. I understand the upside of not using pickle, but the downside is that we have to maintain more baggage to keep version compatibility.
If possible, let's just go with a simple cpickle solution for now which, as discussed, could be augmented by some metadata to aid reproducibility. We should definitely keep the relevant ideas around, and commit to a proper save / load approach a bit later.
Sounds reasonable @digicosmos86 ?

Sounds good to me. Since you've had much more experience with cpickle, I wonder if it would make more sense for you to implement that? For this PR, I'll go ahead and remove any changes in hssm.py, but I think all the other extra functionalities are still worth keeping. How does that sound?

alright, will take care of this over christmas break.

digicosmos86 added 8 commits October 28, 2024 16:40

add serialization and deserialization to hssm.Prior class

babebc0

move download_hf to submodule

97672ef

add serialization to UserParam class

dd623d9

add serialization to Param class

80ebd4b

add serialization to Params class

3b3ee33

add load and save model

b7a54c9

add pyarrow as dev dependency

5bfae20

make tests/param a module

16f95dc

digicosmos86 requested review from cpaniaguam and AlexanderFengler October 28, 2024 20:54

Remove one redundant variable

4a876b0

AlexanderFengler reviewed Oct 29, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Added functionalities to save and load models #599

Added functionalities to save and load models #599

digicosmos86 commented Oct 28, 2024

AlexanderFengler left a comment •

edited

Loading

AlexanderFengler Oct 29, 2024

digicosmos86 Oct 30, 2024

AlexanderFengler Oct 29, 2024

digicosmos86 Oct 30, 2024

AlexanderFengler Oct 29, 2024

digicosmos86 Oct 30, 2024

digicosmos86 commented Oct 30, 2024 •

edited

Loading

AlexanderFengler commented Dec 12, 2024

digicosmos86 commented Dec 16, 2024

AlexanderFengler commented Dec 20, 2024


		@dataclass
		@dataclass(slots=True)

Added functionalities to save and load models #599

Are you sure you want to change the base?

Added functionalities to save and load models #599

Conversation

digicosmos86 commented Oct 28, 2024

AlexanderFengler left a comment • edited Loading

Choose a reason for hiding this comment

AlexanderFengler Oct 29, 2024

Choose a reason for hiding this comment

digicosmos86 Oct 30, 2024

Choose a reason for hiding this comment

AlexanderFengler Oct 29, 2024

Choose a reason for hiding this comment

digicosmos86 Oct 30, 2024

Choose a reason for hiding this comment

AlexanderFengler Oct 29, 2024

Choose a reason for hiding this comment

digicosmos86 Oct 30, 2024

Choose a reason for hiding this comment

digicosmos86 commented Oct 30, 2024 • edited Loading

AlexanderFengler commented Dec 12, 2024

digicosmos86 commented Dec 16, 2024

AlexanderFengler commented Dec 20, 2024

AlexanderFengler left a comment •

edited

Loading

digicosmos86 commented Oct 30, 2024 •

edited

Loading