-
Notifications
You must be signed in to change notification settings - Fork 11
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Added functionalities to save and load models #599
base: main
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall this looks good (at the same time confusing because stuff is added in many places), but, at the risk of making a controversial statement: This adds a lot of complexity. If we can fruitfully use cpickle
here, and have a slim wrapper directly on the model that works with cpickle
I think it would make our life a lot easier.
Could you convince me that this is not a good approach?
Let me just mention, I understand pickle files are not safe
, but from what I see we are currently zipping a couple pickle
files anyways to get the job done.
My main issue with the current approach is that it begs for save --> load inconsistencies and additions to HSSM will need intervention in more places now.
Really not saying this to diminish the work done here, which is great in principle.
|
||
@dataclass | ||
@dataclass(slots=True) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is using slots=True
necessary here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes. This means the dataclass will use the slots implementation rather than a dictionary implementation. In our use case it's faster than the dictionary implementation
if save_data: | ||
dataIO = BytesIO() | ||
|
||
if save_data_format == "csv": |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
kind of want to avoid csv
tbh.
It lacks guarantee on not loosing data formats.
I had this case before, if I remember correctly, it was really easy to create a situation where you loose information from simple saving and loading.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually I was only using the parquet
format but then realized that you have to install pyarrow
for that format. This is an unfortunate compromise for compatibility. We can highlight in documentation that use parquet
format if you want maximum information preservation
model._inference_obj.to_netcdf(tmpfile.name) | ||
zipf.write(tmpfile.name, "traces.nc") | ||
|
||
_write_pickle(zipf, "traces_vi", model._inference_obj_vi) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We are basically writing a zip file into which we are packing other files including pickle files?
I am taking defense-stance before making the next statement... but:
Couldn't we just use cpickle on the whole thing and put that in a zip file (compression being the main reason here?)? :0
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Compression is not the main reason here. We just want all these parts to be in one place. This is another compromise for these parts that cannot be properly turned into a transparent format. We can also choose not to save these parts as a limitation of this functionality
Well, apart from obvious security issues, pickles also cannot be used across python versions, across HSSM versions, and maybe even across platform (if I am not mistaken). Imagine this case scenario: a researcher has used HSSM to create a model. Now she wants to reproducibly share this model with the others. If she is only using a pickle file, then even if the others trust her enough to load this file, they would still probably not be able to use her model because they use a different version of Python, or HSSM, or a different OS. This pickle file will be useless whenever the version of HSSM is updated too, so not good for long-term preservation either. Using pickle for some parts is a compromise, because there is no way to save those parts to a transparent format. I didn't want to include this, and maybe I shouldn't have. Maybe the focus of this functionality should be on parameter definitions and traces (InferenceData), not everything in the model. In the end, it depends on what use case we want to support. In my opinion, if we only want people to be able to save their own model and reload it in the short term, they can feel free to pickle the whole thing. However, if we want the models to be shared and/or preserved across versions, then pickling is not going to work here |
As per our various discussions I think it's a bit early to commit to a sophisticated save / load routine via serialization. Main reasoning: I don't think we should lock ourselves in with baggage on version compatibility etc. just yet, and give it a bit more time for things to mature on that end. I understand the upside of not using pickle, but the downside is that we have to maintain more baggage to keep version compatibility. If possible, let's just go with a simple Sounds reasonable @digicosmos86 ? |
Sounds good to me. Since you've had much more experience with |
alright, will take care of this over christmas break. |
This PR adds the functionalities to save and load models. These are mainly achieved by serializing settings saved in the HSSM object into a
json
file in a zip archive. It is also possible to save the data (optional) andInferenceData
into the same zip archive by saving them intocsv
(or parquet) format andnetcdf
.To that end, serialization and deserialization methods are added to multiple objects (
UserParam
,Params
,Config
, etc) so they can be turned intodict
s and eventuallyJSON
s. Some objects, such as thevi_approx
objects, will be saved as pickles.The reason why this implementation is favored over one single
pickle
file is thatpickle
files are not secure. They can contain arbitrary code that can be executed without the users' knowledge. The users can also pickle the object themselves if they so choose.The disadvantage of this is that many objects might not be able to be serialized into
JSON
formats, especially when the models are highly customized with custom Link functions, custom distributions, etc.. Unfortunately not even thepickling
could solve the problem with this level of customization. This method should work 90% percent of the time.