Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve workflow around saving and loading fitted models #4687

Closed
twiecki opened this issue May 12, 2021 · 22 comments
Closed

Improve workflow around saving and loading fitted models #4687

twiecki opened this issue May 12, 2021 · 22 comments

Comments

@twiecki
Copy link
Member

twiecki commented May 12, 2021

After talking to users who are trying to deploy models, one thing that commonly comes up is that it's cumbersome to do so. One reason is that it's hard to fit a PyMC3 model, save it, and then load it in later for doing predictions, like scikit-learn allows to do.

Fortunately, there exist two packages that already have implemented this on top of PyMC3:

I think these already do most of what we need here, but both of the packages seem abandoned. There are a few options I can see and would like to get opinions on:

  1. We could fork one of these packages and continue to support them (probably pymc-learn because it has more).
  2. We could just take the base-class (https://github.com/pymc-learn/pymc-learn/blob/master/pmlearn/base.py#L18) and add it to PyMC3.
  3. Take just the base-class (https://github.com/pymc-learn/pymc-learn/blob/master/pmlearn/base.py#L18) and add it to another dedicated package.

I'm leaning towards the second option because I don't think users really need a model zoo but just a nice API for their own models to load and save+predict. It's also easier to maintain. I also think this should be core PyMC3 functionality and it's not a lot of code (hence not option 3).

The only downside to 2 I see is that it would add scikit-learn as a dependency because we inherit from sklearn.base.BaseEstimator (https://github.com/scikit-learn/scikit-learn/blob/main/sklearn/base.py#L142). Mostly this adds repr features which can be nice, but maybe they are also not really required. We could either not inherit from BaseEstimator and save the dependency, or make the dependency optional.

@ricardoV94
Copy link
Member

ricardoV94 commented May 12, 2021

Just in case this sounds relevant: one feature I often wish was there was a method to automatically load saved traces if they are available but (re)sample and save the new traces if they are not, or if the model or data has changed since the last time it was sampled.

@Emaasit
Copy link
Contributor

Emaasit commented May 12, 2021

@twiecki I had taken a long hiatus to focus on personal mental health.

This is a timely git issue because I am back and currently working on upgrading pymc-learn to the latest version of PyMC3.

@twiecki
Copy link
Member Author

twiecki commented May 12, 2021

@Emaasit Always an important thing to prioritize, good to have you back.

Good to hear you plan to get back to pymc-learn. This definitely is relevant for this discussion. What are your thoughts in merging the base-class upstream (i.e. into PyMC3) and then pymc-learn could inherit from that and provide the model-zoo (which I do think is very useful).

@Emaasit
Copy link
Contributor

Emaasit commented May 12, 2021

@twiecki Yes, I like that idea. That will enable pymc-learn to focus on its original goal of providing a collection of the most commonly known models (i.e. a model zoo).

... focuses on bringing probabilistic machine learning to non-specialists

@parsing-science
Copy link

@twiecki: I haven't been working on pymc3-models because I haven't need to use it at work in the past few years. You're welcome to fork or use anything in the repo!

@twiecki
Copy link
Member Author

twiecki commented May 12, 2021

Just had a nice conversation with @junpenglao about this. I think these features would be nice:

  • scikit-learn compatible (.fit(), .save(), .load(), .predict())
  • Inherit from pymc3.Model and require use to overwrite .build() method where model will be specified without requiring with context usage. This has the additional benefit of making it compatible with our existing API as well, so users can do with my_model: pm.sample() where my_model is created from the ModelBuilder.
  • We only need to save theInferenceData object (as netcdf), the model itself gets recreated via the class (at least until we can serialize aesara graphs, see Don't run merge optimization in ScalarInnerGraphOp.fgraph pytensor#406).
  • @aseyboldt came up with a handy configuration spec for models which we can also support (like configuring prior parameters), this config is a dict that can be stored as json in the InferenceData object
  • Inheriting from pm.Model allows for model composability where the user-model can be used inside of another with model: context (@lucianopaz has a nice example of this).

@junpenglao
Copy link
Member

junpenglao commented May 12, 2021

A bit more context re the discussion with @twiecki - we think that adding a wrapper in PyMC3 to support better saving a loading would be very useful. Tentatively it goes a bit like:
User work on their pymc3 model REPL style as usual, say:

X = ...
y = ...
with pm.Model() as model:
    beta = pm.Normal(...)
    sigma = pm.HalfCauchy(...)
    obs = pm.Normal("obs", X @ beta, sigma, observed=y)
    trace = pm.sample(...)

To use ModelBuilder, a slight rewrite from the user to overwrite the _build method:

class ModelBuilder(pm.Model):
    ...
    def build(self, *args, **kwargs):
        with self:
            self._build(*args, **kwargs)


class UserModel(ModelBuilder):
    def _build(self, X, y=None):
        beta = pm.Normal(...)
        sigma = pm.HalfCauchy(...)
        obs = pm.Normal("obs", X @ beta, sigma, observed=y)

my_model = UserModel()
my_model.build(X, y)
with my_model:
    pm.sample()

@twiecki
Copy link
Member Author

twiecki commented May 12, 2021

In addition, a user can do with this same API:

my_model.fit(X, y)
my_model.save('mymodel.netcdf')
# then in prod
my_model = UserModel.load('mymodel.netcdf')
preds = my_model.predict(Y_test)

@canyon289
Copy link
Member

canyon289 commented May 13, 2021

Pasting the discussion from bambi where we talk about saving and loading models. If we figure this out at PyMC3 level then bambi could either directly piggyback off of it, or at least reduce the amount of work it needs to do

CC @tomicapretto

bambinos/bambi#259

@AlexIoannides
Copy link

I'm favour of the suggestion from @twiecki and @junpenglao - I think we need the flexibility to define our own models and not just those from a 'model zoo' (although these are also really helpful, but a separate entity, in my opinion).

I would add, that it'd be great to have some flexible in the .predict() method, so that we can handle intervals and densities as well as point-estimates.

My only (minor) concern, is how to verify that the user has loaded the right InferenceData object for their model (before they try doing anything with it). We'd also have to be very clear that the analogue with Scikit-learn ends when it comes to persistence, because in Scikit-Learn we serialise the class with the trained weights embedded within it, whereas the suggestion here is to serialise only the 'weights' (InferenceData). Which may confuse a few people (the first time they come to this).

@lucianopaz
Copy link
Contributor

I have trouble seeing the benefits of the ModelBuilder class. From what I gather, it has the same main problem that we could run into by simply calling pickle.dump of our current Model instances: it requires that all the things that are used in the model must be defined at the top level of the module. If a user uses a DensityDist with an as_op from some function, that will be fairly difficult to serialize. If the user is running from within a jupyter notebook or ipython kernel, that could also complicate unpickling because functions defined on the fly will be expected to be defined the same way when the users unserialize their models.
I think that this problem actually boils down to having a good way for aesara to serialize computational graphs (or have some way to load already compiled binary code for lighter deployment).

@twiecki
Copy link
Member Author

twiecki commented May 19, 2021

@lucianopaz note that this proposal does not propose any pickling take place but that the ModelBuilder class which contains the model is still available when loading the model. From that class, we can recreate the model, but the model object is never pickled. So in that scenario, we only need to store the trace.

@AlexIoannides Thanks for chiming in. I like the idea of more flexibility for .predict(). Good question re linking model class to the right InferenceData object. We can store all kinds of meta info in the InferenceData object like model name and version and check that this matches. Or maybe even a hash of the ModelBuilder class.

@AlexIoannides
Copy link

I think storing the hash of the ModelBuilder class is a great idea.

@canyon289
Copy link
Member

canyon289 commented Jul 26, 2021

User story note

Note im also im favor of this because we use cloudpickle now instead of pickle. After having spent multiple hours debugging the failing ArviZ CI job where pickling was failing, and I was confused because pickle is the goto tool, I got enough help from the devs and read the tests to figure out we needed to use cloud pickle.

I'm afraid "regular" users will try using pickle, see if fail and not know to use cloudpickle. This wrapper will help because users wont have to figure out what library is needed to save or load models, they can just the method itll "just work"

arviz-devs/arviz#1743

@AlexIoannides
Copy link

@canyon289 What's the major difference between cloudpickle and pickle?

@twiecki
Copy link
Member Author

twiecki commented Jul 26, 2021

@AlexIoannides cloudpickle supports more things to pickle, like lambda functions.

@nikmich1
Copy link

nikmich1 commented Jan 3, 2022

Although the 'beginner friendly' tag was removed a few days ago, I would be more than happy to give it a try during the next days (would be my first contribution). Can someone kindly give me a hint, where I can find information on @aseyboldt 's configuration spec dictionary, which was mentioned above? I was not able to find material on that so far.

@twiecki
Copy link
Member Author

twiecki commented Jan 4, 2022

@nikmich1 That would be a great contribution. Here is an example: https://gist.github.com/twiecki/86b02349c60385eb6d77793d37bd96a9

It's quite simple, just a static method that returns a dict that you can then reference when building the model. There are also some nice ideas (like the hash mentioned above) here: https://github.com/quantopian/bayesalpha/blob/master/bayesalpha/base.py

@twiecki
Copy link
Member Author

twiecki commented Jan 13, 2022

@nikmich1 Any progress on this?

@nikmich1
Copy link

@twiecki thanks for reaching out, so far, I created a very preliminary and minimal draft:
https://gist.github.com/nikmich1/61d27a8f9f3cdc6a33dd8a814878b886

For using a model hash instead of simple strings for validation, I will adopt the suggested approach from here. Before bringing it to proper form, some feedback/help would be very useful for me on two topics:

  • For now, the user has to overwrite the '_data_setter' method for the prediction with new data. I did not see a way to know, how to change the Data containers in the defined model without this additional user-defined function. If there is a way around this, very happy to get some advice.

  • So far, dill is used to save and load the model. To switch to saving/loading only the InferenceData objects, I am still looking for a good way to add the model spec dictionary into the InferenceData object, something like this (which does not work):

    inferencedata.model_spec_dict = model_spec_dict
    inferencedata.to_netcdf('filename.netcdf') # dict is lost after loading this file

    I tried inferencedata.extend(az.convert_to_inference_data(model_spec_dict)), but it seems complicated to recover the original data, especially with more than one added dictionary. Happy for any hint towards the right direction.

@twiecki
Copy link
Member Author

twiecki commented Jan 17, 2022

This looks like a great start. I think for a first cut saving both, the netcdf and model object in cloudpickle (not dill) is fine.

Also, you can store ppc and prior predictive in idata as well using extend:

with model:
    idata = pm.sample()
    idata.extend(pm.sample_prior_predictive())
    idata.extend(pm.sample_posterior_predictive(idata))

@twiecki
Copy link
Member Author

twiecki commented Feb 22, 2022

@nikmich1 Any progress?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

10 participants