Improve workflow around saving and loading fitted models #4687

twiecki · 2021-05-12T06:49:36Z

After talking to users who are trying to deploy models, one thing that commonly comes up is that it's cumbersome to do so. One reason is that it's hard to fit a PyMC3 model, save it, and then load it in later for doing predictions, like scikit-learn allows to do.

Fortunately, there exist two packages that already have implemented this on top of PyMC3:

https://github.com/parsing-science/pymc3_models by @parsing-science
https://github.com/pymc-learn/pymc-learn by @Emaasit
where pymc-learn looks like a fork of pymc3_models with some extensions and more models.

I think these already do most of what we need here, but both of the packages seem abandoned. There are a few options I can see and would like to get opinions on:

We could fork one of these packages and continue to support them (probably pymc-learn because it has more).
We could just take the base-class (https://github.com/pymc-learn/pymc-learn/blob/master/pmlearn/base.py#L18) and add it to PyMC3.
Take just the base-class (https://github.com/pymc-learn/pymc-learn/blob/master/pmlearn/base.py#L18) and add it to another dedicated package.

I'm leaning towards the second option because I don't think users really need a model zoo but just a nice API for their own models to load and save+predict. It's also easier to maintain. I also think this should be core PyMC3 functionality and it's not a lot of code (hence not option 3).

The only downside to 2 I see is that it would add scikit-learn as a dependency because we inherit from sklearn.base.BaseEstimator (https://github.com/scikit-learn/scikit-learn/blob/main/sklearn/base.py#L142). Mostly this adds repr features which can be nice, but maybe they are also not really required. We could either not inherit from BaseEstimator and save the dependency, or make the dependency optional.

The text was updated successfully, but these errors were encountered:

ricardoV94 · 2021-05-12T06:58:03Z

Just in case this sounds relevant: one feature I often wish was there was a method to automatically load saved traces if they are available but (re)sample and save the new traces if they are not, or if the model or data has changed since the last time it was sampled.

Emaasit · 2021-05-12T07:08:09Z

@twiecki I had taken a long hiatus to focus on personal mental health.

This is a timely git issue because I am back and currently working on upgrading pymc-learn to the latest version of PyMC3.

twiecki · 2021-05-12T07:12:26Z

@Emaasit Always an important thing to prioritize, good to have you back.

Good to hear you plan to get back to pymc-learn. This definitely is relevant for this discussion. What are your thoughts in merging the base-class upstream (i.e. into PyMC3) and then pymc-learn could inherit from that and provide the model-zoo (which I do think is very useful).

Emaasit · 2021-05-12T07:23:19Z

@twiecki Yes, I like that idea. That will enable pymc-learn to focus on its original goal of providing a collection of the most commonly known models (i.e. a model zoo).

... focuses on bringing probabilistic machine learning to non-specialists

parsing-science · 2021-05-12T13:05:32Z

@twiecki: I haven't been working on pymc3-models because I haven't need to use it at work in the past few years. You're welcome to fork or use anything in the repo!

twiecki · 2021-05-12T13:30:03Z

Just had a nice conversation with @junpenglao about this. I think these features would be nice:

scikit-learn compatible (.fit(), .save(), .load(), .predict())
Inherit from pymc3.Model and require use to overwrite .build() method where model will be specified without requiring with context usage. This has the additional benefit of making it compatible with our existing API as well, so users can do with my_model: pm.sample() where my_model is created from the ModelBuilder.
We only need to save theInferenceData object (as netcdf), the model itself gets recreated via the class (at least until we can serialize aesara graphs, see Don't run merge optimization in ScalarInnerGraphOp.fgraph pytensor#406).
@aseyboldt came up with a handy configuration spec for models which we can also support (like configuring prior parameters), this config is a dict that can be stored as json in the InferenceData object
Inheriting from pm.Model allows for model composability where the user-model can be used inside of another with model: context (@lucianopaz has a nice example of this).

junpenglao · 2021-05-12T13:33:23Z

A bit more context re the discussion with @twiecki - we think that adding a wrapper in PyMC3 to support better saving a loading would be very useful. Tentatively it goes a bit like:
User work on their pymc3 model REPL style as usual, say:

X = ...
y = ...
with pm.Model() as model:
    beta = pm.Normal(...)
    sigma = pm.HalfCauchy(...)
    obs = pm.Normal("obs", X @ beta, sigma, observed=y)
    trace = pm.sample(...)

To use ModelBuilder, a slight rewrite from the user to overwrite the _build method:

class ModelBuilder(pm.Model):
    ...
    def build(self, *args, **kwargs):
        with self:
            self._build(*args, **kwargs)


class UserModel(ModelBuilder):
    def _build(self, X, y=None):
        beta = pm.Normal(...)
        sigma = pm.HalfCauchy(...)
        obs = pm.Normal("obs", X @ beta, sigma, observed=y)

my_model = UserModel()
my_model.build(X, y)
with my_model:
    pm.sample()

twiecki · 2021-05-12T14:30:49Z

In addition, a user can do with this same API:

my_model.fit(X, y)
my_model.save('mymodel.netcdf')
# then in prod
my_model = UserModel.load('mymodel.netcdf')
preds = my_model.predict(Y_test)

canyon289 · 2021-05-13T04:04:16Z

Pasting the discussion from bambi where we talk about saving and loading models. If we figure this out at PyMC3 level then bambi could either directly piggyback off of it, or at least reduce the amount of work it needs to do

CC @tomicapretto

bambinos/bambi#259

AlexIoannides · 2021-05-18T17:16:45Z

I'm favour of the suggestion from @twiecki and @junpenglao - I think we need the flexibility to define our own models and not just those from a 'model zoo' (although these are also really helpful, but a separate entity, in my opinion).

I would add, that it'd be great to have some flexible in the .predict() method, so that we can handle intervals and densities as well as point-estimates.

My only (minor) concern, is how to verify that the user has loaded the right InferenceData object for their model (before they try doing anything with it). We'd also have to be very clear that the analogue with Scikit-learn ends when it comes to persistence, because in Scikit-Learn we serialise the class with the trained weights embedded within it, whereas the suggestion here is to serialise only the 'weights' (InferenceData). Which may confuse a few people (the first time they come to this).

lucianopaz · 2021-05-18T19:19:47Z

I have trouble seeing the benefits of the ModelBuilder class. From what I gather, it has the same main problem that we could run into by simply calling pickle.dump of our current Model instances: it requires that all the things that are used in the model must be defined at the top level of the module. If a user uses a DensityDist with an as_op from some function, that will be fairly difficult to serialize. If the user is running from within a jupyter notebook or ipython kernel, that could also complicate unpickling because functions defined on the fly will be expected to be defined the same way when the users unserialize their models.
I think that this problem actually boils down to having a good way for aesara to serialize computational graphs (or have some way to load already compiled binary code for lighter deployment).

twiecki · 2021-05-19T10:46:12Z

@lucianopaz note that this proposal does not propose any pickling take place but that the ModelBuilder class which contains the model is still available when loading the model. From that class, we can recreate the model, but the model object is never pickled. So in that scenario, we only need to store the trace.

@AlexIoannides Thanks for chiming in. I like the idea of more flexibility for .predict(). Good question re linking model class to the right InferenceData object. We can store all kinds of meta info in the InferenceData object like model name and version and check that this matches. Or maybe even a hash of the ModelBuilder class.

AlexIoannides · 2021-05-20T19:03:20Z

I think storing the hash of the ModelBuilder class is a great idea.

canyon289 · 2021-07-26T11:45:41Z

User story note

Note im also im favor of this because we use cloudpickle now instead of pickle. After having spent multiple hours debugging the failing ArviZ CI job where pickling was failing, and I was confused because pickle is the goto tool, I got enough help from the devs and read the tests to figure out we needed to use cloud pickle.

I'm afraid "regular" users will try using pickle, see if fail and not know to use cloudpickle. This wrapper will help because users wont have to figure out what library is needed to save or load models, they can just the method itll "just work"

arviz-devs/arviz#1743

AlexIoannides · 2021-07-26T11:52:57Z

@canyon289 What's the major difference between cloudpickle and pickle?

twiecki · 2021-07-26T13:05:40Z

@AlexIoannides cloudpickle supports more things to pickle, like lambda functions.

nikmich1 · 2022-01-03T19:16:34Z

Although the 'beginner friendly' tag was removed a few days ago, I would be more than happy to give it a try during the next days (would be my first contribution). Can someone kindly give me a hint, where I can find information on @aseyboldt 's configuration spec dictionary, which was mentioned above? I was not able to find material on that so far.

twiecki · 2022-01-04T03:57:22Z

@nikmich1 That would be a great contribution. Here is an example: https://gist.github.com/twiecki/86b02349c60385eb6d77793d37bd96a9

It's quite simple, just a static method that returns a dict that you can then reference when building the model. There are also some nice ideas (like the hash mentioned above) here: https://github.com/quantopian/bayesalpha/blob/master/bayesalpha/base.py

twiecki · 2022-01-13T17:53:39Z

@nikmich1 Any progress on this?

nikmich1 · 2022-01-14T13:06:58Z

@twiecki thanks for reaching out, so far, I created a very preliminary and minimal draft:
https://gist.github.com/nikmich1/61d27a8f9f3cdc6a33dd8a814878b886

For using a model hash instead of simple strings for validation, I will adopt the suggested approach from here. Before bringing it to proper form, some feedback/help would be very useful for me on two topics:

For now, the user has to overwrite the '_data_setter' method for the prediction with new data. I did not see a way to know, how to change the Data containers in the defined model without this additional user-defined function. If there is a way around this, very happy to get some advice.
So far, dill is used to save and load the model. To switch to saving/loading only the InferenceData objects, I am still looking for a good way to add the model spec dictionary into the InferenceData object, something like this (which does not work):
```
inferencedata.model_spec_dict = model_spec_dict
inferencedata.to_netcdf('filename.netcdf') # dict is lost after loading this file
```
I tried inferencedata.extend(az.convert_to_inference_data(model_spec_dict)), but it seems complicated to recover the original data, especially with more than one added dictionary. Happy for any hint towards the right direction.

twiecki · 2022-01-17T04:15:39Z

This looks like a great start. I think for a first cut saving both, the netcdf and model object in cloudpickle (not dill) is fine.

Also, you can store ppc and prior predictive in idata as well using extend:

with model:
    idata = pm.sample()
    idata.extend(pm.sample_prior_predictive())
    idata.extend(pm.sample_posterior_predictive(idata))

twiecki · 2022-02-22T18:59:09Z

@nikmich1 Any progress?

twiecki added request discussion deployment labels May 12, 2021

twiecki added beginner friendly feature request help wanted and removed request discussion labels May 12, 2021

ricardoV94 removed the beginner friendly label Dec 30, 2021

michaelosthege added enhancements and removed deployment labels Jan 13, 2022

twiecki mentioned this issue Jan 14, 2022

can not load pymc3 model from pickle #5352

Closed

5hv5hvnk mentioned this issue Aug 3, 2022

Added a modelBuilder class for deployment #6023

Closed

5 tasks

ricardoV94 closed this as completed Apr 18, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve workflow around saving and loading fitted models #4687

Improve workflow around saving and loading fitted models #4687

twiecki commented May 12, 2021

ricardoV94 commented May 12, 2021 •

edited

Loading

Emaasit commented May 12, 2021

twiecki commented May 12, 2021

Emaasit commented May 12, 2021

parsing-science commented May 12, 2021

twiecki commented May 12, 2021

junpenglao commented May 12, 2021 •

edited

Loading

twiecki commented May 12, 2021

canyon289 commented May 13, 2021 •

edited

Loading

AlexIoannides commented May 18, 2021

lucianopaz commented May 18, 2021

twiecki commented May 19, 2021

AlexIoannides commented May 20, 2021

canyon289 commented Jul 26, 2021 •

edited

Loading

AlexIoannides commented Jul 26, 2021

twiecki commented Jul 26, 2021

nikmich1 commented Jan 3, 2022 •

edited

Loading

twiecki commented Jan 4, 2022

twiecki commented Jan 13, 2022

nikmich1 commented Jan 14, 2022

twiecki commented Jan 17, 2022 •

edited

Loading

twiecki commented Feb 22, 2022

Improve workflow around saving and loading fitted models #4687

Improve workflow around saving and loading fitted models #4687

Comments

twiecki commented May 12, 2021

ricardoV94 commented May 12, 2021 • edited Loading

Emaasit commented May 12, 2021

twiecki commented May 12, 2021

Emaasit commented May 12, 2021

parsing-science commented May 12, 2021

twiecki commented May 12, 2021

junpenglao commented May 12, 2021 • edited Loading

twiecki commented May 12, 2021

canyon289 commented May 13, 2021 • edited Loading

AlexIoannides commented May 18, 2021

lucianopaz commented May 18, 2021

twiecki commented May 19, 2021

AlexIoannides commented May 20, 2021

canyon289 commented Jul 26, 2021 • edited Loading

AlexIoannides commented Jul 26, 2021

twiecki commented Jul 26, 2021

nikmich1 commented Jan 3, 2022 • edited Loading

twiecki commented Jan 4, 2022

twiecki commented Jan 13, 2022

nikmich1 commented Jan 14, 2022

twiecki commented Jan 17, 2022 • edited Loading

twiecki commented Feb 22, 2022

ricardoV94 commented May 12, 2021 •

edited

Loading

junpenglao commented May 12, 2021 •

edited

Loading

canyon289 commented May 13, 2021 •

edited

Loading

canyon289 commented Jul 26, 2021 •

edited

Loading

nikmich1 commented Jan 3, 2022 •

edited

Loading

twiecki commented Jan 17, 2022 •

edited

Loading