-
-
Notifications
You must be signed in to change notification settings - Fork 2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improve workflow around saving and loading fitted models #4687
Comments
Just in case this sounds relevant: one feature I often wish was there was a method to automatically load saved traces if they are available but (re)sample and save the new traces if they are not, or if the model or data has changed since the last time it was sampled. |
@twiecki I had taken a long hiatus to focus on personal mental health. This is a timely git issue because I am back and currently working on upgrading pymc-learn to the latest version of PyMC3. |
@Emaasit Always an important thing to prioritize, good to have you back. Good to hear you plan to get back to |
@twiecki Yes, I like that idea. That will enable
|
@twiecki: I haven't been working on pymc3-models because I haven't need to use it at work in the past few years. You're welcome to fork or use anything in the repo! |
Just had a nice conversation with @junpenglao about this. I think these features would be nice:
|
A bit more context re the discussion with @twiecki - we think that adding a wrapper in PyMC3 to support better saving a loading would be very useful. Tentatively it goes a bit like: X = ...
y = ...
with pm.Model() as model:
beta = pm.Normal(...)
sigma = pm.HalfCauchy(...)
obs = pm.Normal("obs", X @ beta, sigma, observed=y)
trace = pm.sample(...) To use class ModelBuilder(pm.Model):
...
def build(self, *args, **kwargs):
with self:
self._build(*args, **kwargs)
class UserModel(ModelBuilder):
def _build(self, X, y=None):
beta = pm.Normal(...)
sigma = pm.HalfCauchy(...)
obs = pm.Normal("obs", X @ beta, sigma, observed=y)
my_model = UserModel()
my_model.build(X, y)
with my_model:
pm.sample() |
In addition, a user can do with this same API: my_model.fit(X, y)
my_model.save('mymodel.netcdf')
# then in prod
my_model = UserModel.load('mymodel.netcdf')
preds = my_model.predict(Y_test) |
Pasting the discussion from bambi where we talk about saving and loading models. If we figure this out at PyMC3 level then bambi could either directly piggyback off of it, or at least reduce the amount of work it needs to do |
I'm favour of the suggestion from @twiecki and @junpenglao - I think we need the flexibility to define our own models and not just those from a 'model zoo' (although these are also really helpful, but a separate entity, in my opinion). I would add, that it'd be great to have some flexible in the My only (minor) concern, is how to verify that the user has loaded the right |
I have trouble seeing the benefits of the |
@lucianopaz note that this proposal does not propose any pickling take place but that the @AlexIoannides Thanks for chiming in. I like the idea of more flexibility for |
I think storing the hash of the |
User story note Note im also im favor of this because we use cloudpickle now instead of pickle. After having spent multiple hours debugging the failing ArviZ CI job where pickling was failing, and I was confused because pickle is the goto tool, I got enough help from the devs and read the tests to figure out we needed to use cloud pickle. I'm afraid "regular" users will try using pickle, see if fail and not know to use cloudpickle. This wrapper will help because users wont have to figure out what library is needed to save or load models, they can just the method itll "just work" |
@canyon289 What's the major difference between cloudpickle and pickle? |
@AlexIoannides cloudpickle supports more things to pickle, like lambda functions. |
Although the 'beginner friendly' tag was removed a few days ago, I would be more than happy to give it a try during the next days (would be my first contribution). Can someone kindly give me a hint, where I can find information on @aseyboldt 's configuration spec dictionary, which was mentioned above? I was not able to find material on that so far. |
@nikmich1 That would be a great contribution. Here is an example: https://gist.github.com/twiecki/86b02349c60385eb6d77793d37bd96a9 It's quite simple, just a static method that returns a dict that you can then reference when building the model. There are also some nice ideas (like the hash mentioned above) here: https://github.com/quantopian/bayesalpha/blob/master/bayesalpha/base.py |
@nikmich1 Any progress on this? |
@twiecki thanks for reaching out, so far, I created a very preliminary and minimal draft: For using a model hash instead of simple strings for validation, I will adopt the suggested approach from here. Before bringing it to proper form, some feedback/help would be very useful for me on two topics:
|
This looks like a great start. I think for a first cut saving both, the netcdf and model object in cloudpickle (not dill) is fine. Also, you can store ppc and prior predictive in with model:
idata = pm.sample()
idata.extend(pm.sample_prior_predictive())
idata.extend(pm.sample_posterior_predictive(idata)) |
@nikmich1 Any progress? |
After talking to users who are trying to deploy models, one thing that commonly comes up is that it's cumbersome to do so. One reason is that it's hard to fit a PyMC3 model, save it, and then load it in later for doing predictions, like scikit-learn allows to do.
Fortunately, there exist two packages that already have implemented this on top of PyMC3:
where
pymc-learn
looks like a fork ofpymc3_models
with some extensions and more models.I think these already do most of what we need here, but both of the packages seem abandoned. There are a few options I can see and would like to get opinions on:
pymc-learn
because it has more).I'm leaning towards the second option because I don't think users really need a model zoo but just a nice API for their own models to load and save+predict. It's also easier to maintain. I also think this should be core PyMC3 functionality and it's not a lot of code (hence not option 3).
The only downside to 2 I see is that it would add
scikit-learn
as a dependency because we inherit fromsklearn.base.BaseEstimator
(https://github.com/scikit-learn/scikit-learn/blob/main/sklearn/base.py#L142). Mostly this adds repr features which can be nice, but maybe they are also not really required. We could either not inherit fromBaseEstimator
and save the dependency, or make the dependency optional.The text was updated successfully, but these errors were encountered: