Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Intervals/quantiles can be negative for models that can only make non-negative predictions #422

Open
fsaforo1 opened this issue Jul 11, 2024 · 3 comments
Labels
feature request New feature or request implementing algorithms Implementing algorithms, estimators, objects native to skpro module:probability&simulation probability distributions and simulators module:regression probabilistic regression module

Comments

@fsaforo1
Copy link

For models with log-link like Tweedie, Poisson, etc., that are constrained to only make non-negative predictions, one would expect the intervals to be non-negative as well. Below is a reproducible example.

#pip install tweedie

from tweedie import tweedie
import numpy as np
import pandas as pd

TWEEDIE_VARIANCE_POWER = 1.5
TWEEDIE_DISPERSION_PARAMETER = 20

def simulate_tweedie_glm_data():
    # Number of parameters for model
    P = 10

    N = 5000

    SEED = 2022
    rng = np.random.RandomState(SEED)
    X = rng.rand(N, P - 1)
    X = np.hstack((np.ones((N, 1)), X))
    beta = np.concatenate(([370], rng.randint(-10, 200, P - 1))) / 100
    eta = np.dot(X, beta)
    mu = np.exp(eta)

    tweedie_variance_power = TWEEDIE_VARIANCE_POWER
    dispersion_parameter = TWEEDIE_DISPERSION_PARAMETER

    y = tweedie(mu=mu, p=tweedie_variance_power, phi=dispersion_parameter).rvs(N, random_state = SEED)
    X = pd.DataFrame(X)
    X.columns = ['intcpt' if i ==0 else 'feat_' + str(i) for i in range(P)]
    X.drop('intcpt', axis=1, inplace=True)

    return X, y


X, y = simulate_tweedie_glm_data()

(y<0).mean()
# 0.0


from sklearn.ensemble import RandomForestRegressor
from sklearn.linear_model import TweedieRegressor
from skpro.regression.residual import ResidualDouble

reg_mean = TweedieRegressor(power=TWEEDIE_VARIANCE_POWER)
reg_resid = RandomForestRegressor()
reg_proba = ResidualDouble(reg_mean, reg_resid)
reg_proba.fit(X, y)

y_pred_quantiles = reg_proba.predict_quantiles(X, alpha=[0.05, 0.5, 0.95])

(y_pred_quantiles < 0).mean(axis=0)

#array([0.1868, 0.    , 0.    ])
@fkiraly
Copy link
Collaborator

fkiraly commented Jul 11, 2024

@fsaforo1, this is expected because the model you are constructing actually has a normal predictive distribution, with reg_mean predicting the variance, and reg_resid the residual, so it wil not have non-negative support.

If you want the predictive distribution to be Tweedie, and hence non-negative, the two closest things are:

  1. using the GLMRegressor, this has links which are sub-cases of Tweedie (but not arbitrary Tweedie)

  2. using a Tweedie distribution as the base distribution, for distr_type

A general Tweedie distribution is not yet implemented, but if you'd want to make a PR, that would be appreciated! I can then check that it works with the ResidualDouble strategy. This PR is adding another distribution, you could take it as a template: #421

Also, if you can describe the model that you would like to fit, I can also see whether ResidualDouble needs to be extended, e.g., with parameter transformations.

@fkiraly
Copy link
Collaborator

fkiraly commented Jul 11, 2024

I'm also a bit confused by the sklearn Tweedie regressor.

How does one retrieve the two parameters from the fitted Tweedie distribution? Namely, mean and scale.

@fkiraly
Copy link
Collaborator

fkiraly commented Jul 11, 2024

I opened an issue for interfacing the Tweedie regressor here: #423

I looked at it briefly, and do not know how to get the variance from it. Do you have an idea, @fsaforo1? If so, a reply on #423 would be appreciated!

@fkiraly fkiraly added module:probability&simulation probability distributions and simulators module:regression probabilistic regression module implementing algorithms Implementing algorithms, estimators, objects native to skpro feature request New feature or request labels Jul 11, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature request New feature or request implementing algorithms Implementing algorithms, estimators, objects native to skpro module:probability&simulation probability distributions and simulators module:regression probabilistic regression module
Projects
None yet
Development

No branches or pull requests

2 participants