diff --git a/.nojekyll b/.nojekyll index 7f5fd795f..58ee289c3 100644 --- a/.nojekyll +++ b/.nojekyll @@ -1 +1 @@ -ee1058d6 \ No newline at end of file +7665b7c4 \ No newline at end of file diff --git a/notebooks/index.html b/notebooks/index.html index 6a2908fcd..f413d8762 100644 --- a/notebooks/index.html +++ b/notebooks/index.html @@ -456,11 +456,25 @@

Generalized Linear Models

+
+
+ + Zero inflated models +
+ When the outcome is mostly zeros and or is overdispersed +
+
+
+ + + +
+

More advanced models

-
+
Distributional models @@ -474,7 +488,7 @@

More advanced models

-
+
Gaussian processes @@ -488,7 +502,7 @@

More advanced models

-
+
Gaussian processes @@ -506,7 +520,7 @@

More advanced models

Tools to interpret model outputs

-
+
Predictions @@ -520,7 +534,7 @@

Tools to interpret model outputs

-
+
Comparisons @@ -534,7 +548,7 @@

Tools to interpret model outputs

-
+
Slopes diff --git a/notebooks/zero_inflated_regression.html b/notebooks/zero_inflated_regression.html new file mode 100644 index 000000000..024b32aad --- /dev/null +++ b/notebooks/zero_inflated_regression.html @@ -0,0 +1,1233 @@ + + + + + + + + + +Bambi – zero_inflated_regression + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
+
+ +
+ +
+ + + + +
+ + + +
+
import arviz as az
+import matplotlib.pyplot as plt
+from matplotlib.lines import Line2D
+import numpy as np
+import pandas as pd
+import scipy.stats as stats
+import seaborn as sns
+import warnings
+
+import bambi as bmb
+
+warnings.simplefilter(action='ignore', category=FutureWarning)
+
+
WARNING (pytensor.tensor.blas): Using NumPy C-API based implementation for BLAS functions.
+
+
+
+

Zero inflated models

+

In this notebook, we will describe zero inflated outcomes and why the data generating process behind these outcomes requires a special class of generalized linear models: zero-inflated Poisson (ZIP) and hurdle Poisson. Subsequently, we will describe and implement each model using a set of zero-inflated data from ecology. Along the way, we will also use the interpret sub-package to interpret the predictions and parameters of the models.

+
+

Zero inflated outcomes

+

Sometimes, an observation is not generated from a single process, but from a mixture of processes. Whenever there is a mixture of processes generating an observation, a mixture model may be more appropriate. A mixture model uses more than one probability distribution to model the data. Count data are more susceptible to needing a mixture model as it is common to have a large number of zeros and values greater than zero. A zero means “nothing happened”, and this can be either because the rate of events is low, or because the process that generates the events was never “triggered”. For example, in health service utilization data (the number of times a patient used a service during a given time period), a large number of zeros represents patients with no utilization during the time period. However, some patients do use a service which is a result of some “triggered process”.

+

There are two popular classes of models for modeling zero-inflated data: (1) ZIP, and (2) hurdle Poisson. First, the ZIP model is described and how to implement it in Bambi is outlined. Subsequently, the hurdle Poisson model and how to implement it is outlined thereafter.

+
+
+

Zero inflated poisson

+

To model zero-inflated outcomes, the ZIP model uses a distribution that mixes two data generating processes. The first process generates zeros, and the second process uses a Poisson distribution to generate counts (of which some may be zero). The result of this mixture is a distribution that can be described as

+

\[P(Y=0) = (1 - \psi) + \psi e^{-\mu}\]

+

\[P(Y=y_i) = \psi \frac{e^{-\mu} \mu_{i}^y}{y_{i}!} \ \text{for} \ y_i = 1, 2, 3,...,n\]

+

where \(y_i\) is the outcome, \(\mu\) is the mean of the Poisson process where \(\mu \ge 0\), and \(\psi\) is the probability of the Poisson process where \(0 \lt \psi \lt 1\). To understand how these two processes are “mixed”, let’s simulate some data using the two process equations above (taken from the PyMC docs).

+
+
x = np.arange(0, 22)
+psis = [0.7, 0.4]
+mus = [10, 4]
+plt.figure(figsize=(7, 3))
+for psi, mu in zip(psis, mus):
+    pmf = stats.poisson.pmf(x, mu)
+    pmf[0] = (1 - psi) + pmf[0] # 1.) generate zeros
+    pmf[1:] =  psi * pmf[1:] # 2.) generate counts
+    pmf /= pmf.sum() # normalize to get probabilities
+    plt.plot(x, pmf, '-o', label='$\\psi$ = {}, $\\mu$ = {}'.format(psi, mu))
+
+plt.title("Zero Inflated Poisson Process")
+plt.xlabel('x', fontsize=12)
+plt.ylabel('f(x)', fontsize=12)
+plt.legend(loc=1)
+plt.show()
+
+

+
+
+

Notice how the blue line, corresponding to a higher \(\psi\) and \(\mu\), has a higher rate of counts and less zeros. Additionally, the inline comments above describe the first and second process generating the data.

+
+

ZIP regression model

+

The equations above only describe the ZIP distribution. However, predictors can be added to make this a regression model. Suppose we have a response variable \(Y\), which represents the number of events that occur during a time period, and \(p\) predictors \(X_1, X_2, ..., X_p\). We can model the parameters of the ZIP distribution as a linear combination of the predictors.

+

\[Y_i \sim \text{ZIPoisson}(\mu_i, \psi_i)\]

+

\[g(\mu_i) = \beta_0 + \beta_1 X_{1i}+,...,+\beta_p X_{pi}\]

+

\[h(\psi_i) = \alpha_0 + \alpha_1 X_{1i}+,...,+\alpha_p X_{pi}\]

+

where \(g\) and \(h\) are the link functions for each parameter. Bambi, by default, uses the log link for \(g\) and the logit link for \(h\). Notice how there are two linear models and two link functions: one for each parameter in the \(\text{ZIPoisson}\). The parameters of the linear model differ, because any predictor such as \(X\) may be associated differently with each part of the mixture. Actually, you don’t even need to use the same predictors in both linear models—but this beyond the scope of this notebook.

+
+

The fish dataset

+

To demonstrate the ZIP regression model, we model and predict how many fish are caught by visitors at a state park using survey data. Many visitors catch zero fish, either because they did not fish at all, or because they were unlucky. The dataset contains data on 250 groups that went to a state park to fish. Each group was questioned about how many fish they caught (count), how many children were in the group (child), how many people were in the group (persons), if they used a live bait (livebait) and whether or not they brought a camper to the park (camper).

+
+
fish_data = pd.read_stata("http://www.stata-press.com/data/r11/fish.dta")
+cols = ["count", "livebait", "camper", "persons", "child"]
+fish_data = fish_data[cols]
+fish_data["livebait"] = pd.Categorical(fish_data["livebait"])
+fish_data["camper"] = pd.Categorical(fish_data["camper"])
+fish_data = fish_data[fish_data["count"] < 60] # remove outliers
+
+
+
fish_data.head()
+
+ +
+ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
countlivebaitcamperpersonschild
00.00.00.01.00.0
10.01.01.01.00.0
20.01.00.01.00.0
30.01.01.02.01.0
41.01.00.01.00.0
+
+
+
+
+
# Excess zeros, and skewed count
+plt.figure(figsize=(7, 3))
+sns.histplot(fish_data["count"], discrete=True)
+plt.xlabel("Number of Fish Caught");
+
+

+
+
+

To fit a ZIP regression model, we pass family=zero_inflated_poisson to the bmb.Model constructor.

+
+
zip_model = bmb.Model(
+    "count ~ livebait + camper + persons + child", 
+    fish_data, 
+    family='zero_inflated_poisson'
+)
+
+zip_idata = zip_model.fit(
+    draws=1000, 
+    target_accept=0.95, 
+    random_seed=1234, 
+    chains=4
+)
+
+
Auto-assigning NUTS sampler...
+Initializing NUTS using jitter+adapt_diag...
+Multiprocess sampling (4 chains in 4 jobs)
+NUTS: [count_psi, Intercept, livebait, camper, persons, child]
+
+
+ + +
+
+ +
+ + 100.00% [8000/8000 00:03<00:00 Sampling 4 chains, 0 divergences] +
+ +
+
+
Sampling 4 chains for 1_000 tune and 1_000 draw iterations (4_000 + 4_000 draws total) took 4 seconds.
+
+
+

Lets take a look at the model components. Why is there only one linear model and link function defined for \(\mu\). Where is the linear model and link function for \(\psi\)? By default, the “main” (or first) formula is defined for the parent parameter; in this case \(\mu\). Since we didn’t pass an additional formula for the non-parent parameter \(\psi\), \(\psi\) was never modeled as a function of the predictors as explained above. If we want to model both \(\mu\) and \(\psi\) as a function of the predictor, we need to expicitly pass two formulas.

+
+
zip_model
+
+
       Formula: count ~ livebait + camper + persons + child
+        Family: zero_inflated_poisson
+          Link: mu = log
+  Observations: 248
+        Priors: 
+    target = mu
+        Common-level effects
+            Intercept ~ Normal(mu: 0.0, sigma: 9.5283)
+            livebait ~ Normal(mu: 0.0, sigma: 7.2685)
+            camper ~ Normal(mu: 0.0, sigma: 5.0733)
+            persons ~ Normal(mu: 0.0, sigma: 2.2583)
+            child ~ Normal(mu: 0.0, sigma: 2.9419)
+        
+        Auxiliary parameters
+            psi ~ Beta(alpha: 2.0, beta: 2.0)
+------
+* To see a plot of the priors call the .plot_priors() method.
+* To see a summary or plot of the posterior pass the object returned by .fit() to az.summary() or az.plot_trace()
+
+
+
+
formula = bmb.Formula(
+    "count ~ livebait + camper + persons + child", # parent parameter mu
+    "psi ~ livebait + camper + persons + child"    # non-parent parameter psi
+)
+
+zip_model = bmb.Model(
+    formula, 
+    fish_data, 
+    family='zero_inflated_poisson'
+)
+
+zip_idata = zip_model.fit(
+    draws=1000, 
+    target_accept=0.95, 
+    random_seed=1234, 
+    chains=4
+)
+
+
Auto-assigning NUTS sampler...
+Initializing NUTS using jitter+adapt_diag...
+Multiprocess sampling (4 chains in 4 jobs)
+NUTS: [Intercept, livebait, camper, persons, child, psi_Intercept, psi_livebait, psi_camper, psi_persons, psi_child]
+
+
+ + +
+
+ +
+ + 100.00% [8000/8000 00:05<00:00 Sampling 4 chains, 0 divergences] +
+ +
+
+
Sampling 4 chains for 1_000 tune and 1_000 draw iterations (4_000 + 4_000 draws total) took 6 seconds.
+
+
+
+
zip_model
+
+
       Formula: count ~ livebait + camper + persons + child
+                psi ~ livebait + camper + persons + child
+        Family: zero_inflated_poisson
+          Link: mu = log
+                psi = logit
+  Observations: 248
+        Priors: 
+    target = mu
+        Common-level effects
+            Intercept ~ Normal(mu: 0.0, sigma: 9.5283)
+            livebait ~ Normal(mu: 0.0, sigma: 7.2685)
+            camper ~ Normal(mu: 0.0, sigma: 5.0733)
+            persons ~ Normal(mu: 0.0, sigma: 2.2583)
+            child ~ Normal(mu: 0.0, sigma: 2.9419)
+    target = psi
+        Common-level effects
+            psi_Intercept ~ Normal(mu: 0.0, sigma: 1.0)
+            psi_livebait ~ Normal(mu: 0.0, sigma: 1.0)
+            psi_camper ~ Normal(mu: 0.0, sigma: 1.0)
+            psi_persons ~ Normal(mu: 0.0, sigma: 1.0)
+            psi_child ~ Normal(mu: 0.0, sigma: 1.0)
+------
+* To see a plot of the priors call the .plot_priors() method.
+* To see a summary or plot of the posterior pass the object returned by .fit() to az.summary() or az.plot_trace()
+
+
+

Now, both \(\mu\) and \(\psi\) are defined as a function of a linear combination of the predictors. Additionally, we can see that the log and logit link functions are defined for \(\mu\) and \(\psi\), respectively.

+
+
zip_model.graph()
+
+

+
+
+

Since each parameter has a different link function, and each parameter has a different meaning, we must be careful on how the coefficients are interpreted. Coefficients without the substring “psi” correspond to the \(\mu\) parameter (the mean of the Poisson process) and are on the log scale. Coefficients with the substring “psi” correspond to the \(\psi\) parameter (this can be thought of as the log-odds of non-zero data) and are on the logit scale. Interpreting these coefficients can be easier with the interpret sub-package. Below, we will show how to use this sub-package to interpret the coefficients conditional on a set of the predictors.

+
+
az.summary(
+    zip_idata, 
+    var_names=["Intercept", "livebait", "camper", "persons", "child"], 
+    filter_vars="like"
+)
+
+ +
+ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
meansdhdi_3%hdi_97%mcse_meanmcse_sdess_bulkess_tailr_hat
Intercept-1.5730.310-2.130-0.9560.0050.0043593.03173.01.0
livebait[1.0]1.6090.2721.1432.1690.0040.0034158.03085.01.0
camper[1.0]0.2620.0950.0850.4400.0010.0015032.02816.01.0
persons0.6150.0450.5270.6970.0010.0004864.02709.01.0
child-0.7950.094-0.972-0.6250.0020.0013910.03232.01.0
psi_Intercept-1.4430.817-2.9410.1240.0130.0094253.03018.01.0
psi_livebait[1.0]-0.1880.677-1.4901.0520.0100.0114470.02776.01.0
psi_camper[1.0]0.8410.3230.2221.4370.0040.0036002.03114.01.0
psi_persons0.9120.1930.5711.2880.0030.0024145.03169.01.0
psi_child-1.8900.305-2.502-1.3530.0050.0034022.02883.01.0
+
+
+
+
+
+

Interpret model parameters

+

Since we have fit a distributional model, we can leverage the plot_predictions() function in the interpret sub-package to visualize how the \(\text{ZIPoisson}\) parameters \(\mu\) and \(\psi\) vary as a covariate changes.

+
+
fig, ax = plt.subplots(nrows=1, ncols=2, figsize=(10, 3))
+
+bmb.interpret.plot_predictions(
+    zip_model,
+    zip_idata,
+    covariates="persons",
+    ax=ax[0]
+)
+ax[0].set_ylabel("mu (fish count)")
+ax[0].set_title("$\\mu$ as a function of persons")
+
+bmb.interpret.plot_predictions(
+    zip_model,
+    zip_idata,
+    covariates="persons",
+    target="psi",
+    ax=ax[1]
+)
+ax[1].set_title("$\\psi$ as a function of persons");
+
+

+
+
+

Interpreting the left plot (the \(\mu\) parameter) as the number of people in a group fishing increases, so does the number of fish caught. The right plot (the \(\psi\) parameter) shows that as the number of people in a group fishing increases, the probability of the Poisson process increases. One interpretation of this is that as the number of people in a group increases, the probability of catching no fish decreases.

+
+
+

Posterior predictive distribution

+

Lastly, lets plot the posterior predictive distribution against the observed data to see how well the model fits the data. To plot the samples, a utility function is defined below to assist in the plotting of discrete values.

+
+
def adjust_lightness(color, amount=0.5):
+    import matplotlib.colors as mc
+    import colorsys
+    try:
+        c = mc.cnames[color]
+    except:
+        c = color
+    c = colorsys.rgb_to_hls(*mc.to_rgb(c))
+    return colorsys.hls_to_rgb(c[0], c[1] * amount, c[2])
+
+def plot_ppc_discrete(idata, bins, ax):
+    
+    def add_discrete_bands(x, lower, upper, ax, **kwargs):
+        for i, (l, u) in enumerate(zip(lower, upper)):
+            s = slice(i, i + 2)
+            ax.fill_between(x[s], [l, l], [u, u], **kwargs)
+
+    var_name = list(idata.observed_data.data_vars)[0]
+    y_obs = idata.observed_data[var_name].to_numpy()
+    
+    counts_list = []
+    for draw_values in az.extract(idata, "posterior_predictive")[var_name].to_numpy().T:
+        counts, _ = np.histogram(draw_values, bins=bins)
+        counts_list.append(counts)
+    counts_arr = np.stack(counts_list)
+
+    qts_90 = np.quantile(counts_arr, (0.05, 0.95), axis=0)
+    qts_70 = np.quantile(counts_arr, (0.15, 0.85), axis=0)
+    qts_50 = np.quantile(counts_arr, (0.25, 0.75), axis=0)
+    qts_30 = np.quantile(counts_arr, (0.35, 0.65), axis=0)
+    median = np.quantile(counts_arr, 0.5, axis=0)
+
+    colors = [adjust_lightness("C0", x) for x in [1.8, 1.6, 1.4, 1.2, 0.9]]
+
+    add_discrete_bands(bins, qts_90[0], qts_90[1], ax=ax, color=colors[0])
+    add_discrete_bands(bins, qts_70[0], qts_70[1], ax=ax, color=colors[1])
+    add_discrete_bands(bins, qts_50[0], qts_50[1], ax=ax, color=colors[2])
+    add_discrete_bands(bins, qts_30[0], qts_30[1], ax=ax, color=colors[3])
+
+    
+    ax.step(bins[:-1], median, color=colors[4], lw=2, where="post")
+    ax.hist(y_obs, bins=bins, histtype="step", lw=2, color="black", align="mid")
+    handles = [
+        Line2D([], [], label="Observed data", color="black", lw=2),
+        Line2D([], [], label="Posterior predictive median", color=colors[4], lw=2)
+    ]
+    ax.legend(handles=handles)
+    return ax
+
+
+
zip_pps = zip_model.predict(idata=zip_idata, kind="pps", inplace=False)
+
+bins = np.arange(39)
+fig, ax = plt.subplots(figsize=(7, 3))
+ax = plot_ppc_discrete(zip_pps, bins, ax)
+ax.set_xlabel("Number of Fish Caught")
+ax.set_ylabel("Count")
+ax.set_title("ZIP model - Posterior Predictive Distribution");
+
+

+
+
+

The model captures the number of zeros accurately. However, the model seems to slightly underestimate the counts 1 and 2. Nonetheless, the plot shows that the model captures the overall distribution of counts reasonably well.

+
+
+
+
+

Hurdle poisson

+

Both ZIP and hurdle models both use two processes to generate data. The two models differ in their conceptualization of how the zeros are generated. In \(\text{ZIPoisson}\), the zeroes can come from any of the processes, while in the hurdle Poisson they come only from one of the processes. Thus, a hurdle model assumes zero and positive values are generated from two independent processes. In the hurdle model, there are two components: (1) a “structural” process such as a binary model for modeling whether the response variable is zero or not, and (2) a process using a truncated model such as a truncated Poisson for modeling the counts. The result of these two components is a distribution that can be described as

+

\[P(Y=0) = 1 - \psi\]

+

\[P(Y=y_i) = \psi \frac{e^{-\mu_i}\mu_{i}^{y_i} / y_i!}{1 - e^{-\mu_i}} \ \text{for} \ y_i = 1, 2, 3,...,n\]

+

where \(y_i\) is the outcome, \(\mu\) is the mean of the Poisson process where \(\mu \ge 0\), and \(\psi\) is the probability of the Poisson process where \(0 \lt \psi \lt 1\). The numerator of the second equation is the Poisson probability mass function, and the denominator is one minus the Poisson cumulative distribution function. This is a lot to digest. Again, let’s simulate some data to understand how data is generated from this process.

+
+
x = np.arange(0, 22)
+psis = [0.7, 0.4]
+mus = [10, 4]
+
+plt.figure(figsize=(7, 3))
+for psi, mu in zip(psis, mus):
+    pmf = stats.poisson.pmf(x, mu) # pmf evaluated at x given mu
+    cdf = stats.poisson.cdf(0, mu) # cdf evaluated at 0 given mu
+    pmf[0] = 1 - psi # 1.) generate zeros
+    pmf[1:] =  (psi * pmf[1:]) / (1 - cdf) # 2.) generate counts
+    pmf /= pmf.sum() # normalize to get probabilities
+    plt.plot(x, pmf, '-o', label='$\\psi$ = {}, $\\mu$ = {}'.format(psi, mu))
+
+plt.title("Hurdle Poisson Process")
+plt.xlabel('x', fontsize=12)
+plt.ylabel('f(x)', fontsize=12)
+plt.legend(loc=1)
+plt.show()
+
+

+
+
+

The differences between the ZIP and hurdle models are subtle. Notice how in the code for the hurdle Poisson process, the zero counts are generate by (1 - psi) versus (1 - psi) + pmf[0] for the ZIP process. Additionally, the positive observations are generated by the process (psi * pmf[1:]) / (1 - cdf) where the numerator is a vector of probabilities for positive counts scaled by \(\psi\) and the denominator uses the Poisson cumulative distribution function to evaluate the probability a count is greater than 0.

+
+

Hurdle regression model

+

To add predictors in the hurdle model, we follow the same specification as in the ZIP regression model section since both models have the same structure. The only difference is that the hurdle model uses a truncated Poisson distribution instead of a ZIP distribution. Right away, we will model both the parent and non-parent parameter as a function of the predictors.

+
+
hurdle_formula = bmb.Formula(
+    "count ~ livebait + camper + persons + child", # parent parameter mu
+    "psi ~ livebait + camper + persons + child"    # non-parent parameter psi
+)
+
+hurdle_model = bmb.Model(
+    hurdle_formula, 
+    fish_data, 
+    family='hurdle_poisson'
+)
+
+hurdle_idata = hurdle_model.fit(
+    draws=1000, 
+    target_accept=0.95, 
+    random_seed=1234, 
+    chains=4
+)
+
+
Auto-assigning NUTS sampler...
+Initializing NUTS using jitter+adapt_diag...
+Multiprocess sampling (4 chains in 4 jobs)
+NUTS: [Intercept, livebait, camper, persons, child, psi_Intercept, psi_livebait, psi_camper, psi_persons, psi_child]
+
+
+ + +
+
+ +
+ + 100.00% [8000/8000 00:06<00:00 Sampling 4 chains, 0 divergences] +
+ +
+
+
Sampling 4 chains for 1_000 tune and 1_000 draw iterations (4_000 + 4_000 draws total) took 6 seconds.
+
+
+
+
hurdle_model
+
+
       Formula: count ~ livebait + camper + persons + child
+                psi ~ livebait + camper + persons + child
+        Family: hurdle_poisson
+          Link: mu = log
+                psi = logit
+  Observations: 248
+        Priors: 
+    target = mu
+        Common-level effects
+            Intercept ~ Normal(mu: 0.0, sigma: 9.5283)
+            livebait ~ Normal(mu: 0.0, sigma: 7.2685)
+            camper ~ Normal(mu: 0.0, sigma: 5.0733)
+            persons ~ Normal(mu: 0.0, sigma: 2.2583)
+            child ~ Normal(mu: 0.0, sigma: 2.9419)
+    target = psi
+        Common-level effects
+            psi_Intercept ~ Normal(mu: 0.0, sigma: 1.0)
+            psi_livebait ~ Normal(mu: 0.0, sigma: 1.0)
+            psi_camper ~ Normal(mu: 0.0, sigma: 1.0)
+            psi_persons ~ Normal(mu: 0.0, sigma: 1.0)
+            psi_child ~ Normal(mu: 0.0, sigma: 1.0)
+------
+* To see a plot of the priors call the .plot_priors() method.
+* To see a summary or plot of the posterior pass the object returned by .fit() to az.summary() or az.plot_trace()
+
+
+
+
hurdle_model.graph()
+
+

+
+
+

As the same link functions are used for ZIP and Hurdle model, the coefficients can be interpreted in a similar manner.

+
+
az.summary(
+    hurdle_idata,
+    var_names=["Intercept", "livebait", "camper", "persons", "child"], 
+    filter_vars="like"
+)
+
+ +
+ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
meansdhdi_3%hdi_97%mcse_meanmcse_sdess_bulkess_tailr_hat
Intercept-1.6150.363-2.278-0.9150.0060.0053832.02121.01.0
livebait[1.0]1.6610.3291.0312.2730.0050.0044149.01871.01.0
camper[1.0]0.2710.1000.0730.4490.0010.0016843.02934.01.0
persons0.6100.0450.5330.7000.0010.0004848.03196.01.0
child-0.7910.094-0.970-0.6180.0010.0014371.03006.01.0
psi_Intercept-2.7800.583-3.906-1.7150.0080.0064929.03258.01.0
psi_livebait[1.0]0.7640.427-0.0671.5570.0060.0055721.02779.01.0
psi_camper[1.0]0.8490.2980.2831.3780.0040.0035523.02855.01.0
psi_persons1.0400.1830.7191.3960.0030.0023852.03007.01.0
psi_child-2.0030.282-2.555-1.5170.0040.0034021.03183.01.0
+
+
+
+
+

Posterior predictive samples

+

As with the ZIP model above, we plot the posterior predictive distribution against the observed data to see how well the model fits the data.

+
+
hurdle_pps = hurdle_model.predict(idata=hurdle_idata, kind="pps", inplace=False)
+
+bins = np.arange(39)
+fig, ax = plt.subplots(figsize=(7, 3))
+ax = plot_ppc_discrete(hurdle_pps, bins, ax)
+ax.set_xlabel("Number of Fish Caught")
+ax.set_ylabel("Count")
+ax.set_title("Hurdle Model - Posterior Predictive Distribution");
+
+

+
+
+

The plot looks similar to the ZIP model above. Nonetheless, the plot shows that the model captures the overall distribution of counts reasonably well.

+
+
+
+
+

Summary

+

In this notebook, two classes of models (ZIP and hurdle Poisson) for modeling zero-inflated data were presented and implemented in Bambi. The difference of the data generating process between the two models differ in how zeros are generated. The ZIP model uses a distribution that mixes two data generating processes. The first process generates zeros, and the second process uses a Poisson distribution to generate counts (of which some may be zero). The hurdle Poisson also uses two data generating processes, but doesn’t “mix” them. A process is used for generating zeros such as a binary model for modeling whether the response variable is zero or not, and a second process for modeling the counts. These two proceses are independent of each other.

+

The datset used to demonstrate the two models had a large number of zeros. These zeros appeared because the group doesn’t fish, or because they fished, but caught zero fish. Because zeros could be generated due to two different reasons, the ZIP model, which allows zeros to be generated from a mixture of processes, seems to be more appropriate for this datset.

+
+
%load_ext watermark
+%watermark -n -u -v -iv -w
+
+
Last updated: Mon Sep 25 2023
+
+Python implementation: CPython
+Python version       : 3.11.0
+IPython version      : 8.13.2
+
+seaborn   : 0.12.2
+numpy     : 1.24.2
+scipy     : 1.11.2
+bambi     : 0.13.0.dev0
+matplotlib: 3.7.1
+arviz     : 0.16.1
+pandas    : 2.1.0
+
+Watermark: 2.3.1
+
+
+
+ + +
+
+ +
+ +
+
+ +
+ + + + \ No newline at end of file diff --git a/notebooks/zero_inflated_regression_files/figure-html/cell-11-output-1.svg b/notebooks/zero_inflated_regression_files/figure-html/cell-11-output-1.svg new file mode 100644 index 000000000..28a8d9e0c --- /dev/null +++ b/notebooks/zero_inflated_regression_files/figure-html/cell-11-output-1.svg @@ -0,0 +1,199 @@ + + + + + + + + +clusterlivebait_dim (1) + +livebait_dim (1) + + +clustercamper_dim (1) + +camper_dim (1) + + +clusterpsi_livebait_dim (1) + +psi_livebait_dim (1) + + +clusterpsi_camper_dim (1) + +psi_camper_dim (1) + + +clustercount_obs (248) + +count_obs (248) + + + +persons + +persons +~ +Normal + + + +count + +count +~ +MarginalMixture + + + +persons->count + + + + + +psi_Intercept + +psi_Intercept +~ +Normal + + + +psi + +psi +~ +Deterministic + + + +psi_Intercept->psi + + + + + +Intercept + +Intercept +~ +Normal + + + +Intercept->count + + + + + +psi_child + +psi_child +~ +Normal + + + +psi_child->psi + + + + + +child + +child +~ +Normal + + + +child->count + + + + + +psi_persons + +psi_persons +~ +Normal + + + +psi_persons->psi + + + + + +livebait + +livebait +~ +Normal + + + +livebait->count + + + + + +camper + +camper +~ +Normal + + + +camper->count + + + + + +psi_livebait + +psi_livebait +~ +Normal + + + +psi_livebait->psi + + + + + +psi_camper + +psi_camper +~ +Normal + + + +psi_camper->psi + + + + + +psi->count + + + + + diff --git a/notebooks/zero_inflated_regression_files/figure-html/cell-13-output-1.png b/notebooks/zero_inflated_regression_files/figure-html/cell-13-output-1.png new file mode 100644 index 000000000..5eeee2a1f Binary files /dev/null and b/notebooks/zero_inflated_regression_files/figure-html/cell-13-output-1.png differ diff --git a/notebooks/zero_inflated_regression_files/figure-html/cell-15-output-1.png b/notebooks/zero_inflated_regression_files/figure-html/cell-15-output-1.png new file mode 100644 index 000000000..ab6b52b7f Binary files /dev/null and b/notebooks/zero_inflated_regression_files/figure-html/cell-15-output-1.png differ diff --git a/notebooks/zero_inflated_regression_files/figure-html/cell-16-output-1.png b/notebooks/zero_inflated_regression_files/figure-html/cell-16-output-1.png new file mode 100644 index 000000000..e43467087 Binary files /dev/null and b/notebooks/zero_inflated_regression_files/figure-html/cell-16-output-1.png differ diff --git a/notebooks/zero_inflated_regression_files/figure-html/cell-19-output-1.svg b/notebooks/zero_inflated_regression_files/figure-html/cell-19-output-1.svg new file mode 100644 index 000000000..28a8d9e0c --- /dev/null +++ b/notebooks/zero_inflated_regression_files/figure-html/cell-19-output-1.svg @@ -0,0 +1,199 @@ + + + + + + + + +clusterlivebait_dim (1) + +livebait_dim (1) + + +clustercamper_dim (1) + +camper_dim (1) + + +clusterpsi_livebait_dim (1) + +psi_livebait_dim (1) + + +clusterpsi_camper_dim (1) + +psi_camper_dim (1) + + +clustercount_obs (248) + +count_obs (248) + + + +persons + +persons +~ +Normal + + + +count + +count +~ +MarginalMixture + + + +persons->count + + + + + +psi_Intercept + +psi_Intercept +~ +Normal + + + +psi + +psi +~ +Deterministic + + + +psi_Intercept->psi + + + + + +Intercept + +Intercept +~ +Normal + + + +Intercept->count + + + + + +psi_child + +psi_child +~ +Normal + + + +psi_child->psi + + + + + +child + +child +~ +Normal + + + +child->count + + + + + +psi_persons + +psi_persons +~ +Normal + + + +psi_persons->psi + + + + + +livebait + +livebait +~ +Normal + + + +livebait->count + + + + + +camper + +camper +~ +Normal + + + +camper->count + + + + + +psi_livebait + +psi_livebait +~ +Normal + + + +psi_livebait->psi + + + + + +psi_camper + +psi_camper +~ +Normal + + + +psi_camper->psi + + + + + +psi->count + + + + + diff --git a/notebooks/zero_inflated_regression_files/figure-html/cell-21-output-1.png b/notebooks/zero_inflated_regression_files/figure-html/cell-21-output-1.png new file mode 100644 index 000000000..d27271ec0 Binary files /dev/null and b/notebooks/zero_inflated_regression_files/figure-html/cell-21-output-1.png differ diff --git a/notebooks/zero_inflated_regression_files/figure-html/cell-3-output-1.png b/notebooks/zero_inflated_regression_files/figure-html/cell-3-output-1.png new file mode 100644 index 000000000..dda694197 Binary files /dev/null and b/notebooks/zero_inflated_regression_files/figure-html/cell-3-output-1.png differ diff --git a/notebooks/zero_inflated_regression_files/figure-html/cell-6-output-1.png b/notebooks/zero_inflated_regression_files/figure-html/cell-6-output-1.png new file mode 100644 index 000000000..c1c3885ee Binary files /dev/null and b/notebooks/zero_inflated_regression_files/figure-html/cell-6-output-1.png differ diff --git a/search.json b/search.json index 45d6f0060..4d4acd2e5 100644 --- a/search.json +++ b/search.json @@ -1,80 +1,45 @@ [ { - "objectID": "index.html", - "href": "index.html", - "title": "BAyesian Model-Building Interface in Python", + "objectID": "notebooks/negative_binomial.html", + "href": "notebooks/negative_binomial.html", + "title": "Bambi", "section": "", - "text": "Bambi is a high-level Bayesian model-building interface written in Python. It works with the PyMC probabilistic programming framework and is designed to make it extremely easy to fit Bayesian mixed-effects models common in biology, social sciences and other disciplines." - }, - { - "objectID": "index.html#dependencies", - "href": "index.html#dependencies", - "title": "BAyesian Model-Building Interface in Python", - "section": "Dependencies", - "text": "Dependencies\nBambi is tested on Python 3.9+ and depends on ArviZ, formulae, NumPy, pandas and PyMC (see pyproject.toml for version information)." - }, - { - "objectID": "index.html#installation", - "href": "index.html#installation", - "title": "BAyesian Model-Building Interface in Python", - "section": "Installation", - "text": "Installation\nBambi is available from the Python Package Index at https://pypi.org/project/bambi, alternatively it can be installed using Conda.\n\nPyPI\nThe latest release of Bambi can be installed using pip:\npip install bambi\nAlternatively, if you want the bleeding edge version of the package, you can install from GitHub:\npip install git+https://github.com/bambinos/bambi.git\n\n\nConda\nIf you use Conda, you can also install the latest release of Bambi with the following command:\nconda install -c conda-forge bambi" - }, - { - "objectID": "index.html#usage", - "href": "index.html#usage", - "title": "BAyesian Model-Building Interface in Python", - "section": "Usage", - "text": "Usage\nA simple fixed effects model is shown in the example below.\nimport arviz as az\nimport bambi as bmb\nimport pandas as pd\n\n# Read in a tab-delimited file containing our data\ndata = pd.read_table('my_data.txt', sep='\\t')\n\n# Initialize the fixed effects only model\nmodel = bmb.Model('DV ~ IV1 + IV2', data)\n\n# Fit the model using 1000 on each of 4 chains\nresults = model.fit(draws=1000, chains=4)\n\n# Use ArviZ to plot the results\naz.plot_trace(results)\n\n# Key summary and diagnostic info on the model parameters\naz.summary(results)\nFor a more in-depth introduction to Bambi see our Quickstart or our set of example notebooks." - }, - { - "objectID": "index.html#citation", - "href": "index.html#citation", - "title": "BAyesian Model-Building Interface in Python", - "section": "Citation", - "text": "Citation\nIf you use Bambi and want to cite it please use\n@article{\n Capretto2022,\n title={Bambi: A Simple Interface for Fitting Bayesian Linear Models in Python},\n volume={103},\n url={https://www.jstatsoft.org/index.php/jss/article/view/v103i15},\n doi={10.18637/jss.v103.i15},\n number={15},\n journal={Journal of Statistical Software},\n author={Capretto, Tomás and Piho, Camen and Kumar, Ravin and Westfall, Jacob and Yarkoni, Tal and Martin, Osvaldo A},\n year={2022},\n pages={1–29}\n}" - }, - { - "objectID": "index.html#contributing", - "href": "index.html#contributing", - "title": "BAyesian Model-Building Interface in Python", - "section": "Contributing", - "text": "Contributing\nWe welcome contributions from interested individuals or groups! For information about contributing to Bambi, check out our instructions, policies, and guidelines here." + "text": "I always experience some kind of confusion when looking at the negative binomial distribution after a while of not working with it. There are so many different definitions that I usually need to read everything more than once. The definition I’ve first learned, and the one I like the most, says as follows: The negative binomial distribution is the distribution of a random variable that is defined as the number of independent Bernoulli trials until the k-th “success”. In short, we repeat a Bernoulli experiment until we observe k successes and record the number of trials it required.\n\\[\nY \\sim \\text{NB}(k, p)\n\\]\nwhere \\(0 \\le p \\le 1\\) is the probability of success in each Bernoulli trial, \\(k > 0\\), usually integer, and \\(y \\in \\{k, k + 1, \\cdots\\}\\)\nThe probability mass function (pmf) is\n\\[\np(y | k, p)= \\binom{y - 1}{y-k}(1 -p)^{y - k}p^k\n\\]\nIf you, like me, find it hard to remember whether \\(y\\) starts at \\(0\\), \\(1\\), or \\(k\\), try to think twice about the definition of the variable. But how? First, recall we aim to have \\(k\\) successes. And success is one of the two possible outcomes of a trial, so the number of trials can never be smaller than the number of successes. Thus, we can be confident to say that \\(y \\ge k\\).\nBut this is not the only way of defining the negative binomial distribution, there are plenty of options! One of the most interesting, and the one you see in PyMC3, the library we use in Bambi for the backend, is as a continuous mixture. The negative binomial distribution describes a Poisson random variable whose rate is also a random variable (not a fixed constant!) following a gamma distribution. Or in other words, conditional on a gamma-distributed variable \\(\\mu\\), the variable \\(Y\\) has a Poisson distribution with mean \\(\\mu\\).\nUnder this alternative definition, the pmf is\n\\[\n\\displaystyle p(y | k, \\alpha) = \\binom{y + \\alpha - 1}{y} \\left(\\frac{\\alpha}{\\mu + \\alpha}\\right)^\\alpha\\left(\\frac{\\mu}{\\mu + \\alpha}\\right)^y\n\\]\nwhere \\(\\mu\\) is the parameter of the Poisson distribution (the mean, and variance too!) and \\(\\alpha\\) is the rate parameter of the gamma.\n\nimport arviz as az\nimport bambi as bmb\nimport matplotlib.pyplot as plt\nimport numpy as np\nimport pandas as pd\n\nfrom scipy.stats import nbinom\n\n\naz.style.use(\"arviz-darkgrid\")\n\n\nimport warnings\nwarnings.simplefilter(action='ignore', category=FutureWarning)\n\nIn SciPy, the definition of the negative binomial distribution differs a little from the one in our introduction. They define \\(Y\\) = Number of failures until k successes and then \\(y\\) starts at 0. In the following plot, we have the probability of observing \\(y\\) failures before we see \\(k=3\\) successes.\n\ny = np.arange(0, 30)\nk = 3\np1 = 0.5\np2 = 0.3\n\n\nfig, ax = plt.subplots(1, 2, figsize=(12, 4), sharey=True)\n\nax[0].bar(y, nbinom.pmf(y, k, p1))\nax[0].set_xticks(np.linspace(0, 30, num=11))\nax[0].set_title(f\"k = {k}, p = {p1}\")\n\nax[1].bar(y, nbinom.pmf(y, k, p2))\nax[1].set_xticks(np.linspace(0, 30, num=11))\nax[1].set_title(f\"k = {k}, p = {p2}\")\n\nfig.suptitle(\"Y = Number of failures until k successes\", fontsize=16);\n\n\n\n\nFor example, when \\(p=0.5\\), the probability of seeing \\(y=0\\) failures before 3 successes (or in other words, the probability of having 3 successes out of 3 trials) is 0.125, and the probability of seeing \\(y=3\\) failures before 3 successes is 0.156.\n\nprint(nbinom.pmf(y, k, p1)[0])\nprint(nbinom.pmf(y, k, p1)[3])\n\n0.12499999999999997\n0.15624999999999992\n\n\nFinally, if one wants to show this probability mass function as if we are following the first definition of negative binomial distribution we introduced, we just need to shift the whole thing to the right by adding \\(k\\) to the \\(y\\) values.\n\nfig, ax = plt.subplots(1, 2, figsize=(12, 4), sharey=True)\n\nax[0].bar(y + k, nbinom.pmf(y, k, p1))\nax[0].set_xticks(np.linspace(3, 30, num=10))\nax[0].set_title(f\"k = {k}, p = {p1}\")\n\nax[1].bar(y + k, nbinom.pmf(y, k, p2))\nax[1].set_xticks(np.linspace(3, 30, num=10))\nax[1].set_title(f\"k = {k}, p = {p2}\")\n\nfig.suptitle(\"Y = Number of trials until k successes\", fontsize=16);\n\n\n\n\n\n\n\nThe negative binomial distribution belongs to the exponential family, and the canonical link function is\n\\[\ng(\\mu_i) = \\log\\left(\\frac{\\mu_i}{k + \\mu_i}\\right) = \\log\\left(\\frac{k}{\\mu_i} + 1\\right)\n\\]\nbut it is difficult to interpret. The log link is usually preferred because of the analogy with Poisson model, and it also tends to give better results.\n\n\n\nThis example is based on this UCLA example.\nSchool administrators study the attendance behavior of high school juniors at two schools. Predictors of the number of days of absence include the type of program in which the student is enrolled and a standardized test in math. We have attendance data on 314 high school juniors.\nThe variables of insterest in the dataset are\n\ndaysabs: The number of days of absence. It is our response variable.\nprogr: The type of program. Can be one of ‘General’, ‘Academic’, or ‘Vocational’.\nmath: Score in a standardized math test.\n\n\ndata = pd.read_stata(\"https://stats.idre.ucla.edu/stat/stata/dae/nb_data.dta\")\n\n\ndata.head()\n\n\n\n\n\n \n \n \n id\n gender\n math\n daysabs\n prog\n \n \n \n \n 0\n 1001.0\n male\n 63.0\n 4.0\n 2.0\n \n \n 1\n 1002.0\n male\n 27.0\n 4.0\n 2.0\n \n \n 2\n 1003.0\n female\n 20.0\n 2.0\n 2.0\n \n \n 3\n 1004.0\n female\n 16.0\n 3.0\n 2.0\n \n \n 4\n 1005.0\n female\n 2.0\n 3.0\n 2.0\n \n \n\n\n\n\nWe assign categories to the values 1, 2, and 3 of our \"prog\" variable.\n\ndata[\"prog\"] = data[\"prog\"].map({1: \"General\", 2: \"Academic\", 3: \"Vocational\"})\ndata.head()\n\n\n\n\n\n \n \n \n id\n gender\n math\n daysabs\n prog\n \n \n \n \n 0\n 1001.0\n male\n 63.0\n 4.0\n Academic\n \n \n 1\n 1002.0\n male\n 27.0\n 4.0\n Academic\n \n \n 2\n 1003.0\n female\n 20.0\n 2.0\n Academic\n \n \n 3\n 1004.0\n female\n 16.0\n 3.0\n Academic\n \n \n 4\n 1005.0\n female\n 2.0\n 3.0\n Academic\n \n \n\n\n\n\nThe Academic program is the most popular program (167/314) and General is the least popular one (40/314)\n\ndata[\"prog\"].value_counts()\n\nAcademic 167\nVocational 107\nGeneral 40\nName: prog, dtype: int64\n\n\nLet’s explore the distributions of math score and days of absence for each of the three programs listed above. The vertical lines indicate the mean values.\n\nfig, ax = plt.subplots(3, 2, figsize=(8, 6), sharex=\"col\")\nprograms = list(data[\"prog\"].unique())\nprograms.sort()\n\nfor idx, program in enumerate(programs):\n # Histogram\n ax[idx, 0].hist(data[data[\"prog\"] == program][\"math\"], edgecolor='black', alpha=0.9)\n ax[idx, 0].axvline(data[data[\"prog\"] == program][\"math\"].mean(), color=\"C1\")\n \n # Barplot\n days = data[data[\"prog\"] == program][\"daysabs\"]\n days_mean = days.mean()\n days_counts = days.value_counts()\n values = list(days_counts.index)\n count = days_counts.values\n ax[idx, 1].bar(values, count, edgecolor='black', alpha=0.9)\n ax[idx, 1].axvline(days_mean, color=\"C1\")\n \n # Titles\n ax[idx, 0].set_title(program)\n ax[idx, 1].set_title(program)\n\nplt.setp(ax[-1, 0], xlabel=\"Math score\")\nplt.setp(ax[-1, 1], xlabel=\"Days of absence\");\n\n\n\n\nThe first impression we have is that the distribution of math scores is not equal for any of the programs. It looks right-skewed for students under the Academic program, left-skewed for students under the Vocational program, and roughly uniform for students in the General program (although there’s a drop in the highest values). Clearly those in the Vocational program has the highest mean for the math score.\nOn the other hand, the distribution of the days of absence is right-skewed in all cases. Students in the General program present the highest absence mean while the Vocational group is the one who misses fewer classes on average.\n\n\n\nWe are interested in measuring the association between the type of the program and the math score with the days of absence. It’s also of interest to see if the association between math score and days of absence is different in each type of program.\nIn order to answer our questions, we are going to fit and compare two models. The first model uses the type of the program and the math score as predictors. The second model also includes the interaction between these two variables. The score in the math test is going to be standardized in both cases to make things easier for the sampler and save some seconds. A good idea to follow along is to run these models without scaling math and comparing how long it took to fit.\nWe are going to use a negative binomial likelihood to model the days of absence. But let’s stop here and think why we use this likelihood. Earlier, we said that the negative binomial distributon arises when our variable represents the number of trials until we got \\(k\\) successes. However, the number of trials is fixed, i.e. the number of school days in a given year is not a random variable. So if we stick to the definition, we could think of the two alternative views for this problem\n\nEach of the \\(n\\) days is a trial, and we record whether the student is absent (\\(y=1\\)) or not (\\(y=0\\)). This corresponds to a binary regression setting, where we could think of logistic regression or something alike. A problem here is that we have the sum of \\(y\\) for a student, but not the \\(n\\).\nThe whole school year represents the space where events occur and we count how many absences we see in that space for each student. This gives us a Poisson regression setting (count of an event in a given space or time).\n\nWe also know that when \\(n\\) is large and \\(p\\) is small, the Binomial distribution can be approximated with a Poisson distribution with \\(\\lambda = n * p\\). We don’t know exactly \\(n\\) in this scenario, but we know it is around 180, and we do know that \\(p\\) is small because you can’t skip classes all the time. So both modeling approaches should give similar results.\nBut then, why negative binomial? Can’t we just use a Poisson likelihood?\nYes, we can. However, using a Poisson likelihood implies that the mean is equal to the variance, and that is usually an unrealistic assumption. If it turns out the variance is either substantially smaller or greater than the mean, the Poisson regression model results in a poor fit. Alternatively, if we use a negative binomial likelihood, the variance is not forced to be equal to the mean, and there’s more flexibility to handle a given dataset, and consequently, the fit tends to better.\n\n\n\\[\n\\log{Y_i} = \\beta_1 \\text{Academic}_i + \\beta_2 \\text{General}_i + \\beta_3 \\text{Vocational}_i + \\beta_4 \\text{Math\\_std}_i\n\\]\n\n\n\n\\[\n\\log{Y_i} = \\beta_1 \\text{Academic}_i + \\beta_2 \\text{General}_i + \\beta_3 \\text{Vocational}_i + \\beta_4 \\text{Math\\_std}_i\n + \\beta_5 \\text{General}_i \\cdot \\text{Math\\_std}_i + \\beta_6 \\text{Vocational}_i \\cdot \\text{Math\\_std}_i\n\\]\nIn both cases we have the following dummy variables\n\\[\\text{Academic}_i =\n\\left\\{\n \\begin{array}{ll}\n 1 & \\textrm{if student is under Academic program} \\\\\n 0 & \\textrm{other case}\n \\end{array}\n\\right.\n\\]\n\\[\\text{General}_i =\n\\left\\{\n \\begin{array}{ll}\n 1 & \\textrm{if student is under General program} \\\\\n 0 & \\textrm{other case}\n \\end{array}\n\\right.\n\\]\n\\[\\text{Vocational}_i =\n\\left\\{\n \\begin{array}{ll}\n 1 & \\textrm{if student is under Vocational program} \\\\\n 0 & \\textrm{other case}\n \\end{array}\n\\right.\n\\]\nand \\(Y\\) represents the days of absence.\nSo, for example, the first model for a student under the Vocational program reduces to \\[\n\\log{Y_i} = \\beta_3 + \\beta_4 \\text{Math\\_std}_i\n\\]\nAnd one last thing to note is we’ve decided not to inclide an intercept term, that’s why you don’t see any \\(\\beta_0\\) above. This choice allows us to represent the effect of each program directly with \\(\\beta_1\\), \\(\\beta_2\\), and \\(\\beta_3\\).\n\n\n\n\nIt’s very easy to fit these models with Bambi. We just pass a formula describing the terms in the model and Bambi will know how to handle each of them correctly. The 0 on the right hand side of ~ simply means we don’t want to have the intercept term that is added by default. scale(math) tells Bambi we want to use standardize math before being included in the model. By default, Bambi uses a log link for negative binomial GLMs. We’ll stick to this default here.\n\n\n\nmodel_additive = bmb.Model(\"daysabs ~ 0 + prog + scale(math)\", data, family=\"negativebinomial\")\nidata_additive = model_additive.fit()\n\nAuto-assigning NUTS sampler...\nInitializing NUTS using jitter+adapt_diag...\nMultiprocess sampling (2 chains in 2 jobs)\nNUTS: [daysabs_alpha, prog, scale(math)]\n\n\n\n\n\n\n\n \n \n 100.00% [4000/4000 00:03<00:00 Sampling 2 chains, 0 divergences]\n \n \n\n\nSampling 2 chains for 1_000 tune and 1_000 draw iterations (2_000 + 2_000 draws total) took 4 seconds.\n\n\n\n\n\nFor this second model we just add prog:scale(math) to indicate the interaction. A shorthand would be to use y ~ 0 + prog*scale(math), which uses the full interaction operator. In other words, it just means we want to include the interaction between prog and scale(math) as well as their main effects.\n\nmodel_interaction = bmb.Model(\"daysabs ~ 0 + prog + scale(math) + prog:scale(math)\", data, family=\"negativebinomial\")\nidata_interaction = model_interaction.fit()\n\nAuto-assigning NUTS sampler...\nInitializing NUTS using jitter+adapt_diag...\nMultiprocess sampling (2 chains in 2 jobs)\nNUTS: [daysabs_alpha, prog, scale(math), prog:scale(math)]\n\n\n\n\n\n\n\n \n \n 100.00% [4000/4000 00:06<00:00 Sampling 2 chains, 0 divergences]\n \n \n\n\nSampling 2 chains for 1_000 tune and 1_000 draw iterations (2_000 + 2_000 draws total) took 7 seconds.\n\n\n\n\n\n\nThe first thing we do is calling az.summary(). Here we pass the InferenceData object the .fit() returned. This prints information about the marginal posteriors for each parameter in the model as well as convergence diagnostics.\n\naz.summary(idata_additive)\n\n\n\n\n\n \n \n \n mean\n sd\n hdi_3%\n hdi_97%\n mcse_mean\n mcse_sd\n ess_bulk\n ess_tail\n r_hat\n \n \n \n \n prog[Academic]\n 1.888\n 0.084\n 1.738\n 2.057\n 0.002\n 0.001\n 2430.0\n 1649.0\n 1.00\n \n \n prog[General]\n 2.339\n 0.174\n 2.013\n 2.651\n 0.003\n 0.002\n 3364.0\n 1610.0\n 1.00\n \n \n prog[Vocational]\n 1.047\n 0.112\n 0.845\n 1.264\n 0.002\n 0.002\n 2062.0\n 1609.0\n 1.00\n \n \n scale(math)\n -0.150\n 0.063\n -0.271\n -0.036\n 0.001\n 0.001\n 2115.0\n 1357.0\n 1.00\n \n \n daysabs_alpha\n 1.020\n 0.109\n 0.835\n 1.236\n 0.002\n 0.002\n 2112.0\n 1339.0\n 1.01\n \n \n\n\n\n\n\naz.summary(idata_interaction)\n\n\n\n\n\n \n \n \n mean\n sd\n hdi_3%\n hdi_97%\n mcse_mean\n mcse_sd\n ess_bulk\n ess_tail\n r_hat\n \n \n \n \n prog[Academic]\n 1.876\n 0.083\n 1.721\n 2.032\n 0.002\n 0.001\n 2149.0\n 1451.0\n 1.0\n \n \n prog[General]\n 2.341\n 0.175\n 2.007\n 2.647\n 0.004\n 0.003\n 2188.0\n 1572.0\n 1.0\n \n \n prog[Vocational]\n 0.984\n 0.128\n 0.743\n 1.223\n 0.003\n 0.002\n 2290.0\n 1703.0\n 1.0\n \n \n scale(math)\n -0.194\n 0.081\n -0.334\n -0.030\n 0.002\n 0.001\n 2001.0\n 1625.0\n 1.0\n \n \n prog:scale(math)[General]\n 0.014\n 0.164\n -0.304\n 0.305\n 0.004\n 0.003\n 2008.0\n 1738.0\n 1.0\n \n \n prog:scale(math)[Vocational]\n 0.198\n 0.168\n -0.129\n 0.512\n 0.004\n 0.003\n 1813.0\n 1556.0\n 1.0\n \n \n daysabs_alpha\n 1.017\n 0.104\n 0.821\n 1.208\n 0.002\n 0.002\n 2135.0\n 1397.0\n 1.0\n \n \n\n\n\n\nThe information in the two tables above can be visualized in a more concise manner using a forest plot. ArviZ provides us with plot_forest(). There we simply pass a list containing the InferenceData objects of the models we want to compare.\n\naz.plot_forest(\n [idata_additive, idata_interaction],\n model_names=[\"Additive\", \"Interaction\"],\n var_names=[\"prog\", \"scale(math)\"],\n combined=True,\n figsize=(8, 4)\n);\n\n\n\n\nOne of the first things one can note when seeing this plot is the similarity between the marginal posteriors. Maybe one can conclude that the variability of the marginal posterior of scale(math) is slightly lower in the model that considers the interaction, but the difference is not significant.\nWe can also make conclusions about the association between the program and the math score with the days of absence. First, we see the posterior for the Vocational group is to the left of the posterior for the two other programs, meaning it is associated with fewer absences (as we have seen when first exploring our data). There also seems to be a difference between General and Academic, where we may conclude the students in the General group tend to miss more classes.\nIn addition, the marginal posterior for math shows negative values in both cases. This means that students with higher math scores tend to miss fewer classes. Below, we see a forest plot with the posteriors for the coefficients of the interaction effects. Both of them overlap with 0, which means the data does not give much evidence to support there is an interaction effect between program and math score (i.e., the association between math and days of absence is similar for all the programs).\n\naz.plot_forest(idata_interaction, var_names=[\"prog:scale(math)\"], combined=True, figsize=(8, 4))\nplt.axvline(0);\n\n\n\n\n\n\n\nWe finish this example showing how we can get predictions for new data and plot the mean response for each program together with confidence intervals.\n\nmath_score = np.arange(1, 100)\n\n# This function takes a model and an InferenceData object.\n# It returns of length 3 with predictions for each type of program.\ndef predict(model, idata):\n predictions = []\n for program in programs:\n new_data = pd.DataFrame({\"math\": math_score, \"prog\": [program] * len(math_score)})\n new_idata = model.predict(\n idata, \n data=new_data,\n inplace=False\n )\n prediction = new_idata.posterior[\"daysabs_mean\"]\n predictions.append(prediction)\n \n return predictions\n\n\nprediction_additive = predict(model_additive, idata_additive)\nprediction_interaction = predict(model_interaction, idata_interaction)\n\n\nmu_additive = [prediction.mean((\"chain\", \"draw\")) for prediction in prediction_additive]\nmu_interaction = [prediction.mean((\"chain\", \"draw\")) for prediction in prediction_interaction]\n\n\nfig, ax = plt.subplots(1, 2, sharex=True, sharey=True, figsize = (10, 4))\n\nfor idx, program in enumerate(programs):\n ax[0].plot(math_score, mu_additive[idx], label=f\"{program}\", color=f\"C{idx}\", lw=2)\n az.plot_hdi(math_score, prediction_additive[idx], color=f\"C{idx}\", ax=ax[0])\n\n ax[1].plot(math_score, mu_interaction[idx], label=f\"{program}\", color=f\"C{idx}\", lw=2)\n az.plot_hdi(math_score, prediction_interaction[idx], color=f\"C{idx}\", ax=ax[1])\n\nax[0].set_title(\"Additive\");\nax[1].set_title(\"Interaction\");\nax[0].set_xlabel(\"Math score\")\nax[1].set_xlabel(\"Math score\")\nax[0].set_ylim(0, 25)\nax[0].legend(loc=\"upper right\");\n\n\n\n\nAs we can see in this plot, the interval for the mean response for the Vocational program does not overlap with the interval for the other two groups, representing the group of students who miss fewer classes. On the right panel we can also see that including interaction terms does not change the slopes significantly because the posterior distributions of these coefficients have a substantial overlap with 0.\nIf you’ve made it to the end of this notebook and you’re still curious about what else you can do with these two models, you’re invited to use az.compare() to compare the fit of the two models. What do you expect before seeing the plot? Why? Is there anything else you could do to improve the fit of the model?\nAlso, if you’re still curious about what this model would have looked like with the Poisson likelihood, you just need to replace family=\"negativebinomial\" with family=\"poisson\" and then you’re ready to compare results!\n\n%load_ext watermark\n%watermark -n -u -v -iv -w\n\nLast updated: Thu Jan 05 2023\n\nPython implementation: CPython\nPython version : 3.10.4\nIPython version : 8.5.0\n\narviz : 0.14.0\nbambi : 0.9.3\npandas : 1.5.2\nnumpy : 1.23.5\nmatplotlib: 3.6.2\nsys : 3.10.4 (main, Mar 31 2022, 08:41:55) [GCC 7.5.0]\n\nWatermark: 2.3.1" }, { - "objectID": "index.html#contributors", - "href": "index.html#contributors", - "title": "BAyesian Model-Building Interface in Python", - "section": "Contributors", - "text": "Contributors\nSee the GitHub contributor page." + "objectID": "notebooks/alternative_links_binary.html", + "href": "notebooks/alternative_links_binary.html", + "title": "Bambi", + "section": "", + "text": "In this example we use a simple dataset to fit a Generalized Linear Model for a binary response using different link functions.\n\nimport arviz as az\nimport bambi as bmb\nimport matplotlib.pyplot as plt\nimport numpy as np\nimport pandas as pd\n\nfrom scipy.special import expit as invlogit\nfrom scipy.stats import norm\n\n\naz.style.use(\"arviz-darkgrid\")\nnp.random.seed(1234)\n\n\n\nFirst of all, let’s review some concepts. A Generalized Linear Model (GLM) is made of three components.\n1. Random component\nA set of independent and identically distributed random variables \\(Y_i\\). Their (conditional) probability distribution belongs to the same family \\(f\\) with a mean given by \\(\\mu_i\\).\n2. Systematic component (a.k.a linear predictor)\nConstructed by a linear combination of the parameters \\(\\beta_j\\) and explanatory variables \\(x_j\\), represented by \\(\\eta_i\\)\n\\[\n\\eta_i = \\mathbf{x}_i^T\\mathbf{\\beta} = x_{i1}\\beta_1 + x_{i2}\\beta_2 + \\cdots + x_{ip}\\beta_p\n\\]\n3. Link function\nA monotone and differentiable function \\(g\\) such that\n\\[\ng(\\mu_i) = \\eta_i = \\mathbf{x}_i^T\\mathbf{\\beta}\n\\] where \\(\\mu_i = E(Y_i)\\)\nAs we can see, this function specifies the link between the random and the systematic components of the model.\nAn important feature of GLMs is that no matter we are modeling a function of \\(\\mu\\) (and not just \\(\\mu\\), unless \\(g\\) is the identity function) is that we can show predictions in terms of the mean \\(\\mu\\) by using the inverse of \\(g\\) on the linear predictor \\(\\eta_i\\)\n\\[\ng^{-1}(\\eta_i) = g^{-1}(\\mathbf{x}_i^T\\mathbf{\\beta}) = \\mu_i\n\\]\nIn Bambi, we can use family=\"bernoulli\" to tell we are modeling a binary variable that follows a Bernoulli distribution and our random component is of the form\n\\[\nY_i =\n\\left\\{\n \\begin{array}{ll}\n 1 & \\textrm{with probability } \\pi_i \\\\\n 0 & \\textrm{with probability } 1 - \\pi_i\n \\end{array}\n\\right.\n\\]\nthat has a mean \\(\\mu_i\\) equal to the probability of success \\(\\pi_i\\).\nBy default, this family implies \\(g\\) is the logit function.\n\\[\n\\begin{array}{lcr} \n \\displaystyle \\text{logit}(\\pi_i) = \\log{\\left( \\frac{\\pi_i}{1 - \\pi_i} \\right)} = \\eta_i &\n \\text{ with } &\n \\displaystyle g^{-1}(\\eta) = \\frac{1}{1 + e^{-\\eta}} = \\pi_i\n\\end{array}\n\\]\nBut there are other options available, like the probit and the cloglog link functions.\nThe probit function is the inverse of the cumulative density function of a standard Gaussian distribution\n\\[\n\\begin{array}{lcr} \n \\displaystyle \\text{probit}(\\pi_i) = \\Phi^{-1}(\\pi_i) = \\eta_i &\n \\text{ with } &\n \\displaystyle g^{-1}(\\eta) = \\Phi(\\eta_i) = \\pi_i\n\\end{array}\n\\]\nAnd with the cloglog link function we have\n\\[\n\\begin{array}{lcr} \n \\displaystyle \\text{cloglog}(\\pi_i) = \\log(-\\log(1 - \\pi)) = \\eta_i &\n \\text{ with } &\n \\displaystyle g^{-1}(\\eta) = 1 - \\exp(-\\exp(\\eta_i)) = \\pi_i\n\\end{array}\n\\]\ncloglog stands for complementary log-log and \\(g^{-1}\\) is the cumulative density function of the extreme minimum value distribution.\nLet’s plot them to better understand the implications of what we’re saying.\n\ndef invcloglog(x):\n return 1 - np.exp(-np.exp(x))\n\n\nx = np.linspace(-5, 5, num=200)\n\n# inverse of the logit function\nlogit = invlogit(x)\n\n# cumulative density function of standard gaussian\nprobit = norm.cdf(x)\n\n# inverse of the cloglog function\ncloglog = invcloglog(x)\n\nplt.plot(x, logit, color=\"C0\", lw=2, label=\"Logit\")\nplt.plot(x, probit, color=\"C1\", lw=2, label=\"Probit\")\nplt.plot(x, cloglog, color=\"C2\", lw=2, label=\"CLogLog\")\nplt.axvline(0, c=\"k\", alpha=0.5, ls=\"--\")\nplt.axhline(0.5, c=\"k\", alpha=0.5, ls=\"--\")\nplt.xlabel(r\"$x$\")\nplt.ylabel(r\"$\\pi$\")\nplt.legend();\n\n\n\n\nIn the plot above we can see both the logit and the probit links are symmetric in terms of their slopes at \\(-x\\) and \\(x\\). We can say the function approaches \\(\\pi = 0.5\\) at the same rate as it moves away from it. However, these two functions differ in their tails. The probit link approaches 0 and 1 faster than the logit link as we move away from \\(x=0\\). Just see the orange line is below the blue one for \\(x < 0\\) and it is above for \\(x > 0\\). In other words, the logit function has heavier tails than the probit.\nOn the other hand, the cloglog does not present this symmetry, and we can clearly see it since the green line does not cross the point (0, 0.5). This function approaches faster the 1 than 0 as we move away from \\(x=0\\).\n\n\n\nWe use a data set consisting of the numbers of beetles dead after five hours of exposure to gaseous carbon disulphide at various concentrations. This data can be found in An Introduction to Generalized Linear Models by A. J. Dobson and A. G. Barnett, but the original source is (Bliss, 1935).\n\n\n\n\n\n\n\n\nDose, \\(x_i\\) (\\(\\log_{10}\\text{CS}_2\\text{mgl}^{-1}\\))\nNumber of beetles, \\(n_i\\)\nNumber killed, \\(y_i\\)\n\n\n\n\n1.6907\n59\n6\n\n\n1.7242\n60\n13\n\n\n1.7552\n62\n18\n\n\n1.7842\n56\n28\n\n\n1.8113\n63\n52\n\n\n1.8369\n59\n53\n\n\n1.8610\n62\n61\n\n\n1.8839\n60\n60\n\n\n\nWe create a data frame where the data is in long format (i.e. each row is an observation with a 0-1 outcome).\n\nx = np.array([1.6907, 1.7242, 1.7552, 1.7842, 1.8113, 1.8369, 1.8610, 1.8839])\nn = np.array([59, 60, 62, 56, 63, 59, 62, 60])\ny = np.array([6, 13, 18, 28, 52, 53, 61, 60])\n\ndata = pd.DataFrame({\"x\": x, \"n\": n, \"y\": y})\n\n\n\n\nBambi has two families to model binary data: Bernoulli and Binomial. The first one can be used when each row represents a single observation with a column containing the binary outcome, while the second is used when each row represents a group of observations or realizations and there’s one column for the number of successes and another column for the number of trials.\nSince we have aggregated data, we’re going to use the Binomial family. This family requires using the function proportion(y, n) on the left side of the model formula to indicate we want to model the proportion between two variables. This function can be replaced by any of its aliases prop(y, n) or p(y, n). Let’s use the shortest one here.\n\nformula = \"p(y, n) ~ x\"\n\n\n\nThe logit link is the default link when we say family=\"binomial\", so there’s no need to add it.\n\nmodel_logit = bmb.Model(formula, data, family=\"binomial\")\nidata_logit = model_logit.fit(draws=2000)\n\nAuto-assigning NUTS sampler...\nInitializing NUTS using jitter+adapt_diag...\nMultiprocess sampling (2 chains in 2 jobs)\nNUTS: [Intercept, x]\n\n\n\n\n\n\n\n \n \n 100.00% [6000/6000 00:04<00:00 Sampling 2 chains, 0 divergences]\n \n \n\n\nSampling 2 chains for 1_000 tune and 2_000 draw iterations (2_000 + 4_000 draws total) took 5 seconds.\n\n\n\n\n\n\nmodel_probit = bmb.Model(formula, data, family=\"binomial\", link=\"probit\")\nidata_probit = model_probit.fit(draws=2000)\n\nAuto-assigning NUTS sampler...\nInitializing NUTS using jitter+adapt_diag...\nMultiprocess sampling (2 chains in 2 jobs)\nNUTS: [Intercept, x]\n\n\n\n\n\n\n\n \n \n 100.00% [6000/6000 00:05<00:00 Sampling 2 chains, 0 divergences]\n \n \n\n\nSampling 2 chains for 1_000 tune and 2_000 draw iterations (2_000 + 4_000 draws total) took 5 seconds.\n\n\n\n\n\n\nmodel_cloglog = bmb.Model(formula, data, family=\"binomial\", link=\"cloglog\")\nidata_cloglog = model_cloglog.fit(draws=2000)\n\nAuto-assigning NUTS sampler...\nInitializing NUTS using jitter+adapt_diag...\nMultiprocess sampling (2 chains in 2 jobs)\nNUTS: [Intercept, x]\n\n\n\n\n\n\n\n \n \n 100.00% [6000/6000 00:03<00:00 Sampling 2 chains, 0 divergences]\n \n \n\n\nSampling 2 chains for 1_000 tune and 2_000 draw iterations (2_000 + 4_000 draws total) took 4 seconds.\n\n\n\n\n\n\nWe can use the samples from the posteriors to see the mean estimate for the probability of dying at each concentration level. To do so, we use a little helper function that will help us to write less code. This function leverages the power of the new Model.predict() method that is helpful to obtain both in-sample and out-of-sample predictions.\n\ndef get_predictions(model, idata, seq):\n # Create a data frame with the new data\n new_data = pd.DataFrame({\"x\": seq})\n \n # Predict probability of dying using out of sample data\n model.predict(idata, data=new_data)\n \n # Get posterior mean across all chains and draws\n mu = idata.posterior[\"p(y, n)_mean\"].mean((\"chain\", \"draw\"))\n return mu\n\n\nx_seq = np.linspace(1.6, 2, num=200)\n\nmu_logit = get_predictions(model_logit, idata_logit, x_seq)\nmu_probit = get_predictions(model_probit, idata_probit, x_seq)\nmu_cloglog = get_predictions(model_cloglog, idata_cloglog, x_seq)\n\n\nplt.scatter(x, y / n, c = \"white\", edgecolors = \"black\", s=100)\nplt.plot(x_seq, mu_logit, lw=2, label=\"Logit\")\nplt.plot(x_seq, mu_probit, lw=2, label=\"Probit\")\nplt.plot(x_seq, mu_cloglog, lw=2, label=\"CLogLog\")\nplt.axhline(0.5, c=\"k\", alpha=0.5, ls=\"--\")\nplt.xlabel(r\"Dose $\\log_{10}CS_2mgl^{-1}$\")\nplt.ylabel(\"Probability of death\")\nplt.legend();\n\n\n\n\nIn this example, we can see the models using the logit and probit link functions present very similar estimations. With these particular data, all the three link functions fit the data well and the results do not differ significantly. However, there can be scenarios where the results are more sensitive to the choice of the link function.\nReferences\nBliss, C. I. (1935). The calculation of the dose-mortality curve. Annals of Applied Biology 22, 134–167\n\n%load_ext watermark\n%watermark -n -u -v -iv -w\n\nLast updated: Thu Jan 05 2023\n\nPython implementation: CPython\nPython version : 3.10.4\nIPython version : 8.5.0\n\nsys : 3.10.4 (main, Mar 31 2022, 08:41:55) [GCC 7.5.0]\narviz : 0.14.0\nnumpy : 1.23.5\nbambi : 0.9.3\nmatplotlib: 3.6.2\npandas : 1.5.2\n\nWatermark: 2.3.1" }, { - "objectID": "notebooks/splines_cherry_blossoms.html", - "href": "notebooks/splines_cherry_blossoms.html", + "objectID": "notebooks/categorical_regression.html", + "href": "notebooks/categorical_regression.html", "title": "Bambi", "section": "", - "text": "This example shows how to specify and fit a spline regression in Bambi. This example is based on this example from the PyMC docs.\n\nimport arviz as az\nimport bambi as bmb\nimport matplotlib.pyplot as plt\nimport numpy as np\nimport pandas as pd\n\n\naz.style.use(\"arviz-darkgrid\")\nSEED = 7355608\n\n\n\nRichard McElreath popularized the Cherry Blossom dataset in the second edition of his excellent book Statistical Rethinking. This data represents the day in the year when the first bloom is observed for Japanese cherry blossoms between years 801 and 2015. In his book, Richard McElreath uses this dataset to introduce Basis Splines, or B-Splines in short.\nHere we use Bambi to fit a linear model using B-Splines with the Cherry Blossom data. This dataset can be loaded with Bambi as follows:\n\ndata = bmb.load_data(\"cherry_blossoms\")\ndata\n\n\n\n\n\n \n \n \n year\n doy\n temp\n temp_upper\n temp_lower\n \n \n \n \n 0\n 801\n NaN\n NaN\n NaN\n NaN\n \n \n 1\n 802\n NaN\n NaN\n NaN\n NaN\n \n \n 2\n 803\n NaN\n NaN\n NaN\n NaN\n \n \n 3\n 804\n NaN\n NaN\n NaN\n NaN\n \n \n 4\n 805\n NaN\n NaN\n NaN\n NaN\n \n \n ...\n ...\n ...\n ...\n ...\n ...\n \n \n 1210\n 2011\n 99.0\n NaN\n NaN\n NaN\n \n \n 1211\n 2012\n 101.0\n NaN\n NaN\n NaN\n \n \n 1212\n 2013\n 93.0\n NaN\n NaN\n NaN\n \n \n 1213\n 2014\n 94.0\n NaN\n NaN\n NaN\n \n \n 1214\n 2015\n 93.0\n NaN\n NaN\n NaN\n \n \n\n1215 rows × 5 columns\n\n\n\nThe variable we are interested in modeling is \"doy\", which stands for Day of Year. Also notice this variable contains several missing value which are discarded next.\n\ndata = data.dropna(subset=[\"doy\"]).reset_index(drop=True)\ndata.shape\n\n(827, 5)\n\n\n\n\n\nLet’s get started by creating a scatterplot to explore the values of \"doy\" for each year in the dataset.\n\n# We create a function because this plot is going to be used again later\ndef plot_scatter(data, figsize=(10, 6)):\n _, ax = plt.subplots(figsize=figsize)\n ax.scatter(data[\"year\"], data[\"doy\"], alpha=0.4, s=30)\n ax.set_title(\"Day of the first bloom per year\")\n ax.set_xlabel(\"Year\")\n ax.set_ylabel(\"Days of the first bloom\")\n return ax\n\n\nplot_scatter(data);\n\n\n\n\nWe can observe the day of the first bloom ranges between 85 and 125 approximately, which correspond to late March and early May respectively. On average, the first bloom occurs on the 105th day of the year, which is middle April.\n\n\n\nThe spline will have 15 knots. These knots are the boundaries of the basis functions. These knots split the range of the \"year\" variable into 16 contiguous sections. The basis functions make up a piecewise continuous polynomial, and so they are enforced to meet at the knots. We use the default degree for each piecewise polynomial, which is 3. The result is known as a cubic spline.\nBecause of using quantiles and not having observations for all the years in the time window under study, the knots are distributed unevenly over the range of \"year\" in such a way that the same proportion of values fall between each section.\n\nnum_knots = 15\nknots = np.quantile(data[\"year\"], np.linspace(0, 1, num_knots))\n\n\ndef plot_knots(knots, ax):\n for knot in knots:\n ax.axvline(knot, color=\"0.1\", alpha=0.4)\n return ax\n\n\nax = plot_scatter(data)\nplot_knots(knots, ax);\n\n\n\n\nThe previous chart makes it easy to see the knots, represented by the vertical lines, are spaced unevenly over the years.\n\n\n\nThe B-spline model we are about to create is simply a linear regression model with synthetic predictor variables. These predictors are the basis functions that are derived from the original year predictor.\nIn math notation, we usa a \\(\\text{Normal}\\) distribution for the conditional distribution of \\(Y\\) when \\(X = x_i\\), i.e. \\(Y_i\\), the distribution of the day of the first bloom in a given year.\n\\[\nY_i \\sim \\text{Normal}(\\mu_i, \\sigma)\n\\]\nSo far, this looks like a regular linear regression model. The next line is where the spline comes into play:\n\\[\n\\mu_i = \\alpha + \\sum_{k=1}^K{w_kB_{k, i}}\n\\]\nThe line above tells that for each observation \\(i\\), the mean is influenced by all the basis functions (going from \\(k=1\\) to \\(k=K\\)), plus an intercept \\(\\alpha\\). The \\(w_k\\) values in the summation are the regression coefficients of each of the basis functions, and the \\(B_k\\) are the values of the basis functions.\nFinally, we will be using the following priors\n\\[\n\\begin{aligned}\n\\alpha & \\sim \\text{Normal}(100, 10) \\\\\nw_j & \\sim \\text{Normal}(0, 10)\\\\\n\\sigma & \\sim \\text{Exponential(1)}\n\\end{aligned}\n\\]\nwhere \\(j\\) indexes each of the contiguous sections given by the knots\n\n# We only pass the internal knots to the `bs()` function.\niknots = knots[1:-1]\n\n# Define dictionary of priors\npriors = {\n \"Intercept\": bmb.Prior(\"Normal\", mu=100, sigma=10),\n \"common\": bmb.Prior(\"Normal\", mu=0, sigma=10), \n \"sigma\": bmb.Prior(\"Exponential\", lam=1)\n}\n\n# Define model\n# The intercept=True means the basis also spans the intercept, as originally done in the book example.\nmodel = bmb.Model(\"doy ~ bs(year, knots=iknots, intercept=True)\", data, priors=priors)\nmodel\n\n Formula: doy ~ bs(year, knots=iknots, intercept=True)\n Family: gaussian\n Link: mu = identity\n Observations: 827\n Priors: \n target = mu\n Common-level effects\n Intercept ~ Normal(mu: 100.0, sigma: 10.0)\n bs(year, knots=iknots, intercept=True) ~ Normal(mu: 0.0, sigma: 10.0)\n \n Auxiliary parameters\n sigma ~ Exponential(lam: 1.0)\n\n\nLet’s create a function to plot each of the basis functions in the model.\n\ndef plot_spline_basis(basis, year, figsize=(10, 6)):\n df = (\n pd.DataFrame(basis)\n .assign(year=year)\n .melt(\"year\", var_name=\"basis_idx\", value_name=\"value\")\n )\n\n _, ax = plt.subplots(figsize=figsize)\n\n for idx in df.basis_idx.unique():\n d = df[df.basis_idx == idx]\n ax.plot(d[\"year\"], d[\"value\"])\n \n return ax\n\nBelow, we create a chart to visualize the b-spline basis. The overlap between the functions means that, at any given point in time, the regression function is influenced by more than one basis function. For example, if we look at the year 1200, we can see the regression line is going to be influenced mostly by the violet and brown functions, and to a lesser extent by the green and cyan ones. In summary, this is what enables us to capture local patterns in a smooth fashion.\n\nB = model.response_component.design.common[\"bs(year, knots=iknots, intercept=True)\"]\nax = plot_spline_basis(B, data[\"year\"].values)\nplot_knots(knots, ax);\n\n\n\n\n\n\n\nNow we fit the model. In Bambi, it is as easy as calling the .fit() method on the Model instance.\n\n# The seed is to make results reproducible\nidata = model.fit(random_seed=SEED, idata_kwargs={\"log_likelihood\": True})\n\nAuto-assigning NUTS sampler...\nInitializing NUTS using jitter+adapt_diag...\nMultiprocess sampling (2 chains in 2 jobs)\nNUTS: [doy_sigma, Intercept, bs(year, knots=iknots, intercept=True)]\n\n\n\n\n\n\n\n \n \n 100.00% [4000/4000 00:32<00:00 Sampling 2 chains, 0 divergences]\n \n \n\n\nSampling 2 chains for 1_000 tune and 1_000 draw iterations (2_000 + 2_000 draws total) took 33 seconds.\nWe recommend running at least 4 chains for robust computation of convergence diagnostics\n\n\n\n\n\nIt is always good to use az.summary() to verify parameter estimates as well as effective sample sizes and R hat values. In this case, the main goal is not to interpret the coefficients of the basis spline, but analyze the ess and r_hat diagnostics. In first place, effective sample sizes don’t look impressively high. Most of them are between 300 and 700, which is low compared to the 2000 draws obtained. The only exception is the residual standard deviation sigma. Finally, the r_hat diagnostic is not always 1 for all the parameters, indicating there may be some issues with the mix of the chains.\n\naz.summary(idata)\n\n\n\n\n\n \n \n \n mean\n sd\n hdi_3%\n hdi_97%\n mcse_mean\n mcse_sd\n ess_bulk\n ess_tail\n r_hat\n \n \n \n \n Intercept\n 103.387\n 2.444\n 98.582\n 107.719\n 0.131\n 0.093\n 348.0\n 540.0\n 1.01\n \n \n bs(year, knots=iknots, intercept=True)[0]\n -3.074\n 3.819\n -10.477\n 3.705\n 0.127\n 0.090\n 908.0\n 1319.0\n 1.00\n \n \n bs(year, knots=iknots, intercept=True)[1]\n -0.841\n 3.949\n -8.290\n 6.242\n 0.146\n 0.103\n 739.0\n 1089.0\n 1.00\n \n \n bs(year, knots=iknots, intercept=True)[2]\n -1.167\n 3.662\n -8.245\n 5.517\n 0.140\n 0.099\n 685.0\n 935.0\n 1.00\n \n \n bs(year, knots=iknots, intercept=True)[3]\n 4.810\n 2.987\n -0.362\n 10.721\n 0.135\n 0.096\n 487.0\n 915.0\n 1.00\n \n \n bs(year, knots=iknots, intercept=True)[4]\n -0.881\n 2.970\n -6.245\n 4.759\n 0.137\n 0.097\n 472.0\n 951.0\n 1.00\n \n \n bs(year, knots=iknots, intercept=True)[5]\n 4.277\n 2.963\n -0.901\n 9.904\n 0.134\n 0.095\n 488.0\n 1104.0\n 1.00\n \n \n bs(year, knots=iknots, intercept=True)[6]\n -5.350\n 2.883\n -11.223\n -0.312\n 0.137\n 0.097\n 439.0\n 870.0\n 1.00\n \n \n bs(year, knots=iknots, intercept=True)[7]\n 7.786\n 2.813\n 2.161\n 13.013\n 0.129\n 0.091\n 477.0\n 842.0\n 1.00\n \n \n bs(year, knots=iknots, intercept=True)[8]\n -1.017\n 2.977\n -6.426\n 4.689\n 0.141\n 0.100\n 445.0\n 697.0\n 1.00\n \n \n bs(year, knots=iknots, intercept=True)[9]\n 2.927\n 2.958\n -2.100\n 9.282\n 0.136\n 0.096\n 474.0\n 809.0\n 1.00\n \n \n bs(year, knots=iknots, intercept=True)[10]\n 4.693\n 2.990\n -0.911\n 10.137\n 0.137\n 0.097\n 477.0\n 837.0\n 1.00\n \n \n bs(year, knots=iknots, intercept=True)[11]\n -0.246\n 2.943\n -5.760\n 5.126\n 0.133\n 0.094\n 490.0\n 908.0\n 1.00\n \n \n bs(year, knots=iknots, intercept=True)[12]\n 5.548\n 2.984\n 0.328\n 11.413\n 0.140\n 0.099\n 451.0\n 837.0\n 1.00\n \n \n bs(year, knots=iknots, intercept=True)[13]\n 0.653\n 3.115\n -4.897\n 6.839\n 0.132\n 0.094\n 557.0\n 933.0\n 1.00\n \n \n bs(year, knots=iknots, intercept=True)[14]\n -0.778\n 3.345\n -7.165\n 5.314\n 0.142\n 0.101\n 551.0\n 981.0\n 1.00\n \n \n bs(year, knots=iknots, intercept=True)[15]\n -7.039\n 3.527\n -13.975\n -0.638\n 0.137\n 0.097\n 667.0\n 1021.0\n 1.00\n \n \n bs(year, knots=iknots, intercept=True)[16]\n -7.711\n 3.293\n -14.579\n -2.133\n 0.135\n 0.095\n 595.0\n 1090.0\n 1.00\n \n \n doy_sigma\n 5.944\n 0.143\n 5.671\n 6.198\n 0.003\n 0.002\n 3031.0\n 1497.0\n 1.00\n \n \n\n\n\n\nWe can also use az.plot_trace() to visualize the marginal posteriors and the sampling paths. These traces show a stationary random pattern. If these paths were not random stationary, we would be concerned about the convergence of the chains.\n\naz.plot_trace(idata);\n\n\n\n\nNow we can visualize the fitted basis functions. In addition, we include a thicker black line that represents the dot product between \\(B\\) and \\(w\\). This is the contribution of the b-spline to the linear predictor in the model.\n\nposterior_stacked = az.extract(idata)\nwp = posterior_stacked[\"bs(year, knots=iknots, intercept=True)\"].mean(\"sample\").values\n\nax = plot_spline_basis(B * wp.T, data[\"year\"].values)\nax.plot(data.year.values, np.dot(B, wp.T), color=\"black\", lw=3)\nplot_knots(knots, ax);\n\n\n\n\n\n\n\nLet’s create a function to plot the predicted mean value as well as credible bands for it.\n\ndef plot_predictions(data, idata, model):\n # Create a test dataset with observations spanning the whole range of year\n new_data = pd.DataFrame({\"year\": np.linspace(data.year.min(), data.year.max(), num=500)})\n \n # Predict the day of first blossom\n model.predict(idata, data=new_data)\n\n posterior_stacked = az.extract_dataset(idata)\n # Extract these predictions\n y_hat = posterior_stacked[\"doy_mean\"]\n\n # Compute the mean of the predictions, plotted as a single line.\n y_hat_mean = y_hat.mean(\"sample\")\n\n # Compute 94% credible intervals for the predictions, plotted as bands\n hdi_data = np.quantile(y_hat, [0.03, 0.97], axis=1)\n\n # Plot obserevd data\n ax = plot_scatter(data)\n \n # Plot predicted line\n ax.plot(new_data[\"year\"], y_hat_mean, color=\"firebrick\")\n \n # Plot credibility bands\n ax.fill_between(new_data[\"year\"], hdi_data[0], hdi_data[1], alpha=0.4, color=\"firebrick\")\n \n # Add knots\n plot_knots(knots, ax)\n \n return ax\n\n\nplot_predictions(data, idata, model);\n\n/tmp/ipykernel_33590/2247671002.py:8: FutureWarning: extract_dataset has been deprecated, please use extract\n posterior_stacked = az.extract_dataset(idata)\n\n\n\n\n\n\n\n\nWe can write linear regression models in matrix form as\n\\[\n\\mathbf{y} = \\mathbf{X}\\boldsymbol{\\beta}\n\\]\nwhere \\(\\mathbf{y}\\) is the response column vector of shape \\((n, 1)\\). \\(\\mathbf{X}\\) is the design matrix that contains the values of the predictors for all the observations, of shape \\((n, p)\\). And \\(\\boldsymbol{\\beta}\\) is the column vector of regression coefficients of shape \\((n, 1)\\).\nBecause it’s not something that you’re supposed to consult regularly, Bambi does not expose the design matrix. However, with a some knowledge of the internals, it is possible to have access to it:\n\nnp.round(model.response_component.design.common.design_matrix, 3)\n\narray([[1. , 1. , 0. , ..., 0. , 0. , 0. ],\n [1. , 0.96 , 0.039, ..., 0. , 0. , 0. ],\n [1. , 0.767, 0.221, ..., 0. , 0. , 0. ],\n ...,\n [1. , 0. , 0. , ..., 0.002, 0.097, 0.902],\n [1. , 0. , 0. , ..., 0. , 0.05 , 0.95 ],\n [1. , 0. , 0. , ..., 0. , 0. , 1. ]])\n\n\nLet’s have a look at its shape:\n\nmodel.response_component.design.common.design_matrix.shape\n\n(827, 18)\n\n\n827 is the number of years we have data for, and 18 is the number of predictors/coefficients in the model. We have the first column of ones due to the Intercept term. Then, there are sixteen columns associated with the the basis functions. And finally, one extra column because we used span_intercept=True when calling the function bs() in the model formula.\nNow we could compute the rank of the design matrix to check whether all the columns are linearly independent.\n\nnp.linalg.matrix_rank(model.response_component.design.common.design_matrix)\n\n17\n\n\nSince \\(\\text{rank}(\\mathbf{X})\\) is smaller than the number of columns, we conclude the columns in \\(\\mathbf{X}\\) are not linearly independent.\nIf we have a second look at our code, we are going to figure out we’re spanning the intercept twice. The first time with the intercept term itself, and the second time in the spline basis.\nThis would have been a huge problem in a maximum likelihod estimation approach – we would have obtained an error instead of some parameter estimates. However, since we are doing Bayesian modeling, our priors ensured we obtain our regularized parameter estimates and everything seemed to work pretty well.\nNevertheless, we can still do better. Why would we want to span the intercept twice? Let’s create and fit the model again, this time without spanning the intercept in the spline basis.\n\n# Note we use the same priors\nmodel_new = bmb.Model(\"doy ~ bs(year, knots=iknots)\", data, priors=priors)\nidata_new = model_new.fit(random_seed=SEED, idata_kwargs={\"log_likelihood\": True})\n\nAuto-assigning NUTS sampler...\nInitializing NUTS using jitter+adapt_diag...\nMultiprocess sampling (2 chains in 2 jobs)\nNUTS: [doy_sigma, Intercept, bs(year, knots=iknots)]\n\n\n\n\n\n\n\n \n \n 100.00% [4000/4000 00:31<00:00 Sampling 2 chains, 0 divergences]\n \n \n\n\nSampling 2 chains for 1_000 tune and 1_000 draw iterations (2_000 + 2_000 draws total) took 32 seconds.\nWe recommend running at least 4 chains for robust computation of convergence diagnostics\n\n\nAnd let’s have a look at the summary\n\naz.summary(idata_new)\n\n\n\n\n\n \n \n \n mean\n sd\n hdi_3%\n hdi_97%\n mcse_mean\n mcse_sd\n ess_bulk\n ess_tail\n r_hat\n \n \n \n \n Intercept\n 102.367\n 1.992\n 98.899\n 106.358\n 0.105\n 0.074\n 361.0\n 581.0\n 1.01\n \n \n bs(year, knots=iknots)[0]\n -0.849\n 3.999\n -8.142\n 6.704\n 0.164\n 0.116\n 591.0\n 930.0\n 1.00\n \n \n bs(year, knots=iknots)[1]\n 0.394\n 3.012\n -5.253\n 5.983\n 0.090\n 0.063\n 1132.0\n 1249.0\n 1.00\n \n \n bs(year, knots=iknots)[2]\n 5.707\n 2.712\n 0.074\n 10.305\n 0.120\n 0.085\n 510.0\n 1017.0\n 1.00\n \n \n bs(year, knots=iknots)[3]\n 0.216\n 2.467\n -4.358\n 4.849\n 0.103\n 0.073\n 571.0\n 1320.0\n 1.00\n \n \n bs(year, knots=iknots)[4]\n 5.237\n 2.711\n 0.104\n 10.568\n 0.118\n 0.084\n 526.0\n 789.0\n 1.00\n \n \n bs(year, knots=iknots)[5]\n -4.332\n 2.428\n -8.909\n 0.043\n 0.105\n 0.074\n 535.0\n 890.0\n 1.01\n \n \n bs(year, knots=iknots)[6]\n 8.788\n 2.546\n 3.669\n 13.310\n 0.112\n 0.079\n 518.0\n 854.0\n 1.01\n \n \n bs(year, knots=iknots)[7]\n 0.008\n 2.573\n -5.056\n 4.474\n 0.112\n 0.079\n 525.0\n 916.0\n 1.00\n \n \n bs(year, knots=iknots)[8]\n 3.980\n 2.745\n -0.716\n 9.394\n 0.112\n 0.079\n 597.0\n 927.0\n 1.00\n \n \n bs(year, knots=iknots)[9]\n 5.658\n 2.559\n 0.917\n 10.350\n 0.109\n 0.077\n 552.0\n 850.0\n 1.00\n \n \n bs(year, knots=iknots)[10]\n 0.801\n 2.655\n -4.092\n 5.842\n 0.112\n 0.079\n 565.0\n 956.0\n 1.00\n \n \n bs(year, knots=iknots)[11]\n 6.534\n 2.578\n 1.952\n 11.575\n 0.112\n 0.079\n 531.0\n 845.0\n 1.01\n \n \n bs(year, knots=iknots)[12]\n 1.703\n 2.772\n -3.154\n 7.363\n 0.114\n 0.081\n 591.0\n 1126.0\n 1.00\n \n \n bs(year, knots=iknots)[13]\n 0.190\n 3.076\n -5.277\n 6.077\n 0.115\n 0.081\n 722.0\n 1258.0\n 1.00\n \n \n bs(year, knots=iknots)[14]\n -6.026\n 3.162\n -11.645\n 0.206\n 0.122\n 0.086\n 672.0\n 1164.0\n 1.00\n \n \n bs(year, knots=iknots)[15]\n -6.715\n 3.005\n -12.485\n -1.229\n 0.118\n 0.084\n 641.0\n 1306.0\n 1.00\n \n \n doy_sigma\n 5.949\n 0.146\n 5.674\n 6.221\n 0.003\n 0.002\n 2287.0\n 1466.0\n 1.00\n \n \n\n\n\n\nThere are a couple of things to remark here\n\nThere are 16 coefficients associated with the b-spline now because we’re not spanning the intercept.\nThe ESS numbers have improved in all cases. Notice the sampler isn’t raising any warning about low ESS.\nr_hat coefficeints are still 1.\n\nWe can also compare the sampling times:\n\nidata.posterior.sampling_time\n\n32.5815589427948\n\n\n\nidata_new.posterior.sampling_time\n\n31.589828729629517\n\n\nSampling times are similar in this particular example. But in general, we expect the sampler to run faster when there aren’t structural dependencies in the design matrix.\nAnd what about predictions?\n\nplot_predictions(data, idata_new, model_new);\n\n/tmp/ipykernel_33590/2247671002.py:8: FutureWarning: extract_dataset has been deprecated, please use extract\n posterior_stacked = az.extract_dataset(idata)\n\n\n\n\n\nAnd model comparison?\n\nmodels_dict = {\"Original\": idata, \"New\": idata_new}\ndf_compare = az.compare(models_dict)\ndf_compare\n\n\n\n\n\n \n \n \n rank\n elpd_loo\n p_loo\n elpd_diff\n weight\n se\n dse\n warning\n scale\n \n \n \n \n New\n 0\n -2657.859115\n 15.945629\n 0.000000\n 1.000000e+00\n 21.134973\n 0.000000\n False\n log\n \n \n Original\n 1\n -2658.359085\n 16.652034\n 0.499969\n 3.330669e-16\n 21.173433\n 0.561943\n False\n log\n \n \n\n\n\n\n\naz.plot_compare(df_compare, insample_dev=False);\n\n\n\n\nFinally let’s check influential points according to the k-hat value\n\n# Compute pointwise LOO\nloo_1 = az.loo(idata, pointwise=True)\nloo_2 = az.loo(idata_new, pointwise=True)\n\n/tmp/ipykernel_33590/3493983793.py:2: DeprecationWarning: `product` is deprecated as of NumPy 1.25.0, and will be removed in NumPy 2.0. Please use `prod` instead.\n loo_1 = az.loo(idata, pointwise=True)\n/tmp/ipykernel_33590/3493983793.py:3: DeprecationWarning: `product` is deprecated as of NumPy 1.25.0, and will be removed in NumPy 2.0. Please use `prod` instead.\n loo_2 = az.loo(idata_new, pointwise=True)\n\n\n\n# plot kappa values\naz.plot_khat(loo_1.pareto_k);\n\n\n\n\n\naz.plot_khat(loo_2.pareto_k);\n\n\n\n\n\n\n\nAnother option could have been to use stronger priors on the coefficients associated with the spline functions. For example, the example written in PyMC uses \\(\\text{Normal}(0, 3)\\) priors on them instead of \\(\\text{Normal}(0, 10)\\).\n\n%load_ext watermark\n%watermark -n -u -v -iv -w\n\nLast updated: Wed Jun 28 2023\n\nPython implementation: CPython\nPython version : 3.10.4\nIPython version : 8.5.0\n\npandas : 2.0.2\nbambi : 0.12.0.dev0\narviz : 0.14.0\nnumpy : 1.25.0\nmatplotlib: 3.6.2\n\nWatermark: 2.3.1" + "text": "In this example, we will use the categorical family to model outcomes with more than two categories. The examples in this notebook were constructed by Tomás Capretto, and assembled into this example by Tyler James Burch (@tjburch on GitHub).\n\nimport arviz as az\nimport bambi as bmb\nimport matplotlib.pyplot as plt\nimport numpy as np\nimport pandas as pd\nimport seaborn as sns\n\nfrom matplotlib.lines import Line2D\n\n\nSEED = 1234\naz.style.use(\"arviz-darkgrid\")\n\nWhen modeling binary outcomes with Bambi, the Bernoulli family is used. The multivariate generalization of the Bernoulli family is the Categorical family, and with it, we can model an arbitrary number of outcome categories.\n\n\nTo start, we will create a toy dataset with three classes.\n\nrng = np.random.default_rng(SEED)\nx = np.hstack([rng.normal(m, s, size=50) for m, s in zip([-2.5, 0, 2.5], [1.2, 0.5, 1.2])])\ny = np.array([\"A\"] * 50 + [\"B\"] * 50 + [\"C\"] * 50)\n\ncolors = [\"C0\"] * 50 + [\"C1\"] * 50 + [\"C2\"] * 50\nplt.scatter(x, np.random.uniform(size=150), color=colors)\nplt.xlabel(\"x\")\nplt.ylabel(\"y\");\n\n\n\n\nHere we have 3 classes, generated from three normal distributions: \\(N(-2.5, 1.2)\\), \\(N(0, 0.5)\\), and \\(N(2.5, 1.2)\\). Creating a model to fit these distributions,\n\ndata = pd.DataFrame({\"y\": y, \"x\": x})\nmodel = bmb.Model(\"y ~ x\", data, family=\"categorical\")\nidata = model.fit()\n\nAuto-assigning NUTS sampler...\nInitializing NUTS using jitter+adapt_diag...\nMultiprocess sampling (2 chains in 2 jobs)\nNUTS: [Intercept, x]\n\n\n\n\n\n\n\n \n \n 100.00% [4000/4000 00:04<00:00 Sampling 2 chains, 0 divergences]\n \n \n\n\nSampling 2 chains for 1_000 tune and 1_000 draw iterations (2_000 + 2_000 draws total) took 5 seconds.\nWe recommend running at least 4 chains for robust computation of convergence diagnostics\n\n\nNote that we pass the family=\"categorical\" argument to Bambi’s Model method in order to call the categorical family. Here, the response variable are strings (“A”, “B”, “C”), however they can also be pd.Categorical objects.\nNext we will use posterior predictions to visualize the mean class probability across the \\(x\\) spectrum.\n\nx_new = np.linspace(-5, 5, num=200)\nmodel.predict(idata, data=pd.DataFrame({\"x\": x_new}))\np = idata.posterior[\"y_mean\"].sel(draw=slice(0, None, 10))\n\n\nx_new = np.linspace(-5, 5, num=200)\nmodel.predict(idata, data=pd.DataFrame({\"x\": x_new}))\np = idata.posterior[\"y_mean\"].sel(draw=slice(0, None, 10))\n\nfor j, g in enumerate(\"ABC\"):\n plt.plot(x_new, p.sel({\"y_dim\":g}).stack(samples=(\"chain\", \"draw\")), color=f\"C{j}\", alpha=0.2)\n\nplt.xlabel(\"x\")\nplt.ylabel(\"y\");\n\n\n\n\nHere, we can notice that the probability phases between classes from left to right. At all points across \\(x\\), sum of the class probabilities is 1, since in our generative model, it must be one of these three outcomes.\n\n\n\nNext, we will look at the classic “iris” dataset, which contains samples from 3 different species of iris plants. Using properties of the plant, we will try to model its species.\n\niris = sns.load_dataset(\"iris\")\niris.head(3)\n\n\n\n\n\n \n \n \n sepal_length\n sepal_width\n petal_length\n petal_width\n species\n \n \n \n \n 0\n 5.1\n 3.5\n 1.4\n 0.2\n setosa\n \n \n 1\n 4.9\n 3.0\n 1.4\n 0.2\n setosa\n \n \n 2\n 4.7\n 3.2\n 1.3\n 0.2\n setosa\n \n \n\n\n\n\nThe dataset includes four different properties of the plants: it’s sepal length, sepal width, petal length, and petal width. There are 3 different class possibilities: setosa, versicolor, and virginica.\n\nsns.pairplot(iris, hue=\"species\");\n\n/home/tomas/anaconda3/envs/bambi/lib/python3.10/site-packages/seaborn/axisgrid.py:208: UserWarning: This figure was using a layout engine that is incompatible with subplots_adjust and/or tight_layout; not calling subplots_adjust.\n self._figure.subplots_adjust(right=right)\n\n\n\n\n\nWe can see the three species have several distinct characteristics, which our linear model can capture to distinguish between them.\n\nmodel = bmb.Model(\n \"species ~ sepal_length + sepal_width + petal_length + petal_width\", \n iris, \n family=\"categorical\",\n)\nidata = model.fit()\naz.summary(idata)\n\nAuto-assigning NUTS sampler...\nInitializing NUTS using jitter+adapt_diag...\nMultiprocess sampling (2 chains in 2 jobs)\nNUTS: [Intercept, sepal_length, sepal_width, petal_length, petal_width]\n\n\n\n\n\n\n\n \n \n 100.00% [4000/4000 00:21<00:00 Sampling 2 chains, 0 divergences]\n \n \n\n\nSampling 2 chains for 1_000 tune and 1_000 draw iterations (2_000 + 2_000 draws total) took 21 seconds.\nWe recommend running at least 4 chains for robust computation of convergence diagnostics\n\n\n\n\n\n\n \n \n \n mean\n sd\n hdi_3%\n hdi_97%\n mcse_mean\n mcse_sd\n ess_bulk\n ess_tail\n r_hat\n \n \n \n \n Intercept[versicolor]\n -6.751\n 7.897\n -21.261\n 8.474\n 0.214\n 0.156\n 1369.0\n 1374.0\n 1.0\n \n \n Intercept[virginica]\n -22.546\n 9.566\n -40.257\n -5.208\n 0.229\n 0.164\n 1761.0\n 1556.0\n 1.0\n \n \n sepal_length[versicolor]\n 3.140\n 1.690\n 0.049\n 6.365\n 0.053\n 0.037\n 1031.0\n 1124.0\n 1.0\n \n \n sepal_length[virginica]\n 2.361\n 1.754\n -0.823\n 5.755\n 0.055\n 0.040\n 1020.0\n 974.0\n 1.0\n \n \n sepal_width[versicolor]\n -4.777\n 1.967\n -8.792\n -1.408\n 0.063\n 0.046\n 973.0\n 1096.0\n 1.0\n \n \n sepal_width[virginica]\n -6.681\n 2.368\n -11.597\n -2.590\n 0.076\n 0.055\n 974.0\n 909.0\n 1.0\n \n \n petal_length[versicolor]\n 1.060\n 0.915\n -0.630\n 2.735\n 0.027\n 0.019\n 1187.0\n 1316.0\n 1.0\n \n \n petal_length[virginica]\n 3.986\n 1.071\n 1.972\n 5.882\n 0.029\n 0.021\n 1340.0\n 1187.0\n 1.0\n \n \n petal_width[versicolor]\n 1.905\n 2.024\n -1.927\n 5.871\n 0.060\n 0.045\n 1153.0\n 1113.0\n 1.0\n \n \n petal_width[virginica]\n 9.021\n 2.247\n 5.098\n 13.457\n 0.063\n 0.046\n 1264.0\n 1198.0\n 1.0\n \n \n\n\n\n\n\naz.plot_trace(idata);\n\n\n\n\nWe can see that this has fit quite nicely. You’ll notice there are \\(n-1\\) parameters to fit, where \\(n\\) is the number of categories. In the minimal binary case, recall there’s only one parameter set, since it models probability \\(p\\) of being in a class, and probability \\(1-p\\) of being in the other class. Using the categorical distribution, this extends, so we have \\(p_1\\) for class 1, \\(p_2\\) for class 2, and \\(1-(p_1+p_2)\\) for the final class.\n\n\n\nNext we will look at an example from chapter 8 of Alan Agresti’s Categorical Data Analysis, looking at the primary food choice for 64 alligators caught in Lake George, Florida. We will use their length (a continuous variable) and sex (a categorical variable) as predictors to model their food choice.\nFirst, reproducing the dataset,\n\nlength = [\n 1.3, 1.32, 1.32, 1.4, 1.42, 1.42, 1.47, 1.47, 1.5, 1.52, 1.63, 1.65, 1.65, 1.65, 1.65,\n 1.68, 1.7, 1.73, 1.78, 1.78, 1.8, 1.85, 1.93, 1.93, 1.98, 2.03, 2.03, 2.31, 2.36, 2.46,\n 3.25, 3.28, 3.33, 3.56, 3.58, 3.66, 3.68, 3.71, 3.89, 1.24, 1.3, 1.45, 1.45, 1.55, 1.6, \n 1.6, 1.65, 1.78, 1.78, 1.8, 1.88, 2.16, 2.26, 2.31, 2.36, 2.39, 2.41, 2.44, 2.56, 2.67, \n 2.72, 2.79, 2.84\n]\nchoice = [\n \"I\", \"F\", \"F\", \"F\", \"I\", \"F\", \"I\", \"F\", \"I\", \"I\", \"I\", \"O\", \"O\", \"I\", \"F\", \"F\", \n \"I\", \"O\", \"F\", \"O\", \"F\", \"F\", \"I\", \"F\", \"I\", \"F\", \"F\", \"F\", \"F\", \"F\", \"O\", \"O\", \n \"F\", \"F\", \"F\", \"F\", \"O\", \"F\", \"F\", \"I\", \"I\", \"I\", \"O\", \"I\", \"I\", \"I\", \"F\", \"I\", \n \"O\", \"I\", \"I\", \"F\", \"F\", \"F\", \"F\", \"F\", \"F\", \"F\", \"O\", \"F\", \"I\", \"F\", \"F\"\n]\n\nsex = [\"Male\"] * 32 + [\"Female\"] * 31\ndata = pd.DataFrame({\"choice\": choice, \"length\": length, \"sex\": sex})\ndata[\"choice\"] = pd.Categorical(\n data[\"choice\"].map({\"I\": \"Invertebrates\", \"F\": \"Fish\", \"O\": \"Other\"}), \n [\"Other\", \"Invertebrates\", \"Fish\"], \n ordered=True\n)\ndata.head(3)\n\n\n\n\n\n \n \n \n choice\n length\n sex\n \n \n \n \n 0\n Invertebrates\n 1.30\n Male\n \n \n 1\n Fish\n 1.32\n Male\n \n \n 2\n Fish\n 1.32\n Male\n \n \n\n\n\n\nNext, constructing the model,\n\nmodel = bmb.Model(\"choice ~ length + sex\", data, family=\"categorical\")\nidata = model.fit()\n\nAuto-assigning NUTS sampler...\nInitializing NUTS using jitter+adapt_diag...\nMultiprocess sampling (2 chains in 2 jobs)\nNUTS: [Intercept, length, sex]\n\n\n\n\n\n\n\n \n \n 100.00% [4000/4000 00:04<00:00 Sampling 2 chains, 0 divergences]\n \n \n\n\nSampling 2 chains for 1_000 tune and 1_000 draw iterations (2_000 + 2_000 draws total) took 5 seconds.\nWe recommend running at least 4 chains for robust computation of convergence diagnostics\n\n\nWe can then look at how the food choices vary by length for both male and female alligators.\n\nnew_length = np.linspace(1, 4)\nnew_data = pd.DataFrame({\"length\": np.tile(new_length, 2), \"sex\": [\"Male\"] * 50 + [\"Female\"] * 50})\nmodel.predict(idata, data=new_data)\np = idata.posterior[\"choice_mean\"]\n\nfig, axes = plt.subplots(1, 2, figsize=(12, 5))\nchoices = [\"Other\", \"Invertebrates\", \"Fish\"]\n\nfor j, choice in enumerate(choices):\n males = p.sel({\"choice_dim\":choice, \"choice_obs\":slice(0, 49)})\n females = p.sel({\"choice_dim\":choice, \"choice_obs\":slice(50, 100)})\n axes[0].plot(new_length, males.mean((\"chain\", \"draw\")), color=f\"C{j}\", lw=2)\n axes[1].plot(new_length, females.mean((\"chain\", \"draw\")), color=f\"C{j}\", lw=2)\n az.plot_hdi(new_length, males, color=f\"C{j}\", ax=axes[0])\n az.plot_hdi(new_length, females, color=f\"C{j}\", ax=axes[1])\n\naxes[0].set_title(\"Male\")\naxes[1].set_title(\"Female\")\n\nhandles = [Line2D([], [], color=f\"C{j}\", label=choice) for j, choice in enumerate(choices)]\nfig.subplots_adjust(left=0.05, right=0.975, bottom=0.075, top=0.85)\n\nfig.legend(\n handles,\n choices,\n loc=\"center right\",\n ncol=3,\n bbox_to_anchor=(0.99, 0.95),\n bbox_transform=fig.transFigure\n);\n\n/tmp/ipykernel_30893/358310275.py:21: UserWarning: This figure was using a layout engine that is incompatible with subplots_adjust and/or tight_layout; not calling subplots_adjust.\n fig.subplots_adjust(left=0.05, right=0.975, bottom=0.075, top=0.85)\n\n\n\n\n\nHere we can see that the larger male and female alligators are, the less of a taste they have for invertebrates, and far prefer fish. Additionally, males seem to have a higher propensity to consume “other” foods compared to females at any size. Of note, the posterior means predicted by Bambi contain information about all \\(n\\) categories (despite having only \\(n-1\\) coefficients), so we can directly construct this plot, rather than manually calculating \\(1-(p_1+p_2)\\) for the third class.\nLast, we can make a posterior predictive plot,\n\nmodel.predict(idata, kind=\"pps\")\n\nax = az.plot_ppc(idata)\nax.set_xticks([0.5, 1.5, 2.5])\nax.set_xticklabels(model.response_component.response_term.levels)\nax.set_xlabel(\"Choice\");\nax.set_ylabel(\"Probability\");\n\n\n\n\nwhich depicts posterior predicted probability for each possible food choice for an alligator, which reinforces fish being the most likely food choice, followed by invertebrates.\n\n\nAgresti, A. (2013) Categorical Data Analysis. 3rd Edition, John Wiley & Sons Inc., Hoboken.\n\n%load_ext watermark\n%watermark -n -u -v -iv -w\n\nLast updated: Wed Jun 28 2023\n\nPython implementation: CPython\nPython version : 3.10.4\nIPython version : 8.5.0\n\narviz : 0.14.0\nbambi : 0.12.0.dev0\npandas : 2.0.2\nnumpy : 1.25.0\nmatplotlib: 3.6.2\nseaborn : 0.12.2\n\nWatermark: 2.3.1" }, { - "objectID": "notebooks/test_sample_new_groups.html", - "href": "notebooks/test_sample_new_groups.html", + "objectID": "notebooks/model_comparison.html", + "href": "notebooks/model_comparison.html", "title": "Bambi", "section": "", - "text": "NOTE This notebook is not part of the documentation. It’s not meant to be in the webpage. It’s something I wrote when I was testing the new functionality and I think it’s nice to have it handy.\n\nimport arviz as az\nimport bambi as bmb\nimport matplotlib.pyplot as plt\nimport numpy as np\nimport pandas as pd\n\n\ndata = bmb.load_data(\"sleepstudy\")\n\n\ndata.head()\n\n\n\n\n\n \n \n \n Reaction\n Days\n Subject\n \n \n \n \n 0\n 249.5600\n 0\n 308\n \n \n 1\n 258.7047\n 1\n 308\n \n \n 2\n 250.8006\n 2\n 308\n \n \n 3\n 321.4398\n 3\n 308\n \n \n 4\n 356.8519\n 4\n 308\n \n \n\n\n\n\n\nmodel = bmb.Model(\"Reaction ~ 1 + Days + (1 + Days | Subject)\", data)\nmodel\n\n Formula: Reaction ~ 1 + Days + (1 + Days | Subject)\n Family: gaussian\n Link: mu = identity\n Observations: 180\n Priors: \n target = mu\n Common-level effects\n Intercept ~ Normal(mu: 298.5079, sigma: 261.0092)\n Days ~ Normal(mu: 0.0, sigma: 48.8915)\n \n Group-level effects\n 1|Subject ~ Normal(mu: 0.0, sigma: HalfNormal(sigma: 261.0092))\n Days|Subject ~ Normal(mu: 0.0, sigma: HalfNormal(sigma: 48.8915))\n \n Auxiliary parameters\n sigma ~ HalfStudentT(nu: 4.0, sigma: 56.1721)\n\n\n\nidata = model.fit()\n\nAuto-assigning NUTS sampler...\nInitializing NUTS using jitter+adapt_diag...\nMultiprocess sampling (2 chains in 2 jobs)\nNUTS: [Reaction_sigma, Intercept, Days, 1|Subject_sigma, 1|Subject_offset, Days|Subject_sigma, Days|Subject_offset]\n\n\n\n\n\n\n\n \n \n 100.00% [4000/4000 00:15<00:00 Sampling 2 chains, 0 divergences]\n \n \n\n\nSampling 2 chains for 1_000 tune and 1_000 draw iterations (2_000 + 2_000 draws total) took 15 seconds.\nWe recommend running at least 4 chains for robust computation of convergence diagnostics\n\n\n\ndf_new = data.head(10).reset_index(drop=True)\ndf_new[\"Subject\"] = \"xxx\"\ndf_new = pd.concat([df_new, data.head(10)])\ndf_new = df_new.reset_index(drop=True)\ndf_new\n\n\n\n\n\n \n \n \n Reaction\n Days\n Subject\n \n \n \n \n 0\n 249.5600\n 0\n xxx\n \n \n 1\n 258.7047\n 1\n xxx\n \n \n 2\n 250.8006\n 2\n xxx\n \n \n 3\n 321.4398\n 3\n xxx\n \n \n 4\n 356.8519\n 4\n xxx\n \n \n 5\n 414.6901\n 5\n xxx\n \n \n 6\n 382.2038\n 6\n xxx\n \n \n 7\n 290.1486\n 7\n xxx\n \n \n 8\n 430.5853\n 8\n xxx\n \n \n 9\n 466.3535\n 9\n xxx\n \n \n 10\n 249.5600\n 0\n 308\n \n \n 11\n 258.7047\n 1\n 308\n \n \n 12\n 250.8006\n 2\n 308\n \n \n 13\n 321.4398\n 3\n 308\n \n \n 14\n 356.8519\n 4\n 308\n \n \n 15\n 414.6901\n 5\n 308\n \n \n 16\n 382.2038\n 6\n 308\n \n \n 17\n 290.1486\n 7\n 308\n \n \n 18\n 430.5853\n 8\n 308\n \n \n 19\n 466.3535\n 9\n 308\n \n \n\n\n\n\n\np = model.predict(idata, data=df_new, inplace=False, sample_new_groups=True)\n\nreaction_draws = p.posterior[\"Reaction_mean\"]\nmean = reaction_draws.mean((\"chain\", \"draw\")).to_numpy()\nbounds = reaction_draws.quantile((0.025, 0.975), (\"chain\", \"draw\")).to_numpy()\n\n\nfig, axes = plt.subplots(1, 2, figsize=(10, 4), sharey=True)\n\naxes[0].scatter(df_new.iloc[10:][\"Days\"], df_new.iloc[10:][\"Reaction\"])\naxes[1].scatter(df_new.iloc[:10][\"Days\"], df_new.iloc[:10][\"Reaction\"])\n\naxes[0].fill_between(np.arange(10), bounds[0, 10:], bounds[1, 10:], alpha=0.5, color=\"C0\")\naxes[1].fill_between(np.arange(10), bounds[0, :10], bounds[1, :10], alpha=0.5, color=\"C0\")\n\naxes[0].set_title(\"Original participant\")\naxes[1].set_title(\"New participant\");\n\n\n\n\n\n\ndata = pd.read_csv(\"../../tests/data/crossed_random.csv\")\ndata[\"subj\"] = data[\"subj\"].astype(str)\ndata.head()\n\n\n\n\n\n \n \n \n Unnamed: 0\n subj\n item\n site\n Y\n continuous\n dummy\n threecats\n \n \n \n \n 0\n 0\n 0\n 0\n 0\n 0.276766\n 0.929616\n 0\n a\n \n \n 1\n 1\n 1\n 0\n 0\n -0.058104\n 0.008388\n 0\n a\n \n \n 2\n 2\n 2\n 0\n 1\n -6.847861\n 0.439645\n 0\n a\n \n \n 3\n 3\n 3\n 0\n 1\n 12.474619\n 0.596366\n 0\n a\n \n \n 4\n 4\n 4\n 0\n 2\n -0.426047\n 0.709510\n 0\n a\n \n \n\n\n\n\n\nformula = \"Y ~ 0 + threecats + (0 + threecats | subj)\"\nmodel = bmb.Model(formula, data)\nmodel\n\n Formula: Y ~ 0 + threecats + (0 + threecats | subj)\n Family: gaussian\n Link: mu = identity\n Observations: 120\n Priors: \n target = mu\n Common-level effects\n threecats ~ Normal(mu: [0. 0. 0.], sigma: [31.1617 31.1617 31.1617])\n \n Group-level effects\n threecats|subj ~ Normal(mu: 0.0, sigma: HalfNormal(sigma: [31.1617 31.1617 31.1617]))\n \n Auxiliary parameters\n sigma ~ HalfStudentT(nu: 4.0, sigma: 5.8759)\n\n\n\nidata = model.fit()\n\nAuto-assigning NUTS sampler...\nInitializing NUTS using jitter+adapt_diag...\nMultiprocess sampling (2 chains in 2 jobs)\nNUTS: [Y_sigma, threecats, threecats|subj_sigma, threecats|subj_offset]\n\n\n\n\n\n\n\n \n \n 100.00% [4000/4000 00:08<00:00 Sampling 2 chains, 0 divergences]\n \n \n\n\nSampling 2 chains for 1_000 tune and 1_000 draw iterations (2_000 + 2_000 draws total) took 8 seconds.\nWe recommend running at least 4 chains for robust computation of convergence diagnostics\n\n\n\nnew_data = pd.DataFrame(\n {\n \"threecats\": [\"a\", \"a\"],\n \"subj\": [\"0\", \"11\"]\n }\n)\nnew_data\n\n\n\n\n\n \n \n \n threecats\n subj\n \n \n \n \n 0\n a\n 0\n \n \n 1\n a\n 11\n \n \n\n\n\n\n\np1 = model.predict(idata, data=new_data, inplace=False, sample_new_groups=True)\n\n\nfig, axes = plt.subplots(2, 1, figsize=(7, 9), sharex=True)\n\ny1_grs = p1.posterior[\"Y_mean\"].sel(Y_obs=0).to_numpy().flatten()\ny2_grs = p1.posterior[\"Y_mean\"].sel(Y_obs=1).to_numpy().flatten()\n\naxes[0].hist(y1_grs, bins=20);\naxes[1].hist(y2_grs, bins=20);\n\n\n\n\n\n\ninhaler = pd.read_csv(\"../../tests/data/inhaler.csv\")\ninhaler[\"rating\"] = pd.Categorical(inhaler[\"rating\"], categories=[1, 2, 3, 4])\ninhaler[\"treat\"] = pd.Categorical(inhaler[\"treat\"])\n\nmodel = bmb.Model(\n \"rating ~ 1 + period + treat + (1 + treat|subject)\", inhaler, family=\"categorical\"\n)\nidata = model.fit(tune=200, draws=200)\n\nOnly 200 samples in chain.\nAuto-assigning NUTS sampler...\nInitializing NUTS using jitter+adapt_diag...\nMultiprocess sampling (2 chains in 2 jobs)\nNUTS: [Intercept, period, treat, 1|subject_sigma, 1|subject_offset, treat|subject_sigma, treat|subject_offset]\n\n\n\n\n\n\n\n \n \n 100.00% [800/800 00:11<00:00 Sampling 2 chains, 1 divergences]\n \n \n\n\nSampling 2 chains for 200 tune and 200 draw iterations (400 + 400 draws total) took 12 seconds.\nWe recommend running at least 4 chains for robust computation of convergence diagnostics\n\n\n\ndf_new = inhaler.head(2).reset_index(drop=True)\ndf_new[\"subject\"] = [1, 999]\ndf_new\n\n\n\n\n\n \n \n \n subject\n rating\n treat\n period\n carry\n \n \n \n \n 0\n 1\n 1\n 0.5\n 0.5\n 0\n \n \n 1\n 999\n 1\n 0.5\n 0.5\n 0\n \n \n\n\n\n\n\np = model.predict(idata, data=df_new, inplace=False, sample_new_groups=True)\n\n\nfig, axes = plt.subplots(2, 2, figsize=(12, 9))\nbins = np.linspace(0, 1, 20)\n\nfor i, ax in enumerate(axes.ravel()):\n x = p.posterior[\"rating_mean\"].sel({\"rating_dim\": f'{i + 1}'}).to_numpy()\n ax.hist(x[..., 0].flatten(), bins=bins, histtype=\"step\", color=\"C0\")\n ax.hist(x[..., 1].flatten(), bins=bins, histtype=\"step\", color=\"C1\")" + "text": "The adults dataset is comprised of census data from 1994 in United States.\nThe goal is to use demographic variables to predict whether an individual makes more than $50,000 per year.\nThe following is a description of the variables in the dataset.\n\nage: Individual’s age\nworkclass: Labor class.\nfnlwgt: It is not specified, but we guess it is a final sampling weight.\neducation: Education level as a categorical variable.\neducational_num: Education level as numerical variable. It does not reflect years of education.\nmarital_status: Marital status.\noccupation: Occupation.\nrelationship: Relationship with the head of household.\nrace: Individual’s race.\nsex: Individual’s sex.\ncapital_gain: Capital gain during unspecified period of time.\ncapital_loss: Capital loss during unspecified period of time.\nhs_week: Hours of work per week.\nnative_country: Country of birth.\nincome: Income as a binary variable (either below or above 50K per year).\n\nWe are only using the following variables in this example: income, sex, race, age, and hs_week. This subset is comprised of both categorical and numerical variables which allows us to visualize how to incorporate both types in a logistic regression model while helping to keep the analysis simpler.\n\nimport arviz as az\nimport bambi as bmb\nimport matplotlib.pyplot as plt\nimport matplotlib.lines as mlines\nimport numpy as np\nimport pandas as pd\nimport seaborn as sns\nimport warnings\n\nfrom scipy.special import expit as invlogit\n\n\n# Disable a FutureWarning in ArviZ at the moment of running the notebook\naz.style.use(\"arviz-darkgrid\")\nwarnings.simplefilter(action='ignore', category=FutureWarning)\n\n\ndata = bmb.load_data(\"adults\")\n\n\ndata.info()\ndata.head()\n\n\nRangeIndex: 32561 entries, 0 to 32560\nData columns (total 5 columns):\n # Column Non-Null Count Dtype \n--- ------ -------------- ----- \n 0 income 32561 non-null object\n 1 sex 32561 non-null object\n 2 race 32561 non-null object\n 3 age 32561 non-null int64 \n 4 hs_week 32561 non-null int64 \ndtypes: int64(2), object(3)\nmemory usage: 1.2+ MB\n\n\n\n\n\n\n \n \n \n income\n sex\n race\n age\n hs_week\n \n \n \n \n 0\n <=50K\n Male\n White\n 39\n 40\n \n \n 1\n <=50K\n Male\n White\n 50\n 13\n \n \n 2\n <=50K\n Male\n White\n 38\n 40\n \n \n 3\n <=50K\n Male\n Black\n 53\n 40\n \n \n 4\n <=50K\n Female\n Black\n 28\n 40\n \n \n\n\n\n\nCategorical variables are presented as from type object. In this step we convert them to category.\n\ncategorical_cols = data.columns[data.dtypes == object].tolist()\nfor col in categorical_cols:\n data[col] = data[col].astype(\"category\")\ndata.info()\n\n\nRangeIndex: 32561 entries, 0 to 32560\nData columns (total 5 columns):\n # Column Non-Null Count Dtype \n--- ------ -------------- ----- \n 0 income 32561 non-null category\n 1 sex 32561 non-null category\n 2 race 32561 non-null category\n 3 age 32561 non-null int64 \n 4 hs_week 32561 non-null int64 \ndtypes: category(3), int64(2)\nmemory usage: 604.7 KB\n\n\nInstead of going straight to fitting models, we’re going to do a some exploratory analysis of the variables in the dataset. First we have some plots, and then some conclusions about the information in the plots.\n\n# Just a utilitary function to truncate labels and avoid overlapping in plots\ndef truncate_labels(ticklabels, width=8):\n def truncate(label, width):\n if len(label) > width - 3:\n return label[0 : (width - 4)] + \"...\"\n else:\n return label\n\n labels = [x.get_text() for x in ticklabels]\n labels = [truncate(lbl, width) for lbl in labels]\n\n return labels\n\n\nfig, axes = plt.subplots(3, 2, figsize=(12, 15))\nsns.countplot(x=\"income\", color=\"C0\", data=data, ax=axes[0, 0], saturation=1)\nsns.countplot(x=\"sex\", color=\"C0\", data=data, ax=axes[0, 1], saturation=1);\nsns.countplot(x=\"race\", color=\"C0\", data=data, ax=axes[1, 0], saturation=1);\naxes[1, 0].set_xticklabels(truncate_labels(axes[1, 0].get_xticklabels()))\naxes[1, 1].hist(data[\"age\"], bins=20);\naxes[1, 1].set_xlabel(\"Age\")\naxes[1, 1].set_ylabel(\"Count\")\naxes[2, 0].hist(data[\"hs_week\"], bins=20);\naxes[2, 0].set_xlabel(\"Hours of work / week\")\naxes[2, 0].set_ylabel(\"Count\")\naxes[2, 1].axis('off');\n\n\n\n\nHighlights\n\nApproximately 25% of the people make more than 50K a year.\nTwo thirds of the subjects are males.\nThe great majority of the subjects are white, only a minority are black and the other categories are very infrequent.\nThe distribution of age is skewed to the right, as one might expect.\nThe distribution of hours of work per week looks weird at first sight. But what is a typical workload per week? You got it, 40 hours :).\n\nWe only keep the races black and white to simplify the analysis. The other categories don’t appear very often in our data.\nNow, we see the distribution of income for the different levels of our explanatory variables. Numerical variables are binned to make the analysis possible.\n\ndata = data[data[\"race\"].isin([\"Black\", \"White\"])]\ndata[\"race\"] = data[\"race\"].cat.remove_unused_categories()\nage_bins = [17, 25, 35, 45, 65, 90]\ndata[\"age_binned\"] = pd.cut(data[\"age\"], age_bins)\nhours_bins = [0, 20, 40, 60, 100]\ndata[\"hs_week_binned\"] = pd.cut(data[\"hs_week\"], hours_bins)\n\n\nfig, axes = plt.subplots(3, 2, figsize=(12, 15))\nsns.countplot(x=\"income\", color=\"C0\", data=data, ax=axes[0, 0])\nsns.countplot(x=\"sex\", hue=\"income\", data=data, ax=axes[0, 1])\nsns.countplot(x=\"race\", hue=\"income\", data=data, ax=axes[1, 0])\nsns.countplot(x=\"age_binned\", hue=\"income\", data=data, ax=axes[1, 1])\nsns.countplot(x=\"hs_week_binned\", hue=\"income\", data=data, ax=axes[2, 0])\naxes[2, 1].axis(\"off\");\n\n\n\n\nSome quick and gross info from the plots\n\nThe probability of making more than \\$50k a year is larger if you are a Male.\nA person also has more probability of making more than \\$50k/yr if she/he is White.\nFor age, we see the probability of making more than \\$50k a year increases as the variable increases, up to a point where it starts to decrease.\nAlso, the more hours a person works per week, the higher the chance of making more than \\$50k/yr. There’s a big jump in that probability when the hours of work per week jump from the (20, 40] bin to the (40, 60] one.\n\nSome data preparation before fitting our model. Here we standardize numerical variables age and hs_week because it may help sampler convergence. Also, we compute their second and third power. These powers will be sequantialy added to the model.\n\nage_mean = np.mean(data[\"age\"])\nage_std = np.std(data[\"age\"])\nhs_mean = np.mean(data[\"hs_week\"])\nhs_std = np.std(data[\"hs_week\"])\n\ndata[\"age\"] = (data[\"age\"] - age_mean) / age_std\ndata[\"age2\"] = data[\"age\"] ** 2\ndata[\"age3\"] = data[\"age\"] ** 3\ndata[\"hs_week\"] = (data[\"hs_week\"] - hs_mean) / hs_std\ndata[\"hs_week2\"] = data[\"hs_week\"] ** 2\ndata[\"hs_week3\"] = data[\"hs_week\"] ** 3\n\ndata = data.drop(columns=[\"age_binned\", \"hs_week_binned\"])\n\nThis is how our data looks like before fitting the models.\n\ndata.head()\n\n\n\n\n\n \n \n \n income\n sex\n race\n age\n hs_week\n age2\n age3\n hs_week2\n hs_week3\n \n \n \n \n 0\n <=50K\n Male\n White\n 0.024207\n -0.037250\n 0.000586\n 0.000014\n 0.001388\n -0.000052\n \n \n 1\n <=50K\n Male\n White\n 0.827984\n -2.222326\n 0.685557\n 0.567630\n 4.938734\n -10.975479\n \n \n 2\n <=50K\n Male\n White\n -0.048863\n -0.037250\n 0.002388\n -0.000117\n 0.001388\n -0.000052\n \n \n 3\n <=50K\n Male\n Black\n 1.047195\n -0.037250\n 1.096618\n 1.148374\n 0.001388\n -0.000052\n \n \n 4\n <=50K\n Female\n Black\n -0.779569\n -0.037250\n 0.607728\n -0.473766\n 0.001388\n -0.000052\n \n \n\n\n\n\n\n\n\nWe will use a logistic regression model to estimate the probability of making more than \\$50K as a function of age, hours of work per week, sex, race and education level.\nIf we have a binary response variable \\(Y\\) and a set of predictors or explanatory variables \\(X_1, X_2, \\cdots, X_p\\) the logistic regression model can be defined as follows:\n\\[\\log{\\left(\\frac{\\pi}{1 - \\pi}\\right)} = \\beta_0 + \\beta_1 X_1 + \\beta_2 X_2 + \\cdots + \\beta_p X_p\\]\nwhere \\(\\pi = P(Y = 1)\\) (a.k.a. probability of success) and \\(\\beta_0, \\beta_1, \\cdots \\beta_p\\) are unknown parameters. The term on the left side is the logarithm of the odds ratio or simply known as the log-odds. With little effort, the expression can be re-arranged to express our probability of interest, \\(\\pi\\), as a function of the betas and the predictors.\n\\[\n\\pi = \\frac{e^{\\beta_0 + \\beta_1 X_1 + \\cdots + \\beta_p X_p}}{1 + e^{\\beta_0 + \\beta_1 X_1 + \\cdots + \\beta_p X_p}}\n = \\frac{1}{1 + e^{-(\\beta_0 + \\beta_1 X_1 + \\cdots + \\beta_p X_p)}}\n\\]\nWe need to specify a prior and a likelihood in order to draw samples from the posterior distribution. We could use sociological knowledge about the effects of age and education on income, but instead, let’s use the default prior specification in Bambi.\nThe likelihood is the product of \\(n\\) Bernoulli trials, \\(\\prod_{i=1}^{n}{p_i^y(1-p_i)^{1-y_i}}\\) where \\(p_i = P(Y=1)\\).\nIn our case, we have\n\\[Y =\n\\left\\{\n \\begin{array}{ll}\n 1 & \\textrm{if the person makes more than 50K per year} \\\\\n 0 & \\textrm{if the person makes less than 50K per year}\n \\end{array}\n\\right.\n\\]\n\\[\\pi = P(Y=1)\\]\nBut this is a Bambi example, right? Let’s see how Bambi can helps us to build a logistic regression model.\n\n\n\n\\[\n\\log{\\left(\\frac{\\pi}{1 - \\pi}\\right)} = \\beta_0 + \\beta_1 X_1 + \\beta_2 X_2 + \\beta_3 X_3 + \\beta_4 X_4 \n\\]\nWhere:\n\\[\n\\begin{split}\nX_1 &= \\displaystyle \\frac{\\text{Age} - \\text{Age}_{\\text{mean}}}{\\text{Age}_{\\text{std}}} \\\\\nX_2 &= \\displaystyle \\frac{\\text{Hours\\_week} - \\text{Hours\\_week}_{\\text{mean}}}{\\text{Hours\\_week}_{\\text{std}}} \\\\\nX_3 &=\n\\left\\{\n \\begin{array}{ll}\n 1 & \\textrm{if the person is male} \\\\\n 0 & \\textrm{if the person is female}\n \\end{array}\n\\right. \\\\\nX_4 &=\n\\left\\{\n \\begin{array}{ll}\n 1 & \\textrm{if the person is white} \\\\\n 0 & \\textrm{if the person is black}\n \\end{array}\n\\right.\n\\end{split}\n\\]\n\nmodel1 = bmb.Model(\"income['>50K'] ~ sex + race + age + hs_week\", data, family=\"bernoulli\")\nfitted1 = model1.fit(draws=1000, idata_kwargs={\"log_likelihood\": True})\n\nModeling the probability that income==>50K\nAuto-assigning NUTS sampler...\nInitializing NUTS using jitter+adapt_diag...\nMultiprocess sampling (2 chains in 2 jobs)\nNUTS: [Intercept, sex, race, age, hs_week]\n\n\n\n\n\n\n\n \n \n 100.00% [4000/4000 00:20<00:00 Sampling 2 chains, 0 divergences]\n \n \n\n\nSampling 2 chains for 1_000 tune and 1_000 draw iterations (2_000 + 2_000 draws total) took 21 seconds.\n\n\n\naz.plot_trace(fitted1);\naz.summary(fitted1)\n\n\n\n\n\n \n \n \n mean\n sd\n hdi_3%\n hdi_97%\n mcse_mean\n mcse_sd\n ess_bulk\n ess_tail\n r_hat\n \n \n \n \n Intercept\n -2.635\n 0.062\n -2.757\n -2.525\n 0.001\n 0.001\n 2457.0\n 1739.0\n 1.0\n \n \n sex[Male]\n 1.018\n 0.037\n 0.948\n 1.087\n 0.001\n 0.001\n 2141.0\n 1572.0\n 1.0\n \n \n race[White]\n 0.630\n 0.058\n 0.532\n 0.751\n 0.001\n 0.001\n 3060.0\n 1566.0\n 1.0\n \n \n age\n 0.578\n 0.015\n 0.554\n 0.608\n 0.000\n 0.000\n 1837.0\n 1281.0\n 1.0\n \n \n hs_week\n 0.504\n 0.015\n 0.477\n 0.533\n 0.000\n 0.000\n 2047.0\n 1568.0\n 1.0\n \n \n\n\n\n\n\n\n\n\n\n\n\\[\n\\log{\\left(\\frac{\\pi}{1 - \\pi}\\right)} = \\beta_0 + \\beta_1 X_1 + \\beta_2 X_1^2 + \\beta_3 X_2 + \\beta_4 X_2^2\n + \\beta_5 X_3 + \\beta_6 X_4\n\\]\nWhere:\n$$\n\\[\\begin{aligned}\n X_1 &= \\displaystyle \\frac{\\text{Age} - \\text{Age}_{\\text{mean}}}{\\text{Age}_{\\text{std}}} \\\\\n X_2 &= \\displaystyle \\frac{\\text{Hours\\_week} - \\text{Hours\\_week}_{\\text{mean}}}{\\text{Hours\\_week}_{\\text{std}}} \\\\\n X_3 &=\n \\left\\{\n \\begin{array}{ll}\n 1 & \\textrm{if the person is male} \\\\\n 0 & \\textrm{if the person is female}\n \\end{array}\n \\right. \\\\\n\n X_4 &=\n \\left\\{\n \\begin{array}{ll}\n 1 & \\textrm{if the person is white} \\\\\n 0 & \\textrm{if the person is black}\n \\end{array}\n \\right.\n\\end{aligned}\\]\n$$\n\nmodel2 = bmb.Model(\"income['>50K'] ~ sex + race + age + age2 + hs_week + hs_week2\", data, family=\"bernoulli\")\nfitted2 = model2.fit(idata_kwargs={\"log_likelihood\": True})\n\nModeling the probability that income==>50K\nAuto-assigning NUTS sampler...\nInitializing NUTS using jitter+adapt_diag...\nMultiprocess sampling (2 chains in 2 jobs)\nNUTS: [Intercept, sex, race, age, age2, hs_week, hs_week2]\n\n\n\n\n\n\n\n \n \n 100.00% [4000/4000 00:29<00:00 Sampling 2 chains, 0 divergences]\n \n \n\n\nSampling 2 chains for 1_000 tune and 1_000 draw iterations (2_000 + 2_000 draws total) took 30 seconds.\n\n\n\naz.plot_trace(fitted2);\naz.summary(fitted2)\n\n\n\n\n\n \n \n \n mean\n sd\n hdi_3%\n hdi_97%\n mcse_mean\n mcse_sd\n ess_bulk\n ess_tail\n r_hat\n \n \n \n \n Intercept\n -2.282\n 0.065\n -2.406\n -2.166\n 0.001\n 0.001\n 2037.0\n 1330.0\n 1.0\n \n \n sex[Male]\n 1.006\n 0.038\n 0.939\n 1.074\n 0.001\n 0.001\n 2192.0\n 1628.0\n 1.0\n \n \n race[White]\n 0.702\n 0.061\n 0.590\n 0.818\n 0.001\n 0.001\n 2084.0\n 1343.0\n 1.0\n \n \n age\n 1.069\n 0.024\n 1.028\n 1.117\n 0.001\n 0.000\n 1720.0\n 1406.0\n 1.0\n \n \n age2\n -0.538\n 0.018\n -0.570\n -0.503\n 0.000\n 0.000\n 1730.0\n 1161.0\n 1.0\n \n \n hs_week\n 0.499\n 0.022\n 0.455\n 0.538\n 0.001\n 0.000\n 1665.0\n 1431.0\n 1.0\n \n \n hs_week2\n -0.088\n 0.009\n -0.103\n -0.072\n 0.000\n 0.000\n 1687.0\n 1577.0\n 1.0\n \n \n\n\n\n\n\n\n\n\n\n\n\\[\n\\log{\\left(\\frac{\\pi}{1 - \\pi}\\right)} = \\beta_0 + \\beta_1 X_1 + \\beta_2 X_1^2 + \\beta_3 X_1^3 + \\beta_4 X_2\n + \\beta_5 X_2^2 + \\beta_6 X_2^3 + \\beta_7 X_3 + \\beta_8 X_4\n\\]\nWhere:\n\\[\n\\begin{aligned}\n X_1 &= \\displaystyle \\frac{\\text{Age} - \\text{Age}_{\\text{mean}}}{\\text{Age}_{\\text{std}}} \\\\\n X_2 &= \\displaystyle \\frac{\\text{Hours\\_week} - \\text{Hours\\_week}_{\\text{mean}}}{\\text{Hours\\_week}_{\\text{std}}} \\\\\n X_3 &=\n \\left\\{\n \\begin{array}{ll}\n 1 & \\textrm{if the person is male} \\\\\n 0 & \\textrm{if the person is female}\n \\end{array}\n \\right. \\\\\n X_4 &=\n \\left\\{\n \\begin{array}{ll}\n 1 & \\textrm{if the person is white} \\\\\n 0 & \\textrm{if the person is black}\n \\end{array}\n \\right.\n\\end{aligned}\n\\]\n\nmodel3 = bmb.Model(\n \"income['>50K'] ~ age + age2 + age3 + hs_week + hs_week2 + hs_week3 + sex + race\",\n data,\n family=\"bernoulli\"\n)\nfitted3 = model3.fit(\n draws=1000, random_seed=1234, target_accept=0.9, idata_kwargs={\"log_likelihood\": True}\n)\n\nModeling the probability that income==>50K\nAuto-assigning NUTS sampler...\nInitializing NUTS using jitter+adapt_diag...\nMultiprocess sampling (2 chains in 2 jobs)\nNUTS: [Intercept, age, age2, age3, hs_week, hs_week2, hs_week3, sex, race]\n\n\n\n\n\n\n\n \n \n 100.00% [4000/4000 01:15<00:00 Sampling 2 chains, 0 divergences]\n \n \n\n\nSampling 2 chains for 1_000 tune and 1_000 draw iterations (2_000 + 2_000 draws total) took 75 seconds.\n\n\n\naz.plot_trace(fitted3);\naz.summary(fitted3)\n\n\n\n\n\n \n \n \n mean\n sd\n hdi_3%\n hdi_97%\n mcse_mean\n mcse_sd\n ess_bulk\n ess_tail\n r_hat\n \n \n \n \n Intercept\n -2.145\n 0.064\n -2.270\n -2.028\n 0.001\n 0.001\n 3201.0\n 1540.0\n 1.0\n \n \n age\n 0.963\n 0.026\n 0.913\n 1.009\n 0.001\n 0.000\n 2243.0\n 1290.0\n 1.0\n \n \n age2\n -0.894\n 0.030\n -0.946\n -0.836\n 0.001\n 0.001\n 1541.0\n 1229.0\n 1.0\n \n \n age3\n 0.175\n 0.011\n 0.153\n 0.194\n 0.000\n 0.000\n 1653.0\n 1506.0\n 1.0\n \n \n hs_week\n 0.612\n 0.025\n 0.567\n 0.661\n 0.001\n 0.000\n 2381.0\n 1300.0\n 1.0\n \n \n hs_week2\n -0.010\n 0.010\n -0.030\n 0.010\n 0.000\n 0.000\n 2299.0\n 1590.0\n 1.0\n \n \n hs_week3\n -0.035\n 0.004\n -0.042\n -0.028\n 0.000\n 0.000\n 1815.0\n 1572.0\n 1.0\n \n \n sex[Male]\n 0.985\n 0.038\n 0.918\n 1.059\n 0.001\n 0.001\n 2737.0\n 1549.0\n 1.0\n \n \n race[White]\n 0.681\n 0.060\n 0.573\n 0.798\n 0.001\n 0.001\n 3044.0\n 1514.0\n 1.0\n \n \n\n\n\n\n\n\n\n\n\n\nWe can perform a Bayesian model comparison very easily with az.compare(). Here we pass a dictionary with the InferenceData objects that Model.fit() returned and az.compare() returns a data frame that is ordered from best to worst according to the criteria used. By default, ArviZ uses loo, which is an estimation of leave one out cross validation. Another option is the widely applicable information criterion (WAIC). For more information about the information criteria available and other options within the function see the docs.\n\nmodels_dict = {\n \"model1\": fitted1,\n \"model2\": fitted2,\n \"model3\": fitted3\n}\ndf_compare = az.compare(models_dict)\ndf_compare\n\n\n\n\n\n \n \n \n rank\n elpd_loo\n p_loo\n elpd_diff\n weight\n se\n dse\n warning\n scale\n \n \n \n \n model3\n 0\n -13987.197673\n 9.716205\n 0.000000\n 1.000000e+00\n 89.279906\n 0.000000\n False\n log\n \n \n model2\n 1\n -14155.112761\n 8.147063\n 167.915088\n 3.048565e-12\n 91.305227\n 19.879825\n False\n log\n \n \n model1\n 2\n -14915.862090\n 4.871886\n 928.664417\n 0.000000e+00\n 91.010624\n 38.923423\n False\n log\n \n \n\n\n\n\n\naz.plot_compare(df_compare, insample_dev=False);\n\n\n\n\nThere is a difference in the point estimations (empty circles) between the model with cubic terms (model 3) and the model with quadratic terms (model 2) but there is some overlap between their interval estimations. This time, we are going to select model 2 and do some extra little work with it because from previous experience with this dataset we know there is no substantial difference between them, and model 2 is simpler. However, as we mention in the final remarks, this is not the best you can achieve with this dataset. If you want, you could also try to add other predictors, such as education level and see how it impacts in the model comparison :).\n\n\n\nIn this section we plot age vs the probability of making more than 50K a year given different profiles.\nWe set hours of work per week at 40 hours and assign a grid from 18 to 75 age. They’re standardized because they were standardized when we fitted the model.\nHere we use az.plot_hdi() to get Highest Density Interval plots. We get two bands for each profile. One corresponds to an hdi probability of 0.94 (the default) and the other to an hdi probability of 0.5.\n\nHS_WEEK = (40 - hs_mean) / hs_std\nAGE = (np.linspace(18, 75) - age_mean) / age_std\n\nfig, ax = plt.subplots()\nhandles = []\ni = 0\n\nfor race in [\"Black\", \"White\"]:\n for sex in [\"Female\", \"Male\"]: \n color = f\"C{i}\"\n label = f\"{race} - {sex}\"\n handles.append(mlines.Line2D([], [], color=color, label=label, lw=3))\n \n new_data = pd.DataFrame({\n \"sex\": [sex] * len(AGE),\n \"race\": [race] * len(AGE), \n \"age\": AGE,\n \"age2\": AGE ** 2,\n \"hs_week\": [HS_WEEK] * len(AGE),\n \"hs_week2\": [HS_WEEK ** 2] * len(AGE),\n })\n new_idata = model2.predict(fitted2, data=new_data, inplace=False)\n mean = new_idata.posterior[\"income_mean\"].values\n\n az.plot_hdi(AGE * age_std + age_mean, mean, ax=ax, color=color)\n az.plot_hdi(AGE * age_std + age_mean, mean, ax=ax, color=color, hdi_prob=0.5)\n i += 1\n\nax.set_xlabel(\"Age\")\nax.set_ylabel(\"P(Income > $50K)\")\nax.legend(handles=handles, loc=\"upper left\");\n\n\n\n\nThe highest posterior density bands show how the probability of earning more than 50K changes with age for a given profile. In all the cases, we see the probability of making more than $50K increases with age until approximately age 52, when the probability begins to drop off. We can interpret narrow portions of a curve as places where we have low uncertainty and spread out portions of the bands as places where we have somewhat higher uncertainty about our coefficient values.\n\n\nIn this notebook we’ve seen how easy it is to incorporate ArviZ into a Bambi workflow to perform model comparison based on information criteria such as LOO and WAIC. However, an attentive reader might have seen that the highest density interval plot never shows a predicted probability greater than 0.5 (which is not good if we expect to predict that at least some people working 40hrs/wk make more than \\$50k/yr). You can increase the hours of work per week for the profiles we’ve used and the HDIs will show larger values. But we won’t be seeing the whole picture.\nAlthough we’re using some demographic variables such as sex and race, the cells resulting from the combinations of their levels are still very heterogeneous. For example, we are mixing individuals of all educational levels. A possible next step is to incorporate education into the different models we compared. If any of the readers (yes, you!) is interested in doing so, here there are some notes that may help\n\nEducation is an ordinal categorical variable with a lot of levels.\n\nExplore the conditional distribution of income given education levels.\nSee what are the counts/proportions of people within each education level.\nCollapse categories (but respect the ordinality!). Try to end up with 5 or less categories if possible.\n\nStart with a model with only age, sex, race, hs_week and education. Then incorporate higher order terms (second and third powers for example). Don’t go beyond fourth powers.\nLook for a nice activity to do while the sampler does its job.\nWe know it’s going to take a couple of hours to fit all those models :)\n\nAnd finally, please feel free to open a new issue if you think there’s something that we can improve.\n\n%load_ext watermark\n%watermark -n -u -v -iv -w\n\nLast updated: Thu Jan 05 2023\n\nPython implementation: CPython\nPython version : 3.10.4\nIPython version : 8.5.0\n\nbambi : 0.9.3\nnumpy : 1.23.5\nmatplotlib: 3.6.2\narviz : 0.14.0\nseaborn : 0.12.2\npandas : 1.5.2\nsys : 3.10.4 (main, Mar 31 2022, 08:41:55) [GCC 7.5.0]\n\nWatermark: 2.3.1" }, { - "objectID": "notebooks/beta_regression.html", - "href": "notebooks/beta_regression.html", + "objectID": "notebooks/circular_regression.html", + "href": "notebooks/circular_regression.html", "title": "Bambi", "section": "", - "text": "This example has been contributed by Tyler James Burch (@tjburch on GitHub).\n\nimport arviz as az\nimport bambi as bmb\nimport matplotlib.pyplot as plt\nimport numpy as np\nimport pandas as pd\nfrom scipy import stats\nfrom scipy.special import expit\n\n\naz.style.use(\"arviz-darkgrid\")\n\nIn this example, we’ll look at using the Beta distribution for regression models. The Beta distribution is a probability distribution bounded on the interval [0, 1], which makes it well-suited to model probabilities or proportions. In fact, in much of the Bayesian literature, the Beta distribution is introduced as a prior distribution for the probability \\(p\\) parameter of the Binomial distribution (in fact, it’s the conjugate prior for the Binomial distribution).\n\n\nTo start getting an intuitive sense of the Beta distribution, we’ll model coin flipping probabilities. Say we grab all the coins out of our pocket, we might have some fresh from the mint, but we might also have some old ones. Due to the variation, some may be slightly biased toward heads or tails, and our goal is to model distribution of the probabilities of flipping heads for the coins in our pocket.\nSince we trust the mint, we’ll say the \\(\\alpha\\) and \\(\\beta\\) are both large, we’ll use 1,000 for each, which gives a distribution spanning from 0.45 to 0.55.\n\nalpha = 1_000\nbeta = 1_000\np = np.random.beta(alpha, beta, size=10_000)\naz.plot_kde(p)\nplt.xlabel(\"$p$\");\n\n\n\n\nNext, we’ll use Bambi to try to recover the parameters of the Beta distribution. Since we have no predictors, we can do a intercept-only model to try to recover them.\n\ndata = pd.DataFrame({\"probabilities\": p})\nmodel = bmb.Model(\"probabilities ~ 1\", data, family=\"beta\")\nfitted = model.fit()\n\nAuto-assigning NUTS sampler...\nInitializing NUTS using jitter+adapt_diag...\nMultiprocess sampling (2 chains in 2 jobs)\nNUTS: [probabilities_kappa, Intercept]\n\n\n\n\n\n\n\n \n \n 100.00% [4000/4000 00:04<00:00 Sampling 2 chains, 0 divergences]\n \n \n\n\nSampling 2 chains for 1_000 tune and 1_000 draw iterations (2_000 + 2_000 draws total) took 5 seconds.\n\n\n\naz.plot_trace(fitted);\n\n\n\n\n\naz.summary(fitted)\n\n\n\n\n\n \n \n \n mean\n sd\n hdi_3%\n hdi_97%\n mcse_mean\n mcse_sd\n ess_bulk\n ess_tail\n r_hat\n \n \n \n \n Intercept\n -0.000\n 0.000\n -0.001\n 0.001\n 0.000\n 0.000\n 2079.0\n 1465.0\n 1.0\n \n \n probabilities_kappa\n 2012.885\n 27.642\n 1960.994\n 2062.262\n 0.592\n 0.418\n 2185.0\n 1548.0\n 1.0\n \n \n\n\n\n\nThe model fit, but clearly these parameters are not the ones that we used above. For Beta regression, we use a linear model for the mean, so we use the \\(\\mu\\) and \\(\\sigma\\) formulation. To link the two, we use\n\\(\\alpha = \\mu \\kappa\\)\n\\(\\beta = (1-\\mu)\\kappa\\)\nand \\(\\kappa\\) is a function of the mean and variance,\n\\(\\kappa = \\frac{\\mu(1-\\mu)}{\\sigma^2} - 1\\)\nRather than \\(\\sigma\\), you’ll note Bambi returns \\(\\kappa\\). We’ll define a function to retrieve our original parameters.\n\ndef mukappa_to_alphabeta(mu, kappa):\n # Calculate alpha and beta\n alpha = mu * kappa\n beta = (1 - mu) * kappa\n \n # Get mean values and 95% HDIs \n alpha_mean = alpha.mean((\"chain\", \"draw\")).item()\n alpha_hdi = az.hdi(alpha, hdi_prob=.95)[\"x\"].values\n beta_mean = beta.mean((\"chain\", \"draw\")).item()\n beta_hdi = az.hdi(beta, hdi_prob=.95)[\"x\"].values\n \n return alpha_mean, alpha_hdi, beta_mean, beta_hdi\n\nalpha, alpha_hdi, beta, beta_hdi = mukappa_to_alphabeta(\n expit(fitted.posterior[\"Intercept\"]),\n fitted.posterior[\"probabilities_kappa\"]\n)\n\nprint(f\"Alpha - mean: {np.round(alpha)}, 95% HDI: {np.round(alpha_hdi[0])} - {np.round(alpha_hdi[1])}\")\nprint(f\"Beta - mean: {np.round(beta)}, 95% HDI: {np.round(beta_hdi[0])} - {np.round(beta_hdi[1])}\")\n\nAlpha - mean: 1006.0, 95% HDI: 979.0 - 1033.0\nBeta - mean: 1006.0, 95% HDI: 978.0 - 1032.0\n\n\n\ndef mukappa_to_alphabeta(mu, kappa):\n # Calculate alpha and beta\n alpha = mu * kappa\n beta = (1 - mu) * kappa\n \n # Get mean values and 95% HDIs \n alpha_mean = alpha.mean((\"chain\", \"draw\")).item()\n alpha_hdi = az.hdi(alpha, hdi_prob=.95)[\"x\"].values\n beta_mean = beta.mean((\"chain\", \"draw\")).item()\n beta_hdi = az.hdi(beta, hdi_prob=.95)[\"x\"].values\n \n return alpha_mean, alpha_hdi, beta_mean, beta_hdi\n\nalpha, alpha_hdi, beta, beta_hdi = mukappa_to_alphabeta(\n expit(fitted.posterior[\"Intercept\"]),\n fitted.posterior[\"probabilities_kappa\"]\n)\n\nprint(f\"Alpha - mean: {np.round(alpha)}, 95% HDI: {np.round(alpha_hdi[0])} - {np.round(alpha_hdi[1])}\")\nprint(f\"Beta - mean: {np.round(beta)}, 95% HDI: {np.round(beta_hdi[0])} - {np.round(beta_hdi[1])}\")\n\nAlpha - mean: 1006.0, 95% HDI: 979.0 - 1033.0\nBeta - mean: 1006.0, 95% HDI: 978.0 - 1032.0\n\n\nWe’ve managed to recover our parameters with an intercept-only model.\n\n\n\nPerhaps we have a little more information on the coins in our pocket. We notice that the coins have accumulated dirt on either side, which would shift the probability of getting a tails or heads. In reality, we would not know how much the dirt affects the probability distribution, we would like to recover that parameter. We’ll construct this toy example by saying that each micron of dirt shifts the \\(\\alpha\\) parameter by 5.0. Further, the amount of dirt is distributed according to a Half Normal distribution with a standard deviation of 25 per side.\nWe’ll start by looking at the difference in probability for a coin with a lot of dirt on either side.\n\neffect_per_micron = 5.0\n\n# Clean Coin\nalpha = 1_000\nbeta = 1_000\np = np.random.beta(alpha, beta, size=10_000)\n\n# Add two std to tails side (heads more likely)\np_heads = np.random.beta(alpha + 50 * effect_per_micron, beta, size=10_000)\n# Add two std to heads side (tails more likely)\np_tails = np.random.beta(alpha - 50 * effect_per_micron, beta, size=10_000)\n\naz.plot_kde(p, label=\"Clean Coin\")\naz.plot_kde(p_heads, label=\"Biased toward heads\", plot_kwargs={\"color\":\"C1\"})\naz.plot_kde(p_tails, label=\"Biased toward tails\", plot_kwargs={\"color\":\"C2\"})\nplt.xlabel(\"$p$\")\nplt.ylim(top=plt.ylim()[1]*1.25);\n\n\n\n\nNext, we’ll generate a toy dataset according to our specifications above. As an added foil, we will also assume that we’re limited in our measuring equipment, that we can only measure correctly to the nearest integer micron.\n\n# Create amount of dirt on top and bottom\nheads_bias_dirt = stats.halfnorm(loc=0, scale=25).rvs(size=1_000)\ntails_bias_dirt = stats.halfnorm(loc=0, scale=25).rvs(size=1_000)\n\n# Create the probability per coin\nalpha = np.repeat(1_000, 1_000)\nalpha = alpha + effect_per_micron * heads_bias_dirt - effect_per_micron * tails_bias_dirt\nbeta = np.repeat(1_000, 1_000)\n\np = np.random.beta(alpha, beta)\n\ndf = pd.DataFrame({\n \"p\" : p,\n \"heads_bias_dirt\" : heads_bias_dirt.round(),\n \"tails_bias_dirt\" : tails_bias_dirt.round()\n})\ndf.head()\n\n\n\n\n\n \n \n \n p\n heads_bias_dirt\n tails_bias_dirt\n \n \n \n \n 0\n 0.508915\n 30.0\n 15.0\n \n \n 1\n 0.533541\n 24.0\n 4.0\n \n \n 2\n 0.482905\n 10.0\n 28.0\n \n \n 3\n 0.555191\n 54.0\n 0.0\n \n \n 4\n 0.526059\n 4.0\n 4.0\n \n \n\n\n\n\nTaking a look at our new dataset:\n\nfig,ax = plt.subplots(1,3, figsize=(16,5))\n\ndf[\"p\"].plot.kde(ax=ax[0])\nax[0].set_xlabel(\"$p$\")\n\ndf[\"heads_bias_dirt\"].plot.hist(ax=ax[1], bins=np.arange(0,df[\"heads_bias_dirt\"].max()))\nax[1].set_xlabel(\"Measured Dirt Biasing Toward Heads ($\\mu m$)\")\ndf[\"tails_bias_dirt\"].plot.hist(ax=ax[2], bins=np.arange(0,df[\"tails_bias_dirt\"].max()))\nax[2].set_xlabel(\"Measured Dirt Biasing Toward Tails ($\\mu m$)\");\n\n\n\n\nNext we want to make a model to recover the effect per micron of dirt per side. So far, we’ve considered the biasing toward one side or another independently. A linear model might look something like this:\n$ p (, )$\n\\(logit(\\mu) = \\text{ Normal}( \\alpha + \\beta_h d_h + \\beta_t d_t)\\)\nWhere \\(d_h\\) and \\(d_t\\) are the measured dirt (in microns) biasing the probability toward heads and tails respectively, \\(\\beta_h\\) and \\(\\beta_t\\) are coefficients for how much a micron of dirt affects each independent side, and \\(\\alpha\\) is the intercept. Also note the logit link function used here, since our outcome is on the scale of 0-1, it makes sense that the link must also put our mean on that scale. Logit is the default link function, however Bambi supports the identity, probit, and cloglog links as well.\nIn this toy example, we’ve constructed it such that dirt should not affect one side differently from another, so we can wrap those into one coefficient: \\(\\beta = \\beta_h = -\\beta_t\\). This makes the last line of the model:\n\\(logit(\\mu) = \\text{ Normal}( \\alpha + \\beta \\Delta d)\\)\nwhere\n\\(\\Delta d = d_h - d_t\\)\nPutting that into our dataset, then constructing this model in Bambi,\n\ndf[\"delta_d\"] = df[\"heads_bias_dirt\"] - df[\"tails_bias_dirt\"]\ndirt_model = bmb.Model(\"p ~ delta_d\", df, family=\"beta\")\ndirt_fitted = dirt_model.fit()\ndirt_model.predict(dirt_fitted, kind=\"pps\")\n\nAuto-assigning NUTS sampler...\nInitializing NUTS using jitter+adapt_diag...\nMultiprocess sampling (2 chains in 2 jobs)\nNUTS: [p_kappa, Intercept, delta_d]\n\n\n\n\n\n\n\n \n \n 100.00% [4000/4000 00:06<00:00 Sampling 2 chains, 0 divergences]\n \n \n\n\nSampling 2 chains for 1_000 tune and 1_000 draw iterations (2_000 + 2_000 draws total) took 7 seconds.\n\n\n\naz.summary(dirt_fitted)\n\n\n\n\n\n \n \n \n mean\n sd\n hdi_3%\n hdi_97%\n mcse_mean\n mcse_sd\n ess_bulk\n ess_tail\n r_hat\n \n \n \n \n Intercept\n -0.006\n 0.001\n -0.009\n -0.004\n 0.000\n 0.000\n 2903.0\n 1479.0\n 1.0\n \n \n delta_d\n 0.005\n 0.000\n 0.005\n 0.005\n 0.000\n 0.000\n 3200.0\n 1597.0\n 1.0\n \n \n p_kappa\n 2018.759\n 91.080\n 1862.252\n 2198.655\n 1.719\n 1.216\n 2805.0\n 1399.0\n 1.0\n \n \n p_mean[0]\n 0.517\n 0.000\n 0.516\n 0.518\n 0.000\n 0.000\n 3477.0\n 1662.0\n 1.0\n \n \n p_mean[1]\n 0.523\n 0.000\n 0.522\n 0.524\n 0.000\n 0.000\n 3564.0\n 1637.0\n 1.0\n \n \n ...\n ...\n ...\n ...\n ...\n ...\n ...\n ...\n ...\n ...\n \n \n p_mean[995]\n 0.523\n 0.000\n 0.522\n 0.524\n 0.000\n 0.000\n 3564.0\n 1637.0\n 1.0\n \n \n p_mean[996]\n 0.517\n 0.000\n 0.516\n 0.518\n 0.000\n 0.000\n 3477.0\n 1662.0\n 1.0\n \n \n p_mean[997]\n 0.533\n 0.001\n 0.532\n 0.534\n 0.000\n 0.000\n 3570.0\n 1596.0\n 1.0\n \n \n p_mean[998]\n 0.467\n 0.001\n 0.466\n 0.468\n 0.000\n 0.000\n 2916.0\n 1657.0\n 1.0\n \n \n p_mean[999]\n 0.498\n 0.000\n 0.498\n 0.499\n 0.000\n 0.000\n 2903.0\n 1479.0\n 1.0\n \n \n\n1003 rows × 9 columns\n\n\n\n\naz.plot_ppc(dirt_fitted);\n\n\n\n\nNext, we’ll see if we can in fact recover the effect on \\(\\alpha\\). Remember that in order to return to \\(\\alpha\\), \\(\\beta\\) space, the linear equation passes through an inverse logit transformation, so we must apply this to the coefficient on \\(\\Delta d\\) to get the effect on \\(\\alpha\\). The inverse logit is nicely defined in scipy.special as expit.\n\nmean_effect = expit(dirt_fitted.posterior.delta_d.mean())\nhdi = az.hdi(dirt_fitted.posterior.delta_d, hdi_prob=.95)\nlower = expit(hdi.delta_d[0])\nupper = expit(hdi.delta_d[1])\nprint(f\"Mean effect: {mean_effect.item():.4f}\")\nprint(f\"95% interval {lower.item():.4f} - {upper.item():.4f}\")\n\nMean effect: 0.5012\n95% interval 0.5012 - 0.5013\n\n\nThe recovered effect is very close to the true effect of 0.5.\n\n\n\nIn the Hierarchical Logistic regression with Binomial family notebook, we modeled baseball batting averages (times a player reached first via a hit per times at bat) via a Hierarchical Logisitic regression model. If we’re interested in league-wide effects, we could look at a Beta regression. We work off the assumption that the league-wide batting average follows a Beta distribution, and that individual player’s batting averages are samples from that distribution.\nFirst, load the Batting dataset again, and re-calculate batting average as hits/at-bat. In order to make sure that we have a sufficient sample, we’ll require at least 100 at-bats in order consider a batter. Last, we’ll focus on 1990-2018.\n\nbatting = bmb.load_data(\"batting\")\n\n\nbatting[\"batting_avg\"] = batting[\"H\"] / batting[\"AB\"]\nbatting = batting[batting[\"AB\"] > 100]\ndf = batting[ (batting[\"yearID\"] > 1990) & (batting[\"yearID\"] < 2018) ]\n\n\ndf.batting_avg.hist(bins=30)\nplt.xlabel(\"Batting Average\")\nplt.ylabel(\"Count\");\n\n\n\n\nIf we’re interested in modeling the distribution of batting averages, we can start with an intercept-only model.\n\nmodel_avg = bmb.Model(\"batting_avg ~ 1\", df, family=\"beta\")\navg_fitted = model_avg.fit()\n\nAuto-assigning NUTS sampler...\nInitializing NUTS using jitter+adapt_diag...\nMultiprocess sampling (2 chains in 2 jobs)\nNUTS: [batting_avg_kappa, Intercept]\n\n\n\n\n\n\n\n \n \n 100.00% [4000/4000 00:03<00:00 Sampling 2 chains, 0 divergences]\n \n \n\n\nSampling 2 chains for 1_000 tune and 1_000 draw iterations (2_000 + 2_000 draws total) took 4 seconds.\n\n\n\naz.summary(avg_fitted)\n\n\n\n\n\n \n \n \n mean\n sd\n hdi_3%\n hdi_97%\n mcse_mean\n mcse_sd\n ess_bulk\n ess_tail\n r_hat\n \n \n \n \n Intercept\n -1.038\n 0.002\n -1.041\n -1.035\n 0.000\n 0.000\n 1835.0\n 1455.0\n 1.0\n \n \n batting_avg_kappa\n 152.538\n 1.950\n 149.098\n 156.262\n 0.046\n 0.033\n 1771.0\n 1294.0\n 1.0\n \n \n\n\n\n\nLooking at the posterior predictive,\n\nposterior_predictive = model_avg.predict(avg_fitted, kind=\"pps\")\n\n\naz.plot_ppc(avg_fitted);\n\n\n\n\nThis appears to fit reasonably well. If, for example, we were interested in simulating players, we could sample from this distribution.\nHowever, we can take this further. Say we’re interested in understanding how this distribution shifts if we know a player’s batting average in a previous year. We can condition the model on a player’s n-1 year, and use Beta regrssion to model that. Let’s construct that variable, and for sake of ease, we will ignore players without previous seasons.\n\n# Add the player's batting average in the n-1 year\nbatting[\"batting_avg_shift\"] = np.where(\n batting[\"playerID\"] == batting[\"playerID\"].shift(),\n batting[\"batting_avg\"].shift(),\n np.nan\n)\ndf_shift = batting[ (batting[\"yearID\"] > 1990) & (batting[\"yearID\"] < 2018) ]\ndf_shift = df_shift[~df_shift[\"batting_avg_shift\"].isna()]\ndf_shift[[\"batting_avg_shift\",\"batting_avg\"]].corr()\n\n\n\n\n\n \n \n \n batting_avg_shift\n batting_avg\n \n \n \n \n batting_avg_shift\n 1.000000\n 0.229774\n \n \n batting_avg\n 0.229774\n 1.000000\n \n \n\n\n\n\nThere is a lot of variance in year-to-year batting averages, it’s not known to be incredibly predictive, and we see that here. A correlation coefficient of 0.23 is only lightly predictive. However, we can still use it in our model to get a better understanding. We’ll fit two models. First, we’ll refit the previous, intercept-only, model using this updated dataset so we have an apples-to-apples comparison. Then, we’ll fit a model using the previous year’s batting average as a predictor.\nNotice we need to explicitly ask for the inclusion of the log-likelihood values into the inference data object.\n\nmodel_avg = bmb.Model(\"batting_avg ~ 1\", df_shift, family=\"beta\")\navg_fitted = model_avg.fit(idata_kwargs={\"log_likelihood\": True})\n\nmodel_lag = bmb.Model(\"batting_avg ~ batting_avg_shift\", df_shift, family=\"beta\")\nlag_fitted = model_lag.fit(idata_kwargs={\"log_likelihood\": True})\n\nAuto-assigning NUTS sampler...\nInitializing NUTS using jitter+adapt_diag...\nMultiprocess sampling (2 chains in 2 jobs)\nNUTS: [batting_avg_kappa, Intercept]\n\n\n\n\n\n\n\n \n \n 100.00% [4000/4000 00:02<00:00 Sampling 2 chains, 0 divergences]\n \n \n\n\nSampling 2 chains for 1_000 tune and 1_000 draw iterations (2_000 + 2_000 draws total) took 3 seconds.\nAuto-assigning NUTS sampler...\nInitializing NUTS using jitter+adapt_diag...\nMultiprocess sampling (2 chains in 2 jobs)\nNUTS: [batting_avg_kappa, Intercept, batting_avg_shift]\n\n\n\n\n\n\n\n \n \n 100.00% [4000/4000 00:04<00:00 Sampling 2 chains, 0 divergences]\n \n \n\n\nSampling 2 chains for 1_000 tune and 1_000 draw iterations (2_000 + 2_000 draws total) took 4 seconds.\n\n\n\naz.summary(lag_fitted)\n\n\n\n\n\n \n \n \n mean\n sd\n hdi_3%\n hdi_97%\n mcse_mean\n mcse_sd\n ess_bulk\n ess_tail\n r_hat\n \n \n \n \n Intercept\n -1.374\n 0.074\n -1.517\n -1.240\n 0.001\n 0.001\n 3171.0\n 1435.0\n 1.0\n \n \n batting_avg_shift\n 1.347\n 0.281\n 0.782\n 1.838\n 0.005\n 0.004\n 3091.0\n 1478.0\n 1.0\n \n \n batting_avg_kappa\n 136.149\n 9.414\n 116.879\n 152.420\n 0.184\n 0.132\n 2618.0\n 1463.0\n 1.0\n \n \n\n\n\n\n\naz.compare({\n \"intercept-only\" : avg_fitted,\n \"lag-model\": lag_fitted\n})\n\n\n\n\n\n \n \n \n rank\n elpd_loo\n p_loo\n elpd_diff\n weight\n se\n dse\n warning\n scale\n \n \n \n \n lag-model\n 0\n 784.894117\n 3.146425\n 0.000000\n 0.995619\n 14.582720\n 0.000000\n False\n log\n \n \n intercept-only\n 1\n 774.193828\n 2.034573\n 10.700289\n 0.004381\n 15.320598\n 4.604911\n False\n log\n \n \n\n\n\n\nAdding the predictor results in a higher loo than the intercept-only model.\n\nppc= model_lag.predict(lag_fitted, kind=\"pps\")\naz.plot_ppc(lag_fitted);\n\n\n\n\nThe biggest question this helps us understand is, for each point of batting average in the previous year, how much better do we expect a player to be in the current year?\n\nmean_effect = lag_fitted.posterior.batting_avg_shift.mean().item()\nhdi = az.hdi(lag_fitted.posterior.batting_avg_shift, hdi_prob=.95)\n\nlower = expit(hdi.batting_avg_shift[0]).item()\nupper = expit(hdi.batting_avg_shift[1]).item()\nprint(f\"Mean effect: {expit(mean_effect):.4f}\")\nprint(f\"95% interval {lower:.4f} - {upper:.4f}\")\n\nMean effect: 0.7936\n95% interval 0.6806 - 0.8650\n\n\n\naz.plot_hdi(df_shift.batting_avg_shift, lag_fitted.posterior_predictive.batting_avg, hdi_prob=0.95, color=\"goldenrod\", fill_kwargs={\"alpha\":0.8})\naz.plot_hdi(df_shift.batting_avg_shift, lag_fitted.posterior_predictive.batting_avg, hdi_prob=.68, color=\"forestgreen\", fill_kwargs={\"alpha\":0.8})\n\nintercept = lag_fitted.posterior.Intercept.values.mean()\nx = np.linspace(df_shift.batting_avg_shift.min(), df_shift.batting_avg_shift.max(),100)\nlinear = mean_effect * x + intercept\nplt.plot(x, expit(linear), c=\"black\")\nplt.xlabel(\"Previous Year's Batting Average\")\nplt.ylabel(\"Batting Average\");\n\n\n\n\n\n%load_ext watermark\n%watermark -n -u -v -iv -w\n\nLast updated: Thu Jan 05 2023\n\nPython implementation: CPython\nPython version : 3.10.4\nIPython version : 8.5.0\n\narviz : 0.14.0\nmatplotlib: 3.6.2\nnumpy : 1.23.5\npandas : 1.5.2\nsys : 3.10.4 (main, Mar 31 2022, 08:41:55) [GCC 7.5.0]\nscipy : 1.9.3\nbambi : 0.9.3\n\nWatermark: 2.3.1" + "text": "Circular Regression\n\nimport arviz as az\nimport bambi as bmb\nfrom matplotlib.lines import Line2D\nimport matplotlib.pyplot as plt\nimport numpy as np\nimport pandas as pd\nfrom scipy import stats\n\n\naz.style.use(\"arviz-white\")\n\nDirectional statistics, also known as circular statistics or spherical statistics, refers to a branch of statistics dealing with data which domain is the unit circle, as opposed to “linear” data which support is the real line. Circular data is convenient when dealing with directions or rotations. Some examples include temporal periods like hours or days, compass directions, dihedral angles in biomolecules, etc.\nThe fact that a Sunday can be both the day before or after a Monday, or that 0 is a “better average” for 2 and 358 degrees than 180 are illustrations that circular data and circular statistical methods are better equipped to deal with this kind of problem than the more familiar methods 1.\nThere are a few circular distributions, one of them is the VonMises distribution, that we can think as the cousin of the Gaussian that lives in circular space. The domain of this distribution is any interval of length \\(2\\pi\\). We are going to adopt the convention that the interval goes from \\(-\\pi\\) to \\(\\pi\\), so for example 0 radians is the same as \\(2\\pi\\). The VonMises is defined using two parameters, the mean \\(\\mu\\) (the circular mean) and the concentration \\(\\kappa\\), with \\(\\frac{1}{\\kappa}\\) being analogue of the variance. Let see a few example of the VonMises family:\n\nx = np.linspace(-np.pi, np.pi, 200)\nmus = [0., 0., 0., -2.5]\nkappas = [.001, 0.5, 3, 0.5]\nfor mu, kappa in zip(mus, kappas):\n pdf = stats.vonmises.pdf(x, kappa, loc=mu)\n plt.plot(x, pdf, label=r'$\\mu$ = {}, $\\kappa$ = {}'.format(mu, kappa))\nplt.yticks([])\nplt.legend(loc=1);\n\n\n\n\nWhen doing linear regression a commonly used link function is \\(2 \\arctan(u)\\) this ensure that values over the real line are mapped into the interval \\([-\\pi, \\pi]\\)\n\nu = np.linspace(-12, 12, 200)\nplt.plot(u, 2*np.arctan(u))\nplt.xlabel(\"Reals\")\nplt.ylabel(\"Radians\");\n\n\n\n\nBambi supports circular regression with the VonMises family, to exemplify this we are going to use a dataset from the following experiment. 31 periwinkles (a kind of sea snail) were removed from it original place and released down shore. Then, our task is to model the direction of motion as function of the distance travelled by them after being release.\n\ndata = bmb.load_data(\"periwinkles\")\ndata.head()\n\n\n\n\n\n \n \n \n distance\n direction\n \n \n \n \n 0\n 107\n 1.169371\n \n \n 1\n 46\n 1.151917\n \n \n 2\n 33\n 1.291544\n \n \n 3\n 67\n 1.064651\n \n \n 4\n 122\n 1.012291\n \n \n\n\n\n\nJust to compare results, we are going to use the VonMises family and the normal (default) family.\n\nmodel_vm = bmb.Model(\"direction ~ distance\", data, family=\"vonmises\")\nidata_vm = model_vm.fit(include_mean=True)\n\nmodel_n = bmb.Model(\"direction ~ distance\", data)\nidata_n = model_n.fit(include_mean=True)\n\n/home/tomas/anaconda3/envs/bambi/lib/python3.10/site-packages/pytensor/tensor/rewriting/elemwise.py:694: UserWarning: Rewrite warning: The Op i1 does not provide a C implementation. As well as being potentially slow, this also disables loop fusion.\n warn(\n/home/tomas/anaconda3/envs/bambi/lib/python3.10/site-packages/pytensor/tensor/rewriting/elemwise.py:694: UserWarning: Rewrite warning: The Op i1 does not provide a C implementation. As well as being potentially slow, this also disables loop fusion.\n warn(\nAuto-assigning NUTS sampler...\nInitializing NUTS using jitter+adapt_diag...\n/home/tomas/anaconda3/envs/bambi/lib/python3.10/site-packages/pytensor/tensor/rewriting/elemwise.py:694: UserWarning: Rewrite warning: The Op i0 does not provide a C implementation. As well as being potentially slow, this also disables loop fusion.\n warn(\n/home/tomas/anaconda3/envs/bambi/lib/python3.10/site-packages/pytensor/tensor/rewriting/elemwise.py:694: UserWarning: Rewrite warning: The Op i0 does not provide a C implementation. As well as being potentially slow, this also disables loop fusion.\n warn(\n/home/tomas/anaconda3/envs/bambi/lib/python3.10/site-packages/pytensor/tensor/rewriting/elemwise.py:694: UserWarning: Rewrite warning: The Op i0 does not provide a C implementation. As well as being potentially slow, this also disables loop fusion.\n warn(\n/home/tomas/anaconda3/envs/bambi/lib/python3.10/site-packages/pytensor/tensor/rewriting/elemwise.py:694: UserWarning: Rewrite warning: The Op i0 does not provide a C implementation. As well as being potentially slow, this also disables loop fusion.\n warn(\n/home/tomas/anaconda3/envs/bambi/lib/python3.10/site-packages/pytensor/tensor/rewriting/elemwise.py:694: UserWarning: Rewrite warning: The Op i1 does not provide a C implementation. As well as being potentially slow, this also disables loop fusion.\n warn(\n/home/tomas/anaconda3/envs/bambi/lib/python3.10/site-packages/pytensor/tensor/rewriting/elemwise.py:694: UserWarning: Rewrite warning: The Op i1 does not provide a C implementation. As well as being potentially slow, this also disables loop fusion.\n warn(\n/home/tomas/anaconda3/envs/bambi/lib/python3.10/site-packages/pytensor/tensor/rewriting/elemwise.py:694: UserWarning: Rewrite warning: The Op i0 does not provide a C implementation. As well as being potentially slow, this also disables loop fusion.\n warn(\n/home/tomas/anaconda3/envs/bambi/lib/python3.10/site-packages/pytensor/tensor/rewriting/elemwise.py:694: UserWarning: Rewrite warning: The Op i0 does not provide a C implementation. As well as being potentially slow, this also disables loop fusion.\n warn(\n/home/tomas/anaconda3/envs/bambi/lib/python3.10/site-packages/pytensor/tensor/rewriting/elemwise.py:694: UserWarning: Rewrite warning: The Op i0 does not provide a C implementation. As well as being potentially slow, this also disables loop fusion.\n warn(\n/home/tomas/anaconda3/envs/bambi/lib/python3.10/site-packages/pytensor/tensor/rewriting/elemwise.py:694: UserWarning: Rewrite warning: The Op i0 does not provide a C implementation. As well as being potentially slow, this also disables loop fusion.\n warn(\nMultiprocess sampling (2 chains in 2 jobs)\nNUTS: [direction_kappa, Intercept, distance]\n\n\n\n\n\n\n\n \n \n 100.00% [4000/4000 00:06<00:00 Sampling 2 chains, 0 divergences]\n \n \n\n\nSampling 2 chains for 1_000 tune and 1_000 draw iterations (2_000 + 2_000 draws total) took 6 seconds.\nAuto-assigning NUTS sampler...\nInitializing NUTS using jitter+adapt_diag...\nMultiprocess sampling (2 chains in 2 jobs)\nNUTS: [direction_sigma, Intercept, distance]\n\n\n\n\n\n\n\n \n \n 100.00% [4000/4000 00:03<00:00 Sampling 2 chains, 0 divergences]\n \n \n\n\nSampling 2 chains for 1_000 tune and 1_000 draw iterations (2_000 + 2_000 draws total) took 3 seconds.\n\n\n\naz.summary(idata_vm, var_names=[\"~direction_mean\"])\n\n\n\n\n\n \n \n \n mean\n sd\n hdi_3%\n hdi_97%\n mcse_mean\n mcse_sd\n ess_bulk\n ess_tail\n r_hat\n \n \n \n \n Intercept\n 1.667\n 0.325\n 1.069\n 2.253\n 0.011\n 0.008\n 974.0\n 806.0\n 1.0\n \n \n distance\n -0.010\n 0.004\n -0.018\n -0.002\n 0.000\n 0.000\n 1168.0\n 1170.0\n 1.0\n \n \n direction_kappa\n 2.601\n 0.590\n 1.528\n 3.699\n 0.015\n 0.011\n 1499.0\n 1277.0\n 1.0\n \n \n\n\n\n\n\n_, ax = plt.subplots(1,2, figsize=(8, 4), sharey=True)\nposterior_mean = bmb.families.link.tan_2(idata_vm.posterior[\"direction_mean\"])\nax[0].plot(data.distance, posterior_mean.mean((\"chain\", \"draw\")))\naz.plot_hdi(data.distance, posterior_mean, ax=ax[0])\n\nax[0].plot(data.distance, data.direction, \"k.\")\nax[0].set_xlabel(\"Distance travelled (in m)\")\nax[0].set_ylabel(\"Direction of travel (radians)\")\nax[0].set_title(\"VonMises Family\")\n\nposterior_mean = idata_n.posterior[\"direction_mean\"]\nax[1].plot(data.distance, posterior_mean.mean((\"chain\", \"draw\")))\naz.plot_hdi(data.distance, posterior_mean, ax=ax[1])\n\nax[1].plot(data.distance, data.direction, \"k.\")\nax[1].set_xlabel(\"Distance travelled (in m)\")\nax[1].set_title(\"Normal Family\");\n\n\n\n\nWe can see that there is a negative relationship between distance and direction. This could be explained as Periwinkles travelling in a direction towards the sea travelled shorter distances than those travelling in directions away from it. From a biological perspective, this could have been due to a propensity of the periwinkles to stop moving once they are close to the sea.\nWe can also see that if inadvertently we had assumed a normal response we would have obtained a fit with higher uncertainty and more importantly the wrong sign for the relationship.\nAs a last step for this example we are going to do a posterior predictive check. In the figure below we have to panels showing the same data, with the only difference that the on the right is using a polar projection and the KDE are computing taking into account the circularity of the data.\nWe can see that our modeling is failing at capturing the bimodality in the data (with mode around 1.6 and \\(\\pm \\pi\\)) and hence the predicted distribution is wider and with a mean closer to \\(\\pm \\pi\\).\n\nfig = plt.figure(figsize=(12, 5))\nax0 = plt.subplot(121)\nax1 = plt.subplot(122, projection='polar')\n\nmodel_vm.predict(idata_vm, kind=\"pps\")\npp_samples = az.extract_dataset(idata_vm, group=\"posterior_predictive\", num_samples=200)[\"direction\"]\ncolors = [\"C0\" , \"k\", \"C1\"]\n\nfor ax, circ in zip((ax0, ax1), (False, \"radians\", colors)):\n for s in pp_samples:\n az.plot_kde(s.values, plot_kwargs={\"color\":colors[0], \"alpha\": 0.25}, is_circular=circ, ax=ax)\n az.plot_kde(idata_vm.observed_data[\"direction\"].values,\n plot_kwargs={\"color\":colors[1], \"lw\":3}, is_circular=circ, ax=ax)\n az.plot_kde(idata_vm.posterior_predictive[\"direction\"].values,\n plot_kwargs={\"color\":colors[2], \"ls\":\"--\", \"lw\":3}, is_circular=circ, ax=ax)\n\ncustom_lines = [Line2D([0], [0], color=c) for c in colors]\n\nax0.legend(custom_lines, [\"posterior_predictive\", \"Observed\", 'mean posterior predictive'])\nax0.set_yticks([])\nfig.suptitle(\"Directions (radians)\", fontsize=18);\n\n/tmp/ipykernel_21333/4056881271.py:6: FutureWarning: extract_dataset has been deprecated, please use extract\n pp_samples = az.extract_dataset(idata_vm, group=\"posterior_predictive\", num_samples=200)[\"direction\"]\n\n\n\n\n\nWe have shown an example of regression where the response variable is circular and the covariates are linear. This is sometimes refereed as linear-circular regression in order to distinguish it from other cases. Namely, when the response is linear and the covariates (or at least one of them) is circular the name circular-linear regression is often used. And when both covariates and the response variables are circular, we have a circular-circular regression. When the covariates are circular they are usually modelled with the help of sin and cosine functions. You can read more about this kind of regression and other circular statistical methods in the following books.\n\nCircular statistics in R\nModern directional statistics\nApplied Directional Statistics\nDirectional Statistics\n\n\n%load_ext watermark\n%watermark -n -u -v -iv -w\n\nLast updated: Thu Jan 05 2023\n\nPython implementation: CPython\nPython version : 3.10.4\nIPython version : 8.5.0\n\nscipy : 1.9.3\nbambi : 0.9.3\nnumpy : 1.23.5\narviz : 0.14.0\npandas : 1.5.2\nmatplotlib: 3.6.2\nsys : 3.10.4 (main, Mar 31 2022, 08:41:55) [GCC 7.5.0]\n\nWatermark: 2.3.1" }, { - "objectID": "notebooks/t-test.html", - "href": "notebooks/t-test.html", + "objectID": "notebooks/distributional_models.html", + "href": "notebooks/distributional_models.html", "title": "Bambi", "section": "", - "text": "import arviz as az\nimport bambi as bmb\nimport matplotlib.pyplot as plt\nimport numpy as np\nimport pandas as pd\n\n\naz.style.use(\"arviz-darkgrid\")\nnp.random.seed(1234)\n\nIn this notebook we demo two equivalent ways of performing a two-sample Bayesian t-test to compare the mean value of two Gaussian populations using Bambi.\n\n\nWe generate 160 values from a Gaussian with \\(\\mu=6\\) and \\(\\sigma=2.5\\) and another 120 values from a Gaussian’ with \\(\\mu=8\\) and \\(\\sigma=2\\)\n\na = np.random.normal(6, 2.5, 160)\nb = np.random.normal(8, 2, 120)\ndf = pd.DataFrame({\"Group\": [\"a\"] * 160 + [\"b\"] * 120, \"Val\": np.hstack([a, b])})\n\n\ndf.head()\n\n\n\n\n\n \n \n \n Group\n Val\n \n \n \n \n 0\n a\n 7.178588\n \n \n 1\n a\n 3.022561\n \n \n 2\n a\n 9.581767\n \n \n 3\n a\n 5.218370\n \n \n 4\n a\n 4.198528\n \n \n\n\n\n\n\naz.plot_violin({\"a\": a, \"b\": b});\n\n/home/tomas/anaconda3/envs/bambi/lib/python3.10/site-packages/arviz/plots/backends/matplotlib/violinplot.py:64: UserWarning: This figure was using a layout engine that is incompatible with subplots_adjust and/or tight_layout; not calling subplots_adjust.\n fig.subplots_adjust(wspace=0)\n\n\n\n\n\nWhen we carry out a two sample t-test we are implicitly using a linear model that can be specified in different ways. One of these approaches is the following:\n\n\n\\[\n\\mu_i = \\beta_0 + \\beta_1 (i) + \\epsilon_i\n\\]\nwhere \\(i = 0\\) represents the population 1, \\(i = 1\\) the population 2 and \\(\\epsilon_i\\) is a random error with mean 0. If we replace the indicator variables for the two groups we have\n\\[\n\\mu_0 = \\beta_0 + \\epsilon_i\n\\]\nand\n\\[\n\\mu_1 = \\beta_0 + \\beta_1 + \\epsilon_i\n\\]\nif \\(\\mu_0 = \\mu_1\\) then\n\\[\n\\beta_0 + \\epsilon_i = \\beta_0 + \\beta_1 + \\epsilon_i\\\\\n0 = \\beta_1\n\\]\nThus, we can see that testing whether the mean of the two populations are equal is equivalent to testing whether \\(\\beta_1\\) is 0.\n\n\n\nWe start by instantiating our model and specifying the model previously described.\n\nmodel_1 = bmb.Model(\"Val ~ Group\", df)\nresults_1 = model_1.fit()\n\nAuto-assigning NUTS sampler...\nInitializing NUTS using jitter+adapt_diag...\nMultiprocess sampling (2 chains in 2 jobs)\nNUTS: [Val_sigma, Intercept, Group]\n\n\n\n\n\n\n\n \n \n 100.00% [4000/4000 00:03<00:00 Sampling 2 chains, 0 divergences]\n \n \n\n\nSampling 2 chains for 1_000 tune and 1_000 draw iterations (2_000 + 2_000 draws total) took 4 seconds.\n\n\nWe’ve only specified the formula for the model and Bambi automatically selected priors distributions and values for their parameters. We can inspect both the setup and the priors as following:\n\nmodel_1\n\n Formula: Val ~ Group\n Family: gaussian\n Link: mu = identity\n Observations: 280\n Priors: \n target = mu\n Common-level effects\n Intercept ~ Normal(mu: 6.9762, sigma: 8.1247)\n Group ~ Normal(mu: 0, sigma: 12.4107)\n \n Auxiliary parameters\n Val_sigma ~ HalfStudentT(nu: 4, sigma: 2.4567)\n------\n* To see a plot of the priors call the .plot_priors() method.\n* To see a summary or plot of the posterior pass the object returned by .fit() to az.summary() or az.plot_trace()\n\n\n\nmodel_1.plot_priors();\n\nSampling: [Group, Intercept, Val_sigma]\n\n\n\n\n\nTo inspect our posterior and the sampling process we can call az.plot_trace(). The option kind='rank_vlines' gives us a variant of the rank plot that uses lines and dots and helps us to inspect the stationarity of the chains. Since there is no clear pattern or serious deviations from the horizontal lines, we can conclude the chains are stationary.\n\n\naz.plot_trace(results_1, kind=\"rank_vlines\");\n\n\n\n\n\naz.summary(results_1)\n\n\n\n\n\n \n \n \n mean\n sd\n hdi_3%\n hdi_97%\n mcse_mean\n mcse_sd\n ess_bulk\n ess_tail\n r_hat\n \n \n \n \n Intercept\n 6.116\n 0.179\n 5.778\n 6.449\n 0.003\n 0.002\n 3290.0\n 1795.0\n 1.00\n \n \n Group[b]\n 2.005\n 0.270\n 1.498\n 2.507\n 0.005\n 0.003\n 3537.0\n 1634.0\n 1.00\n \n \n Val_sigma\n 2.261\n 0.092\n 2.077\n 2.423\n 0.002\n 0.001\n 3217.0\n 1551.0\n 1.01\n \n \n\n\n\n\nIn the summary table we can see the 94% highest density interval for \\(\\beta_1\\) ranges from 1.511 to 2.499. Thus, according to the data and the model used, we conclude the difference between the two population means is somewhere between 1.2 and 2.2 and hence we support the hypotehsis that \\(\\beta_1 \\ne 0\\).\nSimilar conclusions can be made with the density estimate for the posterior distribution of \\(\\beta_1\\). As seen in the table, most of the probability for the difference in the mean roughly ranges from 1.2 to 2.2.\n\naz.plot_posterior(results_1, var_names=\"Group\", ref_val=0);\n\n\n\n\nAnother way to arrive to a similar conclusion is by calculating the probability that the parameter \\(\\beta_1 > 0\\). This probability is equal to 1, telling us that the mean of the two populations are different.\n\n# Probabiliy that posterior is > 0\n(results_1.posterior[\"Group\"] > 0).mean().item()\n\n1.0\n\n\nThe linear model implicit in the t-test can also be specified without an intercept term, such is the case of Model 2.\n\n\n\nWhen we carry out a two sample t-test we’re implicitly using the following model:\n\\[\n\\mu_i = \\beta_i + \\epsilon_i\n\\]\nwhere \\(i = 0\\) represents the population 1, \\(i = 1\\) the population 2 and \\(\\epsilon\\) is a random error with mean 0. If we replace the indicator variables for the two groups we have\n\\[\n\\mu_0 = \\beta_0 + \\epsilon\n\\]\nand\n\\[\n\\mu_1 = \\beta_1 + \\epsilon\n\\]\nif \\(\\mu_0 = \\mu_1\\) then\n\\[\n\\beta_0 + \\epsilon = \\beta_1 + \\epsilon\\\\\n\\]\nThus, we can see that testing whether the mean of the two populations are equal is equivalent to testing whether \\(\\beta_0 = \\beta_1\\).\n\n\n\nWe start by instantiating our model and specifying the model previously described. In this model we will bypass the intercept that Bambi adds by default by setting it to zero, even though setting to -1 has the same effect.\n\nmodel_2 = bmb.Model(\"Val ~ 0 + Group\", df)\nresults_2 = model_2.fit() \n\nAuto-assigning NUTS sampler...\nInitializing NUTS using jitter+adapt_diag...\nMultiprocess sampling (2 chains in 2 jobs)\nNUTS: [Val_sigma, Group]\n\n\n\n\n\n\n\n \n \n 100.00% [4000/4000 00:02<00:00 Sampling 2 chains, 0 divergences]\n \n \n\n\nSampling 2 chains for 1_000 tune and 1_000 draw iterations (2_000 + 2_000 draws total) took 3 seconds.\n\n\nWe’ve only specified the formula for the model and Bambi automatically selected priors distributions and values for their parameters. We can inspect both the setup and the priors as following:\n\nmodel_2\n\n Formula: Val ~ 0 + Group\n Family: gaussian\n Link: mu = identity\n Observations: 280\n Priors: \n target = mu\n Common-level effects\n Group ~ Normal(mu: [0. 0.], sigma: [12.4107 12.4107])\n \n Auxiliary parameters\n Val_sigma ~ HalfStudentT(nu: 4, sigma: 2.4567)\n------\n* To see a plot of the priors call the .plot_priors() method.\n* To see a summary or plot of the posterior pass the object returned by .fit() to az.summary() or az.plot_trace()\n\n\n\nmodel_2.plot_priors();\n\nSampling: [Group, Val_sigma]\n\n\n\n\n\nTo inspect our posterior and the sampling process we can call az.plot_trace(). The option kind='rank_vlines' gives us a variant of the rank plot that uses lines and dots and helps us to inspect the stationarity of the chains. Since there is no clear pattern or serious deviations from the horizontal lines, we can conclude the chains are stationary.\n\n\naz.plot_trace(results_2, kind=\"rank_vlines\");\n\n\n\n\n\naz.summary(results_2)\n\n\n\n\n\n \n \n \n mean\n sd\n hdi_3%\n hdi_97%\n mcse_mean\n mcse_sd\n ess_bulk\n ess_tail\n r_hat\n \n \n \n \n Group[a]\n 6.113\n 0.177\n 5.806\n 6.465\n 0.003\n 0.002\n 2973.0\n 1385.0\n 1.0\n \n \n Group[b]\n 8.117\n 0.209\n 7.724\n 8.506\n 0.004\n 0.003\n 3341.0\n 1662.0\n 1.0\n \n \n Val_sigma\n 2.263\n 0.099\n 2.082\n 2.446\n 0.002\n 0.001\n 2727.0\n 1454.0\n 1.0\n \n \n\n\n\n\nIn this summary we can observe the estimated distribution of means for each population. A simple way to compare them is subtracting one to the other. In the next plot we can se that the entirety of the distribution of differences is higher than zero and that the mean of population 2 is higher than the mean of population 1 by a mean of 2.\n\npost_group = results_2.posterior[\"Group\"]\ndiff = post_group.sel(Group_dim=\"b\") - post_group.sel(Group_dim=\"a\") \naz.plot_posterior(diff, ref_val=0);\n\n\n\n\nAnother way to arrive to a similar conclusion is by calculating the probability that the parameter \\(\\beta_1 - \\beta_0 > 0\\). This probability equals to 1, telling us that the mean of the two populations are different.\n\n# Probabiliy that posterior is > 0\n(post_group > 0).mean().item()\n\n1.0\n\n\n\n%load_ext watermark\n%watermark -n -u -v -iv -w\n\nLast updated: Thu Jan 05 2023\n\nPython implementation: CPython\nPython version : 3.10.4\nIPython version : 8.5.0\n\nmatplotlib: 3.6.2\npandas : 1.5.2\nbambi : 0.9.3\narviz : 0.14.0\nnumpy : 1.23.5\nsys : 3.10.4 (main, Mar 31 2022, 08:41:55) [GCC 7.5.0]\n\nWatermark: 2.3.1" + "text": "import arviz as az\nimport bambi as bmb\nimport matplotlib.pyplot as plt\nimport numpy as np\nimport pandas as pd\n\nfrom matplotlib.lines import Line2D\n\n\nimport warnings\nwarnings.simplefilter(action='ignore', category=FutureWarning) # ArviZ\n\naz.style.use(\"arviz-doc\")\n\nFor most regression models, a function of the mean (aka the location parameter) of the response distribution is defined as a linear function of certain predictors, while the remaining parameters are considered auxiliary. For instance, if the response is a Gaussian, we model \\(\\mu\\) as a combination of predictors and \\(\\sigma\\) is estimated from the data, but assumed to be constant for all observations.\nInstead, with distributional models we can specify predictor terms for all parameters of the response distribution. This can be useful, for example, to model heteroskedasticity, i.e. unequal variance. In this notebook we are going to do exactly that.\nTo better understand distributional models, let’s begin fitting a non-distributional models. We are going to model the following syntetic dataset. And we are going to use a Gamma response with a log link function.\n\nrng = np.random.default_rng(121195)\nN = 200\na, b = 0.5, 1.1\nx = rng.uniform(-1.5, 1.5, N)\nshape = np.exp(0.3 + x * 0.5 + rng.normal(scale=0.1, size=N))\ny = rng.gamma(shape, np.exp(a + b * x) / shape, N)\ndata = pd.DataFrame({\"x\": x, \"y\": y})\nnew_data = pd.DataFrame({\"x\": np.linspace(-1.5, 1.5, num=50)})\n\n\n\n\nformula = bmb.Formula(\"y ~ x\")\nmodel_constant = bmb.Model(formula, data, family=\"gamma\", link=\"log\")\nmodel_constant\n\n Formula: y ~ x\n Family: gamma\n Link: mu = log\n Observations: 200\n Priors: \n target = mu\n Common-level effects\n Intercept ~ Normal(mu: 0.0, sigma: 2.5037)\n x ~ Normal(mu: 0.0, sigma: 2.8025)\n \n Auxiliary parameters\n alpha ~ HalfCauchy(beta: 1.0)\n\n\n\nmodel_constant.build()\nmodel_constant.graph()\n\n\n\n\nTake a moment to inspect the textual and graphical representations of the model, to ensure you understand how the parameters are related.\n\nidata_constant = model_constant.fit(random_seed=121195, idata_kwargs={\"log_likelihood\": True})\n\nAuto-assigning NUTS sampler...\nInitializing NUTS using jitter+adapt_diag...\nMultiprocess sampling (2 chains in 2 jobs)\nNUTS: [y_alpha, Intercept, x]\n\n\n\n\n\n\n\n \n \n 100.00% [4000/4000 00:03<00:00 Sampling 2 chains, 0 divergences]\n \n \n\n\nSampling 2 chains for 1_000 tune and 1_000 draw iterations (2_000 + 2_000 draws total) took 4 seconds.\nWe recommend running at least 4 chains for robust computation of convergence diagnostics\n\n\nOnce the model is fitted let’s visually inspect the result in terms of the mean (the line in the following figure) and the individual predictions (the band).\n\nmodel_constant.predict(idata_constant, kind=\"mean\", data=new_data)\nmodel_constant.predict(idata_constant, kind=\"pps\", data=new_data)\n\nqts_constant = (\n az.extract(idata_constant.posterior_predictive, var_names=\"y\")\n .quantile([0.025, 0.975], \"sample\")\n .to_numpy()\n)\nmean_constant = (\n az.extract(idata_constant.posterior_predictive, var_names=\"y\")\n .mean(\"sample\")\n .to_numpy()\n)\n\n\nfig, ax = plt.subplots(figsize=(8, 4.5), dpi=120)\n\naz.plot_hdi(new_data[\"x\"], qts_constant, ax=ax, fill_kwargs={\"alpha\": 0.4})\nax.plot(new_data[\"x\"], mean_constant, color=\"C0\", lw=2)\nax.scatter(data[\"x\"], data[\"y\"], color=\"k\", alpha=0.2)\nax.set(xlabel=\"Predictor\", ylabel=\"Outcome\");\n\n\n\n\nThe model correctly model that the outcome increases with the values of the predictor. So far so good, let’s dive into the heart of the matter.\n\n\n\nNow we are going to build the same model as before with the only, but crucial difference, that we are also going to make alpha depend on the predictor. The syntax is very simple besides the usual “y ~ x”, we now add “alpha ~ x”. Neat!\n\nformula_varying = bmb.Formula(\"y ~ x\", \"alpha ~ x\")\nmodel_varying = bmb.Model(formula_varying, data, family=\"gamma\", link={\"mu\": \"log\", \"alpha\": \"log\"})\nmodel_varying\n\n Formula: y ~ x\n alpha ~ x\n Family: gamma\n Link: mu = log\n alpha = log\n Observations: 200\n Priors: \n target = mu\n Common-level effects\n Intercept ~ Normal(mu: 0.0, sigma: 2.5037)\n x ~ Normal(mu: 0.0, sigma: 2.8025)\n target = alpha\n Common-level effects\n alpha_Intercept ~ Normal(mu: 0.0, sigma: 1.0)\n alpha_x ~ Normal(mu: 0.0, sigma: 1.0)\n\n\n\nmodel_varying.build()\nmodel_varying.graph()\n\n\n\n\nTake another moment to inspect the textual and visual representations of model_varying and also go back and compare those from model_constant.\n\nidata_varying = model_varying.fit(random_seed=121195, idata_kwargs={\"log_likelihood\": True})\n\nAuto-assigning NUTS sampler...\nInitializing NUTS using jitter+adapt_diag...\nMultiprocess sampling (2 chains in 2 jobs)\nNUTS: [Intercept, x, alpha_Intercept, alpha_x]\n\n\n\n\n\n\n\n \n \n 100.00% [4000/4000 00:03<00:00 Sampling 2 chains, 0 divergences]\n \n \n\n\nSampling 2 chains for 1_000 tune and 1_000 draw iterations (2_000 + 2_000 draws total) took 4 seconds.\nWe recommend running at least 4 chains for robust computation of convergence diagnostics\n\n\nNow, with both models being fitted, let’s see how the alpha parameter differs between both models. In the next figure you can see a blueish KDE for the alpha parameter estimated with model_constant and 200 black KDEs for the alpha parameter estimated from the model_varying. You can count it if you want :-), but we know they should be 200 because we should have one for each one of the 200 observations.\n\nfig, ax = plt.subplots(figsize=(8, 4.5), dpi=120)\n\nfor idx in idata_varying.posterior.coords.get(\"y_obs\"):\n values = idata_varying.posterior[\"alpha\"].sel(y_obs=idx).to_numpy().flatten()\n grid, pdf = az.kde(values)\n ax.plot(grid, pdf, lw=0.05, color=\"k\")\n\nvalues = idata_constant.posterior[\"y_alpha\"].to_numpy().flatten()\ngrid, pdf = az.kde(values)\nax.plot(grid, pdf, lw=2, color=\"C0\");\n\n# Create legend\nhandles = [\n Line2D([0], [0], label=\"Varying alpha\", lw=1.5, color=\"k\", alpha=0.6),\n Line2D([0], [0], label=\"Constant alpha\", lw=1.5, color=\"C0\")\n]\n\nlegend = ax.legend(handles=handles, loc=\"upper right\", fontsize=14)\n\nax.set(xlabel=\"Alpha posterior\", ylabel=\"Density\");\n\n\n\n\nThis is nice statistical art and a good insight into what the model is actully doing. But at this point you may be wondering how results looks like and more important how different they are from model_constant. Let’s plot the mean and predictions as we did before, but for both models.\n\nmodel_varying.predict(idata_varying, kind=\"mean\", data=new_data)\nmodel_varying.predict(idata_varying, kind=\"pps\", data=new_data)\n\nqts_varying = (\n az.extract(idata_varying.posterior_predictive, var_names=\"y\")\n .quantile([0.025, 0.975], \"sample\")\n .to_numpy()\n)\nmean_varying = (\n az.extract(idata_varying.posterior_predictive, var_names=\"y\")\n .mean(\"sample\")\n .to_numpy()\n)\n\n\nfig, ax = plt.subplots(figsize=(8, 4.5), dpi=120)\n\naz.plot_hdi(new_data[\"x\"], qts_constant, ax=ax, fill_kwargs={\"alpha\": 0.4})\nax.plot(new_data[\"x\"], mean_constant, color=\"C1\", label=\"constant\")\n\naz.plot_hdi(new_data[\"x\"], qts_varying, ax=ax, fill_kwargs={\"alpha\": 0.4, \"color\":\"k\"})\nax.plot(new_data[\"x\"], mean_varying, color=\"k\", label=\"varying\")\nax.set(xlabel=\"Predictor\", ylabel=\"Outcome\");\nplt.legend();\n\n\n\n\nWe can see that mean is virtually the same for both model but the predictions are not, in particular for larger values of the predictiors.\nWe can also check that the models actually looks different under the LOO metric, with a slight preference for the varying model.\n\naz.compare({\"constant\": idata_constant, \"varying\": idata_varying})\n\n\n\n\n\n \n \n \n rank\n elpd_loo\n p_loo\n elpd_diff\n weight\n se\n dse\n warning\n scale\n \n \n \n \n varying\n 0\n -309.191836\n 3.851329\n 0.000000\n 0.933024\n 16.458759\n 0.00000\n False\n log\n \n \n constant\n 1\n -318.913528\n 2.958351\n 9.721692\n 0.066976\n 15.832033\n 4.59755\n False\n log\n \n \n\n\n\n\n\n\n\nTime to step up our game. In this example we are going to use the bikes data set from the University of California Irvine’s Machine Learning Repository, and we are going to estimate the number of rental bikes rented per hour over a 24 hour period.\nAs the number of bikes is a count variable we are going to use a negativebinomial family, and we are going to use two splines: one for the mean, and one for alpha.\n\ndata = bmb.load_data(\"bikes\")\n# Remove data, you may later try to refit the model to the whole data\ndata = data[::50]\ndata = data.reset_index(drop=True)\n\n\nformula = bmb.Formula(\n \"count ~ 0 + bs(hour, 8, intercept=True)\",\n \"alpha ~ 0 + bs(hour, 8, intercept=True)\"\n)\nmodel_bikes = bmb.Model(formula, data, family=\"negativebinomial\")\nmodel_bikes\n\n Formula: count ~ 0 + bs(hour, 8, intercept=True)\n alpha ~ 0 + bs(hour, 8, intercept=True)\n Family: negativebinomial\n Link: mu = log\n alpha = log\n Observations: 348\n Priors: \n target = mu\n Common-level effects\n bs(hour, 8, intercept=True) ~ Normal(mu: [0. 0. 0. 0. 0. 0. 0. 0.], sigma: [11.3704 13.9185\n 11.9926 10.6887 10.6819 12.1271 13.623 11.366 ])\n\n target = alpha\n Common-level effects\n alpha_bs(hour, 8, intercept=True) ~ Normal(mu: 0.0, sigma: 1.0)\n\n\n\nidata_bikes = model_bikes.fit()\n\nAuto-assigning NUTS sampler...\nInitializing NUTS using jitter+adapt_diag...\nMultiprocess sampling (2 chains in 2 jobs)\nNUTS: [bs(hour, 8, intercept=True), alpha_bs(hour, 8, intercept=True)]\n\n\n\n\n\n\n\n \n \n 100.00% [4000/4000 00:18<00:00 Sampling 2 chains, 0 divergences]\n \n \n\n\nSampling 2 chains for 1_000 tune and 1_000 draw iterations (2_000 + 2_000 draws total) took 19 seconds.\nWe recommend running at least 4 chains for robust computation of convergence diagnostics\n\n\n\nhour = np.linspace(0, 23, num=200)\nnew_data = pd.DataFrame({\"hour\": hour})\nmodel_bikes.predict(idata_bikes, data=new_data, kind=\"pps\")\n\n\nq = [0.025, 0.975]\ndims = (\"chain\", \"draw\")\n\nmean = idata_bikes.posterior[\"count_mean\"].mean(dims).to_numpy()\nmean_interval = idata_bikes.posterior[\"count_mean\"].quantile(q, dims).to_numpy()\ny_interval = idata_bikes.posterior_predictive[\"count\"].quantile(q, dims).to_numpy()\n\nfig, ax = plt.subplots(figsize=(12, 4))\nax.scatter(data[\"hour\"], data[\"count\"], alpha=0.3, color=\"k\")\nax.plot(hour, mean, color=\"C3\")\nax.fill_between(hour, mean_interval[0],mean_interval[1], alpha=0.5, color=\"C1\");\naz.plot_hdi(hour, y_interval, fill_kwargs={\"color\": \"C1\", \"alpha\": 0.3}, ax=ax);\n\n\n\n\n\n%load_ext watermark\n%watermark -n -u -v -iv -w\n\nLast updated: Wed Jun 28 2023\n\nPython implementation: CPython\nPython version : 3.10.4\nIPython version : 8.5.0\n\npandas : 2.0.2\nbambi : 0.12.0.dev0\nmatplotlib: 3.6.2\nnumpy : 1.25.0\narviz : 0.14.0\n\nWatermark: 2.3.1" }, { "objectID": "notebooks/plot_slopes.html", @@ -84,60 +49,74 @@ "text": "Bambi’s sub-package interpret features a set of functions to help interpret complex regression models. The sub-package is inspired by the R package marginaleffects. In this notebook we will discuss two functions slopes and plot_slopes. These two functions allow the modeler to easier interpret slopes, either by a inspecting a summary output or plotting them.\nBelow, it is described why estimating the slope of the prediction function is useful in interpreting generalized linear models (GLMs), how this methodology is implemented in Bambi, and how to use slopes and plot_slopes. It is assumed that the reader is familiar with the basics of GLMs. If not, refer to the Bambi Basic Building Blocks example.\n\n\nAssuming we have fit a linear regression model of the form\n\\[y = \\beta_0 + \\beta_1 x_1 + \\beta_2 x_2 + \\dots + \\beta_k x_k + \\epsilon\\]\nthe “safest” interpretation of the regression coefficients \\(\\beta\\) is as a comparison between two groups of items that differ by \\(1\\) in the relevant predictor variable \\(x_i\\) while being identical in all the other predictors. Formally, the predicted difference between two items \\(i\\) and \\(j\\) that differ by an amount \\(n\\) on predictor \\(k\\), but are identical on all other predictors, the predicted difference is \\(y_i - y_j\\) is \\(\\beta_kx\\), on average.\nHowever, once we move away from a regression model with a Gaussian response, the identity function, and no interaction terms, the interpretation of the coefficients are not as straightforward. For example, in a logistic regression model, the coefficients are on a different scale and are measured in logits (log odds), not probabilities or percentage points. Thus, you cannot interpret the coefficents as a “one unit increase in \\(x_k\\) is associated with an \\(n\\) percentage point decrease in \\(y\\)”. First, the logits must be converted to the probability scale. Secondly, a one unit change in \\(x_k\\) may produce a larger or smaller change in the outcome, depending upon how far away from zero the logits are.\nslopes and plot_slopes, by default, computes quantities of interest on the response scale for GLMs. For example, for a logistic regression model, this is the probability scale, and for a Poisson regression model, this is the count scale.\n\n\nSpecifying interactions in a regression model is a way of allowing parameters to be conditional on certain aspects of the data. By contrast, for a model with no interactions, the parameters are not conditional and thus, the value of one parameter is not dependent on the value of another covariate. However, once interactions exist, multiple parameters are always in play at the same time. Additionally, interactions can be specified for either categorical, continuous, or both types of covariates. Thus, making the interpretation of the parameters more difficult.\nWith GLMs, every covariate essentially interacts with itself because of the link function. To demonstrate parameters interacting with themselves, consider the mean of a Gaussian linear model with an identity link function\n\\[\\mu = \\alpha + \\beta x\\]\nwhere the rate of change in \\(\\mu\\) with respect to \\(x\\) is just \\(\\beta\\), i.e., the rate of change is constant no matter what the value of \\(x\\) is. But when we consider GLMs with link functions used to map outputs to exponential family distribution parameters, calculating the derivative of the mean output \\(\\mu\\) with respect to the predictor is not as straightforward as in the Gaussian linear model. For example, computing the rate of change in a binomial probability \\(p\\) with respect to \\(x\\)\n\\[p = \\frac{exp(\\alpha + \\beta x)}{1 + exp(\\alpha + \\beta x)}\\]\nAnd taking the derivative of \\(p\\) with respect to \\(x\\) yields\n\\[\\frac{\\partial p}{\\partial x} = \\frac{\\beta}{2(1 + cosh(\\alpha + \\beta x))}\\]\nSince \\(x\\) appears in the derivative, the impact of a change in \\(x\\) depends upon \\(x\\), i.e., an interaction with itself even though no interaction term was specified in the model.Thus, visualizing the rate of change in the mean response with respect to a covariate \\(x\\) becomes a useful tool in interpreting GLMs.\n\n\n\n\nHere, we adopt the notation from Chapter 14.4 of Regression and Other Stories to first describe average predictive differences which is essential to computing slopes, and then secondly, average predictive slopes. Assume we have fit a Bambi model predicting an outcome \\(Y\\) based on inputs \\(X\\) and parameters \\(\\theta\\). Consider the following scalar inputs:\n\\[w: \\text{the input of interest}\\] \\[c: \\text{all the other inputs}\\] \\[X = (w, c)\\]\nIn contrast to comparisons, for slopes we are interested in comparing \\(w^{\\text{value}}\\) to \\(w^{\\text{value}+\\epsilon}\\) (perhaps age = 60 and 60.0001 respectively) with all other inputs \\(c\\) held constant. The predictive difference in the outcome changing only \\(w\\) is:\n\\[\\text{average predictive difference} = \\mathbb{E}(y|w^{\\text{value}}, c, \\theta) - \\mathbb{E}(y|w^{\\text{value}+\\epsilon}, c, \\theta)\\]\nSelecting \\(w\\) and \\(w^{\\text{value}+\\epsilon}\\) and averaging over all other inputs \\(c\\) in the data gives you a new “hypothetical” dataset and corresponds to counting all pairs of transitions of \\((w^\\text{value})\\) to \\((w^{\\text{value}+\\epsilon})\\), i.e., differences in \\(w\\) with \\(c\\) held constant. The difference between these two terms is the average predictive difference.\nHowever, to obtain the slope estimate, we need to take the above formula and divide by \\(\\epsilon\\) to obtain the average predictive slope:\n\\[\\text{average predictive slope} = \\frac{\\mathbb{E}(y|w^{\\text{value}}, c, \\theta) - \\mathbb{E}(y|w^{\\text{value}+\\epsilon}, c, \\theta)}{\\epsilon}\\]\n\n\n\nThe objective of slopes and plot_slopes is to compute the rate of change (slope) in the mean of the response \\(y\\) with respect to a small change \\(\\epsilon\\) in the predictor \\(x\\) conditional on other covariates \\(c\\) specified in the model. \\(w\\) is specified by the user and the original value is either provided by the user, else a default value (the mean) is computed by Bambi. The values for the other covariates \\(c\\) specified in the model can be determined under the following three scenarios:\n\nuser provided values\na grid of equally spaced and central values\nempirical distribution (original data used to fit the model)\n\nIn the case of (1) and (2) above, Bambi assembles all pairwise combinations (transitions) of \\(w\\) and \\(c\\) into a new “hypothetical” dataset. In (3), Bambi uses the original \\(c\\), and adds a small amount \\(\\epsilon\\) to each unit of observation’s \\(w\\). In each scenario, predictions are made on the data using the fitted model. Once the predictions are made, comparisons are computed using the posterior samples by taking the difference in the predicted outcome for each pair of transitions and dividing by \\(\\epsilon\\). The average of these slopes is the average predictive slopes.\nFor variables \\(w\\) with a string or categorical data type, the comparisons function is called to compute the expected difference in group means. Please refer to the comparisons documentation for more details.\nBelow, we present several examples showing how to use Bambi to perform these computations for us, and to return either a summary dataframe, or a visualization of the results.\n\nimport arviz as az\nimport pandas as pd\n\nimport bambi as bmb\n\n\n\n\nTo demonstrate slopes and plot_slopes, we will use the well switching dataset to model the probability a household in Bangladesh switches water wells. The data are for an area of Arahazar Upazila, Bangladesh. The researchers labelled each well with its level of arsenic and an indication of whether the well was “safe” or “unsafe”. Those using unsafe wells were encouraged to switch. After several years, it was determined whether each household using an unsafe well had changed its well. The data contains \\(3020\\) observations on the following five variables:\n\nswitch: a factor with levels no and yes indicating whether the household switched to a new well\narsenic: the level of arsenic in the old well (measured in micrograms per liter)\ndist: the distance to the nearest safe well (measured in meters)\nassoc: a factor with levels no and yes indicating whether the household is a member of an arsenic education group\neduc: years of education of the household head\n\nFirst, a logistic regression model with no interactions is fit to the data. Subsequently, to demonstrate the benefits of plot_slopes in interpreting interactions, we will fit a logistic regression model with an interaction term.\n\ndata = pd.read_csv(\"http://www.stat.columbia.edu/~gelman/arm/examples/arsenic/wells.dat\", sep=\" \")\ndata[\"switch\"] = pd.Categorical(data[\"switch\"])\ndata[\"dist100\"] = data[\"dist\"] / 100\ndata[\"educ4\"] = data[\"educ\"] / 4\ndata.head()\n\n\n\n\n\n \n \n \n switch\n arsenic\n dist\n assoc\n educ\n dist100\n educ4\n \n \n \n \n 1\n 1\n 2.36\n 16.826000\n 0\n 0\n 0.16826\n 0.0\n \n \n 2\n 1\n 0.71\n 47.321999\n 0\n 0\n 0.47322\n 0.0\n \n \n 3\n 0\n 2.07\n 20.966999\n 0\n 10\n 0.20967\n 2.5\n \n \n 4\n 1\n 1.15\n 21.486000\n 0\n 12\n 0.21486\n 3.0\n \n \n 5\n 1\n 1.10\n 40.874001\n 1\n 14\n 0.40874\n 3.5\n \n \n\n\n\n\n\nwell_model = bmb.Model(\n \"switch ~ dist100 + arsenic + educ4\",\n data,\n family=\"bernoulli\"\n)\n\nwell_idata = well_model.fit(\n draws=1000, \n target_accept=0.95, \n random_seed=1234, \n chains=4\n)\n\nModeling the probability that switch==0\nAuto-assigning NUTS sampler...\nInitializing NUTS using jitter+adapt_diag...\nMultiprocess sampling (4 chains in 4 jobs)\nNUTS: [Intercept, dist100, arsenic, educ4]\n\n\n\n\n\n\n\n \n \n 100.00% [8000/8000 00:02<00:00 Sampling 4 chains, 0 divergences]\n \n \n\n\nSampling 4 chains for 1_000 tune and 1_000 draw iterations (4_000 + 4_000 draws total) took 3 seconds.\n\n\n\n\nFirst, an example of scenario 1 (user provided values) is given below. In both plot_slopes and slopes, \\(w\\) and \\(c\\) are represented by wrt (with respect to) and conditional, respectively. The modeler has the ability to pass their own values for wrt and conditional by using a dictionary where the key-value pairs are the covariate and value(s) of interest.\nFor example, if we wanted to compute the slope of the probability of switching wells for a typical arsenic value of \\(1.3\\) conditional on a range of dist and educ values, we would pass the following dictionary in the code block below. By default, for \\(w\\), Bambi compares \\(w^\\text{value}\\) to \\(w^{\\text{value} + \\epsilon}\\) where \\(\\epsilon =\\) 1e-4. However, the value for \\(\\epsilon\\) can be changed by passing a value to the argument eps.\nThus, in this example, \\(w^\\text{value} = 1.3\\) and \\(w^{\\text{value} + \\epsilon} = 1.3001\\). The user is not limited to passing a list for the values. A np.array can also be used. Furthermore, Bambi by default, maps the order of the dict keys to the main, group, and panel of the matplotlib figure. Below, since dist100 is the first key, this is used for the x-axis, and educ4 is used for the group (color). If a third key was passed, it would be used for the panel (facet).\n\nfig, ax = bmb.interpret.plot_slopes(\n well_model,\n well_idata,\n wrt={\"arsenic\": 1.3},\n conditional={\"dist100\": [0.20, 0.50, 0.80], \"educ4\": [1.00, 1.20, 2.00]},\n)\nfig.set_size_inches(7, 3)\nfig.axes[0].set_ylabel(\"Slope of Well Switching Probability\");\n\n\n\n\nThe plot above shows that, for example, conditional on dist100 \\(= 0.2\\) and educ4 \\(= 1.0\\) a unit increase in arsenic is associated with households being \\(11\\)% less likely to switch wells. Notice that even though we fit a logistic regression model where the coefficients are on the log-odds scale, the slopes function returns the slope on the probability scale. Thus, we can interpret the y-axis (slope) as the expected change in the probability of switching wells for a unit increase in arsenic conditional on the specified covariates.\nslopes can be called directly to view a summary dataframe that includes the term name, estimate type (discussed in detail in the interpreting coefficients as an elasticity section), values \\(w\\) used to compute the estimate, the specified conditional covariates \\(c\\), and the expected slope of the outcome with the uncertainty interval (by default the \\(94\\)% highest density interval is computed).\n\nbmb.interpret.slopes(\n well_model,\n well_idata,\n wrt={\"arsenic\": 1.5},\n conditional={\n \"dist100\": [0.20, 0.50, 0.80], \n \"educ4\": [1.00, 1.20, 2.00]\n }\n)\n\n\n\n\n\n \n \n \n term\n estimate_type\n value\n dist100\n educ4\n estimate\n lower_3.0%\n upper_97.0%\n \n \n \n \n 0\n arsenic\n dydx\n (1.5, 1.5001)\n 0.2\n 1.0\n -0.110797\n -0.128775\n -0.092806\n \n \n 1\n arsenic\n dydx\n (1.5, 1.5001)\n 0.2\n 1.2\n -0.109867\n -0.126725\n -0.091065\n \n \n 2\n arsenic\n dydx\n (1.5, 1.5001)\n 0.2\n 2.0\n -0.105618\n -0.122685\n -0.088383\n \n \n 3\n arsenic\n dydx\n (1.5, 1.5001)\n 0.5\n 1.0\n -0.116087\n -0.134965\n -0.096843\n \n \n 4\n arsenic\n dydx\n (1.5, 1.5001)\n 0.5\n 1.2\n -0.115632\n -0.134562\n -0.096543\n \n \n 5\n arsenic\n dydx\n (1.5, 1.5001)\n 0.5\n 2.0\n -0.113140\n -0.130448\n -0.093209\n \n \n 6\n arsenic\n dydx\n (1.5, 1.5001)\n 0.8\n 1.0\n -0.117262\n -0.136850\n -0.098549\n \n \n 7\n arsenic\n dydx\n (1.5, 1.5001)\n 0.8\n 1.2\n -0.117347\n -0.136475\n -0.098044\n \n \n 8\n arsenic\n dydx\n (1.5, 1.5001)\n 0.8\n 2.0\n -0.116957\n -0.135079\n -0.096476\n \n \n\n\n\n\nSince all covariates used to fit the model were also specified to compute the slopes, no default value is used for unspecified covariates. A default value is computed for the unspecified covariates because in order to peform predictions, Bambi is expecting a value for each covariate used to fit the model. Additionally, with GLM models, average predictive slopes are conditional in the sense that the estimate depends on the values of all the covariates in the model. Thus, for unspecified covariates, slopes and plot_slopes computes a default value (mean or mode based on the data type of the covariate). Each row in the summary dataframe is read as “the slope (or rate of change) of the probability of switching wells with respect to a small change in \\(w\\) conditional on \\(c\\) is \\(y\\)”.\n\n\n\nUsers can also compute slopes on multiple values for wrt. For example, if we want to compute the slope of \\(y\\) with respect to arsenic \\(= 1.5\\), \\(2.0\\), and \\(2.5\\), simply pass a list or numpy array as the dictionary values for wrt. Keeping the conditional covariate and values the same, the following slope estimates are computed below.\n\nmultiple_values = bmb.interpret.slopes(\n well_model,\n well_idata,\n wrt={\"arsenic\": [1.5, 2.0, 2.5]},\n conditional={\n \"dist100\": [0.20, 0.50, 0.80], \n \"educ4\": [1.00, 1.20, 2.00]\n }\n)\n\nmultiple_values.head(6)\n\n\n\n\n\n \n \n \n term\n estimate_type\n value\n dist100\n educ4\n estimate\n lower_3.0%\n upper_97.0%\n \n \n \n \n 0\n arsenic\n dydx\n (1.5, 1.5001)\n 0.2\n 1.0\n -0.110797\n -0.128775\n -0.092806\n \n \n 1\n arsenic\n dydx\n (2.0, 2.0001)\n 0.2\n 1.0\n -0.109867\n -0.126725\n -0.091065\n \n \n 2\n arsenic\n dydx\n (2.5, 2.5001)\n 0.2\n 1.0\n -0.105618\n -0.122685\n -0.088383\n \n \n 3\n arsenic\n dydx\n (1.5, 1.5001)\n 0.2\n 1.2\n -0.116087\n -0.134965\n -0.096843\n \n \n 4\n arsenic\n dydx\n (2.0, 2.0001)\n 0.2\n 1.2\n -0.115632\n -0.134562\n -0.096543\n \n \n 5\n arsenic\n dydx\n (2.5, 2.5001)\n 0.2\n 1.2\n -0.113140\n -0.130448\n -0.093209\n \n \n\n\n\n\nThe output above is essentially the same as the summary dataframe when we only passed one value to wrt. However, now each element (value) in the list gets a small amount \\(\\epsilon\\) added to it, and the slope is calculated for each of these values.\n\n\n\nAs stated in the interpreting interaction effects section, interpreting coefficients of multiple interaction terms can be difficult and cumbersome. Thus, plot_slopes provides an effective way to visualize the conditional slopes of the interaction effects. Below, we will use the same well switching dataset, but with interaction terms. Specifically, one interaction is added between dist100 and educ4, and another between arsenic and educ4.\n\nwell_model_interact = bmb.Model(\n \"switch ~ dist100 + arsenic + educ4 + dist100:educ4 + arsenic:educ4\",\n data,\n family=\"bernoulli\"\n)\n\nwell_idata_interact = well_model_interact.fit(\n draws=1000, \n target_accept=0.95, \n random_seed=1234, \n chains=4\n)\n\nModeling the probability that switch==0\n\n\nAuto-assigning NUTS sampler...\nInitializing NUTS using jitter+adapt_diag...\nMultiprocess sampling (4 chains in 4 jobs)\nNUTS: [Intercept, dist100, arsenic, educ4, dist100:educ4, arsenic:educ4]\n\n\n\n\n\n\n\n \n \n 100.00% [8000/8000 00:15<00:00 Sampling 4 chains, 0 divergences]\n \n \n\n\nSampling 4 chains for 1_000 tune and 1_000 draw iterations (4_000 + 4_000 draws total) took 15 seconds.\n\n\n\n# summary of coefficients\naz.summary(well_idata_interact)\n\n\n\n\n\n \n \n \n mean\n sd\n hdi_3%\n hdi_97%\n mcse_mean\n mcse_sd\n ess_bulk\n ess_tail\n r_hat\n \n \n \n \n Intercept\n -0.097\n 0.122\n -0.322\n 0.137\n 0.003\n 0.002\n 2259.0\n 2203.0\n 1.0\n \n \n dist100\n 1.320\n 0.175\n 0.982\n 1.640\n 0.004\n 0.003\n 2085.0\n 2457.0\n 1.0\n \n \n arsenic\n -0.398\n 0.061\n -0.521\n -0.291\n 0.001\n 0.001\n 2141.0\n 2558.0\n 1.0\n \n \n educ4\n 0.102\n 0.080\n -0.053\n 0.246\n 0.002\n 0.001\n 1935.0\n 2184.0\n 1.0\n \n \n dist100:educ4\n -0.330\n 0.106\n -0.528\n -0.136\n 0.002\n 0.002\n 2070.0\n 2331.0\n 1.0\n \n \n arsenic:educ4\n -0.079\n 0.043\n -0.161\n -0.000\n 0.001\n 0.001\n 2006.0\n 2348.0\n 1.0\n \n \n\n\n\n\nThe coefficients of the linear model are shown in the table above. The interaction coefficents indicate the slope varies in a continuous fashion with the continuous variable.\nA negative value for arsenic:dist100 indicates that the “effect” of arsenic on the outcome is less negative as distance from the well increases. Similarly, a negative value for arsenic:educ4 indicates that the “effect” of arsenic on the outcome is more negative as education increases. Remember, these coefficients are still on the logit scale. Furthermore, as more variables and interaction terms are added to the model, interpreting these coefficients becomes more difficult.\nThus, lets use plot_slopes to visually see how the slope changes with respect to arsenic conditional on dist100 and educ4 changing. Notice in the code block below how parameters are passed to the subplot_kwargs and fig_kwargs arguments. At times, it can be useful to pass specific group and panel arguments to aid in the interpretation of the plot. Therefore, subplot_kwargs allows the user to manipulate the plotting by passing a dictionary where the keys are {\"main\": ..., \"group\": ..., \"panel\": ...} and the values are the names of the covariates to be plotted. fig_kwargs are figure level key word arguments such as figsize and sharey.\n\nfig, ax = bmb.interpret.plot_slopes(\n well_model_interact,\n well_idata_interact,\n wrt=\"arsenic\",\n conditional=[\"dist100\", \"educ4\"],\n subplot_kwargs={\"main\": \"dist100\", \"group\": \"educ4\", \"panel\": \"educ4\"},\n fig_kwargs={\"figsize\": (16, 4), \"sharey\": True},\n legend=False\n)\n\n\n\n\nWith interaction terms now defined, it can be seen how the slope of the outcome with respect to arsenic differ depending on the value of educ4. Especially in the case of educ4 \\(= 4.25\\), the slope is more “constant”, but with greater uncertainty. Lets compare this with the model that does not include any interaction terms.\n\nfig, ax = bmb.interpret.plot_slopes(\n well_model,\n well_idata,\n wrt=\"arsenic\",\n conditional=[\"dist100\", \"educ4\"],\n subplot_kwargs={\"main\": \"dist100\", \"group\": \"educ4\", \"panel\": \"educ4\"},\n fig_kwargs={\"figsize\": (16, 4), \"sharey\": True},\n legend=False\n)\n\n\n\n\nFor the non-interaction model, conditional on a range of values for educ4 and dist100, the slopes of the outcome are nearly identical.\n\n\n\nEvaluating average predictive slopes at central values for the conditional covariates \\(c\\) can be problematic when the inputs have a large variance since no single central value (mean, median, etc.) is representative of the covariate. This is especially true when \\(c\\) exhibits bi or multimodality. Thus, it may be desireable to use the empirical distribution of \\(c\\) to compute the predictive slopes, and then average over a specific or set of covariates to obtain average slopes. To achieve unit level slopes, do not pass a parameter into conditional and or specify None.\n\nunit_level = bmb.interpret.slopes(\n well_model_interact,\n well_idata_interact,\n wrt=\"arsenic\",\n conditional=None\n)\n\n# empirical distribution\nprint(unit_level.shape[0] == well_model_interact.data.shape[0])\nunit_level.head(10)\n\nTrue\n\n\n\n\n\n\n \n \n \n term\n estimate_type\n value\n dist100\n educ4\n estimate\n lower_3.0%\n upper_97.0%\n \n \n \n \n 0\n arsenic\n dydx\n (2.36, 2.3601)\n 0.16826\n 0.00\n -0.084280\n -0.105566\n -0.063403\n \n \n 1\n arsenic\n dydx\n (0.71, 0.7101)\n 0.47322\n 0.00\n -0.097837\n -0.125057\n -0.070959\n \n \n 2\n arsenic\n dydx\n (2.07, 2.0701)\n 0.20967\n 2.50\n -0.118093\n -0.139848\n -0.093442\n \n \n 3\n arsenic\n dydx\n (1.15, 1.1501)\n 0.21486\n 3.00\n -0.150638\n -0.194765\n -0.108946\n \n \n 4\n arsenic\n dydx\n (1.1, 1.1001)\n 0.40874\n 3.50\n -0.161272\n -0.214761\n -0.108663\n \n \n 5\n arsenic\n dydx\n (3.9, 3.9001)\n 0.69518\n 2.25\n -0.073908\n -0.080525\n -0.067493\n \n \n 6\n arsenic\n dydx\n (2.97, 2.9701000000000004)\n 0.80711\n 1.00\n -0.108482\n -0.123517\n -0.093042\n \n \n 7\n arsenic\n dydx\n (3.24, 3.2401000000000004)\n 0.55146\n 2.50\n -0.088049\n -0.097939\n -0.078020\n \n \n 8\n arsenic\n dydx\n (3.28, 3.2801)\n 0.52647\n 0.00\n -0.087388\n -0.107331\n -0.068076\n \n \n 9\n arsenic\n dydx\n (2.52, 2.5201000000000002)\n 0.75072\n 0.00\n -0.099035\n -0.129517\n -0.073222\n \n \n\n\n\n\n\nwell_model_interact.data.head(10)\n\n\n\n\n\n \n \n \n switch\n arsenic\n dist\n assoc\n educ\n dist100\n educ4\n \n \n \n \n 1\n 1\n 2.36\n 16.826000\n 0\n 0\n 0.16826\n 0.00\n \n \n 2\n 1\n 0.71\n 47.321999\n 0\n 0\n 0.47322\n 0.00\n \n \n 3\n 0\n 2.07\n 20.966999\n 0\n 10\n 0.20967\n 2.50\n \n \n 4\n 1\n 1.15\n 21.486000\n 0\n 12\n 0.21486\n 3.00\n \n \n 5\n 1\n 1.10\n 40.874001\n 1\n 14\n 0.40874\n 3.50\n \n \n 6\n 1\n 3.90\n 69.517998\n 1\n 9\n 0.69518\n 2.25\n \n \n 7\n 1\n 2.97\n 80.710999\n 1\n 4\n 0.80711\n 1.00\n \n \n 8\n 1\n 3.24\n 55.146000\n 0\n 10\n 0.55146\n 2.50\n \n \n 9\n 1\n 3.28\n 52.646999\n 1\n 0\n 0.52647\n 0.00\n \n \n 10\n 1\n 2.52\n 75.071999\n 1\n 0\n 0.75072\n 0.00\n \n \n\n\n\n\nAbove, unit_level is the slopes summary dataframe and well_model_interact.data is the empirical data used to fit the model. Notice how the values for \\(c\\) are identical in both dataframes. However, for \\(w\\), the values are the original \\(w\\) value plus \\(\\epsilon\\). Thus, the estimate value represents the instantaneous rate of change for that unit of observation. However, these unit level slopes are difficult to interpret since each row may have a different slope estimate. Therefore, it is useful to average over (marginalize) the estimates to summarize the unit level predictive slopes.\n\n\nSince the empirical distrubution is used for computing the average predictive slopes, the same number of rows (\\(3020\\)) is returned as the data used to fit the model. To average over a covariate, use the average_by argument. If True is passed, then slopes averages over all covariates. Else, if a single or list of covariates are passed, then slopes averages by the covariates passed.\n\nbmb.interpret.slopes(\n well_model_interact,\n well_idata_interact,\n wrt=\"arsenic\",\n conditional=None,\n average_by=True\n)\n\n\n\n\n\n \n \n \n term\n estimate_type\n estimate\n lower_3.0%\n upper_97.0%\n \n \n \n \n 0\n arsenic\n dydx\n -0.111342\n -0.134846\n -0.088171\n \n \n\n\n\n\nThe code block above is equivalent to taking the mean of the estimate and uncertainty columns. For example:\n\nunit_level[[\"estimate\", \"lower_3.0%\", \"upper_97.0%\"]].mean()\n\nestimate -0.111342\nlower_3.0% -0.134846\nupper_97.0% -0.088171\ndtype: float64\n\n\n\n\n\nAveraging over all covariates may not be desired, and you would rather average by a group or specific covariate. To perform averaging by subgroups, users can pass a single or list of covariates to average_by to average over specific covariates. For example, if we wanted to average by educ4:\n\n# average by educ4\nbmb.interpret.slopes(\n well_model_interact,\n well_idata_interact,\n wrt=\"arsenic\",\n conditional=None,\n average_by=\"educ4\"\n)\n\n\n\n\n\n \n \n \n term\n estimate_type\n educ4\n estimate\n lower_3.0%\n upper_97.0%\n \n \n \n \n 0\n arsenic\n dydx\n 0.00\n -0.092389\n -0.119320\n -0.068167\n \n \n 1\n arsenic\n dydx\n 0.25\n -0.101704\n -0.126096\n -0.076910\n \n \n 2\n arsenic\n dydx\n 0.50\n -0.102112\n -0.122443\n -0.082142\n \n \n 3\n arsenic\n dydx\n 0.75\n -0.106004\n -0.124247\n -0.088132\n \n \n 4\n arsenic\n dydx\n 1.00\n -0.110580\n -0.127803\n -0.093221\n \n \n 5\n arsenic\n dydx\n 1.25\n -0.112334\n -0.128771\n -0.094870\n \n \n 6\n arsenic\n dydx\n 1.50\n -0.114875\n -0.132652\n -0.096790\n \n \n 7\n arsenic\n dydx\n 1.75\n -0.122557\n -0.142921\n -0.101423\n \n \n 8\n arsenic\n dydx\n 2.00\n -0.125187\n -0.148096\n -0.101350\n \n \n 9\n arsenic\n dydx\n 2.25\n -0.125367\n -0.150676\n -0.099852\n \n \n 10\n arsenic\n dydx\n 2.50\n -0.130748\n -0.159912\n -0.101058\n \n \n 11\n arsenic\n dydx\n 2.75\n -0.137422\n -0.170662\n -0.102995\n \n \n 12\n arsenic\n dydx\n 3.00\n -0.136103\n -0.172119\n -0.099548\n \n \n 13\n arsenic\n dydx\n 3.25\n -0.156941\n -0.202215\n -0.107625\n \n \n 14\n arsenic\n dydx\n 3.50\n -0.142571\n -0.186079\n -0.098362\n \n \n 15\n arsenic\n dydx\n 3.75\n -0.138336\n -0.181042\n -0.093120\n \n \n 16\n arsenic\n dydx\n 4.00\n -0.138152\n -0.185974\n -0.089611\n \n \n 17\n arsenic\n dydx\n 4.25\n -0.176623\n -0.244273\n -0.107141\n \n \n\n\n\n\n\n# average by both educ4 and dist100\nbmb.interpret.slopes(\n well_model_interact,\n well_idata_interact,\n wrt=\"arsenic\",\n conditional=None,\n average_by=[\"educ4\", \"dist100\"]\n)\n\n\n\n\n\n \n \n \n term\n estimate_type\n educ4\n dist100\n estimate\n lower_3.0%\n upper_97.0%\n \n \n \n \n 0\n arsenic\n dydx\n 0.00\n 0.00591\n -0.085861\n -0.109133\n -0.061614\n \n \n 1\n arsenic\n dydx\n 0.00\n 0.02409\n -0.096272\n -0.127518\n -0.069670\n \n \n 2\n arsenic\n dydx\n 0.00\n 0.02454\n -0.056617\n -0.065433\n -0.046970\n \n \n 3\n arsenic\n dydx\n 0.00\n 0.02791\n -0.097646\n -0.128131\n -0.069660\n \n \n 4\n arsenic\n dydx\n 0.00\n 0.03252\n -0.076300\n -0.095832\n -0.057900\n \n \n ...\n ...\n ...\n ...\n ...\n ...\n ...\n ...\n \n \n 2992\n arsenic\n dydx\n 4.00\n 1.13727\n -0.070078\n -0.094698\n -0.046623\n \n \n 2993\n arsenic\n dydx\n 4.00\n 1.14418\n -0.125547\n -0.172943\n -0.075368\n \n \n 2994\n arsenic\n dydx\n 4.00\n 1.25308\n -0.156780\n -0.218836\n -0.088258\n \n \n 2995\n arsenic\n dydx\n 4.00\n 1.67025\n -0.161465\n -0.227211\n -0.085394\n \n \n 2996\n arsenic\n dydx\n 4.25\n 0.29633\n -0.176623\n -0.244273\n -0.107141\n \n \n\n2997 rows × 7 columns\n\n\n\nIt is still possible to use plot_slopes when passing an argument to average_by. In the plot below, the empirical distribution is used to compute unit level slopes with respect to arsenic and then averaged over educ4 to obtain the average predictive slopes.\n\nfig, ax = bmb.interpret.plot_slopes(\n well_model_interact,\n well_idata_interact,\n wrt=\"arsenic\",\n conditional=None,\n average_by=\"educ4\"\n)\nfig.set_size_inches(7, 3)\n\n\n\n\n\n\n\n\nIn some fields, such as economics, it is useful to interpret the results of a regression model in terms of an elasticity (a percent change in \\(x\\) is associated with a percent change in \\(y\\)) or semi-elasticity (a unit change in \\(x\\) is associated with a percent change in \\(y\\), or vice versa). Typically, this is achieved by fitting a model where either the outcome and or the covariates are log-transformed. However, since the log transformation is performed by the modeler, to compute elasticities for slopes and plot_slopes, Bambi “post-processes” the predictions to compute the elasticities. Below, it is shown the possible elasticity arguments and how they are computed for slopes and plot_slopes:\n\neyex: a percentage point increase in \\(x_1\\) is associated with an \\(n\\) percentage point increase in \\(y\\)\n\n\\[\\frac{\\partial \\hat{y}}{\\partial x_1} * \\frac{x_1}{\\hat{y}}\\]\n\neydx: a unit increase in \\(x_1\\) is associated with an \\(n\\) percentage point increase in \\(y\\)\n\n\\[\\frac{\\partial \\hat{y}}{\\partial x_1} * \\frac{1}{\\hat{y}}\\]\n\ndyex: a percentage point increase in \\(x_1\\) is associated with an \\(n\\) unit increase in \\(y\\)\n\n\\[\\frac{\\partial \\hat{y}}{\\partial x_1} * x_1\\]\nBelow, each code cell shows the same model, and wrt and conditional argument, but with a different elasticity (slope) argument. By default, dydx (a derivative with no post-processing) is used.\n\nbmb.interpret.slopes(\n well_model_interact,\n well_idata_interact,\n wrt=\"arsenic\",\n slope=\"eyex\",\n conditional=None,\n average_by=True\n)\n\n\n\n\n\n \n \n \n term\n estimate_type\n estimate\n lower_3.0%\n upper_97.0%\n \n \n \n \n 0\n arsenic\n eyex\n -0.525124\n -0.652708\n -0.396082\n \n \n\n\n\n\n\nbmb.interpret.slopes(\n well_model_interact,\n well_idata_interact,\n wrt=\"arsenic\",\n slope=\"eydx\",\n conditional=None,\n average_by=True\n)\n\n\n\n\n\n \n \n \n term\n estimate_type\n estimate\n lower_3.0%\n upper_97.0%\n \n \n \n \n 0\n arsenic\n eydx\n -0.286753\n -0.351592\n -0.220459\n \n \n\n\n\n\n\nbmb.interpret.slopes(\n well_model_interact,\n well_idata_interact,\n wrt=\"arsenic\",\n slope=\"dyex\",\n conditional=None,\n average_by=True\n)\n\n\n\n\n\n \n \n \n term\n estimate_type\n estimate\n lower_3.0%\n upper_97.0%\n \n \n \n \n 0\n arsenic\n dyex\n -0.167616\n -0.201147\n -0.134605\n \n \n\n\n\n\nslope is also an argument for plot_slopes. Below, we visualize the elasticity with respect to arsenic conditional on a range of dist100 and educ4 values (notice this is the same plot as in the conditional slopes section).\n\nfig, ax = bmb.interpret.plot_slopes(\n well_model_interact,\n well_idata_interact,\n wrt=\"arsenic\",\n conditional=[\"dist100\", \"educ4\"],\n slope=\"eyex\",\n subplot_kwargs={\"main\": \"dist100\", \"group\": \"educ4\", \"panel\": \"educ4\"},\n fig_kwargs={\"figsize\": (16, 4), \"sharey\": True},\n legend=False\n)\n\n\n\n\n\n\n\nAs mentioned in the computing slopes section, if you pass a variable with a string or categorical data type, the comparisons function will be called to compute the expected difference in group means. Here, we fit the same interaction model as above, albeit, by specifying educ4 as an ordinal data type.\n\ndata = pd.read_csv(\"http://www.stat.columbia.edu/~gelman/arm/examples/arsenic/wells.dat\", sep=\" \")\ndata[\"switch\"] = pd.Categorical(data[\"switch\"])\ndata[\"dist100\"] = data[\"dist\"] / 100\ndata[\"educ4\"] = pd.Categorical(data[\"educ\"] / 4, ordered=True)\n\n\nwell_model_interact = bmb.Model(\n \"switch ~ dist100 + arsenic + educ4 + dist100:educ4 + arsenic:educ4\",\n data,\n family=\"bernoulli\"\n)\n\nwell_idata_interact = well_model_interact.fit(\n draws=1000, \n target_accept=0.95, \n random_seed=1234, \n chains=4\n)\n\nModeling the probability that switch==0\nAuto-assigning NUTS sampler...\nInitializing NUTS using jitter+adapt_diag...\nMultiprocess sampling (4 chains in 4 jobs)\nNUTS: [Intercept, dist100, arsenic, educ4, dist100:educ4, arsenic:educ4]\n\n\n\n\n\n\n\n \n \n 100.00% [8000/8000 05:18<00:00 Sampling 4 chains, 0 divergences]\n \n \n\n\nSampling 4 chains for 1_000 tune and 1_000 draw iterations (4_000 + 4_000 draws total) took 319 seconds.\n\n\n\nfig, ax = bmb.interpret.plot_slopes(\n well_model_interact,\n well_idata_interact,\n wrt=\"educ4\",\n conditional=\"dist100\",\n average_by=\"dist100\"\n)\nfig.set_size_inches(7, 3)\n\n\n\n\nAs the model was fit with educ4 as a categorical data type, Bambi recognized this, and calls comparisons to compute the differences between each level of educ4. As educ4 contains many category levels, a covariate must be passed to average_by in order to perform plotting. Below, we can see this plot is equivalent to plot_comparisons.\n\nfig, ax = bmb.interpret.plot_comparisons(\n well_model_interact,\n well_idata_interact,\n contrast=\"educ4\",\n conditional=\"dist100\",\n average_by=\"dist100\"\n)\nfig.set_size_inches(7, 3)\n\n\n\n\nHowever, computing the predictive difference between each educ4 level may not be desired. Thus, in plot_slopes, as in plot_comparisons, if wrt is a categorical or string data type, it is possible to specify the wrt values. For example, if we wanted to compute the expected difference in probability of switching wells for when educ4 is \\(4\\) versus \\(1\\) conditional on a range of dist100 and arsenic values, we would pass the following dictionary in the code block below. Please refer to the comparisons documentation for more details.\n\nfig, ax = bmb.interpret.plot_slopes(\n well_model_interact,\n well_idata_interact,\n wrt={\"educ4\": [1, 4]},\n conditional=\"dist100\",\n average_by=\"dist100\"\n)\nfig.set_size_inches(7, 3)\n\n\n\n\n\n%load_ext watermark\n%watermark -n -u -v -iv -w\n\nLast updated: Wed Aug 16 2023\n\nPython implementation: CPython\nPython version : 3.11.0\nIPython version : 8.13.2\n\npandas: 2.0.1\narviz : 0.15.1\nbambi : 0.10.0.dev0\n\nWatermark: 2.3.1" }, { - "objectID": "notebooks/wald_gamma_glm.html", - "href": "notebooks/wald_gamma_glm.html", + "objectID": "notebooks/logistic_regression.html", + "href": "notebooks/logistic_regression.html", "title": "Bambi", "section": "", - "text": "import arviz as az\nimport bambi as bmb\nimport matplotlib.pyplot as plt\nimport numpy as np\n\n\naz.style.use(\"arviz-darkgrid\")\nnp.random.seed(1234)\n\n\n\nIn this notebook we use a data set consisting of 67856 insurance policies and 4624 (6.8%) claims in Australia between 2004 and 2005. The original source of this dataset is the book Generalized Linear Models for Insurance Data by Piet de Jong and Gillian Z. Heller.\n\ndata = bmb.load_data(\"carclaims\")\ndata.head()\n\n\n\n\n\n \n \n \n veh_value\n exposure\n clm\n numclaims\n claimcst0\n veh_body\n veh_age\n gender\n area\n agecat\n \n \n \n \n 0\n 1.06\n 0.303901\n 0\n 0\n 0.0\n HBACK\n 3\n F\n C\n 2\n \n \n 1\n 1.03\n 0.648871\n 0\n 0\n 0.0\n HBACK\n 2\n F\n A\n 4\n \n \n 2\n 3.26\n 0.569473\n 0\n 0\n 0.0\n UTE\n 2\n F\n E\n 2\n \n \n 3\n 4.14\n 0.317591\n 0\n 0\n 0.0\n STNWG\n 2\n F\n D\n 2\n \n \n 4\n 0.72\n 0.648871\n 0\n 0\n 0.0\n HBACK\n 4\n F\n C\n 2\n \n \n\n\n\n\nLet’s see the meaning of the variables before creating any plot or fitting any model.\n\nveh_value: Vehicle value, ranges from \\$0 to \\$350,000.\nexposure: Proportion of the year where the policy was exposed. In practice each policy is not exposed for the full year. Some policies come into force partly into the year while others are canceled before the year’s end.\nclm: Claim occurrence. 0 (no), 1 (yes).\nnumclaims: Number of claims.\nclaimcst0: Claim amount. 0 if no claim. Ranges from \\$200 to \\$55922.\nveh_body: Vehicle body type. Can be one of bus, convertible, coupe, hatchback, hardtop, motorized caravan/combi, minibus, panel van, roadster, sedan, station wagon, truck, and utility.\nveh_age: Vehicle age. 1 (new), 2, 3, and 4.\ngender: Gender of the driver. M (Male) and F (Female).\narea: Driver’s area of residence. Can be one of A, B, C, D, E, and F.\nagecat: Driver’s age category. 1 (youngest), 2, 3, 4, 5, and 6.\n\nThe variable of interest is the claim amount, given by \"claimcst0\". We keep the records where there is a claim, so claim amount is greater than 0.\n\ndata = data[data[\"claimcst0\"] > 0]\n\nFor clarity, we only show those claims amounts below \\$15,000, since there are only 65 records above that threshold.\n\ndata[data[\"claimcst0\"] > 15000].shape[0]\n\n65\n\n\n\nplt.hist(data[data[\"claimcst0\"] <= 15000][\"claimcst0\"], bins=30)\nplt.title(\"Distribution of claim amount\")\nplt.xlabel(\"Claim amount ($)\");\n\n\n\n\nAnd this is when you say: “Oh, there really are ugly right-skewed distributions out there!”. Well, yes, we’ve all been there :)\nIn this case we are going to fit GLMs with a right-skewed distribution for the random component. This time we will be using Wald and Gamma distributions. One of their differences is that the variance is proportional to the cubic mean in the case of the Wald distribution, and proportional to the squared mean in the case of the Gamma distribution.\n\n\n\nThe Wald family (a.k.a inverse Gaussian model) states that\n\\[\n\\begin{array}{cc}\ny_i \\sim \\text{Wald}(\\mu_i, \\lambda) & g(\\mu_i) = \\mathbf{x}_i^T\\beta\n\\end{array}\n\\]\nwhere the pdf of a Wald distribution is given by\n\\[\nf(x|\\mu, \\lambda) =\n\\left(\\frac{\\lambda}{2\\pi}\\right)^{1/2}x^{-3/2}\\exp\\left\\{ -\\frac{\\lambda}{2x} \\left(\\frac{x - \\mu}{\\mu} \\right)^2 \\right\\}\n\\]\nfor \\(x > 0\\), mean \\(\\mu > 0\\) and \\(\\lambda > 0\\) is the shape parameter. The variance is given by \\(\\sigma^2 = \\mu^3/\\lambda\\). The canonical link is \\(g(\\mu_i) = \\mu_i^{-2}\\), but \\(g(\\mu_i) = \\log(\\mu_i)\\) is usually preferred, and it is what we use here.\n\n\n\nThe default parametrization of the Gamma density function is\n\\[\n\\displaystyle f(x | \\alpha, \\beta) = \\frac{\\beta^\\alpha x^{\\alpha -1} e^{-\\beta x}}{\\Gamma(\\alpha)}\n\\]\nwhere \\(x > 0\\), and \\(\\alpha > 0\\) and \\(\\beta > 0\\) are the shape and rate parameters, respectively.\nBut GLMs model the mean of the function, so we need to use an alternative parametrization where\n\\[\n\\begin{array}{ccc}\n\\displaystyle \\mu = \\frac{\\alpha}{\\beta} & \\text{and} & \\displaystyle \\sigma^2 = \\frac{\\alpha}{\\beta^2}\n\\end{array}\n\\]\nand thus we have\n\\[\n\\begin{array}{cccc}\ny_i \\sim \\text{Gamma}(\\mu_i, \\sigma_i), & g(\\mu_i) = \\mathbf{x}_i^T\\beta, & \\text{and} & \\sigma_i = \\mu_i^2/\\alpha\n\\end{array}\n\\]\nwhere \\(\\alpha\\) is the shape parameter in the original parametrization of the gamma pdf. The canonical link is \\(g(\\mu_i) = \\mu_i^{-1}\\), but here we use \\(g(\\mu_i) = \\log(\\mu_i)\\) again.\n\n\n\nIn this example we are going to use the binned age, the gender, and the area of residence to predict the amount of the claim, conditional on the existence of the claim because we are only working with observations where there is a claim.\n\"agecat\" is interpreted as a numeric variable in our data frame, but we know it is categorical, and we wouldn’t be happy if our model takes it as if it was numeric, would we?\nWe have two alternatives to tell Bambi that this numeric variable must be treated as categorical. The first one is to wrap the name of the variable with C(), and the other is to pass the same name to the categorical argument when we create the model. We are going to use the first approach with the Wald family and the second with the Gamma.\nThe C() notation is taken from Patsy and is encouraged when you want to explicitly pass the order of the levels of the variables. If you are happy with the default order, better pass the name to categorical so tables and plots have prettier labels :)\n\n\n\nmodel_wald = bmb.Model(\"claimcst0 ~ C(agecat) + gender + area\", data, family = \"wald\", link = \"log\")\nfitted_wald = model_wald.fit(tune=2000, target_accept=0.9, idata_kwargs={\"log_likelihood\": True})\n\nAuto-assigning NUTS sampler...\nInitializing NUTS using jitter+adapt_diag...\nMultiprocess sampling (2 chains in 2 jobs)\nNUTS: [claimcst0_lam, Intercept, C(agecat), gender, area]\n\n\n\n\n\n\n\n \n \n 100.00% [6000/6000 00:17<00:00 Sampling 2 chains, 0 divergences]\n \n \n\n\nSampling 2 chains for 2_000 tune and 1_000 draw iterations (4_000 + 2_000 draws total) took 17 seconds.\n\n\n\naz.plot_trace(fitted_wald);\n\n\n\n\n\naz.summary(fitted_wald)\n\n\n\n\n\n \n \n \n mean\n sd\n hdi_3%\n hdi_97%\n mcse_mean\n mcse_sd\n ess_bulk\n ess_tail\n r_hat\n \n \n \n \n Intercept\n 7.719\n 0.097\n 7.524\n 7.881\n 0.004\n 0.003\n 723.0\n 973.0\n 1.0\n \n \n C(agecat)[2]\n -0.164\n 0.103\n -0.362\n 0.014\n 0.004\n 0.003\n 670.0\n 867.0\n 1.0\n \n \n C(agecat)[3]\n -0.259\n 0.098\n -0.442\n -0.075\n 0.004\n 0.003\n 757.0\n 1077.0\n 1.0\n \n \n C(agecat)[4]\n -0.264\n 0.098\n -0.441\n -0.080\n 0.004\n 0.003\n 729.0\n 1056.0\n 1.0\n \n \n C(agecat)[5]\n -0.377\n 0.106\n -0.582\n -0.191\n 0.004\n 0.003\n 767.0\n 1142.0\n 1.0\n \n \n C(agecat)[6]\n -0.319\n 0.123\n -0.550\n -0.088\n 0.004\n 0.003\n 897.0\n 1379.0\n 1.0\n \n \n gender[M]\n 0.154\n 0.051\n 0.046\n 0.242\n 0.001\n 0.001\n 2325.0\n 1571.0\n 1.0\n \n \n area[B]\n -0.028\n 0.071\n -0.151\n 0.110\n 0.002\n 0.001\n 1582.0\n 1584.0\n 1.0\n \n \n area[C]\n 0.075\n 0.067\n -0.057\n 0.193\n 0.002\n 0.001\n 1652.0\n 1352.0\n 1.0\n \n \n area[D]\n -0.018\n 0.087\n -0.176\n 0.153\n 0.002\n 0.002\n 1779.0\n 1684.0\n 1.0\n \n \n area[E]\n 0.154\n 0.101\n -0.028\n 0.351\n 0.003\n 0.002\n 1632.0\n 1394.0\n 1.0\n \n \n area[F]\n 0.372\n 0.129\n 0.136\n 0.615\n 0.003\n 0.002\n 1878.0\n 1345.0\n 1.0\n \n \n claimcst0_lam\n 723.159\n 15.695\n 693.002\n 751.738\n 0.306\n 0.217\n 2630.0\n 1577.0\n 1.0\n \n \n\n\n\n\nIf we look at the agecat variable, we can see the log mean of the claim amount tends to decrease when the age of the person increases, with the exception of the last category where we can see a slight increase in the mean of the coefficient (-0.307 vs -0.365 of the previous category). However, these differences only represent a slight tendency because of the large overlap between the marginal posteriors for these coefficients (see overlaid density plots for C(agecat).\nThe posterior for gender tells us that the claim amount tends to be larger for males than for females, with the mean being 0.153 and the credible interval ranging from 0.054 to 0.246.\nFinally, from the marginal posteriors for the areas, we can see that F is the only area that clearly stands out, with a higher mean claim amount than in the rest. Area E may also have a higher claim amount, but this difference with the other areas is not as evident as it happens with F.\n\n\n\n\nmodel_gamma = bmb.Model(\n \"claimcst0 ~ agecat + gender + area\",\n data,\n family=\"gamma\",\n link=\"log\",\n categorical=\"agecat\",\n)\nfitted_gamma = model_gamma.fit(tune=2000, target_accept=0.9, idata_kwargs={\"log_likelihood\": True})\n\nAuto-assigning NUTS sampler...\nInitializing NUTS using jitter+adapt_diag...\nMultiprocess sampling (2 chains in 2 jobs)\nNUTS: [claimcst0_alpha, Intercept, agecat, gender, area]\n\n\n\n\n\n\n\n \n \n 100.00% [6000/6000 00:24<00:00 Sampling 2 chains, 0 divergences]\n \n \n\n\nSampling 2 chains for 2_000 tune and 1_000 draw iterations (4_000 + 2_000 draws total) took 25 seconds.\n\n\n\naz.plot_trace(fitted_gamma);\n\n\n\n\n\naz.summary(fitted_gamma)\n\n\n\n\n\n \n \n \n mean\n sd\n hdi_3%\n hdi_97%\n mcse_mean\n mcse_sd\n ess_bulk\n ess_tail\n r_hat\n \n \n \n \n Intercept\n 7.717\n 0.063\n 7.591\n 7.825\n 0.002\n 0.001\n 891.0\n 1280.0\n 1.0\n \n \n agecat[2]\n -0.181\n 0.064\n -0.309\n -0.064\n 0.002\n 0.001\n 949.0\n 1151.0\n 1.0\n \n \n agecat[3]\n -0.275\n 0.063\n -0.395\n -0.164\n 0.002\n 0.001\n 966.0\n 1342.0\n 1.0\n \n \n agecat[4]\n -0.269\n 0.063\n -0.388\n -0.155\n 0.002\n 0.001\n 900.0\n 1406.0\n 1.0\n \n \n agecat[5]\n -0.389\n 0.071\n -0.522\n -0.255\n 0.002\n 0.002\n 1059.0\n 1358.0\n 1.0\n \n \n agecat[6]\n -0.314\n 0.078\n -0.459\n -0.161\n 0.002\n 0.001\n 1367.0\n 1546.0\n 1.0\n \n \n gender[M]\n 0.166\n 0.034\n 0.101\n 0.225\n 0.001\n 0.000\n 2965.0\n 1448.0\n 1.0\n \n \n area[B]\n -0.023\n 0.050\n -0.123\n 0.062\n 0.001\n 0.001\n 1601.0\n 1709.0\n 1.0\n \n \n area[C]\n 0.071\n 0.045\n -0.013\n 0.156\n 0.001\n 0.001\n 1359.0\n 1514.0\n 1.0\n \n \n area[D]\n -0.017\n 0.063\n -0.132\n 0.106\n 0.001\n 0.001\n 1838.0\n 1558.0\n 1.0\n \n \n area[E]\n 0.152\n 0.067\n 0.026\n 0.273\n 0.002\n 0.001\n 1964.0\n 1596.0\n 1.0\n \n \n area[F]\n 0.371\n 0.076\n 0.235\n 0.521\n 0.002\n 0.001\n 1885.0\n 1467.0\n 1.0\n \n \n claimcst0_alpha\n 0.762\n 0.014\n 0.736\n 0.789\n 0.000\n 0.000\n 3212.0\n 1452.0\n 1.0\n \n \n\n\n\n\nThe interpretation of the parameter posteriors is very similar to what we’ve done for the Wald family. The only difference is that some differences, such as the ones for the area posteriors, are a little more exacerbated here.\n\n\n\n\nWe can perform a Bayesian model comparison very easily with az.compare(). Here we pass a dictionary with the InferenceData objects that Model.fit() returned and az.compare() returns a data frame that is ordered from best to worst according to the criteria used.\n\nmodels = {\"wald\": fitted_wald, \"gamma\": fitted_gamma}\ndf_compare = az.compare(models)\ndf_compare\n\n\n\n\n\n \n \n \n rank\n elpd_loo\n p_loo\n elpd_diff\n weight\n se\n dse\n warning\n scale\n \n \n \n \n wald\n 0\n -38581.405635\n 12.882981\n 0.00000\n 1.0\n 106.105576\n 0.000000\n False\n log\n \n \n gamma\n 1\n -39628.995425\n 26.607829\n 1047.58979\n 0.0\n 104.988009\n 35.754616\n False\n log\n \n \n\n\n\n\n\naz.plot_compare(df_compare, insample_dev=False);\n\n\n\n\nBy default, ArviZ uses loo, which is an estimation of leave one out cross-validation. Another option is the widely applicable information criterion (WAIC). Since the results are in the log scale, the better out-of-sample predictive fit is given by the model with the highest value, which is the Wald model.\n\n%load_ext watermark\n%watermark -n -u -v -iv -w\n\nLast updated: Thu Jan 05 2023\n\nPython implementation: CPython\nPython version : 3.10.4\nIPython version : 8.5.0\n\nsys : 3.10.4 (main, Mar 31 2022, 08:41:55) [GCC 7.5.0]\nmatplotlib: 3.6.2\narviz : 0.14.0\nnumpy : 1.23.5\nbambi : 0.9.3\n\nWatermark: 2.3.1" + "text": "import arviz as az\nimport bambi as bmb\nimport matplotlib.pyplot as plt\nimport numpy as np\nimport pandas as pd\n\n\naz.style.use(\"arviz-darkgrid\")\nSEED = 7355608\n\n\n\nThese data are from the 2016 pilot study. The full study consisted of 1200 people, but here we’ve selected the subset of 487 people who responded to a question about whether they would vote for Hillary Clinton or Donald Trump.\n\ndata = bmb.load_data(\"ANES\")\ndata.head()\n\n\n\n\n\n \n \n \n vote\n age\n party_id\n \n \n \n \n 0\n clinton\n 56\n democrat\n \n \n 1\n trump\n 65\n republican\n \n \n 2\n clinton\n 80\n democrat\n \n \n 3\n trump\n 38\n republican\n \n \n 4\n trump\n 60\n republican\n \n \n\n\n\n\nOur outcome variable is vote, which gives peoples’ responses to the following question prompt:\n“If the 2016 presidential election were between Hillary Clinton for the Democrats and Donald Trump for the Republicans, would you vote for Hillary Clinton, Donald Trump, someone else, or probably not vote?”\n\ndata[\"vote\"].value_counts()\n\nclinton 215\ntrump 158\nsomeone_else 48\nName: vote, dtype: int64\n\n\nThe two predictors we’ll examine are a respondent’s age and their political party affiliation, party_id, which is their response to the following question prompt:\n“Generally speaking, do you usually think of yourself as a Republican, a Democrat, an independent, or what?”\n\ndata[\"party_id\"].value_counts()\n\ndemocrat 186\nindependent 138\nrepublican 97\nName: party_id, dtype: int64\n\n\nThese two predictors are somewhat correlated, but not all that much:\n\nfig, ax = plt.subplots(1, 3, figsize=(10, 4), sharey=True, constrained_layout=True)\nkey = dict(zip(data[\"party_id\"].unique(), range(3)))\nfor label, df in data.groupby(\"party_id\"):\n ax[key[label]].hist(df[\"age\"])\n ax[key[label]].set_xlim([18, 90])\n ax[key[label]].set_xlabel(\"Age\")\n ax[key[label]].set_ylabel(\"Frequency\")\n ax[key[label]].set_title(label)\n ax[key[label]].axvline(df[\"age\"].mean(), color=\"C1\")\n\n\n\n\nWe can get a pretty clear idea of how party identification is related to voting intentions by just looking at a contingency table for these two variables:\n\npd.crosstab(data[\"vote\"], data[\"party_id\"])\n\n\n\n\n\n \n \n party_id\n democrat\n independent\n republican\n \n \n vote\n \n \n \n \n \n \n \n clinton\n 159\n 51\n 5\n \n \n someone_else\n 10\n 22\n 16\n \n \n trump\n 17\n 65\n 76\n \n \n\n\n\n\nBut our main question here will be: How is respondent age related to voting intentions, and is this relationship different for different party affiliations? For this we will use a logistic regression.\n\n\n\nTo keep this simple, let’s look at only the data from people who indicated that they would vote for either Clinton or Trump, and we’ll model the probability of voting for Clinton.\n\nclinton_data = data.loc[data[\"vote\"].isin([\"clinton\", \"trump\"]), :]\nclinton_data.head()\n\n\n\n\n\n \n \n \n vote\n age\n party_id\n \n \n \n \n 0\n clinton\n 56\n democrat\n \n \n 1\n trump\n 65\n republican\n \n \n 2\n clinton\n 80\n democrat\n \n \n 3\n trump\n 38\n republican\n \n \n 4\n trump\n 60\n republican\n \n \n\n\n\n\n\n\nWe’ll use a logistic regression model to estimate the probability of voting for Clinton as a function of age and party affiliation. We can think we have a response variable \\(Y\\) defined as\n\\[\nY =\n\\left\\{\n \\begin{array}{ll}\n 1 & \\textrm{if the person votes for Clinton} \\\\\n 0 & \\textrm{if the person votes for Trump}\n \\end{array}\n\\right.\n\\]\nand we are interested in modelling \\(\\pi = P(Y = 1)\\) (a.k.a. probability of success) based on two explanatory variables, age and party affiliation.\nA logistic regression is a model that links the \\(\\text{logit}(\\pi)\\) to a linear combination of the predictors. In our example, we’re going to include a main effect for party affiliation and the interaction effect between party affiliation and age (i.e. we’ll have a different age slope for each affiliation). The mathematical equation for our model is\n$$\n\\[\\begin{aligned}\n \\log{\\left(\\frac{\\pi}{1 - \\pi}\\right)} &=\n \\beta_0 + \\beta_1 X_1 + \\beta_2 X_2 + \\beta_3 X_3 X_4 + \\beta_4 X_1 X_4 + \\beta_5 X_2 X_4 \\\\\n\n X_1 &= \\left\\{\n \\begin{array}{ll}\n 1 & \\textrm{if party affiliation is Independent} \\\\\n 0 & \\textrm{in other case}\n \\end{array}\n \\right. \\\\\n\n X_2 &= \\left\\{\n \\begin{array}{ll}\n 1 & \\textrm{if party affiliation is Republican} \\\\\n 0 & \\textrm{in other case}\n \\end{array}\n \\right. \\\\\n\n X_3 &=\n \\left\\{\n \\begin{array}{ll}\n 1 & \\textrm{if party affiliation is Democrat} \\\\\n 0 & \\textrm{in other case}\n \\end{array}\n \\right. \\\\\n\n X_4 &= \\text{Age}\n\\end{aligned}\\]\n$$\nNotice we don’t have a main effect for \\(X_3\\). This happens because Democrat party affiliation is being taken as baseline in the encoding of the categorical variable party_id and \\(\\beta_1\\) and \\(\\beta_2\\) represent deviations from that baseline. Thus, we see the main effect of Democrat affiliation is being represented by the Intercept, \\(\\beta_0\\).\nIf we represent the right hand side of the model equation with \\(\\eta\\), the expression can be re-arranged to express our probability of interest, \\(\\pi\\), as a function of the linear predictor \\(\\eta\\).\n\\[\\pi = \\frac{e^\\eta}{1 + e^\\eta}= \\frac{1}{1 + e^{-\\eta}}\\]\nSince we’re Bayesian folks who draw samples from posteriors, we need to specify a prior for the parameters as well as a likelihood function before accomplishing our task. In this occasion, we’re going to use the default priors in Bambi and just note the likelihood is the product of \\(n\\) Bernoulli trials, \\(\\prod_{i=1}^{n}{p_i^y(1-p_i)^{1-y_i}}\\) where \\(p_i = P(Y=1)\\) and \\(y_i = 1\\) if the vote intention is for Clinton and \\(y_i = 0\\) if Trump.\n\n\n\nSpecifying and fitting the model is simple. Bambi is good and doesn’t ask us to translate all the math to code. We just need to specify our model using the formula syntax and pass the correct family argument. Notice the (optional) syntax that we use on the left-hand-side of the formula: We say vote[clinton] to instruct Bambi that we wish the model the probability that vote=='clinton', rather than the probability that vote=='trump'. If we leave this unspecified, Bambi will just pick one of the events to model, but will inform you which one it picked when you build the model (and again when you look at model summaries).\nOn the right-hand-side of the formula we use party_id + party_id:age to instruct Bambi that we want to use party_id and the interaction between party_id and age as the explanatory variables in the model.\n\n\nclinton_model = bmb.Model(\"vote['clinton'] ~ party_id + party_id:age\", clinton_data, family=\"bernoulli\")\nclinton_fitted = clinton_model.fit(\n draws=2000, target_accept=0.85, random_seed=SEED, idata_kwargs={\"log_likelihood\": True}\n)\n\nModeling the probability that vote==clinton\nAuto-assigning NUTS sampler...\nInitializing NUTS using jitter+adapt_diag...\nMultiprocess sampling (2 chains in 2 jobs)\nNUTS: [Intercept, party_id, party_id:age]\n\n\n\n\n\n\n\n \n \n 100.00% [6000/6000 00:13<00:00 Sampling 2 chains, 0 divergences]\n \n \n\n\nSampling 2 chains for 1_000 tune and 2_000 draw iterations (2_000 + 4_000 draws total) took 14 seconds.\n\n\nWe can print the model object to see information about the response distribution, the link function and the priors.\n\nclinton_model\n\n Formula: vote['clinton'] ~ party_id + party_id:age\n Family: bernoulli\n Link: p = logit\n Observations: 373\n Priors: \n target = p\n Common-level effects\n Intercept ~ Normal(mu: 0, sigma: 4.3846)\n party_id ~ Normal(mu: [0. 0.], sigma: [5.4007 6.0634])\n party_id:age ~ Normal(mu: [0. 0. 0.], sigma: [0.0938 0.1007 0.1098])\n------\n* To see a plot of the priors call the .plot_priors() method.\n* To see a summary or plot of the posterior pass the object returned by .fit() to az.summary() or az.plot_trace()\n\n\nUnder the hood, Bambi selected Gaussian priors for all the parameters in the model. By construction, all the priors, except the one for Intercept, are centered around 0, which is consistent with the desired weakly informative behavior. The standard deviation is specific to each parameter.\nSome more info about these default priors can be found in this technical paper.\nWe can also call clinton_model.plot_priors() to visualize the sensitive default priors Bambi has chosen for us.\n\nclinton_model.plot_priors();\n\nSampling: [Intercept, party_id, party_id:age]\n\n\n\n\n\nNow let’s check out the results! We get traceplots and density estimates for the posteriors with az.plot_trace() and a summary of the posteriors with az.summary().\n\naz.plot_trace(clinton_fitted, compact=False);\n\n\n\n\n\naz.summary(clinton_fitted)\n\n\n\n\n\n \n \n \n mean\n sd\n hdi_3%\n hdi_97%\n mcse_mean\n mcse_sd\n ess_bulk\n ess_tail\n r_hat\n \n \n \n \n Intercept\n 1.674\n 0.725\n 0.251\n 2.998\n 0.016\n 0.011\n 2199.0\n 2105.0\n 1.0\n \n \n party_id[independent]\n -0.293\n 0.956\n -2.037\n 1.543\n 0.021\n 0.018\n 2046.0\n 2230.0\n 1.0\n \n \n party_id[republican]\n -1.151\n 1.575\n -4.122\n 1.806\n 0.039\n 0.027\n 1667.0\n 1843.0\n 1.0\n \n \n party_id:age[democrat]\n 0.013\n 0.015\n -0.016\n 0.042\n 0.000\n 0.000\n 2133.0\n 2064.0\n 1.0\n \n \n party_id:age[independent]\n -0.033\n 0.011\n -0.055\n -0.012\n 0.000\n 0.000\n 3257.0\n 2797.0\n 1.0\n \n \n party_id:age[republican]\n -0.080\n 0.036\n -0.153\n -0.018\n 0.001\n 0.001\n 1692.0\n 1546.0\n 1.0\n \n \n\n\n\n\n\n\n\n\nBefore moving forward to inference, we can evaluate the quality of the model’s fit. We will take a look at two different ways of assessing how good is the model’s fit using its predictions.\n\n\nThere is a way of assessing the performance of a model with binary outcomes (such as logistic regression) in a visual way called separation plot. In a separation plot, the model’s predictions are averaged, ordered and represented as consecutive vertical lines. These vertical lines are colored according to the class indicated by their corresponding observed value, in this case light blue indicates class 0 (vote == 'Trump') and blue represents class 1 (vote =='Clinton'). We can use the ArviZ’ implementation of the separation plot, but first we have to obtain the model’s predictions.\n\nclinton_model.predict(clinton_fitted, kind=\"pps\")\n\n\nax = az.plot_separation(clinton_fitted, y='vote', figsize=(9,0.5));\n\n\n\n\nIn this separation plot we can see that some observations are misspredicted, specially in the right hand side of the plot where the model predicts Trump votes when there were really Clinton ones. We can further investigate this using another of ArviZ model evaluation tool.\n\n\n\n\nWe can also use ArviZ to compute LOO and find influential observations using the estimated \\(\\hat \\kappa\\) parameter value.\n\n# compute pointwise LOO\nloo = az.loo(clinton_fitted, pointwise=True)\n\n\n# plot kappa values\naz.plot_khat(loo.pareto_k);\n\n\n\n\nA first look at the khat plot shows that most observations’ \\(\\hat \\kappa\\) values are grouped together in a range that goes up to roughly 0.2. Above that value, we observe some dispersion and a few points that stand out by having the highest \\(\\hat \\kappa\\) values.\nAn observation is influential in the sense that if we refit the data by first removing that observation from the data set, the fitted result will be more different than if we do the same for a non influential observation. Clearly the level of influence of observations can vary continuously. An observation can be influential either because it is an outlier (a measurement error, a data entry error, etc) or because the model is not flexible enough to capture the observation. The approximations used to compute LOO are no longer reliable for \\(\\hat \\kappa > 0.7\\).\nLet us first take a look at the observation with the highest \\(\\hat \\kappa\\).\n\nax = az.plot_khat(loo.pareto_k.values.ravel())\nsorted_kappas = np.sort(loo.pareto_k.values.ravel())\n\n# find observation where the kappa value exceeds the threshold\nthreshold = sorted_kappas[-1:]\nax.axhline(threshold, ls=\"--\", color=\"orange\")\ninfluential_observations = clinton_data.reset_index()[loo.pareto_k.values >= threshold].index\n\nfor x in influential_observations:\n y = loo.pareto_k.values[x]\n ax.text(x, y + 0.01, str(x), ha=\"center\", va=\"baseline\")\n\n\n\n\n\nclinton_data.reset_index()[loo.pareto_k.values >= threshold]\n\n\n\n\n\n \n \n \n index\n vote\n age\n party_id\n \n \n \n \n 365\n 410\n clinton\n 55\n republican\n \n \n\n\n\n\nThis observation corresponds to a 95 year old Republican party member that voted for Trump.\n\nLet us take a look at six observations with the highest \\(\\hat \\kappa\\) values.\n\nax = az.plot_khat(loo.pareto_k)\n\n# find observation where the kappa value exceeds the threshold\nthreshold = sorted_kappas[-6:].min()\nax.axhline(threshold, ls=\"--\", color=\"orange\")\ninfluential_observations = clinton_data.reset_index()[loo.pareto_k.values >= threshold].index\n\nfor x in influential_observations:\n y = loo.pareto_k.values[x]\n ax.text(x, y + 0.01, str(x), ha=\"center\", va=\"baseline\")\n\n\n\n\n\nclinton_data.reset_index()[loo.pareto_k.values>=threshold]\n\n\n\n\n\n \n \n \n index\n vote\n age\n party_id\n \n \n \n \n 34\n 34\n trump\n 83\n republican\n \n \n 58\n 64\n trump\n 84\n republican\n \n \n 62\n 68\n trump\n 91\n republican\n \n \n 87\n 95\n trump\n 80\n republican\n \n \n 191\n 215\n trump\n 95\n republican\n \n \n 365\n 410\n clinton\n 55\n republican\n \n \n\n\n\n\nObservations number 34, 58, 62, and 191 correspond to individuals in under represented age groups in the data set. The rest correspond to Republican party members that voted for Clinton. Let us check how many observations we have of individuals older than 80 years old.\n\nclinton_data[clinton_data.age>80]\n\n\n\n\n\n \n \n \n vote\n age\n party_id\n \n \n \n \n 34\n trump\n 83\n republican\n \n \n 64\n trump\n 84\n republican\n \n \n 68\n trump\n 91\n republican\n \n \n 97\n clinton\n 83\n democrat\n \n \n 215\n trump\n 95\n republican\n \n \n 246\n clinton\n 82\n democrat\n \n \n 403\n clinton\n 81\n democrat\n \n \n\n\n\n\nLet us check how many observations there are of Republicans who voted for Clinton\n\nclinton_data[(clinton_data.vote =='clinton') & (clinton_data.party_id == 'republican')]\n\n\n\n\n\n \n \n \n vote\n age\n party_id\n \n \n \n \n 170\n clinton\n 27\n republican\n \n \n 248\n clinton\n 36\n republican\n \n \n 359\n clinton\n 22\n republican\n \n \n 361\n clinton\n 37\n republican\n \n \n 410\n clinton\n 55\n republican\n \n \n\n\n\n\nThere are only two observations for individuals older than 80 years old and five observations for individuals of the Republican party that vote for Clinton. The fact that the model finds it difficult to predict for these observations is related to model uncertainty, due to a scarce number of observations that exhibit these characteristics.\nLet us repeat the separation plot, this time marking the observations we have analyzed. This plot will show us how the model predicted these particular observations.\n\nimport matplotlib.patheffects as pe\n\nax = az.plot_separation(clinton_fitted, y=\"vote\", figsize=(9, 0.5))\n\ny = np.random.uniform(0.1, 0.5, size=len(influential_observations))\n\nfor x, y in zip(influential_observations, y):\n text = str(x)\n x = x / len(clinton_data)\n ax.scatter(x, y, marker=\"+\", s=50, color=\"red\", zorder=3)\n ax.text(\n x, y + 0.1, text, color=\"white\", ha=\"center\", va=\"bottom\",\n path_effects=[pe.withStroke(linewidth=2, foreground=\"black\")]\n )\n\n\n\n\n\nclinton_data.reset_index()[loo.pareto_k.values>=threshold]\n\n\n\n\n\n \n \n \n index\n vote\n age\n party_id\n \n \n \n \n 34\n 34\n trump\n 83\n republican\n \n \n 58\n 64\n trump\n 84\n republican\n \n \n 62\n 68\n trump\n 91\n republican\n \n \n 87\n 95\n trump\n 80\n republican\n \n \n 191\n 215\n trump\n 95\n republican\n \n \n 365\n 410\n clinton\n 55\n republican\n \n \n\n\n\n\nThis assessment helped us to further understand the model and quality of the fit. It also illustrates the intuition that we should be cautious when predicting for under represented age groups and voting behaviours.\n\n\n\nGrab the posteriors samples of the age slopes for the three party_id categories.\n\nparties = [\"democrat\", \"independent\", \"republican\"]\ndem, ind, rep = [clinton_fitted.posterior[\"party_id:age\"].sel({\"party_id:age_dim\":party}) for party in parties]\n\nPlot the marginal posteriors for the age slopes for the three political affiliations.\n\n_, ax = plt.subplots()\nfor idx, x in enumerate([dem, ind, rep]):\n az.plot_dist(x, label=x[\"party_id:age_dim\"].item(), plot_kwargs={\"color\": f\"C{idx}\"}, ax=ax)\nax.legend(loc=\"upper left\");\n\n\n\n\nNow, using the joint posterior, we can answer our questions in terms of probabilities.\nWhat is the probability that the Democrat slope is greater than the Republican slope?\n\n(dem > rep).mean().item()\n\n0.99625\n\n\nProbability that the Democrat slope is greater than the Independent slope?\n\n(dem > ind).mean().item()\n\n0.99125\n\n\nProbability that the Independent slope is greater than the Republican slope?\n\n(ind > rep).mean().item()\n\n0.899\n\n\nProbability that the Democrat slope is greater than 0?\n\n(dem > 0).mean().item()\n\n0.80875\n\n\nProbability that the Republican slope is less than 0?\n\n(rep < 0).mean().item()\n\n0.995\n\n\nProbability that the Independent slope is less than 0?\n\n(ind < 0).mean().item()\n\n0.99875\n\n\nIf we look at the plot of the marginal posteriors, we may be suspicious that, for example, the probability that Democrat slope is greater than the Republican slope is 0.998 (almost 1!), given the overlap between the blue and green density functions. However, we can’t answer such a question using the marginal posteriors only, as shown in the plot. Since Democrat and Republican slopes (\\(\\beta_3\\) and \\(\\beta_5\\), respectively) are random variables, we need to use their joint distribution to answer probability questions that involve both of them. The fact that logical comparisons (e.g. > in dem > ind) are performed elementwise ensures we’re using samples from the joint posterior as we should. We also note that when the question involves only one of the random variables, it is fine to use the marginal distribution (e.g. (rep < 0).mean()).\nFinally, all these comments may have not been necessary since we didn’t need to mention anything about marginal or joint distributions when performing the calculations, we’ve just grabbed the samples and applied some basic math. But that’s an advantage of Bambi and the Bayesian approach. Things that are not so simple, became simpler :)\n\n\n\nHere we make use of the Model.predict() method to predict the probability of voting for Clinton for an out-of-sample dataset that we create.\n\nage = np.arange(18, 91)\nnew_data = pd.DataFrame({\n \"age\": np.tile(age, 3),\n \"party_id\": np.repeat([\"democrat\", \"republican\", \"independent\"], len(age))\n})\nnew_data\n\n\n\n\n\n \n \n \n age\n party_id\n \n \n \n \n 0\n 18\n democrat\n \n \n 1\n 19\n democrat\n \n \n 2\n 20\n democrat\n \n \n 3\n 21\n democrat\n \n \n 4\n 22\n democrat\n \n \n ...\n ...\n ...\n \n \n 214\n 86\n independent\n \n \n 215\n 87\n independent\n \n \n 216\n 88\n independent\n \n \n 217\n 89\n independent\n \n \n 218\n 90\n independent\n \n \n\n219 rows × 2 columns\n\n\n\nObtain predictions for the new dataset. By default, Bambi is going to obtain a posterior distribution for the mean probability of voting for Clinton. These values are stored as the \"vote_mean\" variable in clinton_fitted.posterior.\n\nclinton_model.predict(clinton_fitted, data=new_data)\n\n\n# Select a sample of posterior values for the mean probability of voting for Clinton\nvote_posterior = az.extract_dataset(clinton_fitted, num_samples=2000)[\"vote_mean\"]\n\n/tmp/ipykernel_23763/325773600.py:2: FutureWarning: extract_dataset has been deprecated, please use extract\n vote_posterior = az.extract_dataset(clinton_fitted, num_samples=2000)[\"vote_mean\"]\n\n\nMake the plot!\n\n_, ax = plt.subplots(figsize=(7, 5))\n\nfor i, party in enumerate([\"democrat\", \"republican\", \"independent\"]):\n # Which rows in new_data correspond to party?\n idx = new_data.index[new_data[\"party_id\"] == party].tolist()\n ax.plot(age, vote_posterior[idx], alpha=0.04, color=f\"C{i}\")\n\nax.set_ylabel(\"P(vote='clinton' | age)\")\nax.set_xlabel(\"Age\", fontsize=15)\nax.set_ylim(0, 1)\nax.set_xlim(18, 90);\n\n\n\n\nThe following is a rough interpretation of the information contained in the plot we’ve just created.\nAccording to our logistic model, the mean probability of voting for Clinton is almost always 0.8 or greater for Democrats no matter the age (blue line). Also, the older the person, the closer the mean probability of voting Clinton to 1.\nOn the other hand, Republicans have a non-zero probability of voting for Clinton when they are young, but it tends to zero for older persons (green line). We can also note the high variability of P(vote = ‘Clinton’) for young Republicans. This reflects our high uncertainty when estimating this probability and it is due to the small amount of Republicans in that age range plus there are only 5 Republicans out of 97 voting for Clinton in the dataset.\nFinally, the mean probability of voting Clinton for the independents is around 0.7 for the youngest and decreases towards 0.2 as they get older (orange line). Since the spread of the lines is similar along all the ages, we can conclude our uncertainty in this estimate is similar for all the age groups.\n\n%load_ext watermark\n%watermark -n -u -v -iv -w\n\nLast updated: Thu Jan 05 2023\n\nPython implementation: CPython\nPython version : 3.10.4\nIPython version : 8.5.0\n\npandas : 1.5.2\nmatplotlib: 3.6.2\nnumpy : 1.23.5\narviz : 0.14.0\nbambi : 0.9.3\nsys : 3.10.4 (main, Mar 31 2022, 08:41:55) [GCC 7.5.0]\n\nWatermark: 2.3.1" }, { - "objectID": "notebooks/hierarchical_binomial_bambi.html", - "href": "notebooks/hierarchical_binomial_bambi.html", + "objectID": "notebooks/how_bambi_works.html", + "href": "notebooks/how_bambi_works.html", "title": "Bambi", "section": "", - "text": "This notebook shows how to build a hierarchical logistic regression model with the Binomial family in Bambi.\nThis example is based on the Hierarchical baseball article in Bayesian Analysis Recipes, a collection of articles on how to do Bayesian data analysis with PyMC3 made by Eric Ma.\n\n\nExtracted from the original work:\n\nBaseball players have many metrics measured for them. Let’s say we are on a baseball team, and would like to quantify player performance, one metric being their batting average (defined by how many times a batter hit a pitched ball, divided by the number of times they were up for batting (“at bat”)). How would you go about this task?\n\n\n\n\n\nimport arviz as az\nimport bambi as bmb\nimport matplotlib.pyplot as plt\nimport numpy as np\n\nfrom matplotlib.lines import Line2D\nfrom matplotlib.patches import Patch\n\n\naz.style.use(\"arviz-darkgrid\")\nrandom_seed = 1234\n\nWe first need some measurements of batting data. Today we’re going to use data from the Baseball Databank. It is a compilation of historical baseball data in a convenient, tidy format, distributed under Open Data terms.\nThis repository contains several datasets in the form of .csv files. This example is going to use the Batting.csv file, which can be loaded directly with Bambi in a convenient way.\n\ndf = bmb.load_data(\"batting\")\n\n# Then clean some of the data\ndf[\"AB\"] = df[\"AB\"].replace(0, np.nan)\ndf = df.dropna()\ndf[\"batting_avg\"] = df[\"H\"] / df[\"AB\"]\ndf = df[df[\"yearID\"] >= 2016]\ndf = df.iloc[0:15] \ndf.head(5)\n\n\n\n\n\n \n \n \n playerID\n yearID\n stint\n teamID\n lgID\n G\n AB\n R\n H\n 2B\n ...\n SB\n CS\n BB\n SO\n IBB\n HBP\n SH\n SF\n GIDP\n batting_avg\n \n \n \n \n 101348\n abadfe01\n 2016\n 1\n MIN\n AL\n 39\n 1.0\n 0\n 0\n 0\n ...\n 0.0\n 0.0\n 0\n 1.0\n 0.0\n 0.0\n 0.0\n 0.0\n 0.0\n 0.000000\n \n \n 101350\n abreujo02\n 2016\n 1\n CHA\n AL\n 159\n 624.0\n 67\n 183\n 32\n ...\n 0.0\n 2.0\n 47\n 125.0\n 7.0\n 15.0\n 0.0\n 9.0\n 21.0\n 0.293269\n \n \n 101352\n ackledu01\n 2016\n 1\n NYA\n AL\n 28\n 61.0\n 6\n 9\n 0\n ...\n 0.0\n 0.0\n 8\n 9.0\n 0.0\n 0.0\n 0.0\n 1.0\n 0.0\n 0.147541\n \n \n 101353\n adamecr01\n 2016\n 1\n COL\n NL\n 121\n 225.0\n 25\n 49\n 7\n ...\n 2.0\n 3.0\n 24\n 47.0\n 0.0\n 4.0\n 3.0\n 0.0\n 5.0\n 0.217778\n \n \n 101355\n adamsma01\n 2016\n 1\n SLN\n NL\n 118\n 297.0\n 37\n 74\n 18\n ...\n 0.0\n 1.0\n 25\n 81.0\n 1.0\n 2.0\n 0.0\n 3.0\n 5.0\n 0.249158\n \n \n\n5 rows × 23 columns\n\n\n\nFrom all the columns above, we’re going to use the following:\n\nplayerID: Unique identification for the player.\nAB: Number of times the player was up for batting.\nH: Number of times the player hit the ball while batting.\nbatting_avg: Simply ratio between H and AB.\n\n\n\n\nIt’s always good to explore the data before starting to write down our models. This is very useful to gain a good understanding of the distribution of the variables and their relationships, and even anticipate some problems that may occur during the sampling process.\nThe following graph summarizes the percentage of hits, as well as the number of times the players were up for batting and the number of times they hit the ball.\n\nBLUE = \"#2a5674\"\nRED = \"#b13f64\"\n\n\n_, ax = plt.subplots(figsize=(10, 6))\n\n# Customize x limits. \n# This adds space on the left side to indicate percentage of hits.\nax.set_xlim(-120, 320)\n\n# Add dots for the times at bat and the number of hits\nax.scatter(df[\"AB\"], list(range(15)), s=140, color=BLUE, zorder=10)\nax.scatter(df[\"H\"], list(range(15)), s=140, color=RED, zorder=10)\n\n# Also a line connecting them\nax.hlines(list(range(15)), df[\"AB\"], df[\"H\"], color=\"#b3b3b3\", lw=4)\n\nax.axvline(ls=\"--\", lw=1.4, color=\"#a3a3a3\")\nax.hlines(list(range(15)), -110, -50, lw=6, color=\"#b3b3b3\", capstyle=\"round\")\nax.scatter(60 * df[\"batting_avg\"] - 110, list(range(15)), s=28, color=RED, zorder=10)\n\n# Add the percentage of hits\nfor j in range(15): \n text = f\"{round(df['batting_avg'].iloc[j] * 100)}%\"\n ax.text(-12, j, text, ha=\"right\", va=\"center\", fontsize=14, color=\"#333\")\n\n# Customize tick positions and labels\nax.yaxis.set_ticks(list(range(15)))\nax.yaxis.set_ticklabels(df[\"playerID\"])\nax.xaxis.set_ticks(range(0, 400, 100))\n\n# Create handles for the legend (just dots and labels)\nhandles = [\n Line2D(\n [0], [0], label=\"At Bat\", marker=\"o\", color=\"None\", markeredgewidth=0,\n markerfacecolor=RED, markersize=12\n ),\n Line2D(\n [0], [0], label=\"Hits\", marker=\"o\", color=\"None\", markeredgewidth=0, \n markerfacecolor=BLUE, markersize=13\n )\n]\n\n# Add legend on top-right corner\nlegend = ax.legend(\n handles=handles, \n loc=1, \n fontsize=14, \n handletextpad=0.4,\n frameon=True\n)\n\n# Finally add labels and a title\nax.set_xlabel(\"Count\", fontsize=14)\nax.set_ylabel(\"Player\", fontsize=14)\nax.set_title(\"How often do batters hit the ball?\", fontsize=20);\n\n\n\n\nThe first thing one can see is that the number of times players were up for batting varies quite a lot. Some players have been there for very few times, while there are others who have been there hundreds of times. We can also note the percentage of hits is usually a number between 12% and 29%.\nThere are two players, alberma01 and abadfe01, who had only one chance to bat. The first one hit the ball, while the latter missed. That’s why alberma01 as a 100% hit percentage, while abadfe01 has 0%. There’s another player, aguilje01, who has a success record of 0% because he missed all the few opportunities he had to bat. These extreme situations, where the empirical estimation lives in the boundary of the parameter space, are associated with estimation problems when using a maximum-likelihood estimation approach. Nonetheless, they can also impact the sampling process, especially when using wide priors.\nAs a final note, abreujo02, has been there for batting 624 times, and thus the grey dot representing this number does not appear in the plot.\n\n\n\nLet’s get started with a simple cell-means logistic regression for \\(p_i\\), the probability of hitting the ball for the player \\(i\\)\n\\[\n\\begin{array}{lr}\n \\displaystyle \\text{logit}(p_i) = \\beta_i & \\text{with } i = 0, \\cdots, 14\n\\end{array} \n\\]\nWhere\n\\[\n\\beta_i \\sim \\text{Normal}(0, \\ \\sigma_{\\beta}),\n\\]\n\\(\\sigma_{\\beta}\\) is a common constant for all the players, and \\(\\text{logit}(p_i) = \\log\\left(\\frac{p_i}{1 - p_i}\\right)\\).\nSpecifying this model is quite simple in Bambi thanks to its formula interface.\nFirst of all, note this is a Binomial family and the response involves both the number of hits (H) and the number of times at bat (AB). We use the p(x, n) function for the response term. This just tells Bambi we want to model the proportion resulting from dividing x over n.\nThe right-hand side of the formula is \"0 + playerID\". This means the model includes a coefficient for each player ID, but does not include a global intercept.\nFinally, using the Binomial family is as easy as passing family=\"binomial\". By default, the link function for this family is link=\"logit\", so there’s nothing to change there.\n\nmodel_non_hierarchical = bmb.Model(\"p(H, AB) ~ 0 + playerID\", df, family=\"binomial\")\nmodel_non_hierarchical\n\n Formula: p(H, AB) ~ 0 + playerID\n Family: binomial\n Link: p = logit\n Observations: 15\n Priors: \n target = p\n Common-level effects\n playerID ~ Normal(mu: [0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.], sigma: [10.0223 10.0223\n 10.0223 10.0223 10.0223 10.0223 10.0223 10.0223 10.0223\n 10.0223 10.0223 10.0223 10.0223 10.0223 10.0223])\n\n\n\nidata_non_hierarchical = model_non_hierarchical.fit(random_seed=random_seed)\n\nAuto-assigning NUTS sampler...\nInitializing NUTS using jitter+adapt_diag...\nMultiprocess sampling (2 chains in 2 jobs)\nNUTS: [playerID]\n\n\n\n\n\n\n\n \n \n 100.00% [4000/4000 00:04<00:00 Sampling 2 chains, 0 divergences]\n \n \n\n\nSampling 2 chains for 1_000 tune and 1_000 draw iterations (2_000 + 2_000 draws total) took 5 seconds.\n\n\nNext we observe the posterior of the coefficient for each player. The compact=False argument means we want separated panels for each player.\n\naz.plot_trace(idata_non_hierarchical, compact=False);\n\n\n\n\nSo far so good! The traceplots indicate the sampler worked well.\nNow, let’s keep this posterior aside for later use and let’s fit the hierarchical version.\n\n\n\nThis model incorporates a group-specific intercept for each player:\n\\[\n\\begin{array}{lr}\n \\displaystyle \\text{logit}(p_i) = \\alpha + \\gamma_i & \\text{with } i = 0, \\cdots, 14\n\\end{array} \n\\]\nwhere\n\\[\n\\begin{array}{c}\n \\alpha \\sim \\text{Normal}(0, \\ \\sigma_{\\alpha}) \\\\\n \\gamma_i \\sim \\text{Normal}(0, \\ \\sigma_{\\gamma}) \\\\\n \\sigma_{\\gamma} \\sim \\text{HalfNormal}(\\tau_{\\gamma})\n\\end{array}\n\\]\nThe group-specific terms are indicated with the | operator in the formula. In this case, since there is an intercept for each player, we write 1|playerID.\n\nmodel_hierarchical = bmb.Model(\"p(H, AB) ~ 1 + (1|playerID)\", df, family=\"binomial\")\nmodel_hierarchical\n\n Formula: p(H, AB) ~ 1 + (1|playerID)\n Family: binomial\n Link: p = logit\n Observations: 15\n Priors: \n target = p\n Common-level effects\n Intercept ~ Normal(mu: 0, sigma: 2.5)\n \n Group-level effects\n 1|playerID ~ Normal(mu: 0, sigma: HalfNormal(sigma: 2.5))\n\n\n\nidata_hierarchical = model_hierarchical.fit(random_seed=random_seed)\n\nAuto-assigning NUTS sampler...\nInitializing NUTS using jitter+adapt_diag...\nMultiprocess sampling (2 chains in 2 jobs)\nNUTS: [Intercept, 1|playerID_sigma, 1|playerID_offset]\n\n\n\n\n\n\n\n \n \n 100.00% [4000/4000 00:07<00:00 Sampling 2 chains, 0 divergences]\n \n \n\n\nSampling 2 chains for 1_000 tune and 1_000 draw iterations (2_000 + 2_000 draws total) took 7 seconds.\n\n\nAnd there we got several divergences… What can we do?\nOne thing we could try is to increase target_accept as suggested in the message above, but there are so many divergences that instead we are going to first take a look at the prior predictive distribution to check whether our priors are too informative or too wide.\nThe Model instance has a method called prior_predictive() that generates samples from the prior predictive distribution. It returns an InferenceData object that contains the values of the prior predictive distribution.\n\nidata_prior = model_hierarchical.prior_predictive()\nprior = az.extract_dataset(idata_prior, group=\"prior_predictive\")[\"p(H, AB)\"]\n\nSampling: [1|playerID_offset, 1|playerID_sigma, Intercept, p(H, AB)]\n/tmp/ipykernel_23363/2686921361.py:2: FutureWarning: extract_dataset has been deprecated, please use extract\n prior = az.extract_dataset(idata_prior, group=\"prior_predictive\")[\"p(H, AB)\"]\n\n\nIf we inspect the DataArray, we see there are 500 draws (sample) for each of the 15 players (p(H, AB)_dim_0)\nLet’s plot these distributions together with the observed proportion of hits for every player here.\n\n# We define this function because this plot is going to be repeated below.\ndef plot_prior_predictive(df, prior):\n AB = df[\"AB\"].values\n H = df[\"H\"].values\n\n fig, axes = plt.subplots(5, 3, figsize=(10, 6), sharex=\"col\")\n\n for idx, ax in enumerate(axes.ravel()):\n pps = prior.sel({\"p(H, AB)_obs\":idx})\n ab = AB[idx]\n h = H[idx]\n hist = ax.hist(pps / ab, bins=25, color=\"#a3a3a3\")\n ax.axvline(h / ab, color=RED, lw=2)\n ax.set_yticks([])\n ax.tick_params(labelsize=12)\n \n fig.subplots_adjust(left=0.025, right=0.975, hspace=0.05, wspace=0.05, bottom=0.125)\n fig.legend(\n handles=[Line2D([0], [0], label=\"Observed proportion\", color=RED, linewidth=2)],\n handlelength=1.5,\n handletextpad=0.8,\n borderaxespad=0,\n frameon=True,\n fontsize=11, \n bbox_to_anchor=(0.975, 0.92),\n loc=\"right\"\n \n )\n fig.text(0.5, 0.025, \"Prior probability of hitting\", fontsize=15, ha=\"center\", va=\"baseline\")\n\n\nplot_prior_predictive(df, prior)\n\n/tmp/ipykernel_23363/3299358313.py:17: UserWarning: This figure was using a layout engine that is incompatible with subplots_adjust and/or tight_layout; not calling subplots_adjust.\n fig.subplots_adjust(left=0.025, right=0.975, hspace=0.05, wspace=0.05, bottom=0.125)\n\n\n\n\n\nIndeed, priors are too wide! Let’s use tighter priors and see what’s the result\n\npriors = {\n \"Intercept\": bmb.Prior(\"Normal\", mu=0, sigma=1),\n \"1|playerID\": bmb.Prior(\"Normal\", mu=0, sigma=bmb.Prior(\"HalfNormal\", sigma=1))\n}\nmodel_hierarchical = bmb.Model(\"p(H, AB) ~ 1 + (1|playerID)\", df, family=\"binomial\", priors=priors)\nmodel_hierarchical\n\n Formula: p(H, AB) ~ 1 + (1|playerID)\n Family: binomial\n Link: p = logit\n Observations: 15\n Priors: \n target = p\n Common-level effects\n Intercept ~ Normal(mu: 0, sigma: 1)\n \n Group-level effects\n 1|playerID ~ Normal(mu: 0, sigma: HalfNormal(sigma: 1))\n\n\nNow let’s check the prior predictive distribution for these new priors.\n\nmodel_hierarchical.build()\nidata_prior = model_hierarchical.prior_predictive()\nprior = az.extract_dataset(idata_prior, group=\"prior_predictive\")[\"p(H, AB)\"]\nplot_prior_predictive(df, prior)\n\nSampling: [1|playerID_offset, 1|playerID_sigma, Intercept, p(H, AB)]\n/tmp/ipykernel_23363/1302716284.py:3: FutureWarning: extract_dataset has been deprecated, please use extract\n prior = az.extract_dataset(idata_prior, group=\"prior_predictive\")[\"p(H, AB)\"]\n/tmp/ipykernel_23363/3299358313.py:17: UserWarning: This figure was using a layout engine that is incompatible with subplots_adjust and/or tight_layout; not calling subplots_adjust.\n fig.subplots_adjust(left=0.025, right=0.975, hspace=0.05, wspace=0.05, bottom=0.125)\n\n\n\n\n\nDefinetely it looks much better. Now the priors tend to have a symmetric shape with a mode at 0.5, with substantial probability on the whole domain.\n\nidata_hierarchical = model_hierarchical.fit(random_seed=random_seed)\n\nAuto-assigning NUTS sampler...\nInitializing NUTS using jitter+adapt_diag...\nMultiprocess sampling (2 chains in 2 jobs)\nNUTS: [Intercept, 1|playerID_sigma, 1|playerID_offset]\n\n\n\n\n\n\n\n \n \n 100.00% [4000/4000 00:06<00:00 Sampling 2 chains, 0 divergences]\n \n \n\n\nSampling 2 chains for 1_000 tune and 1_000 draw iterations (2_000 + 2_000 draws total) took 7 seconds.\n\n\nLet’s try with increasing target_accept and the number of tune samples.\n\nidata_hierarchical = model_hierarchical.fit(tune=2000, draws=2000, target_accept=0.95, random_seed=random_seed)\n\nAuto-assigning NUTS sampler...\nInitializing NUTS using jitter+adapt_diag...\nMultiprocess sampling (2 chains in 2 jobs)\nNUTS: [Intercept, 1|playerID_sigma, 1|playerID_offset]\n\n\n\n\n\n\n\n \n \n 100.00% [8000/8000 00:17<00:00 Sampling 2 chains, 0 divergences]\n \n \n\n\nSampling 2 chains for 2_000 tune and 2_000 draw iterations (4_000 + 4_000 draws total) took 18 seconds.\n\n\n\nvar_names = [\"Intercept\", \"1|playerID\", \"1|playerID_sigma\"]\naz.plot_trace(idata_hierarchical, var_names=var_names, compact=False);\n\n\n\n\nLet’s jump onto the next section where we plot and compare the probability of hit for the players using both models.\n\n\n\nNow we’re going to plot the distribution of the probability of hit for each player, using both models.\nBut before doing that, we need to obtain the posterior in that scale. We could manually take the posterior of the coefficients, compute the linear predictor, and transform that to the probability scale. But that’s a lot of work!\nFortunately, Bambi models have a method called .predict() that we can use to predict in the probability scale. By default, it modifies in-place the InferenceData object we pass to it. Then, the posterior samples can be found in the variable p(H, AB)_mean.\n\nmodel_non_hierarchical.predict(idata_non_hierarchical)\nmodel_hierarchical.predict(idata_hierarchical)\n\nLet’s create a forestplot using the posteriors obtained with both models so we can compare them very easily .\n\n_, ax = plt.subplots(figsize = (8, 8))\n\n# Add vertical line for the global probability of hitting\nax.axvline(x=(df[\"H\"] / df[\"AB\"]).mean(), ls=\"--\", color=\"black\", alpha=0.5)\n\n# Create forestplot with ArviZ, only for the mean.\naz.plot_forest(\n [idata_non_hierarchical, idata_hierarchical], \n var_names=\"p(H, AB)_mean\", \n combined=True, \n colors=[\"#666666\", RED], \n linewidth=2.6, \n markersize=8,\n ax=ax\n)\n\n# Create custom y axis tick labels\nylabels = [f\"H: {round(h)}, AB: {round(ab)}\" for h, ab in zip(df[\"H\"].values, df[\"AB\"].values)]\nylabels = list(reversed(ylabels))\n\n# Put the labels for the y axis in the mid of the original location of the tick marks.\nax.set_yticklabels(ylabels, ha=\"right\")\n\n# Create legend\nhandles = [\n Patch(label=\"Non-hierarchical\", facecolor=\"#666666\"),\n Patch(label=\"Hierarchical\", facecolor=RED),\n Line2D([0], [0], label=\"Mean probability\", ls=\"--\", color=\"black\", alpha=0.5)\n]\n\nlegend = ax.legend(handles=handles, loc=4, fontsize=14, frameon=True, framealpha=0.8);\n\n\n\n\nOne of the first things one can see is that not only the center of the distributions varies but also their dispersion. Those posteriors that are very wide are associated with players who have batted only once or few times, while tighter posteriors correspond to players who batted several times.\nPlayers who have extreme empirical proportions have similar extreme posteriors under the non-hierarchical model. However, under the hierarchical model, these distributions are now shrunk towards the global mean. Extreme values are very unlikely under the hierarchical model.\nAnd finally, paraphrasing Eric, there’s nothing ineherently right or wrong about shrinkage and hierarchical models. Whether this is reasonable or not depends on our prior knowledge about the problem. And to me, after having seen the hit rates of the other players, it is much more reasonable to shrink extreme posteriors based on very few data points towards the global mean rather than just let them concentrate around 0 or 1.\n\n%load_ext watermark\n%watermark -n -u -v -iv -w\n\nLast updated: Thu Jan 05 2023\n\nPython implementation: CPython\nPython version : 3.10.4\nIPython version : 8.5.0\n\nsys : 3.10.4 (main, Mar 31 2022, 08:41:55) [GCC 7.5.0]\narviz : 0.14.0\nmatplotlib: 3.6.2\nbambi : 0.9.3\nnumpy : 1.23.5\n\nWatermark: 2.3.1\n\n\n\n\n\n\n\n By default, the .predict() method obtains the posterior for the mean of the likelihood distribution. This mean would be \\(np\\) for the Binomial family. However, since \\(n\\) varies from observation to observation, it returns the value of \\(p\\), as if it was a Bernoulli family. \n .predict()just appends _mean to the name of the response to indicate it is the posterior of the mean." + "text": "Bambi builds linear predictors of the form\n\\[\n\\pmb{\\eta} = \\mathbf{X}\\pmb{\\beta} + \\mathbf{Z}\\pmb{u}\n\\]\nThe linear predictor is the sum of two kinds of contributions\n\n\\(\\mathbf{X}\\pmb{\\beta}\\) is the common (fixed) effects contribution\n\\(\\mathbf{Z}\\pmb{u}\\) is the group-specific (random) effects contribution\n\nBoth contributions obey the same rule: A dot product between a data object and a parameter object.\n\n\n\nThe following objects are design matrices\n\\[\n\\begin{array}{c}\n\\underset{n\\times p}{\\mathbf{X}}\n& \\underset{n\\times j}{\\mathbf{Z}}\n\\end{array}\n\\]\n\n\\(\\mathbf{X}\\) is the design matrix for the common (fixed) effects part\n\\(\\mathbf{Z}\\) is the design matrix for the group-specific (random) effects part\n\n\n\n\nThe following objects are parameter vectors\n\\[\n\\begin{array}{c}\n\\underset{p\\times 1}{\\pmb{\\beta}}\n& \\underset{j\\times 1}{\\pmb{u}}\n\\end{array}\n\\]\n\n\\(\\pmb{\\beta}\\) is a vector of parameters/coefficients for the common (fixed) effects part\n\\(\\pmb{u}\\) is a vector of parameters/coefficients for the group-specific (random) effects part\n\nAs result, the linear predictor \\(\\pmb{\\eta}\\) is of shape \\(n \\times 1\\).\nA fundamental question: How do we use linear predictors in modeling?\nLinear predictors (or a function of them) describe the functional relationship between one or more parameters of the response distribution and the predictors.\n\n\n\nA classical linear regression model is a special case where there is no group-specific contribution and a linear predictor is mapped to the mean parameter of the response distribution.\n\\[\n\\begin{aligned}\n\\pmb{\\mu} &= \\pmb{\\eta} = \\mathbf{X}\\pmb{\\beta} \\\\\n\\pmb{\\beta} &\\sim \\text{Distribution} \\\\\n\\sigma &\\sim \\text{Distribution} \\\\\nY_i &\\sim \\text{Normal}(\\eta_i, \\sigma)\n\\end{aligned}\n\\]\n\n\n\nLink functions turn linear models in generalized linear models. A link function, \\(g\\), is a function that maps a parameter of the response distribution to the linear predictor. When people talk about generalized linear models, they mean there’s a link function mapping the mean of the response distribution to the linear predictor. But as we will see later, Bambi allows to use linear predictors and link functions to model any parameter of the response distribution – these are known as distributional models or generalized linear models for location, scale, and shape.\n\\[\ng(\\pmb{\\mu}) = \\pmb{\\eta} = \\mathbf{X}\\pmb{\\beta}\n\\]\nwhere \\(g\\) is the link function. It must be differentiable, monotonic, and invertible. For example, the logit function is useful when the mean parameter is bounded in the \\((0, 1)\\) domain.\n\\[\n\\begin{aligned}\ng(\\pmb{\\mu}) &= \\text{logit}(\\pmb{\\mu}) = \\log \\left(\\frac{\\pmb{\\mu}}{1 - \\pmb{\\mu}}\\right) = \\pmb{\\eta} = \\mathbf{X}\\pmb{\\beta} \\\\\n\\pmb{\\mu} = g^{-1}(\\pmb{\\eta}) &= \\text{logistic}(\\pmb{\\eta}) = \\frac{1}{1 + \\exp (-\\pmb{\\eta})} = \\frac{1}{1 + \\exp (-\\mathbf{X}\\pmb{\\beta})}\n\\end{aligned}\n\\]\n\n\n\n\\[\n\\begin{aligned}\ng(\\pmb{\\mu}) &= \\mathbf{X}\\pmb{\\beta} \\\\\n\\pmb{\\beta} &\\sim \\text{Distribution} \\\\\nY_i &\\sim \\text{Bernoulli}(\\mu = g^{-1}(\\mathbf{X}\\pmb{\\beta} )_i)\n\\end{aligned}\n\\]\nwhere \\(g = \\text{logit}\\) and \\(g^{-1} = \\text{logistic}\\) also known as \\(\\text{expit}\\).\n\n\n\nThis is an extension to generalized linear models. In a generalized linear model a linear predictor and a link function are used to explain the relationship between the mean (location) of the response distribution and the predictors. In this type of models we are able to use linear predictors and link functions to represent the relationship between any parameter of the response distribution and the predictors.\n\\[\n\\begin{aligned}\ng_1(\\pmb{\\theta}_1) &= \\mathbf{X}_1\\pmb{\\beta}_1 + \\mathbf{Z}_1\\pmb{u}_1 \\\\\ng_2(\\pmb{\\theta}_2) &= \\mathbf{X}_2\\pmb{\\beta}_2 + \\mathbf{Z}_2\\pmb{u}_2 \\\\\n&\\phantom{b=\\,} \\vdots \\\\\ng_k(\\pmb{\\theta}_k) &= \\mathbf{X}_k\\pmb{\\beta}_k + \\mathbf{Z}_k\\pmb{u}_k \\\\\nY_i &\\sim \\text{Distribution}(\\theta_{1i}, \\theta_{2i}, \\dots, \\theta_{ki})\n\\end{aligned}\n\\]\n\n\n\n\\[\n\\begin{aligned}\ng_1(\\pmb{\\mu}) &= \\mathbf{X}_1\\pmb{\\beta}_1 \\\\\ng_2(\\pmb{\\sigma}) &= \\mathbf{X}_2\\pmb{\\beta}_2 \\\\\n\\pmb{\\beta}_1 &\\sim \\text{Distribution} \\\\\n\\pmb{\\beta}_2 &\\sim \\text{Distribution} \\\\\nY_i &\\sim \\text{Normal}(\\mu_i, \\sigma_i)\n\\end{aligned}\n\\]\nWhere\n\n\\(g_1\\) is the identity function\n\\(g_2\\) is a function that maps \\(\\mathbb{R}\\to\\mathbb{R}^+\\).\n\nUsually \\(g_2 = \\log\\)\n\\(\\pmb{\\sigma} = \\exp(\\mathbf{X}_2\\pmb{\\beta}_2)\\).\n\n\n\n\n\nA design matrix is… a matrix. As such, it’s filled up with numbers. However, it does not mean it cannot encode non-numerical variables. In a design matrix we can encode the following\n\nNumerical predictors\nInteraction effects\nTransformations of numerical predictors that don’t depend on model parameters\n\nPowers\nCentering\nStandardization\nBasis functions\n\nBambi currently supports basis splines\n\nAnd anything you can imagine as well as it does not involve model parameters\n\nCategorical predictors\n\nCategorical variables are encoded into their own design matrices\nThe most popular approach is to create binary “dummy” variables. One per level of the categorical variable.\nBut doing it haphazardly will result in non-identifiabilities quite soon.\nEncodings to the rescue\n\nOne can apply different restrictions or contrast matrices to overcome this problem. They usually imply different interpretations of the regression coefficients.\nTreatment encoding: Sets one level to zero\nZero-sum encoding: Sets one level to the opposite of the sum of the other levels\nBackward differences\nOrthogonal polynomials\nHelmert contrasts\n…\n\n\n\nThese all can be expressed as a single set of columns of a design matrix that are matched with a subset of the parameter vector of the same length\n\n\n\n\nData matrices are built by formulae.\n\nData matrices are not dependent on parameter values in any form.\n\nBambi consumes and manipulates them to create model terms, which shape the parameter vector.\n\nThe parameter vector is not influenced by the values in the data matrix.\n\n\nGoing back to planet Earth…" }, { - "objectID": "notebooks/plot_predictions.html", - "href": "notebooks/plot_predictions.html", + "objectID": "notebooks/how_bambi_works.html#example", + "href": "notebooks/how_bambi_works.html#example", "title": "Bambi", - "section": "", - "text": "This notebook shows how to use, and the capabilities, of the plot_predictions function. The plot_predictions function is a part of Bambi’s sub-package interpret that features a set of tools used to interpret complex regression models that is inspired by the R package marginaleffects.\n\n\nThe purpose of the generalized linear model (GLM) is to unify the approaches needed to analyze data for which either: (1) the assumption of a linear relation between \\(x\\) and \\(y\\), or (2) the assumption of normal variation is not appropriate. GLMs are typically specified in three stages: 1. the linear predictor \\(\\eta = X\\beta\\) where \\(X\\) is an \\(n\\) x \\(p\\) matrix of explanatory variables. 2. the link function \\(g(\\cdot)\\) that relates the linear predictor to the mean of the outcome variable \\(\\mu = g^{-1}(\\eta) = g^{-1}(X\\beta)\\) 3. the random component specifying the distribution of the outcome variable \\(y\\) with mean \\(\\mathbb{E}(y|X) = \\mu\\).\nBased on these three specifications, the mean of the distribution of \\(y\\), given \\(X\\), is determined by \\(X\\beta: \\mathbb{E}(y|X) = g^{-1}(X\\beta)\\).\nGLMs are a broad family of models where the output \\(y\\) is typically assumed to follow an exponential family distribution, e.g., Binomial, Poisson, Gamma, Exponential, and Normal. The job of the link function is to map the linear space of the model \\(X\\beta\\) onto the non-linear space of a parameter like \\(\\mu\\). Commonly used link function are the logit and log link. Also known as the canonical link functions. This brief introduction to GLMs is not meant to be exhuastive, and another good starting point is the Bambi Basic Building Blocks example.\nDue to the link function, there are typically three quantities of interest to interpret in a GLM: 1. the linear predictor \\(\\eta\\) 2. the mean \\(\\mu = g^{-1}(\\eta)\\) 3. the response variable \\(Y \\sim \\mathcal{D}(\\mu, \\theta)\\) where \\(\\mu\\) is the mean parameter and \\(\\theta\\) is (possibly) a vector that contains all the other “nuissance” parameters of the distribution.\nAs modelers, we are usually more interested in interpreting (2) and (3). However, \\(\\mu\\) is not always on the same scale of the response variable and can be more difficult to interpret. Rather, the response scale is a more interpretable scale. Additionally, it is often the case that modelers would like to analyze how a model parameter varies across a range of explanatory variable values. To achieve such an analysis, Bambi has taken inspiration from the R package marginaleffects, and implemented a plot_predictions function that plots the conditional adjusted predictions to aid in the interpretation of GLMs. Below, it is briefly discussed what are conditionally adjusted predictions, how they are computed, and ultimately how to use the plot_predictions function.\n\n\n\nAdjusted predictions refers to the outcome predicted by a fitted model on a specified scale for a given combination of values of the predictor variables, such as their observed values, their means, or some user specified grid of values. The specification of the scale to make the predictions, the link or response scale, refers to the scale used to estimate the model. In normal linear regression, the link scale and the response scale are identical, and therefore, the adjusted prediction is expressed as the mean value of the response variable at the given values of the predictor variables. On the other hand, a logistic regression’s link and response scale are not identical. An adjusted prediction on the link scale will be represented as the log-odds of a successful response given values of the predictor variables. Whereas an adjusted prediction on the response scale gives the probability that the response variable equals 1. The conditional part of conditionally adjusted predictions represents the specific predictor(s) and its values we would like to condition on when plotting predictions.\n\n\nThe objective of plotting conditional adjusted predictions is to visualize how a parameter of the (conditional) response distribution varies as a function of (some) interpolated explanatory variables. This is done by holding all other explanatory variables constant at some specified value, a reference grid, that may or may not correspond to actual observations in the dataset used to fit the model. By default, the plot_predictions function uses a grid of 200 equally spaced values between the minimum and maximum values of the specified explanatory variable as the reference grid.\nThe plot_predictions function uses the fitted model to then compute the predicted values of the model parameter at each value of the reference grid. The plot_predictions function then uses these predictions to plot the model parameter as a function of (some) explanatory variable.\n\nimport arviz as az\nimport matplotlib.pyplot as plt\nimport numpy as np\nimport pandas as pd\n\nimport bambi as bmb\n\n\n\n\n\nFor the first demonstration, we will use a Gaussian linear regression model with the mtcars dataset to better understand the plot_predictions function and its arguments. The mtcars dataset was extracted from the 1974 Motor Trend US magazine, and comprises fuel consumption and 10 aspects of automobile design and performance for 32 automobiles (1973–74 models). The following is a brief description of the variables in the dataset:\n\nmpg: Miles/(US) gallon\ncyl: Number of cylinders\ndisp: Displacement (cu.in.)\nhp: Gross horsepower\ndrat: Rear axle ratio\nwt: Weight (1000 lbs)\nqsec: 1/4 mile time\nvs: Engine (0 = V-shaped, 1 = straight)\nam: Transmission (0 = automatic, 1 = manual)\ngear: Number of forward gear\n\n\n# Load data\ndata = bmb.load_data('mtcars')\ndata[\"cyl\"] = data[\"cyl\"].replace({4: \"low\", 6: \"medium\", 8: \"high\"})\ndata[\"gear\"] = data[\"gear\"].replace({3: \"A\", 4: \"B\", 5: \"C\"})\ndata[\"cyl\"] = pd.Categorical(data[\"cyl\"], categories=[\"low\", \"medium\", \"high\"], ordered=True)\n\n# Define and fit the Bambi model\nmodel = bmb.Model(\"mpg ~ 0 + hp * wt + cyl + gear\", data)\nidata = model.fit(draws=1000, target_accept=0.95, random_seed=1234)\n\nAuto-assigning NUTS sampler...\nInitializing NUTS using jitter+adapt_diag...\nMultiprocess sampling (4 chains in 4 jobs)\nNUTS: [mpg_sigma, hp, wt, hp:wt, cyl, gear]\n\n\n\n\n\n\n\n \n \n 100.00% [8000/8000 00:19<00:00 Sampling 4 chains, 0 divergences]\n \n \n\n\nSampling 4 chains for 1_000 tune and 1_000 draw iterations (4_000 + 4_000 draws total) took 20 seconds.\n\n\nWe can print the Bambi model object to obtain the model components. Below, we see that the Gaussian linear model uses an identity link function that results in no transformation of the linear predictor to the mean of the outcome variable, and the distrbution of the likelihood is Gaussian.\nNow that we have fitted the model, we can visualize how a model parameter varies as a function of (some) interpolated covariate. For this example, we will visualize how the mean response mpg varies as a function of the covariate hp.\nThe Bambi model, ArviZ inference data object (containing the posterior samples and the data used to fit the model), and a list or dictionary of covariates, in this example only hp, are passed to the plot_predictions function. The plot_predictions function then computes the conditional adjusted predictions for each covariate in the list or dictionary using the method described above. The plot_predictions function returns a matplotlib figure object that can be further customized.\n\nfig, ax = plt.subplots(figsize=(7, 3), dpi=120)\nbmb.interpret.plot_predictions(model, idata, \"hp\", ax=ax);\n\n\n\n\nThe plot above shows that as hp increases, the mean mpg decreases. As stated above, this insight was obtained by creating the reference grid and then using the fitted model to compute the predicted values of the model parameter, in this example mpg, at each value of the reference grid.\nBy default, plot_predictions uses the highest density interval (HDI) of the posterior distribution to compute the credible interval of the conditional adjusted predictions. The HDI is a Bayesian analog to the frequentist confidence interval. The HDI is the shortest interval that contains a specified probability of the posterior distribution. By default, plot_predictions uses the 94% HDI.\nplot_predictions uses the posterior distribution by default to visualize some mean outcome parameter . However, the posterior predictive distribution can also be plotted by specifying pps=True where pps stands for posterior predictive samples of the response variable.\n\nfig, ax = plt.subplots(figsize=(7, 3), dpi=120)\nbmb.interpret.plot_predictions(model, idata, \"hp\", pps=True, ax=ax);\n\n\n\n\nHere, we notice that the uncertainty in the conditional adjusted predictions is much larger than the uncertainty when pps=False. This is because the posterior predictive distribution accounts for the uncertainty in the model parameters and the uncertainty in the data. Whereas, the posterior distribution only accounts for the uncertainty in the model parameters.\nplot_predictions allows up to three covariates to be plotted simultaneously where the first element in the list represents the main (x-axis) covariate, the second element the group (hue / color), and the third element the facet (panel). However, when plotting more than one covariate, it can be useful to pass specific group and panel arguments to aid in the interpretation of the plot. Therefore, subplot_kwargs allows the user to manipulate the plotting by passing a dictionary where the keys are {\"main\": ..., \"group\": ..., \"panel\": ...} and the values are the names of the covariates to be plotted. For example, passing two covariates hp and wt and specifying subplot_kwargs={\"main\": \"hp\", \"group\": \"wt\", \"panel\": \"wt\"}.\n\nbmb.interpret.plot_predictions(\n model=model, \n idata=idata, \n covariates=[\"hp\", \"wt\"],\n pps=False,\n legend=False,\n subplot_kwargs={\"main\": \"hp\", \"group\": \"wt\", \"panel\": \"wt\"},\n fig_kwargs={\"figsize\": (20, 8), \"sharey\": True}\n)\nplt.tight_layout();\n\n\n\n\nFurthermore, categorical covariates can also be plotted. We plot the the mean mpg as a function of the two categorical covariates gear and cyl below. The plot_predictions function automatically plots the conditional adjusted predictions for each level of the categorical covariate. Furthermore, when passing a list of covariates into the plot_predictions function, the list will be converted into a dictionary object where the key is taken from (“horizontal”, “color”, “panel”) and the values are the names of the variables. By default, the first element of the list is specified as the “horizontal” covariate, the second element of the list is specified as the “color” covariate, and the third element of the list is mapped to different plot panels.\n\nfig, ax = plt.subplots(figsize=(7, 3), dpi=120)\nbmb.interpret.plot_predictions(model, idata, [\"gear\", \"cyl\"], ax=ax);\n\n\n\n\n\n\n\nLets move onto a model that uses a distribution that is a member of the exponential distribution family and utilizes a link function. For this, we will implement the Negative binomial model from the students absences example. School administrators study the attendance behavior of high school juniors at two schools. Predictors of the number of days of absence include the type of program in which the student is enrolled and a standardized test in math. We have attendance data on 314 high school juniors. The variables of insterest in the dataset are the following:\n\ndaysabs: The number of days of absence. It is our response variable.\nprogr: The type of program. Can be one of ‘General’, ‘Academic’, or ‘Vocational’.\nmath: Score in a standardized math test.\n\n\n# Load data, define and fit Bambi model\ndata = pd.read_stata(\"https://stats.idre.ucla.edu/stat/stata/dae/nb_data.dta\")\ndata[\"prog\"] = data[\"prog\"].map({1: \"General\", 2: \"Academic\", 3: \"Vocational\"})\n\nmodel_interaction = bmb.Model(\n \"daysabs ~ 0 + prog + scale(math) + prog:scale(math)\",\n data,\n family=\"negativebinomial\"\n)\nidata_interaction = model_interaction.fit(\n draws=1000, target_accept=0.95, random_seed=1234, chains=4\n)\n\nAuto-assigning NUTS sampler...\nInitializing NUTS using jitter+adapt_diag...\nMultiprocess sampling (4 chains in 4 jobs)\nNUTS: [daysabs_alpha, prog, scale(math), prog:scale(math)]\n\n\n\n\n\n\n\n \n \n 100.00% [8000/8000 00:02<00:00 Sampling 4 chains, 0 divergences]\n \n \n\n\nSampling 4 chains for 1_000 tune and 1_000 draw iterations (4_000 + 4_000 draws total) took 2 seconds.\n\n\nThis model utilizes a log link function and a negative binomial distribution for the likelihood. Also note that this model also contains an interaction prog:sale(math).\n\nmodel_interaction\n\n Formula: daysabs ~ 0 + prog + scale(math) + prog:scale(math)\n Family: negativebinomial\n Link: mu = log\n Observations: 314\n Priors: \n target = mu\n Common-level effects\n prog ~ Normal(mu: [0. 0. 0.], sigma: [5.0102 7.4983 5.2746])\n scale(math) ~ Normal(mu: 0.0, sigma: 2.5)\n prog:scale(math) ~ Normal(mu: [0. 0.], sigma: [6.1735 4.847 ])\n \n Auxiliary parameters\n alpha ~ HalfCauchy(beta: 1.0)\n------\n* To see a plot of the priors call the .plot_priors() method.\n* To see a summary or plot of the posterior pass the object returned by .fit() to az.summary() or az.plot_trace()\n\n\n\nfig, ax = plt.subplots(figsize=(7, 3), dpi=120)\nbmb.interpret.plot_predictions(\n model_interaction, \n idata_interaction, \n \"math\", \n ax=ax, \n pps=False\n);\n\n\n\n\nThe plot above shows that as math increases, the mean daysabs decreases. However, as the model contains an interaction term, the effect of math on daysabs depends on the value of prog. Therefore, we will use plot_predictions to plot the conditional adjusted predictions for each level of prog.\n\nfig, ax = plt.subplots(figsize=(7, 3), dpi=120)\nbmb.interpret.plot_predictions(\n model_interaction, \n idata_interaction, \n [\"math\", \"prog\"], \n ax=ax, \n pps=False\n);\n\n\n\n\nPassing specific subplot_kwargs can allow for a more interpretable plot. Especially when the posterior predictive distribution plot results in overlapping credible intervals.\n\nbmb.interpret.plot_predictions(\n model_interaction, \n idata_interaction, \n covariates=[\"math\", \"prog\"],\n pps=True,\n subplot_kwargs={\"main\": \"math\", \"group\": \"prog\", \"panel\": \"prog\"},\n legend=False,\n fig_kwargs={\"figsize\": (16, 5), \"sharey\": True}\n);\n\n\n\n\n\n\n\nTo further demonstrate the plot_predictions function, we will implement a logistic regression model. This example is taken from the marginaleffects plot_predictions documentation. The internet movie database, http://imdb.com/, is a website devoted to collecting movie data supplied by studios and fans. It claims to be the biggest movie database on the web and is run by Amazon. The movies in this dataset were selected for inclusion if they had a known length and had been rated by at least one imdb user. The dataset below contains 28,819 rows and 24 columns. The variables of interest in the dataset are the following: - title. Title of the movie. - year. Year of release. - budget. Total budget (if known) in US dollars - length. Length in minutes. - rating. Average IMDB user rating. - votes. Number of IMDB users who rated this movie. - r1-10. Multiplying by ten gives percentile (to nearest 10%) of users who rated this movie a 1. - mpaa. MPAA rating. - action, animation, comedy, drama, documentary, romance, short. Binary variables represent- ing if movie was classified as belonging to that genre.\n\ndata = pd.read_csv(\"https://vincentarelbundock.github.io/Rdatasets/csv/ggplot2movies/movies.csv\")\n\ndata[\"style\"] = \"Other\"\ndata.loc[data[\"Action\"] == 1, \"style\"] = \"Action\"\ndata.loc[data[\"Comedy\"] == 1, \"style\"] = \"Comedy\"\ndata.loc[data[\"Drama\"] == 1, \"style\"] = \"Drama\"\ndata[\"certified_fresh\"] = (data[\"rating\"] >= 8) * 1\ndata = data[data[\"length\"] < 240]\n\npriors = {\"style\": bmb.Prior(\"Normal\", mu=0, sigma=2)}\nmodel = bmb.Model(\"certified_fresh ~ 0 + length * style\", data=data, priors=priors, family=\"bernoulli\")\nidata = model.fit(random_seed=1234, target_accept=0.9, init=\"adapt_diag\")\n\nModeling the probability that certified_fresh==1\nAuto-assigning NUTS sampler...\nInitializing NUTS using adapt_diag...\nMultiprocess sampling (4 chains in 4 jobs)\nNUTS: [length, style, length:style]\n\n\n\n\n\n\n\n \n \n 43.56% [3485/8000 04:04<05:16 Sampling 4 chains, 0 divergences]\n \n \n\n\nThe logistic regression model uses a logit link function and a Bernoulli likelihood. Therefore, the link scale is the log-odds of a successful response and the response scale is the probability of a successful response.\n\nmodel\n\n Formula: certified_fresh ~ 0 + length * style\n Family: bernoulli\n Link: p = logit\n Observations: 58662\n Priors: \n target = p\n Common-level effects\n length ~ Normal(mu: 0.0, sigma: 0.0708)\n style ~ Normal(mu: 0.0, sigma: 2.0)\n length:style ~ Normal(mu: [0. 0. 0.], sigma: [0.0702 0.0509 0.0611])\n------\n* To see a plot of the priors call the .plot_priors() method.\n* To see a summary or plot of the posterior pass the object returned by .fit() to az.summary() or az.plot_trace()\n\n\nAgain, by default, the plot_predictions function plots the mean outcome on the response scale. Therefore, the plot below shows the probability of a successful response certified_fresh as a function of length.\n\nfig, ax = plt.subplots(figsize=(7, 3), dpi=120)\nbmb.interpret.plot_predictions(model, idata, \"length\", ax=ax);\n\n\n\n\nAdditionally, we can see how the probability of certified_fresh varies as a function of categorical covariates.\n\nfig, ax = plt.subplots(figsize=(7, 3), dpi=120)\nbmb.interpret.plot_predictions(model, idata, \"style\", ax=ax);\n\n\n\n\n\n\n\nplot_predictions also has the argument target where target determines what parameter of the response distribution is plotted as a function of the explanatory variables. This argument is useful in distributional models, i.e., when the response distribution contains a parameter for location, scale and or shape. The default of this argument is mean and passing a parameter into target only works when the argument pps=False because when pps=True the posterior predictive distribution is plotted and thus, can only refer to the outcome variable (instead of any of the parameters of the response distribution). For this example, we will simulate our own dataset.\n\nrng = np.random.default_rng(121195)\nN = 200\na, b = 0.5, 1.1\nx = rng.uniform(-1.5, 1.5, N)\nshape = np.exp(0.3 + x * 0.5 + rng.normal(scale=0.1, size=N))\ny = rng.gamma(shape, np.exp(a + b * x) / shape, N)\ndata_gamma = pd.DataFrame({\"x\": x, \"y\": y})\n\nformula = bmb.Formula(\"y ~ x\", \"alpha ~ x\")\nmodel = bmb.Model(formula, data_gamma, family=\"gamma\")\nidata = model.fit(random_seed=1234)\n\nAuto-assigning NUTS sampler...\nInitializing NUTS using jitter+adapt_diag...\nMultiprocess sampling (4 chains in 4 jobs)\nNUTS: [Intercept, x, alpha_Intercept, alpha_x]\n\n\n\n\n\n\n\n \n \n 100.00% [8000/8000 00:02<00:00 Sampling 4 chains, 25 divergences]\n \n \n\n\nSampling 4 chains for 1_000 tune and 1_000 draw iterations (4_000 + 4_000 draws total) took 2 seconds.\nThere were 25 divergences after tuning. Increase `target_accept` or reparameterize.\n\n\n\nmodel\n\n Formula: y ~ x\n alpha ~ x\n Family: gamma\n Link: mu = inverse\n alpha = log\n Observations: 200\n Priors: \n target = mu\n Common-level effects\n Intercept ~ Normal(mu: 0.0, sigma: 2.5037)\n x ~ Normal(mu: 0.0, sigma: 2.8025)\n target = alpha\n Common-level effects\n alpha_Intercept ~ Normal(mu: 0.0, sigma: 1.0)\n alpha_x ~ Normal(mu: 0.0, sigma: 1.0)\n------\n* To see a plot of the priors call the .plot_priors() method.\n* To see a summary or plot of the posterior pass the object returned by .fit() to az.summary() or az.plot_trace()\n\n\nThe model we defined uses a gamma distribution parameterized by alpha and mu where alpha utilizes a log link and mu goes through an inverse link. Therefore, we can plot either: (1) the mu of the response distribution (which is the default), or (2) alpha of the response distribution as a function of the explanatory variable \\(x\\).\n\n# First, the mean of the response (default)\nfig, ax = plt.subplots(figsize=(7, 3), dpi=120)\nbmb.interpret.plot_predictions(model, idata, \"x\", ax=ax);\n\n\n\n\nBelow, instead of plotting the default target, target=mean, we set target=alpha to visualize how the model parameter alpha varies as a function of the x predictor.\n\n# Second, another param. of the distribution: alpha\nfig, ax = plt.subplots(figsize=(7, 3), dpi=120)\nbmb.interpret.plot_predictions(model, idata, \"x\", target='alpha', ax=ax);\n\n\n\n\n\n%load_ext watermark\n%watermark -n -u -v -iv -w\n\nLast updated: Wed Aug 16 2023\n\nPython implementation: CPython\nPython version : 3.11.0\nIPython version : 8.13.2\n\npandas : 2.0.1\nmatplotlib: 3.7.1\nbambi : 0.10.0.dev0\narviz : 0.15.1\nnumpy : 1.24.2\n\nWatermark: 2.3.1" + "section": "Example", + "text": "Example\n\nimport arviz as az\nimport bambi as bmb\nimport matplotlib.pyplot as plt\nimport numpy as np\nimport pandas as pd\n\naz.style.use(\"arviz-darkgrid\")\n\n\ndata = bmb.load_data(\"sleepstudy\")\n\n\ndef plot_data(data):\n fig, axes = plt.subplots(2, 9, figsize=(16, 7.5), sharey=True, sharex=True, dpi=300, constrained_layout=False)\n fig.subplots_adjust(left=0.075, right=0.975, bottom=0.075, top=0.925, wspace=0.03)\n\n axes_flat = axes.ravel()\n\n for i, subject in enumerate(data[\"Subject\"].unique()):\n ax = axes_flat[i]\n idx = data.index[data[\"Subject\"] == subject].tolist()\n days = data.loc[idx, \"Days\"].to_numpy()\n reaction = data.loc[idx, \"Reaction\"].to_numpy()\n\n # Plot observed data points\n ax.scatter(days, reaction, color=\"C0\", ec=\"black\", alpha=0.7)\n\n # Add a title\n ax.set_title(f\"Subject: {subject}\", fontsize=14)\n\n ax.xaxis.set_ticks([0, 2, 4, 6, 8])\n fig.text(0.5, 0.02, \"Days\", fontsize=14)\n fig.text(0.03, 0.5, \"Reaction time (ms)\", rotation=90, fontsize=14, va=\"center\")\n\n return axes\n\nplot_data(data);\n\n\n\n\nThe model\n\\[\n\\begin{aligned}\n\\mu_i & = \\beta_0 + \\beta_1 \\text{Days}_i + u_{0i} + u_{1i}\\text{Days}_i \\\\\n\\beta_0 & \\sim \\text{Normal} \\\\\n\\beta_1 & \\sim \\text{Normal} \\\\\nu_{0i} & \\sim \\text{Normal}(0, \\sigma_{u_0}) \\\\\nu_{1i} & \\sim \\text{Normal}(0, \\sigma_{u_1}) \\\\\n\\sigma_{u_0} & \\sim \\text{HalfNormal} \\\\\n\\sigma_{u_1} & \\sim \\text{HalfNormal} \\\\\n\\sigma & \\sim \\text{HalfStudentT} \\\\\n\\text{Reaction}_i & \\sim \\text{Normal}(\\mu_i, \\sigma)\n\\end{aligned}\n\\]\nWritten in a slightly different way (and omitting some priors)…\n\\[\n\\begin{aligned}\n\\mu_i & = \\text{Intercept}_i + \\text{Slope}_i \\text{Days}_i \\\\\n\\text{Intercept}_i & = \\beta_0 + u_{0i} \\\\\n\\text{Slope}_i & = \\beta_1 + u_{1i} \\\\\n\\sigma & \\sim \\text{HalfStudentT} \\\\\n\\text{Reaction}_i & \\sim \\text{Normal}(\\mu_i, \\sigma) \\\\\n\\end{aligned}\n\\]\nWe can see both the intercept and the slope are made of a “common” component and a “subject-specific” deflection.\nUnder the general representation written above…\n\\[\n\\begin{aligned}\n\\pmb{\\mu} &= \\mathbf{X}\\pmb{\\beta} + \\mathbf{Z}\\pmb{u} \\\\\n\\pmb{\\beta} &\\sim \\text{Normal} \\\\\n\\pmb{u} &\\sim \\text{Normal}(0, \\text{diag}(\\sigma_{\\pmb{u}})) \\\\\n\\sigma &\\sim \\text{HalfStudenT} \\\\\n\\sigma_{\\pmb{u}} &\\sim \\text{HalfNormal} \\\\\nY_i &\\sim \\text{Normal}(\\mu_i, \\sigma)\n\\end{aligned}\n\\]\n\nmodel = bmb.Model(\"Reaction ~ 1 + Days + (1 + Days | Subject)\", data, categorical=\"Subject\")\nmodel\n\n Formula: Reaction ~ 1 + Days + (1 + Days | Subject)\n Family: gaussian\n Link: mu = identity\n Observations: 180\n Priors: \n target = mu\n Common-level effects\n Intercept ~ Normal(mu: 298.5079, sigma: 261.0092)\n Days ~ Normal(mu: 0.0, sigma: 48.8915)\n \n Group-level effects\n 1|Subject ~ Normal(mu: 0.0, sigma: HalfNormal(sigma: 261.0092))\n Days|Subject ~ Normal(mu: 0.0, sigma: HalfNormal(sigma: 48.8915))\n \n Auxiliary parameters\n Reaction_sigma ~ HalfStudentT(nu: 4.0, sigma: 56.1721)\n\n\n\nmodel.build()\nmodel.graph()\n\n\n\n\n\ndm = model.response_component.design\ndm\n\nDesignMatrices\n\n (rows, cols)\nResponse: (180,)\nCommon: (180, 2)\nGroup-specific: (180, 36)\n\nUse .reponse, .common, or .group to access the different members.\n\n\n\nprint(dm.response, \"\\n\")\nprint(np.array(dm.response)[:5])\n\nResponseMatrix \n name: Reaction\n kind: numeric\n shape: (180,)\n\nTo access the actual design matrix do 'np.array(this_obj)' \n\n[249.56 258.7047 250.8006 321.4398 356.8519]\n\n\n\nprint(dm.common, \"\\n\")\nprint(np.array(dm.common)[:5])\n\nCommonEffectsMatrix with shape (180, 2)\nTerms: \n Intercept \n kind: intercept\n column: 0\n Days \n kind: numeric\n column: 1\n\nTo access the actual design matrix do 'np.array(this_obj)' \n\n[[1 0]\n [1 1]\n [1 2]\n [1 3]\n [1 4]]\n\n\n\nprint(dm.group, \"\\n\")\nprint(np.array(dm.group)[:14])\n\nGroupEffectsMatrix with shape (180, 36)\nTerms: \n 1|Subject \n kind: intercept\n groups: ['308', '309', '310', '330', '331', '332', '333', '334', '335', '337', '349', '350',\n '351', '352', '369', '370', '371', '372']\n columns: 0:18\n Days|Subject \n kind: numeric\n groups: ['308', '309', '310', '330', '331', '332', '333', '334', '335', '337', '349', '350',\n '351', '352', '369', '370', '371', '372']\n columns: 18:36\n\nTo access the actual design matrix do 'np.array(this_obj)' \n\n[[1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]\n [1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]\n [1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 2 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]\n [1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 3 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]\n [1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 4 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]\n [1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 5 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]\n [1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 6 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]\n [1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 7 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]\n [1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 8 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]\n [1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 9 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]\n [0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]\n [0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]\n [0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 2 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]\n [0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 3 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]]\n\n\n\nmodel.response_component.intercept_term\n\nCommonTerm( \n name: Intercept,\n prior: Normal(mu: 298.5079, sigma: 261.0092),\n shape: (180,),\n categorical: False\n)\n\n\n\nmodel.response_component.common_terms\n\n{'Days': CommonTerm( \n name: Days,\n prior: Normal(mu: 0.0, sigma: 48.8915),\n shape: (180,),\n categorical: False\n )}\n\n\n\nmodel.response_component.group_specific_terms\n\n{'1|Subject': GroupSpecificTerm( \n name: 1|Subject,\n prior: Normal(mu: 0.0, sigma: HalfNormal(sigma: 261.0092)),\n shape: (180, 18),\n categorical: False,\n groups: ['308', '309', '310', '330', '331', '332', '333', '334', '335', '337', '349', '350', '351', '352', '369', '370', '371', '372']\n ),\n 'Days|Subject': GroupSpecificTerm( \n name: Days|Subject,\n prior: Normal(mu: 0.0, sigma: HalfNormal(sigma: 48.8915)),\n shape: (180, 18),\n categorical: False,\n groups: ['308', '309', '310', '330', '331', '332', '333', '334', '335', '337', '349', '350', '351', '352', '369', '370', '371', '372']\n )}\n\n\nTerms not only exist in the Bambi world. There are three (!!) types of terms being created.\n\nFormulae has its terms\n\nAgnostic information design matrix information\n\nBambi has its terms\n\nContains both the information given by formulae and metadata relevant to Bambi (priors)\n\nThe backend has its terms\n\nAccept a Bambi term and knows how to “compile” itself to that backend.\nE.g. the PyMC backend terms know how to write one or more PyMC distributions out of a Bambi term.\n\n\nCould we have multiple backends? In principle yes. But there’s one aspect which is convoluted, dims and coords, and the solution we found (not the best) prevented us from separating all stuff and making the front-end completely independent of the backend.\nFormulae terms\n\ndm.common.terms\n\n{'Intercept': Intercept(), 'Days': Term([Variable(Days)])}\n\n\n\ndm.group.terms\n\n{'1|Subject': GroupSpecificTerm(\n expr= Intercept(),\n factor= Term([Variable(Subject)])\n ),\n 'Days|Subject': GroupSpecificTerm(\n expr= Term([Variable(Days)]),\n factor= Term([Variable(Subject)])\n )}\n\n\nBambi terms\n\nmodel.response_component.terms\n\n{'Intercept': CommonTerm( \n name: Intercept,\n prior: Normal(mu: 298.5079, sigma: 261.0092),\n shape: (180,),\n categorical: False\n ),\n 'Days': CommonTerm( \n name: Days,\n prior: Normal(mu: 0.0, sigma: 48.8915),\n shape: (180,),\n categorical: False\n ),\n '1|Subject': GroupSpecificTerm( \n name: 1|Subject,\n prior: Normal(mu: 0.0, sigma: HalfNormal(sigma: 261.0092)),\n shape: (180, 18),\n categorical: False,\n groups: ['308', '309', '310', '330', '331', '332', '333', '334', '335', '337', '349', '350', '351', '352', '369', '370', '371', '372']\n ),\n 'Days|Subject': GroupSpecificTerm( \n name: Days|Subject,\n prior: Normal(mu: 0.0, sigma: HalfNormal(sigma: 48.8915)),\n shape: (180, 18),\n categorical: False,\n groups: ['308', '309', '310', '330', '331', '332', '333', '334', '335', '337', '349', '350', '351', '352', '369', '370', '371', '372']\n ),\n 'Reaction': ResponseTerm( \n name: Reaction,\n prior: Normal(mu: 0.0, sigma: 1.0),\n shape: (180,),\n categorical: False\n )}\n\n\nRandom idea: Perhaps in a future we can make Bambi more extensible by using generics-based API and some type of register. I haven’t thought about it at all yet." }, { - "objectID": "notebooks/mister_p.html", - "href": "notebooks/mister_p.html", + "objectID": "notebooks/splines_cherry_blossoms.html", + "href": "notebooks/splines_cherry_blossoms.html", "title": "Bambi", "section": "", - "text": "What are we even doing when we fit a regression model? Is a question that arises when first learning the tools of the trade and again when debugging strange results of your thousandth logistic regression model.\nThis notebook is intended to showcase how regression can be seen as a method for automating the calculation of stratum specific conditional effects. Additionally, we’ll see how we can enrich regression models by a post-stratification adjustment with knowledge of the appropriate stratum specific weights. This technique of multilevel regression and post stratification (MrP) is often used in the context of national surveys where we have knowledge of the population weights appropriate to different demographic groups. It can be used in a wide variety of areas ranging from political polling to online market research. We will demonstrate how to fit and and assess these models using Bambi.\n\nimport warnings\n\nimport arviz as az\nimport bambi as bmb\nimport matplotlib.pyplot as plt\nimport numpy as np\nimport pandas as pd\nimport pymc as pm\n\nwarnings.simplefilter(action=\"ignore\", category=FutureWarning)\n\n\n\n\nFirst consider this example of heart transplant patients adapted from Hernan and Robins’ excellent book Causal Inference: What if. Here we have a number of patients (anonymised with names for the Greek Gods). The data records the outcomes of a heart transplant program for those who were part of the program and those who were not. We also see the different risk levels of each patient assigned the treatment.\nWhat we want to show here is that a regression model fit to this data automatically accounts for the weighting appropriate to the different risk strata. The data is coded with 0-1 indicators for status. Risk_Strata is either 1 for higher risk or 0 for lower risk. Outcome is whether or not the patient died from the procedure, and Treatment is whether or not the patient received treatment.\n\ndf = pd.DataFrame(\n {\n \"name\": [\n \"Rheia\",\n \"Kronos\",\n \"Demeter\",\n \"Hades\",\n \"Hestia\",\n \"Poseidon\",\n \"Hera\",\n \"Zeus\",\n \"Artemis\",\n \"Apollo\",\n \"Leto\",\n \"Ares\",\n \"Athena\",\n \"Hephaestus\",\n \"Aphrodite\",\n \"Cyclope\",\n \"Persephone\",\n \"Hermes\",\n \"Hebe\",\n \"Dionysus\",\n ],\n \"Risk_Strata\": [0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1],\n \"Treatment\": [0, 0, 0, 0, 1, 1, 1, 1, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1],\n \"Outcome\": [0, 1, 0, 0, 0, 0, 0, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 0, 0, 0],\n }\n)\n\ndf[\"Treatment_x_Risk_Strata\"] = df.Treatment * df.Risk_Strata\n\ndf\n\n\n\n\n\n \n \n \n name\n Risk_Strata\n Treatment\n Outcome\n Treatment_x_Risk_Strata\n \n \n \n \n 0\n Rheia\n 0\n 0\n 0\n 0\n \n \n 1\n Kronos\n 0\n 0\n 1\n 0\n \n \n 2\n Demeter\n 0\n 0\n 0\n 0\n \n \n 3\n Hades\n 0\n 0\n 0\n 0\n \n \n 4\n Hestia\n 0\n 1\n 0\n 0\n \n \n 5\n Poseidon\n 0\n 1\n 0\n 0\n \n \n 6\n Hera\n 0\n 1\n 0\n 0\n \n \n 7\n Zeus\n 0\n 1\n 1\n 0\n \n \n 8\n Artemis\n 1\n 0\n 1\n 0\n \n \n 9\n Apollo\n 1\n 0\n 1\n 0\n \n \n 10\n Leto\n 1\n 0\n 0\n 0\n \n \n 11\n Ares\n 1\n 1\n 1\n 1\n \n \n 12\n Athena\n 1\n 1\n 1\n 1\n \n \n 13\n Hephaestus\n 1\n 1\n 1\n 1\n \n \n 14\n Aphrodite\n 1\n 1\n 1\n 1\n \n \n 15\n Cyclope\n 1\n 1\n 1\n 1\n \n \n 16\n Persephone\n 1\n 1\n 1\n 1\n \n \n 17\n Hermes\n 1\n 1\n 0\n 1\n \n \n 18\n Hebe\n 1\n 1\n 0\n 1\n \n \n 19\n Dionysus\n 1\n 1\n 0\n 1\n \n \n\n\n\n\nIf the treatment assignment procedure involved complete randomisation then we might expect a reasonable balance of strata effects across the treated and non-treated. In this sample we see (perhaps counter intuitively) that the treatment seems to induce a higher rate of death than the non-treated group.\n\nsimple_average = df.groupby(\"Treatment\")[[\"Outcome\"]].mean().rename({\"Outcome\": \"Share\"}, axis=1)\nsimple_average\n\n\n\n\n\n \n \n \n Share\n \n \n Treatment\n \n \n \n \n \n 0\n 0.428571\n \n \n 1\n 0.538462\n \n \n\n\n\n\nWhich suggests an alarming causal effect whereby the treatment seems to increase risk of death in the population.\n\ncausal_risk_ratio = simple_average.iloc[1][\"Share\"] / simple_average.iloc[0][\"Share\"]\nprint(\"Causal Risk Ratio:\", causal_risk_ratio)\n\nCausal Risk Ratio: 1.2564102564102564\n\n\nThis finding we know on inspection is driven by the imbalance in the risk strata across the treatment groups.\n\ndf.groupby(\"Risk_Strata\")[[\"Treatment\"]].count().assign(\n proportion=lambda x: x[\"Treatment\"] / len(df)\n)\n\n\n\n\n\n \n \n \n Treatment\n proportion\n \n \n Risk_Strata\n \n \n \n \n \n \n 0\n 8\n 0.4\n \n \n 1\n 12\n 0.6\n \n \n\n\n\n\nWe can correct for this by weighting the results by the share each group represents across the Risk_Strata. In other words when we correct for the population size at the different levels of risk we get a better estimate of the effect. First we see what the expected outcome is for each strata.\n\noutcomes_controlled = (\n df.groupby([\"Risk_Strata\", \"Treatment\"])[[\"Outcome\"]]\n .mean()\n .reset_index()\n .pivot(index=\"Treatment\", columns=[\"Risk_Strata\"], values=\"Outcome\")\n)\n\noutcomes_controlled\n\n\n\n\n\n \n \n Risk_Strata\n 0\n 1\n \n \n Treatment\n \n \n \n \n \n \n 0\n 0.25\n 0.666667\n \n \n 1\n 0.25\n 0.666667\n \n \n\n\n\n\nNote how the expected outcomes are equal across the stratified groups. We can now combine these estimate with the population weights (derived earlier) in each segment to get our weighted average.\n\nweighted_avg = outcomes_controlled.assign(formula=\"0.4*0.25 + 0.6*0.66\").assign(\n weighted_average=lambda x: x[0] * (df[df[\"Risk_Strata\"] == 0].shape[0] / len(df))\n + x[1] * (df[df[\"Risk_Strata\"] == 1].shape[0] / len(df))\n)\n\nweighted_avg\n\n\n\n\n\n \n \n Risk_Strata\n 0\n 1\n formula\n weighted_average\n \n \n Treatment\n \n \n \n \n \n \n \n \n 0\n 0.25\n 0.666667\n 0.4*0.25 + 0.6*0.66\n 0.5\n \n \n 1\n 0.25\n 0.666667\n 0.4*0.25 + 0.6*0.66\n 0.5\n \n \n\n\n\n\nFrom which we can derive a more sensible treatment effect.\n\ncausal_risk_ratio = (\n weighted_avg.iloc[1][\"weighted_average\"] / weighted_avg.iloc[0][\"weighted_average\"]\n)\n\nprint(\"Causal Risk Ratio:\", causal_risk_ratio)\n\nCausal Risk Ratio: 1.0\n\n\n\n\n\nSo far, so good. But so what?\nThe point here is that the above series of steps can be difficult to accomplish with more complex sets of groups and risk profiles. So it’s useful to understand that regression can be used to automatically account for the variation in outcome effects across the different strata of our population. More prosaically, the example shows that it really matters what variables you put in your model.\n\nreg = bmb.Model(\"Outcome ~ 1 + Treatment\", df)\nresults = reg.fit()\n\nreg_strata = bmb.Model(\"Outcome ~ 1 + Treatment + Risk_Strata + Treatment_x_Risk_Strata\", df)\nresults_strata = reg_strata.fit()\n\nAuto-assigning NUTS sampler...\nInitializing NUTS using jitter+adapt_diag...\nInitializing NUTS using jitter+adapt_diag...\nMultiprocess sampling (4 chains in 4 jobs)\nNUTS: [Outcome_sigma, Intercept, Treatment]\n\n\nSampling 4 chains for 1_000 tune and 1_000 draw iterations (4_000 + 4_000 draws total) took 1 seconds.\nAuto-assigning NUTS sampler...\nInitializing NUTS using jitter+adapt_diag...\nMultiprocess sampling (4 chains in 4 jobs)\nNUTS: [Outcome_sigma, Intercept, Treatment, Risk_Strata, Treatment_x_Risk_Strata]\n\n\nSampling 4 chains for 1_000 tune and 1_000 draw iterations (4_000 + 4_000 draws total) took 1 seconds.\n\n\nWe can now inspect the treatment effect and the implied causal risk ratio in each model. We can quickly recover that controlling for the right variables in our regression model automatically adjusts the treatment effect downwards towards 0.\n\naz.summary(results)\n\n\n\n\n\n \n \n \n mean\n sd\n hdi_3%\n hdi_97%\n mcse_mean\n mcse_sd\n ess_bulk\n ess_tail\n r_hat\n \n \n \n \n Intercept\n 0.428\n 0.203\n 0.060\n 0.823\n 0.003\n 0.002\n 4840.0\n 2982.0\n 1.0\n \n \n Treatment\n 0.108\n 0.252\n -0.357\n 0.584\n 0.004\n 0.004\n 4258.0\n 2731.0\n 1.0\n \n \n Outcome_sigma\n 0.542\n 0.092\n 0.388\n 0.713\n 0.001\n 0.001\n 4073.0\n 2488.0\n 1.0\n \n \n\n\n\n\n\naz.summary(results_strata)\n\n\n\n\n\n \n \n \n mean\n sd\n hdi_3%\n hdi_97%\n mcse_mean\n mcse_sd\n ess_bulk\n ess_tail\n r_hat\n \n \n \n \n Intercept\n 0.254\n 0.261\n -0.233\n 0.743\n 0.005\n 0.004\n 2710.0\n 2648.0\n 1.0\n \n \n Treatment\n -0.001\n 0.367\n -0.653\n 0.730\n 0.008\n 0.006\n 2312.0\n 2648.0\n 1.0\n \n \n Risk_Strata\n 0.405\n 0.395\n -0.349\n 1.119\n 0.008\n 0.006\n 2274.0\n 2503.0\n 1.0\n \n \n Treatment_x_Risk_Strata\n 0.010\n 0.496\n -0.947\n 0.939\n 0.011\n 0.009\n 1986.0\n 2113.0\n 1.0\n \n \n Outcome_sigma\n 0.531\n 0.098\n 0.367\n 0.714\n 0.002\n 0.001\n 2389.0\n 2533.0\n 1.0\n \n \n\n\n\n\n\nax = az.plot_forest(\n [results, results_strata],\n model_names=[\"naive_model\", \"stratified_model\"],\n var_names=[\"Treatment\"],\n kind=\"ridgeplot\",\n ridgeplot_alpha=0.4,\n combined=True,\n figsize=(10, 6),\n)\nax[0].axvline(0, color=\"black\", linestyle=\"--\")\nax[0].set_title(\"Treatment Effects under Stratification/Non-stratification\");\n\n\n\n\nWe can even see this in the predicted outcomes for the model. This is an important step. The regression model automatically adjusts for the risk profile within the appropriate strata in the data “seen” by the model.\n\nnew_df = df[[\"Risk_Strata\"]].assign(Treatment=1).assign(Treatment_x_Risk_Strata=1)\nnew_preds = reg_strata.predict(results_strata, kind=\"pps\", data=new_df, inplace=False)\nprint(\"Expected Outcome in the Treated\")\nnew_preds[\"posterior_predictive\"][\"Outcome\"].mean().item()\n\nExpected Outcome in the Treated\n\n\n0.5068569705412103\n\n\n\nnew_df = df[[\"Risk_Strata\"]].assign(Treatment=0).assign(Treatment_x_Risk_Strata=0)\nnew_preds = reg_strata.predict(results_strata, kind=\"pps\", data=new_df, inplace=False)\nprint(\"Expected Outcome in the Untreated\")\n\nnew_preds[\"posterior_predictive\"][\"Outcome\"].mean().item()\n\nExpected Outcome in the Untreated\n\n\n0.49944292437387866\n\n\nWe can see these results more clearly using bambi model interpretation functions to see the predictions within a specific strata.\n\nfig, axs = plt.subplots(1, 2, figsize=(20, 6))\naxs = axs.flatten()\nbmb.interpret.plot_predictions(reg, results, covariates=[\"Treatment\"], ax=axs[0])\nbmb.interpret.plot_predictions(reg_strata, results_strata, covariates=[\"Treatment\"], ax=axs[1])\naxs[0].set_title(\"Non Stratified Regression \\n Model Predictions\")\naxs[1].set_title(\"Stratified Regression \\n Model Predictions\");\n\n\n\n\nHernan and Robins expand on these foundational observations and elaborate the implications for causal inference and the bias of confounding variables. We won’t go into these details, as we instead we want to draw out the connection with controlling for the risk of non-representative sampling. The usefulness of “representative-ness” as an idea is disputed in the statistical literature due to the vagueness of the term. To say a sample is representative is ussually akin to meaning that it was generated from a high-quality probability sampling design. This design is specified to avoid the creep of bias due to selection effects contaminating the results.\nWe’ve seen how regression can automate stratification across the levels of covariates in the model conditional on the sample data. But what if the prevalence of the risk-profile in your data does not reflect the prevalance of risk in the wider population? Then the regression model will automatically adjust to the prevalence in the sample, but it is not adjusting to the correct weights.\n\n\n\nIn the context of national survey design there is always a concern that the sample respondents may be more or less representative of the population across different key demographics e.g. it’s unlikely we would put much faith in the survey’s accuracy if it had 90% male respondents on a question about the lived experience of women. Given that we can know before hand that certain demographic splits are not relective of the census data, we can use this information to appropriately re-weight the regressions fit to non-representative survey data.\nWe’ll demonstrate the idea of multi-level regression and post-stratification adjustment by replicating some of the steps discussed in Martin, Philips and Gelmen’s “Multilevel Regression and Poststratification Case Studies”.\nThey cite data from the Cooperative Congressional Election Study (Schaffner, Ansolabehere, and Luks (2018)), a US nationwide survey designed by a consortium of 60 research teams and administered by YouGov. The outcome of interest is a binary question: Should employers decline coverage of abortions in insurance plans?\n\ncces_all_df = pd.read_csv(\"data/mr_p_cces18_common_vv.csv.gz\", low_memory=False)\ncces_all_df.head()\n\n\n\n\n\n \n \n \n caseid\n commonweight\n commonpostweight\n vvweight\n vvweight_post\n tookpost\n CCEStake\n birthyr\n gender\n educ\n ...\n CL_party\n CL_2018gvm\n CL_2018pep\n CL_2018pvm\n starttime\n endtime\n starttime_post\n endtime_post\n DMA\n dmaname\n \n \n \n \n 0\n 123464282\n 0.940543\n 0.7936\n 0.740858\n 0.641412\n 2\n 1\n 1964\n 2\n 4\n ...\n 11.0\n 1.0\n NaN\n NaN\n 04oct2018 02:47:10\n 09oct2018 04:16:31\n 11nov2018 00:41:13\n 11nov2018 01:21:53\n 512.0\n BALTIMORE\n \n \n 1\n 170169205\n 0.769724\n 0.7388\n 0.425236\n 0.415134\n 2\n 1\n 1971\n 2\n 2\n ...\n 13.0\n NaN\n 6.0\n 2.0\n 02oct2018 06:55:22\n 02oct2018 07:32:51\n 12nov2018 00:49:50\n 12nov2018 01:08:43\n 531.0\n \"TRI-CITIES\n \n \n 2\n 175996005\n 1.491642\n 1.3105\n 1.700094\n 1.603264\n 2\n 1\n 1958\n 2\n 3\n ...\n 13.0\n 5.0\n NaN\n NaN\n 07oct2018 00:48:23\n 07oct2018 01:38:41\n 12nov2018 21:49:41\n 12nov2018 22:19:28\n 564.0\n CHARLESTON-HUNTINGTON\n \n \n 3\n 176818556\n 5.104709\n 4.6304\n 5.946729\n 5.658840\n 2\n 1\n 1946\n 2\n 6\n ...\n 4.0\n 3.0\n NaN\n 3.0\n 11oct2018 15:20:26\n 11oct2018 16:18:42\n 11nov2018 13:24:16\n 11nov2018 14:00:14\n 803.0\n LOS ANGELES\n \n \n 4\n 202120533\n 0.466526\n 0.3745\n 0.412451\n 0.422327\n 2\n 1\n 1972\n 2\n 2\n ...\n 3.0\n 5.0\n NaN\n NaN\n 08oct2018 02:31:28\n 08oct2018 03:03:48\n 15nov2018 01:04:16\n 15nov2018 01:57:21\n 529.0\n LOUISVILLE\n \n \n\n5 rows × 526 columns\n\n\n\n\n\nTo prepare the census data for modelling we need to break the demographic data into appropriate stratum. We will break out these groupings as along broad categories familiar to audiences of election coverage news. Even these steps amount to a significant choice where we use our knowledge of pertinent demographics to decide upon the key strata we wish to represent in our model, as we seek to better predict and understand the voting outcome.\n\nstates = [\n \"AL\",\n \"AK\",\n \"AZ\",\n \"AR\",\n \"CA\",\n \"CO\",\n \"CT\",\n \"DE\",\n \"FL\",\n \"GA\",\n \"HI\",\n \"ID\",\n \"IL\",\n \"IN\",\n \"IA\",\n \"KS\",\n \"KY\",\n \"LA\",\n \"ME\",\n \"MD\",\n \"MA\",\n \"MI\",\n \"MN\",\n \"MS\",\n \"MO\",\n \"MT\",\n \"NE\",\n \"NV\",\n \"NH\",\n \"NJ\",\n \"NM\",\n \"NY\",\n \"NC\",\n \"ND\",\n \"OH\",\n \"OK\",\n \"OR\",\n \"PA\",\n \"RI\",\n \"SC\",\n \"SD\",\n \"TN\",\n \"TX\",\n \"UT\",\n \"VT\",\n \"VA\",\n \"WA\",\n \"WV\",\n \"WI\",\n \"WY\",\n]\n\n\nnumbers = list(range(1, 56, 1))\n\nlkup_states = dict(zip(numbers, states))\nlkup_states\n\n\nethnicity = [\n \"White\",\n \"Black\",\n \"Hispanic\",\n \"Asian\",\n \"Native American\",\n \"Mixed\",\n \"Other\",\n \"Middle Eastern\",\n]\nnumbers = list(range(1, 9, 1))\nlkup_ethnicity = dict(zip(numbers, ethnicity))\nlkup_ethnicity\n\n\nedu = [\"No HS\", \"HS\", \"Some college\", \"Associates\", \"4-Year College\", \"Post-grad\"]\nnumbers = list(range(1, 7, 1))\nlkup_edu = dict(zip(numbers, edu))\n\n\ndef clean_df(df):\n ## 0 Oppose and 1 Support\n df[\"abortion\"] = np.abs(df[\"CC18_321d\"] - 2)\n df[\"state\"] = df[\"inputstate\"].map(lkup_states)\n ## dichotomous (coded as -0.5 Female, +0.5 Male)\n df[\"male\"] = np.abs(df[\"gender\"] - 2) - 0.5\n df[\"eth\"] = df[\"race\"].map(lkup_ethnicity)\n df[\"eth\"] = np.where(\n df[\"eth\"].isin([\"Asian\", \"Other\", \"Middle Eastern\", \"Mixed\", \"Native American\"]),\n \"Other\",\n df[\"eth\"],\n )\n df[\"age\"] = 2018 - df[\"birthyr\"]\n df[\"age\"] = pd.cut(\n df[\"age\"].astype(int),\n [0, 29, 39, 49, 59, 69, 120],\n labels=[\"18-29\", \"30-39\", \"40-49\", \"50-59\", \"60-69\", \"70+\"],\n ordered=True,\n )\n df[\"edu\"] = df[\"educ\"].map(lkup_edu)\n df[\"edu\"] = np.where(df[\"edu\"].isin([\"Some college\", \"Associates\"]), \"Some college\", df[\"edu\"])\n\n df = df[[\"abortion\", \"state\", \"eth\", \"male\", \"age\", \"edu\", \"caseid\"]]\n return df.dropna()\n\n\nstatelevel_predictors_df = pd.read_csv(\"data/mr_p_statelevel_predictors.csv\")\n\ncces_all_df = clean_df(cces_all_df)\ncces_all_df.head()\n\n\n\n\n\n \n \n \n abortion\n state\n eth\n male\n age\n edu\n caseid\n \n \n \n \n 0\n 1.0\n MS\n Other\n -0.5\n 50-59\n Some college\n 123464282\n \n \n 1\n 1.0\n WA\n White\n -0.5\n 40-49\n HS\n 170169205\n \n \n 2\n 1.0\n RI\n White\n -0.5\n 60-69\n Some college\n 175996005\n \n \n 3\n 0.0\n CO\n Other\n -0.5\n 70+\n Post-grad\n 176818556\n \n \n 4\n 1.0\n MA\n White\n -0.5\n 40-49\n HS\n 202120533\n \n \n\n\n\n\nWe will now show how estimates drawn from sample data (biased for whatever reasons of chance and circumstance) can be improved by using a post-stratification adjustment based on known facts about the size of the population in each strata considered in the model. This additional step is simply another modelling choice - another way to invest our model with information. In this manner the technique comes naturally in the Bayesian perspective.\n\n\n\nConsider a deliberately biased sample. Biased away from the census data and in this manner we show how to better recover population level estimates by incorporating details about the census population size across each of the stratum.\n\ncces_df = cces_all_df.merge(statelevel_predictors_df, left_on=\"state\", right_on=\"state\", how=\"left\")\ncces_df[\"weight\"] = (\n 5 * cces_df[\"repvote\"]\n + (cces_df[\"age\"] == \"18-29\") * 0.5\n + (cces_df[\"age\"] == \"30-39\") * 1\n + (cces_df[\"age\"] == \"40-49\") * 2\n + (cces_df[\"age\"] == \"50-59\") * 4\n + (cces_df[\"age\"] == \"60-69\") * 6\n + (cces_df[\"age\"] == \"70+\") * 8\n + (cces_df[\"male\"] == 1) * 20\n + (cces_df[\"eth\"] == \"White\") * 1.05\n)\n\ncces_df = cces_df.sample(5000, weights=\"weight\", random_state=1000)\ncces_df.head()\n\n\n\n\n\n \n \n \n abortion\n state\n eth\n male\n age\n edu\n caseid\n repvote\n region\n weight\n \n \n \n \n 35171\n 0.0\n KY\n White\n -0.5\n 60-69\n HS\n 415208636\n 0.656706\n South\n 10.333531\n \n \n 5167\n 0.0\n NM\n White\n 0.5\n 60-69\n No HS\n 412278020\n 0.453492\n West\n 9.317460\n \n \n 52365\n 0.0\n OK\n Hispanic\n -0.5\n 30-39\n 4-Year College\n 419467449\n 0.693047\n South\n 4.465237\n \n \n 23762\n 1.0\n WV\n White\n -0.5\n 50-59\n Post-grad\n 413757903\n 0.721611\n South\n 8.658053\n \n \n 48197\n 0.0\n RI\n White\n 0.5\n 50-59\n 4-Year College\n 417619385\n 0.416893\n Northeast\n 7.134465\n \n \n\n\n\n\n\n\n\nNow we plot the outcome of expected shares within each demographic bucket across both the biased sample and the census data.\n\nmosaic = \"\"\"\n ABCD\n EEEE\n \"\"\"\n\nfig = plt.figure(layout=\"constrained\", figsize=(20, 10))\nax_dict = fig.subplot_mosaic(mosaic)\n\n\ndef plot_var(var, ax):\n a = (\n cces_df.groupby(var, observed=False)[[\"abortion\"]]\n .mean()\n .rename({\"abortion\": \"share\"}, axis=1)\n .reset_index()\n )\n b = (\n cces_all_df.groupby(var, observed=False)[[\"abortion\"]]\n .mean()\n .rename({\"abortion\": \"share_census\"}, axis=1)\n .reset_index()\n )\n a = a.merge(b).sort_values(\"share\")\n ax_dict[ax].vlines(a[var], a.share, a.share_census)\n ax_dict[ax].scatter(a[var], a.share, color=\"blue\", label=\"Sample\")\n ax_dict[ax].scatter(a[var], a.share_census, color=\"red\", label=\"Census\")\n ax_dict[ax].set_ylabel(\"Proportion\")\n\n\nplot_var(\"age\", \"A\")\nplot_var(\"edu\", \"B\")\nplot_var(\"male\", \"C\")\nplot_var(\"eth\", \"D\")\nplot_var(\"state\", \"E\")\n\nax_dict[\"E\"].legend()\n\nax_dict[\"C\"].set_xticklabels([])\nax_dict[\"C\"].set_xlabel(\"Female / Male\")\nplt.suptitle(\"Comparison of Proportions: Survey Sample V Census\", fontsize=20);\n\n\n\n\nWe can see here how the proportions differ markedly across the census report and our biased sample in how they represent the preferential votes with each strata. We now try and quantify the overall differences between the biased sample and the census report. We calculate the expected proportions in each dataset and their standard error.\n\ndef get_se_bernoulli(p, n):\n return np.sqrt(p * (1 - p) / n)\n\n\nsample_cces_estimate = {\n \"mean\": np.mean(cces_df[\"abortion\"].astype(float)),\n \"se\": get_se_bernoulli(np.mean(cces_df[\"abortion\"].astype(float)), len(cces_df)),\n}\nsample_cces_estimate\n\n\nsample_cces_all_estimate = {\n \"mean\": np.mean(cces_all_df[\"abortion\"].astype(float)),\n \"se\": get_se_bernoulli(np.mean(cces_all_df[\"abortion\"].astype(float)), len(cces_all_df)),\n}\nsample_cces_all_estimate\n\nsummary = pd.DataFrame([sample_cces_all_estimate, sample_cces_estimate])\nsummary[\"data\"] = [\"Full Data\", \"Biased Data\"]\nsummary\n\n\n\n\n\n \n \n \n mean\n se\n data\n \n \n \n \n 0\n 0.434051\n 0.002113\n Full Data\n \n \n 1\n 0.465000\n 0.007054\n Biased Data\n \n \n\n\n\n\nA 3 percent difference in a national survey is a substantial error in the case where the difference is due to preventable bias.\n\n\n\n\nTo facilitate regression based stratification we first need a regression model. In our case we will ultimately fit a multi-level regression model with intercept terms for each for each of the groups in our demographic stratum. In this way we try to account for the appropriate set of variables (as in the example above) to better specify the effect modification due to membership within a particular demographic stratum.\nWe will fit the model using bambi using the binomial link function on the biased sample data. But first we aggregate up by demographic strata and count the occurences within each strata.\n\nmodel_df = (\n cces_df.groupby([\"state\", \"eth\", \"male\", \"age\", \"edu\"], observed=False)\n .agg({\"caseid\": \"nunique\", \"abortion\": \"sum\"})\n .reset_index()\n .sort_values(\"abortion\", ascending=False)\n .rename({\"caseid\": \"n\"}, axis=1)\n .merge(statelevel_predictors_df, left_on=\"state\", right_on=\"state\", how=\"left\")\n)\nmodel_df[\"abortion\"] = model_df[\"abortion\"].astype(int)\nmodel_df[\"n\"] = model_df[\"n\"].astype(int)\nmodel_df.head()\n\n\n\n\n\n \n \n \n state\n eth\n male\n age\n edu\n n\n abortion\n repvote\n region\n \n \n \n \n 0\n ID\n White\n -0.5\n 70+\n HS\n 32\n 18\n 0.683102\n West\n \n \n 1\n ID\n White\n 0.5\n 70+\n 4-Year College\n 20\n 16\n 0.683102\n West\n \n \n 2\n WV\n White\n 0.5\n 70+\n Some college\n 17\n 13\n 0.721611\n South\n \n \n 3\n WV\n White\n 0.5\n 70+\n 4-Year College\n 15\n 12\n 0.721611\n South\n \n \n 4\n ID\n White\n 0.5\n 70+\n Post-grad\n 17\n 11\n 0.683102\n West\n \n \n\n\n\n\nOur model_df now has one row per Strata across all the demographic cuts.\n\n\nHere we use some of bambi’s latest functionality to assess the interaction effects between the variables.\n\nformula = \"\"\" p(abortion, n) ~ C(state) + C(eth) + C(edu) + male + repvote\"\"\"\n\nbase_model = bmb.Model(formula, model_df, family=\"binomial\")\n\nresult = base_model.fit(\n random_seed=100,\n target_accept=0.95,\n # inference_method=\"nuts_numpyro\",\n idata_kwargs={\"log_likelihood\": True},\n)\n\nAuto-assigning NUTS sampler...\nInitializing NUTS using jitter+adapt_diag...\nInitializing NUTS using jitter+adapt_diag...\nMultiprocess sampling (4 chains in 4 jobs)\nNUTS: [Intercept, C(state), C(eth), C(edu), male, repvote]\n\n\nSampling 4 chains for 1_000 tune and 1_000 draw iterations (4_000 + 4_000 draws total) took 816 seconds.\n\n\nWe plot the predicted outcomes within each group using the plot_predictions function.\n\nmosaic = \"\"\"\n AABB\n CCCC\n \"\"\"\n\nfig = plt.figure(layout=\"constrained\", figsize=(20, 7))\naxs = fig.subplot_mosaic(mosaic)\n\nbmb.interpret.plot_predictions(base_model, result, \"eth\", ax=axs[\"A\"])\nbmb.interpret.plot_predictions(base_model, result, \"edu\", ax=axs[\"B\"])\nbmb.interpret.plot_predictions(base_model, result, \"state\", ax=axs[\"C\"])\nplt.suptitle(\"Plot Prediction per Class\", fontsize=20);\n\n\n\n\nMore interesting we can use the comparison functionality to compare differences in eth conditional on age and edu. Where we can see that the differences between ethnicities are pretty stable across all age groups, slightly shifted by within the Post-grad level of education.\n\nfig, ax = bmb.interpret.plot_comparisons(\n model=base_model,\n idata=result,\n contrast={\"eth\": [\"Black\", \"White\"]},\n conditional=[\"age\", \"edu\"],\n comparison_type=\"diff\",\n subplot_kwargs={\"main\": \"age\", \"group\": \"edu\"},\n fig_kwargs={\"figsize\": (12, 5), \"sharey\": True},\n legend=True,\n)\nax[0].set_title(\"Comparison of Difference in Ethnicity \\n within Age and Educational Strata\");\n\n\n\n\nWe can pull these specific estimates out into a table for closer inspection to see that the differences in response expected between the extremes of educational attainment are moderated by state iand race.\n\nbmb.interpret.comparisons(\n model=base_model,\n idata=result,\n contrast={\"edu\": [\"Post-grad\", \"No HS\"]},\n conditional={\"eth\": [\"Black\", \"White\"], \"state\": [\"NY\", \"CA\", \"ID\", \"VA\"]},\n comparison_type=\"diff\",\n)\n\n\n\n\n\n \n \n \n term\n estimate_type\n value\n eth\n state\n male\n repvote\n estimate\n lower_3.0%\n upper_97.0%\n \n \n \n \n 0\n edu\n diff\n (Post-grad, No HS)\n Black\n NY\n 0.0\n 0.530191\n 0.093161\n 0.000171\n 0.197388\n \n \n 1\n edu\n diff\n (Post-grad, No HS)\n Black\n CA\n 0.0\n 0.530191\n 0.078149\n 0.000014\n 0.188560\n \n \n 2\n edu\n diff\n (Post-grad, No HS)\n Black\n ID\n 0.0\n 0.530191\n 0.085810\n 0.000116\n 0.194178\n \n \n 3\n edu\n diff\n (Post-grad, No HS)\n Black\n VA\n 0.0\n 0.530191\n 0.125538\n 0.024355\n 0.220127\n \n \n 4\n edu\n diff\n (Post-grad, No HS)\n White\n NY\n 0.0\n 0.530191\n 0.093632\n 0.000537\n 0.201009\n \n \n 5\n edu\n diff\n (Post-grad, No HS)\n White\n CA\n 0.0\n 0.530191\n 0.078656\n 0.000037\n 0.193271\n \n \n 6\n edu\n diff\n (Post-grad, No HS)\n White\n ID\n 0.0\n 0.530191\n 0.092998\n 0.000269\n 0.198796\n \n \n 7\n edu\n diff\n (Post-grad, No HS)\n White\n VA\n 0.0\n 0.530191\n 0.099620\n 0.002437\n 0.193426\n \n \n\n\n\n\nWith this in mind we want to fit our final model to incorporate the variation we see here across the different levels of our stratified data.\n\n\n\nWe can specify these features of our model using a hierarchical structure as follows:\n\\[ Pr(y_i = 1) = logit^{-1}(\n\\alpha_{\\rm s[i]}^{\\rm state}\n+ \\alpha_{\\rm a[i]}^{\\rm age}\n+ \\alpha_{\\rm r[i]}^{\\rm eth}\n+ \\alpha_{\\rm e[i]}^{\\rm edu}\n+ \\beta^{\\rm male} \\cdot {\\rm Male}_{\\rm i}\n+ \\alpha_{\\rm g[i], r[i]}^{\\rm male.eth}\n+ \\alpha_{\\rm e[i], a[i]}^{\\rm edu.age}\n+ \\alpha_{\\rm e[i], r[i]}^{\\rm edu.eth}\n)\n\\]\nHere we have used the fact that we can add components to the \\(\\alpha\\) intercept terms and interaction effects to express the stratum specific variation in the outcomes that we’ve seen in our exploratory work. Using the bambi formula syntax. We have:\n\n%%capture\nformula = \"\"\" p(abortion, n) ~ (1 | state) + (1 | eth) + (1 | edu) + male + repvote + (1 | male:eth) + (1 | edu:age) + (1 | edu:eth)\"\"\"\n\nmodel_hierarchical = bmb.Model(formula, model_df, family=\"binomial\")\n\nresult = model_hierarchical.fit(\n random_seed=100,\n target_accept=0.99,\n inference_method=\"nuts_numpyro\",\n idata_kwargs={\"log_likelihood\": True},\n)\n\n\nresult\n\n\n\n \n \n arviz.InferenceData\n \n \n \n \n \n posterior\n \n \n \n \n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\nDimensions: (chain: 4, draw: 1000, state__factor_dim: 46,\n eth__factor_dim: 4, edu__factor_dim: 5,\n male:eth__factor_dim: 8, edu:age__factor_dim: 30,\n edu:eth__factor_dim: 20)\nCoordinates:\n * chain (chain) int64 0 1 2 3\n * draw (draw) int64 0 1 2 3 4 5 6 ... 994 995 996 997 998 999\n * state__factor_dim (state__factor_dim) \nDimensions: (chain: 4, draw: 1000, p(abortion, n)_obs: 11040)\nCoordinates:\n * chain (chain) int64 0 1 2 3\n * draw (draw) int64 0 1 2 3 4 5 6 ... 994 995 996 997 998 999\n * p(abortion, n)_obs (p(abortion, n)_obs) int64 0 1 2 3 ... 11037 11038 11039\nData variables:\n p(abortion, n) (chain, draw, p(abortion, n)_obs) float64 -2.099 ... 0.0\nAttributes:\n created_at: 2023-09-19T12:36:26.181088\n arviz_version: 0.16.1\n modeling_interface: bambi\n modeling_interface_version: 0.13.0.devxarray.DatasetDimensions:chain: 4draw: 1000p(abortion, n)_obs: 11040Coordinates: (3)chain(chain)int640 1 2 3array([0, 1, 2, 3])draw(draw)int640 1 2 3 4 5 ... 995 996 997 998 999array([ 0, 1, 2, ..., 997, 998, 999])p(abortion, n)_obs(p(abortion, n)_obs)int640 1 2 3 ... 11036 11037 11038 11039array([ 0, 1, 2, ..., 11037, 11038, 11039])Data variables: (1)p(abortion, n)(chain, draw, p(abortion, n)_obs)float64-2.099 -6.533 -2.53 ... 0.0 0.0 0.0array([[[-2.09936954, -6.53260554, -2.53030632, ..., 0. ,\n 0. , 0. ],\n [-2.82166815, -5.81761829, -2.15259044, ..., 0. ,\n 0. , 0. ],\n [-2.0615933 , -5.0527332 , -2.64438438, ..., 0. ,\n 0. , 0. ],\n ...,\n [-2.39670744, -3.8831292 , -1.9033823 , ..., 0. ,\n 0. , 0. ],\n [-2.80775422, -6.8989457 , -1.99550724, ..., 0. ,\n 0. , 0. ],\n [-3.24975008, -4.83725754, -2.18790635, ..., 0. ,\n 0. , 0. ]],\n\n [[-2.55424996, -6.66109631, -2.62725853, ..., 0. ,\n 0. , 0. ],\n [-2.52754945, -3.73699859, -1.95339883, ..., 0. ,\n 0. , 0. ],\n [-2.38596107, -4.813151 , -2.04832502, ..., 0. ,\n 0. , 0. ],\n...\n [-2.79655001, -5.47903673, -2.23628333, ..., 0. ,\n 0. , 0. ],\n [-3.34298484, -7.36659506, -2.0555685 , ..., 0. ,\n 0. , 0. ],\n [-2.05738331, -4.38768503, -2.17500452, ..., 0. ,\n 0. , 0. ]],\n\n [[-2.05270967, -4.82758352, -2.39733364, ..., 0. ,\n 0. , 0. ],\n [-2.40053672, -5.01300816, -2.14456134, ..., 0. ,\n 0. , 0. ],\n [-2.43148759, -5.1977399 , -2.09675503, ..., 0. ,\n 0. , 0. ],\n ...,\n [-2.57384043, -4.85548749, -2.16441157, ..., 0. ,\n 0. , 0. ],\n [-2.36708357, -4.65351176, -2.35737355, ..., 0. ,\n 0. , 0. ],\n [-2.52597843, -5.57143406, -2.08559554, ..., 0. ,\n 0. , 0. ]]])Indexes: (3)chainPandasIndexPandasIndex(Index([0, 1, 2, 3], dtype='int64', name='chain'))drawPandasIndexPandasIndex(Index([ 0, 1, 2, 3, 4, 5, 6, 7, 8, 9,\n ...\n 990, 991, 992, 993, 994, 995, 996, 997, 998, 999],\n dtype='int64', name='draw', length=1000))p(abortion, n)_obsPandasIndexPandasIndex(Index([ 0, 1, 2, 3, 4, 5, 6, 7, 8, 9,\n ...\n 11030, 11031, 11032, 11033, 11034, 11035, 11036, 11037, 11038, 11039],\n dtype='int64', name='p(abortion, n)_obs', length=11040))Attributes: (4)created_at :2023-09-19T12:36:26.181088arviz_version :0.16.1modeling_interface :bambimodeling_interface_version :0.13.0.dev\n \n \n \n \n \n \n sample_stats\n \n \n \n \n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\nDimensions: (chain: 4, draw: 1000)\nCoordinates:\n * chain (chain) int64 0 1 2 3\n * draw (draw) int64 0 1 2 3 4 5 6 ... 993 994 995 996 997 998 999\nData variables:\n acceptance_rate (chain, draw) float64 0.9995 0.9926 1.0 ... 0.9999 0.9988\n step_size (chain, draw) float64 0.009849 0.009849 ... 0.007358\n diverging (chain, draw) bool False False False ... False False False\n energy (chain, draw) float64 2.255e+03 2.272e+03 ... 2.239e+03\n n_steps (chain, draw) int64 511 511 511 511 511 ... 511 511 511 511\n tree_depth (chain, draw) int64 9 9 9 9 9 9 9 9 9 ... 9 9 9 9 9 9 9 9 9\n lp (chain, draw) float64 2.205e+03 2.206e+03 ... 2.175e+03\nAttributes:\n created_at: 2023-09-19T12:36:26.179914\n arviz_version: 0.16.1\n modeling_interface: bambi\n modeling_interface_version: 0.13.0.devxarray.DatasetDimensions:chain: 4draw: 1000Coordinates: (2)chain(chain)int640 1 2 3array([0, 1, 2, 3])draw(draw)int640 1 2 3 4 5 ... 995 996 997 998 999array([ 0, 1, 2, ..., 997, 998, 999])Data variables: (7)acceptance_rate(chain, draw)float640.9995 0.9926 1.0 ... 0.9999 0.9988array([[0.99946526, 0.99259021, 0.99998277, ..., 0.99903086, 0.98809861,\n 0.99689415],\n [0.99913792, 0.9997045 , 0.99997501, ..., 0.98352235, 0.93096018,\n 0.85187495],\n [0.98668511, 0.99626539, 0.99994987, ..., 0.99803789, 0.99652942,\n 0.99404411],\n [0.98789206, 0.99794272, 0.99994925, ..., 0.99995445, 0.99990492,\n 0.99879618]])step_size(chain, draw)float640.009849 0.009849 ... 0.007358array([[0.00984927, 0.00984927, 0.00984927, ..., 0.00984927, 0.00984927,\n 0.00984927],\n [0.0107935 , 0.0107935 , 0.0107935 , ..., 0.0107935 , 0.0107935 ,\n 0.0107935 ],\n [0.01346049, 0.01346049, 0.01346049, ..., 0.01346049, 0.01346049,\n 0.01346049],\n [0.0073581 , 0.0073581 , 0.0073581 , ..., 0.0073581 , 0.0073581 ,\n 0.0073581 ]])diverging(chain, draw)boolFalse False False ... False Falsearray([[False, False, False, ..., False, False, False],\n [False, False, False, ..., False, False, False],\n [False, False, False, ..., False, False, False],\n [False, False, False, ..., False, False, False]])energy(chain, draw)float642.255e+03 2.272e+03 ... 2.239e+03array([[2254.80380148, 2271.7792398 , 2262.62856249, ..., 2257.72468109,\n 2230.11267142, 2237.93805661],\n [2269.06513073, 2263.36956843, 2245.93272568, ..., 2261.17563583,\n 2262.64441608, 2253.12366973],\n [2264.71522209, 2252.51613169, 2243.13421993, ..., 2258.2260048 ,\n 2259.74040336, 2243.20097056],\n [2248.65859906, 2246.09743317, 2267.83771194, ..., 2254.07769144,\n 2245.89919424, 2239.36666182]])n_steps(chain, draw)int64511 511 511 511 ... 511 511 511 511array([[511, 511, 511, ..., 511, 511, 511],\n [511, 511, 511, ..., 511, 511, 511],\n [255, 255, 255, ..., 255, 255, 255],\n [511, 511, 511, ..., 511, 511, 511]])tree_depth(chain, draw)int649 9 9 9 9 9 9 9 ... 9 9 9 9 9 9 9 9array([[9, 9, 9, ..., 9, 9, 9],\n [9, 9, 9, ..., 9, 9, 9],\n [8, 8, 8, ..., 8, 8, 8],\n [9, 9, 9, ..., 9, 9, 9]])lp(chain, draw)float642.205e+03 2.206e+03 ... 2.175e+03array([[2205.14758575, 2206.32053817, 2203.18323165, ..., 2185.99408105,\n 2172.48562089, 2185.00510852],\n [2208.18815797, 2193.96415489, 2188.18604319, ..., 2200.69704711,\n 2190.64594014, 2191.29041039],\n [2200.43179845, 2194.72819558, 2178.81994092, ..., 2200.06509338,\n 2187.73927464, 2192.0455093 ],\n [2190.22894365, 2189.6562415 , 2213.11595564, ..., 2188.1258582 ,\n 2187.37140655, 2175.15102085]])Indexes: (2)chainPandasIndexPandasIndex(Index([0, 1, 2, 3], dtype='int64', name='chain'))drawPandasIndexPandasIndex(Index([ 0, 1, 2, 3, 4, 5, 6, 7, 8, 9,\n ...\n 990, 991, 992, 993, 994, 995, 996, 997, 998, 999],\n dtype='int64', name='draw', length=1000))Attributes: (4)created_at :2023-09-19T12:36:26.179914arviz_version :0.16.1modeling_interface :bambimodeling_interface_version :0.13.0.dev\n \n \n \n \n \n \n observed_data\n \n \n \n \n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\nDimensions: (p(abortion, n)_obs: 11040)\nCoordinates:\n * p(abortion, n)_obs (p(abortion, n)_obs) int64 0 1 2 3 ... 11037 11038 11039\nData variables:\n p(abortion, n) (p(abortion, n)_obs) int64 18 16 13 12 11 ... 0 0 0 0 0\nAttributes:\n created_at: 2023-09-19T12:36:26.181386\n arviz_version: 0.16.1\n inference_library: numpyro\n inference_library_version: 0.13.0\n sampling_time: 870.213079\n modeling_interface: bambi\n modeling_interface_version: 0.13.0.devxarray.DatasetDimensions:p(abortion, n)_obs: 11040Coordinates: (1)p(abortion, n)_obs(p(abortion, n)_obs)int640 1 2 3 ... 11036 11037 11038 11039array([ 0, 1, 2, ..., 11037, 11038, 11039])Data variables: (1)p(abortion, n)(p(abortion, n)_obs)int6418 16 13 12 11 11 ... 0 0 0 0 0 0array([18, 16, 13, ..., 0, 0, 0])Indexes: (1)p(abortion, n)_obsPandasIndexPandasIndex(Index([ 0, 1, 2, 3, 4, 5, 6, 7, 8, 9,\n ...\n 11030, 11031, 11032, 11033, 11034, 11035, 11036, 11037, 11038, 11039],\n dtype='int64', name='p(abortion, n)_obs', length=11040))Attributes: (7)created_at :2023-09-19T12:36:26.181386arviz_version :0.16.1inference_library :numpyroinference_library_version :0.13.0sampling_time :870.213079modeling_interface :bambimodeling_interface_version :0.13.0.dev\n \n \n \n \n \n \n \n\n\n\naz.summary(result, var_names=[\"Intercept\", \"male\", \"1|edu\", \"1|eth\", \"repvote\"])\n\n\n\n\n\n \n \n \n mean\n sd\n hdi_3%\n hdi_97%\n mcse_mean\n mcse_sd\n ess_bulk\n ess_tail\n r_hat\n \n \n \n \n Intercept\n 0.407\n 0.540\n -0.548\n 1.365\n 0.016\n 0.016\n 1587.0\n 1235.0\n 1.0\n \n \n male\n 0.209\n 0.191\n -0.166\n 0.556\n 0.006\n 0.005\n 1459.0\n 1152.0\n 1.0\n \n \n 1|edu[4-Year College]\n -0.043\n 0.189\n -0.421\n 0.294\n 0.003\n 0.003\n 3269.0\n 2748.0\n 1.0\n \n \n 1|edu[HS]\n 0.059\n 0.186\n -0.285\n 0.433\n 0.003\n 0.003\n 2936.0\n 2716.0\n 1.0\n \n \n 1|edu[No HS]\n 0.169\n 0.224\n -0.181\n 0.638\n 0.005\n 0.003\n 2432.0\n 3248.0\n 1.0\n \n \n 1|edu[Post-grad]\n -0.198\n 0.221\n -0.644\n 0.127\n 0.005\n 0.003\n 2063.0\n 2871.0\n 1.0\n \n \n 1|edu[Some college]\n 0.032\n 0.188\n -0.339\n 0.386\n 0.003\n 0.003\n 3108.0\n 3001.0\n 1.0\n \n \n 1|eth[Black]\n -0.437\n 0.486\n -1.329\n 0.332\n 0.015\n 0.014\n 1692.0\n 1144.0\n 1.0\n \n \n 1|eth[Hispanic]\n 0.059\n 0.455\n -0.649\n 0.953\n 0.014\n 0.013\n 2094.0\n 1166.0\n 1.0\n \n \n 1|eth[Other]\n 0.076\n 0.455\n -0.614\n 1.004\n 0.014\n 0.013\n 1979.0\n 1220.0\n 1.0\n \n \n 1|eth[White]\n 0.162\n 0.459\n -0.622\n 0.970\n 0.015\n 0.013\n 1687.0\n 1124.0\n 1.0\n \n \n repvote\n -1.192\n 0.529\n -2.200\n -0.193\n 0.013\n 0.009\n 1749.0\n 2462.0\n 1.0\n \n \n\n\n\n\nThe terms in the model formula allow for specific intercept terms across the demographic splits of eth, edu, and state. These represent stratum specific adjustments of the intercept term in the model. Similarly we invoke intercepts for the interaction terms of age:edu, male:eth and edu:eth. Each of these cohorts represents a share of the data in our sample.\n\nmodel_hierarchical.graph()\n\n\n\n\nWe then predict the outcomes implied by the biased sample. These predictions are to be adjusted by what we take to be the share of that demographic cohort in population. We can plot the posterior predictive distribution against the observed data from our biased sample to see that we have generally good fit to the distribution.\n\nmodel_hierarchical.predict(result, kind=\"pps\")\nax = az.plot_ppc(result, figsize=(8, 5), kind=\"cumulative\", observed_rug=True)\nax.set_title(\"Posterior Predictive Checks \\n On Biased Sample\");\n\n\n\n\n\n\n\nWe now use the fitted model to predict the voting shares on the data where we use the genuine state numbers per strata. To do so we load data from the national census and augment our data set so as to be able to apply the appropriate weights.\n\npoststrat_df = pd.read_csv(\"data/mr_p_poststrat_df.csv\")\n\nnew_data = poststrat_df.merge(\n statelevel_predictors_df, left_on=\"state\", right_on=\"state\", how=\"left\"\n)\nnew_data.rename({\"educ\": \"edu\"}, axis=1, inplace=True)\nnew_data = model_df.merge(\n new_data,\n how=\"left\",\n left_on=[\"state\", \"eth\", \"male\", \"age\", \"edu\"],\n right_on=[\"state\", \"eth\", \"male\", \"age\", \"edu\"],\n).rename({\"n_y\": \"n\", \"repvote_y\": \"repvote\"}, axis=1)[\n [\"state\", \"eth\", \"male\", \"age\", \"edu\", \"n\", \"repvote\"]\n]\n\n\nnew_data = new_data.merge(\n new_data.groupby(\"state\").agg({\"n\": \"sum\"}).reset_index().rename({\"n\": \"state_total\"}, axis=1)\n)\nnew_data[\"state_percent\"] = new_data[\"n\"] / new_data[\"state_total\"]\nnew_data.head()\n\n\n\n\n\n \n \n \n state\n eth\n male\n age\n edu\n n\n repvote\n state_total\n state_percent\n \n \n \n \n 0\n ID\n White\n -0.5\n 70+\n HS\n 31503\n 0.683102\n 1193885\n 0.026387\n \n \n 1\n ID\n White\n 0.5\n 70+\n 4-Year College\n 11809\n 0.683102\n 1193885\n 0.009891\n \n \n 2\n ID\n White\n 0.5\n 70+\n Post-grad\n 9873\n 0.683102\n 1193885\n 0.008270\n \n \n 3\n ID\n White\n 0.5\n 50-59\n Some college\n 30456\n 0.683102\n 1193885\n 0.025510\n \n \n 4\n ID\n White\n 0.5\n 70+\n HS\n 19898\n 0.683102\n 1193885\n 0.016667\n \n \n\n\n\n\nThis dataset is exactly the same structure and length as our input data to the fitted model. We have simply switched the observed counts across the demographic strata with the counts that reflect their proportion in the national survey. Additionally we have calculated the state totals and the share of each strata within the state. This will be important for later when we use this state_percent variable to calculate an adjusted MrP estimate of the predictions at a state level. We now use this data set with our fitted model to generate posterior predictive distribution.\n\nresult_adjust = model_hierarchical.predict(result, data=new_data, inplace=False, kind=\"pps\")\nresult_adjust\n\n\n\n \n \n arviz.InferenceData\n \n \n \n \n \n posterior\n \n \n \n \n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\nDimensions: (chain: 4, draw: 1000, state__factor_dim: 46,\n eth__factor_dim: 4, edu__factor_dim: 5,\n male:eth__factor_dim: 8, edu:age__factor_dim: 30,\n edu:eth__factor_dim: 20, p(abortion, n)_obs: 11040)\nCoordinates:\n * chain (chain) int64 0 1 2 3\n * draw (draw) int64 0 1 2 3 4 5 6 ... 994 995 996 997 998 999\n * state__factor_dim (state__factor_dim) \nDimensions: (chain: 4, draw: 1000, p(abortion, n)_obs: 11040)\nCoordinates:\n * chain (chain) int64 0 1 2 3\n * draw (draw) int64 0 1 2 3 4 5 6 ... 994 995 996 997 998 999\n * p(abortion, n)_obs (p(abortion, n)_obs) int64 0 1 2 3 ... 11037 11038 11039\nData variables:\n p(abortion, n) (chain, draw, p(abortion, n)_obs) int64 16259 ... 377\nAttributes:\n modeling_interface: bambi\n modeling_interface_version: 0.13.0.devxarray.DatasetDimensions:chain: 4draw: 1000p(abortion, n)_obs: 11040Coordinates: (3)chain(chain)int640 1 2 3array([0, 1, 2, 3])draw(draw)int640 1 2 3 4 5 ... 995 996 997 998 999array([ 0, 1, 2, ..., 997, 998, 999])p(abortion, n)_obs(p(abortion, n)_obs)int640 1 2 3 ... 11036 11037 11038 11039array([ 0, 1, 2, ..., 11037, 11038, 11039])Data variables: (1)p(abortion, n)(chain, draw, p(abortion, n)_obs)int6416259 5481 4277 ... 4286 641 377array([[[16259, 5481, 4277, ..., 4321, 664, 365],\n [13989, 5623, 3910, ..., 4052, 601, 290],\n [16516, 6031, 3927, ..., 3219, 589, 177],\n ...,\n [15045, 6729, 3611, ..., 3245, 512, 218],\n [14320, 5235, 4516, ..., 4261, 689, 341],\n [13275, 6306, 3978, ..., 4395, 805, 267]],\n\n [[14657, 5382, 4231, ..., 4231, 669, 333],\n [14615, 6851, 4753, ..., 3285, 621, 242],\n [15141, 6227, 4576, ..., 3860, 622, 256],\n ...,\n [13318, 6055, 3484, ..., 3098, 440, 159],\n [15656, 5533, 4354, ..., 4697, 786, 261],\n [15203, 5104, 4174, ..., 4577, 647, 331]],\n\n [[16474, 6378, 4529, ..., 4053, 704, 346],\n [13888, 6018, 4722, ..., 4356, 782, 241],\n [15175, 6137, 4043, ..., 3601, 513, 227],\n ...,\n [14040, 5975, 4653, ..., 3936, 510, 262],\n [13193, 4965, 4305, ..., 3584, 655, 213],\n [16524, 6480, 4472, ..., 4322, 576, 345]],\n\n [[16467, 6190, 4382, ..., 3952, 552, 333],\n [15248, 6099, 4961, ..., 3615, 426, 194],\n [15036, 6037, 4003, ..., 4597, 756, 385],\n ...,\n [14751, 6072, 4182, ..., 4595, 753, 387],\n [15141, 6376, 4354, ..., 4730, 745, 469],\n [14705, 5797, 4460, ..., 4286, 641, 377]]])Indexes: (3)chainPandasIndexPandasIndex(Index([0, 1, 2, 3], dtype='int64', name='chain'))drawPandasIndexPandasIndex(Index([ 0, 1, 2, 3, 4, 5, 6, 7, 8, 9,\n ...\n 990, 991, 992, 993, 994, 995, 996, 997, 998, 999],\n dtype='int64', name='draw', length=1000))p(abortion, n)_obsPandasIndexPandasIndex(Index([ 0, 1, 2, 3, 4, 5, 6, 7, 8, 9,\n ...\n 11030, 11031, 11032, 11033, 11034, 11035, 11036, 11037, 11038, 11039],\n dtype='int64', name='p(abortion, n)_obs', length=11040))Attributes: (2)modeling_interface :bambimodeling_interface_version :0.13.0.dev\n \n \n \n \n \n \n log_likelihood\n \n \n \n \n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\nDimensions: (chain: 4, draw: 1000, p(abortion, n)_obs: 11040)\nCoordinates:\n * chain (chain) int64 0 1 2 3\n * draw (draw) int64 0 1 2 3 4 5 6 ... 994 995 996 997 998 999\n * p(abortion, n)_obs (p(abortion, n)_obs) int64 0 1 2 3 ... 11037 11038 11039\nData variables:\n p(abortion, n) (chain, draw, p(abortion, n)_obs) float64 -2.099 ... 0.0\nAttributes:\n created_at: 2023-09-19T12:36:26.181088\n arviz_version: 0.16.1\n modeling_interface: bambi\n modeling_interface_version: 0.13.0.devxarray.DatasetDimensions:chain: 4draw: 1000p(abortion, n)_obs: 11040Coordinates: (3)chain(chain)int640 1 2 3array([0, 1, 2, 3])draw(draw)int640 1 2 3 4 5 ... 995 996 997 998 999array([ 0, 1, 2, ..., 997, 998, 999])p(abortion, n)_obs(p(abortion, n)_obs)int640 1 2 3 ... 11036 11037 11038 11039array([ 0, 1, 2, ..., 11037, 11038, 11039])Data variables: (1)p(abortion, n)(chain, draw, p(abortion, n)_obs)float64-2.099 -6.533 -2.53 ... 0.0 0.0 0.0array([[[-2.09936954, -6.53260554, -2.53030632, ..., 0. ,\n 0. , 0. ],\n [-2.82166815, -5.81761829, -2.15259044, ..., 0. ,\n 0. , 0. ],\n [-2.0615933 , -5.0527332 , -2.64438438, ..., 0. ,\n 0. , 0. ],\n ...,\n [-2.39670744, -3.8831292 , -1.9033823 , ..., 0. ,\n 0. , 0. ],\n [-2.80775422, -6.8989457 , -1.99550724, ..., 0. ,\n 0. , 0. ],\n [-3.24975008, -4.83725754, -2.18790635, ..., 0. ,\n 0. , 0. ]],\n\n [[-2.55424996, -6.66109631, -2.62725853, ..., 0. ,\n 0. , 0. ],\n [-2.52754945, -3.73699859, -1.95339883, ..., 0. ,\n 0. , 0. ],\n [-2.38596107, -4.813151 , -2.04832502, ..., 0. ,\n 0. , 0. ],\n...\n [-2.79655001, -5.47903673, -2.23628333, ..., 0. ,\n 0. , 0. ],\n [-3.34298484, -7.36659506, -2.0555685 , ..., 0. ,\n 0. , 0. ],\n [-2.05738331, -4.38768503, -2.17500452, ..., 0. ,\n 0. , 0. ]],\n\n [[-2.05270967, -4.82758352, -2.39733364, ..., 0. ,\n 0. , 0. ],\n [-2.40053672, -5.01300816, -2.14456134, ..., 0. ,\n 0. , 0. ],\n [-2.43148759, -5.1977399 , -2.09675503, ..., 0. ,\n 0. , 0. ],\n ...,\n [-2.57384043, -4.85548749, -2.16441157, ..., 0. ,\n 0. , 0. ],\n [-2.36708357, -4.65351176, -2.35737355, ..., 0. ,\n 0. , 0. ],\n [-2.52597843, -5.57143406, -2.08559554, ..., 0. ,\n 0. , 0. ]]])Indexes: (3)chainPandasIndexPandasIndex(Index([0, 1, 2, 3], dtype='int64', name='chain'))drawPandasIndexPandasIndex(Index([ 0, 1, 2, 3, 4, 5, 6, 7, 8, 9,\n ...\n 990, 991, 992, 993, 994, 995, 996, 997, 998, 999],\n dtype='int64', name='draw', length=1000))p(abortion, n)_obsPandasIndexPandasIndex(Index([ 0, 1, 2, 3, 4, 5, 6, 7, 8, 9,\n ...\n 11030, 11031, 11032, 11033, 11034, 11035, 11036, 11037, 11038, 11039],\n dtype='int64', name='p(abortion, n)_obs', length=11040))Attributes: (4)created_at :2023-09-19T12:36:26.181088arviz_version :0.16.1modeling_interface :bambimodeling_interface_version :0.13.0.dev\n \n \n \n \n \n \n sample_stats\n \n \n \n \n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\nDimensions: (chain: 4, draw: 1000)\nCoordinates:\n * chain (chain) int64 0 1 2 3\n * draw (draw) int64 0 1 2 3 4 5 6 ... 993 994 995 996 997 998 999\nData variables:\n acceptance_rate (chain, draw) float64 0.9995 0.9926 1.0 ... 0.9999 0.9988\n step_size (chain, draw) float64 0.009849 0.009849 ... 0.007358\n diverging (chain, draw) bool False False False ... False False False\n energy (chain, draw) float64 2.255e+03 2.272e+03 ... 2.239e+03\n n_steps (chain, draw) int64 511 511 511 511 511 ... 511 511 511 511\n tree_depth (chain, draw) int64 9 9 9 9 9 9 9 9 9 ... 9 9 9 9 9 9 9 9 9\n lp (chain, draw) float64 2.205e+03 2.206e+03 ... 2.175e+03\nAttributes:\n created_at: 2023-09-19T12:36:26.179914\n arviz_version: 0.16.1\n modeling_interface: bambi\n modeling_interface_version: 0.13.0.devxarray.DatasetDimensions:chain: 4draw: 1000Coordinates: (2)chain(chain)int640 1 2 3array([0, 1, 2, 3])draw(draw)int640 1 2 3 4 5 ... 995 996 997 998 999array([ 0, 1, 2, ..., 997, 998, 999])Data variables: (7)acceptance_rate(chain, draw)float640.9995 0.9926 1.0 ... 0.9999 0.9988array([[0.99946526, 0.99259021, 0.99998277, ..., 0.99903086, 0.98809861,\n 0.99689415],\n [0.99913792, 0.9997045 , 0.99997501, ..., 0.98352235, 0.93096018,\n 0.85187495],\n [0.98668511, 0.99626539, 0.99994987, ..., 0.99803789, 0.99652942,\n 0.99404411],\n [0.98789206, 0.99794272, 0.99994925, ..., 0.99995445, 0.99990492,\n 0.99879618]])step_size(chain, draw)float640.009849 0.009849 ... 0.007358array([[0.00984927, 0.00984927, 0.00984927, ..., 0.00984927, 0.00984927,\n 0.00984927],\n [0.0107935 , 0.0107935 , 0.0107935 , ..., 0.0107935 , 0.0107935 ,\n 0.0107935 ],\n [0.01346049, 0.01346049, 0.01346049, ..., 0.01346049, 0.01346049,\n 0.01346049],\n [0.0073581 , 0.0073581 , 0.0073581 , ..., 0.0073581 , 0.0073581 ,\n 0.0073581 ]])diverging(chain, draw)boolFalse False False ... False Falsearray([[False, False, False, ..., False, False, False],\n [False, False, False, ..., False, False, False],\n [False, False, False, ..., False, False, False],\n [False, False, False, ..., False, False, False]])energy(chain, draw)float642.255e+03 2.272e+03 ... 2.239e+03array([[2254.80380148, 2271.7792398 , 2262.62856249, ..., 2257.72468109,\n 2230.11267142, 2237.93805661],\n [2269.06513073, 2263.36956843, 2245.93272568, ..., 2261.17563583,\n 2262.64441608, 2253.12366973],\n [2264.71522209, 2252.51613169, 2243.13421993, ..., 2258.2260048 ,\n 2259.74040336, 2243.20097056],\n [2248.65859906, 2246.09743317, 2267.83771194, ..., 2254.07769144,\n 2245.89919424, 2239.36666182]])n_steps(chain, draw)int64511 511 511 511 ... 511 511 511 511array([[511, 511, 511, ..., 511, 511, 511],\n [511, 511, 511, ..., 511, 511, 511],\n [255, 255, 255, ..., 255, 255, 255],\n [511, 511, 511, ..., 511, 511, 511]])tree_depth(chain, draw)int649 9 9 9 9 9 9 9 ... 9 9 9 9 9 9 9 9array([[9, 9, 9, ..., 9, 9, 9],\n [9, 9, 9, ..., 9, 9, 9],\n [8, 8, 8, ..., 8, 8, 8],\n [9, 9, 9, ..., 9, 9, 9]])lp(chain, draw)float642.205e+03 2.206e+03 ... 2.175e+03array([[2205.14758575, 2206.32053817, 2203.18323165, ..., 2185.99408105,\n 2172.48562089, 2185.00510852],\n [2208.18815797, 2193.96415489, 2188.18604319, ..., 2200.69704711,\n 2190.64594014, 2191.29041039],\n [2200.43179845, 2194.72819558, 2178.81994092, ..., 2200.06509338,\n 2187.73927464, 2192.0455093 ],\n [2190.22894365, 2189.6562415 , 2213.11595564, ..., 2188.1258582 ,\n 2187.37140655, 2175.15102085]])Indexes: (2)chainPandasIndexPandasIndex(Index([0, 1, 2, 3], dtype='int64', name='chain'))drawPandasIndexPandasIndex(Index([ 0, 1, 2, 3, 4, 5, 6, 7, 8, 9,\n ...\n 990, 991, 992, 993, 994, 995, 996, 997, 998, 999],\n dtype='int64', name='draw', length=1000))Attributes: (4)created_at :2023-09-19T12:36:26.179914arviz_version :0.16.1modeling_interface :bambimodeling_interface_version :0.13.0.dev\n \n \n \n \n \n \n observed_data\n \n \n \n \n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\nDimensions: (p(abortion, n)_obs: 11040)\nCoordinates:\n * p(abortion, n)_obs (p(abortion, n)_obs) int64 0 1 2 3 ... 11037 11038 11039\nData variables:\n p(abortion, n) (p(abortion, n)_obs) int64 18 16 13 12 11 ... 0 0 0 0 0\nAttributes:\n created_at: 2023-09-19T12:36:26.181386\n arviz_version: 0.16.1\n inference_library: numpyro\n inference_library_version: 0.13.0\n sampling_time: 870.213079\n modeling_interface: bambi\n modeling_interface_version: 0.13.0.devxarray.DatasetDimensions:p(abortion, n)_obs: 11040Coordinates: (1)p(abortion, n)_obs(p(abortion, n)_obs)int640 1 2 3 ... 11036 11037 11038 11039array([ 0, 1, 2, ..., 11037, 11038, 11039])Data variables: (1)p(abortion, n)(p(abortion, n)_obs)int6418 16 13 12 11 11 ... 0 0 0 0 0 0array([18, 16, 13, ..., 0, 0, 0])Indexes: (1)p(abortion, n)_obsPandasIndexPandasIndex(Index([ 0, 1, 2, 3, 4, 5, 6, 7, 8, 9,\n ...\n 11030, 11031, 11032, 11033, 11034, 11035, 11036, 11037, 11038, 11039],\n dtype='int64', name='p(abortion, n)_obs', length=11040))Attributes: (7)created_at :2023-09-19T12:36:26.181386arviz_version :0.16.1inference_library :numpyroinference_library_version :0.13.0sampling_time :870.213079modeling_interface :bambimodeling_interface_version :0.13.0.dev\n \n \n \n \n \n \n \n\n\n\n\n\nWe need to adjust each state specific strata by the weight appropriate for each state to post-stratify the estimates.To do so we extract the indices for each strata in our data on a state by state basis. Then we weight the predicted estimate by the appropriate percentage on a state basis and sum them to recover a state level estimate.\n\nestimates = []\nabortion_posterior_base = az.extract(result, num_samples=2000)[\"p(abortion, n)_mean\"]\nabortion_posterior_mrp = az.extract(result_adjust, num_samples=2000)[\"p(abortion, n)_mean\"]\n\nfor s in new_data[\"state\"].unique():\n idx = new_data.index[new_data[\"state\"] == s].tolist()\n predicted_mrp = (\n ((abortion_posterior_mrp[idx].mean(dim=\"sample\") * new_data.iloc[idx][\"state_percent\"]))\n .sum()\n .item()\n )\n predicted_mrp_lb = (\n (\n (\n abortion_posterior_mrp[idx].quantile(0.025, dim=\"sample\")\n * new_data.iloc[idx][\"state_percent\"]\n )\n )\n .sum()\n .item()\n )\n predicted_mrp_ub = (\n (\n (\n abortion_posterior_mrp[idx].quantile(0.975, dim=\"sample\")\n * new_data.iloc[idx][\"state_percent\"]\n )\n )\n .sum()\n .item()\n )\n predicted = abortion_posterior_base[idx].mean().item()\n base_lb = abortion_posterior_base[idx].quantile(0.025).item()\n base_ub = abortion_posterior_base[idx].quantile(0.975).item()\n\n estimates.append(\n [s, predicted, base_lb, base_ub, predicted_mrp, predicted_mrp_ub, predicted_mrp_lb]\n )\n\n\nstate_predicted = pd.DataFrame(\n estimates,\n columns=[\"state\", \"base_expected\", \"base_lb\", \"base_ub\", \"mrp_adjusted\", \"mrp_ub\", \"mrp_lb\"],\n)\n\nstate_predicted = (\n state_predicted.merge(cces_all_df.groupby(\"state\")[[\"abortion\"]].mean().reset_index())\n .sort_values(\"mrp_adjusted\")\n .rename({\"abortion\": \"census_share\"}, axis=1)\n)\nstate_predicted.head()\n\n\n\n\n\n \n \n \n state\n base_expected\n base_lb\n base_ub\n mrp_adjusted\n mrp_ub\n mrp_lb\n census_share\n \n \n \n \n 9\n OK\n 0.423350\n 0.209144\n 0.660533\n 0.326291\n 0.413912\n 0.245431\n 0.321553\n \n \n 34\n MS\n 0.439145\n 0.215565\n 0.683780\n 0.381575\n 0.493799\n 0.278498\n 0.374640\n \n \n 2\n CO\n 0.475961\n 0.251250\n 0.698478\n 0.397101\n 0.482699\n 0.315535\n 0.354857\n \n \n 24\n ME\n 0.438638\n 0.236010\n 0.669674\n 0.418964\n 0.537156\n 0.296373\n 0.403636\n \n \n 25\n MO\n 0.513291\n 0.225326\n 0.748539\n 0.420735\n 0.525425\n 0.321195\n 0.302954\n \n \n\n\n\n\nThis was the crucial step and we’ll need to unpack it a little. We have taken (state by state) each demographic strata and reweighted the expected posterior predictive value by the share that strata represents in the national census within that state. We have then aggregated this score within the state to generate a state specific value. This value can now be compared to the expected value derived from our biased data and, more interestingly, the value reported in the national census.\n\n\n\nThese adjusted estimates can be plotted against the shares ascribed at the state level in the census. These adjustments provide a far better reflection of the national picture than the ones derived from model fitted to the biased sample.\n\nfig, axs = plt.subplots(2, 1, figsize=(17, 10))\naxs = axs.flatten()\nax = axs[0]\nax1 = axs[1]\nax.scatter(\n state_predicted[\"state\"], state_predicted[\"base_expected\"], color=\"red\", label=\"Biased Sample\"\n)\nax.scatter(\n state_predicted[\"state\"],\n state_predicted[\"mrp_adjusted\"],\n color=\"slateblue\",\n label=\"Mr P Adjusted\",\n)\nax.scatter(\n state_predicted[\"state\"],\n state_predicted[\"census_share\"],\n color=\"darkgreen\",\n label=\"Census Aggregates\",\n)\nax.legend()\nax.vlines(\n state_predicted[\"state\"],\n state_predicted[\"mrp_adjusted\"],\n state_predicted[\"census_share\"],\n color=\"black\",\n linestyles=\"--\",\n)\n\n\nax1.scatter(\n state_predicted[\"state\"], state_predicted[\"base_expected\"], color=\"red\", label=\"Biased Sample\"\n)\nax1.scatter(\n state_predicted[\"state\"],\n state_predicted[\"mrp_adjusted\"],\n color=\"slateblue\",\n label=\"Mr P Adjusted\",\n)\nax1.legend()\n\nax1.vlines(\n state_predicted[\"state\"], state_predicted[\"base_ub\"], state_predicted[\"base_lb\"], color=\"red\"\n)\nax1.vlines(\n state_predicted[\"state\"],\n state_predicted[\"mrp_ub\"],\n state_predicted[\"mrp_lb\"],\n color=\"slateblue\",\n)\nax.set_xlabel(\"State\")\nax.set_ylabel(\"Proportion\")\nax1.set_title(\n \"Comparison of Uncertainty in Biased Predictions and Post-stratified Adjustment\", fontsize=15\n)\nax.set_title(\"Comparison of Post-stratified Adjustment and Census Report\", fontsize=15)\nax1.set_ylabel(\"Proportion\");\n\n\n\n\nIn the top plot here we see the state specific MrP estimates for the proportion voting yes, compared to the estimate inferred from the biased sample and estimates from the national census. We can see how the MrP estimates are much closer to those drawn from the national census.\nIn the below plot we’ve shown the estimates from the MrP model and the estimates drawn from the biased sample, but here we’ve shown the uncertainty in the estimation on a state level. Clearly, the MrP adjustments also shrinks the uncertainty in our estimate of vote-share.\nMrP is in this sense a corrective procedure for the avoidance of bias in sample data, where we have strong evidence for adjusting the weight accorded to any stratum of data in our population.\n\n\n\n\nIn this notebook we have seen how to use bambi to concisely and quickly apply the technique of multilevel regression and post-stratification. We’ve seen how this technique is a natural and compelling extension to regression modelling in general, that incorporates prior knowledge in an interesting and flexible manner.\nThe problems of representation in data are serious. Policy gets made and changed on the basis of anticipated policy effects. Without the ability to control and adjust for non-representative samples, politicians and policy makers risk prioritising initiatives for a vocal majority among the represented in the sample. The question of whether a given sample is “good” or “bad” cannot (at the time) ever be known, so some care needs to be taken when choosing to adjust your model of the data.\nPredictions made from sample data are consequential. It’s not even an exaggeration to say that the fates of entire nations can hang on decisions made from poorly understood sampling procedures. Multilevel regression and post-stratification is an apt tool for making the adjustments required and guiding decisions makers in crucial policy choices, but it should be used carefully." + "text": "This example shows how to specify and fit a spline regression in Bambi. This example is based on this example from the PyMC docs.\n\nimport arviz as az\nimport bambi as bmb\nimport matplotlib.pyplot as plt\nimport numpy as np\nimport pandas as pd\n\n\naz.style.use(\"arviz-darkgrid\")\nSEED = 7355608\n\n\n\nRichard McElreath popularized the Cherry Blossom dataset in the second edition of his excellent book Statistical Rethinking. This data represents the day in the year when the first bloom is observed for Japanese cherry blossoms between years 801 and 2015. In his book, Richard McElreath uses this dataset to introduce Basis Splines, or B-Splines in short.\nHere we use Bambi to fit a linear model using B-Splines with the Cherry Blossom data. This dataset can be loaded with Bambi as follows:\n\ndata = bmb.load_data(\"cherry_blossoms\")\ndata\n\n\n\n\n\n \n \n \n year\n doy\n temp\n temp_upper\n temp_lower\n \n \n \n \n 0\n 801\n NaN\n NaN\n NaN\n NaN\n \n \n 1\n 802\n NaN\n NaN\n NaN\n NaN\n \n \n 2\n 803\n NaN\n NaN\n NaN\n NaN\n \n \n 3\n 804\n NaN\n NaN\n NaN\n NaN\n \n \n 4\n 805\n NaN\n NaN\n NaN\n NaN\n \n \n ...\n ...\n ...\n ...\n ...\n ...\n \n \n 1210\n 2011\n 99.0\n NaN\n NaN\n NaN\n \n \n 1211\n 2012\n 101.0\n NaN\n NaN\n NaN\n \n \n 1212\n 2013\n 93.0\n NaN\n NaN\n NaN\n \n \n 1213\n 2014\n 94.0\n NaN\n NaN\n NaN\n \n \n 1214\n 2015\n 93.0\n NaN\n NaN\n NaN\n \n \n\n1215 rows × 5 columns\n\n\n\nThe variable we are interested in modeling is \"doy\", which stands for Day of Year. Also notice this variable contains several missing value which are discarded next.\n\ndata = data.dropna(subset=[\"doy\"]).reset_index(drop=True)\ndata.shape\n\n(827, 5)\n\n\n\n\n\nLet’s get started by creating a scatterplot to explore the values of \"doy\" for each year in the dataset.\n\n# We create a function because this plot is going to be used again later\ndef plot_scatter(data, figsize=(10, 6)):\n _, ax = plt.subplots(figsize=figsize)\n ax.scatter(data[\"year\"], data[\"doy\"], alpha=0.4, s=30)\n ax.set_title(\"Day of the first bloom per year\")\n ax.set_xlabel(\"Year\")\n ax.set_ylabel(\"Days of the first bloom\")\n return ax\n\n\nplot_scatter(data);\n\n\n\n\nWe can observe the day of the first bloom ranges between 85 and 125 approximately, which correspond to late March and early May respectively. On average, the first bloom occurs on the 105th day of the year, which is middle April.\n\n\n\nThe spline will have 15 knots. These knots are the boundaries of the basis functions. These knots split the range of the \"year\" variable into 16 contiguous sections. The basis functions make up a piecewise continuous polynomial, and so they are enforced to meet at the knots. We use the default degree for each piecewise polynomial, which is 3. The result is known as a cubic spline.\nBecause of using quantiles and not having observations for all the years in the time window under study, the knots are distributed unevenly over the range of \"year\" in such a way that the same proportion of values fall between each section.\n\nnum_knots = 15\nknots = np.quantile(data[\"year\"], np.linspace(0, 1, num_knots))\n\n\ndef plot_knots(knots, ax):\n for knot in knots:\n ax.axvline(knot, color=\"0.1\", alpha=0.4)\n return ax\n\n\nax = plot_scatter(data)\nplot_knots(knots, ax);\n\n\n\n\nThe previous chart makes it easy to see the knots, represented by the vertical lines, are spaced unevenly over the years.\n\n\n\nThe B-spline model we are about to create is simply a linear regression model with synthetic predictor variables. These predictors are the basis functions that are derived from the original year predictor.\nIn math notation, we usa a \\(\\text{Normal}\\) distribution for the conditional distribution of \\(Y\\) when \\(X = x_i\\), i.e. \\(Y_i\\), the distribution of the day of the first bloom in a given year.\n\\[\nY_i \\sim \\text{Normal}(\\mu_i, \\sigma)\n\\]\nSo far, this looks like a regular linear regression model. The next line is where the spline comes into play:\n\\[\n\\mu_i = \\alpha + \\sum_{k=1}^K{w_kB_{k, i}}\n\\]\nThe line above tells that for each observation \\(i\\), the mean is influenced by all the basis functions (going from \\(k=1\\) to \\(k=K\\)), plus an intercept \\(\\alpha\\). The \\(w_k\\) values in the summation are the regression coefficients of each of the basis functions, and the \\(B_k\\) are the values of the basis functions.\nFinally, we will be using the following priors\n\\[\n\\begin{aligned}\n\\alpha & \\sim \\text{Normal}(100, 10) \\\\\nw_j & \\sim \\text{Normal}(0, 10)\\\\\n\\sigma & \\sim \\text{Exponential(1)}\n\\end{aligned}\n\\]\nwhere \\(j\\) indexes each of the contiguous sections given by the knots\n\n# We only pass the internal knots to the `bs()` function.\niknots = knots[1:-1]\n\n# Define dictionary of priors\npriors = {\n \"Intercept\": bmb.Prior(\"Normal\", mu=100, sigma=10),\n \"common\": bmb.Prior(\"Normal\", mu=0, sigma=10), \n \"sigma\": bmb.Prior(\"Exponential\", lam=1)\n}\n\n# Define model\n# The intercept=True means the basis also spans the intercept, as originally done in the book example.\nmodel = bmb.Model(\"doy ~ bs(year, knots=iknots, intercept=True)\", data, priors=priors)\nmodel\n\n Formula: doy ~ bs(year, knots=iknots, intercept=True)\n Family: gaussian\n Link: mu = identity\n Observations: 827\n Priors: \n target = mu\n Common-level effects\n Intercept ~ Normal(mu: 100.0, sigma: 10.0)\n bs(year, knots=iknots, intercept=True) ~ Normal(mu: 0.0, sigma: 10.0)\n \n Auxiliary parameters\n sigma ~ Exponential(lam: 1.0)\n\n\nLet’s create a function to plot each of the basis functions in the model.\n\ndef plot_spline_basis(basis, year, figsize=(10, 6)):\n df = (\n pd.DataFrame(basis)\n .assign(year=year)\n .melt(\"year\", var_name=\"basis_idx\", value_name=\"value\")\n )\n\n _, ax = plt.subplots(figsize=figsize)\n\n for idx in df.basis_idx.unique():\n d = df[df.basis_idx == idx]\n ax.plot(d[\"year\"], d[\"value\"])\n \n return ax\n\nBelow, we create a chart to visualize the b-spline basis. The overlap between the functions means that, at any given point in time, the regression function is influenced by more than one basis function. For example, if we look at the year 1200, we can see the regression line is going to be influenced mostly by the violet and brown functions, and to a lesser extent by the green and cyan ones. In summary, this is what enables us to capture local patterns in a smooth fashion.\n\nB = model.response_component.design.common[\"bs(year, knots=iknots, intercept=True)\"]\nax = plot_spline_basis(B, data[\"year\"].values)\nplot_knots(knots, ax);\n\n\n\n\n\n\n\nNow we fit the model. In Bambi, it is as easy as calling the .fit() method on the Model instance.\n\n# The seed is to make results reproducible\nidata = model.fit(random_seed=SEED, idata_kwargs={\"log_likelihood\": True})\n\nAuto-assigning NUTS sampler...\nInitializing NUTS using jitter+adapt_diag...\nMultiprocess sampling (2 chains in 2 jobs)\nNUTS: [doy_sigma, Intercept, bs(year, knots=iknots, intercept=True)]\n\n\n\n\n\n\n\n \n \n 100.00% [4000/4000 00:32<00:00 Sampling 2 chains, 0 divergences]\n \n \n\n\nSampling 2 chains for 1_000 tune and 1_000 draw iterations (2_000 + 2_000 draws total) took 33 seconds.\nWe recommend running at least 4 chains for robust computation of convergence diagnostics\n\n\n\n\n\nIt is always good to use az.summary() to verify parameter estimates as well as effective sample sizes and R hat values. In this case, the main goal is not to interpret the coefficients of the basis spline, but analyze the ess and r_hat diagnostics. In first place, effective sample sizes don’t look impressively high. Most of them are between 300 and 700, which is low compared to the 2000 draws obtained. The only exception is the residual standard deviation sigma. Finally, the r_hat diagnostic is not always 1 for all the parameters, indicating there may be some issues with the mix of the chains.\n\naz.summary(idata)\n\n\n\n\n\n \n \n \n mean\n sd\n hdi_3%\n hdi_97%\n mcse_mean\n mcse_sd\n ess_bulk\n ess_tail\n r_hat\n \n \n \n \n Intercept\n 103.387\n 2.444\n 98.582\n 107.719\n 0.131\n 0.093\n 348.0\n 540.0\n 1.01\n \n \n bs(year, knots=iknots, intercept=True)[0]\n -3.074\n 3.819\n -10.477\n 3.705\n 0.127\n 0.090\n 908.0\n 1319.0\n 1.00\n \n \n bs(year, knots=iknots, intercept=True)[1]\n -0.841\n 3.949\n -8.290\n 6.242\n 0.146\n 0.103\n 739.0\n 1089.0\n 1.00\n \n \n bs(year, knots=iknots, intercept=True)[2]\n -1.167\n 3.662\n -8.245\n 5.517\n 0.140\n 0.099\n 685.0\n 935.0\n 1.00\n \n \n bs(year, knots=iknots, intercept=True)[3]\n 4.810\n 2.987\n -0.362\n 10.721\n 0.135\n 0.096\n 487.0\n 915.0\n 1.00\n \n \n bs(year, knots=iknots, intercept=True)[4]\n -0.881\n 2.970\n -6.245\n 4.759\n 0.137\n 0.097\n 472.0\n 951.0\n 1.00\n \n \n bs(year, knots=iknots, intercept=True)[5]\n 4.277\n 2.963\n -0.901\n 9.904\n 0.134\n 0.095\n 488.0\n 1104.0\n 1.00\n \n \n bs(year, knots=iknots, intercept=True)[6]\n -5.350\n 2.883\n -11.223\n -0.312\n 0.137\n 0.097\n 439.0\n 870.0\n 1.00\n \n \n bs(year, knots=iknots, intercept=True)[7]\n 7.786\n 2.813\n 2.161\n 13.013\n 0.129\n 0.091\n 477.0\n 842.0\n 1.00\n \n \n bs(year, knots=iknots, intercept=True)[8]\n -1.017\n 2.977\n -6.426\n 4.689\n 0.141\n 0.100\n 445.0\n 697.0\n 1.00\n \n \n bs(year, knots=iknots, intercept=True)[9]\n 2.927\n 2.958\n -2.100\n 9.282\n 0.136\n 0.096\n 474.0\n 809.0\n 1.00\n \n \n bs(year, knots=iknots, intercept=True)[10]\n 4.693\n 2.990\n -0.911\n 10.137\n 0.137\n 0.097\n 477.0\n 837.0\n 1.00\n \n \n bs(year, knots=iknots, intercept=True)[11]\n -0.246\n 2.943\n -5.760\n 5.126\n 0.133\n 0.094\n 490.0\n 908.0\n 1.00\n \n \n bs(year, knots=iknots, intercept=True)[12]\n 5.548\n 2.984\n 0.328\n 11.413\n 0.140\n 0.099\n 451.0\n 837.0\n 1.00\n \n \n bs(year, knots=iknots, intercept=True)[13]\n 0.653\n 3.115\n -4.897\n 6.839\n 0.132\n 0.094\n 557.0\n 933.0\n 1.00\n \n \n bs(year, knots=iknots, intercept=True)[14]\n -0.778\n 3.345\n -7.165\n 5.314\n 0.142\n 0.101\n 551.0\n 981.0\n 1.00\n \n \n bs(year, knots=iknots, intercept=True)[15]\n -7.039\n 3.527\n -13.975\n -0.638\n 0.137\n 0.097\n 667.0\n 1021.0\n 1.00\n \n \n bs(year, knots=iknots, intercept=True)[16]\n -7.711\n 3.293\n -14.579\n -2.133\n 0.135\n 0.095\n 595.0\n 1090.0\n 1.00\n \n \n doy_sigma\n 5.944\n 0.143\n 5.671\n 6.198\n 0.003\n 0.002\n 3031.0\n 1497.0\n 1.00\n \n \n\n\n\n\nWe can also use az.plot_trace() to visualize the marginal posteriors and the sampling paths. These traces show a stationary random pattern. If these paths were not random stationary, we would be concerned about the convergence of the chains.\n\naz.plot_trace(idata);\n\n\n\n\nNow we can visualize the fitted basis functions. In addition, we include a thicker black line that represents the dot product between \\(B\\) and \\(w\\). This is the contribution of the b-spline to the linear predictor in the model.\n\nposterior_stacked = az.extract(idata)\nwp = posterior_stacked[\"bs(year, knots=iknots, intercept=True)\"].mean(\"sample\").values\n\nax = plot_spline_basis(B * wp.T, data[\"year\"].values)\nax.plot(data.year.values, np.dot(B, wp.T), color=\"black\", lw=3)\nplot_knots(knots, ax);\n\n\n\n\n\n\n\nLet’s create a function to plot the predicted mean value as well as credible bands for it.\n\ndef plot_predictions(data, idata, model):\n # Create a test dataset with observations spanning the whole range of year\n new_data = pd.DataFrame({\"year\": np.linspace(data.year.min(), data.year.max(), num=500)})\n \n # Predict the day of first blossom\n model.predict(idata, data=new_data)\n\n posterior_stacked = az.extract_dataset(idata)\n # Extract these predictions\n y_hat = posterior_stacked[\"doy_mean\"]\n\n # Compute the mean of the predictions, plotted as a single line.\n y_hat_mean = y_hat.mean(\"sample\")\n\n # Compute 94% credible intervals for the predictions, plotted as bands\n hdi_data = np.quantile(y_hat, [0.03, 0.97], axis=1)\n\n # Plot obserevd data\n ax = plot_scatter(data)\n \n # Plot predicted line\n ax.plot(new_data[\"year\"], y_hat_mean, color=\"firebrick\")\n \n # Plot credibility bands\n ax.fill_between(new_data[\"year\"], hdi_data[0], hdi_data[1], alpha=0.4, color=\"firebrick\")\n \n # Add knots\n plot_knots(knots, ax)\n \n return ax\n\n\nplot_predictions(data, idata, model);\n\n/tmp/ipykernel_33590/2247671002.py:8: FutureWarning: extract_dataset has been deprecated, please use extract\n posterior_stacked = az.extract_dataset(idata)\n\n\n\n\n\n\n\n\nWe can write linear regression models in matrix form as\n\\[\n\\mathbf{y} = \\mathbf{X}\\boldsymbol{\\beta}\n\\]\nwhere \\(\\mathbf{y}\\) is the response column vector of shape \\((n, 1)\\). \\(\\mathbf{X}\\) is the design matrix that contains the values of the predictors for all the observations, of shape \\((n, p)\\). And \\(\\boldsymbol{\\beta}\\) is the column vector of regression coefficients of shape \\((n, 1)\\).\nBecause it’s not something that you’re supposed to consult regularly, Bambi does not expose the design matrix. However, with a some knowledge of the internals, it is possible to have access to it:\n\nnp.round(model.response_component.design.common.design_matrix, 3)\n\narray([[1. , 1. , 0. , ..., 0. , 0. , 0. ],\n [1. , 0.96 , 0.039, ..., 0. , 0. , 0. ],\n [1. , 0.767, 0.221, ..., 0. , 0. , 0. ],\n ...,\n [1. , 0. , 0. , ..., 0.002, 0.097, 0.902],\n [1. , 0. , 0. , ..., 0. , 0.05 , 0.95 ],\n [1. , 0. , 0. , ..., 0. , 0. , 1. ]])\n\n\nLet’s have a look at its shape:\n\nmodel.response_component.design.common.design_matrix.shape\n\n(827, 18)\n\n\n827 is the number of years we have data for, and 18 is the number of predictors/coefficients in the model. We have the first column of ones due to the Intercept term. Then, there are sixteen columns associated with the the basis functions. And finally, one extra column because we used span_intercept=True when calling the function bs() in the model formula.\nNow we could compute the rank of the design matrix to check whether all the columns are linearly independent.\n\nnp.linalg.matrix_rank(model.response_component.design.common.design_matrix)\n\n17\n\n\nSince \\(\\text{rank}(\\mathbf{X})\\) is smaller than the number of columns, we conclude the columns in \\(\\mathbf{X}\\) are not linearly independent.\nIf we have a second look at our code, we are going to figure out we’re spanning the intercept twice. The first time with the intercept term itself, and the second time in the spline basis.\nThis would have been a huge problem in a maximum likelihod estimation approach – we would have obtained an error instead of some parameter estimates. However, since we are doing Bayesian modeling, our priors ensured we obtain our regularized parameter estimates and everything seemed to work pretty well.\nNevertheless, we can still do better. Why would we want to span the intercept twice? Let’s create and fit the model again, this time without spanning the intercept in the spline basis.\n\n# Note we use the same priors\nmodel_new = bmb.Model(\"doy ~ bs(year, knots=iknots)\", data, priors=priors)\nidata_new = model_new.fit(random_seed=SEED, idata_kwargs={\"log_likelihood\": True})\n\nAuto-assigning NUTS sampler...\nInitializing NUTS using jitter+adapt_diag...\nMultiprocess sampling (2 chains in 2 jobs)\nNUTS: [doy_sigma, Intercept, bs(year, knots=iknots)]\n\n\n\n\n\n\n\n \n \n 100.00% [4000/4000 00:31<00:00 Sampling 2 chains, 0 divergences]\n \n \n\n\nSampling 2 chains for 1_000 tune and 1_000 draw iterations (2_000 + 2_000 draws total) took 32 seconds.\nWe recommend running at least 4 chains for robust computation of convergence diagnostics\n\n\nAnd let’s have a look at the summary\n\naz.summary(idata_new)\n\n\n\n\n\n \n \n \n mean\n sd\n hdi_3%\n hdi_97%\n mcse_mean\n mcse_sd\n ess_bulk\n ess_tail\n r_hat\n \n \n \n \n Intercept\n 102.367\n 1.992\n 98.899\n 106.358\n 0.105\n 0.074\n 361.0\n 581.0\n 1.01\n \n \n bs(year, knots=iknots)[0]\n -0.849\n 3.999\n -8.142\n 6.704\n 0.164\n 0.116\n 591.0\n 930.0\n 1.00\n \n \n bs(year, knots=iknots)[1]\n 0.394\n 3.012\n -5.253\n 5.983\n 0.090\n 0.063\n 1132.0\n 1249.0\n 1.00\n \n \n bs(year, knots=iknots)[2]\n 5.707\n 2.712\n 0.074\n 10.305\n 0.120\n 0.085\n 510.0\n 1017.0\n 1.00\n \n \n bs(year, knots=iknots)[3]\n 0.216\n 2.467\n -4.358\n 4.849\n 0.103\n 0.073\n 571.0\n 1320.0\n 1.00\n \n \n bs(year, knots=iknots)[4]\n 5.237\n 2.711\n 0.104\n 10.568\n 0.118\n 0.084\n 526.0\n 789.0\n 1.00\n \n \n bs(year, knots=iknots)[5]\n -4.332\n 2.428\n -8.909\n 0.043\n 0.105\n 0.074\n 535.0\n 890.0\n 1.01\n \n \n bs(year, knots=iknots)[6]\n 8.788\n 2.546\n 3.669\n 13.310\n 0.112\n 0.079\n 518.0\n 854.0\n 1.01\n \n \n bs(year, knots=iknots)[7]\n 0.008\n 2.573\n -5.056\n 4.474\n 0.112\n 0.079\n 525.0\n 916.0\n 1.00\n \n \n bs(year, knots=iknots)[8]\n 3.980\n 2.745\n -0.716\n 9.394\n 0.112\n 0.079\n 597.0\n 927.0\n 1.00\n \n \n bs(year, knots=iknots)[9]\n 5.658\n 2.559\n 0.917\n 10.350\n 0.109\n 0.077\n 552.0\n 850.0\n 1.00\n \n \n bs(year, knots=iknots)[10]\n 0.801\n 2.655\n -4.092\n 5.842\n 0.112\n 0.079\n 565.0\n 956.0\n 1.00\n \n \n bs(year, knots=iknots)[11]\n 6.534\n 2.578\n 1.952\n 11.575\n 0.112\n 0.079\n 531.0\n 845.0\n 1.01\n \n \n bs(year, knots=iknots)[12]\n 1.703\n 2.772\n -3.154\n 7.363\n 0.114\n 0.081\n 591.0\n 1126.0\n 1.00\n \n \n bs(year, knots=iknots)[13]\n 0.190\n 3.076\n -5.277\n 6.077\n 0.115\n 0.081\n 722.0\n 1258.0\n 1.00\n \n \n bs(year, knots=iknots)[14]\n -6.026\n 3.162\n -11.645\n 0.206\n 0.122\n 0.086\n 672.0\n 1164.0\n 1.00\n \n \n bs(year, knots=iknots)[15]\n -6.715\n 3.005\n -12.485\n -1.229\n 0.118\n 0.084\n 641.0\n 1306.0\n 1.00\n \n \n doy_sigma\n 5.949\n 0.146\n 5.674\n 6.221\n 0.003\n 0.002\n 2287.0\n 1466.0\n 1.00\n \n \n\n\n\n\nThere are a couple of things to remark here\n\nThere are 16 coefficients associated with the b-spline now because we’re not spanning the intercept.\nThe ESS numbers have improved in all cases. Notice the sampler isn’t raising any warning about low ESS.\nr_hat coefficeints are still 1.\n\nWe can also compare the sampling times:\n\nidata.posterior.sampling_time\n\n32.5815589427948\n\n\n\nidata_new.posterior.sampling_time\n\n31.589828729629517\n\n\nSampling times are similar in this particular example. But in general, we expect the sampler to run faster when there aren’t structural dependencies in the design matrix.\nAnd what about predictions?\n\nplot_predictions(data, idata_new, model_new);\n\n/tmp/ipykernel_33590/2247671002.py:8: FutureWarning: extract_dataset has been deprecated, please use extract\n posterior_stacked = az.extract_dataset(idata)\n\n\n\n\n\nAnd model comparison?\n\nmodels_dict = {\"Original\": idata, \"New\": idata_new}\ndf_compare = az.compare(models_dict)\ndf_compare\n\n\n\n\n\n \n \n \n rank\n elpd_loo\n p_loo\n elpd_diff\n weight\n se\n dse\n warning\n scale\n \n \n \n \n New\n 0\n -2657.859115\n 15.945629\n 0.000000\n 1.000000e+00\n 21.134973\n 0.000000\n False\n log\n \n \n Original\n 1\n -2658.359085\n 16.652034\n 0.499969\n 3.330669e-16\n 21.173433\n 0.561943\n False\n log\n \n \n\n\n\n\n\naz.plot_compare(df_compare, insample_dev=False);\n\n\n\n\nFinally let’s check influential points according to the k-hat value\n\n# Compute pointwise LOO\nloo_1 = az.loo(idata, pointwise=True)\nloo_2 = az.loo(idata_new, pointwise=True)\n\n/tmp/ipykernel_33590/3493983793.py:2: DeprecationWarning: `product` is deprecated as of NumPy 1.25.0, and will be removed in NumPy 2.0. Please use `prod` instead.\n loo_1 = az.loo(idata, pointwise=True)\n/tmp/ipykernel_33590/3493983793.py:3: DeprecationWarning: `product` is deprecated as of NumPy 1.25.0, and will be removed in NumPy 2.0. Please use `prod` instead.\n loo_2 = az.loo(idata_new, pointwise=True)\n\n\n\n# plot kappa values\naz.plot_khat(loo_1.pareto_k);\n\n\n\n\n\naz.plot_khat(loo_2.pareto_k);\n\n\n\n\n\n\n\nAnother option could have been to use stronger priors on the coefficients associated with the spline functions. For example, the example written in PyMC uses \\(\\text{Normal}(0, 3)\\) priors on them instead of \\(\\text{Normal}(0, 10)\\).\n\n%load_ext watermark\n%watermark -n -u -v -iv -w\n\nLast updated: Wed Jun 28 2023\n\nPython implementation: CPython\nPython version : 3.10.4\nIPython version : 8.5.0\n\npandas : 2.0.2\nbambi : 0.12.0.dev0\narviz : 0.14.0\nnumpy : 1.25.0\nmatplotlib: 3.6.2\n\nWatermark: 2.3.1" }, { - "objectID": "notebooks/shooter_crossed_random_ANOVA.html", - "href": "notebooks/shooter_crossed_random_ANOVA.html", + "objectID": "notebooks/zero_inflated_regression.html", + "href": "notebooks/zero_inflated_regression.html", "title": "Bambi", "section": "", - "text": "import arviz as az\nimport bambi as bmb\nimport matplotlib.pyplot as plt\nimport numpy as np\nimport pandas as pd\n\n\naz.style.use(\"arviz-darkgrid\")\nSEED = 1234\n\nHere we will analyze a dataset from experimental psychology in which a sample of 36 human participants engaged in what is called the shooter task, yielding 3600 responses and reaction times (100 from each subject). The link above gives some more information about the shooter task, but basically it is a sort of crude first-person-shooter video game in which the subject plays the role of a police officer. The subject views a variety of urban scenes, and in each round or “trial” a person or “target” appears on the screen after some random interval. This person is either Black or White (with 50% probability), and they are holding some object that is either a gun or some other object like a phone or wallet (with 50% probability). When a target appears, the subject has a very brief response window – 0.85 seconds in this particular experiment – within which to press one of two keyboard buttons indicating a “shoot” or “don’t shoot” response. Subjects receive points for correct and timely responses in each trial; subjects’ scores are penalized for incorrect reponses (i.e., shooting an unarmed person or failing to shoot an armed person) or if they don’t respond within the 0.85 response window. The goal of the task, from the subject’s perspective, is to maximize their score.\nThe typical findings in the shooter task are that\n\nSubjects are quicker to respond to armed targets than to unarmed targets, but are especially quick toward armed black targets and especially slow toward unarmed black targets.\nSubjects are more likely to shoot black targets than white targets, whether they are armed or not.\n\n\n\n\nshooter = pd.read_csv(\"data/shooter.csv\", na_values=\".\")\nshooter.head(10)\n\n\n\n\n\n \n \n \n subject\n target\n trial\n race\n object\n time\n response\n \n \n \n \n 0\n 1\n w05\n 19\n white\n nogun\n 658.0\n correct\n \n \n 1\n 2\n b07\n 19\n black\n gun\n 573.0\n correct\n \n \n 2\n 3\n w05\n 19\n white\n gun\n 369.0\n correct\n \n \n 3\n 4\n w07\n 19\n white\n gun\n 495.0\n correct\n \n \n 4\n 5\n w15\n 19\n white\n nogun\n 483.0\n correct\n \n \n 5\n 6\n w96\n 19\n white\n nogun\n 786.0\n correct\n \n \n 6\n 7\n w13\n 19\n white\n nogun\n 519.0\n correct\n \n \n 7\n 8\n w06\n 19\n white\n nogun\n 567.0\n correct\n \n \n 8\n 9\n b14\n 19\n black\n gun\n 672.0\n incorrect\n \n \n 9\n 10\n w90\n 19\n white\n gun\n 457.0\n correct\n \n \n\n\n\n\nThe design of the experiment is such that the subject, target, and object (i.e., gun vs. no gun) factors are fully crossed: each subject views each target twice, once with a gun and once without a gun.\n\npd.crosstab(shooter[\"subject\"], [shooter[\"target\"], shooter[\"object\"]])\n\n\n\n\n\n \n \n target\n b01\n b02\n b03\n b04\n b05\n ...\n w95\n w96\n w97\n w98\n w99\n \n \n object\n gun\n nogun\n gun\n nogun\n gun\n nogun\n gun\n nogun\n gun\n nogun\n ...\n gun\n nogun\n gun\n nogun\n gun\n nogun\n gun\n nogun\n gun\n nogun\n \n \n subject\n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n 1\n 1\n 1\n 1\n 1\n 1\n 1\n 1\n 1\n 1\n 1\n ...\n 1\n 1\n 1\n 1\n 1\n 1\n 1\n 1\n 1\n 1\n \n \n 2\n 1\n 1\n 1\n 1\n 1\n 1\n 1\n 1\n 1\n 1\n ...\n 1\n 1\n 1\n 1\n 1\n 1\n 1\n 1\n 1\n 1\n \n \n 3\n 1\n 1\n 1\n 1\n 1\n 1\n 1\n 1\n 1\n 1\n ...\n 1\n 1\n 1\n 1\n 1\n 1\n 1\n 1\n 1\n 1\n \n \n 4\n 1\n 1\n 1\n 1\n 1\n 1\n 1\n 1\n 1\n 1\n ...\n 1\n 1\n 1\n 1\n 1\n 1\n 1\n 1\n 1\n 1\n \n \n 5\n 1\n 1\n 1\n 1\n 1\n 1\n 1\n 1\n 1\n 1\n ...\n 1\n 1\n 1\n 1\n 1\n 1\n 1\n 1\n 1\n 1\n \n \n 6\n 1\n 1\n 1\n 1\n 1\n 1\n 1\n 1\n 1\n 1\n ...\n 1\n 1\n 1\n 1\n 1\n 1\n 1\n 1\n 1\n 1\n \n \n 7\n 1\n 1\n 1\n 1\n 1\n 1\n 1\n 1\n 1\n 1\n ...\n 1\n 1\n 1\n 1\n 1\n 1\n 1\n 1\n 1\n 1\n \n \n 8\n 1\n 1\n 1\n 1\n 1\n 1\n 1\n 1\n 1\n 1\n ...\n 1\n 1\n 1\n 1\n 1\n 1\n 1\n 1\n 1\n 1\n \n \n 9\n 1\n 1\n 1\n 1\n 1\n 1\n 1\n 1\n 1\n 1\n ...\n 1\n 1\n 1\n 1\n 1\n 1\n 1\n 1\n 1\n 1\n \n \n 10\n 1\n 1\n 1\n 1\n 1\n 1\n 1\n 1\n 1\n 1\n ...\n 1\n 1\n 1\n 1\n 1\n 1\n 1\n 1\n 1\n 1\n \n \n 11\n 1\n 1\n 1\n 1\n 1\n 1\n 1\n 1\n 1\n 1\n ...\n 1\n 1\n 1\n 1\n 1\n 1\n 1\n 1\n 1\n 1\n \n \n 12\n 1\n 1\n 1\n 1\n 1\n 1\n 1\n 1\n 1\n 1\n ...\n 1\n 1\n 1\n 1\n 1\n 1\n 1\n 1\n 1\n 1\n \n \n 13\n 1\n 1\n 1\n 1\n 1\n 1\n 1\n 1\n 1\n 1\n ...\n 1\n 1\n 1\n 1\n 1\n 1\n 1\n 1\n 1\n 1\n \n \n 14\n 1\n 1\n 1\n 1\n 1\n 1\n 1\n 1\n 1\n 1\n ...\n 1\n 1\n 1\n 1\n 1\n 1\n 1\n 1\n 1\n 1\n \n \n 15\n 1\n 1\n 1\n 1\n 1\n 1\n 1\n 1\n 1\n 1\n ...\n 1\n 1\n 1\n 1\n 1\n 1\n 1\n 1\n 1\n 1\n \n \n 16\n 1\n 1\n 1\n 1\n 1\n 1\n 1\n 1\n 1\n 1\n ...\n 1\n 1\n 1\n 1\n 1\n 1\n 1\n 1\n 1\n 1\n \n \n 17\n 1\n 1\n 1\n 1\n 1\n 1\n 1\n 1\n 1\n 1\n ...\n 1\n 1\n 1\n 1\n 1\n 1\n 1\n 1\n 1\n 1\n \n \n 18\n 1\n 1\n 1\n 1\n 1\n 1\n 1\n 1\n 1\n 1\n ...\n 1\n 1\n 1\n 1\n 1\n 1\n 1\n 1\n 1\n 1\n \n \n 19\n 1\n 1\n 1\n 1\n 1\n 1\n 1\n 1\n 1\n 1\n ...\n 1\n 1\n 1\n 1\n 1\n 1\n 1\n 1\n 1\n 1\n \n \n 20\n 1\n 1\n 1\n 1\n 1\n 1\n 1\n 1\n 1\n 1\n ...\n 1\n 1\n 1\n 1\n 1\n 1\n 1\n 1\n 1\n 1\n \n \n 21\n 1\n 1\n 1\n 1\n 1\n 1\n 1\n 1\n 1\n 1\n ...\n 1\n 1\n 1\n 1\n 1\n 1\n 1\n 1\n 1\n 1\n \n \n 22\n 1\n 1\n 1\n 1\n 1\n 1\n 1\n 1\n 1\n 1\n ...\n 1\n 1\n 1\n 1\n 1\n 1\n 1\n 1\n 1\n 1\n \n \n 23\n 1\n 1\n 1\n 1\n 1\n 1\n 1\n 1\n 1\n 1\n ...\n 1\n 1\n 1\n 1\n 1\n 1\n 1\n 1\n 1\n 1\n \n \n 24\n 1\n 1\n 1\n 1\n 1\n 1\n 1\n 1\n 1\n 1\n ...\n 1\n 1\n 1\n 1\n 1\n 1\n 1\n 1\n 1\n 1\n \n \n 25\n 1\n 1\n 1\n 1\n 1\n 1\n 1\n 1\n 1\n 1\n ...\n 1\n 1\n 1\n 1\n 1\n 1\n 1\n 1\n 1\n 1\n \n \n 26\n 1\n 1\n 1\n 1\n 1\n 1\n 1\n 1\n 1\n 1\n ...\n 1\n 1\n 1\n 1\n 1\n 1\n 1\n 1\n 1\n 1\n \n \n 27\n 1\n 1\n 1\n 1\n 1\n 1\n 1\n 1\n 1\n 1\n ...\n 1\n 1\n 1\n 1\n 1\n 1\n 1\n 1\n 1\n 1\n \n \n 28\n 1\n 1\n 1\n 1\n 1\n 1\n 1\n 1\n 1\n 1\n ...\n 1\n 1\n 1\n 1\n 1\n 1\n 1\n 1\n 1\n 1\n \n \n 29\n 1\n 1\n 1\n 1\n 1\n 1\n 1\n 1\n 1\n 1\n ...\n 1\n 1\n 1\n 1\n 1\n 1\n 1\n 1\n 1\n 1\n \n \n 30\n 1\n 1\n 1\n 1\n 1\n 1\n 1\n 1\n 1\n 1\n ...\n 1\n 1\n 1\n 1\n 1\n 1\n 1\n 1\n 1\n 1\n \n \n 31\n 1\n 1\n 1\n 1\n 1\n 1\n 1\n 1\n 1\n 1\n ...\n 1\n 1\n 1\n 1\n 1\n 1\n 1\n 1\n 1\n 1\n \n \n 32\n 1\n 1\n 1\n 1\n 1\n 1\n 1\n 1\n 1\n 1\n ...\n 1\n 1\n 1\n 1\n 1\n 1\n 1\n 1\n 1\n 1\n \n \n 33\n 1\n 1\n 1\n 1\n 1\n 1\n 1\n 1\n 1\n 1\n ...\n 1\n 1\n 1\n 1\n 1\n 1\n 1\n 1\n 1\n 1\n \n \n 34\n 1\n 1\n 1\n 1\n 1\n 1\n 1\n 1\n 1\n 1\n ...\n 1\n 1\n 1\n 1\n 1\n 1\n 1\n 1\n 1\n 1\n \n \n 35\n 1\n 1\n 1\n 1\n 1\n 1\n 1\n 1\n 1\n 1\n ...\n 1\n 1\n 1\n 1\n 1\n 1\n 1\n 1\n 1\n 1\n \n \n 36\n 1\n 1\n 1\n 1\n 1\n 1\n 1\n 1\n 1\n 1\n ...\n 1\n 1\n 1\n 1\n 1\n 1\n 1\n 1\n 1\n 1\n \n \n\n36 rows × 100 columns\n\n\n\nThe response speeds on each trial are recorded given as reaction times (milliseconds per response), but here we invert them to and multiply by 1000 so that we are analyzing response rates (responses per second). There is no theoretical reason to prefer one of these metrics over the other, but it turns out that response rates tend to have nicer distributional properties than reaction times (i.e., deviate less strongly from the standard Gaussian assumptions), so response rates will be a little more convenient for us by allowing us to use some fairly standard distributional models.\n\nshooter[\"rate\"] = 1000.0 / shooter[\"time\"]\n\n\nplt.hist(shooter[\"rate\"].dropna());\n\n\n\n\n\n\n\n\n\nOur first model is analogous to how the data from the shooter task are usually analyzed: incorporating all subject-level sources of variability, but ignoring the sampling variability due to the sample of 50 targets. This is a Bayesian generalized linear mixed model (GLMM) with a Normal response and with intercepts and slopes that vary randomly across subjects.\nOf note here is the S(x) syntax, which is from the Formulae library that we use to parse model formulas. This instructs Bambi to use contrast codes of -1 and +1 for the two levels of each of the common factors of race (black vs. white) and object (gun vs. no gun), so that the race and object coefficients can be interpreted as simple effects on average across the levels of the other factor (directly analogous, but not quite equivalent, to the main effects). This is the standard coding used in ANOVA.\n\nsubj_model = bmb.Model(\n \"rate ~ S(race) * S(object) + (S(race) * S(object) | subject)\", \n shooter, \n dropna=True\n)\nsubj_fitted = subj_model.fit(random_seed=SEED)\n\nAutomatically removing 98/3600 rows from the dataset.\nAuto-assigning NUTS sampler...\nInitializing NUTS using jitter+adapt_diag...\nMultiprocess sampling (2 chains in 2 jobs)\nNUTS: [rate_sigma, Intercept, S(race), S(object), S(race):S(object), 1|subject_sigma, 1|subject_offset, S(race)|subject_sigma, S(race)|subject_offset, S(object)|subject_sigma, S(object)|subject_offset, S(race):S(object)|subject_sigma, S(race):S(object)|subject_offset]\n\n\n\n\n\n\n\n \n \n 100.00% [4000/4000 00:43<00:00 Sampling 2 chains, 0 divergences]\n \n \n\n\nSampling 2 chains for 1_000 tune and 1_000 draw iterations (2_000 + 2_000 draws total) took 44 seconds.\n\n\nFirst let’s visualize the default priors that Bambi automatically decided on for each of the parameters. We do this by calling the .plot_priors() method of the Model object.\n\nsubj_model.plot_priors();\n\nSampling: [1|subject_sigma, Intercept, S(object), S(object)|subject_sigma, S(race), S(race):S(object), S(race):S(object)|subject_sigma, S(race)|subject_sigma, rate_sigma]\n\n\n\n\n\nThe priors on the common effects seem quite reasonable. Recall that because of the -1 vs +1 contrast coding, the coefficients correspond to half the difference between the two levels of each factor. So the priors on the common effects essentially say that the black vs. white and gun vs. no gun (and their interaction) response rate differences are very unlikely to be as large as a full response per second.\nNow let’s visualize the model estimates. We do this by passing the InferenceData object that resulted from the Model.fit() call to az.plot_trace().\n\naz.plot_trace(subj_fitted);\n\n\n\n\nEach distribution in the plots above has 2 densities because we used 2 MCMC chains, so we are viewing the results of all 2 chains prior to their aggregation. The main message from the plot above is that the chains all seem to have converged well and the resulting posterior distributions all look quite reasonable. It’s a bit easier to digest all this information in a concise, tabular form, which we can get by passing the object that resulted from the Model.fit() call to az.summary().\n\naz.summary(subj_fitted)\n\n\n\n\n\n \n \n \n mean\n sd\n hdi_3%\n hdi_97%\n mcse_mean\n mcse_sd\n ess_bulk\n ess_tail\n r_hat\n \n \n \n \n Intercept\n 1.708\n 0.014\n 1.682\n 1.736\n 0.001\n 0.0\n 406.0\n 571.0\n 1.02\n \n \n S(race)[black]\n -0.001\n 0.004\n -0.009\n 0.007\n 0.000\n 0.0\n 3103.0\n 1200.0\n 1.00\n \n \n S(object)[gun]\n 0.093\n 0.006\n 0.082\n 0.105\n 0.000\n 0.0\n 1290.0\n 1237.0\n 1.00\n \n \n S(race):S(object)[black, gun]\n 0.024\n 0.004\n 0.015\n 0.031\n 0.000\n 0.0\n 3353.0\n 1333.0\n 1.00\n \n \n rate_sigma\n 0.240\n 0.003\n 0.234\n 0.245\n 0.000\n 0.0\n 3307.0\n 885.0\n 1.00\n \n \n ...\n ...\n ...\n ...\n ...\n ...\n ...\n ...\n ...\n ...\n \n \n S(race):S(object)|subject[black, gun, 32]\n -0.001\n 0.006\n -0.013\n 0.011\n 0.000\n 0.0\n 2663.0\n 1697.0\n 1.00\n \n \n S(race):S(object)|subject[black, gun, 33]\n -0.000\n 0.006\n -0.013\n 0.012\n 0.000\n 0.0\n 2553.0\n 1503.0\n 1.00\n \n \n S(race):S(object)|subject[black, gun, 34]\n -0.000\n 0.006\n -0.013\n 0.012\n 0.000\n 0.0\n 3585.0\n 1455.0\n 1.00\n \n \n S(race):S(object)|subject[black, gun, 35]\n 0.000\n 0.006\n -0.012\n 0.011\n 0.000\n 0.0\n 3093.0\n 1745.0\n 1.01\n \n \n S(race):S(object)|subject[black, gun, 36]\n -0.001\n 0.006\n -0.016\n 0.009\n 0.000\n 0.0\n 2359.0\n 1725.0\n 1.00\n \n \n\n153 rows × 9 columns\n\n\n\nThe take-home message from the analysis seems to be that we do find evidence for the usual finding that subjects are especially quick to respond (presumably with a shoot response) to armed black targets and especially slow to respond to unarmed black targets (while unarmed white targets receive “don’t shoot” responses with less hesitation). We see this in the fact that the marginal posterior for the S(race):S(object) interaction coefficient is concentrated strongly away from 0.\n\n\n\nA major flaw in the analysis above is that stimulus specific effects are ignored. The model does include group specific effects for subjects, reflecting the fact that the subjects we observed are but a sample from the broader population of subjects we are interested in and that potentially could have appeared in our study. But the targets we observed – the 50 photographs of white and black men that subjets responded to – are also but a sample from the broader theoretical population of targets we are interested in talking about, and that we could have just as easily and justifiably used as the experimental stimuli in the study. Since the stimuli comprise a random sample, they are subject to sampling variability, and this sampling variability should be accounted in the analysis by including stimulus specific effects. For some more information on this, see here, particularly pages 62-63.\nTo account for this, we let the intercept and slope for object be different for each target. Specific slopes for object across targets are possible because, if you recall, the design of the study was such that each target gets viewed twice by each subject, once with a gun and once without a gun. However, because each target is always either white or black, it’s not possible to add group specific slopes for the race factor or the interaction.\n\nstim_model = bmb.Model(\n \"rate ~ S(race) * S(object) + (S(race) * S(object) | subject) + (S(object) | target)\", \n shooter, \n dropna=True\n)\nstim_fitted = stim_model.fit(random_seed=SEED)\n\nAutomatically removing 98/3600 rows from the dataset.\nAuto-assigning NUTS sampler...\nInitializing NUTS using jitter+adapt_diag...\nMultiprocess sampling (2 chains in 2 jobs)\nNUTS: [rate_sigma, Intercept, S(race), S(object), S(race):S(object), 1|subject_sigma, 1|subject_offset, S(race)|subject_sigma, S(race)|subject_offset, S(object)|subject_sigma, S(object)|subject_offset, S(race):S(object)|subject_sigma, S(race):S(object)|subject_offset, 1|target_sigma, 1|target_offset, S(object)|target_sigma, S(object)|target_offset]\n\n\n\n\n\n\n\n \n \n 100.00% [4000/4000 00:58<00:00 Sampling 2 chains, 0 divergences]\n \n \n\n\nSampling 2 chains for 1_000 tune and 1_000 draw iterations (2_000 + 2_000 draws total) took 60 seconds.\n\n\nNow let’s look at the results…\n\naz.plot_trace(stim_fitted);\n\n\n\n\n\naz.summary(stim_fitted)\n\n\n\n\n\n \n \n \n mean\n sd\n hdi_3%\n hdi_97%\n mcse_mean\n mcse_sd\n ess_bulk\n ess_tail\n r_hat\n \n \n \n \n Intercept\n 1.702\n 0.020\n 1.666\n 1.738\n 0.002\n 0.001\n 158.0\n 331.0\n 1.01\n \n \n S(race)[black]\n -0.001\n 0.013\n -0.026\n 0.024\n 0.001\n 0.001\n 239.0\n 469.0\n 1.00\n \n \n S(object)[gun]\n 0.093\n 0.014\n 0.068\n 0.122\n 0.001\n 0.001\n 200.0\n 394.0\n 1.01\n \n \n S(race):S(object)[black, gun]\n 0.025\n 0.014\n 0.001\n 0.054\n 0.001\n 0.001\n 134.0\n 246.0\n 1.00\n \n \n rate_sigma\n 0.205\n 0.002\n 0.201\n 0.210\n 0.000\n 0.000\n 1698.0\n 1484.0\n 1.00\n \n \n ...\n ...\n ...\n ...\n ...\n ...\n ...\n ...\n ...\n ...\n \n \n S(object)|target[gun, w95]\n -0.175\n 0.029\n -0.229\n -0.117\n 0.001\n 0.001\n 458.0\n 706.0\n 1.00\n \n \n S(object)|target[gun, w96]\n 0.080\n 0.031\n 0.026\n 0.137\n 0.001\n 0.001\n 502.0\n 907.0\n 1.00\n \n \n S(object)|target[gun, w97]\n 0.007\n 0.029\n -0.046\n 0.064\n 0.001\n 0.001\n 433.0\n 953.0\n 1.00\n \n \n S(object)|target[gun, w98]\n 0.087\n 0.029\n 0.036\n 0.141\n 0.001\n 0.001\n 466.0\n 1033.0\n 1.00\n \n \n S(object)|target[gun, w99]\n -0.019\n 0.029\n -0.076\n 0.034\n 0.001\n 0.001\n 419.0\n 686.0\n 1.00\n \n \n\n255 rows × 9 columns\n\n\n\nThere are two interesting things to note here. The first is that the key interaction effect, S(race):S(object) is much less clear now. The marginal posterior is still mostly concentrated away from 0, but there’s certainly a nontrivial part that overlaps with 0; 2.9% of the distribution, to be exact.\n\n(stim_fitted.posterior[\"S(race):S(object)\"] < 0).mean()\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\narray(0.031)xarray.DataArray'S(race):S(object)'0.031array(0.031)Coordinates: (0)Indexes: (0)Attributes: (0)\n\n\nThe second interesting thing is that the two new variance components in the model, those associated with the stimulus specific effects, are actually rather large. This actually largely explains the first fact above, since if these where estimated to be close to 0 anyway, the model estimates wouldn’t be much different than they were in the subj_model. It makes sense that there is a strong tendency for different targets to elicit difference reaction times on average, which leads to a large estimate of 1|target_sigma.\nLess obviously, the large estimate of S(object)|target_sigma (targets tend to vary a lot in their response rate differences when they have a gun vs. some other object) also makes sense, because in this experiment, different targets were pictured with different non-gun objects. Some of these objects, such as a bright red can of Coca-Cola, are not easily confused with a gun, so subjects are able to quickly decide on the correct response. Other objects, such as a black cell phone, are possibly easier to confuse with a gun, so subjects take longer to decide on the correct response when confronted with this object.\nSince each target is yoked to a particular non-gun object, there is good reason to expect large target-to-target variability in the object effect, which is indeed what we see in the model estimates.\n\n\n\n\nHere we seek evidence of the second traditional finding, that subjects are more likely to response ‘shoot’ toward black targets than toward white targets, regardless of whether they are armed or not. Currently the dataset just records whether the given response was correct or not, so first we transformed this into whether the response was ‘shoot’ or ‘dontshoot’.\n\nshooter[\"shoot_or_not\"] = shooter[\"response\"].astype(str)\n\n# armed targets\nnew_values = {\"correct\": \"shoot\", \"incorrect\": \"dontshoot\", \"timeout\": np.nan}\nshooter.loc[shooter[\"object\"] == \"gun\", \"shoot_or_not\"] = (\n shooter.loc[shooter[\"object\"] == \"gun\", \"response\"].astype(str).replace(new_values)\n)\n \n# unarmed targets\nnew_values = {\"correct\": \"dontshoot\", \"incorrect\": \"shoot\", \"timeout\": np.nan}\nshooter.loc[shooter[\"object\"] == \"nogun\", \"shoot_or_not\"] = (\n shooter.loc[shooter[\"object\"] == \"nogun\", \"response\"].astype(str).replace(new_values)\n)\n \n# view result\nshooter.head(20)\n\n\n\n\n\n \n \n \n subject\n target\n trial\n race\n object\n time\n response\n rate\n shoot_or_not\n \n \n \n \n 0\n 1\n w05\n 19\n white\n nogun\n 658.0\n correct\n 1.519757\n dontshoot\n \n \n 1\n 2\n b07\n 19\n black\n gun\n 573.0\n correct\n 1.745201\n shoot\n \n \n 2\n 3\n w05\n 19\n white\n gun\n 369.0\n correct\n 2.710027\n shoot\n \n \n 3\n 4\n w07\n 19\n white\n gun\n 495.0\n correct\n 2.020202\n shoot\n \n \n 4\n 5\n w15\n 19\n white\n nogun\n 483.0\n correct\n 2.070393\n dontshoot\n \n \n 5\n 6\n w96\n 19\n white\n nogun\n 786.0\n correct\n 1.272265\n dontshoot\n \n \n 6\n 7\n w13\n 19\n white\n nogun\n 519.0\n correct\n 1.926782\n dontshoot\n \n \n 7\n 8\n w06\n 19\n white\n nogun\n 567.0\n correct\n 1.763668\n dontshoot\n \n \n 8\n 9\n b14\n 19\n black\n gun\n 672.0\n incorrect\n 1.488095\n dontshoot\n \n \n 9\n 10\n w90\n 19\n white\n gun\n 457.0\n correct\n 2.188184\n shoot\n \n \n 10\n 11\n w91\n 19\n white\n nogun\n 599.0\n correct\n 1.669449\n dontshoot\n \n \n 11\n 12\n b17\n 19\n black\n nogun\n 769.0\n correct\n 1.300390\n dontshoot\n \n \n 12\n 13\n b04\n 19\n black\n nogun\n 600.0\n correct\n 1.666667\n dontshoot\n \n \n 13\n 14\n w17\n 19\n white\n nogun\n 653.0\n correct\n 1.531394\n dontshoot\n \n \n 14\n 15\n b93\n 19\n black\n gun\n 468.0\n correct\n 2.136752\n shoot\n \n \n 15\n 16\n w96\n 19\n white\n gun\n 546.0\n correct\n 1.831502\n shoot\n \n \n 16\n 17\n w91\n 19\n white\n gun\n 591.0\n incorrect\n 1.692047\n dontshoot\n \n \n 17\n 18\n b95\n 19\n black\n gun\n NaN\n timeout\n NaN\n NaN\n \n \n 18\n 19\n b09\n 19\n black\n gun\n 656.0\n correct\n 1.524390\n shoot\n \n \n 19\n 20\n b02\n 19\n black\n gun\n 617.0\n correct\n 1.620746\n shoot\n \n \n\n\n\n\nLet’s skip straight to the correct model that includes stimulus specific effects. This looks quite similiar to the stim_model from above except that we change the response to the new shoot_or_not variable – notice the [shoot] syntax indicating that we wish to model the prbability that shoot_or_not=='shoot', not shoot_or_not=='dontshoot' – and then change to family='bernoulli' to indicate a mixed effects logistic regression.\n\nstim_response_model = bmb.Model(\n \"shoot_or_not[shoot] ~ S(race)*S(object) + (S(race)*S(object) | subject) + (S(object) | target)\",\n shooter,\n family=\"bernoulli\",\n dropna=True\n)\n\n# Note we increased target_accept from default 0.8 to 0.9 because there were divergences\nstim_response_fitted = stim_response_model.fit(\n draws=2000, \n target_accept=0.9,\n random_seed=SEED\n)\n\nAutomatically removing 98/3600 rows from the dataset.\nModeling the probability that shoot_or_not==shoot\nAuto-assigning NUTS sampler...\nInitializing NUTS using jitter+adapt_diag...\nMultiprocess sampling (2 chains in 2 jobs)\nNUTS: [Intercept, S(race), S(object), S(race):S(object), 1|subject_sigma, 1|subject_offset, S(race)|subject_sigma, S(race)|subject_offset, S(object)|subject_sigma, S(object)|subject_offset, S(race):S(object)|subject_sigma, S(race):S(object)|subject_offset, 1|target_sigma, 1|target_offset, S(object)|target_sigma, S(object)|target_offset]\n\n\n\n\n\n\n\n \n \n 100.00% [6000/6000 01:49<00:00 Sampling 2 chains, 0 divergences]\n \n \n\n\nSampling 2 chains for 1_000 tune and 2_000 draw iterations (2_000 + 4_000 draws total) took 110 seconds.\n\n\nShow the trace plot\n\naz.plot_trace(stim_response_fitted);\n\n\n\n\nLooks pretty good! Now for the more concise summary.\n\naz.summary(stim_response_fitted)\n\n\n\n\n\n \n \n \n mean\n sd\n hdi_3%\n hdi_97%\n mcse_mean\n mcse_sd\n ess_bulk\n ess_tail\n r_hat\n \n \n \n \n Intercept\n -0.021\n 0.151\n -0.309\n 0.262\n 0.002\n 0.002\n 4864.0\n 2794.0\n 1.0\n \n \n S(race)[black]\n 0.224\n 0.145\n -0.032\n 0.508\n 0.002\n 0.002\n 5123.0\n 3256.0\n 1.0\n \n \n S(object)[gun]\n 4.172\n 0.248\n 3.724\n 4.636\n 0.005\n 0.003\n 2687.0\n 2887.0\n 1.0\n \n \n S(race):S(object)[black, gun]\n 0.200\n 0.170\n -0.120\n 0.516\n 0.003\n 0.002\n 3508.0\n 3153.0\n 1.0\n \n \n 1|subject_sigma\n 0.222\n 0.151\n 0.000\n 0.486\n 0.004\n 0.003\n 1734.0\n 2208.0\n 1.0\n \n \n ...\n ...\n ...\n ...\n ...\n ...\n ...\n ...\n ...\n ...\n \n \n S(object)|target[gun, w95]\n 0.349\n 0.598\n -0.749\n 1.497\n 0.007\n 0.009\n 8476.0\n 3118.0\n 1.0\n \n \n S(object)|target[gun, w96]\n 0.030\n 0.554\n -0.997\n 1.062\n 0.006\n 0.010\n 7719.0\n 2645.0\n 1.0\n \n \n S(object)|target[gun, w97]\n 0.310\n 0.582\n -0.734\n 1.439\n 0.008\n 0.010\n 5782.0\n 2261.0\n 1.0\n \n \n S(object)|target[gun, w98]\n 0.344\n 0.584\n -0.637\n 1.525\n 0.007\n 0.008\n 7543.0\n 3183.0\n 1.0\n \n \n S(object)|target[gun, w99]\n 0.017\n 0.548\n -0.993\n 1.069\n 0.006\n 0.009\n 8789.0\n 3061.0\n 1.0\n \n \n\n254 rows × 9 columns\n\n\n\nThere is some slight evidence here for the hypothesis that subjects are more likely to shoot the black targets, regardless of whether they are armed or not, but the evidence is not too strong. The marginal posterior for the S(race) coefficient is mostly concentrated away from 0, but it overlaps even more in this case with 0 than did the key interaction effect in the previous model.\n\n(stim_response_fitted.posterior[\"S(race)\"] < 0).mean()\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\narray(0.06275)xarray.DataArray'S(race)'0.06275array(0.06275)Coordinates: (0)Indexes: (0)Attributes: (0)\n\n\n\n%load_ext watermark\n%watermark -n -u -v -iv -w\n\nLast updated: Thu Jan 05 2023\n\nPython implementation: CPython\nPython version : 3.10.4\nIPython version : 8.5.0\n\nnumpy : 1.23.5\narviz : 0.14.0\nmatplotlib: 3.6.2\nsys : 3.10.4 (main, Mar 31 2022, 08:41:55) [GCC 7.5.0]\npandas : 1.5.2\nbambi : 0.9.3\n\nWatermark: 2.3.1" + "text": "import arviz as az\nimport matplotlib.pyplot as plt\nfrom matplotlib.lines import Line2D\nimport numpy as np\nimport pandas as pd\nimport scipy.stats as stats\nimport seaborn as sns\nimport warnings\n\nimport bambi as bmb\n\nwarnings.simplefilter(action='ignore', category=FutureWarning)\n\nWARNING (pytensor.tensor.blas): Using NumPy C-API based implementation for BLAS functions." }, { - "objectID": "notebooks/quantile_regression.html", - "href": "notebooks/quantile_regression.html", + "objectID": "notebooks/zero_inflated_regression.html#zero-inflated-outcomes", + "href": "notebooks/zero_inflated_regression.html#zero-inflated-outcomes", "title": "Bambi", - "section": "", - "text": "import arviz as az\nimport bambi as bmb\nimport matplotlib.pyplot as plt\nimport numpy as np\nimport pandas as pd\nfrom scipy import stats\n\n\naz.style.use(\"arviz-darkgrid\")\nSEED = 12947\n\nUsually when doing regression we model the conditional mean of some distribution. Common cases are a Normal distribution for continuous unbounded responses, a Poisson distribution for count data, etc.\nQuantile regression, instead estimates a conditional quantile of the response variable. If the quantile is 0.5, then we will be estimating the median (instead of the mean), this could be useful as a way of performing robust regression, in a similar fashion as using a Student-t distribution instead of a Normal. But for some problem we actually care of the behaviour of the response away from the mean (or median). For example, in medical research, pathologies or potential health risks occur at high or low quantile, for instance, overweight and underweight. In some other fields like ecology, quantile regression is justified due to the existence of complex interactions between variables, where the effect of one variable on another is different for different ranges of the variable.\n\n\nAt first it could be weird to think which distribution we should use as the likelihood for quantile regression or how to write a Bayesian model for quantile regression. But it turns out the answer is quite simple, we just need to use the asymmetric Laplace distribution. This distribution has one parameter controling the mean, another for the scale and a third one for the asymmetry. There are at least two alternative parametrizations regarding this asymmetric parameter. In terms of \\(\\kappa\\) a parameter that goes from 0 to \\(\\infty\\) and in terms of \\(q\\) a number between 0 and 1. This later parametrization is more intuitive for quantile regression as we can directly interpre it as the quantile of interest.\nOn the next cell we compute the pdf of 3 distribution from the Asymmetric Laplace family\n\nx = np.linspace(-6, 6, 2000)\nquantiles = np.array([0.2, 0.5, 0.8])\nfor q, m in zip(quantiles, [0, 0, -1]):\n κ = (q/(1-q))**0.5\n plt.plot(x, stats.laplace_asymmetric(κ, m, 1).pdf(x), label=f\"q={q:}, μ={m}, σ=1\")\nplt.yticks([]);\nplt.legend();\n\n\n\n\nWe are going to use a simple dataset to model the Body Mass Index for Dutch kids and young men as a function of their age.\n\ndata = pd.read_csv(\"data/bmi.csv\")\ndata.head()\n\n\n\n\n\n \n \n \n age\n bmi\n \n \n \n \n 0\n 0.03\n 13.235289\n \n \n 1\n 0.04\n 12.438775\n \n \n 2\n 0.04\n 14.541775\n \n \n 3\n 0.04\n 11.773954\n \n \n 4\n 0.04\n 15.325614\n \n \n\n\n\n\nAs we can see from the next figure the relationship between BMI and age is far from linear, and hence we are going to use splines.\n\nplt.plot(data.age, data.bmi, \"k.\");\nplt.xlabel(\"Age\")\nplt.ylabel(\"BMI\");\n\n\n\n\nWe are going to model 3 quantiles, 0.1, 0.5 and 0.9. For that reasoson we are going to fit 3 separated models, being the sole different the value of kappa of the Asymmetric Laplace distribution, that will be fix at a different value each time. In the future Bambi will allow to directly work with the parameter q instead of kappa, in the meantime we have to apply a transformation to go from quantiles to suitable values of kappa.\n\\[\n\\kappa = \\sqrt{\\frac{q}{1 - q}}\n\\]\n\nquantiles = np.array([0.1, 0.5, 0.9])\nkappas = (quantiles/(1-quantiles))**0.5\n\nknots = np.quantile(data.age, np.linspace(0, 1, 10))[1:-1]\n\nidatas = []\nfor κ in kappas:\n model = bmb.Model(\"bmi ~ bs(age, knots=knots)\",\n data=data, family=\"asymmetriclaplace\", priors={\"kappa\": κ})\n idata = model.fit()\n model.predict(idata)\n idatas.append(idata)\n\nAuto-assigning NUTS sampler...\nInitializing NUTS using jitter+adapt_diag...\nMultiprocess sampling (2 chains in 2 jobs)\nNUTS: [bmi_b, Intercept, bs(age, knots = knots)]\n\n\n\n\n\n\n\n \n \n 100.00% [4000/4000 00:27<00:00 Sampling 2 chains, 0 divergences]\n \n \n\n\nSampling 2 chains for 1_000 tune and 1_000 draw iterations (2_000 + 2_000 draws total) took 28 seconds.\nAuto-assigning NUTS sampler...\nInitializing NUTS using jitter+adapt_diag...\nMultiprocess sampling (2 chains in 2 jobs)\nNUTS: [bmi_b, Intercept, bs(age, knots = knots)]\n\n\n\n\n\n\n\n \n \n 100.00% [4000/4000 00:22<00:00 Sampling 2 chains, 0 divergences]\n \n \n\n\nSampling 2 chains for 1_000 tune and 1_000 draw iterations (2_000 + 2_000 draws total) took 22 seconds.\nAuto-assigning NUTS sampler...\nInitializing NUTS using jitter+adapt_diag...\nMultiprocess sampling (2 chains in 2 jobs)\nNUTS: [bmi_b, Intercept, bs(age, knots = knots)]\n\n\n\n\n\n\n\n \n \n 100.00% [4000/4000 00:28<00:00 Sampling 2 chains, 0 divergences]\n \n \n\n\nSampling 2 chains for 1_000 tune and 1_000 draw iterations (2_000 + 2_000 draws total) took 29 seconds.\n\n\nWe can see the result of the 3 fitted curves in the next figure. One feature that stand-out is that the gap or distance between the median (orange) line and the two other lines is not the same. Also the shapes of the curve while following a similar pattern, are not exactly the same.\n\nplt.plot(data.age, data.bmi, \".\", color=\"0.5\")\nfor idata, q in zip(idatas, quantiles):\n plt.plot(data.age.values, idata.posterior[\"bmi_mean\"].mean((\"chain\", \"draw\")),\n label=f\"q={q:}\", lw=3);\n \nplt.legend()\nplt.xlabel(\"Age\")\nplt.ylabel(\"BMI\");\n\n\n\n\nTo better undestand these remarks let’s compute a simple linear regression and then compute the same 3 quantiles from that fit.\n\nmodel_g = bmb.Model(\"bmi ~ bs(age, knots=knots)\",\n data=data)\nidata_g = model_g.fit()\nmodel_g.predict(idata_g, kind=\"pps\")\n\nAuto-assigning NUTS sampler...\nInitializing NUTS using jitter+adapt_diag...\nMultiprocess sampling (2 chains in 2 jobs)\nNUTS: [bmi_sigma, Intercept, bs(age, knots = knots)]\n\n\n\n\n\n\n\n \n \n 100.00% [4000/4000 00:15<00:00 Sampling 2 chains, 0 divergences]\n \n \n\n\nSampling 2 chains for 1_000 tune and 1_000 draw iterations (2_000 + 2_000 draws total) took 16 seconds.\n\n\n\nidata_g_mean_quantiles = idata_g.posterior_predictive[\"bmi\"].quantile(quantiles, (\"chain\", \"draw\"))\n\n\nplt.plot(data.age, data.bmi, \".\", color=\"0.5\")\nfor q in quantiles:\n plt.plot(data.age.values, idata_g_mean_quantiles.sel(quantile=q),\n label=f\"q={q:}\");\n \nplt.legend()\nplt.xlabel(\"Age\")\nplt.ylabel(\"BMI\");\n\n\n\n\nWe can see that when we use a Gaussian family and from that fit we compute the quantiles, the quantiles q=0.1 and q=0.9 are symetrical with respect to q=0.5, also the shape of the curves is essentially the same just shifted up or down. Additionally the Asymmetric Laplace family allows the model to account for the increased variability in BMI as the age increases, while for the Gaussian family that variability always stays the same.\n\n%load_ext watermark\n%watermark -n -u -v -iv -w\n\nLast updated: Thu Jan 05 2023\n\nPython implementation: CPython\nPython version : 3.10.4\nIPython version : 8.5.0\n\nbambi : 0.9.3\nmatplotlib: 3.6.2\nscipy : 1.9.3\nnumpy : 1.23.5\npandas : 1.5.2\nsys : 3.10.4 (main, Mar 31 2022, 08:41:55) [GCC 7.5.0]\narviz : 0.14.0\n\nWatermark: 2.3.1" + "section": "Zero inflated outcomes", + "text": "Zero inflated outcomes\nSometimes, an observation is not generated from a single process, but from a mixture of processes. Whenever there is a mixture of processes generating an observation, a mixture model may be more appropriate. A mixture model uses more than one probability distribution to model the data. Count data are more susceptible to needing a mixture model as it is common to have a large number of zeros and values greater than zero. A zero means “nothing happened”, and this can be either because the rate of events is low, or because the process that generates the events was never “triggered”. For example, in health service utilization data (the number of times a patient used a service during a given time period), a large number of zeros represents patients with no utilization during the time period. However, some patients do use a service which is a result of some “triggered process”.\nThere are two popular classes of models for modeling zero-inflated data: (1) ZIP, and (2) hurdle Poisson. First, the ZIP model is described and how to implement it in Bambi is outlined. Subsequently, the hurdle Poisson model and how to implement it is outlined thereafter." }, { - "objectID": "notebooks/distributional_models.html", - "href": "notebooks/distributional_models.html", + "objectID": "notebooks/zero_inflated_regression.html#zero-inflated-poisson", + "href": "notebooks/zero_inflated_regression.html#zero-inflated-poisson", "title": "Bambi", - "section": "", - "text": "import arviz as az\nimport bambi as bmb\nimport matplotlib.pyplot as plt\nimport numpy as np\nimport pandas as pd\n\nfrom matplotlib.lines import Line2D\n\n\nimport warnings\nwarnings.simplefilter(action='ignore', category=FutureWarning) # ArviZ\n\naz.style.use(\"arviz-doc\")\n\nFor most regression models, a function of the mean (aka the location parameter) of the response distribution is defined as a linear function of certain predictors, while the remaining parameters are considered auxiliary. For instance, if the response is a Gaussian, we model \\(\\mu\\) as a combination of predictors and \\(\\sigma\\) is estimated from the data, but assumed to be constant for all observations.\nInstead, with distributional models we can specify predictor terms for all parameters of the response distribution. This can be useful, for example, to model heteroskedasticity, i.e. unequal variance. In this notebook we are going to do exactly that.\nTo better understand distributional models, let’s begin fitting a non-distributional models. We are going to model the following syntetic dataset. And we are going to use a Gamma response with a log link function.\n\nrng = np.random.default_rng(121195)\nN = 200\na, b = 0.5, 1.1\nx = rng.uniform(-1.5, 1.5, N)\nshape = np.exp(0.3 + x * 0.5 + rng.normal(scale=0.1, size=N))\ny = rng.gamma(shape, np.exp(a + b * x) / shape, N)\ndata = pd.DataFrame({\"x\": x, \"y\": y})\nnew_data = pd.DataFrame({\"x\": np.linspace(-1.5, 1.5, num=50)})\n\n\n\n\nformula = bmb.Formula(\"y ~ x\")\nmodel_constant = bmb.Model(formula, data, family=\"gamma\", link=\"log\")\nmodel_constant\n\n Formula: y ~ x\n Family: gamma\n Link: mu = log\n Observations: 200\n Priors: \n target = mu\n Common-level effects\n Intercept ~ Normal(mu: 0.0, sigma: 2.5037)\n x ~ Normal(mu: 0.0, sigma: 2.8025)\n \n Auxiliary parameters\n alpha ~ HalfCauchy(beta: 1.0)\n\n\n\nmodel_constant.build()\nmodel_constant.graph()\n\n\n\n\nTake a moment to inspect the textual and graphical representations of the model, to ensure you understand how the parameters are related.\n\nidata_constant = model_constant.fit(random_seed=121195, idata_kwargs={\"log_likelihood\": True})\n\nAuto-assigning NUTS sampler...\nInitializing NUTS using jitter+adapt_diag...\nMultiprocess sampling (2 chains in 2 jobs)\nNUTS: [y_alpha, Intercept, x]\n\n\n\n\n\n\n\n \n \n 100.00% [4000/4000 00:03<00:00 Sampling 2 chains, 0 divergences]\n \n \n\n\nSampling 2 chains for 1_000 tune and 1_000 draw iterations (2_000 + 2_000 draws total) took 4 seconds.\nWe recommend running at least 4 chains for robust computation of convergence diagnostics\n\n\nOnce the model is fitted let’s visually inspect the result in terms of the mean (the line in the following figure) and the individual predictions (the band).\n\nmodel_constant.predict(idata_constant, kind=\"mean\", data=new_data)\nmodel_constant.predict(idata_constant, kind=\"pps\", data=new_data)\n\nqts_constant = (\n az.extract(idata_constant.posterior_predictive, var_names=\"y\")\n .quantile([0.025, 0.975], \"sample\")\n .to_numpy()\n)\nmean_constant = (\n az.extract(idata_constant.posterior_predictive, var_names=\"y\")\n .mean(\"sample\")\n .to_numpy()\n)\n\n\nfig, ax = plt.subplots(figsize=(8, 4.5), dpi=120)\n\naz.plot_hdi(new_data[\"x\"], qts_constant, ax=ax, fill_kwargs={\"alpha\": 0.4})\nax.plot(new_data[\"x\"], mean_constant, color=\"C0\", lw=2)\nax.scatter(data[\"x\"], data[\"y\"], color=\"k\", alpha=0.2)\nax.set(xlabel=\"Predictor\", ylabel=\"Outcome\");\n\n\n\n\nThe model correctly model that the outcome increases with the values of the predictor. So far so good, let’s dive into the heart of the matter.\n\n\n\nNow we are going to build the same model as before with the only, but crucial difference, that we are also going to make alpha depend on the predictor. The syntax is very simple besides the usual “y ~ x”, we now add “alpha ~ x”. Neat!\n\nformula_varying = bmb.Formula(\"y ~ x\", \"alpha ~ x\")\nmodel_varying = bmb.Model(formula_varying, data, family=\"gamma\", link={\"mu\": \"log\", \"alpha\": \"log\"})\nmodel_varying\n\n Formula: y ~ x\n alpha ~ x\n Family: gamma\n Link: mu = log\n alpha = log\n Observations: 200\n Priors: \n target = mu\n Common-level effects\n Intercept ~ Normal(mu: 0.0, sigma: 2.5037)\n x ~ Normal(mu: 0.0, sigma: 2.8025)\n target = alpha\n Common-level effects\n alpha_Intercept ~ Normal(mu: 0.0, sigma: 1.0)\n alpha_x ~ Normal(mu: 0.0, sigma: 1.0)\n\n\n\nmodel_varying.build()\nmodel_varying.graph()\n\n\n\n\nTake another moment to inspect the textual and visual representations of model_varying and also go back and compare those from model_constant.\n\nidata_varying = model_varying.fit(random_seed=121195, idata_kwargs={\"log_likelihood\": True})\n\nAuto-assigning NUTS sampler...\nInitializing NUTS using jitter+adapt_diag...\nMultiprocess sampling (2 chains in 2 jobs)\nNUTS: [Intercept, x, alpha_Intercept, alpha_x]\n\n\n\n\n\n\n\n \n \n 100.00% [4000/4000 00:03<00:00 Sampling 2 chains, 0 divergences]\n \n \n\n\nSampling 2 chains for 1_000 tune and 1_000 draw iterations (2_000 + 2_000 draws total) took 4 seconds.\nWe recommend running at least 4 chains for robust computation of convergence diagnostics\n\n\nNow, with both models being fitted, let’s see how the alpha parameter differs between both models. In the next figure you can see a blueish KDE for the alpha parameter estimated with model_constant and 200 black KDEs for the alpha parameter estimated from the model_varying. You can count it if you want :-), but we know they should be 200 because we should have one for each one of the 200 observations.\n\nfig, ax = plt.subplots(figsize=(8, 4.5), dpi=120)\n\nfor idx in idata_varying.posterior.coords.get(\"y_obs\"):\n values = idata_varying.posterior[\"alpha\"].sel(y_obs=idx).to_numpy().flatten()\n grid, pdf = az.kde(values)\n ax.plot(grid, pdf, lw=0.05, color=\"k\")\n\nvalues = idata_constant.posterior[\"y_alpha\"].to_numpy().flatten()\ngrid, pdf = az.kde(values)\nax.plot(grid, pdf, lw=2, color=\"C0\");\n\n# Create legend\nhandles = [\n Line2D([0], [0], label=\"Varying alpha\", lw=1.5, color=\"k\", alpha=0.6),\n Line2D([0], [0], label=\"Constant alpha\", lw=1.5, color=\"C0\")\n]\n\nlegend = ax.legend(handles=handles, loc=\"upper right\", fontsize=14)\n\nax.set(xlabel=\"Alpha posterior\", ylabel=\"Density\");\n\n\n\n\nThis is nice statistical art and a good insight into what the model is actully doing. But at this point you may be wondering how results looks like and more important how different they are from model_constant. Let’s plot the mean and predictions as we did before, but for both models.\n\nmodel_varying.predict(idata_varying, kind=\"mean\", data=new_data)\nmodel_varying.predict(idata_varying, kind=\"pps\", data=new_data)\n\nqts_varying = (\n az.extract(idata_varying.posterior_predictive, var_names=\"y\")\n .quantile([0.025, 0.975], \"sample\")\n .to_numpy()\n)\nmean_varying = (\n az.extract(idata_varying.posterior_predictive, var_names=\"y\")\n .mean(\"sample\")\n .to_numpy()\n)\n\n\nfig, ax = plt.subplots(figsize=(8, 4.5), dpi=120)\n\naz.plot_hdi(new_data[\"x\"], qts_constant, ax=ax, fill_kwargs={\"alpha\": 0.4})\nax.plot(new_data[\"x\"], mean_constant, color=\"C1\", label=\"constant\")\n\naz.plot_hdi(new_data[\"x\"], qts_varying, ax=ax, fill_kwargs={\"alpha\": 0.4, \"color\":\"k\"})\nax.plot(new_data[\"x\"], mean_varying, color=\"k\", label=\"varying\")\nax.set(xlabel=\"Predictor\", ylabel=\"Outcome\");\nplt.legend();\n\n\n\n\nWe can see that mean is virtually the same for both model but the predictions are not, in particular for larger values of the predictiors.\nWe can also check that the models actually looks different under the LOO metric, with a slight preference for the varying model.\n\naz.compare({\"constant\": idata_constant, \"varying\": idata_varying})\n\n\n\n\n\n \n \n \n rank\n elpd_loo\n p_loo\n elpd_diff\n weight\n se\n dse\n warning\n scale\n \n \n \n \n varying\n 0\n -309.191836\n 3.851329\n 0.000000\n 0.933024\n 16.458759\n 0.00000\n False\n log\n \n \n constant\n 1\n -318.913528\n 2.958351\n 9.721692\n 0.066976\n 15.832033\n 4.59755\n False\n log\n \n \n\n\n\n\n\n\n\nTime to step up our game. In this example we are going to use the bikes data set from the University of California Irvine’s Machine Learning Repository, and we are going to estimate the number of rental bikes rented per hour over a 24 hour period.\nAs the number of bikes is a count variable we are going to use a negativebinomial family, and we are going to use two splines: one for the mean, and one for alpha.\n\ndata = bmb.load_data(\"bikes\")\n# Remove data, you may later try to refit the model to the whole data\ndata = data[::50]\ndata = data.reset_index(drop=True)\n\n\nformula = bmb.Formula(\n \"count ~ 0 + bs(hour, 8, intercept=True)\",\n \"alpha ~ 0 + bs(hour, 8, intercept=True)\"\n)\nmodel_bikes = bmb.Model(formula, data, family=\"negativebinomial\")\nmodel_bikes\n\n Formula: count ~ 0 + bs(hour, 8, intercept=True)\n alpha ~ 0 + bs(hour, 8, intercept=True)\n Family: negativebinomial\n Link: mu = log\n alpha = log\n Observations: 348\n Priors: \n target = mu\n Common-level effects\n bs(hour, 8, intercept=True) ~ Normal(mu: [0. 0. 0. 0. 0. 0. 0. 0.], sigma: [11.3704 13.9185\n 11.9926 10.6887 10.6819 12.1271 13.623 11.366 ])\n\n target = alpha\n Common-level effects\n alpha_bs(hour, 8, intercept=True) ~ Normal(mu: 0.0, sigma: 1.0)\n\n\n\nidata_bikes = model_bikes.fit()\n\nAuto-assigning NUTS sampler...\nInitializing NUTS using jitter+adapt_diag...\nMultiprocess sampling (2 chains in 2 jobs)\nNUTS: [bs(hour, 8, intercept=True), alpha_bs(hour, 8, intercept=True)]\n\n\n\n\n\n\n\n \n \n 100.00% [4000/4000 00:18<00:00 Sampling 2 chains, 0 divergences]\n \n \n\n\nSampling 2 chains for 1_000 tune and 1_000 draw iterations (2_000 + 2_000 draws total) took 19 seconds.\nWe recommend running at least 4 chains for robust computation of convergence diagnostics\n\n\n\nhour = np.linspace(0, 23, num=200)\nnew_data = pd.DataFrame({\"hour\": hour})\nmodel_bikes.predict(idata_bikes, data=new_data, kind=\"pps\")\n\n\nq = [0.025, 0.975]\ndims = (\"chain\", \"draw\")\n\nmean = idata_bikes.posterior[\"count_mean\"].mean(dims).to_numpy()\nmean_interval = idata_bikes.posterior[\"count_mean\"].quantile(q, dims).to_numpy()\ny_interval = idata_bikes.posterior_predictive[\"count\"].quantile(q, dims).to_numpy()\n\nfig, ax = plt.subplots(figsize=(12, 4))\nax.scatter(data[\"hour\"], data[\"count\"], alpha=0.3, color=\"k\")\nax.plot(hour, mean, color=\"C3\")\nax.fill_between(hour, mean_interval[0],mean_interval[1], alpha=0.5, color=\"C1\");\naz.plot_hdi(hour, y_interval, fill_kwargs={\"color\": \"C1\", \"alpha\": 0.3}, ax=ax);\n\n\n\n\n\n%load_ext watermark\n%watermark -n -u -v -iv -w\n\nLast updated: Wed Jun 28 2023\n\nPython implementation: CPython\nPython version : 3.10.4\nIPython version : 8.5.0\n\npandas : 2.0.2\nbambi : 0.12.0.dev0\nmatplotlib: 3.6.2\nnumpy : 1.25.0\narviz : 0.14.0\n\nWatermark: 2.3.1" + "section": "Zero inflated poisson", + "text": "Zero inflated poisson\nTo model zero-inflated outcomes, the ZIP model uses a distribution that mixes two data generating processes. The first process generates zeros, and the second process uses a Poisson distribution to generate counts (of which some may be zero). The result of this mixture is a distribution that can be described as\n\\[P(Y=0) = (1 - \\psi) + \\psi e^{-\\mu}\\]\n\\[P(Y=y_i) = \\psi \\frac{e^{-\\mu} \\mu_{i}^y}{y_{i}!} \\ \\text{for} \\ y_i = 1, 2, 3,...,n\\]\nwhere \\(y_i\\) is the outcome, \\(\\mu\\) is the mean of the Poisson process where \\(\\mu \\ge 0\\), and \\(\\psi\\) is the probability of the Poisson process where \\(0 \\lt \\psi \\lt 1\\). To understand how these two processes are “mixed”, let’s simulate some data using the two process equations above (taken from the PyMC docs).\n\nx = np.arange(0, 22)\npsis = [0.7, 0.4]\nmus = [10, 4]\nplt.figure(figsize=(7, 3))\nfor psi, mu in zip(psis, mus):\n pmf = stats.poisson.pmf(x, mu)\n pmf[0] = (1 - psi) + pmf[0] # 1.) generate zeros\n pmf[1:] = psi * pmf[1:] # 2.) generate counts\n pmf /= pmf.sum() # normalize to get probabilities\n plt.plot(x, pmf, '-o', label='$\\\\psi$ = {}, $\\\\mu$ = {}'.format(psi, mu))\n\nplt.title(\"Zero Inflated Poisson Process\")\nplt.xlabel('x', fontsize=12)\nplt.ylabel('f(x)', fontsize=12)\nplt.legend(loc=1)\nplt.show()\n\n\n\n\nNotice how the blue line, corresponding to a higher \\(\\psi\\) and \\(\\mu\\), has a higher rate of counts and less zeros. Additionally, the inline comments above describe the first and second process generating the data.\n\nZIP regression model\nThe equations above only describe the ZIP distribution. However, predictors can be added to make this a regression model. Suppose we have a response variable \\(Y\\), which represents the number of events that occur during a time period, and \\(p\\) predictors \\(X_1, X_2, ..., X_p\\). We can model the parameters of the ZIP distribution as a linear combination of the predictors.\n\\[Y_i \\sim \\text{ZIPoisson}(\\mu_i, \\psi_i)\\]\n\\[g(\\mu_i) = \\beta_0 + \\beta_1 X_{1i}+,...,+\\beta_p X_{pi}\\]\n\\[h(\\psi_i) = \\alpha_0 + \\alpha_1 X_{1i}+,...,+\\alpha_p X_{pi}\\]\nwhere \\(g\\) and \\(h\\) are the link functions for each parameter. Bambi, by default, uses the log link for \\(g\\) and the logit link for \\(h\\). Notice how there are two linear models and two link functions: one for each parameter in the \\(\\text{ZIPoisson}\\). The parameters of the linear model differ, because any predictor such as \\(X\\) may be associated differently with each part of the mixture. Actually, you don’t even need to use the same predictors in both linear models—but this beyond the scope of this notebook.\n\nThe fish dataset\nTo demonstrate the ZIP regression model, we model and predict how many fish are caught by visitors at a state park using survey data. Many visitors catch zero fish, either because they did not fish at all, or because they were unlucky. The dataset contains data on 250 groups that went to a state park to fish. Each group was questioned about how many fish they caught (count), how many children were in the group (child), how many people were in the group (persons), if they used a live bait (livebait) and whether or not they brought a camper to the park (camper).\n\nfish_data = pd.read_stata(\"http://www.stata-press.com/data/r11/fish.dta\")\ncols = [\"count\", \"livebait\", \"camper\", \"persons\", \"child\"]\nfish_data = fish_data[cols]\nfish_data[\"livebait\"] = pd.Categorical(fish_data[\"livebait\"])\nfish_data[\"camper\"] = pd.Categorical(fish_data[\"camper\"])\nfish_data = fish_data[fish_data[\"count\"] < 60] # remove outliers\n\n\nfish_data.head()\n\n\n\n\n\n \n \n \n count\n livebait\n camper\n persons\n child\n \n \n \n \n 0\n 0.0\n 0.0\n 0.0\n 1.0\n 0.0\n \n \n 1\n 0.0\n 1.0\n 1.0\n 1.0\n 0.0\n \n \n 2\n 0.0\n 1.0\n 0.0\n 1.0\n 0.0\n \n \n 3\n 0.0\n 1.0\n 1.0\n 2.0\n 1.0\n \n \n 4\n 1.0\n 1.0\n 0.0\n 1.0\n 0.0\n \n \n\n\n\n\n\n# Excess zeros, and skewed count\nplt.figure(figsize=(7, 3))\nsns.histplot(fish_data[\"count\"], discrete=True)\nplt.xlabel(\"Number of Fish Caught\");\n\n\n\n\nTo fit a ZIP regression model, we pass family=zero_inflated_poisson to the bmb.Model constructor.\n\nzip_model = bmb.Model(\n \"count ~ livebait + camper + persons + child\", \n fish_data, \n family='zero_inflated_poisson'\n)\n\nzip_idata = zip_model.fit(\n draws=1000, \n target_accept=0.95, \n random_seed=1234, \n chains=4\n)\n\nAuto-assigning NUTS sampler...\nInitializing NUTS using jitter+adapt_diag...\nMultiprocess sampling (4 chains in 4 jobs)\nNUTS: [count_psi, Intercept, livebait, camper, persons, child]\n\n\n\n\n\n\n\n \n \n 100.00% [8000/8000 00:03<00:00 Sampling 4 chains, 0 divergences]\n \n \n\n\nSampling 4 chains for 1_000 tune and 1_000 draw iterations (4_000 + 4_000 draws total) took 4 seconds.\n\n\nLets take a look at the model components. Why is there only one linear model and link function defined for \\(\\mu\\). Where is the linear model and link function for \\(\\psi\\)? By default, the “main” (or first) formula is defined for the parent parameter; in this case \\(\\mu\\). Since we didn’t pass an additional formula for the non-parent parameter \\(\\psi\\), \\(\\psi\\) was never modeled as a function of the predictors as explained above. If we want to model both \\(\\mu\\) and \\(\\psi\\) as a function of the predictor, we need to expicitly pass two formulas.\n\nzip_model\n\n Formula: count ~ livebait + camper + persons + child\n Family: zero_inflated_poisson\n Link: mu = log\n Observations: 248\n Priors: \n target = mu\n Common-level effects\n Intercept ~ Normal(mu: 0.0, sigma: 9.5283)\n livebait ~ Normal(mu: 0.0, sigma: 7.2685)\n camper ~ Normal(mu: 0.0, sigma: 5.0733)\n persons ~ Normal(mu: 0.0, sigma: 2.2583)\n child ~ Normal(mu: 0.0, sigma: 2.9419)\n \n Auxiliary parameters\n psi ~ Beta(alpha: 2.0, beta: 2.0)\n------\n* To see a plot of the priors call the .plot_priors() method.\n* To see a summary or plot of the posterior pass the object returned by .fit() to az.summary() or az.plot_trace()\n\n\n\nformula = bmb.Formula(\n \"count ~ livebait + camper + persons + child\", # parent parameter mu\n \"psi ~ livebait + camper + persons + child\" # non-parent parameter psi\n)\n\nzip_model = bmb.Model(\n formula, \n fish_data, \n family='zero_inflated_poisson'\n)\n\nzip_idata = zip_model.fit(\n draws=1000, \n target_accept=0.95, \n random_seed=1234, \n chains=4\n)\n\nAuto-assigning NUTS sampler...\nInitializing NUTS using jitter+adapt_diag...\nMultiprocess sampling (4 chains in 4 jobs)\nNUTS: [Intercept, livebait, camper, persons, child, psi_Intercept, psi_livebait, psi_camper, psi_persons, psi_child]\n\n\n\n\n\n\n\n \n \n 100.00% [8000/8000 00:05<00:00 Sampling 4 chains, 0 divergences]\n \n \n\n\nSampling 4 chains for 1_000 tune and 1_000 draw iterations (4_000 + 4_000 draws total) took 6 seconds.\n\n\n\nzip_model\n\n Formula: count ~ livebait + camper + persons + child\n psi ~ livebait + camper + persons + child\n Family: zero_inflated_poisson\n Link: mu = log\n psi = logit\n Observations: 248\n Priors: \n target = mu\n Common-level effects\n Intercept ~ Normal(mu: 0.0, sigma: 9.5283)\n livebait ~ Normal(mu: 0.0, sigma: 7.2685)\n camper ~ Normal(mu: 0.0, sigma: 5.0733)\n persons ~ Normal(mu: 0.0, sigma: 2.2583)\n child ~ Normal(mu: 0.0, sigma: 2.9419)\n target = psi\n Common-level effects\n psi_Intercept ~ Normal(mu: 0.0, sigma: 1.0)\n psi_livebait ~ Normal(mu: 0.0, sigma: 1.0)\n psi_camper ~ Normal(mu: 0.0, sigma: 1.0)\n psi_persons ~ Normal(mu: 0.0, sigma: 1.0)\n psi_child ~ Normal(mu: 0.0, sigma: 1.0)\n------\n* To see a plot of the priors call the .plot_priors() method.\n* To see a summary or plot of the posterior pass the object returned by .fit() to az.summary() or az.plot_trace()\n\n\nNow, both \\(\\mu\\) and \\(\\psi\\) are defined as a function of a linear combination of the predictors. Additionally, we can see that the log and logit link functions are defined for \\(\\mu\\) and \\(\\psi\\), respectively.\n\nzip_model.graph()\n\n\n\n\nSince each parameter has a different link function, and each parameter has a different meaning, we must be careful on how the coefficients are interpreted. Coefficients without the substring “psi” correspond to the \\(\\mu\\) parameter (the mean of the Poisson process) and are on the log scale. Coefficients with the substring “psi” correspond to the \\(\\psi\\) parameter (this can be thought of as the log-odds of non-zero data) and are on the logit scale. Interpreting these coefficients can be easier with the interpret sub-package. Below, we will show how to use this sub-package to interpret the coefficients conditional on a set of the predictors.\n\naz.summary(\n zip_idata, \n var_names=[\"Intercept\", \"livebait\", \"camper\", \"persons\", \"child\"], \n filter_vars=\"like\"\n)\n\n\n\n\n\n \n \n \n mean\n sd\n hdi_3%\n hdi_97%\n mcse_mean\n mcse_sd\n ess_bulk\n ess_tail\n r_hat\n \n \n \n \n Intercept\n -1.573\n 0.310\n -2.130\n -0.956\n 0.005\n 0.004\n 3593.0\n 3173.0\n 1.0\n \n \n livebait[1.0]\n 1.609\n 0.272\n 1.143\n 2.169\n 0.004\n 0.003\n 4158.0\n 3085.0\n 1.0\n \n \n camper[1.0]\n 0.262\n 0.095\n 0.085\n 0.440\n 0.001\n 0.001\n 5032.0\n 2816.0\n 1.0\n \n \n persons\n 0.615\n 0.045\n 0.527\n 0.697\n 0.001\n 0.000\n 4864.0\n 2709.0\n 1.0\n \n \n child\n -0.795\n 0.094\n -0.972\n -0.625\n 0.002\n 0.001\n 3910.0\n 3232.0\n 1.0\n \n \n psi_Intercept\n -1.443\n 0.817\n -2.941\n 0.124\n 0.013\n 0.009\n 4253.0\n 3018.0\n 1.0\n \n \n psi_livebait[1.0]\n -0.188\n 0.677\n -1.490\n 1.052\n 0.010\n 0.011\n 4470.0\n 2776.0\n 1.0\n \n \n psi_camper[1.0]\n 0.841\n 0.323\n 0.222\n 1.437\n 0.004\n 0.003\n 6002.0\n 3114.0\n 1.0\n \n \n psi_persons\n 0.912\n 0.193\n 0.571\n 1.288\n 0.003\n 0.002\n 4145.0\n 3169.0\n 1.0\n \n \n psi_child\n -1.890\n 0.305\n -2.502\n -1.353\n 0.005\n 0.003\n 4022.0\n 2883.0\n 1.0\n \n \n\n\n\n\n\n\nInterpret model parameters\nSince we have fit a distributional model, we can leverage the plot_predictions() function in the interpret sub-package to visualize how the \\(\\text{ZIPoisson}\\) parameters \\(\\mu\\) and \\(\\psi\\) vary as a covariate changes.\n\nfig, ax = plt.subplots(nrows=1, ncols=2, figsize=(10, 3))\n\nbmb.interpret.plot_predictions(\n zip_model,\n zip_idata,\n covariates=\"persons\",\n ax=ax[0]\n)\nax[0].set_ylabel(\"mu (fish count)\")\nax[0].set_title(\"$\\\\mu$ as a function of persons\")\n\nbmb.interpret.plot_predictions(\n zip_model,\n zip_idata,\n covariates=\"persons\",\n target=\"psi\",\n ax=ax[1]\n)\nax[1].set_title(\"$\\\\psi$ as a function of persons\");\n\n\n\n\nInterpreting the left plot (the \\(\\mu\\) parameter) as the number of people in a group fishing increases, so does the number of fish caught. The right plot (the \\(\\psi\\) parameter) shows that as the number of people in a group fishing increases, the probability of the Poisson process increases. One interpretation of this is that as the number of people in a group increases, the probability of catching no fish decreases.\n\n\nPosterior predictive distribution\nLastly, lets plot the posterior predictive distribution against the observed data to see how well the model fits the data. To plot the samples, a utility function is defined below to assist in the plotting of discrete values.\n\ndef adjust_lightness(color, amount=0.5):\n import matplotlib.colors as mc\n import colorsys\n try:\n c = mc.cnames[color]\n except:\n c = color\n c = colorsys.rgb_to_hls(*mc.to_rgb(c))\n return colorsys.hls_to_rgb(c[0], c[1] * amount, c[2])\n\ndef plot_ppc_discrete(idata, bins, ax):\n \n def add_discrete_bands(x, lower, upper, ax, **kwargs):\n for i, (l, u) in enumerate(zip(lower, upper)):\n s = slice(i, i + 2)\n ax.fill_between(x[s], [l, l], [u, u], **kwargs)\n\n var_name = list(idata.observed_data.data_vars)[0]\n y_obs = idata.observed_data[var_name].to_numpy()\n \n counts_list = []\n for draw_values in az.extract(idata, \"posterior_predictive\")[var_name].to_numpy().T:\n counts, _ = np.histogram(draw_values, bins=bins)\n counts_list.append(counts)\n counts_arr = np.stack(counts_list)\n\n qts_90 = np.quantile(counts_arr, (0.05, 0.95), axis=0)\n qts_70 = np.quantile(counts_arr, (0.15, 0.85), axis=0)\n qts_50 = np.quantile(counts_arr, (0.25, 0.75), axis=0)\n qts_30 = np.quantile(counts_arr, (0.35, 0.65), axis=0)\n median = np.quantile(counts_arr, 0.5, axis=0)\n\n colors = [adjust_lightness(\"C0\", x) for x in [1.8, 1.6, 1.4, 1.2, 0.9]]\n\n add_discrete_bands(bins, qts_90[0], qts_90[1], ax=ax, color=colors[0])\n add_discrete_bands(bins, qts_70[0], qts_70[1], ax=ax, color=colors[1])\n add_discrete_bands(bins, qts_50[0], qts_50[1], ax=ax, color=colors[2])\n add_discrete_bands(bins, qts_30[0], qts_30[1], ax=ax, color=colors[3])\n\n \n ax.step(bins[:-1], median, color=colors[4], lw=2, where=\"post\")\n ax.hist(y_obs, bins=bins, histtype=\"step\", lw=2, color=\"black\", align=\"mid\")\n handles = [\n Line2D([], [], label=\"Observed data\", color=\"black\", lw=2),\n Line2D([], [], label=\"Posterior predictive median\", color=colors[4], lw=2)\n ]\n ax.legend(handles=handles)\n return ax\n\n\nzip_pps = zip_model.predict(idata=zip_idata, kind=\"pps\", inplace=False)\n\nbins = np.arange(39)\nfig, ax = plt.subplots(figsize=(7, 3))\nax = plot_ppc_discrete(zip_pps, bins, ax)\nax.set_xlabel(\"Number of Fish Caught\")\nax.set_ylabel(\"Count\")\nax.set_title(\"ZIP model - Posterior Predictive Distribution\");\n\n\n\n\nThe model captures the number of zeros accurately. However, the model seems to slightly underestimate the counts 1 and 2. Nonetheless, the plot shows that the model captures the overall distribution of counts reasonably well." }, { - "objectID": "notebooks/multi-level_regression.html", - "href": "notebooks/multi-level_regression.html", + "objectID": "notebooks/zero_inflated_regression.html#hurdle-poisson", + "href": "notebooks/zero_inflated_regression.html#hurdle-poisson", + "title": "Bambi", + "section": "Hurdle poisson", + "text": "Hurdle poisson\nBoth ZIP and hurdle models both use two processes to generate data. The two models differ in their conceptualization of how the zeros are generated. In \\(\\text{ZIPoisson}\\), the zeroes can come from any of the processes, while in the hurdle Poisson they come only from one of the processes. Thus, a hurdle model assumes zero and positive values are generated from two independent processes. In the hurdle model, there are two components: (1) a “structural” process such as a binary model for modeling whether the response variable is zero or not, and (2) a process using a truncated model such as a truncated Poisson for modeling the counts. The result of these two components is a distribution that can be described as\n\\[P(Y=0) = 1 - \\psi\\]\n\\[P(Y=y_i) = \\psi \\frac{e^{-\\mu_i}\\mu_{i}^{y_i} / y_i!}{1 - e^{-\\mu_i}} \\ \\text{for} \\ y_i = 1, 2, 3,...,n\\]\nwhere \\(y_i\\) is the outcome, \\(\\mu\\) is the mean of the Poisson process where \\(\\mu \\ge 0\\), and \\(\\psi\\) is the probability of the Poisson process where \\(0 \\lt \\psi \\lt 1\\). The numerator of the second equation is the Poisson probability mass function, and the denominator is one minus the Poisson cumulative distribution function. This is a lot to digest. Again, let’s simulate some data to understand how data is generated from this process.\n\nx = np.arange(0, 22)\npsis = [0.7, 0.4]\nmus = [10, 4]\n\nplt.figure(figsize=(7, 3))\nfor psi, mu in zip(psis, mus):\n pmf = stats.poisson.pmf(x, mu) # pmf evaluated at x given mu\n cdf = stats.poisson.cdf(0, mu) # cdf evaluated at 0 given mu\n pmf[0] = 1 - psi # 1.) generate zeros\n pmf[1:] = (psi * pmf[1:]) / (1 - cdf) # 2.) generate counts\n pmf /= pmf.sum() # normalize to get probabilities\n plt.plot(x, pmf, '-o', label='$\\\\psi$ = {}, $\\\\mu$ = {}'.format(psi, mu))\n\nplt.title(\"Hurdle Poisson Process\")\nplt.xlabel('x', fontsize=12)\nplt.ylabel('f(x)', fontsize=12)\nplt.legend(loc=1)\nplt.show()\n\n\n\n\nThe differences between the ZIP and hurdle models are subtle. Notice how in the code for the hurdle Poisson process, the zero counts are generate by (1 - psi) versus (1 - psi) + pmf[0] for the ZIP process. Additionally, the positive observations are generated by the process (psi * pmf[1:]) / (1 - cdf) where the numerator is a vector of probabilities for positive counts scaled by \\(\\psi\\) and the denominator uses the Poisson cumulative distribution function to evaluate the probability a count is greater than 0.\n\nHurdle regression model\nTo add predictors in the hurdle model, we follow the same specification as in the ZIP regression model section since both models have the same structure. The only difference is that the hurdle model uses a truncated Poisson distribution instead of a ZIP distribution. Right away, we will model both the parent and non-parent parameter as a function of the predictors.\n\nhurdle_formula = bmb.Formula(\n \"count ~ livebait + camper + persons + child\", # parent parameter mu\n \"psi ~ livebait + camper + persons + child\" # non-parent parameter psi\n)\n\nhurdle_model = bmb.Model(\n hurdle_formula, \n fish_data, \n family='hurdle_poisson'\n)\n\nhurdle_idata = hurdle_model.fit(\n draws=1000, \n target_accept=0.95, \n random_seed=1234, \n chains=4\n)\n\nAuto-assigning NUTS sampler...\nInitializing NUTS using jitter+adapt_diag...\nMultiprocess sampling (4 chains in 4 jobs)\nNUTS: [Intercept, livebait, camper, persons, child, psi_Intercept, psi_livebait, psi_camper, psi_persons, psi_child]\n\n\n\n\n\n\n\n \n \n 100.00% [8000/8000 00:06<00:00 Sampling 4 chains, 0 divergences]\n \n \n\n\nSampling 4 chains for 1_000 tune and 1_000 draw iterations (4_000 + 4_000 draws total) took 6 seconds.\n\n\n\nhurdle_model\n\n Formula: count ~ livebait + camper + persons + child\n psi ~ livebait + camper + persons + child\n Family: hurdle_poisson\n Link: mu = log\n psi = logit\n Observations: 248\n Priors: \n target = mu\n Common-level effects\n Intercept ~ Normal(mu: 0.0, sigma: 9.5283)\n livebait ~ Normal(mu: 0.0, sigma: 7.2685)\n camper ~ Normal(mu: 0.0, sigma: 5.0733)\n persons ~ Normal(mu: 0.0, sigma: 2.2583)\n child ~ Normal(mu: 0.0, sigma: 2.9419)\n target = psi\n Common-level effects\n psi_Intercept ~ Normal(mu: 0.0, sigma: 1.0)\n psi_livebait ~ Normal(mu: 0.0, sigma: 1.0)\n psi_camper ~ Normal(mu: 0.0, sigma: 1.0)\n psi_persons ~ Normal(mu: 0.0, sigma: 1.0)\n psi_child ~ Normal(mu: 0.0, sigma: 1.0)\n------\n* To see a plot of the priors call the .plot_priors() method.\n* To see a summary or plot of the posterior pass the object returned by .fit() to az.summary() or az.plot_trace()\n\n\n\nhurdle_model.graph()\n\n\n\n\nAs the same link functions are used for ZIP and Hurdle model, the coefficients can be interpreted in a similar manner.\n\naz.summary(\n hurdle_idata,\n var_names=[\"Intercept\", \"livebait\", \"camper\", \"persons\", \"child\"], \n filter_vars=\"like\"\n)\n\n\n\n\n\n \n \n \n mean\n sd\n hdi_3%\n hdi_97%\n mcse_mean\n mcse_sd\n ess_bulk\n ess_tail\n r_hat\n \n \n \n \n Intercept\n -1.615\n 0.363\n -2.278\n -0.915\n 0.006\n 0.005\n 3832.0\n 2121.0\n 1.0\n \n \n livebait[1.0]\n 1.661\n 0.329\n 1.031\n 2.273\n 0.005\n 0.004\n 4149.0\n 1871.0\n 1.0\n \n \n camper[1.0]\n 0.271\n 0.100\n 0.073\n 0.449\n 0.001\n 0.001\n 6843.0\n 2934.0\n 1.0\n \n \n persons\n 0.610\n 0.045\n 0.533\n 0.700\n 0.001\n 0.000\n 4848.0\n 3196.0\n 1.0\n \n \n child\n -0.791\n 0.094\n -0.970\n -0.618\n 0.001\n 0.001\n 4371.0\n 3006.0\n 1.0\n \n \n psi_Intercept\n -2.780\n 0.583\n -3.906\n -1.715\n 0.008\n 0.006\n 4929.0\n 3258.0\n 1.0\n \n \n psi_livebait[1.0]\n 0.764\n 0.427\n -0.067\n 1.557\n 0.006\n 0.005\n 5721.0\n 2779.0\n 1.0\n \n \n psi_camper[1.0]\n 0.849\n 0.298\n 0.283\n 1.378\n 0.004\n 0.003\n 5523.0\n 2855.0\n 1.0\n \n \n psi_persons\n 1.040\n 0.183\n 0.719\n 1.396\n 0.003\n 0.002\n 3852.0\n 3007.0\n 1.0\n \n \n psi_child\n -2.003\n 0.282\n -2.555\n -1.517\n 0.004\n 0.003\n 4021.0\n 3183.0\n 1.0\n \n \n\n\n\n\n\nPosterior predictive samples\nAs with the ZIP model above, we plot the posterior predictive distribution against the observed data to see how well the model fits the data.\n\nhurdle_pps = hurdle_model.predict(idata=hurdle_idata, kind=\"pps\", inplace=False)\n\nbins = np.arange(39)\nfig, ax = plt.subplots(figsize=(7, 3))\nax = plot_ppc_discrete(hurdle_pps, bins, ax)\nax.set_xlabel(\"Number of Fish Caught\")\nax.set_ylabel(\"Count\")\nax.set_title(\"Hurdle Model - Posterior Predictive Distribution\");\n\n\n\n\nThe plot looks similar to the ZIP model above. Nonetheless, the plot shows that the model captures the overall distribution of counts reasonably well." + }, + { + "objectID": "notebooks/zero_inflated_regression.html#summary", + "href": "notebooks/zero_inflated_regression.html#summary", + "title": "Bambi", + "section": "Summary", + "text": "Summary\nIn this notebook, two classes of models (ZIP and hurdle Poisson) for modeling zero-inflated data were presented and implemented in Bambi. The difference of the data generating process between the two models differ in how zeros are generated. The ZIP model uses a distribution that mixes two data generating processes. The first process generates zeros, and the second process uses a Poisson distribution to generate counts (of which some may be zero). The hurdle Poisson also uses two data generating processes, but doesn’t “mix” them. A process is used for generating zeros such as a binary model for modeling whether the response variable is zero or not, and a second process for modeling the counts. These two proceses are independent of each other.\nThe datset used to demonstrate the two models had a large number of zeros. These zeros appeared because the group doesn’t fish, or because they fished, but caught zero fish. Because zeros could be generated due to two different reasons, the ZIP model, which allows zeros to be generated from a mixture of processes, seems to be more appropriate for this datset.\n\n%load_ext watermark\n%watermark -n -u -v -iv -w\n\nLast updated: Mon Sep 25 2023\n\nPython implementation: CPython\nPython version : 3.11.0\nIPython version : 8.13.2\n\nseaborn : 0.12.2\nnumpy : 1.24.2\nscipy : 1.11.2\nbambi : 0.13.0.dev0\nmatplotlib: 3.7.1\narviz : 0.16.1\npandas : 2.1.0\n\nWatermark: 2.3.1" + }, + { + "objectID": "notebooks/shooter_crossed_random_ANOVA.html", + "href": "notebooks/shooter_crossed_random_ANOVA.html", "title": "Bambi", "section": "", - "text": "Hierarchical Linear Regression (Pigs dataset)\n\nimport arviz as az\nimport bambi as bmb\nimport numpy as np\nimport matplotlib.pyplot as plt\nimport statsmodels.api as sm\nimport xarray as xr\n\n\naz.style.use(\"arviz-darkgrid\")\nSEED = 7355608\n\nIn this notebook we demo how to perform a Bayesian hierarchical linear regression.\nWe’ll use a multi-level dataset included with statsmodels containing the growth curve of pigs. Since the weight of each pig is measured multiple times, we’ll estimate a model that allows varying intercepts and slopes for time, for each pig.\n\nLoad data\n\n# Load up data from statsmodels\ndata = sm.datasets.get_rdataset(\"dietox\", \"geepack\").data\ndata.describe()\n\n\n\n\n\n \n \n \n Pig\n Litter\n Start\n Weight\n Feed\n Time\n \n \n \n \n count\n 861.000000\n 861.000000\n 861.000000\n 861.000000\n 789.000000\n 861.000000\n \n \n mean\n 6238.319396\n 12.135889\n 25.672701\n 60.725769\n 80.728645\n 6.480836\n \n \n std\n 1323.845928\n 7.427252\n 3.624336\n 24.978881\n 52.877736\n 3.444735\n \n \n min\n 4601.000000\n 1.000000\n 15.000000\n 15.000000\n 3.300003\n 1.000000\n \n \n 25%\n 4857.000000\n 5.000000\n 23.799990\n 38.299990\n 32.800003\n 3.000000\n \n \n 50%\n 5866.000000\n 11.000000\n 25.700000\n 59.199980\n 74.499996\n 6.000000\n \n \n 75%\n 8050.000000\n 20.000000\n 27.299990\n 81.199950\n 123.000000\n 9.000000\n \n \n max\n 8442.000000\n 24.000000\n 35.399990\n 117.000000\n 224.500000\n 12.000000\n \n \n\n\n\n\n\n\nModel\n\\[\nY_i = \\beta_{0, i} + \\beta_{1, i} X + \\epsilon_i\n\\]\nwith\n\\(\\beta_{0, i} = \\beta_0 + \\alpha_{0, i}\\)\n\\(\\beta_{1, i} = \\beta_1 + \\alpha_{1, i}\\)\nwhere \\(\\beta_0\\) and \\(\\beta_1\\) are usual common intercept and slope you find in a linear regression. \\(\\alpha_{0, i}\\) and \\(\\alpha_{1, i}\\) are the group specific components for the pig \\(i\\), influencing the intercept and the slope respectively. Finally \\(\\epsilon_i\\) is the random error we always see in this type of models, assumed to be Gaussian with mean 0. Note that here we use “common” and “group specific” effects to denote what in many fields are known as “fixed” and “random” effects, respectively.\nWe use the formula syntax to specify the model. Previously, you had to specify common and group specific components separately. Now, thanks to formulae, you can specify model formulas just as you would do with R packages like lme4 and brms. In a nutshell, the term on the left side tells Weight is the response variable, Time on the right-hand side tells we include a main effect for the variable Time, and (Time|Pig) indicates we want to allow a each pig to have its own slope for Time as well as its own intercept (which is implicit). If we only wanted different intercepts, we would have written Weight ~ Time + (1 | Pig) and if we wanted slopes specific to each pig without including a pig specific intercept, we would write Weight ~ Time + (0 + Time | Pig).\n\nmodel = bmb.Model(\"Weight ~ Time + (Time|Pig)\", data)\nresults = model.fit()\n\nAuto-assigning NUTS sampler...\nInitializing NUTS using jitter+adapt_diag...\nMultiprocess sampling (2 chains in 2 jobs)\nNUTS: [Weight_sigma, Intercept, Time, 1|Pig_sigma, 1|Pig_offset, Time|Pig_sigma, Time|Pig_offset]\n\n\n\n\n\n\n\n \n \n 100.00% [4000/4000 00:25<00:00 Sampling 2 chains, 0 divergences]\n \n \n\n\nSampling 2 chains for 1_000 tune and 1_000 draw iterations (2_000 + 2_000 draws total) took 26 seconds.\n\n\nWe can print the model to have a summary of the details\n\nmodel\n\n Formula: Weight ~ Time + (Time|Pig)\n Family: gaussian\n Link: mu = identity\n Observations: 861\n Priors: \n target = mu\n Common-level effects\n Intercept ~ Normal(mu: 60.7258, sigma: 133.0346)\n Time ~ Normal(mu: 0, sigma: 18.1283)\n \n Group-level effects\n 1|Pig ~ Normal(mu: 0, sigma: HalfNormal(sigma: 133.0346))\n Time|Pig ~ Normal(mu: 0, sigma: HalfNormal(sigma: 18.1283))\n Auxiliary parameters\n Weight_sigma ~ HalfStudentT(nu: 4, sigma: 24.9644)\n------\n* To see a plot of the priors call the .plot_priors() method.\n* To see a summary or plot of the posterior pass the object returned by .fit() to az.summary() or az.plot_trace()\n\n\nSince we have not specified prior distributions for the parameters in the model, Bambi has chosen sensible defaults for us. We can explore these priors through samples generated from them with a call to Model.plot_priors(), which plots a kernel density estimate for each prior.\n\nmodel.plot_priors();\n\nSampling: [1|Pig_sigma, Intercept, Time, Time|Pig_sigma, Weight_sigma]\n\n\n\n\n\nNow we are ready to check the results. Using az.plot_trace() we get traceplots that show the values sampled from the posteriors and density estimates that gives us an idea of the shape of the posterior distribution of our parameters.\nIn this case it is very convenient to use compact=True. We tell ArviZ to plot all the group specific posteriors in the same panel which saves space and makes it easier to compare group specific posteriors. Thus, we’ll have a panel with all the group specific intercepts, and another panel with all the group specific slopes. If we used compact=False, which is the default, we would end up with a huge number of panels which would make the plot unreadable.\n\n# Plot posteriors\naz.plot_trace(\n results,\n var_names=[\"Intercept\", \"Time\", \"1|Pig\", \"Time|Pig\", \"Weight_sigma\"],\n compact=True,\n);\n\n\n\n\nThe same plot could have been generated with less typing by calling\naz.plot_trace(results, var_names=[\"~1|Pig_sigma\", \"~Time|Pig_sigma\"], compact=True);\nwhich uses an alternative notation to pass var_names based on the negation symbol in Python, ~. There we are telling ArviZ to plot all the variables in the InferenceData object results, except from 1|Pig_sigma and Time|Pig_sigma.\nCan’t believe it? Come on, run this notebook on your side and have a try!\nThe plots generated by az.plot_trace() are enough to be confident that the sampler did a good job and conclude about plausible values for the distribution of each parameter in the model. But if we want to, and it is a good idea to do it, we can get umerical summaries for the posteriors with az.summary().\n\naz.summary(results, var_names=[\"Intercept\", \"Time\", \"1|Pig_sigma\", \"Time|Pig_sigma\", \"Weight_sigma\"])\n\n\n\n\n\n \n \n \n mean\n sd\n hdi_3%\n hdi_97%\n mcse_mean\n mcse_sd\n ess_bulk\n ess_tail\n r_hat\n \n \n \n \n Intercept\n 15.741\n 0.543\n 14.781\n 16.814\n 0.030\n 0.021\n 330.0\n 719.0\n 1.01\n \n \n Time\n 6.944\n 0.084\n 6.802\n 7.108\n 0.005\n 0.004\n 236.0\n 424.0\n 1.03\n \n \n 1|Pig_sigma\n 4.537\n 0.423\n 3.811\n 5.369\n 0.018\n 0.013\n 586.0\n 1161.0\n 1.00\n \n \n Time|Pig_sigma\n 0.662\n 0.063\n 0.546\n 0.774\n 0.003\n 0.002\n 443.0\n 931.0\n 1.00\n \n \n Weight_sigma\n 2.461\n 0.064\n 2.348\n 2.580\n 0.001\n 0.001\n 2534.0\n 1534.0\n 1.00\n \n \n\n\n\n\n\n\nEstimated regression line\nHere we’ll visualize the regression equations we have sampled for a particular pig and then we’ll compare the mean regression equation for all the 72 pigs in the dataset.\nIn the following plot we can see the 2000 linear regressions we have sampled for the pig ‘4601’. The mean regression line is plotted in black and the observed weights for this pig are respresented by the blue dots.\n\n# The ID of the first pig is '4601'\ndata_0 = data[data[\"Pig\"] == 4601][[\"Time\", \"Weight\"]]\ntime = np.array([1, 12])\n\nposterior = az.extract_dataset(results)\nintercept_common = posterior[\"Intercept\"]\nslope_common = posterior[\"Time\"]\n\nintercept_specific_0 = posterior[\"1|Pig\"].sel(Pig__factor_dim=\"4601\")\nslope_specific_0 = posterior[\"Time|Pig\"].sel(Pig__factor_dim=\"4601\")\n\na = (intercept_common + intercept_specific_0)\nb = (slope_common + slope_specific_0)\n\n# make time a DataArray so we can get automatic broadcasting\ntime_xi = xr.DataArray(time)\nplt.plot(time_xi, (a + b * time_xi).T, color=\"C1\", lw=0.1)\nplt.plot(time_xi, a.mean() + b.mean() * time_xi, color=\"black\")\nplt.scatter(data_0[\"Time\"], data_0[\"Weight\"], zorder=2)\nplt.ylabel(\"Weight (kg)\")\nplt.xlabel(\"Time (weeks)\");\n\n/tmp/ipykernel_25969/3021069513.py:5: FutureWarning: extract_dataset has been deprecated, please use extract\n posterior = az.extract_dataset(results)\n\n\n\n\n\nNext, we calculate the mean regression line for each pig and show them together in one plot. Here we clearly see each pig has a different pair of intercept and slope.\n\nintercept_group_specific = posterior[\"1|Pig\"]\nslope_group_specific = posterior[\"Time|Pig\"]\na = intercept_common.mean() + intercept_group_specific.mean(\"sample\")\nb = slope_common.mean() + slope_group_specific.mean(\"sample\")\ntime_xi = xr.DataArray(time)\nplt.plot(time_xi, (a + b * time_xi).T, color=\"C1\", alpha=0.7, lw=0.8)\nplt.ylabel(\"Weight (kg)\")\nplt.xlabel(\"Time (weeks)\");\n\n\n\n\nWe can get credible interval plots with ArviZ. Here the line indicates a 94% credible interval calculated as higher posterior density, the thicker line represents the interquartile range and the dot is the median. We can quickly note two things:\n\nThe uncertainty about the intercept estimate is much higher than the uncertainty about the Time slope.\nThe credible interval for Time is far away from 0, so we can be confident there’s a positive relationship the Weight of the pigs and Time.\n\nWe’re not making any great discovering by stating that as time passes we expect the pigs to weight more, but this very simple example can be used as a starting point in applications where the relationship between the variables is not that clear beforehand.\n\naz.plot_forest(\n results,\n var_names=[\"Intercept\", \"Time\"],\n figsize=(8, 2),\n);\n\n\n\n\nWe can also plot the posterior overlaid with a region of practical equivalence (ROPE). This region indicates a range of parameter values that are considered to be practically equivalent to some reference value of interest to the particular application, for example 0. In the following plot we can see that all our posterior distributions fall outside of this range.\n\naz.plot_posterior(results, var_names=[\"Intercept\", \"Time\"], ref_val=0, rope=[-1, 1]);\n\n\n\n\n\n%load_ext watermark\n%watermark -n -u -v -iv -w\n\nLast updated: Thu Jan 05 2023\n\nPython implementation: CPython\nPython version : 3.10.4\nIPython version : 8.5.0\n\nmatplotlib : 3.6.2\nxarray : 2022.11.0\nnumpy : 1.23.5\narviz : 0.14.0\nstatsmodels: 0.13.2\nbambi : 0.9.3\nsys : 3.10.4 (main, Mar 31 2022, 08:41:55) [GCC 7.5.0]\n\nWatermark: 2.3.1" + "text": "import arviz as az\nimport bambi as bmb\nimport matplotlib.pyplot as plt\nimport numpy as np\nimport pandas as pd\n\n\naz.style.use(\"arviz-darkgrid\")\nSEED = 1234\n\nHere we will analyze a dataset from experimental psychology in which a sample of 36 human participants engaged in what is called the shooter task, yielding 3600 responses and reaction times (100 from each subject). The link above gives some more information about the shooter task, but basically it is a sort of crude first-person-shooter video game in which the subject plays the role of a police officer. The subject views a variety of urban scenes, and in each round or “trial” a person or “target” appears on the screen after some random interval. This person is either Black or White (with 50% probability), and they are holding some object that is either a gun or some other object like a phone or wallet (with 50% probability). When a target appears, the subject has a very brief response window – 0.85 seconds in this particular experiment – within which to press one of two keyboard buttons indicating a “shoot” or “don’t shoot” response. Subjects receive points for correct and timely responses in each trial; subjects’ scores are penalized for incorrect reponses (i.e., shooting an unarmed person or failing to shoot an armed person) or if they don’t respond within the 0.85 response window. The goal of the task, from the subject’s perspective, is to maximize their score.\nThe typical findings in the shooter task are that\n\nSubjects are quicker to respond to armed targets than to unarmed targets, but are especially quick toward armed black targets and especially slow toward unarmed black targets.\nSubjects are more likely to shoot black targets than white targets, whether they are armed or not.\n\n\n\n\nshooter = pd.read_csv(\"data/shooter.csv\", na_values=\".\")\nshooter.head(10)\n\n\n\n\n\n \n \n \n subject\n target\n trial\n race\n object\n time\n response\n \n \n \n \n 0\n 1\n w05\n 19\n white\n nogun\n 658.0\n correct\n \n \n 1\n 2\n b07\n 19\n black\n gun\n 573.0\n correct\n \n \n 2\n 3\n w05\n 19\n white\n gun\n 369.0\n correct\n \n \n 3\n 4\n w07\n 19\n white\n gun\n 495.0\n correct\n \n \n 4\n 5\n w15\n 19\n white\n nogun\n 483.0\n correct\n \n \n 5\n 6\n w96\n 19\n white\n nogun\n 786.0\n correct\n \n \n 6\n 7\n w13\n 19\n white\n nogun\n 519.0\n correct\n \n \n 7\n 8\n w06\n 19\n white\n nogun\n 567.0\n correct\n \n \n 8\n 9\n b14\n 19\n black\n gun\n 672.0\n incorrect\n \n \n 9\n 10\n w90\n 19\n white\n gun\n 457.0\n correct\n \n \n\n\n\n\nThe design of the experiment is such that the subject, target, and object (i.e., gun vs. no gun) factors are fully crossed: each subject views each target twice, once with a gun and once without a gun.\n\npd.crosstab(shooter[\"subject\"], [shooter[\"target\"], shooter[\"object\"]])\n\n\n\n\n\n \n \n target\n b01\n b02\n b03\n b04\n b05\n ...\n w95\n w96\n w97\n w98\n w99\n \n \n object\n gun\n nogun\n gun\n nogun\n gun\n nogun\n gun\n nogun\n gun\n nogun\n ...\n gun\n nogun\n gun\n nogun\n gun\n nogun\n gun\n nogun\n gun\n nogun\n \n \n subject\n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n 1\n 1\n 1\n 1\n 1\n 1\n 1\n 1\n 1\n 1\n 1\n ...\n 1\n 1\n 1\n 1\n 1\n 1\n 1\n 1\n 1\n 1\n \n \n 2\n 1\n 1\n 1\n 1\n 1\n 1\n 1\n 1\n 1\n 1\n ...\n 1\n 1\n 1\n 1\n 1\n 1\n 1\n 1\n 1\n 1\n \n \n 3\n 1\n 1\n 1\n 1\n 1\n 1\n 1\n 1\n 1\n 1\n ...\n 1\n 1\n 1\n 1\n 1\n 1\n 1\n 1\n 1\n 1\n \n \n 4\n 1\n 1\n 1\n 1\n 1\n 1\n 1\n 1\n 1\n 1\n ...\n 1\n 1\n 1\n 1\n 1\n 1\n 1\n 1\n 1\n 1\n \n \n 5\n 1\n 1\n 1\n 1\n 1\n 1\n 1\n 1\n 1\n 1\n ...\n 1\n 1\n 1\n 1\n 1\n 1\n 1\n 1\n 1\n 1\n \n \n 6\n 1\n 1\n 1\n 1\n 1\n 1\n 1\n 1\n 1\n 1\n ...\n 1\n 1\n 1\n 1\n 1\n 1\n 1\n 1\n 1\n 1\n \n \n 7\n 1\n 1\n 1\n 1\n 1\n 1\n 1\n 1\n 1\n 1\n ...\n 1\n 1\n 1\n 1\n 1\n 1\n 1\n 1\n 1\n 1\n \n \n 8\n 1\n 1\n 1\n 1\n 1\n 1\n 1\n 1\n 1\n 1\n ...\n 1\n 1\n 1\n 1\n 1\n 1\n 1\n 1\n 1\n 1\n \n \n 9\n 1\n 1\n 1\n 1\n 1\n 1\n 1\n 1\n 1\n 1\n ...\n 1\n 1\n 1\n 1\n 1\n 1\n 1\n 1\n 1\n 1\n \n \n 10\n 1\n 1\n 1\n 1\n 1\n 1\n 1\n 1\n 1\n 1\n ...\n 1\n 1\n 1\n 1\n 1\n 1\n 1\n 1\n 1\n 1\n \n \n 11\n 1\n 1\n 1\n 1\n 1\n 1\n 1\n 1\n 1\n 1\n ...\n 1\n 1\n 1\n 1\n 1\n 1\n 1\n 1\n 1\n 1\n \n \n 12\n 1\n 1\n 1\n 1\n 1\n 1\n 1\n 1\n 1\n 1\n ...\n 1\n 1\n 1\n 1\n 1\n 1\n 1\n 1\n 1\n 1\n \n \n 13\n 1\n 1\n 1\n 1\n 1\n 1\n 1\n 1\n 1\n 1\n ...\n 1\n 1\n 1\n 1\n 1\n 1\n 1\n 1\n 1\n 1\n \n \n 14\n 1\n 1\n 1\n 1\n 1\n 1\n 1\n 1\n 1\n 1\n ...\n 1\n 1\n 1\n 1\n 1\n 1\n 1\n 1\n 1\n 1\n \n \n 15\n 1\n 1\n 1\n 1\n 1\n 1\n 1\n 1\n 1\n 1\n ...\n 1\n 1\n 1\n 1\n 1\n 1\n 1\n 1\n 1\n 1\n \n \n 16\n 1\n 1\n 1\n 1\n 1\n 1\n 1\n 1\n 1\n 1\n ...\n 1\n 1\n 1\n 1\n 1\n 1\n 1\n 1\n 1\n 1\n \n \n 17\n 1\n 1\n 1\n 1\n 1\n 1\n 1\n 1\n 1\n 1\n ...\n 1\n 1\n 1\n 1\n 1\n 1\n 1\n 1\n 1\n 1\n \n \n 18\n 1\n 1\n 1\n 1\n 1\n 1\n 1\n 1\n 1\n 1\n ...\n 1\n 1\n 1\n 1\n 1\n 1\n 1\n 1\n 1\n 1\n \n \n 19\n 1\n 1\n 1\n 1\n 1\n 1\n 1\n 1\n 1\n 1\n ...\n 1\n 1\n 1\n 1\n 1\n 1\n 1\n 1\n 1\n 1\n \n \n 20\n 1\n 1\n 1\n 1\n 1\n 1\n 1\n 1\n 1\n 1\n ...\n 1\n 1\n 1\n 1\n 1\n 1\n 1\n 1\n 1\n 1\n \n \n 21\n 1\n 1\n 1\n 1\n 1\n 1\n 1\n 1\n 1\n 1\n ...\n 1\n 1\n 1\n 1\n 1\n 1\n 1\n 1\n 1\n 1\n \n \n 22\n 1\n 1\n 1\n 1\n 1\n 1\n 1\n 1\n 1\n 1\n ...\n 1\n 1\n 1\n 1\n 1\n 1\n 1\n 1\n 1\n 1\n \n \n 23\n 1\n 1\n 1\n 1\n 1\n 1\n 1\n 1\n 1\n 1\n ...\n 1\n 1\n 1\n 1\n 1\n 1\n 1\n 1\n 1\n 1\n \n \n 24\n 1\n 1\n 1\n 1\n 1\n 1\n 1\n 1\n 1\n 1\n ...\n 1\n 1\n 1\n 1\n 1\n 1\n 1\n 1\n 1\n 1\n \n \n 25\n 1\n 1\n 1\n 1\n 1\n 1\n 1\n 1\n 1\n 1\n ...\n 1\n 1\n 1\n 1\n 1\n 1\n 1\n 1\n 1\n 1\n \n \n 26\n 1\n 1\n 1\n 1\n 1\n 1\n 1\n 1\n 1\n 1\n ...\n 1\n 1\n 1\n 1\n 1\n 1\n 1\n 1\n 1\n 1\n \n \n 27\n 1\n 1\n 1\n 1\n 1\n 1\n 1\n 1\n 1\n 1\n ...\n 1\n 1\n 1\n 1\n 1\n 1\n 1\n 1\n 1\n 1\n \n \n 28\n 1\n 1\n 1\n 1\n 1\n 1\n 1\n 1\n 1\n 1\n ...\n 1\n 1\n 1\n 1\n 1\n 1\n 1\n 1\n 1\n 1\n \n \n 29\n 1\n 1\n 1\n 1\n 1\n 1\n 1\n 1\n 1\n 1\n ...\n 1\n 1\n 1\n 1\n 1\n 1\n 1\n 1\n 1\n 1\n \n \n 30\n 1\n 1\n 1\n 1\n 1\n 1\n 1\n 1\n 1\n 1\n ...\n 1\n 1\n 1\n 1\n 1\n 1\n 1\n 1\n 1\n 1\n \n \n 31\n 1\n 1\n 1\n 1\n 1\n 1\n 1\n 1\n 1\n 1\n ...\n 1\n 1\n 1\n 1\n 1\n 1\n 1\n 1\n 1\n 1\n \n \n 32\n 1\n 1\n 1\n 1\n 1\n 1\n 1\n 1\n 1\n 1\n ...\n 1\n 1\n 1\n 1\n 1\n 1\n 1\n 1\n 1\n 1\n \n \n 33\n 1\n 1\n 1\n 1\n 1\n 1\n 1\n 1\n 1\n 1\n ...\n 1\n 1\n 1\n 1\n 1\n 1\n 1\n 1\n 1\n 1\n \n \n 34\n 1\n 1\n 1\n 1\n 1\n 1\n 1\n 1\n 1\n 1\n ...\n 1\n 1\n 1\n 1\n 1\n 1\n 1\n 1\n 1\n 1\n \n \n 35\n 1\n 1\n 1\n 1\n 1\n 1\n 1\n 1\n 1\n 1\n ...\n 1\n 1\n 1\n 1\n 1\n 1\n 1\n 1\n 1\n 1\n \n \n 36\n 1\n 1\n 1\n 1\n 1\n 1\n 1\n 1\n 1\n 1\n ...\n 1\n 1\n 1\n 1\n 1\n 1\n 1\n 1\n 1\n 1\n \n \n\n36 rows × 100 columns\n\n\n\nThe response speeds on each trial are recorded given as reaction times (milliseconds per response), but here we invert them to and multiply by 1000 so that we are analyzing response rates (responses per second). There is no theoretical reason to prefer one of these metrics over the other, but it turns out that response rates tend to have nicer distributional properties than reaction times (i.e., deviate less strongly from the standard Gaussian assumptions), so response rates will be a little more convenient for us by allowing us to use some fairly standard distributional models.\n\nshooter[\"rate\"] = 1000.0 / shooter[\"time\"]\n\n\nplt.hist(shooter[\"rate\"].dropna());\n\n\n\n\n\n\n\n\n\nOur first model is analogous to how the data from the shooter task are usually analyzed: incorporating all subject-level sources of variability, but ignoring the sampling variability due to the sample of 50 targets. This is a Bayesian generalized linear mixed model (GLMM) with a Normal response and with intercepts and slopes that vary randomly across subjects.\nOf note here is the S(x) syntax, which is from the Formulae library that we use to parse model formulas. This instructs Bambi to use contrast codes of -1 and +1 for the two levels of each of the common factors of race (black vs. white) and object (gun vs. no gun), so that the race and object coefficients can be interpreted as simple effects on average across the levels of the other factor (directly analogous, but not quite equivalent, to the main effects). This is the standard coding used in ANOVA.\n\nsubj_model = bmb.Model(\n \"rate ~ S(race) * S(object) + (S(race) * S(object) | subject)\", \n shooter, \n dropna=True\n)\nsubj_fitted = subj_model.fit(random_seed=SEED)\n\nAutomatically removing 98/3600 rows from the dataset.\nAuto-assigning NUTS sampler...\nInitializing NUTS using jitter+adapt_diag...\nMultiprocess sampling (2 chains in 2 jobs)\nNUTS: [rate_sigma, Intercept, S(race), S(object), S(race):S(object), 1|subject_sigma, 1|subject_offset, S(race)|subject_sigma, S(race)|subject_offset, S(object)|subject_sigma, S(object)|subject_offset, S(race):S(object)|subject_sigma, S(race):S(object)|subject_offset]\n\n\n\n\n\n\n\n \n \n 100.00% [4000/4000 00:43<00:00 Sampling 2 chains, 0 divergences]\n \n \n\n\nSampling 2 chains for 1_000 tune and 1_000 draw iterations (2_000 + 2_000 draws total) took 44 seconds.\n\n\nFirst let’s visualize the default priors that Bambi automatically decided on for each of the parameters. We do this by calling the .plot_priors() method of the Model object.\n\nsubj_model.plot_priors();\n\nSampling: [1|subject_sigma, Intercept, S(object), S(object)|subject_sigma, S(race), S(race):S(object), S(race):S(object)|subject_sigma, S(race)|subject_sigma, rate_sigma]\n\n\n\n\n\nThe priors on the common effects seem quite reasonable. Recall that because of the -1 vs +1 contrast coding, the coefficients correspond to half the difference between the two levels of each factor. So the priors on the common effects essentially say that the black vs. white and gun vs. no gun (and their interaction) response rate differences are very unlikely to be as large as a full response per second.\nNow let’s visualize the model estimates. We do this by passing the InferenceData object that resulted from the Model.fit() call to az.plot_trace().\n\naz.plot_trace(subj_fitted);\n\n\n\n\nEach distribution in the plots above has 2 densities because we used 2 MCMC chains, so we are viewing the results of all 2 chains prior to their aggregation. The main message from the plot above is that the chains all seem to have converged well and the resulting posterior distributions all look quite reasonable. It’s a bit easier to digest all this information in a concise, tabular form, which we can get by passing the object that resulted from the Model.fit() call to az.summary().\n\naz.summary(subj_fitted)\n\n\n\n\n\n \n \n \n mean\n sd\n hdi_3%\n hdi_97%\n mcse_mean\n mcse_sd\n ess_bulk\n ess_tail\n r_hat\n \n \n \n \n Intercept\n 1.708\n 0.014\n 1.682\n 1.736\n 0.001\n 0.0\n 406.0\n 571.0\n 1.02\n \n \n S(race)[black]\n -0.001\n 0.004\n -0.009\n 0.007\n 0.000\n 0.0\n 3103.0\n 1200.0\n 1.00\n \n \n S(object)[gun]\n 0.093\n 0.006\n 0.082\n 0.105\n 0.000\n 0.0\n 1290.0\n 1237.0\n 1.00\n \n \n S(race):S(object)[black, gun]\n 0.024\n 0.004\n 0.015\n 0.031\n 0.000\n 0.0\n 3353.0\n 1333.0\n 1.00\n \n \n rate_sigma\n 0.240\n 0.003\n 0.234\n 0.245\n 0.000\n 0.0\n 3307.0\n 885.0\n 1.00\n \n \n ...\n ...\n ...\n ...\n ...\n ...\n ...\n ...\n ...\n ...\n \n \n S(race):S(object)|subject[black, gun, 32]\n -0.001\n 0.006\n -0.013\n 0.011\n 0.000\n 0.0\n 2663.0\n 1697.0\n 1.00\n \n \n S(race):S(object)|subject[black, gun, 33]\n -0.000\n 0.006\n -0.013\n 0.012\n 0.000\n 0.0\n 2553.0\n 1503.0\n 1.00\n \n \n S(race):S(object)|subject[black, gun, 34]\n -0.000\n 0.006\n -0.013\n 0.012\n 0.000\n 0.0\n 3585.0\n 1455.0\n 1.00\n \n \n S(race):S(object)|subject[black, gun, 35]\n 0.000\n 0.006\n -0.012\n 0.011\n 0.000\n 0.0\n 3093.0\n 1745.0\n 1.01\n \n \n S(race):S(object)|subject[black, gun, 36]\n -0.001\n 0.006\n -0.016\n 0.009\n 0.000\n 0.0\n 2359.0\n 1725.0\n 1.00\n \n \n\n153 rows × 9 columns\n\n\n\nThe take-home message from the analysis seems to be that we do find evidence for the usual finding that subjects are especially quick to respond (presumably with a shoot response) to armed black targets and especially slow to respond to unarmed black targets (while unarmed white targets receive “don’t shoot” responses with less hesitation). We see this in the fact that the marginal posterior for the S(race):S(object) interaction coefficient is concentrated strongly away from 0.\n\n\n\nA major flaw in the analysis above is that stimulus specific effects are ignored. The model does include group specific effects for subjects, reflecting the fact that the subjects we observed are but a sample from the broader population of subjects we are interested in and that potentially could have appeared in our study. But the targets we observed – the 50 photographs of white and black men that subjets responded to – are also but a sample from the broader theoretical population of targets we are interested in talking about, and that we could have just as easily and justifiably used as the experimental stimuli in the study. Since the stimuli comprise a random sample, they are subject to sampling variability, and this sampling variability should be accounted in the analysis by including stimulus specific effects. For some more information on this, see here, particularly pages 62-63.\nTo account for this, we let the intercept and slope for object be different for each target. Specific slopes for object across targets are possible because, if you recall, the design of the study was such that each target gets viewed twice by each subject, once with a gun and once without a gun. However, because each target is always either white or black, it’s not possible to add group specific slopes for the race factor or the interaction.\n\nstim_model = bmb.Model(\n \"rate ~ S(race) * S(object) + (S(race) * S(object) | subject) + (S(object) | target)\", \n shooter, \n dropna=True\n)\nstim_fitted = stim_model.fit(random_seed=SEED)\n\nAutomatically removing 98/3600 rows from the dataset.\nAuto-assigning NUTS sampler...\nInitializing NUTS using jitter+adapt_diag...\nMultiprocess sampling (2 chains in 2 jobs)\nNUTS: [rate_sigma, Intercept, S(race), S(object), S(race):S(object), 1|subject_sigma, 1|subject_offset, S(race)|subject_sigma, S(race)|subject_offset, S(object)|subject_sigma, S(object)|subject_offset, S(race):S(object)|subject_sigma, S(race):S(object)|subject_offset, 1|target_sigma, 1|target_offset, S(object)|target_sigma, S(object)|target_offset]\n\n\n\n\n\n\n\n \n \n 100.00% [4000/4000 00:58<00:00 Sampling 2 chains, 0 divergences]\n \n \n\n\nSampling 2 chains for 1_000 tune and 1_000 draw iterations (2_000 + 2_000 draws total) took 60 seconds.\n\n\nNow let’s look at the results…\n\naz.plot_trace(stim_fitted);\n\n\n\n\n\naz.summary(stim_fitted)\n\n\n\n\n\n \n \n \n mean\n sd\n hdi_3%\n hdi_97%\n mcse_mean\n mcse_sd\n ess_bulk\n ess_tail\n r_hat\n \n \n \n \n Intercept\n 1.702\n 0.020\n 1.666\n 1.738\n 0.002\n 0.001\n 158.0\n 331.0\n 1.01\n \n \n S(race)[black]\n -0.001\n 0.013\n -0.026\n 0.024\n 0.001\n 0.001\n 239.0\n 469.0\n 1.00\n \n \n S(object)[gun]\n 0.093\n 0.014\n 0.068\n 0.122\n 0.001\n 0.001\n 200.0\n 394.0\n 1.01\n \n \n S(race):S(object)[black, gun]\n 0.025\n 0.014\n 0.001\n 0.054\n 0.001\n 0.001\n 134.0\n 246.0\n 1.00\n \n \n rate_sigma\n 0.205\n 0.002\n 0.201\n 0.210\n 0.000\n 0.000\n 1698.0\n 1484.0\n 1.00\n \n \n ...\n ...\n ...\n ...\n ...\n ...\n ...\n ...\n ...\n ...\n \n \n S(object)|target[gun, w95]\n -0.175\n 0.029\n -0.229\n -0.117\n 0.001\n 0.001\n 458.0\n 706.0\n 1.00\n \n \n S(object)|target[gun, w96]\n 0.080\n 0.031\n 0.026\n 0.137\n 0.001\n 0.001\n 502.0\n 907.0\n 1.00\n \n \n S(object)|target[gun, w97]\n 0.007\n 0.029\n -0.046\n 0.064\n 0.001\n 0.001\n 433.0\n 953.0\n 1.00\n \n \n S(object)|target[gun, w98]\n 0.087\n 0.029\n 0.036\n 0.141\n 0.001\n 0.001\n 466.0\n 1033.0\n 1.00\n \n \n S(object)|target[gun, w99]\n -0.019\n 0.029\n -0.076\n 0.034\n 0.001\n 0.001\n 419.0\n 686.0\n 1.00\n \n \n\n255 rows × 9 columns\n\n\n\nThere are two interesting things to note here. The first is that the key interaction effect, S(race):S(object) is much less clear now. The marginal posterior is still mostly concentrated away from 0, but there’s certainly a nontrivial part that overlaps with 0; 2.9% of the distribution, to be exact.\n\n(stim_fitted.posterior[\"S(race):S(object)\"] < 0).mean()\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\narray(0.031)xarray.DataArray'S(race):S(object)'0.031array(0.031)Coordinates: (0)Indexes: (0)Attributes: (0)\n\n\nThe second interesting thing is that the two new variance components in the model, those associated with the stimulus specific effects, are actually rather large. This actually largely explains the first fact above, since if these where estimated to be close to 0 anyway, the model estimates wouldn’t be much different than they were in the subj_model. It makes sense that there is a strong tendency for different targets to elicit difference reaction times on average, which leads to a large estimate of 1|target_sigma.\nLess obviously, the large estimate of S(object)|target_sigma (targets tend to vary a lot in their response rate differences when they have a gun vs. some other object) also makes sense, because in this experiment, different targets were pictured with different non-gun objects. Some of these objects, such as a bright red can of Coca-Cola, are not easily confused with a gun, so subjects are able to quickly decide on the correct response. Other objects, such as a black cell phone, are possibly easier to confuse with a gun, so subjects take longer to decide on the correct response when confronted with this object.\nSince each target is yoked to a particular non-gun object, there is good reason to expect large target-to-target variability in the object effect, which is indeed what we see in the model estimates.\n\n\n\n\nHere we seek evidence of the second traditional finding, that subjects are more likely to response ‘shoot’ toward black targets than toward white targets, regardless of whether they are armed or not. Currently the dataset just records whether the given response was correct or not, so first we transformed this into whether the response was ‘shoot’ or ‘dontshoot’.\n\nshooter[\"shoot_or_not\"] = shooter[\"response\"].astype(str)\n\n# armed targets\nnew_values = {\"correct\": \"shoot\", \"incorrect\": \"dontshoot\", \"timeout\": np.nan}\nshooter.loc[shooter[\"object\"] == \"gun\", \"shoot_or_not\"] = (\n shooter.loc[shooter[\"object\"] == \"gun\", \"response\"].astype(str).replace(new_values)\n)\n \n# unarmed targets\nnew_values = {\"correct\": \"dontshoot\", \"incorrect\": \"shoot\", \"timeout\": np.nan}\nshooter.loc[shooter[\"object\"] == \"nogun\", \"shoot_or_not\"] = (\n shooter.loc[shooter[\"object\"] == \"nogun\", \"response\"].astype(str).replace(new_values)\n)\n \n# view result\nshooter.head(20)\n\n\n\n\n\n \n \n \n subject\n target\n trial\n race\n object\n time\n response\n rate\n shoot_or_not\n \n \n \n \n 0\n 1\n w05\n 19\n white\n nogun\n 658.0\n correct\n 1.519757\n dontshoot\n \n \n 1\n 2\n b07\n 19\n black\n gun\n 573.0\n correct\n 1.745201\n shoot\n \n \n 2\n 3\n w05\n 19\n white\n gun\n 369.0\n correct\n 2.710027\n shoot\n \n \n 3\n 4\n w07\n 19\n white\n gun\n 495.0\n correct\n 2.020202\n shoot\n \n \n 4\n 5\n w15\n 19\n white\n nogun\n 483.0\n correct\n 2.070393\n dontshoot\n \n \n 5\n 6\n w96\n 19\n white\n nogun\n 786.0\n correct\n 1.272265\n dontshoot\n \n \n 6\n 7\n w13\n 19\n white\n nogun\n 519.0\n correct\n 1.926782\n dontshoot\n \n \n 7\n 8\n w06\n 19\n white\n nogun\n 567.0\n correct\n 1.763668\n dontshoot\n \n \n 8\n 9\n b14\n 19\n black\n gun\n 672.0\n incorrect\n 1.488095\n dontshoot\n \n \n 9\n 10\n w90\n 19\n white\n gun\n 457.0\n correct\n 2.188184\n shoot\n \n \n 10\n 11\n w91\n 19\n white\n nogun\n 599.0\n correct\n 1.669449\n dontshoot\n \n \n 11\n 12\n b17\n 19\n black\n nogun\n 769.0\n correct\n 1.300390\n dontshoot\n \n \n 12\n 13\n b04\n 19\n black\n nogun\n 600.0\n correct\n 1.666667\n dontshoot\n \n \n 13\n 14\n w17\n 19\n white\n nogun\n 653.0\n correct\n 1.531394\n dontshoot\n \n \n 14\n 15\n b93\n 19\n black\n gun\n 468.0\n correct\n 2.136752\n shoot\n \n \n 15\n 16\n w96\n 19\n white\n gun\n 546.0\n correct\n 1.831502\n shoot\n \n \n 16\n 17\n w91\n 19\n white\n gun\n 591.0\n incorrect\n 1.692047\n dontshoot\n \n \n 17\n 18\n b95\n 19\n black\n gun\n NaN\n timeout\n NaN\n NaN\n \n \n 18\n 19\n b09\n 19\n black\n gun\n 656.0\n correct\n 1.524390\n shoot\n \n \n 19\n 20\n b02\n 19\n black\n gun\n 617.0\n correct\n 1.620746\n shoot\n \n \n\n\n\n\nLet’s skip straight to the correct model that includes stimulus specific effects. This looks quite similiar to the stim_model from above except that we change the response to the new shoot_or_not variable – notice the [shoot] syntax indicating that we wish to model the prbability that shoot_or_not=='shoot', not shoot_or_not=='dontshoot' – and then change to family='bernoulli' to indicate a mixed effects logistic regression.\n\nstim_response_model = bmb.Model(\n \"shoot_or_not[shoot] ~ S(race)*S(object) + (S(race)*S(object) | subject) + (S(object) | target)\",\n shooter,\n family=\"bernoulli\",\n dropna=True\n)\n\n# Note we increased target_accept from default 0.8 to 0.9 because there were divergences\nstim_response_fitted = stim_response_model.fit(\n draws=2000, \n target_accept=0.9,\n random_seed=SEED\n)\n\nAutomatically removing 98/3600 rows from the dataset.\nModeling the probability that shoot_or_not==shoot\nAuto-assigning NUTS sampler...\nInitializing NUTS using jitter+adapt_diag...\nMultiprocess sampling (2 chains in 2 jobs)\nNUTS: [Intercept, S(race), S(object), S(race):S(object), 1|subject_sigma, 1|subject_offset, S(race)|subject_sigma, S(race)|subject_offset, S(object)|subject_sigma, S(object)|subject_offset, S(race):S(object)|subject_sigma, S(race):S(object)|subject_offset, 1|target_sigma, 1|target_offset, S(object)|target_sigma, S(object)|target_offset]\n\n\n\n\n\n\n\n \n \n 100.00% [6000/6000 01:49<00:00 Sampling 2 chains, 0 divergences]\n \n \n\n\nSampling 2 chains for 1_000 tune and 2_000 draw iterations (2_000 + 4_000 draws total) took 110 seconds.\n\n\nShow the trace plot\n\naz.plot_trace(stim_response_fitted);\n\n\n\n\nLooks pretty good! Now for the more concise summary.\n\naz.summary(stim_response_fitted)\n\n\n\n\n\n \n \n \n mean\n sd\n hdi_3%\n hdi_97%\n mcse_mean\n mcse_sd\n ess_bulk\n ess_tail\n r_hat\n \n \n \n \n Intercept\n -0.021\n 0.151\n -0.309\n 0.262\n 0.002\n 0.002\n 4864.0\n 2794.0\n 1.0\n \n \n S(race)[black]\n 0.224\n 0.145\n -0.032\n 0.508\n 0.002\n 0.002\n 5123.0\n 3256.0\n 1.0\n \n \n S(object)[gun]\n 4.172\n 0.248\n 3.724\n 4.636\n 0.005\n 0.003\n 2687.0\n 2887.0\n 1.0\n \n \n S(race):S(object)[black, gun]\n 0.200\n 0.170\n -0.120\n 0.516\n 0.003\n 0.002\n 3508.0\n 3153.0\n 1.0\n \n \n 1|subject_sigma\n 0.222\n 0.151\n 0.000\n 0.486\n 0.004\n 0.003\n 1734.0\n 2208.0\n 1.0\n \n \n ...\n ...\n ...\n ...\n ...\n ...\n ...\n ...\n ...\n ...\n \n \n S(object)|target[gun, w95]\n 0.349\n 0.598\n -0.749\n 1.497\n 0.007\n 0.009\n 8476.0\n 3118.0\n 1.0\n \n \n S(object)|target[gun, w96]\n 0.030\n 0.554\n -0.997\n 1.062\n 0.006\n 0.010\n 7719.0\n 2645.0\n 1.0\n \n \n S(object)|target[gun, w97]\n 0.310\n 0.582\n -0.734\n 1.439\n 0.008\n 0.010\n 5782.0\n 2261.0\n 1.0\n \n \n S(object)|target[gun, w98]\n 0.344\n 0.584\n -0.637\n 1.525\n 0.007\n 0.008\n 7543.0\n 3183.0\n 1.0\n \n \n S(object)|target[gun, w99]\n 0.017\n 0.548\n -0.993\n 1.069\n 0.006\n 0.009\n 8789.0\n 3061.0\n 1.0\n \n \n\n254 rows × 9 columns\n\n\n\nThere is some slight evidence here for the hypothesis that subjects are more likely to shoot the black targets, regardless of whether they are armed or not, but the evidence is not too strong. The marginal posterior for the S(race) coefficient is mostly concentrated away from 0, but it overlaps even more in this case with 0 than did the key interaction effect in the previous model.\n\n(stim_response_fitted.posterior[\"S(race)\"] < 0).mean()\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\narray(0.06275)xarray.DataArray'S(race)'0.06275array(0.06275)Coordinates: (0)Indexes: (0)Attributes: (0)\n\n\n\n%load_ext watermark\n%watermark -n -u -v -iv -w\n\nLast updated: Thu Jan 05 2023\n\nPython implementation: CPython\nPython version : 3.10.4\nIPython version : 8.5.0\n\nnumpy : 1.23.5\narviz : 0.14.0\nmatplotlib: 3.6.2\nsys : 3.10.4 (main, Mar 31 2022, 08:41:55) [GCC 7.5.0]\npandas : 1.5.2\nbambi : 0.9.3\n\nWatermark: 2.3.1" }, { "objectID": "notebooks/plot_comparisons.html", @@ -147,95 +126,95 @@ "text": "comparisons and plot_comparisons are a part of Bambi’s sub-package plots that feature a set of functions used to interpret complex regression models. This sub-package is inspired by the R package marginaleffects. These two functions allow the modeler to compare the predictions made by a model for different contrast and covariate values. Below, it is described why comparing predictions is useful in interpreting generalized linear models (GLMs), how this methodology is implemented in Bambi, and how to use comparisons and plot_comparisons. It is assumed that the reader is familiar with the basics of GLMs. If not, refer to the Bambi Basic Building Blocks example.\nDue to the link function in a GLM, there are typically three quantities of interest to interpret:\n\nthe linear predictor \\(\\eta = X\\beta\\) where \\(X\\) is an \\(n\\) x \\(p\\) matrix of explanatory variables.\nthe mean \\(\\mu = g^{-1}(\\eta)\\) where the link function \\(g(\\cdot)\\) relates the linear predictor to the mean of the outcome variable \\(\\mu = g^{-1}(\\eta) = g^{-1}(X\\beta)\\)\nthe response variable \\(Y \\sim \\mathcal{D}(\\mu, \\theta)\\) where \\(\\mu\\) is the mean parameter and \\(\\theta\\) is (possibly) a vector that contains all the other “auxillary” parameters of the distribution.\n\nOften, with GLMs, \\(\\eta\\) is linear in the parameters, but nonlinear in relation of inputs to the outcome \\(Y\\) due to the link function \\(g\\). Thus, as modelers, we are usually more interested in interpreting (2) and (3). For example, in logistic regression, the linear predictor is on the log-odds scale, but the quantity of interest is on the probability scale. In Poisson regression, the linear predictor is on the log-scale, but the response variable is on the count scale. Referring back to logistic regression, a specified difference in one of the \\(x\\) variables does not correspond to a constant difference in the the probability of the outcome.\nIt is often helpful with GLMs, for the modeler and audience, to have a summary that gives the expected difference in the outcome corresponding to a unit difference in each of the input variables. Thus, the goal of comparisons and plot_comparisons is to provide the modeler with a summary and visualization of the average predicted difference.\n\n\nHere, we adopt the notation from Chapter 14.4 of Regression and Other Stories to describe average predictive differences. Assume we have fit a Bambi model predicting an outcome \\(Y\\) based on inputs \\(X\\) and parameters \\(\\theta\\). Consider the following scalar inputs:\n\\[w: \\text{the input of interest}\\] \\[c: \\text{all the other inputs}\\] \\[X = (w, c)\\]\nSuppose for the input of interest, we are interested in comparing \\(w^{\\text{high}}\\) to \\(w^{\\text{low}}\\) (perhaps age = \\(60\\) and \\(40\\) respectively) with all other inputs \\(c\\) held constant. The predictive difference in the outcome changing only \\(w\\) is:\n\\[\\text{average predictive difference} = \\mathbb{E}(y|w^{\\text{high}}, c, \\theta) - \\mathbb{E}(y|w^{\\text{low}}, c, \\theta)\\]\nSelecting the maximum and minimum values of \\(w\\) and averaging over all other inputs \\(c\\) in the data gives you a new “hypothetical” dataset and corresponds to counting all pairs of transitions of \\((w^\\text{low})\\) to \\((w^\\text{high})\\), i.e., differences in \\(w\\) with \\(c\\) held constant. The difference between these two terms is the average predictive difference.\n\n\nThe objective of comparisons and plot_comparisons is to compute the expected difference in the outcome corresponding to three different scenarios for \\(w\\) and \\(c\\) where \\(w\\) is either provided by the user, else a default value is computed by Bambi (described in the default values section). The three scenarios are:\n\nuser provided values for \\(c\\).\na grid of equally spaced and central values for \\(c\\).\nempirical distribution (original data used to fit the model) for \\(c\\).\n\nIn the case of (1) and (2) above, Bambi assembles all pairwise combinations (transitions) of \\(w\\) and \\(c\\) into a new “hypothetical” dataset. In (3), Bambi uses the original \\(c\\), but replaces \\(w\\) with the user provided value or the default value computed by Bambi. In each scenario, predictions are made on the data using the fitted model. Once the predictions are made, comparisons are computed using the posterior samples by taking the difference in the predicted outcome for each pair of transitions. The average of these differences is the average predictive difference.\nThus, the goal of comparisons and plot_comparisons is to provide the modeler with a summary and visualization of the average predictive difference. Below, we demonstrate how to compute and plot average predictive differences with comparisons and plot_comparions using several examples.\n\nimport arviz as az\nimport numpy as np\nimport pandas as pd\n\n\nimport bambi as bmb\nfrom bambi.plots import comparisons, plot_comparisons\n\n\n\n\n\nWe model and predict how many fish are caught by visitors at a state park using survey data. Many visitors catch zero fish, either because they did not fish at all, or because they were unlucky. We would like to explicitly model this bimodal behavior (zero versus non-zero) using a Zero Inflated Poisson model, and to compare how different inputs of interest \\(w\\) and other covariate values \\(c\\) are associated with the number of fish caught. The dataset contains data on 250 groups that went to a state park to fish. Each group was questioned about how many fish they caught (count), how many children were in the group (child), how many people were in the group (persons), if they used a live bait and whether or not they brought a camper to the park (camper).\n\nfish_data = pd.read_stata(\"http://www.stata-press.com/data/r11/fish.dta\")\ncols = [\"count\", \"livebait\", \"camper\", \"persons\", \"child\"]\nfish_data = fish_data[cols]\nfish_data[\"livebait\"] = pd.Categorical(fish_data[\"livebait\"])\nfish_data[\"camper\"] = pd.Categorical(fish_data[\"camper\"])\n\n\nfish_model = bmb.Model(\n \"count ~ livebait + camper + persons + child\", \n fish_data, \n family='zero_inflated_poisson'\n)\n\nfish_idata = fish_model.fit(\n draws=1000, \n target_accept=0.95, \n random_seed=1234, \n chains=4\n)\n\nAuto-assigning NUTS sampler...\nInitializing NUTS using jitter+adapt_diag...\nMultiprocess sampling (4 chains in 4 jobs)\nNUTS: [count_psi, Intercept, livebait, camper, persons, child]\n\n\n |████████████████████████████████| 100.00% [8000/8000 00:03<00:00 Sampling 4 chains, 0 divergences]\n\n\nSampling 4 chains for 1_000 tune and 1_000 draw iterations (4_000 + 4_000 draws total) took 4 seconds.\n\n\n\n\nFirst, an example of scenario 1 (user provided values) is given below. In both plot_comparisons and comparisons, \\(w\\) and \\(c\\) are represented by contrast and conditional, respectively. The modeler has the ability to pass their own values for contrast and conditional by using a dictionary where the key-value pairs are the covariate and value(s) of interest. For example, if we wanted to compare the number of fish caught for \\(4\\) versus \\(1\\) persons conditional on a range of child and livebait values, we would pass the following dictionary in the code block below. By default, for \\(w\\), Bambi compares \\(w^\\text{high}\\) to \\(w^\\text{low}\\). Thus, in this example, \\(w^\\text{high}\\) = 4 and \\(w^\\text{low}\\) = 1. The user is not limited to passing a list for the values. A np.array can also be used. Furthermore, Bambi by default, maps the order of the dict keys to the main, group, and panel of the matplotlib figure. Below, since child is the first key, this is used for the x-axis, and livebait is used for the group (color). If a third key was passed, it would be used for the panel (facet).\n\nfig, ax = plot_comparisons(\n model=fish_model,\n idata=fish_idata,\n contrast={\"persons\": [1, 4]},\n conditional={\"child\": [0, 1, 2], \"livebait\": [0, 1]},\n) \nfig.set_size_inches(7, 3)\n\n\n\n\nThe plot above shows that, comparing \\(4\\) to \\(1\\) persons given \\(0\\) children and using livebait, the expected difference is about \\(26\\) fish. When not using livebait, the expected difference decreases substantially to about \\(5\\) fish. Using livebait with a group of people is associated with a much larger expected difference in the number of fish caught.\ncomparisons can be called to view a summary dataframe that includes the term \\(w\\) and its contrast, the specified conditional covariate, and the expected difference in the outcome with the uncertainty interval (by default the 94% highest density interval is computed).\n\ncomparisons(\n model=fish_model,\n idata=fish_idata,\n contrast={\"persons\": [1, 4]},\n conditional={\"child\": [0, 1, 2], \"livebait\": [0, 1]},\n) \n\n\n\n\n\n \n \n \n term\n estimate_type\n value\n child\n livebait\n camper\n estimate\n lower_3.0%\n upper_97.0%\n \n \n \n \n 0\n persons\n diff\n (1.0, 4.0)\n 0.0\n 0.0\n 1.0\n 4.834472\n 2.563472\n 7.037150\n \n \n 1\n persons\n diff\n (1.0, 4.0)\n 0.0\n 1.0\n 1.0\n 26.423188\n 23.739729\n 29.072748\n \n \n 2\n persons\n diff\n (1.0, 4.0)\n 1.0\n 0.0\n 1.0\n 1.202003\n 0.631629\n 1.780965\n \n \n 3\n persons\n diff\n (1.0, 4.0)\n 1.0\n 1.0\n 1.0\n 6.571943\n 5.469275\n 7.642248\n \n \n 4\n persons\n diff\n (1.0, 4.0)\n 2.0\n 0.0\n 1.0\n 0.301384\n 0.143676\n 0.467608\n \n \n 5\n persons\n diff\n (1.0, 4.0)\n 2.0\n 1.0\n 1.0\n 1.648417\n 1.140415\n 2.187190\n \n \n\n\n\n\nBut why is camper also in the summary dataframe? This is because in order to peform predictions, Bambi is expecting a value for each covariate used to fit the model. Additionally, with GLM models, average predictive comparisons are conditional in the sense that the estimate depends on the values of all the covariates in the model. Thus, for unspecified covariates, comparisons and plot_comparisons computes a default value (mean or mode based on the data type of the covariate). Thus, \\(c\\) = child, livebait, camper. Each row in the summary dataframe is read as “comparing \\(4\\) to \\(1\\) persons conditional on \\(c\\), the expected difference in the outcome is \\(y\\).”\n\n\n\nUsers can also perform comparisons on multiple contrast values. For example, if we wanted to compare the number of fish caught between \\((1, 2)\\), \\((1, 4)\\), and \\((2, 4)\\) persons conditional on a range of values for child and livebait.\n\nmultiple_values = comparisons(\n model=fish_model,\n idata=fish_idata,\n contrast={\"persons\": [1, 2, 4]},\n conditional={\"child\": [0, 1, 2], \"livebait\": [0, 1]}\n)\n\nmultiple_values\n\n\n\n\n\n \n \n \n term\n estimate_type\n value\n child\n livebait\n camper\n estimate\n lower_3.0%\n upper_97.0%\n \n \n \n \n 0\n persons\n diff\n (1, 2)\n 0.0\n 0.0\n 1.0\n 0.527627\n 0.295451\n 0.775465\n \n \n 1\n persons\n diff\n (1, 2)\n 0.0\n 1.0\n 1.0\n 2.883694\n 2.605690\n 3.177685\n \n \n 2\n persons\n diff\n (1, 2)\n 1.0\n 0.0\n 1.0\n 0.131319\n 0.067339\n 0.195132\n \n \n 3\n persons\n diff\n (1, 2)\n 1.0\n 1.0\n 1.0\n 0.717965\n 0.592968\n 0.857893\n \n \n 4\n persons\n diff\n (1, 2)\n 2.0\n 0.0\n 1.0\n 0.032960\n 0.015212\n 0.052075\n \n \n 5\n persons\n diff\n (1, 2)\n 2.0\n 1.0\n 1.0\n 0.180270\n 0.123173\n 0.244695\n \n \n 6\n persons\n diff\n (1, 4)\n 0.0\n 0.0\n 1.0\n 4.834472\n 2.563472\n 7.037150\n \n \n 7\n persons\n diff\n (1, 4)\n 0.0\n 1.0\n 1.0\n 26.423188\n 23.739729\n 29.072748\n \n \n 8\n persons\n diff\n (1, 4)\n 1.0\n 0.0\n 1.0\n 1.202003\n 0.631629\n 1.780965\n \n \n 9\n persons\n diff\n (1, 4)\n 1.0\n 1.0\n 1.0\n 6.571943\n 5.469275\n 7.642248\n \n \n 10\n persons\n diff\n (1, 4)\n 2.0\n 0.0\n 1.0\n 0.301384\n 0.143676\n 0.467608\n \n \n 11\n persons\n diff\n (1, 4)\n 2.0\n 1.0\n 1.0\n 1.648417\n 1.140415\n 2.187190\n \n \n 12\n persons\n diff\n (2, 4)\n 0.0\n 0.0\n 1.0\n 4.306845\n 2.267097\n 6.280005\n \n \n 13\n persons\n diff\n (2, 4)\n 0.0\n 1.0\n 1.0\n 23.539494\n 20.990931\n 26.240169\n \n \n 14\n persons\n diff\n (2, 4)\n 1.0\n 0.0\n 1.0\n 1.070683\n 0.565931\n 1.585718\n \n \n 15\n persons\n diff\n (2, 4)\n 1.0\n 1.0\n 1.0\n 5.853978\n 4.858957\n 6.848519\n \n \n 16\n persons\n diff\n (2, 4)\n 2.0\n 0.0\n 1.0\n 0.268423\n 0.124033\n 0.412274\n \n \n 17\n persons\n diff\n (2, 4)\n 2.0\n 1.0\n 1.0\n 1.468147\n 1.024800\n 1.960934\n \n \n\n\n\n\nNotice how the contrast \\(w\\) varies while the covariates \\(c\\) are held constant. Currently, however, plotting multiple contrast values can be difficult to interpret since the contrast is “abstracted” away onto the y-axis. Thus, it would be difficult to interpret which portion of the plot corresponds to which contrast value. Therefore, it is currently recommended that if you want to plot multiple contrast values, call comparisons directly to obtain the summary dataframe and plot the results yourself.\n\n\n\nNow, we move onto scenario 2 described above (grid of equally spaced and central values) in computing average predictive comparisons. You are not required to pass values for contrast and conditional. If you do not pass values, Bambi will compute default values for you. Below, it is described how these default values are computed.\nThe default value for contrast is a centered difference at the mean for a contrast variable with a numeric dtype, and unique levels for a contrast varaible with a categorical dtype. For example, if the modeler is interested in the comparison of a \\(5\\) unit increase in \\(w\\) where \\(w\\) is a numeric variable, Bambi computes the mean and then subtracts and adds \\(2.5\\) units to the mean to obtain a centered difference. By default, if no value is passed for the contrast covariate, Bambi computes a one unit centered difference at the mean. For example, if only contrast=\"persons\" is passed, then \\(\\pm\\) \\(0.5\\) is applied to the mean of persons. If \\(w\\) is a categorical variable, Bambi computes and returns the unique levels. For example, if \\(w\\) has levels [“high scool”, “vocational”, “university”], Bambi computes and returns the unique values of this variable.\nThe default values for conditional are more involved. Currently, by default, if a dict or list is passed to conditional, Bambi uses the ordering (keys if dict and elements if list) to determine which covariate to use as the main, group (color), and panel (facet) variable. This is the same logic used in plot_comparisons described above. Subsequently, the default values used for the conditional covariates depend on their ordering and dtype. Below, the psuedocode used for computing default values covariates passed to conditional is outlined:\nif v == \"main\":\n \n if v == numeric:\n return np.linspace(v.min(), v.max(), 50)\n elif v == categorical:\n return np.unique(v)\n\nelif v == \"group\":\n \n if v == numeric:\n return np.quantile(v, np.linspace(0, 1, 5))\n elif v == categorical:\n return np.unique(v)\n\nelif v == \"panel\":\n \n if v == numeric:\n return np.quantile(v, np.linspace(0, 1, 5))\n elif v == categorical:\n return np.unique(v)\nThus, letting Bambi compute default values for conditional is equivalent to creating a hypothetical “data grid” of new values. Lets say we are interested in comparing the number of fish caught for the contrast livebait conditional on persons and child. This time, lets call comparisons first to gain an understanding of the data generating the plot.\n\ncontrast_df = comparisons(\n model=fish_model,\n idata=fish_idata,\n contrast=\"livebait\",\n conditional=[\"persons\", \"child\"],\n)\n\ncontrast_df.head(10)\n\n\n\n\n\n \n \n \n term\n estimate_type\n value\n persons\n child\n camper\n estimate\n lower_3.0%\n upper_97.0%\n \n \n \n \n 0\n livebait\n diff\n (0.0, 1.0)\n 1.000000\n 0.0\n 1.0\n 1.694646\n 1.252803\n 2.081207\n \n \n 1\n livebait\n diff\n (0.0, 1.0)\n 1.000000\n 1.0\n 1.0\n 0.422448\n 0.299052\n 0.551766\n \n \n 2\n livebait\n diff\n (0.0, 1.0)\n 1.000000\n 3.0\n 1.0\n 0.026923\n 0.012752\n 0.043035\n \n \n 3\n livebait\n diff\n (0.0, 1.0)\n 1.061224\n 0.0\n 1.0\n 1.787412\n 1.342979\n 2.203158\n \n \n 4\n livebait\n diff\n (0.0, 1.0)\n 1.061224\n 1.0\n 1.0\n 0.445555\n 0.317253\n 0.580117\n \n \n 5\n livebait\n diff\n (0.0, 1.0)\n 1.061224\n 3.0\n 1.0\n 0.028393\n 0.013452\n 0.045276\n \n \n 6\n livebait\n diff\n (0.0, 1.0)\n 1.122449\n 0.0\n 1.0\n 1.885270\n 1.422937\n 2.313218\n \n \n 7\n livebait\n diff\n (0.0, 1.0)\n 1.122449\n 1.0\n 1.0\n 0.469929\n 0.335373\n 0.609249\n \n \n 8\n livebait\n diff\n (0.0, 1.0)\n 1.122449\n 3.0\n 1.0\n 0.029944\n 0.014165\n 0.047593\n \n \n 9\n livebait\n diff\n (0.0, 1.0)\n 1.183674\n 0.0\n 1.0\n 1.988500\n 1.501650\n 2.424762\n \n \n\n\n\n\nAs livebait was encoded as a categorical dtype, Bambi returned the unique levels of \\([0, 1]\\) for the contrast. persons and child were passed as the first and second element and thus act as the main and group variables, respectively. It can be see from the output above, that an equally spaced grid was used to compute the values for persons, whereas a quantile based grid was used for child. Furthermore, as camper was unspecified, the mode was used as the default value. Lets go ahead and plot the commparisons.\n\nfig, ax = plot_comparisons(\n model=fish_model,\n idata=fish_idata,\n contrast=\"livebait\",\n conditional=[\"persons\", \"child\"],\n) \nfig.set_size_inches(7, 3)\n\n\n\n\nThe plot shows us that the expected differences in fish caught comparing a group of people who use livebait and no livebait is not only conditional on the number of persons, but also children. However, the plotted comparisons for child = \\(3\\) is difficult to interpret on a single plot. Thus, it can be useful to pass specific group and panel arguments to aid in the interpretation of the plot. Therefore, subplot_kwargs allows the user to manipulate the plotting by passing a dictionary where the keys are {\"main\": ..., \"group\": ..., \"panel\": ...} and the values are the names of the covariates to be plotted. Below, we plot the same comparisons as above, but this time we specify group and panel to both be child.\n\nfig, ax = plot_comparisons(\n model=fish_model,\n idata=fish_idata,\n contrast=\"livebait\",\n conditional=[\"persons\", \"child\"],\n subplot_kwargs={\"main\": \"persons\", \"group\": \"child\", \"panel\": \"child\"},\n fig_kwargs={\"figsize\":(12, 3), \"sharey\": True},\n legend=False\n) \n\n\n\n\n\n\n\nEvaluating average predictive comparisons at central values for the conditional covariates \\(c\\) can be problematic when the inputs have a large variance since no single central value (mean, median, etc.) is representative of the covariate. This is especially true when \\(c\\) exhibits bi or multimodality. Thus, it may be desireable to use the empirical distribution of \\(c\\) to compute the predictive comparisons, and then average over a specific or set of covariates to obtain the average predictive comparisons. To achieve unit level contrasts, do not pass a parameter into conditional and or specify None.\n\nunit_level = comparisons(\n model=fish_model,\n idata=fish_idata,\n contrast=\"livebait\",\n conditional=None,\n)\n\n# empirical distribution\nprint(unit_level.shape[0] == fish_model.data.shape[0])\nunit_level.head(10)\n\nTrue\n\n\n\n\n\n\n \n \n \n term\n estimate_type\n value\n camper\n child\n persons\n estimate\n lower_3.0%\n upper_97.0%\n \n \n \n \n 0\n livebait\n diff\n (0.0, 1.0)\n 0.0\n 0.0\n 1.0\n 0.864408\n 0.627063\n 1.116105\n \n \n 1\n livebait\n diff\n (0.0, 1.0)\n 1.0\n 0.0\n 1.0\n 1.694646\n 1.252803\n 2.081207\n \n \n 2\n livebait\n diff\n (0.0, 1.0)\n 0.0\n 0.0\n 1.0\n 0.864408\n 0.627063\n 1.116105\n \n \n 3\n livebait\n diff\n (0.0, 1.0)\n 1.0\n 1.0\n 2.0\n 1.009094\n 0.755449\n 1.249551\n \n \n 4\n livebait\n diff\n (0.0, 1.0)\n 0.0\n 0.0\n 1.0\n 0.864408\n 0.627063\n 1.116105\n \n \n 5\n livebait\n diff\n (0.0, 1.0)\n 1.0\n 2.0\n 4.0\n 1.453235\n 0.964674\n 1.956434\n \n \n 6\n livebait\n diff\n (0.0, 1.0)\n 0.0\n 1.0\n 3.0\n 1.233247\n 0.900295\n 1.569891\n \n \n 7\n livebait\n diff\n (0.0, 1.0)\n 0.0\n 3.0\n 4.0\n 0.188019\n 0.090328\n 0.289560\n \n \n 8\n livebait\n diff\n (0.0, 1.0)\n 1.0\n 2.0\n 3.0\n 0.606361\n 0.390571\n 0.818549\n \n \n 9\n livebait\n diff\n (0.0, 1.0)\n 1.0\n 0.0\n 1.0\n 1.694646\n 1.252803\n 2.081207\n \n \n\n\n\n\n\n# empirical (observed) data used to fit the model\nfish_model.data.head(10)\n\n\n\n\n\n \n \n \n count\n livebait\n camper\n persons\n child\n \n \n \n \n 0\n 0.0\n 0.0\n 0.0\n 1.0\n 0.0\n \n \n 1\n 0.0\n 1.0\n 1.0\n 1.0\n 0.0\n \n \n 2\n 0.0\n 1.0\n 0.0\n 1.0\n 0.0\n \n \n 3\n 0.0\n 1.0\n 1.0\n 2.0\n 1.0\n \n \n 4\n 1.0\n 1.0\n 0.0\n 1.0\n 0.0\n \n \n 5\n 0.0\n 1.0\n 1.0\n 4.0\n 2.0\n \n \n 6\n 0.0\n 1.0\n 0.0\n 3.0\n 1.0\n \n \n 7\n 0.0\n 1.0\n 0.0\n 4.0\n 3.0\n \n \n 8\n 0.0\n 0.0\n 1.0\n 3.0\n 2.0\n \n \n 9\n 1.0\n 1.0\n 1.0\n 1.0\n 0.0\n \n \n\n\n\n\nAbove, unit_level is the comparisons summary dataframe and fish_model.data is the empirical data. Notice how the values for \\(c\\) are identical in both dataframes. However, for \\(w\\), the values are different. However, these unit level contrasts are difficult to interpret as each row corresponds to that unit’s contrast. Therefore, it is useful to average over (marginalize) the estimates to summarize the unit level predictive comparisons.\n\n\nSince the empirical distrubution is used for computing the average predictive comparisons, the same number of rows (250) is returned as the data used to fit the model. To average over a covariate, use the average_by argument. If True is passed, then comparisons averages over all covariates. Else, if a single or list of covariates are passed, then comparisons averages by the covariates passed.\n\n# marginalize over all covariates\ncomparisons(\n model=fish_model,\n idata=fish_idata,\n contrast=\"livebait\",\n conditional=None,\n average_by=True\n)\n\n\n\n\n\n \n \n \n term\n estimate_type\n value\n estimate\n lower_3.0%\n upper_97.0%\n \n \n \n \n 0\n livebait\n diff\n (0.0, 1.0)\n 3.649691\n 2.956185\n 4.333621\n \n \n\n\n\n\nPassing True to average_by averages over all covariates and is equivalent to taking the mean of the estimate and uncertainty columns. For example:\n\nunit_level = comparisons(\n model=fish_model,\n idata=fish_idata,\n contrast=\"livebait\",\n conditional=None,\n)\n\nunit_level[[\"estimate\", \"lower_3.0%\", \"upper_97.0%\"]].mean()\n\nestimate 3.649691\nlower_3.0% 2.956185\nupper_97.0% 4.333621\ndtype: float64\n\n\n\n\n\nAveraging over all covariates may not be desired, and you would rather average by a group or specific covariate. To perform averaging by subgroups, users can pass a single or list of covariates to average_by to average over specific covariates. For example, if we wanted to average by persons:\n\n# average by number of persons\ncomparisons(\n model=fish_model,\n idata=fish_idata,\n contrast=\"livebait\",\n conditional=None,\n average_by=\"persons\"\n)\n\n\n\n\n\n \n \n \n term\n estimate_type\n value\n persons\n estimate\n lower_3.0%\n upper_97.0%\n \n \n \n \n 0\n livebait\n diff\n (0.0, 1.0)\n 1.0\n 1.374203\n 1.011290\n 1.708711\n \n \n 1\n livebait\n diff\n (0.0, 1.0)\n 2.0\n 1.963362\n 1.543330\n 2.376636\n \n \n 2\n livebait\n diff\n (0.0, 1.0)\n 3.0\n 3.701510\n 3.056586\n 4.357385\n \n \n 3\n livebait\n diff\n (0.0, 1.0)\n 4.0\n 7.358662\n 6.047642\n 8.655654\n \n \n\n\n\n\n\n# average by number of persons and camper by passing a list\ncomparisons(\n model=fish_model,\n idata=fish_idata,\n contrast=\"livebait\",\n conditional=None,\n average_by=[\"persons\", \"camper\"]\n)\n\n\n\n\n\n \n \n \n term\n estimate_type\n value\n persons\n camper\n estimate\n lower_3.0%\n upper_97.0%\n \n \n \n \n 0\n livebait\n diff\n (0.0, 1.0)\n 1.0\n 0.0\n 0.864408\n 0.627063\n 1.116105\n \n \n 1\n livebait\n diff\n (0.0, 1.0)\n 1.0\n 1.0\n 1.694646\n 1.252803\n 2.081207\n \n \n 2\n livebait\n diff\n (0.0, 1.0)\n 2.0\n 0.0\n 1.424598\n 1.078389\n 1.777154\n \n \n 3\n livebait\n diff\n (0.0, 1.0)\n 2.0\n 1.0\n 2.344439\n 1.872191\n 2.800661\n \n \n 4\n livebait\n diff\n (0.0, 1.0)\n 3.0\n 0.0\n 2.429459\n 1.871578\n 2.964242\n \n \n 5\n livebait\n diff\n (0.0, 1.0)\n 3.0\n 1.0\n 4.443540\n 3.747840\n 5.170052\n \n \n 6\n livebait\n diff\n (0.0, 1.0)\n 4.0\n 0.0\n 3.541921\n 2.686445\n 4.391176\n \n \n 7\n livebait\n diff\n (0.0, 1.0)\n 4.0\n 1.0\n 10.739204\n 9.024702\n 12.432764\n \n \n\n\n\n\nIt is still possible to use plot_comparisons when passing an argument to average_by. In the plot below, the empirical distribution is used to compute unit level contrasts for livebait and then averaged over persons to obtain the average predictive comparisons. The plot below is similar to the second plot in this notebook. The differences being that: (1) a pairwise transition grid is defined for the second plot above, whereas the empirical distribution is used in the plot below, and (2) in the plot below, we marginalized over the other covariates in the model (thus the reason for not having a camper or child group and panel, and a reduction in the uncertainty interval).\n\nfig, ax = plot_comparisons(\n model=fish_model,\n idata=fish_idata,\n contrast=\"livebait\",\n conditional=None,\n average_by=\"persons\"\n)\nfig.set_size_inches(7, 3)\n\n\n\n\n\n\n\n\n\nTo showcase an additional functionality of comparisons and plot_comparisons, we fit a logistic regression model to the titanic dataset with interaction terms to model the probability of survival. The titanic dataset gives the values of four categorical attributes for each of the 2201 people on board the Titanic when it struck an iceberg and sank. The attributes are social class (first class, second class, third class, crewmember), age, sex (0 = female, 1 = male), and whether or not the person survived (0 = deceased, 1 = survived).\n\ndat = pd.read_csv(\"https://vincentarelbundock.github.io/Rdatasets/csv/Stat2Data/Titanic.csv\", index_col=0)\n\ndat[\"PClass\"] = dat[\"PClass\"].str.replace(\"[st, nd, rd]\", \"\", regex=True)\ndat[\"PClass\"] = dat[\"PClass\"].str.replace(\"*\", \"0\").astype(int)\ndat[\"PClass\"] = dat[\"PClass\"].replace(0, np.nan)\ndat[\"PClass\"] = pd.Categorical(dat[\"PClass\"], ordered=True)\ndat[\"SexCode\"] = pd.Categorical(dat[\"SexCode\"], ordered=True)\n\ndat = dat.dropna(axis=0, how=\"any\")\n\n\ntitanic_model = bmb.Model(\n \"Survived ~ PClass * SexCode * Age\", \n data=dat, \n family=\"bernoulli\"\n)\ntitanic_idata = titanic_model.fit(draws=1000, target_accept=0.95, random_seed=1234)\n\nModeling the probability that Survived==1\nAuto-assigning NUTS sampler...\nInitializing NUTS using jitter+adapt_diag...\nMultiprocess sampling (4 chains in 4 jobs)\nNUTS: [Intercept, PClass, SexCode, PClass:SexCode, Age, PClass:Age, SexCode:Age, PClass:SexCode:Age]\n\n\n |████████████████████████████████| 100.00% [8000/8000 00:15<00:00 Sampling 4 chains, 0 divergences]\n\n\nSampling 4 chains for 1_000 tune and 1_000 draw iterations (4_000 + 4_000 draws total) took 16 seconds.\n\n\n\n\ncomparisons and plot_comparisons also allow you to specify the type of comparison to be computed. By default, a difference is used. However, it is also possible to take the ratio where comparisons would then become average predictive ratios. To achieve this, pass \"ratio\" into the argument comparison_type. Using different comparison types offers a way to produce alternative insights; especially when there are interaction terms as the value of one covariate depends on the value of the other covariate.\n\nfig, ax = plot_comparisons(\n model=titanic_model,\n idata=titanic_idata,\n contrast={\"PClass\": [1, 3]},\n conditional=[\"Age\", \"SexCode\"],\n comparison_type=\"ratio\",\n subplot_kwargs={\"main\": \"Age\", \"group\": \"SexCode\", \"panel\": \"SexCode\"},\n fig_kwargs={\"figsize\":(12, 3), \"sharey\": True},\n legend=False\n\n)\n\n\n\n\nThe left panel shows that the ratio of the probability of survival comparing PClass \\(3\\) to \\(1\\) conditional on Age is non-constant. Whereas the right panel shows an approximately constant ratio in the probability of survival comparing PClass \\(3\\) to \\(1\\) conditional on Age." }, { - "objectID": "notebooks/hsgp_1d.html", - "href": "notebooks/hsgp_1d.html", + "objectID": "notebooks/hsgp_2d.html", + "href": "notebooks/hsgp_2d.html", "title": "Bambi", "section": "", - "text": "This article demonstrates the how to use Bambi with Gaussian Processes with 1 dimensional predictors. Bambi supports Gaussian Processes through the approximation known as Hilbert Space Gaussian Processes (HSGP).\nHSGP is a framework that falls under the class of low-rank approximations that are based on forming a basis function approximation with \\(m\\) basis functions, where \\(m\\) is usually much less smaller than \\(n\\), the number of observations.\nFor references see Hilbert Space Methods for Reduced-Rank Gaussian Process Regression and Practical Hilbert Space Approximate Bayesian Gaussian Processes for Probabilistic Programming.\n\nfrom formulae import design_matrices\n\nimport arviz as az\nimport bambi as bmb\nimport matplotlib.pyplot as plt\nimport numpy as np\nimport pandas as pd\n\nfrom bambi.plots import plot_cap\nfrom matplotlib.lines import Line2D\n\n\n\nLet’s get started simulating some data from a smooth function. The goal is to fit a normal likelihood model where a Gaussian process term contributes to the mean.\n\nrng = np.random.default_rng(seed=121195)\n\nsize = 100\nx = np.linspace(0, 50, size)\nb = 0.1 * rng.normal(size=6)\nsigma = 0.15\n\ndm = design_matrices(\"0 + bs(x, df=6, intercept=True)\", pd.DataFrame({\"x\": x}))\nX = np.array(dm.common)\nf = 10 * X @ b\ny = f + rng.normal(size=size) * sigma\ndf = pd.DataFrame({\"x\": x, \"y\": y})\n\nfig, ax = plt.subplots(figsize=(9, 6))\nax.scatter(x, y, s=30, alpha=0.8);\nax.plot(x, f, color=\"black\");\n\n\n\n\nNow let’s simply create and fit the model. We use the hsgp to initialize a HSGP term in the model formula. Notice we pass the variable x and values for two other arguments m and c that we’ll cover later.\n\nmodel = bmb.Model(\"y ~ 0 + hsgp(x, m=10, c=2)\", df)\nmodel\n\n Formula: y ~ 0 + hsgp(x, m=10, c=2)\n Family: gaussian\n Link: mu = identity\n Observations: 100\n Priors: \n target = mu\n HSGP contributions\n hsgp(x, m=10, c=2)\n cov: ExpQuad\n sigma ~ Exponential(lam: 1.0)\n ell ~ InverseGamma(alpha: 3.0, beta: 2.0)\n \n Auxiliary parameters\n y_sigma ~ HalfStudentT(nu: 4.0, sigma: 0.2745)\n\n\nIn the model description we can see the contribution of the HSGP term. It consists of two things: the name of the covariance kernel and the priors for its parameters. In this case, it’s an Exponentiated Quadratic covariance kernel with parameters sigma (amplitude) and ell (lengthscale). The prior for the amplitude is Exponential(1) and the prior for the lengthscale is InverseGamma(3, 2).\n\nidata = model.fit(inference_method=\"nuts_numpyro\", random_seed=121195)\nprint(idata.sample_stats[\"diverging\"].sum().to_numpy())\n\n/home/tomas/anaconda3/envs/bambi_hsgp/lib/python3.10/site-packages/pymc/sampling/jax.py:39: UserWarning: This module is experimental.\n warnings.warn(\"This module is experimental.\")\n\n\nCompiling...\n\n\nNo GPU/TPU found, falling back to CPU. (Set TF_CPP_MIN_LOG_LEVEL=0 and rerun for more info.)\n\n\nCompilation time = 0:00:02.804363\nSampling...\n\n\n\n\n\n\n\n\n\n\n\n\n\n\nSampling time = 0:00:04.776557\nTransforming variables...\nTransformation time = 0:00:00.521686\n527\n\n\n\naz.plot_trace(idata, backend_kwargs={\"layout\": \"constrained\"});\n\n\n\n\nThe fit is horrible. To fix that we can use better priors. But before doing that, it’s important to note that HSGP terms have a unique characteristic in that they do not receive priors themselves. Rather, the associated parameters of an HSGP term, such as sigma and ell, are the ones that are assigned priors. Therefore, we need to assign the HSGP term a dictionary of priors instead of a single prior.\n\nprior_hsgp = {\n \"sigma\": bmb.Prior(\"Exponential\", lam=2), # amplitude\n \"ell\": bmb.Prior(\"InverseGamma\", mu=10, sigma=1) # lengthscale\n}\n\n# This is the dictionary we pass to Bambi\npriors = {\n \"hsgp(x, m=10, c=2)\": prior_hsgp,\n \"sigma\": bmb.Prior(\"HalfNormal\", sigma=10)\n}\nmodel = bmb.Model(\"y ~ 0 + hsgp(x, m=10, c=2)\", df, priors=priors)\nmodel\n\n Formula: y ~ 0 + hsgp(x, m=10, c=2)\n Family: gaussian\n Link: mu = identity\n Observations: 100\n Priors: \n target = mu\n HSGP contributions\n hsgp(x, m=10, c=2)\n cov: ExpQuad\n sigma ~ Exponential(lam: 2.0)\n ell ~ InverseGamma(mu: 10.0, sigma: 1.0)\n \n Auxiliary parameters\n y_sigma ~ HalfNormal(sigma: 10.0)\n\n\nNotice the priors were updated in the model summary. Now we’re ready to fit the model!\n\nidata = model.fit(inference_method=\"nuts_numpyro\", target_accept=0.9, random_seed=121195)\nprint(idata.sample_stats[\"diverging\"].sum().to_numpy())\n\nCompiling...\nCompilation time = 0:00:02.378503\nSampling...\n\n\n\n\n\n\n\n\n\n\n\n\n\n\nSampling time = 0:00:05.336123\nTransforming variables...\nTransformation time = 0:00:00.174204\n7\n\n\n\naz.plot_trace(idata, backend_kwargs={\"layout\": \"constrained\"});\n\n\n\n\nThe marginal posteriors look somehow better, but we still have lots of divergences. What else can we do? Change the parametrization!\nThe hsgp() function has a centered argument which defaults to False and thus Bambi uses a non-centered parametrization by default. But we can change that actually. Let’s try it!\n\nprior_hsgp = {\n \"sigma\": bmb.Prior(\"Exponential\", lam=2), # amplitude\n \"ell\": bmb.Prior(\"InverseGamma\", mu=10, sigma=1) # lengthscale\n}\n\n# This is the dictionary we pass to Bambi\npriors = {\n \"hsgp(x, m=10, c=2, centered=True)\": prior_hsgp,\n \"sigma\": bmb.Prior(\"HalfNormal\", sigma=10)\n}\nmodel = bmb.Model(\"y ~ 0 + hsgp(x, m=10, c=2, centered=True)\", df, priors=priors)\nmodel\n\n Formula: y ~ 0 + hsgp(x, m=10, c=2, centered=True)\n Family: gaussian\n Link: mu = identity\n Observations: 100\n Priors: \n target = mu\n HSGP contributions\n hsgp(x, m=10, c=2, centered=True)\n cov: ExpQuad\n sigma ~ Exponential(lam: 2.0)\n ell ~ InverseGamma(mu: 10.0, sigma: 1.0)\n \n Auxiliary parameters\n y_sigma ~ HalfNormal(sigma: 10.0)\n\n\n\nidata = model.fit(inference_method=\"nuts_numpyro\", target_accept=0.9, random_seed=121195)\nprint(idata.sample_stats[\"diverging\"].sum().to_numpy())\n\nCompiling...\nCompilation time = 0:00:02.560797\nSampling...\n\n\n\n\n\n\n\n\n\n\n\n\n\n\nSampling time = 0:00:04.839103\nTransforming variables...\nTransformation time = 0:00:00.028475\n0\n\n\n\naz.plot_trace(idata, backend_kwargs={\"layout\": \"constrained\"});\n\n\n\n\nAwesome! That looks much better now.\nWe still get all the nice things from Bambi when using GPs. An example of this is the plot_cap() function which enables us to generate a visualization of the adjusted mean with credible bands automatically.\n\nfig, ax = plt.subplots(figsize=(9, 6))\nax.scatter(df[\"x\"], df[\"y\"], s=30, color=\"0.5\", alpha=0.5)\nplot_cap(model, idata, \"x\", ax=ax);\nax.set(xlabel=\"Predictor\", ylabel=\"Observed\");\n\n\n\n\nAnd on top of that, it’s possible to get draws from the posterior predictive distribution and plot credible bands for it. All we need is the .predict() method from the model class.\n\nnew_data = pd.DataFrame({\"x\": np.linspace(0, 50, num=500)})\nmodel.predict(idata, kind=\"pps\", data=new_data)\npps = idata.posterior_predictive[\"y\"].to_numpy().reshape(4000, 500)\nqts = np.quantile(pps, q=(0.025, 0.975), axis=0)\n\nfig, ax = plt.subplots(figsize=(9, 6))\nax.fill_between(new_data[\"x\"], qts[0], qts[1], color=\"C0\", alpha=0.6)\nax.scatter(df[\"x\"], df[\"y\"], s=30, color=\"C1\", alpha=0.9)\nax.plot(x, f, color=\"black\", ls=\"--\")\nax.set(xlabel=\"Predictor\", ylabel=\"Observed\");\n\nhandles = [Line2D([], [], color=\"black\", ls=\"--\"), Line2D([], [], color=\"C0\")]\nlabels = [\"True curve\", \"Posterior predictive distribution\"]\nax.legend(handles, labels);\n\n\n\n\n\n\n\nhsgp() is a transformation that is available in the namespace where the model formula is evaluated. In plain english, hsgp() is like a function you can use in your model formulas. You don’t need to worry about the details, Bambi knows how to handle them.But if still you want to see the actual code, you can have a look at the implementation of the HSGP class in bambi/transformations.py.\nWhat users do need to care about is the arguments the hsgp() transformation support. There are a bunch of arguments that can be passed after the variable number of non-keyword arguments representing the variables of the HSGP contribution. Below is a brief overview of these arguments and their respective descriptions.\n\nm: The number of basis vectors\nL: The boundary of the variable space\nc: The proportion extension factor\nby: This argument specifies the values of a variable used for grouping. It is used to create a HSGP term by group. If left unspecified, the default value is None, which means that there is no group variable and all observations belong to the same group.\ncov: This argument specifies the name of the covariance function to be used. The default value is \"ExpQuad\".\nshare_cov: Determines whether the same covariance function is shared across all groups. This argument is relevant only when by is not None and the default value is True.\nscale: When set to True, the predictors are be rescaled such that the largest Euclidean distance between two points is 1. This adjustment often improves the sampling speed and convergence.\niso: Determines whether to use an isotropic or non-isotropic Gaussian Process. With an isotropic GP, the same level of smoothing is applied to all predictors, while a anisotropic GP allows different levels of smoothing for individual predictors. Note that this argument is ignored if only one predictor is provided. The default value is True.\ndrop_first: Whether to exclude the first basis vector or not. The default value is False.\ncentered: Whether to use the centered or the non-centered parametrization. Defaults to False.\n\nThe parameters m, L and c are directly related to the basis vectors of the HSGP approximation. If you want to know more about m, L, and/or c, it’s recommended to have a look at the documentation of the HSGP class in PyMC.\n\nSo far, we showcased how to use m, c and centered. In the remainder of this article we’re going to see how by and share_cov are used when we add a GP contribution by groups.\n\n\n\nIn this section we fit a model with a HSGP contribution by levels of a categorical variable. The data was simulated with the gamSim() function from the R package {mgcv} by Simon Wood.\n\ndata = pd.read_csv(\"data/gam_data.csv\")\ndata[\"fac\"] = pd.Categorical(data[\"fac\"])\ndata.head()[[\"x2\", \"y\", \"fac\"]]\n\n\n\n\n\n \n \n \n x2\n y\n fac\n \n \n \n \n 0\n 0.497183\n 3.085274\n 3\n \n \n 1\n 0.196003\n -2.250410\n 2\n \n \n 2\n 0.958474\n 0.070548\n 3\n \n \n 3\n 0.972759\n -0.230454\n 1\n \n \n 4\n 0.755836\n 2.173497\n 2\n \n \n\n\n\n\nLet’s visualize x2 versus y for the different levels in fac.\n\nfig, ax = plt.subplots(figsize=(9, 5))\ncolors = [f\"C{i}\" for i in pd.Categorical(data[\"fac\"]).codes]\nax.scatter(data[\"x2\"], data[\"y\"], color=colors, alpha=0.6)\nax.set(xlabel=\"x2\", ylabel=\"y\");\n\n\n\n\nWe can observe the relation between x2 and y can be approximated by a smooth non-linear curve, for all groups.\nBelow, we create the model with Bambi. The biggest difference is that we’re passing by=fac in the hsgp() call. This is all we need to ask Bambi to create multiple GP contribution terms, one per group.\nAnother trick that was not shown previously is the usage of an alias. .set_alias() from the Model class allow us to have more readable and shorter names for the components of a model. As you’ll see below, it makes a huge difference when displaying summaries or visualizations for the parameters of the model.\n\nprior_hsgp = {\n \"sigma\": bmb.Prior(\"Exponential\", lam=3),\n \"ell\": bmb.Prior(\"Exponential\", lam=3)\n}\npriors = {\n \"hsgp(x2, by=fac, m=12, c=1.5)\": prior_hsgp,\n \"sigma\": bmb.Prior(\"HalfNormal\", sigma=1)\n}\nmodel = bmb.Model(\"y ~ 0 + hsgp(x2, by=fac, m=12, c=1.5)\", data, priors=priors)\nmodel.set_alias({\"hsgp(x2, by=fac, m=12, c=1.5)\": \"hsgp\"})\nmodel\n\n Formula: y ~ 0 + hsgp(x2, by=fac, m=12, c=1.5)\n Family: gaussian\n Link: mu = identity\n Observations: 300\n Priors: \n target = mu\n HSGP contributions\n hsgp(x2, by=fac, m=12, c=1.5)\n cov: ExpQuad\n sigma ~ Exponential(lam: 3.0)\n ell ~ Exponential(lam: 3.0)\n \n Auxiliary parameters\n y_sigma ~ HalfNormal(sigma: 1.0)\n\n\n\nmodel.build()\nmodel.graph()\n\n\n\n\nSee how nicer are the names for the HSGP contribution parameters with the alias!\n\nidata = model.fit(inference_method=\"nuts_numpyro\", target_accept=0.95, random_seed=121195)\nprint(idata.sample_stats.diverging.sum().item())\n\nCompiling...\nCompilation time = 0:00:03.565702\nSampling...\n\n\n\n\n\n\n\n\n\n\n\n\n\n\nSampling time = 0:00:06.818602\nTransforming variables...\nTransformation time = 0:00:00.885410\n0\n\n\n\naz.plot_trace(\n idata, \n var_names=[\"hsgp_weights\", \"hsgp_sigma\", \"hsgp_ell\", \"y_sigma\"], \n backend_kwargs={\"layout\": \"constrained\"}\n);\n\n\n\n\nThis time we got no divergences and good mixing and nice convergence in our first try (or perhaps it wasn’t the first try!). One thing that stands out are the marginal posterior for some of the beta parameters (the weights of the basis). This may indicate our approximation is using more basis vectors than what’s really needed.\nNote: At this point we have used the term ‘basis vector’ several times. This concept is very close to the concept of ‘basis functions’. The difference is that the ‘basis vector’ is a ‘basis function’ already evaluated at a set of points. In this case, the set of points is made by the values of the numerical predictor x2.\nDo you remember how easy was it to use plot_cap() above? Should it be harder now? Well, the answer will surprise you: No!\nAll we need to do is passing a second variable name which is mapped to the color in the visualization. Voilà!\n\nfig, ax = plt.subplots(figsize = (9, 5))\ncolors = [f\"C{i}\" for i in pd.Categorical(data[\"fac\"]).codes]\nax.scatter(data[\"x2\"], data[\"y\"], color=colors, alpha=0.6)\nplot_cap(model, idata, [\"x2\", \"fac\"], ax=ax);\n\n\n\n\nWe can go one step further and modify the model to use different covariance functions for the different groups. For that purpose, we pass share_cov=False. As always, Bambi takes care of all the details.\n\nprior_hsgp = {\n \"sigma\": bmb.Prior(\"Exponential\", lam=1),\n \"ell\": bmb.Prior(\"Exponential\", lam=3)\n}\npriors = {\n \"hsgp(x2, by=fac, m=12, c=1.5, share_cov=False)\": prior_hsgp,\n \"sigma\": bmb.Prior(\"HalfNormal\", sigma=1)\n}\nmodel = bmb.Model(\"y ~ 0 + hsgp(x2, by=fac, m=12, c=1.5, share_cov=False)\", data, priors=priors)\nmodel.set_alias({\"hsgp(x2, by=fac, m=12, c=1.5, share_cov=False)\": \"hsgp\"})\nmodel\n\n Formula: y ~ 0 + hsgp(x2, by=fac, m=12, c=1.5, share_cov=False)\n Family: gaussian\n Link: mu = identity\n Observations: 300\n Priors: \n target = mu\n HSGP contributions\n hsgp(x2, by=fac, m=12, c=1.5, share_cov=False)\n cov: ExpQuad\n sigma ~ Exponential(lam: 1.0)\n ell ~ Exponential(lam: 3.0)\n \n Auxiliary parameters\n y_sigma ~ HalfNormal(sigma: 1.0)\n\n\n\nmodel.build()\nmodel.graph()\n\n\n\n\nHave a closer look at the model graph. See that the hsgp_sigma and hsgp_ell parameters are no longer scalar. There are three of each, one for each group.\n\nidata = model.fit(inference_method=\"nuts_numpyro\", target_accept=0.95, random_seed=121195)\n\nCompiling...\nCompilation time = 0:00:04.396845\nSampling...\n\n\n\n\n\n\n\n\n\n\n\n\n\n\nSampling time = 0:00:07.743907\nTransforming variables...\nTransformation time = 0:00:00.519422\n\n\n\naz.plot_trace(\n idata, \n var_names=[\"hsgp_sigma\", \"hsgp_ell\", \"y_sigma\"], \n backend_kwargs={\"layout\": \"constrained\"}\n);\n\n\n\n\nIn fact, we can see not all the groups have similar posteriors for the covariance function parameters when they are allowed to vary.\nBefore closing the article, it’s worth looking at a particular but not uncommon pattern when using the HSGP approximation. Let’s have a look at the posterior distributions for the weights of the basis.\n\naz.plot_trace(idata, var_names=[\"hsgp_weights\"], backend_kwargs={\"layout\": \"constrained\"});\n\n\n\n\nLooks like some distributions are extremely flat, and others are extremely tight around zero.\nTo investigate this further we can manually plot the posterior for the first J basis vectors and see what they look like.\n\nbasis_n = 6\nfig, axes = plt.subplots(3, 1, figsize = (7, 10))\nfor i in range(3):\n ax = axes[i]\n values = idata.posterior[\"hsgp_weights\"].sel({\"hsgp_by\": i + 1})\n for j in range(basis_n):\n az.plot_kde(\n values.sel({\"hsgp_weights_dim\": j}).to_numpy().flatten(), \n ax=ax, \n plot_kwargs={\"color\": f\"C{j}\"}\n );\n\n\n\n\nIndeed, we can see that, at least for the first group, the posterior of the weights start being too tight around zero when we consider the 6th basis vector. But the posteriors for the weights of the previous basis vectors look nice.\nTo confirm our thought, let’s increase the value of basis_n to 9 and see what happens.\n\nbasis_n = 9\nfig, axes = plt.subplots(3, 1, figsize = (7, 10))\nfor i in range(3):\n ax = axes[i]\n values = idata.posterior[\"hsgp_weights\"].sel({\"hsgp_by\": i + 1})\n for j in range(basis_n):\n az.plot_kde(\n values.sel({\"hsgp_weights_dim\": j}).to_numpy().flatten(), \n ax=ax, \n plot_kwargs={\"color\": f\"C{j}\"}\n );\n\n\n\n\nAlright, that’s too spiky! Nonetheless, we don’t see that happening for the third group yet, indicating the higher number of basis vectors is more appropriate for this group." + "text": "This article demonstrates how to use Bambi with Gaussian Processes with 2 dimensional predictors. Bambi supports Gaussian Processes through the low-rank approximation known as Hilbert Space Gaussian Processes. For references see Hilbert Space Methods for Reduced-Rank Gaussian Process Regression and Practical Hilbert Space Approximate Bayesian Gaussian Processes for Probabilistic Programming.\nFor a demonstration of Gaussian Processes in 1D together with a more in depth explanation see To Do.\n\nimport arviz as az\nimport bambi as bmb\nimport matplotlib.pyplot as plt\nimport numpy as np\nimport pandas as pd\nimport pymc as pm\n\nThe goal of this notebook is to showcase Bambi’s support for Gaussian Processes on two-dimensional data using the HSGP approximation.\nTo achieve this, we begin by creating a matrix of coordinates that will serve as the locations where we measure the values of a continuous response variable.\n\nx1 = np.linspace(0, 10, 12)\nx2 = np.linspace(0, 10, 12)\nxx, yy = np.meshgrid(x1, x2)\nX = np.column_stack([xx.flatten(), yy.flatten()])\nX.shape\n\n(144, 2)\n\n\n\n\nIn modeling multi-dimensional data with a Gaussian Process, we must choose between using an isotropic or an anisotropic Gaussian Process. An isotropic GP applies the same degree of smoothing to all predictors and is rotationally invariant. On the other hand, an anisotropic GP assigns different degrees of smoothing to each predictor and is not rotationally invariant.\nFurthermore, as the hsgp() function allows for the creation of separate GP contribution terms for the levels of a categorical variable through its by argument, we also examine both single-group and multiple-group scenarios.\n\n\nWe create a covariance kernel using ExpQuad from the gp submodule in PyMC. Note that the lengthscale and amplitude for both dimensions are 2 and 1.2, respectively. Then, we simply use NumPy to get a random draw from the 144-dimensional multivariate normal distribution.\n\nrng = np.random.default_rng(1234)\n\nell = 2\ncov = 1.2 * pm.gp.cov.ExpQuad(2, ls=ell)\nK = cov(X).eval()\nmu = np.zeros(X.shape[0])\nprint(mu.shape, K.shape)\n\nf = rng.multivariate_normal(mu, K)\n\nfig, ax = plt.subplots()\nax.scatter(xx, yy, c=f, s=900, marker=\"s\");\n\n(144,) (144, 144)\n\n\n\n\n\nSince Bambi works with long-format data frames, we need to reshape our data before creating the data frame.\n\ndata = pd.DataFrame(\n {\n \"x\": np.tile(xx.flatten(), 1),\n \"y\": np.tile(yy.flatten(), 1), \n \"outcome\": f.flatten()\n }\n)\n\nNow, let’s construct the model. The only notable distinction from the one-dimensional case is that we provide two unnamed arguments to the hsgp() function, representing the predictors on each dimension.\n\nprior_hsgp = {\n \"sigma\": bmb.Prior(\"Exponential\", lam=3),\n \"ell\": bmb.Prior(\"InverseGamma\", mu=2, sigma=0.2),\n}\npriors = {\n \"hsgp(x, y, c=1.5, m=10)\": prior_hsgp, \n \"sigma\": bmb.Prior(\"HalfNormal\", sigma=2)\n}\nmodel = bmb.Model(\"outcome ~ 0 + hsgp(x, y, c=1.5, m=10)\", data, priors=priors)\nmodel.set_alias({\"hsgp(x, y, c=1.5, m=10)\": \"hsgp\"})\nmodel\n\n Formula: outcome ~ 0 + hsgp(x, y, c=1.5, m=10)\n Family: gaussian\n Link: mu = identity\n Observations: 144\n Priors: \n target = mu\n HSGP contributions\n hsgp(x, y, c=1.5, m=10)\n cov: ExpQuad\n sigma ~ Exponential(lam: 3.0)\n ell ~ InverseGamma(mu: 2.0, sigma: 0.2)\n \n Auxiliary parameters\n outcome_sigma ~ HalfNormal(sigma: 2.0)\n\n\nThe parameters c and m of the HSGP aproximation are specific to each dimension, and can have different values for each. However, as we are passing scalars instead of sequences, Bambi will internally recycle them, causing the HSGP approximation to use the same values of c and m for both dimensions.\nLet’s build the internal PyMC model and create a graph to have a visual representation of the relationships between the model parameters.\n\nmodel.build()\nmodel.graph()\n\n\n\n\nAnd finally, we quickly fit the model and show a traceplot to explore the posterior and spot any issues with the sampler.\n\nidata = model.fit(inference_method=\"nuts_numpyro\", target_accept=0.9)\nprint(idata.sample_stats.diverging.sum().item())\n\n/home/tomas/anaconda3/envs/bambi_hsgp/lib/python3.10/site-packages/pymc/sampling/jax.py:39: UserWarning: This module is experimental.\n warnings.warn(\"This module is experimental.\")\n\n\nCompiling...\n\n\nNo GPU/TPU found, falling back to CPU. (Set TF_CPP_MIN_LOG_LEVEL=0 and rerun for more info.)\n\n\nCompilation time = 0:00:02.522713\nSampling...\n\n\n\n\n\n\n\n\n\n\n\n\n\n\nSampling time = 0:00:23.313351\nTransforming variables...\nTransformation time = 0:00:00.628279\n0\n\n\n\naz.plot_trace(\n idata, \n var_names=[\"hsgp_sigma\", \"hsgp_ell\", \"outcome_sigma\"], \n backend_kwargs={\"layout\": \"constrained\"}\n);\n\n\n\n\nWe don’t see any divergences. However, the autocorrelation in the chains for the covariance function parameters, along with the insufficient mixing, indicates that there may be an issue with the prior specification of the model.\nSince the goal of the notebook is to simply show what features Bambi supports and how to use them, we won’t further investigate these issues. However, such posteriors shouldn’t be considered in any serious application.\nFrom now on, the notebook will follow the same structure as the one already shown, which consists of\n\nData simulation with some specific settings\nCreation of the Bambi model\nBuilding of the internal PyMC model and visualization of the graph\nModel fit and inspection of the traceplot\n\n\n\n\nIn this scenario we have multiple groups that share the same covariance function.\n\nrng = np.random.default_rng(123)\n\nell = 2\ncov = 1.2 * pm.gp.cov.ExpQuad(2, ls=ell)\nK = cov(X).eval()\nmu = np.zeros(X.shape[0])\n\nf = rng.multivariate_normal(mu, K, 3)\n\nfig, axes = plt.subplots(1, 3, figsize=(12, 4))\nfor i, ax in enumerate(axes):\n ax.scatter(xx, yy, c=f[i], s=320, marker=\"s\")\n ax.grid(False)\n ax.set_title(f\"Group {i}\")\n\n\n\n\n\ndata = pd.DataFrame(\n {\n \"x\": np.tile(xx.flatten(), 3),\n \"y\": np.tile(yy.flatten(), 3),\n \"group\": np.repeat(list(\"ABC\"), 12 * 12),\n \"outcome\": f.flatten()\n }\n)\n\nNotice we don’t modify anything substantial in the call to hsgp() for now.\n\nprior_hsgp = {\n \"sigma\": bmb.Prior(\"Exponential\", lam=3),\n \"ell\": bmb.Prior(\"InverseGamma\", mu=2, sigma=0.2),\n}\npriors = {\n \"hsgp(x, y, by=group, c=1.5, m=10)\": prior_hsgp, \n \"sigma\": bmb.Prior(\"HalfNormal\", sigma=2)\n}\nmodel = bmb.Model(\"outcome ~ 0 + hsgp(x, y, by=group, c=1.5, m=10)\", data, priors=priors)\nmodel.set_alias({\"hsgp(x, y, by=group, c=1.5, m=10)\": \"hsgp\"})\nprint(model)\nmodel.build()\nmodel.graph()\n\n Formula: outcome ~ 0 + hsgp(x, y, by=group, c=1.5, m=10)\n Family: gaussian\n Link: mu = identity\n Observations: 432\n Priors: \n target = mu\n HSGP contributions\n hsgp(x, y, by=group, c=1.5, m=10)\n cov: ExpQuad\n sigma ~ Exponential(lam: 3.0)\n ell ~ InverseGamma(mu: 2.0, sigma: 0.2)\n \n Auxiliary parameters\n outcome_sigma ~ HalfNormal(sigma: 2.0)\n\n\n\n\n\n\nidata = model.fit(inference_method=\"nuts_numpyro\", target_accept=0.9)\nprint(idata.sample_stats.diverging.sum().item())\n\nCompiling...\nCompilation time = 0:00:02.721842\nSampling...\n\n\n\n\n\n\n\n\n\n\n\n\n\n\nSampling time = 0:02:17.782596\nTransforming variables...\nTransformation time = 0:00:00.838094\n0\n\n\n\naz.plot_trace(\n idata, \n var_names=[\"hsgp_sigma\", \"hsgp_ell\", \"outcome_sigma\"], \n backend_kwargs={\"layout\": \"constrained\"}\n);\n\n\n\n\nWhile we have three groups, we only have one hsgp_sigma and one hsgp_ell for all groups. This is because, by default, the HSGP contributions by groups use the same instance of the covariance function.\n\n\n\nAgain we have multiple groups. But this time, each group has specific values for the amplitude and the lengthscale.\n\nrng = np.random.default_rng(12)\n\nsigmas = [1.2, 1.5, 1.8]\nells = [1.5, 2, 3]\n\nsamples = []\nfor sigma, ell in zip(sigmas, ells):\n cov = sigma * pm.gp.cov.ExpQuad(2, ls=ell)\n K = cov(X).eval()\n mu = np.zeros(X.shape[0])\n samples.append(rng.multivariate_normal(mu, K))\n\nf = np.stack(samples)\nfig, axes = plt.subplots(1, 3, figsize=(12, 4))\nfor i, ax in enumerate(axes):\n ax.scatter(xx, yy, c=f[i], s=320, marker=\"s\")\n ax.grid(False)\n ax.set_title(f\"Group {i}\")\n\n\n\n\n\ndata = pd.DataFrame(\n {\n \"x\": np.tile(xx.flatten(), 3),\n \"y\": np.tile(yy.flatten(), 3),\n \"group\": np.repeat(list(\"ABC\"), 12 * 12),\n \"outcome\": f.flatten()\n }\n)\n\nIn situations like this, we can tell Bambi not to use the same covariance function for all the groups with share_cov=False and Bambi will create a separate instance for each group, resulting in group specific estimates of the amplitude and the lengthscale.\nNotice, however, we’re still using the same kind of covariance function, which in this case is ExpQuad.\n\nprior_hsgp = {\n \"sigma\": bmb.Prior(\"Exponential\", lam=3),\n \"ell\": bmb.Prior(\"InverseGamma\", mu=2, sigma=0.2),\n}\npriors = {\n \"hsgp(x, y, by=group, c=1.5, m=10, share_cov=False)\": prior_hsgp, \n \"sigma\": bmb.Prior(\"HalfNormal\", sigma=2)\n}\nmodel = bmb.Model(\n \"outcome ~ 0 + hsgp(x, y, by=group, c=1.5, m=10, share_cov=False)\", \n data, \n priors=priors\n)\nmodel.set_alias({\"hsgp(x, y, by=group, c=1.5, m=10, share_cov=False)\": \"hsgp\"})\nprint(model)\nmodel.build()\nmodel.graph()\n\n Formula: outcome ~ 0 + hsgp(x, y, by=group, c=1.5, m=10, share_cov=False)\n Family: gaussian\n Link: mu = identity\n Observations: 432\n Priors: \n target = mu\n HSGP contributions\n hsgp(x, y, by=group, c=1.5, m=10, share_cov=False)\n cov: ExpQuad\n sigma ~ Exponential(lam: 3.0)\n ell ~ InverseGamma(mu: 2.0, sigma: 0.2)\n \n Auxiliary parameters\n outcome_sigma ~ HalfNormal(sigma: 2.0)\n\n\n\n\n\nSee the all the HSGP related parameters gained the new dimension hsgp_by.\n\nidata = model.fit(inference_method=\"nuts_numpyro\", target_accept=0.9)\nprint(idata.sample_stats.diverging.sum().item())\n\nCompiling...\nCompilation time = 0:00:04.491697\nSampling...\n\n\n\n\n\n\n\n\n\n\n\n\n\n\nSampling time = 0:02:35.274256\nTransforming variables...\nTransformation time = 0:00:00.801181\n0\n\n\n\naz.plot_trace(\n idata, \n var_names=[\"hsgp_sigma\", \"hsgp_ell\", \"outcome_sigma\"], \n backend_kwargs={\"layout\": \"constrained\"}\n);\n\n\n\n\nUnlike the previous case, now there are three hsgp_sigma and three hsgp_ell parameters, one per group. We can see them in different colors in the visualization.\n\n\n\n\nIn this second part we repeat exactly the same that we did for the isotropic case. First, we start with a single group. Then, we continue with multiple groups that share the covariance function. And finally, multiple groups with different covariance functions. The main difference is that we use iso=False, which asks to use an anisotropic GP.\n\n\n\nrng = np.random.default_rng(1234)\n\nell = [2, 0.9]\ncov = 1.2 * pm.gp.cov.ExpQuad(2, ls=ell)\nK = cov(X).eval()\nmu = np.zeros(X.shape[0])\n\nf = rng.multivariate_normal(mu, K)\n\nfig, ax = plt.subplots(figsize = (4.5, 4.5))\nax.scatter(xx, yy, c=f, s=900, marker=\"s\");\n\n\n\n\n\ndata = pd.DataFrame(\n {\n \"x\": np.tile(xx.flatten(), 1),\n \"y\": np.tile(yy.flatten(), 1), \n \"outcome\": f.flatten()\n }\n)\n\n\nprior_hsgp = {\n \"sigma\": bmb.Prior(\"Exponential\", lam=3),\n \"ell\": bmb.Prior(\"InverseGamma\", mu=2, sigma=0.2),\n}\npriors = {\n \"hsgp(x, y, c=1.5, m=10, iso=False)\": prior_hsgp, \n \"sigma\": bmb.Prior(\"HalfNormal\", sigma=2)\n}\nmodel = bmb.Model(\"outcome ~ 0 + hsgp(x, y, c=1.5, m=10, iso=False)\", data, priors=priors)\nmodel.set_alias({\"hsgp(x, y, c=1.5, m=10, iso=False)\": \"hsgp\"})\nprint(model)\nmodel.build()\nmodel.graph()\n\n Formula: outcome ~ 0 + hsgp(x, y, c=1.5, m=10, iso=False)\n Family: gaussian\n Link: mu = identity\n Observations: 144\n Priors: \n target = mu\n HSGP contributions\n hsgp(x, y, c=1.5, m=10, iso=False)\n cov: ExpQuad\n sigma ~ Exponential(lam: 3.0)\n ell ~ InverseGamma(mu: 2.0, sigma: 0.2)\n \n Auxiliary parameters\n outcome_sigma ~ HalfNormal(sigma: 2.0)\n\n\n\n\n\nAlthough there is only one group in this case, the graph includes a hsgp_var dimension. This dimension represents the variables in the HSGP component, indicating that there is one lengthscale parameter per variable.\n\nidata = model.fit(inference_method=\"nuts_numpyro\", target_accept=0.9)\nprint(idata.sample_stats.diverging.sum().item())\n\nCompiling...\nCompilation time = 0:00:02.320646\nSampling...\n\n\n\n\n\n\n\n\n\n\n\n\n\n\nSampling time = 0:00:06.159032\nTransforming variables...\nTransformation time = 0:00:00.173091\n0\n\n\n\naz.plot_trace(\n idata, \n var_names=[\"hsgp_sigma\", \"hsgp_ell\", \"outcome_sigma\"], \n backend_kwargs={\"layout\": \"constrained\"}\n);\n\n\n\n\n\n\n\n\nrng = np.random.default_rng(123)\n\nell = [2, 0.9]\ncov = 1.2 * pm.gp.cov.ExpQuad(2, ls=ell)\nK = cov(X).eval()\nmu = np.zeros(X.shape[0])\n\nf = rng.multivariate_normal(mu, K, 3)\n\nfig, axes = plt.subplots(1, 3, figsize=(12, 4))\nfor i, ax in enumerate(axes):\n ax.scatter(xx, yy, c=f[i], s=320, marker=\"s\")\n ax.grid(False)\n ax.set_title(f\"Group {i}\")\n\n\n\n\n\ndata = pd.DataFrame(\n {\n \"x\": np.tile(xx.flatten(), 3),\n \"y\": np.tile(yy.flatten(), 3),\n \"group\": np.repeat(list(\"ABC\"), 12 * 12),\n \"outcome\": f.flatten()\n }\n)\n\n\nprior_hsgp = {\n \"sigma\": bmb.Prior(\"Exponential\", lam=3),\n \"ell\": bmb.Prior(\"InverseGamma\", mu=2, sigma=0.2),\n}\npriors = {\n \"hsgp(x, y, by=group, c=1.5, m=10, iso=False)\": prior_hsgp, \n \"sigma\": bmb.Prior(\"HalfNormal\", sigma=2)\n}\nmodel = bmb.Model(\"outcome ~ 0 + hsgp(x, y, by=group, c=1.5, m=10, iso=False)\", data, priors=priors)\nmodel.set_alias({\"hsgp(x, y, by=group, c=1.5, m=10, iso=False)\": \"hsgp\"})\nprint(model)\nmodel.build()\nmodel.graph()\n\n Formula: outcome ~ 0 + hsgp(x, y, by=group, c=1.5, m=10, iso=False)\n Family: gaussian\n Link: mu = identity\n Observations: 432\n Priors: \n target = mu\n HSGP contributions\n hsgp(x, y, by=group, c=1.5, m=10, iso=False)\n cov: ExpQuad\n sigma ~ Exponential(lam: 3.0)\n ell ~ InverseGamma(mu: 2.0, sigma: 0.2)\n \n Auxiliary parameters\n outcome_sigma ~ HalfNormal(sigma: 2.0)\n\n\n\n\n\n\nidata = model.fit(inference_method=\"nuts_numpyro\", target_accept=0.9)\nprint(idata.sample_stats.diverging.sum().item())\n\nCompiling...\nCompilation time = 0:00:02.464203\nSampling...\n\n\n\n\n\n\n\n\n\n\n\n\n\n\nSampling time = 0:00:17.674547\nTransforming variables...\nTransformation time = 0:00:00.249682\n0\n\n\n\naz.plot_trace(\n idata, \n var_names=[\"hsgp_sigma\", \"hsgp_ell\", \"outcome_sigma\"], \n backend_kwargs={\"layout\": \"constrained\"}\n);\n\n\n\n\n\n\n\n\nrng = np.random.default_rng(12)\n\nsigmas = [1.2, 1.5, 1.8]\nells = [[1.5, 0.8], [2, 1.5], [3, 1]]\n\nsamples = []\nfor sigma, ell in zip(sigmas, ells):\n cov = sigma * pm.gp.cov.ExpQuad(2, ls=ell)\n K = cov(X).eval()\n mu = np.zeros(X.shape[0])\n samples.append(rng.multivariate_normal(mu, K))\n\nf = np.stack(samples)\nfig, axes = plt.subplots(1, 3, figsize=(12, 4))\nfor i, ax in enumerate(axes):\n ax.scatter(xx, yy, c=f[i], s=320, marker=\"s\")\n ax.grid(False)\n ax.set_title(f\"Group {i}\")\n\n\n\n\n\ndata = pd.DataFrame(\n {\n \"x\": np.tile(xx.flatten(), 3),\n \"y\": np.tile(yy.flatten(), 3),\n \"group\": np.repeat(list(\"ABC\"), 12 * 12),\n \"outcome\": f.flatten()\n }\n)\n\n\nprior_hsgp = {\n \"sigma\": bmb.Prior(\"Exponential\", lam=3),\n \"ell\": bmb.Prior(\"InverseGamma\", mu=2, sigma=0.2),\n}\npriors = {\n \"hsgp(x, y, by=group, c=1.5, m=10, iso=False, share_cov=False)\": prior_hsgp, \n \"sigma\": bmb.Prior(\"HalfNormal\", sigma=2)\n}\nmodel = bmb.Model(\n \"outcome ~ 0 + hsgp(x, y, by=group, c=1.5, m=10, iso=False, share_cov=False)\", \n data, \n priors=priors\n)\nmodel.set_alias({\"hsgp(x, y, by=group, c=1.5, m=10, iso=False, share_cov=False)\": \"hsgp\"})\nprint(model)\nmodel.build()\nmodel.graph()\n\n Formula: outcome ~ 0 + hsgp(x, y, by=group, c=1.5, m=10, iso=False, share_cov=False)\n Family: gaussian\n Link: mu = identity\n Observations: 432\n Priors: \n target = mu\n HSGP contributions\n hsgp(x, y, by=group, c=1.5, m=10, iso=False, share_cov=False)\n cov: ExpQuad\n sigma ~ Exponential(lam: 3.0)\n ell ~ InverseGamma(mu: 2.0, sigma: 0.2)\n \n Auxiliary parameters\n outcome_sigma ~ HalfNormal(sigma: 2.0)\n\n\n\n\n\n\nidata = model.fit(inference_method=\"nuts_numpyro\", target_accept=0.9)\nprint(idata.sample_stats.diverging.sum().item())\n\nCompiling...\nCompilation time = 0:00:03.955870\nSampling...\n\n\n\n\n\n\n\n\n\n\n\n\n\n\nSampling time = 0:00:20.713181\nTransforming variables...\nTransformation time = 0:00:00.513813\n0\n\n\n\naz.plot_trace(\n idata, \n var_names=[\"hsgp_sigma\", \"hsgp_ell\", \"outcome_sigma\"], \n backend_kwargs={\"layout\": \"constrained\"}\n);\n\n\n\n\n\n\n\n\nFor this final demonstration we’re going to use a simulated dataset where the outcome is a count variable. For the predictors, we have the location in terms of the latitude and longitude, as well as other variables such as the year of the measurement, the site where the measure was made, and one continuous predictor.\n\ndata = pd.read_csv(\"data/poisson_data.csv\")\ndata[\"Year\"] = pd.Categorical(data[\"Year\"])\nprint(data.shape)\ndata.head()\n\n(100, 6)\n\n\n\n\n\n\n \n \n \n Year\n Count\n Site\n Lat\n Lon\n X1\n \n \n \n \n 0\n 2015\n 4\n Site1\n 47.559880\n 7.216754\n 3.316140\n \n \n 1\n 2016\n 0\n Site1\n 47.257079\n 7.135390\n 2.249612\n \n \n 2\n 2015\n 0\n Site1\n 47.061967\n 7.804383\n 2.835283\n \n \n 3\n 2016\n 0\n Site1\n 47.385533\n 7.433145\n 2.776692\n \n \n 4\n 2015\n 1\n Site1\n 47.034987\n 7.434643\n 2.295769\n \n \n\n\n\n\nWe can visualize the outcome variable by location and year.\n\nfig, axes = plt.subplots(1, 2, figsize=(12, 4))\nfor i, (ax, year) in enumerate(zip(axes, [2015, 2016])):\n mask = data[\"Year\"] == year\n x = data.loc[mask, \"Lat\"]\n y = data.loc[mask, \"Lon\"]\n count = data.loc[mask, \"Count\"]\n ax.scatter(x, y, c=count, s=30, marker=\"s\")\n ax.set_title(f\"Year {year}\")\n\n\n\n\nThere’s not much we can conclude from here but it’s not a problem. The most relevant part of the example is not the data itself, but how to use Bambi to include GP components in a complex model.\nIt’s very easy to create a model that uses both regular common and group-specific predictors as well as a GP contribution term. We just add them to the model formula, treat hsgp() as any other call, and that’s it!\nBelow we have common effects for the Year, the interaction between X1 and Year, and group-specific intercepts by Site. Finally, we add hsgp() as any other call.\n\nformula = \"Count ~ 0 + Year + X1:Year + (1|Site) + hsgp(Lon, Lat, by=Year, m=5, c=1.5)\"\nmodel = bmb.Model(formula, data, family=\"poisson\")\nmodel\n\n Formula: Count ~ 0 + Year + X1:Year + (1|Site) + hsgp(Lon, Lat, by=Year, m=5, c=1.5)\n Family: poisson\n Link: mu = log\n Observations: 100\n Priors: \n target = mu\n Common-level effects\n Year ~ Normal(mu: [0. 0.], sigma: [5. 5.])\n X1:Year ~ Normal(mu: [0. 0.], sigma: [1.5693 1.4766])\n \n Group-level effects\n 1|Site ~ Normal(mu: 0.0, sigma: HalfNormal(sigma: 5.3683))\n \n HSGP contributions\n hsgp(Lon, Lat, by=Year, m=5, c=1.5)\n cov: ExpQuad\n sigma ~ Exponential(lam: 1.0)\n ell ~ InverseGamma(alpha: 3.0, beta: 2.0)\n\n\nLet’s use an alias to make the graph representation more readable.\n\nmodel.set_alias({\"hsgp(Lon, Lat, by=Year, m=5, c=1.5)\": \"gp\"})\nmodel.build()\nmodel.graph()\n\n\n\n\nAnd finally, let’s fit the model.\n\nidata = model.fit(inference_method=\"nuts_numpyro\", target_accept=0.99)\nprint(idata.sample_stats.diverging.sum().item())\n\nCompiling...\nCompilation time = 0:00:04.433012\nSampling...\n\n\n\n\n\n\n\n\n\n\n\n\n\n\nSampling time = 0:00:09.698066\nTransforming variables...\nTransformation time = 0:00:00.668909\n15\n\n\n\naz.plot_trace(\n idata, \n var_names=[\"gp_sigma\", \"gp_ell\", \"gp_weights\"], \n backend_kwargs={\"layout\": \"constrained\"}\n);\n\n\n\n\nNotice the posteriors for the gp_weights are all centered at zero. This is a symptom of the absence of any spatial effect.\n\naz.plot_trace(\n idata, \n var_names=[\"Year\", \"X1:Year\"], \n backend_kwargs={\"layout\": \"constrained\"}\n);\n\n\n\n\n\naz.plot_trace(\n idata, \n var_names=[\"1|Site\", \"1|Site_sigma\"], \n backend_kwargs={\"layout\": \"constrained\"}\n);" }, { - "objectID": "notebooks/negative_binomial.html", - "href": "notebooks/negative_binomial.html", + "objectID": "notebooks/quantile_regression.html", + "href": "notebooks/quantile_regression.html", "title": "Bambi", "section": "", - "text": "I always experience some kind of confusion when looking at the negative binomial distribution after a while of not working with it. There are so many different definitions that I usually need to read everything more than once. The definition I’ve first learned, and the one I like the most, says as follows: The negative binomial distribution is the distribution of a random variable that is defined as the number of independent Bernoulli trials until the k-th “success”. In short, we repeat a Bernoulli experiment until we observe k successes and record the number of trials it required.\n\\[\nY \\sim \\text{NB}(k, p)\n\\]\nwhere \\(0 \\le p \\le 1\\) is the probability of success in each Bernoulli trial, \\(k > 0\\), usually integer, and \\(y \\in \\{k, k + 1, \\cdots\\}\\)\nThe probability mass function (pmf) is\n\\[\np(y | k, p)= \\binom{y - 1}{y-k}(1 -p)^{y - k}p^k\n\\]\nIf you, like me, find it hard to remember whether \\(y\\) starts at \\(0\\), \\(1\\), or \\(k\\), try to think twice about the definition of the variable. But how? First, recall we aim to have \\(k\\) successes. And success is one of the two possible outcomes of a trial, so the number of trials can never be smaller than the number of successes. Thus, we can be confident to say that \\(y \\ge k\\).\nBut this is not the only way of defining the negative binomial distribution, there are plenty of options! One of the most interesting, and the one you see in PyMC3, the library we use in Bambi for the backend, is as a continuous mixture. The negative binomial distribution describes a Poisson random variable whose rate is also a random variable (not a fixed constant!) following a gamma distribution. Or in other words, conditional on a gamma-distributed variable \\(\\mu\\), the variable \\(Y\\) has a Poisson distribution with mean \\(\\mu\\).\nUnder this alternative definition, the pmf is\n\\[\n\\displaystyle p(y | k, \\alpha) = \\binom{y + \\alpha - 1}{y} \\left(\\frac{\\alpha}{\\mu + \\alpha}\\right)^\\alpha\\left(\\frac{\\mu}{\\mu + \\alpha}\\right)^y\n\\]\nwhere \\(\\mu\\) is the parameter of the Poisson distribution (the mean, and variance too!) and \\(\\alpha\\) is the rate parameter of the gamma.\n\nimport arviz as az\nimport bambi as bmb\nimport matplotlib.pyplot as plt\nimport numpy as np\nimport pandas as pd\n\nfrom scipy.stats import nbinom\n\n\naz.style.use(\"arviz-darkgrid\")\n\n\nimport warnings\nwarnings.simplefilter(action='ignore', category=FutureWarning)\n\nIn SciPy, the definition of the negative binomial distribution differs a little from the one in our introduction. They define \\(Y\\) = Number of failures until k successes and then \\(y\\) starts at 0. In the following plot, we have the probability of observing \\(y\\) failures before we see \\(k=3\\) successes.\n\ny = np.arange(0, 30)\nk = 3\np1 = 0.5\np2 = 0.3\n\n\nfig, ax = plt.subplots(1, 2, figsize=(12, 4), sharey=True)\n\nax[0].bar(y, nbinom.pmf(y, k, p1))\nax[0].set_xticks(np.linspace(0, 30, num=11))\nax[0].set_title(f\"k = {k}, p = {p1}\")\n\nax[1].bar(y, nbinom.pmf(y, k, p2))\nax[1].set_xticks(np.linspace(0, 30, num=11))\nax[1].set_title(f\"k = {k}, p = {p2}\")\n\nfig.suptitle(\"Y = Number of failures until k successes\", fontsize=16);\n\n\n\n\nFor example, when \\(p=0.5\\), the probability of seeing \\(y=0\\) failures before 3 successes (or in other words, the probability of having 3 successes out of 3 trials) is 0.125, and the probability of seeing \\(y=3\\) failures before 3 successes is 0.156.\n\nprint(nbinom.pmf(y, k, p1)[0])\nprint(nbinom.pmf(y, k, p1)[3])\n\n0.12499999999999997\n0.15624999999999992\n\n\nFinally, if one wants to show this probability mass function as if we are following the first definition of negative binomial distribution we introduced, we just need to shift the whole thing to the right by adding \\(k\\) to the \\(y\\) values.\n\nfig, ax = plt.subplots(1, 2, figsize=(12, 4), sharey=True)\n\nax[0].bar(y + k, nbinom.pmf(y, k, p1))\nax[0].set_xticks(np.linspace(3, 30, num=10))\nax[0].set_title(f\"k = {k}, p = {p1}\")\n\nax[1].bar(y + k, nbinom.pmf(y, k, p2))\nax[1].set_xticks(np.linspace(3, 30, num=10))\nax[1].set_title(f\"k = {k}, p = {p2}\")\n\nfig.suptitle(\"Y = Number of trials until k successes\", fontsize=16);\n\n\n\n\n\n\n\nThe negative binomial distribution belongs to the exponential family, and the canonical link function is\n\\[\ng(\\mu_i) = \\log\\left(\\frac{\\mu_i}{k + \\mu_i}\\right) = \\log\\left(\\frac{k}{\\mu_i} + 1\\right)\n\\]\nbut it is difficult to interpret. The log link is usually preferred because of the analogy with Poisson model, and it also tends to give better results.\n\n\n\nThis example is based on this UCLA example.\nSchool administrators study the attendance behavior of high school juniors at two schools. Predictors of the number of days of absence include the type of program in which the student is enrolled and a standardized test in math. We have attendance data on 314 high school juniors.\nThe variables of insterest in the dataset are\n\ndaysabs: The number of days of absence. It is our response variable.\nprogr: The type of program. Can be one of ‘General’, ‘Academic’, or ‘Vocational’.\nmath: Score in a standardized math test.\n\n\ndata = pd.read_stata(\"https://stats.idre.ucla.edu/stat/stata/dae/nb_data.dta\")\n\n\ndata.head()\n\n\n\n\n\n \n \n \n id\n gender\n math\n daysabs\n prog\n \n \n \n \n 0\n 1001.0\n male\n 63.0\n 4.0\n 2.0\n \n \n 1\n 1002.0\n male\n 27.0\n 4.0\n 2.0\n \n \n 2\n 1003.0\n female\n 20.0\n 2.0\n 2.0\n \n \n 3\n 1004.0\n female\n 16.0\n 3.0\n 2.0\n \n \n 4\n 1005.0\n female\n 2.0\n 3.0\n 2.0\n \n \n\n\n\n\nWe assign categories to the values 1, 2, and 3 of our \"prog\" variable.\n\ndata[\"prog\"] = data[\"prog\"].map({1: \"General\", 2: \"Academic\", 3: \"Vocational\"})\ndata.head()\n\n\n\n\n\n \n \n \n id\n gender\n math\n daysabs\n prog\n \n \n \n \n 0\n 1001.0\n male\n 63.0\n 4.0\n Academic\n \n \n 1\n 1002.0\n male\n 27.0\n 4.0\n Academic\n \n \n 2\n 1003.0\n female\n 20.0\n 2.0\n Academic\n \n \n 3\n 1004.0\n female\n 16.0\n 3.0\n Academic\n \n \n 4\n 1005.0\n female\n 2.0\n 3.0\n Academic\n \n \n\n\n\n\nThe Academic program is the most popular program (167/314) and General is the least popular one (40/314)\n\ndata[\"prog\"].value_counts()\n\nAcademic 167\nVocational 107\nGeneral 40\nName: prog, dtype: int64\n\n\nLet’s explore the distributions of math score and days of absence for each of the three programs listed above. The vertical lines indicate the mean values.\n\nfig, ax = plt.subplots(3, 2, figsize=(8, 6), sharex=\"col\")\nprograms = list(data[\"prog\"].unique())\nprograms.sort()\n\nfor idx, program in enumerate(programs):\n # Histogram\n ax[idx, 0].hist(data[data[\"prog\"] == program][\"math\"], edgecolor='black', alpha=0.9)\n ax[idx, 0].axvline(data[data[\"prog\"] == program][\"math\"].mean(), color=\"C1\")\n \n # Barplot\n days = data[data[\"prog\"] == program][\"daysabs\"]\n days_mean = days.mean()\n days_counts = days.value_counts()\n values = list(days_counts.index)\n count = days_counts.values\n ax[idx, 1].bar(values, count, edgecolor='black', alpha=0.9)\n ax[idx, 1].axvline(days_mean, color=\"C1\")\n \n # Titles\n ax[idx, 0].set_title(program)\n ax[idx, 1].set_title(program)\n\nplt.setp(ax[-1, 0], xlabel=\"Math score\")\nplt.setp(ax[-1, 1], xlabel=\"Days of absence\");\n\n\n\n\nThe first impression we have is that the distribution of math scores is not equal for any of the programs. It looks right-skewed for students under the Academic program, left-skewed for students under the Vocational program, and roughly uniform for students in the General program (although there’s a drop in the highest values). Clearly those in the Vocational program has the highest mean for the math score.\nOn the other hand, the distribution of the days of absence is right-skewed in all cases. Students in the General program present the highest absence mean while the Vocational group is the one who misses fewer classes on average.\n\n\n\nWe are interested in measuring the association between the type of the program and the math score with the days of absence. It’s also of interest to see if the association between math score and days of absence is different in each type of program.\nIn order to answer our questions, we are going to fit and compare two models. The first model uses the type of the program and the math score as predictors. The second model also includes the interaction between these two variables. The score in the math test is going to be standardized in both cases to make things easier for the sampler and save some seconds. A good idea to follow along is to run these models without scaling math and comparing how long it took to fit.\nWe are going to use a negative binomial likelihood to model the days of absence. But let’s stop here and think why we use this likelihood. Earlier, we said that the negative binomial distributon arises when our variable represents the number of trials until we got \\(k\\) successes. However, the number of trials is fixed, i.e. the number of school days in a given year is not a random variable. So if we stick to the definition, we could think of the two alternative views for this problem\n\nEach of the \\(n\\) days is a trial, and we record whether the student is absent (\\(y=1\\)) or not (\\(y=0\\)). This corresponds to a binary regression setting, where we could think of logistic regression or something alike. A problem here is that we have the sum of \\(y\\) for a student, but not the \\(n\\).\nThe whole school year represents the space where events occur and we count how many absences we see in that space for each student. This gives us a Poisson regression setting (count of an event in a given space or time).\n\nWe also know that when \\(n\\) is large and \\(p\\) is small, the Binomial distribution can be approximated with a Poisson distribution with \\(\\lambda = n * p\\). We don’t know exactly \\(n\\) in this scenario, but we know it is around 180, and we do know that \\(p\\) is small because you can’t skip classes all the time. So both modeling approaches should give similar results.\nBut then, why negative binomial? Can’t we just use a Poisson likelihood?\nYes, we can. However, using a Poisson likelihood implies that the mean is equal to the variance, and that is usually an unrealistic assumption. If it turns out the variance is either substantially smaller or greater than the mean, the Poisson regression model results in a poor fit. Alternatively, if we use a negative binomial likelihood, the variance is not forced to be equal to the mean, and there’s more flexibility to handle a given dataset, and consequently, the fit tends to better.\n\n\n\\[\n\\log{Y_i} = \\beta_1 \\text{Academic}_i + \\beta_2 \\text{General}_i + \\beta_3 \\text{Vocational}_i + \\beta_4 \\text{Math\\_std}_i\n\\]\n\n\n\n\\[\n\\log{Y_i} = \\beta_1 \\text{Academic}_i + \\beta_2 \\text{General}_i + \\beta_3 \\text{Vocational}_i + \\beta_4 \\text{Math\\_std}_i\n + \\beta_5 \\text{General}_i \\cdot \\text{Math\\_std}_i + \\beta_6 \\text{Vocational}_i \\cdot \\text{Math\\_std}_i\n\\]\nIn both cases we have the following dummy variables\n\\[\\text{Academic}_i =\n\\left\\{\n \\begin{array}{ll}\n 1 & \\textrm{if student is under Academic program} \\\\\n 0 & \\textrm{other case}\n \\end{array}\n\\right.\n\\]\n\\[\\text{General}_i =\n\\left\\{\n \\begin{array}{ll}\n 1 & \\textrm{if student is under General program} \\\\\n 0 & \\textrm{other case}\n \\end{array}\n\\right.\n\\]\n\\[\\text{Vocational}_i =\n\\left\\{\n \\begin{array}{ll}\n 1 & \\textrm{if student is under Vocational program} \\\\\n 0 & \\textrm{other case}\n \\end{array}\n\\right.\n\\]\nand \\(Y\\) represents the days of absence.\nSo, for example, the first model for a student under the Vocational program reduces to \\[\n\\log{Y_i} = \\beta_3 + \\beta_4 \\text{Math\\_std}_i\n\\]\nAnd one last thing to note is we’ve decided not to inclide an intercept term, that’s why you don’t see any \\(\\beta_0\\) above. This choice allows us to represent the effect of each program directly with \\(\\beta_1\\), \\(\\beta_2\\), and \\(\\beta_3\\).\n\n\n\n\nIt’s very easy to fit these models with Bambi. We just pass a formula describing the terms in the model and Bambi will know how to handle each of them correctly. The 0 on the right hand side of ~ simply means we don’t want to have the intercept term that is added by default. scale(math) tells Bambi we want to use standardize math before being included in the model. By default, Bambi uses a log link for negative binomial GLMs. We’ll stick to this default here.\n\n\n\nmodel_additive = bmb.Model(\"daysabs ~ 0 + prog + scale(math)\", data, family=\"negativebinomial\")\nidata_additive = model_additive.fit()\n\nAuto-assigning NUTS sampler...\nInitializing NUTS using jitter+adapt_diag...\nMultiprocess sampling (2 chains in 2 jobs)\nNUTS: [daysabs_alpha, prog, scale(math)]\n\n\n\n\n\n\n\n \n \n 100.00% [4000/4000 00:03<00:00 Sampling 2 chains, 0 divergences]\n \n \n\n\nSampling 2 chains for 1_000 tune and 1_000 draw iterations (2_000 + 2_000 draws total) took 4 seconds.\n\n\n\n\n\nFor this second model we just add prog:scale(math) to indicate the interaction. A shorthand would be to use y ~ 0 + prog*scale(math), which uses the full interaction operator. In other words, it just means we want to include the interaction between prog and scale(math) as well as their main effects.\n\nmodel_interaction = bmb.Model(\"daysabs ~ 0 + prog + scale(math) + prog:scale(math)\", data, family=\"negativebinomial\")\nidata_interaction = model_interaction.fit()\n\nAuto-assigning NUTS sampler...\nInitializing NUTS using jitter+adapt_diag...\nMultiprocess sampling (2 chains in 2 jobs)\nNUTS: [daysabs_alpha, prog, scale(math), prog:scale(math)]\n\n\n\n\n\n\n\n \n \n 100.00% [4000/4000 00:06<00:00 Sampling 2 chains, 0 divergences]\n \n \n\n\nSampling 2 chains for 1_000 tune and 1_000 draw iterations (2_000 + 2_000 draws total) took 7 seconds.\n\n\n\n\n\n\nThe first thing we do is calling az.summary(). Here we pass the InferenceData object the .fit() returned. This prints information about the marginal posteriors for each parameter in the model as well as convergence diagnostics.\n\naz.summary(idata_additive)\n\n\n\n\n\n \n \n \n mean\n sd\n hdi_3%\n hdi_97%\n mcse_mean\n mcse_sd\n ess_bulk\n ess_tail\n r_hat\n \n \n \n \n prog[Academic]\n 1.888\n 0.084\n 1.738\n 2.057\n 0.002\n 0.001\n 2430.0\n 1649.0\n 1.00\n \n \n prog[General]\n 2.339\n 0.174\n 2.013\n 2.651\n 0.003\n 0.002\n 3364.0\n 1610.0\n 1.00\n \n \n prog[Vocational]\n 1.047\n 0.112\n 0.845\n 1.264\n 0.002\n 0.002\n 2062.0\n 1609.0\n 1.00\n \n \n scale(math)\n -0.150\n 0.063\n -0.271\n -0.036\n 0.001\n 0.001\n 2115.0\n 1357.0\n 1.00\n \n \n daysabs_alpha\n 1.020\n 0.109\n 0.835\n 1.236\n 0.002\n 0.002\n 2112.0\n 1339.0\n 1.01\n \n \n\n\n\n\n\naz.summary(idata_interaction)\n\n\n\n\n\n \n \n \n mean\n sd\n hdi_3%\n hdi_97%\n mcse_mean\n mcse_sd\n ess_bulk\n ess_tail\n r_hat\n \n \n \n \n prog[Academic]\n 1.876\n 0.083\n 1.721\n 2.032\n 0.002\n 0.001\n 2149.0\n 1451.0\n 1.0\n \n \n prog[General]\n 2.341\n 0.175\n 2.007\n 2.647\n 0.004\n 0.003\n 2188.0\n 1572.0\n 1.0\n \n \n prog[Vocational]\n 0.984\n 0.128\n 0.743\n 1.223\n 0.003\n 0.002\n 2290.0\n 1703.0\n 1.0\n \n \n scale(math)\n -0.194\n 0.081\n -0.334\n -0.030\n 0.002\n 0.001\n 2001.0\n 1625.0\n 1.0\n \n \n prog:scale(math)[General]\n 0.014\n 0.164\n -0.304\n 0.305\n 0.004\n 0.003\n 2008.0\n 1738.0\n 1.0\n \n \n prog:scale(math)[Vocational]\n 0.198\n 0.168\n -0.129\n 0.512\n 0.004\n 0.003\n 1813.0\n 1556.0\n 1.0\n \n \n daysabs_alpha\n 1.017\n 0.104\n 0.821\n 1.208\n 0.002\n 0.002\n 2135.0\n 1397.0\n 1.0\n \n \n\n\n\n\nThe information in the two tables above can be visualized in a more concise manner using a forest plot. ArviZ provides us with plot_forest(). There we simply pass a list containing the InferenceData objects of the models we want to compare.\n\naz.plot_forest(\n [idata_additive, idata_interaction],\n model_names=[\"Additive\", \"Interaction\"],\n var_names=[\"prog\", \"scale(math)\"],\n combined=True,\n figsize=(8, 4)\n);\n\n\n\n\nOne of the first things one can note when seeing this plot is the similarity between the marginal posteriors. Maybe one can conclude that the variability of the marginal posterior of scale(math) is slightly lower in the model that considers the interaction, but the difference is not significant.\nWe can also make conclusions about the association between the program and the math score with the days of absence. First, we see the posterior for the Vocational group is to the left of the posterior for the two other programs, meaning it is associated with fewer absences (as we have seen when first exploring our data). There also seems to be a difference between General and Academic, where we may conclude the students in the General group tend to miss more classes.\nIn addition, the marginal posterior for math shows negative values in both cases. This means that students with higher math scores tend to miss fewer classes. Below, we see a forest plot with the posteriors for the coefficients of the interaction effects. Both of them overlap with 0, which means the data does not give much evidence to support there is an interaction effect between program and math score (i.e., the association between math and days of absence is similar for all the programs).\n\naz.plot_forest(idata_interaction, var_names=[\"prog:scale(math)\"], combined=True, figsize=(8, 4))\nplt.axvline(0);\n\n\n\n\n\n\n\nWe finish this example showing how we can get predictions for new data and plot the mean response for each program together with confidence intervals.\n\nmath_score = np.arange(1, 100)\n\n# This function takes a model and an InferenceData object.\n# It returns of length 3 with predictions for each type of program.\ndef predict(model, idata):\n predictions = []\n for program in programs:\n new_data = pd.DataFrame({\"math\": math_score, \"prog\": [program] * len(math_score)})\n new_idata = model.predict(\n idata, \n data=new_data,\n inplace=False\n )\n prediction = new_idata.posterior[\"daysabs_mean\"]\n predictions.append(prediction)\n \n return predictions\n\n\nprediction_additive = predict(model_additive, idata_additive)\nprediction_interaction = predict(model_interaction, idata_interaction)\n\n\nmu_additive = [prediction.mean((\"chain\", \"draw\")) for prediction in prediction_additive]\nmu_interaction = [prediction.mean((\"chain\", \"draw\")) for prediction in prediction_interaction]\n\n\nfig, ax = plt.subplots(1, 2, sharex=True, sharey=True, figsize = (10, 4))\n\nfor idx, program in enumerate(programs):\n ax[0].plot(math_score, mu_additive[idx], label=f\"{program}\", color=f\"C{idx}\", lw=2)\n az.plot_hdi(math_score, prediction_additive[idx], color=f\"C{idx}\", ax=ax[0])\n\n ax[1].plot(math_score, mu_interaction[idx], label=f\"{program}\", color=f\"C{idx}\", lw=2)\n az.plot_hdi(math_score, prediction_interaction[idx], color=f\"C{idx}\", ax=ax[1])\n\nax[0].set_title(\"Additive\");\nax[1].set_title(\"Interaction\");\nax[0].set_xlabel(\"Math score\")\nax[1].set_xlabel(\"Math score\")\nax[0].set_ylim(0, 25)\nax[0].legend(loc=\"upper right\");\n\n\n\n\nAs we can see in this plot, the interval for the mean response for the Vocational program does not overlap with the interval for the other two groups, representing the group of students who miss fewer classes. On the right panel we can also see that including interaction terms does not change the slopes significantly because the posterior distributions of these coefficients have a substantial overlap with 0.\nIf you’ve made it to the end of this notebook and you’re still curious about what else you can do with these two models, you’re invited to use az.compare() to compare the fit of the two models. What do you expect before seeing the plot? Why? Is there anything else you could do to improve the fit of the model?\nAlso, if you’re still curious about what this model would have looked like with the Poisson likelihood, you just need to replace family=\"negativebinomial\" with family=\"poisson\" and then you’re ready to compare results!\n\n%load_ext watermark\n%watermark -n -u -v -iv -w\n\nLast updated: Thu Jan 05 2023\n\nPython implementation: CPython\nPython version : 3.10.4\nIPython version : 8.5.0\n\narviz : 0.14.0\nbambi : 0.9.3\npandas : 1.5.2\nnumpy : 1.23.5\nmatplotlib: 3.6.2\nsys : 3.10.4 (main, Mar 31 2022, 08:41:55) [GCC 7.5.0]\n\nWatermark: 2.3.1" + "text": "import arviz as az\nimport bambi as bmb\nimport matplotlib.pyplot as plt\nimport numpy as np\nimport pandas as pd\nfrom scipy import stats\n\n\naz.style.use(\"arviz-darkgrid\")\nSEED = 12947\n\nUsually when doing regression we model the conditional mean of some distribution. Common cases are a Normal distribution for continuous unbounded responses, a Poisson distribution for count data, etc.\nQuantile regression, instead estimates a conditional quantile of the response variable. If the quantile is 0.5, then we will be estimating the median (instead of the mean), this could be useful as a way of performing robust regression, in a similar fashion as using a Student-t distribution instead of a Normal. But for some problem we actually care of the behaviour of the response away from the mean (or median). For example, in medical research, pathologies or potential health risks occur at high or low quantile, for instance, overweight and underweight. In some other fields like ecology, quantile regression is justified due to the existence of complex interactions between variables, where the effect of one variable on another is different for different ranges of the variable.\n\n\nAt first it could be weird to think which distribution we should use as the likelihood for quantile regression or how to write a Bayesian model for quantile regression. But it turns out the answer is quite simple, we just need to use the asymmetric Laplace distribution. This distribution has one parameter controling the mean, another for the scale and a third one for the asymmetry. There are at least two alternative parametrizations regarding this asymmetric parameter. In terms of \\(\\kappa\\) a parameter that goes from 0 to \\(\\infty\\) and in terms of \\(q\\) a number between 0 and 1. This later parametrization is more intuitive for quantile regression as we can directly interpre it as the quantile of interest.\nOn the next cell we compute the pdf of 3 distribution from the Asymmetric Laplace family\n\nx = np.linspace(-6, 6, 2000)\nquantiles = np.array([0.2, 0.5, 0.8])\nfor q, m in zip(quantiles, [0, 0, -1]):\n κ = (q/(1-q))**0.5\n plt.plot(x, stats.laplace_asymmetric(κ, m, 1).pdf(x), label=f\"q={q:}, μ={m}, σ=1\")\nplt.yticks([]);\nplt.legend();\n\n\n\n\nWe are going to use a simple dataset to model the Body Mass Index for Dutch kids and young men as a function of their age.\n\ndata = pd.read_csv(\"data/bmi.csv\")\ndata.head()\n\n\n\n\n\n \n \n \n age\n bmi\n \n \n \n \n 0\n 0.03\n 13.235289\n \n \n 1\n 0.04\n 12.438775\n \n \n 2\n 0.04\n 14.541775\n \n \n 3\n 0.04\n 11.773954\n \n \n 4\n 0.04\n 15.325614\n \n \n\n\n\n\nAs we can see from the next figure the relationship between BMI and age is far from linear, and hence we are going to use splines.\n\nplt.plot(data.age, data.bmi, \"k.\");\nplt.xlabel(\"Age\")\nplt.ylabel(\"BMI\");\n\n\n\n\nWe are going to model 3 quantiles, 0.1, 0.5 and 0.9. For that reasoson we are going to fit 3 separated models, being the sole different the value of kappa of the Asymmetric Laplace distribution, that will be fix at a different value each time. In the future Bambi will allow to directly work with the parameter q instead of kappa, in the meantime we have to apply a transformation to go from quantiles to suitable values of kappa.\n\\[\n\\kappa = \\sqrt{\\frac{q}{1 - q}}\n\\]\n\nquantiles = np.array([0.1, 0.5, 0.9])\nkappas = (quantiles/(1-quantiles))**0.5\n\nknots = np.quantile(data.age, np.linspace(0, 1, 10))[1:-1]\n\nidatas = []\nfor κ in kappas:\n model = bmb.Model(\"bmi ~ bs(age, knots=knots)\",\n data=data, family=\"asymmetriclaplace\", priors={\"kappa\": κ})\n idata = model.fit()\n model.predict(idata)\n idatas.append(idata)\n\nAuto-assigning NUTS sampler...\nInitializing NUTS using jitter+adapt_diag...\nMultiprocess sampling (2 chains in 2 jobs)\nNUTS: [bmi_b, Intercept, bs(age, knots = knots)]\n\n\n\n\n\n\n\n \n \n 100.00% [4000/4000 00:27<00:00 Sampling 2 chains, 0 divergences]\n \n \n\n\nSampling 2 chains for 1_000 tune and 1_000 draw iterations (2_000 + 2_000 draws total) took 28 seconds.\nAuto-assigning NUTS sampler...\nInitializing NUTS using jitter+adapt_diag...\nMultiprocess sampling (2 chains in 2 jobs)\nNUTS: [bmi_b, Intercept, bs(age, knots = knots)]\n\n\n\n\n\n\n\n \n \n 100.00% [4000/4000 00:22<00:00 Sampling 2 chains, 0 divergences]\n \n \n\n\nSampling 2 chains for 1_000 tune and 1_000 draw iterations (2_000 + 2_000 draws total) took 22 seconds.\nAuto-assigning NUTS sampler...\nInitializing NUTS using jitter+adapt_diag...\nMultiprocess sampling (2 chains in 2 jobs)\nNUTS: [bmi_b, Intercept, bs(age, knots = knots)]\n\n\n\n\n\n\n\n \n \n 100.00% [4000/4000 00:28<00:00 Sampling 2 chains, 0 divergences]\n \n \n\n\nSampling 2 chains for 1_000 tune and 1_000 draw iterations (2_000 + 2_000 draws total) took 29 seconds.\n\n\nWe can see the result of the 3 fitted curves in the next figure. One feature that stand-out is that the gap or distance between the median (orange) line and the two other lines is not the same. Also the shapes of the curve while following a similar pattern, are not exactly the same.\n\nplt.plot(data.age, data.bmi, \".\", color=\"0.5\")\nfor idata, q in zip(idatas, quantiles):\n plt.plot(data.age.values, idata.posterior[\"bmi_mean\"].mean((\"chain\", \"draw\")),\n label=f\"q={q:}\", lw=3);\n \nplt.legend()\nplt.xlabel(\"Age\")\nplt.ylabel(\"BMI\");\n\n\n\n\nTo better undestand these remarks let’s compute a simple linear regression and then compute the same 3 quantiles from that fit.\n\nmodel_g = bmb.Model(\"bmi ~ bs(age, knots=knots)\",\n data=data)\nidata_g = model_g.fit()\nmodel_g.predict(idata_g, kind=\"pps\")\n\nAuto-assigning NUTS sampler...\nInitializing NUTS using jitter+adapt_diag...\nMultiprocess sampling (2 chains in 2 jobs)\nNUTS: [bmi_sigma, Intercept, bs(age, knots = knots)]\n\n\n\n\n\n\n\n \n \n 100.00% [4000/4000 00:15<00:00 Sampling 2 chains, 0 divergences]\n \n \n\n\nSampling 2 chains for 1_000 tune and 1_000 draw iterations (2_000 + 2_000 draws total) took 16 seconds.\n\n\n\nidata_g_mean_quantiles = idata_g.posterior_predictive[\"bmi\"].quantile(quantiles, (\"chain\", \"draw\"))\n\n\nplt.plot(data.age, data.bmi, \".\", color=\"0.5\")\nfor q in quantiles:\n plt.plot(data.age.values, idata_g_mean_quantiles.sel(quantile=q),\n label=f\"q={q:}\");\n \nplt.legend()\nplt.xlabel(\"Age\")\nplt.ylabel(\"BMI\");\n\n\n\n\nWe can see that when we use a Gaussian family and from that fit we compute the quantiles, the quantiles q=0.1 and q=0.9 are symetrical with respect to q=0.5, also the shape of the curves is essentially the same just shifted up or down. Additionally the Asymmetric Laplace family allows the model to account for the increased variability in BMI as the age increases, while for the Gaussian family that variability always stays the same.\n\n%load_ext watermark\n%watermark -n -u -v -iv -w\n\nLast updated: Thu Jan 05 2023\n\nPython implementation: CPython\nPython version : 3.10.4\nIPython version : 8.5.0\n\nbambi : 0.9.3\nmatplotlib: 3.6.2\nscipy : 1.9.3\nnumpy : 1.23.5\npandas : 1.5.2\nsys : 3.10.4 (main, Mar 31 2022, 08:41:55) [GCC 7.5.0]\narviz : 0.14.0\n\nWatermark: 2.3.1" }, { - "objectID": "notebooks/t_regression.html", - "href": "notebooks/t_regression.html", + "objectID": "notebooks/hsgp_1d.html", + "href": "notebooks/hsgp_1d.html", "title": "Bambi", "section": "", - "text": "Robust Linear Regression\nThis example has been lifted from the PyMC Docs, and adapted to for Bambi by Tyler James Burch (@tjburch on GitHub).\nMany toy datasets circumvent problems that practitioners run into with real data. Specifically, the assumption of normality can be easily violated by outliers, which can cause havoc in traditional linear regression. One way to navigate this is through robust linear regression, outlined in this example.\nFirst load modules and set the RNG for reproducibility.\n\nimport arviz as az\nimport bambi as bmb\nimport matplotlib.pyplot as plt\nimport numpy as np\nimport pandas as pd\n\n\naz.style.use(\"arviz-darkgrid\")\nnp.random.seed(1111)\n\nNext, generate pseudodata. The bulk of the data will be linear with noise distributed normally, but additionally several outliers will be interjected.\n\nsize = 100\ntrue_intercept = 1\ntrue_slope = 2\n\nx = np.linspace(0, 1, size)\n# y = a + b*x\ntrue_regression_line = true_intercept + true_slope * x\n# add noise\ny = true_regression_line + np.random.normal(scale=0.5, size=size)\n\n# Add outliers\nx_out = np.append(x, [0.1, 0.15, 0.2])\ny_out = np.append(y, [8, 6, 9])\n\ndata = pd.DataFrame({\n \"x\": x_out, \n \"y\": y_out\n})\n\nPlot this data. The three data points in the top left are the interjected data.\n\nfig = plt.figure(figsize=(7, 7))\nax = fig.add_subplot(111, xlabel=\"x\", ylabel=\"y\", title=\"Generated data and underlying model\")\nax.plot(x_out, y_out, \"x\", label=\"sampled data\")\nax.plot(x, true_regression_line, label=\"true regression line\", lw=2.0)\nplt.legend(loc=0);\n\n\n\n\nTo highlight the problem, first fit a standard normally-distributed linear regression.\n\n# Note, \"gaussian\" is the default argument for family. Added to be explicit. \ngauss_model = bmb.Model(\"y ~ x\", data, family=\"gaussian\")\ngauss_fitted = gauss_model.fit(draws=2000, idata_kwargs={\"log_likelihood\": True})\ngauss_model.predict(gauss_fitted, kind=\"pps\")\n\nAuto-assigning NUTS sampler...\nInitializing NUTS using jitter+adapt_diag...\nMultiprocess sampling (2 chains in 2 jobs)\nNUTS: [y_sigma, Intercept, x]\n\n\n\n\n\n\n\n \n \n 100.00% [6000/6000 00:03<00:00 Sampling 2 chains, 0 divergences]\n \n \n\n\nSampling 2 chains for 1_000 tune and 2_000 draw iterations (2_000 + 4_000 draws total) took 3 seconds.\n\n\n\naz.summary(gauss_fitted)\n\n\n\n\n\n \n \n \n mean\n sd\n hdi_3%\n hdi_97%\n mcse_mean\n mcse_sd\n ess_bulk\n ess_tail\n r_hat\n \n \n \n \n Intercept\n 1.533\n 0.230\n 1.093\n 1.959\n 0.003\n 0.002\n 5481.0\n 2857.0\n 1.0\n \n \n x\n 1.201\n 0.400\n 0.458\n 1.964\n 0.005\n 0.004\n 5177.0\n 2869.0\n 1.0\n \n \n y_sigma\n 1.186\n 0.085\n 1.032\n 1.351\n 0.001\n 0.001\n 5873.0\n 2891.0\n 1.0\n \n \n y_mean[0]\n 1.533\n 0.230\n 1.093\n 1.959\n 0.003\n 0.002\n 5481.0\n 2857.0\n 1.0\n \n \n y_mean[1]\n 1.546\n 0.227\n 1.113\n 1.963\n 0.003\n 0.002\n 5487.0\n 2857.0\n 1.0\n \n \n ...\n ...\n ...\n ...\n ...\n ...\n ...\n ...\n ...\n ...\n \n \n y_mean[98]\n 2.722\n 0.227\n 2.288\n 3.143\n 0.003\n 0.002\n 5461.0\n 3205.0\n 1.0\n \n \n y_mean[99]\n 2.734\n 0.230\n 2.307\n 3.176\n 0.003\n 0.002\n 5454.0\n 3232.0\n 1.0\n \n \n y_mean[100]\n 1.653\n 0.197\n 1.290\n 2.027\n 0.003\n 0.002\n 5512.0\n 3038.0\n 1.0\n \n \n y_mean[101]\n 1.714\n 0.181\n 1.376\n 2.048\n 0.002\n 0.002\n 5539.0\n 3273.0\n 1.0\n \n \n y_mean[102]\n 1.774\n 0.166\n 1.447\n 2.064\n 0.002\n 0.002\n 5572.0\n 3294.0\n 1.0\n \n \n\n106 rows × 9 columns\n\n\n\nRemember, the true intercept was 1, the true slope was 2. The recovered intercept is much higher, and the slope is much lower, so the influence of the outliers is apparent.\nVisually, looking at the recovered regression line and posterior predictive HDI highlights the problem further.\n\nplt.figure(figsize=(7, 5))\n# Plot Data\nplt.plot(x_out, y_out, \"x\", label=\"data\")\n# Plot recovered linear regression\nx_range = np.linspace(min(x_out), max(x_out), 2000)\ny_pred = gauss_fitted.posterior.x.mean().item() * x_range + gauss_fitted.posterior.Intercept.mean().item()\nplt.plot(x_range, y_pred, \n color=\"black\",linestyle=\"--\",\n label=\"Recovered regression line\"\n )\n# Plot HDIs\nfor interval in [0.38, 0.68]:\n az.plot_hdi(x_out, gauss_fitted.posterior_predictive.y, \n hdi_prob=interval, color=\"firebrick\")\n# Plot true regression line\nplt.plot(x, true_regression_line, \n label=\"True regression line\", lw=2.0, color=\"black\")\nplt.legend(loc=0);\n\n\n\n\nThe recovered regression line, as well as the \\(0.5\\sigma\\) and \\(1\\sigma\\) bands are shown.\nClearly there is skew in the fit. At lower \\(x\\) values, the regression line is far higher than the true line. This is a result of the outliers, which cause the model to assume a higher value in that regime.\nAdditionally the uncertainty bands are too wide (remember, the \\(1\\sigma\\) band ought to cover 68% of the data, while here it covers most of the points). Due to the small probability mass in the tails of a normal distribution, the outliers have an large effect, causing the uncertainty bands to be oversized.\nClearly, assuming the data are distributed normally is inducing problems here. Bayesian robust linear regression forgoes the normality assumption by instead using a Student T distribution to describe the distribution of the data. The Student T distribution has thicker tails, and by allocating more probability mass to the tails, outliers have a less strong effect.\nComparing the two distributions,\n\nnormal_data = np.random.normal(loc=0, scale=1, size=100_000)\nt_data = np.random.standard_t(df=1, size=100_000)\n\nbins = np.arange(-8,8,0.15)\nplt.hist(normal_data, \n bins=bins, density=True,\n alpha=0.6,\n label=\"Normal\"\n )\nplt.hist(t_data, \n bins=bins,density=True,\n alpha=0.6,\n label=\"Student T\"\n )\nplt.xlabel(\"x\")\nplt.ylabel(\"Probability density\")\nplt.xlim(-8,8)\nplt.legend();\n\n\n\n\nAs we can see, the tails of the Student T are much larger, which means values far from the mean are more likely when compared to the normal distribution.\nThe T distribution is specified by a number of degrees of freedom (\\(\\nu\\)). In numpy.random.standard_t this is the parameter df, in the pymc T distribution, it’s nu. It is constrained to real numbers greater than 0. As the degrees of freedom increase, the probability in the tails Student T distribution decrease. In the limit of \\(\\nu \\rightarrow + \\infty\\), the Student T distribution is a normal distribution. Below, the T distribution is plotted for various \\(\\nu\\).\n\nbins = np.arange(-8,8,0.15)\nfor ndof in [0.1, 1, 10]:\n\n t_data = np.random.standard_t(df=ndof, size=100_000)\n\n plt.hist(t_data, \n bins=bins,density=True,\n label=f\"$\\\\nu = {ndof}$\",\n histtype=\"step\"\n )\nplt.hist(normal_data, \n bins=bins, density=True,\n histtype=\"step\",\n label=\"Normal\"\n ) \n \nplt.xlabel(\"x\")\nplt.ylabel(\"Probability density\")\nplt.xlim(-6,6)\nplt.legend();\n\n\n\n\nIn Bambi, the way to specify a regression with Student T distributed data is by passing \"t\" to the family parameter of a Model.\n\nt_model = bmb.Model(\"y ~ x\", data, family=\"t\")\nt_fitted = t_model.fit(draws=2000, idata_kwargs={\"log_likelihood\": True})\nt_model.predict(t_fitted, kind=\"pps\")\n\nAuto-assigning NUTS sampler...\nInitializing NUTS using jitter+adapt_diag...\nMultiprocess sampling (2 chains in 2 jobs)\nNUTS: [y_sigma, y_nu, Intercept, x]\n\n\n\n\n\n\n\n \n \n 100.00% [6000/6000 00:06<00:00 Sampling 2 chains, 0 divergences]\n \n \n\n\nSampling 2 chains for 1_000 tune and 2_000 draw iterations (2_000 + 4_000 draws total) took 7 seconds.\n\n\n\naz.summary(t_fitted)\n\n\n\n\n\n \n \n \n mean\n sd\n hdi_3%\n hdi_97%\n mcse_mean\n mcse_sd\n ess_bulk\n ess_tail\n r_hat\n \n \n \n \n Intercept\n 0.994\n 0.107\n 0.797\n 1.199\n 0.002\n 0.001\n 4029.0\n 3029.0\n 1.0\n \n \n x\n 1.900\n 0.184\n 1.562\n 2.254\n 0.003\n 0.002\n 4172.0\n 3105.0\n 1.0\n \n \n y_sigma\n 0.405\n 0.046\n 0.321\n 0.492\n 0.001\n 0.001\n 4006.0\n 3248.0\n 1.0\n \n \n y_nu\n 2.601\n 0.620\n 1.500\n 3.727\n 0.011\n 0.008\n 3431.0\n 3063.0\n 1.0\n \n \n y_mean[0]\n 0.994\n 0.107\n 0.797\n 1.199\n 0.002\n 0.001\n 4029.0\n 3029.0\n 1.0\n \n \n ...\n ...\n ...\n ...\n ...\n ...\n ...\n ...\n ...\n ...\n \n \n y_mean[98]\n 2.875\n 0.103\n 2.688\n 3.079\n 0.001\n 0.001\n 4786.0\n 3228.0\n 1.0\n \n \n y_mean[99]\n 2.894\n 0.105\n 2.709\n 3.105\n 0.002\n 0.001\n 4768.0\n 3155.0\n 1.0\n \n \n y_mean[100]\n 1.184\n 0.091\n 1.009\n 1.350\n 0.001\n 0.001\n 4046.0\n 3140.0\n 1.0\n \n \n y_mean[101]\n 1.279\n 0.084\n 1.118\n 1.432\n 0.001\n 0.001\n 4074.0\n 3151.0\n 1.0\n \n \n y_mean[102]\n 1.374\n 0.077\n 1.232\n 1.519\n 0.001\n 0.001\n 4128.0\n 3194.0\n 1.0\n \n \n\n107 rows × 9 columns\n\n\n\nNote the new parameter in the model, y_nu. This is the aforementioned degrees of freedom. If this number were very high, we would expect it to be well described by a normal distribution. However, the HDI of this spans from 1.5 to 3.7, meaning that the tails are much heavier than a normal distribution. As a result of the heavier tails, y_sigma has also dropped precipitously from the normal model, meaning the oversized uncertainty bands from above have shrunk.\nComparing the extracted values of the two models,\n\ndef get_slope_intercept(mod):\n return (\n mod.posterior.x.mean().item(),\n mod.posterior.Intercept.mean().item()\n )\ngauss_slope, gauss_int = get_slope_intercept(gauss_fitted)\nt_slope, t_int = get_slope_intercept(t_fitted)\n\npd.DataFrame({\n \"Model\":[\"True\",\"Normal\",\"T\"],\n \"Slope\":[2, gauss_slope, t_slope],\n \"Intercept\": [1, gauss_int, t_int]\n}).set_index(\"Model\").T.round(decimals=2)\n\n\n\n\n\n \n \n Model\n True\n Normal\n T\n \n \n \n \n Slope\n 2.0\n 1.20\n 1.90\n \n \n Intercept\n 1.0\n 1.53\n 0.99\n \n \n\n\n\n\nHere we can see the mean recovered values of both the slope and intercept are far closer to the true values using the robust regression model compared to the normally distributed one.\nVisually comparing the robust regression line,\n\nplt.figure(figsize=(7, 5))\n# Plot Data\nplt.plot(x_out, y_out, \"x\", label=\"data\")\n# Plot recovered robust linear regression\nx_range = np.linspace(min(x_out), max(x_out), 2000)\ny_pred = t_fitted.posterior.x.mean().item() * x_range + t_fitted.posterior.Intercept.mean().item()\nplt.plot(x_range, y_pred, \n color=\"black\",linestyle=\"--\",\n label=\"Recovered regression line\"\n )\n# Plot HDIs\nfor interval in [0.05, 0.38, 0.68]:\n az.plot_hdi(x_out, t_fitted.posterior_predictive.y, \n hdi_prob=interval, color=\"firebrick\")\n# Plot true regression line\nplt.plot(x, true_regression_line, \n label=\"true regression line\", lw=2.0, color=\"black\")\nplt.legend(loc=0);\n\n\n\n\nThis is much better. The true and recovered regression lines are much closer, and the uncertainty bands are appropriate sized. The effect of the outliers is not entirely gone, the recovered line still slightly differs from the true line, but the effect is far smaller, which is a result of the Student T likelihood function ascribing a higher probability to outliers than the normal distribution. Additionally, this inference is based on sampling methods, so it is expected to have small differences (especially given a relatively small number of samples).\nLast, another way to evaluate the models is to compare based on Leave-one-out Cross-validation (LOO), which provides an estimate of accuracy on out-of-sample predictions.\n\nmodels = {\n \"gaussian\": gauss_fitted,\n \"Student T\": t_fitted\n}\ndf_compare = az.compare(models)\ndf_compare\n\n/home/tomas/anaconda3/envs/bambi/lib/python3.10/site-packages/arviz/stats/stats.py:803: UserWarning: Estimated shape parameter of Pareto distribution is greater than 0.7 for one or more samples. You should consider using a more robust model, this is because importance sampling is less likely to work well if the marginal posterior and LOO posterior are very different. This is more likely to happen with a non-robust model and highly influential observations.\n warnings.warn(\n\n\n\n\n\n\n \n \n \n rank\n elpd_loo\n p_loo\n elpd_diff\n weight\n se\n dse\n warning\n scale\n \n \n \n \n Student T\n 0\n -101.760564\n 5.603439\n 0.000000\n 1.000000e+00\n 14.994794\n 0.000000\n False\n log\n \n \n gaussian\n 1\n -171.732028\n 14.081743\n 69.971464\n 3.053913e-11\n 29.382970\n 17.542539\n True\n log\n \n \n\n\n\n\n\naz.plot_compare(df_compare, insample_dev=False);\n\n\n\n\nHere it is quite obvious that the Student T model is much better, due to having a clearly larger value of LOO.\n\n%load_ext watermark\n%watermark -n -u -v -iv -w\n\nLast updated: Thu Jan 05 2023\n\nPython implementation: CPython\nPython version : 3.10.4\nIPython version : 8.5.0\n\nbambi : 0.9.3\npandas : 1.5.2\nnumpy : 1.23.5\nmatplotlib: 3.6.2\nsys : 3.10.4 (main, Mar 31 2022, 08:41:55) [GCC 7.5.0]\narviz : 0.14.0\n\nWatermark: 2.3.1" + "text": "This article demonstrates the how to use Bambi with Gaussian Processes with 1 dimensional predictors. Bambi supports Gaussian Processes through the approximation known as Hilbert Space Gaussian Processes (HSGP).\nHSGP is a framework that falls under the class of low-rank approximations that are based on forming a basis function approximation with \\(m\\) basis functions, where \\(m\\) is usually much less smaller than \\(n\\), the number of observations.\nFor references see Hilbert Space Methods for Reduced-Rank Gaussian Process Regression and Practical Hilbert Space Approximate Bayesian Gaussian Processes for Probabilistic Programming.\n\nfrom formulae import design_matrices\n\nimport arviz as az\nimport bambi as bmb\nimport matplotlib.pyplot as plt\nimport numpy as np\nimport pandas as pd\n\nfrom bambi.plots import plot_cap\nfrom matplotlib.lines import Line2D\n\n\n\nLet’s get started simulating some data from a smooth function. The goal is to fit a normal likelihood model where a Gaussian process term contributes to the mean.\n\nrng = np.random.default_rng(seed=121195)\n\nsize = 100\nx = np.linspace(0, 50, size)\nb = 0.1 * rng.normal(size=6)\nsigma = 0.15\n\ndm = design_matrices(\"0 + bs(x, df=6, intercept=True)\", pd.DataFrame({\"x\": x}))\nX = np.array(dm.common)\nf = 10 * X @ b\ny = f + rng.normal(size=size) * sigma\ndf = pd.DataFrame({\"x\": x, \"y\": y})\n\nfig, ax = plt.subplots(figsize=(9, 6))\nax.scatter(x, y, s=30, alpha=0.8);\nax.plot(x, f, color=\"black\");\n\n\n\n\nNow let’s simply create and fit the model. We use the hsgp to initialize a HSGP term in the model formula. Notice we pass the variable x and values for two other arguments m and c that we’ll cover later.\n\nmodel = bmb.Model(\"y ~ 0 + hsgp(x, m=10, c=2)\", df)\nmodel\n\n Formula: y ~ 0 + hsgp(x, m=10, c=2)\n Family: gaussian\n Link: mu = identity\n Observations: 100\n Priors: \n target = mu\n HSGP contributions\n hsgp(x, m=10, c=2)\n cov: ExpQuad\n sigma ~ Exponential(lam: 1.0)\n ell ~ InverseGamma(alpha: 3.0, beta: 2.0)\n \n Auxiliary parameters\n y_sigma ~ HalfStudentT(nu: 4.0, sigma: 0.2745)\n\n\nIn the model description we can see the contribution of the HSGP term. It consists of two things: the name of the covariance kernel and the priors for its parameters. In this case, it’s an Exponentiated Quadratic covariance kernel with parameters sigma (amplitude) and ell (lengthscale). The prior for the amplitude is Exponential(1) and the prior for the lengthscale is InverseGamma(3, 2).\n\nidata = model.fit(inference_method=\"nuts_numpyro\", random_seed=121195)\nprint(idata.sample_stats[\"diverging\"].sum().to_numpy())\n\n/home/tomas/anaconda3/envs/bambi_hsgp/lib/python3.10/site-packages/pymc/sampling/jax.py:39: UserWarning: This module is experimental.\n warnings.warn(\"This module is experimental.\")\n\n\nCompiling...\n\n\nNo GPU/TPU found, falling back to CPU. (Set TF_CPP_MIN_LOG_LEVEL=0 and rerun for more info.)\n\n\nCompilation time = 0:00:02.804363\nSampling...\n\n\n\n\n\n\n\n\n\n\n\n\n\n\nSampling time = 0:00:04.776557\nTransforming variables...\nTransformation time = 0:00:00.521686\n527\n\n\n\naz.plot_trace(idata, backend_kwargs={\"layout\": \"constrained\"});\n\n\n\n\nThe fit is horrible. To fix that we can use better priors. But before doing that, it’s important to note that HSGP terms have a unique characteristic in that they do not receive priors themselves. Rather, the associated parameters of an HSGP term, such as sigma and ell, are the ones that are assigned priors. Therefore, we need to assign the HSGP term a dictionary of priors instead of a single prior.\n\nprior_hsgp = {\n \"sigma\": bmb.Prior(\"Exponential\", lam=2), # amplitude\n \"ell\": bmb.Prior(\"InverseGamma\", mu=10, sigma=1) # lengthscale\n}\n\n# This is the dictionary we pass to Bambi\npriors = {\n \"hsgp(x, m=10, c=2)\": prior_hsgp,\n \"sigma\": bmb.Prior(\"HalfNormal\", sigma=10)\n}\nmodel = bmb.Model(\"y ~ 0 + hsgp(x, m=10, c=2)\", df, priors=priors)\nmodel\n\n Formula: y ~ 0 + hsgp(x, m=10, c=2)\n Family: gaussian\n Link: mu = identity\n Observations: 100\n Priors: \n target = mu\n HSGP contributions\n hsgp(x, m=10, c=2)\n cov: ExpQuad\n sigma ~ Exponential(lam: 2.0)\n ell ~ InverseGamma(mu: 10.0, sigma: 1.0)\n \n Auxiliary parameters\n y_sigma ~ HalfNormal(sigma: 10.0)\n\n\nNotice the priors were updated in the model summary. Now we’re ready to fit the model!\n\nidata = model.fit(inference_method=\"nuts_numpyro\", target_accept=0.9, random_seed=121195)\nprint(idata.sample_stats[\"diverging\"].sum().to_numpy())\n\nCompiling...\nCompilation time = 0:00:02.378503\nSampling...\n\n\n\n\n\n\n\n\n\n\n\n\n\n\nSampling time = 0:00:05.336123\nTransforming variables...\nTransformation time = 0:00:00.174204\n7\n\n\n\naz.plot_trace(idata, backend_kwargs={\"layout\": \"constrained\"});\n\n\n\n\nThe marginal posteriors look somehow better, but we still have lots of divergences. What else can we do? Change the parametrization!\nThe hsgp() function has a centered argument which defaults to False and thus Bambi uses a non-centered parametrization by default. But we can change that actually. Let’s try it!\n\nprior_hsgp = {\n \"sigma\": bmb.Prior(\"Exponential\", lam=2), # amplitude\n \"ell\": bmb.Prior(\"InverseGamma\", mu=10, sigma=1) # lengthscale\n}\n\n# This is the dictionary we pass to Bambi\npriors = {\n \"hsgp(x, m=10, c=2, centered=True)\": prior_hsgp,\n \"sigma\": bmb.Prior(\"HalfNormal\", sigma=10)\n}\nmodel = bmb.Model(\"y ~ 0 + hsgp(x, m=10, c=2, centered=True)\", df, priors=priors)\nmodel\n\n Formula: y ~ 0 + hsgp(x, m=10, c=2, centered=True)\n Family: gaussian\n Link: mu = identity\n Observations: 100\n Priors: \n target = mu\n HSGP contributions\n hsgp(x, m=10, c=2, centered=True)\n cov: ExpQuad\n sigma ~ Exponential(lam: 2.0)\n ell ~ InverseGamma(mu: 10.0, sigma: 1.0)\n \n Auxiliary parameters\n y_sigma ~ HalfNormal(sigma: 10.0)\n\n\n\nidata = model.fit(inference_method=\"nuts_numpyro\", target_accept=0.9, random_seed=121195)\nprint(idata.sample_stats[\"diverging\"].sum().to_numpy())\n\nCompiling...\nCompilation time = 0:00:02.560797\nSampling...\n\n\n\n\n\n\n\n\n\n\n\n\n\n\nSampling time = 0:00:04.839103\nTransforming variables...\nTransformation time = 0:00:00.028475\n0\n\n\n\naz.plot_trace(idata, backend_kwargs={\"layout\": \"constrained\"});\n\n\n\n\nAwesome! That looks much better now.\nWe still get all the nice things from Bambi when using GPs. An example of this is the plot_cap() function which enables us to generate a visualization of the adjusted mean with credible bands automatically.\n\nfig, ax = plt.subplots(figsize=(9, 6))\nax.scatter(df[\"x\"], df[\"y\"], s=30, color=\"0.5\", alpha=0.5)\nplot_cap(model, idata, \"x\", ax=ax);\nax.set(xlabel=\"Predictor\", ylabel=\"Observed\");\n\n\n\n\nAnd on top of that, it’s possible to get draws from the posterior predictive distribution and plot credible bands for it. All we need is the .predict() method from the model class.\n\nnew_data = pd.DataFrame({\"x\": np.linspace(0, 50, num=500)})\nmodel.predict(idata, kind=\"pps\", data=new_data)\npps = idata.posterior_predictive[\"y\"].to_numpy().reshape(4000, 500)\nqts = np.quantile(pps, q=(0.025, 0.975), axis=0)\n\nfig, ax = plt.subplots(figsize=(9, 6))\nax.fill_between(new_data[\"x\"], qts[0], qts[1], color=\"C0\", alpha=0.6)\nax.scatter(df[\"x\"], df[\"y\"], s=30, color=\"C1\", alpha=0.9)\nax.plot(x, f, color=\"black\", ls=\"--\")\nax.set(xlabel=\"Predictor\", ylabel=\"Observed\");\n\nhandles = [Line2D([], [], color=\"black\", ls=\"--\"), Line2D([], [], color=\"C0\")]\nlabels = [\"True curve\", \"Posterior predictive distribution\"]\nax.legend(handles, labels);\n\n\n\n\n\n\n\nhsgp() is a transformation that is available in the namespace where the model formula is evaluated. In plain english, hsgp() is like a function you can use in your model formulas. You don’t need to worry about the details, Bambi knows how to handle them.But if still you want to see the actual code, you can have a look at the implementation of the HSGP class in bambi/transformations.py.\nWhat users do need to care about is the arguments the hsgp() transformation support. There are a bunch of arguments that can be passed after the variable number of non-keyword arguments representing the variables of the HSGP contribution. Below is a brief overview of these arguments and their respective descriptions.\n\nm: The number of basis vectors\nL: The boundary of the variable space\nc: The proportion extension factor\nby: This argument specifies the values of a variable used for grouping. It is used to create a HSGP term by group. If left unspecified, the default value is None, which means that there is no group variable and all observations belong to the same group.\ncov: This argument specifies the name of the covariance function to be used. The default value is \"ExpQuad\".\nshare_cov: Determines whether the same covariance function is shared across all groups. This argument is relevant only when by is not None and the default value is True.\nscale: When set to True, the predictors are be rescaled such that the largest Euclidean distance between two points is 1. This adjustment often improves the sampling speed and convergence.\niso: Determines whether to use an isotropic or non-isotropic Gaussian Process. With an isotropic GP, the same level of smoothing is applied to all predictors, while a anisotropic GP allows different levels of smoothing for individual predictors. Note that this argument is ignored if only one predictor is provided. The default value is True.\ndrop_first: Whether to exclude the first basis vector or not. The default value is False.\ncentered: Whether to use the centered or the non-centered parametrization. Defaults to False.\n\nThe parameters m, L and c are directly related to the basis vectors of the HSGP approximation. If you want to know more about m, L, and/or c, it’s recommended to have a look at the documentation of the HSGP class in PyMC.\n\nSo far, we showcased how to use m, c and centered. In the remainder of this article we’re going to see how by and share_cov are used when we add a GP contribution by groups.\n\n\n\nIn this section we fit a model with a HSGP contribution by levels of a categorical variable. The data was simulated with the gamSim() function from the R package {mgcv} by Simon Wood.\n\ndata = pd.read_csv(\"data/gam_data.csv\")\ndata[\"fac\"] = pd.Categorical(data[\"fac\"])\ndata.head()[[\"x2\", \"y\", \"fac\"]]\n\n\n\n\n\n \n \n \n x2\n y\n fac\n \n \n \n \n 0\n 0.497183\n 3.085274\n 3\n \n \n 1\n 0.196003\n -2.250410\n 2\n \n \n 2\n 0.958474\n 0.070548\n 3\n \n \n 3\n 0.972759\n -0.230454\n 1\n \n \n 4\n 0.755836\n 2.173497\n 2\n \n \n\n\n\n\nLet’s visualize x2 versus y for the different levels in fac.\n\nfig, ax = plt.subplots(figsize=(9, 5))\ncolors = [f\"C{i}\" for i in pd.Categorical(data[\"fac\"]).codes]\nax.scatter(data[\"x2\"], data[\"y\"], color=colors, alpha=0.6)\nax.set(xlabel=\"x2\", ylabel=\"y\");\n\n\n\n\nWe can observe the relation between x2 and y can be approximated by a smooth non-linear curve, for all groups.\nBelow, we create the model with Bambi. The biggest difference is that we’re passing by=fac in the hsgp() call. This is all we need to ask Bambi to create multiple GP contribution terms, one per group.\nAnother trick that was not shown previously is the usage of an alias. .set_alias() from the Model class allow us to have more readable and shorter names for the components of a model. As you’ll see below, it makes a huge difference when displaying summaries or visualizations for the parameters of the model.\n\nprior_hsgp = {\n \"sigma\": bmb.Prior(\"Exponential\", lam=3),\n \"ell\": bmb.Prior(\"Exponential\", lam=3)\n}\npriors = {\n \"hsgp(x2, by=fac, m=12, c=1.5)\": prior_hsgp,\n \"sigma\": bmb.Prior(\"HalfNormal\", sigma=1)\n}\nmodel = bmb.Model(\"y ~ 0 + hsgp(x2, by=fac, m=12, c=1.5)\", data, priors=priors)\nmodel.set_alias({\"hsgp(x2, by=fac, m=12, c=1.5)\": \"hsgp\"})\nmodel\n\n Formula: y ~ 0 + hsgp(x2, by=fac, m=12, c=1.5)\n Family: gaussian\n Link: mu = identity\n Observations: 300\n Priors: \n target = mu\n HSGP contributions\n hsgp(x2, by=fac, m=12, c=1.5)\n cov: ExpQuad\n sigma ~ Exponential(lam: 3.0)\n ell ~ Exponential(lam: 3.0)\n \n Auxiliary parameters\n y_sigma ~ HalfNormal(sigma: 1.0)\n\n\n\nmodel.build()\nmodel.graph()\n\n\n\n\nSee how nicer are the names for the HSGP contribution parameters with the alias!\n\nidata = model.fit(inference_method=\"nuts_numpyro\", target_accept=0.95, random_seed=121195)\nprint(idata.sample_stats.diverging.sum().item())\n\nCompiling...\nCompilation time = 0:00:03.565702\nSampling...\n\n\n\n\n\n\n\n\n\n\n\n\n\n\nSampling time = 0:00:06.818602\nTransforming variables...\nTransformation time = 0:00:00.885410\n0\n\n\n\naz.plot_trace(\n idata, \n var_names=[\"hsgp_weights\", \"hsgp_sigma\", \"hsgp_ell\", \"y_sigma\"], \n backend_kwargs={\"layout\": \"constrained\"}\n);\n\n\n\n\nThis time we got no divergences and good mixing and nice convergence in our first try (or perhaps it wasn’t the first try!). One thing that stands out are the marginal posterior for some of the beta parameters (the weights of the basis). This may indicate our approximation is using more basis vectors than what’s really needed.\nNote: At this point we have used the term ‘basis vector’ several times. This concept is very close to the concept of ‘basis functions’. The difference is that the ‘basis vector’ is a ‘basis function’ already evaluated at a set of points. In this case, the set of points is made by the values of the numerical predictor x2.\nDo you remember how easy was it to use plot_cap() above? Should it be harder now? Well, the answer will surprise you: No!\nAll we need to do is passing a second variable name which is mapped to the color in the visualization. Voilà!\n\nfig, ax = plt.subplots(figsize = (9, 5))\ncolors = [f\"C{i}\" for i in pd.Categorical(data[\"fac\"]).codes]\nax.scatter(data[\"x2\"], data[\"y\"], color=colors, alpha=0.6)\nplot_cap(model, idata, [\"x2\", \"fac\"], ax=ax);\n\n\n\n\nWe can go one step further and modify the model to use different covariance functions for the different groups. For that purpose, we pass share_cov=False. As always, Bambi takes care of all the details.\n\nprior_hsgp = {\n \"sigma\": bmb.Prior(\"Exponential\", lam=1),\n \"ell\": bmb.Prior(\"Exponential\", lam=3)\n}\npriors = {\n \"hsgp(x2, by=fac, m=12, c=1.5, share_cov=False)\": prior_hsgp,\n \"sigma\": bmb.Prior(\"HalfNormal\", sigma=1)\n}\nmodel = bmb.Model(\"y ~ 0 + hsgp(x2, by=fac, m=12, c=1.5, share_cov=False)\", data, priors=priors)\nmodel.set_alias({\"hsgp(x2, by=fac, m=12, c=1.5, share_cov=False)\": \"hsgp\"})\nmodel\n\n Formula: y ~ 0 + hsgp(x2, by=fac, m=12, c=1.5, share_cov=False)\n Family: gaussian\n Link: mu = identity\n Observations: 300\n Priors: \n target = mu\n HSGP contributions\n hsgp(x2, by=fac, m=12, c=1.5, share_cov=False)\n cov: ExpQuad\n sigma ~ Exponential(lam: 1.0)\n ell ~ Exponential(lam: 3.0)\n \n Auxiliary parameters\n y_sigma ~ HalfNormal(sigma: 1.0)\n\n\n\nmodel.build()\nmodel.graph()\n\n\n\n\nHave a closer look at the model graph. See that the hsgp_sigma and hsgp_ell parameters are no longer scalar. There are three of each, one for each group.\n\nidata = model.fit(inference_method=\"nuts_numpyro\", target_accept=0.95, random_seed=121195)\n\nCompiling...\nCompilation time = 0:00:04.396845\nSampling...\n\n\n\n\n\n\n\n\n\n\n\n\n\n\nSampling time = 0:00:07.743907\nTransforming variables...\nTransformation time = 0:00:00.519422\n\n\n\naz.plot_trace(\n idata, \n var_names=[\"hsgp_sigma\", \"hsgp_ell\", \"y_sigma\"], \n backend_kwargs={\"layout\": \"constrained\"}\n);\n\n\n\n\nIn fact, we can see not all the groups have similar posteriors for the covariance function parameters when they are allowed to vary.\nBefore closing the article, it’s worth looking at a particular but not uncommon pattern when using the HSGP approximation. Let’s have a look at the posterior distributions for the weights of the basis.\n\naz.plot_trace(idata, var_names=[\"hsgp_weights\"], backend_kwargs={\"layout\": \"constrained\"});\n\n\n\n\nLooks like some distributions are extremely flat, and others are extremely tight around zero.\nTo investigate this further we can manually plot the posterior for the first J basis vectors and see what they look like.\n\nbasis_n = 6\nfig, axes = plt.subplots(3, 1, figsize = (7, 10))\nfor i in range(3):\n ax = axes[i]\n values = idata.posterior[\"hsgp_weights\"].sel({\"hsgp_by\": i + 1})\n for j in range(basis_n):\n az.plot_kde(\n values.sel({\"hsgp_weights_dim\": j}).to_numpy().flatten(), \n ax=ax, \n plot_kwargs={\"color\": f\"C{j}\"}\n );\n\n\n\n\nIndeed, we can see that, at least for the first group, the posterior of the weights start being too tight around zero when we consider the 6th basis vector. But the posteriors for the weights of the previous basis vectors look nice.\nTo confirm our thought, let’s increase the value of basis_n to 9 and see what happens.\n\nbasis_n = 9\nfig, axes = plt.subplots(3, 1, figsize = (7, 10))\nfor i in range(3):\n ax = axes[i]\n values = idata.posterior[\"hsgp_weights\"].sel({\"hsgp_by\": i + 1})\n for j in range(basis_n):\n az.plot_kde(\n values.sel({\"hsgp_weights_dim\": j}).to_numpy().flatten(), \n ax=ax, \n plot_kwargs={\"color\": f\"C{j}\"}\n );\n\n\n\n\nAlright, that’s too spiky! Nonetheless, we don’t see that happening for the third group yet, indicating the higher number of basis vectors is more appropriate for this group." }, { - "objectID": "notebooks/ESCS_multiple_regression.html", - "href": "notebooks/ESCS_multiple_regression.html", + "objectID": "notebooks/radon_example.html", + "href": "notebooks/radon_example.html", "title": "Bambi", "section": "", - "text": "import arviz as az\nimport bambi as bmb\nimport matplotlib.pyplot as plt\nimport numpy as np\nimport pandas as pd\nimport statsmodels.api as sm\nimport xarray as xr\n\n\naz.style.use(\"arviz-darkgrid\")\nSEED = 7355608\n\n\n\nBambi comes with several datasets. These can be accessed via the load_data() function.\n\ndata = bmb.load_data(\"ESCS\")\nnp.round(data.describe(), 2)\n\n\n\n\n\n \n \n \n drugs\n n\n e\n o\n a\n c\n hones\n emoti\n extra\n agree\n consc\n openn\n \n \n \n \n count\n 604.00\n 604.00\n 604.00\n 604.00\n 604.00\n 604.00\n 604.00\n 604.00\n 604.00\n 604.00\n 604.00\n 604.00\n \n \n mean\n 2.21\n 80.04\n 106.52\n 113.87\n 124.63\n 124.23\n 3.89\n 3.18\n 3.21\n 3.13\n 3.57\n 3.41\n \n \n std\n 0.65\n 23.21\n 19.88\n 21.12\n 16.67\n 18.69\n 0.45\n 0.46\n 0.53\n 0.47\n 0.44\n 0.52\n \n \n min\n 1.00\n 23.00\n 42.00\n 51.00\n 63.00\n 44.00\n 2.56\n 1.47\n 1.62\n 1.59\n 2.00\n 1.28\n \n \n 25%\n 1.71\n 65.75\n 93.00\n 101.00\n 115.00\n 113.00\n 3.59\n 2.88\n 2.84\n 2.84\n 3.31\n 3.06\n \n \n 50%\n 2.14\n 76.00\n 107.00\n 112.00\n 126.00\n 125.00\n 3.88\n 3.19\n 3.22\n 3.16\n 3.56\n 3.44\n \n \n 75%\n 2.64\n 93.00\n 120.00\n 129.00\n 136.00\n 136.00\n 4.20\n 3.47\n 3.56\n 3.44\n 3.84\n 3.75\n \n \n max\n 4.29\n 163.00\n 158.00\n 174.00\n 171.00\n 180.00\n 4.94\n 4.62\n 4.75\n 4.44\n 4.75\n 4.72\n \n \n\n\n\n\nIt’s always a good idea to start off with some basic plotting. Here’s what our outcome variable drugs (some index of self-reported illegal drug use) looks like:\n\ndata[\"drugs\"].hist();\n\n\n\n\nThe five numerical predictors that we’ll use are sum-scores measuring participants’ standings on the Big Five personality dimensions. The dimensions are:\n\nO = Openness to experience\nC = Conscientiousness\nE = Extraversion\nA = Agreeableness\nN = Neuroticism\n\nHere’s what our predictors look like:\n\naz.plot_pair(data[[\"o\", \"c\", \"e\", \"a\", \"n\"]].to_dict(\"list\"), marginals=True, textsize=24);\n\n\n\n\nWe can easily see all the predictors are more or less symmetrically distributed without outliers and the pairwise correlations between them are not strong.\n\n\n\nWe’re going to fit a pretty straightforward additive multiple regression model predicting drug index from all 5 personality dimension scores. It’s simple to specify the model using a familiar formula interface. Here we also tell Bambi to run two parallel Markov Chain Monte Carlo (MCMC) chains, each one with 2000 draws. The first 1000 draws are tuning steps that we discard and the last 1000 draws are considered to be taken from the joint posterior distribution of all the parameters (to be confirmed when we analyze the convergence of the chains).\n\nmodel = bmb.Model(\"drugs ~ o + c + e + a + n\", data)\nfitted = model.fit(tune=2000, draws=2000, init=\"adapt_diag\", random_seed=SEED)\n\nAuto-assigning NUTS sampler...\nInitializing NUTS using adapt_diag...\nMultiprocess sampling (2 chains in 2 jobs)\nNUTS: [drugs_sigma, Intercept, o, c, e, a, n]\n\n\n\n\n\n\n\n \n \n 100.00% [8000/8000 00:11<00:00 Sampling 2 chains, 0 divergences]\n \n \n\n\nSampling 2 chains for 2_000 tune and 2_000 draw iterations (4_000 + 4_000 draws total) took 12 seconds.\n\n\nGreat! But this is a Bayesian model, right? What about the priors? If no priors are given explicitly by the user, then Bambi chooses smart default priors for all parameters of the model based on the implied partial correlations between the outcome and the predictors. Here’s what the default priors look like in this case – the plots below show 1000 draws from each prior distribution:\n\nmodel.plot_priors();\n\nSampling: [Intercept, a, c, drugs_sigma, e, n, o]\n\n\n\n\n\n\n# Normal priors on the coefficients\n{x.name: x.prior.args for x in model.response_component.terms.values()}\n\n{'Intercept': {'mu': array(2.21014664), 'sigma': array(21.19375074)},\n 'o': {'mu': array(0), 'sigma': array(0.0768135)},\n 'c': {'mu': array(0), 'sigma': array(0.08679683)},\n 'e': {'mu': array(0), 'sigma': array(0.0815892)},\n 'a': {'mu': array(0), 'sigma': array(0.09727366)},\n 'n': {'mu': array(0), 'sigma': array(0.06987412)},\n 'drugs': {'mu': array(0), 'sigma': array(1)}}\n\n\n\n# HalfStudentT prior on the residual standard deviation\nfor name, component in model.constant_components.items():\n print(f\"{name}: {component.prior}\")\n\nsigma: HalfStudentT(nu: 4, sigma: 0.6482)\n\n\nYou could also just print the model and see it also contains the same information about the priors\n\nmodel\n\n Formula: drugs ~ o + c + e + a + n\n Family: gaussian\n Link: mu = identity\n Observations: 604\n Priors: \n target = mu\n Common-level effects\n Intercept ~ Normal(mu: 2.2101, sigma: 21.1938)\n o ~ Normal(mu: 0, sigma: 0.0768)\n c ~ Normal(mu: 0, sigma: 0.0868)\n e ~ Normal(mu: 0, sigma: 0.0816)\n a ~ Normal(mu: 0, sigma: 0.0973)\n n ~ Normal(mu: 0, sigma: 0.0699)\n Auxiliary parameters\n drugs_sigma ~ HalfStudentT(nu: 4, sigma: 0.6482)\n------\n* To see a plot of the priors call the .plot_priors() method.\n* To see a summary or plot of the posterior pass the object returned by .fit() to az.summary() or az.plot_trace()\n\n\nSome more info about the default prior distributions can be found in this technical paper.\nNotice the apparently small SDs of the slope priors. This is due to the relative scales of the outcome and the predictors: remember from the plots above that the outcome, drugs, ranges from 1 to about 4, while the predictors all range from about 20 to 180 or so. A one-unit change in any of the predictors – which is a trivial increase on the scale of the predictors – is likely to lead to a very small absolute change in the outcome. Believe it or not, these priors are actually quite wide on the partial correlation scale!\n\n\n\nLet’s start with a pretty picture of the parameter estimates!\n\naz.plot_trace(fitted);\n\n\n\n\nThe left panels show the marginal posterior distributions for all of the model’s parameters, which summarize the most plausible values of the regression coefficients, given the data we have now observed. These posterior density plots show two overlaid distributions because we ran two MCMC chains. The panels on the right are “trace plots” showing the sampling paths of the two MCMC chains as they wander through the parameter space. If any of these paths exhibited a pattern other than white noise we would be concerned about the convergence of the chains.\nA much more succinct (non-graphical) summary of the parameter estimates can be found like so:\n\naz.summary(fitted)\n\n\n\n\n\n \n \n \n mean\n sd\n hdi_3%\n hdi_97%\n mcse_mean\n mcse_sd\n ess_bulk\n ess_tail\n r_hat\n \n \n \n \n Intercept\n 3.298\n 0.351\n 2.609\n 3.924\n 0.006\n 0.004\n 3956.0\n 3180.0\n 1.0\n \n \n o\n 0.006\n 0.001\n 0.004\n 0.009\n 0.000\n 0.000\n 4217.0\n 3214.0\n 1.0\n \n \n c\n -0.004\n 0.001\n -0.007\n -0.001\n 0.000\n 0.000\n 3820.0\n 3286.0\n 1.0\n \n \n e\n 0.003\n 0.001\n 0.001\n 0.006\n 0.000\n 0.000\n 4252.0\n 3625.0\n 1.0\n \n \n a\n -0.012\n 0.001\n -0.015\n -0.010\n 0.000\n 0.000\n 4846.0\n 3437.0\n 1.0\n \n \n n\n -0.002\n 0.001\n -0.004\n 0.001\n 0.000\n 0.000\n 4048.0\n 3317.0\n 1.0\n \n \n drugs_sigma\n 0.592\n 0.017\n 0.561\n 0.623\n 0.000\n 0.000\n 5882.0\n 2962.0\n 1.0\n \n \n\n\n\n\nWhen there are multiple MCMC chains, the default summary output includes some basic convergence diagnostic info (the effective MCMC sample sizes and the Gelman-Rubin “R-hat” statistics), although in this case it’s pretty clear from the trace plots above that the chains have converged just fine.\n\n\n\n\nsamples = fitted.posterior\n\nIt turns out that we can convert each regression coefficient into a partial correlation by multiplying it by a constant that depends on (1) the SD of the predictor, (2) the SD of the outcome, and (3) the degree of multicollinearity with the set of other predictors. Two of these statistics are actually already computed and stored in the fitted model object, in a dictionary called dm_statistics (for design matrix statistics), because they are used internally. We will compute the others manually.\nSome information about the relationship between linear regression parameters and partial correlation can be found here.\n\n# the names of the predictors\nvarnames = ['o', 'c', 'e', 'a', 'n']\n\n# compute the needed statistics like R-squared when each predictor is response and all the \n# other predictors are the predictor\n\n# x_matrix = common effects design matrix (excluding intercept/constant term)\nterms = [t for t in model.response_component.common_terms.values() if t.name != \"Intercept\"]\nx_matrix = [pd.DataFrame(x.data, columns=x.levels) for x in terms]\nx_matrix = pd.concat(x_matrix, axis=1)\nx_matrix.columns = varnames\n\ndm_statistics = {\n 'r2_x': pd.Series(\n {\n x: sm.OLS(\n endog=x_matrix[x],\n exog=sm.add_constant(x_matrix.drop(x, axis=1))\n if \"Intercept\" in model.response_component.terms\n else x_matrix.drop(x, axis=1),\n )\n .fit()\n .rsquared\n for x in list(x_matrix.columns)\n }\n ),\n 'sigma_x': x_matrix.std(),\n 'mean_x': x_matrix.mean(axis=0),\n}\n\nr2_x = dm_statistics['r2_x']\nsd_x = dm_statistics['sigma_x']\nr2_y = pd.Series([sm.OLS(endog=data['drugs'],\n exog=sm.add_constant(data[[p for p in varnames if p != x]])).fit().rsquared\n for x in varnames], index=varnames)\nsd_y = data['drugs'].std()\n\n# compute the products to multiply each slope with to produce the partial correlations\nslope_constant = (sd_x[varnames] / sd_y) * ((1 - r2_x[varnames]) / (1 - r2_y)) ** 0.5\nslope_constant\n\no 32.392557\nc 27.674284\ne 30.305117\na 26.113299\nn 34.130431\ndtype: float64\n\n\nNow we just multiply each sampled regression coefficient by its corresponding slope_constant to transform it into a sample partial correlation coefficient.\n\npcorr_samples = (samples[varnames] * slope_constant)\n\nAnd voilà! We now have a joint posterior distribution for the partial correlation coefficients. Let’s plot the marginal posterior distributions:\n\n# Pass the same axes to az.plot_kde to have all the densities in the same plot\n_, ax = plt.subplots()\nfor idx, (k, v) in enumerate(pcorr_samples.items()):\n az.plot_dist(v, label=k, plot_kwargs={'color':f'C{idx}'}, ax=ax)\nax.axvline(x=0, color='k', linestyle='--');\n\n\n\n\nThe means of these distributions serve as good point estimates of the partial correlations:\n\npcorr_samples.mean()\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\nDimensions: ()\nData variables:\n o float64 0.1973\n c float64 -0.105\n e float64 0.1016\n a float64 -0.324\n n float64 -0.0513xarray.DatasetDimensions:Coordinates: (0)Data variables: (5)o()float640.1973array(0.19728891)c()float64-0.105array(-0.105046)e()float640.1016array(0.10159318)a()float64-0.324array(-0.32396902)n()float64-0.0513array(-0.05130356)Indexes: (0)Attributes: (0)\n\n\n\n\n\nWe just take the square of the partial correlation coefficients, so it’s easy to get posteriors on that scale too:\n\n_, ax = plt.subplots()\nfor idx, (k, v) in enumerate(pcorr_samples.items()):\n az.plot_dist(v ** 2, label=k, plot_kwargs={'color':f'C{idx}'}, ax=ax)\nax.set_ylim(0, 80);\n\n\n\n\nWith these posteriors we can ask: What is the probability that the squared partial correlation for Openness (blue) is greater than the squared partial correlation for Conscientiousness (orange)?\n\n(pcorr_samples['o'] ** 2 > pcorr_samples['c'] ** 2).mean().item()\n\n0.9365\n\n\nIf we contrast this result with the plot we’ve just shown, we may think the probability is too high when looking at the overlap between the blue and orange curves. However, the previous plot is only showing marginal posteriors, which don’t account for correlations between the coefficients. In our Bayesian world, our model parameters’ are random variables (and consequently, any combination of them are too). As such, squared partial correlation have a joint distribution. When computing probabilities involving at least two of these parameters, one has to use the joint distribution. Otherwise, if we choose to work only with marginals, we are implicitly assuming independence.\nLet’s check the joint distribution of the squared partial correlation for Openness and Conscientiousness. We highlight with a blue color the draws where the coefficient for Openness is greater than the coefficient for Conscientiousness.\n\nsq_partial_c = pcorr_samples['c'] ** 2\nsq_partial_o = pcorr_samples['o'] ** 2\n\n\ncolors = np.where(sq_partial_c > sq_partial_o, \"C1\", \"C0\").flatten().tolist()\n\nplt.scatter(sq_partial_o, sq_partial_c, c=colors)\nplt.xlabel(\"Openness to experience\")\nplt.ylabel(\"Conscientiousness\");\n\n\n\n\nWe can see that in the great majority of the draws (92.8%) the squared partial correlation for Openness is greater than the one for Conscientiousness. In fact, we can check the correlation between them is\n\nxr.corr(sq_partial_c, sq_partial_o).item()\n\n-0.19487146395840146\n\n\nwhich explains why ony looking at the marginal posteriors (i.e. assuming independence) is not the best approach here.\nFor each predictor, what is the probability that it has the largest squared partial correlation?\n\npc_df = pcorr_samples.to_dataframe()\n(pc_df**2).idxmax(axis=1).value_counts() / len(pc_df.index)\n\na 0.989\no 0.011\ndtype: float64\n\n\nAgreeableness is clearly the strongest predictor of drug use among the Big Five personality traits in terms of partial correlation, but it’s still not a particularly strong predictor in an absolute sense. Walter Mischel famously claimed that it is rare to see correlations between personality measure and relevant behavioral outcomes exceed 0.3. In this case, the probability that the agreeableness partial correlation exceeds 0.3 is:\n\n(np.abs(pcorr_samples['a']) > 0.3).mean().item()\n\n0.7515\n\n\n\n\n\nOnce we have computed the posterior distribution, we can use it to compute the posterior predictive distribution. As the name implies, these are predictions assuming the model’s parameter are distributed as the posterior. Thus, the posterior predictive includes the uncertainty about the parameters.\nWith bambi we can use the model’s predict() method with the fitted az.InferenceData to generate a posterior predictive samples, which are then automatically added to the az.InferenceData object\n\nposterior_predictive = model.predict(fitted, kind=\"pps\")\nfitted\n\n\n\n \n \n arviz.InferenceData\n \n \n \n \n \n posterior\n \n \n \n \n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\nDimensions: (chain: 2, draw: 2000, drugs_obs: 604)\nCoordinates:\n * chain (chain) int64 0 1\n * draw (draw) int64 0 1 2 3 4 5 6 ... 1994 1995 1996 1997 1998 1999\n * drugs_obs (drugs_obs) int64 0 1 2 3 4 5 6 ... 597 598 599 600 601 602 603\nData variables:\n Intercept (chain, draw) float64 3.176 3.52 3.331 ... 2.713 3.273 3.582\n o (chain, draw) float64 0.004945 0.004528 ... 0.007971 0.005363\n c (chain, draw) float64 -0.003048 -0.004202 ... -0.006359\n e (chain, draw) float64 0.004493 0.003775 ... 0.002476 0.003399\n a (chain, draw) float64 -0.01186 -0.01245 ... -0.0138 -0.01127\n n (chain, draw) float64 -0.001693 -0.001597 ... -0.001553\n drugs_sigma (chain, draw) float64 0.6181 0.5667 0.6038 ... 0.5624 0.5909\n drugs_mean (chain, draw, drugs_obs) float64 2.404 2.112 ... 2.465 2.221\nAttributes:\n created_at: 2023-01-05T13:59:47.818007\n arviz_version: 0.14.0\n inference_library: pymc\n inference_library_version: 5.0.1\n sampling_time: 12.082805395126343\n tuning_steps: 2000\n modeling_interface: bambi\n modeling_interface_version: 0.9.3xarray.DatasetDimensions:chain: 2draw: 2000drugs_obs: 604Coordinates: (3)chain(chain)int640 1array([0, 1])draw(draw)int640 1 2 3 4 ... 1996 1997 1998 1999array([ 0, 1, 2, ..., 1997, 1998, 1999])drugs_obs(drugs_obs)int640 1 2 3 4 5 ... 599 600 601 602 603array([ 0, 1, 2, ..., 601, 602, 603])Data variables: (8)Intercept(chain, draw)float643.176 3.52 3.331 ... 3.273 3.582array([[3.17599154, 3.51973775, 3.33138916, ..., 3.78559953, 3.69898904,\n 4.10173002],\n [3.22869629, 3.05970267, 4.05607078, ..., 2.71262619, 3.27329852,\n 3.5817772 ]])o(chain, draw)float640.004945 0.004528 ... 0.005363array([[0.00494498, 0.00452772, 0.00756356, ..., 0.00563804, 0.00567455,\n 0.00605511],\n [0.00566662, 0.00734347, 0.00455193, ..., 0.00781383, 0.00797121,\n 0.00536332]])c(chain, draw)float64-0.003048 -0.004202 ... -0.006359array([[-0.00304773, -0.00420235, -0.00330325, ..., -0.00690254,\n -0.00304681, -0.0049652 ],\n [-0.00114941, -0.00482514, -0.00490148, ..., -0.00216353,\n -0.00381585, -0.00635936]])e(chain, draw)float640.004493 0.003775 ... 0.003399array([[0.00449327, 0.00377511, 0.00301912, ..., 0.00351791, 0.00326551,\n 0.00174616],\n [0.00282394, 0.00321167, 0.002755 , ..., 0.00387329, 0.00247556,\n 0.00339919]])a(chain, draw)float64-0.01186 -0.01245 ... -0.01127array([[-0.01185915, -0.01244995, -0.01392545, ..., -0.01249392,\n -0.01503491, -0.01621822],\n [-0.0145613 , -0.01041646, -0.01505147, ..., -0.01161795,\n -0.01379694, -0.01127277]])n(chain, draw)float64-0.001693 -0.001597 ... -0.001553array([[-0.00169305, -0.00159663, -0.00154174, ..., -0.0021489 ,\n -0.00288474, -0.00150735],\n [-0.0001852 , -0.00150535, -0.00206979, ..., -0.00123601,\n -0.00069656, -0.00155252]])drugs_sigma(chain, draw)float640.6181 0.5667 ... 0.5624 0.5909array([[0.61807081, 0.56667133, 0.60383893, ..., 0.59876874, 0.5881488 ,\n 0.59612521],\n [0.57607916, 0.59275997, 0.59122171, ..., 0.62229347, 0.56236473,\n 0.59090382]])drugs_mean(chain, draw, drugs_obs)float642.404 2.112 1.809 ... 2.465 2.221array([[[2.40445527, 2.11215753, 1.80914249, ..., 2.03306693,\n 2.47439674, 2.12557432],\n [2.46462075, 2.10211223, 1.75099921, ..., 2.06373429,\n 2.40050445, 2.17458259],\n [2.48629432, 2.19120177, 1.78244804, ..., 1.99662782,\n 2.50554453, 2.14334816],\n ...,\n [2.52971839, 2.10128993, 1.60904467, ..., 2.03221421,\n 2.4590608 , 2.18403842],\n [2.38667108, 2.0990887 , 1.74141595, ..., 1.91173016,\n 2.39889978, 2.06061017],\n [2.53520294, 2.09223729, 1.62325609, ..., 1.99719024,\n 2.25196002, 2.17630404]],\n\n [[2.40955116, 2.09007661, 1.81021061, ..., 2.0214824 ,\n 2.25523148, 2.12606555],\n [2.47973927, 2.19885636, 1.76301224, ..., 2.02577592,\n 2.58236465, 2.15872438],\n [2.48310012, 2.05888822, 1.64851281, ..., 2.00492937,\n 2.27433231, 2.15713283],\n ...,\n [2.38775743, 2.18114872, 1.83901419, ..., 1.96842912,\n 2.57436394, 2.0785617 ],\n [2.49053754, 2.14963624, 1.70581067, ..., 1.99018703,\n 2.41675113, 2.14096604],\n [2.53882175, 2.1313453 , 1.67693896, ..., 2.08870657,\n 2.46499185, 2.22116629]]])Indexes: (3)chainPandasIndexPandasIndex(Int64Index([0, 1], dtype='int64', name='chain'))drawPandasIndexPandasIndex(Int64Index([ 0, 1, 2, 3, 4, 5, 6, 7, 8, 9,\n ...\n 1990, 1991, 1992, 1993, 1994, 1995, 1996, 1997, 1998, 1999],\n dtype='int64', name='draw', length=2000))drugs_obsPandasIndexPandasIndex(Int64Index([ 0, 1, 2, 3, 4, 5, 6, 7, 8, 9,\n ...\n 594, 595, 596, 597, 598, 599, 600, 601, 602, 603],\n dtype='int64', name='drugs_obs', length=604))Attributes: (8)created_at :2023-01-05T13:59:47.818007arviz_version :0.14.0inference_library :pymcinference_library_version :5.0.1sampling_time :12.082805395126343tuning_steps :2000modeling_interface :bambimodeling_interface_version :0.9.3\n \n \n \n \n \n \n posterior_predictive\n \n \n \n \n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\nDimensions: (chain: 2, draw: 2000, drugs_obs: 604)\nCoordinates:\n * chain (chain) int64 0 1\n * draw (draw) int64 0 1 2 3 4 5 6 ... 1993 1994 1995 1996 1997 1998 1999\n * drugs_obs (drugs_obs) int64 0 1 2 3 4 5 6 7 ... 597 598 599 600 601 602 603\nData variables:\n drugs (chain, draw, drugs_obs) float64 2.695 1.825 ... 1.757 2.475\nAttributes:\n modeling_interface: bambi\n modeling_interface_version: 0.9.3xarray.DatasetDimensions:chain: 2draw: 2000drugs_obs: 604Coordinates: (3)chain(chain)int640 1array([0, 1])draw(draw)int640 1 2 3 4 ... 1996 1997 1998 1999array([ 0, 1, 2, ..., 1997, 1998, 1999])drugs_obs(drugs_obs)int640 1 2 3 4 5 ... 599 600 601 602 603array([ 0, 1, 2, ..., 601, 602, 603])Data variables: (1)drugs(chain, draw, drugs_obs)float642.695 1.825 1.951 ... 1.757 2.475array([[[2.69503226, 1.82467892, 1.95143733, ..., 2.5699741 ,\n 1.84978551, 1.36654724],\n [2.59077371, 3.11220779, 0.79189108, ..., 2.38631284,\n 2.62021493, 2.18537113],\n [3.11823781, 2.23392971, 1.75284024, ..., 2.35781091,\n 1.90029844, 2.27354726],\n ...,\n [1.72739111, 1.72704894, 1.95692669, ..., 2.55793246,\n 2.12482296, 2.65996429],\n [2.07203446, 0.57259278, 2.09124301, ..., 2.36280251,\n 2.23606286, 3.02304092],\n [2.52625525, 1.61450826, 2.41667227, ..., 1.83555475,\n 2.0276591 , 1.89229018]],\n\n [[2.50995335, 2.67645277, 0.38388315, ..., 1.78983849,\n 2.42224863, 1.7022833 ],\n [1.67277622, 1.9170972 , 2.49938629, ..., 1.99462421,\n 3.11777803, 2.60929834],\n [2.71727227, 1.99501171, 1.27468368, ..., 2.79142362,\n 2.5874147 , 1.43214944],\n ...,\n [2.08488322, 1.23115871, 1.46252038, ..., 2.13904811,\n 2.67930592, 2.54997319],\n [2.47050601, 2.46610699, 1.6644404 , ..., 1.63954359,\n 2.41069391, 2.59553247],\n [2.11421864, 1.3497171 , 1.67469565, ..., 1.24887731,\n 1.75678247, 2.4746553 ]]])Indexes: (3)chainPandasIndexPandasIndex(Int64Index([0, 1], dtype='int64', name='chain'))drawPandasIndexPandasIndex(Int64Index([ 0, 1, 2, 3, 4, 5, 6, 7, 8, 9,\n ...\n 1990, 1991, 1992, 1993, 1994, 1995, 1996, 1997, 1998, 1999],\n dtype='int64', name='draw', length=2000))drugs_obsPandasIndexPandasIndex(Int64Index([ 0, 1, 2, 3, 4, 5, 6, 7, 8, 9,\n ...\n 594, 595, 596, 597, 598, 599, 600, 601, 602, 603],\n dtype='int64', name='drugs_obs', length=604))Attributes: (2)modeling_interface :bambimodeling_interface_version :0.9.3\n \n \n \n \n \n \n sample_stats\n \n \n \n \n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\nDimensions: (chain: 2, draw: 2000)\nCoordinates:\n * chain (chain) int64 0 1\n * draw (draw) int64 0 1 2 3 4 5 ... 1995 1996 1997 1998 1999\nData variables: (12/17)\n tree_depth (chain, draw) int64 3 2 3 3 3 2 2 3 ... 2 2 3 2 3 3 3\n n_steps (chain, draw) float64 7.0 3.0 7.0 7.0 ... 7.0 7.0 7.0\n step_size_bar (chain, draw) float64 0.8184 0.8184 ... 0.8091 0.8091\n acceptance_rate (chain, draw) float64 0.8022 0.9751 ... 0.957 0.5038\n index_in_trajectory (chain, draw) int64 -2 2 2 3 -4 2 -2 ... 2 -5 1 4 2 5\n process_time_diff (chain, draw) float64 0.002245 0.001611 ... 0.002694\n ... ...\n max_energy_error (chain, draw) float64 1.085 -0.1348 ... 0.2997 1.41\n diverging (chain, draw) bool False False False ... False False\n perf_counter_start (chain, draw) float64 7.401e+03 ... 7.405e+03\n energy_error (chain, draw) float64 0.06257 -0.1283 ... -0.2735\n lp (chain, draw) float64 -536.3 -536.0 ... -537.7 -536.3\n step_size (chain, draw) float64 0.757 0.757 ... 0.8614 0.8614\nAttributes:\n created_at: 2023-01-05T13:59:47.843311\n arviz_version: 0.14.0\n inference_library: pymc\n inference_library_version: 5.0.1\n sampling_time: 12.082805395126343\n tuning_steps: 2000\n modeling_interface: bambi\n modeling_interface_version: 0.9.3xarray.DatasetDimensions:chain: 2draw: 2000Coordinates: (2)chain(chain)int640 1array([0, 1])draw(draw)int640 1 2 3 4 ... 1996 1997 1998 1999array([ 0, 1, 2, ..., 1997, 1998, 1999])Data variables: (17)tree_depth(chain, draw)int643 2 3 3 3 2 2 3 ... 2 2 2 3 2 3 3 3array([[3, 2, 3, ..., 3, 2, 2],\n [2, 2, 3, ..., 3, 3, 3]])n_steps(chain, draw)float647.0 3.0 7.0 7.0 ... 3.0 7.0 7.0 7.0array([[7., 3., 7., ..., 7., 3., 3.],\n [3., 3., 7., ..., 7., 7., 7.]])step_size_bar(chain, draw)float640.8184 0.8184 ... 0.8091 0.8091array([[0.81840616, 0.81840616, 0.81840616, ..., 0.81840616, 0.81840616,\n 0.81840616],\n [0.8090762 , 0.8090762 , 0.8090762 , ..., 0.8090762 , 0.8090762 ,\n 0.8090762 ]])acceptance_rate(chain, draw)float640.8022 0.9751 ... 0.957 0.5038array([[0.80218379, 0.97508852, 0.98194673, ..., 0.92311194, 0.8097277 ,\n 0.40372929],\n [0.86113225, 0.76594351, 0.67048735, ..., 0.92215663, 0.95695655,\n 0.50382814]])index_in_trajectory(chain, draw)int64-2 2 2 3 -4 2 -2 ... 2 2 -5 1 4 2 5array([[-2, 2, 2, ..., 4, 3, -1],\n [-1, 3, -3, ..., 4, 2, 5]])process_time_diff(chain, draw)float640.002245 0.001611 ... 0.002694array([[0.00224537, 0.00161071, 0.00262122, ..., 0.00185244, 0.0009074 ,\n 0.00094349],\n [0.00145086, 0.00134784, 0.00256195, ..., 0.00220908, 0.00232599,\n 0.00269432]])perf_counter_diff(chain, draw)float640.002245 0.001717 ... 0.002694array([[0.00224466, 0.00171671, 0.0026596 , ..., 0.00185201, 0.00090715,\n 0.00094316],\n [0.00196686, 0.00134622, 0.00256046, ..., 0.00220816, 0.00232364,\n 0.0026936 ]])largest_eigval(chain, draw)float64nan nan nan nan ... nan nan nan nanarray([[nan, nan, nan, ..., nan, nan, nan],\n [nan, nan, nan, ..., nan, nan, nan]])smallest_eigval(chain, draw)float64nan nan nan nan ... nan nan nan nanarray([[nan, nan, nan, ..., nan, nan, nan],\n [nan, nan, nan, ..., nan, nan, nan]])energy(chain, draw)float64541.3 538.1 537.9 ... 540.4 545.0array([[541.34133531, 538.08024449, 537.88965217, ..., 538.86443359,\n 539.86045571, 540.70508204],\n [540.93659924, 541.96396793, 539.54197037, ..., 539.03858149,\n 540.43513443, 545.02637008]])reached_max_treedepth(chain, draw)boolFalse False False ... False Falsearray([[False, False, False, ..., False, False, False],\n [False, False, False, ..., False, False, False]])max_energy_error(chain, draw)float641.085 -0.1348 ... 0.2997 1.41array([[ 1.08481213, -0.13479529, -0.38578247, ..., 0.34598906,\n 0.71378583, 1.09914272],\n [ 0.53888782, 0.86242187, 0.89569298, ..., 0.27006348,\n 0.29971737, 1.40998718]])diverging(chain, draw)boolFalse False False ... False Falsearray([[False, False, False, ..., False, False, False],\n [False, False, False, ..., False, False, False]])perf_counter_start(chain, draw)float647.401e+03 7.401e+03 ... 7.405e+03array([[7400.94306119, 7400.9458602 , 7400.94833494, ..., 7405.97315966,\n 7405.97520941, 7405.97632057],\n [7400.48802495, 7400.49063799, 7400.49262723, ..., 7405.30901618,\n 7405.31147156, 7405.31411165]])energy_error(chain, draw)float640.06257 -0.1283 ... 0.03155 -0.2735array([[ 0.06256698, -0.12827742, -0.0469405 , ..., -0.05069006,\n -0.02168315, 0.86390736],\n [-0.34268291, -0.14879415, 0.31310984, ..., 0.27006348,\n 0.03154649, -0.27345004]])lp(chain, draw)float64-536.3 -536.0 ... -537.7 -536.3array([[-536.32650601, -536.03480607, -536.27365583, ..., -536.17211968,\n -536.13502656, -539.14388769],\n [-537.44038441, -535.68599013, -536.82172512, ..., -537.58815552,\n -537.70851584, -536.32463184]])step_size(chain, draw)float640.757 0.757 0.757 ... 0.8614 0.8614array([[0.75699092, 0.75699092, 0.75699092, ..., 0.75699092, 0.75699092,\n 0.75699092],\n [0.86139733, 0.86139733, 0.86139733, ..., 0.86139733, 0.86139733,\n 0.86139733]])Indexes: (2)chainPandasIndexPandasIndex(Int64Index([0, 1], dtype='int64', name='chain'))drawPandasIndexPandasIndex(Int64Index([ 0, 1, 2, 3, 4, 5, 6, 7, 8, 9,\n ...\n 1990, 1991, 1992, 1993, 1994, 1995, 1996, 1997, 1998, 1999],\n dtype='int64', name='draw', length=2000))Attributes: (8)created_at :2023-01-05T13:59:47.843311arviz_version :0.14.0inference_library :pymcinference_library_version :5.0.1sampling_time :12.082805395126343tuning_steps :2000modeling_interface :bambimodeling_interface_version :0.9.3\n \n \n \n \n \n \n observed_data\n \n \n \n \n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\nDimensions: (drugs_obs: 604)\nCoordinates:\n * drugs_obs (drugs_obs) int64 0 1 2 3 4 5 6 7 ... 597 598 599 600 601 602 603\nData variables:\n drugs (drugs_obs) float64 1.857 3.071 1.571 2.214 ... 1.5 2.5 3.357\nAttributes:\n created_at: 2023-01-05T13:59:47.853402\n arviz_version: 0.14.0\n inference_library: pymc\n inference_library_version: 5.0.1\n modeling_interface: bambi\n modeling_interface_version: 0.9.3xarray.DatasetDimensions:drugs_obs: 604Coordinates: (1)drugs_obs(drugs_obs)int640 1 2 3 4 5 ... 599 600 601 602 603array([ 0, 1, 2, ..., 601, 602, 603])Data variables: (1)drugs(drugs_obs)float641.857 3.071 1.571 ... 1.5 2.5 3.357array([1.85714286, 3.07142857, 1.57142857, 2.21428571, 1.07142857,\n 1.42857143, 1.14285714, 2.14285714, 2.14285714, 1.07142857,\n 1.85714286, 2.5 , 1.85714286, 2.71428571, 1.42857143,\n 1.71428571, 1.71428571, 3.14285714, 2.71428571, 1.92857143,\n 2.71428571, 2.28571429, 2.35714286, 1.71428571, 2. ,\n 2.92857143, 2.5 , 2.92857143, 2.64285714, 2.21428571,\n 2.78571429, 2.71428571, 3.07142857, 2. , 3. ,\n 1.92857143, 3.07142857, 2.57142857, 2.71428571, 3.07142857,\n 1.78571429, 1.78571429, 3.57142857, 2.28571429, 2.78571429,\n 2.14285714, 2.71428571, 2.71428571, 2.35714286, 2.28571429,\n 1.85714286, 2.57142857, 2.14285714, 3.07142857, 2.07142857,\n 3.5 , 1.71428571, 2.5 , 2.14285714, 1.14285714,\n 3.5 , 1.85714286, 3.28571429, 2.64285714, 2. ,\n 1.85714286, 2.35714286, 2.21428571, 3.14285714, 2.64285714,\n 1.28571429, 1.64285714, 2.64285714, 2.07142857, 2.21428571,\n 3.07142857, 2.42857143, 3.21428571, 2.71428571, 2.07142857,\n 2.42857143, 2.07142857, 2.92857143, 3.42857143, 1.92857143,\n 2.57142857, 1. , 2.42857143, 2.14285714, 1.71428571,\n 1.78571429, 3.35714286, 1.71428571, 1.85714286, 2.07142857,\n 2.71428571, 1.5 , 1.57142857, 1.14285714, 1. ,\n...\n 1.35714286, 3.07142857, 1.42857143, 2.64285714, 1.35714286,\n 2.07142857, 3. , 1.35714286, 1.85714286, 1.42857143,\n 1.78571429, 2. , 2.42857143, 1.42857143, 2. ,\n 3.07142857, 1.5 , 2. , 2.42857143, 2. ,\n 2.64285714, 3.92857143, 2.42857143, 2. , 1.71428571,\n 1.42857143, 2. , 1.78571429, 1.85714286, 2.78571429,\n 1.14285714, 1.42857143, 2.21428571, 2.07142857, 1.42857143,\n 1.85714286, 2.64285714, 3.5 , 2. , 2. ,\n 2.92857143, 1.71428571, 2.57142857, 2.28571429, 1.21428571,\n 2.64285714, 1.21428571, 1.92857143, 1.85714286, 1.5 ,\n 1.5 , 1. , 1.85714286, 2.28571429, 2.28571429,\n 2. , 2.85714286, 1.21428571, 2.14285714, 1.71428571,\n 1.42857143, 2.64285714, 1.64285714, 1.57142857, 1.64285714,\n 1.57142857, 1.07142857, 2.07142857, 1.42857143, 2.35714286,\n 2.42857143, 2.42857143, 2.28571429, 1.85714286, 1.42857143,\n 1.78571429, 1.64285714, 1.64285714, 1.07142857, 3.71428571,\n 3.07142857, 2.21428571, 2.14285714, 1.78571429, 2. ,\n 2.14285714, 3.85714286, 1.64285714, 3. , 2.64285714,\n 1.71428571, 2.78571429, 1.85714286, 3.14285714, 2.42857143,\n 1.57142857, 1.5 , 2.5 , 3.35714286])Indexes: (1)drugs_obsPandasIndexPandasIndex(Int64Index([ 0, 1, 2, 3, 4, 5, 6, 7, 8, 9,\n ...\n 594, 595, 596, 597, 598, 599, 600, 601, 602, 603],\n dtype='int64', name='drugs_obs', length=604))Attributes: (6)created_at :2023-01-05T13:59:47.853402arviz_version :0.14.0inference_library :pymcinference_library_version :5.0.1modeling_interface :bambimodeling_interface_version :0.9.3\n \n \n \n \n \n \n \n\n\nOne use of the posterior predictive is as a diagnostic tool, shown below using az.plot_ppc().The blue lines represent the posterior predictive distribution estimates, and the black line represents the observed data. Our posterior predictions seems perform an adequately good job representing the observed data in all regions except near the value of 1, where the observed data and posterior estimates diverge.\n\naz.plot_ppc(fitted);\n\n/home/tomas/anaconda3/envs/bambi/lib/python3.10/site-packages/IPython/core/events.py:89: UserWarning: Creating legend with loc=\"best\" can be slow with large amounts of data.\n func(*args, **kwargs)\n\n\n\n\n\n\n%load_ext watermark\n%watermark -n -u -v -iv -w\n\nLast updated: Thu Jan 05 2023\n\nPython implementation: CPython\nPython version : 3.10.4\nIPython version : 8.5.0\n\npandas : 1.5.2\narviz : 0.14.0\nstatsmodels: 0.13.2\nmatplotlib : 3.6.2\nsys : 3.10.4 (main, Mar 31 2022, 08:41:55) [GCC 7.5.0]\nbambi : 0.9.3\nnumpy : 1.23.5\nxarray : 2022.11.0\n\nWatermark: 2.3.1" + "text": "In this notebook we want to revisit the classical hierarchical linear regression model based on the dataset of the Radon Contamination by Gelman and Hill. In particular, we want to show how easy is to port the PyMC models, presented in the very complete article A Primer on Bayesian Methods for Multilevel Modeling, to Bambi using the more concise formula specification for the models.\nThis example has been ported from PyMC by Juan Orduz (@juanitorduz) and Bambi developers.\n\n\n\nimport arviz as az\nimport bambi as bmb\nimport matplotlib.pyplot as plt\nimport numpy as np\nimport pandas as pd\nimport pymc as pm\nimport seaborn as sns\n\n\naz.style.use(\"arviz-darkgrid\")\nnp.random.default_rng(8924)\n\nGenerator(PCG64) at 0x7FDEF2EFEC00\n\n\n\n\n\nLet us load the data into a pandas data frame.\n\n# Get radon data\npath = \"https://raw.githubusercontent.com/pymc-devs/pymc-examples/main/examples/data/srrs2.dat\"\nradon_df = pd.read_csv(path)\n\n# Get city data\ncity_df = pd.read_csv(pm.get_data(\"cty.dat\"))\n\n\ndisplay(radon_df.head())\nprint(radon_df.shape[0])\n\n\n\n\n\n \n \n \n idnum\n state\n state2\n stfips\n zip\n region\n typebldg\n floor\n room\n basement\n ...\n stoptm\n startdt\n stopdt\n activity\n pcterr\n adjwt\n dupflag\n zipflag\n cntyfips\n county\n \n \n \n \n 0\n 1\n AZ\n AZ\n 4\n 85920\n 1\n 1\n 1\n 2\n N\n ...\n 1100\n 112987\n 120287\n 0.3\n 0.0\n 136.060971\n 0\n 0\n 1\n APACHE\n \n \n 1\n 2\n AZ\n AZ\n 4\n 85920\n 1\n 0\n 9\n 0\n \n ...\n 700\n 70788\n 71188\n 0.6\n 33.3\n 128.784975\n 0\n 0\n 1\n APACHE\n \n \n 2\n 3\n AZ\n AZ\n 4\n 85924\n 1\n 1\n 1\n 3\n N\n ...\n 1145\n 70788\n 70788\n 0.5\n 0.0\n 150.245112\n 0\n 0\n 1\n APACHE\n \n \n 3\n 4\n AZ\n AZ\n 4\n 85925\n 1\n 1\n 1\n 3\n N\n ...\n 1900\n 52088\n 52288\n 0.6\n 97.2\n 136.060971\n 0\n 0\n 1\n APACHE\n \n \n 4\n 5\n AZ\n AZ\n 4\n 85932\n 1\n 1\n 1\n 1\n N\n ...\n 900\n 70788\n 70788\n 0.3\n 0.0\n 136.060971\n 0\n 0\n 1\n APACHE\n \n \n\n5 rows × 25 columns\n\n\n\n12777\n\n\n\ndisplay(city_df.head())\nprint(city_df.shape[0])\n\n\n\n\n\n \n \n \n stfips\n ctfips\n st\n cty\n lon\n lat\n Uppm\n \n \n \n \n 0\n 1\n 1\n AL\n AUTAUGA\n -86.643\n 32.534\n 1.78331\n \n \n 1\n 1\n 3\n AL\n BALDWIN\n -87.750\n 30.661\n 1.38323\n \n \n 2\n 1\n 5\n AL\n BARBOUR\n -85.393\n 31.870\n 2.10105\n \n \n 3\n 1\n 7\n AL\n BIBB\n -87.126\n 32.998\n 1.67313\n \n \n 4\n 1\n 9\n AL\n BLOUNT\n -86.568\n 33.981\n 1.88501\n \n \n\n\n\n\n3194\n\n\n\n\n\nWe are going to preprocess the data as done in the article A Primer on Bayesian Methods for Multilevel Modeling.\n\n# Strip spaces from column names\nradon_df.columns = radon_df.columns.map(str.strip)\n\n# Filter to keep observations for \"MN\" state only\ndf = radon_df[radon_df.state == \"MN\"].copy()\ncity_mn_df = city_df[city_df.st == \"MN\"].copy()\n\n# Compute fips\ndf[\"fips\"] = 1_000 * df.stfips + df.cntyfips\ncity_mn_df[\"fips\"] = 1_000 * city_mn_df.stfips + city_mn_df.ctfips\n\n# Merge data\ndf = df.merge(city_mn_df[[\"fips\", \"Uppm\"]], on=\"fips\")\ndf = df.drop_duplicates(subset=\"idnum\")\n\n# Clean county names\ndf.county = df.county.map(str.strip)\n\n# Compute log(radon + 0.1)\ndf[\"log_radon\"] = np.log(df[\"activity\"] + 0.1)\n\n# Compute log of Uranium\ndf[\"log_u\"] = np.log(df[\"Uppm\"])\n\n# Let's map floor. 0 -> Basement and 1 -> Floor\ndf[\"floor\"] = df[\"floor\"].map({0: \"Basement\", 1: \"Floor\"})\n\n# Sort values by floor\ndf = df.sort_values(by=\"floor\")\n\n# Reset index\ndf = df.reset_index(drop=True)\n\nIn this exercise, we model the logarithm of the Radon measurements. This is because the distribution of the Radon level is approximately log-normal. We also add a small number, 0.1, to prevent us from trying to compute the logarithm of 0.\n\n\n\nIn order to get a glimpse of the data, we are going to do some exploratory data analysis. First, let’s have a look at the global distribution of the untransformed radon levels.\n\n_, ax = plt.subplots()\nsns.histplot(x=\"activity\", alpha=0.2, stat=\"density\", element=\"step\", common_norm=False, data=df, ax=ax)\nsns.kdeplot(x=\"activity\", data=df, ax=ax, cut=0)\nax.set(title=\"Density of Radon\", xlabel=\"Radon\", ylabel=\"Density\");\n\n\n\n\nNext, let us see the global log(radon + 0.1) distribution.\n\n_, ax = plt.subplots()\nsns.histplot(x=\"log_radon\", alpha=0.2, stat=\"density\", element=\"step\", common_norm=False, data=df, ax=ax)\nsns.kdeplot(x=\"log_radon\", data=df, ax=ax)\nax.set(title=\"Density of log(Radon + 0.1)\", xlabel=\"$\\log(Radon + 0.1)$\", ylabel=\"Density\");\n\n\n\n\nThere are many a priori reasons to think houses with basement has higher radon levels. From geochemistry to composition of building materials to poor ventilation. We can split the distribution of log(radon + 0.1) per floor to see if we are able to see that difference in our data.\n\n_, ax = plt.subplots()\nsns.histplot(\n x=\"log_radon\", hue=\"floor\", alpha=0.2, stat=\"density\", element=\"step\", \n common_norm=False, data=df, ax=ax\n)\nsns.kdeplot(x=\"log_radon\", hue=\"floor\", common_norm=False, data=df, ax=ax)\nax.set(title=\"Density of log(Radon + 0.1)\", xlabel=\"$\\log + 0.1$\", ylabel=\"Density\");\n\n\n\n\nThis exploration tell us that, as expected, the average radon level is higher in Basement than Floor.\nNext, we are going to count the number of counties.\n\nn_counties = df[\"county\"].unique().size\nprint(f\"Number of counties: {n_counties}\")\n\nNumber of counties: 85\n\n\nLet us dig deeper into the distribution of radon and number of observations per county and floor level.\n\nlog_radon_county_agg = (\n df \n .groupby([\"county\", \"floor\"], as_index=False)\n .agg(\n log_radon_mean=(\"log_radon\", \"mean\"),\n n_obs=(\"log_radon\", \"count\")\n )\n)\n\nfig, ax= plt.subplots(nrows=1, ncols=2, figsize=(12, 6), layout=\"constrained\")\nsns.boxplot(x=\"floor\", y=\"log_radon_mean\", data=log_radon_county_agg, ax=ax[0])\nax[0].set(title=\"log(Radon + 0.1) Mean per County\", ylabel=\"$\\log + 0.1$\")\n\nsns.boxplot(x=\"floor\", y=\"n_obs\", data=log_radon_county_agg, ax=ax[1])\nax[1].set(title=\"Number of Observations\", xlabel=\"floor\", ylabel=\"Number of observations\");\n\n\n\n\n\nOn the left hand side we can see that the \"Basement\" distribution per county is shifted to higher values with respect to the \"Floor\" distribution. We had seen this above when considering all counties together.\nOn the right hand side we see that the number of observations per county is not the same for the floor levels. In particular, we see that there are some counties with a lot of basement observations. This can create some bias when computing simple statistics to compare across counties. Moreover, not all county and floor combinations are present in the dataset. For example:\n\n\nassert df.query(\"county == 'YELLOW MEDICINE' and floor == 'Floor'\").empty\n\n\n\n\n\n\n\n\nFor this first model we only consider the predictor floor, which represents the floor level. The following equation describes the linear model that we are going to build with Bambi\n\\[\ny = \\beta_{j} + \\varepsilon\n\\]\nwhere\n\\[\n\\begin{aligned}\ny &= \\text{Response for the (log) radon measurement }\\\\\n\\beta_{j} &= \\text{Coefficient for the floor level } j \\\\\n\\varepsilon & = \\text{Residual random error}\n\\end{aligned}\n\\]\nEach \\(j\\) indexes a different floor level. In this case, \\(j=1\\) means \"basement\" and \\(j=2\\) means \"floor\".\n\n\n\n\n\nThe only common effect in this model is the floor effect represented by the \\(\\beta_{j}\\) coefficients. We have\n\\[\n\\beta_{j} \\sim \\text{Normal}(0, \\sigma_{\\beta_j})\n\\]\nfor \\(j: 1, 2\\), where \\(\\sigma_{\\beta_j}\\) is a positive constant that we set to 10 for all \\(j\\).\n\n\n\n\\[\n\\begin{aligned}\n\\varepsilon & \\sim \\text{Normal}(0, \\sigma) \\\\\n\\sigma & \\sim \\text{Exponential}(\\lambda)\n\\end{aligned}\n\\]\nwhere \\(\\lambda\\) is a positive constant that we set to 1.\nLet us now write the Bambi model.\nThe 0 on the right side of ~ in the model formula removes the global intercept that is added by default. This allows Bambi to use one coefficient for each floor level.\n\n# A dictionary with the priors we pass to the model initialization\npooled_priors = {\n \"floor\": bmb.Prior(\"Normal\", mu=0, sigma=10),\n \"sigma\": bmb.Prior(\"Exponential\", lam=1),\n}\n\npooled_model = bmb.Model(\"log_radon ~ 0 + floor\", df, priors=pooled_priors)\npooled_model\n\n Formula: log_radon ~ 0 + floor\n Family: gaussian\n Link: mu = identity\n Observations: 919\n Priors: \n target = mu\n Common-level effects\n floor ~ Normal(mu: 0, sigma: 10)\n Auxiliary parameters\n log_radon_sigma ~ Exponential(lam: 1)\n\n\nThe Family name: Gaussian indicates the selected family, which defaults to Gaussian. And Link: identity indicates the default value for the link argument in bmb.Model(). Taken together this simply means that we are fitting a normal linear regression model.\nLet’s see the graph representation of the model before fitting. To do so, we first need to call the .build() method. Internally, this builds the underlying PyMC model.\n\npooled_model.build()\npooled_model.graph()\n\n\n\n\nLet’s now fit the model.\n\npooled_results = pooled_model.fit()\n\nAuto-assigning NUTS sampler...\nInitializing NUTS using jitter+adapt_diag...\nMultiprocess sampling (2 chains in 2 jobs)\nNUTS: [log_radon_sigma, floor]\n\n\n\n\n\n\n\n \n \n 100.00% [4000/4000 00:02<00:00 Sampling 2 chains, 0 divergences]\n \n \n\n\nSampling 2 chains for 1_000 tune and 1_000 draw iterations (2_000 + 2_000 draws total) took 2 seconds.\n\n\nNow we can examine the posterior distribution, i.e. the joint distribution of model parameters conditional on the data:\n\naz.plot_trace(data=pooled_results, compact=True, chain_prop={\"ls\": \"-\"})\nplt.suptitle(\"Pooled Model Trace\");\n\n\n\n\nWe can also see some posterior summary statistics.\n\npooled_summary = az.summary(data=pooled_results)\npooled_summary\n\n\n\n\n\n \n \n \n mean\n sd\n hdi_3%\n hdi_97%\n mcse_mean\n mcse_sd\n ess_bulk\n ess_tail\n r_hat\n \n \n \n \n floor[Basement]\n 1.362\n 0.029\n 1.308\n 1.416\n 0.001\n 0.000\n 2861.0\n 1584.0\n 1.0\n \n \n floor[Floor]\n 0.776\n 0.060\n 0.664\n 0.885\n 0.001\n 0.001\n 2818.0\n 1502.0\n 1.0\n \n \n log_radon_sigma\n 0.791\n 0.018\n 0.755\n 0.823\n 0.000\n 0.000\n 2950.0\n 1459.0\n 1.0\n \n \n\n\n\n\nFrom the posterior plot and the summary, we can see the mean radon level is considerably higher in the Basement than in the Floor level. This reflects what we originally saw in the initial data exploration. In addition, sice we have more measurements in the Basement, the uncertainty in its posterior is smaller than the uncertainty in the posterior for the Floor level.\nWe can compare the mean of the posterior distribution of the floor terms to the sample mean. This is going to be useful to understand the meaning of complete pooling.\n\n_, ax = plt.subplots()\n\n(\n pooled_summary[\"mean\"]\n .iloc[:-1]\n .reset_index()\n .assign(floor = lambda x: x[\"index\"].str.slice(6, -1).str.strip())\n .merge(\n right=df.groupby([\"floor\"])[\"log_radon\"].mean(),\n left_on=\"floor\",\n right_index=True\n )\n .rename(columns={\n \"mean\": \"posterior mean\",\n \"log_radon\": \"sample mean\"\n })\n .melt(\n id_vars=\"floor\",\n value_vars=[\"posterior mean\", \"sample mean\"]\n )\n .pipe((sns.barplot, \"data\"),\n x=\"floor\",\n y=\"value\",\n hue=\"variable\",\n ax=ax\n )\n)\nax.set(title=\"log(Radon + 0.1) Mean per Floor - Pooled Model\", ylabel=\"$\\log + 0.1$\");\n\n\n\n\nFrom the plot alone it is hard to detect the difference between the posterior mean and the sample mean. This happens because the estimation for any observation in either group is simply the group mean plus the smoothing due to the non-flat priors.\nIn other words, for every observation where floor is \"Basement\" the model predicts the mean radon for all the basement measurements, and for every observation where floor is \"Floor\", the model predicts the mean radon for all the floor measurements.\nWhat does complete pooling exactly mean here?\nIn this example, the pooling refers to how we treat the different counties when computing estimates (i.e. this does not refer to pooling across floor levels for example). Complete pooling means that all measurements for all counties are pooled into a single estimate (“treat all counties the same”), conditional on the floor level (because it is used as a covariate/predictor). For that reason, when computing the prediction for a given observation, we do not discriminate which county it belongs to. We pool all the counties into a single estimate, or in other words, we perform a complete pooling.\nLet’s now compare the posterior predictive distribution for each group with the distribution of the observed data.\nTo do this we need to perform a couple of steps:\n\nObtain samples from the posterior predictive distribution using the .predict() method.\nApply the inverse transform to have the posterior predictive samples in the original scale of the response.\n\n\n# Note we create a new data set. \n# One observation per group is enough to obtain posterior predictive samples for that group\n# The more observations we create, the more posterior predictive samples from the same distribution\n# we obtain.\nnew_data = pd.DataFrame({\"floor\": [\"Basement\", \"Floor\"]})\npooled_model.predict(pooled_results, kind=\"pps\", data=new_data)\n\n# Stack chains and draws and extract posterior predictive samples\npps = az.extract_dataset(pooled_results, group=\"posterior_predictive\")[\"log_radon\"].values\n\n# Inverse transform the posterior predictive samples\npps = np.exp(pps) - 0.1\n\nfig, ax = plt.subplots(nrows=2, ncols=2, figsize=(12, 6), layout=\"constrained\")\nax = ax.flatten()\n\nsns.histplot(x=pps[0].flatten(), stat=\"density\", color=\"C0\", ax=ax[0])\nax[0].set(title=\"Basement (Posterior Predictive Distribution)\", xlabel=\"radon\", ylabel=\"Density\")\nsns.histplot(x=\"activity\", data=df.query(\"floor == 'Basement'\"), stat=\"density\", ax=ax[2])\nax[2].set(title=\"Basement (Sample Distribution)\", xlim=ax[0].get_xlim(), xlabel=\"radon\", ylabel=\"Density\")\n\nsns.histplot(x=pps[1].flatten(), stat=\"density\", color=\"C1\", ax=ax[1])\nax[1].set(title=\"Floor (Posterior Predictive Distribution)\", xlabel=\"radon\", ylabel=\"Density\")\nsns.histplot(x=\"activity\", data=df.query(\"floor == 'Floor'\"), stat=\"density\", color=\"C1\", ax=ax[3])\nax[3].set(title=\"Floor (Sample Distribution)\", xlim=ax[1].get_xlim(), xlabel=\"radon\", ylabel=\"Density\");\n\n/tmp/ipykernel_29247/1213510270.py:9: FutureWarning: extract_dataset has been deprecated, please use extract\n pps = az.extract_dataset(pooled_results, group=\"posterior_predictive\")[\"log_radon\"].values\n\n\n\n\n\nThe distributions look very similar, but we see that we have some extreme values. Hence if we need a number to compare them let us use the median.\n\nnp.median(a=pps, axis=1)\n\narray([3.71183577, 2.01142545])\n\n\n\ndf.groupby([\"floor\"])[\"activity\"].median()\n\nfloor\nBasement 3.9\nFloor 2.1\nName: activity, dtype: float64\n\n\n\n\n\n\n\nThe following model uses both floor and county as predictors. They are represented with an interaction effect. It means the predicted radon level for a given measurement depends both on the floor level as well as the county. This interaction coefficient allows the floor effect to vary across counties. Or said analogously, the county effect can vary across floor levels.\n\n\n\\[\ny = \\gamma_{jk} + \\varepsilon\n\\]\nwhere\n\\[\n\\begin{aligned}\ny &= \\text{Response for the (log) radon measurement }\\\\\n\\gamma_{jk} &= \\text{Coefficient for floor level } j \\text{ and county } k\\\\\n\\varepsilon & = \\text{Residual random error}\n\\end{aligned}\n\\]\n\n\n\n\n\nThe common effect is the interaction between floor and county. The prior is\n\\[\n\\gamma_{jk} \\sim \\text{Normal}(0, \\sigma_{\\gamma_{jk}})\n\\]\nfor all \\(j: 1, 2\\) and \\(k: 1, \\cdots, 85\\).\n\\(\\sigma_{\\gamma_{jk}}\\) is a positive constant that we set to 10 in all cases.\n\n\n\n\\[\n\\begin{aligned}\n\\varepsilon_i & \\sim \\text{Normal}(0, \\sigma) \\\\\n\\sigma & \\sim \\text{Exponential}(\\lambda)\n\\end{aligned}\n\\] where \\(\\lambda\\) is a positive constant that we set to 1.\nTo specify this model in Bambi we can use the formula log_radon ~ 0 + county:floor. Again, we remove the global intercept with the 0 on the right hand side. county:floor specifies the multiplicative interaction between county and floor.\n\nunpooled_priors = {\n \"county:floor\": bmb.Prior(\"Normal\", mu=0, sigma=10),\n \"sigma\": bmb.Prior(\"Exponential\", lam=1),\n}\n\nunpooled_model = bmb.Model(\"log_radon ~ 0 + county:floor\", df, priors=unpooled_priors)\nunpooled_model\n\n Formula: log_radon ~ 0 + county:floor\n Family: gaussian\n Link: mu = identity\n Observations: 919\n Priors: \n target = mu\n Common-level effects\n county:floor ~ Normal(mu: 0, sigma: 10)\n Auxiliary parameters\n log_radon_sigma ~ Exponential(lam: 1)\n\n\n\nunpooled_results = unpooled_model.fit()\n\nAuto-assigning NUTS sampler...\nInitializing NUTS using jitter+adapt_diag...\nMultiprocess sampling (2 chains in 2 jobs)\nNUTS: [log_radon_sigma, county:floor]\n\n\n\n\n\n\n\n \n \n 100.00% [4000/4000 01:14<00:00 Sampling 2 chains, 0 divergences]\n \n \n\n\nSampling 2 chains for 1_000 tune and 1_000 draw iterations (2_000 + 2_000 draws total) took 74 seconds.\n\n\n\nunpooled_model.graph()\n\n\n\n\nFrom the graph representation of the model we see the model estimates \\(170 = 85 \\times 2\\) parameters for the county:floor interaction. Let us now explore the model fit.\nFirst, we can now see the plot of the marginal posterior distributions along with the sampling traces.\n\naz.plot_trace(data=unpooled_results, compact=True, chain_prop={\"ls\": \"-\"})\nplt.suptitle(\"Un-Pooled Model Trace\");\n\n\n\n\nSome posteriors for county:floor are much more spread than others, which makes it harder to compare them. To obtain a better summary visualization we can use a forest plot. This plot also allows us to identify exactly the combination of county and floor level.\n\naz.plot_forest(data=unpooled_results, figsize=(6, 32), r_hat=True, combined=True, textsize=8);\n\n\n\n\nNote how for the combination county == 'YELLOW MEDICINE' and floor == 'Floor' where we do not have any observations, the model can still generate predictions which are essentially coming from the prior distributions, which explains the large HDI intervals.\nNext, let’s have a look into the posterior mean for each county and floor combination:\n\nunpooled_summary = az.summary(data=unpooled_results)\n\nWe can now plot the posterior distribution mean of the gamma coefficients against the observed values (sample).\n\n# Get county and floor names from summary table\nvar_mapping = (\n unpooled_summary\n .iloc[:-1]\n .reset_index(drop=False)[\"index\"].str.slice(13, -1).str.split(\",\").apply(pd.Series)\n)\n\nvar_mapping.rename(columns={0: \"county\", 1: \"floor\"}, inplace=True)\nvar_mapping[\"county\"] = var_mapping[\"county\"].str.strip()\nvar_mapping[\"floor\"] = var_mapping[\"floor\"].str.strip()\nvar_mapping.index = unpooled_summary.iloc[:-1].index\n \n# Merge with observed values\nunpooled_summary_2 = pd.concat([var_mapping, unpooled_summary.iloc[:-1]], axis=1)\n\nfig, ax = plt.subplots(figsize=(7, 6))\n\n(\n unpooled_summary_2\n .merge(right=log_radon_county_agg, on=[\"county\", \"floor\"], how=\"left\")\n .pipe(\n (sns.scatterplot, \"data\"),\n x=\"log_radon_mean\",\n y=\"mean\",\n hue=\"floor\",\n ax=ax\n )\n)\nax.axline(xy1=(1, 1), slope=1, color=\"black\", linestyle=\"--\", label=\"diagonal\")\nax.legend()\nax.set(\n title=\"log(Radon + 0.1) Mean per County (Unpooled Model)\",\n xlabel=\"observed (sample)\",\n ylabel=\"prediction\",\n);\n\n\n\n\nAs expected, the values strongly concentrated along the diagonal. In other words, for each county and floor level combination, the model uses their sample mean of radon level as prediction, plus smoothing due to the non-flat priors.\nWhat does no pooling exactly mean here?\nIn the previous example we said complete pooling means the observations are pooled together into single estimates no matter the county they belong to. The situation is completely the opposite in this no pooling scenario. Here, none of the measurements in a given county affect the computation of the coefficient for another county. That’s why, in the end, the estimation for each combination of county and floor level (i.e. \\(\\gamma_{jk}\\)) is the mean of the measurements in that county and floor level (plus prior smoothing) as is reflected in the diagonal scatterplot above.\n\n\n\n\n\n\nIn this section we are going to explore various types of hierarchical models. If you’re familiar with the PyMC way of using hierarchies, the Bambi way (borrowed from mixed effects models way) may be a bit unfamiliar in the beginning, but as we will see, the notation is very convenient. A good explanation is found in Chapter 16 from Bayes Rules book, specifically section 16.3.2. Moreover, you can also take a look into the Bambi examples section where you can find other use cases.\n\n\nWe start with a model that considers a global intercept and varying intercepts for each county. The dispersion parameter of the prior for these varying intercepts is an hyperprior that is common to all the counties. As we are going to conclude later, this is what causes the partial pooling in the model estimates.\n\n\nLet us use greek letters for common effects and roman letters for varying effects. In this case, \\(\\alpha\\) is a common intercept and \\(u\\) is a group-specific intercept.\n\\[\ny = \\alpha + u_j + \\varepsilon\n\\]\nwhere\n\\[\n\\begin{aligned}\ny &= \\text{Response for the (log) radon measurement } \\\\\n\\alpha &= \\text{Intercept common to all measurements or global intercept} \\\\\nu_j &= \\text{Intercept specific to the county } j \\\\\n\\varepsilon & = \\text{Residual random error}\n\\end{aligned}\n\\]\n\n\n\n\n\nThe only common effect in this model is the intercept \\(\\alpha\\). We have\n\\[\n\\alpha \\sim \\text{Normal}(0, \\sigma_\\alpha)\n\\]\nwhere \\(\\sigma_\\alpha\\) is a positive constant that we set to 10.\n\n\n\n\\[\nu_j \\sim \\text{Normal}(0, \\sigma_u)\n\\]\nfor all \\(j: 1, \\cdots, 85\\).\nContrary to the common effects case, \\(\\sigma_u\\) is considered a random variable.\nWe assign \\(\\sigma_u\\) the following hyperprior, which is the same to all the counties,\n\\[\n\\sigma_u\\sim \\text{Exponential}(\\tau)\n\\]\nand \\(\\tau\\) is a positive constant that we set to \\(1\\).\n\n\n\n\\[\n\\begin{aligned}\n\\varepsilon & \\sim \\text{Normal}(0, \\sigma) \\\\\n\\sigma & \\sim \\text{Exponential}(\\lambda)\n\\end{aligned}\n\\]\nwhere \\(\\lambda\\) is a positive constant that we set to 1.\n\n\n\n\nThe common intercept \\(\\alpha\\) represents the mean response across all counties and floor levels.\nOn top of it, the county-specific intercept terms \\(u_j\\) represent county-specific deviations from that global mean. This type of term is also known as a vaying intercept in the statistical literature.\n\n# We can add the hyper-priors inside the prior dictionary parameter of the model constructor\npartial_pooling_priors = {\n \"Intercept\": bmb.Prior(\"Normal\", mu=0, sigma=10),\n \"1|county\": bmb.Prior(\"Normal\", mu=0, sigma=bmb.Prior(\"Exponential\", lam=1)),\n \"sigma\": bmb.Prior(\"Exponential\", lam=1),\n}\n\npartial_pooling_model = bmb.Model(\n formula=\"log_radon ~ 1 + (1|county)\", \n data=df, \n priors=partial_pooling_priors, \n noncentered=False\n)\npartial_pooling_model\n\n Formula: log_radon ~ 1 + (1|county)\n Family: gaussian\n Link: mu = identity\n Observations: 919\n Priors: \n target = mu\n Common-level effects\n Intercept ~ Normal(mu: 0, sigma: 10)\n \n Group-level effects\n 1|county ~ Normal(mu: 0, sigma: Exponential(lam: 1))\n Auxiliary parameters\n log_radon_sigma ~ Exponential(lam: 1)\n\n\nThe noncentered argument asks Bambi not to use the non centered representation for the varying effects. This makes the graph representation clearer and is closer to the original implementation in the PyMC documentation.\n\npartial_pooling_results = partial_pooling_model.fit()\n\nAuto-assigning NUTS sampler...\nInitializing NUTS using jitter+adapt_diag...\nMultiprocess sampling (2 chains in 2 jobs)\nNUTS: [log_radon_sigma, Intercept, 1|county_sigma, 1|county]\n\n\n\n\n\n\n\n \n \n 100.00% [4000/4000 00:06<00:00 Sampling 2 chains, 0 divergences]\n \n \n\n\nSampling 2 chains for 1_000 tune and 1_000 draw iterations (2_000 + 2_000 draws total) took 6 seconds.\n\n\nWe can inspect the graphical representation of the model:\n\npartial_pooling_model.graph()\n\n\n\n\nWe can clearly see a new hierarchical level as compared to the complete pooling model and unpooled model.\nNext, we can plot the posterior distribution of the coefficients in the model:\n\naz.plot_trace(data=partial_pooling_results, compact=True, chain_prop={\"ls\": \"-\"})\nplt.suptitle(\"Partial Pooling Model Trace\");\n\n\n\n\n\n1|county is \\(u_j\\), the county-specific intercepts.\n1|county_sigma is \\(\\sigma_u\\), the standard deviation of the county-specific intercepts above.\n\nLet us now compare the posterior predictive mean against the observed data at county level.\n\npartial_pooling_results\n\n\n\n \n \n arviz.InferenceData\n \n \n \n \n \n posterior\n \n \n \n \n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\nDimensions: (chain: 2, draw: 1000, county__factor_dim: 85)\nCoordinates:\n * chain (chain) int64 0 1\n * draw (draw) int64 0 1 2 3 4 5 6 ... 994 995 996 997 998 999\n * county__factor_dim (county__factor_dim) \nDimensions: (chain: 2, draw: 1000)\nCoordinates:\n * chain (chain) int64 0 1\n * draw (draw) int64 0 1 2 3 4 5 ... 994 995 996 997 998 999\nData variables: (12/17)\n smallest_eigval (chain, draw) float64 nan nan nan nan ... nan nan nan\n diverging (chain, draw) bool False False False ... False False\n step_size (chain, draw) float64 0.4264 0.4264 ... 0.5258 0.5258\n perf_counter_start (chain, draw) float64 1.134e+04 ... 1.134e+04\n process_time_diff (chain, draw) float64 0.001845 0.001541 ... 0.001548\n n_steps (chain, draw) float64 7.0 7.0 15.0 ... 7.0 15.0 7.0\n ... ...\n lp (chain, draw) float64 -1.085e+03 ... -1.08e+03\n step_size_bar (chain, draw) float64 0.443 0.443 ... 0.4482 0.4482\n energy (chain, draw) float64 1.126e+03 ... 1.126e+03\n acceptance_rate (chain, draw) float64 0.5666 0.8503 ... 0.6538 0.808\n max_energy_error (chain, draw) float64 0.77 -0.5181 ... 0.9307 0.5153\n largest_eigval (chain, draw) float64 nan nan nan nan ... nan nan nan\nAttributes:\n created_at: 2023-01-05T15:05:23.338166\n arviz_version: 0.14.0\n inference_library: pymc\n inference_library_version: 5.0.1\n sampling_time: 6.298857688903809\n tuning_steps: 1000\n modeling_interface: bambi\n modeling_interface_version: 0.9.3xarray.DatasetDimensions:chain: 2draw: 1000Coordinates: (2)chain(chain)int640 1array([0, 1])draw(draw)int640 1 2 3 4 5 ... 995 996 997 998 999array([ 0, 1, 2, ..., 997, 998, 999])Data variables: (17)smallest_eigval(chain, draw)float64nan nan nan nan ... nan nan nan nanarray([[nan, nan, nan, ..., nan, nan, nan],\n [nan, nan, nan, ..., nan, nan, nan]])diverging(chain, draw)boolFalse False False ... False Falsearray([[False, False, False, ..., False, False, False],\n [False, False, False, ..., False, False, False]])step_size(chain, draw)float640.4264 0.4264 ... 0.5258 0.5258array([[0.42643211, 0.42643211, 0.42643211, ..., 0.42643211, 0.42643211,\n 0.42643211],\n [0.52580235, 0.52580235, 0.52580235, ..., 0.52580235, 0.52580235,\n 0.52580235]])perf_counter_start(chain, draw)float641.134e+04 1.134e+04 ... 1.134e+04array([[11339.18060664, 11339.18266998, 11339.18440546, ...,\n 11341.43753515, 11341.44075284, 11341.44241447],\n [11339.13916398, 11339.14101822, 11339.14428841, ...,\n 11341.3763751 , 11341.37816056, 11341.38131733]])process_time_diff(chain, draw)float640.001845 0.001541 ... 0.001548array([[0.00184498, 0.00154148, 0.00291322, ..., 0.00300371, 0.00147345,\n 0.00286303],\n [0.00161406, 0.00306944, 0.00304204, ..., 0.00158612, 0.00295564,\n 0.00154751]])n_steps(chain, draw)float647.0 7.0 15.0 15.0 ... 7.0 15.0 7.0array([[ 7., 7., 15., ..., 15., 7., 15.],\n [ 7., 15., 15., ..., 7., 15., 7.]])reached_max_treedepth(chain, draw)boolFalse False False ... False Falsearray([[False, False, False, ..., False, False, False],\n [False, False, False, ..., False, False, False]])index_in_trajectory(chain, draw)int643 -5 3 7 -4 5 5 ... -1 1 -1 -2 3 1array([[ 3, -5, 3, ..., -7, 7, 5],\n [ 4, 8, -4, ..., -2, 3, 1]])perf_counter_diff(chain, draw)float640.001844 0.001541 ... 0.001546array([[0.00184394, 0.00154119, 0.00291212, ..., 0.00300222, 0.00147261,\n 0.00286229],\n [0.00164266, 0.00306914, 0.00304128, ..., 0.00158521, 0.00295434,\n 0.00154646]])energy_error(chain, draw)float640.4473 0.06158 ... 0.604 -0.08139array([[ 0.44728351, 0.06157559, 0.09484343, ..., 0.05589812,\n -0.21583235, 0.33901506],\n [ 0.02462608, 0.31476461, 0.19316821, ..., -0.96651082,\n 0.60399192, -0.08138962]])tree_depth(chain, draw)int643 3 4 4 4 4 3 3 ... 3 4 3 4 4 3 4 3array([[3, 3, 4, ..., 4, 3, 4],\n [3, 4, 4, ..., 3, 4, 3]])lp(chain, draw)float64-1.085e+03 -1.088e+03 ... -1.08e+03array([[-1085.37191562, -1087.88863096, -1092.37538007, ...,\n -1092.75516949, -1089.14198077, -1098.8258156 ],\n [-1089.85063753, -1096.88438309, -1098.87095161, ...,\n -1071.29957155, -1078.66959952, -1079.5095693 ]])step_size_bar(chain, draw)float640.443 0.443 0.443 ... 0.4482 0.4482array([[0.4430255 , 0.4430255 , 0.4430255 , ..., 0.4430255 , 0.4430255 ,\n 0.4430255 ],\n [0.44816553, 0.44816553, 0.44816553, ..., 0.44816553, 0.44816553,\n 0.44816553]])energy(chain, draw)float641.126e+03 1.132e+03 ... 1.126e+03array([[1125.91278604, 1131.94969871, 1130.44029909, ..., 1133.86619845,\n 1143.86580192, 1140.01452244],\n [1125.72031508, 1139.96028702, 1145.9408952 , ..., 1112.82974843,\n 1113.96640428, 1125.65983207]])acceptance_rate(chain, draw)float640.5666 0.8503 ... 0.6538 0.808array([[0.566597 , 0.85028311, 0.91524893, ..., 0.95842402, 0.81942384,\n 0.8264694 ],\n [0.91399074, 0.80459391, 0.91507312, ..., 1. , 0.65376977,\n 0.80802917]])max_energy_error(chain, draw)float640.77 -0.5181 ... 0.9307 0.5153array([[ 0.77004892, -0.51805383, -0.52415563, ..., -0.5551082 ,\n 0.42294361, 0.43692918],\n [-0.83235839, 0.50504515, -0.33711855, ..., -1.07243648,\n 0.93065977, 0.51534825]])largest_eigval(chain, draw)float64nan nan nan nan ... nan nan nan nanarray([[nan, nan, nan, ..., nan, nan, nan],\n [nan, nan, nan, ..., nan, nan, nan]])Indexes: (2)chainPandasIndexPandasIndex(Int64Index([0, 1], dtype='int64', name='chain'))drawPandasIndexPandasIndex(Int64Index([ 0, 1, 2, 3, 4, 5, 6, 7, 8, 9,\n ...\n 990, 991, 992, 993, 994, 995, 996, 997, 998, 999],\n dtype='int64', name='draw', length=1000))Attributes: (8)created_at :2023-01-05T15:05:23.338166arviz_version :0.14.0inference_library :pymcinference_library_version :5.0.1sampling_time :6.298857688903809tuning_steps :1000modeling_interface :bambimodeling_interface_version :0.9.3\n \n \n \n \n \n \n observed_data\n \n \n \n \n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\nDimensions: (log_radon_obs: 919)\nCoordinates:\n * log_radon_obs (log_radon_obs) int64 0 1 2 3 4 5 ... 913 914 915 916 917 918\nData variables:\n log_radon (log_radon_obs) float64 1.435 1.03 0.2624 ... 2.219 0.8329\nAttributes:\n created_at: 2023-01-05T15:05:23.345383\n arviz_version: 0.14.0\n inference_library: pymc\n inference_library_version: 5.0.1\n modeling_interface: bambi\n modeling_interface_version: 0.9.3xarray.DatasetDimensions:log_radon_obs: 919Coordinates: (1)log_radon_obs(log_radon_obs)int640 1 2 3 4 5 ... 914 915 916 917 918array([ 0, 1, 2, ..., 916, 917, 918])Data variables: (1)log_radon(log_radon_obs)float641.435 1.03 0.2624 ... 2.219 0.8329array([ 1.43508453, 1.02961942, 0.26236426, 1.28093385, 1.7227666 ,\n 1.7227666 , 0.26236426, 1.60943791, 1.41098697, 1.28093385,\n 0.95551145, 0.26236426, 1.02961942, 0.58778666, 1.16315081,\n -0.22314355, 0.09531018, 0.69314718, 1.36097655, 2.19722458,\n 2.01490302, 1.80828877, 1.66770682, 1.84054963, 2.16332303,\n 1.45861502, 1.77495235, 1.43508453, 1.06471074, 0.69314718,\n 0.26236426, 0.47000363, 2.2512918 , 0.58778666, 2.50143595,\n 1.94591015, 0.78845736, 2.27212589, 1.25276297, 1.93152141,\n 1.30833282, 0.83290912, 0.99325177, 0.78845736, 1.96009478,\n 0.26236426, 1.36097655, 1.28093385, 1.36097655, 2.28238239,\n 1.87180218, 1.54756251, 1.19392247, 0.95551145, 1.06471074,\n 1.16315081, 0.53062825, 1.56861592, 1.41098697, 1.62924054,\n 0.47000363, 1.58923521, 0.87546874, -0.10536052, 0.87546874,\n 1.54756251, 2.40694511, 2.7080502 , 2.16332303, 1.5260563 ,\n 0.47000363, 1.38629436, 0.64185389, 0.53062825, 0.91629073,\n 1.36097655, 1.64865863, 1.70474809, 1.74046617, 2.94968834,\n 1.13140211, 1.64865863, 2.05412373, 2.10413415, 1.56861592,\n 2.14006616, 0.53062825, 2.44234704, 3.2308044 , 2.34180581,\n 1.30833282, 1.02961942, 1.41098697, 0.74193734, 2.44234704,\n 2.3321439 , 0.26236426, 1.19392247, 1.48160454, 0.83290912,\n...\n 0.40546511, 0.95551145, 1.06471074, 0.53062825, 1.06471074,\n 0.95551145, 2.32238772, 2.54160199, 0.78845736, 1.13140211,\n -2.30258509, 1.06471074, 0.33647224, 2.43361336, 0.33647224,\n 0. , 1.5260563 , 1.48160454, 1.09861229, 1.45861502,\n 1.28093385, 1.94591015, 0.47000363, -0.51082562, 0. ,\n 0.18232156, 0. , -0.51082562, 1.33500107, -0.10536052,\n 1.06471074, 0.83290912, 1.58923521, 0.18232156, 1.09861229,\n 0.53062825, 3.23867845, 0.40546511, 2.69462718, 3.03495299,\n 0.91629073, 0.58778666, -0.10536052, 0.58778666, 1.06471074,\n 1.98787435, 1.91692261, 0.95551145, 0.09531018, 0.95551145,\n 0. , -2.30258509, 2.41591378, 1.19392247, -0.22314355,\n 0.83290912, 1.58923521, 1.94591015, 0.18232156, 0.64185389,\n 0.95551145, 1.28093385, 0. , 0.09531018, 0.99325177,\n 0.47000363, -2.30258509, 0. , 1.77495235, 1.28093385,\n 0.78845736, 2.29253476, 1.94591015, 1.74046617, 0.83290912,\n 1.80828877, 0.18232156, 1.48160454, 1.30833282, 1.25276297,\n 0.26236426, 0.58778666, 1.45861502, -0.10536052, 2.96527307,\n 0.95551145, 0.78845736, 0.33647224, 0.74193734, 1.33500107,\n -0.51082562, 0.09531018, 0.40546511, -0.69314718, -0.51082562,\n 0.53062825, 0. , 2.21920348, 0.83290912])Indexes: (1)log_radon_obsPandasIndexPandasIndex(Int64Index([ 0, 1, 2, 3, 4, 5, 6, 7, 8, 9,\n ...\n 909, 910, 911, 912, 913, 914, 915, 916, 917, 918],\n dtype='int64', name='log_radon_obs', length=919))Attributes: (6)created_at :2023-01-05T15:05:23.345383arviz_version :0.14.0inference_library :pymcinference_library_version :5.0.1modeling_interface :bambimodeling_interface_version :0.9.3\n \n \n \n \n \n \n \n\n\n\npartial_pooling_model.predict(partial_pooling_results, kind=\"pps\")\n\n# Stack chains and draws. pps stands for posterior predictive samples\npps = az.extract_dataset(partial_pooling_results, group=\"posterior_predictive\")[\"log_radon\"].values\n\npps_df = pd.DataFrame(data=pps).assign(county=df[\"county\"])\ny_pred = pps_df.groupby(\"county\").mean().mean(axis=1)\ny_sample = df.groupby(\"county\")[\"log_radon\"].mean()\n\nfig, ax = plt.subplots(figsize=(8, 7))\nsns.regplot(x=y_sample, y=y_pred, ax=ax)\nax.axline(xy1=(1, 1), slope=1, color=\"black\", linestyle=\"--\", label=\"diagonal\")\nax.axhline(y=y_pred.mean(), color=\"C3\", linestyle=\"--\", label=\"predicted global mean\")\nax.legend(loc=\"lower right\")\nax.set(\n title=\"log(Radon + 0.1) Mean per County (Partial Pooling Model)\",\n xlabel=\"observed (sample)\",\n ylabel=\"prediction\",\n xlim=(0.3, 2.7),\n ylim=(0.3, 2.7),\n);\n\n/tmp/ipykernel_29247/3145587883.py:4: FutureWarning: extract_dataset has been deprecated, please use extract\n pps = az.extract_dataset(partial_pooling_results, group=\"posterior_predictive\")[\"log_radon\"].values\n\n\n\n\n\nNote that in this case the points are not concentrated along the diagonal (as it was the case for the unpooled model). The reason is that in the partial pooling model the hyperprior shrinks the predictions towards the global mean, represented by the horizonital dashed line.\nWhat does partial pooling exactly mean here?\nWe said the first model we built performed a complete pooling because estimates pooled observations regardless to which county they belong to. We could see that in the coefficients for the floor variable. The estimate for each level was the sample mean for each level, plus prior smoothing, without making any special distinction to observations from different counties.\nThen, when we built our second model we said it performed no pooling. This was the opposite scenario. Estimates for effects involving a specific county were not informed at all by the information in the other counties.\nNow, we say this model performs partial pooling. But what does it mean? Well, if we had complete pooling and no pooling, this must be some type of compromise in between.\nIn this model, we have a global intercept \\(\\alpha\\), which represents the mean of the response variable across all counties. We also have group-specific intercepts \\(u_j\\) that represent deviations from the global mean specific to each county \\(j\\). Thess group-specific intercepts are assigned a Normal prior centered at 0. The standard deviations of these priors are considered random, instead of fixed. Since they are random, they are assigned a prior distribution, which is a hyperprior in this case because it is a prior on top of a prior. And that hyperprior is the same distribution for all the county-specific intercepts. Because of that, these random deviations from the global mean are not independent. Indeed, the shared hyperprior is what causes the partial pooling in the model estimates. In other words, some information is shared between counties when computing estimates for their effects and it results in a shrinkage towards the global mean.\nConnecting what we’ve just said with the figure above we can see the partial pooling is a compromise between complete pooling (global mean) and no pooling (diagonal).\n\n\n\n\nNext, we add the floor global feature (i.e. does not depend on the county) into the model above. We remove the global intercept so Bambi keeps one coefficient for each floor level.\nIn the original PyMC example, this model is introduced under the Varying intercept model title. We feel that “County-specific intercepts and common predictors” is a more accurate representation of the model we build in Bambi. It is correct to say this is a varying intercept model, because of the county-specific intercepts, but so was the last model we built.\n\n\n\\[\ny = \\beta_j + u_k + \\varepsilon\n\\]\nwhere\n\\[\n\\begin{aligned}\ny &= \\text{Response for the (log) radon measurement } \\\\\n\\beta_j &= \\text{Coefficient for the floor level } j \\\\\nu_k &= \\text{Intercept specific to the county } k \\\\\n\\varepsilon & = \\text{Residual random error}\n\\end{aligned}\n\\]\n\n\n\n\n\nThe common effect in this model is the floor term \\(\\beta_j\\)\n\\[\n\\beta_j \\sim \\text{Normal}(0, \\sigma_{\\beta_j})\n\\]\nfor all \\(j: 1, 2\\) and \\(\\sigma_{\\beta_j}\\) is a positive constant that we set to \\(10\\).\n\n\n\n\\[\nu_k \\sim \\text{Normal}(0, \\sigma_u)\n\\]\nfor all \\(j:1, \\cdots, 85\\). The hyperprior is\n$$\n_u ()\n$$\nand \\(\\tau\\) is a positive constant that we set to \\(1\\).\n\n\n\n\\[\n\\begin{aligned}\n\\varepsilon & \\sim \\text{Normal}(0, \\sigma) \\\\\n\\sigma & \\sim \\text{Exponential}(\\lambda)\n\\end{aligned}\n\\]\nwhere \\(\\lambda\\) is a positive constant that we set to \\(1\\).\n\n\n\n\\(\\beta_j\\) and \\(u_k\\) may look similar. The difference is that the latter is a hierarchical effect (it has a hyperprior), while the former is not.\n\nvarying_intercept_priors = {\n \"floor\": bmb.Prior(\"Normal\", mu=0, sigma=10),\n \"1|county\": bmb.Prior(\"Normal\", mu=0, sigma=bmb.Prior(\"Exponential\", lam=1)),\n \"sigma\": bmb.Prior(\"Exponential\", lam=1),\n}\n\nvarying_intercept_model = bmb.Model(\n formula=\"log_radon ~ 0 + floor + (1|county)\",\n data=df,\n priors=varying_intercept_priors,\n noncentered=False\n )\n\nvarying_intercept_model\n\n Formula: log_radon ~ 0 + floor + (1|county)\n Family: gaussian\n Link: mu = identity\n Observations: 919\n Priors: \n target = mu\n Common-level effects\n floor ~ Normal(mu: 0, sigma: 10)\n \n Group-level effects\n 1|county ~ Normal(mu: 0, sigma: Exponential(lam: 1))\n Auxiliary parameters\n log_radon_sigma ~ Exponential(lam: 1)\n\n\n\nvarying_intercept_results = varying_intercept_model.fit()\n\nAuto-assigning NUTS sampler...\nInitializing NUTS using jitter+adapt_diag...\nMultiprocess sampling (2 chains in 2 jobs)\nNUTS: [log_radon_sigma, floor, 1|county_sigma, 1|county]\n\n\n\n\n\n\n\n \n \n 100.00% [4000/4000 00:06<00:00 Sampling 2 chains, 0 divergences]\n \n \n\n\nSampling 2 chains for 1_000 tune and 1_000 draw iterations (2_000 + 2_000 draws total) took 7 seconds.\n\n\nWhen looking at the graph representation of the model we still see the hierarchical structure for the county varying intercepts, but we do not see it for the floor feature as expected.\n\nvarying_intercept_model.graph()\n\n\n\n\nLet us visualize the posterior distributions:\n\naz.plot_trace(data=varying_intercept_results, compact=True, chain_prop={\"ls\": \"-\"});\nplt.suptitle(\"Varying Intercepts Model Trace\");\n\n\n\n\n\n\n\n\n\nNext we want to include a hierarchical structure in the floor effect.\n\n\n\\[\ny = \\beta_j + b_{jk} + \\varepsilon\n\\]\nwhere\n\\[\n\\begin{aligned}\ny &= \\text{Response for the (log) radon measurement}\\\\\n\\beta_j &= \\text{Coefficient for the floor level } j \\\\\nb_{jk} &= \\text{Coefficient for the floor level } j \\text{ specific to the county } k\\\\\n\\varepsilon & = \\text{Residual random error}\n\\end{aligned}\n\\]\n\n\n\n\n\nThe common effect in this model is the floor term \\(\\beta_j\\)\n\\[\n\\beta_j \\sim \\text{Normal}(0, \\sigma_{\\beta_j})\n\\]\nwhere \\(\\sigma_{\\beta_j}\\) is a positive constant that we set to \\(10\\).\n\n\n\nHere, again, we have the floor effects\n\\[\nb_{jk} \\sim \\text{Normal}(0, \\sigma_{b_j})\n\\]\nfor \\(j:1, 2\\) and \\(k: 1, \\cdots, 85\\).\nThe hyperprior is\n\\[\n\\sigma_{b_j} \\sim \\text{Exponential}(\\tau)\n\\]\nfor \\(j:1, 2\\).\n\\(\\tau\\) is a positive constant that we set to \\(1\\).\n\n\n\n\\[\n\\begin{aligned}\n\\varepsilon & \\sim \\text{Normal}(0, \\sigma) \\\\\n\\sigma & \\sim \\text{Exponential}(\\lambda)\n\\end{aligned}\n\\]\nwhere \\(\\lambda\\) is a positive constant that we set to 1.\n\n\n\nBoth \\(\\beta_j\\) and \\(b_{jk}\\) are floor effects. The difference is that the first one is a common effect, while the second is a group-specific effect. In other words, the second floor effect varies from county to county. These effects represent the county specific deviations from the common floor effect \\(\\beta_j\\). Because of the hyperprior, the \\(b_{jk}\\) effects aren’t independent and result in the partial-pooling effect.\nIn this case the Bambi model specification is quite easy, namely log_radon ~ 0 + floor + (0 + floor|county). This formula represents the following terms:\n\nThe first 0 tells we don’t want a global intercept.\nfloor is \\(\\beta_j\\). It says we want to include an effect for each floor level. Since there’s no global intercept, a coefficient for each level is included.\nThe 0 in (0 + floor|county) means we don’t want county-specific intercept. We need to explicitly turn it off as we did with the regular intercept.\nfloor|county is \\(b_{jk}\\), the county-specific floor coefficients. Again, since there’s no varying intercepot for the counties, this includes coefficients for both floor levels.\n\n\nvarying_intercept_slope_priors = {\n \"floor\": bmb.Prior(\"Normal\", mu=0, sigma=10),\n \"floor|county\": bmb.Prior(\"Normal\", mu=0, sigma=bmb.Prior(\"Exponential\", lam=1)),\n \"sigma\": bmb.Prior(\"Exponential\", lam=1),\n}\n\nvarying_intercept_slope_model = bmb.Model(\n formula=\"log_radon ~ 0 + floor + (0 + floor|county)\",\n data=df,\n priors=varying_intercept_slope_priors,\n noncentered=True\n )\n\nvarying_intercept_slope_model\n\n Formula: log_radon ~ 0 + floor + (0 + floor|county)\n Family: gaussian\n Link: mu = identity\n Observations: 919\n Priors: \n target = mu\n Common-level effects\n floor ~ Normal(mu: 0, sigma: 10)\n \n Group-level effects\n floor|county ~ Normal(mu: 0, sigma: Exponential(lam: 1))\n Auxiliary parameters\n log_radon_sigma ~ Exponential(lam: 1)\n\n\nNext, we fit the model. Note we increase the default number of draws from the posterior and the tune samples to 2000. In addition, as the structure of the model gets more complex, so does the posterior. That’s why we increase target_accept from the default 0.8 to 0.9, because we want to explore the posterior more cautiously .\n\nvarying_intercept_slope_results = varying_intercept_slope_model.fit(\n draws=2000, \n tune=2000,\n target_accept=0.9\n)\n\nAuto-assigning NUTS sampler...\nInitializing NUTS using jitter+adapt_diag...\nMultiprocess sampling (2 chains in 2 jobs)\nNUTS: [log_radon_sigma, floor, floor|county_sigma, floor|county_offset]\n\n\n\n\n\n\n\n \n \n 100.00% [8000/8000 00:24<00:00 Sampling 2 chains, 0 divergences]\n \n \n\n\nSampling 2 chains for 2_000 tune and 2_000 draw iterations (4_000 + 4_000 draws total) took 24 seconds.\n\n\nIn the graph representation of the model we can now see hierarchical structures both in the intercepts and the slopes. The terms that end with _offset appeared because we are using a non-centered parametrization. This parametrization is an algebraic trick that helps computation but leaves the model unchanged.\n\nvarying_intercept_slope_model.graph()\n\n\n\n\nLet’s have a look at the marginal posterior for the coefficients in the model.\n\nvar_names = [\"floor\", \"floor|county\", \"floor|county_sigma\", \"log_radon_sigma\"]\naz.plot_trace(\n data=varying_intercept_slope_results,\n var_names=var_names, \n compact=True, \n chain_prop={\"ls\": \"-\"}\n);\n\n\n\n\n\n\n\n\n\nWe now want to consider a county-level predictor, namely the (log) uranium level. This is not a county-level predictor in the sense that we use a county-specific coefficient, but in the sense that all the uranium concentrations were measured per county. Thus all the houses in the same county have the same uranium level.\n\n\n\\[\ny = \\beta_j + \\xi x + b_{jk} + \\varepsilon\n\\]\nwhere\n\\[\n\\begin{aligned}\ny &= \\text{Response for the (log) radon measurement} \\\\\nx &= \\text{Log uranium concentration} \\\\\n\\beta_j &= \\text{Coefficient for the floor level } j \\\\\n\\xi &= \\text{Coefficient for the slope of the log uranium concentration}\\\\\nb_{jk} &= \\text{Coefficient for the floor level } j \\text{ specific to the county } k\\\\\n\\varepsilon & = \\text{Residual random error}\n\\end{aligned}\n\\]\n\n\n\n\n\nThis model has two common effects:\n\\[\n\\begin{aligned}\n\\beta_j \\sim \\text{Normal}(0, \\sigma_{\\beta_j}) \\\\\n\\xi \\sim \\text{Normal}(0, \\sigma_\\xi)\n\\end{aligned}\n\\]\nwhere \\(j:1, 2\\) and all \\(\\sigma_{\\beta_j}\\) and \\(\\sigma_{\\xi}\\) are set to \\(10\\).\n\n\n\nHere, again, we have the floor effects\n\\[\nb_{jk} \\sim \\text{Normal}(0, \\sigma_{b_j})\n\\]\nfor \\(j:1, 2\\) and \\(k: 1, \\cdots, 85\\).\nThe hyperprior is\n\\[\n\\sigma_{b_j} \\sim \\text{Exponential}(\\tau)\n\\]\nfor \\(j:1, 2\\).\n\\(\\tau\\) is a positive constant that we set to \\(1\\).\n\n\n\n\\[\n\\begin{aligned}\n\\varepsilon & \\sim \\text{Normal}(0, \\sigma) \\\\\n\\sigma & \\sim \\text{Exponential}(\\lambda)\n\\end{aligned}\n\\]\nwhere \\(\\lambda\\) is a positive constant that we set to \\(1\\).\n\ncovariate_priors = {\n \"floor\": bmb.Prior(\"Normal\", mu=0, sigma=10),\n \"log_u\": bmb.Prior(\"Normal\", mu=0, sigma=10),\n \"floor|county\": bmb.Prior(\"Normal\", mu=0, sigma=bmb.Prior(\"Exponential\", lam=1)),\n \"sigma\": bmb.Prior(\"Exponential\", lam=1),\n}\n\ncovariate_model = bmb.Model(\n formula=\"log_radon ~ 0 + floor + log_u + (0 + floor|county)\",\n data=df,\n priors=covariate_priors,\n noncentered=True\n )\n\ncovariate_model\n\n Formula: log_radon ~ 0 + floor + log_u + (0 + floor|county)\n Family: gaussian\n Link: mu = identity\n Observations: 919\n Priors: \n target = mu\n Common-level effects\n floor ~ Normal(mu: 0, sigma: 10)\n log_u ~ Normal(mu: 0, sigma: 10)\n \n Group-level effects\n floor|county ~ Normal(mu: 0, sigma: Exponential(lam: 1))\n Auxiliary parameters\n log_radon_sigma ~ Exponential(lam: 1)\n\n\n\ncovariate_results = covariate_model.fit(\n draws=2000, \n tune=2000,\n target_accept=0.9,\n chains=2\n)\n\nAuto-assigning NUTS sampler...\nInitializing NUTS using jitter+adapt_diag...\nMultiprocess sampling (2 chains in 2 jobs)\nNUTS: [log_radon_sigma, floor, log_u, floor|county_sigma, floor|county_offset]\n\n\n\n\n\n\n\n \n \n 100.00% [8000/8000 00:26<00:00 Sampling 2 chains, 0 divergences]\n \n \n\n\nSampling 2 chains for 2_000 tune and 2_000 draw iterations (4_000 + 4_000 draws total) took 27 seconds.\n\n\n\ncovariate_model.graph()\n\n\n\n\n\nvar_names = [\"floor\", \"log_u\", \"floor|county\", \"floor|county_sigma\", \"log_radon_sigma\"]\naz.plot_trace(\n data=covariate_results,\n var_names=var_names, \n compact=True, \n chain_prop={\"ls\": \"-\"}\n);\n\n\n\n\nLet us now visualize the posterior distributions of the intercepts:\n\n# get log_u values per county\nlog_u_sample = df.groupby([\"county\"])[\"log_u\"].mean().values\n\n# compute the slope posterior samples\nlog_u_slope = covariate_results.posterior[\"log_u\"].values[..., None] * log_u_sample\n\n# Compute the posterior for the floor coefficient when it is Basement\nintercepts = (\n covariate_results.posterior.sel(floor_dim=\"Basement\")[\"floor\"]\n + covariate_results.posterior.sel(floor__expr_dim=\"Basement\")[\"floor|county\"] \n).values\n\ny_predicted = (intercepts + log_u_slope).reshape(4000, n_counties).T\n\n# reduce the intercepts posterior samples to the mean per county\nmean_intercept = intercepts.mean(axis=2)[..., None] + log_u_slope\n\n\nfig, ax = plt.subplots()\n\ny_predicted_bounds = np.quantile(y_predicted, q=[0.03, 0.96], axis=1)\n\nsns.scatterplot(\n x=log_u_sample,\n y=y_predicted.mean(axis=1),\n alpha=0.8,\n color=\"C0\",\n s=50,\n label=\"Mean county-intercept\",\n ax=ax\n)\nax.vlines(log_u_sample, y_predicted_bounds[0], y_predicted_bounds[1], color=\"C1\", alpha=0.5)\n\naz.plot_hdi(\n x=log_u_sample,\n y=mean_intercept,\n color=\"black\",\n fill_kwargs={\"alpha\": 0.1, \"label\": \"Mean intercept HPD\"},\n ax=ax\n)\n\nsns.lineplot(\n x=log_u_sample,\n y=mean_intercept.reshape(4000, n_counties).mean(axis=0),\n color=\"black\",\n alpha=0.6,\n label=\"Mean intercept\",\n ax=ax\n)\n\nax.legend(loc=\"upper left\")\nax.set(\n title=\"County Intercepts (Covariance Model)\",\n xlabel=\"County-level log uranium\",\n ylabel=\"Intercept estimate\"\n);\n\n\n\n\n\n\n\n\n\n\n\nLet us dig deeper into the model comparison for the pooled, unpooled, and partial pooling models. To do so we are generate predictions for each model ad county level, where we aggregate by taking the mean, and plot them against the observed values.\n\n# generate posterior predictive samples\npooled_model.predict(pooled_results, kind=\"pps\")\nunpooled_model.predict(unpooled_results, kind=\"pps\")\npartial_pooling_model.predict(partial_pooling_results, kind=\"pps\")\n\n# stack chain and draw values\npooled_pps = az.extract_dataset(pooled_results, group=\"posterior_predictive\")[\"log_radon\"].values\nunpooled_pps = az.extract_dataset(unpooled_results, group=\"posterior_predictive\")[\"log_radon\"].values\npartial_pooling_pps = az.extract_dataset(partial_pooling_results, group=\"posterior_predictive\")[\"log_radon\"].values\n\n# Generate predictions per county\npooled_pps_df = pd.DataFrame(data=pooled_pps).assign(county=df[\"county\"])\ny_pred_pooled = pooled_pps_df.groupby(\"county\").mean().mean(axis=1)\n\nunpooled_pps_df = pd.DataFrame(data=unpooled_pps).assign(county=df[\"county\"])\ny_pred_unpooled = unpooled_pps_df.groupby(\"county\").mean().mean(axis=1)\n\npartial_pooling_pps_df = pd.DataFrame(data=partial_pooling_pps).assign(county=df[\"county\"])\ny_pred_partial_pooling = partial_pooling_pps_df.groupby(\"county\").mean().mean(axis=1)\n\n# observed values\ny_sample = df.groupby(\"county\")[\"log_radon\"].mean()\n\n/tmp/ipykernel_29247/54649629.py:7: FutureWarning: extract_dataset has been deprecated, please use extract\n pooled_pps = az.extract_dataset(pooled_results, group=\"posterior_predictive\")[\"log_radon\"].values\n/tmp/ipykernel_29247/54649629.py:8: FutureWarning: extract_dataset has been deprecated, please use extract\n unpooled_pps = az.extract_dataset(unpooled_results, group=\"posterior_predictive\")[\"log_radon\"].values\n/tmp/ipykernel_29247/54649629.py:9: FutureWarning: extract_dataset has been deprecated, please use extract\n partial_pooling_pps = az.extract_dataset(partial_pooling_results, group=\"posterior_predictive\")[\"log_radon\"].values\n\n\n\nfig, ax = plt.subplots(figsize=(8, 8))\n\nsns.regplot(x=y_sample, y=y_pred_pooled, label=\"pooled\", color=\"C0\", ax=ax)\nsns.regplot(x=y_sample, y=y_pred_unpooled, label=\"unpooled\", color=\"C1\", ax=ax)\nsns.regplot(x=y_sample, y=y_pred_partial_pooling, label=\"partial pooling\", color=\"C2\", ax=ax)\nax.axhline(y=df[\"log_radon\"].mean(), color=\"C0\", linestyle=\"--\", label=\"sample mean\")\nax.axline(xy1=(1, 1), slope=1, color=\"black\", linestyle=\"--\", label=\"diagonal\")\nax.axhline(\n y=y_pred_partial_pooling.mean(), color=\"C3\",\n linestyle=\"--\", label=\"predicted global mean (partial pooling)\"\n)\nax.legend(loc=\"upper center\", bbox_to_anchor=(0.5, -0.1), ncol=2)\nax.set(\n title=\"log(Radon + 0.1) Mean per County - Model Comparison\",\n xlabel=\"observed (sample)\",\n ylabel=\"prediction\",\n xlim=(0.2, 2.8),\n ylim=(0.2, 2.8),\n);\n\n\n\n\n\nThe pooled model consider all the counties together, this explains why the predictions do not vary at county level. This is represented by the almost-flat line in the plot above (blue).\nOn the other hand, the unpooled model considers each county separately, so the prediction is very close to the observation mean. This is represented by the line very close to the diagonal (orange).\nThe partial pooling model is mixing global and information at county level. This is clearly seen by how corresponding (green) line is in between the pooling and unpooling lines.\n\n\n%load_ext watermark\n%watermark -n -u -v -iv -w\n\nLast updated: Thu Jan 05 2023\n\nPython implementation: CPython\nPython version : 3.10.4\nIPython version : 8.5.0\n\nnumpy : 1.23.5\nseaborn : 0.12.2\nmatplotlib: 3.6.2\nbambi : 0.9.3\narviz : 0.14.0\nsys : 3.10.4 (main, Mar 31 2022, 08:41:55) [GCC 7.5.0]\npandas : 1.5.2\npymc : 5.0.1\n\nWatermark: 2.3.1" }, { - "objectID": "notebooks/circular_regression.html", - "href": "notebooks/circular_regression.html", + "objectID": "notebooks/wald_gamma_glm.html", + "href": "notebooks/wald_gamma_glm.html", "title": "Bambi", "section": "", - "text": "Circular Regression\n\nimport arviz as az\nimport bambi as bmb\nfrom matplotlib.lines import Line2D\nimport matplotlib.pyplot as plt\nimport numpy as np\nimport pandas as pd\nfrom scipy import stats\n\n\naz.style.use(\"arviz-white\")\n\nDirectional statistics, also known as circular statistics or spherical statistics, refers to a branch of statistics dealing with data which domain is the unit circle, as opposed to “linear” data which support is the real line. Circular data is convenient when dealing with directions or rotations. Some examples include temporal periods like hours or days, compass directions, dihedral angles in biomolecules, etc.\nThe fact that a Sunday can be both the day before or after a Monday, or that 0 is a “better average” for 2 and 358 degrees than 180 are illustrations that circular data and circular statistical methods are better equipped to deal with this kind of problem than the more familiar methods 1.\nThere are a few circular distributions, one of them is the VonMises distribution, that we can think as the cousin of the Gaussian that lives in circular space. The domain of this distribution is any interval of length \\(2\\pi\\). We are going to adopt the convention that the interval goes from \\(-\\pi\\) to \\(\\pi\\), so for example 0 radians is the same as \\(2\\pi\\). The VonMises is defined using two parameters, the mean \\(\\mu\\) (the circular mean) and the concentration \\(\\kappa\\), with \\(\\frac{1}{\\kappa}\\) being analogue of the variance. Let see a few example of the VonMises family:\n\nx = np.linspace(-np.pi, np.pi, 200)\nmus = [0., 0., 0., -2.5]\nkappas = [.001, 0.5, 3, 0.5]\nfor mu, kappa in zip(mus, kappas):\n pdf = stats.vonmises.pdf(x, kappa, loc=mu)\n plt.plot(x, pdf, label=r'$\\mu$ = {}, $\\kappa$ = {}'.format(mu, kappa))\nplt.yticks([])\nplt.legend(loc=1);\n\n\n\n\nWhen doing linear regression a commonly used link function is \\(2 \\arctan(u)\\) this ensure that values over the real line are mapped into the interval \\([-\\pi, \\pi]\\)\n\nu = np.linspace(-12, 12, 200)\nplt.plot(u, 2*np.arctan(u))\nplt.xlabel(\"Reals\")\nplt.ylabel(\"Radians\");\n\n\n\n\nBambi supports circular regression with the VonMises family, to exemplify this we are going to use a dataset from the following experiment. 31 periwinkles (a kind of sea snail) were removed from it original place and released down shore. Then, our task is to model the direction of motion as function of the distance travelled by them after being release.\n\ndata = bmb.load_data(\"periwinkles\")\ndata.head()\n\n\n\n\n\n \n \n \n distance\n direction\n \n \n \n \n 0\n 107\n 1.169371\n \n \n 1\n 46\n 1.151917\n \n \n 2\n 33\n 1.291544\n \n \n 3\n 67\n 1.064651\n \n \n 4\n 122\n 1.012291\n \n \n\n\n\n\nJust to compare results, we are going to use the VonMises family and the normal (default) family.\n\nmodel_vm = bmb.Model(\"direction ~ distance\", data, family=\"vonmises\")\nidata_vm = model_vm.fit(include_mean=True)\n\nmodel_n = bmb.Model(\"direction ~ distance\", data)\nidata_n = model_n.fit(include_mean=True)\n\n/home/tomas/anaconda3/envs/bambi/lib/python3.10/site-packages/pytensor/tensor/rewriting/elemwise.py:694: UserWarning: Rewrite warning: The Op i1 does not provide a C implementation. As well as being potentially slow, this also disables loop fusion.\n warn(\n/home/tomas/anaconda3/envs/bambi/lib/python3.10/site-packages/pytensor/tensor/rewriting/elemwise.py:694: UserWarning: Rewrite warning: The Op i1 does not provide a C implementation. As well as being potentially slow, this also disables loop fusion.\n warn(\nAuto-assigning NUTS sampler...\nInitializing NUTS using jitter+adapt_diag...\n/home/tomas/anaconda3/envs/bambi/lib/python3.10/site-packages/pytensor/tensor/rewriting/elemwise.py:694: UserWarning: Rewrite warning: The Op i0 does not provide a C implementation. As well as being potentially slow, this also disables loop fusion.\n warn(\n/home/tomas/anaconda3/envs/bambi/lib/python3.10/site-packages/pytensor/tensor/rewriting/elemwise.py:694: UserWarning: Rewrite warning: The Op i0 does not provide a C implementation. As well as being potentially slow, this also disables loop fusion.\n warn(\n/home/tomas/anaconda3/envs/bambi/lib/python3.10/site-packages/pytensor/tensor/rewriting/elemwise.py:694: UserWarning: Rewrite warning: The Op i0 does not provide a C implementation. As well as being potentially slow, this also disables loop fusion.\n warn(\n/home/tomas/anaconda3/envs/bambi/lib/python3.10/site-packages/pytensor/tensor/rewriting/elemwise.py:694: UserWarning: Rewrite warning: The Op i0 does not provide a C implementation. As well as being potentially slow, this also disables loop fusion.\n warn(\n/home/tomas/anaconda3/envs/bambi/lib/python3.10/site-packages/pytensor/tensor/rewriting/elemwise.py:694: UserWarning: Rewrite warning: The Op i1 does not provide a C implementation. As well as being potentially slow, this also disables loop fusion.\n warn(\n/home/tomas/anaconda3/envs/bambi/lib/python3.10/site-packages/pytensor/tensor/rewriting/elemwise.py:694: UserWarning: Rewrite warning: The Op i1 does not provide a C implementation. As well as being potentially slow, this also disables loop fusion.\n warn(\n/home/tomas/anaconda3/envs/bambi/lib/python3.10/site-packages/pytensor/tensor/rewriting/elemwise.py:694: UserWarning: Rewrite warning: The Op i0 does not provide a C implementation. As well as being potentially slow, this also disables loop fusion.\n warn(\n/home/tomas/anaconda3/envs/bambi/lib/python3.10/site-packages/pytensor/tensor/rewriting/elemwise.py:694: UserWarning: Rewrite warning: The Op i0 does not provide a C implementation. As well as being potentially slow, this also disables loop fusion.\n warn(\n/home/tomas/anaconda3/envs/bambi/lib/python3.10/site-packages/pytensor/tensor/rewriting/elemwise.py:694: UserWarning: Rewrite warning: The Op i0 does not provide a C implementation. As well as being potentially slow, this also disables loop fusion.\n warn(\n/home/tomas/anaconda3/envs/bambi/lib/python3.10/site-packages/pytensor/tensor/rewriting/elemwise.py:694: UserWarning: Rewrite warning: The Op i0 does not provide a C implementation. As well as being potentially slow, this also disables loop fusion.\n warn(\nMultiprocess sampling (2 chains in 2 jobs)\nNUTS: [direction_kappa, Intercept, distance]\n\n\n\n\n\n\n\n \n \n 100.00% [4000/4000 00:06<00:00 Sampling 2 chains, 0 divergences]\n \n \n\n\nSampling 2 chains for 1_000 tune and 1_000 draw iterations (2_000 + 2_000 draws total) took 6 seconds.\nAuto-assigning NUTS sampler...\nInitializing NUTS using jitter+adapt_diag...\nMultiprocess sampling (2 chains in 2 jobs)\nNUTS: [direction_sigma, Intercept, distance]\n\n\n\n\n\n\n\n \n \n 100.00% [4000/4000 00:03<00:00 Sampling 2 chains, 0 divergences]\n \n \n\n\nSampling 2 chains for 1_000 tune and 1_000 draw iterations (2_000 + 2_000 draws total) took 3 seconds.\n\n\n\naz.summary(idata_vm, var_names=[\"~direction_mean\"])\n\n\n\n\n\n \n \n \n mean\n sd\n hdi_3%\n hdi_97%\n mcse_mean\n mcse_sd\n ess_bulk\n ess_tail\n r_hat\n \n \n \n \n Intercept\n 1.667\n 0.325\n 1.069\n 2.253\n 0.011\n 0.008\n 974.0\n 806.0\n 1.0\n \n \n distance\n -0.010\n 0.004\n -0.018\n -0.002\n 0.000\n 0.000\n 1168.0\n 1170.0\n 1.0\n \n \n direction_kappa\n 2.601\n 0.590\n 1.528\n 3.699\n 0.015\n 0.011\n 1499.0\n 1277.0\n 1.0\n \n \n\n\n\n\n\n_, ax = plt.subplots(1,2, figsize=(8, 4), sharey=True)\nposterior_mean = bmb.families.link.tan_2(idata_vm.posterior[\"direction_mean\"])\nax[0].plot(data.distance, posterior_mean.mean((\"chain\", \"draw\")))\naz.plot_hdi(data.distance, posterior_mean, ax=ax[0])\n\nax[0].plot(data.distance, data.direction, \"k.\")\nax[0].set_xlabel(\"Distance travelled (in m)\")\nax[0].set_ylabel(\"Direction of travel (radians)\")\nax[0].set_title(\"VonMises Family\")\n\nposterior_mean = idata_n.posterior[\"direction_mean\"]\nax[1].plot(data.distance, posterior_mean.mean((\"chain\", \"draw\")))\naz.plot_hdi(data.distance, posterior_mean, ax=ax[1])\n\nax[1].plot(data.distance, data.direction, \"k.\")\nax[1].set_xlabel(\"Distance travelled (in m)\")\nax[1].set_title(\"Normal Family\");\n\n\n\n\nWe can see that there is a negative relationship between distance and direction. This could be explained as Periwinkles travelling in a direction towards the sea travelled shorter distances than those travelling in directions away from it. From a biological perspective, this could have been due to a propensity of the periwinkles to stop moving once they are close to the sea.\nWe can also see that if inadvertently we had assumed a normal response we would have obtained a fit with higher uncertainty and more importantly the wrong sign for the relationship.\nAs a last step for this example we are going to do a posterior predictive check. In the figure below we have to panels showing the same data, with the only difference that the on the right is using a polar projection and the KDE are computing taking into account the circularity of the data.\nWe can see that our modeling is failing at capturing the bimodality in the data (with mode around 1.6 and \\(\\pm \\pi\\)) and hence the predicted distribution is wider and with a mean closer to \\(\\pm \\pi\\).\n\nfig = plt.figure(figsize=(12, 5))\nax0 = plt.subplot(121)\nax1 = plt.subplot(122, projection='polar')\n\nmodel_vm.predict(idata_vm, kind=\"pps\")\npp_samples = az.extract_dataset(idata_vm, group=\"posterior_predictive\", num_samples=200)[\"direction\"]\ncolors = [\"C0\" , \"k\", \"C1\"]\n\nfor ax, circ in zip((ax0, ax1), (False, \"radians\", colors)):\n for s in pp_samples:\n az.plot_kde(s.values, plot_kwargs={\"color\":colors[0], \"alpha\": 0.25}, is_circular=circ, ax=ax)\n az.plot_kde(idata_vm.observed_data[\"direction\"].values,\n plot_kwargs={\"color\":colors[1], \"lw\":3}, is_circular=circ, ax=ax)\n az.plot_kde(idata_vm.posterior_predictive[\"direction\"].values,\n plot_kwargs={\"color\":colors[2], \"ls\":\"--\", \"lw\":3}, is_circular=circ, ax=ax)\n\ncustom_lines = [Line2D([0], [0], color=c) for c in colors]\n\nax0.legend(custom_lines, [\"posterior_predictive\", \"Observed\", 'mean posterior predictive'])\nax0.set_yticks([])\nfig.suptitle(\"Directions (radians)\", fontsize=18);\n\n/tmp/ipykernel_21333/4056881271.py:6: FutureWarning: extract_dataset has been deprecated, please use extract\n pp_samples = az.extract_dataset(idata_vm, group=\"posterior_predictive\", num_samples=200)[\"direction\"]\n\n\n\n\n\nWe have shown an example of regression where the response variable is circular and the covariates are linear. This is sometimes refereed as linear-circular regression in order to distinguish it from other cases. Namely, when the response is linear and the covariates (or at least one of them) is circular the name circular-linear regression is often used. And when both covariates and the response variables are circular, we have a circular-circular regression. When the covariates are circular they are usually modelled with the help of sin and cosine functions. You can read more about this kind of regression and other circular statistical methods in the following books.\n\nCircular statistics in R\nModern directional statistics\nApplied Directional Statistics\nDirectional Statistics\n\n\n%load_ext watermark\n%watermark -n -u -v -iv -w\n\nLast updated: Thu Jan 05 2023\n\nPython implementation: CPython\nPython version : 3.10.4\nIPython version : 8.5.0\n\nscipy : 1.9.3\nbambi : 0.9.3\nnumpy : 1.23.5\narviz : 0.14.0\npandas : 1.5.2\nmatplotlib: 3.6.2\nsys : 3.10.4 (main, Mar 31 2022, 08:41:55) [GCC 7.5.0]\n\nWatermark: 2.3.1" + "text": "import arviz as az\nimport bambi as bmb\nimport matplotlib.pyplot as plt\nimport numpy as np\n\n\naz.style.use(\"arviz-darkgrid\")\nnp.random.seed(1234)\n\n\n\nIn this notebook we use a data set consisting of 67856 insurance policies and 4624 (6.8%) claims in Australia between 2004 and 2005. The original source of this dataset is the book Generalized Linear Models for Insurance Data by Piet de Jong and Gillian Z. Heller.\n\ndata = bmb.load_data(\"carclaims\")\ndata.head()\n\n\n\n\n\n \n \n \n veh_value\n exposure\n clm\n numclaims\n claimcst0\n veh_body\n veh_age\n gender\n area\n agecat\n \n \n \n \n 0\n 1.06\n 0.303901\n 0\n 0\n 0.0\n HBACK\n 3\n F\n C\n 2\n \n \n 1\n 1.03\n 0.648871\n 0\n 0\n 0.0\n HBACK\n 2\n F\n A\n 4\n \n \n 2\n 3.26\n 0.569473\n 0\n 0\n 0.0\n UTE\n 2\n F\n E\n 2\n \n \n 3\n 4.14\n 0.317591\n 0\n 0\n 0.0\n STNWG\n 2\n F\n D\n 2\n \n \n 4\n 0.72\n 0.648871\n 0\n 0\n 0.0\n HBACK\n 4\n F\n C\n 2\n \n \n\n\n\n\nLet’s see the meaning of the variables before creating any plot or fitting any model.\n\nveh_value: Vehicle value, ranges from \\$0 to \\$350,000.\nexposure: Proportion of the year where the policy was exposed. In practice each policy is not exposed for the full year. Some policies come into force partly into the year while others are canceled before the year’s end.\nclm: Claim occurrence. 0 (no), 1 (yes).\nnumclaims: Number of claims.\nclaimcst0: Claim amount. 0 if no claim. Ranges from \\$200 to \\$55922.\nveh_body: Vehicle body type. Can be one of bus, convertible, coupe, hatchback, hardtop, motorized caravan/combi, minibus, panel van, roadster, sedan, station wagon, truck, and utility.\nveh_age: Vehicle age. 1 (new), 2, 3, and 4.\ngender: Gender of the driver. M (Male) and F (Female).\narea: Driver’s area of residence. Can be one of A, B, C, D, E, and F.\nagecat: Driver’s age category. 1 (youngest), 2, 3, 4, 5, and 6.\n\nThe variable of interest is the claim amount, given by \"claimcst0\". We keep the records where there is a claim, so claim amount is greater than 0.\n\ndata = data[data[\"claimcst0\"] > 0]\n\nFor clarity, we only show those claims amounts below \\$15,000, since there are only 65 records above that threshold.\n\ndata[data[\"claimcst0\"] > 15000].shape[0]\n\n65\n\n\n\nplt.hist(data[data[\"claimcst0\"] <= 15000][\"claimcst0\"], bins=30)\nplt.title(\"Distribution of claim amount\")\nplt.xlabel(\"Claim amount ($)\");\n\n\n\n\nAnd this is when you say: “Oh, there really are ugly right-skewed distributions out there!”. Well, yes, we’ve all been there :)\nIn this case we are going to fit GLMs with a right-skewed distribution for the random component. This time we will be using Wald and Gamma distributions. One of their differences is that the variance is proportional to the cubic mean in the case of the Wald distribution, and proportional to the squared mean in the case of the Gamma distribution.\n\n\n\nThe Wald family (a.k.a inverse Gaussian model) states that\n\\[\n\\begin{array}{cc}\ny_i \\sim \\text{Wald}(\\mu_i, \\lambda) & g(\\mu_i) = \\mathbf{x}_i^T\\beta\n\\end{array}\n\\]\nwhere the pdf of a Wald distribution is given by\n\\[\nf(x|\\mu, \\lambda) =\n\\left(\\frac{\\lambda}{2\\pi}\\right)^{1/2}x^{-3/2}\\exp\\left\\{ -\\frac{\\lambda}{2x} \\left(\\frac{x - \\mu}{\\mu} \\right)^2 \\right\\}\n\\]\nfor \\(x > 0\\), mean \\(\\mu > 0\\) and \\(\\lambda > 0\\) is the shape parameter. The variance is given by \\(\\sigma^2 = \\mu^3/\\lambda\\). The canonical link is \\(g(\\mu_i) = \\mu_i^{-2}\\), but \\(g(\\mu_i) = \\log(\\mu_i)\\) is usually preferred, and it is what we use here.\n\n\n\nThe default parametrization of the Gamma density function is\n\\[\n\\displaystyle f(x | \\alpha, \\beta) = \\frac{\\beta^\\alpha x^{\\alpha -1} e^{-\\beta x}}{\\Gamma(\\alpha)}\n\\]\nwhere \\(x > 0\\), and \\(\\alpha > 0\\) and \\(\\beta > 0\\) are the shape and rate parameters, respectively.\nBut GLMs model the mean of the function, so we need to use an alternative parametrization where\n\\[\n\\begin{array}{ccc}\n\\displaystyle \\mu = \\frac{\\alpha}{\\beta} & \\text{and} & \\displaystyle \\sigma^2 = \\frac{\\alpha}{\\beta^2}\n\\end{array}\n\\]\nand thus we have\n\\[\n\\begin{array}{cccc}\ny_i \\sim \\text{Gamma}(\\mu_i, \\sigma_i), & g(\\mu_i) = \\mathbf{x}_i^T\\beta, & \\text{and} & \\sigma_i = \\mu_i^2/\\alpha\n\\end{array}\n\\]\nwhere \\(\\alpha\\) is the shape parameter in the original parametrization of the gamma pdf. The canonical link is \\(g(\\mu_i) = \\mu_i^{-1}\\), but here we use \\(g(\\mu_i) = \\log(\\mu_i)\\) again.\n\n\n\nIn this example we are going to use the binned age, the gender, and the area of residence to predict the amount of the claim, conditional on the existence of the claim because we are only working with observations where there is a claim.\n\"agecat\" is interpreted as a numeric variable in our data frame, but we know it is categorical, and we wouldn’t be happy if our model takes it as if it was numeric, would we?\nWe have two alternatives to tell Bambi that this numeric variable must be treated as categorical. The first one is to wrap the name of the variable with C(), and the other is to pass the same name to the categorical argument when we create the model. We are going to use the first approach with the Wald family and the second with the Gamma.\nThe C() notation is taken from Patsy and is encouraged when you want to explicitly pass the order of the levels of the variables. If you are happy with the default order, better pass the name to categorical so tables and plots have prettier labels :)\n\n\n\nmodel_wald = bmb.Model(\"claimcst0 ~ C(agecat) + gender + area\", data, family = \"wald\", link = \"log\")\nfitted_wald = model_wald.fit(tune=2000, target_accept=0.9, idata_kwargs={\"log_likelihood\": True})\n\nAuto-assigning NUTS sampler...\nInitializing NUTS using jitter+adapt_diag...\nMultiprocess sampling (2 chains in 2 jobs)\nNUTS: [claimcst0_lam, Intercept, C(agecat), gender, area]\n\n\n\n\n\n\n\n \n \n 100.00% [6000/6000 00:17<00:00 Sampling 2 chains, 0 divergences]\n \n \n\n\nSampling 2 chains for 2_000 tune and 1_000 draw iterations (4_000 + 2_000 draws total) took 17 seconds.\n\n\n\naz.plot_trace(fitted_wald);\n\n\n\n\n\naz.summary(fitted_wald)\n\n\n\n\n\n \n \n \n mean\n sd\n hdi_3%\n hdi_97%\n mcse_mean\n mcse_sd\n ess_bulk\n ess_tail\n r_hat\n \n \n \n \n Intercept\n 7.719\n 0.097\n 7.524\n 7.881\n 0.004\n 0.003\n 723.0\n 973.0\n 1.0\n \n \n C(agecat)[2]\n -0.164\n 0.103\n -0.362\n 0.014\n 0.004\n 0.003\n 670.0\n 867.0\n 1.0\n \n \n C(agecat)[3]\n -0.259\n 0.098\n -0.442\n -0.075\n 0.004\n 0.003\n 757.0\n 1077.0\n 1.0\n \n \n C(agecat)[4]\n -0.264\n 0.098\n -0.441\n -0.080\n 0.004\n 0.003\n 729.0\n 1056.0\n 1.0\n \n \n C(agecat)[5]\n -0.377\n 0.106\n -0.582\n -0.191\n 0.004\n 0.003\n 767.0\n 1142.0\n 1.0\n \n \n C(agecat)[6]\n -0.319\n 0.123\n -0.550\n -0.088\n 0.004\n 0.003\n 897.0\n 1379.0\n 1.0\n \n \n gender[M]\n 0.154\n 0.051\n 0.046\n 0.242\n 0.001\n 0.001\n 2325.0\n 1571.0\n 1.0\n \n \n area[B]\n -0.028\n 0.071\n -0.151\n 0.110\n 0.002\n 0.001\n 1582.0\n 1584.0\n 1.0\n \n \n area[C]\n 0.075\n 0.067\n -0.057\n 0.193\n 0.002\n 0.001\n 1652.0\n 1352.0\n 1.0\n \n \n area[D]\n -0.018\n 0.087\n -0.176\n 0.153\n 0.002\n 0.002\n 1779.0\n 1684.0\n 1.0\n \n \n area[E]\n 0.154\n 0.101\n -0.028\n 0.351\n 0.003\n 0.002\n 1632.0\n 1394.0\n 1.0\n \n \n area[F]\n 0.372\n 0.129\n 0.136\n 0.615\n 0.003\n 0.002\n 1878.0\n 1345.0\n 1.0\n \n \n claimcst0_lam\n 723.159\n 15.695\n 693.002\n 751.738\n 0.306\n 0.217\n 2630.0\n 1577.0\n 1.0\n \n \n\n\n\n\nIf we look at the agecat variable, we can see the log mean of the claim amount tends to decrease when the age of the person increases, with the exception of the last category where we can see a slight increase in the mean of the coefficient (-0.307 vs -0.365 of the previous category). However, these differences only represent a slight tendency because of the large overlap between the marginal posteriors for these coefficients (see overlaid density plots for C(agecat).\nThe posterior for gender tells us that the claim amount tends to be larger for males than for females, with the mean being 0.153 and the credible interval ranging from 0.054 to 0.246.\nFinally, from the marginal posteriors for the areas, we can see that F is the only area that clearly stands out, with a higher mean claim amount than in the rest. Area E may also have a higher claim amount, but this difference with the other areas is not as evident as it happens with F.\n\n\n\n\nmodel_gamma = bmb.Model(\n \"claimcst0 ~ agecat + gender + area\",\n data,\n family=\"gamma\",\n link=\"log\",\n categorical=\"agecat\",\n)\nfitted_gamma = model_gamma.fit(tune=2000, target_accept=0.9, idata_kwargs={\"log_likelihood\": True})\n\nAuto-assigning NUTS sampler...\nInitializing NUTS using jitter+adapt_diag...\nMultiprocess sampling (2 chains in 2 jobs)\nNUTS: [claimcst0_alpha, Intercept, agecat, gender, area]\n\n\n\n\n\n\n\n \n \n 100.00% [6000/6000 00:24<00:00 Sampling 2 chains, 0 divergences]\n \n \n\n\nSampling 2 chains for 2_000 tune and 1_000 draw iterations (4_000 + 2_000 draws total) took 25 seconds.\n\n\n\naz.plot_trace(fitted_gamma);\n\n\n\n\n\naz.summary(fitted_gamma)\n\n\n\n\n\n \n \n \n mean\n sd\n hdi_3%\n hdi_97%\n mcse_mean\n mcse_sd\n ess_bulk\n ess_tail\n r_hat\n \n \n \n \n Intercept\n 7.717\n 0.063\n 7.591\n 7.825\n 0.002\n 0.001\n 891.0\n 1280.0\n 1.0\n \n \n agecat[2]\n -0.181\n 0.064\n -0.309\n -0.064\n 0.002\n 0.001\n 949.0\n 1151.0\n 1.0\n \n \n agecat[3]\n -0.275\n 0.063\n -0.395\n -0.164\n 0.002\n 0.001\n 966.0\n 1342.0\n 1.0\n \n \n agecat[4]\n -0.269\n 0.063\n -0.388\n -0.155\n 0.002\n 0.001\n 900.0\n 1406.0\n 1.0\n \n \n agecat[5]\n -0.389\n 0.071\n -0.522\n -0.255\n 0.002\n 0.002\n 1059.0\n 1358.0\n 1.0\n \n \n agecat[6]\n -0.314\n 0.078\n -0.459\n -0.161\n 0.002\n 0.001\n 1367.0\n 1546.0\n 1.0\n \n \n gender[M]\n 0.166\n 0.034\n 0.101\n 0.225\n 0.001\n 0.000\n 2965.0\n 1448.0\n 1.0\n \n \n area[B]\n -0.023\n 0.050\n -0.123\n 0.062\n 0.001\n 0.001\n 1601.0\n 1709.0\n 1.0\n \n \n area[C]\n 0.071\n 0.045\n -0.013\n 0.156\n 0.001\n 0.001\n 1359.0\n 1514.0\n 1.0\n \n \n area[D]\n -0.017\n 0.063\n -0.132\n 0.106\n 0.001\n 0.001\n 1838.0\n 1558.0\n 1.0\n \n \n area[E]\n 0.152\n 0.067\n 0.026\n 0.273\n 0.002\n 0.001\n 1964.0\n 1596.0\n 1.0\n \n \n area[F]\n 0.371\n 0.076\n 0.235\n 0.521\n 0.002\n 0.001\n 1885.0\n 1467.0\n 1.0\n \n \n claimcst0_alpha\n 0.762\n 0.014\n 0.736\n 0.789\n 0.000\n 0.000\n 3212.0\n 1452.0\n 1.0\n \n \n\n\n\n\nThe interpretation of the parameter posteriors is very similar to what we’ve done for the Wald family. The only difference is that some differences, such as the ones for the area posteriors, are a little more exacerbated here.\n\n\n\n\nWe can perform a Bayesian model comparison very easily with az.compare(). Here we pass a dictionary with the InferenceData objects that Model.fit() returned and az.compare() returns a data frame that is ordered from best to worst according to the criteria used.\n\nmodels = {\"wald\": fitted_wald, \"gamma\": fitted_gamma}\ndf_compare = az.compare(models)\ndf_compare\n\n\n\n\n\n \n \n \n rank\n elpd_loo\n p_loo\n elpd_diff\n weight\n se\n dse\n warning\n scale\n \n \n \n \n wald\n 0\n -38581.405635\n 12.882981\n 0.00000\n 1.0\n 106.105576\n 0.000000\n False\n log\n \n \n gamma\n 1\n -39628.995425\n 26.607829\n 1047.58979\n 0.0\n 104.988009\n 35.754616\n False\n log\n \n \n\n\n\n\n\naz.plot_compare(df_compare, insample_dev=False);\n\n\n\n\nBy default, ArviZ uses loo, which is an estimation of leave one out cross-validation. Another option is the widely applicable information criterion (WAIC). Since the results are in the log scale, the better out-of-sample predictive fit is given by the model with the highest value, which is the Wald model.\n\n%load_ext watermark\n%watermark -n -u -v -iv -w\n\nLast updated: Thu Jan 05 2023\n\nPython implementation: CPython\nPython version : 3.10.4\nIPython version : 8.5.0\n\nsys : 3.10.4 (main, Mar 31 2022, 08:41:55) [GCC 7.5.0]\nmatplotlib: 3.6.2\narviz : 0.14.0\nnumpy : 1.23.5\nbambi : 0.9.3\n\nWatermark: 2.3.1" }, { - "objectID": "notebooks/logistic_regression.html", - "href": "notebooks/logistic_regression.html", + "objectID": "notebooks/sleepstudy.html", + "href": "notebooks/sleepstudy.html", "title": "Bambi", "section": "", - "text": "import arviz as az\nimport bambi as bmb\nimport matplotlib.pyplot as plt\nimport numpy as np\nimport pandas as pd\n\n\naz.style.use(\"arviz-darkgrid\")\nSEED = 7355608\n\n\n\nThese data are from the 2016 pilot study. The full study consisted of 1200 people, but here we’ve selected the subset of 487 people who responded to a question about whether they would vote for Hillary Clinton or Donald Trump.\n\ndata = bmb.load_data(\"ANES\")\ndata.head()\n\n\n\n\n\n \n \n \n vote\n age\n party_id\n \n \n \n \n 0\n clinton\n 56\n democrat\n \n \n 1\n trump\n 65\n republican\n \n \n 2\n clinton\n 80\n democrat\n \n \n 3\n trump\n 38\n republican\n \n \n 4\n trump\n 60\n republican\n \n \n\n\n\n\nOur outcome variable is vote, which gives peoples’ responses to the following question prompt:\n“If the 2016 presidential election were between Hillary Clinton for the Democrats and Donald Trump for the Republicans, would you vote for Hillary Clinton, Donald Trump, someone else, or probably not vote?”\n\ndata[\"vote\"].value_counts()\n\nclinton 215\ntrump 158\nsomeone_else 48\nName: vote, dtype: int64\n\n\nThe two predictors we’ll examine are a respondent’s age and their political party affiliation, party_id, which is their response to the following question prompt:\n“Generally speaking, do you usually think of yourself as a Republican, a Democrat, an independent, or what?”\n\ndata[\"party_id\"].value_counts()\n\ndemocrat 186\nindependent 138\nrepublican 97\nName: party_id, dtype: int64\n\n\nThese two predictors are somewhat correlated, but not all that much:\n\nfig, ax = plt.subplots(1, 3, figsize=(10, 4), sharey=True, constrained_layout=True)\nkey = dict(zip(data[\"party_id\"].unique(), range(3)))\nfor label, df in data.groupby(\"party_id\"):\n ax[key[label]].hist(df[\"age\"])\n ax[key[label]].set_xlim([18, 90])\n ax[key[label]].set_xlabel(\"Age\")\n ax[key[label]].set_ylabel(\"Frequency\")\n ax[key[label]].set_title(label)\n ax[key[label]].axvline(df[\"age\"].mean(), color=\"C1\")\n\n\n\n\nWe can get a pretty clear idea of how party identification is related to voting intentions by just looking at a contingency table for these two variables:\n\npd.crosstab(data[\"vote\"], data[\"party_id\"])\n\n\n\n\n\n \n \n party_id\n democrat\n independent\n republican\n \n \n vote\n \n \n \n \n \n \n \n clinton\n 159\n 51\n 5\n \n \n someone_else\n 10\n 22\n 16\n \n \n trump\n 17\n 65\n 76\n \n \n\n\n\n\nBut our main question here will be: How is respondent age related to voting intentions, and is this relationship different for different party affiliations? For this we will use a logistic regression.\n\n\n\nTo keep this simple, let’s look at only the data from people who indicated that they would vote for either Clinton or Trump, and we’ll model the probability of voting for Clinton.\n\nclinton_data = data.loc[data[\"vote\"].isin([\"clinton\", \"trump\"]), :]\nclinton_data.head()\n\n\n\n\n\n \n \n \n vote\n age\n party_id\n \n \n \n \n 0\n clinton\n 56\n democrat\n \n \n 1\n trump\n 65\n republican\n \n \n 2\n clinton\n 80\n democrat\n \n \n 3\n trump\n 38\n republican\n \n \n 4\n trump\n 60\n republican\n \n \n\n\n\n\n\n\nWe’ll use a logistic regression model to estimate the probability of voting for Clinton as a function of age and party affiliation. We can think we have a response variable \\(Y\\) defined as\n\\[\nY =\n\\left\\{\n \\begin{array}{ll}\n 1 & \\textrm{if the person votes for Clinton} \\\\\n 0 & \\textrm{if the person votes for Trump}\n \\end{array}\n\\right.\n\\]\nand we are interested in modelling \\(\\pi = P(Y = 1)\\) (a.k.a. probability of success) based on two explanatory variables, age and party affiliation.\nA logistic regression is a model that links the \\(\\text{logit}(\\pi)\\) to a linear combination of the predictors. In our example, we’re going to include a main effect for party affiliation and the interaction effect between party affiliation and age (i.e. we’ll have a different age slope for each affiliation). The mathematical equation for our model is\n$$\n\\[\\begin{aligned}\n \\log{\\left(\\frac{\\pi}{1 - \\pi}\\right)} &=\n \\beta_0 + \\beta_1 X_1 + \\beta_2 X_2 + \\beta_3 X_3 X_4 + \\beta_4 X_1 X_4 + \\beta_5 X_2 X_4 \\\\\n\n X_1 &= \\left\\{\n \\begin{array}{ll}\n 1 & \\textrm{if party affiliation is Independent} \\\\\n 0 & \\textrm{in other case}\n \\end{array}\n \\right. \\\\\n\n X_2 &= \\left\\{\n \\begin{array}{ll}\n 1 & \\textrm{if party affiliation is Republican} \\\\\n 0 & \\textrm{in other case}\n \\end{array}\n \\right. \\\\\n\n X_3 &=\n \\left\\{\n \\begin{array}{ll}\n 1 & \\textrm{if party affiliation is Democrat} \\\\\n 0 & \\textrm{in other case}\n \\end{array}\n \\right. \\\\\n\n X_4 &= \\text{Age}\n\\end{aligned}\\]\n$$\nNotice we don’t have a main effect for \\(X_3\\). This happens because Democrat party affiliation is being taken as baseline in the encoding of the categorical variable party_id and \\(\\beta_1\\) and \\(\\beta_2\\) represent deviations from that baseline. Thus, we see the main effect of Democrat affiliation is being represented by the Intercept, \\(\\beta_0\\).\nIf we represent the right hand side of the model equation with \\(\\eta\\), the expression can be re-arranged to express our probability of interest, \\(\\pi\\), as a function of the linear predictor \\(\\eta\\).\n\\[\\pi = \\frac{e^\\eta}{1 + e^\\eta}= \\frac{1}{1 + e^{-\\eta}}\\]\nSince we’re Bayesian folks who draw samples from posteriors, we need to specify a prior for the parameters as well as a likelihood function before accomplishing our task. In this occasion, we’re going to use the default priors in Bambi and just note the likelihood is the product of \\(n\\) Bernoulli trials, \\(\\prod_{i=1}^{n}{p_i^y(1-p_i)^{1-y_i}}\\) where \\(p_i = P(Y=1)\\) and \\(y_i = 1\\) if the vote intention is for Clinton and \\(y_i = 0\\) if Trump.\n\n\n\nSpecifying and fitting the model is simple. Bambi is good and doesn’t ask us to translate all the math to code. We just need to specify our model using the formula syntax and pass the correct family argument. Notice the (optional) syntax that we use on the left-hand-side of the formula: We say vote[clinton] to instruct Bambi that we wish the model the probability that vote=='clinton', rather than the probability that vote=='trump'. If we leave this unspecified, Bambi will just pick one of the events to model, but will inform you which one it picked when you build the model (and again when you look at model summaries).\nOn the right-hand-side of the formula we use party_id + party_id:age to instruct Bambi that we want to use party_id and the interaction between party_id and age as the explanatory variables in the model.\n\n\nclinton_model = bmb.Model(\"vote['clinton'] ~ party_id + party_id:age\", clinton_data, family=\"bernoulli\")\nclinton_fitted = clinton_model.fit(\n draws=2000, target_accept=0.85, random_seed=SEED, idata_kwargs={\"log_likelihood\": True}\n)\n\nModeling the probability that vote==clinton\nAuto-assigning NUTS sampler...\nInitializing NUTS using jitter+adapt_diag...\nMultiprocess sampling (2 chains in 2 jobs)\nNUTS: [Intercept, party_id, party_id:age]\n\n\n\n\n\n\n\n \n \n 100.00% [6000/6000 00:13<00:00 Sampling 2 chains, 0 divergences]\n \n \n\n\nSampling 2 chains for 1_000 tune and 2_000 draw iterations (2_000 + 4_000 draws total) took 14 seconds.\n\n\nWe can print the model object to see information about the response distribution, the link function and the priors.\n\nclinton_model\n\n Formula: vote['clinton'] ~ party_id + party_id:age\n Family: bernoulli\n Link: p = logit\n Observations: 373\n Priors: \n target = p\n Common-level effects\n Intercept ~ Normal(mu: 0, sigma: 4.3846)\n party_id ~ Normal(mu: [0. 0.], sigma: [5.4007 6.0634])\n party_id:age ~ Normal(mu: [0. 0. 0.], sigma: [0.0938 0.1007 0.1098])\n------\n* To see a plot of the priors call the .plot_priors() method.\n* To see a summary or plot of the posterior pass the object returned by .fit() to az.summary() or az.plot_trace()\n\n\nUnder the hood, Bambi selected Gaussian priors for all the parameters in the model. By construction, all the priors, except the one for Intercept, are centered around 0, which is consistent with the desired weakly informative behavior. The standard deviation is specific to each parameter.\nSome more info about these default priors can be found in this technical paper.\nWe can also call clinton_model.plot_priors() to visualize the sensitive default priors Bambi has chosen for us.\n\nclinton_model.plot_priors();\n\nSampling: [Intercept, party_id, party_id:age]\n\n\n\n\n\nNow let’s check out the results! We get traceplots and density estimates for the posteriors with az.plot_trace() and a summary of the posteriors with az.summary().\n\naz.plot_trace(clinton_fitted, compact=False);\n\n\n\n\n\naz.summary(clinton_fitted)\n\n\n\n\n\n \n \n \n mean\n sd\n hdi_3%\n hdi_97%\n mcse_mean\n mcse_sd\n ess_bulk\n ess_tail\n r_hat\n \n \n \n \n Intercept\n 1.674\n 0.725\n 0.251\n 2.998\n 0.016\n 0.011\n 2199.0\n 2105.0\n 1.0\n \n \n party_id[independent]\n -0.293\n 0.956\n -2.037\n 1.543\n 0.021\n 0.018\n 2046.0\n 2230.0\n 1.0\n \n \n party_id[republican]\n -1.151\n 1.575\n -4.122\n 1.806\n 0.039\n 0.027\n 1667.0\n 1843.0\n 1.0\n \n \n party_id:age[democrat]\n 0.013\n 0.015\n -0.016\n 0.042\n 0.000\n 0.000\n 2133.0\n 2064.0\n 1.0\n \n \n party_id:age[independent]\n -0.033\n 0.011\n -0.055\n -0.012\n 0.000\n 0.000\n 3257.0\n 2797.0\n 1.0\n \n \n party_id:age[republican]\n -0.080\n 0.036\n -0.153\n -0.018\n 0.001\n 0.001\n 1692.0\n 1546.0\n 1.0\n \n \n\n\n\n\n\n\n\n\nBefore moving forward to inference, we can evaluate the quality of the model’s fit. We will take a look at two different ways of assessing how good is the model’s fit using its predictions.\n\n\nThere is a way of assessing the performance of a model with binary outcomes (such as logistic regression) in a visual way called separation plot. In a separation plot, the model’s predictions are averaged, ordered and represented as consecutive vertical lines. These vertical lines are colored according to the class indicated by their corresponding observed value, in this case light blue indicates class 0 (vote == 'Trump') and blue represents class 1 (vote =='Clinton'). We can use the ArviZ’ implementation of the separation plot, but first we have to obtain the model’s predictions.\n\nclinton_model.predict(clinton_fitted, kind=\"pps\")\n\n\nax = az.plot_separation(clinton_fitted, y='vote', figsize=(9,0.5));\n\n\n\n\nIn this separation plot we can see that some observations are misspredicted, specially in the right hand side of the plot where the model predicts Trump votes when there were really Clinton ones. We can further investigate this using another of ArviZ model evaluation tool.\n\n\n\n\nWe can also use ArviZ to compute LOO and find influential observations using the estimated \\(\\hat \\kappa\\) parameter value.\n\n# compute pointwise LOO\nloo = az.loo(clinton_fitted, pointwise=True)\n\n\n# plot kappa values\naz.plot_khat(loo.pareto_k);\n\n\n\n\nA first look at the khat plot shows that most observations’ \\(\\hat \\kappa\\) values are grouped together in a range that goes up to roughly 0.2. Above that value, we observe some dispersion and a few points that stand out by having the highest \\(\\hat \\kappa\\) values.\nAn observation is influential in the sense that if we refit the data by first removing that observation from the data set, the fitted result will be more different than if we do the same for a non influential observation. Clearly the level of influence of observations can vary continuously. An observation can be influential either because it is an outlier (a measurement error, a data entry error, etc) or because the model is not flexible enough to capture the observation. The approximations used to compute LOO are no longer reliable for \\(\\hat \\kappa > 0.7\\).\nLet us first take a look at the observation with the highest \\(\\hat \\kappa\\).\n\nax = az.plot_khat(loo.pareto_k.values.ravel())\nsorted_kappas = np.sort(loo.pareto_k.values.ravel())\n\n# find observation where the kappa value exceeds the threshold\nthreshold = sorted_kappas[-1:]\nax.axhline(threshold, ls=\"--\", color=\"orange\")\ninfluential_observations = clinton_data.reset_index()[loo.pareto_k.values >= threshold].index\n\nfor x in influential_observations:\n y = loo.pareto_k.values[x]\n ax.text(x, y + 0.01, str(x), ha=\"center\", va=\"baseline\")\n\n\n\n\n\nclinton_data.reset_index()[loo.pareto_k.values >= threshold]\n\n\n\n\n\n \n \n \n index\n vote\n age\n party_id\n \n \n \n \n 365\n 410\n clinton\n 55\n republican\n \n \n\n\n\n\nThis observation corresponds to a 95 year old Republican party member that voted for Trump.\n\nLet us take a look at six observations with the highest \\(\\hat \\kappa\\) values.\n\nax = az.plot_khat(loo.pareto_k)\n\n# find observation where the kappa value exceeds the threshold\nthreshold = sorted_kappas[-6:].min()\nax.axhline(threshold, ls=\"--\", color=\"orange\")\ninfluential_observations = clinton_data.reset_index()[loo.pareto_k.values >= threshold].index\n\nfor x in influential_observations:\n y = loo.pareto_k.values[x]\n ax.text(x, y + 0.01, str(x), ha=\"center\", va=\"baseline\")\n\n\n\n\n\nclinton_data.reset_index()[loo.pareto_k.values>=threshold]\n\n\n\n\n\n \n \n \n index\n vote\n age\n party_id\n \n \n \n \n 34\n 34\n trump\n 83\n republican\n \n \n 58\n 64\n trump\n 84\n republican\n \n \n 62\n 68\n trump\n 91\n republican\n \n \n 87\n 95\n trump\n 80\n republican\n \n \n 191\n 215\n trump\n 95\n republican\n \n \n 365\n 410\n clinton\n 55\n republican\n \n \n\n\n\n\nObservations number 34, 58, 62, and 191 correspond to individuals in under represented age groups in the data set. The rest correspond to Republican party members that voted for Clinton. Let us check how many observations we have of individuals older than 80 years old.\n\nclinton_data[clinton_data.age>80]\n\n\n\n\n\n \n \n \n vote\n age\n party_id\n \n \n \n \n 34\n trump\n 83\n republican\n \n \n 64\n trump\n 84\n republican\n \n \n 68\n trump\n 91\n republican\n \n \n 97\n clinton\n 83\n democrat\n \n \n 215\n trump\n 95\n republican\n \n \n 246\n clinton\n 82\n democrat\n \n \n 403\n clinton\n 81\n democrat\n \n \n\n\n\n\nLet us check how many observations there are of Republicans who voted for Clinton\n\nclinton_data[(clinton_data.vote =='clinton') & (clinton_data.party_id == 'republican')]\n\n\n\n\n\n \n \n \n vote\n age\n party_id\n \n \n \n \n 170\n clinton\n 27\n republican\n \n \n 248\n clinton\n 36\n republican\n \n \n 359\n clinton\n 22\n republican\n \n \n 361\n clinton\n 37\n republican\n \n \n 410\n clinton\n 55\n republican\n \n \n\n\n\n\nThere are only two observations for individuals older than 80 years old and five observations for individuals of the Republican party that vote for Clinton. The fact that the model finds it difficult to predict for these observations is related to model uncertainty, due to a scarce number of observations that exhibit these characteristics.\nLet us repeat the separation plot, this time marking the observations we have analyzed. This plot will show us how the model predicted these particular observations.\n\nimport matplotlib.patheffects as pe\n\nax = az.plot_separation(clinton_fitted, y=\"vote\", figsize=(9, 0.5))\n\ny = np.random.uniform(0.1, 0.5, size=len(influential_observations))\n\nfor x, y in zip(influential_observations, y):\n text = str(x)\n x = x / len(clinton_data)\n ax.scatter(x, y, marker=\"+\", s=50, color=\"red\", zorder=3)\n ax.text(\n x, y + 0.1, text, color=\"white\", ha=\"center\", va=\"bottom\",\n path_effects=[pe.withStroke(linewidth=2, foreground=\"black\")]\n )\n\n\n\n\n\nclinton_data.reset_index()[loo.pareto_k.values>=threshold]\n\n\n\n\n\n \n \n \n index\n vote\n age\n party_id\n \n \n \n \n 34\n 34\n trump\n 83\n republican\n \n \n 58\n 64\n trump\n 84\n republican\n \n \n 62\n 68\n trump\n 91\n republican\n \n \n 87\n 95\n trump\n 80\n republican\n \n \n 191\n 215\n trump\n 95\n republican\n \n \n 365\n 410\n clinton\n 55\n republican\n \n \n\n\n\n\nThis assessment helped us to further understand the model and quality of the fit. It also illustrates the intuition that we should be cautious when predicting for under represented age groups and voting behaviours.\n\n\n\nGrab the posteriors samples of the age slopes for the three party_id categories.\n\nparties = [\"democrat\", \"independent\", \"republican\"]\ndem, ind, rep = [clinton_fitted.posterior[\"party_id:age\"].sel({\"party_id:age_dim\":party}) for party in parties]\n\nPlot the marginal posteriors for the age slopes for the three political affiliations.\n\n_, ax = plt.subplots()\nfor idx, x in enumerate([dem, ind, rep]):\n az.plot_dist(x, label=x[\"party_id:age_dim\"].item(), plot_kwargs={\"color\": f\"C{idx}\"}, ax=ax)\nax.legend(loc=\"upper left\");\n\n\n\n\nNow, using the joint posterior, we can answer our questions in terms of probabilities.\nWhat is the probability that the Democrat slope is greater than the Republican slope?\n\n(dem > rep).mean().item()\n\n0.99625\n\n\nProbability that the Democrat slope is greater than the Independent slope?\n\n(dem > ind).mean().item()\n\n0.99125\n\n\nProbability that the Independent slope is greater than the Republican slope?\n\n(ind > rep).mean().item()\n\n0.899\n\n\nProbability that the Democrat slope is greater than 0?\n\n(dem > 0).mean().item()\n\n0.80875\n\n\nProbability that the Republican slope is less than 0?\n\n(rep < 0).mean().item()\n\n0.995\n\n\nProbability that the Independent slope is less than 0?\n\n(ind < 0).mean().item()\n\n0.99875\n\n\nIf we look at the plot of the marginal posteriors, we may be suspicious that, for example, the probability that Democrat slope is greater than the Republican slope is 0.998 (almost 1!), given the overlap between the blue and green density functions. However, we can’t answer such a question using the marginal posteriors only, as shown in the plot. Since Democrat and Republican slopes (\\(\\beta_3\\) and \\(\\beta_5\\), respectively) are random variables, we need to use their joint distribution to answer probability questions that involve both of them. The fact that logical comparisons (e.g. > in dem > ind) are performed elementwise ensures we’re using samples from the joint posterior as we should. We also note that when the question involves only one of the random variables, it is fine to use the marginal distribution (e.g. (rep < 0).mean()).\nFinally, all these comments may have not been necessary since we didn’t need to mention anything about marginal or joint distributions when performing the calculations, we’ve just grabbed the samples and applied some basic math. But that’s an advantage of Bambi and the Bayesian approach. Things that are not so simple, became simpler :)\n\n\n\nHere we make use of the Model.predict() method to predict the probability of voting for Clinton for an out-of-sample dataset that we create.\n\nage = np.arange(18, 91)\nnew_data = pd.DataFrame({\n \"age\": np.tile(age, 3),\n \"party_id\": np.repeat([\"democrat\", \"republican\", \"independent\"], len(age))\n})\nnew_data\n\n\n\n\n\n \n \n \n age\n party_id\n \n \n \n \n 0\n 18\n democrat\n \n \n 1\n 19\n democrat\n \n \n 2\n 20\n democrat\n \n \n 3\n 21\n democrat\n \n \n 4\n 22\n democrat\n \n \n ...\n ...\n ...\n \n \n 214\n 86\n independent\n \n \n 215\n 87\n independent\n \n \n 216\n 88\n independent\n \n \n 217\n 89\n independent\n \n \n 218\n 90\n independent\n \n \n\n219 rows × 2 columns\n\n\n\nObtain predictions for the new dataset. By default, Bambi is going to obtain a posterior distribution for the mean probability of voting for Clinton. These values are stored as the \"vote_mean\" variable in clinton_fitted.posterior.\n\nclinton_model.predict(clinton_fitted, data=new_data)\n\n\n# Select a sample of posterior values for the mean probability of voting for Clinton\nvote_posterior = az.extract_dataset(clinton_fitted, num_samples=2000)[\"vote_mean\"]\n\n/tmp/ipykernel_23763/325773600.py:2: FutureWarning: extract_dataset has been deprecated, please use extract\n vote_posterior = az.extract_dataset(clinton_fitted, num_samples=2000)[\"vote_mean\"]\n\n\nMake the plot!\n\n_, ax = plt.subplots(figsize=(7, 5))\n\nfor i, party in enumerate([\"democrat\", \"republican\", \"independent\"]):\n # Which rows in new_data correspond to party?\n idx = new_data.index[new_data[\"party_id\"] == party].tolist()\n ax.plot(age, vote_posterior[idx], alpha=0.04, color=f\"C{i}\")\n\nax.set_ylabel(\"P(vote='clinton' | age)\")\nax.set_xlabel(\"Age\", fontsize=15)\nax.set_ylim(0, 1)\nax.set_xlim(18, 90);\n\n\n\n\nThe following is a rough interpretation of the information contained in the plot we’ve just created.\nAccording to our logistic model, the mean probability of voting for Clinton is almost always 0.8 or greater for Democrats no matter the age (blue line). Also, the older the person, the closer the mean probability of voting Clinton to 1.\nOn the other hand, Republicans have a non-zero probability of voting for Clinton when they are young, but it tends to zero for older persons (green line). We can also note the high variability of P(vote = ‘Clinton’) for young Republicans. This reflects our high uncertainty when estimating this probability and it is due to the small amount of Republicans in that age range plus there are only 5 Republicans out of 97 voting for Clinton in the dataset.\nFinally, the mean probability of voting Clinton for the independents is around 0.7 for the youngest and decreases towards 0.2 as they get older (orange line). Since the spread of the lines is similar along all the ages, we can conclude our uncertainty in this estimate is similar for all the age groups.\n\n%load_ext watermark\n%watermark -n -u -v -iv -w\n\nLast updated: Thu Jan 05 2023\n\nPython implementation: CPython\nPython version : 3.10.4\nIPython version : 8.5.0\n\npandas : 1.5.2\nmatplotlib: 3.6.2\nnumpy : 1.23.5\narviz : 0.14.0\nbambi : 0.9.3\nsys : 3.10.4 (main, Mar 31 2022, 08:41:55) [GCC 7.5.0]\n\nWatermark: 2.3.1" + "text": "import arviz as az\nimport bambi as bmb\nimport matplotlib.pyplot as plt\nimport numpy as np\nimport pandas as pd\n\n\naz.style.use(\"arviz-darkgrid\")\nSEED = 7355608\n\nIn this example we are going to use sleepstudy dataset. It is derived from the study described in Belenky et al. (2003) and popularized in the lme4 R package. This dataset contains the average reaction time per day (in milliseconds) on a series of tests for the most sleep-deprived group in a sleep deprivation study. The first two days of the study are considered as adaptation and training, the third day is a baseline, and sleep deprivation started after day 3. The subjects in this group were restricted to 3 hours of sleep per night.\n\n\nThe sleepstudy dataset can be loaded using the load_data() function:\n\ndata = bmb.load_data(\"sleepstudy\")\ndata\n\n\n\n\n\n \n \n \n Reaction\n Days\n Subject\n \n \n \n \n 0\n 249.5600\n 0\n 308\n \n \n 1\n 258.7047\n 1\n 308\n \n \n 2\n 250.8006\n 2\n 308\n \n \n 3\n 321.4398\n 3\n 308\n \n \n 4\n 356.8519\n 4\n 308\n \n \n ...\n ...\n ...\n ...\n \n \n 175\n 329.6076\n 5\n 372\n \n \n 176\n 334.4818\n 6\n 372\n \n \n 177\n 343.2199\n 7\n 372\n \n \n 178\n 369.1417\n 8\n 372\n \n \n 179\n 364.1236\n 9\n 372\n \n \n\n180 rows × 3 columns\n\n\n\nThe response variable is Reaction, the average of the reaction time measurements on a given subject for a given day. The two covariates are Days, the number of days of sleep deprivation, and Subject, the identifier of the subject on which the observation was made.\n\n\n\nLet’s get started by displaying the data in a multi-panel layout. There’s a panel for each subject in the study. This allows us to observe and compare the association of Days and Reaction between subjects.\n\ndef plot_data(data):\n fig, axes = plt.subplots(2, 9, figsize=(16, 7.5), sharey=True, sharex=True, dpi=300, constrained_layout=False)\n fig.subplots_adjust(left=0.075, right=0.975, bottom=0.075, top=0.925, wspace=0.03)\n\n axes_flat = axes.ravel()\n\n for i, subject in enumerate(data[\"Subject\"].unique()):\n ax = axes_flat[i]\n idx = data.index[data[\"Subject\"] == subject].tolist()\n days = data.loc[idx, \"Days\"].values\n reaction = data.loc[idx, \"Reaction\"].values\n\n # Plot observed data points\n ax.scatter(days, reaction, color=\"C0\", ec=\"black\", alpha=0.7)\n\n # Add a title\n ax.set_title(f\"Subject: {subject}\", fontsize=14)\n\n ax.xaxis.set_ticks([0, 2, 4, 6, 8])\n fig.text(0.5, 0.02, \"Days\", fontsize=14)\n fig.text(0.03, 0.5, \"Reaction time (ms)\", rotation=90, fontsize=14, va=\"center\")\n\n return axes\n\n\nplot_data(data);\n\n\n\n\nFor most of the subjects, there’s a clear positive association between Days and Reaction time. Reaction times increase as people accumulate more days of sleep deprivation. Participants differ in the initial reaction times as well as in the association between sleep deprivation and reaction time. Reaction times increase faster for some subjects and slower for others. Finally, the relationship between Days and Reaction time presents some deviations from linearity within the panels, but these are neither substantial nor systematic.\n\n\n\nOur main goal is to measure the association between Days and Reaction times. We are interested both in the common effect across all subjects, as well as the effects associated with each individual. To do this, we’re going to use a hierarchical linear regression model that includes the effect of a common intercept and slope, as well as intercepts and slopes specific to each subject. These types of effects are also known as fixed and random effects in the statistical literature.\nThe model can be written as follows:\n\\[\n\\begin{aligned}\n\\text{Reaction}_i & \\sim \\text{Normal}(\\mu_i, \\sigma) \\\\\n\\mu_i & = \\beta_{\\text{Intercept}[i]} + \\beta_{\\text{Days}[i]}\\text{Days}_i \\\\\n\\beta_{\\text{Intercept}[i]} & = \\beta_{\\text{Intercept}} + \\alpha_{\\text{Intercept}_i}\\\\\n\\beta_{\\text{Days}[i]} & = \\beta_{\\text{Days}} + \\alpha_{\\text{Days}_i}\\\\\n\\end{aligned}\n\\]\nwhere \\(\\beta_{\\text{Intercept}}\\) and \\(\\beta_{\\text{Days}}\\) are the intercept and day slope effects common to all subjects in the study, and \\(\\alpha_{\\text{Intercept}_i}\\) and \\(\\alpha_{\\text{Days}_i}\\) are the subject-specific intercept and slope effects. These group-specific effects represent the deviation of each subject from the average behavior.\nNote we’re not describing the prior distributions for \\(\\beta_{\\text{Intercept}}\\), \\(\\beta_{\\text{Days}}\\), \\(\\alpha_{\\text{Intercept}_i}\\), \\(\\alpha_{\\text{Days}_i}\\), and \\(\\sigma\\) because we’re going to use default priors in Bambi.\nNext, let’s create the Bambi model. Here we use the formula syntax to specify the model in a clear and concise manner. The term on the left side of ~ tells Reaction is the response variable. The Days term on the right-hand side tells we want to include a slope effect for the Days variable common to all subjects. (Days | Subject) indicates the Days slope for a given subject is going to consist of the common slope plus a deviation specific to that subject. The common and subject-specific intercepts are added implicitly. We could suppress them by adding a 0 on the common or the group-specific part of the formula (e.g. 0 + Days + (0 + Days|Subject)).\nIf we wanted subject-specific intercepts, but not subjec-specific slopes we would have written Reaction ~ Days + (1 | Subject) and if we wanted slopes specific to each Subject without including a Subject specific intercept, we would write Reaction ~ Days + (0 + Days | Subject).\nThat’s been quite a long introduction for the model. Let’s write it down in code now:\n\nmodel = bmb.Model(\"Reaction ~ 1 + Days + (Days | Subject)\", data, categorical=\"Subject\")\n\nA description of the model and the priors can be obtained by simply printing the model object\n\nmodel\n\n Formula: Reaction ~ 1 + Days + (Days | Subject)\n Family: gaussian\n Link: mu = identity\n Observations: 180\n Priors: \n target = mu\n Common-level effects\n Intercept ~ Normal(mu: 298.5079, sigma: 261.0092)\n Days ~ Normal(mu: 0, sigma: 48.8915)\n \n Group-level effects\n 1|Subject ~ Normal(mu: 0, sigma: HalfNormal(sigma: 261.0092))\n Days|Subject ~ Normal(mu: 0, sigma: HalfNormal(sigma: 48.8915))\n \n Auxiliary parameters\n Reaction_sigma ~ HalfStudentT(nu: 4, sigma: 56.1721)\n\n\nThere we see the formula used to specify the model, the name of the response distribution (Gaussian), the link function (identity), together with the number of observations (180). Below, we have a description of the prior distributions for the different terms in the model. This tells Bambi is using Normal priors for both common and group-specific terms, and a HalfStudentT distribution for the residual error term of the linear regression.\nNow it’s time to hit the inference button. In Bambi, it is as simple as using the .fit() method. This returns an InferenceData object from the ArviZ library. The draws=2000 argument asks the sampler to obtain 2000 draws from the posterior for each chain.\n\nidata = model.fit(draws=2000, random_seed=SEED)\n\nAuto-assigning NUTS sampler...\nInitializing NUTS using jitter+adapt_diag...\nMultiprocess sampling (2 chains in 2 jobs)\nNUTS: [Reaction_sigma, Intercept, Days, 1|Subject_sigma, 1|Subject_offset, Days|Subject_sigma, Days|Subject_offset]\n\n\n\n\n\n\n\n \n \n 100.00% [6000/6000 00:29<00:00 Sampling 2 chains, 0 divergences]\n \n \n\n\nSampling 2 chains for 1_000 tune and 2_000 draw iterations (2_000 + 4_000 draws total) took 30 seconds.\n\n\n\n\n\nFirst of all, let’s obtain a summary of the posterior distribution of the Intercept and Days effects.\n\naz.summary(idata, var_names=[\"Intercept\", \"Days\"], kind=\"stats\")\n\n\n\n\n\n \n \n \n mean\n sd\n hdi_3%\n hdi_97%\n \n \n \n \n Intercept\n 251.494\n 7.621\n 238.327\n 266.683\n \n \n Days\n 10.467\n 1.686\n 7.411\n 13.720\n \n \n\n\n\n\nOn average, people’s average reaction time at the beginning of the study is between 235 and 265 milliseconds. With every extra day of sleep deprivation, the mean reaction times increase, on average, between 7.2 and 13.9 milliseconds.\nSo far so good with the interpretation of the common effects. It’s quite straightforward and simple. But this analysis would be incomplete and misleading if we don’t evaluate the subject-specific terms we added to the model. These terms are telling us how much subjects differ from each other in terms of the initial reaction time and the association between days of sleep deprivation and reaction times.\nBelow we use ArviZ to obtain a traceplot of the subject-specific intercepts 1|Subject and slopes Days|Subject. This traceplot contains two columns. On the left, we have the posterior distributions that we analyze below, and on the right, we have the draws from the posterior in the order the sampler draw them for us. The stationary random pattern, or white noise appearence, tells us the sampler converged and the chains mixed well.\nFrom the range of the posteriors of the subject-specific intercepts we can see the initial mean reaction time for a given subject can differ substantially from the general mean we see in the table above. There’s also a large difference in the slopes. Some subjects see their reaction times increase quite rapidly as they’re deprived from sleep, while others have a better tolerance and get worse more slowly. Finally, from the pink posterior centered at ~ -11, there seems to be one person who gets better at reaction times. Looks like they took this as a serious challenge!\nIn summary, the model is capturing the behavior we saw in the data exploration stage. People differ both in the initial reaction times as well as in how these reaction times are affected by the successive days of sleep deprivation.\n\naz.plot_trace(idata, var_names=[\"1|Subject\", \"Days|Subject\"]);\n\n\n\n\nSo far, we’ve made the following conclusions\n\nPeople’s mean reaction time increase as they are deprived from sleep.\nPeople have different reaction times in the beginning of the study.\nSome people are more affected by sleep deprivation than others.\n\nBut there’s another question we haven’t answered yet: Are the initial reaction times associated with how much the sleep deprivation affects the evolution of reaction times? Let’s create a scatterplot to visualize the joint posterior of the subject-specific intercepts and slopes. This chart uses different colors for the individuals.\n\n# extract a subsample from the posterior and stack the chain and draw dims \nposterior = az.extract(idata, num_samples=500)\n\n_, ax = plt.subplots()\n\nidata.posterior.plot.scatter(\n x=\"1|Subject\", y=\"Days|Subject\",\n hue=\"Subject__factor_dim\",\n add_colorbar=False,\n add_legend=False,\n cmap=\"tab20\",\n edgecolors=None,\n) \n\nax.axhline(c=\"0.25\", ls=\"--\")\nax.axvline(c=\"0.25\", ls=\"--\")\nax.set_xlabel(\"Subject-specific intercept\")\nax.set_ylabel(\"Subject-specific slope\");\n\n\n\n\nIf we look at the bigger picture, i.e omitting the groups, we can conclude there’s no association between the intercept and slope. In other words, having lower or higher intial reaction times does not say anything about how much sleep deprivation affects the average reaction time on a given subject.\nOn the other hand, if we look at the joint posterior for a given individual, we can see a negative correlation between the intercept and the slope. This is telling that, conditional on a given subject, the intercept and slope posteriors are not independent. However, it doesn’t imply anything about the overall relationship between the intercept and the slope, which is what we need if we want to know whether the initial time is associated with how much sleep deprivation affects the reaction time.\nTo conclude with this example, we’re going create the same plot we created in the beginning with the mean regression lines and a credible bands for them.\n\n# Obtain the posterior of the mean\nmodel.predict(idata)\n\n# Plot the data\naxes = plot_data(data)\n\n# Take the posterior of the mean reaction time\nreaction_mean = az.extract(idata)[\"Reaction_mean\"].values\n\nfor subject, ax in zip(data[\"Subject\"].unique(), axes.ravel()):\n\n idx = data.index[data[\"Subject\"]== subject].tolist()\n days = data.loc[idx, \"Days\"].values\n \n # Plot highest density interval / credibility interval\n az.plot_hdi(days, reaction_mean[idx].T[np.newaxis], color=\"C0\", ax=ax)\n \n # Plot mean regression line\n ax.plot(days, reaction_mean[idx].mean(axis=1), color=\"C0\")\n\n\n\n\n\n%load_ext watermark\n%watermark -n -u -v -iv -w\n\nThe watermark extension is already loaded. To reload it, use:\n %reload_ext watermark\nLast updated: Thu Jan 05 2023\n\nPython implementation: CPython\nPython version : 3.10.4\nIPython version : 8.5.0\n\nnumpy : 1.23.5\narviz : 0.14.0\nmatplotlib: 3.6.2\nsys : 3.10.4 (main, Mar 31 2022, 08:41:55) [GCC 7.5.0]\npandas : 1.5.2\nbambi : 0.9.3\n\nWatermark: 2.3.1" }, { - "objectID": "notebooks/alternative_links_binary.html", - "href": "notebooks/alternative_links_binary.html", + "objectID": "notebooks/test_sample_new_groups.html", + "href": "notebooks/test_sample_new_groups.html", "title": "Bambi", "section": "", - "text": "In this example we use a simple dataset to fit a Generalized Linear Model for a binary response using different link functions.\n\nimport arviz as az\nimport bambi as bmb\nimport matplotlib.pyplot as plt\nimport numpy as np\nimport pandas as pd\n\nfrom scipy.special import expit as invlogit\nfrom scipy.stats import norm\n\n\naz.style.use(\"arviz-darkgrid\")\nnp.random.seed(1234)\n\n\n\nFirst of all, let’s review some concepts. A Generalized Linear Model (GLM) is made of three components.\n1. Random component\nA set of independent and identically distributed random variables \\(Y_i\\). Their (conditional) probability distribution belongs to the same family \\(f\\) with a mean given by \\(\\mu_i\\).\n2. Systematic component (a.k.a linear predictor)\nConstructed by a linear combination of the parameters \\(\\beta_j\\) and explanatory variables \\(x_j\\), represented by \\(\\eta_i\\)\n\\[\n\\eta_i = \\mathbf{x}_i^T\\mathbf{\\beta} = x_{i1}\\beta_1 + x_{i2}\\beta_2 + \\cdots + x_{ip}\\beta_p\n\\]\n3. Link function\nA monotone and differentiable function \\(g\\) such that\n\\[\ng(\\mu_i) = \\eta_i = \\mathbf{x}_i^T\\mathbf{\\beta}\n\\] where \\(\\mu_i = E(Y_i)\\)\nAs we can see, this function specifies the link between the random and the systematic components of the model.\nAn important feature of GLMs is that no matter we are modeling a function of \\(\\mu\\) (and not just \\(\\mu\\), unless \\(g\\) is the identity function) is that we can show predictions in terms of the mean \\(\\mu\\) by using the inverse of \\(g\\) on the linear predictor \\(\\eta_i\\)\n\\[\ng^{-1}(\\eta_i) = g^{-1}(\\mathbf{x}_i^T\\mathbf{\\beta}) = \\mu_i\n\\]\nIn Bambi, we can use family=\"bernoulli\" to tell we are modeling a binary variable that follows a Bernoulli distribution and our random component is of the form\n\\[\nY_i =\n\\left\\{\n \\begin{array}{ll}\n 1 & \\textrm{with probability } \\pi_i \\\\\n 0 & \\textrm{with probability } 1 - \\pi_i\n \\end{array}\n\\right.\n\\]\nthat has a mean \\(\\mu_i\\) equal to the probability of success \\(\\pi_i\\).\nBy default, this family implies \\(g\\) is the logit function.\n\\[\n\\begin{array}{lcr} \n \\displaystyle \\text{logit}(\\pi_i) = \\log{\\left( \\frac{\\pi_i}{1 - \\pi_i} \\right)} = \\eta_i &\n \\text{ with } &\n \\displaystyle g^{-1}(\\eta) = \\frac{1}{1 + e^{-\\eta}} = \\pi_i\n\\end{array}\n\\]\nBut there are other options available, like the probit and the cloglog link functions.\nThe probit function is the inverse of the cumulative density function of a standard Gaussian distribution\n\\[\n\\begin{array}{lcr} \n \\displaystyle \\text{probit}(\\pi_i) = \\Phi^{-1}(\\pi_i) = \\eta_i &\n \\text{ with } &\n \\displaystyle g^{-1}(\\eta) = \\Phi(\\eta_i) = \\pi_i\n\\end{array}\n\\]\nAnd with the cloglog link function we have\n\\[\n\\begin{array}{lcr} \n \\displaystyle \\text{cloglog}(\\pi_i) = \\log(-\\log(1 - \\pi)) = \\eta_i &\n \\text{ with } &\n \\displaystyle g^{-1}(\\eta) = 1 - \\exp(-\\exp(\\eta_i)) = \\pi_i\n\\end{array}\n\\]\ncloglog stands for complementary log-log and \\(g^{-1}\\) is the cumulative density function of the extreme minimum value distribution.\nLet’s plot them to better understand the implications of what we’re saying.\n\ndef invcloglog(x):\n return 1 - np.exp(-np.exp(x))\n\n\nx = np.linspace(-5, 5, num=200)\n\n# inverse of the logit function\nlogit = invlogit(x)\n\n# cumulative density function of standard gaussian\nprobit = norm.cdf(x)\n\n# inverse of the cloglog function\ncloglog = invcloglog(x)\n\nplt.plot(x, logit, color=\"C0\", lw=2, label=\"Logit\")\nplt.plot(x, probit, color=\"C1\", lw=2, label=\"Probit\")\nplt.plot(x, cloglog, color=\"C2\", lw=2, label=\"CLogLog\")\nplt.axvline(0, c=\"k\", alpha=0.5, ls=\"--\")\nplt.axhline(0.5, c=\"k\", alpha=0.5, ls=\"--\")\nplt.xlabel(r\"$x$\")\nplt.ylabel(r\"$\\pi$\")\nplt.legend();\n\n\n\n\nIn the plot above we can see both the logit and the probit links are symmetric in terms of their slopes at \\(-x\\) and \\(x\\). We can say the function approaches \\(\\pi = 0.5\\) at the same rate as it moves away from it. However, these two functions differ in their tails. The probit link approaches 0 and 1 faster than the logit link as we move away from \\(x=0\\). Just see the orange line is below the blue one for \\(x < 0\\) and it is above for \\(x > 0\\). In other words, the logit function has heavier tails than the probit.\nOn the other hand, the cloglog does not present this symmetry, and we can clearly see it since the green line does not cross the point (0, 0.5). This function approaches faster the 1 than 0 as we move away from \\(x=0\\).\n\n\n\nWe use a data set consisting of the numbers of beetles dead after five hours of exposure to gaseous carbon disulphide at various concentrations. This data can be found in An Introduction to Generalized Linear Models by A. J. Dobson and A. G. Barnett, but the original source is (Bliss, 1935).\n\n\n\n\n\n\n\n\nDose, \\(x_i\\) (\\(\\log_{10}\\text{CS}_2\\text{mgl}^{-1}\\))\nNumber of beetles, \\(n_i\\)\nNumber killed, \\(y_i\\)\n\n\n\n\n1.6907\n59\n6\n\n\n1.7242\n60\n13\n\n\n1.7552\n62\n18\n\n\n1.7842\n56\n28\n\n\n1.8113\n63\n52\n\n\n1.8369\n59\n53\n\n\n1.8610\n62\n61\n\n\n1.8839\n60\n60\n\n\n\nWe create a data frame where the data is in long format (i.e. each row is an observation with a 0-1 outcome).\n\nx = np.array([1.6907, 1.7242, 1.7552, 1.7842, 1.8113, 1.8369, 1.8610, 1.8839])\nn = np.array([59, 60, 62, 56, 63, 59, 62, 60])\ny = np.array([6, 13, 18, 28, 52, 53, 61, 60])\n\ndata = pd.DataFrame({\"x\": x, \"n\": n, \"y\": y})\n\n\n\n\nBambi has two families to model binary data: Bernoulli and Binomial. The first one can be used when each row represents a single observation with a column containing the binary outcome, while the second is used when each row represents a group of observations or realizations and there’s one column for the number of successes and another column for the number of trials.\nSince we have aggregated data, we’re going to use the Binomial family. This family requires using the function proportion(y, n) on the left side of the model formula to indicate we want to model the proportion between two variables. This function can be replaced by any of its aliases prop(y, n) or p(y, n). Let’s use the shortest one here.\n\nformula = \"p(y, n) ~ x\"\n\n\n\nThe logit link is the default link when we say family=\"binomial\", so there’s no need to add it.\n\nmodel_logit = bmb.Model(formula, data, family=\"binomial\")\nidata_logit = model_logit.fit(draws=2000)\n\nAuto-assigning NUTS sampler...\nInitializing NUTS using jitter+adapt_diag...\nMultiprocess sampling (2 chains in 2 jobs)\nNUTS: [Intercept, x]\n\n\n\n\n\n\n\n \n \n 100.00% [6000/6000 00:04<00:00 Sampling 2 chains, 0 divergences]\n \n \n\n\nSampling 2 chains for 1_000 tune and 2_000 draw iterations (2_000 + 4_000 draws total) took 5 seconds.\n\n\n\n\n\n\nmodel_probit = bmb.Model(formula, data, family=\"binomial\", link=\"probit\")\nidata_probit = model_probit.fit(draws=2000)\n\nAuto-assigning NUTS sampler...\nInitializing NUTS using jitter+adapt_diag...\nMultiprocess sampling (2 chains in 2 jobs)\nNUTS: [Intercept, x]\n\n\n\n\n\n\n\n \n \n 100.00% [6000/6000 00:05<00:00 Sampling 2 chains, 0 divergences]\n \n \n\n\nSampling 2 chains for 1_000 tune and 2_000 draw iterations (2_000 + 4_000 draws total) took 5 seconds.\n\n\n\n\n\n\nmodel_cloglog = bmb.Model(formula, data, family=\"binomial\", link=\"cloglog\")\nidata_cloglog = model_cloglog.fit(draws=2000)\n\nAuto-assigning NUTS sampler...\nInitializing NUTS using jitter+adapt_diag...\nMultiprocess sampling (2 chains in 2 jobs)\nNUTS: [Intercept, x]\n\n\n\n\n\n\n\n \n \n 100.00% [6000/6000 00:03<00:00 Sampling 2 chains, 0 divergences]\n \n \n\n\nSampling 2 chains for 1_000 tune and 2_000 draw iterations (2_000 + 4_000 draws total) took 4 seconds.\n\n\n\n\n\n\nWe can use the samples from the posteriors to see the mean estimate for the probability of dying at each concentration level. To do so, we use a little helper function that will help us to write less code. This function leverages the power of the new Model.predict() method that is helpful to obtain both in-sample and out-of-sample predictions.\n\ndef get_predictions(model, idata, seq):\n # Create a data frame with the new data\n new_data = pd.DataFrame({\"x\": seq})\n \n # Predict probability of dying using out of sample data\n model.predict(idata, data=new_data)\n \n # Get posterior mean across all chains and draws\n mu = idata.posterior[\"p(y, n)_mean\"].mean((\"chain\", \"draw\"))\n return mu\n\n\nx_seq = np.linspace(1.6, 2, num=200)\n\nmu_logit = get_predictions(model_logit, idata_logit, x_seq)\nmu_probit = get_predictions(model_probit, idata_probit, x_seq)\nmu_cloglog = get_predictions(model_cloglog, idata_cloglog, x_seq)\n\n\nplt.scatter(x, y / n, c = \"white\", edgecolors = \"black\", s=100)\nplt.plot(x_seq, mu_logit, lw=2, label=\"Logit\")\nplt.plot(x_seq, mu_probit, lw=2, label=\"Probit\")\nplt.plot(x_seq, mu_cloglog, lw=2, label=\"CLogLog\")\nplt.axhline(0.5, c=\"k\", alpha=0.5, ls=\"--\")\nplt.xlabel(r\"Dose $\\log_{10}CS_2mgl^{-1}$\")\nplt.ylabel(\"Probability of death\")\nplt.legend();\n\n\n\n\nIn this example, we can see the models using the logit and probit link functions present very similar estimations. With these particular data, all the three link functions fit the data well and the results do not differ significantly. However, there can be scenarios where the results are more sensitive to the choice of the link function.\nReferences\nBliss, C. I. (1935). The calculation of the dose-mortality curve. Annals of Applied Biology 22, 134–167\n\n%load_ext watermark\n%watermark -n -u -v -iv -w\n\nLast updated: Thu Jan 05 2023\n\nPython implementation: CPython\nPython version : 3.10.4\nIPython version : 8.5.0\n\nsys : 3.10.4 (main, Mar 31 2022, 08:41:55) [GCC 7.5.0]\narviz : 0.14.0\nnumpy : 1.23.5\nbambi : 0.9.3\nmatplotlib: 3.6.2\npandas : 1.5.2\n\nWatermark: 2.3.1" + "text": "NOTE This notebook is not part of the documentation. It’s not meant to be in the webpage. It’s something I wrote when I was testing the new functionality and I think it’s nice to have it handy.\n\nimport arviz as az\nimport bambi as bmb\nimport matplotlib.pyplot as plt\nimport numpy as np\nimport pandas as pd\n\n\ndata = bmb.load_data(\"sleepstudy\")\n\n\ndata.head()\n\n\n\n\n\n \n \n \n Reaction\n Days\n Subject\n \n \n \n \n 0\n 249.5600\n 0\n 308\n \n \n 1\n 258.7047\n 1\n 308\n \n \n 2\n 250.8006\n 2\n 308\n \n \n 3\n 321.4398\n 3\n 308\n \n \n 4\n 356.8519\n 4\n 308\n \n \n\n\n\n\n\nmodel = bmb.Model(\"Reaction ~ 1 + Days + (1 + Days | Subject)\", data)\nmodel\n\n Formula: Reaction ~ 1 + Days + (1 + Days | Subject)\n Family: gaussian\n Link: mu = identity\n Observations: 180\n Priors: \n target = mu\n Common-level effects\n Intercept ~ Normal(mu: 298.5079, sigma: 261.0092)\n Days ~ Normal(mu: 0.0, sigma: 48.8915)\n \n Group-level effects\n 1|Subject ~ Normal(mu: 0.0, sigma: HalfNormal(sigma: 261.0092))\n Days|Subject ~ Normal(mu: 0.0, sigma: HalfNormal(sigma: 48.8915))\n \n Auxiliary parameters\n sigma ~ HalfStudentT(nu: 4.0, sigma: 56.1721)\n\n\n\nidata = model.fit()\n\nAuto-assigning NUTS sampler...\nInitializing NUTS using jitter+adapt_diag...\nMultiprocess sampling (2 chains in 2 jobs)\nNUTS: [Reaction_sigma, Intercept, Days, 1|Subject_sigma, 1|Subject_offset, Days|Subject_sigma, Days|Subject_offset]\n\n\n\n\n\n\n\n \n \n 100.00% [4000/4000 00:15<00:00 Sampling 2 chains, 0 divergences]\n \n \n\n\nSampling 2 chains for 1_000 tune and 1_000 draw iterations (2_000 + 2_000 draws total) took 15 seconds.\nWe recommend running at least 4 chains for robust computation of convergence diagnostics\n\n\n\ndf_new = data.head(10).reset_index(drop=True)\ndf_new[\"Subject\"] = \"xxx\"\ndf_new = pd.concat([df_new, data.head(10)])\ndf_new = df_new.reset_index(drop=True)\ndf_new\n\n\n\n\n\n \n \n \n Reaction\n Days\n Subject\n \n \n \n \n 0\n 249.5600\n 0\n xxx\n \n \n 1\n 258.7047\n 1\n xxx\n \n \n 2\n 250.8006\n 2\n xxx\n \n \n 3\n 321.4398\n 3\n xxx\n \n \n 4\n 356.8519\n 4\n xxx\n \n \n 5\n 414.6901\n 5\n xxx\n \n \n 6\n 382.2038\n 6\n xxx\n \n \n 7\n 290.1486\n 7\n xxx\n \n \n 8\n 430.5853\n 8\n xxx\n \n \n 9\n 466.3535\n 9\n xxx\n \n \n 10\n 249.5600\n 0\n 308\n \n \n 11\n 258.7047\n 1\n 308\n \n \n 12\n 250.8006\n 2\n 308\n \n \n 13\n 321.4398\n 3\n 308\n \n \n 14\n 356.8519\n 4\n 308\n \n \n 15\n 414.6901\n 5\n 308\n \n \n 16\n 382.2038\n 6\n 308\n \n \n 17\n 290.1486\n 7\n 308\n \n \n 18\n 430.5853\n 8\n 308\n \n \n 19\n 466.3535\n 9\n 308\n \n \n\n\n\n\n\np = model.predict(idata, data=df_new, inplace=False, sample_new_groups=True)\n\nreaction_draws = p.posterior[\"Reaction_mean\"]\nmean = reaction_draws.mean((\"chain\", \"draw\")).to_numpy()\nbounds = reaction_draws.quantile((0.025, 0.975), (\"chain\", \"draw\")).to_numpy()\n\n\nfig, axes = plt.subplots(1, 2, figsize=(10, 4), sharey=True)\n\naxes[0].scatter(df_new.iloc[10:][\"Days\"], df_new.iloc[10:][\"Reaction\"])\naxes[1].scatter(df_new.iloc[:10][\"Days\"], df_new.iloc[:10][\"Reaction\"])\n\naxes[0].fill_between(np.arange(10), bounds[0, 10:], bounds[1, 10:], alpha=0.5, color=\"C0\")\naxes[1].fill_between(np.arange(10), bounds[0, :10], bounds[1, :10], alpha=0.5, color=\"C0\")\n\naxes[0].set_title(\"Original participant\")\naxes[1].set_title(\"New participant\");\n\n\n\n\n\n\ndata = pd.read_csv(\"../../tests/data/crossed_random.csv\")\ndata[\"subj\"] = data[\"subj\"].astype(str)\ndata.head()\n\n\n\n\n\n \n \n \n Unnamed: 0\n subj\n item\n site\n Y\n continuous\n dummy\n threecats\n \n \n \n \n 0\n 0\n 0\n 0\n 0\n 0.276766\n 0.929616\n 0\n a\n \n \n 1\n 1\n 1\n 0\n 0\n -0.058104\n 0.008388\n 0\n a\n \n \n 2\n 2\n 2\n 0\n 1\n -6.847861\n 0.439645\n 0\n a\n \n \n 3\n 3\n 3\n 0\n 1\n 12.474619\n 0.596366\n 0\n a\n \n \n 4\n 4\n 4\n 0\n 2\n -0.426047\n 0.709510\n 0\n a\n \n \n\n\n\n\n\nformula = \"Y ~ 0 + threecats + (0 + threecats | subj)\"\nmodel = bmb.Model(formula, data)\nmodel\n\n Formula: Y ~ 0 + threecats + (0 + threecats | subj)\n Family: gaussian\n Link: mu = identity\n Observations: 120\n Priors: \n target = mu\n Common-level effects\n threecats ~ Normal(mu: [0. 0. 0.], sigma: [31.1617 31.1617 31.1617])\n \n Group-level effects\n threecats|subj ~ Normal(mu: 0.0, sigma: HalfNormal(sigma: [31.1617 31.1617 31.1617]))\n \n Auxiliary parameters\n sigma ~ HalfStudentT(nu: 4.0, sigma: 5.8759)\n\n\n\nidata = model.fit()\n\nAuto-assigning NUTS sampler...\nInitializing NUTS using jitter+adapt_diag...\nMultiprocess sampling (2 chains in 2 jobs)\nNUTS: [Y_sigma, threecats, threecats|subj_sigma, threecats|subj_offset]\n\n\n\n\n\n\n\n \n \n 100.00% [4000/4000 00:08<00:00 Sampling 2 chains, 0 divergences]\n \n \n\n\nSampling 2 chains for 1_000 tune and 1_000 draw iterations (2_000 + 2_000 draws total) took 8 seconds.\nWe recommend running at least 4 chains for robust computation of convergence diagnostics\n\n\n\nnew_data = pd.DataFrame(\n {\n \"threecats\": [\"a\", \"a\"],\n \"subj\": [\"0\", \"11\"]\n }\n)\nnew_data\n\n\n\n\n\n \n \n \n threecats\n subj\n \n \n \n \n 0\n a\n 0\n \n \n 1\n a\n 11\n \n \n\n\n\n\n\np1 = model.predict(idata, data=new_data, inplace=False, sample_new_groups=True)\n\n\nfig, axes = plt.subplots(2, 1, figsize=(7, 9), sharex=True)\n\ny1_grs = p1.posterior[\"Y_mean\"].sel(Y_obs=0).to_numpy().flatten()\ny2_grs = p1.posterior[\"Y_mean\"].sel(Y_obs=1).to_numpy().flatten()\n\naxes[0].hist(y1_grs, bins=20);\naxes[1].hist(y2_grs, bins=20);\n\n\n\n\n\n\ninhaler = pd.read_csv(\"../../tests/data/inhaler.csv\")\ninhaler[\"rating\"] = pd.Categorical(inhaler[\"rating\"], categories=[1, 2, 3, 4])\ninhaler[\"treat\"] = pd.Categorical(inhaler[\"treat\"])\n\nmodel = bmb.Model(\n \"rating ~ 1 + period + treat + (1 + treat|subject)\", inhaler, family=\"categorical\"\n)\nidata = model.fit(tune=200, draws=200)\n\nOnly 200 samples in chain.\nAuto-assigning NUTS sampler...\nInitializing NUTS using jitter+adapt_diag...\nMultiprocess sampling (2 chains in 2 jobs)\nNUTS: [Intercept, period, treat, 1|subject_sigma, 1|subject_offset, treat|subject_sigma, treat|subject_offset]\n\n\n\n\n\n\n\n \n \n 100.00% [800/800 00:11<00:00 Sampling 2 chains, 1 divergences]\n \n \n\n\nSampling 2 chains for 200 tune and 200 draw iterations (400 + 400 draws total) took 12 seconds.\nWe recommend running at least 4 chains for robust computation of convergence diagnostics\n\n\n\ndf_new = inhaler.head(2).reset_index(drop=True)\ndf_new[\"subject\"] = [1, 999]\ndf_new\n\n\n\n\n\n \n \n \n subject\n rating\n treat\n period\n carry\n \n \n \n \n 0\n 1\n 1\n 0.5\n 0.5\n 0\n \n \n 1\n 999\n 1\n 0.5\n 0.5\n 0\n \n \n\n\n\n\n\np = model.predict(idata, data=df_new, inplace=False, sample_new_groups=True)\n\n\nfig, axes = plt.subplots(2, 2, figsize=(12, 9))\nbins = np.linspace(0, 1, 20)\n\nfor i, ax in enumerate(axes.ravel()):\n x = p.posterior[\"rating_mean\"].sel({\"rating_dim\": f'{i + 1}'}).to_numpy()\n ax.hist(x[..., 0].flatten(), bins=bins, histtype=\"step\", color=\"C0\")\n ax.hist(x[..., 1].flatten(), bins=bins, histtype=\"step\", color=\"C1\")" }, { - "objectID": "notebooks/sleepstudy.html", - "href": "notebooks/sleepstudy.html", + "objectID": "notebooks/Strack_RRR_re_analysis.html", + "href": "notebooks/Strack_RRR_re_analysis.html", "title": "Bambi", "section": "", - "text": "import arviz as az\nimport bambi as bmb\nimport matplotlib.pyplot as plt\nimport numpy as np\nimport pandas as pd\n\n\naz.style.use(\"arviz-darkgrid\")\nSEED = 7355608\n\nIn this example we are going to use sleepstudy dataset. It is derived from the study described in Belenky et al. (2003) and popularized in the lme4 R package. This dataset contains the average reaction time per day (in milliseconds) on a series of tests for the most sleep-deprived group in a sleep deprivation study. The first two days of the study are considered as adaptation and training, the third day is a baseline, and sleep deprivation started after day 3. The subjects in this group were restricted to 3 hours of sleep per night.\n\n\nThe sleepstudy dataset can be loaded using the load_data() function:\n\ndata = bmb.load_data(\"sleepstudy\")\ndata\n\n\n\n\n\n \n \n \n Reaction\n Days\n Subject\n \n \n \n \n 0\n 249.5600\n 0\n 308\n \n \n 1\n 258.7047\n 1\n 308\n \n \n 2\n 250.8006\n 2\n 308\n \n \n 3\n 321.4398\n 3\n 308\n \n \n 4\n 356.8519\n 4\n 308\n \n \n ...\n ...\n ...\n ...\n \n \n 175\n 329.6076\n 5\n 372\n \n \n 176\n 334.4818\n 6\n 372\n \n \n 177\n 343.2199\n 7\n 372\n \n \n 178\n 369.1417\n 8\n 372\n \n \n 179\n 364.1236\n 9\n 372\n \n \n\n180 rows × 3 columns\n\n\n\nThe response variable is Reaction, the average of the reaction time measurements on a given subject for a given day. The two covariates are Days, the number of days of sleep deprivation, and Subject, the identifier of the subject on which the observation was made.\n\n\n\nLet’s get started by displaying the data in a multi-panel layout. There’s a panel for each subject in the study. This allows us to observe and compare the association of Days and Reaction between subjects.\n\ndef plot_data(data):\n fig, axes = plt.subplots(2, 9, figsize=(16, 7.5), sharey=True, sharex=True, dpi=300, constrained_layout=False)\n fig.subplots_adjust(left=0.075, right=0.975, bottom=0.075, top=0.925, wspace=0.03)\n\n axes_flat = axes.ravel()\n\n for i, subject in enumerate(data[\"Subject\"].unique()):\n ax = axes_flat[i]\n idx = data.index[data[\"Subject\"] == subject].tolist()\n days = data.loc[idx, \"Days\"].values\n reaction = data.loc[idx, \"Reaction\"].values\n\n # Plot observed data points\n ax.scatter(days, reaction, color=\"C0\", ec=\"black\", alpha=0.7)\n\n # Add a title\n ax.set_title(f\"Subject: {subject}\", fontsize=14)\n\n ax.xaxis.set_ticks([0, 2, 4, 6, 8])\n fig.text(0.5, 0.02, \"Days\", fontsize=14)\n fig.text(0.03, 0.5, \"Reaction time (ms)\", rotation=90, fontsize=14, va=\"center\")\n\n return axes\n\n\nplot_data(data);\n\n\n\n\nFor most of the subjects, there’s a clear positive association between Days and Reaction time. Reaction times increase as people accumulate more days of sleep deprivation. Participants differ in the initial reaction times as well as in the association between sleep deprivation and reaction time. Reaction times increase faster for some subjects and slower for others. Finally, the relationship between Days and Reaction time presents some deviations from linearity within the panels, but these are neither substantial nor systematic.\n\n\n\nOur main goal is to measure the association between Days and Reaction times. We are interested both in the common effect across all subjects, as well as the effects associated with each individual. To do this, we’re going to use a hierarchical linear regression model that includes the effect of a common intercept and slope, as well as intercepts and slopes specific to each subject. These types of effects are also known as fixed and random effects in the statistical literature.\nThe model can be written as follows:\n\\[\n\\begin{aligned}\n\\text{Reaction}_i & \\sim \\text{Normal}(\\mu_i, \\sigma) \\\\\n\\mu_i & = \\beta_{\\text{Intercept}[i]} + \\beta_{\\text{Days}[i]}\\text{Days}_i \\\\\n\\beta_{\\text{Intercept}[i]} & = \\beta_{\\text{Intercept}} + \\alpha_{\\text{Intercept}_i}\\\\\n\\beta_{\\text{Days}[i]} & = \\beta_{\\text{Days}} + \\alpha_{\\text{Days}_i}\\\\\n\\end{aligned}\n\\]\nwhere \\(\\beta_{\\text{Intercept}}\\) and \\(\\beta_{\\text{Days}}\\) are the intercept and day slope effects common to all subjects in the study, and \\(\\alpha_{\\text{Intercept}_i}\\) and \\(\\alpha_{\\text{Days}_i}\\) are the subject-specific intercept and slope effects. These group-specific effects represent the deviation of each subject from the average behavior.\nNote we’re not describing the prior distributions for \\(\\beta_{\\text{Intercept}}\\), \\(\\beta_{\\text{Days}}\\), \\(\\alpha_{\\text{Intercept}_i}\\), \\(\\alpha_{\\text{Days}_i}\\), and \\(\\sigma\\) because we’re going to use default priors in Bambi.\nNext, let’s create the Bambi model. Here we use the formula syntax to specify the model in a clear and concise manner. The term on the left side of ~ tells Reaction is the response variable. The Days term on the right-hand side tells we want to include a slope effect for the Days variable common to all subjects. (Days | Subject) indicates the Days slope for a given subject is going to consist of the common slope plus a deviation specific to that subject. The common and subject-specific intercepts are added implicitly. We could suppress them by adding a 0 on the common or the group-specific part of the formula (e.g. 0 + Days + (0 + Days|Subject)).\nIf we wanted subject-specific intercepts, but not subjec-specific slopes we would have written Reaction ~ Days + (1 | Subject) and if we wanted slopes specific to each Subject without including a Subject specific intercept, we would write Reaction ~ Days + (0 + Days | Subject).\nThat’s been quite a long introduction for the model. Let’s write it down in code now:\n\nmodel = bmb.Model(\"Reaction ~ 1 + Days + (Days | Subject)\", data, categorical=\"Subject\")\n\nA description of the model and the priors can be obtained by simply printing the model object\n\nmodel\n\n Formula: Reaction ~ 1 + Days + (Days | Subject)\n Family: gaussian\n Link: mu = identity\n Observations: 180\n Priors: \n target = mu\n Common-level effects\n Intercept ~ Normal(mu: 298.5079, sigma: 261.0092)\n Days ~ Normal(mu: 0, sigma: 48.8915)\n \n Group-level effects\n 1|Subject ~ Normal(mu: 0, sigma: HalfNormal(sigma: 261.0092))\n Days|Subject ~ Normal(mu: 0, sigma: HalfNormal(sigma: 48.8915))\n \n Auxiliary parameters\n Reaction_sigma ~ HalfStudentT(nu: 4, sigma: 56.1721)\n\n\nThere we see the formula used to specify the model, the name of the response distribution (Gaussian), the link function (identity), together with the number of observations (180). Below, we have a description of the prior distributions for the different terms in the model. This tells Bambi is using Normal priors for both common and group-specific terms, and a HalfStudentT distribution for the residual error term of the linear regression.\nNow it’s time to hit the inference button. In Bambi, it is as simple as using the .fit() method. This returns an InferenceData object from the ArviZ library. The draws=2000 argument asks the sampler to obtain 2000 draws from the posterior for each chain.\n\nidata = model.fit(draws=2000, random_seed=SEED)\n\nAuto-assigning NUTS sampler...\nInitializing NUTS using jitter+adapt_diag...\nMultiprocess sampling (2 chains in 2 jobs)\nNUTS: [Reaction_sigma, Intercept, Days, 1|Subject_sigma, 1|Subject_offset, Days|Subject_sigma, Days|Subject_offset]\n\n\n\n\n\n\n\n \n \n 100.00% [6000/6000 00:29<00:00 Sampling 2 chains, 0 divergences]\n \n \n\n\nSampling 2 chains for 1_000 tune and 2_000 draw iterations (2_000 + 4_000 draws total) took 30 seconds.\n\n\n\n\n\nFirst of all, let’s obtain a summary of the posterior distribution of the Intercept and Days effects.\n\naz.summary(idata, var_names=[\"Intercept\", \"Days\"], kind=\"stats\")\n\n\n\n\n\n \n \n \n mean\n sd\n hdi_3%\n hdi_97%\n \n \n \n \n Intercept\n 251.494\n 7.621\n 238.327\n 266.683\n \n \n Days\n 10.467\n 1.686\n 7.411\n 13.720\n \n \n\n\n\n\nOn average, people’s average reaction time at the beginning of the study is between 235 and 265 milliseconds. With every extra day of sleep deprivation, the mean reaction times increase, on average, between 7.2 and 13.9 milliseconds.\nSo far so good with the interpretation of the common effects. It’s quite straightforward and simple. But this analysis would be incomplete and misleading if we don’t evaluate the subject-specific terms we added to the model. These terms are telling us how much subjects differ from each other in terms of the initial reaction time and the association between days of sleep deprivation and reaction times.\nBelow we use ArviZ to obtain a traceplot of the subject-specific intercepts 1|Subject and slopes Days|Subject. This traceplot contains two columns. On the left, we have the posterior distributions that we analyze below, and on the right, we have the draws from the posterior in the order the sampler draw them for us. The stationary random pattern, or white noise appearence, tells us the sampler converged and the chains mixed well.\nFrom the range of the posteriors of the subject-specific intercepts we can see the initial mean reaction time for a given subject can differ substantially from the general mean we see in the table above. There’s also a large difference in the slopes. Some subjects see their reaction times increase quite rapidly as they’re deprived from sleep, while others have a better tolerance and get worse more slowly. Finally, from the pink posterior centered at ~ -11, there seems to be one person who gets better at reaction times. Looks like they took this as a serious challenge!\nIn summary, the model is capturing the behavior we saw in the data exploration stage. People differ both in the initial reaction times as well as in how these reaction times are affected by the successive days of sleep deprivation.\n\naz.plot_trace(idata, var_names=[\"1|Subject\", \"Days|Subject\"]);\n\n\n\n\nSo far, we’ve made the following conclusions\n\nPeople’s mean reaction time increase as they are deprived from sleep.\nPeople have different reaction times in the beginning of the study.\nSome people are more affected by sleep deprivation than others.\n\nBut there’s another question we haven’t answered yet: Are the initial reaction times associated with how much the sleep deprivation affects the evolution of reaction times? Let’s create a scatterplot to visualize the joint posterior of the subject-specific intercepts and slopes. This chart uses different colors for the individuals.\n\n# extract a subsample from the posterior and stack the chain and draw dims \nposterior = az.extract(idata, num_samples=500)\n\n_, ax = plt.subplots()\n\nidata.posterior.plot.scatter(\n x=\"1|Subject\", y=\"Days|Subject\",\n hue=\"Subject__factor_dim\",\n add_colorbar=False,\n add_legend=False,\n cmap=\"tab20\",\n edgecolors=None,\n) \n\nax.axhline(c=\"0.25\", ls=\"--\")\nax.axvline(c=\"0.25\", ls=\"--\")\nax.set_xlabel(\"Subject-specific intercept\")\nax.set_ylabel(\"Subject-specific slope\");\n\n\n\n\nIf we look at the bigger picture, i.e omitting the groups, we can conclude there’s no association between the intercept and slope. In other words, having lower or higher intial reaction times does not say anything about how much sleep deprivation affects the average reaction time on a given subject.\nOn the other hand, if we look at the joint posterior for a given individual, we can see a negative correlation between the intercept and the slope. This is telling that, conditional on a given subject, the intercept and slope posteriors are not independent. However, it doesn’t imply anything about the overall relationship between the intercept and the slope, which is what we need if we want to know whether the initial time is associated with how much sleep deprivation affects the reaction time.\nTo conclude with this example, we’re going create the same plot we created in the beginning with the mean regression lines and a credible bands for them.\n\n# Obtain the posterior of the mean\nmodel.predict(idata)\n\n# Plot the data\naxes = plot_data(data)\n\n# Take the posterior of the mean reaction time\nreaction_mean = az.extract(idata)[\"Reaction_mean\"].values\n\nfor subject, ax in zip(data[\"Subject\"].unique(), axes.ravel()):\n\n idx = data.index[data[\"Subject\"]== subject].tolist()\n days = data.loc[idx, \"Days\"].values\n \n # Plot highest density interval / credibility interval\n az.plot_hdi(days, reaction_mean[idx].T[np.newaxis], color=\"C0\", ax=ax)\n \n # Plot mean regression line\n ax.plot(days, reaction_mean[idx].mean(axis=1), color=\"C0\")\n\n\n\n\n\n%load_ext watermark\n%watermark -n -u -v -iv -w\n\nThe watermark extension is already loaded. To reload it, use:\n %reload_ext watermark\nLast updated: Thu Jan 05 2023\n\nPython implementation: CPython\nPython version : 3.10.4\nIPython version : 8.5.0\n\nnumpy : 1.23.5\narviz : 0.14.0\nmatplotlib: 3.6.2\nsys : 3.10.4 (main, Mar 31 2022, 08:41:55) [GCC 7.5.0]\npandas : 1.5.2\nbambi : 0.9.3\n\nWatermark: 2.3.1" + "text": "from glob import glob\nfrom os.path import basename\n\nimport arviz as az\nimport bambi as bmb\nimport numpy as np\nimport pandas as pd\n\n\naz.style.use(\"arviz-darkgrid\")\n\nIn this Jupyter notebook, we do a Bayesian reanalysis of the data reported in the recent registered replication report (RRR) of a famous study by Strack, Martin & Stepper (1988). The original Strack et al. study tested a facial feedback hypothesis arguing that emotional responses are, in part, driven by facial expressions (rather than expressions always following from emotions). Strack and colleagues reported that participants rated cartoons as more funny when the participants held a pen in their teeth (unknowingly inducing a smile) than when they held a pen between their lips (unknowingly inducing a pout). The article has been cited over 1,400 times, and has been enormously influential in popularizing the view that affective experiences and outward expressions of affective experiences can both influence each other (instead of the relationship being a one-way street from experience to expression). In 2016, a Registered Replication Report led by Wagenmakers and colleagues attempted to replicate Study 1 from Strack, Martin, & Stepper (1988) in 17 independent experiments comprising over 2,500 participants. The RRR reported no evidence in support of the effect.\nBecause the emphasis here is on fitting models in Bambi, we spend very little time on quality control and data exploration; our goal is simply to show how one can replicate and extend the primary analysis reported in the RRR in a few lines of Bambi code.\n\n\nThe data for the RRR of Strack, Martin, & Stepper (henceforth SMS) is available as a set of CSV files from the project’s repository on the Open Science Framework. For the sake of completeness, we’ll show how to go from the raw CSV to the “long” data format that Bambi can use.\nOne slightly annoying thing about these 17 CSV files–each of which represents a different replication site–is that they don’t all contain exactly the same columns. Some labs added a column or two at the end (mostly for notes). To keep things simple, we’ll just truncate each dataset to only the first 22 columns. Because the variable names are structured in a bit of a confusing way, we’ll also just drop the first two rows in each file, and manually set the column names for all 22 variables. Once we’ve done that, we can simply concatenate all of the 17 datasets along the row axis to create one big dataset.\n\nDL_PATH = 'data/facial_feedback/*csv'\n\ndfs = []\ncolumns = ['subject', 'cond_id', 'condition', 'correct_c1', 'correct_c2', 'correct_c3', 'correct_c4',\n 'correct_total', 'rating_t1', 'rating_t2', 'rating_c1', 'rating_c2', 'rating_c3',\n 'rating_c4', 'self_perf', 'comprehension', 'awareness', 'transcript', 'age', 'gender',\n 'student', 'occupation']\n\ncount = 0\nfor idx, study in enumerate(glob(DL_PATH)):\n data = pd.read_csv(study, encoding='latin1', skiprows=2, header=None, index_col=False).iloc[:, :22]\n data.columns = columns\n # Add study name\n data['study'] = idx\n # Some sites used the same subject id numbering schemes, so prepend with study to create unique ids.\n # Note that if we don't do this, Bambi would have no way of distinguishing two subjects who share\n # the same id, which would hose our results.\n data['uid'] = data['subject'].astype(float) + count\n dfs.append(data)\ndata = pd.concat(dfs, axis=0).apply(pd.to_numeric, errors='coerce', axis=1)\n\nLet’s see what the first few rows look like…\n\ndata.head()\n\n\n\n\n\n \n \n \n subject\n cond_id\n condition\n correct_c1\n correct_c2\n correct_c3\n correct_c4\n correct_total\n rating_t1\n rating_t2\n ...\n self_perf\n comprehension\n awareness\n transcript\n age\n gender\n student\n occupation\n study\n uid\n \n \n \n \n 0\n 1.0\n 1.0\n 0.0\n 1.0\n 1.0\n 1.0\n 1.0\n 4.0\n 5.0\n 9.0\n ...\n 5.0\n 1.0\n 0.0\n NaN\n 21.0\n 1.0\n 1.0\n NaN\n 0.0\n 1.0\n \n \n 1\n 2.0\n 2.0\n 1.0\n 1.0\n 1.0\n 1.0\n 1.0\n 4.0\n 3.0\n 4.0\n ...\n 7.0\n 1.0\n 0.0\n NaN\n 25.0\n 1.0\n 1.0\n NaN\n 0.0\n 2.0\n \n \n 2\n 3.0\n 3.0\n 0.0\n 1.0\n 1.0\n 1.0\n 1.0\n 4.0\n 4.0\n 4.0\n ...\n 9.0\n 1.0\n 0.0\n NaN\n 23.0\n 0.0\n 1.0\n NaN\n 0.0\n 3.0\n \n \n 3\n 4.0\n 4.0\n 1.0\n 1.0\n 1.0\n 1.0\n 1.0\n 4.0\n 7.0\n 3.0\n ...\n 4.0\n 1.0\n 0.0\n NaN\n 19.0\n 0.0\n 1.0\n NaN\n 0.0\n 4.0\n \n \n 4\n 5.0\n 5.0\n 0.0\n 1.0\n 1.0\n 1.0\n 1.0\n 4.0\n 5.0\n 7.0\n ...\n 6.0\n 1.0\n 0.0\n NaN\n 19.0\n 0.0\n 1.0\n NaN\n 0.0\n 5.0\n \n \n\n5 rows × 24 columns\n\n\n\n\n\n\nAt this point we have our data in a pandas DataFrame with shape of (2612, 24). Unfortunately, we can’t use the data in this form. We’ll need to (a) conduct some basic quality control, and (b) “melt” the dataset–currently in so-called “wide” format, with each subject in a separate row–into long format, where each row is a single trial. Fortunately, we can do this easily in pandas:\n\n# Keep only subjects who (i) respond appropriately on all trials,\n# (ii) understand the cartoons, and (iii) don't report any awareness\n# of the hypothesis or underlying theory.\nvalid = data.query('correct_total==4 and comprehension==1 and awareness==0')\nlong = pd.melt(valid, ['uid', 'condition', 'gender', 'age', 'study', 'self_perf'],\n ['rating_c1', 'rating_c2', 'rating_c3', 'rating_c4'], var_name='stimulus')\n\n\nlong\n\n\n\n\n\n \n \n \n uid\n condition\n gender\n age\n study\n self_perf\n stimulus\n value\n \n \n \n \n 0\n 1.0\n 0.0\n 1.0\n 21.0\n 0.0\n 5.0\n rating_c1\n 5.0\n \n \n 1\n 2.0\n 1.0\n 1.0\n 25.0\n 0.0\n 7.0\n rating_c1\n 0.0\n \n \n 2\n 3.0\n 0.0\n 0.0\n 23.0\n 0.0\n 9.0\n rating_c1\n 4.0\n \n \n 3\n 4.0\n 1.0\n 0.0\n 19.0\n 0.0\n 4.0\n rating_c1\n 7.0\n \n \n 4\n 5.0\n 0.0\n 0.0\n 19.0\n 0.0\n 6.0\n rating_c1\n 4.0\n \n \n ...\n ...\n ...\n ...\n ...\n ...\n ...\n ...\n ...\n \n \n 6935\n 164.0\n 0.0\n 0.0\n 18.0\n 16.0\n 4.0\n rating_c4\n 0.0\n \n \n 6936\n 168.0\n 0.0\n 0.0\n 18.0\n 16.0\n 8.0\n rating_c4\n 6.0\n \n \n 6937\n 169.0\n 1.0\n 0.0\n 18.0\n 16.0\n 7.0\n rating_c4\n 7.0\n \n \n 6938\n 171.0\n 1.0\n 0.0\n 19.0\n 16.0\n 7.0\n rating_c4\n 4.0\n \n \n 6939\n 172.0\n 0.0\n 1.0\n 21.0\n 16.0\n 7.0\n rating_c4\n 3.0\n \n \n\n6940 rows × 8 columns\n\n\n\nNotice that in the melt() call above, we’re treating not only the unique subject ID (uid) as an identifying variable, but also gender, experimental condition, age, and study name. Since these are all between-subject variables, these columns are all completely redundant with uid, and adding them does nothing to change the structure of our data. The point of explicitly listing them is just to keep them around in the dataset, so that we can easily add them to our models.\n\n\n\nNow that we’re all done with our (minimal) preprocessing, it’s time to fit the model! This turns out to be a snap in Bambi. We’ll begin with a very naive (and, as we’ll see later, incorrect) model that includes only the following terms:\n\nAn overall (common) intercept.\nThe common effect of experimental condition (“smiling” by holding a pen in one’s teeth vs. “pouting” by holding a pen in one’s lips). This is the primary variable of interest in the study.\nA group specific intercept for each of the 1,728 subjects in the ‘long’ dataset. (There were 2,576 subjects in the original dataset, but about 25% were excluded for various reasons, and we’re further excluding all subjects who lack complete data. As an exercise, you can try relaxing some of these criteria and re-fitting the models, though you’ll probably find that it makes no meaningful difference to the results.)\n\nWe’ll create a Bambi model, fit it, and store the results in a new object–which we can then interrogate in various ways.\n\n# Initialize the model, passing in the dataset we want to use.\nmodel = bmb.Model(\"value ~ condition + (1|uid)\", long, dropna=True)\n\n# Set a custom prior on group specific factor variances—just for illustration\ngroup_specific_sd = bmb.Prior(\"HalfNormal\", sigma=10)\ngroup_specific_prior = bmb.Prior(\"Normal\", mu=0, sigma=group_specific_sd)\nmodel.set_priors(group_specific=group_specific_prior)\n\n# Fit the model, drawing 1,000 MCMC draws per chain\nresults = model.fit(draws=1000)\n\nAutomatically removing 9/6940 rows from the dataset.\nAuto-assigning NUTS sampler...\nInitializing NUTS using jitter+adapt_diag...\nMultiprocess sampling (2 chains in 2 jobs)\nNUTS: [value_sigma, Intercept, condition, 1|uid_sigma, 1|uid_offset]\n\n\n\n\n\n\n\n \n \n 100.00% [4000/4000 00:23<00:00 Sampling 2 chains, 0 divergences]\n \n \n\n\nSampling 2 chains for 1_000 tune and 1_000 draw iterations (2_000 + 2_000 draws total) took 23 seconds.\n\n\nNotice that, in Bambi, the common and group specific effects are specified in the same formula. This is the same convention used by other similar packages like brms.\n\n\n\nWe can plot the prior distributions for all parameters with a call to the plot_priors() method.\n\nmodel.plot_priors();\n\nSampling: [1|uid_sigma, Intercept, condition, value_sigma]\n\n\n\n\n\nAnd we can easily get the posterior distributions with az.plot_trace(). We can select a subset of the parameters with the var_names arguments, like in the following cell. Or alternative by negating variables like var_names=\"~1|uid\".\n\naz.plot_trace(results,\n var_names=[\"Intercept\", \"condition\", \"value_sigma\", \"1|uid_sigma\"],\n compact=False,\n);\n\n\n\n\nIf we want a numerical summary of the results, we just pass the results object to az.summary(). By default, summary shows the mean, standard deviation, and 94% highest density interval for the posterior. Summary also includes the Monte Carlo standard error, the effective sample size and the R-hat statistic.\n\naz.summary(results, var_names=['Intercept', 'condition', 'value_sigma', '1|uid_sigma'])\n\n\n\n\n\n \n \n \n mean\n sd\n hdi_3%\n hdi_97%\n mcse_mean\n mcse_sd\n ess_bulk\n ess_tail\n r_hat\n \n \n \n \n Intercept\n 4.563\n 0.047\n 4.472\n 4.647\n 0.001\n 0.001\n 2208.0\n 1508.0\n 1.0\n \n \n condition\n -0.030\n 0.058\n -0.143\n 0.073\n 0.001\n 0.001\n 2473.0\n 1198.0\n 1.0\n \n \n value_sigma\n 2.402\n 0.021\n 2.360\n 2.439\n 0.000\n 0.000\n 2429.0\n 1335.0\n 1.0\n \n \n 1|uid_sigma\n 0.306\n 0.045\n 0.228\n 0.392\n 0.002\n 0.001\n 643.0\n 915.0\n 1.0\n \n \n\n\n\n\n\n\n\nLooking at the parameter estimates produced by our model, it seems pretty clear that there’s no meaningful effect of condition. The posterior distribution is centered almost exactly on 0, with most of the probability mass on very small values. The 94% HDI spans from \\(\\approx -0.14\\) to \\(\\approx 0.08\\)–in other words, the plausible effect of the experimental manipulation is, at best, to produce a change of < 0.2 on cartoon ratings made on a 10-point scale. For perspective, the variation between subjects is enormous in comparison–the standard deviation for group specific effects 1|uid_sigma is around 0.3. We can also see that the model is behaving well, and the sampler seems to have converged nicely (the traces for all parameters look stationary).\nUnfortunately, our first model has at least two pretty serious problems. First, it gives no consideration to between-study variation–we’re simply lumping all 1,728 subjects together, as if they came from the same study. A better model would properly account for study-level variation. We could model study as either a common or a group specific factor in this case–both choices are defensible, depending on whether we want to think of the 17 studies in this dataset as the only sites of interest, or as if they’re just 17 random sites drawn from some much larger population that have particular characteristics we want to account for.\nFor present purposes, we’ll adopt the latter strategy (as an exercise, you can modify the the code below and re-run the model with study as a common factor). We’ll “keep it maximal” by adding both group specific study intercepts and group specific study slopes to the model. That is, we’ll assume that the subjects at each research site have a different baseline appreciation of the cartoons (some find the cartoons funnier than others), and that the effect of condition also varies across sites.\nSecond, our model also fails to explicitly model variation in cartoon ratings that should properly be attributed to the 4 stimuli. In principle, our estimate of the common effect of condition could change somewhat once we correctly account for stimulus variability (though in practice, the net effect is almost always to reduce effects, not increase them–so in this case, it’s very unlikely that adding group specific stimulus effects will produce a meaningful effect of condition). So we’ll deal with this by adding specific intercepts for the 4 stimuli. We’ll model the stimuli as group specific effect, rather than common, because it wouldn’t make sense to think of these particular cartoons as exhausting the universe of stimuli we care about (i.e., we wouldn’t really care about the facial-feedback effect if we knew that it only applied to 4 specific Far Side cartoons, and no other stimuli).\nLastly, just for fun, we can throw in some additional covariates, since they’re readily available in the dataset, and may be of interest even if they don’t directly inform the core hypothesis. Specifically, we’ll add common effects of gender and age to the model, which will let us estimate the degree to which participants’ ratings of the cartoons varies as a function of these background variables.\nOnce we’ve done all that, we end up with a model that’s in a good position to answer the question we care about–namely, whether the smiling/pouting manipulation has an effect on cartoon ratings that generalizes across the subjects, studies, and stimuli found in the RRR dataset.\n\nmodel = bmb.Model(\n \"value ~ condition + age + gender + (1|uid) + (condition|study) + (condition|stimulus)\",\n long,\n dropna=True,\n)\n\ngroup_specific_sd = bmb.Prior(\"HalfNormal\", sigma=10)\ngroup_specific_prior = bmb.Prior(\"Normal\", mu=0, sigma=group_specific_sd)\nmodel.set_priors(group_specific=group_specific_prior)\n\n# Not we use 2000 samples for tuning and increase the taget_accept to 0.99.\n# The default values result in divergences.\nresults = model.fit(draws=1000, tune=2000, target_accept=0.99)\n\nAutomatically removing 33/6940 rows from the dataset.\nAuto-assigning NUTS sampler...\nInitializing NUTS using jitter+adapt_diag...\nMultiprocess sampling (2 chains in 2 jobs)\nNUTS: [value_sigma, Intercept, condition, age, gender, 1|uid_sigma, 1|uid_offset, 1|study_sigma, 1|study_offset, condition|study_sigma, condition|study_offset, 1|stimulus_sigma, 1|stimulus_offset, condition|stimulus_sigma, condition|stimulus_offset]\n\n\n\n\n\n\n\n \n \n 100.00% [6000/6000 26:22<00:00 Sampling 2 chains, 0 divergences]\n \n \n\n\nSampling 2 chains for 2_000 tune and 1_000 draw iterations (4_000 + 2_000 draws total) took 1583 seconds.\n\n\n\naz.plot_trace(results, \n var_names=['Intercept', 'age', 'gender', 'condition', 'value_sigma', \n '1|study', '1|stimulus', 'condition|study', 'condition|stimulus',\n '1|study_sigma', '1|stimulus_sigma', 'condition|study_sigma', \n ],\n compact=True);\n\n\n\n\n\n\n\nNo. There’s still no discernible effect. Modeling the data using a mixed-effects model does highlight a number of other interesting features, however: * The stimulus-level standard deviation 1|stimulus_sigma is quite large compared to the other factors. This is potentially problematic, because it suggests that a more conventional analysis that left individual stimulus effects out of the model could potentially run a high false positive rate. Note that this is a problem that affects both the RRR and the original Strack study equally; the moral of the story is to deliberately sample large numbers of stimuli and explicitly model their influence. * Older people seem to rate cartoons as being (a little bit) funnier. * The variation across sites is surprisingly small–in terms of both the group specific intercepts (1|study) and the group specific slopes (condition|study). In other words, the constitution of the sample, the gender of the experimenter, or any of the hundreds of others of between-site differences that one might conceivably have expected to matter, don’t really seem to make much of a difference to participants’ ratings of the cartoons.\n\n%load_ext watermark\n%watermark -n -u -v -iv -w\n\nLast updated: Thu Jan 05 2023\n\nPython implementation: CPython\nPython version : 3.10.4\nIPython version : 8.5.0\n\nbambi : 0.9.3\npandas: 1.5.2\nnumpy : 1.23.5\narviz : 0.14.0\nsys : 3.10.4 (main, Mar 31 2022, 08:41:55) [GCC 7.5.0]\n\nWatermark: 2.3.1" }, { - "objectID": "notebooks/radon_example.html", - "href": "notebooks/radon_example.html", + "objectID": "notebooks/plot_predictions.html", + "href": "notebooks/plot_predictions.html", "title": "Bambi", "section": "", - "text": "In this notebook we want to revisit the classical hierarchical linear regression model based on the dataset of the Radon Contamination by Gelman and Hill. In particular, we want to show how easy is to port the PyMC models, presented in the very complete article A Primer on Bayesian Methods for Multilevel Modeling, to Bambi using the more concise formula specification for the models.\nThis example has been ported from PyMC by Juan Orduz (@juanitorduz) and Bambi developers.\n\n\n\nimport arviz as az\nimport bambi as bmb\nimport matplotlib.pyplot as plt\nimport numpy as np\nimport pandas as pd\nimport pymc as pm\nimport seaborn as sns\n\n\naz.style.use(\"arviz-darkgrid\")\nnp.random.default_rng(8924)\n\nGenerator(PCG64) at 0x7FDEF2EFEC00\n\n\n\n\n\nLet us load the data into a pandas data frame.\n\n# Get radon data\npath = \"https://raw.githubusercontent.com/pymc-devs/pymc-examples/main/examples/data/srrs2.dat\"\nradon_df = pd.read_csv(path)\n\n# Get city data\ncity_df = pd.read_csv(pm.get_data(\"cty.dat\"))\n\n\ndisplay(radon_df.head())\nprint(radon_df.shape[0])\n\n\n\n\n\n \n \n \n idnum\n state\n state2\n stfips\n zip\n region\n typebldg\n floor\n room\n basement\n ...\n stoptm\n startdt\n stopdt\n activity\n pcterr\n adjwt\n dupflag\n zipflag\n cntyfips\n county\n \n \n \n \n 0\n 1\n AZ\n AZ\n 4\n 85920\n 1\n 1\n 1\n 2\n N\n ...\n 1100\n 112987\n 120287\n 0.3\n 0.0\n 136.060971\n 0\n 0\n 1\n APACHE\n \n \n 1\n 2\n AZ\n AZ\n 4\n 85920\n 1\n 0\n 9\n 0\n \n ...\n 700\n 70788\n 71188\n 0.6\n 33.3\n 128.784975\n 0\n 0\n 1\n APACHE\n \n \n 2\n 3\n AZ\n AZ\n 4\n 85924\n 1\n 1\n 1\n 3\n N\n ...\n 1145\n 70788\n 70788\n 0.5\n 0.0\n 150.245112\n 0\n 0\n 1\n APACHE\n \n \n 3\n 4\n AZ\n AZ\n 4\n 85925\n 1\n 1\n 1\n 3\n N\n ...\n 1900\n 52088\n 52288\n 0.6\n 97.2\n 136.060971\n 0\n 0\n 1\n APACHE\n \n \n 4\n 5\n AZ\n AZ\n 4\n 85932\n 1\n 1\n 1\n 1\n N\n ...\n 900\n 70788\n 70788\n 0.3\n 0.0\n 136.060971\n 0\n 0\n 1\n APACHE\n \n \n\n5 rows × 25 columns\n\n\n\n12777\n\n\n\ndisplay(city_df.head())\nprint(city_df.shape[0])\n\n\n\n\n\n \n \n \n stfips\n ctfips\n st\n cty\n lon\n lat\n Uppm\n \n \n \n \n 0\n 1\n 1\n AL\n AUTAUGA\n -86.643\n 32.534\n 1.78331\n \n \n 1\n 1\n 3\n AL\n BALDWIN\n -87.750\n 30.661\n 1.38323\n \n \n 2\n 1\n 5\n AL\n BARBOUR\n -85.393\n 31.870\n 2.10105\n \n \n 3\n 1\n 7\n AL\n BIBB\n -87.126\n 32.998\n 1.67313\n \n \n 4\n 1\n 9\n AL\n BLOUNT\n -86.568\n 33.981\n 1.88501\n \n \n\n\n\n\n3194\n\n\n\n\n\nWe are going to preprocess the data as done in the article A Primer on Bayesian Methods for Multilevel Modeling.\n\n# Strip spaces from column names\nradon_df.columns = radon_df.columns.map(str.strip)\n\n# Filter to keep observations for \"MN\" state only\ndf = radon_df[radon_df.state == \"MN\"].copy()\ncity_mn_df = city_df[city_df.st == \"MN\"].copy()\n\n# Compute fips\ndf[\"fips\"] = 1_000 * df.stfips + df.cntyfips\ncity_mn_df[\"fips\"] = 1_000 * city_mn_df.stfips + city_mn_df.ctfips\n\n# Merge data\ndf = df.merge(city_mn_df[[\"fips\", \"Uppm\"]], on=\"fips\")\ndf = df.drop_duplicates(subset=\"idnum\")\n\n# Clean county names\ndf.county = df.county.map(str.strip)\n\n# Compute log(radon + 0.1)\ndf[\"log_radon\"] = np.log(df[\"activity\"] + 0.1)\n\n# Compute log of Uranium\ndf[\"log_u\"] = np.log(df[\"Uppm\"])\n\n# Let's map floor. 0 -> Basement and 1 -> Floor\ndf[\"floor\"] = df[\"floor\"].map({0: \"Basement\", 1: \"Floor\"})\n\n# Sort values by floor\ndf = df.sort_values(by=\"floor\")\n\n# Reset index\ndf = df.reset_index(drop=True)\n\nIn this exercise, we model the logarithm of the Radon measurements. This is because the distribution of the Radon level is approximately log-normal. We also add a small number, 0.1, to prevent us from trying to compute the logarithm of 0.\n\n\n\nIn order to get a glimpse of the data, we are going to do some exploratory data analysis. First, let’s have a look at the global distribution of the untransformed radon levels.\n\n_, ax = plt.subplots()\nsns.histplot(x=\"activity\", alpha=0.2, stat=\"density\", element=\"step\", common_norm=False, data=df, ax=ax)\nsns.kdeplot(x=\"activity\", data=df, ax=ax, cut=0)\nax.set(title=\"Density of Radon\", xlabel=\"Radon\", ylabel=\"Density\");\n\n\n\n\nNext, let us see the global log(radon + 0.1) distribution.\n\n_, ax = plt.subplots()\nsns.histplot(x=\"log_radon\", alpha=0.2, stat=\"density\", element=\"step\", common_norm=False, data=df, ax=ax)\nsns.kdeplot(x=\"log_radon\", data=df, ax=ax)\nax.set(title=\"Density of log(Radon + 0.1)\", xlabel=\"$\\log(Radon + 0.1)$\", ylabel=\"Density\");\n\n\n\n\nThere are many a priori reasons to think houses with basement has higher radon levels. From geochemistry to composition of building materials to poor ventilation. We can split the distribution of log(radon + 0.1) per floor to see if we are able to see that difference in our data.\n\n_, ax = plt.subplots()\nsns.histplot(\n x=\"log_radon\", hue=\"floor\", alpha=0.2, stat=\"density\", element=\"step\", \n common_norm=False, data=df, ax=ax\n)\nsns.kdeplot(x=\"log_radon\", hue=\"floor\", common_norm=False, data=df, ax=ax)\nax.set(title=\"Density of log(Radon + 0.1)\", xlabel=\"$\\log + 0.1$\", ylabel=\"Density\");\n\n\n\n\nThis exploration tell us that, as expected, the average radon level is higher in Basement than Floor.\nNext, we are going to count the number of counties.\n\nn_counties = df[\"county\"].unique().size\nprint(f\"Number of counties: {n_counties}\")\n\nNumber of counties: 85\n\n\nLet us dig deeper into the distribution of radon and number of observations per county and floor level.\n\nlog_radon_county_agg = (\n df \n .groupby([\"county\", \"floor\"], as_index=False)\n .agg(\n log_radon_mean=(\"log_radon\", \"mean\"),\n n_obs=(\"log_radon\", \"count\")\n )\n)\n\nfig, ax= plt.subplots(nrows=1, ncols=2, figsize=(12, 6), layout=\"constrained\")\nsns.boxplot(x=\"floor\", y=\"log_radon_mean\", data=log_radon_county_agg, ax=ax[0])\nax[0].set(title=\"log(Radon + 0.1) Mean per County\", ylabel=\"$\\log + 0.1$\")\n\nsns.boxplot(x=\"floor\", y=\"n_obs\", data=log_radon_county_agg, ax=ax[1])\nax[1].set(title=\"Number of Observations\", xlabel=\"floor\", ylabel=\"Number of observations\");\n\n\n\n\n\nOn the left hand side we can see that the \"Basement\" distribution per county is shifted to higher values with respect to the \"Floor\" distribution. We had seen this above when considering all counties together.\nOn the right hand side we see that the number of observations per county is not the same for the floor levels. In particular, we see that there are some counties with a lot of basement observations. This can create some bias when computing simple statistics to compare across counties. Moreover, not all county and floor combinations are present in the dataset. For example:\n\n\nassert df.query(\"county == 'YELLOW MEDICINE' and floor == 'Floor'\").empty\n\n\n\n\n\n\n\n\nFor this first model we only consider the predictor floor, which represents the floor level. The following equation describes the linear model that we are going to build with Bambi\n\\[\ny = \\beta_{j} + \\varepsilon\n\\]\nwhere\n\\[\n\\begin{aligned}\ny &= \\text{Response for the (log) radon measurement }\\\\\n\\beta_{j} &= \\text{Coefficient for the floor level } j \\\\\n\\varepsilon & = \\text{Residual random error}\n\\end{aligned}\n\\]\nEach \\(j\\) indexes a different floor level. In this case, \\(j=1\\) means \"basement\" and \\(j=2\\) means \"floor\".\n\n\n\n\n\nThe only common effect in this model is the floor effect represented by the \\(\\beta_{j}\\) coefficients. We have\n\\[\n\\beta_{j} \\sim \\text{Normal}(0, \\sigma_{\\beta_j})\n\\]\nfor \\(j: 1, 2\\), where \\(\\sigma_{\\beta_j}\\) is a positive constant that we set to 10 for all \\(j\\).\n\n\n\n\\[\n\\begin{aligned}\n\\varepsilon & \\sim \\text{Normal}(0, \\sigma) \\\\\n\\sigma & \\sim \\text{Exponential}(\\lambda)\n\\end{aligned}\n\\]\nwhere \\(\\lambda\\) is a positive constant that we set to 1.\nLet us now write the Bambi model.\nThe 0 on the right side of ~ in the model formula removes the global intercept that is added by default. This allows Bambi to use one coefficient for each floor level.\n\n# A dictionary with the priors we pass to the model initialization\npooled_priors = {\n \"floor\": bmb.Prior(\"Normal\", mu=0, sigma=10),\n \"sigma\": bmb.Prior(\"Exponential\", lam=1),\n}\n\npooled_model = bmb.Model(\"log_radon ~ 0 + floor\", df, priors=pooled_priors)\npooled_model\n\n Formula: log_radon ~ 0 + floor\n Family: gaussian\n Link: mu = identity\n Observations: 919\n Priors: \n target = mu\n Common-level effects\n floor ~ Normal(mu: 0, sigma: 10)\n Auxiliary parameters\n log_radon_sigma ~ Exponential(lam: 1)\n\n\nThe Family name: Gaussian indicates the selected family, which defaults to Gaussian. And Link: identity indicates the default value for the link argument in bmb.Model(). Taken together this simply means that we are fitting a normal linear regression model.\nLet’s see the graph representation of the model before fitting. To do so, we first need to call the .build() method. Internally, this builds the underlying PyMC model.\n\npooled_model.build()\npooled_model.graph()\n\n\n\n\nLet’s now fit the model.\n\npooled_results = pooled_model.fit()\n\nAuto-assigning NUTS sampler...\nInitializing NUTS using jitter+adapt_diag...\nMultiprocess sampling (2 chains in 2 jobs)\nNUTS: [log_radon_sigma, floor]\n\n\n\n\n\n\n\n \n \n 100.00% [4000/4000 00:02<00:00 Sampling 2 chains, 0 divergences]\n \n \n\n\nSampling 2 chains for 1_000 tune and 1_000 draw iterations (2_000 + 2_000 draws total) took 2 seconds.\n\n\nNow we can examine the posterior distribution, i.e. the joint distribution of model parameters conditional on the data:\n\naz.plot_trace(data=pooled_results, compact=True, chain_prop={\"ls\": \"-\"})\nplt.suptitle(\"Pooled Model Trace\");\n\n\n\n\nWe can also see some posterior summary statistics.\n\npooled_summary = az.summary(data=pooled_results)\npooled_summary\n\n\n\n\n\n \n \n \n mean\n sd\n hdi_3%\n hdi_97%\n mcse_mean\n mcse_sd\n ess_bulk\n ess_tail\n r_hat\n \n \n \n \n floor[Basement]\n 1.362\n 0.029\n 1.308\n 1.416\n 0.001\n 0.000\n 2861.0\n 1584.0\n 1.0\n \n \n floor[Floor]\n 0.776\n 0.060\n 0.664\n 0.885\n 0.001\n 0.001\n 2818.0\n 1502.0\n 1.0\n \n \n log_radon_sigma\n 0.791\n 0.018\n 0.755\n 0.823\n 0.000\n 0.000\n 2950.0\n 1459.0\n 1.0\n \n \n\n\n\n\nFrom the posterior plot and the summary, we can see the mean radon level is considerably higher in the Basement than in the Floor level. This reflects what we originally saw in the initial data exploration. In addition, sice we have more measurements in the Basement, the uncertainty in its posterior is smaller than the uncertainty in the posterior for the Floor level.\nWe can compare the mean of the posterior distribution of the floor terms to the sample mean. This is going to be useful to understand the meaning of complete pooling.\n\n_, ax = plt.subplots()\n\n(\n pooled_summary[\"mean\"]\n .iloc[:-1]\n .reset_index()\n .assign(floor = lambda x: x[\"index\"].str.slice(6, -1).str.strip())\n .merge(\n right=df.groupby([\"floor\"])[\"log_radon\"].mean(),\n left_on=\"floor\",\n right_index=True\n )\n .rename(columns={\n \"mean\": \"posterior mean\",\n \"log_radon\": \"sample mean\"\n })\n .melt(\n id_vars=\"floor\",\n value_vars=[\"posterior mean\", \"sample mean\"]\n )\n .pipe((sns.barplot, \"data\"),\n x=\"floor\",\n y=\"value\",\n hue=\"variable\",\n ax=ax\n )\n)\nax.set(title=\"log(Radon + 0.1) Mean per Floor - Pooled Model\", ylabel=\"$\\log + 0.1$\");\n\n\n\n\nFrom the plot alone it is hard to detect the difference between the posterior mean and the sample mean. This happens because the estimation for any observation in either group is simply the group mean plus the smoothing due to the non-flat priors.\nIn other words, for every observation where floor is \"Basement\" the model predicts the mean radon for all the basement measurements, and for every observation where floor is \"Floor\", the model predicts the mean radon for all the floor measurements.\nWhat does complete pooling exactly mean here?\nIn this example, the pooling refers to how we treat the different counties when computing estimates (i.e. this does not refer to pooling across floor levels for example). Complete pooling means that all measurements for all counties are pooled into a single estimate (“treat all counties the same”), conditional on the floor level (because it is used as a covariate/predictor). For that reason, when computing the prediction for a given observation, we do not discriminate which county it belongs to. We pool all the counties into a single estimate, or in other words, we perform a complete pooling.\nLet’s now compare the posterior predictive distribution for each group with the distribution of the observed data.\nTo do this we need to perform a couple of steps:\n\nObtain samples from the posterior predictive distribution using the .predict() method.\nApply the inverse transform to have the posterior predictive samples in the original scale of the response.\n\n\n# Note we create a new data set. \n# One observation per group is enough to obtain posterior predictive samples for that group\n# The more observations we create, the more posterior predictive samples from the same distribution\n# we obtain.\nnew_data = pd.DataFrame({\"floor\": [\"Basement\", \"Floor\"]})\npooled_model.predict(pooled_results, kind=\"pps\", data=new_data)\n\n# Stack chains and draws and extract posterior predictive samples\npps = az.extract_dataset(pooled_results, group=\"posterior_predictive\")[\"log_radon\"].values\n\n# Inverse transform the posterior predictive samples\npps = np.exp(pps) - 0.1\n\nfig, ax = plt.subplots(nrows=2, ncols=2, figsize=(12, 6), layout=\"constrained\")\nax = ax.flatten()\n\nsns.histplot(x=pps[0].flatten(), stat=\"density\", color=\"C0\", ax=ax[0])\nax[0].set(title=\"Basement (Posterior Predictive Distribution)\", xlabel=\"radon\", ylabel=\"Density\")\nsns.histplot(x=\"activity\", data=df.query(\"floor == 'Basement'\"), stat=\"density\", ax=ax[2])\nax[2].set(title=\"Basement (Sample Distribution)\", xlim=ax[0].get_xlim(), xlabel=\"radon\", ylabel=\"Density\")\n\nsns.histplot(x=pps[1].flatten(), stat=\"density\", color=\"C1\", ax=ax[1])\nax[1].set(title=\"Floor (Posterior Predictive Distribution)\", xlabel=\"radon\", ylabel=\"Density\")\nsns.histplot(x=\"activity\", data=df.query(\"floor == 'Floor'\"), stat=\"density\", color=\"C1\", ax=ax[3])\nax[3].set(title=\"Floor (Sample Distribution)\", xlim=ax[1].get_xlim(), xlabel=\"radon\", ylabel=\"Density\");\n\n/tmp/ipykernel_29247/1213510270.py:9: FutureWarning: extract_dataset has been deprecated, please use extract\n pps = az.extract_dataset(pooled_results, group=\"posterior_predictive\")[\"log_radon\"].values\n\n\n\n\n\nThe distributions look very similar, but we see that we have some extreme values. Hence if we need a number to compare them let us use the median.\n\nnp.median(a=pps, axis=1)\n\narray([3.71183577, 2.01142545])\n\n\n\ndf.groupby([\"floor\"])[\"activity\"].median()\n\nfloor\nBasement 3.9\nFloor 2.1\nName: activity, dtype: float64\n\n\n\n\n\n\n\nThe following model uses both floor and county as predictors. They are represented with an interaction effect. It means the predicted radon level for a given measurement depends both on the floor level as well as the county. This interaction coefficient allows the floor effect to vary across counties. Or said analogously, the county effect can vary across floor levels.\n\n\n\\[\ny = \\gamma_{jk} + \\varepsilon\n\\]\nwhere\n\\[\n\\begin{aligned}\ny &= \\text{Response for the (log) radon measurement }\\\\\n\\gamma_{jk} &= \\text{Coefficient for floor level } j \\text{ and county } k\\\\\n\\varepsilon & = \\text{Residual random error}\n\\end{aligned}\n\\]\n\n\n\n\n\nThe common effect is the interaction between floor and county. The prior is\n\\[\n\\gamma_{jk} \\sim \\text{Normal}(0, \\sigma_{\\gamma_{jk}})\n\\]\nfor all \\(j: 1, 2\\) and \\(k: 1, \\cdots, 85\\).\n\\(\\sigma_{\\gamma_{jk}}\\) is a positive constant that we set to 10 in all cases.\n\n\n\n\\[\n\\begin{aligned}\n\\varepsilon_i & \\sim \\text{Normal}(0, \\sigma) \\\\\n\\sigma & \\sim \\text{Exponential}(\\lambda)\n\\end{aligned}\n\\] where \\(\\lambda\\) is a positive constant that we set to 1.\nTo specify this model in Bambi we can use the formula log_radon ~ 0 + county:floor. Again, we remove the global intercept with the 0 on the right hand side. county:floor specifies the multiplicative interaction between county and floor.\n\nunpooled_priors = {\n \"county:floor\": bmb.Prior(\"Normal\", mu=0, sigma=10),\n \"sigma\": bmb.Prior(\"Exponential\", lam=1),\n}\n\nunpooled_model = bmb.Model(\"log_radon ~ 0 + county:floor\", df, priors=unpooled_priors)\nunpooled_model\n\n Formula: log_radon ~ 0 + county:floor\n Family: gaussian\n Link: mu = identity\n Observations: 919\n Priors: \n target = mu\n Common-level effects\n county:floor ~ Normal(mu: 0, sigma: 10)\n Auxiliary parameters\n log_radon_sigma ~ Exponential(lam: 1)\n\n\n\nunpooled_results = unpooled_model.fit()\n\nAuto-assigning NUTS sampler...\nInitializing NUTS using jitter+adapt_diag...\nMultiprocess sampling (2 chains in 2 jobs)\nNUTS: [log_radon_sigma, county:floor]\n\n\n\n\n\n\n\n \n \n 100.00% [4000/4000 01:14<00:00 Sampling 2 chains, 0 divergences]\n \n \n\n\nSampling 2 chains for 1_000 tune and 1_000 draw iterations (2_000 + 2_000 draws total) took 74 seconds.\n\n\n\nunpooled_model.graph()\n\n\n\n\nFrom the graph representation of the model we see the model estimates \\(170 = 85 \\times 2\\) parameters for the county:floor interaction. Let us now explore the model fit.\nFirst, we can now see the plot of the marginal posterior distributions along with the sampling traces.\n\naz.plot_trace(data=unpooled_results, compact=True, chain_prop={\"ls\": \"-\"})\nplt.suptitle(\"Un-Pooled Model Trace\");\n\n\n\n\nSome posteriors for county:floor are much more spread than others, which makes it harder to compare them. To obtain a better summary visualization we can use a forest plot. This plot also allows us to identify exactly the combination of county and floor level.\n\naz.plot_forest(data=unpooled_results, figsize=(6, 32), r_hat=True, combined=True, textsize=8);\n\n\n\n\nNote how for the combination county == 'YELLOW MEDICINE' and floor == 'Floor' where we do not have any observations, the model can still generate predictions which are essentially coming from the prior distributions, which explains the large HDI intervals.\nNext, let’s have a look into the posterior mean for each county and floor combination:\n\nunpooled_summary = az.summary(data=unpooled_results)\n\nWe can now plot the posterior distribution mean of the gamma coefficients against the observed values (sample).\n\n# Get county and floor names from summary table\nvar_mapping = (\n unpooled_summary\n .iloc[:-1]\n .reset_index(drop=False)[\"index\"].str.slice(13, -1).str.split(\",\").apply(pd.Series)\n)\n\nvar_mapping.rename(columns={0: \"county\", 1: \"floor\"}, inplace=True)\nvar_mapping[\"county\"] = var_mapping[\"county\"].str.strip()\nvar_mapping[\"floor\"] = var_mapping[\"floor\"].str.strip()\nvar_mapping.index = unpooled_summary.iloc[:-1].index\n \n# Merge with observed values\nunpooled_summary_2 = pd.concat([var_mapping, unpooled_summary.iloc[:-1]], axis=1)\n\nfig, ax = plt.subplots(figsize=(7, 6))\n\n(\n unpooled_summary_2\n .merge(right=log_radon_county_agg, on=[\"county\", \"floor\"], how=\"left\")\n .pipe(\n (sns.scatterplot, \"data\"),\n x=\"log_radon_mean\",\n y=\"mean\",\n hue=\"floor\",\n ax=ax\n )\n)\nax.axline(xy1=(1, 1), slope=1, color=\"black\", linestyle=\"--\", label=\"diagonal\")\nax.legend()\nax.set(\n title=\"log(Radon + 0.1) Mean per County (Unpooled Model)\",\n xlabel=\"observed (sample)\",\n ylabel=\"prediction\",\n);\n\n\n\n\nAs expected, the values strongly concentrated along the diagonal. In other words, for each county and floor level combination, the model uses their sample mean of radon level as prediction, plus smoothing due to the non-flat priors.\nWhat does no pooling exactly mean here?\nIn the previous example we said complete pooling means the observations are pooled together into single estimates no matter the county they belong to. The situation is completely the opposite in this no pooling scenario. Here, none of the measurements in a given county affect the computation of the coefficient for another county. That’s why, in the end, the estimation for each combination of county and floor level (i.e. \\(\\gamma_{jk}\\)) is the mean of the measurements in that county and floor level (plus prior smoothing) as is reflected in the diagonal scatterplot above.\n\n\n\n\n\n\nIn this section we are going to explore various types of hierarchical models. If you’re familiar with the PyMC way of using hierarchies, the Bambi way (borrowed from mixed effects models way) may be a bit unfamiliar in the beginning, but as we will see, the notation is very convenient. A good explanation is found in Chapter 16 from Bayes Rules book, specifically section 16.3.2. Moreover, you can also take a look into the Bambi examples section where you can find other use cases.\n\n\nWe start with a model that considers a global intercept and varying intercepts for each county. The dispersion parameter of the prior for these varying intercepts is an hyperprior that is common to all the counties. As we are going to conclude later, this is what causes the partial pooling in the model estimates.\n\n\nLet us use greek letters for common effects and roman letters for varying effects. In this case, \\(\\alpha\\) is a common intercept and \\(u\\) is a group-specific intercept.\n\\[\ny = \\alpha + u_j + \\varepsilon\n\\]\nwhere\n\\[\n\\begin{aligned}\ny &= \\text{Response for the (log) radon measurement } \\\\\n\\alpha &= \\text{Intercept common to all measurements or global intercept} \\\\\nu_j &= \\text{Intercept specific to the county } j \\\\\n\\varepsilon & = \\text{Residual random error}\n\\end{aligned}\n\\]\n\n\n\n\n\nThe only common effect in this model is the intercept \\(\\alpha\\). We have\n\\[\n\\alpha \\sim \\text{Normal}(0, \\sigma_\\alpha)\n\\]\nwhere \\(\\sigma_\\alpha\\) is a positive constant that we set to 10.\n\n\n\n\\[\nu_j \\sim \\text{Normal}(0, \\sigma_u)\n\\]\nfor all \\(j: 1, \\cdots, 85\\).\nContrary to the common effects case, \\(\\sigma_u\\) is considered a random variable.\nWe assign \\(\\sigma_u\\) the following hyperprior, which is the same to all the counties,\n\\[\n\\sigma_u\\sim \\text{Exponential}(\\tau)\n\\]\nand \\(\\tau\\) is a positive constant that we set to \\(1\\).\n\n\n\n\\[\n\\begin{aligned}\n\\varepsilon & \\sim \\text{Normal}(0, \\sigma) \\\\\n\\sigma & \\sim \\text{Exponential}(\\lambda)\n\\end{aligned}\n\\]\nwhere \\(\\lambda\\) is a positive constant that we set to 1.\n\n\n\n\nThe common intercept \\(\\alpha\\) represents the mean response across all counties and floor levels.\nOn top of it, the county-specific intercept terms \\(u_j\\) represent county-specific deviations from that global mean. This type of term is also known as a vaying intercept in the statistical literature.\n\n# We can add the hyper-priors inside the prior dictionary parameter of the model constructor\npartial_pooling_priors = {\n \"Intercept\": bmb.Prior(\"Normal\", mu=0, sigma=10),\n \"1|county\": bmb.Prior(\"Normal\", mu=0, sigma=bmb.Prior(\"Exponential\", lam=1)),\n \"sigma\": bmb.Prior(\"Exponential\", lam=1),\n}\n\npartial_pooling_model = bmb.Model(\n formula=\"log_radon ~ 1 + (1|county)\", \n data=df, \n priors=partial_pooling_priors, \n noncentered=False\n)\npartial_pooling_model\n\n Formula: log_radon ~ 1 + (1|county)\n Family: gaussian\n Link: mu = identity\n Observations: 919\n Priors: \n target = mu\n Common-level effects\n Intercept ~ Normal(mu: 0, sigma: 10)\n \n Group-level effects\n 1|county ~ Normal(mu: 0, sigma: Exponential(lam: 1))\n Auxiliary parameters\n log_radon_sigma ~ Exponential(lam: 1)\n\n\nThe noncentered argument asks Bambi not to use the non centered representation for the varying effects. This makes the graph representation clearer and is closer to the original implementation in the PyMC documentation.\n\npartial_pooling_results = partial_pooling_model.fit()\n\nAuto-assigning NUTS sampler...\nInitializing NUTS using jitter+adapt_diag...\nMultiprocess sampling (2 chains in 2 jobs)\nNUTS: [log_radon_sigma, Intercept, 1|county_sigma, 1|county]\n\n\n\n\n\n\n\n \n \n 100.00% [4000/4000 00:06<00:00 Sampling 2 chains, 0 divergences]\n \n \n\n\nSampling 2 chains for 1_000 tune and 1_000 draw iterations (2_000 + 2_000 draws total) took 6 seconds.\n\n\nWe can inspect the graphical representation of the model:\n\npartial_pooling_model.graph()\n\n\n\n\nWe can clearly see a new hierarchical level as compared to the complete pooling model and unpooled model.\nNext, we can plot the posterior distribution of the coefficients in the model:\n\naz.plot_trace(data=partial_pooling_results, compact=True, chain_prop={\"ls\": \"-\"})\nplt.suptitle(\"Partial Pooling Model Trace\");\n\n\n\n\n\n1|county is \\(u_j\\), the county-specific intercepts.\n1|county_sigma is \\(\\sigma_u\\), the standard deviation of the county-specific intercepts above.\n\nLet us now compare the posterior predictive mean against the observed data at county level.\n\npartial_pooling_results\n\n\n\n \n \n arviz.InferenceData\n \n \n \n \n \n posterior\n \n \n \n \n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\nDimensions: (chain: 2, draw: 1000, county__factor_dim: 85)\nCoordinates:\n * chain (chain) int64 0 1\n * draw (draw) int64 0 1 2 3 4 5 6 ... 994 995 996 997 998 999\n * county__factor_dim (county__factor_dim) \nDimensions: (chain: 2, draw: 1000)\nCoordinates:\n * chain (chain) int64 0 1\n * draw (draw) int64 0 1 2 3 4 5 ... 994 995 996 997 998 999\nData variables: (12/17)\n smallest_eigval (chain, draw) float64 nan nan nan nan ... nan nan nan\n diverging (chain, draw) bool False False False ... False False\n step_size (chain, draw) float64 0.4264 0.4264 ... 0.5258 0.5258\n perf_counter_start (chain, draw) float64 1.134e+04 ... 1.134e+04\n process_time_diff (chain, draw) float64 0.001845 0.001541 ... 0.001548\n n_steps (chain, draw) float64 7.0 7.0 15.0 ... 7.0 15.0 7.0\n ... ...\n lp (chain, draw) float64 -1.085e+03 ... -1.08e+03\n step_size_bar (chain, draw) float64 0.443 0.443 ... 0.4482 0.4482\n energy (chain, draw) float64 1.126e+03 ... 1.126e+03\n acceptance_rate (chain, draw) float64 0.5666 0.8503 ... 0.6538 0.808\n max_energy_error (chain, draw) float64 0.77 -0.5181 ... 0.9307 0.5153\n largest_eigval (chain, draw) float64 nan nan nan nan ... nan nan nan\nAttributes:\n created_at: 2023-01-05T15:05:23.338166\n arviz_version: 0.14.0\n inference_library: pymc\n inference_library_version: 5.0.1\n sampling_time: 6.298857688903809\n tuning_steps: 1000\n modeling_interface: bambi\n modeling_interface_version: 0.9.3xarray.DatasetDimensions:chain: 2draw: 1000Coordinates: (2)chain(chain)int640 1array([0, 1])draw(draw)int640 1 2 3 4 5 ... 995 996 997 998 999array([ 0, 1, 2, ..., 997, 998, 999])Data variables: (17)smallest_eigval(chain, draw)float64nan nan nan nan ... nan nan nan nanarray([[nan, nan, nan, ..., nan, nan, nan],\n [nan, nan, nan, ..., nan, nan, nan]])diverging(chain, draw)boolFalse False False ... False Falsearray([[False, False, False, ..., False, False, False],\n [False, False, False, ..., False, False, False]])step_size(chain, draw)float640.4264 0.4264 ... 0.5258 0.5258array([[0.42643211, 0.42643211, 0.42643211, ..., 0.42643211, 0.42643211,\n 0.42643211],\n [0.52580235, 0.52580235, 0.52580235, ..., 0.52580235, 0.52580235,\n 0.52580235]])perf_counter_start(chain, draw)float641.134e+04 1.134e+04 ... 1.134e+04array([[11339.18060664, 11339.18266998, 11339.18440546, ...,\n 11341.43753515, 11341.44075284, 11341.44241447],\n [11339.13916398, 11339.14101822, 11339.14428841, ...,\n 11341.3763751 , 11341.37816056, 11341.38131733]])process_time_diff(chain, draw)float640.001845 0.001541 ... 0.001548array([[0.00184498, 0.00154148, 0.00291322, ..., 0.00300371, 0.00147345,\n 0.00286303],\n [0.00161406, 0.00306944, 0.00304204, ..., 0.00158612, 0.00295564,\n 0.00154751]])n_steps(chain, draw)float647.0 7.0 15.0 15.0 ... 7.0 15.0 7.0array([[ 7., 7., 15., ..., 15., 7., 15.],\n [ 7., 15., 15., ..., 7., 15., 7.]])reached_max_treedepth(chain, draw)boolFalse False False ... False Falsearray([[False, False, False, ..., False, False, False],\n [False, False, False, ..., False, False, False]])index_in_trajectory(chain, draw)int643 -5 3 7 -4 5 5 ... -1 1 -1 -2 3 1array([[ 3, -5, 3, ..., -7, 7, 5],\n [ 4, 8, -4, ..., -2, 3, 1]])perf_counter_diff(chain, draw)float640.001844 0.001541 ... 0.001546array([[0.00184394, 0.00154119, 0.00291212, ..., 0.00300222, 0.00147261,\n 0.00286229],\n [0.00164266, 0.00306914, 0.00304128, ..., 0.00158521, 0.00295434,\n 0.00154646]])energy_error(chain, draw)float640.4473 0.06158 ... 0.604 -0.08139array([[ 0.44728351, 0.06157559, 0.09484343, ..., 0.05589812,\n -0.21583235, 0.33901506],\n [ 0.02462608, 0.31476461, 0.19316821, ..., -0.96651082,\n 0.60399192, -0.08138962]])tree_depth(chain, draw)int643 3 4 4 4 4 3 3 ... 3 4 3 4 4 3 4 3array([[3, 3, 4, ..., 4, 3, 4],\n [3, 4, 4, ..., 3, 4, 3]])lp(chain, draw)float64-1.085e+03 -1.088e+03 ... -1.08e+03array([[-1085.37191562, -1087.88863096, -1092.37538007, ...,\n -1092.75516949, -1089.14198077, -1098.8258156 ],\n [-1089.85063753, -1096.88438309, -1098.87095161, ...,\n -1071.29957155, -1078.66959952, -1079.5095693 ]])step_size_bar(chain, draw)float640.443 0.443 0.443 ... 0.4482 0.4482array([[0.4430255 , 0.4430255 , 0.4430255 , ..., 0.4430255 , 0.4430255 ,\n 0.4430255 ],\n [0.44816553, 0.44816553, 0.44816553, ..., 0.44816553, 0.44816553,\n 0.44816553]])energy(chain, draw)float641.126e+03 1.132e+03 ... 1.126e+03array([[1125.91278604, 1131.94969871, 1130.44029909, ..., 1133.86619845,\n 1143.86580192, 1140.01452244],\n [1125.72031508, 1139.96028702, 1145.9408952 , ..., 1112.82974843,\n 1113.96640428, 1125.65983207]])acceptance_rate(chain, draw)float640.5666 0.8503 ... 0.6538 0.808array([[0.566597 , 0.85028311, 0.91524893, ..., 0.95842402, 0.81942384,\n 0.8264694 ],\n [0.91399074, 0.80459391, 0.91507312, ..., 1. , 0.65376977,\n 0.80802917]])max_energy_error(chain, draw)float640.77 -0.5181 ... 0.9307 0.5153array([[ 0.77004892, -0.51805383, -0.52415563, ..., -0.5551082 ,\n 0.42294361, 0.43692918],\n [-0.83235839, 0.50504515, -0.33711855, ..., -1.07243648,\n 0.93065977, 0.51534825]])largest_eigval(chain, draw)float64nan nan nan nan ... nan nan nan nanarray([[nan, nan, nan, ..., nan, nan, nan],\n [nan, nan, nan, ..., nan, nan, nan]])Indexes: (2)chainPandasIndexPandasIndex(Int64Index([0, 1], dtype='int64', name='chain'))drawPandasIndexPandasIndex(Int64Index([ 0, 1, 2, 3, 4, 5, 6, 7, 8, 9,\n ...\n 990, 991, 992, 993, 994, 995, 996, 997, 998, 999],\n dtype='int64', name='draw', length=1000))Attributes: (8)created_at :2023-01-05T15:05:23.338166arviz_version :0.14.0inference_library :pymcinference_library_version :5.0.1sampling_time :6.298857688903809tuning_steps :1000modeling_interface :bambimodeling_interface_version :0.9.3\n \n \n \n \n \n \n observed_data\n \n \n \n \n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\nDimensions: (log_radon_obs: 919)\nCoordinates:\n * log_radon_obs (log_radon_obs) int64 0 1 2 3 4 5 ... 913 914 915 916 917 918\nData variables:\n log_radon (log_radon_obs) float64 1.435 1.03 0.2624 ... 2.219 0.8329\nAttributes:\n created_at: 2023-01-05T15:05:23.345383\n arviz_version: 0.14.0\n inference_library: pymc\n inference_library_version: 5.0.1\n modeling_interface: bambi\n modeling_interface_version: 0.9.3xarray.DatasetDimensions:log_radon_obs: 919Coordinates: (1)log_radon_obs(log_radon_obs)int640 1 2 3 4 5 ... 914 915 916 917 918array([ 0, 1, 2, ..., 916, 917, 918])Data variables: (1)log_radon(log_radon_obs)float641.435 1.03 0.2624 ... 2.219 0.8329array([ 1.43508453, 1.02961942, 0.26236426, 1.28093385, 1.7227666 ,\n 1.7227666 , 0.26236426, 1.60943791, 1.41098697, 1.28093385,\n 0.95551145, 0.26236426, 1.02961942, 0.58778666, 1.16315081,\n -0.22314355, 0.09531018, 0.69314718, 1.36097655, 2.19722458,\n 2.01490302, 1.80828877, 1.66770682, 1.84054963, 2.16332303,\n 1.45861502, 1.77495235, 1.43508453, 1.06471074, 0.69314718,\n 0.26236426, 0.47000363, 2.2512918 , 0.58778666, 2.50143595,\n 1.94591015, 0.78845736, 2.27212589, 1.25276297, 1.93152141,\n 1.30833282, 0.83290912, 0.99325177, 0.78845736, 1.96009478,\n 0.26236426, 1.36097655, 1.28093385, 1.36097655, 2.28238239,\n 1.87180218, 1.54756251, 1.19392247, 0.95551145, 1.06471074,\n 1.16315081, 0.53062825, 1.56861592, 1.41098697, 1.62924054,\n 0.47000363, 1.58923521, 0.87546874, -0.10536052, 0.87546874,\n 1.54756251, 2.40694511, 2.7080502 , 2.16332303, 1.5260563 ,\n 0.47000363, 1.38629436, 0.64185389, 0.53062825, 0.91629073,\n 1.36097655, 1.64865863, 1.70474809, 1.74046617, 2.94968834,\n 1.13140211, 1.64865863, 2.05412373, 2.10413415, 1.56861592,\n 2.14006616, 0.53062825, 2.44234704, 3.2308044 , 2.34180581,\n 1.30833282, 1.02961942, 1.41098697, 0.74193734, 2.44234704,\n 2.3321439 , 0.26236426, 1.19392247, 1.48160454, 0.83290912,\n...\n 0.40546511, 0.95551145, 1.06471074, 0.53062825, 1.06471074,\n 0.95551145, 2.32238772, 2.54160199, 0.78845736, 1.13140211,\n -2.30258509, 1.06471074, 0.33647224, 2.43361336, 0.33647224,\n 0. , 1.5260563 , 1.48160454, 1.09861229, 1.45861502,\n 1.28093385, 1.94591015, 0.47000363, -0.51082562, 0. ,\n 0.18232156, 0. , -0.51082562, 1.33500107, -0.10536052,\n 1.06471074, 0.83290912, 1.58923521, 0.18232156, 1.09861229,\n 0.53062825, 3.23867845, 0.40546511, 2.69462718, 3.03495299,\n 0.91629073, 0.58778666, -0.10536052, 0.58778666, 1.06471074,\n 1.98787435, 1.91692261, 0.95551145, 0.09531018, 0.95551145,\n 0. , -2.30258509, 2.41591378, 1.19392247, -0.22314355,\n 0.83290912, 1.58923521, 1.94591015, 0.18232156, 0.64185389,\n 0.95551145, 1.28093385, 0. , 0.09531018, 0.99325177,\n 0.47000363, -2.30258509, 0. , 1.77495235, 1.28093385,\n 0.78845736, 2.29253476, 1.94591015, 1.74046617, 0.83290912,\n 1.80828877, 0.18232156, 1.48160454, 1.30833282, 1.25276297,\n 0.26236426, 0.58778666, 1.45861502, -0.10536052, 2.96527307,\n 0.95551145, 0.78845736, 0.33647224, 0.74193734, 1.33500107,\n -0.51082562, 0.09531018, 0.40546511, -0.69314718, -0.51082562,\n 0.53062825, 0. , 2.21920348, 0.83290912])Indexes: (1)log_radon_obsPandasIndexPandasIndex(Int64Index([ 0, 1, 2, 3, 4, 5, 6, 7, 8, 9,\n ...\n 909, 910, 911, 912, 913, 914, 915, 916, 917, 918],\n dtype='int64', name='log_radon_obs', length=919))Attributes: (6)created_at :2023-01-05T15:05:23.345383arviz_version :0.14.0inference_library :pymcinference_library_version :5.0.1modeling_interface :bambimodeling_interface_version :0.9.3\n \n \n \n \n \n \n \n\n\n\npartial_pooling_model.predict(partial_pooling_results, kind=\"pps\")\n\n# Stack chains and draws. pps stands for posterior predictive samples\npps = az.extract_dataset(partial_pooling_results, group=\"posterior_predictive\")[\"log_radon\"].values\n\npps_df = pd.DataFrame(data=pps).assign(county=df[\"county\"])\ny_pred = pps_df.groupby(\"county\").mean().mean(axis=1)\ny_sample = df.groupby(\"county\")[\"log_radon\"].mean()\n\nfig, ax = plt.subplots(figsize=(8, 7))\nsns.regplot(x=y_sample, y=y_pred, ax=ax)\nax.axline(xy1=(1, 1), slope=1, color=\"black\", linestyle=\"--\", label=\"diagonal\")\nax.axhline(y=y_pred.mean(), color=\"C3\", linestyle=\"--\", label=\"predicted global mean\")\nax.legend(loc=\"lower right\")\nax.set(\n title=\"log(Radon + 0.1) Mean per County (Partial Pooling Model)\",\n xlabel=\"observed (sample)\",\n ylabel=\"prediction\",\n xlim=(0.3, 2.7),\n ylim=(0.3, 2.7),\n);\n\n/tmp/ipykernel_29247/3145587883.py:4: FutureWarning: extract_dataset has been deprecated, please use extract\n pps = az.extract_dataset(partial_pooling_results, group=\"posterior_predictive\")[\"log_radon\"].values\n\n\n\n\n\nNote that in this case the points are not concentrated along the diagonal (as it was the case for the unpooled model). The reason is that in the partial pooling model the hyperprior shrinks the predictions towards the global mean, represented by the horizonital dashed line.\nWhat does partial pooling exactly mean here?\nWe said the first model we built performed a complete pooling because estimates pooled observations regardless to which county they belong to. We could see that in the coefficients for the floor variable. The estimate for each level was the sample mean for each level, plus prior smoothing, without making any special distinction to observations from different counties.\nThen, when we built our second model we said it performed no pooling. This was the opposite scenario. Estimates for effects involving a specific county were not informed at all by the information in the other counties.\nNow, we say this model performs partial pooling. But what does it mean? Well, if we had complete pooling and no pooling, this must be some type of compromise in between.\nIn this model, we have a global intercept \\(\\alpha\\), which represents the mean of the response variable across all counties. We also have group-specific intercepts \\(u_j\\) that represent deviations from the global mean specific to each county \\(j\\). Thess group-specific intercepts are assigned a Normal prior centered at 0. The standard deviations of these priors are considered random, instead of fixed. Since they are random, they are assigned a prior distribution, which is a hyperprior in this case because it is a prior on top of a prior. And that hyperprior is the same distribution for all the county-specific intercepts. Because of that, these random deviations from the global mean are not independent. Indeed, the shared hyperprior is what causes the partial pooling in the model estimates. In other words, some information is shared between counties when computing estimates for their effects and it results in a shrinkage towards the global mean.\nConnecting what we’ve just said with the figure above we can see the partial pooling is a compromise between complete pooling (global mean) and no pooling (diagonal).\n\n\n\n\nNext, we add the floor global feature (i.e. does not depend on the county) into the model above. We remove the global intercept so Bambi keeps one coefficient for each floor level.\nIn the original PyMC example, this model is introduced under the Varying intercept model title. We feel that “County-specific intercepts and common predictors” is a more accurate representation of the model we build in Bambi. It is correct to say this is a varying intercept model, because of the county-specific intercepts, but so was the last model we built.\n\n\n\\[\ny = \\beta_j + u_k + \\varepsilon\n\\]\nwhere\n\\[\n\\begin{aligned}\ny &= \\text{Response for the (log) radon measurement } \\\\\n\\beta_j &= \\text{Coefficient for the floor level } j \\\\\nu_k &= \\text{Intercept specific to the county } k \\\\\n\\varepsilon & = \\text{Residual random error}\n\\end{aligned}\n\\]\n\n\n\n\n\nThe common effect in this model is the floor term \\(\\beta_j\\)\n\\[\n\\beta_j \\sim \\text{Normal}(0, \\sigma_{\\beta_j})\n\\]\nfor all \\(j: 1, 2\\) and \\(\\sigma_{\\beta_j}\\) is a positive constant that we set to \\(10\\).\n\n\n\n\\[\nu_k \\sim \\text{Normal}(0, \\sigma_u)\n\\]\nfor all \\(j:1, \\cdots, 85\\). The hyperprior is\n$$\n_u ()\n$$\nand \\(\\tau\\) is a positive constant that we set to \\(1\\).\n\n\n\n\\[\n\\begin{aligned}\n\\varepsilon & \\sim \\text{Normal}(0, \\sigma) \\\\\n\\sigma & \\sim \\text{Exponential}(\\lambda)\n\\end{aligned}\n\\]\nwhere \\(\\lambda\\) is a positive constant that we set to \\(1\\).\n\n\n\n\\(\\beta_j\\) and \\(u_k\\) may look similar. The difference is that the latter is a hierarchical effect (it has a hyperprior), while the former is not.\n\nvarying_intercept_priors = {\n \"floor\": bmb.Prior(\"Normal\", mu=0, sigma=10),\n \"1|county\": bmb.Prior(\"Normal\", mu=0, sigma=bmb.Prior(\"Exponential\", lam=1)),\n \"sigma\": bmb.Prior(\"Exponential\", lam=1),\n}\n\nvarying_intercept_model = bmb.Model(\n formula=\"log_radon ~ 0 + floor + (1|county)\",\n data=df,\n priors=varying_intercept_priors,\n noncentered=False\n )\n\nvarying_intercept_model\n\n Formula: log_radon ~ 0 + floor + (1|county)\n Family: gaussian\n Link: mu = identity\n Observations: 919\n Priors: \n target = mu\n Common-level effects\n floor ~ Normal(mu: 0, sigma: 10)\n \n Group-level effects\n 1|county ~ Normal(mu: 0, sigma: Exponential(lam: 1))\n Auxiliary parameters\n log_radon_sigma ~ Exponential(lam: 1)\n\n\n\nvarying_intercept_results = varying_intercept_model.fit()\n\nAuto-assigning NUTS sampler...\nInitializing NUTS using jitter+adapt_diag...\nMultiprocess sampling (2 chains in 2 jobs)\nNUTS: [log_radon_sigma, floor, 1|county_sigma, 1|county]\n\n\n\n\n\n\n\n \n \n 100.00% [4000/4000 00:06<00:00 Sampling 2 chains, 0 divergences]\n \n \n\n\nSampling 2 chains for 1_000 tune and 1_000 draw iterations (2_000 + 2_000 draws total) took 7 seconds.\n\n\nWhen looking at the graph representation of the model we still see the hierarchical structure for the county varying intercepts, but we do not see it for the floor feature as expected.\n\nvarying_intercept_model.graph()\n\n\n\n\nLet us visualize the posterior distributions:\n\naz.plot_trace(data=varying_intercept_results, compact=True, chain_prop={\"ls\": \"-\"});\nplt.suptitle(\"Varying Intercepts Model Trace\");\n\n\n\n\n\n\n\n\n\nNext we want to include a hierarchical structure in the floor effect.\n\n\n\\[\ny = \\beta_j + b_{jk} + \\varepsilon\n\\]\nwhere\n\\[\n\\begin{aligned}\ny &= \\text{Response for the (log) radon measurement}\\\\\n\\beta_j &= \\text{Coefficient for the floor level } j \\\\\nb_{jk} &= \\text{Coefficient for the floor level } j \\text{ specific to the county } k\\\\\n\\varepsilon & = \\text{Residual random error}\n\\end{aligned}\n\\]\n\n\n\n\n\nThe common effect in this model is the floor term \\(\\beta_j\\)\n\\[\n\\beta_j \\sim \\text{Normal}(0, \\sigma_{\\beta_j})\n\\]\nwhere \\(\\sigma_{\\beta_j}\\) is a positive constant that we set to \\(10\\).\n\n\n\nHere, again, we have the floor effects\n\\[\nb_{jk} \\sim \\text{Normal}(0, \\sigma_{b_j})\n\\]\nfor \\(j:1, 2\\) and \\(k: 1, \\cdots, 85\\).\nThe hyperprior is\n\\[\n\\sigma_{b_j} \\sim \\text{Exponential}(\\tau)\n\\]\nfor \\(j:1, 2\\).\n\\(\\tau\\) is a positive constant that we set to \\(1\\).\n\n\n\n\\[\n\\begin{aligned}\n\\varepsilon & \\sim \\text{Normal}(0, \\sigma) \\\\\n\\sigma & \\sim \\text{Exponential}(\\lambda)\n\\end{aligned}\n\\]\nwhere \\(\\lambda\\) is a positive constant that we set to 1.\n\n\n\nBoth \\(\\beta_j\\) and \\(b_{jk}\\) are floor effects. The difference is that the first one is a common effect, while the second is a group-specific effect. In other words, the second floor effect varies from county to county. These effects represent the county specific deviations from the common floor effect \\(\\beta_j\\). Because of the hyperprior, the \\(b_{jk}\\) effects aren’t independent and result in the partial-pooling effect.\nIn this case the Bambi model specification is quite easy, namely log_radon ~ 0 + floor + (0 + floor|county). This formula represents the following terms:\n\nThe first 0 tells we don’t want a global intercept.\nfloor is \\(\\beta_j\\). It says we want to include an effect for each floor level. Since there’s no global intercept, a coefficient for each level is included.\nThe 0 in (0 + floor|county) means we don’t want county-specific intercept. We need to explicitly turn it off as we did with the regular intercept.\nfloor|county is \\(b_{jk}\\), the county-specific floor coefficients. Again, since there’s no varying intercepot for the counties, this includes coefficients for both floor levels.\n\n\nvarying_intercept_slope_priors = {\n \"floor\": bmb.Prior(\"Normal\", mu=0, sigma=10),\n \"floor|county\": bmb.Prior(\"Normal\", mu=0, sigma=bmb.Prior(\"Exponential\", lam=1)),\n \"sigma\": bmb.Prior(\"Exponential\", lam=1),\n}\n\nvarying_intercept_slope_model = bmb.Model(\n formula=\"log_radon ~ 0 + floor + (0 + floor|county)\",\n data=df,\n priors=varying_intercept_slope_priors,\n noncentered=True\n )\n\nvarying_intercept_slope_model\n\n Formula: log_radon ~ 0 + floor + (0 + floor|county)\n Family: gaussian\n Link: mu = identity\n Observations: 919\n Priors: \n target = mu\n Common-level effects\n floor ~ Normal(mu: 0, sigma: 10)\n \n Group-level effects\n floor|county ~ Normal(mu: 0, sigma: Exponential(lam: 1))\n Auxiliary parameters\n log_radon_sigma ~ Exponential(lam: 1)\n\n\nNext, we fit the model. Note we increase the default number of draws from the posterior and the tune samples to 2000. In addition, as the structure of the model gets more complex, so does the posterior. That’s why we increase target_accept from the default 0.8 to 0.9, because we want to explore the posterior more cautiously .\n\nvarying_intercept_slope_results = varying_intercept_slope_model.fit(\n draws=2000, \n tune=2000,\n target_accept=0.9\n)\n\nAuto-assigning NUTS sampler...\nInitializing NUTS using jitter+adapt_diag...\nMultiprocess sampling (2 chains in 2 jobs)\nNUTS: [log_radon_sigma, floor, floor|county_sigma, floor|county_offset]\n\n\n\n\n\n\n\n \n \n 100.00% [8000/8000 00:24<00:00 Sampling 2 chains, 0 divergences]\n \n \n\n\nSampling 2 chains for 2_000 tune and 2_000 draw iterations (4_000 + 4_000 draws total) took 24 seconds.\n\n\nIn the graph representation of the model we can now see hierarchical structures both in the intercepts and the slopes. The terms that end with _offset appeared because we are using a non-centered parametrization. This parametrization is an algebraic trick that helps computation but leaves the model unchanged.\n\nvarying_intercept_slope_model.graph()\n\n\n\n\nLet’s have a look at the marginal posterior for the coefficients in the model.\n\nvar_names = [\"floor\", \"floor|county\", \"floor|county_sigma\", \"log_radon_sigma\"]\naz.plot_trace(\n data=varying_intercept_slope_results,\n var_names=var_names, \n compact=True, \n chain_prop={\"ls\": \"-\"}\n);\n\n\n\n\n\n\n\n\n\nWe now want to consider a county-level predictor, namely the (log) uranium level. This is not a county-level predictor in the sense that we use a county-specific coefficient, but in the sense that all the uranium concentrations were measured per county. Thus all the houses in the same county have the same uranium level.\n\n\n\\[\ny = \\beta_j + \\xi x + b_{jk} + \\varepsilon\n\\]\nwhere\n\\[\n\\begin{aligned}\ny &= \\text{Response for the (log) radon measurement} \\\\\nx &= \\text{Log uranium concentration} \\\\\n\\beta_j &= \\text{Coefficient for the floor level } j \\\\\n\\xi &= \\text{Coefficient for the slope of the log uranium concentration}\\\\\nb_{jk} &= \\text{Coefficient for the floor level } j \\text{ specific to the county } k\\\\\n\\varepsilon & = \\text{Residual random error}\n\\end{aligned}\n\\]\n\n\n\n\n\nThis model has two common effects:\n\\[\n\\begin{aligned}\n\\beta_j \\sim \\text{Normal}(0, \\sigma_{\\beta_j}) \\\\\n\\xi \\sim \\text{Normal}(0, \\sigma_\\xi)\n\\end{aligned}\n\\]\nwhere \\(j:1, 2\\) and all \\(\\sigma_{\\beta_j}\\) and \\(\\sigma_{\\xi}\\) are set to \\(10\\).\n\n\n\nHere, again, we have the floor effects\n\\[\nb_{jk} \\sim \\text{Normal}(0, \\sigma_{b_j})\n\\]\nfor \\(j:1, 2\\) and \\(k: 1, \\cdots, 85\\).\nThe hyperprior is\n\\[\n\\sigma_{b_j} \\sim \\text{Exponential}(\\tau)\n\\]\nfor \\(j:1, 2\\).\n\\(\\tau\\) is a positive constant that we set to \\(1\\).\n\n\n\n\\[\n\\begin{aligned}\n\\varepsilon & \\sim \\text{Normal}(0, \\sigma) \\\\\n\\sigma & \\sim \\text{Exponential}(\\lambda)\n\\end{aligned}\n\\]\nwhere \\(\\lambda\\) is a positive constant that we set to \\(1\\).\n\ncovariate_priors = {\n \"floor\": bmb.Prior(\"Normal\", mu=0, sigma=10),\n \"log_u\": bmb.Prior(\"Normal\", mu=0, sigma=10),\n \"floor|county\": bmb.Prior(\"Normal\", mu=0, sigma=bmb.Prior(\"Exponential\", lam=1)),\n \"sigma\": bmb.Prior(\"Exponential\", lam=1),\n}\n\ncovariate_model = bmb.Model(\n formula=\"log_radon ~ 0 + floor + log_u + (0 + floor|county)\",\n data=df,\n priors=covariate_priors,\n noncentered=True\n )\n\ncovariate_model\n\n Formula: log_radon ~ 0 + floor + log_u + (0 + floor|county)\n Family: gaussian\n Link: mu = identity\n Observations: 919\n Priors: \n target = mu\n Common-level effects\n floor ~ Normal(mu: 0, sigma: 10)\n log_u ~ Normal(mu: 0, sigma: 10)\n \n Group-level effects\n floor|county ~ Normal(mu: 0, sigma: Exponential(lam: 1))\n Auxiliary parameters\n log_radon_sigma ~ Exponential(lam: 1)\n\n\n\ncovariate_results = covariate_model.fit(\n draws=2000, \n tune=2000,\n target_accept=0.9,\n chains=2\n)\n\nAuto-assigning NUTS sampler...\nInitializing NUTS using jitter+adapt_diag...\nMultiprocess sampling (2 chains in 2 jobs)\nNUTS: [log_radon_sigma, floor, log_u, floor|county_sigma, floor|county_offset]\n\n\n\n\n\n\n\n \n \n 100.00% [8000/8000 00:26<00:00 Sampling 2 chains, 0 divergences]\n \n \n\n\nSampling 2 chains for 2_000 tune and 2_000 draw iterations (4_000 + 4_000 draws total) took 27 seconds.\n\n\n\ncovariate_model.graph()\n\n\n\n\n\nvar_names = [\"floor\", \"log_u\", \"floor|county\", \"floor|county_sigma\", \"log_radon_sigma\"]\naz.plot_trace(\n data=covariate_results,\n var_names=var_names, \n compact=True, \n chain_prop={\"ls\": \"-\"}\n);\n\n\n\n\nLet us now visualize the posterior distributions of the intercepts:\n\n# get log_u values per county\nlog_u_sample = df.groupby([\"county\"])[\"log_u\"].mean().values\n\n# compute the slope posterior samples\nlog_u_slope = covariate_results.posterior[\"log_u\"].values[..., None] * log_u_sample\n\n# Compute the posterior for the floor coefficient when it is Basement\nintercepts = (\n covariate_results.posterior.sel(floor_dim=\"Basement\")[\"floor\"]\n + covariate_results.posterior.sel(floor__expr_dim=\"Basement\")[\"floor|county\"] \n).values\n\ny_predicted = (intercepts + log_u_slope).reshape(4000, n_counties).T\n\n# reduce the intercepts posterior samples to the mean per county\nmean_intercept = intercepts.mean(axis=2)[..., None] + log_u_slope\n\n\nfig, ax = plt.subplots()\n\ny_predicted_bounds = np.quantile(y_predicted, q=[0.03, 0.96], axis=1)\n\nsns.scatterplot(\n x=log_u_sample,\n y=y_predicted.mean(axis=1),\n alpha=0.8,\n color=\"C0\",\n s=50,\n label=\"Mean county-intercept\",\n ax=ax\n)\nax.vlines(log_u_sample, y_predicted_bounds[0], y_predicted_bounds[1], color=\"C1\", alpha=0.5)\n\naz.plot_hdi(\n x=log_u_sample,\n y=mean_intercept,\n color=\"black\",\n fill_kwargs={\"alpha\": 0.1, \"label\": \"Mean intercept HPD\"},\n ax=ax\n)\n\nsns.lineplot(\n x=log_u_sample,\n y=mean_intercept.reshape(4000, n_counties).mean(axis=0),\n color=\"black\",\n alpha=0.6,\n label=\"Mean intercept\",\n ax=ax\n)\n\nax.legend(loc=\"upper left\")\nax.set(\n title=\"County Intercepts (Covariance Model)\",\n xlabel=\"County-level log uranium\",\n ylabel=\"Intercept estimate\"\n);\n\n\n\n\n\n\n\n\n\n\n\nLet us dig deeper into the model comparison for the pooled, unpooled, and partial pooling models. To do so we are generate predictions for each model ad county level, where we aggregate by taking the mean, and plot them against the observed values.\n\n# generate posterior predictive samples\npooled_model.predict(pooled_results, kind=\"pps\")\nunpooled_model.predict(unpooled_results, kind=\"pps\")\npartial_pooling_model.predict(partial_pooling_results, kind=\"pps\")\n\n# stack chain and draw values\npooled_pps = az.extract_dataset(pooled_results, group=\"posterior_predictive\")[\"log_radon\"].values\nunpooled_pps = az.extract_dataset(unpooled_results, group=\"posterior_predictive\")[\"log_radon\"].values\npartial_pooling_pps = az.extract_dataset(partial_pooling_results, group=\"posterior_predictive\")[\"log_radon\"].values\n\n# Generate predictions per county\npooled_pps_df = pd.DataFrame(data=pooled_pps).assign(county=df[\"county\"])\ny_pred_pooled = pooled_pps_df.groupby(\"county\").mean().mean(axis=1)\n\nunpooled_pps_df = pd.DataFrame(data=unpooled_pps).assign(county=df[\"county\"])\ny_pred_unpooled = unpooled_pps_df.groupby(\"county\").mean().mean(axis=1)\n\npartial_pooling_pps_df = pd.DataFrame(data=partial_pooling_pps).assign(county=df[\"county\"])\ny_pred_partial_pooling = partial_pooling_pps_df.groupby(\"county\").mean().mean(axis=1)\n\n# observed values\ny_sample = df.groupby(\"county\")[\"log_radon\"].mean()\n\n/tmp/ipykernel_29247/54649629.py:7: FutureWarning: extract_dataset has been deprecated, please use extract\n pooled_pps = az.extract_dataset(pooled_results, group=\"posterior_predictive\")[\"log_radon\"].values\n/tmp/ipykernel_29247/54649629.py:8: FutureWarning: extract_dataset has been deprecated, please use extract\n unpooled_pps = az.extract_dataset(unpooled_results, group=\"posterior_predictive\")[\"log_radon\"].values\n/tmp/ipykernel_29247/54649629.py:9: FutureWarning: extract_dataset has been deprecated, please use extract\n partial_pooling_pps = az.extract_dataset(partial_pooling_results, group=\"posterior_predictive\")[\"log_radon\"].values\n\n\n\nfig, ax = plt.subplots(figsize=(8, 8))\n\nsns.regplot(x=y_sample, y=y_pred_pooled, label=\"pooled\", color=\"C0\", ax=ax)\nsns.regplot(x=y_sample, y=y_pred_unpooled, label=\"unpooled\", color=\"C1\", ax=ax)\nsns.regplot(x=y_sample, y=y_pred_partial_pooling, label=\"partial pooling\", color=\"C2\", ax=ax)\nax.axhline(y=df[\"log_radon\"].mean(), color=\"C0\", linestyle=\"--\", label=\"sample mean\")\nax.axline(xy1=(1, 1), slope=1, color=\"black\", linestyle=\"--\", label=\"diagonal\")\nax.axhline(\n y=y_pred_partial_pooling.mean(), color=\"C3\",\n linestyle=\"--\", label=\"predicted global mean (partial pooling)\"\n)\nax.legend(loc=\"upper center\", bbox_to_anchor=(0.5, -0.1), ncol=2)\nax.set(\n title=\"log(Radon + 0.1) Mean per County - Model Comparison\",\n xlabel=\"observed (sample)\",\n ylabel=\"prediction\",\n xlim=(0.2, 2.8),\n ylim=(0.2, 2.8),\n);\n\n\n\n\n\nThe pooled model consider all the counties together, this explains why the predictions do not vary at county level. This is represented by the almost-flat line in the plot above (blue).\nOn the other hand, the unpooled model considers each county separately, so the prediction is very close to the observation mean. This is represented by the line very close to the diagonal (orange).\nThe partial pooling model is mixing global and information at county level. This is clearly seen by how corresponding (green) line is in between the pooling and unpooling lines.\n\n\n%load_ext watermark\n%watermark -n -u -v -iv -w\n\nLast updated: Thu Jan 05 2023\n\nPython implementation: CPython\nPython version : 3.10.4\nIPython version : 8.5.0\n\nnumpy : 1.23.5\nseaborn : 0.12.2\nmatplotlib: 3.6.2\nbambi : 0.9.3\narviz : 0.14.0\nsys : 3.10.4 (main, Mar 31 2022, 08:41:55) [GCC 7.5.0]\npandas : 1.5.2\npymc : 5.0.1\n\nWatermark: 2.3.1" + "text": "This notebook shows how to use, and the capabilities, of the plot_predictions function. The plot_predictions function is a part of Bambi’s sub-package interpret that features a set of tools used to interpret complex regression models that is inspired by the R package marginaleffects.\n\n\nThe purpose of the generalized linear model (GLM) is to unify the approaches needed to analyze data for which either: (1) the assumption of a linear relation between \\(x\\) and \\(y\\), or (2) the assumption of normal variation is not appropriate. GLMs are typically specified in three stages: 1. the linear predictor \\(\\eta = X\\beta\\) where \\(X\\) is an \\(n\\) x \\(p\\) matrix of explanatory variables. 2. the link function \\(g(\\cdot)\\) that relates the linear predictor to the mean of the outcome variable \\(\\mu = g^{-1}(\\eta) = g^{-1}(X\\beta)\\) 3. the random component specifying the distribution of the outcome variable \\(y\\) with mean \\(\\mathbb{E}(y|X) = \\mu\\).\nBased on these three specifications, the mean of the distribution of \\(y\\), given \\(X\\), is determined by \\(X\\beta: \\mathbb{E}(y|X) = g^{-1}(X\\beta)\\).\nGLMs are a broad family of models where the output \\(y\\) is typically assumed to follow an exponential family distribution, e.g., Binomial, Poisson, Gamma, Exponential, and Normal. The job of the link function is to map the linear space of the model \\(X\\beta\\) onto the non-linear space of a parameter like \\(\\mu\\). Commonly used link function are the logit and log link. Also known as the canonical link functions. This brief introduction to GLMs is not meant to be exhuastive, and another good starting point is the Bambi Basic Building Blocks example.\nDue to the link function, there are typically three quantities of interest to interpret in a GLM: 1. the linear predictor \\(\\eta\\) 2. the mean \\(\\mu = g^{-1}(\\eta)\\) 3. the response variable \\(Y \\sim \\mathcal{D}(\\mu, \\theta)\\) where \\(\\mu\\) is the mean parameter and \\(\\theta\\) is (possibly) a vector that contains all the other “nuissance” parameters of the distribution.\nAs modelers, we are usually more interested in interpreting (2) and (3). However, \\(\\mu\\) is not always on the same scale of the response variable and can be more difficult to interpret. Rather, the response scale is a more interpretable scale. Additionally, it is often the case that modelers would like to analyze how a model parameter varies across a range of explanatory variable values. To achieve such an analysis, Bambi has taken inspiration from the R package marginaleffects, and implemented a plot_predictions function that plots the conditional adjusted predictions to aid in the interpretation of GLMs. Below, it is briefly discussed what are conditionally adjusted predictions, how they are computed, and ultimately how to use the plot_predictions function.\n\n\n\nAdjusted predictions refers to the outcome predicted by a fitted model on a specified scale for a given combination of values of the predictor variables, such as their observed values, their means, or some user specified grid of values. The specification of the scale to make the predictions, the link or response scale, refers to the scale used to estimate the model. In normal linear regression, the link scale and the response scale are identical, and therefore, the adjusted prediction is expressed as the mean value of the response variable at the given values of the predictor variables. On the other hand, a logistic regression’s link and response scale are not identical. An adjusted prediction on the link scale will be represented as the log-odds of a successful response given values of the predictor variables. Whereas an adjusted prediction on the response scale gives the probability that the response variable equals 1. The conditional part of conditionally adjusted predictions represents the specific predictor(s) and its values we would like to condition on when plotting predictions.\n\n\nThe objective of plotting conditional adjusted predictions is to visualize how a parameter of the (conditional) response distribution varies as a function of (some) interpolated explanatory variables. This is done by holding all other explanatory variables constant at some specified value, a reference grid, that may or may not correspond to actual observations in the dataset used to fit the model. By default, the plot_predictions function uses a grid of 200 equally spaced values between the minimum and maximum values of the specified explanatory variable as the reference grid.\nThe plot_predictions function uses the fitted model to then compute the predicted values of the model parameter at each value of the reference grid. The plot_predictions function then uses these predictions to plot the model parameter as a function of (some) explanatory variable.\n\nimport arviz as az\nimport matplotlib.pyplot as plt\nimport numpy as np\nimport pandas as pd\n\nimport bambi as bmb\n\n\n\n\n\nFor the first demonstration, we will use a Gaussian linear regression model with the mtcars dataset to better understand the plot_predictions function and its arguments. The mtcars dataset was extracted from the 1974 Motor Trend US magazine, and comprises fuel consumption and 10 aspects of automobile design and performance for 32 automobiles (1973–74 models). The following is a brief description of the variables in the dataset:\n\nmpg: Miles/(US) gallon\ncyl: Number of cylinders\ndisp: Displacement (cu.in.)\nhp: Gross horsepower\ndrat: Rear axle ratio\nwt: Weight (1000 lbs)\nqsec: 1/4 mile time\nvs: Engine (0 = V-shaped, 1 = straight)\nam: Transmission (0 = automatic, 1 = manual)\ngear: Number of forward gear\n\n\n# Load data\ndata = bmb.load_data('mtcars')\ndata[\"cyl\"] = data[\"cyl\"].replace({4: \"low\", 6: \"medium\", 8: \"high\"})\ndata[\"gear\"] = data[\"gear\"].replace({3: \"A\", 4: \"B\", 5: \"C\"})\ndata[\"cyl\"] = pd.Categorical(data[\"cyl\"], categories=[\"low\", \"medium\", \"high\"], ordered=True)\n\n# Define and fit the Bambi model\nmodel = bmb.Model(\"mpg ~ 0 + hp * wt + cyl + gear\", data)\nidata = model.fit(draws=1000, target_accept=0.95, random_seed=1234)\n\nAuto-assigning NUTS sampler...\nInitializing NUTS using jitter+adapt_diag...\nMultiprocess sampling (4 chains in 4 jobs)\nNUTS: [mpg_sigma, hp, wt, hp:wt, cyl, gear]\n\n\n\n\n\n\n\n \n \n 100.00% [8000/8000 00:19<00:00 Sampling 4 chains, 0 divergences]\n \n \n\n\nSampling 4 chains for 1_000 tune and 1_000 draw iterations (4_000 + 4_000 draws total) took 20 seconds.\n\n\nWe can print the Bambi model object to obtain the model components. Below, we see that the Gaussian linear model uses an identity link function that results in no transformation of the linear predictor to the mean of the outcome variable, and the distrbution of the likelihood is Gaussian.\nNow that we have fitted the model, we can visualize how a model parameter varies as a function of (some) interpolated covariate. For this example, we will visualize how the mean response mpg varies as a function of the covariate hp.\nThe Bambi model, ArviZ inference data object (containing the posterior samples and the data used to fit the model), and a list or dictionary of covariates, in this example only hp, are passed to the plot_predictions function. The plot_predictions function then computes the conditional adjusted predictions for each covariate in the list or dictionary using the method described above. The plot_predictions function returns a matplotlib figure object that can be further customized.\n\nfig, ax = plt.subplots(figsize=(7, 3), dpi=120)\nbmb.interpret.plot_predictions(model, idata, \"hp\", ax=ax);\n\n\n\n\nThe plot above shows that as hp increases, the mean mpg decreases. As stated above, this insight was obtained by creating the reference grid and then using the fitted model to compute the predicted values of the model parameter, in this example mpg, at each value of the reference grid.\nBy default, plot_predictions uses the highest density interval (HDI) of the posterior distribution to compute the credible interval of the conditional adjusted predictions. The HDI is a Bayesian analog to the frequentist confidence interval. The HDI is the shortest interval that contains a specified probability of the posterior distribution. By default, plot_predictions uses the 94% HDI.\nplot_predictions uses the posterior distribution by default to visualize some mean outcome parameter . However, the posterior predictive distribution can also be plotted by specifying pps=True where pps stands for posterior predictive samples of the response variable.\n\nfig, ax = plt.subplots(figsize=(7, 3), dpi=120)\nbmb.interpret.plot_predictions(model, idata, \"hp\", pps=True, ax=ax);\n\n\n\n\nHere, we notice that the uncertainty in the conditional adjusted predictions is much larger than the uncertainty when pps=False. This is because the posterior predictive distribution accounts for the uncertainty in the model parameters and the uncertainty in the data. Whereas, the posterior distribution only accounts for the uncertainty in the model parameters.\nplot_predictions allows up to three covariates to be plotted simultaneously where the first element in the list represents the main (x-axis) covariate, the second element the group (hue / color), and the third element the facet (panel). However, when plotting more than one covariate, it can be useful to pass specific group and panel arguments to aid in the interpretation of the plot. Therefore, subplot_kwargs allows the user to manipulate the plotting by passing a dictionary where the keys are {\"main\": ..., \"group\": ..., \"panel\": ...} and the values are the names of the covariates to be plotted. For example, passing two covariates hp and wt and specifying subplot_kwargs={\"main\": \"hp\", \"group\": \"wt\", \"panel\": \"wt\"}.\n\nbmb.interpret.plot_predictions(\n model=model, \n idata=idata, \n covariates=[\"hp\", \"wt\"],\n pps=False,\n legend=False,\n subplot_kwargs={\"main\": \"hp\", \"group\": \"wt\", \"panel\": \"wt\"},\n fig_kwargs={\"figsize\": (20, 8), \"sharey\": True}\n)\nplt.tight_layout();\n\n\n\n\nFurthermore, categorical covariates can also be plotted. We plot the the mean mpg as a function of the two categorical covariates gear and cyl below. The plot_predictions function automatically plots the conditional adjusted predictions for each level of the categorical covariate. Furthermore, when passing a list of covariates into the plot_predictions function, the list will be converted into a dictionary object where the key is taken from (“horizontal”, “color”, “panel”) and the values are the names of the variables. By default, the first element of the list is specified as the “horizontal” covariate, the second element of the list is specified as the “color” covariate, and the third element of the list is mapped to different plot panels.\n\nfig, ax = plt.subplots(figsize=(7, 3), dpi=120)\nbmb.interpret.plot_predictions(model, idata, [\"gear\", \"cyl\"], ax=ax);\n\n\n\n\n\n\n\nLets move onto a model that uses a distribution that is a member of the exponential distribution family and utilizes a link function. For this, we will implement the Negative binomial model from the students absences example. School administrators study the attendance behavior of high school juniors at two schools. Predictors of the number of days of absence include the type of program in which the student is enrolled and a standardized test in math. We have attendance data on 314 high school juniors. The variables of insterest in the dataset are the following:\n\ndaysabs: The number of days of absence. It is our response variable.\nprogr: The type of program. Can be one of ‘General’, ‘Academic’, or ‘Vocational’.\nmath: Score in a standardized math test.\n\n\n# Load data, define and fit Bambi model\ndata = pd.read_stata(\"https://stats.idre.ucla.edu/stat/stata/dae/nb_data.dta\")\ndata[\"prog\"] = data[\"prog\"].map({1: \"General\", 2: \"Academic\", 3: \"Vocational\"})\n\nmodel_interaction = bmb.Model(\n \"daysabs ~ 0 + prog + scale(math) + prog:scale(math)\",\n data,\n family=\"negativebinomial\"\n)\nidata_interaction = model_interaction.fit(\n draws=1000, target_accept=0.95, random_seed=1234, chains=4\n)\n\nAuto-assigning NUTS sampler...\nInitializing NUTS using jitter+adapt_diag...\nMultiprocess sampling (4 chains in 4 jobs)\nNUTS: [daysabs_alpha, prog, scale(math), prog:scale(math)]\n\n\n\n\n\n\n\n \n \n 100.00% [8000/8000 00:02<00:00 Sampling 4 chains, 0 divergences]\n \n \n\n\nSampling 4 chains for 1_000 tune and 1_000 draw iterations (4_000 + 4_000 draws total) took 2 seconds.\n\n\nThis model utilizes a log link function and a negative binomial distribution for the likelihood. Also note that this model also contains an interaction prog:sale(math).\n\nmodel_interaction\n\n Formula: daysabs ~ 0 + prog + scale(math) + prog:scale(math)\n Family: negativebinomial\n Link: mu = log\n Observations: 314\n Priors: \n target = mu\n Common-level effects\n prog ~ Normal(mu: [0. 0. 0.], sigma: [5.0102 7.4983 5.2746])\n scale(math) ~ Normal(mu: 0.0, sigma: 2.5)\n prog:scale(math) ~ Normal(mu: [0. 0.], sigma: [6.1735 4.847 ])\n \n Auxiliary parameters\n alpha ~ HalfCauchy(beta: 1.0)\n------\n* To see a plot of the priors call the .plot_priors() method.\n* To see a summary or plot of the posterior pass the object returned by .fit() to az.summary() or az.plot_trace()\n\n\n\nfig, ax = plt.subplots(figsize=(7, 3), dpi=120)\nbmb.interpret.plot_predictions(\n model_interaction, \n idata_interaction, \n \"math\", \n ax=ax, \n pps=False\n);\n\n\n\n\nThe plot above shows that as math increases, the mean daysabs decreases. However, as the model contains an interaction term, the effect of math on daysabs depends on the value of prog. Therefore, we will use plot_predictions to plot the conditional adjusted predictions for each level of prog.\n\nfig, ax = plt.subplots(figsize=(7, 3), dpi=120)\nbmb.interpret.plot_predictions(\n model_interaction, \n idata_interaction, \n [\"math\", \"prog\"], \n ax=ax, \n pps=False\n);\n\n\n\n\nPassing specific subplot_kwargs can allow for a more interpretable plot. Especially when the posterior predictive distribution plot results in overlapping credible intervals.\n\nbmb.interpret.plot_predictions(\n model_interaction, \n idata_interaction, \n covariates=[\"math\", \"prog\"],\n pps=True,\n subplot_kwargs={\"main\": \"math\", \"group\": \"prog\", \"panel\": \"prog\"},\n legend=False,\n fig_kwargs={\"figsize\": (16, 5), \"sharey\": True}\n);\n\n\n\n\n\n\n\nTo further demonstrate the plot_predictions function, we will implement a logistic regression model. This example is taken from the marginaleffects plot_predictions documentation. The internet movie database, http://imdb.com/, is a website devoted to collecting movie data supplied by studios and fans. It claims to be the biggest movie database on the web and is run by Amazon. The movies in this dataset were selected for inclusion if they had a known length and had been rated by at least one imdb user. The dataset below contains 28,819 rows and 24 columns. The variables of interest in the dataset are the following: - title. Title of the movie. - year. Year of release. - budget. Total budget (if known) in US dollars - length. Length in minutes. - rating. Average IMDB user rating. - votes. Number of IMDB users who rated this movie. - r1-10. Multiplying by ten gives percentile (to nearest 10%) of users who rated this movie a 1. - mpaa. MPAA rating. - action, animation, comedy, drama, documentary, romance, short. Binary variables represent- ing if movie was classified as belonging to that genre.\n\ndata = pd.read_csv(\"https://vincentarelbundock.github.io/Rdatasets/csv/ggplot2movies/movies.csv\")\n\ndata[\"style\"] = \"Other\"\ndata.loc[data[\"Action\"] == 1, \"style\"] = \"Action\"\ndata.loc[data[\"Comedy\"] == 1, \"style\"] = \"Comedy\"\ndata.loc[data[\"Drama\"] == 1, \"style\"] = \"Drama\"\ndata[\"certified_fresh\"] = (data[\"rating\"] >= 8) * 1\ndata = data[data[\"length\"] < 240]\n\npriors = {\"style\": bmb.Prior(\"Normal\", mu=0, sigma=2)}\nmodel = bmb.Model(\"certified_fresh ~ 0 + length * style\", data=data, priors=priors, family=\"bernoulli\")\nidata = model.fit(random_seed=1234, target_accept=0.9, init=\"adapt_diag\")\n\nModeling the probability that certified_fresh==1\nAuto-assigning NUTS sampler...\nInitializing NUTS using adapt_diag...\nMultiprocess sampling (4 chains in 4 jobs)\nNUTS: [length, style, length:style]\n\n\n\n\n\n\n\n \n \n 43.56% [3485/8000 04:04<05:16 Sampling 4 chains, 0 divergences]\n \n \n\n\nThe logistic regression model uses a logit link function and a Bernoulli likelihood. Therefore, the link scale is the log-odds of a successful response and the response scale is the probability of a successful response.\n\nmodel\n\n Formula: certified_fresh ~ 0 + length * style\n Family: bernoulli\n Link: p = logit\n Observations: 58662\n Priors: \n target = p\n Common-level effects\n length ~ Normal(mu: 0.0, sigma: 0.0708)\n style ~ Normal(mu: 0.0, sigma: 2.0)\n length:style ~ Normal(mu: [0. 0. 0.], sigma: [0.0702 0.0509 0.0611])\n------\n* To see a plot of the priors call the .plot_priors() method.\n* To see a summary or plot of the posterior pass the object returned by .fit() to az.summary() or az.plot_trace()\n\n\nAgain, by default, the plot_predictions function plots the mean outcome on the response scale. Therefore, the plot below shows the probability of a successful response certified_fresh as a function of length.\n\nfig, ax = plt.subplots(figsize=(7, 3), dpi=120)\nbmb.interpret.plot_predictions(model, idata, \"length\", ax=ax);\n\n\n\n\nAdditionally, we can see how the probability of certified_fresh varies as a function of categorical covariates.\n\nfig, ax = plt.subplots(figsize=(7, 3), dpi=120)\nbmb.interpret.plot_predictions(model, idata, \"style\", ax=ax);\n\n\n\n\n\n\n\nplot_predictions also has the argument target where target determines what parameter of the response distribution is plotted as a function of the explanatory variables. This argument is useful in distributional models, i.e., when the response distribution contains a parameter for location, scale and or shape. The default of this argument is mean and passing a parameter into target only works when the argument pps=False because when pps=True the posterior predictive distribution is plotted and thus, can only refer to the outcome variable (instead of any of the parameters of the response distribution). For this example, we will simulate our own dataset.\n\nrng = np.random.default_rng(121195)\nN = 200\na, b = 0.5, 1.1\nx = rng.uniform(-1.5, 1.5, N)\nshape = np.exp(0.3 + x * 0.5 + rng.normal(scale=0.1, size=N))\ny = rng.gamma(shape, np.exp(a + b * x) / shape, N)\ndata_gamma = pd.DataFrame({\"x\": x, \"y\": y})\n\nformula = bmb.Formula(\"y ~ x\", \"alpha ~ x\")\nmodel = bmb.Model(formula, data_gamma, family=\"gamma\")\nidata = model.fit(random_seed=1234)\n\nAuto-assigning NUTS sampler...\nInitializing NUTS using jitter+adapt_diag...\nMultiprocess sampling (4 chains in 4 jobs)\nNUTS: [Intercept, x, alpha_Intercept, alpha_x]\n\n\n\n\n\n\n\n \n \n 100.00% [8000/8000 00:02<00:00 Sampling 4 chains, 25 divergences]\n \n \n\n\nSampling 4 chains for 1_000 tune and 1_000 draw iterations (4_000 + 4_000 draws total) took 2 seconds.\nThere were 25 divergences after tuning. Increase `target_accept` or reparameterize.\n\n\n\nmodel\n\n Formula: y ~ x\n alpha ~ x\n Family: gamma\n Link: mu = inverse\n alpha = log\n Observations: 200\n Priors: \n target = mu\n Common-level effects\n Intercept ~ Normal(mu: 0.0, sigma: 2.5037)\n x ~ Normal(mu: 0.0, sigma: 2.8025)\n target = alpha\n Common-level effects\n alpha_Intercept ~ Normal(mu: 0.0, sigma: 1.0)\n alpha_x ~ Normal(mu: 0.0, sigma: 1.0)\n------\n* To see a plot of the priors call the .plot_priors() method.\n* To see a summary or plot of the posterior pass the object returned by .fit() to az.summary() or az.plot_trace()\n\n\nThe model we defined uses a gamma distribution parameterized by alpha and mu where alpha utilizes a log link and mu goes through an inverse link. Therefore, we can plot either: (1) the mu of the response distribution (which is the default), or (2) alpha of the response distribution as a function of the explanatory variable \\(x\\).\n\n# First, the mean of the response (default)\nfig, ax = plt.subplots(figsize=(7, 3), dpi=120)\nbmb.interpret.plot_predictions(model, idata, \"x\", ax=ax);\n\n\n\n\nBelow, instead of plotting the default target, target=mean, we set target=alpha to visualize how the model parameter alpha varies as a function of the x predictor.\n\n# Second, another param. of the distribution: alpha\nfig, ax = plt.subplots(figsize=(7, 3), dpi=120)\nbmb.interpret.plot_predictions(model, idata, \"x\", target='alpha', ax=ax);\n\n\n\n\n\n%load_ext watermark\n%watermark -n -u -v -iv -w\n\nLast updated: Wed Aug 16 2023\n\nPython implementation: CPython\nPython version : 3.11.0\nIPython version : 8.13.2\n\npandas : 2.0.1\nmatplotlib: 3.7.1\nbambi : 0.10.0.dev0\narviz : 0.15.1\nnumpy : 1.24.2\n\nWatermark: 2.3.1" }, { - "objectID": "notebooks/hsgp_2d.html", - "href": "notebooks/hsgp_2d.html", + "objectID": "notebooks/ESCS_multiple_regression.html", + "href": "notebooks/ESCS_multiple_regression.html", "title": "Bambi", "section": "", - "text": "This article demonstrates how to use Bambi with Gaussian Processes with 2 dimensional predictors. Bambi supports Gaussian Processes through the low-rank approximation known as Hilbert Space Gaussian Processes. For references see Hilbert Space Methods for Reduced-Rank Gaussian Process Regression and Practical Hilbert Space Approximate Bayesian Gaussian Processes for Probabilistic Programming.\nFor a demonstration of Gaussian Processes in 1D together with a more in depth explanation see To Do.\n\nimport arviz as az\nimport bambi as bmb\nimport matplotlib.pyplot as plt\nimport numpy as np\nimport pandas as pd\nimport pymc as pm\n\nThe goal of this notebook is to showcase Bambi’s support for Gaussian Processes on two-dimensional data using the HSGP approximation.\nTo achieve this, we begin by creating a matrix of coordinates that will serve as the locations where we measure the values of a continuous response variable.\n\nx1 = np.linspace(0, 10, 12)\nx2 = np.linspace(0, 10, 12)\nxx, yy = np.meshgrid(x1, x2)\nX = np.column_stack([xx.flatten(), yy.flatten()])\nX.shape\n\n(144, 2)\n\n\n\n\nIn modeling multi-dimensional data with a Gaussian Process, we must choose between using an isotropic or an anisotropic Gaussian Process. An isotropic GP applies the same degree of smoothing to all predictors and is rotationally invariant. On the other hand, an anisotropic GP assigns different degrees of smoothing to each predictor and is not rotationally invariant.\nFurthermore, as the hsgp() function allows for the creation of separate GP contribution terms for the levels of a categorical variable through its by argument, we also examine both single-group and multiple-group scenarios.\n\n\nWe create a covariance kernel using ExpQuad from the gp submodule in PyMC. Note that the lengthscale and amplitude for both dimensions are 2 and 1.2, respectively. Then, we simply use NumPy to get a random draw from the 144-dimensional multivariate normal distribution.\n\nrng = np.random.default_rng(1234)\n\nell = 2\ncov = 1.2 * pm.gp.cov.ExpQuad(2, ls=ell)\nK = cov(X).eval()\nmu = np.zeros(X.shape[0])\nprint(mu.shape, K.shape)\n\nf = rng.multivariate_normal(mu, K)\n\nfig, ax = plt.subplots()\nax.scatter(xx, yy, c=f, s=900, marker=\"s\");\n\n(144,) (144, 144)\n\n\n\n\n\nSince Bambi works with long-format data frames, we need to reshape our data before creating the data frame.\n\ndata = pd.DataFrame(\n {\n \"x\": np.tile(xx.flatten(), 1),\n \"y\": np.tile(yy.flatten(), 1), \n \"outcome\": f.flatten()\n }\n)\n\nNow, let’s construct the model. The only notable distinction from the one-dimensional case is that we provide two unnamed arguments to the hsgp() function, representing the predictors on each dimension.\n\nprior_hsgp = {\n \"sigma\": bmb.Prior(\"Exponential\", lam=3),\n \"ell\": bmb.Prior(\"InverseGamma\", mu=2, sigma=0.2),\n}\npriors = {\n \"hsgp(x, y, c=1.5, m=10)\": prior_hsgp, \n \"sigma\": bmb.Prior(\"HalfNormal\", sigma=2)\n}\nmodel = bmb.Model(\"outcome ~ 0 + hsgp(x, y, c=1.5, m=10)\", data, priors=priors)\nmodel.set_alias({\"hsgp(x, y, c=1.5, m=10)\": \"hsgp\"})\nmodel\n\n Formula: outcome ~ 0 + hsgp(x, y, c=1.5, m=10)\n Family: gaussian\n Link: mu = identity\n Observations: 144\n Priors: \n target = mu\n HSGP contributions\n hsgp(x, y, c=1.5, m=10)\n cov: ExpQuad\n sigma ~ Exponential(lam: 3.0)\n ell ~ InverseGamma(mu: 2.0, sigma: 0.2)\n \n Auxiliary parameters\n outcome_sigma ~ HalfNormal(sigma: 2.0)\n\n\nThe parameters c and m of the HSGP aproximation are specific to each dimension, and can have different values for each. However, as we are passing scalars instead of sequences, Bambi will internally recycle them, causing the HSGP approximation to use the same values of c and m for both dimensions.\nLet’s build the internal PyMC model and create a graph to have a visual representation of the relationships between the model parameters.\n\nmodel.build()\nmodel.graph()\n\n\n\n\nAnd finally, we quickly fit the model and show a traceplot to explore the posterior and spot any issues with the sampler.\n\nidata = model.fit(inference_method=\"nuts_numpyro\", target_accept=0.9)\nprint(idata.sample_stats.diverging.sum().item())\n\n/home/tomas/anaconda3/envs/bambi_hsgp/lib/python3.10/site-packages/pymc/sampling/jax.py:39: UserWarning: This module is experimental.\n warnings.warn(\"This module is experimental.\")\n\n\nCompiling...\n\n\nNo GPU/TPU found, falling back to CPU. (Set TF_CPP_MIN_LOG_LEVEL=0 and rerun for more info.)\n\n\nCompilation time = 0:00:02.522713\nSampling...\n\n\n\n\n\n\n\n\n\n\n\n\n\n\nSampling time = 0:00:23.313351\nTransforming variables...\nTransformation time = 0:00:00.628279\n0\n\n\n\naz.plot_trace(\n idata, \n var_names=[\"hsgp_sigma\", \"hsgp_ell\", \"outcome_sigma\"], \n backend_kwargs={\"layout\": \"constrained\"}\n);\n\n\n\n\nWe don’t see any divergences. However, the autocorrelation in the chains for the covariance function parameters, along with the insufficient mixing, indicates that there may be an issue with the prior specification of the model.\nSince the goal of the notebook is to simply show what features Bambi supports and how to use them, we won’t further investigate these issues. However, such posteriors shouldn’t be considered in any serious application.\nFrom now on, the notebook will follow the same structure as the one already shown, which consists of\n\nData simulation with some specific settings\nCreation of the Bambi model\nBuilding of the internal PyMC model and visualization of the graph\nModel fit and inspection of the traceplot\n\n\n\n\nIn this scenario we have multiple groups that share the same covariance function.\n\nrng = np.random.default_rng(123)\n\nell = 2\ncov = 1.2 * pm.gp.cov.ExpQuad(2, ls=ell)\nK = cov(X).eval()\nmu = np.zeros(X.shape[0])\n\nf = rng.multivariate_normal(mu, K, 3)\n\nfig, axes = plt.subplots(1, 3, figsize=(12, 4))\nfor i, ax in enumerate(axes):\n ax.scatter(xx, yy, c=f[i], s=320, marker=\"s\")\n ax.grid(False)\n ax.set_title(f\"Group {i}\")\n\n\n\n\n\ndata = pd.DataFrame(\n {\n \"x\": np.tile(xx.flatten(), 3),\n \"y\": np.tile(yy.flatten(), 3),\n \"group\": np.repeat(list(\"ABC\"), 12 * 12),\n \"outcome\": f.flatten()\n }\n)\n\nNotice we don’t modify anything substantial in the call to hsgp() for now.\n\nprior_hsgp = {\n \"sigma\": bmb.Prior(\"Exponential\", lam=3),\n \"ell\": bmb.Prior(\"InverseGamma\", mu=2, sigma=0.2),\n}\npriors = {\n \"hsgp(x, y, by=group, c=1.5, m=10)\": prior_hsgp, \n \"sigma\": bmb.Prior(\"HalfNormal\", sigma=2)\n}\nmodel = bmb.Model(\"outcome ~ 0 + hsgp(x, y, by=group, c=1.5, m=10)\", data, priors=priors)\nmodel.set_alias({\"hsgp(x, y, by=group, c=1.5, m=10)\": \"hsgp\"})\nprint(model)\nmodel.build()\nmodel.graph()\n\n Formula: outcome ~ 0 + hsgp(x, y, by=group, c=1.5, m=10)\n Family: gaussian\n Link: mu = identity\n Observations: 432\n Priors: \n target = mu\n HSGP contributions\n hsgp(x, y, by=group, c=1.5, m=10)\n cov: ExpQuad\n sigma ~ Exponential(lam: 3.0)\n ell ~ InverseGamma(mu: 2.0, sigma: 0.2)\n \n Auxiliary parameters\n outcome_sigma ~ HalfNormal(sigma: 2.0)\n\n\n\n\n\n\nidata = model.fit(inference_method=\"nuts_numpyro\", target_accept=0.9)\nprint(idata.sample_stats.diverging.sum().item())\n\nCompiling...\nCompilation time = 0:00:02.721842\nSampling...\n\n\n\n\n\n\n\n\n\n\n\n\n\n\nSampling time = 0:02:17.782596\nTransforming variables...\nTransformation time = 0:00:00.838094\n0\n\n\n\naz.plot_trace(\n idata, \n var_names=[\"hsgp_sigma\", \"hsgp_ell\", \"outcome_sigma\"], \n backend_kwargs={\"layout\": \"constrained\"}\n);\n\n\n\n\nWhile we have three groups, we only have one hsgp_sigma and one hsgp_ell for all groups. This is because, by default, the HSGP contributions by groups use the same instance of the covariance function.\n\n\n\nAgain we have multiple groups. But this time, each group has specific values for the amplitude and the lengthscale.\n\nrng = np.random.default_rng(12)\n\nsigmas = [1.2, 1.5, 1.8]\nells = [1.5, 2, 3]\n\nsamples = []\nfor sigma, ell in zip(sigmas, ells):\n cov = sigma * pm.gp.cov.ExpQuad(2, ls=ell)\n K = cov(X).eval()\n mu = np.zeros(X.shape[0])\n samples.append(rng.multivariate_normal(mu, K))\n\nf = np.stack(samples)\nfig, axes = plt.subplots(1, 3, figsize=(12, 4))\nfor i, ax in enumerate(axes):\n ax.scatter(xx, yy, c=f[i], s=320, marker=\"s\")\n ax.grid(False)\n ax.set_title(f\"Group {i}\")\n\n\n\n\n\ndata = pd.DataFrame(\n {\n \"x\": np.tile(xx.flatten(), 3),\n \"y\": np.tile(yy.flatten(), 3),\n \"group\": np.repeat(list(\"ABC\"), 12 * 12),\n \"outcome\": f.flatten()\n }\n)\n\nIn situations like this, we can tell Bambi not to use the same covariance function for all the groups with share_cov=False and Bambi will create a separate instance for each group, resulting in group specific estimates of the amplitude and the lengthscale.\nNotice, however, we’re still using the same kind of covariance function, which in this case is ExpQuad.\n\nprior_hsgp = {\n \"sigma\": bmb.Prior(\"Exponential\", lam=3),\n \"ell\": bmb.Prior(\"InverseGamma\", mu=2, sigma=0.2),\n}\npriors = {\n \"hsgp(x, y, by=group, c=1.5, m=10, share_cov=False)\": prior_hsgp, \n \"sigma\": bmb.Prior(\"HalfNormal\", sigma=2)\n}\nmodel = bmb.Model(\n \"outcome ~ 0 + hsgp(x, y, by=group, c=1.5, m=10, share_cov=False)\", \n data, \n priors=priors\n)\nmodel.set_alias({\"hsgp(x, y, by=group, c=1.5, m=10, share_cov=False)\": \"hsgp\"})\nprint(model)\nmodel.build()\nmodel.graph()\n\n Formula: outcome ~ 0 + hsgp(x, y, by=group, c=1.5, m=10, share_cov=False)\n Family: gaussian\n Link: mu = identity\n Observations: 432\n Priors: \n target = mu\n HSGP contributions\n hsgp(x, y, by=group, c=1.5, m=10, share_cov=False)\n cov: ExpQuad\n sigma ~ Exponential(lam: 3.0)\n ell ~ InverseGamma(mu: 2.0, sigma: 0.2)\n \n Auxiliary parameters\n outcome_sigma ~ HalfNormal(sigma: 2.0)\n\n\n\n\n\nSee the all the HSGP related parameters gained the new dimension hsgp_by.\n\nidata = model.fit(inference_method=\"nuts_numpyro\", target_accept=0.9)\nprint(idata.sample_stats.diverging.sum().item())\n\nCompiling...\nCompilation time = 0:00:04.491697\nSampling...\n\n\n\n\n\n\n\n\n\n\n\n\n\n\nSampling time = 0:02:35.274256\nTransforming variables...\nTransformation time = 0:00:00.801181\n0\n\n\n\naz.plot_trace(\n idata, \n var_names=[\"hsgp_sigma\", \"hsgp_ell\", \"outcome_sigma\"], \n backend_kwargs={\"layout\": \"constrained\"}\n);\n\n\n\n\nUnlike the previous case, now there are three hsgp_sigma and three hsgp_ell parameters, one per group. We can see them in different colors in the visualization.\n\n\n\n\nIn this second part we repeat exactly the same that we did for the isotropic case. First, we start with a single group. Then, we continue with multiple groups that share the covariance function. And finally, multiple groups with different covariance functions. The main difference is that we use iso=False, which asks to use an anisotropic GP.\n\n\n\nrng = np.random.default_rng(1234)\n\nell = [2, 0.9]\ncov = 1.2 * pm.gp.cov.ExpQuad(2, ls=ell)\nK = cov(X).eval()\nmu = np.zeros(X.shape[0])\n\nf = rng.multivariate_normal(mu, K)\n\nfig, ax = plt.subplots(figsize = (4.5, 4.5))\nax.scatter(xx, yy, c=f, s=900, marker=\"s\");\n\n\n\n\n\ndata = pd.DataFrame(\n {\n \"x\": np.tile(xx.flatten(), 1),\n \"y\": np.tile(yy.flatten(), 1), \n \"outcome\": f.flatten()\n }\n)\n\n\nprior_hsgp = {\n \"sigma\": bmb.Prior(\"Exponential\", lam=3),\n \"ell\": bmb.Prior(\"InverseGamma\", mu=2, sigma=0.2),\n}\npriors = {\n \"hsgp(x, y, c=1.5, m=10, iso=False)\": prior_hsgp, \n \"sigma\": bmb.Prior(\"HalfNormal\", sigma=2)\n}\nmodel = bmb.Model(\"outcome ~ 0 + hsgp(x, y, c=1.5, m=10, iso=False)\", data, priors=priors)\nmodel.set_alias({\"hsgp(x, y, c=1.5, m=10, iso=False)\": \"hsgp\"})\nprint(model)\nmodel.build()\nmodel.graph()\n\n Formula: outcome ~ 0 + hsgp(x, y, c=1.5, m=10, iso=False)\n Family: gaussian\n Link: mu = identity\n Observations: 144\n Priors: \n target = mu\n HSGP contributions\n hsgp(x, y, c=1.5, m=10, iso=False)\n cov: ExpQuad\n sigma ~ Exponential(lam: 3.0)\n ell ~ InverseGamma(mu: 2.0, sigma: 0.2)\n \n Auxiliary parameters\n outcome_sigma ~ HalfNormal(sigma: 2.0)\n\n\n\n\n\nAlthough there is only one group in this case, the graph includes a hsgp_var dimension. This dimension represents the variables in the HSGP component, indicating that there is one lengthscale parameter per variable.\n\nidata = model.fit(inference_method=\"nuts_numpyro\", target_accept=0.9)\nprint(idata.sample_stats.diverging.sum().item())\n\nCompiling...\nCompilation time = 0:00:02.320646\nSampling...\n\n\n\n\n\n\n\n\n\n\n\n\n\n\nSampling time = 0:00:06.159032\nTransforming variables...\nTransformation time = 0:00:00.173091\n0\n\n\n\naz.plot_trace(\n idata, \n var_names=[\"hsgp_sigma\", \"hsgp_ell\", \"outcome_sigma\"], \n backend_kwargs={\"layout\": \"constrained\"}\n);\n\n\n\n\n\n\n\n\nrng = np.random.default_rng(123)\n\nell = [2, 0.9]\ncov = 1.2 * pm.gp.cov.ExpQuad(2, ls=ell)\nK = cov(X).eval()\nmu = np.zeros(X.shape[0])\n\nf = rng.multivariate_normal(mu, K, 3)\n\nfig, axes = plt.subplots(1, 3, figsize=(12, 4))\nfor i, ax in enumerate(axes):\n ax.scatter(xx, yy, c=f[i], s=320, marker=\"s\")\n ax.grid(False)\n ax.set_title(f\"Group {i}\")\n\n\n\n\n\ndata = pd.DataFrame(\n {\n \"x\": np.tile(xx.flatten(), 3),\n \"y\": np.tile(yy.flatten(), 3),\n \"group\": np.repeat(list(\"ABC\"), 12 * 12),\n \"outcome\": f.flatten()\n }\n)\n\n\nprior_hsgp = {\n \"sigma\": bmb.Prior(\"Exponential\", lam=3),\n \"ell\": bmb.Prior(\"InverseGamma\", mu=2, sigma=0.2),\n}\npriors = {\n \"hsgp(x, y, by=group, c=1.5, m=10, iso=False)\": prior_hsgp, \n \"sigma\": bmb.Prior(\"HalfNormal\", sigma=2)\n}\nmodel = bmb.Model(\"outcome ~ 0 + hsgp(x, y, by=group, c=1.5, m=10, iso=False)\", data, priors=priors)\nmodel.set_alias({\"hsgp(x, y, by=group, c=1.5, m=10, iso=False)\": \"hsgp\"})\nprint(model)\nmodel.build()\nmodel.graph()\n\n Formula: outcome ~ 0 + hsgp(x, y, by=group, c=1.5, m=10, iso=False)\n Family: gaussian\n Link: mu = identity\n Observations: 432\n Priors: \n target = mu\n HSGP contributions\n hsgp(x, y, by=group, c=1.5, m=10, iso=False)\n cov: ExpQuad\n sigma ~ Exponential(lam: 3.0)\n ell ~ InverseGamma(mu: 2.0, sigma: 0.2)\n \n Auxiliary parameters\n outcome_sigma ~ HalfNormal(sigma: 2.0)\n\n\n\n\n\n\nidata = model.fit(inference_method=\"nuts_numpyro\", target_accept=0.9)\nprint(idata.sample_stats.diverging.sum().item())\n\nCompiling...\nCompilation time = 0:00:02.464203\nSampling...\n\n\n\n\n\n\n\n\n\n\n\n\n\n\nSampling time = 0:00:17.674547\nTransforming variables...\nTransformation time = 0:00:00.249682\n0\n\n\n\naz.plot_trace(\n idata, \n var_names=[\"hsgp_sigma\", \"hsgp_ell\", \"outcome_sigma\"], \n backend_kwargs={\"layout\": \"constrained\"}\n);\n\n\n\n\n\n\n\n\nrng = np.random.default_rng(12)\n\nsigmas = [1.2, 1.5, 1.8]\nells = [[1.5, 0.8], [2, 1.5], [3, 1]]\n\nsamples = []\nfor sigma, ell in zip(sigmas, ells):\n cov = sigma * pm.gp.cov.ExpQuad(2, ls=ell)\n K = cov(X).eval()\n mu = np.zeros(X.shape[0])\n samples.append(rng.multivariate_normal(mu, K))\n\nf = np.stack(samples)\nfig, axes = plt.subplots(1, 3, figsize=(12, 4))\nfor i, ax in enumerate(axes):\n ax.scatter(xx, yy, c=f[i], s=320, marker=\"s\")\n ax.grid(False)\n ax.set_title(f\"Group {i}\")\n\n\n\n\n\ndata = pd.DataFrame(\n {\n \"x\": np.tile(xx.flatten(), 3),\n \"y\": np.tile(yy.flatten(), 3),\n \"group\": np.repeat(list(\"ABC\"), 12 * 12),\n \"outcome\": f.flatten()\n }\n)\n\n\nprior_hsgp = {\n \"sigma\": bmb.Prior(\"Exponential\", lam=3),\n \"ell\": bmb.Prior(\"InverseGamma\", mu=2, sigma=0.2),\n}\npriors = {\n \"hsgp(x, y, by=group, c=1.5, m=10, iso=False, share_cov=False)\": prior_hsgp, \n \"sigma\": bmb.Prior(\"HalfNormal\", sigma=2)\n}\nmodel = bmb.Model(\n \"outcome ~ 0 + hsgp(x, y, by=group, c=1.5, m=10, iso=False, share_cov=False)\", \n data, \n priors=priors\n)\nmodel.set_alias({\"hsgp(x, y, by=group, c=1.5, m=10, iso=False, share_cov=False)\": \"hsgp\"})\nprint(model)\nmodel.build()\nmodel.graph()\n\n Formula: outcome ~ 0 + hsgp(x, y, by=group, c=1.5, m=10, iso=False, share_cov=False)\n Family: gaussian\n Link: mu = identity\n Observations: 432\n Priors: \n target = mu\n HSGP contributions\n hsgp(x, y, by=group, c=1.5, m=10, iso=False, share_cov=False)\n cov: ExpQuad\n sigma ~ Exponential(lam: 3.0)\n ell ~ InverseGamma(mu: 2.0, sigma: 0.2)\n \n Auxiliary parameters\n outcome_sigma ~ HalfNormal(sigma: 2.0)\n\n\n\n\n\n\nidata = model.fit(inference_method=\"nuts_numpyro\", target_accept=0.9)\nprint(idata.sample_stats.diverging.sum().item())\n\nCompiling...\nCompilation time = 0:00:03.955870\nSampling...\n\n\n\n\n\n\n\n\n\n\n\n\n\n\nSampling time = 0:00:20.713181\nTransforming variables...\nTransformation time = 0:00:00.513813\n0\n\n\n\naz.plot_trace(\n idata, \n var_names=[\"hsgp_sigma\", \"hsgp_ell\", \"outcome_sigma\"], \n backend_kwargs={\"layout\": \"constrained\"}\n);\n\n\n\n\n\n\n\n\nFor this final demonstration we’re going to use a simulated dataset where the outcome is a count variable. For the predictors, we have the location in terms of the latitude and longitude, as well as other variables such as the year of the measurement, the site where the measure was made, and one continuous predictor.\n\ndata = pd.read_csv(\"data/poisson_data.csv\")\ndata[\"Year\"] = pd.Categorical(data[\"Year\"])\nprint(data.shape)\ndata.head()\n\n(100, 6)\n\n\n\n\n\n\n \n \n \n Year\n Count\n Site\n Lat\n Lon\n X1\n \n \n \n \n 0\n 2015\n 4\n Site1\n 47.559880\n 7.216754\n 3.316140\n \n \n 1\n 2016\n 0\n Site1\n 47.257079\n 7.135390\n 2.249612\n \n \n 2\n 2015\n 0\n Site1\n 47.061967\n 7.804383\n 2.835283\n \n \n 3\n 2016\n 0\n Site1\n 47.385533\n 7.433145\n 2.776692\n \n \n 4\n 2015\n 1\n Site1\n 47.034987\n 7.434643\n 2.295769\n \n \n\n\n\n\nWe can visualize the outcome variable by location and year.\n\nfig, axes = plt.subplots(1, 2, figsize=(12, 4))\nfor i, (ax, year) in enumerate(zip(axes, [2015, 2016])):\n mask = data[\"Year\"] == year\n x = data.loc[mask, \"Lat\"]\n y = data.loc[mask, \"Lon\"]\n count = data.loc[mask, \"Count\"]\n ax.scatter(x, y, c=count, s=30, marker=\"s\")\n ax.set_title(f\"Year {year}\")\n\n\n\n\nThere’s not much we can conclude from here but it’s not a problem. The most relevant part of the example is not the data itself, but how to use Bambi to include GP components in a complex model.\nIt’s very easy to create a model that uses both regular common and group-specific predictors as well as a GP contribution term. We just add them to the model formula, treat hsgp() as any other call, and that’s it!\nBelow we have common effects for the Year, the interaction between X1 and Year, and group-specific intercepts by Site. Finally, we add hsgp() as any other call.\n\nformula = \"Count ~ 0 + Year + X1:Year + (1|Site) + hsgp(Lon, Lat, by=Year, m=5, c=1.5)\"\nmodel = bmb.Model(formula, data, family=\"poisson\")\nmodel\n\n Formula: Count ~ 0 + Year + X1:Year + (1|Site) + hsgp(Lon, Lat, by=Year, m=5, c=1.5)\n Family: poisson\n Link: mu = log\n Observations: 100\n Priors: \n target = mu\n Common-level effects\n Year ~ Normal(mu: [0. 0.], sigma: [5. 5.])\n X1:Year ~ Normal(mu: [0. 0.], sigma: [1.5693 1.4766])\n \n Group-level effects\n 1|Site ~ Normal(mu: 0.0, sigma: HalfNormal(sigma: 5.3683))\n \n HSGP contributions\n hsgp(Lon, Lat, by=Year, m=5, c=1.5)\n cov: ExpQuad\n sigma ~ Exponential(lam: 1.0)\n ell ~ InverseGamma(alpha: 3.0, beta: 2.0)\n\n\nLet’s use an alias to make the graph representation more readable.\n\nmodel.set_alias({\"hsgp(Lon, Lat, by=Year, m=5, c=1.5)\": \"gp\"})\nmodel.build()\nmodel.graph()\n\n\n\n\nAnd finally, let’s fit the model.\n\nidata = model.fit(inference_method=\"nuts_numpyro\", target_accept=0.99)\nprint(idata.sample_stats.diverging.sum().item())\n\nCompiling...\nCompilation time = 0:00:04.433012\nSampling...\n\n\n\n\n\n\n\n\n\n\n\n\n\n\nSampling time = 0:00:09.698066\nTransforming variables...\nTransformation time = 0:00:00.668909\n15\n\n\n\naz.plot_trace(\n idata, \n var_names=[\"gp_sigma\", \"gp_ell\", \"gp_weights\"], \n backend_kwargs={\"layout\": \"constrained\"}\n);\n\n\n\n\nNotice the posteriors for the gp_weights are all centered at zero. This is a symptom of the absence of any spatial effect.\n\naz.plot_trace(\n idata, \n var_names=[\"Year\", \"X1:Year\"], \n backend_kwargs={\"layout\": \"constrained\"}\n);\n\n\n\n\n\naz.plot_trace(\n idata, \n var_names=[\"1|Site\", \"1|Site_sigma\"], \n backend_kwargs={\"layout\": \"constrained\"}\n);" + "text": "import arviz as az\nimport bambi as bmb\nimport matplotlib.pyplot as plt\nimport numpy as np\nimport pandas as pd\nimport statsmodels.api as sm\nimport xarray as xr\n\n\naz.style.use(\"arviz-darkgrid\")\nSEED = 7355608\n\n\n\nBambi comes with several datasets. These can be accessed via the load_data() function.\n\ndata = bmb.load_data(\"ESCS\")\nnp.round(data.describe(), 2)\n\n\n\n\n\n \n \n \n drugs\n n\n e\n o\n a\n c\n hones\n emoti\n extra\n agree\n consc\n openn\n \n \n \n \n count\n 604.00\n 604.00\n 604.00\n 604.00\n 604.00\n 604.00\n 604.00\n 604.00\n 604.00\n 604.00\n 604.00\n 604.00\n \n \n mean\n 2.21\n 80.04\n 106.52\n 113.87\n 124.63\n 124.23\n 3.89\n 3.18\n 3.21\n 3.13\n 3.57\n 3.41\n \n \n std\n 0.65\n 23.21\n 19.88\n 21.12\n 16.67\n 18.69\n 0.45\n 0.46\n 0.53\n 0.47\n 0.44\n 0.52\n \n \n min\n 1.00\n 23.00\n 42.00\n 51.00\n 63.00\n 44.00\n 2.56\n 1.47\n 1.62\n 1.59\n 2.00\n 1.28\n \n \n 25%\n 1.71\n 65.75\n 93.00\n 101.00\n 115.00\n 113.00\n 3.59\n 2.88\n 2.84\n 2.84\n 3.31\n 3.06\n \n \n 50%\n 2.14\n 76.00\n 107.00\n 112.00\n 126.00\n 125.00\n 3.88\n 3.19\n 3.22\n 3.16\n 3.56\n 3.44\n \n \n 75%\n 2.64\n 93.00\n 120.00\n 129.00\n 136.00\n 136.00\n 4.20\n 3.47\n 3.56\n 3.44\n 3.84\n 3.75\n \n \n max\n 4.29\n 163.00\n 158.00\n 174.00\n 171.00\n 180.00\n 4.94\n 4.62\n 4.75\n 4.44\n 4.75\n 4.72\n \n \n\n\n\n\nIt’s always a good idea to start off with some basic plotting. Here’s what our outcome variable drugs (some index of self-reported illegal drug use) looks like:\n\ndata[\"drugs\"].hist();\n\n\n\n\nThe five numerical predictors that we’ll use are sum-scores measuring participants’ standings on the Big Five personality dimensions. The dimensions are:\n\nO = Openness to experience\nC = Conscientiousness\nE = Extraversion\nA = Agreeableness\nN = Neuroticism\n\nHere’s what our predictors look like:\n\naz.plot_pair(data[[\"o\", \"c\", \"e\", \"a\", \"n\"]].to_dict(\"list\"), marginals=True, textsize=24);\n\n\n\n\nWe can easily see all the predictors are more or less symmetrically distributed without outliers and the pairwise correlations between them are not strong.\n\n\n\nWe’re going to fit a pretty straightforward additive multiple regression model predicting drug index from all 5 personality dimension scores. It’s simple to specify the model using a familiar formula interface. Here we also tell Bambi to run two parallel Markov Chain Monte Carlo (MCMC) chains, each one with 2000 draws. The first 1000 draws are tuning steps that we discard and the last 1000 draws are considered to be taken from the joint posterior distribution of all the parameters (to be confirmed when we analyze the convergence of the chains).\n\nmodel = bmb.Model(\"drugs ~ o + c + e + a + n\", data)\nfitted = model.fit(tune=2000, draws=2000, init=\"adapt_diag\", random_seed=SEED)\n\nAuto-assigning NUTS sampler...\nInitializing NUTS using adapt_diag...\nMultiprocess sampling (2 chains in 2 jobs)\nNUTS: [drugs_sigma, Intercept, o, c, e, a, n]\n\n\n\n\n\n\n\n \n \n 100.00% [8000/8000 00:11<00:00 Sampling 2 chains, 0 divergences]\n \n \n\n\nSampling 2 chains for 2_000 tune and 2_000 draw iterations (4_000 + 4_000 draws total) took 12 seconds.\n\n\nGreat! But this is a Bayesian model, right? What about the priors? If no priors are given explicitly by the user, then Bambi chooses smart default priors for all parameters of the model based on the implied partial correlations between the outcome and the predictors. Here’s what the default priors look like in this case – the plots below show 1000 draws from each prior distribution:\n\nmodel.plot_priors();\n\nSampling: [Intercept, a, c, drugs_sigma, e, n, o]\n\n\n\n\n\n\n# Normal priors on the coefficients\n{x.name: x.prior.args for x in model.response_component.terms.values()}\n\n{'Intercept': {'mu': array(2.21014664), 'sigma': array(21.19375074)},\n 'o': {'mu': array(0), 'sigma': array(0.0768135)},\n 'c': {'mu': array(0), 'sigma': array(0.08679683)},\n 'e': {'mu': array(0), 'sigma': array(0.0815892)},\n 'a': {'mu': array(0), 'sigma': array(0.09727366)},\n 'n': {'mu': array(0), 'sigma': array(0.06987412)},\n 'drugs': {'mu': array(0), 'sigma': array(1)}}\n\n\n\n# HalfStudentT prior on the residual standard deviation\nfor name, component in model.constant_components.items():\n print(f\"{name}: {component.prior}\")\n\nsigma: HalfStudentT(nu: 4, sigma: 0.6482)\n\n\nYou could also just print the model and see it also contains the same information about the priors\n\nmodel\n\n Formula: drugs ~ o + c + e + a + n\n Family: gaussian\n Link: mu = identity\n Observations: 604\n Priors: \n target = mu\n Common-level effects\n Intercept ~ Normal(mu: 2.2101, sigma: 21.1938)\n o ~ Normal(mu: 0, sigma: 0.0768)\n c ~ Normal(mu: 0, sigma: 0.0868)\n e ~ Normal(mu: 0, sigma: 0.0816)\n a ~ Normal(mu: 0, sigma: 0.0973)\n n ~ Normal(mu: 0, sigma: 0.0699)\n Auxiliary parameters\n drugs_sigma ~ HalfStudentT(nu: 4, sigma: 0.6482)\n------\n* To see a plot of the priors call the .plot_priors() method.\n* To see a summary or plot of the posterior pass the object returned by .fit() to az.summary() or az.plot_trace()\n\n\nSome more info about the default prior distributions can be found in this technical paper.\nNotice the apparently small SDs of the slope priors. This is due to the relative scales of the outcome and the predictors: remember from the plots above that the outcome, drugs, ranges from 1 to about 4, while the predictors all range from about 20 to 180 or so. A one-unit change in any of the predictors – which is a trivial increase on the scale of the predictors – is likely to lead to a very small absolute change in the outcome. Believe it or not, these priors are actually quite wide on the partial correlation scale!\n\n\n\nLet’s start with a pretty picture of the parameter estimates!\n\naz.plot_trace(fitted);\n\n\n\n\nThe left panels show the marginal posterior distributions for all of the model’s parameters, which summarize the most plausible values of the regression coefficients, given the data we have now observed. These posterior density plots show two overlaid distributions because we ran two MCMC chains. The panels on the right are “trace plots” showing the sampling paths of the two MCMC chains as they wander through the parameter space. If any of these paths exhibited a pattern other than white noise we would be concerned about the convergence of the chains.\nA much more succinct (non-graphical) summary of the parameter estimates can be found like so:\n\naz.summary(fitted)\n\n\n\n\n\n \n \n \n mean\n sd\n hdi_3%\n hdi_97%\n mcse_mean\n mcse_sd\n ess_bulk\n ess_tail\n r_hat\n \n \n \n \n Intercept\n 3.298\n 0.351\n 2.609\n 3.924\n 0.006\n 0.004\n 3956.0\n 3180.0\n 1.0\n \n \n o\n 0.006\n 0.001\n 0.004\n 0.009\n 0.000\n 0.000\n 4217.0\n 3214.0\n 1.0\n \n \n c\n -0.004\n 0.001\n -0.007\n -0.001\n 0.000\n 0.000\n 3820.0\n 3286.0\n 1.0\n \n \n e\n 0.003\n 0.001\n 0.001\n 0.006\n 0.000\n 0.000\n 4252.0\n 3625.0\n 1.0\n \n \n a\n -0.012\n 0.001\n -0.015\n -0.010\n 0.000\n 0.000\n 4846.0\n 3437.0\n 1.0\n \n \n n\n -0.002\n 0.001\n -0.004\n 0.001\n 0.000\n 0.000\n 4048.0\n 3317.0\n 1.0\n \n \n drugs_sigma\n 0.592\n 0.017\n 0.561\n 0.623\n 0.000\n 0.000\n 5882.0\n 2962.0\n 1.0\n \n \n\n\n\n\nWhen there are multiple MCMC chains, the default summary output includes some basic convergence diagnostic info (the effective MCMC sample sizes and the Gelman-Rubin “R-hat” statistics), although in this case it’s pretty clear from the trace plots above that the chains have converged just fine.\n\n\n\n\nsamples = fitted.posterior\n\nIt turns out that we can convert each regression coefficient into a partial correlation by multiplying it by a constant that depends on (1) the SD of the predictor, (2) the SD of the outcome, and (3) the degree of multicollinearity with the set of other predictors. Two of these statistics are actually already computed and stored in the fitted model object, in a dictionary called dm_statistics (for design matrix statistics), because they are used internally. We will compute the others manually.\nSome information about the relationship between linear regression parameters and partial correlation can be found here.\n\n# the names of the predictors\nvarnames = ['o', 'c', 'e', 'a', 'n']\n\n# compute the needed statistics like R-squared when each predictor is response and all the \n# other predictors are the predictor\n\n# x_matrix = common effects design matrix (excluding intercept/constant term)\nterms = [t for t in model.response_component.common_terms.values() if t.name != \"Intercept\"]\nx_matrix = [pd.DataFrame(x.data, columns=x.levels) for x in terms]\nx_matrix = pd.concat(x_matrix, axis=1)\nx_matrix.columns = varnames\n\ndm_statistics = {\n 'r2_x': pd.Series(\n {\n x: sm.OLS(\n endog=x_matrix[x],\n exog=sm.add_constant(x_matrix.drop(x, axis=1))\n if \"Intercept\" in model.response_component.terms\n else x_matrix.drop(x, axis=1),\n )\n .fit()\n .rsquared\n for x in list(x_matrix.columns)\n }\n ),\n 'sigma_x': x_matrix.std(),\n 'mean_x': x_matrix.mean(axis=0),\n}\n\nr2_x = dm_statistics['r2_x']\nsd_x = dm_statistics['sigma_x']\nr2_y = pd.Series([sm.OLS(endog=data['drugs'],\n exog=sm.add_constant(data[[p for p in varnames if p != x]])).fit().rsquared\n for x in varnames], index=varnames)\nsd_y = data['drugs'].std()\n\n# compute the products to multiply each slope with to produce the partial correlations\nslope_constant = (sd_x[varnames] / sd_y) * ((1 - r2_x[varnames]) / (1 - r2_y)) ** 0.5\nslope_constant\n\no 32.392557\nc 27.674284\ne 30.305117\na 26.113299\nn 34.130431\ndtype: float64\n\n\nNow we just multiply each sampled regression coefficient by its corresponding slope_constant to transform it into a sample partial correlation coefficient.\n\npcorr_samples = (samples[varnames] * slope_constant)\n\nAnd voilà! We now have a joint posterior distribution for the partial correlation coefficients. Let’s plot the marginal posterior distributions:\n\n# Pass the same axes to az.plot_kde to have all the densities in the same plot\n_, ax = plt.subplots()\nfor idx, (k, v) in enumerate(pcorr_samples.items()):\n az.plot_dist(v, label=k, plot_kwargs={'color':f'C{idx}'}, ax=ax)\nax.axvline(x=0, color='k', linestyle='--');\n\n\n\n\nThe means of these distributions serve as good point estimates of the partial correlations:\n\npcorr_samples.mean()\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\nDimensions: ()\nData variables:\n o float64 0.1973\n c float64 -0.105\n e float64 0.1016\n a float64 -0.324\n n float64 -0.0513xarray.DatasetDimensions:Coordinates: (0)Data variables: (5)o()float640.1973array(0.19728891)c()float64-0.105array(-0.105046)e()float640.1016array(0.10159318)a()float64-0.324array(-0.32396902)n()float64-0.0513array(-0.05130356)Indexes: (0)Attributes: (0)\n\n\n\n\n\nWe just take the square of the partial correlation coefficients, so it’s easy to get posteriors on that scale too:\n\n_, ax = plt.subplots()\nfor idx, (k, v) in enumerate(pcorr_samples.items()):\n az.plot_dist(v ** 2, label=k, plot_kwargs={'color':f'C{idx}'}, ax=ax)\nax.set_ylim(0, 80);\n\n\n\n\nWith these posteriors we can ask: What is the probability that the squared partial correlation for Openness (blue) is greater than the squared partial correlation for Conscientiousness (orange)?\n\n(pcorr_samples['o'] ** 2 > pcorr_samples['c'] ** 2).mean().item()\n\n0.9365\n\n\nIf we contrast this result with the plot we’ve just shown, we may think the probability is too high when looking at the overlap between the blue and orange curves. However, the previous plot is only showing marginal posteriors, which don’t account for correlations between the coefficients. In our Bayesian world, our model parameters’ are random variables (and consequently, any combination of them are too). As such, squared partial correlation have a joint distribution. When computing probabilities involving at least two of these parameters, one has to use the joint distribution. Otherwise, if we choose to work only with marginals, we are implicitly assuming independence.\nLet’s check the joint distribution of the squared partial correlation for Openness and Conscientiousness. We highlight with a blue color the draws where the coefficient for Openness is greater than the coefficient for Conscientiousness.\n\nsq_partial_c = pcorr_samples['c'] ** 2\nsq_partial_o = pcorr_samples['o'] ** 2\n\n\ncolors = np.where(sq_partial_c > sq_partial_o, \"C1\", \"C0\").flatten().tolist()\n\nplt.scatter(sq_partial_o, sq_partial_c, c=colors)\nplt.xlabel(\"Openness to experience\")\nplt.ylabel(\"Conscientiousness\");\n\n\n\n\nWe can see that in the great majority of the draws (92.8%) the squared partial correlation for Openness is greater than the one for Conscientiousness. In fact, we can check the correlation between them is\n\nxr.corr(sq_partial_c, sq_partial_o).item()\n\n-0.19487146395840146\n\n\nwhich explains why ony looking at the marginal posteriors (i.e. assuming independence) is not the best approach here.\nFor each predictor, what is the probability that it has the largest squared partial correlation?\n\npc_df = pcorr_samples.to_dataframe()\n(pc_df**2).idxmax(axis=1).value_counts() / len(pc_df.index)\n\na 0.989\no 0.011\ndtype: float64\n\n\nAgreeableness is clearly the strongest predictor of drug use among the Big Five personality traits in terms of partial correlation, but it’s still not a particularly strong predictor in an absolute sense. Walter Mischel famously claimed that it is rare to see correlations between personality measure and relevant behavioral outcomes exceed 0.3. In this case, the probability that the agreeableness partial correlation exceeds 0.3 is:\n\n(np.abs(pcorr_samples['a']) > 0.3).mean().item()\n\n0.7515\n\n\n\n\n\nOnce we have computed the posterior distribution, we can use it to compute the posterior predictive distribution. As the name implies, these are predictions assuming the model’s parameter are distributed as the posterior. Thus, the posterior predictive includes the uncertainty about the parameters.\nWith bambi we can use the model’s predict() method with the fitted az.InferenceData to generate a posterior predictive samples, which are then automatically added to the az.InferenceData object\n\nposterior_predictive = model.predict(fitted, kind=\"pps\")\nfitted\n\n\n\n \n \n arviz.InferenceData\n \n \n \n \n \n posterior\n \n \n \n \n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\nDimensions: (chain: 2, draw: 2000, drugs_obs: 604)\nCoordinates:\n * chain (chain) int64 0 1\n * draw (draw) int64 0 1 2 3 4 5 6 ... 1994 1995 1996 1997 1998 1999\n * drugs_obs (drugs_obs) int64 0 1 2 3 4 5 6 ... 597 598 599 600 601 602 603\nData variables:\n Intercept (chain, draw) float64 3.176 3.52 3.331 ... 2.713 3.273 3.582\n o (chain, draw) float64 0.004945 0.004528 ... 0.007971 0.005363\n c (chain, draw) float64 -0.003048 -0.004202 ... -0.006359\n e (chain, draw) float64 0.004493 0.003775 ... 0.002476 0.003399\n a (chain, draw) float64 -0.01186 -0.01245 ... -0.0138 -0.01127\n n (chain, draw) float64 -0.001693 -0.001597 ... -0.001553\n drugs_sigma (chain, draw) float64 0.6181 0.5667 0.6038 ... 0.5624 0.5909\n drugs_mean (chain, draw, drugs_obs) float64 2.404 2.112 ... 2.465 2.221\nAttributes:\n created_at: 2023-01-05T13:59:47.818007\n arviz_version: 0.14.0\n inference_library: pymc\n inference_library_version: 5.0.1\n sampling_time: 12.082805395126343\n tuning_steps: 2000\n modeling_interface: bambi\n modeling_interface_version: 0.9.3xarray.DatasetDimensions:chain: 2draw: 2000drugs_obs: 604Coordinates: (3)chain(chain)int640 1array([0, 1])draw(draw)int640 1 2 3 4 ... 1996 1997 1998 1999array([ 0, 1, 2, ..., 1997, 1998, 1999])drugs_obs(drugs_obs)int640 1 2 3 4 5 ... 599 600 601 602 603array([ 0, 1, 2, ..., 601, 602, 603])Data variables: (8)Intercept(chain, draw)float643.176 3.52 3.331 ... 3.273 3.582array([[3.17599154, 3.51973775, 3.33138916, ..., 3.78559953, 3.69898904,\n 4.10173002],\n [3.22869629, 3.05970267, 4.05607078, ..., 2.71262619, 3.27329852,\n 3.5817772 ]])o(chain, draw)float640.004945 0.004528 ... 0.005363array([[0.00494498, 0.00452772, 0.00756356, ..., 0.00563804, 0.00567455,\n 0.00605511],\n [0.00566662, 0.00734347, 0.00455193, ..., 0.00781383, 0.00797121,\n 0.00536332]])c(chain, draw)float64-0.003048 -0.004202 ... -0.006359array([[-0.00304773, -0.00420235, -0.00330325, ..., -0.00690254,\n -0.00304681, -0.0049652 ],\n [-0.00114941, -0.00482514, -0.00490148, ..., -0.00216353,\n -0.00381585, -0.00635936]])e(chain, draw)float640.004493 0.003775 ... 0.003399array([[0.00449327, 0.00377511, 0.00301912, ..., 0.00351791, 0.00326551,\n 0.00174616],\n [0.00282394, 0.00321167, 0.002755 , ..., 0.00387329, 0.00247556,\n 0.00339919]])a(chain, draw)float64-0.01186 -0.01245 ... -0.01127array([[-0.01185915, -0.01244995, -0.01392545, ..., -0.01249392,\n -0.01503491, -0.01621822],\n [-0.0145613 , -0.01041646, -0.01505147, ..., -0.01161795,\n -0.01379694, -0.01127277]])n(chain, draw)float64-0.001693 -0.001597 ... -0.001553array([[-0.00169305, -0.00159663, -0.00154174, ..., -0.0021489 ,\n -0.00288474, -0.00150735],\n [-0.0001852 , -0.00150535, -0.00206979, ..., -0.00123601,\n -0.00069656, -0.00155252]])drugs_sigma(chain, draw)float640.6181 0.5667 ... 0.5624 0.5909array([[0.61807081, 0.56667133, 0.60383893, ..., 0.59876874, 0.5881488 ,\n 0.59612521],\n [0.57607916, 0.59275997, 0.59122171, ..., 0.62229347, 0.56236473,\n 0.59090382]])drugs_mean(chain, draw, drugs_obs)float642.404 2.112 1.809 ... 2.465 2.221array([[[2.40445527, 2.11215753, 1.80914249, ..., 2.03306693,\n 2.47439674, 2.12557432],\n [2.46462075, 2.10211223, 1.75099921, ..., 2.06373429,\n 2.40050445, 2.17458259],\n [2.48629432, 2.19120177, 1.78244804, ..., 1.99662782,\n 2.50554453, 2.14334816],\n ...,\n [2.52971839, 2.10128993, 1.60904467, ..., 2.03221421,\n 2.4590608 , 2.18403842],\n [2.38667108, 2.0990887 , 1.74141595, ..., 1.91173016,\n 2.39889978, 2.06061017],\n [2.53520294, 2.09223729, 1.62325609, ..., 1.99719024,\n 2.25196002, 2.17630404]],\n\n [[2.40955116, 2.09007661, 1.81021061, ..., 2.0214824 ,\n 2.25523148, 2.12606555],\n [2.47973927, 2.19885636, 1.76301224, ..., 2.02577592,\n 2.58236465, 2.15872438],\n [2.48310012, 2.05888822, 1.64851281, ..., 2.00492937,\n 2.27433231, 2.15713283],\n ...,\n [2.38775743, 2.18114872, 1.83901419, ..., 1.96842912,\n 2.57436394, 2.0785617 ],\n [2.49053754, 2.14963624, 1.70581067, ..., 1.99018703,\n 2.41675113, 2.14096604],\n [2.53882175, 2.1313453 , 1.67693896, ..., 2.08870657,\n 2.46499185, 2.22116629]]])Indexes: (3)chainPandasIndexPandasIndex(Int64Index([0, 1], dtype='int64', name='chain'))drawPandasIndexPandasIndex(Int64Index([ 0, 1, 2, 3, 4, 5, 6, 7, 8, 9,\n ...\n 1990, 1991, 1992, 1993, 1994, 1995, 1996, 1997, 1998, 1999],\n dtype='int64', name='draw', length=2000))drugs_obsPandasIndexPandasIndex(Int64Index([ 0, 1, 2, 3, 4, 5, 6, 7, 8, 9,\n ...\n 594, 595, 596, 597, 598, 599, 600, 601, 602, 603],\n dtype='int64', name='drugs_obs', length=604))Attributes: (8)created_at :2023-01-05T13:59:47.818007arviz_version :0.14.0inference_library :pymcinference_library_version :5.0.1sampling_time :12.082805395126343tuning_steps :2000modeling_interface :bambimodeling_interface_version :0.9.3\n \n \n \n \n \n \n posterior_predictive\n \n \n \n \n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\nDimensions: (chain: 2, draw: 2000, drugs_obs: 604)\nCoordinates:\n * chain (chain) int64 0 1\n * draw (draw) int64 0 1 2 3 4 5 6 ... 1993 1994 1995 1996 1997 1998 1999\n * drugs_obs (drugs_obs) int64 0 1 2 3 4 5 6 7 ... 597 598 599 600 601 602 603\nData variables:\n drugs (chain, draw, drugs_obs) float64 2.695 1.825 ... 1.757 2.475\nAttributes:\n modeling_interface: bambi\n modeling_interface_version: 0.9.3xarray.DatasetDimensions:chain: 2draw: 2000drugs_obs: 604Coordinates: (3)chain(chain)int640 1array([0, 1])draw(draw)int640 1 2 3 4 ... 1996 1997 1998 1999array([ 0, 1, 2, ..., 1997, 1998, 1999])drugs_obs(drugs_obs)int640 1 2 3 4 5 ... 599 600 601 602 603array([ 0, 1, 2, ..., 601, 602, 603])Data variables: (1)drugs(chain, draw, drugs_obs)float642.695 1.825 1.951 ... 1.757 2.475array([[[2.69503226, 1.82467892, 1.95143733, ..., 2.5699741 ,\n 1.84978551, 1.36654724],\n [2.59077371, 3.11220779, 0.79189108, ..., 2.38631284,\n 2.62021493, 2.18537113],\n [3.11823781, 2.23392971, 1.75284024, ..., 2.35781091,\n 1.90029844, 2.27354726],\n ...,\n [1.72739111, 1.72704894, 1.95692669, ..., 2.55793246,\n 2.12482296, 2.65996429],\n [2.07203446, 0.57259278, 2.09124301, ..., 2.36280251,\n 2.23606286, 3.02304092],\n [2.52625525, 1.61450826, 2.41667227, ..., 1.83555475,\n 2.0276591 , 1.89229018]],\n\n [[2.50995335, 2.67645277, 0.38388315, ..., 1.78983849,\n 2.42224863, 1.7022833 ],\n [1.67277622, 1.9170972 , 2.49938629, ..., 1.99462421,\n 3.11777803, 2.60929834],\n [2.71727227, 1.99501171, 1.27468368, ..., 2.79142362,\n 2.5874147 , 1.43214944],\n ...,\n [2.08488322, 1.23115871, 1.46252038, ..., 2.13904811,\n 2.67930592, 2.54997319],\n [2.47050601, 2.46610699, 1.6644404 , ..., 1.63954359,\n 2.41069391, 2.59553247],\n [2.11421864, 1.3497171 , 1.67469565, ..., 1.24887731,\n 1.75678247, 2.4746553 ]]])Indexes: (3)chainPandasIndexPandasIndex(Int64Index([0, 1], dtype='int64', name='chain'))drawPandasIndexPandasIndex(Int64Index([ 0, 1, 2, 3, 4, 5, 6, 7, 8, 9,\n ...\n 1990, 1991, 1992, 1993, 1994, 1995, 1996, 1997, 1998, 1999],\n dtype='int64', name='draw', length=2000))drugs_obsPandasIndexPandasIndex(Int64Index([ 0, 1, 2, 3, 4, 5, 6, 7, 8, 9,\n ...\n 594, 595, 596, 597, 598, 599, 600, 601, 602, 603],\n dtype='int64', name='drugs_obs', length=604))Attributes: (2)modeling_interface :bambimodeling_interface_version :0.9.3\n \n \n \n \n \n \n sample_stats\n \n \n \n \n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\nDimensions: (chain: 2, draw: 2000)\nCoordinates:\n * chain (chain) int64 0 1\n * draw (draw) int64 0 1 2 3 4 5 ... 1995 1996 1997 1998 1999\nData variables: (12/17)\n tree_depth (chain, draw) int64 3 2 3 3 3 2 2 3 ... 2 2 3 2 3 3 3\n n_steps (chain, draw) float64 7.0 3.0 7.0 7.0 ... 7.0 7.0 7.0\n step_size_bar (chain, draw) float64 0.8184 0.8184 ... 0.8091 0.8091\n acceptance_rate (chain, draw) float64 0.8022 0.9751 ... 0.957 0.5038\n index_in_trajectory (chain, draw) int64 -2 2 2 3 -4 2 -2 ... 2 -5 1 4 2 5\n process_time_diff (chain, draw) float64 0.002245 0.001611 ... 0.002694\n ... ...\n max_energy_error (chain, draw) float64 1.085 -0.1348 ... 0.2997 1.41\n diverging (chain, draw) bool False False False ... False False\n perf_counter_start (chain, draw) float64 7.401e+03 ... 7.405e+03\n energy_error (chain, draw) float64 0.06257 -0.1283 ... -0.2735\n lp (chain, draw) float64 -536.3 -536.0 ... -537.7 -536.3\n step_size (chain, draw) float64 0.757 0.757 ... 0.8614 0.8614\nAttributes:\n created_at: 2023-01-05T13:59:47.843311\n arviz_version: 0.14.0\n inference_library: pymc\n inference_library_version: 5.0.1\n sampling_time: 12.082805395126343\n tuning_steps: 2000\n modeling_interface: bambi\n modeling_interface_version: 0.9.3xarray.DatasetDimensions:chain: 2draw: 2000Coordinates: (2)chain(chain)int640 1array([0, 1])draw(draw)int640 1 2 3 4 ... 1996 1997 1998 1999array([ 0, 1, 2, ..., 1997, 1998, 1999])Data variables: (17)tree_depth(chain, draw)int643 2 3 3 3 2 2 3 ... 2 2 2 3 2 3 3 3array([[3, 2, 3, ..., 3, 2, 2],\n [2, 2, 3, ..., 3, 3, 3]])n_steps(chain, draw)float647.0 3.0 7.0 7.0 ... 3.0 7.0 7.0 7.0array([[7., 3., 7., ..., 7., 3., 3.],\n [3., 3., 7., ..., 7., 7., 7.]])step_size_bar(chain, draw)float640.8184 0.8184 ... 0.8091 0.8091array([[0.81840616, 0.81840616, 0.81840616, ..., 0.81840616, 0.81840616,\n 0.81840616],\n [0.8090762 , 0.8090762 , 0.8090762 , ..., 0.8090762 , 0.8090762 ,\n 0.8090762 ]])acceptance_rate(chain, draw)float640.8022 0.9751 ... 0.957 0.5038array([[0.80218379, 0.97508852, 0.98194673, ..., 0.92311194, 0.8097277 ,\n 0.40372929],\n [0.86113225, 0.76594351, 0.67048735, ..., 0.92215663, 0.95695655,\n 0.50382814]])index_in_trajectory(chain, draw)int64-2 2 2 3 -4 2 -2 ... 2 2 -5 1 4 2 5array([[-2, 2, 2, ..., 4, 3, -1],\n [-1, 3, -3, ..., 4, 2, 5]])process_time_diff(chain, draw)float640.002245 0.001611 ... 0.002694array([[0.00224537, 0.00161071, 0.00262122, ..., 0.00185244, 0.0009074 ,\n 0.00094349],\n [0.00145086, 0.00134784, 0.00256195, ..., 0.00220908, 0.00232599,\n 0.00269432]])perf_counter_diff(chain, draw)float640.002245 0.001717 ... 0.002694array([[0.00224466, 0.00171671, 0.0026596 , ..., 0.00185201, 0.00090715,\n 0.00094316],\n [0.00196686, 0.00134622, 0.00256046, ..., 0.00220816, 0.00232364,\n 0.0026936 ]])largest_eigval(chain, draw)float64nan nan nan nan ... nan nan nan nanarray([[nan, nan, nan, ..., nan, nan, nan],\n [nan, nan, nan, ..., nan, nan, nan]])smallest_eigval(chain, draw)float64nan nan nan nan ... nan nan nan nanarray([[nan, nan, nan, ..., nan, nan, nan],\n [nan, nan, nan, ..., nan, nan, nan]])energy(chain, draw)float64541.3 538.1 537.9 ... 540.4 545.0array([[541.34133531, 538.08024449, 537.88965217, ..., 538.86443359,\n 539.86045571, 540.70508204],\n [540.93659924, 541.96396793, 539.54197037, ..., 539.03858149,\n 540.43513443, 545.02637008]])reached_max_treedepth(chain, draw)boolFalse False False ... False Falsearray([[False, False, False, ..., False, False, False],\n [False, False, False, ..., False, False, False]])max_energy_error(chain, draw)float641.085 -0.1348 ... 0.2997 1.41array([[ 1.08481213, -0.13479529, -0.38578247, ..., 0.34598906,\n 0.71378583, 1.09914272],\n [ 0.53888782, 0.86242187, 0.89569298, ..., 0.27006348,\n 0.29971737, 1.40998718]])diverging(chain, draw)boolFalse False False ... False Falsearray([[False, False, False, ..., False, False, False],\n [False, False, False, ..., False, False, False]])perf_counter_start(chain, draw)float647.401e+03 7.401e+03 ... 7.405e+03array([[7400.94306119, 7400.9458602 , 7400.94833494, ..., 7405.97315966,\n 7405.97520941, 7405.97632057],\n [7400.48802495, 7400.49063799, 7400.49262723, ..., 7405.30901618,\n 7405.31147156, 7405.31411165]])energy_error(chain, draw)float640.06257 -0.1283 ... 0.03155 -0.2735array([[ 0.06256698, -0.12827742, -0.0469405 , ..., -0.05069006,\n -0.02168315, 0.86390736],\n [-0.34268291, -0.14879415, 0.31310984, ..., 0.27006348,\n 0.03154649, -0.27345004]])lp(chain, draw)float64-536.3 -536.0 ... -537.7 -536.3array([[-536.32650601, -536.03480607, -536.27365583, ..., -536.17211968,\n -536.13502656, -539.14388769],\n [-537.44038441, -535.68599013, -536.82172512, ..., -537.58815552,\n -537.70851584, -536.32463184]])step_size(chain, draw)float640.757 0.757 0.757 ... 0.8614 0.8614array([[0.75699092, 0.75699092, 0.75699092, ..., 0.75699092, 0.75699092,\n 0.75699092],\n [0.86139733, 0.86139733, 0.86139733, ..., 0.86139733, 0.86139733,\n 0.86139733]])Indexes: (2)chainPandasIndexPandasIndex(Int64Index([0, 1], dtype='int64', name='chain'))drawPandasIndexPandasIndex(Int64Index([ 0, 1, 2, 3, 4, 5, 6, 7, 8, 9,\n ...\n 1990, 1991, 1992, 1993, 1994, 1995, 1996, 1997, 1998, 1999],\n dtype='int64', name='draw', length=2000))Attributes: (8)created_at :2023-01-05T13:59:47.843311arviz_version :0.14.0inference_library :pymcinference_library_version :5.0.1sampling_time :12.082805395126343tuning_steps :2000modeling_interface :bambimodeling_interface_version :0.9.3\n \n \n \n \n \n \n observed_data\n \n \n \n \n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\nDimensions: (drugs_obs: 604)\nCoordinates:\n * drugs_obs (drugs_obs) int64 0 1 2 3 4 5 6 7 ... 597 598 599 600 601 602 603\nData variables:\n drugs (drugs_obs) float64 1.857 3.071 1.571 2.214 ... 1.5 2.5 3.357\nAttributes:\n created_at: 2023-01-05T13:59:47.853402\n arviz_version: 0.14.0\n inference_library: pymc\n inference_library_version: 5.0.1\n modeling_interface: bambi\n modeling_interface_version: 0.9.3xarray.DatasetDimensions:drugs_obs: 604Coordinates: (1)drugs_obs(drugs_obs)int640 1 2 3 4 5 ... 599 600 601 602 603array([ 0, 1, 2, ..., 601, 602, 603])Data variables: (1)drugs(drugs_obs)float641.857 3.071 1.571 ... 1.5 2.5 3.357array([1.85714286, 3.07142857, 1.57142857, 2.21428571, 1.07142857,\n 1.42857143, 1.14285714, 2.14285714, 2.14285714, 1.07142857,\n 1.85714286, 2.5 , 1.85714286, 2.71428571, 1.42857143,\n 1.71428571, 1.71428571, 3.14285714, 2.71428571, 1.92857143,\n 2.71428571, 2.28571429, 2.35714286, 1.71428571, 2. ,\n 2.92857143, 2.5 , 2.92857143, 2.64285714, 2.21428571,\n 2.78571429, 2.71428571, 3.07142857, 2. , 3. ,\n 1.92857143, 3.07142857, 2.57142857, 2.71428571, 3.07142857,\n 1.78571429, 1.78571429, 3.57142857, 2.28571429, 2.78571429,\n 2.14285714, 2.71428571, 2.71428571, 2.35714286, 2.28571429,\n 1.85714286, 2.57142857, 2.14285714, 3.07142857, 2.07142857,\n 3.5 , 1.71428571, 2.5 , 2.14285714, 1.14285714,\n 3.5 , 1.85714286, 3.28571429, 2.64285714, 2. ,\n 1.85714286, 2.35714286, 2.21428571, 3.14285714, 2.64285714,\n 1.28571429, 1.64285714, 2.64285714, 2.07142857, 2.21428571,\n 3.07142857, 2.42857143, 3.21428571, 2.71428571, 2.07142857,\n 2.42857143, 2.07142857, 2.92857143, 3.42857143, 1.92857143,\n 2.57142857, 1. , 2.42857143, 2.14285714, 1.71428571,\n 1.78571429, 3.35714286, 1.71428571, 1.85714286, 2.07142857,\n 2.71428571, 1.5 , 1.57142857, 1.14285714, 1. ,\n...\n 1.35714286, 3.07142857, 1.42857143, 2.64285714, 1.35714286,\n 2.07142857, 3. , 1.35714286, 1.85714286, 1.42857143,\n 1.78571429, 2. , 2.42857143, 1.42857143, 2. ,\n 3.07142857, 1.5 , 2. , 2.42857143, 2. ,\n 2.64285714, 3.92857143, 2.42857143, 2. , 1.71428571,\n 1.42857143, 2. , 1.78571429, 1.85714286, 2.78571429,\n 1.14285714, 1.42857143, 2.21428571, 2.07142857, 1.42857143,\n 1.85714286, 2.64285714, 3.5 , 2. , 2. ,\n 2.92857143, 1.71428571, 2.57142857, 2.28571429, 1.21428571,\n 2.64285714, 1.21428571, 1.92857143, 1.85714286, 1.5 ,\n 1.5 , 1. , 1.85714286, 2.28571429, 2.28571429,\n 2. , 2.85714286, 1.21428571, 2.14285714, 1.71428571,\n 1.42857143, 2.64285714, 1.64285714, 1.57142857, 1.64285714,\n 1.57142857, 1.07142857, 2.07142857, 1.42857143, 2.35714286,\n 2.42857143, 2.42857143, 2.28571429, 1.85714286, 1.42857143,\n 1.78571429, 1.64285714, 1.64285714, 1.07142857, 3.71428571,\n 3.07142857, 2.21428571, 2.14285714, 1.78571429, 2. ,\n 2.14285714, 3.85714286, 1.64285714, 3. , 2.64285714,\n 1.71428571, 2.78571429, 1.85714286, 3.14285714, 2.42857143,\n 1.57142857, 1.5 , 2.5 , 3.35714286])Indexes: (1)drugs_obsPandasIndexPandasIndex(Int64Index([ 0, 1, 2, 3, 4, 5, 6, 7, 8, 9,\n ...\n 594, 595, 596, 597, 598, 599, 600, 601, 602, 603],\n dtype='int64', name='drugs_obs', length=604))Attributes: (6)created_at :2023-01-05T13:59:47.853402arviz_version :0.14.0inference_library :pymcinference_library_version :5.0.1modeling_interface :bambimodeling_interface_version :0.9.3\n \n \n \n \n \n \n \n\n\nOne use of the posterior predictive is as a diagnostic tool, shown below using az.plot_ppc().The blue lines represent the posterior predictive distribution estimates, and the black line represents the observed data. Our posterior predictions seems perform an adequately good job representing the observed data in all regions except near the value of 1, where the observed data and posterior estimates diverge.\n\naz.plot_ppc(fitted);\n\n/home/tomas/anaconda3/envs/bambi/lib/python3.10/site-packages/IPython/core/events.py:89: UserWarning: Creating legend with loc=\"best\" can be slow with large amounts of data.\n func(*args, **kwargs)\n\n\n\n\n\n\n%load_ext watermark\n%watermark -n -u -v -iv -w\n\nLast updated: Thu Jan 05 2023\n\nPython implementation: CPython\nPython version : 3.10.4\nIPython version : 8.5.0\n\npandas : 1.5.2\narviz : 0.14.0\nstatsmodels: 0.13.2\nmatplotlib : 3.6.2\nsys : 3.10.4 (main, Mar 31 2022, 08:41:55) [GCC 7.5.0]\nbambi : 0.9.3\nnumpy : 1.23.5\nxarray : 2022.11.0\n\nWatermark: 2.3.1" }, { - "objectID": "notebooks/model_comparison.html", - "href": "notebooks/model_comparison.html", + "objectID": "notebooks/mister_p.html", + "href": "notebooks/mister_p.html", "title": "Bambi", "section": "", - "text": "The adults dataset is comprised of census data from 1994 in United States.\nThe goal is to use demographic variables to predict whether an individual makes more than $50,000 per year.\nThe following is a description of the variables in the dataset.\n\nage: Individual’s age\nworkclass: Labor class.\nfnlwgt: It is not specified, but we guess it is a final sampling weight.\neducation: Education level as a categorical variable.\neducational_num: Education level as numerical variable. It does not reflect years of education.\nmarital_status: Marital status.\noccupation: Occupation.\nrelationship: Relationship with the head of household.\nrace: Individual’s race.\nsex: Individual’s sex.\ncapital_gain: Capital gain during unspecified period of time.\ncapital_loss: Capital loss during unspecified period of time.\nhs_week: Hours of work per week.\nnative_country: Country of birth.\nincome: Income as a binary variable (either below or above 50K per year).\n\nWe are only using the following variables in this example: income, sex, race, age, and hs_week. This subset is comprised of both categorical and numerical variables which allows us to visualize how to incorporate both types in a logistic regression model while helping to keep the analysis simpler.\n\nimport arviz as az\nimport bambi as bmb\nimport matplotlib.pyplot as plt\nimport matplotlib.lines as mlines\nimport numpy as np\nimport pandas as pd\nimport seaborn as sns\nimport warnings\n\nfrom scipy.special import expit as invlogit\n\n\n# Disable a FutureWarning in ArviZ at the moment of running the notebook\naz.style.use(\"arviz-darkgrid\")\nwarnings.simplefilter(action='ignore', category=FutureWarning)\n\n\ndata = bmb.load_data(\"adults\")\n\n\ndata.info()\ndata.head()\n\n\nRangeIndex: 32561 entries, 0 to 32560\nData columns (total 5 columns):\n # Column Non-Null Count Dtype \n--- ------ -------------- ----- \n 0 income 32561 non-null object\n 1 sex 32561 non-null object\n 2 race 32561 non-null object\n 3 age 32561 non-null int64 \n 4 hs_week 32561 non-null int64 \ndtypes: int64(2), object(3)\nmemory usage: 1.2+ MB\n\n\n\n\n\n\n \n \n \n income\n sex\n race\n age\n hs_week\n \n \n \n \n 0\n <=50K\n Male\n White\n 39\n 40\n \n \n 1\n <=50K\n Male\n White\n 50\n 13\n \n \n 2\n <=50K\n Male\n White\n 38\n 40\n \n \n 3\n <=50K\n Male\n Black\n 53\n 40\n \n \n 4\n <=50K\n Female\n Black\n 28\n 40\n \n \n\n\n\n\nCategorical variables are presented as from type object. In this step we convert them to category.\n\ncategorical_cols = data.columns[data.dtypes == object].tolist()\nfor col in categorical_cols:\n data[col] = data[col].astype(\"category\")\ndata.info()\n\n\nRangeIndex: 32561 entries, 0 to 32560\nData columns (total 5 columns):\n # Column Non-Null Count Dtype \n--- ------ -------------- ----- \n 0 income 32561 non-null category\n 1 sex 32561 non-null category\n 2 race 32561 non-null category\n 3 age 32561 non-null int64 \n 4 hs_week 32561 non-null int64 \ndtypes: category(3), int64(2)\nmemory usage: 604.7 KB\n\n\nInstead of going straight to fitting models, we’re going to do a some exploratory analysis of the variables in the dataset. First we have some plots, and then some conclusions about the information in the plots.\n\n# Just a utilitary function to truncate labels and avoid overlapping in plots\ndef truncate_labels(ticklabels, width=8):\n def truncate(label, width):\n if len(label) > width - 3:\n return label[0 : (width - 4)] + \"...\"\n else:\n return label\n\n labels = [x.get_text() for x in ticklabels]\n labels = [truncate(lbl, width) for lbl in labels]\n\n return labels\n\n\nfig, axes = plt.subplots(3, 2, figsize=(12, 15))\nsns.countplot(x=\"income\", color=\"C0\", data=data, ax=axes[0, 0], saturation=1)\nsns.countplot(x=\"sex\", color=\"C0\", data=data, ax=axes[0, 1], saturation=1);\nsns.countplot(x=\"race\", color=\"C0\", data=data, ax=axes[1, 0], saturation=1);\naxes[1, 0].set_xticklabels(truncate_labels(axes[1, 0].get_xticklabels()))\naxes[1, 1].hist(data[\"age\"], bins=20);\naxes[1, 1].set_xlabel(\"Age\")\naxes[1, 1].set_ylabel(\"Count\")\naxes[2, 0].hist(data[\"hs_week\"], bins=20);\naxes[2, 0].set_xlabel(\"Hours of work / week\")\naxes[2, 0].set_ylabel(\"Count\")\naxes[2, 1].axis('off');\n\n\n\n\nHighlights\n\nApproximately 25% of the people make more than 50K a year.\nTwo thirds of the subjects are males.\nThe great majority of the subjects are white, only a minority are black and the other categories are very infrequent.\nThe distribution of age is skewed to the right, as one might expect.\nThe distribution of hours of work per week looks weird at first sight. But what is a typical workload per week? You got it, 40 hours :).\n\nWe only keep the races black and white to simplify the analysis. The other categories don’t appear very often in our data.\nNow, we see the distribution of income for the different levels of our explanatory variables. Numerical variables are binned to make the analysis possible.\n\ndata = data[data[\"race\"].isin([\"Black\", \"White\"])]\ndata[\"race\"] = data[\"race\"].cat.remove_unused_categories()\nage_bins = [17, 25, 35, 45, 65, 90]\ndata[\"age_binned\"] = pd.cut(data[\"age\"], age_bins)\nhours_bins = [0, 20, 40, 60, 100]\ndata[\"hs_week_binned\"] = pd.cut(data[\"hs_week\"], hours_bins)\n\n\nfig, axes = plt.subplots(3, 2, figsize=(12, 15))\nsns.countplot(x=\"income\", color=\"C0\", data=data, ax=axes[0, 0])\nsns.countplot(x=\"sex\", hue=\"income\", data=data, ax=axes[0, 1])\nsns.countplot(x=\"race\", hue=\"income\", data=data, ax=axes[1, 0])\nsns.countplot(x=\"age_binned\", hue=\"income\", data=data, ax=axes[1, 1])\nsns.countplot(x=\"hs_week_binned\", hue=\"income\", data=data, ax=axes[2, 0])\naxes[2, 1].axis(\"off\");\n\n\n\n\nSome quick and gross info from the plots\n\nThe probability of making more than \\$50k a year is larger if you are a Male.\nA person also has more probability of making more than \\$50k/yr if she/he is White.\nFor age, we see the probability of making more than \\$50k a year increases as the variable increases, up to a point where it starts to decrease.\nAlso, the more hours a person works per week, the higher the chance of making more than \\$50k/yr. There’s a big jump in that probability when the hours of work per week jump from the (20, 40] bin to the (40, 60] one.\n\nSome data preparation before fitting our model. Here we standardize numerical variables age and hs_week because it may help sampler convergence. Also, we compute their second and third power. These powers will be sequantialy added to the model.\n\nage_mean = np.mean(data[\"age\"])\nage_std = np.std(data[\"age\"])\nhs_mean = np.mean(data[\"hs_week\"])\nhs_std = np.std(data[\"hs_week\"])\n\ndata[\"age\"] = (data[\"age\"] - age_mean) / age_std\ndata[\"age2\"] = data[\"age\"] ** 2\ndata[\"age3\"] = data[\"age\"] ** 3\ndata[\"hs_week\"] = (data[\"hs_week\"] - hs_mean) / hs_std\ndata[\"hs_week2\"] = data[\"hs_week\"] ** 2\ndata[\"hs_week3\"] = data[\"hs_week\"] ** 3\n\ndata = data.drop(columns=[\"age_binned\", \"hs_week_binned\"])\n\nThis is how our data looks like before fitting the models.\n\ndata.head()\n\n\n\n\n\n \n \n \n income\n sex\n race\n age\n hs_week\n age2\n age3\n hs_week2\n hs_week3\n \n \n \n \n 0\n <=50K\n Male\n White\n 0.024207\n -0.037250\n 0.000586\n 0.000014\n 0.001388\n -0.000052\n \n \n 1\n <=50K\n Male\n White\n 0.827984\n -2.222326\n 0.685557\n 0.567630\n 4.938734\n -10.975479\n \n \n 2\n <=50K\n Male\n White\n -0.048863\n -0.037250\n 0.002388\n -0.000117\n 0.001388\n -0.000052\n \n \n 3\n <=50K\n Male\n Black\n 1.047195\n -0.037250\n 1.096618\n 1.148374\n 0.001388\n -0.000052\n \n \n 4\n <=50K\n Female\n Black\n -0.779569\n -0.037250\n 0.607728\n -0.473766\n 0.001388\n -0.000052\n \n \n\n\n\n\n\n\n\nWe will use a logistic regression model to estimate the probability of making more than \\$50K as a function of age, hours of work per week, sex, race and education level.\nIf we have a binary response variable \\(Y\\) and a set of predictors or explanatory variables \\(X_1, X_2, \\cdots, X_p\\) the logistic regression model can be defined as follows:\n\\[\\log{\\left(\\frac{\\pi}{1 - \\pi}\\right)} = \\beta_0 + \\beta_1 X_1 + \\beta_2 X_2 + \\cdots + \\beta_p X_p\\]\nwhere \\(\\pi = P(Y = 1)\\) (a.k.a. probability of success) and \\(\\beta_0, \\beta_1, \\cdots \\beta_p\\) are unknown parameters. The term on the left side is the logarithm of the odds ratio or simply known as the log-odds. With little effort, the expression can be re-arranged to express our probability of interest, \\(\\pi\\), as a function of the betas and the predictors.\n\\[\n\\pi = \\frac{e^{\\beta_0 + \\beta_1 X_1 + \\cdots + \\beta_p X_p}}{1 + e^{\\beta_0 + \\beta_1 X_1 + \\cdots + \\beta_p X_p}}\n = \\frac{1}{1 + e^{-(\\beta_0 + \\beta_1 X_1 + \\cdots + \\beta_p X_p)}}\n\\]\nWe need to specify a prior and a likelihood in order to draw samples from the posterior distribution. We could use sociological knowledge about the effects of age and education on income, but instead, let’s use the default prior specification in Bambi.\nThe likelihood is the product of \\(n\\) Bernoulli trials, \\(\\prod_{i=1}^{n}{p_i^y(1-p_i)^{1-y_i}}\\) where \\(p_i = P(Y=1)\\).\nIn our case, we have\n\\[Y =\n\\left\\{\n \\begin{array}{ll}\n 1 & \\textrm{if the person makes more than 50K per year} \\\\\n 0 & \\textrm{if the person makes less than 50K per year}\n \\end{array}\n\\right.\n\\]\n\\[\\pi = P(Y=1)\\]\nBut this is a Bambi example, right? Let’s see how Bambi can helps us to build a logistic regression model.\n\n\n\n\\[\n\\log{\\left(\\frac{\\pi}{1 - \\pi}\\right)} = \\beta_0 + \\beta_1 X_1 + \\beta_2 X_2 + \\beta_3 X_3 + \\beta_4 X_4 \n\\]\nWhere:\n\\[\n\\begin{split}\nX_1 &= \\displaystyle \\frac{\\text{Age} - \\text{Age}_{\\text{mean}}}{\\text{Age}_{\\text{std}}} \\\\\nX_2 &= \\displaystyle \\frac{\\text{Hours\\_week} - \\text{Hours\\_week}_{\\text{mean}}}{\\text{Hours\\_week}_{\\text{std}}} \\\\\nX_3 &=\n\\left\\{\n \\begin{array}{ll}\n 1 & \\textrm{if the person is male} \\\\\n 0 & \\textrm{if the person is female}\n \\end{array}\n\\right. \\\\\nX_4 &=\n\\left\\{\n \\begin{array}{ll}\n 1 & \\textrm{if the person is white} \\\\\n 0 & \\textrm{if the person is black}\n \\end{array}\n\\right.\n\\end{split}\n\\]\n\nmodel1 = bmb.Model(\"income['>50K'] ~ sex + race + age + hs_week\", data, family=\"bernoulli\")\nfitted1 = model1.fit(draws=1000, idata_kwargs={\"log_likelihood\": True})\n\nModeling the probability that income==>50K\nAuto-assigning NUTS sampler...\nInitializing NUTS using jitter+adapt_diag...\nMultiprocess sampling (2 chains in 2 jobs)\nNUTS: [Intercept, sex, race, age, hs_week]\n\n\n\n\n\n\n\n \n \n 100.00% [4000/4000 00:20<00:00 Sampling 2 chains, 0 divergences]\n \n \n\n\nSampling 2 chains for 1_000 tune and 1_000 draw iterations (2_000 + 2_000 draws total) took 21 seconds.\n\n\n\naz.plot_trace(fitted1);\naz.summary(fitted1)\n\n\n\n\n\n \n \n \n mean\n sd\n hdi_3%\n hdi_97%\n mcse_mean\n mcse_sd\n ess_bulk\n ess_tail\n r_hat\n \n \n \n \n Intercept\n -2.635\n 0.062\n -2.757\n -2.525\n 0.001\n 0.001\n 2457.0\n 1739.0\n 1.0\n \n \n sex[Male]\n 1.018\n 0.037\n 0.948\n 1.087\n 0.001\n 0.001\n 2141.0\n 1572.0\n 1.0\n \n \n race[White]\n 0.630\n 0.058\n 0.532\n 0.751\n 0.001\n 0.001\n 3060.0\n 1566.0\n 1.0\n \n \n age\n 0.578\n 0.015\n 0.554\n 0.608\n 0.000\n 0.000\n 1837.0\n 1281.0\n 1.0\n \n \n hs_week\n 0.504\n 0.015\n 0.477\n 0.533\n 0.000\n 0.000\n 2047.0\n 1568.0\n 1.0\n \n \n\n\n\n\n\n\n\n\n\n\n\\[\n\\log{\\left(\\frac{\\pi}{1 - \\pi}\\right)} = \\beta_0 + \\beta_1 X_1 + \\beta_2 X_1^2 + \\beta_3 X_2 + \\beta_4 X_2^2\n + \\beta_5 X_3 + \\beta_6 X_4\n\\]\nWhere:\n$$\n\\[\\begin{aligned}\n X_1 &= \\displaystyle \\frac{\\text{Age} - \\text{Age}_{\\text{mean}}}{\\text{Age}_{\\text{std}}} \\\\\n X_2 &= \\displaystyle \\frac{\\text{Hours\\_week} - \\text{Hours\\_week}_{\\text{mean}}}{\\text{Hours\\_week}_{\\text{std}}} \\\\\n X_3 &=\n \\left\\{\n \\begin{array}{ll}\n 1 & \\textrm{if the person is male} \\\\\n 0 & \\textrm{if the person is female}\n \\end{array}\n \\right. \\\\\n\n X_4 &=\n \\left\\{\n \\begin{array}{ll}\n 1 & \\textrm{if the person is white} \\\\\n 0 & \\textrm{if the person is black}\n \\end{array}\n \\right.\n\\end{aligned}\\]\n$$\n\nmodel2 = bmb.Model(\"income['>50K'] ~ sex + race + age + age2 + hs_week + hs_week2\", data, family=\"bernoulli\")\nfitted2 = model2.fit(idata_kwargs={\"log_likelihood\": True})\n\nModeling the probability that income==>50K\nAuto-assigning NUTS sampler...\nInitializing NUTS using jitter+adapt_diag...\nMultiprocess sampling (2 chains in 2 jobs)\nNUTS: [Intercept, sex, race, age, age2, hs_week, hs_week2]\n\n\n\n\n\n\n\n \n \n 100.00% [4000/4000 00:29<00:00 Sampling 2 chains, 0 divergences]\n \n \n\n\nSampling 2 chains for 1_000 tune and 1_000 draw iterations (2_000 + 2_000 draws total) took 30 seconds.\n\n\n\naz.plot_trace(fitted2);\naz.summary(fitted2)\n\n\n\n\n\n \n \n \n mean\n sd\n hdi_3%\n hdi_97%\n mcse_mean\n mcse_sd\n ess_bulk\n ess_tail\n r_hat\n \n \n \n \n Intercept\n -2.282\n 0.065\n -2.406\n -2.166\n 0.001\n 0.001\n 2037.0\n 1330.0\n 1.0\n \n \n sex[Male]\n 1.006\n 0.038\n 0.939\n 1.074\n 0.001\n 0.001\n 2192.0\n 1628.0\n 1.0\n \n \n race[White]\n 0.702\n 0.061\n 0.590\n 0.818\n 0.001\n 0.001\n 2084.0\n 1343.0\n 1.0\n \n \n age\n 1.069\n 0.024\n 1.028\n 1.117\n 0.001\n 0.000\n 1720.0\n 1406.0\n 1.0\n \n \n age2\n -0.538\n 0.018\n -0.570\n -0.503\n 0.000\n 0.000\n 1730.0\n 1161.0\n 1.0\n \n \n hs_week\n 0.499\n 0.022\n 0.455\n 0.538\n 0.001\n 0.000\n 1665.0\n 1431.0\n 1.0\n \n \n hs_week2\n -0.088\n 0.009\n -0.103\n -0.072\n 0.000\n 0.000\n 1687.0\n 1577.0\n 1.0\n \n \n\n\n\n\n\n\n\n\n\n\n\\[\n\\log{\\left(\\frac{\\pi}{1 - \\pi}\\right)} = \\beta_0 + \\beta_1 X_1 + \\beta_2 X_1^2 + \\beta_3 X_1^3 + \\beta_4 X_2\n + \\beta_5 X_2^2 + \\beta_6 X_2^3 + \\beta_7 X_3 + \\beta_8 X_4\n\\]\nWhere:\n\\[\n\\begin{aligned}\n X_1 &= \\displaystyle \\frac{\\text{Age} - \\text{Age}_{\\text{mean}}}{\\text{Age}_{\\text{std}}} \\\\\n X_2 &= \\displaystyle \\frac{\\text{Hours\\_week} - \\text{Hours\\_week}_{\\text{mean}}}{\\text{Hours\\_week}_{\\text{std}}} \\\\\n X_3 &=\n \\left\\{\n \\begin{array}{ll}\n 1 & \\textrm{if the person is male} \\\\\n 0 & \\textrm{if the person is female}\n \\end{array}\n \\right. \\\\\n X_4 &=\n \\left\\{\n \\begin{array}{ll}\n 1 & \\textrm{if the person is white} \\\\\n 0 & \\textrm{if the person is black}\n \\end{array}\n \\right.\n\\end{aligned}\n\\]\n\nmodel3 = bmb.Model(\n \"income['>50K'] ~ age + age2 + age3 + hs_week + hs_week2 + hs_week3 + sex + race\",\n data,\n family=\"bernoulli\"\n)\nfitted3 = model3.fit(\n draws=1000, random_seed=1234, target_accept=0.9, idata_kwargs={\"log_likelihood\": True}\n)\n\nModeling the probability that income==>50K\nAuto-assigning NUTS sampler...\nInitializing NUTS using jitter+adapt_diag...\nMultiprocess sampling (2 chains in 2 jobs)\nNUTS: [Intercept, age, age2, age3, hs_week, hs_week2, hs_week3, sex, race]\n\n\n\n\n\n\n\n \n \n 100.00% [4000/4000 01:15<00:00 Sampling 2 chains, 0 divergences]\n \n \n\n\nSampling 2 chains for 1_000 tune and 1_000 draw iterations (2_000 + 2_000 draws total) took 75 seconds.\n\n\n\naz.plot_trace(fitted3);\naz.summary(fitted3)\n\n\n\n\n\n \n \n \n mean\n sd\n hdi_3%\n hdi_97%\n mcse_mean\n mcse_sd\n ess_bulk\n ess_tail\n r_hat\n \n \n \n \n Intercept\n -2.145\n 0.064\n -2.270\n -2.028\n 0.001\n 0.001\n 3201.0\n 1540.0\n 1.0\n \n \n age\n 0.963\n 0.026\n 0.913\n 1.009\n 0.001\n 0.000\n 2243.0\n 1290.0\n 1.0\n \n \n age2\n -0.894\n 0.030\n -0.946\n -0.836\n 0.001\n 0.001\n 1541.0\n 1229.0\n 1.0\n \n \n age3\n 0.175\n 0.011\n 0.153\n 0.194\n 0.000\n 0.000\n 1653.0\n 1506.0\n 1.0\n \n \n hs_week\n 0.612\n 0.025\n 0.567\n 0.661\n 0.001\n 0.000\n 2381.0\n 1300.0\n 1.0\n \n \n hs_week2\n -0.010\n 0.010\n -0.030\n 0.010\n 0.000\n 0.000\n 2299.0\n 1590.0\n 1.0\n \n \n hs_week3\n -0.035\n 0.004\n -0.042\n -0.028\n 0.000\n 0.000\n 1815.0\n 1572.0\n 1.0\n \n \n sex[Male]\n 0.985\n 0.038\n 0.918\n 1.059\n 0.001\n 0.001\n 2737.0\n 1549.0\n 1.0\n \n \n race[White]\n 0.681\n 0.060\n 0.573\n 0.798\n 0.001\n 0.001\n 3044.0\n 1514.0\n 1.0\n \n \n\n\n\n\n\n\n\n\n\n\nWe can perform a Bayesian model comparison very easily with az.compare(). Here we pass a dictionary with the InferenceData objects that Model.fit() returned and az.compare() returns a data frame that is ordered from best to worst according to the criteria used. By default, ArviZ uses loo, which is an estimation of leave one out cross validation. Another option is the widely applicable information criterion (WAIC). For more information about the information criteria available and other options within the function see the docs.\n\nmodels_dict = {\n \"model1\": fitted1,\n \"model2\": fitted2,\n \"model3\": fitted3\n}\ndf_compare = az.compare(models_dict)\ndf_compare\n\n\n\n\n\n \n \n \n rank\n elpd_loo\n p_loo\n elpd_diff\n weight\n se\n dse\n warning\n scale\n \n \n \n \n model3\n 0\n -13987.197673\n 9.716205\n 0.000000\n 1.000000e+00\n 89.279906\n 0.000000\n False\n log\n \n \n model2\n 1\n -14155.112761\n 8.147063\n 167.915088\n 3.048565e-12\n 91.305227\n 19.879825\n False\n log\n \n \n model1\n 2\n -14915.862090\n 4.871886\n 928.664417\n 0.000000e+00\n 91.010624\n 38.923423\n False\n log\n \n \n\n\n\n\n\naz.plot_compare(df_compare, insample_dev=False);\n\n\n\n\nThere is a difference in the point estimations (empty circles) between the model with cubic terms (model 3) and the model with quadratic terms (model 2) but there is some overlap between their interval estimations. This time, we are going to select model 2 and do some extra little work with it because from previous experience with this dataset we know there is no substantial difference between them, and model 2 is simpler. However, as we mention in the final remarks, this is not the best you can achieve with this dataset. If you want, you could also try to add other predictors, such as education level and see how it impacts in the model comparison :).\n\n\n\nIn this section we plot age vs the probability of making more than 50K a year given different profiles.\nWe set hours of work per week at 40 hours and assign a grid from 18 to 75 age. They’re standardized because they were standardized when we fitted the model.\nHere we use az.plot_hdi() to get Highest Density Interval plots. We get two bands for each profile. One corresponds to an hdi probability of 0.94 (the default) and the other to an hdi probability of 0.5.\n\nHS_WEEK = (40 - hs_mean) / hs_std\nAGE = (np.linspace(18, 75) - age_mean) / age_std\n\nfig, ax = plt.subplots()\nhandles = []\ni = 0\n\nfor race in [\"Black\", \"White\"]:\n for sex in [\"Female\", \"Male\"]: \n color = f\"C{i}\"\n label = f\"{race} - {sex}\"\n handles.append(mlines.Line2D([], [], color=color, label=label, lw=3))\n \n new_data = pd.DataFrame({\n \"sex\": [sex] * len(AGE),\n \"race\": [race] * len(AGE), \n \"age\": AGE,\n \"age2\": AGE ** 2,\n \"hs_week\": [HS_WEEK] * len(AGE),\n \"hs_week2\": [HS_WEEK ** 2] * len(AGE),\n })\n new_idata = model2.predict(fitted2, data=new_data, inplace=False)\n mean = new_idata.posterior[\"income_mean\"].values\n\n az.plot_hdi(AGE * age_std + age_mean, mean, ax=ax, color=color)\n az.plot_hdi(AGE * age_std + age_mean, mean, ax=ax, color=color, hdi_prob=0.5)\n i += 1\n\nax.set_xlabel(\"Age\")\nax.set_ylabel(\"P(Income > $50K)\")\nax.legend(handles=handles, loc=\"upper left\");\n\n\n\n\nThe highest posterior density bands show how the probability of earning more than 50K changes with age for a given profile. In all the cases, we see the probability of making more than $50K increases with age until approximately age 52, when the probability begins to drop off. We can interpret narrow portions of a curve as places where we have low uncertainty and spread out portions of the bands as places where we have somewhat higher uncertainty about our coefficient values.\n\n\nIn this notebook we’ve seen how easy it is to incorporate ArviZ into a Bambi workflow to perform model comparison based on information criteria such as LOO and WAIC. However, an attentive reader might have seen that the highest density interval plot never shows a predicted probability greater than 0.5 (which is not good if we expect to predict that at least some people working 40hrs/wk make more than \\$50k/yr). You can increase the hours of work per week for the profiles we’ve used and the HDIs will show larger values. But we won’t be seeing the whole picture.\nAlthough we’re using some demographic variables such as sex and race, the cells resulting from the combinations of their levels are still very heterogeneous. For example, we are mixing individuals of all educational levels. A possible next step is to incorporate education into the different models we compared. If any of the readers (yes, you!) is interested in doing so, here there are some notes that may help\n\nEducation is an ordinal categorical variable with a lot of levels.\n\nExplore the conditional distribution of income given education levels.\nSee what are the counts/proportions of people within each education level.\nCollapse categories (but respect the ordinality!). Try to end up with 5 or less categories if possible.\n\nStart with a model with only age, sex, race, hs_week and education. Then incorporate higher order terms (second and third powers for example). Don’t go beyond fourth powers.\nLook for a nice activity to do while the sampler does its job.\nWe know it’s going to take a couple of hours to fit all those models :)\n\nAnd finally, please feel free to open a new issue if you think there’s something that we can improve.\n\n%load_ext watermark\n%watermark -n -u -v -iv -w\n\nLast updated: Thu Jan 05 2023\n\nPython implementation: CPython\nPython version : 3.10.4\nIPython version : 8.5.0\n\nbambi : 0.9.3\nnumpy : 1.23.5\nmatplotlib: 3.6.2\narviz : 0.14.0\nseaborn : 0.12.2\npandas : 1.5.2\nsys : 3.10.4 (main, Mar 31 2022, 08:41:55) [GCC 7.5.0]\n\nWatermark: 2.3.1" + "text": "What are we even doing when we fit a regression model? Is a question that arises when first learning the tools of the trade and again when debugging strange results of your thousandth logistic regression model.\nThis notebook is intended to showcase how regression can be seen as a method for automating the calculation of stratum specific conditional effects. Additionally, we’ll see how we can enrich regression models by a post-stratification adjustment with knowledge of the appropriate stratum specific weights. This technique of multilevel regression and post stratification (MrP) is often used in the context of national surveys where we have knowledge of the population weights appropriate to different demographic groups. It can be used in a wide variety of areas ranging from political polling to online market research. We will demonstrate how to fit and and assess these models using Bambi.\n\nimport warnings\n\nimport arviz as az\nimport bambi as bmb\nimport matplotlib.pyplot as plt\nimport numpy as np\nimport pandas as pd\nimport pymc as pm\n\nwarnings.simplefilter(action=\"ignore\", category=FutureWarning)\n\n\n\n\nFirst consider this example of heart transplant patients adapted from Hernan and Robins’ excellent book Causal Inference: What if. Here we have a number of patients (anonymised with names for the Greek Gods). The data records the outcomes of a heart transplant program for those who were part of the program and those who were not. We also see the different risk levels of each patient assigned the treatment.\nWhat we want to show here is that a regression model fit to this data automatically accounts for the weighting appropriate to the different risk strata. The data is coded with 0-1 indicators for status. Risk_Strata is either 1 for higher risk or 0 for lower risk. Outcome is whether or not the patient died from the procedure, and Treatment is whether or not the patient received treatment.\n\ndf = pd.DataFrame(\n {\n \"name\": [\n \"Rheia\",\n \"Kronos\",\n \"Demeter\",\n \"Hades\",\n \"Hestia\",\n \"Poseidon\",\n \"Hera\",\n \"Zeus\",\n \"Artemis\",\n \"Apollo\",\n \"Leto\",\n \"Ares\",\n \"Athena\",\n \"Hephaestus\",\n \"Aphrodite\",\n \"Cyclope\",\n \"Persephone\",\n \"Hermes\",\n \"Hebe\",\n \"Dionysus\",\n ],\n \"Risk_Strata\": [0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1],\n \"Treatment\": [0, 0, 0, 0, 1, 1, 1, 1, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1],\n \"Outcome\": [0, 1, 0, 0, 0, 0, 0, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 0, 0, 0],\n }\n)\n\ndf[\"Treatment_x_Risk_Strata\"] = df.Treatment * df.Risk_Strata\n\ndf\n\n\n\n\n\n \n \n \n name\n Risk_Strata\n Treatment\n Outcome\n Treatment_x_Risk_Strata\n \n \n \n \n 0\n Rheia\n 0\n 0\n 0\n 0\n \n \n 1\n Kronos\n 0\n 0\n 1\n 0\n \n \n 2\n Demeter\n 0\n 0\n 0\n 0\n \n \n 3\n Hades\n 0\n 0\n 0\n 0\n \n \n 4\n Hestia\n 0\n 1\n 0\n 0\n \n \n 5\n Poseidon\n 0\n 1\n 0\n 0\n \n \n 6\n Hera\n 0\n 1\n 0\n 0\n \n \n 7\n Zeus\n 0\n 1\n 1\n 0\n \n \n 8\n Artemis\n 1\n 0\n 1\n 0\n \n \n 9\n Apollo\n 1\n 0\n 1\n 0\n \n \n 10\n Leto\n 1\n 0\n 0\n 0\n \n \n 11\n Ares\n 1\n 1\n 1\n 1\n \n \n 12\n Athena\n 1\n 1\n 1\n 1\n \n \n 13\n Hephaestus\n 1\n 1\n 1\n 1\n \n \n 14\n Aphrodite\n 1\n 1\n 1\n 1\n \n \n 15\n Cyclope\n 1\n 1\n 1\n 1\n \n \n 16\n Persephone\n 1\n 1\n 1\n 1\n \n \n 17\n Hermes\n 1\n 1\n 0\n 1\n \n \n 18\n Hebe\n 1\n 1\n 0\n 1\n \n \n 19\n Dionysus\n 1\n 1\n 0\n 1\n \n \n\n\n\n\nIf the treatment assignment procedure involved complete randomisation then we might expect a reasonable balance of strata effects across the treated and non-treated. In this sample we see (perhaps counter intuitively) that the treatment seems to induce a higher rate of death than the non-treated group.\n\nsimple_average = df.groupby(\"Treatment\")[[\"Outcome\"]].mean().rename({\"Outcome\": \"Share\"}, axis=1)\nsimple_average\n\n\n\n\n\n \n \n \n Share\n \n \n Treatment\n \n \n \n \n \n 0\n 0.428571\n \n \n 1\n 0.538462\n \n \n\n\n\n\nWhich suggests an alarming causal effect whereby the treatment seems to increase risk of death in the population.\n\ncausal_risk_ratio = simple_average.iloc[1][\"Share\"] / simple_average.iloc[0][\"Share\"]\nprint(\"Causal Risk Ratio:\", causal_risk_ratio)\n\nCausal Risk Ratio: 1.2564102564102564\n\n\nThis finding we know on inspection is driven by the imbalance in the risk strata across the treatment groups.\n\ndf.groupby(\"Risk_Strata\")[[\"Treatment\"]].count().assign(\n proportion=lambda x: x[\"Treatment\"] / len(df)\n)\n\n\n\n\n\n \n \n \n Treatment\n proportion\n \n \n Risk_Strata\n \n \n \n \n \n \n 0\n 8\n 0.4\n \n \n 1\n 12\n 0.6\n \n \n\n\n\n\nWe can correct for this by weighting the results by the share each group represents across the Risk_Strata. In other words when we correct for the population size at the different levels of risk we get a better estimate of the effect. First we see what the expected outcome is for each strata.\n\noutcomes_controlled = (\n df.groupby([\"Risk_Strata\", \"Treatment\"])[[\"Outcome\"]]\n .mean()\n .reset_index()\n .pivot(index=\"Treatment\", columns=[\"Risk_Strata\"], values=\"Outcome\")\n)\n\noutcomes_controlled\n\n\n\n\n\n \n \n Risk_Strata\n 0\n 1\n \n \n Treatment\n \n \n \n \n \n \n 0\n 0.25\n 0.666667\n \n \n 1\n 0.25\n 0.666667\n \n \n\n\n\n\nNote how the expected outcomes are equal across the stratified groups. We can now combine these estimate with the population weights (derived earlier) in each segment to get our weighted average.\n\nweighted_avg = outcomes_controlled.assign(formula=\"0.4*0.25 + 0.6*0.66\").assign(\n weighted_average=lambda x: x[0] * (df[df[\"Risk_Strata\"] == 0].shape[0] / len(df))\n + x[1] * (df[df[\"Risk_Strata\"] == 1].shape[0] / len(df))\n)\n\nweighted_avg\n\n\n\n\n\n \n \n Risk_Strata\n 0\n 1\n formula\n weighted_average\n \n \n Treatment\n \n \n \n \n \n \n \n \n 0\n 0.25\n 0.666667\n 0.4*0.25 + 0.6*0.66\n 0.5\n \n \n 1\n 0.25\n 0.666667\n 0.4*0.25 + 0.6*0.66\n 0.5\n \n \n\n\n\n\nFrom which we can derive a more sensible treatment effect.\n\ncausal_risk_ratio = (\n weighted_avg.iloc[1][\"weighted_average\"] / weighted_avg.iloc[0][\"weighted_average\"]\n)\n\nprint(\"Causal Risk Ratio:\", causal_risk_ratio)\n\nCausal Risk Ratio: 1.0\n\n\n\n\n\nSo far, so good. But so what?\nThe point here is that the above series of steps can be difficult to accomplish with more complex sets of groups and risk profiles. So it’s useful to understand that regression can be used to automatically account for the variation in outcome effects across the different strata of our population. More prosaically, the example shows that it really matters what variables you put in your model.\n\nreg = bmb.Model(\"Outcome ~ 1 + Treatment\", df)\nresults = reg.fit()\n\nreg_strata = bmb.Model(\"Outcome ~ 1 + Treatment + Risk_Strata + Treatment_x_Risk_Strata\", df)\nresults_strata = reg_strata.fit()\n\nAuto-assigning NUTS sampler...\nInitializing NUTS using jitter+adapt_diag...\nInitializing NUTS using jitter+adapt_diag...\nMultiprocess sampling (4 chains in 4 jobs)\nNUTS: [Outcome_sigma, Intercept, Treatment]\n\n\nSampling 4 chains for 1_000 tune and 1_000 draw iterations (4_000 + 4_000 draws total) took 1 seconds.\nAuto-assigning NUTS sampler...\nInitializing NUTS using jitter+adapt_diag...\nMultiprocess sampling (4 chains in 4 jobs)\nNUTS: [Outcome_sigma, Intercept, Treatment, Risk_Strata, Treatment_x_Risk_Strata]\n\n\nSampling 4 chains for 1_000 tune and 1_000 draw iterations (4_000 + 4_000 draws total) took 1 seconds.\n\n\nWe can now inspect the treatment effect and the implied causal risk ratio in each model. We can quickly recover that controlling for the right variables in our regression model automatically adjusts the treatment effect downwards towards 0.\n\naz.summary(results)\n\n\n\n\n\n \n \n \n mean\n sd\n hdi_3%\n hdi_97%\n mcse_mean\n mcse_sd\n ess_bulk\n ess_tail\n r_hat\n \n \n \n \n Intercept\n 0.428\n 0.203\n 0.060\n 0.823\n 0.003\n 0.002\n 4840.0\n 2982.0\n 1.0\n \n \n Treatment\n 0.108\n 0.252\n -0.357\n 0.584\n 0.004\n 0.004\n 4258.0\n 2731.0\n 1.0\n \n \n Outcome_sigma\n 0.542\n 0.092\n 0.388\n 0.713\n 0.001\n 0.001\n 4073.0\n 2488.0\n 1.0\n \n \n\n\n\n\n\naz.summary(results_strata)\n\n\n\n\n\n \n \n \n mean\n sd\n hdi_3%\n hdi_97%\n mcse_mean\n mcse_sd\n ess_bulk\n ess_tail\n r_hat\n \n \n \n \n Intercept\n 0.254\n 0.261\n -0.233\n 0.743\n 0.005\n 0.004\n 2710.0\n 2648.0\n 1.0\n \n \n Treatment\n -0.001\n 0.367\n -0.653\n 0.730\n 0.008\n 0.006\n 2312.0\n 2648.0\n 1.0\n \n \n Risk_Strata\n 0.405\n 0.395\n -0.349\n 1.119\n 0.008\n 0.006\n 2274.0\n 2503.0\n 1.0\n \n \n Treatment_x_Risk_Strata\n 0.010\n 0.496\n -0.947\n 0.939\n 0.011\n 0.009\n 1986.0\n 2113.0\n 1.0\n \n \n Outcome_sigma\n 0.531\n 0.098\n 0.367\n 0.714\n 0.002\n 0.001\n 2389.0\n 2533.0\n 1.0\n \n \n\n\n\n\n\nax = az.plot_forest(\n [results, results_strata],\n model_names=[\"naive_model\", \"stratified_model\"],\n var_names=[\"Treatment\"],\n kind=\"ridgeplot\",\n ridgeplot_alpha=0.4,\n combined=True,\n figsize=(10, 6),\n)\nax[0].axvline(0, color=\"black\", linestyle=\"--\")\nax[0].set_title(\"Treatment Effects under Stratification/Non-stratification\");\n\n\n\n\nWe can even see this in the predicted outcomes for the model. This is an important step. The regression model automatically adjusts for the risk profile within the appropriate strata in the data “seen” by the model.\n\nnew_df = df[[\"Risk_Strata\"]].assign(Treatment=1).assign(Treatment_x_Risk_Strata=1)\nnew_preds = reg_strata.predict(results_strata, kind=\"pps\", data=new_df, inplace=False)\nprint(\"Expected Outcome in the Treated\")\nnew_preds[\"posterior_predictive\"][\"Outcome\"].mean().item()\n\nExpected Outcome in the Treated\n\n\n0.5068569705412103\n\n\n\nnew_df = df[[\"Risk_Strata\"]].assign(Treatment=0).assign(Treatment_x_Risk_Strata=0)\nnew_preds = reg_strata.predict(results_strata, kind=\"pps\", data=new_df, inplace=False)\nprint(\"Expected Outcome in the Untreated\")\n\nnew_preds[\"posterior_predictive\"][\"Outcome\"].mean().item()\n\nExpected Outcome in the Untreated\n\n\n0.49944292437387866\n\n\nWe can see these results more clearly using bambi model interpretation functions to see the predictions within a specific strata.\n\nfig, axs = plt.subplots(1, 2, figsize=(20, 6))\naxs = axs.flatten()\nbmb.interpret.plot_predictions(reg, results, covariates=[\"Treatment\"], ax=axs[0])\nbmb.interpret.plot_predictions(reg_strata, results_strata, covariates=[\"Treatment\"], ax=axs[1])\naxs[0].set_title(\"Non Stratified Regression \\n Model Predictions\")\naxs[1].set_title(\"Stratified Regression \\n Model Predictions\");\n\n\n\n\nHernan and Robins expand on these foundational observations and elaborate the implications for causal inference and the bias of confounding variables. We won’t go into these details, as we instead we want to draw out the connection with controlling for the risk of non-representative sampling. The usefulness of “representative-ness” as an idea is disputed in the statistical literature due to the vagueness of the term. To say a sample is representative is ussually akin to meaning that it was generated from a high-quality probability sampling design. This design is specified to avoid the creep of bias due to selection effects contaminating the results.\nWe’ve seen how regression can automate stratification across the levels of covariates in the model conditional on the sample data. But what if the prevalence of the risk-profile in your data does not reflect the prevalance of risk in the wider population? Then the regression model will automatically adjust to the prevalence in the sample, but it is not adjusting to the correct weights.\n\n\n\nIn the context of national survey design there is always a concern that the sample respondents may be more or less representative of the population across different key demographics e.g. it’s unlikely we would put much faith in the survey’s accuracy if it had 90% male respondents on a question about the lived experience of women. Given that we can know before hand that certain demographic splits are not relective of the census data, we can use this information to appropriately re-weight the regressions fit to non-representative survey data.\nWe’ll demonstrate the idea of multi-level regression and post-stratification adjustment by replicating some of the steps discussed in Martin, Philips and Gelmen’s “Multilevel Regression and Poststratification Case Studies”.\nThey cite data from the Cooperative Congressional Election Study (Schaffner, Ansolabehere, and Luks (2018)), a US nationwide survey designed by a consortium of 60 research teams and administered by YouGov. The outcome of interest is a binary question: Should employers decline coverage of abortions in insurance plans?\n\ncces_all_df = pd.read_csv(\"data/mr_p_cces18_common_vv.csv.gz\", low_memory=False)\ncces_all_df.head()\n\n\n\n\n\n \n \n \n caseid\n commonweight\n commonpostweight\n vvweight\n vvweight_post\n tookpost\n CCEStake\n birthyr\n gender\n educ\n ...\n CL_party\n CL_2018gvm\n CL_2018pep\n CL_2018pvm\n starttime\n endtime\n starttime_post\n endtime_post\n DMA\n dmaname\n \n \n \n \n 0\n 123464282\n 0.940543\n 0.7936\n 0.740858\n 0.641412\n 2\n 1\n 1964\n 2\n 4\n ...\n 11.0\n 1.0\n NaN\n NaN\n 04oct2018 02:47:10\n 09oct2018 04:16:31\n 11nov2018 00:41:13\n 11nov2018 01:21:53\n 512.0\n BALTIMORE\n \n \n 1\n 170169205\n 0.769724\n 0.7388\n 0.425236\n 0.415134\n 2\n 1\n 1971\n 2\n 2\n ...\n 13.0\n NaN\n 6.0\n 2.0\n 02oct2018 06:55:22\n 02oct2018 07:32:51\n 12nov2018 00:49:50\n 12nov2018 01:08:43\n 531.0\n \"TRI-CITIES\n \n \n 2\n 175996005\n 1.491642\n 1.3105\n 1.700094\n 1.603264\n 2\n 1\n 1958\n 2\n 3\n ...\n 13.0\n 5.0\n NaN\n NaN\n 07oct2018 00:48:23\n 07oct2018 01:38:41\n 12nov2018 21:49:41\n 12nov2018 22:19:28\n 564.0\n CHARLESTON-HUNTINGTON\n \n \n 3\n 176818556\n 5.104709\n 4.6304\n 5.946729\n 5.658840\n 2\n 1\n 1946\n 2\n 6\n ...\n 4.0\n 3.0\n NaN\n 3.0\n 11oct2018 15:20:26\n 11oct2018 16:18:42\n 11nov2018 13:24:16\n 11nov2018 14:00:14\n 803.0\n LOS ANGELES\n \n \n 4\n 202120533\n 0.466526\n 0.3745\n 0.412451\n 0.422327\n 2\n 1\n 1972\n 2\n 2\n ...\n 3.0\n 5.0\n NaN\n NaN\n 08oct2018 02:31:28\n 08oct2018 03:03:48\n 15nov2018 01:04:16\n 15nov2018 01:57:21\n 529.0\n LOUISVILLE\n \n \n\n5 rows × 526 columns\n\n\n\n\n\nTo prepare the census data for modelling we need to break the demographic data into appropriate stratum. We will break out these groupings as along broad categories familiar to audiences of election coverage news. Even these steps amount to a significant choice where we use our knowledge of pertinent demographics to decide upon the key strata we wish to represent in our model, as we seek to better predict and understand the voting outcome.\n\nstates = [\n \"AL\",\n \"AK\",\n \"AZ\",\n \"AR\",\n \"CA\",\n \"CO\",\n \"CT\",\n \"DE\",\n \"FL\",\n \"GA\",\n \"HI\",\n \"ID\",\n \"IL\",\n \"IN\",\n \"IA\",\n \"KS\",\n \"KY\",\n \"LA\",\n \"ME\",\n \"MD\",\n \"MA\",\n \"MI\",\n \"MN\",\n \"MS\",\n \"MO\",\n \"MT\",\n \"NE\",\n \"NV\",\n \"NH\",\n \"NJ\",\n \"NM\",\n \"NY\",\n \"NC\",\n \"ND\",\n \"OH\",\n \"OK\",\n \"OR\",\n \"PA\",\n \"RI\",\n \"SC\",\n \"SD\",\n \"TN\",\n \"TX\",\n \"UT\",\n \"VT\",\n \"VA\",\n \"WA\",\n \"WV\",\n \"WI\",\n \"WY\",\n]\n\n\nnumbers = list(range(1, 56, 1))\n\nlkup_states = dict(zip(numbers, states))\nlkup_states\n\n\nethnicity = [\n \"White\",\n \"Black\",\n \"Hispanic\",\n \"Asian\",\n \"Native American\",\n \"Mixed\",\n \"Other\",\n \"Middle Eastern\",\n]\nnumbers = list(range(1, 9, 1))\nlkup_ethnicity = dict(zip(numbers, ethnicity))\nlkup_ethnicity\n\n\nedu = [\"No HS\", \"HS\", \"Some college\", \"Associates\", \"4-Year College\", \"Post-grad\"]\nnumbers = list(range(1, 7, 1))\nlkup_edu = dict(zip(numbers, edu))\n\n\ndef clean_df(df):\n ## 0 Oppose and 1 Support\n df[\"abortion\"] = np.abs(df[\"CC18_321d\"] - 2)\n df[\"state\"] = df[\"inputstate\"].map(lkup_states)\n ## dichotomous (coded as -0.5 Female, +0.5 Male)\n df[\"male\"] = np.abs(df[\"gender\"] - 2) - 0.5\n df[\"eth\"] = df[\"race\"].map(lkup_ethnicity)\n df[\"eth\"] = np.where(\n df[\"eth\"].isin([\"Asian\", \"Other\", \"Middle Eastern\", \"Mixed\", \"Native American\"]),\n \"Other\",\n df[\"eth\"],\n )\n df[\"age\"] = 2018 - df[\"birthyr\"]\n df[\"age\"] = pd.cut(\n df[\"age\"].astype(int),\n [0, 29, 39, 49, 59, 69, 120],\n labels=[\"18-29\", \"30-39\", \"40-49\", \"50-59\", \"60-69\", \"70+\"],\n ordered=True,\n )\n df[\"edu\"] = df[\"educ\"].map(lkup_edu)\n df[\"edu\"] = np.where(df[\"edu\"].isin([\"Some college\", \"Associates\"]), \"Some college\", df[\"edu\"])\n\n df = df[[\"abortion\", \"state\", \"eth\", \"male\", \"age\", \"edu\", \"caseid\"]]\n return df.dropna()\n\n\nstatelevel_predictors_df = pd.read_csv(\"data/mr_p_statelevel_predictors.csv\")\n\ncces_all_df = clean_df(cces_all_df)\ncces_all_df.head()\n\n\n\n\n\n \n \n \n abortion\n state\n eth\n male\n age\n edu\n caseid\n \n \n \n \n 0\n 1.0\n MS\n Other\n -0.5\n 50-59\n Some college\n 123464282\n \n \n 1\n 1.0\n WA\n White\n -0.5\n 40-49\n HS\n 170169205\n \n \n 2\n 1.0\n RI\n White\n -0.5\n 60-69\n Some college\n 175996005\n \n \n 3\n 0.0\n CO\n Other\n -0.5\n 70+\n Post-grad\n 176818556\n \n \n 4\n 1.0\n MA\n White\n -0.5\n 40-49\n HS\n 202120533\n \n \n\n\n\n\nWe will now show how estimates drawn from sample data (biased for whatever reasons of chance and circumstance) can be improved by using a post-stratification adjustment based on known facts about the size of the population in each strata considered in the model. This additional step is simply another modelling choice - another way to invest our model with information. In this manner the technique comes naturally in the Bayesian perspective.\n\n\n\nConsider a deliberately biased sample. Biased away from the census data and in this manner we show how to better recover population level estimates by incorporating details about the census population size across each of the stratum.\n\ncces_df = cces_all_df.merge(statelevel_predictors_df, left_on=\"state\", right_on=\"state\", how=\"left\")\ncces_df[\"weight\"] = (\n 5 * cces_df[\"repvote\"]\n + (cces_df[\"age\"] == \"18-29\") * 0.5\n + (cces_df[\"age\"] == \"30-39\") * 1\n + (cces_df[\"age\"] == \"40-49\") * 2\n + (cces_df[\"age\"] == \"50-59\") * 4\n + (cces_df[\"age\"] == \"60-69\") * 6\n + (cces_df[\"age\"] == \"70+\") * 8\n + (cces_df[\"male\"] == 1) * 20\n + (cces_df[\"eth\"] == \"White\") * 1.05\n)\n\ncces_df = cces_df.sample(5000, weights=\"weight\", random_state=1000)\ncces_df.head()\n\n\n\n\n\n \n \n \n abortion\n state\n eth\n male\n age\n edu\n caseid\n repvote\n region\n weight\n \n \n \n \n 35171\n 0.0\n KY\n White\n -0.5\n 60-69\n HS\n 415208636\n 0.656706\n South\n 10.333531\n \n \n 5167\n 0.0\n NM\n White\n 0.5\n 60-69\n No HS\n 412278020\n 0.453492\n West\n 9.317460\n \n \n 52365\n 0.0\n OK\n Hispanic\n -0.5\n 30-39\n 4-Year College\n 419467449\n 0.693047\n South\n 4.465237\n \n \n 23762\n 1.0\n WV\n White\n -0.5\n 50-59\n Post-grad\n 413757903\n 0.721611\n South\n 8.658053\n \n \n 48197\n 0.0\n RI\n White\n 0.5\n 50-59\n 4-Year College\n 417619385\n 0.416893\n Northeast\n 7.134465\n \n \n\n\n\n\n\n\n\nNow we plot the outcome of expected shares within each demographic bucket across both the biased sample and the census data.\n\nmosaic = \"\"\"\n ABCD\n EEEE\n \"\"\"\n\nfig = plt.figure(layout=\"constrained\", figsize=(20, 10))\nax_dict = fig.subplot_mosaic(mosaic)\n\n\ndef plot_var(var, ax):\n a = (\n cces_df.groupby(var, observed=False)[[\"abortion\"]]\n .mean()\n .rename({\"abortion\": \"share\"}, axis=1)\n .reset_index()\n )\n b = (\n cces_all_df.groupby(var, observed=False)[[\"abortion\"]]\n .mean()\n .rename({\"abortion\": \"share_census\"}, axis=1)\n .reset_index()\n )\n a = a.merge(b).sort_values(\"share\")\n ax_dict[ax].vlines(a[var], a.share, a.share_census)\n ax_dict[ax].scatter(a[var], a.share, color=\"blue\", label=\"Sample\")\n ax_dict[ax].scatter(a[var], a.share_census, color=\"red\", label=\"Census\")\n ax_dict[ax].set_ylabel(\"Proportion\")\n\n\nplot_var(\"age\", \"A\")\nplot_var(\"edu\", \"B\")\nplot_var(\"male\", \"C\")\nplot_var(\"eth\", \"D\")\nplot_var(\"state\", \"E\")\n\nax_dict[\"E\"].legend()\n\nax_dict[\"C\"].set_xticklabels([])\nax_dict[\"C\"].set_xlabel(\"Female / Male\")\nplt.suptitle(\"Comparison of Proportions: Survey Sample V Census\", fontsize=20);\n\n\n\n\nWe can see here how the proportions differ markedly across the census report and our biased sample in how they represent the preferential votes with each strata. We now try and quantify the overall differences between the biased sample and the census report. We calculate the expected proportions in each dataset and their standard error.\n\ndef get_se_bernoulli(p, n):\n return np.sqrt(p * (1 - p) / n)\n\n\nsample_cces_estimate = {\n \"mean\": np.mean(cces_df[\"abortion\"].astype(float)),\n \"se\": get_se_bernoulli(np.mean(cces_df[\"abortion\"].astype(float)), len(cces_df)),\n}\nsample_cces_estimate\n\n\nsample_cces_all_estimate = {\n \"mean\": np.mean(cces_all_df[\"abortion\"].astype(float)),\n \"se\": get_se_bernoulli(np.mean(cces_all_df[\"abortion\"].astype(float)), len(cces_all_df)),\n}\nsample_cces_all_estimate\n\nsummary = pd.DataFrame([sample_cces_all_estimate, sample_cces_estimate])\nsummary[\"data\"] = [\"Full Data\", \"Biased Data\"]\nsummary\n\n\n\n\n\n \n \n \n mean\n se\n data\n \n \n \n \n 0\n 0.434051\n 0.002113\n Full Data\n \n \n 1\n 0.465000\n 0.007054\n Biased Data\n \n \n\n\n\n\nA 3 percent difference in a national survey is a substantial error in the case where the difference is due to preventable bias.\n\n\n\n\nTo facilitate regression based stratification we first need a regression model. In our case we will ultimately fit a multi-level regression model with intercept terms for each for each of the groups in our demographic stratum. In this way we try to account for the appropriate set of variables (as in the example above) to better specify the effect modification due to membership within a particular demographic stratum.\nWe will fit the model using bambi using the binomial link function on the biased sample data. But first we aggregate up by demographic strata and count the occurences within each strata.\n\nmodel_df = (\n cces_df.groupby([\"state\", \"eth\", \"male\", \"age\", \"edu\"], observed=False)\n .agg({\"caseid\": \"nunique\", \"abortion\": \"sum\"})\n .reset_index()\n .sort_values(\"abortion\", ascending=False)\n .rename({\"caseid\": \"n\"}, axis=1)\n .merge(statelevel_predictors_df, left_on=\"state\", right_on=\"state\", how=\"left\")\n)\nmodel_df[\"abortion\"] = model_df[\"abortion\"].astype(int)\nmodel_df[\"n\"] = model_df[\"n\"].astype(int)\nmodel_df.head()\n\n\n\n\n\n \n \n \n state\n eth\n male\n age\n edu\n n\n abortion\n repvote\n region\n \n \n \n \n 0\n ID\n White\n -0.5\n 70+\n HS\n 32\n 18\n 0.683102\n West\n \n \n 1\n ID\n White\n 0.5\n 70+\n 4-Year College\n 20\n 16\n 0.683102\n West\n \n \n 2\n WV\n White\n 0.5\n 70+\n Some college\n 17\n 13\n 0.721611\n South\n \n \n 3\n WV\n White\n 0.5\n 70+\n 4-Year College\n 15\n 12\n 0.721611\n South\n \n \n 4\n ID\n White\n 0.5\n 70+\n Post-grad\n 17\n 11\n 0.683102\n West\n \n \n\n\n\n\nOur model_df now has one row per Strata across all the demographic cuts.\n\n\nHere we use some of bambi’s latest functionality to assess the interaction effects between the variables.\n\nformula = \"\"\" p(abortion, n) ~ C(state) + C(eth) + C(edu) + male + repvote\"\"\"\n\nbase_model = bmb.Model(formula, model_df, family=\"binomial\")\n\nresult = base_model.fit(\n random_seed=100,\n target_accept=0.95,\n # inference_method=\"nuts_numpyro\",\n idata_kwargs={\"log_likelihood\": True},\n)\n\nAuto-assigning NUTS sampler...\nInitializing NUTS using jitter+adapt_diag...\nInitializing NUTS using jitter+adapt_diag...\nMultiprocess sampling (4 chains in 4 jobs)\nNUTS: [Intercept, C(state), C(eth), C(edu), male, repvote]\n\n\nSampling 4 chains for 1_000 tune and 1_000 draw iterations (4_000 + 4_000 draws total) took 816 seconds.\n\n\nWe plot the predicted outcomes within each group using the plot_predictions function.\n\nmosaic = \"\"\"\n AABB\n CCCC\n \"\"\"\n\nfig = plt.figure(layout=\"constrained\", figsize=(20, 7))\naxs = fig.subplot_mosaic(mosaic)\n\nbmb.interpret.plot_predictions(base_model, result, \"eth\", ax=axs[\"A\"])\nbmb.interpret.plot_predictions(base_model, result, \"edu\", ax=axs[\"B\"])\nbmb.interpret.plot_predictions(base_model, result, \"state\", ax=axs[\"C\"])\nplt.suptitle(\"Plot Prediction per Class\", fontsize=20);\n\n\n\n\nMore interesting we can use the comparison functionality to compare differences in eth conditional on age and edu. Where we can see that the differences between ethnicities are pretty stable across all age groups, slightly shifted by within the Post-grad level of education.\n\nfig, ax = bmb.interpret.plot_comparisons(\n model=base_model,\n idata=result,\n contrast={\"eth\": [\"Black\", \"White\"]},\n conditional=[\"age\", \"edu\"],\n comparison_type=\"diff\",\n subplot_kwargs={\"main\": \"age\", \"group\": \"edu\"},\n fig_kwargs={\"figsize\": (12, 5), \"sharey\": True},\n legend=True,\n)\nax[0].set_title(\"Comparison of Difference in Ethnicity \\n within Age and Educational Strata\");\n\n\n\n\nWe can pull these specific estimates out into a table for closer inspection to see that the differences in response expected between the extremes of educational attainment are moderated by state iand race.\n\nbmb.interpret.comparisons(\n model=base_model,\n idata=result,\n contrast={\"edu\": [\"Post-grad\", \"No HS\"]},\n conditional={\"eth\": [\"Black\", \"White\"], \"state\": [\"NY\", \"CA\", \"ID\", \"VA\"]},\n comparison_type=\"diff\",\n)\n\n\n\n\n\n \n \n \n term\n estimate_type\n value\n eth\n state\n male\n repvote\n estimate\n lower_3.0%\n upper_97.0%\n \n \n \n \n 0\n edu\n diff\n (Post-grad, No HS)\n Black\n NY\n 0.0\n 0.530191\n 0.093161\n 0.000171\n 0.197388\n \n \n 1\n edu\n diff\n (Post-grad, No HS)\n Black\n CA\n 0.0\n 0.530191\n 0.078149\n 0.000014\n 0.188560\n \n \n 2\n edu\n diff\n (Post-grad, No HS)\n Black\n ID\n 0.0\n 0.530191\n 0.085810\n 0.000116\n 0.194178\n \n \n 3\n edu\n diff\n (Post-grad, No HS)\n Black\n VA\n 0.0\n 0.530191\n 0.125538\n 0.024355\n 0.220127\n \n \n 4\n edu\n diff\n (Post-grad, No HS)\n White\n NY\n 0.0\n 0.530191\n 0.093632\n 0.000537\n 0.201009\n \n \n 5\n edu\n diff\n (Post-grad, No HS)\n White\n CA\n 0.0\n 0.530191\n 0.078656\n 0.000037\n 0.193271\n \n \n 6\n edu\n diff\n (Post-grad, No HS)\n White\n ID\n 0.0\n 0.530191\n 0.092998\n 0.000269\n 0.198796\n \n \n 7\n edu\n diff\n (Post-grad, No HS)\n White\n VA\n 0.0\n 0.530191\n 0.099620\n 0.002437\n 0.193426\n \n \n\n\n\n\nWith this in mind we want to fit our final model to incorporate the variation we see here across the different levels of our stratified data.\n\n\n\nWe can specify these features of our model using a hierarchical structure as follows:\n\\[ Pr(y_i = 1) = logit^{-1}(\n\\alpha_{\\rm s[i]}^{\\rm state}\n+ \\alpha_{\\rm a[i]}^{\\rm age}\n+ \\alpha_{\\rm r[i]}^{\\rm eth}\n+ \\alpha_{\\rm e[i]}^{\\rm edu}\n+ \\beta^{\\rm male} \\cdot {\\rm Male}_{\\rm i}\n+ \\alpha_{\\rm g[i], r[i]}^{\\rm male.eth}\n+ \\alpha_{\\rm e[i], a[i]}^{\\rm edu.age}\n+ \\alpha_{\\rm e[i], r[i]}^{\\rm edu.eth}\n)\n\\]\nHere we have used the fact that we can add components to the \\(\\alpha\\) intercept terms and interaction effects to express the stratum specific variation in the outcomes that we’ve seen in our exploratory work. Using the bambi formula syntax. We have:\n\n%%capture\nformula = \"\"\" p(abortion, n) ~ (1 | state) + (1 | eth) + (1 | edu) + male + repvote + (1 | male:eth) + (1 | edu:age) + (1 | edu:eth)\"\"\"\n\nmodel_hierarchical = bmb.Model(formula, model_df, family=\"binomial\")\n\nresult = model_hierarchical.fit(\n random_seed=100,\n target_accept=0.99,\n inference_method=\"nuts_numpyro\",\n idata_kwargs={\"log_likelihood\": True},\n)\n\n\nresult\n\n\n\n \n \n arviz.InferenceData\n \n \n \n \n \n posterior\n \n \n \n \n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\nDimensions: (chain: 4, draw: 1000, state__factor_dim: 46,\n eth__factor_dim: 4, edu__factor_dim: 5,\n male:eth__factor_dim: 8, edu:age__factor_dim: 30,\n edu:eth__factor_dim: 20)\nCoordinates:\n * chain (chain) int64 0 1 2 3\n * draw (draw) int64 0 1 2 3 4 5 6 ... 994 995 996 997 998 999\n * state__factor_dim (state__factor_dim) \nDimensions: (chain: 4, draw: 1000, p(abortion, n)_obs: 11040)\nCoordinates:\n * chain (chain) int64 0 1 2 3\n * draw (draw) int64 0 1 2 3 4 5 6 ... 994 995 996 997 998 999\n * p(abortion, n)_obs (p(abortion, n)_obs) int64 0 1 2 3 ... 11037 11038 11039\nData variables:\n p(abortion, n) (chain, draw, p(abortion, n)_obs) float64 -2.099 ... 0.0\nAttributes:\n created_at: 2023-09-19T12:36:26.181088\n arviz_version: 0.16.1\n modeling_interface: bambi\n modeling_interface_version: 0.13.0.devxarray.DatasetDimensions:chain: 4draw: 1000p(abortion, n)_obs: 11040Coordinates: (3)chain(chain)int640 1 2 3array([0, 1, 2, 3])draw(draw)int640 1 2 3 4 5 ... 995 996 997 998 999array([ 0, 1, 2, ..., 997, 998, 999])p(abortion, n)_obs(p(abortion, n)_obs)int640 1 2 3 ... 11036 11037 11038 11039array([ 0, 1, 2, ..., 11037, 11038, 11039])Data variables: (1)p(abortion, n)(chain, draw, p(abortion, n)_obs)float64-2.099 -6.533 -2.53 ... 0.0 0.0 0.0array([[[-2.09936954, -6.53260554, -2.53030632, ..., 0. ,\n 0. , 0. ],\n [-2.82166815, -5.81761829, -2.15259044, ..., 0. ,\n 0. , 0. ],\n [-2.0615933 , -5.0527332 , -2.64438438, ..., 0. ,\n 0. , 0. ],\n ...,\n [-2.39670744, -3.8831292 , -1.9033823 , ..., 0. ,\n 0. , 0. ],\n [-2.80775422, -6.8989457 , -1.99550724, ..., 0. ,\n 0. , 0. ],\n [-3.24975008, -4.83725754, -2.18790635, ..., 0. ,\n 0. , 0. ]],\n\n [[-2.55424996, -6.66109631, -2.62725853, ..., 0. ,\n 0. , 0. ],\n [-2.52754945, -3.73699859, -1.95339883, ..., 0. ,\n 0. , 0. ],\n [-2.38596107, -4.813151 , -2.04832502, ..., 0. ,\n 0. , 0. ],\n...\n [-2.79655001, -5.47903673, -2.23628333, ..., 0. ,\n 0. , 0. ],\n [-3.34298484, -7.36659506, -2.0555685 , ..., 0. ,\n 0. , 0. ],\n [-2.05738331, -4.38768503, -2.17500452, ..., 0. ,\n 0. , 0. ]],\n\n [[-2.05270967, -4.82758352, -2.39733364, ..., 0. ,\n 0. , 0. ],\n [-2.40053672, -5.01300816, -2.14456134, ..., 0. ,\n 0. , 0. ],\n [-2.43148759, -5.1977399 , -2.09675503, ..., 0. ,\n 0. , 0. ],\n ...,\n [-2.57384043, -4.85548749, -2.16441157, ..., 0. ,\n 0. , 0. ],\n [-2.36708357, -4.65351176, -2.35737355, ..., 0. ,\n 0. , 0. ],\n [-2.52597843, -5.57143406, -2.08559554, ..., 0. ,\n 0. , 0. ]]])Indexes: (3)chainPandasIndexPandasIndex(Index([0, 1, 2, 3], dtype='int64', name='chain'))drawPandasIndexPandasIndex(Index([ 0, 1, 2, 3, 4, 5, 6, 7, 8, 9,\n ...\n 990, 991, 992, 993, 994, 995, 996, 997, 998, 999],\n dtype='int64', name='draw', length=1000))p(abortion, n)_obsPandasIndexPandasIndex(Index([ 0, 1, 2, 3, 4, 5, 6, 7, 8, 9,\n ...\n 11030, 11031, 11032, 11033, 11034, 11035, 11036, 11037, 11038, 11039],\n dtype='int64', name='p(abortion, n)_obs', length=11040))Attributes: (4)created_at :2023-09-19T12:36:26.181088arviz_version :0.16.1modeling_interface :bambimodeling_interface_version :0.13.0.dev\n \n \n \n \n \n \n sample_stats\n \n \n \n \n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\nDimensions: (chain: 4, draw: 1000)\nCoordinates:\n * chain (chain) int64 0 1 2 3\n * draw (draw) int64 0 1 2 3 4 5 6 ... 993 994 995 996 997 998 999\nData variables:\n acceptance_rate (chain, draw) float64 0.9995 0.9926 1.0 ... 0.9999 0.9988\n step_size (chain, draw) float64 0.009849 0.009849 ... 0.007358\n diverging (chain, draw) bool False False False ... False False False\n energy (chain, draw) float64 2.255e+03 2.272e+03 ... 2.239e+03\n n_steps (chain, draw) int64 511 511 511 511 511 ... 511 511 511 511\n tree_depth (chain, draw) int64 9 9 9 9 9 9 9 9 9 ... 9 9 9 9 9 9 9 9 9\n lp (chain, draw) float64 2.205e+03 2.206e+03 ... 2.175e+03\nAttributes:\n created_at: 2023-09-19T12:36:26.179914\n arviz_version: 0.16.1\n modeling_interface: bambi\n modeling_interface_version: 0.13.0.devxarray.DatasetDimensions:chain: 4draw: 1000Coordinates: (2)chain(chain)int640 1 2 3array([0, 1, 2, 3])draw(draw)int640 1 2 3 4 5 ... 995 996 997 998 999array([ 0, 1, 2, ..., 997, 998, 999])Data variables: (7)acceptance_rate(chain, draw)float640.9995 0.9926 1.0 ... 0.9999 0.9988array([[0.99946526, 0.99259021, 0.99998277, ..., 0.99903086, 0.98809861,\n 0.99689415],\n [0.99913792, 0.9997045 , 0.99997501, ..., 0.98352235, 0.93096018,\n 0.85187495],\n [0.98668511, 0.99626539, 0.99994987, ..., 0.99803789, 0.99652942,\n 0.99404411],\n [0.98789206, 0.99794272, 0.99994925, ..., 0.99995445, 0.99990492,\n 0.99879618]])step_size(chain, draw)float640.009849 0.009849 ... 0.007358array([[0.00984927, 0.00984927, 0.00984927, ..., 0.00984927, 0.00984927,\n 0.00984927],\n [0.0107935 , 0.0107935 , 0.0107935 , ..., 0.0107935 , 0.0107935 ,\n 0.0107935 ],\n [0.01346049, 0.01346049, 0.01346049, ..., 0.01346049, 0.01346049,\n 0.01346049],\n [0.0073581 , 0.0073581 , 0.0073581 , ..., 0.0073581 , 0.0073581 ,\n 0.0073581 ]])diverging(chain, draw)boolFalse False False ... False Falsearray([[False, False, False, ..., False, False, False],\n [False, False, False, ..., False, False, False],\n [False, False, False, ..., False, False, False],\n [False, False, False, ..., False, False, False]])energy(chain, draw)float642.255e+03 2.272e+03 ... 2.239e+03array([[2254.80380148, 2271.7792398 , 2262.62856249, ..., 2257.72468109,\n 2230.11267142, 2237.93805661],\n [2269.06513073, 2263.36956843, 2245.93272568, ..., 2261.17563583,\n 2262.64441608, 2253.12366973],\n [2264.71522209, 2252.51613169, 2243.13421993, ..., 2258.2260048 ,\n 2259.74040336, 2243.20097056],\n [2248.65859906, 2246.09743317, 2267.83771194, ..., 2254.07769144,\n 2245.89919424, 2239.36666182]])n_steps(chain, draw)int64511 511 511 511 ... 511 511 511 511array([[511, 511, 511, ..., 511, 511, 511],\n [511, 511, 511, ..., 511, 511, 511],\n [255, 255, 255, ..., 255, 255, 255],\n [511, 511, 511, ..., 511, 511, 511]])tree_depth(chain, draw)int649 9 9 9 9 9 9 9 ... 9 9 9 9 9 9 9 9array([[9, 9, 9, ..., 9, 9, 9],\n [9, 9, 9, ..., 9, 9, 9],\n [8, 8, 8, ..., 8, 8, 8],\n [9, 9, 9, ..., 9, 9, 9]])lp(chain, draw)float642.205e+03 2.206e+03 ... 2.175e+03array([[2205.14758575, 2206.32053817, 2203.18323165, ..., 2185.99408105,\n 2172.48562089, 2185.00510852],\n [2208.18815797, 2193.96415489, 2188.18604319, ..., 2200.69704711,\n 2190.64594014, 2191.29041039],\n [2200.43179845, 2194.72819558, 2178.81994092, ..., 2200.06509338,\n 2187.73927464, 2192.0455093 ],\n [2190.22894365, 2189.6562415 , 2213.11595564, ..., 2188.1258582 ,\n 2187.37140655, 2175.15102085]])Indexes: (2)chainPandasIndexPandasIndex(Index([0, 1, 2, 3], dtype='int64', name='chain'))drawPandasIndexPandasIndex(Index([ 0, 1, 2, 3, 4, 5, 6, 7, 8, 9,\n ...\n 990, 991, 992, 993, 994, 995, 996, 997, 998, 999],\n dtype='int64', name='draw', length=1000))Attributes: (4)created_at :2023-09-19T12:36:26.179914arviz_version :0.16.1modeling_interface :bambimodeling_interface_version :0.13.0.dev\n \n \n \n \n \n \n observed_data\n \n \n \n \n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\nDimensions: (p(abortion, n)_obs: 11040)\nCoordinates:\n * p(abortion, n)_obs (p(abortion, n)_obs) int64 0 1 2 3 ... 11037 11038 11039\nData variables:\n p(abortion, n) (p(abortion, n)_obs) int64 18 16 13 12 11 ... 0 0 0 0 0\nAttributes:\n created_at: 2023-09-19T12:36:26.181386\n arviz_version: 0.16.1\n inference_library: numpyro\n inference_library_version: 0.13.0\n sampling_time: 870.213079\n modeling_interface: bambi\n modeling_interface_version: 0.13.0.devxarray.DatasetDimensions:p(abortion, n)_obs: 11040Coordinates: (1)p(abortion, n)_obs(p(abortion, n)_obs)int640 1 2 3 ... 11036 11037 11038 11039array([ 0, 1, 2, ..., 11037, 11038, 11039])Data variables: (1)p(abortion, n)(p(abortion, n)_obs)int6418 16 13 12 11 11 ... 0 0 0 0 0 0array([18, 16, 13, ..., 0, 0, 0])Indexes: (1)p(abortion, n)_obsPandasIndexPandasIndex(Index([ 0, 1, 2, 3, 4, 5, 6, 7, 8, 9,\n ...\n 11030, 11031, 11032, 11033, 11034, 11035, 11036, 11037, 11038, 11039],\n dtype='int64', name='p(abortion, n)_obs', length=11040))Attributes: (7)created_at :2023-09-19T12:36:26.181386arviz_version :0.16.1inference_library :numpyroinference_library_version :0.13.0sampling_time :870.213079modeling_interface :bambimodeling_interface_version :0.13.0.dev\n \n \n \n \n \n \n \n\n\n\naz.summary(result, var_names=[\"Intercept\", \"male\", \"1|edu\", \"1|eth\", \"repvote\"])\n\n\n\n\n\n \n \n \n mean\n sd\n hdi_3%\n hdi_97%\n mcse_mean\n mcse_sd\n ess_bulk\n ess_tail\n r_hat\n \n \n \n \n Intercept\n 0.407\n 0.540\n -0.548\n 1.365\n 0.016\n 0.016\n 1587.0\n 1235.0\n 1.0\n \n \n male\n 0.209\n 0.191\n -0.166\n 0.556\n 0.006\n 0.005\n 1459.0\n 1152.0\n 1.0\n \n \n 1|edu[4-Year College]\n -0.043\n 0.189\n -0.421\n 0.294\n 0.003\n 0.003\n 3269.0\n 2748.0\n 1.0\n \n \n 1|edu[HS]\n 0.059\n 0.186\n -0.285\n 0.433\n 0.003\n 0.003\n 2936.0\n 2716.0\n 1.0\n \n \n 1|edu[No HS]\n 0.169\n 0.224\n -0.181\n 0.638\n 0.005\n 0.003\n 2432.0\n 3248.0\n 1.0\n \n \n 1|edu[Post-grad]\n -0.198\n 0.221\n -0.644\n 0.127\n 0.005\n 0.003\n 2063.0\n 2871.0\n 1.0\n \n \n 1|edu[Some college]\n 0.032\n 0.188\n -0.339\n 0.386\n 0.003\n 0.003\n 3108.0\n 3001.0\n 1.0\n \n \n 1|eth[Black]\n -0.437\n 0.486\n -1.329\n 0.332\n 0.015\n 0.014\n 1692.0\n 1144.0\n 1.0\n \n \n 1|eth[Hispanic]\n 0.059\n 0.455\n -0.649\n 0.953\n 0.014\n 0.013\n 2094.0\n 1166.0\n 1.0\n \n \n 1|eth[Other]\n 0.076\n 0.455\n -0.614\n 1.004\n 0.014\n 0.013\n 1979.0\n 1220.0\n 1.0\n \n \n 1|eth[White]\n 0.162\n 0.459\n -0.622\n 0.970\n 0.015\n 0.013\n 1687.0\n 1124.0\n 1.0\n \n \n repvote\n -1.192\n 0.529\n -2.200\n -0.193\n 0.013\n 0.009\n 1749.0\n 2462.0\n 1.0\n \n \n\n\n\n\nThe terms in the model formula allow for specific intercept terms across the demographic splits of eth, edu, and state. These represent stratum specific adjustments of the intercept term in the model. Similarly we invoke intercepts for the interaction terms of age:edu, male:eth and edu:eth. Each of these cohorts represents a share of the data in our sample.\n\nmodel_hierarchical.graph()\n\n\n\n\nWe then predict the outcomes implied by the biased sample. These predictions are to be adjusted by what we take to be the share of that demographic cohort in population. We can plot the posterior predictive distribution against the observed data from our biased sample to see that we have generally good fit to the distribution.\n\nmodel_hierarchical.predict(result, kind=\"pps\")\nax = az.plot_ppc(result, figsize=(8, 5), kind=\"cumulative\", observed_rug=True)\nax.set_title(\"Posterior Predictive Checks \\n On Biased Sample\");\n\n\n\n\n\n\n\nWe now use the fitted model to predict the voting shares on the data where we use the genuine state numbers per strata. To do so we load data from the national census and augment our data set so as to be able to apply the appropriate weights.\n\npoststrat_df = pd.read_csv(\"data/mr_p_poststrat_df.csv\")\n\nnew_data = poststrat_df.merge(\n statelevel_predictors_df, left_on=\"state\", right_on=\"state\", how=\"left\"\n)\nnew_data.rename({\"educ\": \"edu\"}, axis=1, inplace=True)\nnew_data = model_df.merge(\n new_data,\n how=\"left\",\n left_on=[\"state\", \"eth\", \"male\", \"age\", \"edu\"],\n right_on=[\"state\", \"eth\", \"male\", \"age\", \"edu\"],\n).rename({\"n_y\": \"n\", \"repvote_y\": \"repvote\"}, axis=1)[\n [\"state\", \"eth\", \"male\", \"age\", \"edu\", \"n\", \"repvote\"]\n]\n\n\nnew_data = new_data.merge(\n new_data.groupby(\"state\").agg({\"n\": \"sum\"}).reset_index().rename({\"n\": \"state_total\"}, axis=1)\n)\nnew_data[\"state_percent\"] = new_data[\"n\"] / new_data[\"state_total\"]\nnew_data.head()\n\n\n\n\n\n \n \n \n state\n eth\n male\n age\n edu\n n\n repvote\n state_total\n state_percent\n \n \n \n \n 0\n ID\n White\n -0.5\n 70+\n HS\n 31503\n 0.683102\n 1193885\n 0.026387\n \n \n 1\n ID\n White\n 0.5\n 70+\n 4-Year College\n 11809\n 0.683102\n 1193885\n 0.009891\n \n \n 2\n ID\n White\n 0.5\n 70+\n Post-grad\n 9873\n 0.683102\n 1193885\n 0.008270\n \n \n 3\n ID\n White\n 0.5\n 50-59\n Some college\n 30456\n 0.683102\n 1193885\n 0.025510\n \n \n 4\n ID\n White\n 0.5\n 70+\n HS\n 19898\n 0.683102\n 1193885\n 0.016667\n \n \n\n\n\n\nThis dataset is exactly the same structure and length as our input data to the fitted model. We have simply switched the observed counts across the demographic strata with the counts that reflect their proportion in the national survey. Additionally we have calculated the state totals and the share of each strata within the state. This will be important for later when we use this state_percent variable to calculate an adjusted MrP estimate of the predictions at a state level. We now use this data set with our fitted model to generate posterior predictive distribution.\n\nresult_adjust = model_hierarchical.predict(result, data=new_data, inplace=False, kind=\"pps\")\nresult_adjust\n\n\n\n \n \n arviz.InferenceData\n \n \n \n \n \n posterior\n \n \n \n \n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\nDimensions: (chain: 4, draw: 1000, state__factor_dim: 46,\n eth__factor_dim: 4, edu__factor_dim: 5,\n male:eth__factor_dim: 8, edu:age__factor_dim: 30,\n edu:eth__factor_dim: 20, p(abortion, n)_obs: 11040)\nCoordinates:\n * chain (chain) int64 0 1 2 3\n * draw (draw) int64 0 1 2 3 4 5 6 ... 994 995 996 997 998 999\n * state__factor_dim (state__factor_dim) \nDimensions: (chain: 4, draw: 1000, p(abortion, n)_obs: 11040)\nCoordinates:\n * chain (chain) int64 0 1 2 3\n * draw (draw) int64 0 1 2 3 4 5 6 ... 994 995 996 997 998 999\n * p(abortion, n)_obs (p(abortion, n)_obs) int64 0 1 2 3 ... 11037 11038 11039\nData variables:\n p(abortion, n) (chain, draw, p(abortion, n)_obs) int64 16259 ... 377\nAttributes:\n modeling_interface: bambi\n modeling_interface_version: 0.13.0.devxarray.DatasetDimensions:chain: 4draw: 1000p(abortion, n)_obs: 11040Coordinates: (3)chain(chain)int640 1 2 3array([0, 1, 2, 3])draw(draw)int640 1 2 3 4 5 ... 995 996 997 998 999array([ 0, 1, 2, ..., 997, 998, 999])p(abortion, n)_obs(p(abortion, n)_obs)int640 1 2 3 ... 11036 11037 11038 11039array([ 0, 1, 2, ..., 11037, 11038, 11039])Data variables: (1)p(abortion, n)(chain, draw, p(abortion, n)_obs)int6416259 5481 4277 ... 4286 641 377array([[[16259, 5481, 4277, ..., 4321, 664, 365],\n [13989, 5623, 3910, ..., 4052, 601, 290],\n [16516, 6031, 3927, ..., 3219, 589, 177],\n ...,\n [15045, 6729, 3611, ..., 3245, 512, 218],\n [14320, 5235, 4516, ..., 4261, 689, 341],\n [13275, 6306, 3978, ..., 4395, 805, 267]],\n\n [[14657, 5382, 4231, ..., 4231, 669, 333],\n [14615, 6851, 4753, ..., 3285, 621, 242],\n [15141, 6227, 4576, ..., 3860, 622, 256],\n ...,\n [13318, 6055, 3484, ..., 3098, 440, 159],\n [15656, 5533, 4354, ..., 4697, 786, 261],\n [15203, 5104, 4174, ..., 4577, 647, 331]],\n\n [[16474, 6378, 4529, ..., 4053, 704, 346],\n [13888, 6018, 4722, ..., 4356, 782, 241],\n [15175, 6137, 4043, ..., 3601, 513, 227],\n ...,\n [14040, 5975, 4653, ..., 3936, 510, 262],\n [13193, 4965, 4305, ..., 3584, 655, 213],\n [16524, 6480, 4472, ..., 4322, 576, 345]],\n\n [[16467, 6190, 4382, ..., 3952, 552, 333],\n [15248, 6099, 4961, ..., 3615, 426, 194],\n [15036, 6037, 4003, ..., 4597, 756, 385],\n ...,\n [14751, 6072, 4182, ..., 4595, 753, 387],\n [15141, 6376, 4354, ..., 4730, 745, 469],\n [14705, 5797, 4460, ..., 4286, 641, 377]]])Indexes: (3)chainPandasIndexPandasIndex(Index([0, 1, 2, 3], dtype='int64', name='chain'))drawPandasIndexPandasIndex(Index([ 0, 1, 2, 3, 4, 5, 6, 7, 8, 9,\n ...\n 990, 991, 992, 993, 994, 995, 996, 997, 998, 999],\n dtype='int64', name='draw', length=1000))p(abortion, n)_obsPandasIndexPandasIndex(Index([ 0, 1, 2, 3, 4, 5, 6, 7, 8, 9,\n ...\n 11030, 11031, 11032, 11033, 11034, 11035, 11036, 11037, 11038, 11039],\n dtype='int64', name='p(abortion, n)_obs', length=11040))Attributes: (2)modeling_interface :bambimodeling_interface_version :0.13.0.dev\n \n \n \n \n \n \n log_likelihood\n \n \n \n \n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\nDimensions: (chain: 4, draw: 1000, p(abortion, n)_obs: 11040)\nCoordinates:\n * chain (chain) int64 0 1 2 3\n * draw (draw) int64 0 1 2 3 4 5 6 ... 994 995 996 997 998 999\n * p(abortion, n)_obs (p(abortion, n)_obs) int64 0 1 2 3 ... 11037 11038 11039\nData variables:\n p(abortion, n) (chain, draw, p(abortion, n)_obs) float64 -2.099 ... 0.0\nAttributes:\n created_at: 2023-09-19T12:36:26.181088\n arviz_version: 0.16.1\n modeling_interface: bambi\n modeling_interface_version: 0.13.0.devxarray.DatasetDimensions:chain: 4draw: 1000p(abortion, n)_obs: 11040Coordinates: (3)chain(chain)int640 1 2 3array([0, 1, 2, 3])draw(draw)int640 1 2 3 4 5 ... 995 996 997 998 999array([ 0, 1, 2, ..., 997, 998, 999])p(abortion, n)_obs(p(abortion, n)_obs)int640 1 2 3 ... 11036 11037 11038 11039array([ 0, 1, 2, ..., 11037, 11038, 11039])Data variables: (1)p(abortion, n)(chain, draw, p(abortion, n)_obs)float64-2.099 -6.533 -2.53 ... 0.0 0.0 0.0array([[[-2.09936954, -6.53260554, -2.53030632, ..., 0. ,\n 0. , 0. ],\n [-2.82166815, -5.81761829, -2.15259044, ..., 0. ,\n 0. , 0. ],\n [-2.0615933 , -5.0527332 , -2.64438438, ..., 0. ,\n 0. , 0. ],\n ...,\n [-2.39670744, -3.8831292 , -1.9033823 , ..., 0. ,\n 0. , 0. ],\n [-2.80775422, -6.8989457 , -1.99550724, ..., 0. ,\n 0. , 0. ],\n [-3.24975008, -4.83725754, -2.18790635, ..., 0. ,\n 0. , 0. ]],\n\n [[-2.55424996, -6.66109631, -2.62725853, ..., 0. ,\n 0. , 0. ],\n [-2.52754945, -3.73699859, -1.95339883, ..., 0. ,\n 0. , 0. ],\n [-2.38596107, -4.813151 , -2.04832502, ..., 0. ,\n 0. , 0. ],\n...\n [-2.79655001, -5.47903673, -2.23628333, ..., 0. ,\n 0. , 0. ],\n [-3.34298484, -7.36659506, -2.0555685 , ..., 0. ,\n 0. , 0. ],\n [-2.05738331, -4.38768503, -2.17500452, ..., 0. ,\n 0. , 0. ]],\n\n [[-2.05270967, -4.82758352, -2.39733364, ..., 0. ,\n 0. , 0. ],\n [-2.40053672, -5.01300816, -2.14456134, ..., 0. ,\n 0. , 0. ],\n [-2.43148759, -5.1977399 , -2.09675503, ..., 0. ,\n 0. , 0. ],\n ...,\n [-2.57384043, -4.85548749, -2.16441157, ..., 0. ,\n 0. , 0. ],\n [-2.36708357, -4.65351176, -2.35737355, ..., 0. ,\n 0. , 0. ],\n [-2.52597843, -5.57143406, -2.08559554, ..., 0. ,\n 0. , 0. ]]])Indexes: (3)chainPandasIndexPandasIndex(Index([0, 1, 2, 3], dtype='int64', name='chain'))drawPandasIndexPandasIndex(Index([ 0, 1, 2, 3, 4, 5, 6, 7, 8, 9,\n ...\n 990, 991, 992, 993, 994, 995, 996, 997, 998, 999],\n dtype='int64', name='draw', length=1000))p(abortion, n)_obsPandasIndexPandasIndex(Index([ 0, 1, 2, 3, 4, 5, 6, 7, 8, 9,\n ...\n 11030, 11031, 11032, 11033, 11034, 11035, 11036, 11037, 11038, 11039],\n dtype='int64', name='p(abortion, n)_obs', length=11040))Attributes: (4)created_at :2023-09-19T12:36:26.181088arviz_version :0.16.1modeling_interface :bambimodeling_interface_version :0.13.0.dev\n \n \n \n \n \n \n sample_stats\n \n \n \n \n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\nDimensions: (chain: 4, draw: 1000)\nCoordinates:\n * chain (chain) int64 0 1 2 3\n * draw (draw) int64 0 1 2 3 4 5 6 ... 993 994 995 996 997 998 999\nData variables:\n acceptance_rate (chain, draw) float64 0.9995 0.9926 1.0 ... 0.9999 0.9988\n step_size (chain, draw) float64 0.009849 0.009849 ... 0.007358\n diverging (chain, draw) bool False False False ... False False False\n energy (chain, draw) float64 2.255e+03 2.272e+03 ... 2.239e+03\n n_steps (chain, draw) int64 511 511 511 511 511 ... 511 511 511 511\n tree_depth (chain, draw) int64 9 9 9 9 9 9 9 9 9 ... 9 9 9 9 9 9 9 9 9\n lp (chain, draw) float64 2.205e+03 2.206e+03 ... 2.175e+03\nAttributes:\n created_at: 2023-09-19T12:36:26.179914\n arviz_version: 0.16.1\n modeling_interface: bambi\n modeling_interface_version: 0.13.0.devxarray.DatasetDimensions:chain: 4draw: 1000Coordinates: (2)chain(chain)int640 1 2 3array([0, 1, 2, 3])draw(draw)int640 1 2 3 4 5 ... 995 996 997 998 999array([ 0, 1, 2, ..., 997, 998, 999])Data variables: (7)acceptance_rate(chain, draw)float640.9995 0.9926 1.0 ... 0.9999 0.9988array([[0.99946526, 0.99259021, 0.99998277, ..., 0.99903086, 0.98809861,\n 0.99689415],\n [0.99913792, 0.9997045 , 0.99997501, ..., 0.98352235, 0.93096018,\n 0.85187495],\n [0.98668511, 0.99626539, 0.99994987, ..., 0.99803789, 0.99652942,\n 0.99404411],\n [0.98789206, 0.99794272, 0.99994925, ..., 0.99995445, 0.99990492,\n 0.99879618]])step_size(chain, draw)float640.009849 0.009849 ... 0.007358array([[0.00984927, 0.00984927, 0.00984927, ..., 0.00984927, 0.00984927,\n 0.00984927],\n [0.0107935 , 0.0107935 , 0.0107935 , ..., 0.0107935 , 0.0107935 ,\n 0.0107935 ],\n [0.01346049, 0.01346049, 0.01346049, ..., 0.01346049, 0.01346049,\n 0.01346049],\n [0.0073581 , 0.0073581 , 0.0073581 , ..., 0.0073581 , 0.0073581 ,\n 0.0073581 ]])diverging(chain, draw)boolFalse False False ... False Falsearray([[False, False, False, ..., False, False, False],\n [False, False, False, ..., False, False, False],\n [False, False, False, ..., False, False, False],\n [False, False, False, ..., False, False, False]])energy(chain, draw)float642.255e+03 2.272e+03 ... 2.239e+03array([[2254.80380148, 2271.7792398 , 2262.62856249, ..., 2257.72468109,\n 2230.11267142, 2237.93805661],\n [2269.06513073, 2263.36956843, 2245.93272568, ..., 2261.17563583,\n 2262.64441608, 2253.12366973],\n [2264.71522209, 2252.51613169, 2243.13421993, ..., 2258.2260048 ,\n 2259.74040336, 2243.20097056],\n [2248.65859906, 2246.09743317, 2267.83771194, ..., 2254.07769144,\n 2245.89919424, 2239.36666182]])n_steps(chain, draw)int64511 511 511 511 ... 511 511 511 511array([[511, 511, 511, ..., 511, 511, 511],\n [511, 511, 511, ..., 511, 511, 511],\n [255, 255, 255, ..., 255, 255, 255],\n [511, 511, 511, ..., 511, 511, 511]])tree_depth(chain, draw)int649 9 9 9 9 9 9 9 ... 9 9 9 9 9 9 9 9array([[9, 9, 9, ..., 9, 9, 9],\n [9, 9, 9, ..., 9, 9, 9],\n [8, 8, 8, ..., 8, 8, 8],\n [9, 9, 9, ..., 9, 9, 9]])lp(chain, draw)float642.205e+03 2.206e+03 ... 2.175e+03array([[2205.14758575, 2206.32053817, 2203.18323165, ..., 2185.99408105,\n 2172.48562089, 2185.00510852],\n [2208.18815797, 2193.96415489, 2188.18604319, ..., 2200.69704711,\n 2190.64594014, 2191.29041039],\n [2200.43179845, 2194.72819558, 2178.81994092, ..., 2200.06509338,\n 2187.73927464, 2192.0455093 ],\n [2190.22894365, 2189.6562415 , 2213.11595564, ..., 2188.1258582 ,\n 2187.37140655, 2175.15102085]])Indexes: (2)chainPandasIndexPandasIndex(Index([0, 1, 2, 3], dtype='int64', name='chain'))drawPandasIndexPandasIndex(Index([ 0, 1, 2, 3, 4, 5, 6, 7, 8, 9,\n ...\n 990, 991, 992, 993, 994, 995, 996, 997, 998, 999],\n dtype='int64', name='draw', length=1000))Attributes: (4)created_at :2023-09-19T12:36:26.179914arviz_version :0.16.1modeling_interface :bambimodeling_interface_version :0.13.0.dev\n \n \n \n \n \n \n observed_data\n \n \n \n \n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\nDimensions: (p(abortion, n)_obs: 11040)\nCoordinates:\n * p(abortion, n)_obs (p(abortion, n)_obs) int64 0 1 2 3 ... 11037 11038 11039\nData variables:\n p(abortion, n) (p(abortion, n)_obs) int64 18 16 13 12 11 ... 0 0 0 0 0\nAttributes:\n created_at: 2023-09-19T12:36:26.181386\n arviz_version: 0.16.1\n inference_library: numpyro\n inference_library_version: 0.13.0\n sampling_time: 870.213079\n modeling_interface: bambi\n modeling_interface_version: 0.13.0.devxarray.DatasetDimensions:p(abortion, n)_obs: 11040Coordinates: (1)p(abortion, n)_obs(p(abortion, n)_obs)int640 1 2 3 ... 11036 11037 11038 11039array([ 0, 1, 2, ..., 11037, 11038, 11039])Data variables: (1)p(abortion, n)(p(abortion, n)_obs)int6418 16 13 12 11 11 ... 0 0 0 0 0 0array([18, 16, 13, ..., 0, 0, 0])Indexes: (1)p(abortion, n)_obsPandasIndexPandasIndex(Index([ 0, 1, 2, 3, 4, 5, 6, 7, 8, 9,\n ...\n 11030, 11031, 11032, 11033, 11034, 11035, 11036, 11037, 11038, 11039],\n dtype='int64', name='p(abortion, n)_obs', length=11040))Attributes: (7)created_at :2023-09-19T12:36:26.181386arviz_version :0.16.1inference_library :numpyroinference_library_version :0.13.0sampling_time :870.213079modeling_interface :bambimodeling_interface_version :0.13.0.dev\n \n \n \n \n \n \n \n\n\n\n\n\nWe need to adjust each state specific strata by the weight appropriate for each state to post-stratify the estimates.To do so we extract the indices for each strata in our data on a state by state basis. Then we weight the predicted estimate by the appropriate percentage on a state basis and sum them to recover a state level estimate.\n\nestimates = []\nabortion_posterior_base = az.extract(result, num_samples=2000)[\"p(abortion, n)_mean\"]\nabortion_posterior_mrp = az.extract(result_adjust, num_samples=2000)[\"p(abortion, n)_mean\"]\n\nfor s in new_data[\"state\"].unique():\n idx = new_data.index[new_data[\"state\"] == s].tolist()\n predicted_mrp = (\n ((abortion_posterior_mrp[idx].mean(dim=\"sample\") * new_data.iloc[idx][\"state_percent\"]))\n .sum()\n .item()\n )\n predicted_mrp_lb = (\n (\n (\n abortion_posterior_mrp[idx].quantile(0.025, dim=\"sample\")\n * new_data.iloc[idx][\"state_percent\"]\n )\n )\n .sum()\n .item()\n )\n predicted_mrp_ub = (\n (\n (\n abortion_posterior_mrp[idx].quantile(0.975, dim=\"sample\")\n * new_data.iloc[idx][\"state_percent\"]\n )\n )\n .sum()\n .item()\n )\n predicted = abortion_posterior_base[idx].mean().item()\n base_lb = abortion_posterior_base[idx].quantile(0.025).item()\n base_ub = abortion_posterior_base[idx].quantile(0.975).item()\n\n estimates.append(\n [s, predicted, base_lb, base_ub, predicted_mrp, predicted_mrp_ub, predicted_mrp_lb]\n )\n\n\nstate_predicted = pd.DataFrame(\n estimates,\n columns=[\"state\", \"base_expected\", \"base_lb\", \"base_ub\", \"mrp_adjusted\", \"mrp_ub\", \"mrp_lb\"],\n)\n\nstate_predicted = (\n state_predicted.merge(cces_all_df.groupby(\"state\")[[\"abortion\"]].mean().reset_index())\n .sort_values(\"mrp_adjusted\")\n .rename({\"abortion\": \"census_share\"}, axis=1)\n)\nstate_predicted.head()\n\n\n\n\n\n \n \n \n state\n base_expected\n base_lb\n base_ub\n mrp_adjusted\n mrp_ub\n mrp_lb\n census_share\n \n \n \n \n 9\n OK\n 0.423350\n 0.209144\n 0.660533\n 0.326291\n 0.413912\n 0.245431\n 0.321553\n \n \n 34\n MS\n 0.439145\n 0.215565\n 0.683780\n 0.381575\n 0.493799\n 0.278498\n 0.374640\n \n \n 2\n CO\n 0.475961\n 0.251250\n 0.698478\n 0.397101\n 0.482699\n 0.315535\n 0.354857\n \n \n 24\n ME\n 0.438638\n 0.236010\n 0.669674\n 0.418964\n 0.537156\n 0.296373\n 0.403636\n \n \n 25\n MO\n 0.513291\n 0.225326\n 0.748539\n 0.420735\n 0.525425\n 0.321195\n 0.302954\n \n \n\n\n\n\nThis was the crucial step and we’ll need to unpack it a little. We have taken (state by state) each demographic strata and reweighted the expected posterior predictive value by the share that strata represents in the national census within that state. We have then aggregated this score within the state to generate a state specific value. This value can now be compared to the expected value derived from our biased data and, more interestingly, the value reported in the national census.\n\n\n\nThese adjusted estimates can be plotted against the shares ascribed at the state level in the census. These adjustments provide a far better reflection of the national picture than the ones derived from model fitted to the biased sample.\n\nfig, axs = plt.subplots(2, 1, figsize=(17, 10))\naxs = axs.flatten()\nax = axs[0]\nax1 = axs[1]\nax.scatter(\n state_predicted[\"state\"], state_predicted[\"base_expected\"], color=\"red\", label=\"Biased Sample\"\n)\nax.scatter(\n state_predicted[\"state\"],\n state_predicted[\"mrp_adjusted\"],\n color=\"slateblue\",\n label=\"Mr P Adjusted\",\n)\nax.scatter(\n state_predicted[\"state\"],\n state_predicted[\"census_share\"],\n color=\"darkgreen\",\n label=\"Census Aggregates\",\n)\nax.legend()\nax.vlines(\n state_predicted[\"state\"],\n state_predicted[\"mrp_adjusted\"],\n state_predicted[\"census_share\"],\n color=\"black\",\n linestyles=\"--\",\n)\n\n\nax1.scatter(\n state_predicted[\"state\"], state_predicted[\"base_expected\"], color=\"red\", label=\"Biased Sample\"\n)\nax1.scatter(\n state_predicted[\"state\"],\n state_predicted[\"mrp_adjusted\"],\n color=\"slateblue\",\n label=\"Mr P Adjusted\",\n)\nax1.legend()\n\nax1.vlines(\n state_predicted[\"state\"], state_predicted[\"base_ub\"], state_predicted[\"base_lb\"], color=\"red\"\n)\nax1.vlines(\n state_predicted[\"state\"],\n state_predicted[\"mrp_ub\"],\n state_predicted[\"mrp_lb\"],\n color=\"slateblue\",\n)\nax.set_xlabel(\"State\")\nax.set_ylabel(\"Proportion\")\nax1.set_title(\n \"Comparison of Uncertainty in Biased Predictions and Post-stratified Adjustment\", fontsize=15\n)\nax.set_title(\"Comparison of Post-stratified Adjustment and Census Report\", fontsize=15)\nax1.set_ylabel(\"Proportion\");\n\n\n\n\nIn the top plot here we see the state specific MrP estimates for the proportion voting yes, compared to the estimate inferred from the biased sample and estimates from the national census. We can see how the MrP estimates are much closer to those drawn from the national census.\nIn the below plot we’ve shown the estimates from the MrP model and the estimates drawn from the biased sample, but here we’ve shown the uncertainty in the estimation on a state level. Clearly, the MrP adjustments also shrinks the uncertainty in our estimate of vote-share.\nMrP is in this sense a corrective procedure for the avoidance of bias in sample data, where we have strong evidence for adjusting the weight accorded to any stratum of data in our population.\n\n\n\n\nIn this notebook we have seen how to use bambi to concisely and quickly apply the technique of multilevel regression and post-stratification. We’ve seen how this technique is a natural and compelling extension to regression modelling in general, that incorporates prior knowledge in an interesting and flexible manner.\nThe problems of representation in data are serious. Policy gets made and changed on the basis of anticipated policy effects. Without the ability to control and adjust for non-representative samples, politicians and policy makers risk prioritising initiatives for a vocal majority among the represented in the sample. The question of whether a given sample is “good” or “bad” cannot (at the time) ever be known, so some care needs to be taken when choosing to adjust your model of the data.\nPredictions made from sample data are consequential. It’s not even an exaggeration to say that the fates of entire nations can hang on decisions made from poorly understood sampling procedures. Multilevel regression and post-stratification is an apt tool for making the adjustments required and guiding decisions makers in crucial policy choices, but it should be used carefully." }, { - "objectID": "notebooks/how_bambi_works.html", - "href": "notebooks/how_bambi_works.html", + "objectID": "notebooks/hierarchical_binomial_bambi.html", + "href": "notebooks/hierarchical_binomial_bambi.html", "title": "Bambi", "section": "", - "text": "Bambi builds linear predictors of the form\n\\[\n\\pmb{\\eta} = \\mathbf{X}\\pmb{\\beta} + \\mathbf{Z}\\pmb{u}\n\\]\nThe linear predictor is the sum of two kinds of contributions\n\n\\(\\mathbf{X}\\pmb{\\beta}\\) is the common (fixed) effects contribution\n\\(\\mathbf{Z}\\pmb{u}\\) is the group-specific (random) effects contribution\n\nBoth contributions obey the same rule: A dot product between a data object and a parameter object.\n\n\n\nThe following objects are design matrices\n\\[\n\\begin{array}{c}\n\\underset{n\\times p}{\\mathbf{X}}\n& \\underset{n\\times j}{\\mathbf{Z}}\n\\end{array}\n\\]\n\n\\(\\mathbf{X}\\) is the design matrix for the common (fixed) effects part\n\\(\\mathbf{Z}\\) is the design matrix for the group-specific (random) effects part\n\n\n\n\nThe following objects are parameter vectors\n\\[\n\\begin{array}{c}\n\\underset{p\\times 1}{\\pmb{\\beta}}\n& \\underset{j\\times 1}{\\pmb{u}}\n\\end{array}\n\\]\n\n\\(\\pmb{\\beta}\\) is a vector of parameters/coefficients for the common (fixed) effects part\n\\(\\pmb{u}\\) is a vector of parameters/coefficients for the group-specific (random) effects part\n\nAs result, the linear predictor \\(\\pmb{\\eta}\\) is of shape \\(n \\times 1\\).\nA fundamental question: How do we use linear predictors in modeling?\nLinear predictors (or a function of them) describe the functional relationship between one or more parameters of the response distribution and the predictors.\n\n\n\nA classical linear regression model is a special case where there is no group-specific contribution and a linear predictor is mapped to the mean parameter of the response distribution.\n\\[\n\\begin{aligned}\n\\pmb{\\mu} &= \\pmb{\\eta} = \\mathbf{X}\\pmb{\\beta} \\\\\n\\pmb{\\beta} &\\sim \\text{Distribution} \\\\\n\\sigma &\\sim \\text{Distribution} \\\\\nY_i &\\sim \\text{Normal}(\\eta_i, \\sigma)\n\\end{aligned}\n\\]\n\n\n\nLink functions turn linear models in generalized linear models. A link function, \\(g\\), is a function that maps a parameter of the response distribution to the linear predictor. When people talk about generalized linear models, they mean there’s a link function mapping the mean of the response distribution to the linear predictor. But as we will see later, Bambi allows to use linear predictors and link functions to model any parameter of the response distribution – these are known as distributional models or generalized linear models for location, scale, and shape.\n\\[\ng(\\pmb{\\mu}) = \\pmb{\\eta} = \\mathbf{X}\\pmb{\\beta}\n\\]\nwhere \\(g\\) is the link function. It must be differentiable, monotonic, and invertible. For example, the logit function is useful when the mean parameter is bounded in the \\((0, 1)\\) domain.\n\\[\n\\begin{aligned}\ng(\\pmb{\\mu}) &= \\text{logit}(\\pmb{\\mu}) = \\log \\left(\\frac{\\pmb{\\mu}}{1 - \\pmb{\\mu}}\\right) = \\pmb{\\eta} = \\mathbf{X}\\pmb{\\beta} \\\\\n\\pmb{\\mu} = g^{-1}(\\pmb{\\eta}) &= \\text{logistic}(\\pmb{\\eta}) = \\frac{1}{1 + \\exp (-\\pmb{\\eta})} = \\frac{1}{1 + \\exp (-\\mathbf{X}\\pmb{\\beta})}\n\\end{aligned}\n\\]\n\n\n\n\\[\n\\begin{aligned}\ng(\\pmb{\\mu}) &= \\mathbf{X}\\pmb{\\beta} \\\\\n\\pmb{\\beta} &\\sim \\text{Distribution} \\\\\nY_i &\\sim \\text{Bernoulli}(\\mu = g^{-1}(\\mathbf{X}\\pmb{\\beta} )_i)\n\\end{aligned}\n\\]\nwhere \\(g = \\text{logit}\\) and \\(g^{-1} = \\text{logistic}\\) also known as \\(\\text{expit}\\).\n\n\n\nThis is an extension to generalized linear models. In a generalized linear model a linear predictor and a link function are used to explain the relationship between the mean (location) of the response distribution and the predictors. In this type of models we are able to use linear predictors and link functions to represent the relationship between any parameter of the response distribution and the predictors.\n\\[\n\\begin{aligned}\ng_1(\\pmb{\\theta}_1) &= \\mathbf{X}_1\\pmb{\\beta}_1 + \\mathbf{Z}_1\\pmb{u}_1 \\\\\ng_2(\\pmb{\\theta}_2) &= \\mathbf{X}_2\\pmb{\\beta}_2 + \\mathbf{Z}_2\\pmb{u}_2 \\\\\n&\\phantom{b=\\,} \\vdots \\\\\ng_k(\\pmb{\\theta}_k) &= \\mathbf{X}_k\\pmb{\\beta}_k + \\mathbf{Z}_k\\pmb{u}_k \\\\\nY_i &\\sim \\text{Distribution}(\\theta_{1i}, \\theta_{2i}, \\dots, \\theta_{ki})\n\\end{aligned}\n\\]\n\n\n\n\\[\n\\begin{aligned}\ng_1(\\pmb{\\mu}) &= \\mathbf{X}_1\\pmb{\\beta}_1 \\\\\ng_2(\\pmb{\\sigma}) &= \\mathbf{X}_2\\pmb{\\beta}_2 \\\\\n\\pmb{\\beta}_1 &\\sim \\text{Distribution} \\\\\n\\pmb{\\beta}_2 &\\sim \\text{Distribution} \\\\\nY_i &\\sim \\text{Normal}(\\mu_i, \\sigma_i)\n\\end{aligned}\n\\]\nWhere\n\n\\(g_1\\) is the identity function\n\\(g_2\\) is a function that maps \\(\\mathbb{R}\\to\\mathbb{R}^+\\).\n\nUsually \\(g_2 = \\log\\)\n\\(\\pmb{\\sigma} = \\exp(\\mathbf{X}_2\\pmb{\\beta}_2)\\).\n\n\n\n\n\nA design matrix is… a matrix. As such, it’s filled up with numbers. However, it does not mean it cannot encode non-numerical variables. In a design matrix we can encode the following\n\nNumerical predictors\nInteraction effects\nTransformations of numerical predictors that don’t depend on model parameters\n\nPowers\nCentering\nStandardization\nBasis functions\n\nBambi currently supports basis splines\n\nAnd anything you can imagine as well as it does not involve model parameters\n\nCategorical predictors\n\nCategorical variables are encoded into their own design matrices\nThe most popular approach is to create binary “dummy” variables. One per level of the categorical variable.\nBut doing it haphazardly will result in non-identifiabilities quite soon.\nEncodings to the rescue\n\nOne can apply different restrictions or contrast matrices to overcome this problem. They usually imply different interpretations of the regression coefficients.\nTreatment encoding: Sets one level to zero\nZero-sum encoding: Sets one level to the opposite of the sum of the other levels\nBackward differences\nOrthogonal polynomials\nHelmert contrasts\n…\n\n\n\nThese all can be expressed as a single set of columns of a design matrix that are matched with a subset of the parameter vector of the same length\n\n\n\n\nData matrices are built by formulae.\n\nData matrices are not dependent on parameter values in any form.\n\nBambi consumes and manipulates them to create model terms, which shape the parameter vector.\n\nThe parameter vector is not influenced by the values in the data matrix.\n\n\nGoing back to planet Earth…" + "text": "This notebook shows how to build a hierarchical logistic regression model with the Binomial family in Bambi.\nThis example is based on the Hierarchical baseball article in Bayesian Analysis Recipes, a collection of articles on how to do Bayesian data analysis with PyMC3 made by Eric Ma.\n\n\nExtracted from the original work:\n\nBaseball players have many metrics measured for them. Let’s say we are on a baseball team, and would like to quantify player performance, one metric being their batting average (defined by how many times a batter hit a pitched ball, divided by the number of times they were up for batting (“at bat”)). How would you go about this task?\n\n\n\n\n\nimport arviz as az\nimport bambi as bmb\nimport matplotlib.pyplot as plt\nimport numpy as np\n\nfrom matplotlib.lines import Line2D\nfrom matplotlib.patches import Patch\n\n\naz.style.use(\"arviz-darkgrid\")\nrandom_seed = 1234\n\nWe first need some measurements of batting data. Today we’re going to use data from the Baseball Databank. It is a compilation of historical baseball data in a convenient, tidy format, distributed under Open Data terms.\nThis repository contains several datasets in the form of .csv files. This example is going to use the Batting.csv file, which can be loaded directly with Bambi in a convenient way.\n\ndf = bmb.load_data(\"batting\")\n\n# Then clean some of the data\ndf[\"AB\"] = df[\"AB\"].replace(0, np.nan)\ndf = df.dropna()\ndf[\"batting_avg\"] = df[\"H\"] / df[\"AB\"]\ndf = df[df[\"yearID\"] >= 2016]\ndf = df.iloc[0:15] \ndf.head(5)\n\n\n\n\n\n \n \n \n playerID\n yearID\n stint\n teamID\n lgID\n G\n AB\n R\n H\n 2B\n ...\n SB\n CS\n BB\n SO\n IBB\n HBP\n SH\n SF\n GIDP\n batting_avg\n \n \n \n \n 101348\n abadfe01\n 2016\n 1\n MIN\n AL\n 39\n 1.0\n 0\n 0\n 0\n ...\n 0.0\n 0.0\n 0\n 1.0\n 0.0\n 0.0\n 0.0\n 0.0\n 0.0\n 0.000000\n \n \n 101350\n abreujo02\n 2016\n 1\n CHA\n AL\n 159\n 624.0\n 67\n 183\n 32\n ...\n 0.0\n 2.0\n 47\n 125.0\n 7.0\n 15.0\n 0.0\n 9.0\n 21.0\n 0.293269\n \n \n 101352\n ackledu01\n 2016\n 1\n NYA\n AL\n 28\n 61.0\n 6\n 9\n 0\n ...\n 0.0\n 0.0\n 8\n 9.0\n 0.0\n 0.0\n 0.0\n 1.0\n 0.0\n 0.147541\n \n \n 101353\n adamecr01\n 2016\n 1\n COL\n NL\n 121\n 225.0\n 25\n 49\n 7\n ...\n 2.0\n 3.0\n 24\n 47.0\n 0.0\n 4.0\n 3.0\n 0.0\n 5.0\n 0.217778\n \n \n 101355\n adamsma01\n 2016\n 1\n SLN\n NL\n 118\n 297.0\n 37\n 74\n 18\n ...\n 0.0\n 1.0\n 25\n 81.0\n 1.0\n 2.0\n 0.0\n 3.0\n 5.0\n 0.249158\n \n \n\n5 rows × 23 columns\n\n\n\nFrom all the columns above, we’re going to use the following:\n\nplayerID: Unique identification for the player.\nAB: Number of times the player was up for batting.\nH: Number of times the player hit the ball while batting.\nbatting_avg: Simply ratio between H and AB.\n\n\n\n\nIt’s always good to explore the data before starting to write down our models. This is very useful to gain a good understanding of the distribution of the variables and their relationships, and even anticipate some problems that may occur during the sampling process.\nThe following graph summarizes the percentage of hits, as well as the number of times the players were up for batting and the number of times they hit the ball.\n\nBLUE = \"#2a5674\"\nRED = \"#b13f64\"\n\n\n_, ax = plt.subplots(figsize=(10, 6))\n\n# Customize x limits. \n# This adds space on the left side to indicate percentage of hits.\nax.set_xlim(-120, 320)\n\n# Add dots for the times at bat and the number of hits\nax.scatter(df[\"AB\"], list(range(15)), s=140, color=BLUE, zorder=10)\nax.scatter(df[\"H\"], list(range(15)), s=140, color=RED, zorder=10)\n\n# Also a line connecting them\nax.hlines(list(range(15)), df[\"AB\"], df[\"H\"], color=\"#b3b3b3\", lw=4)\n\nax.axvline(ls=\"--\", lw=1.4, color=\"#a3a3a3\")\nax.hlines(list(range(15)), -110, -50, lw=6, color=\"#b3b3b3\", capstyle=\"round\")\nax.scatter(60 * df[\"batting_avg\"] - 110, list(range(15)), s=28, color=RED, zorder=10)\n\n# Add the percentage of hits\nfor j in range(15): \n text = f\"{round(df['batting_avg'].iloc[j] * 100)}%\"\n ax.text(-12, j, text, ha=\"right\", va=\"center\", fontsize=14, color=\"#333\")\n\n# Customize tick positions and labels\nax.yaxis.set_ticks(list(range(15)))\nax.yaxis.set_ticklabels(df[\"playerID\"])\nax.xaxis.set_ticks(range(0, 400, 100))\n\n# Create handles for the legend (just dots and labels)\nhandles = [\n Line2D(\n [0], [0], label=\"At Bat\", marker=\"o\", color=\"None\", markeredgewidth=0,\n markerfacecolor=RED, markersize=12\n ),\n Line2D(\n [0], [0], label=\"Hits\", marker=\"o\", color=\"None\", markeredgewidth=0, \n markerfacecolor=BLUE, markersize=13\n )\n]\n\n# Add legend on top-right corner\nlegend = ax.legend(\n handles=handles, \n loc=1, \n fontsize=14, \n handletextpad=0.4,\n frameon=True\n)\n\n# Finally add labels and a title\nax.set_xlabel(\"Count\", fontsize=14)\nax.set_ylabel(\"Player\", fontsize=14)\nax.set_title(\"How often do batters hit the ball?\", fontsize=20);\n\n\n\n\nThe first thing one can see is that the number of times players were up for batting varies quite a lot. Some players have been there for very few times, while there are others who have been there hundreds of times. We can also note the percentage of hits is usually a number between 12% and 29%.\nThere are two players, alberma01 and abadfe01, who had only one chance to bat. The first one hit the ball, while the latter missed. That’s why alberma01 as a 100% hit percentage, while abadfe01 has 0%. There’s another player, aguilje01, who has a success record of 0% because he missed all the few opportunities he had to bat. These extreme situations, where the empirical estimation lives in the boundary of the parameter space, are associated with estimation problems when using a maximum-likelihood estimation approach. Nonetheless, they can also impact the sampling process, especially when using wide priors.\nAs a final note, abreujo02, has been there for batting 624 times, and thus the grey dot representing this number does not appear in the plot.\n\n\n\nLet’s get started with a simple cell-means logistic regression for \\(p_i\\), the probability of hitting the ball for the player \\(i\\)\n\\[\n\\begin{array}{lr}\n \\displaystyle \\text{logit}(p_i) = \\beta_i & \\text{with } i = 0, \\cdots, 14\n\\end{array} \n\\]\nWhere\n\\[\n\\beta_i \\sim \\text{Normal}(0, \\ \\sigma_{\\beta}),\n\\]\n\\(\\sigma_{\\beta}\\) is a common constant for all the players, and \\(\\text{logit}(p_i) = \\log\\left(\\frac{p_i}{1 - p_i}\\right)\\).\nSpecifying this model is quite simple in Bambi thanks to its formula interface.\nFirst of all, note this is a Binomial family and the response involves both the number of hits (H) and the number of times at bat (AB). We use the p(x, n) function for the response term. This just tells Bambi we want to model the proportion resulting from dividing x over n.\nThe right-hand side of the formula is \"0 + playerID\". This means the model includes a coefficient for each player ID, but does not include a global intercept.\nFinally, using the Binomial family is as easy as passing family=\"binomial\". By default, the link function for this family is link=\"logit\", so there’s nothing to change there.\n\nmodel_non_hierarchical = bmb.Model(\"p(H, AB) ~ 0 + playerID\", df, family=\"binomial\")\nmodel_non_hierarchical\n\n Formula: p(H, AB) ~ 0 + playerID\n Family: binomial\n Link: p = logit\n Observations: 15\n Priors: \n target = p\n Common-level effects\n playerID ~ Normal(mu: [0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.], sigma: [10.0223 10.0223\n 10.0223 10.0223 10.0223 10.0223 10.0223 10.0223 10.0223\n 10.0223 10.0223 10.0223 10.0223 10.0223 10.0223])\n\n\n\nidata_non_hierarchical = model_non_hierarchical.fit(random_seed=random_seed)\n\nAuto-assigning NUTS sampler...\nInitializing NUTS using jitter+adapt_diag...\nMultiprocess sampling (2 chains in 2 jobs)\nNUTS: [playerID]\n\n\n\n\n\n\n\n \n \n 100.00% [4000/4000 00:04<00:00 Sampling 2 chains, 0 divergences]\n \n \n\n\nSampling 2 chains for 1_000 tune and 1_000 draw iterations (2_000 + 2_000 draws total) took 5 seconds.\n\n\nNext we observe the posterior of the coefficient for each player. The compact=False argument means we want separated panels for each player.\n\naz.plot_trace(idata_non_hierarchical, compact=False);\n\n\n\n\nSo far so good! The traceplots indicate the sampler worked well.\nNow, let’s keep this posterior aside for later use and let’s fit the hierarchical version.\n\n\n\nThis model incorporates a group-specific intercept for each player:\n\\[\n\\begin{array}{lr}\n \\displaystyle \\text{logit}(p_i) = \\alpha + \\gamma_i & \\text{with } i = 0, \\cdots, 14\n\\end{array} \n\\]\nwhere\n\\[\n\\begin{array}{c}\n \\alpha \\sim \\text{Normal}(0, \\ \\sigma_{\\alpha}) \\\\\n \\gamma_i \\sim \\text{Normal}(0, \\ \\sigma_{\\gamma}) \\\\\n \\sigma_{\\gamma} \\sim \\text{HalfNormal}(\\tau_{\\gamma})\n\\end{array}\n\\]\nThe group-specific terms are indicated with the | operator in the formula. In this case, since there is an intercept for each player, we write 1|playerID.\n\nmodel_hierarchical = bmb.Model(\"p(H, AB) ~ 1 + (1|playerID)\", df, family=\"binomial\")\nmodel_hierarchical\n\n Formula: p(H, AB) ~ 1 + (1|playerID)\n Family: binomial\n Link: p = logit\n Observations: 15\n Priors: \n target = p\n Common-level effects\n Intercept ~ Normal(mu: 0, sigma: 2.5)\n \n Group-level effects\n 1|playerID ~ Normal(mu: 0, sigma: HalfNormal(sigma: 2.5))\n\n\n\nidata_hierarchical = model_hierarchical.fit(random_seed=random_seed)\n\nAuto-assigning NUTS sampler...\nInitializing NUTS using jitter+adapt_diag...\nMultiprocess sampling (2 chains in 2 jobs)\nNUTS: [Intercept, 1|playerID_sigma, 1|playerID_offset]\n\n\n\n\n\n\n\n \n \n 100.00% [4000/4000 00:07<00:00 Sampling 2 chains, 0 divergences]\n \n \n\n\nSampling 2 chains for 1_000 tune and 1_000 draw iterations (2_000 + 2_000 draws total) took 7 seconds.\n\n\nAnd there we got several divergences… What can we do?\nOne thing we could try is to increase target_accept as suggested in the message above, but there are so many divergences that instead we are going to first take a look at the prior predictive distribution to check whether our priors are too informative or too wide.\nThe Model instance has a method called prior_predictive() that generates samples from the prior predictive distribution. It returns an InferenceData object that contains the values of the prior predictive distribution.\n\nidata_prior = model_hierarchical.prior_predictive()\nprior = az.extract_dataset(idata_prior, group=\"prior_predictive\")[\"p(H, AB)\"]\n\nSampling: [1|playerID_offset, 1|playerID_sigma, Intercept, p(H, AB)]\n/tmp/ipykernel_23363/2686921361.py:2: FutureWarning: extract_dataset has been deprecated, please use extract\n prior = az.extract_dataset(idata_prior, group=\"prior_predictive\")[\"p(H, AB)\"]\n\n\nIf we inspect the DataArray, we see there are 500 draws (sample) for each of the 15 players (p(H, AB)_dim_0)\nLet’s plot these distributions together with the observed proportion of hits for every player here.\n\n# We define this function because this plot is going to be repeated below.\ndef plot_prior_predictive(df, prior):\n AB = df[\"AB\"].values\n H = df[\"H\"].values\n\n fig, axes = plt.subplots(5, 3, figsize=(10, 6), sharex=\"col\")\n\n for idx, ax in enumerate(axes.ravel()):\n pps = prior.sel({\"p(H, AB)_obs\":idx})\n ab = AB[idx]\n h = H[idx]\n hist = ax.hist(pps / ab, bins=25, color=\"#a3a3a3\")\n ax.axvline(h / ab, color=RED, lw=2)\n ax.set_yticks([])\n ax.tick_params(labelsize=12)\n \n fig.subplots_adjust(left=0.025, right=0.975, hspace=0.05, wspace=0.05, bottom=0.125)\n fig.legend(\n handles=[Line2D([0], [0], label=\"Observed proportion\", color=RED, linewidth=2)],\n handlelength=1.5,\n handletextpad=0.8,\n borderaxespad=0,\n frameon=True,\n fontsize=11, \n bbox_to_anchor=(0.975, 0.92),\n loc=\"right\"\n \n )\n fig.text(0.5, 0.025, \"Prior probability of hitting\", fontsize=15, ha=\"center\", va=\"baseline\")\n\n\nplot_prior_predictive(df, prior)\n\n/tmp/ipykernel_23363/3299358313.py:17: UserWarning: This figure was using a layout engine that is incompatible with subplots_adjust and/or tight_layout; not calling subplots_adjust.\n fig.subplots_adjust(left=0.025, right=0.975, hspace=0.05, wspace=0.05, bottom=0.125)\n\n\n\n\n\nIndeed, priors are too wide! Let’s use tighter priors and see what’s the result\n\npriors = {\n \"Intercept\": bmb.Prior(\"Normal\", mu=0, sigma=1),\n \"1|playerID\": bmb.Prior(\"Normal\", mu=0, sigma=bmb.Prior(\"HalfNormal\", sigma=1))\n}\nmodel_hierarchical = bmb.Model(\"p(H, AB) ~ 1 + (1|playerID)\", df, family=\"binomial\", priors=priors)\nmodel_hierarchical\n\n Formula: p(H, AB) ~ 1 + (1|playerID)\n Family: binomial\n Link: p = logit\n Observations: 15\n Priors: \n target = p\n Common-level effects\n Intercept ~ Normal(mu: 0, sigma: 1)\n \n Group-level effects\n 1|playerID ~ Normal(mu: 0, sigma: HalfNormal(sigma: 1))\n\n\nNow let’s check the prior predictive distribution for these new priors.\n\nmodel_hierarchical.build()\nidata_prior = model_hierarchical.prior_predictive()\nprior = az.extract_dataset(idata_prior, group=\"prior_predictive\")[\"p(H, AB)\"]\nplot_prior_predictive(df, prior)\n\nSampling: [1|playerID_offset, 1|playerID_sigma, Intercept, p(H, AB)]\n/tmp/ipykernel_23363/1302716284.py:3: FutureWarning: extract_dataset has been deprecated, please use extract\n prior = az.extract_dataset(idata_prior, group=\"prior_predictive\")[\"p(H, AB)\"]\n/tmp/ipykernel_23363/3299358313.py:17: UserWarning: This figure was using a layout engine that is incompatible with subplots_adjust and/or tight_layout; not calling subplots_adjust.\n fig.subplots_adjust(left=0.025, right=0.975, hspace=0.05, wspace=0.05, bottom=0.125)\n\n\n\n\n\nDefinetely it looks much better. Now the priors tend to have a symmetric shape with a mode at 0.5, with substantial probability on the whole domain.\n\nidata_hierarchical = model_hierarchical.fit(random_seed=random_seed)\n\nAuto-assigning NUTS sampler...\nInitializing NUTS using jitter+adapt_diag...\nMultiprocess sampling (2 chains in 2 jobs)\nNUTS: [Intercept, 1|playerID_sigma, 1|playerID_offset]\n\n\n\n\n\n\n\n \n \n 100.00% [4000/4000 00:06<00:00 Sampling 2 chains, 0 divergences]\n \n \n\n\nSampling 2 chains for 1_000 tune and 1_000 draw iterations (2_000 + 2_000 draws total) took 7 seconds.\n\n\nLet’s try with increasing target_accept and the number of tune samples.\n\nidata_hierarchical = model_hierarchical.fit(tune=2000, draws=2000, target_accept=0.95, random_seed=random_seed)\n\nAuto-assigning NUTS sampler...\nInitializing NUTS using jitter+adapt_diag...\nMultiprocess sampling (2 chains in 2 jobs)\nNUTS: [Intercept, 1|playerID_sigma, 1|playerID_offset]\n\n\n\n\n\n\n\n \n \n 100.00% [8000/8000 00:17<00:00 Sampling 2 chains, 0 divergences]\n \n \n\n\nSampling 2 chains for 2_000 tune and 2_000 draw iterations (4_000 + 4_000 draws total) took 18 seconds.\n\n\n\nvar_names = [\"Intercept\", \"1|playerID\", \"1|playerID_sigma\"]\naz.plot_trace(idata_hierarchical, var_names=var_names, compact=False);\n\n\n\n\nLet’s jump onto the next section where we plot and compare the probability of hit for the players using both models.\n\n\n\nNow we’re going to plot the distribution of the probability of hit for each player, using both models.\nBut before doing that, we need to obtain the posterior in that scale. We could manually take the posterior of the coefficients, compute the linear predictor, and transform that to the probability scale. But that’s a lot of work!\nFortunately, Bambi models have a method called .predict() that we can use to predict in the probability scale. By default, it modifies in-place the InferenceData object we pass to it. Then, the posterior samples can be found in the variable p(H, AB)_mean.\n\nmodel_non_hierarchical.predict(idata_non_hierarchical)\nmodel_hierarchical.predict(idata_hierarchical)\n\nLet’s create a forestplot using the posteriors obtained with both models so we can compare them very easily .\n\n_, ax = plt.subplots(figsize = (8, 8))\n\n# Add vertical line for the global probability of hitting\nax.axvline(x=(df[\"H\"] / df[\"AB\"]).mean(), ls=\"--\", color=\"black\", alpha=0.5)\n\n# Create forestplot with ArviZ, only for the mean.\naz.plot_forest(\n [idata_non_hierarchical, idata_hierarchical], \n var_names=\"p(H, AB)_mean\", \n combined=True, \n colors=[\"#666666\", RED], \n linewidth=2.6, \n markersize=8,\n ax=ax\n)\n\n# Create custom y axis tick labels\nylabels = [f\"H: {round(h)}, AB: {round(ab)}\" for h, ab in zip(df[\"H\"].values, df[\"AB\"].values)]\nylabels = list(reversed(ylabels))\n\n# Put the labels for the y axis in the mid of the original location of the tick marks.\nax.set_yticklabels(ylabels, ha=\"right\")\n\n# Create legend\nhandles = [\n Patch(label=\"Non-hierarchical\", facecolor=\"#666666\"),\n Patch(label=\"Hierarchical\", facecolor=RED),\n Line2D([0], [0], label=\"Mean probability\", ls=\"--\", color=\"black\", alpha=0.5)\n]\n\nlegend = ax.legend(handles=handles, loc=4, fontsize=14, frameon=True, framealpha=0.8);\n\n\n\n\nOne of the first things one can see is that not only the center of the distributions varies but also their dispersion. Those posteriors that are very wide are associated with players who have batted only once or few times, while tighter posteriors correspond to players who batted several times.\nPlayers who have extreme empirical proportions have similar extreme posteriors under the non-hierarchical model. However, under the hierarchical model, these distributions are now shrunk towards the global mean. Extreme values are very unlikely under the hierarchical model.\nAnd finally, paraphrasing Eric, there’s nothing ineherently right or wrong about shrinkage and hierarchical models. Whether this is reasonable or not depends on our prior knowledge about the problem. And to me, after having seen the hit rates of the other players, it is much more reasonable to shrink extreme posteriors based on very few data points towards the global mean rather than just let them concentrate around 0 or 1.\n\n%load_ext watermark\n%watermark -n -u -v -iv -w\n\nLast updated: Thu Jan 05 2023\n\nPython implementation: CPython\nPython version : 3.10.4\nIPython version : 8.5.0\n\nsys : 3.10.4 (main, Mar 31 2022, 08:41:55) [GCC 7.5.0]\narviz : 0.14.0\nmatplotlib: 3.6.2\nbambi : 0.9.3\nnumpy : 1.23.5\n\nWatermark: 2.3.1\n\n\n\n\n\n\n\n By default, the .predict() method obtains the posterior for the mean of the likelihood distribution. This mean would be \\(np\\) for the Binomial family. However, since \\(n\\) varies from observation to observation, it returns the value of \\(p\\), as if it was a Bernoulli family. \n .predict()just appends _mean to the name of the response to indicate it is the posterior of the mean." }, { - "objectID": "notebooks/how_bambi_works.html#example", - "href": "notebooks/how_bambi_works.html#example", + "objectID": "notebooks/multi-level_regression.html", + "href": "notebooks/multi-level_regression.html", "title": "Bambi", - "section": "Example", - "text": "Example\n\nimport arviz as az\nimport bambi as bmb\nimport matplotlib.pyplot as plt\nimport numpy as np\nimport pandas as pd\n\naz.style.use(\"arviz-darkgrid\")\n\n\ndata = bmb.load_data(\"sleepstudy\")\n\n\ndef plot_data(data):\n fig, axes = plt.subplots(2, 9, figsize=(16, 7.5), sharey=True, sharex=True, dpi=300, constrained_layout=False)\n fig.subplots_adjust(left=0.075, right=0.975, bottom=0.075, top=0.925, wspace=0.03)\n\n axes_flat = axes.ravel()\n\n for i, subject in enumerate(data[\"Subject\"].unique()):\n ax = axes_flat[i]\n idx = data.index[data[\"Subject\"] == subject].tolist()\n days = data.loc[idx, \"Days\"].to_numpy()\n reaction = data.loc[idx, \"Reaction\"].to_numpy()\n\n # Plot observed data points\n ax.scatter(days, reaction, color=\"C0\", ec=\"black\", alpha=0.7)\n\n # Add a title\n ax.set_title(f\"Subject: {subject}\", fontsize=14)\n\n ax.xaxis.set_ticks([0, 2, 4, 6, 8])\n fig.text(0.5, 0.02, \"Days\", fontsize=14)\n fig.text(0.03, 0.5, \"Reaction time (ms)\", rotation=90, fontsize=14, va=\"center\")\n\n return axes\n\nplot_data(data);\n\n\n\n\nThe model\n\\[\n\\begin{aligned}\n\\mu_i & = \\beta_0 + \\beta_1 \\text{Days}_i + u_{0i} + u_{1i}\\text{Days}_i \\\\\n\\beta_0 & \\sim \\text{Normal} \\\\\n\\beta_1 & \\sim \\text{Normal} \\\\\nu_{0i} & \\sim \\text{Normal}(0, \\sigma_{u_0}) \\\\\nu_{1i} & \\sim \\text{Normal}(0, \\sigma_{u_1}) \\\\\n\\sigma_{u_0} & \\sim \\text{HalfNormal} \\\\\n\\sigma_{u_1} & \\sim \\text{HalfNormal} \\\\\n\\sigma & \\sim \\text{HalfStudentT} \\\\\n\\text{Reaction}_i & \\sim \\text{Normal}(\\mu_i, \\sigma)\n\\end{aligned}\n\\]\nWritten in a slightly different way (and omitting some priors)…\n\\[\n\\begin{aligned}\n\\mu_i & = \\text{Intercept}_i + \\text{Slope}_i \\text{Days}_i \\\\\n\\text{Intercept}_i & = \\beta_0 + u_{0i} \\\\\n\\text{Slope}_i & = \\beta_1 + u_{1i} \\\\\n\\sigma & \\sim \\text{HalfStudentT} \\\\\n\\text{Reaction}_i & \\sim \\text{Normal}(\\mu_i, \\sigma) \\\\\n\\end{aligned}\n\\]\nWe can see both the intercept and the slope are made of a “common” component and a “subject-specific” deflection.\nUnder the general representation written above…\n\\[\n\\begin{aligned}\n\\pmb{\\mu} &= \\mathbf{X}\\pmb{\\beta} + \\mathbf{Z}\\pmb{u} \\\\\n\\pmb{\\beta} &\\sim \\text{Normal} \\\\\n\\pmb{u} &\\sim \\text{Normal}(0, \\text{diag}(\\sigma_{\\pmb{u}})) \\\\\n\\sigma &\\sim \\text{HalfStudenT} \\\\\n\\sigma_{\\pmb{u}} &\\sim \\text{HalfNormal} \\\\\nY_i &\\sim \\text{Normal}(\\mu_i, \\sigma)\n\\end{aligned}\n\\]\n\nmodel = bmb.Model(\"Reaction ~ 1 + Days + (1 + Days | Subject)\", data, categorical=\"Subject\")\nmodel\n\n Formula: Reaction ~ 1 + Days + (1 + Days | Subject)\n Family: gaussian\n Link: mu = identity\n Observations: 180\n Priors: \n target = mu\n Common-level effects\n Intercept ~ Normal(mu: 298.5079, sigma: 261.0092)\n Days ~ Normal(mu: 0.0, sigma: 48.8915)\n \n Group-level effects\n 1|Subject ~ Normal(mu: 0.0, sigma: HalfNormal(sigma: 261.0092))\n Days|Subject ~ Normal(mu: 0.0, sigma: HalfNormal(sigma: 48.8915))\n \n Auxiliary parameters\n Reaction_sigma ~ HalfStudentT(nu: 4.0, sigma: 56.1721)\n\n\n\nmodel.build()\nmodel.graph()\n\n\n\n\n\ndm = model.response_component.design\ndm\n\nDesignMatrices\n\n (rows, cols)\nResponse: (180,)\nCommon: (180, 2)\nGroup-specific: (180, 36)\n\nUse .reponse, .common, or .group to access the different members.\n\n\n\nprint(dm.response, \"\\n\")\nprint(np.array(dm.response)[:5])\n\nResponseMatrix \n name: Reaction\n kind: numeric\n shape: (180,)\n\nTo access the actual design matrix do 'np.array(this_obj)' \n\n[249.56 258.7047 250.8006 321.4398 356.8519]\n\n\n\nprint(dm.common, \"\\n\")\nprint(np.array(dm.common)[:5])\n\nCommonEffectsMatrix with shape (180, 2)\nTerms: \n Intercept \n kind: intercept\n column: 0\n Days \n kind: numeric\n column: 1\n\nTo access the actual design matrix do 'np.array(this_obj)' \n\n[[1 0]\n [1 1]\n [1 2]\n [1 3]\n [1 4]]\n\n\n\nprint(dm.group, \"\\n\")\nprint(np.array(dm.group)[:14])\n\nGroupEffectsMatrix with shape (180, 36)\nTerms: \n 1|Subject \n kind: intercept\n groups: ['308', '309', '310', '330', '331', '332', '333', '334', '335', '337', '349', '350',\n '351', '352', '369', '370', '371', '372']\n columns: 0:18\n Days|Subject \n kind: numeric\n groups: ['308', '309', '310', '330', '331', '332', '333', '334', '335', '337', '349', '350',\n '351', '352', '369', '370', '371', '372']\n columns: 18:36\n\nTo access the actual design matrix do 'np.array(this_obj)' \n\n[[1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]\n [1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]\n [1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 2 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]\n [1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 3 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]\n [1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 4 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]\n [1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 5 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]\n [1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 6 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]\n [1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 7 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]\n [1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 8 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]\n [1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 9 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]\n [0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]\n [0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]\n [0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 2 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]\n [0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 3 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]]\n\n\n\nmodel.response_component.intercept_term\n\nCommonTerm( \n name: Intercept,\n prior: Normal(mu: 298.5079, sigma: 261.0092),\n shape: (180,),\n categorical: False\n)\n\n\n\nmodel.response_component.common_terms\n\n{'Days': CommonTerm( \n name: Days,\n prior: Normal(mu: 0.0, sigma: 48.8915),\n shape: (180,),\n categorical: False\n )}\n\n\n\nmodel.response_component.group_specific_terms\n\n{'1|Subject': GroupSpecificTerm( \n name: 1|Subject,\n prior: Normal(mu: 0.0, sigma: HalfNormal(sigma: 261.0092)),\n shape: (180, 18),\n categorical: False,\n groups: ['308', '309', '310', '330', '331', '332', '333', '334', '335', '337', '349', '350', '351', '352', '369', '370', '371', '372']\n ),\n 'Days|Subject': GroupSpecificTerm( \n name: Days|Subject,\n prior: Normal(mu: 0.0, sigma: HalfNormal(sigma: 48.8915)),\n shape: (180, 18),\n categorical: False,\n groups: ['308', '309', '310', '330', '331', '332', '333', '334', '335', '337', '349', '350', '351', '352', '369', '370', '371', '372']\n )}\n\n\nTerms not only exist in the Bambi world. There are three (!!) types of terms being created.\n\nFormulae has its terms\n\nAgnostic information design matrix information\n\nBambi has its terms\n\nContains both the information given by formulae and metadata relevant to Bambi (priors)\n\nThe backend has its terms\n\nAccept a Bambi term and knows how to “compile” itself to that backend.\nE.g. the PyMC backend terms know how to write one or more PyMC distributions out of a Bambi term.\n\n\nCould we have multiple backends? In principle yes. But there’s one aspect which is convoluted, dims and coords, and the solution we found (not the best) prevented us from separating all stuff and making the front-end completely independent of the backend.\nFormulae terms\n\ndm.common.terms\n\n{'Intercept': Intercept(), 'Days': Term([Variable(Days)])}\n\n\n\ndm.group.terms\n\n{'1|Subject': GroupSpecificTerm(\n expr= Intercept(),\n factor= Term([Variable(Subject)])\n ),\n 'Days|Subject': GroupSpecificTerm(\n expr= Term([Variable(Days)]),\n factor= Term([Variable(Subject)])\n )}\n\n\nBambi terms\n\nmodel.response_component.terms\n\n{'Intercept': CommonTerm( \n name: Intercept,\n prior: Normal(mu: 298.5079, sigma: 261.0092),\n shape: (180,),\n categorical: False\n ),\n 'Days': CommonTerm( \n name: Days,\n prior: Normal(mu: 0.0, sigma: 48.8915),\n shape: (180,),\n categorical: False\n ),\n '1|Subject': GroupSpecificTerm( \n name: 1|Subject,\n prior: Normal(mu: 0.0, sigma: HalfNormal(sigma: 261.0092)),\n shape: (180, 18),\n categorical: False,\n groups: ['308', '309', '310', '330', '331', '332', '333', '334', '335', '337', '349', '350', '351', '352', '369', '370', '371', '372']\n ),\n 'Days|Subject': GroupSpecificTerm( \n name: Days|Subject,\n prior: Normal(mu: 0.0, sigma: HalfNormal(sigma: 48.8915)),\n shape: (180, 18),\n categorical: False,\n groups: ['308', '309', '310', '330', '331', '332', '333', '334', '335', '337', '349', '350', '351', '352', '369', '370', '371', '372']\n ),\n 'Reaction': ResponseTerm( \n name: Reaction,\n prior: Normal(mu: 0.0, sigma: 1.0),\n shape: (180,),\n categorical: False\n )}\n\n\nRandom idea: Perhaps in a future we can make Bambi more extensible by using generics-based API and some type of register. I haven’t thought about it at all yet." + "section": "", + "text": "Hierarchical Linear Regression (Pigs dataset)\n\nimport arviz as az\nimport bambi as bmb\nimport numpy as np\nimport matplotlib.pyplot as plt\nimport statsmodels.api as sm\nimport xarray as xr\n\n\naz.style.use(\"arviz-darkgrid\")\nSEED = 7355608\n\nIn this notebook we demo how to perform a Bayesian hierarchical linear regression.\nWe’ll use a multi-level dataset included with statsmodels containing the growth curve of pigs. Since the weight of each pig is measured multiple times, we’ll estimate a model that allows varying intercepts and slopes for time, for each pig.\n\nLoad data\n\n# Load up data from statsmodels\ndata = sm.datasets.get_rdataset(\"dietox\", \"geepack\").data\ndata.describe()\n\n\n\n\n\n \n \n \n Pig\n Litter\n Start\n Weight\n Feed\n Time\n \n \n \n \n count\n 861.000000\n 861.000000\n 861.000000\n 861.000000\n 789.000000\n 861.000000\n \n \n mean\n 6238.319396\n 12.135889\n 25.672701\n 60.725769\n 80.728645\n 6.480836\n \n \n std\n 1323.845928\n 7.427252\n 3.624336\n 24.978881\n 52.877736\n 3.444735\n \n \n min\n 4601.000000\n 1.000000\n 15.000000\n 15.000000\n 3.300003\n 1.000000\n \n \n 25%\n 4857.000000\n 5.000000\n 23.799990\n 38.299990\n 32.800003\n 3.000000\n \n \n 50%\n 5866.000000\n 11.000000\n 25.700000\n 59.199980\n 74.499996\n 6.000000\n \n \n 75%\n 8050.000000\n 20.000000\n 27.299990\n 81.199950\n 123.000000\n 9.000000\n \n \n max\n 8442.000000\n 24.000000\n 35.399990\n 117.000000\n 224.500000\n 12.000000\n \n \n\n\n\n\n\n\nModel\n\\[\nY_i = \\beta_{0, i} + \\beta_{1, i} X + \\epsilon_i\n\\]\nwith\n\\(\\beta_{0, i} = \\beta_0 + \\alpha_{0, i}\\)\n\\(\\beta_{1, i} = \\beta_1 + \\alpha_{1, i}\\)\nwhere \\(\\beta_0\\) and \\(\\beta_1\\) are usual common intercept and slope you find in a linear regression. \\(\\alpha_{0, i}\\) and \\(\\alpha_{1, i}\\) are the group specific components for the pig \\(i\\), influencing the intercept and the slope respectively. Finally \\(\\epsilon_i\\) is the random error we always see in this type of models, assumed to be Gaussian with mean 0. Note that here we use “common” and “group specific” effects to denote what in many fields are known as “fixed” and “random” effects, respectively.\nWe use the formula syntax to specify the model. Previously, you had to specify common and group specific components separately. Now, thanks to formulae, you can specify model formulas just as you would do with R packages like lme4 and brms. In a nutshell, the term on the left side tells Weight is the response variable, Time on the right-hand side tells we include a main effect for the variable Time, and (Time|Pig) indicates we want to allow a each pig to have its own slope for Time as well as its own intercept (which is implicit). If we only wanted different intercepts, we would have written Weight ~ Time + (1 | Pig) and if we wanted slopes specific to each pig without including a pig specific intercept, we would write Weight ~ Time + (0 + Time | Pig).\n\nmodel = bmb.Model(\"Weight ~ Time + (Time|Pig)\", data)\nresults = model.fit()\n\nAuto-assigning NUTS sampler...\nInitializing NUTS using jitter+adapt_diag...\nMultiprocess sampling (2 chains in 2 jobs)\nNUTS: [Weight_sigma, Intercept, Time, 1|Pig_sigma, 1|Pig_offset, Time|Pig_sigma, Time|Pig_offset]\n\n\n\n\n\n\n\n \n \n 100.00% [4000/4000 00:25<00:00 Sampling 2 chains, 0 divergences]\n \n \n\n\nSampling 2 chains for 1_000 tune and 1_000 draw iterations (2_000 + 2_000 draws total) took 26 seconds.\n\n\nWe can print the model to have a summary of the details\n\nmodel\n\n Formula: Weight ~ Time + (Time|Pig)\n Family: gaussian\n Link: mu = identity\n Observations: 861\n Priors: \n target = mu\n Common-level effects\n Intercept ~ Normal(mu: 60.7258, sigma: 133.0346)\n Time ~ Normal(mu: 0, sigma: 18.1283)\n \n Group-level effects\n 1|Pig ~ Normal(mu: 0, sigma: HalfNormal(sigma: 133.0346))\n Time|Pig ~ Normal(mu: 0, sigma: HalfNormal(sigma: 18.1283))\n Auxiliary parameters\n Weight_sigma ~ HalfStudentT(nu: 4, sigma: 24.9644)\n------\n* To see a plot of the priors call the .plot_priors() method.\n* To see a summary or plot of the posterior pass the object returned by .fit() to az.summary() or az.plot_trace()\n\n\nSince we have not specified prior distributions for the parameters in the model, Bambi has chosen sensible defaults for us. We can explore these priors through samples generated from them with a call to Model.plot_priors(), which plots a kernel density estimate for each prior.\n\nmodel.plot_priors();\n\nSampling: [1|Pig_sigma, Intercept, Time, Time|Pig_sigma, Weight_sigma]\n\n\n\n\n\nNow we are ready to check the results. Using az.plot_trace() we get traceplots that show the values sampled from the posteriors and density estimates that gives us an idea of the shape of the posterior distribution of our parameters.\nIn this case it is very convenient to use compact=True. We tell ArviZ to plot all the group specific posteriors in the same panel which saves space and makes it easier to compare group specific posteriors. Thus, we’ll have a panel with all the group specific intercepts, and another panel with all the group specific slopes. If we used compact=False, which is the default, we would end up with a huge number of panels which would make the plot unreadable.\n\n# Plot posteriors\naz.plot_trace(\n results,\n var_names=[\"Intercept\", \"Time\", \"1|Pig\", \"Time|Pig\", \"Weight_sigma\"],\n compact=True,\n);\n\n\n\n\nThe same plot could have been generated with less typing by calling\naz.plot_trace(results, var_names=[\"~1|Pig_sigma\", \"~Time|Pig_sigma\"], compact=True);\nwhich uses an alternative notation to pass var_names based on the negation symbol in Python, ~. There we are telling ArviZ to plot all the variables in the InferenceData object results, except from 1|Pig_sigma and Time|Pig_sigma.\nCan’t believe it? Come on, run this notebook on your side and have a try!\nThe plots generated by az.plot_trace() are enough to be confident that the sampler did a good job and conclude about plausible values for the distribution of each parameter in the model. But if we want to, and it is a good idea to do it, we can get umerical summaries for the posteriors with az.summary().\n\naz.summary(results, var_names=[\"Intercept\", \"Time\", \"1|Pig_sigma\", \"Time|Pig_sigma\", \"Weight_sigma\"])\n\n\n\n\n\n \n \n \n mean\n sd\n hdi_3%\n hdi_97%\n mcse_mean\n mcse_sd\n ess_bulk\n ess_tail\n r_hat\n \n \n \n \n Intercept\n 15.741\n 0.543\n 14.781\n 16.814\n 0.030\n 0.021\n 330.0\n 719.0\n 1.01\n \n \n Time\n 6.944\n 0.084\n 6.802\n 7.108\n 0.005\n 0.004\n 236.0\n 424.0\n 1.03\n \n \n 1|Pig_sigma\n 4.537\n 0.423\n 3.811\n 5.369\n 0.018\n 0.013\n 586.0\n 1161.0\n 1.00\n \n \n Time|Pig_sigma\n 0.662\n 0.063\n 0.546\n 0.774\n 0.003\n 0.002\n 443.0\n 931.0\n 1.00\n \n \n Weight_sigma\n 2.461\n 0.064\n 2.348\n 2.580\n 0.001\n 0.001\n 2534.0\n 1534.0\n 1.00\n \n \n\n\n\n\n\n\nEstimated regression line\nHere we’ll visualize the regression equations we have sampled for a particular pig and then we’ll compare the mean regression equation for all the 72 pigs in the dataset.\nIn the following plot we can see the 2000 linear regressions we have sampled for the pig ‘4601’. The mean regression line is plotted in black and the observed weights for this pig are respresented by the blue dots.\n\n# The ID of the first pig is '4601'\ndata_0 = data[data[\"Pig\"] == 4601][[\"Time\", \"Weight\"]]\ntime = np.array([1, 12])\n\nposterior = az.extract_dataset(results)\nintercept_common = posterior[\"Intercept\"]\nslope_common = posterior[\"Time\"]\n\nintercept_specific_0 = posterior[\"1|Pig\"].sel(Pig__factor_dim=\"4601\")\nslope_specific_0 = posterior[\"Time|Pig\"].sel(Pig__factor_dim=\"4601\")\n\na = (intercept_common + intercept_specific_0)\nb = (slope_common + slope_specific_0)\n\n# make time a DataArray so we can get automatic broadcasting\ntime_xi = xr.DataArray(time)\nplt.plot(time_xi, (a + b * time_xi).T, color=\"C1\", lw=0.1)\nplt.plot(time_xi, a.mean() + b.mean() * time_xi, color=\"black\")\nplt.scatter(data_0[\"Time\"], data_0[\"Weight\"], zorder=2)\nplt.ylabel(\"Weight (kg)\")\nplt.xlabel(\"Time (weeks)\");\n\n/tmp/ipykernel_25969/3021069513.py:5: FutureWarning: extract_dataset has been deprecated, please use extract\n posterior = az.extract_dataset(results)\n\n\n\n\n\nNext, we calculate the mean regression line for each pig and show them together in one plot. Here we clearly see each pig has a different pair of intercept and slope.\n\nintercept_group_specific = posterior[\"1|Pig\"]\nslope_group_specific = posterior[\"Time|Pig\"]\na = intercept_common.mean() + intercept_group_specific.mean(\"sample\")\nb = slope_common.mean() + slope_group_specific.mean(\"sample\")\ntime_xi = xr.DataArray(time)\nplt.plot(time_xi, (a + b * time_xi).T, color=\"C1\", alpha=0.7, lw=0.8)\nplt.ylabel(\"Weight (kg)\")\nplt.xlabel(\"Time (weeks)\");\n\n\n\n\nWe can get credible interval plots with ArviZ. Here the line indicates a 94% credible interval calculated as higher posterior density, the thicker line represents the interquartile range and the dot is the median. We can quickly note two things:\n\nThe uncertainty about the intercept estimate is much higher than the uncertainty about the Time slope.\nThe credible interval for Time is far away from 0, so we can be confident there’s a positive relationship the Weight of the pigs and Time.\n\nWe’re not making any great discovering by stating that as time passes we expect the pigs to weight more, but this very simple example can be used as a starting point in applications where the relationship between the variables is not that clear beforehand.\n\naz.plot_forest(\n results,\n var_names=[\"Intercept\", \"Time\"],\n figsize=(8, 2),\n);\n\n\n\n\nWe can also plot the posterior overlaid with a region of practical equivalence (ROPE). This region indicates a range of parameter values that are considered to be practically equivalent to some reference value of interest to the particular application, for example 0. In the following plot we can see that all our posterior distributions fall outside of this range.\n\naz.plot_posterior(results, var_names=[\"Intercept\", \"Time\"], ref_val=0, rope=[-1, 1]);\n\n\n\n\n\n%load_ext watermark\n%watermark -n -u -v -iv -w\n\nLast updated: Thu Jan 05 2023\n\nPython implementation: CPython\nPython version : 3.10.4\nIPython version : 8.5.0\n\nmatplotlib : 3.6.2\nxarray : 2022.11.0\nnumpy : 1.23.5\narviz : 0.14.0\nstatsmodels: 0.13.2\nbambi : 0.9.3\nsys : 3.10.4 (main, Mar 31 2022, 08:41:55) [GCC 7.5.0]\n\nWatermark: 2.3.1" }, { "objectID": "notebooks/getting_started.html", @@ -245,39 +224,32 @@ "text": "Bambi requires a working Python interpreter (3.7+). We recommend installing Python and key numerical libraries using the Anaconda Distribution, which has one-click installers available on all major platforms.\nAssuming a standard Python environment is installed on your machine (including pip), Bambi itself can be installed in one line using pip:\npip install bambi\nAlternatively, if you want the bleeding edge version of the package, you can install from GitHub:\npip install git+https://github.com/bambinos/bambi.git\n\n\nSuppose we have data for a typical within-subjects psychology experiment with 2 experimental conditions. Stimuli are nested within condition, and subjects are crossed with condition. We want to fit a model predicting reaction time (RT) from the common effect of condition, group specific intercepts for subjects, group specific condition slopes for students, and group specific intercepts for stimuli. Using Bambi we can fit this model and summarize its results as follows:\nimport bambi as bmb\n\n# Assume we already have our data loaded as a pandas DataFrame\nmodel = bmb.Model(\"rt ~ condition + (condition|subject) + (1|stimulus)\", data)\nresults = model.fit(draws=5000, chains=2)\naz.plot_trace(results)\naz.summary(results)\n\n\n\n\n\n\nimport arviz as az\nimport bambi as bmb\nimport numpy as np\nimport pandas as pd\n\n\naz.style.use(\"arviz-darkgrid\")\n\n\n\n\nCreating a new model in Bambi is simple:\n\n# Read in a tab-delimited file containing our data\ndata = pd.read_table(\"data/my_data.txt\", sep=\"\\t\")\n\n# Initialize the model\nmodel = bmb.Model(\"y ~ x + z\", data)\n\n# Inspect model object\nmodel\n\n Formula: y ~ x + z\n Family: gaussian\n Link: mu = identity\n Observations: 50\n Priors: \n target = mu\n Common-level effects\n Intercept ~ Normal(mu: 0.1852, sigma: 2.5649)\n x ~ Normal(mu: 0, sigma: 2.231)\n z ~ Normal(mu: 0, sigma: 2.4374)\n Auxiliary parameters\n y_sigma ~ HalfStudentT(nu: 4, sigma: 1.013)\n\n\nTypically, we will initialize a Bambi Model by passing it a model formula and a pandas DataFrame. Other arguments such as family, priors, and link are available. By default, it uses family=\"gaussian\" which implies a linear regression with normal error. We get back a model that we can immediately fit by calling model.fit().\n\n\n\nAs with most mixed effect modeling packages, Bambi expects data in “long” format–meaning that each row should reflects a single observation at the most fine-grained level of analysis. For example, given a model where students are nested into classrooms and classrooms are nested into schools, we would want data with the following kind of structure:\n\n\n\n\nstudent\ngender\ngpa\nclass\nschool\n\n\n\n\n1\nF\n3.4\n1\n1\n\n\n2\nF\n3.7\n1\n1\n\n\n3\nM\n2.2\n1\n1\n\n\n4\nF\n3.9\n2\n1\n\n\n5\nM\n3.6\n2\n1\n\n\n6\nM\n3.5\n2\n1\n\n\n7\nF\n2.8\n3\n2\n\n\n8\nM\n3.9\n3\n2\n\n\n9\nF\n4.0\n3\n2\n\n\n\n\n\n\n\n\nModels are specified in Bambi using a formula-based syntax similar to what one might find in R packages like lme4 or brms using the Python formulae library. A couple of examples illustrate the breadth of models that can be easily specified in Bambi:\n\ndata = pd.read_csv(\"data/rrr_long.csv\")\ndata.head(10)\n\n\n\n\n\n \n \n \n uid\n condition\n gender\n age\n study\n self_perf\n stimulus\n value\n \n \n \n \n 0\n 1.0\n 0.0\n 1.0\n 24.0\n 0.0\n 8.0\n rating_c1\n 3.0\n \n \n 1\n 2.0\n 1.0\n 0.0\n 27.0\n 0.0\n 9.0\n rating_c1\n 7.0\n \n \n 2\n 3.0\n 0.0\n 1.0\n 25.0\n 0.0\n 3.0\n rating_c1\n 5.0\n \n \n 3\n 5.0\n 0.0\n 1.0\n 20.0\n 0.0\n 3.0\n rating_c1\n 7.0\n \n \n 4\n 8.0\n 1.0\n 1.0\n 19.0\n 0.0\n 6.0\n rating_c1\n 6.0\n \n \n 5\n 9.0\n 0.0\n 1.0\n 22.0\n 0.0\n 3.0\n rating_c1\n 6.0\n \n \n 6\n 10.0\n 1.0\n 1.0\n 49.0\n 0.0\n 4.0\n rating_c1\n 6.0\n \n \n 7\n 11.0\n 0.0\n 0.0\n 24.0\n 0.0\n 5.0\n rating_c1\n 7.0\n \n \n 8\n 12.0\n 1.0\n 0.0\n 26.0\n 0.0\n 6.0\n rating_c1\n 2.0\n \n \n 9\n 13.0\n 0.0\n 1.0\n 23.0\n 0.0\n 7.0\n rating_c1\n 1.0\n \n \n\n\n\n\n\n# Number of rows with missing values\ndata.isna().any(axis=1).sum()\n\n401\n\n\nWe pass dropna=True to tell Bambi to drop rows containing missing values. The number of rows dropped is different from the number of rows with missing values because Bambi only considers columns involved in the model.\n\n# Common (or fixed) effects only\nbmb.Model(\"value ~ condition + age + gender\", data, dropna=True)\n\nAutomatically removing 33/6940 rows from the dataset.\n\n\n Formula: value ~ condition + age + gender\n Family: gaussian\n Link: mu = identity\n Observations: 6907\n Priors: \n target = mu\n Common-level effects\n Intercept ~ Normal(mu: 4.5457, sigma: 28.4114)\n condition ~ Normal(mu: 0, sigma: 12.0966)\n age ~ Normal(mu: 0, sigma: 1.3011)\n gender ~ Normal(mu: 0, sigma: 13.1286)\n Auxiliary parameters\n value_sigma ~ HalfStudentT(nu: 4, sigma: 2.4186)\n\n\n\n# Common effects and group specific (or random) intercepts for subject\nbmb.Model(\"value ~ condition + age + gender + (1|uid)\", data, dropna=True)\n\nAutomatically removing 33/6940 rows from the dataset.\n\n\n Formula: value ~ condition + age + gender + (1|uid)\n Family: gaussian\n Link: mu = identity\n Observations: 6907\n Priors: \n target = mu\n Common-level effects\n Intercept ~ Normal(mu: 4.5457, sigma: 28.4114)\n condition ~ Normal(mu: 0, sigma: 12.0966)\n age ~ Normal(mu: 0, sigma: 1.3011)\n gender ~ Normal(mu: 0, sigma: 13.1286)\n \n Group-level effects\n 1|uid ~ Normal(mu: 0, sigma: HalfNormal(sigma: 28.4114))\n Auxiliary parameters\n value_sigma ~ HalfStudentT(nu: 4, sigma: 2.4186)\n\n\n\n# Multiple, complex group specific effects with both\n# group specific slopes and group specific intercepts\nbmb.Model(\"value ~ condition + age + gender + (1|uid) + (condition|study) + (condition|stimulus)\", data, dropna=True)\n\nAutomatically removing 33/6940 rows from the dataset.\n\n\n Formula: value ~ condition + age + gender + (1|uid) + (condition|study) + (condition|stimulus)\n Family: gaussian\n Link: mu = identity\n Observations: 6907\n Priors: \n target = mu\n Common-level effects\n Intercept ~ Normal(mu: 4.5457, sigma: 28.4114)\n condition ~ Normal(mu: 0, sigma: 12.0966)\n age ~ Normal(mu: 0, sigma: 1.3011)\n gender ~ Normal(mu: 0, sigma: 13.1286)\n \n Group-level effects\n 1|uid ~ Normal(mu: 0, sigma: HalfNormal(sigma: 28.4114))\n 1|study ~ Normal(mu: 0, sigma: HalfNormal(sigma: 28.4114))\n condition|study ~ Normal(mu: 0, sigma: HalfNormal(sigma: 12.0966))\n 1|stimulus ~ Normal(mu: 0, sigma: HalfNormal(sigma: 28.4114))\n condition|stimulus ~ Normal(mu: 0, sigma: HalfNormal(sigma: 12.0966))\n Auxiliary parameters\n value_sigma ~ HalfStudentT(nu: 4, sigma: 2.4186)\n\n\nEach of the above examples specifies a full model that can be fitted using PyMC by doing\nresults = model.fit()\n\n\nWhen a categorical common effect with N levels is added to a model, by default, it is coded by N-1 dummy variables (i.e., reduced-rank coding). For example, suppose we write \"y ~ condition + age + gender\", where condition is a categorical variable with 4 levels, and age and gender are continuous variables. Then our model would contain an intercept term (added to the model by default, as in R), three dummy-coded variables (each contrasting the first level of condition with one of the subsequent levels), and continuous predictors for age and gender. Suppose, however, that we would rather use full-rank coding of conditions. If we explicitly remove the intercept –as in \"y ~ 0 + condition + age + gender\"– then we get the desired effect. Now, the intercept is no longer included, and condition will be coded using 4 dummy indicators, each one coding for the presence or absence of the respective condition without reference to the other conditions.\nGroup specific effects are handled in a comparable way. When adding group specific intercepts, coding is always full-rank (e.g., when adding group specific intercepts for 100 schools, one gets 100 dummy-coded indicators coding each school separately, and not 99 indicators contrasting each school with the very first one). For group specific slopes, coding proceeds the same way as for common effects. The group specific effects specification \"(condition|subject)\" would add an intercept for each subject, plus N-1 condition slopes (each coded with respect to the first, omitted, level as the referent). If we instead specify \"(0+condition|subject)\", we get N condition slopes and no intercepts.\n\n\n\nOnce a model is fully specified, we need to run the PyMC sampler to generate parameter estimates. If we’re using the one-line fit() interface, sampling will begin right away:\n\nmodel = bmb.Model(\"value ~ condition + age + gender + (1|uid)\", data, dropna=True)\nresults = model.fit()\n\nAutomatically removing 33/6940 rows from the dataset.\nAuto-assigning NUTS sampler...\nInitializing NUTS using jitter+adapt_diag...\nMultiprocess sampling (2 chains in 2 jobs)\nNUTS: [value_sigma, Intercept, condition, age, gender, 1|uid_sigma, 1|uid_offset]\n\n\n\n\n\n\n\n \n \n 100.00% [4000/4000 00:28<00:00 Sampling 2 chains, 0 divergences]\n \n \n\n\nSampling 2 chains for 1_000 tune and 1_000 draw iterations (2_000 + 2_000 draws total) took 28 seconds.\n\n\nThe above code obtains 1,000 draws (the default value) and return them as an InferenceData instance.\n.. tip:: InferenceData is a rich data structure to store and manipulate data such as posterior samples, prior/posterior predictive samples, observations, etc. It is based on xarray, a library offering N-dimensional labeled arrays (you can think of it as a generalization of both Numpy arrays and Pandas dataframes). To learn how to perform common operations with InferenceData, like indexing, selection etc please check this and for details of the InferenceData Schema see this specification.\nIn this case, the fit() method accepts optional keyword arguments to pass onto PyMC’s sample() method, so any methods accepted by sample() can be specified here. We can also explicitly set the number of draws via the draws argument. For example, if we call fit(draws=2000, chains=2), the PyMC sampler will sample two chains in parallel, drawing 2,000 draws for each one. We could also specify starting parameter values, the step function to use, and so on (for full details, see the PyMC documentation).\nAlternatively, we can build a model, but not fit it.\n\nmodel = bmb.Model(\"value ~ condition + age + gender + (1|uid)\", data, dropna=True)\nmodel.build()\n\nAutomatically removing 33/6940 rows from the dataset.\n\n\nBuilding without sampling can be useful if we want to inspect the internal PyMC model before we start the (potentially long) sampling process. Once we’re satisfied, and wish to run the sampler, we can then simply call model.fit(), and the sampler will start running. Another good reason to build a model is to generate plot of the marginal priors using model.plot_priors().\n\nmodel.plot_priors();\n\nSampling: [1|uid_sigma, Intercept, age, condition, gender, value_sigma]\n\n\n\n\n\n\n\n\n\nBayesian inference requires one to specify prior probability distributions that represent the analyst’s belief (in advance of seeing the data) about the likely values of the model parameters. In practice, analysts often lack sufficient information to formulate well-defined priors, and instead opt to use “weakly informative” priors that mainly serve to keep the model from exploring completely pathological parts of the parameter space (e.g., when defining a prior on the distribution of human heights, a value of 3,000 cms should be assigned a probability of exactly 0).\nBy default, Bambi will intelligently generate weakly informative priors for all model terms, by loosely scaling them to the observed data. Currently, Bambi uses a methodology very similar to the one described in the documentation of the R package rstanarm. While the default priors will behave well in most typical settings, there are many cases where an analyst will want to specify their own priors–and in general, when informative priors are available, it’s a good idea to use them.\nFortunately, Bambi is built on top of PyMC, which means that we can seamlessly use any of the over 40 Distribution classes defined in PyMC. We can specify such priors in Bambi using the Prior class, which initializes with a name argument (which must map on exactly to the name of a valid PyMC Distribution) followed by any of the parameters accepted by the corresponding distribution. For example:\n\n# A Laplace prior with mean of 0 and scale of 10\nmy_favorite_prior = bmb.Prior(\"Laplace\", mu=0, b=bmb.Prior(\"HalfNormal\", sigma=1))\n\n# Set the prior when adding a term to the model; more details on this below.\npriors = {\"1|uid\": my_favorite_prior}\nbmb.Model(\"value ~ condition + (1|uid)\", data, priors=priors, dropna=True)\n\nAutomatically removing 9/6940 rows from the dataset.\n\n\n Formula: value ~ condition + (1|uid)\n Family: gaussian\n Link: mu = identity\n Observations: 6931\n Priors: \n target = mu\n Common-level effects\n Intercept ~ Normal(mu: 4.5516, sigma: 8.4548)\n condition ~ Normal(mu: 0, sigma: 12.1019)\n \n Group-level effects\n 1|uid ~ Laplace(mu: 0, b: HalfNormal(sigma: 1))\n Auxiliary parameters\n value_sigma ~ HalfStudentT(nu: 4, sigma: 2.4197)\n\n\nPriors specified using the Prior class can be nested to arbitrary depths–meaning, we can set any of a given prior’s argument to point to another Prior instance. This is particularly useful when specifying hierarchical priors on group specific effects, where the individual group specific slopes or intercepts are constrained to share a common source distribution:\n\nsubject_sd = bmb.Prior(\"HalfCauchy\", beta=5)\nsubject_prior = bmb.Prior(\"Normal\", mu=0, sd=subject_sd)\npriors = {\"1|uid\": subject_prior}\nbmb.Model(\"value ~ condition + (1|uid)\", data, priors=priors, dropna=True)\n\nAutomatically removing 9/6940 rows from the dataset.\n\n\n Formula: value ~ condition + (1|uid)\n Family: gaussian\n Link: mu = identity\n Observations: 6931\n Priors: \n target = mu\n Common-level effects\n Intercept ~ Normal(mu: 4.5516, sigma: 8.4548)\n condition ~ Normal(mu: 0, sigma: 12.1019)\n \n Group-level effects\n 1|uid ~ Normal(mu: 0, sd: HalfCauchy(beta: 5))\n Auxiliary parameters\n value_sigma ~ HalfStudentT(nu: 4, sigma: 2.4197)\n\n\nThe above prior specification indicates that the individual subject intercepts are to be treated as if they are randomly sampled from the same underlying normal distribution, where the variance of that normal distribution is parameterized by a separate hyperprior (a half-cauchy with beta = 5).\nIt’s important to note that explicitly setting priors by passing in Prior objects will disable Bambi’s default behavior of scaling priors to the data in order to ensure that they remain weakly informative. This means that if you specify your own prior, you have to be sure not only to specify the distribution you want, but also any relevant scale parameters. For example, the 0.5 in Prior(\"Normal\", mu=0, sd=0.5) will be specified on the scale of the data, not the bounded partial correlation scale that Bambi uses for default priors. This means that if your outcome variable has a mean value of 10,000 and a standard deviation of, say, 1,000, you could potentially have some problems getting the model to produce reasonable estimates, since from the perspective of the data, you’re specifying an extremely strong prior.\n\n\nBambi’s priors are a thin layer on top of PyMC distributions. If you want to ask for a prior distribution by name, it must be the name of a PyMC distribution. But sometimes we want to use more complex distributions as priors. For all those cases, Bambi’s Prior class allow users to pass a function that returns a distribution that will be used as the prior. See the following example:\ndef CustomPrior(name, *args, dims=None, **kwargs):\n return pm.Normal(name, *args, dims=dims, **kwargs)\n\npriors = {\"x\": Prior(\"CustomPrior\", mu=0, sigma=5, dist=CustomPrior)}\nmodel = Model(\"y ~ x\", data, priors=priors)\nThe example above is trival because it’s just a wrapper of the pm.Normal distribution. But we can use this pattern to construct more complex distributions, such as a Truncated Laplace distribution shown below.\ndef TruncatedLaplace(name, mu,b,lower,upper,*args, dims=None, **kwargs):\n lap_dist = pm.Laplace.dist(mu=mu, b=b)\n return pm.Truncated(name, lap_dist, lower=lower, upper=upper, *args, dims=dims, **kwargs)\nIn summary, custom priors allow for greater flexibility by combining existing PyMC distributions in different ways. If you need to use a distribution that’s not implemented in PyMC, please check the link for further details.\n\n\n\n\nBambi supports the construction of mixed models with non-normal response distributions (i.e., generalized linear mixed models, or GLMMs). GLMMs are specified in the same way as LMMs, except that the user must specify the distribution to use for the response, and (optionally) the link function with which to transform the linear model prediction into the desired non-normal response. The easiest way to construct a GLMM is to simple set the family when creating the model:\n\ndata = bmb.load_data(\"admissions\")\nmodel = bmb.Model(\"admit ~ gre + gpa + rank\", data, family=\"bernoulli\")\nresults = model.fit()\n\nModeling the probability that admit==1\nAuto-assigning NUTS sampler...\nInitializing NUTS using jitter+adapt_diag...\nMultiprocess sampling (2 chains in 2 jobs)\nNUTS: [Intercept, gre, gpa, rank]\n\n\n\n\n\n\n\n \n \n 100.00% [4000/4000 00:15<00:00 Sampling 2 chains, 0 divergences]\n \n \n\n\nSampling 2 chains for 1_000 tune and 1_000 draw iterations (2_000 + 2_000 draws total) took 16 seconds.\n\n\nIf no link argument is explicitly set (see below), the canonical link function (or an otherwise sensible default) will be used. The following table summarizes the currently available families and their associated links:\n\n\n\n\n\n\n\n\n\nFamily name\nResponse distribution\nDefault link\n\n\n\n\nasymmetriclaplace\nAsymmetricLaplace\nidentity\n\n\nbernoulli\nBernoulli\nlogit\n\n\nbeta\nBeta\nlogit\n\n\nbeta_binomial\nBetaBinomial\nlogit\n\n\nbinomial\nBinomial\nlogit\n\n\ncategorical\nCategorical\nsoftmax\n\n\ncumulative\nCumulative\nlogit\n\n\nexponential\nExponential\nlog\n\n\ndirichlet_multinomial\nDirichletMultinomial\nlogit\n\n\ngamma\nGamma\ninverse\n\n\ngaussian\nNormal\nidentity\n\n\nhurdle_gamma\nHurdleGamma\nlog\n\n\nhurdle_lognormal\nHurdleLogNormal\nidentity\n\n\nhurdle_negativebinomial\nHurdleNegativeBinomial\nlog\n\n\nhurdle_poisson\nHurdlePoisson\nlog\n\n\nmultinomial\nMultinomial\nsoftmax\n\n\nnegativebinomial\nNegativeBinomial\nlog\n\n\nlaplace\nLaplace\nidentity\n\n\npoisson\nPoisson\nlog\n\n\nsratio\nStoppingRatio\nlogit\n\n\nt\nStudentT\nidentity\n\n\nvonmises\nVonMises\ntan(x / 2)\n\n\nwald\nInverseGaussian\ninverse squared\n\n\nweibull\nWeibull\nlog\n\n\nzero_inflated_binomial\nZeroInflatedBinomial\nlogit\n\n\nzero_inflated_negativebinomial\nZeroInflatedNegativeBinomial\nlog\n\n\nzero_inflated_poisson\nZeroInflatedPoisson\nlog\n\n\n\n\n\n\n\nFollowing the convention used in many R packages, the response distribution to use for a GLMM is specified in a Family class that indicates how the response variable is distributed, as well as the link function transforming the linear response to a non-linear one. Although the easiest way to specify a family is by name, using one of the options listed in the table above, users can also create and use their own family, providing enormous flexibility. In the following example, we show how the built-in Bernoulli family could be constructed on-the-fly:\n\nfrom scipy import special\n\n# Construct likelihood distribution ------------------------------\n# This must use a valid PyMC distribution name.\n# 'parent' is the name of the variable that represents the mean of the distribution. \n# The mean of the Bernoulli family is given by 'p'.\nlikelihood = bmb.Likelihood(\"Bernoulli\", parent=\"p\")\n\n# Set link function ----------------------------------------------\n# There are two alternative approaches.\n# 1. Pass a name that is known by Bambi\nlink = bmb.Link(\"logit\")\n\n# 2. Build everything from scratch\n# link: A function that maps the response to the linear predictor\n# linkinv: A function that maps the linear predictor to the response\n# linkinv_backend: A function that maps the linear predictor to the response\n# that works with PyTensor tensors.\n# bmb.math.sigmoid is a PyTensor tensor function wrapped by PyMC and Bambi \nlink = bmb.Link(\n \"my_logit\", \n link=special.expit,\n linkinv=special.logit,\n linkinv_backend=bmb.math.sigmoid\n)\n\n# Construct the family -------------------------------------------\n# Families are defined by a name, a Likelihood and a Link.\nfamily = bmb.Family(\"bernoulli\", likelihood, link)\n\n# Now it's business as usual\nmodel = bmb.Model(\"admit ~ gre + gpa + rank\", data, family=family)\nresults = model.fit()\n\nAuto-assigning NUTS sampler...\nInitializing NUTS using jitter+adapt_diag...\nMultiprocess sampling (2 chains in 2 jobs)\nNUTS: [Intercept, gre, gpa, rank]\n\n\n\n\n\n\n\n \n \n 100.00% [4000/4000 00:11<00:00 Sampling 2 chains, 0 divergences]\n \n \n\n\nSampling 2 chains for 1_000 tune and 1_000 draw iterations (2_000 + 2_000 draws total) took 12 seconds.\n\n\nThe above example produces results identical to simply setting family='bernoulli'.\nOne complication in specifying a custom Family is that one must pass both a link function and an inverse link function which must be able to operate over PyTensor tensors rather than numpy arrays, so you’ll probably need to rely on tensor operations provided in aesara.tensor (many of which are also wrapped by PyMC) when defining a new link.\n\n\n\nWhen a model is fitted, it returns a InferenceData object containing data related to the model and the posterior. This object can be passed to many functions in ArviZ to obtain numerical and visuals diagnostics and plot in general.\n\n\n\nTo visualize a plot of the posterior estimates and sample traces for all parameters, simply pass the InferenceData object to the arviz function az._plot_trace:\n\naz.plot_trace(results, compact=False);\n\n\n\n\nMore details on this plot are available in the ArviZ documentation.\n\n\n\nIf you prefer numerical summaries of the posterior estimates, you can use the az.summary() function from ArviZ which provides a pandas DataFrame with some key summary and diagnostics info on the model parameters, such as the 94% highest posterior density intervals\n\naz.summary(results)\n\n\n\n\n\n \n \n \n mean\n sd\n hdi_3%\n hdi_97%\n mcse_mean\n mcse_sd\n ess_bulk\n ess_tail\n r_hat\n \n \n \n \n Intercept\n -3.510\n 1.172\n -5.663\n -1.414\n 0.018\n 0.014\n 4173.0\n 1944.0\n 1.0\n \n \n gre\n 0.002\n 0.001\n 0.000\n 0.004\n 0.000\n 0.000\n 2012.0\n 1462.0\n 1.0\n \n \n gpa\n 0.793\n 0.328\n 0.181\n 1.406\n 0.006\n 0.005\n 2769.0\n 1954.0\n 1.0\n \n \n rank\n -0.567\n 0.129\n -0.815\n -0.340\n 0.003\n 0.002\n 2125.0\n 1646.0\n 1.0\n \n \n\n\n\n\nIf you want to view summaries or plots for specific parameters, you can pass a list of its names:\n\n# show the names of all variables stored in the InferenceData object\nlist(results.posterior.data_vars)\n\n['Intercept', 'gre', 'gpa', 'rank']\n\n\nYou can find detailed, worked examples of fitting Bambi models and working with the results in the example notebooks here.\n\n\n\nBambi is just a high-level interface to PyMC. As such, Bambi internally stores virtually all objects generated by PyMC, making it easy for users to retrieve, inspect, and modify those objects. For example, the Model class created by PyMC (as opposed to the Bambi class of the same name) is accessible from model.backend.model.\n\ntype(model.backend.model)\n\npymc.model.Model\n\n\n\nmodel.backend.model\n\n\\[\n \\begin{array}{rcl}\n \\text{Intercept} &\\sim & \\operatorname{N}(0,~26.6)\\\\\\text{gre} &\\sim & \\operatorname{N}(0,~0.0217)\\\\\\text{gpa} &\\sim & \\operatorname{N}(0,~6.58)\\\\\\text{rank} &\\sim & \\operatorname{N}(0,~2.65)\\\\\\text{admit} &\\sim & \\operatorname{Bern}(f(\\text{Intercept},~\\text{rank},~\\text{gpa},~\\text{gre}))\n \\end{array}\n \\]\n\n\n\nmodel.backend.model.observed_RVs\n\n[admit ~ Bern(f(Intercept, rank, gpa, gre))]\n\n\n\nmodel.backend.model.unobserved_RVs\n\n[Intercept ~ N(0, 26.6),\n gre ~ N(0, 0.0217),\n gpa ~ N(0, 6.58),\n rank ~ N(0, 2.65)]" }, { - "objectID": "notebooks/categorical_regression.html", - "href": "notebooks/categorical_regression.html", - "title": "Bambi", - "section": "", - "text": "In this example, we will use the categorical family to model outcomes with more than two categories. The examples in this notebook were constructed by Tomás Capretto, and assembled into this example by Tyler James Burch (@tjburch on GitHub).\n\nimport arviz as az\nimport bambi as bmb\nimport matplotlib.pyplot as plt\nimport numpy as np\nimport pandas as pd\nimport seaborn as sns\n\nfrom matplotlib.lines import Line2D\n\n\nSEED = 1234\naz.style.use(\"arviz-darkgrid\")\n\nWhen modeling binary outcomes with Bambi, the Bernoulli family is used. The multivariate generalization of the Bernoulli family is the Categorical family, and with it, we can model an arbitrary number of outcome categories.\n\n\nTo start, we will create a toy dataset with three classes.\n\nrng = np.random.default_rng(SEED)\nx = np.hstack([rng.normal(m, s, size=50) for m, s in zip([-2.5, 0, 2.5], [1.2, 0.5, 1.2])])\ny = np.array([\"A\"] * 50 + [\"B\"] * 50 + [\"C\"] * 50)\n\ncolors = [\"C0\"] * 50 + [\"C1\"] * 50 + [\"C2\"] * 50\nplt.scatter(x, np.random.uniform(size=150), color=colors)\nplt.xlabel(\"x\")\nplt.ylabel(\"y\");\n\n\n\n\nHere we have 3 classes, generated from three normal distributions: \\(N(-2.5, 1.2)\\), \\(N(0, 0.5)\\), and \\(N(2.5, 1.2)\\). Creating a model to fit these distributions,\n\ndata = pd.DataFrame({\"y\": y, \"x\": x})\nmodel = bmb.Model(\"y ~ x\", data, family=\"categorical\")\nidata = model.fit()\n\nAuto-assigning NUTS sampler...\nInitializing NUTS using jitter+adapt_diag...\nMultiprocess sampling (2 chains in 2 jobs)\nNUTS: [Intercept, x]\n\n\n\n\n\n\n\n \n \n 100.00% [4000/4000 00:04<00:00 Sampling 2 chains, 0 divergences]\n \n \n\n\nSampling 2 chains for 1_000 tune and 1_000 draw iterations (2_000 + 2_000 draws total) took 5 seconds.\nWe recommend running at least 4 chains for robust computation of convergence diagnostics\n\n\nNote that we pass the family=\"categorical\" argument to Bambi’s Model method in order to call the categorical family. Here, the response variable are strings (“A”, “B”, “C”), however they can also be pd.Categorical objects.\nNext we will use posterior predictions to visualize the mean class probability across the \\(x\\) spectrum.\n\nx_new = np.linspace(-5, 5, num=200)\nmodel.predict(idata, data=pd.DataFrame({\"x\": x_new}))\np = idata.posterior[\"y_mean\"].sel(draw=slice(0, None, 10))\n\n\nx_new = np.linspace(-5, 5, num=200)\nmodel.predict(idata, data=pd.DataFrame({\"x\": x_new}))\np = idata.posterior[\"y_mean\"].sel(draw=slice(0, None, 10))\n\nfor j, g in enumerate(\"ABC\"):\n plt.plot(x_new, p.sel({\"y_dim\":g}).stack(samples=(\"chain\", \"draw\")), color=f\"C{j}\", alpha=0.2)\n\nplt.xlabel(\"x\")\nplt.ylabel(\"y\");\n\n\n\n\nHere, we can notice that the probability phases between classes from left to right. At all points across \\(x\\), sum of the class probabilities is 1, since in our generative model, it must be one of these three outcomes.\n\n\n\nNext, we will look at the classic “iris” dataset, which contains samples from 3 different species of iris plants. Using properties of the plant, we will try to model its species.\n\niris = sns.load_dataset(\"iris\")\niris.head(3)\n\n\n\n\n\n \n \n \n sepal_length\n sepal_width\n petal_length\n petal_width\n species\n \n \n \n \n 0\n 5.1\n 3.5\n 1.4\n 0.2\n setosa\n \n \n 1\n 4.9\n 3.0\n 1.4\n 0.2\n setosa\n \n \n 2\n 4.7\n 3.2\n 1.3\n 0.2\n setosa\n \n \n\n\n\n\nThe dataset includes four different properties of the plants: it’s sepal length, sepal width, petal length, and petal width. There are 3 different class possibilities: setosa, versicolor, and virginica.\n\nsns.pairplot(iris, hue=\"species\");\n\n/home/tomas/anaconda3/envs/bambi/lib/python3.10/site-packages/seaborn/axisgrid.py:208: UserWarning: This figure was using a layout engine that is incompatible with subplots_adjust and/or tight_layout; not calling subplots_adjust.\n self._figure.subplots_adjust(right=right)\n\n\n\n\n\nWe can see the three species have several distinct characteristics, which our linear model can capture to distinguish between them.\n\nmodel = bmb.Model(\n \"species ~ sepal_length + sepal_width + petal_length + petal_width\", \n iris, \n family=\"categorical\",\n)\nidata = model.fit()\naz.summary(idata)\n\nAuto-assigning NUTS sampler...\nInitializing NUTS using jitter+adapt_diag...\nMultiprocess sampling (2 chains in 2 jobs)\nNUTS: [Intercept, sepal_length, sepal_width, petal_length, petal_width]\n\n\n\n\n\n\n\n \n \n 100.00% [4000/4000 00:21<00:00 Sampling 2 chains, 0 divergences]\n \n \n\n\nSampling 2 chains for 1_000 tune and 1_000 draw iterations (2_000 + 2_000 draws total) took 21 seconds.\nWe recommend running at least 4 chains for robust computation of convergence diagnostics\n\n\n\n\n\n\n \n \n \n mean\n sd\n hdi_3%\n hdi_97%\n mcse_mean\n mcse_sd\n ess_bulk\n ess_tail\n r_hat\n \n \n \n \n Intercept[versicolor]\n -6.751\n 7.897\n -21.261\n 8.474\n 0.214\n 0.156\n 1369.0\n 1374.0\n 1.0\n \n \n Intercept[virginica]\n -22.546\n 9.566\n -40.257\n -5.208\n 0.229\n 0.164\n 1761.0\n 1556.0\n 1.0\n \n \n sepal_length[versicolor]\n 3.140\n 1.690\n 0.049\n 6.365\n 0.053\n 0.037\n 1031.0\n 1124.0\n 1.0\n \n \n sepal_length[virginica]\n 2.361\n 1.754\n -0.823\n 5.755\n 0.055\n 0.040\n 1020.0\n 974.0\n 1.0\n \n \n sepal_width[versicolor]\n -4.777\n 1.967\n -8.792\n -1.408\n 0.063\n 0.046\n 973.0\n 1096.0\n 1.0\n \n \n sepal_width[virginica]\n -6.681\n 2.368\n -11.597\n -2.590\n 0.076\n 0.055\n 974.0\n 909.0\n 1.0\n \n \n petal_length[versicolor]\n 1.060\n 0.915\n -0.630\n 2.735\n 0.027\n 0.019\n 1187.0\n 1316.0\n 1.0\n \n \n petal_length[virginica]\n 3.986\n 1.071\n 1.972\n 5.882\n 0.029\n 0.021\n 1340.0\n 1187.0\n 1.0\n \n \n petal_width[versicolor]\n 1.905\n 2.024\n -1.927\n 5.871\n 0.060\n 0.045\n 1153.0\n 1113.0\n 1.0\n \n \n petal_width[virginica]\n 9.021\n 2.247\n 5.098\n 13.457\n 0.063\n 0.046\n 1264.0\n 1198.0\n 1.0\n \n \n\n\n\n\n\naz.plot_trace(idata);\n\n\n\n\nWe can see that this has fit quite nicely. You’ll notice there are \\(n-1\\) parameters to fit, where \\(n\\) is the number of categories. In the minimal binary case, recall there’s only one parameter set, since it models probability \\(p\\) of being in a class, and probability \\(1-p\\) of being in the other class. Using the categorical distribution, this extends, so we have \\(p_1\\) for class 1, \\(p_2\\) for class 2, and \\(1-(p_1+p_2)\\) for the final class.\n\n\n\nNext we will look at an example from chapter 8 of Alan Agresti’s Categorical Data Analysis, looking at the primary food choice for 64 alligators caught in Lake George, Florida. We will use their length (a continuous variable) and sex (a categorical variable) as predictors to model their food choice.\nFirst, reproducing the dataset,\n\nlength = [\n 1.3, 1.32, 1.32, 1.4, 1.42, 1.42, 1.47, 1.47, 1.5, 1.52, 1.63, 1.65, 1.65, 1.65, 1.65,\n 1.68, 1.7, 1.73, 1.78, 1.78, 1.8, 1.85, 1.93, 1.93, 1.98, 2.03, 2.03, 2.31, 2.36, 2.46,\n 3.25, 3.28, 3.33, 3.56, 3.58, 3.66, 3.68, 3.71, 3.89, 1.24, 1.3, 1.45, 1.45, 1.55, 1.6, \n 1.6, 1.65, 1.78, 1.78, 1.8, 1.88, 2.16, 2.26, 2.31, 2.36, 2.39, 2.41, 2.44, 2.56, 2.67, \n 2.72, 2.79, 2.84\n]\nchoice = [\n \"I\", \"F\", \"F\", \"F\", \"I\", \"F\", \"I\", \"F\", \"I\", \"I\", \"I\", \"O\", \"O\", \"I\", \"F\", \"F\", \n \"I\", \"O\", \"F\", \"O\", \"F\", \"F\", \"I\", \"F\", \"I\", \"F\", \"F\", \"F\", \"F\", \"F\", \"O\", \"O\", \n \"F\", \"F\", \"F\", \"F\", \"O\", \"F\", \"F\", \"I\", \"I\", \"I\", \"O\", \"I\", \"I\", \"I\", \"F\", \"I\", \n \"O\", \"I\", \"I\", \"F\", \"F\", \"F\", \"F\", \"F\", \"F\", \"F\", \"O\", \"F\", \"I\", \"F\", \"F\"\n]\n\nsex = [\"Male\"] * 32 + [\"Female\"] * 31\ndata = pd.DataFrame({\"choice\": choice, \"length\": length, \"sex\": sex})\ndata[\"choice\"] = pd.Categorical(\n data[\"choice\"].map({\"I\": \"Invertebrates\", \"F\": \"Fish\", \"O\": \"Other\"}), \n [\"Other\", \"Invertebrates\", \"Fish\"], \n ordered=True\n)\ndata.head(3)\n\n\n\n\n\n \n \n \n choice\n length\n sex\n \n \n \n \n 0\n Invertebrates\n 1.30\n Male\n \n \n 1\n Fish\n 1.32\n Male\n \n \n 2\n Fish\n 1.32\n Male\n \n \n\n\n\n\nNext, constructing the model,\n\nmodel = bmb.Model(\"choice ~ length + sex\", data, family=\"categorical\")\nidata = model.fit()\n\nAuto-assigning NUTS sampler...\nInitializing NUTS using jitter+adapt_diag...\nMultiprocess sampling (2 chains in 2 jobs)\nNUTS: [Intercept, length, sex]\n\n\n\n\n\n\n\n \n \n 100.00% [4000/4000 00:04<00:00 Sampling 2 chains, 0 divergences]\n \n \n\n\nSampling 2 chains for 1_000 tune and 1_000 draw iterations (2_000 + 2_000 draws total) took 5 seconds.\nWe recommend running at least 4 chains for robust computation of convergence diagnostics\n\n\nWe can then look at how the food choices vary by length for both male and female alligators.\n\nnew_length = np.linspace(1, 4)\nnew_data = pd.DataFrame({\"length\": np.tile(new_length, 2), \"sex\": [\"Male\"] * 50 + [\"Female\"] * 50})\nmodel.predict(idata, data=new_data)\np = idata.posterior[\"choice_mean\"]\n\nfig, axes = plt.subplots(1, 2, figsize=(12, 5))\nchoices = [\"Other\", \"Invertebrates\", \"Fish\"]\n\nfor j, choice in enumerate(choices):\n males = p.sel({\"choice_dim\":choice, \"choice_obs\":slice(0, 49)})\n females = p.sel({\"choice_dim\":choice, \"choice_obs\":slice(50, 100)})\n axes[0].plot(new_length, males.mean((\"chain\", \"draw\")), color=f\"C{j}\", lw=2)\n axes[1].plot(new_length, females.mean((\"chain\", \"draw\")), color=f\"C{j}\", lw=2)\n az.plot_hdi(new_length, males, color=f\"C{j}\", ax=axes[0])\n az.plot_hdi(new_length, females, color=f\"C{j}\", ax=axes[1])\n\naxes[0].set_title(\"Male\")\naxes[1].set_title(\"Female\")\n\nhandles = [Line2D([], [], color=f\"C{j}\", label=choice) for j, choice in enumerate(choices)]\nfig.subplots_adjust(left=0.05, right=0.975, bottom=0.075, top=0.85)\n\nfig.legend(\n handles,\n choices,\n loc=\"center right\",\n ncol=3,\n bbox_to_anchor=(0.99, 0.95),\n bbox_transform=fig.transFigure\n);\n\n/tmp/ipykernel_30893/358310275.py:21: UserWarning: This figure was using a layout engine that is incompatible with subplots_adjust and/or tight_layout; not calling subplots_adjust.\n fig.subplots_adjust(left=0.05, right=0.975, bottom=0.075, top=0.85)\n\n\n\n\n\nHere we can see that the larger male and female alligators are, the less of a taste they have for invertebrates, and far prefer fish. Additionally, males seem to have a higher propensity to consume “other” foods compared to females at any size. Of note, the posterior means predicted by Bambi contain information about all \\(n\\) categories (despite having only \\(n-1\\) coefficients), so we can directly construct this plot, rather than manually calculating \\(1-(p_1+p_2)\\) for the third class.\nLast, we can make a posterior predictive plot,\n\nmodel.predict(idata, kind=\"pps\")\n\nax = az.plot_ppc(idata)\nax.set_xticks([0.5, 1.5, 2.5])\nax.set_xticklabels(model.response_component.response_term.levels)\nax.set_xlabel(\"Choice\");\nax.set_ylabel(\"Probability\");\n\n\n\n\nwhich depicts posterior predicted probability for each possible food choice for an alligator, which reinforces fish being the most likely food choice, followed by invertebrates.\n\n\nAgresti, A. (2013) Categorical Data Analysis. 3rd Edition, John Wiley & Sons Inc., Hoboken.\n\n%load_ext watermark\n%watermark -n -u -v -iv -w\n\nLast updated: Wed Jun 28 2023\n\nPython implementation: CPython\nPython version : 3.10.4\nIPython version : 8.5.0\n\narviz : 0.14.0\nbambi : 0.12.0.dev0\npandas : 2.0.2\nnumpy : 1.25.0\nmatplotlib: 3.6.2\nseaborn : 0.12.2\n\nWatermark: 2.3.1" - }, - { - "objectID": "notebooks/Strack_RRR_re_analysis.html", - "href": "notebooks/Strack_RRR_re_analysis.html", + "objectID": "notebooks/beta_regression.html", + "href": "notebooks/beta_regression.html", "title": "Bambi", "section": "", - "text": "from glob import glob\nfrom os.path import basename\n\nimport arviz as az\nimport bambi as bmb\nimport numpy as np\nimport pandas as pd\n\n\naz.style.use(\"arviz-darkgrid\")\n\nIn this Jupyter notebook, we do a Bayesian reanalysis of the data reported in the recent registered replication report (RRR) of a famous study by Strack, Martin & Stepper (1988). The original Strack et al. study tested a facial feedback hypothesis arguing that emotional responses are, in part, driven by facial expressions (rather than expressions always following from emotions). Strack and colleagues reported that participants rated cartoons as more funny when the participants held a pen in their teeth (unknowingly inducing a smile) than when they held a pen between their lips (unknowingly inducing a pout). The article has been cited over 1,400 times, and has been enormously influential in popularizing the view that affective experiences and outward expressions of affective experiences can both influence each other (instead of the relationship being a one-way street from experience to expression). In 2016, a Registered Replication Report led by Wagenmakers and colleagues attempted to replicate Study 1 from Strack, Martin, & Stepper (1988) in 17 independent experiments comprising over 2,500 participants. The RRR reported no evidence in support of the effect.\nBecause the emphasis here is on fitting models in Bambi, we spend very little time on quality control and data exploration; our goal is simply to show how one can replicate and extend the primary analysis reported in the RRR in a few lines of Bambi code.\n\n\nThe data for the RRR of Strack, Martin, & Stepper (henceforth SMS) is available as a set of CSV files from the project’s repository on the Open Science Framework. For the sake of completeness, we’ll show how to go from the raw CSV to the “long” data format that Bambi can use.\nOne slightly annoying thing about these 17 CSV files–each of which represents a different replication site–is that they don’t all contain exactly the same columns. Some labs added a column or two at the end (mostly for notes). To keep things simple, we’ll just truncate each dataset to only the first 22 columns. Because the variable names are structured in a bit of a confusing way, we’ll also just drop the first two rows in each file, and manually set the column names for all 22 variables. Once we’ve done that, we can simply concatenate all of the 17 datasets along the row axis to create one big dataset.\n\nDL_PATH = 'data/facial_feedback/*csv'\n\ndfs = []\ncolumns = ['subject', 'cond_id', 'condition', 'correct_c1', 'correct_c2', 'correct_c3', 'correct_c4',\n 'correct_total', 'rating_t1', 'rating_t2', 'rating_c1', 'rating_c2', 'rating_c3',\n 'rating_c4', 'self_perf', 'comprehension', 'awareness', 'transcript', 'age', 'gender',\n 'student', 'occupation']\n\ncount = 0\nfor idx, study in enumerate(glob(DL_PATH)):\n data = pd.read_csv(study, encoding='latin1', skiprows=2, header=None, index_col=False).iloc[:, :22]\n data.columns = columns\n # Add study name\n data['study'] = idx\n # Some sites used the same subject id numbering schemes, so prepend with study to create unique ids.\n # Note that if we don't do this, Bambi would have no way of distinguishing two subjects who share\n # the same id, which would hose our results.\n data['uid'] = data['subject'].astype(float) + count\n dfs.append(data)\ndata = pd.concat(dfs, axis=0).apply(pd.to_numeric, errors='coerce', axis=1)\n\nLet’s see what the first few rows look like…\n\ndata.head()\n\n\n\n\n\n \n \n \n subject\n cond_id\n condition\n correct_c1\n correct_c2\n correct_c3\n correct_c4\n correct_total\n rating_t1\n rating_t2\n ...\n self_perf\n comprehension\n awareness\n transcript\n age\n gender\n student\n occupation\n study\n uid\n \n \n \n \n 0\n 1.0\n 1.0\n 0.0\n 1.0\n 1.0\n 1.0\n 1.0\n 4.0\n 5.0\n 9.0\n ...\n 5.0\n 1.0\n 0.0\n NaN\n 21.0\n 1.0\n 1.0\n NaN\n 0.0\n 1.0\n \n \n 1\n 2.0\n 2.0\n 1.0\n 1.0\n 1.0\n 1.0\n 1.0\n 4.0\n 3.0\n 4.0\n ...\n 7.0\n 1.0\n 0.0\n NaN\n 25.0\n 1.0\n 1.0\n NaN\n 0.0\n 2.0\n \n \n 2\n 3.0\n 3.0\n 0.0\n 1.0\n 1.0\n 1.0\n 1.0\n 4.0\n 4.0\n 4.0\n ...\n 9.0\n 1.0\n 0.0\n NaN\n 23.0\n 0.0\n 1.0\n NaN\n 0.0\n 3.0\n \n \n 3\n 4.0\n 4.0\n 1.0\n 1.0\n 1.0\n 1.0\n 1.0\n 4.0\n 7.0\n 3.0\n ...\n 4.0\n 1.0\n 0.0\n NaN\n 19.0\n 0.0\n 1.0\n NaN\n 0.0\n 4.0\n \n \n 4\n 5.0\n 5.0\n 0.0\n 1.0\n 1.0\n 1.0\n 1.0\n 4.0\n 5.0\n 7.0\n ...\n 6.0\n 1.0\n 0.0\n NaN\n 19.0\n 0.0\n 1.0\n NaN\n 0.0\n 5.0\n \n \n\n5 rows × 24 columns\n\n\n\n\n\n\nAt this point we have our data in a pandas DataFrame with shape of (2612, 24). Unfortunately, we can’t use the data in this form. We’ll need to (a) conduct some basic quality control, and (b) “melt” the dataset–currently in so-called “wide” format, with each subject in a separate row–into long format, where each row is a single trial. Fortunately, we can do this easily in pandas:\n\n# Keep only subjects who (i) respond appropriately on all trials,\n# (ii) understand the cartoons, and (iii) don't report any awareness\n# of the hypothesis or underlying theory.\nvalid = data.query('correct_total==4 and comprehension==1 and awareness==0')\nlong = pd.melt(valid, ['uid', 'condition', 'gender', 'age', 'study', 'self_perf'],\n ['rating_c1', 'rating_c2', 'rating_c3', 'rating_c4'], var_name='stimulus')\n\n\nlong\n\n\n\n\n\n \n \n \n uid\n condition\n gender\n age\n study\n self_perf\n stimulus\n value\n \n \n \n \n 0\n 1.0\n 0.0\n 1.0\n 21.0\n 0.0\n 5.0\n rating_c1\n 5.0\n \n \n 1\n 2.0\n 1.0\n 1.0\n 25.0\n 0.0\n 7.0\n rating_c1\n 0.0\n \n \n 2\n 3.0\n 0.0\n 0.0\n 23.0\n 0.0\n 9.0\n rating_c1\n 4.0\n \n \n 3\n 4.0\n 1.0\n 0.0\n 19.0\n 0.0\n 4.0\n rating_c1\n 7.0\n \n \n 4\n 5.0\n 0.0\n 0.0\n 19.0\n 0.0\n 6.0\n rating_c1\n 4.0\n \n \n ...\n ...\n ...\n ...\n ...\n ...\n ...\n ...\n ...\n \n \n 6935\n 164.0\n 0.0\n 0.0\n 18.0\n 16.0\n 4.0\n rating_c4\n 0.0\n \n \n 6936\n 168.0\n 0.0\n 0.0\n 18.0\n 16.0\n 8.0\n rating_c4\n 6.0\n \n \n 6937\n 169.0\n 1.0\n 0.0\n 18.0\n 16.0\n 7.0\n rating_c4\n 7.0\n \n \n 6938\n 171.0\n 1.0\n 0.0\n 19.0\n 16.0\n 7.0\n rating_c4\n 4.0\n \n \n 6939\n 172.0\n 0.0\n 1.0\n 21.0\n 16.0\n 7.0\n rating_c4\n 3.0\n \n \n\n6940 rows × 8 columns\n\n\n\nNotice that in the melt() call above, we’re treating not only the unique subject ID (uid) as an identifying variable, but also gender, experimental condition, age, and study name. Since these are all between-subject variables, these columns are all completely redundant with uid, and adding them does nothing to change the structure of our data. The point of explicitly listing them is just to keep them around in the dataset, so that we can easily add them to our models.\n\n\n\nNow that we’re all done with our (minimal) preprocessing, it’s time to fit the model! This turns out to be a snap in Bambi. We’ll begin with a very naive (and, as we’ll see later, incorrect) model that includes only the following terms:\n\nAn overall (common) intercept.\nThe common effect of experimental condition (“smiling” by holding a pen in one’s teeth vs. “pouting” by holding a pen in one’s lips). This is the primary variable of interest in the study.\nA group specific intercept for each of the 1,728 subjects in the ‘long’ dataset. (There were 2,576 subjects in the original dataset, but about 25% were excluded for various reasons, and we’re further excluding all subjects who lack complete data. As an exercise, you can try relaxing some of these criteria and re-fitting the models, though you’ll probably find that it makes no meaningful difference to the results.)\n\nWe’ll create a Bambi model, fit it, and store the results in a new object–which we can then interrogate in various ways.\n\n# Initialize the model, passing in the dataset we want to use.\nmodel = bmb.Model(\"value ~ condition + (1|uid)\", long, dropna=True)\n\n# Set a custom prior on group specific factor variances—just for illustration\ngroup_specific_sd = bmb.Prior(\"HalfNormal\", sigma=10)\ngroup_specific_prior = bmb.Prior(\"Normal\", mu=0, sigma=group_specific_sd)\nmodel.set_priors(group_specific=group_specific_prior)\n\n# Fit the model, drawing 1,000 MCMC draws per chain\nresults = model.fit(draws=1000)\n\nAutomatically removing 9/6940 rows from the dataset.\nAuto-assigning NUTS sampler...\nInitializing NUTS using jitter+adapt_diag...\nMultiprocess sampling (2 chains in 2 jobs)\nNUTS: [value_sigma, Intercept, condition, 1|uid_sigma, 1|uid_offset]\n\n\n\n\n\n\n\n \n \n 100.00% [4000/4000 00:23<00:00 Sampling 2 chains, 0 divergences]\n \n \n\n\nSampling 2 chains for 1_000 tune and 1_000 draw iterations (2_000 + 2_000 draws total) took 23 seconds.\n\n\nNotice that, in Bambi, the common and group specific effects are specified in the same formula. This is the same convention used by other similar packages like brms.\n\n\n\nWe can plot the prior distributions for all parameters with a call to the plot_priors() method.\n\nmodel.plot_priors();\n\nSampling: [1|uid_sigma, Intercept, condition, value_sigma]\n\n\n\n\n\nAnd we can easily get the posterior distributions with az.plot_trace(). We can select a subset of the parameters with the var_names arguments, like in the following cell. Or alternative by negating variables like var_names=\"~1|uid\".\n\naz.plot_trace(results,\n var_names=[\"Intercept\", \"condition\", \"value_sigma\", \"1|uid_sigma\"],\n compact=False,\n);\n\n\n\n\nIf we want a numerical summary of the results, we just pass the results object to az.summary(). By default, summary shows the mean, standard deviation, and 94% highest density interval for the posterior. Summary also includes the Monte Carlo standard error, the effective sample size and the R-hat statistic.\n\naz.summary(results, var_names=['Intercept', 'condition', 'value_sigma', '1|uid_sigma'])\n\n\n\n\n\n \n \n \n mean\n sd\n hdi_3%\n hdi_97%\n mcse_mean\n mcse_sd\n ess_bulk\n ess_tail\n r_hat\n \n \n \n \n Intercept\n 4.563\n 0.047\n 4.472\n 4.647\n 0.001\n 0.001\n 2208.0\n 1508.0\n 1.0\n \n \n condition\n -0.030\n 0.058\n -0.143\n 0.073\n 0.001\n 0.001\n 2473.0\n 1198.0\n 1.0\n \n \n value_sigma\n 2.402\n 0.021\n 2.360\n 2.439\n 0.000\n 0.000\n 2429.0\n 1335.0\n 1.0\n \n \n 1|uid_sigma\n 0.306\n 0.045\n 0.228\n 0.392\n 0.002\n 0.001\n 643.0\n 915.0\n 1.0\n \n \n\n\n\n\n\n\n\nLooking at the parameter estimates produced by our model, it seems pretty clear that there’s no meaningful effect of condition. The posterior distribution is centered almost exactly on 0, with most of the probability mass on very small values. The 94% HDI spans from \\(\\approx -0.14\\) to \\(\\approx 0.08\\)–in other words, the plausible effect of the experimental manipulation is, at best, to produce a change of < 0.2 on cartoon ratings made on a 10-point scale. For perspective, the variation between subjects is enormous in comparison–the standard deviation for group specific effects 1|uid_sigma is around 0.3. We can also see that the model is behaving well, and the sampler seems to have converged nicely (the traces for all parameters look stationary).\nUnfortunately, our first model has at least two pretty serious problems. First, it gives no consideration to between-study variation–we’re simply lumping all 1,728 subjects together, as if they came from the same study. A better model would properly account for study-level variation. We could model study as either a common or a group specific factor in this case–both choices are defensible, depending on whether we want to think of the 17 studies in this dataset as the only sites of interest, or as if they’re just 17 random sites drawn from some much larger population that have particular characteristics we want to account for.\nFor present purposes, we’ll adopt the latter strategy (as an exercise, you can modify the the code below and re-run the model with study as a common factor). We’ll “keep it maximal” by adding both group specific study intercepts and group specific study slopes to the model. That is, we’ll assume that the subjects at each research site have a different baseline appreciation of the cartoons (some find the cartoons funnier than others), and that the effect of condition also varies across sites.\nSecond, our model also fails to explicitly model variation in cartoon ratings that should properly be attributed to the 4 stimuli. In principle, our estimate of the common effect of condition could change somewhat once we correctly account for stimulus variability (though in practice, the net effect is almost always to reduce effects, not increase them–so in this case, it’s very unlikely that adding group specific stimulus effects will produce a meaningful effect of condition). So we’ll deal with this by adding specific intercepts for the 4 stimuli. We’ll model the stimuli as group specific effect, rather than common, because it wouldn’t make sense to think of these particular cartoons as exhausting the universe of stimuli we care about (i.e., we wouldn’t really care about the facial-feedback effect if we knew that it only applied to 4 specific Far Side cartoons, and no other stimuli).\nLastly, just for fun, we can throw in some additional covariates, since they’re readily available in the dataset, and may be of interest even if they don’t directly inform the core hypothesis. Specifically, we’ll add common effects of gender and age to the model, which will let us estimate the degree to which participants’ ratings of the cartoons varies as a function of these background variables.\nOnce we’ve done all that, we end up with a model that’s in a good position to answer the question we care about–namely, whether the smiling/pouting manipulation has an effect on cartoon ratings that generalizes across the subjects, studies, and stimuli found in the RRR dataset.\n\nmodel = bmb.Model(\n \"value ~ condition + age + gender + (1|uid) + (condition|study) + (condition|stimulus)\",\n long,\n dropna=True,\n)\n\ngroup_specific_sd = bmb.Prior(\"HalfNormal\", sigma=10)\ngroup_specific_prior = bmb.Prior(\"Normal\", mu=0, sigma=group_specific_sd)\nmodel.set_priors(group_specific=group_specific_prior)\n\n# Not we use 2000 samples for tuning and increase the taget_accept to 0.99.\n# The default values result in divergences.\nresults = model.fit(draws=1000, tune=2000, target_accept=0.99)\n\nAutomatically removing 33/6940 rows from the dataset.\nAuto-assigning NUTS sampler...\nInitializing NUTS using jitter+adapt_diag...\nMultiprocess sampling (2 chains in 2 jobs)\nNUTS: [value_sigma, Intercept, condition, age, gender, 1|uid_sigma, 1|uid_offset, 1|study_sigma, 1|study_offset, condition|study_sigma, condition|study_offset, 1|stimulus_sigma, 1|stimulus_offset, condition|stimulus_sigma, condition|stimulus_offset]\n\n\n\n\n\n\n\n \n \n 100.00% [6000/6000 26:22<00:00 Sampling 2 chains, 0 divergences]\n \n \n\n\nSampling 2 chains for 2_000 tune and 1_000 draw iterations (4_000 + 2_000 draws total) took 1583 seconds.\n\n\n\naz.plot_trace(results, \n var_names=['Intercept', 'age', 'gender', 'condition', 'value_sigma', \n '1|study', '1|stimulus', 'condition|study', 'condition|stimulus',\n '1|study_sigma', '1|stimulus_sigma', 'condition|study_sigma', \n ],\n compact=True);\n\n\n\n\n\n\n\nNo. There’s still no discernible effect. Modeling the data using a mixed-effects model does highlight a number of other interesting features, however: * The stimulus-level standard deviation 1|stimulus_sigma is quite large compared to the other factors. This is potentially problematic, because it suggests that a more conventional analysis that left individual stimulus effects out of the model could potentially run a high false positive rate. Note that this is a problem that affects both the RRR and the original Strack study equally; the moral of the story is to deliberately sample large numbers of stimuli and explicitly model their influence. * Older people seem to rate cartoons as being (a little bit) funnier. * The variation across sites is surprisingly small–in terms of both the group specific intercepts (1|study) and the group specific slopes (condition|study). In other words, the constitution of the sample, the gender of the experimenter, or any of the hundreds of others of between-site differences that one might conceivably have expected to matter, don’t really seem to make much of a difference to participants’ ratings of the cartoons.\n\n%load_ext watermark\n%watermark -n -u -v -iv -w\n\nLast updated: Thu Jan 05 2023\n\nPython implementation: CPython\nPython version : 3.10.4\nIPython version : 8.5.0\n\nbambi : 0.9.3\npandas: 1.5.2\nnumpy : 1.23.5\narviz : 0.14.0\nsys : 3.10.4 (main, Mar 31 2022, 08:41:55) [GCC 7.5.0]\n\nWatermark: 2.3.1" + "text": "This example has been contributed by Tyler James Burch (@tjburch on GitHub).\n\nimport arviz as az\nimport bambi as bmb\nimport matplotlib.pyplot as plt\nimport numpy as np\nimport pandas as pd\nfrom scipy import stats\nfrom scipy.special import expit\n\n\naz.style.use(\"arviz-darkgrid\")\n\nIn this example, we’ll look at using the Beta distribution for regression models. The Beta distribution is a probability distribution bounded on the interval [0, 1], which makes it well-suited to model probabilities or proportions. In fact, in much of the Bayesian literature, the Beta distribution is introduced as a prior distribution for the probability \\(p\\) parameter of the Binomial distribution (in fact, it’s the conjugate prior for the Binomial distribution).\n\n\nTo start getting an intuitive sense of the Beta distribution, we’ll model coin flipping probabilities. Say we grab all the coins out of our pocket, we might have some fresh from the mint, but we might also have some old ones. Due to the variation, some may be slightly biased toward heads or tails, and our goal is to model distribution of the probabilities of flipping heads for the coins in our pocket.\nSince we trust the mint, we’ll say the \\(\\alpha\\) and \\(\\beta\\) are both large, we’ll use 1,000 for each, which gives a distribution spanning from 0.45 to 0.55.\n\nalpha = 1_000\nbeta = 1_000\np = np.random.beta(alpha, beta, size=10_000)\naz.plot_kde(p)\nplt.xlabel(\"$p$\");\n\n\n\n\nNext, we’ll use Bambi to try to recover the parameters of the Beta distribution. Since we have no predictors, we can do a intercept-only model to try to recover them.\n\ndata = pd.DataFrame({\"probabilities\": p})\nmodel = bmb.Model(\"probabilities ~ 1\", data, family=\"beta\")\nfitted = model.fit()\n\nAuto-assigning NUTS sampler...\nInitializing NUTS using jitter+adapt_diag...\nMultiprocess sampling (2 chains in 2 jobs)\nNUTS: [probabilities_kappa, Intercept]\n\n\n\n\n\n\n\n \n \n 100.00% [4000/4000 00:04<00:00 Sampling 2 chains, 0 divergences]\n \n \n\n\nSampling 2 chains for 1_000 tune and 1_000 draw iterations (2_000 + 2_000 draws total) took 5 seconds.\n\n\n\naz.plot_trace(fitted);\n\n\n\n\n\naz.summary(fitted)\n\n\n\n\n\n \n \n \n mean\n sd\n hdi_3%\n hdi_97%\n mcse_mean\n mcse_sd\n ess_bulk\n ess_tail\n r_hat\n \n \n \n \n Intercept\n -0.000\n 0.000\n -0.001\n 0.001\n 0.000\n 0.000\n 2079.0\n 1465.0\n 1.0\n \n \n probabilities_kappa\n 2012.885\n 27.642\n 1960.994\n 2062.262\n 0.592\n 0.418\n 2185.0\n 1548.0\n 1.0\n \n \n\n\n\n\nThe model fit, but clearly these parameters are not the ones that we used above. For Beta regression, we use a linear model for the mean, so we use the \\(\\mu\\) and \\(\\sigma\\) formulation. To link the two, we use\n\\(\\alpha = \\mu \\kappa\\)\n\\(\\beta = (1-\\mu)\\kappa\\)\nand \\(\\kappa\\) is a function of the mean and variance,\n\\(\\kappa = \\frac{\\mu(1-\\mu)}{\\sigma^2} - 1\\)\nRather than \\(\\sigma\\), you’ll note Bambi returns \\(\\kappa\\). We’ll define a function to retrieve our original parameters.\n\ndef mukappa_to_alphabeta(mu, kappa):\n # Calculate alpha and beta\n alpha = mu * kappa\n beta = (1 - mu) * kappa\n \n # Get mean values and 95% HDIs \n alpha_mean = alpha.mean((\"chain\", \"draw\")).item()\n alpha_hdi = az.hdi(alpha, hdi_prob=.95)[\"x\"].values\n beta_mean = beta.mean((\"chain\", \"draw\")).item()\n beta_hdi = az.hdi(beta, hdi_prob=.95)[\"x\"].values\n \n return alpha_mean, alpha_hdi, beta_mean, beta_hdi\n\nalpha, alpha_hdi, beta, beta_hdi = mukappa_to_alphabeta(\n expit(fitted.posterior[\"Intercept\"]),\n fitted.posterior[\"probabilities_kappa\"]\n)\n\nprint(f\"Alpha - mean: {np.round(alpha)}, 95% HDI: {np.round(alpha_hdi[0])} - {np.round(alpha_hdi[1])}\")\nprint(f\"Beta - mean: {np.round(beta)}, 95% HDI: {np.round(beta_hdi[0])} - {np.round(beta_hdi[1])}\")\n\nAlpha - mean: 1006.0, 95% HDI: 979.0 - 1033.0\nBeta - mean: 1006.0, 95% HDI: 978.0 - 1032.0\n\n\n\ndef mukappa_to_alphabeta(mu, kappa):\n # Calculate alpha and beta\n alpha = mu * kappa\n beta = (1 - mu) * kappa\n \n # Get mean values and 95% HDIs \n alpha_mean = alpha.mean((\"chain\", \"draw\")).item()\n alpha_hdi = az.hdi(alpha, hdi_prob=.95)[\"x\"].values\n beta_mean = beta.mean((\"chain\", \"draw\")).item()\n beta_hdi = az.hdi(beta, hdi_prob=.95)[\"x\"].values\n \n return alpha_mean, alpha_hdi, beta_mean, beta_hdi\n\nalpha, alpha_hdi, beta, beta_hdi = mukappa_to_alphabeta(\n expit(fitted.posterior[\"Intercept\"]),\n fitted.posterior[\"probabilities_kappa\"]\n)\n\nprint(f\"Alpha - mean: {np.round(alpha)}, 95% HDI: {np.round(alpha_hdi[0])} - {np.round(alpha_hdi[1])}\")\nprint(f\"Beta - mean: {np.round(beta)}, 95% HDI: {np.round(beta_hdi[0])} - {np.round(beta_hdi[1])}\")\n\nAlpha - mean: 1006.0, 95% HDI: 979.0 - 1033.0\nBeta - mean: 1006.0, 95% HDI: 978.0 - 1032.0\n\n\nWe’ve managed to recover our parameters with an intercept-only model.\n\n\n\nPerhaps we have a little more information on the coins in our pocket. We notice that the coins have accumulated dirt on either side, which would shift the probability of getting a tails or heads. In reality, we would not know how much the dirt affects the probability distribution, we would like to recover that parameter. We’ll construct this toy example by saying that each micron of dirt shifts the \\(\\alpha\\) parameter by 5.0. Further, the amount of dirt is distributed according to a Half Normal distribution with a standard deviation of 25 per side.\nWe’ll start by looking at the difference in probability for a coin with a lot of dirt on either side.\n\neffect_per_micron = 5.0\n\n# Clean Coin\nalpha = 1_000\nbeta = 1_000\np = np.random.beta(alpha, beta, size=10_000)\n\n# Add two std to tails side (heads more likely)\np_heads = np.random.beta(alpha + 50 * effect_per_micron, beta, size=10_000)\n# Add two std to heads side (tails more likely)\np_tails = np.random.beta(alpha - 50 * effect_per_micron, beta, size=10_000)\n\naz.plot_kde(p, label=\"Clean Coin\")\naz.plot_kde(p_heads, label=\"Biased toward heads\", plot_kwargs={\"color\":\"C1\"})\naz.plot_kde(p_tails, label=\"Biased toward tails\", plot_kwargs={\"color\":\"C2\"})\nplt.xlabel(\"$p$\")\nplt.ylim(top=plt.ylim()[1]*1.25);\n\n\n\n\nNext, we’ll generate a toy dataset according to our specifications above. As an added foil, we will also assume that we’re limited in our measuring equipment, that we can only measure correctly to the nearest integer micron.\n\n# Create amount of dirt on top and bottom\nheads_bias_dirt = stats.halfnorm(loc=0, scale=25).rvs(size=1_000)\ntails_bias_dirt = stats.halfnorm(loc=0, scale=25).rvs(size=1_000)\n\n# Create the probability per coin\nalpha = np.repeat(1_000, 1_000)\nalpha = alpha + effect_per_micron * heads_bias_dirt - effect_per_micron * tails_bias_dirt\nbeta = np.repeat(1_000, 1_000)\n\np = np.random.beta(alpha, beta)\n\ndf = pd.DataFrame({\n \"p\" : p,\n \"heads_bias_dirt\" : heads_bias_dirt.round(),\n \"tails_bias_dirt\" : tails_bias_dirt.round()\n})\ndf.head()\n\n\n\n\n\n \n \n \n p\n heads_bias_dirt\n tails_bias_dirt\n \n \n \n \n 0\n 0.508915\n 30.0\n 15.0\n \n \n 1\n 0.533541\n 24.0\n 4.0\n \n \n 2\n 0.482905\n 10.0\n 28.0\n \n \n 3\n 0.555191\n 54.0\n 0.0\n \n \n 4\n 0.526059\n 4.0\n 4.0\n \n \n\n\n\n\nTaking a look at our new dataset:\n\nfig,ax = plt.subplots(1,3, figsize=(16,5))\n\ndf[\"p\"].plot.kde(ax=ax[0])\nax[0].set_xlabel(\"$p$\")\n\ndf[\"heads_bias_dirt\"].plot.hist(ax=ax[1], bins=np.arange(0,df[\"heads_bias_dirt\"].max()))\nax[1].set_xlabel(\"Measured Dirt Biasing Toward Heads ($\\mu m$)\")\ndf[\"tails_bias_dirt\"].plot.hist(ax=ax[2], bins=np.arange(0,df[\"tails_bias_dirt\"].max()))\nax[2].set_xlabel(\"Measured Dirt Biasing Toward Tails ($\\mu m$)\");\n\n\n\n\nNext we want to make a model to recover the effect per micron of dirt per side. So far, we’ve considered the biasing toward one side or another independently. A linear model might look something like this:\n$ p (, )$\n\\(logit(\\mu) = \\text{ Normal}( \\alpha + \\beta_h d_h + \\beta_t d_t)\\)\nWhere \\(d_h\\) and \\(d_t\\) are the measured dirt (in microns) biasing the probability toward heads and tails respectively, \\(\\beta_h\\) and \\(\\beta_t\\) are coefficients for how much a micron of dirt affects each independent side, and \\(\\alpha\\) is the intercept. Also note the logit link function used here, since our outcome is on the scale of 0-1, it makes sense that the link must also put our mean on that scale. Logit is the default link function, however Bambi supports the identity, probit, and cloglog links as well.\nIn this toy example, we’ve constructed it such that dirt should not affect one side differently from another, so we can wrap those into one coefficient: \\(\\beta = \\beta_h = -\\beta_t\\). This makes the last line of the model:\n\\(logit(\\mu) = \\text{ Normal}( \\alpha + \\beta \\Delta d)\\)\nwhere\n\\(\\Delta d = d_h - d_t\\)\nPutting that into our dataset, then constructing this model in Bambi,\n\ndf[\"delta_d\"] = df[\"heads_bias_dirt\"] - df[\"tails_bias_dirt\"]\ndirt_model = bmb.Model(\"p ~ delta_d\", df, family=\"beta\")\ndirt_fitted = dirt_model.fit()\ndirt_model.predict(dirt_fitted, kind=\"pps\")\n\nAuto-assigning NUTS sampler...\nInitializing NUTS using jitter+adapt_diag...\nMultiprocess sampling (2 chains in 2 jobs)\nNUTS: [p_kappa, Intercept, delta_d]\n\n\n\n\n\n\n\n \n \n 100.00% [4000/4000 00:06<00:00 Sampling 2 chains, 0 divergences]\n \n \n\n\nSampling 2 chains for 1_000 tune and 1_000 draw iterations (2_000 + 2_000 draws total) took 7 seconds.\n\n\n\naz.summary(dirt_fitted)\n\n\n\n\n\n \n \n \n mean\n sd\n hdi_3%\n hdi_97%\n mcse_mean\n mcse_sd\n ess_bulk\n ess_tail\n r_hat\n \n \n \n \n Intercept\n -0.006\n 0.001\n -0.009\n -0.004\n 0.000\n 0.000\n 2903.0\n 1479.0\n 1.0\n \n \n delta_d\n 0.005\n 0.000\n 0.005\n 0.005\n 0.000\n 0.000\n 3200.0\n 1597.0\n 1.0\n \n \n p_kappa\n 2018.759\n 91.080\n 1862.252\n 2198.655\n 1.719\n 1.216\n 2805.0\n 1399.0\n 1.0\n \n \n p_mean[0]\n 0.517\n 0.000\n 0.516\n 0.518\n 0.000\n 0.000\n 3477.0\n 1662.0\n 1.0\n \n \n p_mean[1]\n 0.523\n 0.000\n 0.522\n 0.524\n 0.000\n 0.000\n 3564.0\n 1637.0\n 1.0\n \n \n ...\n ...\n ...\n ...\n ...\n ...\n ...\n ...\n ...\n ...\n \n \n p_mean[995]\n 0.523\n 0.000\n 0.522\n 0.524\n 0.000\n 0.000\n 3564.0\n 1637.0\n 1.0\n \n \n p_mean[996]\n 0.517\n 0.000\n 0.516\n 0.518\n 0.000\n 0.000\n 3477.0\n 1662.0\n 1.0\n \n \n p_mean[997]\n 0.533\n 0.001\n 0.532\n 0.534\n 0.000\n 0.000\n 3570.0\n 1596.0\n 1.0\n \n \n p_mean[998]\n 0.467\n 0.001\n 0.466\n 0.468\n 0.000\n 0.000\n 2916.0\n 1657.0\n 1.0\n \n \n p_mean[999]\n 0.498\n 0.000\n 0.498\n 0.499\n 0.000\n 0.000\n 2903.0\n 1479.0\n 1.0\n \n \n\n1003 rows × 9 columns\n\n\n\n\naz.plot_ppc(dirt_fitted);\n\n\n\n\nNext, we’ll see if we can in fact recover the effect on \\(\\alpha\\). Remember that in order to return to \\(\\alpha\\), \\(\\beta\\) space, the linear equation passes through an inverse logit transformation, so we must apply this to the coefficient on \\(\\Delta d\\) to get the effect on \\(\\alpha\\). The inverse logit is nicely defined in scipy.special as expit.\n\nmean_effect = expit(dirt_fitted.posterior.delta_d.mean())\nhdi = az.hdi(dirt_fitted.posterior.delta_d, hdi_prob=.95)\nlower = expit(hdi.delta_d[0])\nupper = expit(hdi.delta_d[1])\nprint(f\"Mean effect: {mean_effect.item():.4f}\")\nprint(f\"95% interval {lower.item():.4f} - {upper.item():.4f}\")\n\nMean effect: 0.5012\n95% interval 0.5012 - 0.5013\n\n\nThe recovered effect is very close to the true effect of 0.5.\n\n\n\nIn the Hierarchical Logistic regression with Binomial family notebook, we modeled baseball batting averages (times a player reached first via a hit per times at bat) via a Hierarchical Logisitic regression model. If we’re interested in league-wide effects, we could look at a Beta regression. We work off the assumption that the league-wide batting average follows a Beta distribution, and that individual player’s batting averages are samples from that distribution.\nFirst, load the Batting dataset again, and re-calculate batting average as hits/at-bat. In order to make sure that we have a sufficient sample, we’ll require at least 100 at-bats in order consider a batter. Last, we’ll focus on 1990-2018.\n\nbatting = bmb.load_data(\"batting\")\n\n\nbatting[\"batting_avg\"] = batting[\"H\"] / batting[\"AB\"]\nbatting = batting[batting[\"AB\"] > 100]\ndf = batting[ (batting[\"yearID\"] > 1990) & (batting[\"yearID\"] < 2018) ]\n\n\ndf.batting_avg.hist(bins=30)\nplt.xlabel(\"Batting Average\")\nplt.ylabel(\"Count\");\n\n\n\n\nIf we’re interested in modeling the distribution of batting averages, we can start with an intercept-only model.\n\nmodel_avg = bmb.Model(\"batting_avg ~ 1\", df, family=\"beta\")\navg_fitted = model_avg.fit()\n\nAuto-assigning NUTS sampler...\nInitializing NUTS using jitter+adapt_diag...\nMultiprocess sampling (2 chains in 2 jobs)\nNUTS: [batting_avg_kappa, Intercept]\n\n\n\n\n\n\n\n \n \n 100.00% [4000/4000 00:03<00:00 Sampling 2 chains, 0 divergences]\n \n \n\n\nSampling 2 chains for 1_000 tune and 1_000 draw iterations (2_000 + 2_000 draws total) took 4 seconds.\n\n\n\naz.summary(avg_fitted)\n\n\n\n\n\n \n \n \n mean\n sd\n hdi_3%\n hdi_97%\n mcse_mean\n mcse_sd\n ess_bulk\n ess_tail\n r_hat\n \n \n \n \n Intercept\n -1.038\n 0.002\n -1.041\n -1.035\n 0.000\n 0.000\n 1835.0\n 1455.0\n 1.0\n \n \n batting_avg_kappa\n 152.538\n 1.950\n 149.098\n 156.262\n 0.046\n 0.033\n 1771.0\n 1294.0\n 1.0\n \n \n\n\n\n\nLooking at the posterior predictive,\n\nposterior_predictive = model_avg.predict(avg_fitted, kind=\"pps\")\n\n\naz.plot_ppc(avg_fitted);\n\n\n\n\nThis appears to fit reasonably well. If, for example, we were interested in simulating players, we could sample from this distribution.\nHowever, we can take this further. Say we’re interested in understanding how this distribution shifts if we know a player’s batting average in a previous year. We can condition the model on a player’s n-1 year, and use Beta regrssion to model that. Let’s construct that variable, and for sake of ease, we will ignore players without previous seasons.\n\n# Add the player's batting average in the n-1 year\nbatting[\"batting_avg_shift\"] = np.where(\n batting[\"playerID\"] == batting[\"playerID\"].shift(),\n batting[\"batting_avg\"].shift(),\n np.nan\n)\ndf_shift = batting[ (batting[\"yearID\"] > 1990) & (batting[\"yearID\"] < 2018) ]\ndf_shift = df_shift[~df_shift[\"batting_avg_shift\"].isna()]\ndf_shift[[\"batting_avg_shift\",\"batting_avg\"]].corr()\n\n\n\n\n\n \n \n \n batting_avg_shift\n batting_avg\n \n \n \n \n batting_avg_shift\n 1.000000\n 0.229774\n \n \n batting_avg\n 0.229774\n 1.000000\n \n \n\n\n\n\nThere is a lot of variance in year-to-year batting averages, it’s not known to be incredibly predictive, and we see that here. A correlation coefficient of 0.23 is only lightly predictive. However, we can still use it in our model to get a better understanding. We’ll fit two models. First, we’ll refit the previous, intercept-only, model using this updated dataset so we have an apples-to-apples comparison. Then, we’ll fit a model using the previous year’s batting average as a predictor.\nNotice we need to explicitly ask for the inclusion of the log-likelihood values into the inference data object.\n\nmodel_avg = bmb.Model(\"batting_avg ~ 1\", df_shift, family=\"beta\")\navg_fitted = model_avg.fit(idata_kwargs={\"log_likelihood\": True})\n\nmodel_lag = bmb.Model(\"batting_avg ~ batting_avg_shift\", df_shift, family=\"beta\")\nlag_fitted = model_lag.fit(idata_kwargs={\"log_likelihood\": True})\n\nAuto-assigning NUTS sampler...\nInitializing NUTS using jitter+adapt_diag...\nMultiprocess sampling (2 chains in 2 jobs)\nNUTS: [batting_avg_kappa, Intercept]\n\n\n\n\n\n\n\n \n \n 100.00% [4000/4000 00:02<00:00 Sampling 2 chains, 0 divergences]\n \n \n\n\nSampling 2 chains for 1_000 tune and 1_000 draw iterations (2_000 + 2_000 draws total) took 3 seconds.\nAuto-assigning NUTS sampler...\nInitializing NUTS using jitter+adapt_diag...\nMultiprocess sampling (2 chains in 2 jobs)\nNUTS: [batting_avg_kappa, Intercept, batting_avg_shift]\n\n\n\n\n\n\n\n \n \n 100.00% [4000/4000 00:04<00:00 Sampling 2 chains, 0 divergences]\n \n \n\n\nSampling 2 chains for 1_000 tune and 1_000 draw iterations (2_000 + 2_000 draws total) took 4 seconds.\n\n\n\naz.summary(lag_fitted)\n\n\n\n\n\n \n \n \n mean\n sd\n hdi_3%\n hdi_97%\n mcse_mean\n mcse_sd\n ess_bulk\n ess_tail\n r_hat\n \n \n \n \n Intercept\n -1.374\n 0.074\n -1.517\n -1.240\n 0.001\n 0.001\n 3171.0\n 1435.0\n 1.0\n \n \n batting_avg_shift\n 1.347\n 0.281\n 0.782\n 1.838\n 0.005\n 0.004\n 3091.0\n 1478.0\n 1.0\n \n \n batting_avg_kappa\n 136.149\n 9.414\n 116.879\n 152.420\n 0.184\n 0.132\n 2618.0\n 1463.0\n 1.0\n \n \n\n\n\n\n\naz.compare({\n \"intercept-only\" : avg_fitted,\n \"lag-model\": lag_fitted\n})\n\n\n\n\n\n \n \n \n rank\n elpd_loo\n p_loo\n elpd_diff\n weight\n se\n dse\n warning\n scale\n \n \n \n \n lag-model\n 0\n 784.894117\n 3.146425\n 0.000000\n 0.995619\n 14.582720\n 0.000000\n False\n log\n \n \n intercept-only\n 1\n 774.193828\n 2.034573\n 10.700289\n 0.004381\n 15.320598\n 4.604911\n False\n log\n \n \n\n\n\n\nAdding the predictor results in a higher loo than the intercept-only model.\n\nppc= model_lag.predict(lag_fitted, kind=\"pps\")\naz.plot_ppc(lag_fitted);\n\n\n\n\nThe biggest question this helps us understand is, for each point of batting average in the previous year, how much better do we expect a player to be in the current year?\n\nmean_effect = lag_fitted.posterior.batting_avg_shift.mean().item()\nhdi = az.hdi(lag_fitted.posterior.batting_avg_shift, hdi_prob=.95)\n\nlower = expit(hdi.batting_avg_shift[0]).item()\nupper = expit(hdi.batting_avg_shift[1]).item()\nprint(f\"Mean effect: {expit(mean_effect):.4f}\")\nprint(f\"95% interval {lower:.4f} - {upper:.4f}\")\n\nMean effect: 0.7936\n95% interval 0.6806 - 0.8650\n\n\n\naz.plot_hdi(df_shift.batting_avg_shift, lag_fitted.posterior_predictive.batting_avg, hdi_prob=0.95, color=\"goldenrod\", fill_kwargs={\"alpha\":0.8})\naz.plot_hdi(df_shift.batting_avg_shift, lag_fitted.posterior_predictive.batting_avg, hdi_prob=.68, color=\"forestgreen\", fill_kwargs={\"alpha\":0.8})\n\nintercept = lag_fitted.posterior.Intercept.values.mean()\nx = np.linspace(df_shift.batting_avg_shift.min(), df_shift.batting_avg_shift.max(),100)\nlinear = mean_effect * x + intercept\nplt.plot(x, expit(linear), c=\"black\")\nplt.xlabel(\"Previous Year's Batting Average\")\nplt.ylabel(\"Batting Average\");\n\n\n\n\n\n%load_ext watermark\n%watermark -n -u -v -iv -w\n\nLast updated: Thu Jan 05 2023\n\nPython implementation: CPython\nPython version : 3.10.4\nIPython version : 8.5.0\n\narviz : 0.14.0\nmatplotlib: 3.6.2\nnumpy : 1.23.5\npandas : 1.5.2\nsys : 3.10.4 (main, Mar 31 2022, 08:41:55) [GCC 7.5.0]\nscipy : 1.9.3\nbambi : 0.9.3\n\nWatermark: 2.3.1" }, { - "objectID": "api/index.html", - "href": "api/index.html", + "objectID": "notebooks/t-test.html", + "href": "notebooks/t-test.html", "title": "Bambi", "section": "", - "text": "The basics\n\n\n\nModel\nSpecification of model class.\n\n\nFormula\nModel formula\n\n\n\n\n\n\n\n\n\nPrior\nAbstract specification of a term prior.\n\n\n\n\n\n\n\n\n\nFamily\nA specification of model family.\n\n\nLikelihood\nRepresentation of a Likelihood function for a Bambi model.\n\n\nLink\nRepresentation of a link function.\n\n\n\n\n\n\n\n\n\ninterpret.plot_comparisons\nPlot Conditional Adjusted Comparisons\n\n\ninterpret.plot_predictions\nPlot Conditional Adjusted Predictions\n\n\ninterpret.plot_slopes\nPlot Conditional Adjusted Slopes\n\n\n\n\n\n\n\n\n\ninterpret.comparisons\nCompute Conditional Adjusted Comparisons\n\n\ninterpret.predictions\nCompute Conditional Adjusted Predictions\n\n\ninterpret.slopes\nCompute Conditional Adjusted Slopes\n\n\n\n\n\n\n\n\n\nclear_data_home\nDelete all the content of the data home cache.\n\n\nload_data\nLoad a dataset." + "text": "import arviz as az\nimport bambi as bmb\nimport matplotlib.pyplot as plt\nimport numpy as np\nimport pandas as pd\n\n\naz.style.use(\"arviz-darkgrid\")\nnp.random.seed(1234)\n\nIn this notebook we demo two equivalent ways of performing a two-sample Bayesian t-test to compare the mean value of two Gaussian populations using Bambi.\n\n\nWe generate 160 values from a Gaussian with \\(\\mu=6\\) and \\(\\sigma=2.5\\) and another 120 values from a Gaussian’ with \\(\\mu=8\\) and \\(\\sigma=2\\)\n\na = np.random.normal(6, 2.5, 160)\nb = np.random.normal(8, 2, 120)\ndf = pd.DataFrame({\"Group\": [\"a\"] * 160 + [\"b\"] * 120, \"Val\": np.hstack([a, b])})\n\n\ndf.head()\n\n\n\n\n\n \n \n \n Group\n Val\n \n \n \n \n 0\n a\n 7.178588\n \n \n 1\n a\n 3.022561\n \n \n 2\n a\n 9.581767\n \n \n 3\n a\n 5.218370\n \n \n 4\n a\n 4.198528\n \n \n\n\n\n\n\naz.plot_violin({\"a\": a, \"b\": b});\n\n/home/tomas/anaconda3/envs/bambi/lib/python3.10/site-packages/arviz/plots/backends/matplotlib/violinplot.py:64: UserWarning: This figure was using a layout engine that is incompatible with subplots_adjust and/or tight_layout; not calling subplots_adjust.\n fig.subplots_adjust(wspace=0)\n\n\n\n\n\nWhen we carry out a two sample t-test we are implicitly using a linear model that can be specified in different ways. One of these approaches is the following:\n\n\n\\[\n\\mu_i = \\beta_0 + \\beta_1 (i) + \\epsilon_i\n\\]\nwhere \\(i = 0\\) represents the population 1, \\(i = 1\\) the population 2 and \\(\\epsilon_i\\) is a random error with mean 0. If we replace the indicator variables for the two groups we have\n\\[\n\\mu_0 = \\beta_0 + \\epsilon_i\n\\]\nand\n\\[\n\\mu_1 = \\beta_0 + \\beta_1 + \\epsilon_i\n\\]\nif \\(\\mu_0 = \\mu_1\\) then\n\\[\n\\beta_0 + \\epsilon_i = \\beta_0 + \\beta_1 + \\epsilon_i\\\\\n0 = \\beta_1\n\\]\nThus, we can see that testing whether the mean of the two populations are equal is equivalent to testing whether \\(\\beta_1\\) is 0.\n\n\n\nWe start by instantiating our model and specifying the model previously described.\n\nmodel_1 = bmb.Model(\"Val ~ Group\", df)\nresults_1 = model_1.fit()\n\nAuto-assigning NUTS sampler...\nInitializing NUTS using jitter+adapt_diag...\nMultiprocess sampling (2 chains in 2 jobs)\nNUTS: [Val_sigma, Intercept, Group]\n\n\n\n\n\n\n\n \n \n 100.00% [4000/4000 00:03<00:00 Sampling 2 chains, 0 divergences]\n \n \n\n\nSampling 2 chains for 1_000 tune and 1_000 draw iterations (2_000 + 2_000 draws total) took 4 seconds.\n\n\nWe’ve only specified the formula for the model and Bambi automatically selected priors distributions and values for their parameters. We can inspect both the setup and the priors as following:\n\nmodel_1\n\n Formula: Val ~ Group\n Family: gaussian\n Link: mu = identity\n Observations: 280\n Priors: \n target = mu\n Common-level effects\n Intercept ~ Normal(mu: 6.9762, sigma: 8.1247)\n Group ~ Normal(mu: 0, sigma: 12.4107)\n \n Auxiliary parameters\n Val_sigma ~ HalfStudentT(nu: 4, sigma: 2.4567)\n------\n* To see a plot of the priors call the .plot_priors() method.\n* To see a summary or plot of the posterior pass the object returned by .fit() to az.summary() or az.plot_trace()\n\n\n\nmodel_1.plot_priors();\n\nSampling: [Group, Intercept, Val_sigma]\n\n\n\n\n\nTo inspect our posterior and the sampling process we can call az.plot_trace(). The option kind='rank_vlines' gives us a variant of the rank plot that uses lines and dots and helps us to inspect the stationarity of the chains. Since there is no clear pattern or serious deviations from the horizontal lines, we can conclude the chains are stationary.\n\n\naz.plot_trace(results_1, kind=\"rank_vlines\");\n\n\n\n\n\naz.summary(results_1)\n\n\n\n\n\n \n \n \n mean\n sd\n hdi_3%\n hdi_97%\n mcse_mean\n mcse_sd\n ess_bulk\n ess_tail\n r_hat\n \n \n \n \n Intercept\n 6.116\n 0.179\n 5.778\n 6.449\n 0.003\n 0.002\n 3290.0\n 1795.0\n 1.00\n \n \n Group[b]\n 2.005\n 0.270\n 1.498\n 2.507\n 0.005\n 0.003\n 3537.0\n 1634.0\n 1.00\n \n \n Val_sigma\n 2.261\n 0.092\n 2.077\n 2.423\n 0.002\n 0.001\n 3217.0\n 1551.0\n 1.01\n \n \n\n\n\n\nIn the summary table we can see the 94% highest density interval for \\(\\beta_1\\) ranges from 1.511 to 2.499. Thus, according to the data and the model used, we conclude the difference between the two population means is somewhere between 1.2 and 2.2 and hence we support the hypotehsis that \\(\\beta_1 \\ne 0\\).\nSimilar conclusions can be made with the density estimate for the posterior distribution of \\(\\beta_1\\). As seen in the table, most of the probability for the difference in the mean roughly ranges from 1.2 to 2.2.\n\naz.plot_posterior(results_1, var_names=\"Group\", ref_val=0);\n\n\n\n\nAnother way to arrive to a similar conclusion is by calculating the probability that the parameter \\(\\beta_1 > 0\\). This probability is equal to 1, telling us that the mean of the two populations are different.\n\n# Probabiliy that posterior is > 0\n(results_1.posterior[\"Group\"] > 0).mean().item()\n\n1.0\n\n\nThe linear model implicit in the t-test can also be specified without an intercept term, such is the case of Model 2.\n\n\n\nWhen we carry out a two sample t-test we’re implicitly using the following model:\n\\[\n\\mu_i = \\beta_i + \\epsilon_i\n\\]\nwhere \\(i = 0\\) represents the population 1, \\(i = 1\\) the population 2 and \\(\\epsilon\\) is a random error with mean 0. If we replace the indicator variables for the two groups we have\n\\[\n\\mu_0 = \\beta_0 + \\epsilon\n\\]\nand\n\\[\n\\mu_1 = \\beta_1 + \\epsilon\n\\]\nif \\(\\mu_0 = \\mu_1\\) then\n\\[\n\\beta_0 + \\epsilon = \\beta_1 + \\epsilon\\\\\n\\]\nThus, we can see that testing whether the mean of the two populations are equal is equivalent to testing whether \\(\\beta_0 = \\beta_1\\).\n\n\n\nWe start by instantiating our model and specifying the model previously described. In this model we will bypass the intercept that Bambi adds by default by setting it to zero, even though setting to -1 has the same effect.\n\nmodel_2 = bmb.Model(\"Val ~ 0 + Group\", df)\nresults_2 = model_2.fit() \n\nAuto-assigning NUTS sampler...\nInitializing NUTS using jitter+adapt_diag...\nMultiprocess sampling (2 chains in 2 jobs)\nNUTS: [Val_sigma, Group]\n\n\n\n\n\n\n\n \n \n 100.00% [4000/4000 00:02<00:00 Sampling 2 chains, 0 divergences]\n \n \n\n\nSampling 2 chains for 1_000 tune and 1_000 draw iterations (2_000 + 2_000 draws total) took 3 seconds.\n\n\nWe’ve only specified the formula for the model and Bambi automatically selected priors distributions and values for their parameters. We can inspect both the setup and the priors as following:\n\nmodel_2\n\n Formula: Val ~ 0 + Group\n Family: gaussian\n Link: mu = identity\n Observations: 280\n Priors: \n target = mu\n Common-level effects\n Group ~ Normal(mu: [0. 0.], sigma: [12.4107 12.4107])\n \n Auxiliary parameters\n Val_sigma ~ HalfStudentT(nu: 4, sigma: 2.4567)\n------\n* To see a plot of the priors call the .plot_priors() method.\n* To see a summary or plot of the posterior pass the object returned by .fit() to az.summary() or az.plot_trace()\n\n\n\nmodel_2.plot_priors();\n\nSampling: [Group, Val_sigma]\n\n\n\n\n\nTo inspect our posterior and the sampling process we can call az.plot_trace(). The option kind='rank_vlines' gives us a variant of the rank plot that uses lines and dots and helps us to inspect the stationarity of the chains. Since there is no clear pattern or serious deviations from the horizontal lines, we can conclude the chains are stationary.\n\n\naz.plot_trace(results_2, kind=\"rank_vlines\");\n\n\n\n\n\naz.summary(results_2)\n\n\n\n\n\n \n \n \n mean\n sd\n hdi_3%\n hdi_97%\n mcse_mean\n mcse_sd\n ess_bulk\n ess_tail\n r_hat\n \n \n \n \n Group[a]\n 6.113\n 0.177\n 5.806\n 6.465\n 0.003\n 0.002\n 2973.0\n 1385.0\n 1.0\n \n \n Group[b]\n 8.117\n 0.209\n 7.724\n 8.506\n 0.004\n 0.003\n 3341.0\n 1662.0\n 1.0\n \n \n Val_sigma\n 2.263\n 0.099\n 2.082\n 2.446\n 0.002\n 0.001\n 2727.0\n 1454.0\n 1.0\n \n \n\n\n\n\nIn this summary we can observe the estimated distribution of means for each population. A simple way to compare them is subtracting one to the other. In the next plot we can se that the entirety of the distribution of differences is higher than zero and that the mean of population 2 is higher than the mean of population 1 by a mean of 2.\n\npost_group = results_2.posterior[\"Group\"]\ndiff = post_group.sel(Group_dim=\"b\") - post_group.sel(Group_dim=\"a\") \naz.plot_posterior(diff, ref_val=0);\n\n\n\n\nAnother way to arrive to a similar conclusion is by calculating the probability that the parameter \\(\\beta_1 - \\beta_0 > 0\\). This probability equals to 1, telling us that the mean of the two populations are different.\n\n# Probabiliy that posterior is > 0\n(post_group > 0).mean().item()\n\n1.0\n\n\n\n%load_ext watermark\n%watermark -n -u -v -iv -w\n\nLast updated: Thu Jan 05 2023\n\nPython implementation: CPython\nPython version : 3.10.4\nIPython version : 8.5.0\n\nmatplotlib: 3.6.2\npandas : 1.5.2\nbambi : 0.9.3\narviz : 0.14.0\nnumpy : 1.23.5\nsys : 3.10.4 (main, Mar 31 2022, 08:41:55) [GCC 7.5.0]\n\nWatermark: 2.3.1" }, { - "objectID": "api/clear_data_home.html", - "href": "api/clear_data_home.html", + "objectID": "notebooks/t_regression.html", + "href": "notebooks/t_regression.html", "title": "Bambi", "section": "", - "text": "data.clear_data_home(data_home=None)\nDelete all the content of the data home cache.\n\n\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\nDefault\n\n\n\n\ndata_home\n\nThe path to Bambi data dir. By default a folder named \"bambi_data\" in the user home folder.\nNone" + "text": "Robust Linear Regression\nThis example has been lifted from the PyMC Docs, and adapted to for Bambi by Tyler James Burch (@tjburch on GitHub).\nMany toy datasets circumvent problems that practitioners run into with real data. Specifically, the assumption of normality can be easily violated by outliers, which can cause havoc in traditional linear regression. One way to navigate this is through robust linear regression, outlined in this example.\nFirst load modules and set the RNG for reproducibility.\n\nimport arviz as az\nimport bambi as bmb\nimport matplotlib.pyplot as plt\nimport numpy as np\nimport pandas as pd\n\n\naz.style.use(\"arviz-darkgrid\")\nnp.random.seed(1111)\n\nNext, generate pseudodata. The bulk of the data will be linear with noise distributed normally, but additionally several outliers will be interjected.\n\nsize = 100\ntrue_intercept = 1\ntrue_slope = 2\n\nx = np.linspace(0, 1, size)\n# y = a + b*x\ntrue_regression_line = true_intercept + true_slope * x\n# add noise\ny = true_regression_line + np.random.normal(scale=0.5, size=size)\n\n# Add outliers\nx_out = np.append(x, [0.1, 0.15, 0.2])\ny_out = np.append(y, [8, 6, 9])\n\ndata = pd.DataFrame({\n \"x\": x_out, \n \"y\": y_out\n})\n\nPlot this data. The three data points in the top left are the interjected data.\n\nfig = plt.figure(figsize=(7, 7))\nax = fig.add_subplot(111, xlabel=\"x\", ylabel=\"y\", title=\"Generated data and underlying model\")\nax.plot(x_out, y_out, \"x\", label=\"sampled data\")\nax.plot(x, true_regression_line, label=\"true regression line\", lw=2.0)\nplt.legend(loc=0);\n\n\n\n\nTo highlight the problem, first fit a standard normally-distributed linear regression.\n\n# Note, \"gaussian\" is the default argument for family. Added to be explicit. \ngauss_model = bmb.Model(\"y ~ x\", data, family=\"gaussian\")\ngauss_fitted = gauss_model.fit(draws=2000, idata_kwargs={\"log_likelihood\": True})\ngauss_model.predict(gauss_fitted, kind=\"pps\")\n\nAuto-assigning NUTS sampler...\nInitializing NUTS using jitter+adapt_diag...\nMultiprocess sampling (2 chains in 2 jobs)\nNUTS: [y_sigma, Intercept, x]\n\n\n\n\n\n\n\n \n \n 100.00% [6000/6000 00:03<00:00 Sampling 2 chains, 0 divergences]\n \n \n\n\nSampling 2 chains for 1_000 tune and 2_000 draw iterations (2_000 + 4_000 draws total) took 3 seconds.\n\n\n\naz.summary(gauss_fitted)\n\n\n\n\n\n \n \n \n mean\n sd\n hdi_3%\n hdi_97%\n mcse_mean\n mcse_sd\n ess_bulk\n ess_tail\n r_hat\n \n \n \n \n Intercept\n 1.533\n 0.230\n 1.093\n 1.959\n 0.003\n 0.002\n 5481.0\n 2857.0\n 1.0\n \n \n x\n 1.201\n 0.400\n 0.458\n 1.964\n 0.005\n 0.004\n 5177.0\n 2869.0\n 1.0\n \n \n y_sigma\n 1.186\n 0.085\n 1.032\n 1.351\n 0.001\n 0.001\n 5873.0\n 2891.0\n 1.0\n \n \n y_mean[0]\n 1.533\n 0.230\n 1.093\n 1.959\n 0.003\n 0.002\n 5481.0\n 2857.0\n 1.0\n \n \n y_mean[1]\n 1.546\n 0.227\n 1.113\n 1.963\n 0.003\n 0.002\n 5487.0\n 2857.0\n 1.0\n \n \n ...\n ...\n ...\n ...\n ...\n ...\n ...\n ...\n ...\n ...\n \n \n y_mean[98]\n 2.722\n 0.227\n 2.288\n 3.143\n 0.003\n 0.002\n 5461.0\n 3205.0\n 1.0\n \n \n y_mean[99]\n 2.734\n 0.230\n 2.307\n 3.176\n 0.003\n 0.002\n 5454.0\n 3232.0\n 1.0\n \n \n y_mean[100]\n 1.653\n 0.197\n 1.290\n 2.027\n 0.003\n 0.002\n 5512.0\n 3038.0\n 1.0\n \n \n y_mean[101]\n 1.714\n 0.181\n 1.376\n 2.048\n 0.002\n 0.002\n 5539.0\n 3273.0\n 1.0\n \n \n y_mean[102]\n 1.774\n 0.166\n 1.447\n 2.064\n 0.002\n 0.002\n 5572.0\n 3294.0\n 1.0\n \n \n\n106 rows × 9 columns\n\n\n\nRemember, the true intercept was 1, the true slope was 2. The recovered intercept is much higher, and the slope is much lower, so the influence of the outliers is apparent.\nVisually, looking at the recovered regression line and posterior predictive HDI highlights the problem further.\n\nplt.figure(figsize=(7, 5))\n# Plot Data\nplt.plot(x_out, y_out, \"x\", label=\"data\")\n# Plot recovered linear regression\nx_range = np.linspace(min(x_out), max(x_out), 2000)\ny_pred = gauss_fitted.posterior.x.mean().item() * x_range + gauss_fitted.posterior.Intercept.mean().item()\nplt.plot(x_range, y_pred, \n color=\"black\",linestyle=\"--\",\n label=\"Recovered regression line\"\n )\n# Plot HDIs\nfor interval in [0.38, 0.68]:\n az.plot_hdi(x_out, gauss_fitted.posterior_predictive.y, \n hdi_prob=interval, color=\"firebrick\")\n# Plot true regression line\nplt.plot(x, true_regression_line, \n label=\"True regression line\", lw=2.0, color=\"black\")\nplt.legend(loc=0);\n\n\n\n\nThe recovered regression line, as well as the \\(0.5\\sigma\\) and \\(1\\sigma\\) bands are shown.\nClearly there is skew in the fit. At lower \\(x\\) values, the regression line is far higher than the true line. This is a result of the outliers, which cause the model to assume a higher value in that regime.\nAdditionally the uncertainty bands are too wide (remember, the \\(1\\sigma\\) band ought to cover 68% of the data, while here it covers most of the points). Due to the small probability mass in the tails of a normal distribution, the outliers have an large effect, causing the uncertainty bands to be oversized.\nClearly, assuming the data are distributed normally is inducing problems here. Bayesian robust linear regression forgoes the normality assumption by instead using a Student T distribution to describe the distribution of the data. The Student T distribution has thicker tails, and by allocating more probability mass to the tails, outliers have a less strong effect.\nComparing the two distributions,\n\nnormal_data = np.random.normal(loc=0, scale=1, size=100_000)\nt_data = np.random.standard_t(df=1, size=100_000)\n\nbins = np.arange(-8,8,0.15)\nplt.hist(normal_data, \n bins=bins, density=True,\n alpha=0.6,\n label=\"Normal\"\n )\nplt.hist(t_data, \n bins=bins,density=True,\n alpha=0.6,\n label=\"Student T\"\n )\nplt.xlabel(\"x\")\nplt.ylabel(\"Probability density\")\nplt.xlim(-8,8)\nplt.legend();\n\n\n\n\nAs we can see, the tails of the Student T are much larger, which means values far from the mean are more likely when compared to the normal distribution.\nThe T distribution is specified by a number of degrees of freedom (\\(\\nu\\)). In numpy.random.standard_t this is the parameter df, in the pymc T distribution, it’s nu. It is constrained to real numbers greater than 0. As the degrees of freedom increase, the probability in the tails Student T distribution decrease. In the limit of \\(\\nu \\rightarrow + \\infty\\), the Student T distribution is a normal distribution. Below, the T distribution is plotted for various \\(\\nu\\).\n\nbins = np.arange(-8,8,0.15)\nfor ndof in [0.1, 1, 10]:\n\n t_data = np.random.standard_t(df=ndof, size=100_000)\n\n plt.hist(t_data, \n bins=bins,density=True,\n label=f\"$\\\\nu = {ndof}$\",\n histtype=\"step\"\n )\nplt.hist(normal_data, \n bins=bins, density=True,\n histtype=\"step\",\n label=\"Normal\"\n ) \n \nplt.xlabel(\"x\")\nplt.ylabel(\"Probability density\")\nplt.xlim(-6,6)\nplt.legend();\n\n\n\n\nIn Bambi, the way to specify a regression with Student T distributed data is by passing \"t\" to the family parameter of a Model.\n\nt_model = bmb.Model(\"y ~ x\", data, family=\"t\")\nt_fitted = t_model.fit(draws=2000, idata_kwargs={\"log_likelihood\": True})\nt_model.predict(t_fitted, kind=\"pps\")\n\nAuto-assigning NUTS sampler...\nInitializing NUTS using jitter+adapt_diag...\nMultiprocess sampling (2 chains in 2 jobs)\nNUTS: [y_sigma, y_nu, Intercept, x]\n\n\n\n\n\n\n\n \n \n 100.00% [6000/6000 00:06<00:00 Sampling 2 chains, 0 divergences]\n \n \n\n\nSampling 2 chains for 1_000 tune and 2_000 draw iterations (2_000 + 4_000 draws total) took 7 seconds.\n\n\n\naz.summary(t_fitted)\n\n\n\n\n\n \n \n \n mean\n sd\n hdi_3%\n hdi_97%\n mcse_mean\n mcse_sd\n ess_bulk\n ess_tail\n r_hat\n \n \n \n \n Intercept\n 0.994\n 0.107\n 0.797\n 1.199\n 0.002\n 0.001\n 4029.0\n 3029.0\n 1.0\n \n \n x\n 1.900\n 0.184\n 1.562\n 2.254\n 0.003\n 0.002\n 4172.0\n 3105.0\n 1.0\n \n \n y_sigma\n 0.405\n 0.046\n 0.321\n 0.492\n 0.001\n 0.001\n 4006.0\n 3248.0\n 1.0\n \n \n y_nu\n 2.601\n 0.620\n 1.500\n 3.727\n 0.011\n 0.008\n 3431.0\n 3063.0\n 1.0\n \n \n y_mean[0]\n 0.994\n 0.107\n 0.797\n 1.199\n 0.002\n 0.001\n 4029.0\n 3029.0\n 1.0\n \n \n ...\n ...\n ...\n ...\n ...\n ...\n ...\n ...\n ...\n ...\n \n \n y_mean[98]\n 2.875\n 0.103\n 2.688\n 3.079\n 0.001\n 0.001\n 4786.0\n 3228.0\n 1.0\n \n \n y_mean[99]\n 2.894\n 0.105\n 2.709\n 3.105\n 0.002\n 0.001\n 4768.0\n 3155.0\n 1.0\n \n \n y_mean[100]\n 1.184\n 0.091\n 1.009\n 1.350\n 0.001\n 0.001\n 4046.0\n 3140.0\n 1.0\n \n \n y_mean[101]\n 1.279\n 0.084\n 1.118\n 1.432\n 0.001\n 0.001\n 4074.0\n 3151.0\n 1.0\n \n \n y_mean[102]\n 1.374\n 0.077\n 1.232\n 1.519\n 0.001\n 0.001\n 4128.0\n 3194.0\n 1.0\n \n \n\n107 rows × 9 columns\n\n\n\nNote the new parameter in the model, y_nu. This is the aforementioned degrees of freedom. If this number were very high, we would expect it to be well described by a normal distribution. However, the HDI of this spans from 1.5 to 3.7, meaning that the tails are much heavier than a normal distribution. As a result of the heavier tails, y_sigma has also dropped precipitously from the normal model, meaning the oversized uncertainty bands from above have shrunk.\nComparing the extracted values of the two models,\n\ndef get_slope_intercept(mod):\n return (\n mod.posterior.x.mean().item(),\n mod.posterior.Intercept.mean().item()\n )\ngauss_slope, gauss_int = get_slope_intercept(gauss_fitted)\nt_slope, t_int = get_slope_intercept(t_fitted)\n\npd.DataFrame({\n \"Model\":[\"True\",\"Normal\",\"T\"],\n \"Slope\":[2, gauss_slope, t_slope],\n \"Intercept\": [1, gauss_int, t_int]\n}).set_index(\"Model\").T.round(decimals=2)\n\n\n\n\n\n \n \n Model\n True\n Normal\n T\n \n \n \n \n Slope\n 2.0\n 1.20\n 1.90\n \n \n Intercept\n 1.0\n 1.53\n 0.99\n \n \n\n\n\n\nHere we can see the mean recovered values of both the slope and intercept are far closer to the true values using the robust regression model compared to the normally distributed one.\nVisually comparing the robust regression line,\n\nplt.figure(figsize=(7, 5))\n# Plot Data\nplt.plot(x_out, y_out, \"x\", label=\"data\")\n# Plot recovered robust linear regression\nx_range = np.linspace(min(x_out), max(x_out), 2000)\ny_pred = t_fitted.posterior.x.mean().item() * x_range + t_fitted.posterior.Intercept.mean().item()\nplt.plot(x_range, y_pred, \n color=\"black\",linestyle=\"--\",\n label=\"Recovered regression line\"\n )\n# Plot HDIs\nfor interval in [0.05, 0.38, 0.68]:\n az.plot_hdi(x_out, t_fitted.posterior_predictive.y, \n hdi_prob=interval, color=\"firebrick\")\n# Plot true regression line\nplt.plot(x, true_regression_line, \n label=\"true regression line\", lw=2.0, color=\"black\")\nplt.legend(loc=0);\n\n\n\n\nThis is much better. The true and recovered regression lines are much closer, and the uncertainty bands are appropriate sized. The effect of the outliers is not entirely gone, the recovered line still slightly differs from the true line, but the effect is far smaller, which is a result of the Student T likelihood function ascribing a higher probability to outliers than the normal distribution. Additionally, this inference is based on sampling methods, so it is expected to have small differences (especially given a relatively small number of samples).\nLast, another way to evaluate the models is to compare based on Leave-one-out Cross-validation (LOO), which provides an estimate of accuracy on out-of-sample predictions.\n\nmodels = {\n \"gaussian\": gauss_fitted,\n \"Student T\": t_fitted\n}\ndf_compare = az.compare(models)\ndf_compare\n\n/home/tomas/anaconda3/envs/bambi/lib/python3.10/site-packages/arviz/stats/stats.py:803: UserWarning: Estimated shape parameter of Pareto distribution is greater than 0.7 for one or more samples. You should consider using a more robust model, this is because importance sampling is less likely to work well if the marginal posterior and LOO posterior are very different. This is more likely to happen with a non-robust model and highly influential observations.\n warnings.warn(\n\n\n\n\n\n\n \n \n \n rank\n elpd_loo\n p_loo\n elpd_diff\n weight\n se\n dse\n warning\n scale\n \n \n \n \n Student T\n 0\n -101.760564\n 5.603439\n 0.000000\n 1.000000e+00\n 14.994794\n 0.000000\n False\n log\n \n \n gaussian\n 1\n -171.732028\n 14.081743\n 69.971464\n 3.053913e-11\n 29.382970\n 17.542539\n True\n log\n \n \n\n\n\n\n\naz.plot_compare(df_compare, insample_dev=False);\n\n\n\n\nHere it is quite obvious that the Student T model is much better, due to having a clearly larger value of LOO.\n\n%load_ext watermark\n%watermark -n -u -v -iv -w\n\nLast updated: Thu Jan 05 2023\n\nPython implementation: CPython\nPython version : 3.10.4\nIPython version : 8.5.0\n\nbambi : 0.9.3\npandas : 1.5.2\nnumpy : 1.23.5\nmatplotlib: 3.6.2\nsys : 3.10.4 (main, Mar 31 2022, 08:41:55) [GCC 7.5.0]\narviz : 0.14.0\n\nWatermark: 2.3.1" }, { - "objectID": "api/Link.html", - "href": "api/Link.html", + "objectID": "api/Prior.html", + "href": "api/Prior.html", "title": "Bambi", "section": "", - "text": "families.Link(self, name, link=None, linkinv=None, linkinv_backend=None)\nRepresentation of a link function.\nThis object contains two main functions. One is the link function itself, the function that maps values in the response scale to the linear predictor, and the other is the inverse of the link function, that maps values of the linear predictor to the response scale.\nThe great majority of users will never interact with this class unless they want to create a custom Family with a custom Link. This is automatically handled for all the built-in families.\n\n\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\nDefault\n\n\n\n\nname\nstr\nThe name of the link function. If it is a known name, it’s not necessary to pass any other arguments because functions are already defined internally. If not known, all of link, linkinv and linkinv_backend must be specified.\nrequired\n\n\nlink\nfunction\nA function that maps the response to the linear predictor. Known as the :math:g function in GLM jargon. Does not need to be specified when name is a known name.\nNone\n\n\nlinkinv\nfunction\nA function that maps the linear predictor to the response. Known as the :math:g^{-1} function in GLM jargon. Does not need to be specified when name is a known name.\nNone\n\n\nlinkinv_backend\nfunction\nSame than linkinv but must be something that works with PyMC backend (i.e. it must work with PyTensor tensors). Does not need to be specified when name is a known name.\nNone" + "text": "priors.Prior(self, name, auto_scale=True, dist=None, **kwargs)\nAbstract specification of a term prior.\n\n\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\nDefault\n\n\n\n\nname\nstr\nName of prior distribution. Must be the name of a PyMC distribution (e.g., \"Normal\", \"Bernoulli\", etc.)\nrequired\n\n\nauto_scale\n\nWhether to adjust the parameters of the prior or use them as passed. Default to True.\nTrue\n\n\nkwargs\ndict\nOptional keywords specifying the parameters of the named distribution.\n{}\n\n\ndist\npymc.distributions.distribution.DistributionMeta or callable\nA callable that returns a valid PyMC distribution. The signature must contain name, dims, and shape, as well as its own keyworded arguments.\nNone\n\n\n\n\n\n\n\n\n\nName\nDescription\n\n\n\n\nupdate\nUpdate the arguments of the prior with additional arguments.\n\n\n\n\n\nPrior.update(self, **kwargs)\nUpdate the arguments of the prior with additional arguments.\n\n\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\nDefault\n\n\n\n\nkwargs\ndict\nOptional keyword arguments to add to prior args.\n{}" }, { "objectID": "api/interpret.slopes.html", @@ -287,46 +259,46 @@ "text": "interpret.slopes(model, idata, wrt, conditional=None, average_by=None, eps=0.0001, slope='dydx', use_hdi=True, prob=None, transforms=None, sample_new_groups=False)\nCompute Conditional Adjusted Slopes\n\n\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\nDefault\n\n\n\n\nmodel\nbambi.Model\nThe model for which we want to plot the predictions.\nrequired\n\n\nidata\narviz.InferenceData\nThe InferenceData object that contains the samples from the posterior distribution of the model.\nrequired\n\n\nwrt\n(str, dict)\nThe slope of the regression with respect to (wrt) this predictor will be computed.\nrequired\n\n\nconditional\n(str, dict, list)\nThe covariates we would like to condition on.\nNone\n\n\naverage_by\nUnion[str, list, bool, None]\nThe covariates we would like to average by. The passed covariate(s) will marginalize over the other covariates in the model. If True, it averages over all covariates in the model to obtain the average estimate. Defaults to None.\nNone\n\n\neps\nfloat\nTo compute the slope, ‘wrt’ is evaluated at wrt +/- ‘eps’. The rate of change is then computed as the difference between the two values divided by ‘eps’. Defaults to 1e-4.\n0.0001\n\n\nslope\nstr\nThe type of slope to compute. Defaults to ‘dydx’. ‘dydx’ represents a unit increase in ‘wrt’ is associated with an n-unit change in the response. ‘eyex’ represents a percentage increase in ‘wrt’ is associated with an n-percent change in the response. ‘eydx’ represents a unit increase in ‘wrt’ is associated with an n-percent change in the response. ‘dyex’ represents a percent change in ‘wrt’ is associated with a unit increase in the response.\n'dydx'\n\n\nuse_hdi\nbool\nWhether to compute the highest density interval (defaults to True) or the quantiles.\nTrue\n\n\nprob\nfloat\nThe probability for the credibility intervals. Must be between 0 and 1. Defaults to 0.94. Changing the global variable az.rcParam[\"stats.hdi_prob\"] affects this default.\nNone\n\n\ntransforms\ndict\nTransformations that are applied to each of the variables being plotted. The keys are the name of the variables, and the values are functions to be applied. Defaults to None.\nNone\n\n\nsample_new_groups\nbool\nIf the model contains group-level effects, and data is passed for unseen groups, whether to sample from the new groups. Defaults to False.\nFalse\n\n\n\n\n\n\n\n\n\n\n\n\n\nType\nDescription\n\n\n\n\npandas.DataFrame\nA dataframe with the comparison values, highest density interval, wrt name, contrast value, and conditional values.\n\n\n\n\n\n\n\n\n\n\n\n\n\nType\nDescription\n\n\n\n\nValueError\nIf length of wrt is greater than 1. If conditional is None and wrt is passed more than 2 values. If conditional is None and default wrt has more than 2 unique values. If slope is not ‘dydx’, ‘dyex’, ‘eyex’, or ‘eydx’. If prob is not > 0 and < 1." }, { - "objectID": "api/interpret.plot_predictions.html", - "href": "api/interpret.plot_predictions.html", + "objectID": "api/load_data.html", + "href": "api/load_data.html", "title": "Bambi", "section": "", - "text": "interpret.plot_predictions(model, idata, covariates, target='mean', sample_new_groups=False, pps=False, use_hdi=True, prob=None, transforms=None, legend=True, ax=None, fig_kwargs=None, subplot_kwargs=None)\nPlot Conditional Adjusted Predictions\n\n\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\nDefault\n\n\n\n\nmodel\nbambi.Model\nThe model for which we want to plot the predictions.\nrequired\n\n\nidata\narviz.InferenceData\nThe InferenceData object that contains the samples from the posterior distribution of the model.\nrequired\n\n\ncovariates\nlist or dict\nA sequence of between one and three names of variables in the model.\nrequired\n\n\ntarget\nstr\nWhich model parameter to plot. Defaults to ‘mean’. Passing a parameter into target only works when pps is False as the target may not be available in the posterior predictive distribution.\n'mean'\n\n\nsample_new_groups\nbool\nIf the model contains group-level effects, and data is passed for unseen groups, whether to sample from the new groups. Defaults to False.\nFalse\n\n\npps\nbool\nWhether to plot the posterior predictive samples. Defaults to False.\nFalse\n\n\nuse_hdi\nbool\nWhether to compute the highest density interval (defaults to True) or the quantiles.\nTrue\n\n\nprob\nfloat\nThe probability for the credibility intervals. Must be between 0 and 1. Defaults to 0.94. Changing the global variable az.rcParam[\"stats.hdi_prob\"] affects this default.\nNone\n\n\nlegend\nbool\nWhether to automatically include a legend in the plot. Defaults to True.\nTrue\n\n\ntransforms\ndict\nTransformations that are applied to each of the variables being plotted. The keys are the name of the variables, and the values are functions to be applied. Defaults to None.\nNone\n\n\nax\nmatplotlib.axes._subplots.AxesSubplot\nA matplotlib axes object or a sequence of them. If None, this function instantiates a new axes object. Defaults to None.\nNone\n\n\nfig_kwargs\noptional\nKeyword arguments passed to the matplotlib figure function as a dict. For example, fig_kwargs=dict(figsize=(11, 8)), sharey=True would make the figure 11 inches wide by 8 inches high and would share the y-axis values.\nNone\n\n\nsubplot_kwargs\noptional\nKeyword arguments used to determine the covariates used for the horizontal, group, and panel axes. For example, subplot_kwargs=dict(main=\"x\", group=\"y\", panel=\"z\") would plot the horizontal axis as x, the color (hue) as y, and the panel axis as z.\nNone\n\n\n\n\n\n\n\n\n\n\n\n\n\nType\nDescription\n\n\n\n\n(matplotlib.figure.Figure, matplotlib.axes._subplots.AxesSubplot)\nA tuple with the figure and the axes.\n\n\n\n\n\n\n\n\n\n\n\n\n\nType\nDescription\n\n\n\n\nValueError\nWhen level is not within 0 and 1. When the main covariate is not numeric or categoric.\n\n\nTypeError\nWhen covariates is not a string or a list of strings." + "text": "data.load_data(dataset=None, data_home=None)\nLoad a dataset.\nRun with no parameters to get a list of all available data sets.\nThe directory to save can also be set with the environment variable BAMBI_HOME. The checksum of the dataset is checked against a hardcoded value to watch for data corruption. Run bmb.clear_data_home() to clear the data directory.\n\n\n\n\n\nName\nType\nDescription\nDefault\n\n\n\n\ndataset\n\nName of dataset to load.\nNone\n\n\ndata_home\n\nWhere to save remote datasets\nNone\n\n\n\n\n\n\n\n\n\nType\nDescription\n\n\n\n\npandas.DataFrame" }, { - "objectID": "api/interpret.plot_comparisons.html", - "href": "api/interpret.plot_comparisons.html", + "objectID": "api/Family.html", + "href": "api/Family.html", "title": "Bambi", "section": "", - "text": "interpret.plot_comparisons(model, idata, contrast, conditional=None, average_by=None, comparison_type='diff', sample_new_groups=False, use_hdi=True, prob=None, legend=True, transforms=None, ax=None, fig_kwargs=None, subplot_kwargs=None)\nPlot Conditional Adjusted Comparisons\n\n\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\nDefault\n\n\n\n\nmodel\nbambi.Model\nThe model for which we want to plot the predictions.\nrequired\n\n\nidata\narviz.InferenceData\nThe InferenceData object that contains the samples from the posterior distribution of the model.\nrequired\n\n\ncontrast\n(str, dict, list)\nThe predictor name whose contrast we would like to compare.\nrequired\n\n\nconditional\n(str, dict, list)\nThe covariates we would like to condition on.\nNone\n\n\naverage_by\nUnion[str, list]\nThe covariates we would like to average by. The passed covariate(s) will marginalize over the other covariates in the model. Defaults to None.\nNone\n\n\ncomparison_type\nstr\nThe type of comparison to plot. Defaults to ‘diff’.\n'diff'\n\n\nsample_new_groups\nbool\nIf the model contains group-level effects, and data is passed for unseen groups, whether to sample from the new groups. Defaults to False.\nFalse\n\n\nuse_hdi\nbool\nWhether to compute the highest density interval (defaults to True) or the quantiles.\nTrue\n\n\nprob\nfloat\nThe probability for the credibility intervals. Must be between 0 and 1. Defaults to 0.94. Changing the global variable az.rcParam[\"stats.hdi_prob\"] affects this default.\nNone\n\n\nlegend\nbool\nWhether to automatically include a legend in the plot. Defaults to True.\nTrue\n\n\ntransforms\ndict\nTransformations that are applied to each of the variables being plotted. The keys are the name of the variables, and the values are functions to be applied. Defaults to None.\nNone\n\n\nax\nmatplotlib.axes._subplots.AxesSubplot\nA matplotlib axes object or a sequence of them. If None, this function instantiates a new axes object. Defaults to None.\nNone\n\n\nfig_kwargs\noptional\nKeyword arguments passed to the matplotlib figure function as a dict. For example, fig_kwargs=dict(figsize=(11, 8)), sharey=True would make the figure 11 inches wide by 8 inches high and would share the y-axis values.\nNone\n\n\nsubplot_kwargs\noptional\nKeyword arguments used to determine the covariates used for the horizontal, group, and panel axes. For example, subplot_kwargs=dict(main=\"x\", group=\"y\", panel=\"z\") would plot the horizontal axis as x, the color (hue) as y, and the panel axis as z.\nNone\n\n\n\n\n\n\n\n\n\n\n\n\n\nType\nDescription\n\n\n\n\n(matplotlib.figure.Figure, matplotlib.axes._subplots.AxesSubplot)\nA tuple with the figure and the axes.\n\n\n\n\n\n\n\n\n\n\n\n\n\nType\nDescription\n\n\n\n\nValueError\nIf conditional and average_by are both None. If length of conditional is greater than 3 and average_by is None.\n\n\nWarning\nIf length of contrast is greater than 2." + "text": "families.Family(self, name, likelihood, link)\nA specification of model family.\n\n\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\nDefault\n\n\n\n\nname\nstr\nThe name of the family. It can be any string.\nrequired\n\n\nlikelihood\nLikelihood\nA bambi.families.Likelihood instance specifying the model likelihood function.\nrequired\n\n\nlink\nUnion[str, Dict[str, Union[str, Link]]]\nThe link function that’s used for every parameter in the likelihood function. Keys are the names of the parameters and values are the link functions. These can be a str with a name or a bambi.families.Link instance. The link function transforms the linear predictors.\nrequired\n\n\n\n\n\n\n>>> import bambi as bmb\nReplicate the Gaussian built-in family.\n>>> sigma_prior = bmb.Prior(\"HalfNormal\", sigma=1)\n>>> likelihood = bmb.Likelihood(\"Gaussian\", params=[\"mu\", \"sigma\"], parent=\"mu\")\n>>> family = bmb.Family(\"gaussian\", likelihood, \"identity\")\n>>> bmb.Model(\"y ~ x\", data, family=family, priors={\"sigma\": sigma_prior})\nReplicate the Bernoulli built-in family.\n>>> likelihood = bmb.Likelihood(\"Bernoulli\", parent=\"p\")\n>>> family = bmb.Family(\"bernoulli\", likelihood, \"logit\")\n>>> bmb.Model(\"y ~ x\", data, family=family)\n\n\n\n\n\n\nName\nDescription\n\n\n\n\nposterior_predictive\nGet draws from the posterior predictive distribution\n\n\nset_default_priors\nSet default priors for non-parent parameters\n\n\n\n\n\nFamily.posterior_predictive(self, model, posterior, **kwargs)\nGet draws from the posterior predictive distribution\nThis function works for almost all the families. It grabs the draws for the parameters needed in the response distribution, and then gets samples from the posterior predictive distribution using pm.draw(). It won’t work when the response distribution requires parameters that are not available in posterior.\n\n\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\nDefault\n\n\n\n\nmodel\nbambi.Model\nThe model\nrequired\n\n\nposterior\nxr.Dataset\nThe xarray dataset that contains the draws for all the parameters in the posterior. It must contain the parameters that are needed in the distribution of the response, or the parameters that allow to derive them.\nrequired\n\n\nkwargs\n\nParameters that are used to get draws but do not appear in the posterior object or other configuration parameters. For instance, the ‘n’ in binomial models and multinomial models.\n{}\n\n\n\n\n\n\n\n\n\n\n\n\n\nType\nDescription\n\n\n\n\nxr.DataArray\nA data array with the draws from the posterior predictive distribution\n\n\n\n\n\n\n\nFamily.set_default_priors(self, priors)\nSet default priors for non-parent parameters\n\n\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\nDefault\n\n\n\n\npriors\ndict\nThe keys are the names of non-parent parameters and the values are their default priors.\nrequired" }, { - "objectID": "api/Model.html", - "href": "api/Model.html", + "objectID": "api/Link.html", + "href": "api/Link.html", "title": "Bambi", "section": "", - "text": "Model(self, formula, data, family='gaussian', priors=None, link=None, categorical=None, potentials=None, dropna=False, auto_scale=True, noncentered=True, center_predictors=True, extra_namespace=None)\nSpecification of model class.\n\n\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\nDefault\n\n\n\n\nformula\nstr or bambi.formula.Formula\nA model description written using the formula syntax from the formulae library.\nrequired\n\n\ndata\npandas.DataFrame\nA pandas dataframe containing the data on which the model will be fit, with column names matching variables defined in the formula.\nrequired\n\n\nfamily\nstr or bambi.families.Family\nA specification of the model family (analogous to the family object in R). Either a string, or an instance of class bambi.families.Family. If a string is passed, a family with the corresponding name must be defined in the defaults loaded at Model initialization. Valid pre-defined families are \"bernoulli\", \"beta\", \"binomial\", \"categorical\", \"gamma\", \"gaussian\", \"negativebinomial\", \"poisson\", \"t\", and \"wald\". Defaults to \"gaussian\".\n'gaussian'\n\n\npriors\ndict\nOptional specification of priors for one or more terms. A dictionary where the keys are the names of terms in the model, “common,” or “group_specific” and the values are instances of class Prior. If priors are unset, uses automatic priors inspired by the R rstanarm library.\nNone\n\n\nlink\nstr or Dict[str, str]\nThe name of the link function to use. Valid names are \"cloglog\", \"identity\", \"inverse_squared\", \"inverse\", \"log\", \"logit\", \"probit\", and \"softmax\". Not all the link functions can be used with all the families. If a dictionary, keys are the names of the target parameters and the values are the names of the link functions.\nNone\n\n\ncategorical\nstr or list\nThe names of any variables to treat as categorical. Can be either a single variable name, or a list of names. If categorical is None, the data type of the columns in the data will be used to infer handling. In cases where numeric columns are to be treated as categorical (e.g., group specific factors coded as numerical IDs), explicitly passing variable names via this argument is recommended.\nNone\n\n\npotentials\nA list of 2-tuples.\nOptional specification of potentials. A potential is an arbitrary expression added to the likelihood, this is generally useful to add constrains to models, that are difficult to express otherwise. The first term of a 2-tuple is the name of a variable in the model, the second a lambda function expressing the desired constraint. If a constraint involves n variables, you can pass n 2-tuples or pass a tuple which first element is a n-tuple and second element is a lambda function with n arguments. The number and order of the lambda function has to match the number and order of the variables names.\nNone\n\n\ndropna\nbool\nWhen True, rows with any missing values in either the predictors or outcome are automatically dropped from the dataset in a listwise manner.\nFalse\n\n\nauto_scale\nbool\nIf True (default), priors are automatically rescaled to the data (to be weakly informative) any time default priors are used. Note that any priors explicitly set by the user will always take precedence over default priors.\nTrue\n\n\nnoncentered\nbool\nIf True (default), uses a non-centered parameterization for normal hyperpriors on grouped parameters. If False, naive (centered) parameterization is used.\nTrue\n\n\ncenter_predictors\nbool\nIf True (default), and if there is an intercept in the common terms, the data is centered by subtracting the mean. The centering is undone after sampling to provide the actual intercept in all distributional components that have an intercept. Note that this changes the interpretation of the prior on the intercept because it refers to the intercept of the centered data.\nTrue\n\n\nextra_namespace\ndict\nAdditional user supplied variables with transformations or data to include in the environment where the formula is evaluated. Defaults to None.\nNone\n\n\n\n\n\n\n\n\n\nName\nDescription\n\n\n\n\nbuild\nSet up the model for sampling/fitting.\n\n\nfit\nFit the model using PyMC.\n\n\ngraph\nProduce a graphviz Digraph from a built Bambi model.\n\n\nplot_priors\nSamples from the prior distribution and plots its marginals.\n\n\npredict\nPredict method for Bambi models\n\n\nprior_predictive\nGenerate samples from the prior predictive distribution.\n\n\nset_alias\nSet aliases for the terms and auxiliary parameters in the model\n\n\nset_priors\nSet priors for one or more existing terms.\n\n\n\n\n\nModel.build(self)\nSet up the model for sampling/fitting.\nCreates an instance of the underlying PyMC model and adds all the necessary terms to it.\n\n\n\n\n\nType\nDescription\n\n\n\n\nNone\n\n\n\n\n\n\n\n\nModel.fit(self, draws=1000, tune=1000, discard_tuned_samples=True, omit_offsets=True, include_mean=False, inference_method='mcmc', init='auto', n_init=50000, chains=None, cores=None, random_seed=None, **kwargs)\nFit the model using PyMC.\n\n\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\nDefault\n\n\n\n\ndraws\n\nThe number of samples to draw from the posterior distribution. Defaults to 1000.\n1000\n\n\ntune\nint\nNumber of iterations to tune. Defaults to 1000. Samplers adjust the step sizes, scalings or similar during tuning. These tuning samples are be drawn in addition to the number specified in the draws argument, and will be discarded unless discard_tuned_samples is set to False.\n1000\n\n\ndiscard_tuned_samples\nbool\nWhether to discard posterior samples of the tune interval. Defaults to True.\nTrue\n\n\nomit_offsets\nbool\nOmits offset terms in the InferenceData object returned when the model includes group specific effects. Defaults to True.\nTrue\n\n\ninclude_mean\nbool\nCompute the posterior of the mean response. Defaults to False.\nFalse\n\n\ninference_method\nstr\nThe method to use for fitting the model. By default, \"mcmc\". This automatically assigns a MCMC method best suited for each kind of variables, like NUTS for continuous variables and Metropolis for non-binary discrete ones. Alternatively, \"vi\", in which case the model will be fitted using variational inference as implemented in PyMC using the fit function. Finally, \"laplace\", in which case a Laplace approximation is used and is not recommended other than for pedagogical use. To use the PyMC numpyro and blackjax samplers, use nuts_numpyro or nuts_blackjax respectively. Both methods will only work if you can use NUTS sampling, so your model must be differentiable.\n'mcmc'\n\n\ninit\nstr\nInitialization method. Defaults to \"auto\". The available methods are: * auto: Use \"jitter+adapt_diag\" and if this method fails it uses \"adapt_diag\". * adapt_diag: Start with a identity mass matrix and then adapt a diagonal based on the variance of the tuning samples. All chains use the test value (usually the prior mean) as starting point. * jitter+adapt_diag: Same as \"adapt_diag\", but use test value plus a uniform jitter in [-1, 1] as starting point in each chain. * advi+adapt_diag: Run ADVI and then adapt the resulting diagonal mass matrix based on the sample variance of the tuning samples. * advi+adapt_diag_grad: Run ADVI and then adapt the resulting diagonal mass matrix based on the variance of the gradients during tuning. This is experimental and might be removed in a future release. * advi: Run ADVI to estimate posterior mean and diagonal mass matrix. * advi_map: Initialize ADVI with MAP and use MAP as starting point. * map: Use the MAP as starting point. This is strongly discouraged. * adapt_full: Adapt a dense mass matrix using the sample covariances. All chains use the test value (usually the prior mean) as starting point. * jitter+adapt_full: Same as \"adapt_full\", but use test value plus a uniform jitter in [-1, 1] as starting point in each chain.\n'auto'\n\n\nn_init\nint\nNumber of initialization iterations. Only works for \"advi\" init methods.\n50000\n\n\nchains\nint\nThe number of chains to sample. Running independent chains is important for some convergence statistics and can also reveal multiple modes in the posterior. If None, then set to either cores or 2, whichever is larger.\nNone\n\n\ncores\nint\nThe number of chains to run in parallel. If None, it is equal to the number of CPUs in the system unless there are more than 4 CPUs, in which case it is set to 4.\nNone\n\n\nrandom_seed\nint or list of ints\nA list is accepted if cores is greater than one.\nNone\n\n\n**kwargs\n\nFor other kwargs see the documentation for PyMC.sample().\n{}\n\n\n\n\n\n\n\n\n\n\n\n\n\nType\nDescription\n\n\n\n\nAn ArviZ InferenceData instance if inference_method is \"mcmc\" (default),\n\n\n\n“nuts_numpyro”, “nuts_blackjax” or “laplace”.\n\n\n\nAn Approximation object if \"vi\".\n\n\n\n\n\n\n\n\nModel.graph(self, formatting='plain', name=None, figsize=None, dpi=300, fmt='png')\nProduce a graphviz Digraph from a built Bambi model.\nRequires graphviz, which may be installed most easily with conda install -c conda-forge python-graphviz\nAlternatively, you may install the graphviz binaries yourself, and then pip install graphviz to get the python bindings. See http://graphviz.readthedocs.io/en/stable/manual.html for more information.\n\n\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\nDefault\n\n\n\n\nformatting\nstr\nOne of \"plain\" or \"plain_with_params\". Defaults to \"plain\".\n'plain'\n\n\nname\nstr\nName of the figure to save. Defaults to None, no figure is saved.\nNone\n\n\nfigsize\ntuple\nMaximum width and height of figure in inches. Defaults to None, the figure size is set automatically. If defined and the drawing is larger than the given size, the drawing is uniformly scaled down so that it fits within the given size. Only works if name is not None.\nNone\n\n\ndpi\nint\nPoint per inch of the figure to save. Defaults to 300. Only works if name is not None.\n300\n\n\nfmt\nstr\nFormat of the figure to save. Defaults to \"png\". Only works if name is not None.\n'png'\n\n\n\n\n\n\n\n\n\nType\nDescription\n\n\n\n\ngraphviz.Digraph\nThe graph\n\n\n\n\n\n\n\n\n\nmodel = Model(“y ~ x + (1|z)”) model.build() model.graph()\n\n\n\n\n\n\nmodel = Model(“y ~ x + (1|z)”) model.fit() model.graph()\n\n\n\n\n\n\n\nModel.plot_priors(self, draws=5000, var_names=None, random_seed=None, figsize=None, textsize=None, hdi_prob=None, round_to=2, point_estimate='mean', kind='kde', bins=None, omit_offsets=True, omit_group_specific=True, ax=None, **kwargs)\nSamples from the prior distribution and plots its marginals.\n\n\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\nDefault\n\n\n\n\ndraws\nint\nNumber of draws to sample from the prior predictive distribution. Defaults to 5000.\n5000\n\n\nvar_names\nstr or list\nA list of names of variables for which to compute the prior predictive distribution. Defaults to None which means to include both observed and unobserved RVs.\nNone\n\n\nrandom_seed\nint\nSeed for the random number generator.\nNone\n\n\nfigsize\ntuple\nFigure size. If None it will be defined automatically.\nNone\n\n\ntextsize\nfloat\nText size scaling factor for labels, titles and lines. If None it will be autoscaled based on figsize.\nNone\n\n\nhdi_prob\nfloat or str\nPlots highest density interval for chosen percentage of density. Use \"hide\" to hide the highest density interval. Defaults to 0.94.\nNone\n\n\nround_to\nint\nControls formatting of floats. Defaults to 2 or the integer part, whichever is bigger.\n2\n\n\npoint_estimate\nstr\nPlot point estimate per variable. Values should be \"mean\", \"median\", \"mode\" or None. Defaults to \"auto\" i.e. it falls back to default set in ArviZ’s rcParams.\n'mean'\n\n\nkind\nstr\nType of plot to display (\"kde\" or \"hist\") For discrete variables this argument is ignored and a histogram is always used.\n'kde'\n\n\nbins\ninteger or sequence or auto\nControls the number of bins, accepts the same keywords matplotlib.pyplot.hist() does. Only works if kind == \"hist\". If None (default) it will use \"auto\" for continuous variables and range(xmin, xmax + 1) for discrete variables.\nNone\n\n\nomit_offsets\nbool\nWhether to omit offset terms in the plot. Defaults to True.\nTrue\n\n\nomit_group_specific\nbool\nWhether to omit group specific effects in the plot. Defaults to True.\nTrue\n\n\nax\nnumpy array-like of matplotlib axes or bokeh figures\nA 2D array of locations into which to plot the densities. If not supplied, ArviZ will create its own array of plot areas (and return it).\nNone\n\n\n**kwargs\n\nPassed as-is to matplotlib.pyplot.hist() or matplotlib.pyplot.plot() function depending on the value of kind.\n{}\n\n\n\n\n\n\n\n\n\nType\nDescription\n\n\n\n\nmatplotlib axes\n\n\n\n\n\n\n\n\nModel.predict(self, idata, kind='mean', data=None, inplace=True, include_group_specific=True, sample_new_groups=False)\nPredict method for Bambi models\nObtains in-sample and out-of-sample predictions from a fitted Bambi model.\n\n\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\nDefault\n\n\n\n\nidata\nInferenceData\nThe InferenceData instance returned by .fit().\nrequired\n\n\nkind\nstr\nIndicates the type of prediction required. Can be \"mean\" or \"pps\". The first returns draws from the posterior distribution of the mean, while the latter returns the draws from the posterior predictive distribution (i.e. the posterior probability distribution for a new observation) in addition to the mean posterior distribution. Defaults to \"mean\".\n'mean'\n\n\ndata\npandas.DataFrame or None\nAn optional data frame with values for the predictors that are used to obtain out-of-sample predictions. If omitted, the original dataset is used.\nNone\n\n\ninplace\nbool\nIf True it will modify idata in-place. Otherwise, it will return a copy of idata with the predictions added. If kind=\"mean\", a new variable ending in \"_mean\" is added to the posterior group. If kind=\"pps\", it appends a posterior_predictive group to idata. If any of these already exist, it will be overwritten.\nTrue\n\n\ninclude_group_specific\nbool\nDetermines if predictions incorporate group-specific effects. If False, predictions are made with common effects only (i.e. group specific are set to zero). Defaults to True.\nTrue\n\n\nsample_new_groups\nbool\nSpecifies if it is allowed to obtain predictions for new groups of group-specific terms. When True, each posterior sample for the new groups is drawn from the posterior draws of a randomly selected existing group. Since different groups may be selected at each draw, the end result represents the variation across existing groups. The method implemented is quivalent to sample_new_levels=\"uncertainty\" in brms.\nFalse\n\n\n\n\n\n\n\n\n\nType\nDescription\n\n\n\n\nInferenceData or None\n\n\n\n\n\n\n\n\nModel.prior_predictive(self, draws=500, var_names=None, omit_offsets=True, random_seed=None)\nGenerate samples from the prior predictive distribution.\n\n\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\nDefault\n\n\n\n\ndraws\nint\nNumber of draws to sample from the prior predictive distribution. Defaults to 500.\n500\n\n\nvar_names\nstr or list\nA list of names of variables for which to compute the prior predictive distribution. Defaults to None which means both observed and unobserved RVs.\nNone\n\n\nomit_offsets\nbool\nWhether to omit offset terms in the plot. Defaults to True.\nTrue\n\n\nrandom_seed\nint\nSeed for the random number generator.\nNone\n\n\n\n\n\n\n\n\n\n\n\n\n\nType\nDescription\n\n\n\n\nInferenceData\nInferenceData object with the groups prior, prior_predictive and observed_data.\n\n\n\n\n\n\n\nModel.set_alias(self, aliases)\nSet aliases for the terms and auxiliary parameters in the model\n\n\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\nDefault\n\n\n\n\naliases\ndict\nA dictionary where key represents the original term name and the value is the alias.\nrequired\n\n\n\n\n\n\n\n\n\nType\nDescription\n\n\n\n\nNone\n\n\n\n\n\n\n\n\nModel.set_priors(self, priors=None, common=None, group_specific=None)\nSet priors for one or more existing terms.\n\n\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\nDefault\n\n\n\n\npriors\ndict\nDictionary of priors to update. Keys are names of terms to update; values are the new priors (either a Prior instance, or an int or float that scales the default priors).\nNone\n\n\ncommon\nPrior, int, or float\nA prior specification to apply to all common terms included in the model.\nNone\n\n\ngroup_specific\nPrior, int, or float\nA prior specification to apply to all group specific terms included in the model.\nNone\n\n\n\n\n\n\n\n\n\nType\nDescription\n\n\n\n\nNone" + "text": "families.Link(self, name, link=None, linkinv=None, linkinv_backend=None)\nRepresentation of a link function.\nThis object contains two main functions. One is the link function itself, the function that maps values in the response scale to the linear predictor, and the other is the inverse of the link function, that maps values of the linear predictor to the response scale.\nThe great majority of users will never interact with this class unless they want to create a custom Family with a custom Link. This is automatically handled for all the built-in families.\n\n\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\nDefault\n\n\n\n\nname\nstr\nThe name of the link function. If it is a known name, it’s not necessary to pass any other arguments because functions are already defined internally. If not known, all of link, linkinv and linkinv_backend must be specified.\nrequired\n\n\nlink\nfunction\nA function that maps the response to the linear predictor. Known as the :math:g function in GLM jargon. Does not need to be specified when name is a known name.\nNone\n\n\nlinkinv\nfunction\nA function that maps the linear predictor to the response. Known as the :math:g^{-1} function in GLM jargon. Does not need to be specified when name is a known name.\nNone\n\n\nlinkinv_backend\nfunction\nSame than linkinv but must be something that works with PyMC backend (i.e. it must work with PyTensor tensors). Does not need to be specified when name is a known name.\nNone" }, { - "objectID": "api/interpret.comparisons.html", - "href": "api/interpret.comparisons.html", + "objectID": "api/interpret.plot_comparisons.html", + "href": "api/interpret.plot_comparisons.html", "title": "Bambi", "section": "", - "text": "interpret.comparisons(model, idata, contrast, conditional=None, average_by=None, comparison_type='diff', use_hdi=True, prob=None, transforms=None, sample_new_groups=False)\nCompute Conditional Adjusted Comparisons\n\n\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\nDefault\n\n\n\n\nmodel\nbambi.Model\nThe model for which we want to plot the predictions.\nrequired\n\n\nidata\narviz.InferenceData\nThe InferenceData object that contains the samples from the posterior distribution of the model.\nrequired\n\n\ncontrast\n(str, dict)\nThe predictor name whose contrast we would like to compare.\nrequired\n\n\nconditional\n(str, dict, list)\nThe covariates we would like to condition on.\nNone\n\n\naverage_by\nUnion[str, list, bool, None]\nThe covariates we would like to average by. The passed covariate(s) will marginalize over the other covariates in the model. If True, it averages over all covariates in the model to obtain the average estimate. Defaults to None.\nNone\n\n\ncomparison_type\nstr\nThe type of comparison to plot. Defaults to ‘diff’.\n'diff'\n\n\nuse_hdi\nbool\nWhether to compute the highest density interval (defaults to True) or the quantiles.\nTrue\n\n\nprob\nfloat\nThe probability for the credibility intervals. Must be between 0 and 1. Defaults to 0.94. Changing the global variable az.rcParam[\"stats.hdi_prob\"] affects this default.\nNone\n\n\ntransforms\ndict\nTransformations that are applied to each of the variables being plotted. The keys are the name of the variables, and the values are functions to be applied. Defaults to None.\nNone\n\n\nsample_new_groups\nbool\nIf the model contains group-level effects, and data is passed for unseen groups, whether to sample from the new groups. Defaults to False.\nFalse\n\n\n\n\n\n\n\n\n\n\n\n\n\nType\nDescription\n\n\n\n\npandas.DataFrame\nA dataframe with the comparison values, highest density interval, contrast name, contrast value, and conditional values.\n\n\n\n\n\n\n\n\n\n\n\n\n\nType\nDescription\n\n\n\n\nValueError\nIf wrt is a dict and length of contrast is greater than 1. If wrt is a dict and length of contrast is greater than 2 and conditional is None. If conditional is None and contrast is categorical with > 2 values. If comparison_type is not ‘diff’ or ‘ratio’. If prob is not > 0 and < 1." + "text": "interpret.plot_comparisons(model, idata, contrast, conditional=None, average_by=None, comparison_type='diff', sample_new_groups=False, use_hdi=True, prob=None, legend=True, transforms=None, ax=None, fig_kwargs=None, subplot_kwargs=None)\nPlot Conditional Adjusted Comparisons\n\n\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\nDefault\n\n\n\n\nmodel\nbambi.Model\nThe model for which we want to plot the predictions.\nrequired\n\n\nidata\narviz.InferenceData\nThe InferenceData object that contains the samples from the posterior distribution of the model.\nrequired\n\n\ncontrast\n(str, dict, list)\nThe predictor name whose contrast we would like to compare.\nrequired\n\n\nconditional\n(str, dict, list)\nThe covariates we would like to condition on.\nNone\n\n\naverage_by\nUnion[str, list]\nThe covariates we would like to average by. The passed covariate(s) will marginalize over the other covariates in the model. Defaults to None.\nNone\n\n\ncomparison_type\nstr\nThe type of comparison to plot. Defaults to ‘diff’.\n'diff'\n\n\nsample_new_groups\nbool\nIf the model contains group-level effects, and data is passed for unseen groups, whether to sample from the new groups. Defaults to False.\nFalse\n\n\nuse_hdi\nbool\nWhether to compute the highest density interval (defaults to True) or the quantiles.\nTrue\n\n\nprob\nfloat\nThe probability for the credibility intervals. Must be between 0 and 1. Defaults to 0.94. Changing the global variable az.rcParam[\"stats.hdi_prob\"] affects this default.\nNone\n\n\nlegend\nbool\nWhether to automatically include a legend in the plot. Defaults to True.\nTrue\n\n\ntransforms\ndict\nTransformations that are applied to each of the variables being plotted. The keys are the name of the variables, and the values are functions to be applied. Defaults to None.\nNone\n\n\nax\nmatplotlib.axes._subplots.AxesSubplot\nA matplotlib axes object or a sequence of them. If None, this function instantiates a new axes object. Defaults to None.\nNone\n\n\nfig_kwargs\noptional\nKeyword arguments passed to the matplotlib figure function as a dict. For example, fig_kwargs=dict(figsize=(11, 8)), sharey=True would make the figure 11 inches wide by 8 inches high and would share the y-axis values.\nNone\n\n\nsubplot_kwargs\noptional\nKeyword arguments used to determine the covariates used for the horizontal, group, and panel axes. For example, subplot_kwargs=dict(main=\"x\", group=\"y\", panel=\"z\") would plot the horizontal axis as x, the color (hue) as y, and the panel axis as z.\nNone\n\n\n\n\n\n\n\n\n\n\n\n\n\nType\nDescription\n\n\n\n\n(matplotlib.figure.Figure, matplotlib.axes._subplots.AxesSubplot)\nA tuple with the figure and the axes.\n\n\n\n\n\n\n\n\n\n\n\n\n\nType\nDescription\n\n\n\n\nValueError\nIf conditional and average_by are both None. If length of conditional is greater than 3 and average_by is None.\n\n\nWarning\nIf length of contrast is greater than 2." }, { - "objectID": "api/Prior.html", - "href": "api/Prior.html", + "objectID": "api/interpret.predictions.html", + "href": "api/interpret.predictions.html", "title": "Bambi", "section": "", - "text": "priors.Prior(self, name, auto_scale=True, dist=None, **kwargs)\nAbstract specification of a term prior.\n\n\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\nDefault\n\n\n\n\nname\nstr\nName of prior distribution. Must be the name of a PyMC distribution (e.g., \"Normal\", \"Bernoulli\", etc.)\nrequired\n\n\nauto_scale\n\nWhether to adjust the parameters of the prior or use them as passed. Default to True.\nTrue\n\n\nkwargs\ndict\nOptional keywords specifying the parameters of the named distribution.\n{}\n\n\ndist\npymc.distributions.distribution.DistributionMeta or callable\nA callable that returns a valid PyMC distribution. The signature must contain name, dims, and shape, as well as its own keyworded arguments.\nNone\n\n\n\n\n\n\n\n\n\nName\nDescription\n\n\n\n\nupdate\nUpdate the arguments of the prior with additional arguments.\n\n\n\n\n\nPrior.update(self, **kwargs)\nUpdate the arguments of the prior with additional arguments.\n\n\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\nDefault\n\n\n\n\nkwargs\ndict\nOptional keyword arguments to add to prior args.\n{}" + "text": "interpret.predictions(model, idata, covariates, target='mean', pps=False, use_hdi=True, prob=None, transforms=None, sample_new_groups=False)\nCompute Conditional Adjusted Predictions\n\n\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\nDefault\n\n\n\n\nmodel\nbambi.Model\nThe model for which we want to plot the predictions.\nrequired\n\n\nidata\narviz.InferenceData\nThe InferenceData object that contains the samples from the posterior distribution of the model.\nrequired\n\n\ncovariates\nlist or dict\nA sequence of between one and three names of variables or a dict of length between one and three. If a sequence, the first variable is taken as the main variable and is mapped to the horizontal axis. If present, the second name is a coloring/grouping variable, and the third is mapped to different plot panels. If a dictionary, keys must be taken from (“main”, “group”, “panel”) and the values are the names of the variables.\nrequired\n\n\ntarget\nstr\nWhich model parameter to plot. Defaults to ‘mean’. Passing a parameter into target only works when pps is False as the target may not be available in the posterior predictive distribution.\n'mean'\n\n\npps\nbool\nWhether to plot the posterior predictive samples. Defaults to False.\nFalse\n\n\nuse_hdi\nbool\nWhether to compute the highest density interval (defaults to True) or the quantiles.\nTrue\n\n\nprob\nfloat\nThe probability for the credibility intervals. Must be between 0 and 1. Defaults to 0.94. Changing the global variable az.rcParam[\"stats.hdi_prob\"] affects this default.\nNone\n\n\ntransforms\ndict\nTransformations that are applied to each of the variables being plotted. The keys are the name of the variables, and the values are functions to be applied. Defaults to None.\nNone\n\n\nsample_new_groups\nbool\nIf the model contains group-level effects, and data is passed for unseen groups, whether to sample from the new groups. Defaults to False.\nFalse\n\n\n\n\n\n\n\n\n\n\n\n\n\nType\nDescription\n\n\n\n\npandas.DataFrame\nA DataFrame with the create_cap_data and model predictions.\n\n\n\n\n\n\n\n\n\n\n\n\n\nType\nDescription\n\n\n\n\nValueError\nIf pps is True and target is not \"mean\". If passed covariates is not in correct key, value format. If length of covariates is not between 1 and 3." }, { - "objectID": "api/Family.html", - "href": "api/Family.html", + "objectID": "api/Likelihood.html", + "href": "api/Likelihood.html", "title": "Bambi", "section": "", - "text": "families.Family(self, name, likelihood, link)\nA specification of model family.\n\n\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\nDefault\n\n\n\n\nname\nstr\nThe name of the family. It can be any string.\nrequired\n\n\nlikelihood\nLikelihood\nA bambi.families.Likelihood instance specifying the model likelihood function.\nrequired\n\n\nlink\nUnion[str, Dict[str, Union[str, Link]]]\nThe link function that’s used for every parameter in the likelihood function. Keys are the names of the parameters and values are the link functions. These can be a str with a name or a bambi.families.Link instance. The link function transforms the linear predictors.\nrequired\n\n\n\n\n\n\n>>> import bambi as bmb\nReplicate the Gaussian built-in family.\n>>> sigma_prior = bmb.Prior(\"HalfNormal\", sigma=1)\n>>> likelihood = bmb.Likelihood(\"Gaussian\", params=[\"mu\", \"sigma\"], parent=\"mu\")\n>>> family = bmb.Family(\"gaussian\", likelihood, \"identity\")\n>>> bmb.Model(\"y ~ x\", data, family=family, priors={\"sigma\": sigma_prior})\nReplicate the Bernoulli built-in family.\n>>> likelihood = bmb.Likelihood(\"Bernoulli\", parent=\"p\")\n>>> family = bmb.Family(\"bernoulli\", likelihood, \"logit\")\n>>> bmb.Model(\"y ~ x\", data, family=family)\n\n\n\n\n\n\nName\nDescription\n\n\n\n\nposterior_predictive\nGet draws from the posterior predictive distribution\n\n\nset_default_priors\nSet default priors for non-parent parameters\n\n\n\n\n\nFamily.posterior_predictive(self, model, posterior, **kwargs)\nGet draws from the posterior predictive distribution\nThis function works for almost all the families. It grabs the draws for the parameters needed in the response distribution, and then gets samples from the posterior predictive distribution using pm.draw(). It won’t work when the response distribution requires parameters that are not available in posterior.\n\n\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\nDefault\n\n\n\n\nmodel\nbambi.Model\nThe model\nrequired\n\n\nposterior\nxr.Dataset\nThe xarray dataset that contains the draws for all the parameters in the posterior. It must contain the parameters that are needed in the distribution of the response, or the parameters that allow to derive them.\nrequired\n\n\nkwargs\n\nParameters that are used to get draws but do not appear in the posterior object or other configuration parameters. For instance, the ‘n’ in binomial models and multinomial models.\n{}\n\n\n\n\n\n\n\n\n\n\n\n\n\nType\nDescription\n\n\n\n\nxr.DataArray\nA data array with the draws from the posterior predictive distribution\n\n\n\n\n\n\n\nFamily.set_default_priors(self, priors)\nSet default priors for non-parent parameters\n\n\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\nDefault\n\n\n\n\npriors\ndict\nThe keys are the names of non-parent parameters and the values are their default priors.\nrequired" + "text": "families.Likelihood(self, name, params=None, parent=None, dist=None)\nRepresentation of a Likelihood function for a Bambi model.\nNotes: * parent must be in params * parent is inferred from the name if it is a known name\n\n\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\nDefault\n\n\n\n\nname\nstr\nName of the likelihood function. Must be a valid PyMC distribution name.\nrequired\n\n\nparams\nSequence[str]\nThe name of the parameters the likelihood function accepts.\nNone\n\n\nparent\nstr\nOptional specification of the name of the mean parameter in the likelihood. This is the parameter whose transformation is modeled by the linear predictor.\nNone\n\n\ndist\npymc.distributions.distribution.DistributionMeta or callable\nOptional custom PyMC distribution that will be used to compute the likelihood.\nNone" }, { "objectID": "api/interpret.plot_slopes.html", @@ -343,25 +315,88 @@ "text": "Formula(self, formula, *additionals)\nModel formula\nAllows to describe a model with multiple formulas. The first formula describes the response variable and its predictors. The following formulas describe predictors for other parameters of the likelihood function, allowing distributional models.\n\n\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\nDefault\n\n\n\n\nformula\nstr\nA model description written using the formula syntax from the formulae library.\nrequired\n\n\n*additionals\nstr\nAdditional formulas that describe the\n()\n\n\n\n\n\n\n\n\n\nName\nDescription\n\n\n\n\ncheck_additional\nCheck if an additional formula matches the expected format\n\n\ncheck_additionals\nCheck if the additional formulas match the expected format\n\n\nget_all_formulas\nGet all the model formulas\n\n\n\n\n\nFormula.check_additional(self, additional)\nCheck if an additional formula matches the expected format\n\n\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\nDefault\n\n\n\n\nadditional\nstr\nA model formula that describes a model parameter.\nrequired\n\n\n\n\n\n\n\n\n\nType\nDescription\n\n\n\n\nValueError\nIf the formula does not contain a response term\n\n\nValueError\nIf the response term is not a plain name\n\n\n\n\n\n\n\nFormula.check_additionals(self, additionals)\nCheck if the additional formulas match the expected format\n\n\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\nDefault\n\n\n\n\nadditionals\nSequence[str]\nModel formulas that describe model parameters rather than a response variable\nrequired\n\n\n\n\n\n\n\n\n\n\n\n\n\nType\nDescription\n\n\n\n\nSequence[str]\nIf all formulas match the required format, it return them.\n\n\n\n\n\n\n\nFormula.get_all_formulas(self)\nGet all the model formulas\n\n\n\n\n\nType\nDescription\n\n\n\n\nlist\nAll the formulas in the instance" }, { - "objectID": "api/interpret.predictions.html", - "href": "api/interpret.predictions.html", + "objectID": "api/Model.html", + "href": "api/Model.html", "title": "Bambi", "section": "", - "text": "interpret.predictions(model, idata, covariates, target='mean', pps=False, use_hdi=True, prob=None, transforms=None, sample_new_groups=False)\nCompute Conditional Adjusted Predictions\n\n\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\nDefault\n\n\n\n\nmodel\nbambi.Model\nThe model for which we want to plot the predictions.\nrequired\n\n\nidata\narviz.InferenceData\nThe InferenceData object that contains the samples from the posterior distribution of the model.\nrequired\n\n\ncovariates\nlist or dict\nA sequence of between one and three names of variables or a dict of length between one and three. If a sequence, the first variable is taken as the main variable and is mapped to the horizontal axis. If present, the second name is a coloring/grouping variable, and the third is mapped to different plot panels. If a dictionary, keys must be taken from (“main”, “group”, “panel”) and the values are the names of the variables.\nrequired\n\n\ntarget\nstr\nWhich model parameter to plot. Defaults to ‘mean’. Passing a parameter into target only works when pps is False as the target may not be available in the posterior predictive distribution.\n'mean'\n\n\npps\nbool\nWhether to plot the posterior predictive samples. Defaults to False.\nFalse\n\n\nuse_hdi\nbool\nWhether to compute the highest density interval (defaults to True) or the quantiles.\nTrue\n\n\nprob\nfloat\nThe probability for the credibility intervals. Must be between 0 and 1. Defaults to 0.94. Changing the global variable az.rcParam[\"stats.hdi_prob\"] affects this default.\nNone\n\n\ntransforms\ndict\nTransformations that are applied to each of the variables being plotted. The keys are the name of the variables, and the values are functions to be applied. Defaults to None.\nNone\n\n\nsample_new_groups\nbool\nIf the model contains group-level effects, and data is passed for unseen groups, whether to sample from the new groups. Defaults to False.\nFalse\n\n\n\n\n\n\n\n\n\n\n\n\n\nType\nDescription\n\n\n\n\npandas.DataFrame\nA DataFrame with the create_cap_data and model predictions.\n\n\n\n\n\n\n\n\n\n\n\n\n\nType\nDescription\n\n\n\n\nValueError\nIf pps is True and target is not \"mean\". If passed covariates is not in correct key, value format. If length of covariates is not between 1 and 3." + "text": "Model(self, formula, data, family='gaussian', priors=None, link=None, categorical=None, potentials=None, dropna=False, auto_scale=True, noncentered=True, center_predictors=True, extra_namespace=None)\nSpecification of model class.\n\n\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\nDefault\n\n\n\n\nformula\nstr or bambi.formula.Formula\nA model description written using the formula syntax from the formulae library.\nrequired\n\n\ndata\npandas.DataFrame\nA pandas dataframe containing the data on which the model will be fit, with column names matching variables defined in the formula.\nrequired\n\n\nfamily\nstr or bambi.families.Family\nA specification of the model family (analogous to the family object in R). Either a string, or an instance of class bambi.families.Family. If a string is passed, a family with the corresponding name must be defined in the defaults loaded at Model initialization. Valid pre-defined families are \"bernoulli\", \"beta\", \"binomial\", \"categorical\", \"gamma\", \"gaussian\", \"negativebinomial\", \"poisson\", \"t\", and \"wald\". Defaults to \"gaussian\".\n'gaussian'\n\n\npriors\ndict\nOptional specification of priors for one or more terms. A dictionary where the keys are the names of terms in the model, “common,” or “group_specific” and the values are instances of class Prior. If priors are unset, uses automatic priors inspired by the R rstanarm library.\nNone\n\n\nlink\nstr or Dict[str, str]\nThe name of the link function to use. Valid names are \"cloglog\", \"identity\", \"inverse_squared\", \"inverse\", \"log\", \"logit\", \"probit\", and \"softmax\". Not all the link functions can be used with all the families. If a dictionary, keys are the names of the target parameters and the values are the names of the link functions.\nNone\n\n\ncategorical\nstr or list\nThe names of any variables to treat as categorical. Can be either a single variable name, or a list of names. If categorical is None, the data type of the columns in the data will be used to infer handling. In cases where numeric columns are to be treated as categorical (e.g., group specific factors coded as numerical IDs), explicitly passing variable names via this argument is recommended.\nNone\n\n\npotentials\nA list of 2-tuples.\nOptional specification of potentials. A potential is an arbitrary expression added to the likelihood, this is generally useful to add constrains to models, that are difficult to express otherwise. The first term of a 2-tuple is the name of a variable in the model, the second a lambda function expressing the desired constraint. If a constraint involves n variables, you can pass n 2-tuples or pass a tuple which first element is a n-tuple and second element is a lambda function with n arguments. The number and order of the lambda function has to match the number and order of the variables names.\nNone\n\n\ndropna\nbool\nWhen True, rows with any missing values in either the predictors or outcome are automatically dropped from the dataset in a listwise manner.\nFalse\n\n\nauto_scale\nbool\nIf True (default), priors are automatically rescaled to the data (to be weakly informative) any time default priors are used. Note that any priors explicitly set by the user will always take precedence over default priors.\nTrue\n\n\nnoncentered\nbool\nIf True (default), uses a non-centered parameterization for normal hyperpriors on grouped parameters. If False, naive (centered) parameterization is used.\nTrue\n\n\ncenter_predictors\nbool\nIf True (default), and if there is an intercept in the common terms, the data is centered by subtracting the mean. The centering is undone after sampling to provide the actual intercept in all distributional components that have an intercept. Note that this changes the interpretation of the prior on the intercept because it refers to the intercept of the centered data.\nTrue\n\n\nextra_namespace\ndict\nAdditional user supplied variables with transformations or data to include in the environment where the formula is evaluated. Defaults to None.\nNone\n\n\n\n\n\n\n\n\n\nName\nDescription\n\n\n\n\nbuild\nSet up the model for sampling/fitting.\n\n\nfit\nFit the model using PyMC.\n\n\ngraph\nProduce a graphviz Digraph from a built Bambi model.\n\n\nplot_priors\nSamples from the prior distribution and plots its marginals.\n\n\npredict\nPredict method for Bambi models\n\n\nprior_predictive\nGenerate samples from the prior predictive distribution.\n\n\nset_alias\nSet aliases for the terms and auxiliary parameters in the model\n\n\nset_priors\nSet priors for one or more existing terms.\n\n\n\n\n\nModel.build(self)\nSet up the model for sampling/fitting.\nCreates an instance of the underlying PyMC model and adds all the necessary terms to it.\n\n\n\n\n\nType\nDescription\n\n\n\n\nNone\n\n\n\n\n\n\n\n\nModel.fit(self, draws=1000, tune=1000, discard_tuned_samples=True, omit_offsets=True, include_mean=False, inference_method='mcmc', init='auto', n_init=50000, chains=None, cores=None, random_seed=None, **kwargs)\nFit the model using PyMC.\n\n\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\nDefault\n\n\n\n\ndraws\n\nThe number of samples to draw from the posterior distribution. Defaults to 1000.\n1000\n\n\ntune\nint\nNumber of iterations to tune. Defaults to 1000. Samplers adjust the step sizes, scalings or similar during tuning. These tuning samples are be drawn in addition to the number specified in the draws argument, and will be discarded unless discard_tuned_samples is set to False.\n1000\n\n\ndiscard_tuned_samples\nbool\nWhether to discard posterior samples of the tune interval. Defaults to True.\nTrue\n\n\nomit_offsets\nbool\nOmits offset terms in the InferenceData object returned when the model includes group specific effects. Defaults to True.\nTrue\n\n\ninclude_mean\nbool\nCompute the posterior of the mean response. Defaults to False.\nFalse\n\n\ninference_method\nstr\nThe method to use for fitting the model. By default, \"mcmc\". This automatically assigns a MCMC method best suited for each kind of variables, like NUTS for continuous variables and Metropolis for non-binary discrete ones. Alternatively, \"vi\", in which case the model will be fitted using variational inference as implemented in PyMC using the fit function. Finally, \"laplace\", in which case a Laplace approximation is used and is not recommended other than for pedagogical use. To use the PyMC numpyro and blackjax samplers, use nuts_numpyro or nuts_blackjax respectively. Both methods will only work if you can use NUTS sampling, so your model must be differentiable.\n'mcmc'\n\n\ninit\nstr\nInitialization method. Defaults to \"auto\". The available methods are: * auto: Use \"jitter+adapt_diag\" and if this method fails it uses \"adapt_diag\". * adapt_diag: Start with a identity mass matrix and then adapt a diagonal based on the variance of the tuning samples. All chains use the test value (usually the prior mean) as starting point. * jitter+adapt_diag: Same as \"adapt_diag\", but use test value plus a uniform jitter in [-1, 1] as starting point in each chain. * advi+adapt_diag: Run ADVI and then adapt the resulting diagonal mass matrix based on the sample variance of the tuning samples. * advi+adapt_diag_grad: Run ADVI and then adapt the resulting diagonal mass matrix based on the variance of the gradients during tuning. This is experimental and might be removed in a future release. * advi: Run ADVI to estimate posterior mean and diagonal mass matrix. * advi_map: Initialize ADVI with MAP and use MAP as starting point. * map: Use the MAP as starting point. This is strongly discouraged. * adapt_full: Adapt a dense mass matrix using the sample covariances. All chains use the test value (usually the prior mean) as starting point. * jitter+adapt_full: Same as \"adapt_full\", but use test value plus a uniform jitter in [-1, 1] as starting point in each chain.\n'auto'\n\n\nn_init\nint\nNumber of initialization iterations. Only works for \"advi\" init methods.\n50000\n\n\nchains\nint\nThe number of chains to sample. Running independent chains is important for some convergence statistics and can also reveal multiple modes in the posterior. If None, then set to either cores or 2, whichever is larger.\nNone\n\n\ncores\nint\nThe number of chains to run in parallel. If None, it is equal to the number of CPUs in the system unless there are more than 4 CPUs, in which case it is set to 4.\nNone\n\n\nrandom_seed\nint or list of ints\nA list is accepted if cores is greater than one.\nNone\n\n\n**kwargs\n\nFor other kwargs see the documentation for PyMC.sample().\n{}\n\n\n\n\n\n\n\n\n\n\n\n\n\nType\nDescription\n\n\n\n\nAn ArviZ InferenceData instance if inference_method is \"mcmc\" (default),\n\n\n\n“nuts_numpyro”, “nuts_blackjax” or “laplace”.\n\n\n\nAn Approximation object if \"vi\".\n\n\n\n\n\n\n\n\nModel.graph(self, formatting='plain', name=None, figsize=None, dpi=300, fmt='png')\nProduce a graphviz Digraph from a built Bambi model.\nRequires graphviz, which may be installed most easily with conda install -c conda-forge python-graphviz\nAlternatively, you may install the graphviz binaries yourself, and then pip install graphviz to get the python bindings. See http://graphviz.readthedocs.io/en/stable/manual.html for more information.\n\n\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\nDefault\n\n\n\n\nformatting\nstr\nOne of \"plain\" or \"plain_with_params\". Defaults to \"plain\".\n'plain'\n\n\nname\nstr\nName of the figure to save. Defaults to None, no figure is saved.\nNone\n\n\nfigsize\ntuple\nMaximum width and height of figure in inches. Defaults to None, the figure size is set automatically. If defined and the drawing is larger than the given size, the drawing is uniformly scaled down so that it fits within the given size. Only works if name is not None.\nNone\n\n\ndpi\nint\nPoint per inch of the figure to save. Defaults to 300. Only works if name is not None.\n300\n\n\nfmt\nstr\nFormat of the figure to save. Defaults to \"png\". Only works if name is not None.\n'png'\n\n\n\n\n\n\n\n\n\nType\nDescription\n\n\n\n\ngraphviz.Digraph\nThe graph\n\n\n\n\n\n\n\n\n\nmodel = Model(“y ~ x + (1|z)”) model.build() model.graph()\n\n\n\n\n\n\nmodel = Model(“y ~ x + (1|z)”) model.fit() model.graph()\n\n\n\n\n\n\n\nModel.plot_priors(self, draws=5000, var_names=None, random_seed=None, figsize=None, textsize=None, hdi_prob=None, round_to=2, point_estimate='mean', kind='kde', bins=None, omit_offsets=True, omit_group_specific=True, ax=None, **kwargs)\nSamples from the prior distribution and plots its marginals.\n\n\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\nDefault\n\n\n\n\ndraws\nint\nNumber of draws to sample from the prior predictive distribution. Defaults to 5000.\n5000\n\n\nvar_names\nstr or list\nA list of names of variables for which to compute the prior predictive distribution. Defaults to None which means to include both observed and unobserved RVs.\nNone\n\n\nrandom_seed\nint\nSeed for the random number generator.\nNone\n\n\nfigsize\ntuple\nFigure size. If None it will be defined automatically.\nNone\n\n\ntextsize\nfloat\nText size scaling factor for labels, titles and lines. If None it will be autoscaled based on figsize.\nNone\n\n\nhdi_prob\nfloat or str\nPlots highest density interval for chosen percentage of density. Use \"hide\" to hide the highest density interval. Defaults to 0.94.\nNone\n\n\nround_to\nint\nControls formatting of floats. Defaults to 2 or the integer part, whichever is bigger.\n2\n\n\npoint_estimate\nstr\nPlot point estimate per variable. Values should be \"mean\", \"median\", \"mode\" or None. Defaults to \"auto\" i.e. it falls back to default set in ArviZ’s rcParams.\n'mean'\n\n\nkind\nstr\nType of plot to display (\"kde\" or \"hist\") For discrete variables this argument is ignored and a histogram is always used.\n'kde'\n\n\nbins\ninteger or sequence or auto\nControls the number of bins, accepts the same keywords matplotlib.pyplot.hist() does. Only works if kind == \"hist\". If None (default) it will use \"auto\" for continuous variables and range(xmin, xmax + 1) for discrete variables.\nNone\n\n\nomit_offsets\nbool\nWhether to omit offset terms in the plot. Defaults to True.\nTrue\n\n\nomit_group_specific\nbool\nWhether to omit group specific effects in the plot. Defaults to True.\nTrue\n\n\nax\nnumpy array-like of matplotlib axes or bokeh figures\nA 2D array of locations into which to plot the densities. If not supplied, ArviZ will create its own array of plot areas (and return it).\nNone\n\n\n**kwargs\n\nPassed as-is to matplotlib.pyplot.hist() or matplotlib.pyplot.plot() function depending on the value of kind.\n{}\n\n\n\n\n\n\n\n\n\nType\nDescription\n\n\n\n\nmatplotlib axes\n\n\n\n\n\n\n\n\nModel.predict(self, idata, kind='mean', data=None, inplace=True, include_group_specific=True, sample_new_groups=False)\nPredict method for Bambi models\nObtains in-sample and out-of-sample predictions from a fitted Bambi model.\n\n\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\nDefault\n\n\n\n\nidata\nInferenceData\nThe InferenceData instance returned by .fit().\nrequired\n\n\nkind\nstr\nIndicates the type of prediction required. Can be \"mean\" or \"pps\". The first returns draws from the posterior distribution of the mean, while the latter returns the draws from the posterior predictive distribution (i.e. the posterior probability distribution for a new observation) in addition to the mean posterior distribution. Defaults to \"mean\".\n'mean'\n\n\ndata\npandas.DataFrame or None\nAn optional data frame with values for the predictors that are used to obtain out-of-sample predictions. If omitted, the original dataset is used.\nNone\n\n\ninplace\nbool\nIf True it will modify idata in-place. Otherwise, it will return a copy of idata with the predictions added. If kind=\"mean\", a new variable ending in \"_mean\" is added to the posterior group. If kind=\"pps\", it appends a posterior_predictive group to idata. If any of these already exist, it will be overwritten.\nTrue\n\n\ninclude_group_specific\nbool\nDetermines if predictions incorporate group-specific effects. If False, predictions are made with common effects only (i.e. group specific are set to zero). Defaults to True.\nTrue\n\n\nsample_new_groups\nbool\nSpecifies if it is allowed to obtain predictions for new groups of group-specific terms. When True, each posterior sample for the new groups is drawn from the posterior draws of a randomly selected existing group. Since different groups may be selected at each draw, the end result represents the variation across existing groups. The method implemented is quivalent to sample_new_levels=\"uncertainty\" in brms.\nFalse\n\n\n\n\n\n\n\n\n\nType\nDescription\n\n\n\n\nInferenceData or None\n\n\n\n\n\n\n\n\nModel.prior_predictive(self, draws=500, var_names=None, omit_offsets=True, random_seed=None)\nGenerate samples from the prior predictive distribution.\n\n\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\nDefault\n\n\n\n\ndraws\nint\nNumber of draws to sample from the prior predictive distribution. Defaults to 500.\n500\n\n\nvar_names\nstr or list\nA list of names of variables for which to compute the prior predictive distribution. Defaults to None which means both observed and unobserved RVs.\nNone\n\n\nomit_offsets\nbool\nWhether to omit offset terms in the plot. Defaults to True.\nTrue\n\n\nrandom_seed\nint\nSeed for the random number generator.\nNone\n\n\n\n\n\n\n\n\n\n\n\n\n\nType\nDescription\n\n\n\n\nInferenceData\nInferenceData object with the groups prior, prior_predictive and observed_data.\n\n\n\n\n\n\n\nModel.set_alias(self, aliases)\nSet aliases for the terms and auxiliary parameters in the model\n\n\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\nDefault\n\n\n\n\naliases\ndict\nA dictionary where key represents the original term name and the value is the alias.\nrequired\n\n\n\n\n\n\n\n\n\nType\nDescription\n\n\n\n\nNone\n\n\n\n\n\n\n\n\nModel.set_priors(self, priors=None, common=None, group_specific=None)\nSet priors for one or more existing terms.\n\n\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\nDefault\n\n\n\n\npriors\ndict\nDictionary of priors to update. Keys are names of terms to update; values are the new priors (either a Prior instance, or an int or float that scales the default priors).\nNone\n\n\ncommon\nPrior, int, or float\nA prior specification to apply to all common terms included in the model.\nNone\n\n\ngroup_specific\nPrior, int, or float\nA prior specification to apply to all group specific terms included in the model.\nNone\n\n\n\n\n\n\n\n\n\nType\nDescription\n\n\n\n\nNone" }, { - "objectID": "api/load_data.html", - "href": "api/load_data.html", + "objectID": "api/clear_data_home.html", + "href": "api/clear_data_home.html", "title": "Bambi", "section": "", - "text": "data.load_data(dataset=None, data_home=None)\nLoad a dataset.\nRun with no parameters to get a list of all available data sets.\nThe directory to save can also be set with the environment variable BAMBI_HOME. The checksum of the dataset is checked against a hardcoded value to watch for data corruption. Run bmb.clear_data_home() to clear the data directory.\n\n\n\n\n\nName\nType\nDescription\nDefault\n\n\n\n\ndataset\n\nName of dataset to load.\nNone\n\n\ndata_home\n\nWhere to save remote datasets\nNone\n\n\n\n\n\n\n\n\n\nType\nDescription\n\n\n\n\npandas.DataFrame" + "text": "data.clear_data_home(data_home=None)\nDelete all the content of the data home cache.\n\n\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\nDefault\n\n\n\n\ndata_home\n\nThe path to Bambi data dir. By default a folder named \"bambi_data\" in the user home folder.\nNone" }, { - "objectID": "api/Likelihood.html", - "href": "api/Likelihood.html", + "objectID": "api/interpret.comparisons.html", + "href": "api/interpret.comparisons.html", "title": "Bambi", "section": "", - "text": "families.Likelihood(self, name, params=None, parent=None, dist=None)\nRepresentation of a Likelihood function for a Bambi model.\nNotes: * parent must be in params * parent is inferred from the name if it is a known name\n\n\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\nDefault\n\n\n\n\nname\nstr\nName of the likelihood function. Must be a valid PyMC distribution name.\nrequired\n\n\nparams\nSequence[str]\nThe name of the parameters the likelihood function accepts.\nNone\n\n\nparent\nstr\nOptional specification of the name of the mean parameter in the likelihood. This is the parameter whose transformation is modeled by the linear predictor.\nNone\n\n\ndist\npymc.distributions.distribution.DistributionMeta or callable\nOptional custom PyMC distribution that will be used to compute the likelihood.\nNone" + "text": "interpret.comparisons(model, idata, contrast, conditional=None, average_by=None, comparison_type='diff', use_hdi=True, prob=None, transforms=None, sample_new_groups=False)\nCompute Conditional Adjusted Comparisons\n\n\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\nDefault\n\n\n\n\nmodel\nbambi.Model\nThe model for which we want to plot the predictions.\nrequired\n\n\nidata\narviz.InferenceData\nThe InferenceData object that contains the samples from the posterior distribution of the model.\nrequired\n\n\ncontrast\n(str, dict)\nThe predictor name whose contrast we would like to compare.\nrequired\n\n\nconditional\n(str, dict, list)\nThe covariates we would like to condition on.\nNone\n\n\naverage_by\nUnion[str, list, bool, None]\nThe covariates we would like to average by. The passed covariate(s) will marginalize over the other covariates in the model. If True, it averages over all covariates in the model to obtain the average estimate. Defaults to None.\nNone\n\n\ncomparison_type\nstr\nThe type of comparison to plot. Defaults to ‘diff’.\n'diff'\n\n\nuse_hdi\nbool\nWhether to compute the highest density interval (defaults to True) or the quantiles.\nTrue\n\n\nprob\nfloat\nThe probability for the credibility intervals. Must be between 0 and 1. Defaults to 0.94. Changing the global variable az.rcParam[\"stats.hdi_prob\"] affects this default.\nNone\n\n\ntransforms\ndict\nTransformations that are applied to each of the variables being plotted. The keys are the name of the variables, and the values are functions to be applied. Defaults to None.\nNone\n\n\nsample_new_groups\nbool\nIf the model contains group-level effects, and data is passed for unseen groups, whether to sample from the new groups. Defaults to False.\nFalse\n\n\n\n\n\n\n\n\n\n\n\n\n\nType\nDescription\n\n\n\n\npandas.DataFrame\nA dataframe with the comparison values, highest density interval, contrast name, contrast value, and conditional values.\n\n\n\n\n\n\n\n\n\n\n\n\n\nType\nDescription\n\n\n\n\nValueError\nIf wrt is a dict and length of contrast is greater than 1. If wrt is a dict and length of contrast is greater than 2 and conditional is None. If conditional is None and contrast is categorical with > 2 values. If comparison_type is not ‘diff’ or ‘ratio’. If prob is not > 0 and < 1." + }, + { + "objectID": "api/index.html", + "href": "api/index.html", + "title": "Bambi", + "section": "", + "text": "The basics\n\n\n\nModel\nSpecification of model class.\n\n\nFormula\nModel formula\n\n\n\n\n\n\n\n\n\nPrior\nAbstract specification of a term prior.\n\n\n\n\n\n\n\n\n\nFamily\nA specification of model family.\n\n\nLikelihood\nRepresentation of a Likelihood function for a Bambi model.\n\n\nLink\nRepresentation of a link function.\n\n\n\n\n\n\n\n\n\ninterpret.plot_comparisons\nPlot Conditional Adjusted Comparisons\n\n\ninterpret.plot_predictions\nPlot Conditional Adjusted Predictions\n\n\ninterpret.plot_slopes\nPlot Conditional Adjusted Slopes\n\n\n\n\n\n\n\n\n\ninterpret.comparisons\nCompute Conditional Adjusted Comparisons\n\n\ninterpret.predictions\nCompute Conditional Adjusted Predictions\n\n\ninterpret.slopes\nCompute Conditional Adjusted Slopes\n\n\n\n\n\n\n\n\n\nclear_data_home\nDelete all the content of the data home cache.\n\n\nload_data\nLoad a dataset." + }, + { + "objectID": "api/interpret.plot_predictions.html", + "href": "api/interpret.plot_predictions.html", + "title": "Bambi", + "section": "", + "text": "interpret.plot_predictions(model, idata, covariates, target='mean', sample_new_groups=False, pps=False, use_hdi=True, prob=None, transforms=None, legend=True, ax=None, fig_kwargs=None, subplot_kwargs=None)\nPlot Conditional Adjusted Predictions\n\n\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\nDefault\n\n\n\n\nmodel\nbambi.Model\nThe model for which we want to plot the predictions.\nrequired\n\n\nidata\narviz.InferenceData\nThe InferenceData object that contains the samples from the posterior distribution of the model.\nrequired\n\n\ncovariates\nlist or dict\nA sequence of between one and three names of variables in the model.\nrequired\n\n\ntarget\nstr\nWhich model parameter to plot. Defaults to ‘mean’. Passing a parameter into target only works when pps is False as the target may not be available in the posterior predictive distribution.\n'mean'\n\n\nsample_new_groups\nbool\nIf the model contains group-level effects, and data is passed for unseen groups, whether to sample from the new groups. Defaults to False.\nFalse\n\n\npps\nbool\nWhether to plot the posterior predictive samples. Defaults to False.\nFalse\n\n\nuse_hdi\nbool\nWhether to compute the highest density interval (defaults to True) or the quantiles.\nTrue\n\n\nprob\nfloat\nThe probability for the credibility intervals. Must be between 0 and 1. Defaults to 0.94. Changing the global variable az.rcParam[\"stats.hdi_prob\"] affects this default.\nNone\n\n\nlegend\nbool\nWhether to automatically include a legend in the plot. Defaults to True.\nTrue\n\n\ntransforms\ndict\nTransformations that are applied to each of the variables being plotted. The keys are the name of the variables, and the values are functions to be applied. Defaults to None.\nNone\n\n\nax\nmatplotlib.axes._subplots.AxesSubplot\nA matplotlib axes object or a sequence of them. If None, this function instantiates a new axes object. Defaults to None.\nNone\n\n\nfig_kwargs\noptional\nKeyword arguments passed to the matplotlib figure function as a dict. For example, fig_kwargs=dict(figsize=(11, 8)), sharey=True would make the figure 11 inches wide by 8 inches high and would share the y-axis values.\nNone\n\n\nsubplot_kwargs\noptional\nKeyword arguments used to determine the covariates used for the horizontal, group, and panel axes. For example, subplot_kwargs=dict(main=\"x\", group=\"y\", panel=\"z\") would plot the horizontal axis as x, the color (hue) as y, and the panel axis as z.\nNone\n\n\n\n\n\n\n\n\n\n\n\n\n\nType\nDescription\n\n\n\n\n(matplotlib.figure.Figure, matplotlib.axes._subplots.AxesSubplot)\nA tuple with the figure and the axes.\n\n\n\n\n\n\n\n\n\n\n\n\n\nType\nDescription\n\n\n\n\nValueError\nWhen level is not within 0 and 1. When the main covariate is not numeric or categoric.\n\n\nTypeError\nWhen covariates is not a string or a list of strings." + }, + { + "objectID": "index.html", + "href": "index.html", + "title": "BAyesian Model-Building Interface in Python", + "section": "", + "text": "Bambi is a high-level Bayesian model-building interface written in Python. It works with the PyMC probabilistic programming framework and is designed to make it extremely easy to fit Bayesian mixed-effects models common in biology, social sciences and other disciplines." + }, + { + "objectID": "index.html#dependencies", + "href": "index.html#dependencies", + "title": "BAyesian Model-Building Interface in Python", + "section": "Dependencies", + "text": "Dependencies\nBambi is tested on Python 3.9+ and depends on ArviZ, formulae, NumPy, pandas and PyMC (see pyproject.toml for version information)." + }, + { + "objectID": "index.html#installation", + "href": "index.html#installation", + "title": "BAyesian Model-Building Interface in Python", + "section": "Installation", + "text": "Installation\nBambi is available from the Python Package Index at https://pypi.org/project/bambi, alternatively it can be installed using Conda.\n\nPyPI\nThe latest release of Bambi can be installed using pip:\npip install bambi\nAlternatively, if you want the bleeding edge version of the package, you can install from GitHub:\npip install git+https://github.com/bambinos/bambi.git\n\n\nConda\nIf you use Conda, you can also install the latest release of Bambi with the following command:\nconda install -c conda-forge bambi" + }, + { + "objectID": "index.html#usage", + "href": "index.html#usage", + "title": "BAyesian Model-Building Interface in Python", + "section": "Usage", + "text": "Usage\nA simple fixed effects model is shown in the example below.\nimport arviz as az\nimport bambi as bmb\nimport pandas as pd\n\n# Read in a tab-delimited file containing our data\ndata = pd.read_table('my_data.txt', sep='\\t')\n\n# Initialize the fixed effects only model\nmodel = bmb.Model('DV ~ IV1 + IV2', data)\n\n# Fit the model using 1000 on each of 4 chains\nresults = model.fit(draws=1000, chains=4)\n\n# Use ArviZ to plot the results\naz.plot_trace(results)\n\n# Key summary and diagnostic info on the model parameters\naz.summary(results)\nFor a more in-depth introduction to Bambi see our Quickstart or our set of example notebooks." + }, + { + "objectID": "index.html#citation", + "href": "index.html#citation", + "title": "BAyesian Model-Building Interface in Python", + "section": "Citation", + "text": "Citation\nIf you use Bambi and want to cite it please use\n@article{\n Capretto2022,\n title={Bambi: A Simple Interface for Fitting Bayesian Linear Models in Python},\n volume={103},\n url={https://www.jstatsoft.org/index.php/jss/article/view/v103i15},\n doi={10.18637/jss.v103.i15},\n number={15},\n journal={Journal of Statistical Software},\n author={Capretto, Tomás and Piho, Camen and Kumar, Ravin and Westfall, Jacob and Yarkoni, Tal and Martin, Osvaldo A},\n year={2022},\n pages={1–29}\n}" + }, + { + "objectID": "index.html#contributing", + "href": "index.html#contributing", + "title": "BAyesian Model-Building Interface in Python", + "section": "Contributing", + "text": "Contributing\nWe welcome contributions from interested individuals or groups! For information about contributing to Bambi, check out our instructions, policies, and guidelines here." + }, + { + "objectID": "index.html#contributors", + "href": "index.html#contributors", + "title": "BAyesian Model-Building Interface in Python", + "section": "Contributors", + "text": "Contributors\nSee the GitHub contributor page." }, { "objectID": "faq.html",