diff --git a/.nojekyll b/.nojekyll index 6d9eb250..a2e4ab13 100644 --- a/.nojekyll +++ b/.nojekyll @@ -1 +1 @@ -4ac0403d \ No newline at end of file +c3a3499e \ No newline at end of file diff --git a/notebooks/t-test.html b/notebooks/t-test.html index 9110e4ff..17942797 100644 --- a/notebooks/t-test.html +++ b/notebooks/t-test.html @@ -619,7 +619,8 @@

Model 1

if \(\mu_0 = \mu_1\) then

\[ \beta_0 + \epsilon_i = \beta_0 + \beta_1 + \epsilon_i\\ -0 = \beta_1 +\] \[ +\beta_1 = 0 \]

Thus, we can see that testing whether the mean of the two populations are equal is equivalent to testing whether \(\beta_1\) is 0.

diff --git a/search.json b/search.json index 45df3ae6..91670c7c 100644 --- a/search.json +++ b/search.json @@ -652,7 +652,7 @@ "href": "notebooks/t-test.html#generate-data", "title": "Comparison of two means (T-test)", "section": "Generate data", - "text": "Generate data\nWe generate 160 values from a Gaussian with \\(\\mu=6\\) and \\(\\sigma=2.5\\) and another 120 values from a Gaussian’ with \\(\\mu=8\\) and \\(\\sigma=2\\)\n\na = np.random.normal(6, 2.5, 160)\nb = np.random.normal(8, 2, 120)\ndf = pd.DataFrame({\"Group\": [\"a\"] * 160 + [\"b\"] * 120, \"Val\": np.hstack([a, b])})\n\n\ndf.head()\n\n\n\n\n\n\n\n\nGroup\nVal\n\n\n\n\n0\na\n7.178588\n\n\n1\na\n3.022561\n\n\n2\na\n9.581767\n\n\n3\na\n5.218370\n\n\n4\na\n4.198528\n\n\n\n\n\n\n\n\naz.plot_violin({\"a\": a, \"b\": b});\n\n/home/tomas/anaconda3/envs/bambi-dev/lib/python3.11/site-packages/arviz/plots/backends/matplotlib/violinplot.py:65: UserWarning: This figure was using a layout engine that is incompatible with subplots_adjust and/or tight_layout; not calling subplots_adjust.\n fig.subplots_adjust(wspace=0)\n\n\n\n\n\n\n\n\n\nWhen we carry out a two sample t-test we are implicitly using a linear model that can be specified in different ways. One of these approaches is the following:\n\nModel 1\n\\[\n\\mu_i = \\beta_0 + \\beta_1 (i) + \\epsilon_i\n\\]\nwhere \\(i = 0\\) represents the population 1, \\(i = 1\\) the population 2 and \\(\\epsilon_i\\) is a random error with mean 0. If we replace the indicator variables for the two groups we have\n\\[\n\\mu_0 = \\beta_0 + \\epsilon_i\n\\]\nand\n\\[\n\\mu_1 = \\beta_0 + \\beta_1 + \\epsilon_i\n\\]\nif \\(\\mu_0 = \\mu_1\\) then\n\\[\n\\beta_0 + \\epsilon_i = \\beta_0 + \\beta_1 + \\epsilon_i\\\\\n0 = \\beta_1\n\\]\nThus, we can see that testing whether the mean of the two populations are equal is equivalent to testing whether \\(\\beta_1\\) is 0.\n\n\nAnalysis\nWe start by instantiating our model and specifying the model previously described.\n\nmodel_1 = bmb.Model(\"Val ~ Group\", df)\nresults_1 = model_1.fit()\n\nAuto-assigning NUTS sampler...\nInitializing NUTS using jitter+adapt_diag...\nMultiprocess sampling (2 chains in 2 jobs)\nNUTS: [sigma, Intercept, Group]\n\n\n\n\n\n\n\n\n\n\n\nSampling 2 chains for 1_000 tune and 1_000 draw iterations (2_000 + 2_000 draws total) took 2 seconds.\nWe recommend running at least 4 chains for robust computation of convergence diagnostics\n\n\nWe’ve only specified the formula for the model and Bambi automatically selected priors distributions and values for their parameters. We can inspect both the setup and the priors as following:\n\nmodel_1\n\n Formula: Val ~ Group\n Family: gaussian\n Link: mu = identity\n Observations: 280\n Priors: \n target = mu\n Common-level effects\n Intercept ~ Normal(mu: 6.9762, sigma: 8.1247)\n Group ~ Normal(mu: 0.0, sigma: 12.4107)\n \n Auxiliary parameters\n sigma ~ HalfStudentT(nu: 4.0, sigma: 2.4567)\n------\n* To see a plot of the priors call the .plot_priors() method.\n* To see a summary or plot of the posterior pass the object returned by .fit() to az.summary() or az.plot_trace()\n\n\n\nmodel_1.plot_priors();\n\nSampling: [Group, Intercept, sigma]\n\n\n\n\n\n\n\n\n\nTo inspect our posterior and the sampling process we can call az.plot_trace(). The option kind='rank_vlines' gives us a variant of the rank plot that uses lines and dots and helps us to inspect the stationarity of the chains. Since there is no clear pattern or serious deviations from the horizontal lines, we can conclude the chains are stationary.\n\n\naz.plot_trace(results_1, kind=\"rank_vlines\");\n\n\n\n\n\n\n\n\n\naz.summary(results_1)\n\n\n\n\n\n\n\n\nmean\nsd\nhdi_3%\nhdi_97%\nmcse_mean\nmcse_sd\ness_bulk\ness_tail\nr_hat\n\n\n\n\nGroup[b]\n2.004\n0.265\n1.548\n2.512\n0.005\n0.003\n3036.0\n1618.0\n1.0\n\n\nIntercept\n6.117\n0.180\n5.777\n6.459\n0.003\n0.002\n3049.0\n1512.0\n1.0\n\n\nsigma\n2.265\n0.096\n2.087\n2.444\n0.002\n0.001\n3570.0\n1662.0\n1.0\n\n\n\n\n\n\n\nIn the summary table we can see the 94% highest density interval for \\(\\beta_1\\) ranges from 1.511 to 2.499. Thus, according to the data and the model used, we conclude the difference between the two population means is somewhere between 1.2 and 2.2 and hence we support the hypotehsis that \\(\\beta_1 \\ne 0\\).\nSimilar conclusions can be made with the density estimate for the posterior distribution of \\(\\beta_1\\). As seen in the table, most of the probability for the difference in the mean roughly ranges from 1.2 to 2.2.\n\naz.plot_posterior(results_1, var_names=\"Group\", ref_val=0);\n\n\n\n\n\n\n\n\nAnother way to arrive to a similar conclusion is by calculating the probability that the parameter \\(\\beta_1 > 0\\). This probability is equal to 1, telling us that the mean of the two populations are different.\n\n# Probabiliy that posterior is > 0\n(results_1.posterior[\"Group\"] > 0).mean().item()\n\n1.0\n\n\nThe linear model implicit in the t-test can also be specified without an intercept term, such is the case of Model 2.\n\n\nModel 2\nWhen we carry out a two sample t-test we’re implicitly using the following model:\n\\[\n\\mu_i = \\beta_i + \\epsilon_i\n\\]\nwhere \\(i = 0\\) represents the population 1, \\(i = 1\\) the population 2 and \\(\\epsilon\\) is a random error with mean 0. If we replace the indicator variables for the two groups we have\n\\[\n\\mu_0 = \\beta_0 + \\epsilon\n\\]\nand\n\\[\n\\mu_1 = \\beta_1 + \\epsilon\n\\]\nif \\(\\mu_0 = \\mu_1\\) then\n\\[\n\\beta_0 + \\epsilon = \\beta_1 + \\epsilon\\\\\n\\]\nThus, we can see that testing whether the mean of the two populations are equal is equivalent to testing whether \\(\\beta_0 = \\beta_1\\).\n\n\nAnalysis\nWe start by instantiating our model and specifying the model previously described. In this model we will bypass the intercept that Bambi adds by default by setting it to zero, even though setting to -1 has the same effect.\n\nmodel_2 = bmb.Model(\"Val ~ 0 + Group\", df)\nresults_2 = model_2.fit() \n\nAuto-assigning NUTS sampler...\nInitializing NUTS using jitter+adapt_diag...\nMultiprocess sampling (2 chains in 2 jobs)\nNUTS: [sigma, Group]\n\n\n\n\n\n\n\n\n\n\n\nSampling 2 chains for 1_000 tune and 1_000 draw iterations (2_000 + 2_000 draws total) took 2 seconds.\nWe recommend running at least 4 chains for robust computation of convergence diagnostics\n\n\nWe’ve only specified the formula for the model and Bambi automatically selected priors distributions and values for their parameters. We can inspect both the setup and the priors as following:\n\nmodel_2\n\n Formula: Val ~ 0 + Group\n Family: gaussian\n Link: mu = identity\n Observations: 280\n Priors: \n target = mu\n Common-level effects\n Group ~ Normal(mu: [0. 0.], sigma: [12.4107 12.4107])\n \n Auxiliary parameters\n sigma ~ HalfStudentT(nu: 4.0, sigma: 2.4567)\n------\n* To see a plot of the priors call the .plot_priors() method.\n* To see a summary or plot of the posterior pass the object returned by .fit() to az.summary() or az.plot_trace()\n\n\n\nmodel_2.plot_priors();\n\nSampling: [Group, sigma]\n\n\n\n\n\n\n\n\n\nTo inspect our posterior and the sampling process we can call az.plot_trace(). The option kind='rank_vlines' gives us a variant of the rank plot that uses lines and dots and helps us to inspect the stationarity of the chains. Since there is no clear pattern or serious deviations from the horizontal lines, we can conclude the chains are stationary.\n\n\naz.plot_trace(results_2, kind=\"rank_vlines\");\n\n\n\n\n\n\n\n\n\naz.summary(results_2)\n\n\n\n\n\n\n\n\nmean\nsd\nhdi_3%\nhdi_97%\nmcse_mean\nmcse_sd\ness_bulk\ness_tail\nr_hat\n\n\n\n\nGroup[a]\n6.124\n0.177\n5.778\n6.431\n0.003\n0.002\n2735.0\n1407.0\n1.00\n\n\nGroup[b]\n8.115\n0.203\n7.708\n8.474\n0.004\n0.003\n3086.0\n1621.0\n1.00\n\n\nsigma\n2.265\n0.099\n2.075\n2.437\n0.002\n0.001\n2884.0\n1630.0\n1.01\n\n\n\n\n\n\n\nIn this summary we can observe the estimated distribution of means for each population. A simple way to compare them is subtracting one to the other. In the next plot we can se that the entirety of the distribution of differences is higher than zero and that the mean of population 2 is higher than the mean of population 1 by a mean of 2.\n\npost_group = results_2.posterior[\"Group\"]\ndiff = post_group.sel(Group_dim=\"b\") - post_group.sel(Group_dim=\"a\") \naz.plot_posterior(diff, ref_val=0);\n\n\n\n\n\n\n\n\nAnother way to arrive to a similar conclusion is by calculating the probability that the parameter \\(\\beta_1 - \\beta_0 > 0\\). This probability equals to 1, telling us that the mean of the two populations are different.\n\n# Probabiliy that posterior is > 0\n(post_group > 0).mean().item()\n\n1.0\n\n\n\n%load_ext watermark\n%watermark -n -u -v -iv -w\n\nLast updated: Sat May 25 2024\n\nPython implementation: CPython\nPython version : 3.11.9\nIPython version : 8.24.0\n\npandas : 2.2.2\nmatplotlib: 3.8.4\narviz : 0.18.0\nbambi : 0.13.1.dev37+g2a54df76.d20240525\nnumpy : 1.26.4\n\nWatermark: 2.4.3", + "text": "Generate data\nWe generate 160 values from a Gaussian with \\(\\mu=6\\) and \\(\\sigma=2.5\\) and another 120 values from a Gaussian’ with \\(\\mu=8\\) and \\(\\sigma=2\\)\n\na = np.random.normal(6, 2.5, 160)\nb = np.random.normal(8, 2, 120)\ndf = pd.DataFrame({\"Group\": [\"a\"] * 160 + [\"b\"] * 120, \"Val\": np.hstack([a, b])})\n\n\ndf.head()\n\n\n\n\n\n\n\n\nGroup\nVal\n\n\n\n\n0\na\n7.178588\n\n\n1\na\n3.022561\n\n\n2\na\n9.581767\n\n\n3\na\n5.218370\n\n\n4\na\n4.198528\n\n\n\n\n\n\n\n\naz.plot_violin({\"a\": a, \"b\": b});\n\n/home/tomas/anaconda3/envs/bambi-dev/lib/python3.11/site-packages/arviz/plots/backends/matplotlib/violinplot.py:65: UserWarning: This figure was using a layout engine that is incompatible with subplots_adjust and/or tight_layout; not calling subplots_adjust.\n fig.subplots_adjust(wspace=0)\n\n\n\n\n\n\n\n\n\nWhen we carry out a two sample t-test we are implicitly using a linear model that can be specified in different ways. One of these approaches is the following:\n\nModel 1\n\\[\n\\mu_i = \\beta_0 + \\beta_1 (i) + \\epsilon_i\n\\]\nwhere \\(i = 0\\) represents the population 1, \\(i = 1\\) the population 2 and \\(\\epsilon_i\\) is a random error with mean 0. If we replace the indicator variables for the two groups we have\n\\[\n\\mu_0 = \\beta_0 + \\epsilon_i\n\\]\nand\n\\[\n\\mu_1 = \\beta_0 + \\beta_1 + \\epsilon_i\n\\]\nif \\(\\mu_0 = \\mu_1\\) then\n\\[\n\\beta_0 + \\epsilon_i = \\beta_0 + \\beta_1 + \\epsilon_i\\\\\n\\] \\[\n\\beta_1 = 0\n\\]\nThus, we can see that testing whether the mean of the two populations are equal is equivalent to testing whether \\(\\beta_1\\) is 0.\n\n\nAnalysis\nWe start by instantiating our model and specifying the model previously described.\n\nmodel_1 = bmb.Model(\"Val ~ Group\", df)\nresults_1 = model_1.fit()\n\nAuto-assigning NUTS sampler...\nInitializing NUTS using jitter+adapt_diag...\nMultiprocess sampling (2 chains in 2 jobs)\nNUTS: [sigma, Intercept, Group]\n\n\n\n\n\n\n\n\n\n\n\nSampling 2 chains for 1_000 tune and 1_000 draw iterations (2_000 + 2_000 draws total) took 2 seconds.\nWe recommend running at least 4 chains for robust computation of convergence diagnostics\n\n\nWe’ve only specified the formula for the model and Bambi automatically selected priors distributions and values for their parameters. We can inspect both the setup and the priors as following:\n\nmodel_1\n\n Formula: Val ~ Group\n Family: gaussian\n Link: mu = identity\n Observations: 280\n Priors: \n target = mu\n Common-level effects\n Intercept ~ Normal(mu: 6.9762, sigma: 8.1247)\n Group ~ Normal(mu: 0.0, sigma: 12.4107)\n \n Auxiliary parameters\n sigma ~ HalfStudentT(nu: 4.0, sigma: 2.4567)\n------\n* To see a plot of the priors call the .plot_priors() method.\n* To see a summary or plot of the posterior pass the object returned by .fit() to az.summary() or az.plot_trace()\n\n\n\nmodel_1.plot_priors();\n\nSampling: [Group, Intercept, sigma]\n\n\n\n\n\n\n\n\n\nTo inspect our posterior and the sampling process we can call az.plot_trace(). The option kind='rank_vlines' gives us a variant of the rank plot that uses lines and dots and helps us to inspect the stationarity of the chains. Since there is no clear pattern or serious deviations from the horizontal lines, we can conclude the chains are stationary.\n\n\naz.plot_trace(results_1, kind=\"rank_vlines\");\n\n\n\n\n\n\n\n\n\naz.summary(results_1)\n\n\n\n\n\n\n\n\nmean\nsd\nhdi_3%\nhdi_97%\nmcse_mean\nmcse_sd\ness_bulk\ness_tail\nr_hat\n\n\n\n\nGroup[b]\n2.004\n0.265\n1.548\n2.512\n0.005\n0.003\n3036.0\n1618.0\n1.0\n\n\nIntercept\n6.117\n0.180\n5.777\n6.459\n0.003\n0.002\n3049.0\n1512.0\n1.0\n\n\nsigma\n2.265\n0.096\n2.087\n2.444\n0.002\n0.001\n3570.0\n1662.0\n1.0\n\n\n\n\n\n\n\nIn the summary table we can see the 94% highest density interval for \\(\\beta_1\\) ranges from 1.511 to 2.499. Thus, according to the data and the model used, we conclude the difference between the two population means is somewhere between 1.2 and 2.2 and hence we support the hypotehsis that \\(\\beta_1 \\ne 0\\).\nSimilar conclusions can be made with the density estimate for the posterior distribution of \\(\\beta_1\\). As seen in the table, most of the probability for the difference in the mean roughly ranges from 1.2 to 2.2.\n\naz.plot_posterior(results_1, var_names=\"Group\", ref_val=0);\n\n\n\n\n\n\n\n\nAnother way to arrive to a similar conclusion is by calculating the probability that the parameter \\(\\beta_1 > 0\\). This probability is equal to 1, telling us that the mean of the two populations are different.\n\n# Probabiliy that posterior is > 0\n(results_1.posterior[\"Group\"] > 0).mean().item()\n\n1.0\n\n\nThe linear model implicit in the t-test can also be specified without an intercept term, such is the case of Model 2.\n\n\nModel 2\nWhen we carry out a two sample t-test we’re implicitly using the following model:\n\\[\n\\mu_i = \\beta_i + \\epsilon_i\n\\]\nwhere \\(i = 0\\) represents the population 1, \\(i = 1\\) the population 2 and \\(\\epsilon\\) is a random error with mean 0. If we replace the indicator variables for the two groups we have\n\\[\n\\mu_0 = \\beta_0 + \\epsilon\n\\]\nand\n\\[\n\\mu_1 = \\beta_1 + \\epsilon\n\\]\nif \\(\\mu_0 = \\mu_1\\) then\n\\[\n\\beta_0 + \\epsilon = \\beta_1 + \\epsilon\\\\\n\\]\nThus, we can see that testing whether the mean of the two populations are equal is equivalent to testing whether \\(\\beta_0 = \\beta_1\\).\n\n\nAnalysis\nWe start by instantiating our model and specifying the model previously described. In this model we will bypass the intercept that Bambi adds by default by setting it to zero, even though setting to -1 has the same effect.\n\nmodel_2 = bmb.Model(\"Val ~ 0 + Group\", df)\nresults_2 = model_2.fit() \n\nAuto-assigning NUTS sampler...\nInitializing NUTS using jitter+adapt_diag...\nMultiprocess sampling (2 chains in 2 jobs)\nNUTS: [sigma, Group]\n\n\n\n\n\n\n\n\n\n\n\nSampling 2 chains for 1_000 tune and 1_000 draw iterations (2_000 + 2_000 draws total) took 2 seconds.\nWe recommend running at least 4 chains for robust computation of convergence diagnostics\n\n\nWe’ve only specified the formula for the model and Bambi automatically selected priors distributions and values for their parameters. We can inspect both the setup and the priors as following:\n\nmodel_2\n\n Formula: Val ~ 0 + Group\n Family: gaussian\n Link: mu = identity\n Observations: 280\n Priors: \n target = mu\n Common-level effects\n Group ~ Normal(mu: [0. 0.], sigma: [12.4107 12.4107])\n \n Auxiliary parameters\n sigma ~ HalfStudentT(nu: 4.0, sigma: 2.4567)\n------\n* To see a plot of the priors call the .plot_priors() method.\n* To see a summary or plot of the posterior pass the object returned by .fit() to az.summary() or az.plot_trace()\n\n\n\nmodel_2.plot_priors();\n\nSampling: [Group, sigma]\n\n\n\n\n\n\n\n\n\nTo inspect our posterior and the sampling process we can call az.plot_trace(). The option kind='rank_vlines' gives us a variant of the rank plot that uses lines and dots and helps us to inspect the stationarity of the chains. Since there is no clear pattern or serious deviations from the horizontal lines, we can conclude the chains are stationary.\n\n\naz.plot_trace(results_2, kind=\"rank_vlines\");\n\n\n\n\n\n\n\n\n\naz.summary(results_2)\n\n\n\n\n\n\n\n\nmean\nsd\nhdi_3%\nhdi_97%\nmcse_mean\nmcse_sd\ness_bulk\ness_tail\nr_hat\n\n\n\n\nGroup[a]\n6.124\n0.177\n5.778\n6.431\n0.003\n0.002\n2735.0\n1407.0\n1.00\n\n\nGroup[b]\n8.115\n0.203\n7.708\n8.474\n0.004\n0.003\n3086.0\n1621.0\n1.00\n\n\nsigma\n2.265\n0.099\n2.075\n2.437\n0.002\n0.001\n2884.0\n1630.0\n1.01\n\n\n\n\n\n\n\nIn this summary we can observe the estimated distribution of means for each population. A simple way to compare them is subtracting one to the other. In the next plot we can se that the entirety of the distribution of differences is higher than zero and that the mean of population 2 is higher than the mean of population 1 by a mean of 2.\n\npost_group = results_2.posterior[\"Group\"]\ndiff = post_group.sel(Group_dim=\"b\") - post_group.sel(Group_dim=\"a\") \naz.plot_posterior(diff, ref_val=0);\n\n\n\n\n\n\n\n\nAnother way to arrive to a similar conclusion is by calculating the probability that the parameter \\(\\beta_1 - \\beta_0 > 0\\). This probability equals to 1, telling us that the mean of the two populations are different.\n\n# Probabiliy that posterior is > 0\n(post_group > 0).mean().item()\n\n1.0\n\n\n\n%load_ext watermark\n%watermark -n -u -v -iv -w\n\nLast updated: Sat May 25 2024\n\nPython implementation: CPython\nPython version : 3.11.9\nIPython version : 8.24.0\n\npandas : 2.2.2\nmatplotlib: 3.8.4\narviz : 0.18.0\nbambi : 0.13.1.dev37+g2a54df76.d20240525\nnumpy : 1.26.4\n\nWatermark: 2.4.3", "crumbs": [ "Examples", "Linear regression models",