Skip to content

Commit

Permalink
Video review
Browse files Browse the repository at this point in the history
  • Loading branch information
s2t2 committed Sep 3, 2024
1 parent 0429bbd commit 1f429e9
Show file tree
Hide file tree
Showing 2 changed files with 16 additions and 11 deletions.
19 changes: 12 additions & 7 deletions docs/notes/applied-stats/basic-tests.qmd
Original file line number Diff line number Diff line change
Expand Up @@ -23,12 +23,13 @@ df.head()

## Normality Tests

We can conduct a [normality test](https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.normaltest.html) to see if a given distribution is normally distributed.
We can use the [`normaltest` function](https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.normaltest.html) from to conduct a normality test, to see if a given variable is normally distributed.

> This function tests the null hypothesis that a sample comes from a normal distribution.
>
> If the p-value is "small" - that is, if there is a low probability of sampling data from a normally distributed population that produces such an extreme value of the statistic - this may be taken as evidence against the null hypothesis in favor of the alternative: the weights were not drawn from a normal distribution.
In this example, we pass a column or list of values to the `normaltest` function, which produces a result containing the statistic and p value:

```{python}
from scipy.stats import normaltest
Expand All @@ -39,17 +40,17 @@ result = normaltest(x)
print(result)
```

### Interpreting the results.
Interpreting the results:


> To determine whether the data do not follow a normal distribution, compare the p-value to the significance level. Usually, a significance level (denoted as α or alpha) of 0.05 works well. A significance level of 0.05 indicates a 5% risk of concluding that the data do not follow a normal distribution when the data do follow a normal distribution.
>
> P-value ≤ α: The data do not follow a normal distribution (Reject H0)
> If the p-value is less than or equal to the significance level, the decision is to reject the null hypothesis and conclude that your data do not follow a normal distribution.
>
> P-value > α: You cannot conclude that the data do not follow a normal distribution (Fail to reject H0). If the p-value is larger than the significance level, the decision is to fail to reject the null hypothesis. You do not have enough evidence to conclude that your data do not follow a normal distribution. - [source](https://support.minitab.com/en-us/minitab/21/help-and-how-to/statistics/basic-statistics/how-to/normality-test/interpret-the-results/key-results/
)
> P-value > α: You cannot conclude that the data do not follow a normal distribution (Fail to reject H0). If the p-value is larger than the significance level, the decision is to fail to reject the null hypothesis. You do not have enough evidence to conclude that your data do not follow a normal distribution. - [source](https://support.minitab.com/en-us/minitab/21/help-and-how-to/statistics/basic-statistics/how-to/normality-test/interpret-the-results/key-results/)
We examine the p value. If the p value is less than some significance level we set (in this case 0.05), we reject the null hypothesis, and conclude the data is not normally distributed. Otherwise, we fail to reject the null hypothesis, and conclude it is possible the data could be normally distributed:

```{python}
Expand Down Expand Up @@ -118,7 +119,7 @@ Reference: <https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.tte
>
> This is a test for the null hypothesis that 2 independent samples have identical average (expected) values. This test assumes that the populations have identical variances by default.
>
> The t-test quantifies the difference between the arithmetic means of the two samples. The p-value quantifies the probability of observing as or more extreme values assuming the null hypothesis, that the samples are drawn from populations with the same population means, is true. A p-value larger than a chosen threshold (e.g. 5% or 1%) indicates that our observation is not so unlikely to have occurred by chance. Therefore, we do not reject the null hypothesis of equal population means. If the p-value is smaller than our threshold, then we have evidence against the null hypothesis of equal population means. - [source](https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.ttest_ind.html)
> The t-test quantifies the difference between the arithmetic means of the two samples. The p-value quantifies the probability of observing as or more extreme values assuming the null hypothesis, that the samples are drawn from populations with the same population means, is true. A p-value larger than a chosen threshold (e.g. 5% or 1%) indicates that our observation is not so unlikely to have occurred by chance. Therefore, we do not reject the null hypothesis of equal population means. If the p-value is smaller than our threshold, then we have evidence against the null hypothesis of equal population means. -
```{python}
print(rates_recent.var())
Expand All @@ -145,12 +146,14 @@ Reference: <https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.tte

> Calculate the T-test for the mean of ONE group of scores.
>
> This is a test for the null hypothesis that the expected value (mean) of a sample of independent observations a is equal to the given population mean, popmean.
> This is a test for the null hypothesis that the expected value (mean) of a sample of independent observations is equal to the given population mean, popmean.
>
> Under certain assumptions about the population from which a sample is drawn, the confidence interval with confidence level 95% is expected to contain the true population mean in 95% of sample replications. - [source](https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.ttest_1samp.html)
> Under certain assumptions about the population from which a sample is drawn, the confidence interval with confidence level 95% is expected to contain the true population mean in 95% of sample replications.
Suppose we wish to test the null hypothesis that the mean of the fed funds rates is equal to 2.5%.

We pass as parameters the column of values, and the population mean we wish to test. Then we inspect the p value to interpret the results.

```{python}
from scipy.stats import ttest_1samp
Expand All @@ -167,6 +170,8 @@ else:
print("NOT ABLE TO REJECT (MEAN COULT BE EQUAL TO POPMEAN)")
```

Finally, we can access information about the confidence interval for this test:

```{python}
ci = result.confidence_interval(confidence_level=0.95)
print(ci)
Expand Down
8 changes: 4 additions & 4 deletions docs/notes/applied-stats/summary-stats.qmd
Original file line number Diff line number Diff line change
Expand Up @@ -34,7 +34,7 @@ We can use the dataframe's [`describe` method](https://pandas.pydata.org/docs/re
df.describe()
```

This will show us the number of rows, mean and median, min and max, and quantiles for each column.
This will show us the number of rows, mean and standard deviation, min and max, and quantiles for each column.

As you may be aware, we can alternatively calculate these metrics ourselves, using `Series` aggregations:

Expand All @@ -61,7 +61,7 @@ series.describe() # for comparison

## Distribution Plots

In order to learn more about the distribution of this data, we can create distribution plots of the federal funds rate, to tell a story about the summary statistics for this indicator.
In order to learn more about the distribution of this data, we can create distribution plots, to tell a story about the summary statistics.


A box plot:
Expand Down Expand Up @@ -94,8 +94,8 @@ px.histogram(df, x="fed", #nbins=12,
```

When we make a histogram, we can specify the number of bins.
When we make a histogram, we can specify the number of bins, using the `nbins` parameter.

These charts help us visually identify distributions in the data.

Looks like the recent higher funds rates are potential outliers. Based on this view, is hard to say for sure if this data is normally distributed, or whether it is too skewed by the outliers. In the next chapter, we will perform more official statistical tests to determine whether this data is normally distributed.
Based on this view, is hard to say for sure if this data is normally distributed, or multi-modal, or whether it is too skewed by the outliers. In the next chapter, we will perform more official statistical tests to determine if this data is normally distributed.

0 comments on commit 1f429e9

Please sign in to comment.