Skip to content

Commit

Permalink
started writing docu for distris and resids
Browse files Browse the repository at this point in the history
  • Loading branch information
Konrad1991 committed Nov 13, 2024
1 parent 7488aee commit 970ef95
Show file tree
Hide file tree
Showing 20 changed files with 4,654 additions and 0 deletions.
28 changes: 28 additions & 0 deletions bs/.development/Distris.R
Original file line number Diff line number Diff line change
@@ -0,0 +1,28 @@
dnorm(0.25, mean = 0, sd = 1)
dnorm(0.5, mean = 0, sd = 1)
dnorm(0.005, mean = 0, sd = 1)
dnorm(5, mean = 0, sd = 1)
dnorm(0.01, mean = 0, sd = 1)

mean <- 0
sd <- 1
set.seed(123) # For reproducibility
samples <- rnorm(1000, mean, sd)
hist(samples,
breaks = 30, probability = TRUE, col = "lightgray",
main = "Histogram of Random Samples with PDF Overlay",
xlab = "x", ylab = "Density", ylim = c(0, 1.25)
)

x <- seq(-4, 4, length.out = 100)
lines(x, dnorm(x, mean, sd), col = "blue", lwd = 2)

x <- seq(-4, 4, length.out = 100)
lines(x, pnorm(x, mean, sd), type = "l", lwd = 2, col = "red")

lines(density(samples), col = "green", lwd = 2)

legend("topright",
legend = c("PDF (dnorm)", "CDF (pnorm)", "Empirical Density", "Histogram of Samples"),
col = c("blue", "red", "green", "lightgray"), lwd = 2, fill = c(NA, NA, NA, "lightgray")
)
1,053 changes: 1,053 additions & 0 deletions bs/.development/Distris.html

Large diffs are not rendered by default.

338 changes: 338 additions & 0 deletions bs/.development/Distris.qmd
Original file line number Diff line number Diff line change
@@ -0,0 +1,338 @@
---
title: "Exploring Probability Distributions in R"
output: html_document # Use `output: quarto::html` for Quarto
---

## Introduction

This document provides an overview of the `d-`, `p-`, `q-`, and `r-` functions in R,
focusing on their roles in understanding probability distributions.

## Theoretical Background

### The `d-` Function (Density)
The `d-` functions, such as `dnorm()`, calculate the density (or probability mass)
at a given point. For continuous distributions, this represents the height
of the probability density function (PDF) at a specific value.
Use it to understand the likelihood of observing a specific value under a given distribution.
For example, dnorm(0, mean = 0, sd = 1) returns the density at x = 0.
It helps assess how probable a particular outcome is.

### The `p-` Function (Cumulative Distribution)
The `p-` functions, like `pnorm()`, return the cumulative probability up
to a given point. This is known as the cumulative distribution function (CDF),
which tells us the probability of a random variable being
less than or equal to a specific value.
Function (Cumulative Probability):
calculates the cumulative probability up to a specific point,
indicating the probability of a value being less than or equal to that point.
For instance, pnorm(1, mean = 0, sd = 1) gives the probability P(X≤1).
It's useful for finding thresholds or quantifying areas under the curve.

### The `q-` Function (Quantile)
The `q-` functions, such as `qnorm()`, provide the inverse of the CDF.
Given a probability, `q-` functions return the quantile that corresponds to that probability.
q- Function (Quantile): calculates the value corresponding to a given
cumulative probability. Use it when you need the value for a specified percentile.
For example, qnorm(0.975, mean = 0, sd = 1) returns the 97.5th percentile
of the standard normal distribution, useful in setting confidence intervals.

### The `r-` Function (Random Generation)
The `r-` functions, like `rnorm()`, are used for generating random samples from a
specified distribution. This is essential for simulations and statistical analysis.

## Visualizing a Standard Normal Distribution

Let's create a histogram of random samples drawn from a standard normal
distribution and overlay it with the PDF, CDF, and empirical density.

```{r setup, include=FALSE}
mean <- 0
sd <- 1
set.seed(123)
```

### Generating Random Samples

```{r random-samples}
samples <- rnorm(1000, mean, sd)
```

### Plotting the Histogram with Overlays

```{r plot-histogram}
library(ggplot2)
x <- seq(-4, 4, length.out = 100)
df <- data.frame(x = x, value = samples, group = "Measured")
# Sequence for plotting PDF and CDF and empirical density
densi_emp <- density(samples)
# Normalize the measured points
hist_data <- hist(samples, breaks = 30, plot = FALSE)
N <- length(samples)
n_counts_in_bins <- hist_data$counts
bin_widths <- hist_data$breaks[2] - hist_data$breaks[1]
densi <- n_counts_in_bins / (bin_widths * N)
df <- rbind(df,
data.frame(x = x, value = dnorm(x, mean, sd), group = "PDF"),
data.frame(x = x, value = pnorm(x, mean, sd), group = "CDF"),
data.frame(x = densi_emp$x, value = densi_emp$y, group = "Empirical Density")
)
ggplot() +
geom_histogram(
data = df[df$group == "Measured", ],
aes(x = value, y = ..density..)
# This normalizes the histogram: n/(binwidth*N) = density;
# Where N is the total number of observations and
# n is the number of observations in each bin.
# See above for more details
) +
geom_line(
data = df[df$group != "Measured", ],
aes(x = x, y = value, color = group),
size = 1.5
) +
theme(legend.title = element_blank())
```


## Likelihood of a Dataset
Likelihood refers to how probable a specific set of data is under a given model.
It quantifies the model’s ability to explain the observed data.
Maximizing the likelihood helps estimate the parameters that make the observed data
most probable (used in maximum likelihood estimation).

### fitdistr and AIC
- fitdistr (in MASS package): Fits a distribution to data by estimating
its parameters using maximum likelihood estimation.
It's useful for modeling and checking which distribution best matches your data.
- AIC (Akaike Information Criterion): Measures the relative quality
of statistical models, balancing model fit and complexity. A lower AIC
indicates a model that better explains the data with fewer parameters.

```{r Likelihood}
library(FAwR)
library(MASS)
data(gutten)
hist(gutten$dbh.cm)
fit_norm <- fitdistr(gutten$dbh.cm, "normal")
fit_norm
likelihood <- prod(
dnorm(gutten$dbh.cm,
mean = fit_norm$estimate[1],
sd = fit_norm$estimate[2])
) # Cannot be calculated like this by a computer --> is 0
likelihood <- sum(
log(
dnorm(gutten$dbh.cm,
mean = fit_norm$estimate[1],
sd = fit_norm$estimate[2])
)
)
likelihood
logLik(fit_norm)
AIC(fit_norm)
BIC(fit_norm)
# log normal
fit_lognorm <- fitdistr(gutten$dbh.cm, "lognormal")
fit_lognorm
likelihood <- sum(
log(
dnorm(gutten$dbh.cm,
mean = fit_lognorm$estimate[1],
sd = fit_lognorm$estimate[2])
)
)
likelihood
logLik(fit_lognorm)
AIC(fit_lognorm)
BIC(fit_lognorm)
```

## Regression i.e. generalized linear models

A regression describes the quantitative relationship between
one or more predictor variables and a response variable.
Here a distribution is again fitted to the measured data.
However, a predictor effects certain parameters of the distribution.
For example, we have three factors in our dataset: genotype, time and treatment.
Moreover, we want to predict the CO2 uptake rate (our response variable).
Furthermore, we assume normally distributed values.
CO2uptake ~ N(µ = a + b*genotype + c*time + d*treatment, σ)

### The link function/scala
Often distributions have only a certain interval in which they
are defined. However, if a predictor takes values which are outside
of the interval,. we need to transform the predictor to a new scale.
This is the task of the link function.



```{r LinkFunctions}
#| echo = FALSE
df <- data.frame(distributions = c("Normal", "Poisson", "Binomial", "Gamma", "Negative Binomial"),
link_functions = c("Identity", "Log", "Logit (ln(y / (1 - y))", "Inverse", "Log"))
knitr::kable(df)
```


```{r Regressions}
df <- data.frame(sizes = c(rnorm(10000, mean = 170, sd = 5), rnorm(10000, mean = 182, sd = 5)),
sex = rep(c("F", "M"), each = 10000))
p1 <- ggplot(df, aes(x = sizes, colour = sex)) +
geom_histogram()
p2 <- ggplot(df, aes(y = sizes, x = sex)) +
geom_boxplot()
library(patchwork)
p1 + p2
m <- glm(sizes ~ 1, data = df, family = "gaussian")
cat(
"Base model: Fit the distribution to the data\n",
"y = N(µ, σ)\n"
)
summary(m)
cat(
"Model: which considers the sex\n",
"y = N(µ = a*F + b*M, σ)\n"
)
m <- glm(sizes ~ sex, data = df, family = "gaussian")
summary(m)
# By hand
# Fit a normal distribution to the data without considering the sex
fit_base <- fitdistr(df$sizes, "normal")
mean_base <- fit_base$estimate[1]
sd_base <- fit_base$estimate[2]
# Calculate log-likelihood for the base model
log_likelihood_base <- sum(log(dnorm(df$sizes, mean = mean_base, sd = sd_base)))
log_likelihood_base
# Fit separate distributions for each group
fit_female <- fitdistr(df$sizes[df$sex == "F"], "normal")
mean_female <- fit_female$estimate[1]
sd_female <- fit_female$estimate[2]
fit_male <- fitdistr(df$sizes[df$sex == "M"], "normal")
mean_male <- fit_male$estimate[1]
sd_male <- fit_male$estimate[2]
# Calculate log-likelihood for the model considering sex
log_likelihood_female <- sum(log(dnorm(df$sizes[df$sex == "F"], mean = mean_female, sd = sd_female)))
log_likelihood_male <- sum(log(dnorm(df$sizes[df$sex == "M"], mean = mean_male, sd = sd_male)))
log_likelihood_total <- log_likelihood_female + log_likelihood_male
log_likelihood_total
cat("Log-likelihood of the base model:", log_likelihood_base, "\n")
cat("Log-likelihood of the model considering sex:", log_likelihood_total, "\n")
```


## Constructing a Regression Model by Hand

### 1. Formulate the Regression Equation

Suppose you're constructing a simple linear regression model for a response variable $Y$ with a single predictor $X$. The model would be:
$$
Y_i = \beta_0 + \beta_1 X_i + \epsilon_i
$$
where:
- $\beta_0$ is the intercept,
- $\beta_1$ is the slope coefficient,
- $\epsilon_i$ is the error term assumed to be normally distributed with mean 0 and variance $\sigma^2$.

### 2. Set Up the Design Matrix

For a model with $n$ observations:
- Create a matrix $X$ for the predictors, which includes a column of 1s for the intercept:
$$
X = \begin{bmatrix}
1 & X_1 \\
1 & X_2 \\
\vdots & \vdots \\
1 & X_n \\
\end{bmatrix}
$$

### 3. Estimate the Coefficients

The least squares estimates $\hat{\beta}$ can be found using:
$$
\hat{\beta} = (X^T X)^{-1} X^T Y
$$

#### By Hand Example

- **Calculate $X^T X$**:
$$
X^T X = \begin{bmatrix}
n & \sum X_i \\
\sum X_i & \sum X_i^2 \\
\end{bmatrix}
$$

- **Calculate $X^T Y$**:
$$
X^T Y = \begin{bmatrix}
\sum Y_i \\
\sum X_i Y_i \\
\end{bmatrix}
$$

- **Find $(X^T X)^{-1}$**:
$$
(X^T X)^{-1} = \frac{1}{n \sum X_i^2 - (\sum X_i)^2} \begin{bmatrix}
\sum X_i^2 & -\sum X_i \\
-\sum X_i & n \\
\end{bmatrix}
$$

- **Compute $\hat{\beta}$**:
$$
\hat{\beta} = (X^T X)^{-1} X^T Y
$$

### 4. Compute Residuals

Calculate residuals $\hat{\epsilon}_i = Y_i - \hat{Y}_i$, where $\hat{Y}_i$ is obtained using the fitted model:
$$
\hat{Y}_i = \hat{\beta}_0 + \hat{\beta}_1 X_i
$$

### 5. Estimate Variance $\sigma^2$

The estimate of the variance $\sigma^2$ is:
$$
\hat{\sigma}^2 = \frac{\sum \hat{\epsilon}_i^2}{n - p}
$$
where $p$ is the number of estimated parameters (e.g., $p = 2$ for intercept and slope).

```{r Regressions Continued}
design_matrix <- model.matrix(sizes ~ sex, data = df)
head(design_matrix)
tail(design_matrix)
beta_hat <- solve(t(design_matrix) %*% design_matrix) %*% t(design_matrix) %*% df$sizes
beta_hat
cat(
"Model: which considers the sex\n",
"y = N(µ = a*F + b*M, σ)\n"
)
m <- glm(sizes ~ sex, data = df, family = "gaussian")
summary(m)
sum(residuals(m)^2) / (nrow(df) - 2)
# Calc residuals
# NOTE: Is beta_hat[1] + beta_hat[2] * df$sizes but in addition sex is considered
y_hat <- design_matrix %*% beta_hat
residuals <- df$sizes - y_hat
variance <- sum(residuals^2) / (nrow(df) - 2)
variance
```

Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading

0 comments on commit 970ef95

Please sign in to comment.