started writing docu for distris and resids

ComPlat · Nov 13, 2024 · 970ef95 · 970ef95
1 parent 7488aee
commit 970ef95
Show file tree

Hide file tree

Showing 20 changed files with 4,654 additions and 0 deletions.
diff --git a/bs/.development/Distris.R b/bs/.development/Distris.R
@@ -0,0 +1,28 @@
+dnorm(0.25, mean = 0, sd = 1)
+dnorm(0.5, mean = 0, sd = 1)
+dnorm(0.005, mean = 0, sd = 1)
+dnorm(5, mean = 0, sd = 1)
+dnorm(0.01, mean = 0, sd = 1)
+
+mean <- 0
+sd <- 1
+set.seed(123) # For reproducibility
+samples <- rnorm(1000, mean, sd)
+hist(samples,
+  breaks = 30, probability = TRUE, col = "lightgray",
+  main = "Histogram of Random Samples with PDF Overlay",
+  xlab = "x", ylab = "Density", ylim = c(0, 1.25)
+)
+
+x <- seq(-4, 4, length.out = 100)
+lines(x, dnorm(x, mean, sd), col = "blue", lwd = 2)
+
+x <- seq(-4, 4, length.out = 100)
+lines(x, pnorm(x, mean, sd), type = "l", lwd = 2, col = "red")
+
+lines(density(samples), col = "green", lwd = 2)
+
+legend("topright",
+  legend = c("PDF (dnorm)", "CDF (pnorm)", "Empirical Density", "Histogram of Samples"),
+  col = c("blue", "red", "green", "lightgray"), lwd = 2, fill = c(NA, NA, NA, "lightgray")
+)
diff --git a/bs/.development/Distris.html b/bs/.development/Distris.html
diff --git a/bs/.development/Distris.qmd b/bs/.development/Distris.qmd
@@ -0,0 +1,338 @@
+---
+title: "Exploring Probability Distributions in R"
+output: html_document  # Use `output: quarto::html` for Quarto
+---
+
+## Introduction
+
+This document provides an overview of the `d-`, `p-`, `q-`, and `r-` functions in R,
+focusing on their roles in understanding probability distributions.
+
+## Theoretical Background
+
+### The `d-` Function (Density)
+The `d-` functions, such as `dnorm()`, calculate the density (or probability mass)
+at a given point. For continuous distributions, this represents the height
+of the probability density function (PDF) at a specific value.
+Use it to understand the likelihood of observing a specific value under a given distribution.
+For example, dnorm(0, mean = 0, sd = 1) returns the density at x = 0.
+It helps assess how probable a particular outcome is.
+
+### The `p-` Function (Cumulative Distribution)
+The `p-` functions, like `pnorm()`, return the cumulative probability up
+to a given point. This is known as the cumulative distribution function (CDF),
+which tells us the probability of a random variable being
+less than or equal to a specific value.
+Function (Cumulative Probability):
+calculates the cumulative probability up to a specific point,
+indicating the probability of a value being less than or equal to that point.
+For instance, pnorm(1, mean = 0, sd = 1) gives the probability P(X≤1).
+It's useful for finding thresholds or quantifying areas under the curve.
+
+### The `q-` Function (Quantile)
+The `q-` functions, such as `qnorm()`, provide the inverse of the CDF.
+Given a probability, `q-` functions return the quantile that corresponds to that probability.
+q- Function (Quantile): calculates the value corresponding to a given
+cumulative probability. Use it when you need the value for a specified percentile.
+For example, qnorm(0.975, mean = 0, sd = 1) returns the 97.5th percentile
+of the standard normal distribution, useful in setting confidence intervals.
+
+### The `r-` Function (Random Generation)
+The `r-` functions, like `rnorm()`, are used for generating random samples from a
+specified distribution. This is essential for simulations and statistical analysis.
+
+## Visualizing a Standard Normal Distribution
+
+Let's create a histogram of random samples drawn from a standard normal
+distribution and overlay it with the PDF, CDF, and empirical density.
+
+```{r setup, include=FALSE}
+mean <- 0
+sd <- 1
+set.seed(123)
+```
+
+### Generating Random Samples
+
+```{r random-samples}
+samples <- rnorm(1000, mean, sd)
+```
+
+### Plotting the Histogram with Overlays
+
+```{r plot-histogram}
+library(ggplot2)
+x <- seq(-4, 4, length.out = 100)
+df <- data.frame(x = x, value = samples, group = "Measured")
+# Sequence for plotting PDF and CDF and empirical density
+densi_emp <- density(samples)
+
+# Normalize the measured points
+hist_data <- hist(samples, breaks = 30, plot = FALSE)
+N <- length(samples)
+n_counts_in_bins <- hist_data$counts
+bin_widths <- hist_data$breaks[2] - hist_data$breaks[1]
+densi <- n_counts_in_bins / (bin_widths * N)
+
+df <- rbind(df,
+  data.frame(x = x, value = dnorm(x, mean, sd), group = "PDF"),
+  data.frame(x = x, value = pnorm(x, mean, sd), group = "CDF"),
+  data.frame(x = densi_emp$x, value = densi_emp$y, group = "Empirical Density")
+)
+
+ggplot() +
+  geom_histogram(
+    data = df[df$group == "Measured", ],
+    aes(x = value, y = ..density..) 
+    # This normalizes the histogram: n/(binwidth*N) = density;
+    # Where N is the total number of observations and
+    # n is the number of observations in each bin.
+    # See above for more details
+  ) +
+  geom_line(
+    data = df[df$group != "Measured", ],
+    aes(x = x, y = value, color = group),
+    size = 1.5
+  ) +
+  theme(legend.title = element_blank())
+```
+
+
+## Likelihood of a Dataset
+Likelihood refers to how probable a specific set of data is under a given model.
+It quantifies the model’s ability to explain the observed data.
+Maximizing the likelihood helps estimate the parameters that make the observed data
+most probable (used in maximum likelihood estimation).
+
+### fitdistr and AIC
+- fitdistr (in MASS package): Fits a distribution to data by estimating
+  its parameters using maximum likelihood estimation.
+  It's useful for modeling and checking which distribution best matches your data.
+- AIC (Akaike Information Criterion): Measures the relative quality
+  of statistical models, balancing model fit and complexity. A lower AIC
+  indicates a model that better explains the data with fewer parameters.
+
+```{r Likelihood}
+library(FAwR)
+library(MASS)
+data(gutten)
+hist(gutten$dbh.cm)
+fit_norm <- fitdistr(gutten$dbh.cm, "normal")
+fit_norm
+likelihood <- prod(
+  dnorm(gutten$dbh.cm,
+  mean = fit_norm$estimate[1],
+  sd = fit_norm$estimate[2])
+) # Cannot be calculated like this by a computer --> is 0
+likelihood <- sum(
+  log(
+    dnorm(gutten$dbh.cm,
+      mean = fit_norm$estimate[1],
+      sd = fit_norm$estimate[2])
+  )
+)
+likelihood
+logLik(fit_norm)
+AIC(fit_norm)
+BIC(fit_norm)
+
+# log normal
+fit_lognorm <- fitdistr(gutten$dbh.cm, "lognormal")
+fit_lognorm
+likelihood <- sum(
+  log(
+    dnorm(gutten$dbh.cm,
+      mean = fit_lognorm$estimate[1],
+      sd = fit_lognorm$estimate[2])
+  )
+)
+likelihood
+logLik(fit_lognorm)
+AIC(fit_lognorm)
+BIC(fit_lognorm)
+```
+
+## Regression i.e. generalized linear models
+
+A regression describes the quantitative relationship between
+one or more predictor variables and a response variable.
+Here a distribution is again fitted to the measured data.
+However, a predictor effects certain parameters of the distribution.
+For example, we have three factors in our dataset: genotype, time and treatment.
+Moreover, we want to predict the CO2 uptake rate (our response variable).
+Furthermore, we assume normally distributed values.
+CO2uptake ~ N(µ = a + b*genotype + c*time + d*treatment, σ)
+
+### The link function/scala
+Often distributions have only a certain interval in which they
+are defined. However, if a predictor takes values which are outside
+of the interval,. we need to transform the predictor to a new scale.
+This is the task of the link function. 
+
+
+
+```{r LinkFunctions}
+#| echo = FALSE
+df <- data.frame(distributions = c("Normal", "Poisson", "Binomial", "Gamma", "Negative Binomial"),
+                link_functions = c("Identity", "Log", "Logit (ln(y / (1 - y))", "Inverse", "Log"))
+knitr::kable(df)
+```
+
+
+```{r Regressions}
+df <- data.frame(sizes = c(rnorm(10000, mean = 170, sd = 5), rnorm(10000, mean = 182, sd = 5)),
+                 sex = rep(c("F", "M"), each = 10000))
+p1 <- ggplot(df, aes(x = sizes, colour = sex)) +
+  geom_histogram()
+p2 <- ggplot(df, aes(y = sizes, x = sex)) +
+  geom_boxplot()
+library(patchwork)
+p1 + p2
+m <- glm(sizes ~ 1, data = df, family = "gaussian")
+cat(
+  "Base model: Fit the distribution to the data\n",
+  "y = N(µ, σ)\n"
+)
+summary(m)
+cat(
+  "Model: which considers the sex\n",
+  "y = N(µ = a*F + b*M, σ)\n"
+)
+m <- glm(sizes ~ sex, data = df, family = "gaussian")
+summary(m)
+
+# By hand
+# Fit a normal distribution to the data without considering the sex
+fit_base <- fitdistr(df$sizes, "normal")
+mean_base <- fit_base$estimate[1]
+sd_base <- fit_base$estimate[2]
+
+# Calculate log-likelihood for the base model
+log_likelihood_base <- sum(log(dnorm(df$sizes, mean = mean_base, sd = sd_base)))
+log_likelihood_base
+
+# Fit separate distributions for each group
+fit_female <- fitdistr(df$sizes[df$sex == "F"], "normal")
+mean_female <- fit_female$estimate[1]
+sd_female <- fit_female$estimate[2]
+
+fit_male <- fitdistr(df$sizes[df$sex == "M"], "normal")
+mean_male <- fit_male$estimate[1]
+sd_male <- fit_male$estimate[2]
+
+# Calculate log-likelihood for the model considering sex
+log_likelihood_female <- sum(log(dnorm(df$sizes[df$sex == "F"], mean = mean_female, sd = sd_female)))
+log_likelihood_male <- sum(log(dnorm(df$sizes[df$sex == "M"], mean = mean_male, sd = sd_male)))
+
+log_likelihood_total <- log_likelihood_female + log_likelihood_male
+log_likelihood_total
+
+cat("Log-likelihood of the base model:", log_likelihood_base, "\n")
+cat("Log-likelihood of the model considering sex:", log_likelihood_total, "\n")
+
+```
+
+
+## Constructing a Regression Model by Hand
+
+### 1. Formulate the Regression Equation
+
+Suppose you're constructing a simple linear regression model for a response variable $Y$ with a single predictor $X$. The model would be:
+$$
+Y_i = \beta_0 + \beta_1 X_i + \epsilon_i
+$$
+where:
+- $\beta_0$ is the intercept,
+- $\beta_1$ is the slope coefficient,
+- $\epsilon_i$ is the error term assumed to be normally distributed with mean 0 and variance $\sigma^2$.
+
+### 2. Set Up the Design Matrix
+
+For a model with $n$ observations:
+- Create a matrix $X$ for the predictors, which includes a column of 1s for the intercept:
+$$
+X = \begin{bmatrix}
+1 & X_1 \\
+1 & X_2 \\
+\vdots & \vdots \\
+1 & X_n \\
+\end{bmatrix}
+$$
+
+### 3. Estimate the Coefficients
+
+The least squares estimates $\hat{\beta}$ can be found using:
+$$
+\hat{\beta} = (X^T X)^{-1} X^T Y
+$$
+
+#### By Hand Example
+
+- **Calculate $X^T X$**:
+$$
+X^T X = \begin{bmatrix}
+n & \sum X_i \\
+\sum X_i & \sum X_i^2 \\
+\end{bmatrix}
+$$
+
+- **Calculate $X^T Y$**:
+$$
+X^T Y = \begin{bmatrix}
+\sum Y_i \\
+\sum X_i Y_i \\
+\end{bmatrix}
+$$
+
+- **Find $(X^T X)^{-1}$**:
+$$
+(X^T X)^{-1} = \frac{1}{n \sum X_i^2 - (\sum X_i)^2} \begin{bmatrix}
+\sum X_i^2 & -\sum X_i \\
+-\sum X_i & n \\
+\end{bmatrix}
+$$
+
+- **Compute $\hat{\beta}$**:
+$$
+\hat{\beta} = (X^T X)^{-1} X^T Y
+$$
+
+### 4. Compute Residuals
+
+Calculate residuals $\hat{\epsilon}_i = Y_i - \hat{Y}_i$, where $\hat{Y}_i$ is obtained using the fitted model:
+$$
+\hat{Y}_i = \hat{\beta}_0 + \hat{\beta}_1 X_i
+$$
+
+### 5. Estimate Variance $\sigma^2$
+
+The estimate of the variance $\sigma^2$ is:
+$$
+\hat{\sigma}^2 = \frac{\sum \hat{\epsilon}_i^2}{n - p}
+$$
+where $p$ is the number of estimated parameters (e.g., $p = 2$ for intercept and slope).
+
+```{r Regressions Continued}
+design_matrix <- model.matrix(sizes ~ sex, data = df)
+head(design_matrix)
+tail(design_matrix)
+beta_hat <- solve(t(design_matrix) %*% design_matrix) %*% t(design_matrix) %*% df$sizes
+beta_hat
+cat(
+  "Model: which considers the sex\n",
+  "y = N(µ = a*F + b*M, σ)\n"
+)
+m <- glm(sizes ~ sex, data = df, family = "gaussian")
+summary(m)
+sum(residuals(m)^2) / (nrow(df) - 2)
+
+# Calc residuals
+# NOTE: Is beta_hat[1] + beta_hat[2] * df$sizes but in addition sex is considered
+y_hat <- design_matrix %*% beta_hat 
+residuals <- df$sizes - y_hat
+variance <- sum(residuals^2) / (nrow(df) - 2)
+variance
+
+
+```
+
diff --git a/bs/.development/Distris_files/figure-html/Likelihood-1.png b/bs/.development/Distris_files/figure-html/Likelihood-1.png
diff --git a/bs/.development/Distris_files/figure-html/Regressions Continued-1.png b/bs/.development/Distris_files/figure-html/Regressions Continued-1.png
diff --git a/bs/.development/Distris_files/figure-html/Regressions-1.png b/bs/.development/Distris_files/figure-html/Regressions-1.png
diff --git a/bs/.development/Distris_files/figure-html/Regressions-2.png b/bs/.development/Distris_files/figure-html/Regressions-2.png
diff --git a/bs/.development/Distris_files/figure-html/plot-histogram-1.png b/bs/.development/Distris_files/figure-html/plot-histogram-1.png
diff --git a/bs/.development/Distris_files/figure-html/plot-histogram-2.png b/bs/.development/Distris_files/figure-html/plot-histogram-2.png