diff --git a/4F13/gaussian-processes.md b/4F13/gaussian-processes.md index 73f33c9..29d500f 100644 --- a/4F13/gaussian-processes.md +++ b/4F13/gaussian-processes.md @@ -1,9 +1,13 @@ # Index -- [Gaussian Processes](#9-gaussian-processes) - - +- [9-Gaussian-Processes*](#9-gaussian-processes) +- [10-Gaussian-Processes-and-Data*](#10-gaussian-processes-and-data) +- [11-Gaussian-Process-Marginal-Likelihood-and-Hyperparameters](#11-gaussian-process-marginal-likelihood-and-hyperparameters) +- [12-Correspondence-Between-Linear-Models-and-Gaussian-Processes](#12-correspondence-between-linear-models-and-gaussian-processes) +- [13-Covariance-Functions](#13-covariance-functions) +- [14-Finite-and-Infinite-Basis-GPs](#14-finite-and-infinite-basis-gps) +--- ## 9-Gaussian-Processes ### From Scalar Gaussians to Multivariate Gaussians to Gaussian Processes @@ -11,10 +15,7 @@ 1. **Scalar Gaussian**: A single random variable $x$ with distribution $N(\mu, \sigma^2)$. 2. **Multivariate Gaussian**: A vector $x = [x_1, x_2, \dots, x_N]^T$ with joint Gaussian distribution: - - $$ - p(x \mid \mu, \Sigma) = \frac{1}{(2 \pi)^{N/2} |\Sigma|^{1/2}} \exp \left( -\frac{1}{2} (x - \mu)^T \Sigma^{-1} (x - \mu) \right) - $$ + $$p(x \mid \mu, \Sigma) = \frac{1}{(2 \pi)^{N/2} |\Sigma|^{1/2}} \exp \left( -\frac{1}{2} (x - \mu)^T \Sigma^{-1} (x - \mu) \right)$$ 3. **Gaussian Process (GP)**: An extension to infinitely many variables. @@ -39,60 +40,89 @@ Key properties: - **Marginalization**: The marginal distribution over any subset of variables is Gaussian. - **Conditioning**: The conditional distribution given some variables is also Gaussian. -The marginalization property simplifies Gaussian processes (GPs) by leveraging their unique characteristics. The marginalization property allows you to work with finite-dimensional slices of the GP. Specifically: $$ p(\mathbf{x}) = \int p(\mathbf{x}, \mathbf{y}) \, d\mathbf{y}. $$ For a multivariate Gaussian: $$ \begin{bmatrix} \mathbf{x} \\ \mathbf{y} \end{bmatrix} \sim \mathcal{N}\left( \begin{bmatrix} \mathbf{a} \\ \mathbf{b} \end{bmatrix}, \begin{bmatrix} A & B^\top \\ B & C \end{bmatrix} \right) \quad \Rightarrow \quad p(\mathbf{x}) \sim \mathcal{N}(\mathbf{a}, A). $$ In Gaussian processes, this property enables predictions based only on finite-dimensional covariance matrices without handling infinite-dimensional computations. +The marginalization property simplifies Gaussian processes (GPs) by leveraging their unique characteristics. The marginalization property allows you to work with finite-dimensional slices of the GP. Specifically: +$$ p(\mathbf{x}) = \int p(\mathbf{x}, \mathbf{y}) \, d\mathbf{y}. $$ +For a multivariate Gaussian: +$$ \begin{bmatrix} \mathbf{x} \\ \mathbf{y} \end{bmatrix} \sim \mathcal{N}\left( \begin{bmatrix} \mathbf{a} \\ \mathbf{b} \end{bmatrix}, \begin{bmatrix} A & B^\top \\ B & C \end{bmatrix} \right) \quad \Rightarrow \quad p(\mathbf{x}) \sim \mathcal{N}(\mathbf{a}, A). $$ +In Gaussian processes, this property enables predictions based only on finite-dimensional covariance matrices without handling infinite-dimensional computations. ### GP as a distribution over functions A GP defines a distribution over functions. Each finite collection of function values follows a multivariate Gaussian distribution. -**Example**: $$ p(f) \sim \mathcal{N}(m, k), \quad m(x) = 0, \quad k(x, x') = \exp\left(-\frac{1}{2}(x - x')^2\right). $$ For a finite set of points $\{x_1, x_2, ..., x_N\}$, the function values $f(x_1), f(x_2), ..., f(x_N)$ are jointly Gaussian: $$ f \sim \mathcal{N}(0, \Sigma), \quad \Sigma_{ij} = k(x_i, x_j). $$To visualize a GP, draw samples from this multivariate Gaussian and plot them as functions. +**Example**: $$ p(f) \sim \mathcal{N}(m, k), \quad m(x) = 0, \quad k(x, x') = \exp\left(-\frac{1}{2}(x - x')^2\right). $$ +For a finite set of points $\{x_1, x_2, ..., x_N\}$, the function values $\{f_1, f_2, ..., f_N\}$ are jointly Gaussian: + +$$ f \sim \mathcal{N}(0, \Sigma), \quad \Sigma_{ij} = k(x_i, x_j). $$ + +To visualize a GP, draw samples from this multivariate Gaussian and plot them as functions. + +### Sampling from a Gaussian Process (GP) -### Generating Functions from a GP - **Goal**: Generate samples from a joint Gaussian distribution with mean $\mathbf{m}$ and covariance $\mathbf{K}$. -Simpler case; assume $m = 0$: -1. **Select Inputs**: Choose $N$ input points $x_1, x_2, \dots, x_N$. -2. **Compute Covariance Matrix**: $K_{ij} = k(x_i, x_j)$. -3. **Sample Function Values**: Draw $f \sim N(0, K)$. -4. **Plot Function**: Plot $f$ versus $x$. -Similarly, for $m \neq 0$: - 1. Generate random standard normal samples $\mathbf{z} \sim \mathcal{N}(0, I)$. - 2. Compute $\mathbf{y} = \text{chol}(\mathbf{K})^\top \mathbf{z} + \mathbf{m}$, - where $\text{chol}(\mathbf{K})$ is the Cholesky decomposition of $\mathbf{K}$ such that $\mathbf{R}^\top \mathbf{R} = \mathbf{K}$. +The following two methods are **conceptually the same** in the sense that they both generate samples from the same joint Gaussian prior defined by the GP. However, they differ in how they approach the computation: + +- **Direct Sampling**: All samples are generated simultaneously using the full covariance matrix $\mathbf{K}$. +- **Sequential Sampling**: Samples are generated one by one using conditional distributions, which can be derived from the same $\mathbf{K}$. + +### **1. Direct Sampling Using Cholesky Decomposition** +This method generates samples by directly leveraging the multivariate Gaussian distribution: -The Cholesky factorization ensures the generated samples have the correct covariance structure $\mathbf{K}$. +- **Steps**: + 1. **Select Inputs**: Choose $N$ input points $\{x_i\}_{i=1}^{N}$. + 2. **Covariance Matrix**: Compute the covariance matrix $\mathbf{K}$ for all chosen input points $\{x_i\}_{i=1}^{N}$ using the kernel $k(x_i, x_j)$. + 3. **Sampling $\mathbf{f}$ from a Gaussian Process**. Both methods are equivalent: -#### Sequential Generation -Generate function values one at a time, conditioning on previous values. This uses properties of conditional Gaussians. +- **Sample $\mathbf{z}$ and Transform**: + - Draw $\mathbf{z} \sim \mathcal{N}(0, I)$, where $\mathbf{z}$ is a vector of independent standard normal samples. + - Transform $\mathbf{z}$ to match the desired covariance $\mathbf{K}$: + $$ + \mathbf{f} = \text{chol}(\mathbf{K})^\top \mathbf{z} + \mathbf{m}, + $$ + where $\mathbf{m}$ is the mean vector. -- **Factorization**: +- **Direct Sampling**: + - Directly sample $\mathbf{f} \sim \mathcal{N}(\mathbf{m}, \mathbf{K})$ using computational libraries. + +- **Purpose**: The Cholesky decomposition ensures that the resulting samples have the correct covariance $\mathbf{K}$ and mean $\mathbf{m}$. + +### **2. Sequential Sampling Using Conditional Gaussians** +This method generates samples by iteratively sampling one point at a time, conditioning on previously sampled points: + +- **Steps**: + +1. **Factorization**: Use the chain rule for multivariate Gaussians to factorize the joint distribution: $$ p(f_1, ..., f_N \mid x_1, ..., x_N) = \prod_{n=1}^N p(f_n \mid f_{