- 9-Gaussian-Processes*
- 10-Gaussian-Processes-and-Data*
- 11-Gaussian-Process-Marginal-Likelihood-and-Hyperparameters*
- 12-Correspondence-Between-Linear-Models-and-Gaussian-Processes*
- 13-Covariance-Functions
- 14-Finite-and-Infinite-Basis-GPs
-
Scalar Gaussian: A single random variable
$x$ with distribution$N(\mu, \sigma^2)$ . -
Multivariate Gaussian: A vector
$x = [x_1, x_2, \dots, x_N]^T$ with joint Gaussian distribution:$$p(x \mid \mu, \Sigma) = \frac{1}{(2 \pi)^{N/2} |\Sigma|^{1/2}} \exp \left( -\frac{1}{2} (x - \mu)^T \Sigma^{-1} (x - \mu) \right)$$ -
Gaussian Process (GP): An extension to infinitely many variables.
- Definition: A collection of random variables, any finite number of which have a joint Gaussian distribution.
- Intuition: Think of functions as infinitely long vectors.
A GP is fully specified by:
-
Mean function
$m(x) = E[f(x)]$ -
Covariance function
$k(x, x') = E[(f(x) - m(x))(f(x') - m(x'))]$
Notation:
Key properties:
- Marginalization: The marginal distribution over any subset of variables is Gaussian.
- Conditioning: The conditional distribution given some variables is also Gaussian.
The marginalization property simplifies Gaussian processes (GPs) by leveraging their unique characteristics. The marginalization property allows you to work with finite-dimensional slices of the GP. Specifically:
For a multivariate Gaussian:
In Gaussian processes, this property enables predictions based only on finite-dimensional covariance matrices without handling infinite-dimensional computations.
A GP defines a distribution over functions. Each finite collection of function values follows a multivariate Gaussian distribution.
Example: $$ p(f) \sim \mathcal{N}(m, k), \quad m(x) = 0, \quad k(x, x') = \exp\left(-\frac{1}{2}(x - x')^2\right). $$
For a finite set of points
To visualize a GP, draw samples from this multivariate Gaussian and plot them as functions.
-
Goal: Generate samples from a joint Gaussian distribution with mean
$\mathbf{m}$ and covariance$\mathbf{K}$ .
The following two methods are conceptually the same in the sense that they both generate samples from the same joint Gaussian prior defined by the GP. However, they differ in how they approach the computation:
-
Direct Sampling: All samples are generated simultaneously using the full covariance matrix
$\mathbf{K}$ . -
Sequential Sampling: Samples are generated one by one using conditional distributions, which can be derived from the same
$\mathbf{K}$ .
This method generates samples by directly leveraging the multivariate Gaussian distribution:
-
Steps:
-
Select Inputs: Choose
$N$ input points${x_i}_{i=1}^{N}$ . -
Covariance Matrix: Compute the covariance matrix
$\mathbf{K}$ for all chosen input points${x_i}_{i=1}^{N}$ using the kernel$k(x_i, x_j)$ . -
Sampling
$\mathbf{f}$ from a Gaussian Process. Both methods are equivalent:
-
Select Inputs: Choose
-
Sample
$\mathbf{z}$ and Transform:- Draw
$\mathbf{z} \sim \mathcal{N}(0, I)$ , where$\mathbf{z}$ is a vector of independent standard normal samples. - Transform
$\mathbf{z}$ to match the desired covariance$\mathbf{K}$ : $$ \mathbf{f} = \text{chol}(\mathbf{K})^\top \mathbf{z} + \mathbf{m}, $$ where$\mathbf{m}$ is the mean vector.
- Draw
-
Direct Sampling:
- Directly sample
$\mathbf{f} \sim \mathcal{N}(\mathbf{m}, \mathbf{K})$ using computational libraries.
- Directly sample
-
Purpose: The Cholesky decomposition ensures that the resulting samples have the correct covariance
$\mathbf{K}$ and mean$\mathbf{m}$ .
This method generates samples by iteratively sampling one point at a time, conditioning on previously sampled points:
- Steps:
-
Factorization: Use the chain rule for multivariate Gaussians to factorize the joint distribution: $$ p(f_1, ..., f_N \mid x_1, ..., x_N) = \prod_{n=1}^N p(f_n \mid f_{<n}, x_{<n}). $$
-
The joint prior:
- Conditional distribution:
-
Purpose: This approach samples points sequentially, conditioning on previously sampled values.
-
Illustration:
- The shaded regions and lines in the plots show how the GP updates its predictions as new data points are added.
![[gp-samp.png]]
- Small to Moderate Number of Input Points: Use direct sampling (Cholesky decomposition) for simplicity and efficiency.
- Large Number of Input Points or Online Sampling: Use sequential sampling, especially if you need to incorporate new input points dynamically without recomputing the entire covariance matrix.
Given observed data $D = {(x_i, y_i)}{i=1}^N$, we want to predict $f$ at new inputs $x_$.
Assumption: Observations
In our non-parametric model, the "parameters" are the function itself!
The joint distribution:
$$
p(\mathbf{f}, \mathbf{y}) = p(\mathbf{f}) p(\mathbf{y} \mid \mathbf{f}) = p(\mathbf{y}) p(\mathbf{f} \mid \mathbf{y})
\quad \Rightarrow \quad \mathcal{N}(\mathbf{f} \mid \mathbf{m}, \mathbf{k}) \mathcal{N}(\mathbf{y} \mid \mathbf{f}) = Z_{\mid \mathbf{y}} \mathcal{N}(\mathbf{f} \mid \mathbf{m}{\mid \mathbf{y}}, \mathbf{k}{\mid \mathbf{y}}).
$$
where
-
Gaussian process prior with zero mean and covariance function
$\mathbf{k}$ : $$ p(\mathbf{f} \mid \mathcal{M}_i) \sim \mathcal{N}(\mathbf{f} \mid \mathbf{m} \equiv 0, \mathbf{k}), $$ -
Gaussian likelihood, with noise variance
$\sigma^2_{\text{noise}}$ : $$ p(\mathbf{y} \mid \mathbf{f}, \mathcal{M}i) \sim \mathcal{N}(\mathbf{f}, \sigma^2{\text{noise}} \mathbf{I}), $$
leads to a Gaussian process posterior: $$ p(\mathbf{f} \mid \mathbf{y}, \mathcal{M}i) \sim \mathcal{N}(\mathbf{f} \mid \mathbf{m}{\mid \mathbf{y}}, \mathbf{k}{\mid \mathbf{y}}), $$ where: $$ \begin{aligned} \mathbf{m}{\mid \mathbf{y}}(x) &= \mathbf{k}(x, \mathbf{x}) [\mathbf{K}(\mathbf{x}, \mathbf{x}) + \sigma^2_\text{noise} \mathbf{I}]^{-1} \mathbf{y}, \ \mathbf{k}{\mid \mathbf{y}}(x, x') &= k(x, x') - \mathbf{k}(x, \mathbf{x}) [\mathbf{K}(\mathbf{x}, \mathbf{x}) + \sigma^2\text{noise} \mathbf{I}]^{-1} \mathbf{k}(\mathbf{x}, x'). \end{aligned} $$
Intuition:
Posterior mean:
-
Posterior covariance:
- The first term
Visualization:
- Prior Samples: Functions drawn from the GP prior.
- Posterior Samples: Functions drawn from the GP posterior, which now pass through (or near) the observed data points.
![[gp-data.png]]
The predictive distribution in Gaussian Processes (GPs) is essentially the posterior distribution over the function values at a new input point $x_$. This is because GPs are non-parametric models, and we do not need to integrate over explicit parameters like in parametric Bayesian models. The predictive distribution for a new input $x_$ is given by:
$$ p(y_* \mid x_, \mathbf{x}, \mathbf{y}) \sim \mathcal{N} \left( \mathbf{k}(x_, \mathbf{x})^\top \left[\mathbf{K} + \sigma_{\text{noise}}^2 \mathbf{I} \right]^{-1} \mathbf{y}, , \mathbf{k}(x_, x_) + \sigma_{\text{noise}}^2 - \mathbf{k}(x_, \mathbf{x})^\top \left[\mathbf{K} + \sigma_{\text{noise}}^2 \mathbf{I} \right]^{-1} \mathbf{k}(x_, \mathbf{x}) \right). $$
-
Mean: Describes the predicted value at
$x_*$ . -
Variance: Quantifies uncertainty at
$x_*$ .
The predictive mean formula: $$ \mu(x_) = \mathbf{k}(x_, \mathbf{x}) \left[\mathbf{K} + \sigma_\text{noise}^2 \mathbf{I} \right]^{-1} \mathbf{y}, $$ can be rewritten as: $$ \mu(x_) = \sum_{n=1}^N \beta_n y_n = \sum_{n=1}^N \alpha_n k(x_, x_n), $$ providing significant intuition about how Gaussian Processes make predictions by weighting observations using the kernel.
- The formula
$\sum_{n=1}^N \beta_n y_n$ expresses the predictive mean $\mu(x_)$ as a weighted sum of the observed outputs $y_n$, where the weights $\beta_n = \sum_{m=1}^N \left[\mathbf{K} + \sigma_\text{noise}^2 \mathbf{I}\right]^{-1}{n,m} k(x, x_m)$. Alternatively: $$ \boldsymbol{\beta} = \left[\mathbf{K} + \sigma_\text{noise}^2 \mathbf{I}\right]^{-1} \mathbf{k}(\mathbf{x}, x_*), $$
Intuition:
-
$\beta_n$ as Influence Weights: The coefficients$\beta_n$ represent the influence of each training data point's similarity to the new input $x_$ on the predictive mean $\mu(x_)$. They quantify how much each observed output$y_n$ contributes to the prediction, weighted by the covariance between$x_n$ and $x_$. The closer the test point $x_$ is to a training point$x_n$ in the input space (as measured by the kernel), the larger the corresponding weight$\beta_n$ , and vice versa. -
Role in Prediction: In the expression $\mu(x_) = \sum_{n=1}^N \beta_n y_n$, each $\beta_n$ scales the observed output $y_n$. This means that the prediction at $x_$ is a weighted sum of the training outputs, where the weights
$\beta_n$ depend on both the covariance structure of the data and the similarity between$x_n$ and$x_*$ . -
Effect of Noise and Correlations: The inversion of
$\mathbf{K} + \sigma_\text{noise}^2 \mathbf{I}$ adjusts the weights$\beta_n$ based on noise and correlations in the data. Observations that are more relevant (e.g., closer to$x_*$ or less noisy) will have larger$\beta_n$ values, contributing more to the prediction.
-
The formula expresses the predictive mean $\mu(x_)$ as a weighted sum of kernel values between the test point $x_$ and the training points
$x_n$ , where the weights $\alpha_n = \sum_{m=1}^N \left[\mathbf{K} + \sigma_\text{noise}^2 \mathbf{I}\right]^{-1}{n,m} y_m$. Alternatively: $$ \boldsymbol{\alpha} = \left[\mathbf{K} + \sigma\text{noise}^2 \mathbf{I} \right]^{-1} \mathbf{y} $$ Intuition: -
$\alpha_n$ as Influence Weights: The coefficients$\alpha_n$ represent the influence of each training data point$(x_n, y_n)$ on the prediction at a new input $x_$. They quantify how much each observed output $y_n$ contributes to the predictive mean $\mu(x_)$, after accounting for the correlations (captured by the kernel matrix$\mathbf{K}$ ) and the noise in the observations. -
Role in Prediction: In the expression $\mu(x_) = \sum_{n=1}^N \alpha_n k(x_, x_n),$ each
$\alpha_n$ scales the similarity between the new input $x_$ and the training input $x_n$ (measured by $k(x_, x_n)$). This means the prediction at$x_*$ is a weighted sum of the similarities to all training points, where the weights$\alpha_n$ are determined by both the training outputs and the covariance structure of the data. -
Effect of Noise and Correlations: The inversion of
$\mathbf{K} + \sigma_\text{noise}^2 \mathbf{I}$ adjusts the weights$\alpha_n$ based on how noise and correlations between data points affect the reliability of each observation. Data points that are less affected by noise or are more informative (due to higher correlations) will generally have larger$\alpha_n$ values.
Conclude that weights (
$$ \sigma^2(x_) = \mathbf{k}(x_, x_) - \mathbf{k}(x_, \mathbf{x}) \left[\mathbf{K(x, x)} + \sigma_{\text{noise}}^2 \mathbf{I} \right]^{-1} \mathbf{k}(\mathbf{x}, x_*). $$
Intuition:
- The posterior nvariance has two terms:
- Prior Variance ($\mathbf{k}(x_, x_)$): The initial uncertainty in the prior.
-
Subtract Information Gained from Data: Subtracted based on how well the observed data explains
$x_*$ .-
Similarity to Training Data:
If $x_$ is similar to training points, $\mathbf{k}(x_, \mathbf{x})$ will have larger values. -
Adjust for Noise and Correlation:
The inverse term adjusts the influence of each training point based on noise and how the points correlate with each other. -
Overall Reduction:
The product $\mathbf{k}(x_, \mathbf{x}) \left[\mathbf{K} + \sigma_\text{noise}^2 \mathbf{I}\right]^{-1} \mathbf{k}(x_, \mathbf{x})^\top$ quantifies the total reduction in uncertainty at$x_*$ due to the observed data.
-
Similarity to Training Data:
The marginal likelihood (or evidence) is the probability of the observed data under the GP model:
Given
$f \sim N(m, K)$ $\epsilon \sim N(0, \sigma_n^2 I)$
Since
Taking the natural logarithm of both sides gives the log marginal likelihood:
$$
\log p(y \mid x) = -\frac{1}{2} (y - m)^T (K + \sigma_n^2 I)^{-1} (y - m) - \frac{1}{2} \log |K + \sigma_n^2 I| - \frac{N}{2} \log 2\pi
$$
where the matrix
Interpretation: 1. First Term: Data Fit
- The expression
$(y - m)^T (K + \sigma_n^2 I)^{-1} (y - m)$ represents the squared Mahalanobis distance of$y$ from the mean. Unlike Euclidean distance, the Mahalanobis distance accounts for the covariance structure of the data, effectively scaling the dimensions according to their variances and covariances. - During model training (e.g., hyperparameter optimization), the objective is to minimize this term. Minimizing the discrepancy ensures that the GP model's predictions are as close as possible to the observed data, considering the uncertainty captured by the covariance matrix.
2. Second Term: Model Complexity (Occam's Razor)
- The determinant
$|K + \sigma_n^2 I|$ represents the volume of the uncertainty captured by the covariance matrix$\Sigma = K + \sigma_n^2 I$ . A larger determinant indicates a more spread-out distribution, implying greater uncertainty. - This term acts as a penalty for model complexity. A more complex model (with a covariance matrix that allows for greater variability) will generally have a larger determinant, leading to a higher penalty. Conversely, a simpler model will have a smaller determinant and thus a smaller penalty.
- This embodies the principle of Occam's Razor, which favors simpler models when possible.
Third term is normalizing constant, i.e. irrelevant when it comes to optimization.
-
Hyperparameters
$\theta$ : Parameters of the covariance function (e.g., length-scale$\ell$ , signal variance$\sigma_f^2$ , noise variance$\sigma_n^2$ ). -
Optimizing Hyperparameters:
Find
$\theta$ that maximize the marginal likelihood: $$ \theta^* = \arg \max_\theta \log p(y \mid x, \theta) $$This is a form of model selection.
Example:
-
Squared Exponential Covariance Function:
$$ k(x, x') = \sigma_f^2 \exp \left( -\frac{(x - x')^2}{2 \ell^2} \right) $$
By adjusting
$\ell$ and$\sigma_f^2$ , we can control the smoothness and amplitude of the functions.
The marginal likelihood balances data fit and model complexity:
- Simple models with fewer hyperparameters may not fit the data well but are preferred if they explain the data sufficiently.
- Complex models may overfit the data but are penalized in the marginal likelihood due to increased complexity.
![[Pasted image 20241124175426.png]]
The mean posterior predictive function is plotted for 3 different length scales (the blue curve corresponds to optimizing the marginal likelihood). Notice, that an almost exact fit to the data can be achieved by reducing the length scale – but the marginal likelihood does not favour this!
Bayes' rule helps identify the right model complexity by leveraging the marginal likelihood, which balances goodness-of-fit with model simplicity. Overly simple models (highly peaked marginal likelihood) fail to capture data variability, while overly complex models (broad marginal likelihood) risk overfitting. The optimal model, guided by Occam's Razor, maximizes the marginal likelihood by being complex enough to explain the data but simple enough to generalize well, inherently penalizing unnecessary complexity. This balance ensures a principled trade-off between model flexibility and parsimony.
![[occam.png]]
An illustrative analogous example:
Recall: The formula for the log-likelihood is:
This example demonstrates how fitting the variance
![[variance-estim.png]]
Consider a linear model with Gaussian priors:
that is,
-
Mean Function: $$m(x) = \mathbb{E}{\mathbf{w}}(f(x)) = \int \left( \sum{m=1}^M w_m \phi_m(x) \right) p(\mathbf{w}) d\mathbf{w} = \sum_{m=1}^M \phi_m(x) \int w_m p(w_m) dw_m = 0$$
-
Covariance Function:
$$ k(x_i, x_j) = \text{Cov}{\mathbf{w}}(f(x_i), f(x_j)) = \mathbb{E}{\mathbf{w}}(f(x_i)f(x_j)) = \int \cdots \int \left( \sum_{k=1}^M \sum_{l=1}^M w_k w_l \phi_k(x_i) \phi_l(x_j) \right) p(\mathbf{w}) d\mathbf{w}. $$
where the integration symbol
This simplifies to:
Finally, this can be written compactly as:
If
The inner product
Conversely, any GP with covariance function
- Mercer's Theorem: Some covariance functions correspond to infinite-dimensional feature spaces.
-
Gaussian Processes: Complexity is
$O(N^3)$ due to inversion of the$N \times N$ covariance matrix. Feasible for small to medium-sized datasets. -
Linear Models: Complexity is
$O(N M^2)$ , where$M$ is the number of basis functions. Can be more efficient when$M$ is small.
-
Covariance Functions and Hyperparameters
- Covariance functions define the structure of relationships in Gaussian Processes (GPs).
- Hyperparameters control the behavior of covariance functions and are set using marginal likelihood.
- Choosing the right covariance function and hyperparameters can aid in model selection and data interpretation.
-
Common Covariance Functions
- Stationary Covariance Functions: Squared exponential, rational quadratic, and Matérn.
- Special Cases: Radial Basis Function (RBF) networks, splines, large neural networks.
- Covariance functions can be combined into more complex forms for better flexibility.
-
Hierarchical Model and ARD
- Hyperparameters of the covariance function are critical for model selection.
- Automatic Relevance Determination (ARD) is useful for feature selection. For instance:
$$
k(x, x') = v_0^2 \exp\left(-\sum_{d=1}^D \frac{(x_d - x'_d)^2}{2v_d^2}\right),
$$
where hyperparameters
$\theta = (v_0, v_1, \dots, v_D, \sigma_n^2)$ .
-
Interpretation
- Hyperparameters
$v_d$ scale the importance of input dimensions$d$ . - ARD enables automatic selection of relevant features in the data.
- Hyperparameters
![[gp-2d.png]]
-
Definition
- The rational quadratic (RQ) covariance function:
$$
k_{RQ}(r) = \left(1 + \frac{r^2}{2\alpha \ell^2}\right)^{-\alpha},
$$
where
$\alpha > 0$ and$\ell$ is the characteristic length-scale.
- The rational quadratic (RQ) covariance function:
$$
k_{RQ}(r) = \left(1 + \frac{r^2}{2\alpha \ell^2}\right)^{-\alpha},
$$
where
-
Interpretation
- RQ can be seen as a scale mixture (an infinite sum) of squared exponential (SE) covariance functions with varying length-scales.
- In the limit
$\alpha \to \infty$ , the RQ covariance function becomes the SE covariance function: $$ k_{SE}(r) = \exp\left(-\frac{r^2}{2\ell^2}\right). $$
![[gp-covariance.png]]
-
Definition
- The Matérn covariance function is given by:
$$
k_{\nu}(x, x') = \frac{1}{\Gamma(\nu) 2^{\nu-1}} \left( \sqrt{2\nu} \frac{|x - x'|}{\ell} \right)^\nu K_\nu \left( \sqrt{2\nu} \frac{|x - x'|}{\ell} \right),
$$
where
$K_\nu$ is the modified Bessel function of the second kind and$\ell$ is the characteristic length-scale.
- The Matérn covariance function is given by:
$$
k_{\nu}(x, x') = \frac{1}{\Gamma(\nu) 2^{\nu-1}} \left( \sqrt{2\nu} \frac{|x - x'|}{\ell} \right)^\nu K_\nu \left( \sqrt{2\nu} \frac{|x - x'|}{\ell} \right),
$$
where
-
Special Cases
-
$\nu = \frac{1}{2}$ : Exponential covariance function (Ornstein-Uhlenbeck process). $$ k(r) = \exp\left(-\frac{r}{\ell}\right). $$ -
$\nu = \frac{3}{2}$ : Once-differentiable function. $$ k(r) = \left(1 + \sqrt{3} \frac{r}{\ell}\right) \exp\left(-\sqrt{3} \frac{r}{\ell}\right). $$ -
$\nu = \frac{5}{2}$ : Twice-differentiable function. $$ k(r) = \left(1 + \sqrt{5} \frac{r}{\ell} + \frac{5r^2}{3\ell^2}\right) \exp\left(-\sqrt{5} \frac{r}{\ell}\right). $$ -
$\nu \to \infty$ : Equivalent to the SE covariance function.
-
-
Intuition
- The hyperparameter
$\nu$ controls the smoothness of the sampled functions. Larger$\nu$ implies smoother functions.
- The hyperparameter
![[gp-cov.png]]
-
Definition
- Periodic covariance functions model periodic data:
$$
k_{periodic}(x, x') = \exp\left(-\frac{2 \sin^2(\pi |x - x'| / p)}{\ell^2}\right),
$$
where
$p$ is the period and$\ell$ is the characteristic length-scale.
- Periodic covariance functions model periodic data:
$$
k_{periodic}(x, x') = \exp\left(-\frac{2 \sin^2(\pi |x - x'| / p)}{\ell^2}\right),
$$
where
-
Intuition
- By transforming the inputs into
$u = (\sin(x), \cos(x))^\top$ , the covariance measures periodic distances in this transformed space.
- By transforming the inputs into
![[gp-covar.png]]
Three functions drawn at random; left
-
Cubic Splines
- The solution to the minimization problem: $$ \sum_{i=1}^n (f(x^{(i)}) - y^{(i)})^2 + \lambda \int (f''(x))^2 dx $$ is the natural cubic spline.
-
GP Interpretation
- The same function can be derived as the posterior mean of a GP with a specific covariance function: $$ k(x, x') = \sigma^2 + xx'\sigma^2 + \lambda \int_0^1 \min(x, x')^3 dx. $$
![[gp-vis.png]]
-
Large Neural Networks
- As the number of hidden units in a neural network grows, the output becomes equivalent to a GP with a specific covariance function: $$ k(x, x') = \sigma^2 \arcsin\left(\frac{2x^\top \Sigma x'}{\sqrt{(1 + x^\top \Sigma x)(1 + x'^\top \Sigma x')}}\right). $$
-
Intuition
- The prior distribution over neural network weights induces a prior over functions, which resembles a GP.
Covariance functions have to be possitive definite.
-
Combining Covariance Functions
- Covariance functions can be combined to form new ones:
-
Sum:
$k(x, x') = k_1(x, x') + k_2(x, x')$ -
Product:
$k(x, x') = k_1(x, x') \cdot k_2(x, x')$ -
Other:
$k(x, x') = g(x) k(x, x') g(x')$
-
Sum:
- Covariance functions can be combined to form new ones:
-
Applications
- Composite covariance functions allow for greater modeling flexibility, tailoring the GP to specific data structures.
-
Finite vs. Infinite Models
- A central question in modeling is whether finite or infinite models should be preferred.
- Finite Models: Involve fixed parameters and limited basis functions. These make much stronger assumptions about the data and can lack flexibility.
- Infinite Models (Gaussian Processes): Allow a theoretically infinite number of basis functions, offering more flexibility. Gaussian Processes (GPs) serve as a formalism to define such infinite models.
-
Gaussian Processes as Infinite Models
- GPs represent a fancy yet practical way to implement infinite models. But, the key question is:
- Do infinite models make a difference in practice?
- Yes, because they avoid overfitting and ensure generalization by accounting for all possible functions consistent with the data.
- GPs represent a fancy yet practical way to implement infinite models. But, the key question is:
-
Illustrative Example
- A GP with a squared exponential covariance function corresponds to an infinite linear model with Gaussian basis functions placed everywhere in the input space, not just at training points. This results in smoother, more realistic models. ![[gp-finite.png]]
-
Finite Linear Models with Localized Basis Functions
- Example: A model with only five basis functions is constrained to represent limited patterns.
- Visualization:
- Finite models show high variance and poor uncertainty estimation in regions without training data.
- As more data is added, the performance improves, but the limited number of basis functions prevents robust generalization.
-
Gaussian Processes with Infinite Basis Functions
- In contrast, a GP:
- Uses infinitely many basis functions.
- Ensures smooth predictions and uncertainty estimates across the input space.
- Key Difference: GPs generalize even in regions far from training points by leveraging the covariance function.
- In contrast, a GP:
-
Infinite Basis Expansion
The GP framework arises naturally by considering a sum of Gaussian basis functions: $$ f(x) = \lim_{N \to \infty} \frac{1}{N} \sum_{n=-N/2}^{N/2} \gamma_n \exp\left(-\left(x - \frac{n}{\sqrt{N}}\right)^2\right), $$ where$\gamma_n \sim \mathcal{N}(0, 1)$ .-
Interpretation: As
$N \to \infty$ , this sum transitions from a finite representation to a continuous integral: $$ f(x) = \int_{-\infty}^{\infty} \gamma(u) \exp(-(x - u)^2) , du, $$ with$\gamma(u) \sim \mathcal{N}(0, 1)$ .
-
Interpretation: As
-
GP Foundations
- Mean Function:
assuming zero-mean priors for $\gamma(u)$.
- Covariance Function:
- **Key Insight**: The squared exponential covariance function encapsulates an infinite number of Gaussian-shaped basis functions.
- Practical Implication
The GP enables regression over the entire input space, avoiding the overfitting often seen in finite models.
- When to Choose GPs:
- When uncertainty matters (e.g., in scientific predictions or safety-critical systems).
- When flexibility is essential due to limited training data.
- Limitations of GPs:
- Computational cost grows cubically with the number of data points, making scalability a challenge.
- Solutions: Sparse approximations or variational inference.