nw

antoniofrancaib · Nov 21, 2024 · 6b98459 · 6b98459
1 parent 1780c8f
commit 6b98459
Show file tree

Hide file tree

Showing 21 changed files with 1,292 additions and 209 deletions.
diff --git a/4F13/gaussian-processes.md b/4F13/gaussian-processes.md
@@ -198,6 +198,7 @@ For GPs with Gaussian noise, this integral can be computed analytically:
 $$
 \log p(y \mid x) = -\frac{1}{2} y^T (K + \sigma_n^2 I)^{-1} y - \frac{1}{2} \log |K + \sigma_n^2 I| - \frac{N}{2} \log 2 \pi
 $$
+where the matrix $K$ represents the **kernel (or covariance) matrix**. 
 
 **Interpretation**:
 - The first term measures how well the model fits the data (data fit).
@@ -232,24 +233,48 @@ The marginal likelihood balances data fit and model complexity:
 - Simple models with fewer hyperparameters may not fit the data well but are preferred if they explain the data sufficiently.
 - Complex models may overfit the data but are penalized in the marginal likelihood due to increased complexity.
 
-## Correspondence Between Linear Models and Gaussian Processes
-<span style="color:red;">LEFT HERE!</span>
-
+## 12. Correspondence Between Linear Models and Gaussian Processes
 ### From Linear Models to GPs
 Consider a linear model with Gaussian priors:
 
 $$
-f(x) = \sum_{m=1}^M w_m \phi_m(x), \quad w_m \sim N(0, \sigma_w^2)
+f(x) = \sum_{m=1}^M w_m \phi_m(x) = \mathbf{w}^\top \boldsymbol{\phi}(x), \quad w_m \sim N(0, \sigma_w^2)
 $$
 
-- **Mean Function**: $m(x) = E[f(x)] = 0$
+that is,  $p(\mathbf{w}) = \mathcal{N}(\mathbf{w}; \mathbf{0}, \mathbf{A}),$
+
+- **Mean Function**: 
+$$m(x) = \mathbb{E}_{\mathbf{w}}(f(x)) = \int \left( \sum_{m=1}^M w_m \phi_m(x) \right) p(\mathbf{w}) d\mathbf{w} = \sum_{m=1}^M \phi_m(x) \int w_m p(w_m) dw_m = 0$$
+
 - **Covariance Function**:
 
-  $$
-  k(x, x') = E[f(x) f(x')] = \sigma_w^2 \sum_{m=1}^M \phi_m(x) \phi_m(x') = \sigma_w^2 \phi(x)^T \phi(x')
-  $$
+$$
+k(x_i, x_j) = \text{Cov}_{\mathbf{w}}(f(x_i), f(x_j)) = \mathbb{E}_{\mathbf{w}}(f(x_i)f(x_j)) = \int \cdots \int \left( \sum_{k=1}^M \sum_{l=1}^M w_k w_l \phi_k(x_i) \phi_l(x_j) \right) p(\mathbf{w}) d\mathbf{w}.
+$$
+
+where the integration symbol $\int \cdots \int$ is shorthand for integrating over all $M$ dimensions of $\mathbf{w}$. 
+
+This simplifies to:
+
+$$
+k(x_i, x_j) = \sum_{k=1}^M \sum_{l=1}^M \phi_k(x_i) \phi_l(x_j) \int \int w_k w_l p(w_k, w_l) dw_k dw_l = \sum_{k=1}^M \sum_{l=1}^M A_{kl} \phi_k(x_i) \phi_l(x_j).
+$$
+
+Finally, this can be written compactly as:
+
+$$
+k(x_i, x_j) = \boldsymbol{\phi}(x_i)^\top \mathbf{A} \boldsymbol{\phi}(x_j).
+$$
+
+#### Special Case:
+If $\mathbf{A} = \sigma_w^2 \mathbf{I}$, then:
+
+$$
+k(x_i, x_j) = \sigma_w^2 \sum_{k=1}^M \phi_k(x_i) \phi_k(x_j) = \sigma_w^2 \boldsymbol{\phi}(x_i)^\top \boldsymbol{\phi}(x_j).
+$$
+
+The inner product $\boldsymbol{\phi}(x_i)^\top \boldsymbol{\phi}(x_j)$ measures the **similarity** between the feature vectors $\phi(x_i)$ and $\phi(x_j)$. If the two inputs $x_i$ and $x_j$ are very similar, their feature vectors will also be similar, resulting in a large inner product. This means a high covariance, and viceversa. 
 
-This shows that the linear model with Gaussian priors corresponds to a GP with covariance function $k(x, x')$.
 
 ### From GPs to Linear Models
 Conversely, any GP with covariance function $k(x, x') = \phi(x)^T A \phi(x')$ can be represented as a linear model with basis functions $\phi(x)$ and weight covariance $A$.
@@ -261,46 +286,225 @@ Conversely, any GP with covariance function $k(x, x') = \phi(x)^T A \phi(x')$ ca
 - **Gaussian Processes**: Complexity is $O(N^3)$ due to inversion of the $N \times N$ covariance matrix. Feasible for small to medium-sized datasets.
 - **Linear Models**: Complexity is $O(N M^2)$, where $M$ is the number of basis functions. Can be more efficient when $M$ is small.
 
-## Covariance Functions
+---
 
-### Stationary Covariance Functions
-Covariance functions that depend only on $r = |x - x'|$.
+## 13. Covariance Functions
 
-1. **Squared Exponential (SE)**
+### Key Concepts
 
-   $$
-   k_{\text{SE}}(r) = \sigma_f^2 \exp \left( -\frac{r^2}{2 \ell^2} \right)
-   $$
+1. **Covariance Functions and Hyperparameters**
+   - Covariance functions define the structure of relationships in Gaussian Processes (GPs).
+   - Hyperparameters control the behavior of covariance functions and are set using marginal likelihood.
+   - Choosing the right covariance function and hyperparameters can aid in model selection and data interpretation.
 
-2. **Rational Quadratic (RQ)**
+2. **Common Covariance Functions**
+   - **Stationary Covariance Functions**: Squared exponential, rational quadratic, and Matérn.
+   - **Special Cases**: Radial Basis Function (RBF) networks, splines, large neural networks.
+   - Covariance functions can be combined into more complex forms for better flexibility.
 
-   $$
-   k_{\text{RQ}}(r) = \sigma_f^2 \left( 1 + \frac{r^2}{2 \alpha \ell^2} \right)^{-\alpha}
-   $$
+---
 
-3. **Matérn**
+### Model Selection and Hyperparameters
 
-   $$
-   k_{\text{Matérn}}(r) = \sigma_f^2 \frac{2^{1 - \nu}}{\Gamma(\nu)} \left( \frac{\sqrt{2 \nu} r}{\ell} \right)^\nu K_\nu \left( \frac{\sqrt{2 \nu} r}{\ell} \right)
-   $$
+1. **Hierarchical Model and ARD**
+   - Hyperparameters of the covariance function are critical for model selection.
+   - Automatic Relevance Determination (ARD) is useful for feature selection. For instance:
+     $$
+     k(x, x') = v_0^2 \exp\left(-\sum_{d=1}^D \frac{(x_d - x'_d)^2}{2v_d^2}\right),
+     $$
+     where hyperparameters $\theta = (v_0, v_1, \dots, v_D, \sigma_n^2)$.
 
-4. **Periodic Covariance Function**
+2. **Interpretation**
+   - Hyperparameters $v_d$ scale the importance of input dimensions $d$.
+   - ARD enables automatic selection of relevant features in the data.
 
-   $$
-   k_{\text{Per}}(x, x') = \sigma_f^2 \exp \left( -\frac{2 \sin^2 \left( \frac{\pi |x - x'|}{p} \right)}{\ell^2} \right)
-   $$
+![[Pasted image 20241121124727.png]]
+
+---
+
+### Rational Quadratic Covariance Function
+
+1. **Definition**
+   - The rational quadratic (RQ) covariance function:
+     $$
+     k_{RQ}(r) = \left(1 + \frac{r^2}{2\alpha \ell^2}\right)^{-\alpha},
+     $$
+     where $\alpha > 0$ and $\ell$ is the characteristic length-scale.
+
+2. **Interpretation**
+   - RQ can be seen as a scale mixture (an infinite sum) of squared exponential (SE) covariance functions with varying length-scales.
+   - In the limit $\alpha \to \infty$, the RQ covariance function becomes the SE covariance function:
+     $$
+     k_{SE}(r) = \exp\left(-\frac{r^2}{2\ell^2}\right).
+     $$
+     
+![[Pasted image 20241121124807.png]]
+
+### Matérn Covariance Functions
+
+1. **Definition**
+   - The Matérn covariance function is given by:
+     $$
+     k_{\nu}(x, x') = \frac{1}{\Gamma(\nu) 2^{\nu-1}} \left( \sqrt{2\nu} \frac{\|x - x'\|}{\ell} \right)^\nu K_\nu \left( \sqrt{2\nu} \frac{\|x - x'\|}{\ell} \right),
+     $$
+     where $K_\nu$ is the modified Bessel function of the second kind and $\ell$ is the characteristic length-scale.
+
+2. **Special Cases**
+   - $\nu = \frac{1}{2}$: Exponential covariance function (Ornstein-Uhlenbeck process).
+     $$
+     k(r) = \exp\left(-\frac{r}{\ell}\right).
+     $$
+   - $\nu = \frac{3}{2}$: Once-differentiable function.
+     $$
+     k(r) = \left(1 + \sqrt{3} \frac{r}{\ell}\right) \exp\left(-\sqrt{3} \frac{r}{\ell}\right).
+     $$
+   - $\nu = \frac{5}{2}$: Twice-differentiable function.
+     $$
+     k(r) = \left(1 + \sqrt{5} \frac{r}{\ell} + \frac{5r^2}{3\ell^2}\right) \exp\left(-\sqrt{5} \frac{r}{\ell}\right).
+     $$
+   - $\nu \to \infty$: Equivalent to the SE covariance function.
+
+3. **Intuition**
+   - The hyperparameter $\nu$ controls the smoothness of the sampled functions. Larger $\nu$ implies smoother functions.
+
+![[Pasted image 20241121124847.png]]
+
+---
+
+### Periodic Covariance Functions
+
+1. **Definition**
+   - Periodic covariance functions model periodic data:
+     $$
+     k_{periodic}(x, x') = \exp\left(-\frac{2 \sin^2(\pi |x - x'| / p)}{\ell^2}\right),
+     $$
+     where $p$ is the period and $\ell$ is the characteristic length-scale.
 
-5. **Neural Network Covariance Function**
+2. **Intuition**
+   - By transforming the inputs into $u = (\sin(x), \cos(x))^\top$, the covariance measures periodic distances in this transformed space.
 
+![[Pasted image 20241121124912.png]]
+
+Three functions drawn at random; left $> 1$, and right $< 1$.
+
+### Splines and Gaussian Processes
+
+1. **Cubic Splines**
+   - The solution to the minimization problem:
+     $$
+     \sum_{i=1}^n (f(x^{(i)}) - y^{(i)})^2 + \lambda \int (f''(x))^2 dx
+     $$
+     is the natural cubic spline.
+
+2. **GP Interpretation**
+   - The same function can be derived as the posterior mean of a GP with a specific covariance function:
+     $$
+     k(x, x') = \sigma^2 + xx'\sigma^2 + \lambda \int_0^1 \min(x, x')^3 dx.
+     $$
+
+![[Pasted image 20241121125134.png]]
+### Neural Networks and GPs
+
+1. **Large Neural Networks**
+   - As the number of hidden units in a neural network grows, the output becomes equivalent to a GP with a specific covariance function:
+     $$
+     k(x, x') = \sigma^2 \arcsin\left(\frac{2x^\top \Sigma x'}{\sqrt{(1 + x^\top \Sigma x)(1 + x'^\top \Sigma x')}}\right).
+     $$
+
+2. **Intuition**
+   - The prior distribution over neural network weights induces a prior over functions, which resembles a GP.
+
+### Composite Covariance Functions
+
+Covariance functions have to be possitive definite.
+
+1. **Combining Covariance Functions**
+   - Covariance functions can be combined to form new ones:
+     - **Sum**: $k(x, x') = k_1(x, x') + k_2(x, x')$
+     - **Product**: $k(x, x') = k_1(x, x') \cdot k_2(x, x')$
+     - **Other**: $k(x, x') = g(x) k(x, x') g(x')$
+
+2. **Applications**
+   - Composite covariance functions allow for greater modeling flexibility, tailoring the GP to specific data structures.
+
+---
+
+## 14. EXTRA: Finite and Infinite Basis GPs
+
+1. **Finite vs. Infinite Models**  
+   - A central question in modeling is whether finite or infinite models should be preferred.
+   - **Finite Models**: Involve fixed parameters and limited basis functions. These make much stronger assumptions about the data and can lack flexibility.
+   - **Infinite Models (Gaussian Processes)**: Allow a theoretically infinite number of basis functions, offering more flexibility. Gaussian Processes (GPs) serve as a formalism to define such infinite models.
+
+2. **Gaussian Processes as Infinite Models**  
+   - GPs represent a fancy yet practical way to implement infinite models. But, the key question is:
+     - *Do infinite models make a difference in practice?*
+   - Yes, because they avoid overfitting and ensure generalization by accounting for all possible functions consistent with the data.
+
+3. **Illustrative Example**  
+   - A GP with a squared exponential covariance function corresponds to an infinite linear model with Gaussian basis functions **placed everywhere in the input space**, not just at training points. This results in smoother, more realistic models.
+![[Pasted image 20241121122908.png]]
+### Dangers of Finite Basis Functions
+
+1. **Finite Linear Models with Localized Basis Functions**
+   - Example: A model with only **five basis functions** is constrained to represent limited patterns.
+   - **Visualization**:
+     - Finite models show high variance and poor uncertainty estimation in regions without training data.
+     - As more data is added, the performance improves, but the limited number of basis functions prevents robust generalization.
+
+2. **Gaussian Processes with Infinite Basis Functions**
+   - In contrast, a GP:
+     - Uses infinitely many basis functions.
+     - Ensures smooth predictions and uncertainty estimates across the input space.
+   - **Key Difference**: GPs generalize even in regions far from training points by leveraging the covariance function.
+
+### From Infinite Linear Models to Gaussian Processes
+
+1. **Infinite Basis Expansion**  
+   The GP framework arises naturally by considering a sum of Gaussian basis functions:
    $$
-   k_{\text{NN}}(x, x') = \frac{\sigma_f^2}{\pi} \sin^{-1} \left( \frac{2 x^T \Sigma x'}{\sqrt{(1 + 2 x^T \Sigma x)(1 + 2 x'^T \Sigma x')}} \right)
+   f(x) = \lim_{N \to \infty} \frac{1}{N} \sum_{n=-N/2}^{N/2} \gamma_n \exp\left(-\left(x - \frac{n}{\sqrt{N}}\right)^2\right),
    $$
+   where $\gamma_n \sim \mathcal{N}(0, 1)$.
+
+   - **Interpretation**: As $N \to \infty$, this sum transitions from a finite representation to a continuous integral:
+     $$
+     f(x) = \int_{-\infty}^{\infty} \gamma(u) \exp(-(x - u)^2) \, du,
+     $$
+     with $\gamma(u) \sim \mathcal{N}(0, 1)$.
+
+
+2. **GP Foundations**  
+   - **Mean Function**:
+
+$$
+\mu(x) = \mathbb{E}[f(x)] = \int_{-\infty}^\infty \exp(-(x - u)^2) \int_{-\infty}^\infty \gamma(u)p(\gamma(u)) d\gamma(u) \, du = 0,
+$$
+
+     assuming zero-mean priors for $\gamma(u)$.
+
+   - **Covariance Function**:
+
+$$
+\mathbb{E}[f(x)f(x')] = \int_{-\infty}^\infty \exp\left(-(x - u)^2 - (x' - u)^2\right) du
+$$
+
+$$
+= \int \exp\left(-2\left(u - \frac{x + x'}{2}\right)^2 + \frac{(x + x')^2}{2} - x^2 - x'^2\right) du \propto \exp\left(-\frac{(x - x')^2}{2}\right).
+$$
+
+     - **Key Insight**: The squared exponential covariance function encapsulates an infinite number of Gaussian-shaped basis functions.
+
+3. **Practical Implication**  
+   The GP enables regression over the entire input space, avoiding the overfitting often seen in finite models.
+
+### Practical Takeaways
 
-### Combining Covariance Functions
-- **Addition**: $k(x, x') = k_1(x, x') + k_2(x, x')$
-- **Multiplication**: $k(x, x') = k_1(x, x') \cdot k_2(x, x')$
-- **Scaling**: $k(x, x') = g(x) k(x, x') g(x')$, where $g(x)$ is a function.
+1. **When to Choose GPs**: 
+   - When uncertainty matters (e.g., in scientific predictions or safety-critical systems).
+   - When flexibility is essential due to limited training data.
+2. **Limitations of GPs**:
+   - Computational cost grows cubically with the number of data points, making scalability a challenge.
+   - Solutions: Sparse approximations or variational inference.
 
-## Conclusion
-Gaussian Processes offer a robust, flexible framework for modeling complex datasets without specifying a fixed number of parameters. By defining a prior directly over functions, GPs capture our beliefs about function properties such as smoothness and periodicity. The marginal likelihood provides a principled way to select hyperparameters and models, embodying Occam's Razor by balancing data fit and model complexity. Understanding the relationship between linear models and GPs, as well as the role of covariance functions, is crucial for effectively applying GPs to real-world problems.