Skip to content

Commit

Permalink
nw
Browse files Browse the repository at this point in the history
  • Loading branch information
antoniofrancaib committed Nov 9, 2024
1 parent 12c8dcb commit 08ec173
Show file tree
Hide file tree
Showing 2 changed files with 28 additions and 47 deletions.
18 changes: 0 additions & 18 deletions MLMI1/preliminaries/3-classification.md
Original file line number Diff line number Diff line change
Expand Up @@ -232,21 +232,3 @@ By plotting the probability contours or decision boundaries, we can observe how
- **Cross-Validation**: Use validation sets to monitor performance and select model parameters.
- **Early Stopping**: Halt training when performance on validation data starts to degrade.


---

# k-Nearest Neighbours (kNN) Classification Algorithm

The **k-nearest neighbours (kNN)** algorithm is a simple, non-parametric method used for classification and regression tasks. It classifies a new data point based on the majority class among its $k$ nearest neighbours in the feature space.

1. **Find the $k$ nearest neighbours** of $\mathbf{x}^\ast$ in the training set using a chosen distance metric.

2. **Assign the class** to $\mathbf{x}^\ast$ based on the most frequent class among its $k$ nearest neighbours.

In case of a tie, the class can be chosen randomly among those with the highest frequency.

### Effect of $k$ on Decision Boundaries

- **Small $k$** (e.g., $k=1$): Decision boundaries can be irregular and sensitive to noise, potentially leading to overfitting.
- **Large $k$**: Decision boundaries become smoother, which may lead to underfitting and loss of important local patterns.

57 changes: 28 additions & 29 deletions MLMI1/preliminaries/4-clustering.md
Original file line number Diff line number Diff line change
@@ -1,22 +1,11 @@
# Introduction to Clustering

Clustering $\rightarrow$ grouping data points into clusters.
a dataset of $D$-dimensional points, $\mathbf{x}_n$, the goal is to assign each point to one of $K$ clusters, denoted by $s_n$ -based on some defined similarity measure-.
Clustering $\rightarrow$ grouping a set of **unlabeled inputs** $\{x_n\}_{n=1}^N$ into clusters based on similarity, without prior knowledge of class labels.

**unsupervised learning** task = only the input data ${\mathbf{x}_n}$ is provided.
Goal: to uncover hidden structure in the data without explicit output labels.

### Examples:

| Application | Data | Clusters |
| ----------------------- | ------------------------ | ---------------------- |
| Genetic analysis | Genetic markers | Ancestral groups |
| Medical analysis | Patient records and data | Disease subtypes |
| Image segmentation | Image pixel values | Distinct image regions |
| Social network analysis | Node connections | Social communities |
**Clustering Goal**: To find a function $f: \mathbb{R}^D \rightarrow \{1, 2, \dots, K\}$ that assigns each input $x_n$ to one of $K$ clusters. The goal is to group similar inputs together in such a way that:

---
# The K-means Algorithm (deterministic approach)
# The K-means Algorithm

Given a dataset $\{\mathbf{x}_n\}_{n=1}^N$ of two-dimensional real-valued data points $\mathbf{x}_n = [x_{1,n}, x_{2,n}]^\top$, we aim to cluster the points into $K$ clusters using the K-means algorithm. The algorithm assigns each datapoint to one of $K$ clusters with centers $\{\mathbf{m}_k\}_{k=1}^K$.

Expand All @@ -41,6 +30,7 @@ $$
\mathcal{C} = \sum_{n=1}^N \sum_{k=1}^K s_{n,k} \lvert \lvert \mathbf{x}_n - \mathbf{m}_k \rvert \rvert^2
$$
The K-means algorithm minimizes an energy function called the **within-cluster sum of squares** (WCSS), also referred to as the **inertia** or the **distortion**.

### Optimization Process
The optimization alternates between two steps:

Expand All @@ -67,8 +57,10 @@ These steps are repeated until the cluster assignments $\{s_{nk}\}$ no longer ch
- **Hard Assignments**: Each data point is assigned definitively to one cluster, disregarding the uncertainty or probability of belonging to other clusters.

![[Pasted image 20241107181435.png]]

---
# Mixture of Gaussians (probabilistic approach) and the Expectation Maximisation Algorithm

# Expectation Maximisation (EM) Algorithm for MoG

## Introduction
- **Mixture of Gaussians (MoG)**:
Expand All @@ -94,9 +86,9 @@ The Mixture of Gaussians model is a probabilistic model that assumes data is gen
where:
- $\boldsymbol{\mu}_k$ is the mean vector of cluster $k$.
- $\boldsymbol{\Sigma}_k$ is the covariance matrix of cluster $k$.
- $\mathcal{N}(\mathbf{x}; \boldsymbol{\mu}, \boldsymbol{\Sigma})$ denotes the multivariate Gaussian distribution:$$
\mathcal{N}(\mathbf{x}; \boldsymbol{\mu}, \boldsymbol{\Sigma}) = \frac{1}{\sqrt{(2\pi)^D |\boldsymbol{\Sigma}|}} \exp\left( -\frac{1}{2} (\mathbf{x} - \boldsymbol{\mu})^\top \boldsymbol{\Sigma}^{-1} (\mathbf{x} - \boldsymbol{\mu}) \right)
$$where $D$ is the dimensionality of the data.
- $\mathcal{N}(\mathbf{x}; \boldsymbol{\mu}, \boldsymbol{\Sigma})$ denotes the multivariate Gaussian distribution:
$$ \mathcal{N}(\mathbf{x}; \boldsymbol{\mu}, \boldsymbol{\Sigma}) = \frac{1}{\sqrt{(2\pi)^D |\boldsymbol{\Sigma}|}} \exp\left( -\frac{1}{2} (\mathbf{x} - \boldsymbol{\mu})^\top \boldsymbol{\Sigma}^{-1} (\mathbf{x} - \boldsymbol{\mu}) \right)$$
where $D$ is the dimensionality of the data.

3. **Generative Process**:
- To generate each data point:
Expand All @@ -107,6 +99,7 @@ The Mixture of Gaussians model is a probabilistic model that assumes data is gen
## Objective
- **Inference Goal**:
- Given the observed data $\{\mathbf{x}_n\}$, infer the latent cluster assignments $\{s_n\}$ and estimate the model parameters $\theta = \{\pi_k, \boldsymbol{\mu}_k, \boldsymbol{\Sigma}_k\}$.

- **Maximum Likelihood Estimation (MLE)**:
- Aim to maximize the likelihood of the observed data:$$
p(\mathbf{X} \mid \theta) = \prod_{n=1}^N \sum_{k=1}^K \pi_k \mathcal{N}(\mathbf{x}_n; \boldsymbol{\mu}_k, \boldsymbol{\Sigma}_k)
Expand All @@ -119,6 +112,7 @@ The EM algorithm provides an iterative approach to find maximum likelihood estim
### Detailed Steps
#### 1. Define Free Energy ($\mathcal{F}$)
- **Free Energy** is a lower bound to the log-likelihood, and makes the optimization problem tractable when dealing with latent variables:
$$ \mathcal{F}(q(\mathbf{s}), \theta) = \log p(\mathbf{X} \mid \theta) - KL(q(\mathbf{s}) \parallel p(\mathbf{s} \mid \mathbf{X}, \theta)) \leq \log p(\mathbf{X} \mid \theta) $$
Expand All @@ -131,6 +125,7 @@ $$ \mathcal{F}(q(\mathbf{s}), \theta) = \log p(\mathbf{X} \mid \theta) - KL(q(\m
- **Zero Condition**: $KL = 0$ if and only if $q(\mathbf{s}) = p(\mathbf{s} \mid \mathbf{X}, \theta)$ for all $\mathbf{s}$.

#### 2. Initialization

**Parameter Initialization**:
- **Importance**: Crucial for the convergence and quality of the final solution.
- **Strategies**:
Expand All @@ -156,6 +151,7 @@ $$\boldsymbol{\Sigma}_k^{(0)} = \mathbf{I}, \quad \forall k
- Initialize $\boldsymbol{\mu}_k^{(0)}$ based on visual inspection or random selection.

#### 3. E Step (Expectation)

$\rightarrow$ Compute the posterior probabilities (responsibilities) that each data point belongs to each cluster.

- **Objective**: Maximize $\mathcal{F}$ with respect to $q(\mathbf{s})$ while keeping $\theta$ fixed.
Expand All @@ -164,8 +160,8 @@ $\rightarrow$ Compute the posterior probabilities (responsibilities) that each d
$$ \mathcal{F} = - KL(q(\mathbf{s}) \parallel p(\mathbf{s} \mid \mathbf{X}, \theta)) + \text{constant} $$

- **Result**: Set $q(\mathbf{s})$ to the posterior distribution $p(\mathbf{s} \mid \mathbf{X}, \theta)$ (posterior distribution represents our best guess about the hidden variables, given the data and the current parameters).
- **Calculating the Posterior Probability**:

- **Calculating the Posterior Probability**:
$$ p(s_n = k \mid \mathbf{x}_n, \theta) = \frac{p(s_n = k \mid \theta) p(\mathbf{x}_n \mid s_n = k, \theta)}{p(\mathbf{x}_n \mid \theta)} $$$$
= \frac{\pi_k \mathcal{N}(\mathbf{x}_n; \boldsymbol{\mu}_k, \boldsymbol{\Sigma}_k)}{\sum_{j=1}^K \pi_j \mathcal{N}(\mathbf{x}_n; \boldsymbol{\mu}_j, \boldsymbol{\Sigma}_j)}
$$
Expand All @@ -176,11 +172,13 @@ $$ p(s_n = k \mid \mathbf{x}_n, \theta) = \frac{p(s_n = k \mid \theta) p(\mathbf
q(s_n = k) = \frac{u_{nk}}{\sum_{j=1}^K u_{nj}}, \quad \forall n, k
$$
#### 4. M Step (Maximisation)
$\rightarrow$ Update the model parameters $\theta$ using the responsibilities computed in the E step.
- **Objective**: Maximize $\mathcal{F}$ with respect to $\theta$ while keeping $q(\mathbf{s})$ fixed.
- **Maximisation of Free Energy**:
-
$$ \mathcal{F}(q(\mathbf{s}), \theta) = \sum_{\mathbf{s}} q(\mathbf{s}) \log p(\mathbf{s}, \mathbf{X} \mid \theta) - \sum_{\mathbf{s}} q(\mathbf{s}) \log q(\mathbf{s})$$
- Since $q(\mathbf{s})$ is fixed, maximize:

Expand All @@ -190,6 +188,7 @@ $$\mathcal{Q}(\theta) = \sum_{\mathbf{s}} q(\mathbf{s}) \log p(\mathbf{s}, \math
$$ \mathcal{Q}(\theta) = \sum_{n=1}^N \sum_{k=1}^K q(s_n = k) \left[ \log \pi_k - \frac{1}{2} \log |\boldsymbol{\Sigma}_k| - \frac{1}{2} (\mathbf{x}_n - \boldsymbol{\mu}_k)^\top \boldsymbol{\Sigma}_k^{-1} (\mathbf{x}_n - \boldsymbol{\mu}_k) \right] $$

- **Parameter Updates Taking Derivatives**:

**With respect to $\boldsymbol{\mu}_k$**:$$
\frac{\partial \mathcal{Q}}{\partial \boldsymbol{\mu}_k} = \sum_{n=1}^N q(s_n = k) \boldsymbol{\Sigma}_k^{-1} (\mathbf{x}_n - \boldsymbol{\mu}_k) = 0
$$ Solving:$$
Expand Down Expand Up @@ -227,13 +226,13 @@ $$ \mathcal{F}^{(m+1)} - \mathcal{F}^{(m)}| < \epsilon $$
- **Poor Initialization**: Can lead to suboptimal clustering results.
- **Singular Covariance Matrices**: Can occur if clusters collapse, requiring regularization.

## Conclusion
- **MoG vs. K-means**:
- **Flexibility**: MoG can model more complex cluster shapes and handle overlapping clusters.
- **Probabilistic Framework**: Provides a probabilistic interpretation of cluster assignments.
- **EM Algorithm**:
- **Powerful Tool**: Efficiently estimates parameters in the presence of latent variables.
- **Limitations**: Sensitive to initialization and may converge to local maxima.
- **Applications**:
- Widely used in pattern recognition, computer vision, and machine learning for clustering and density estimation tasks.
## Limitations

1. **Convergence to Local Optima**: EM is a **greedy algorithm** and only guarantees convergence to a local optimum, which might not be the global optimum.
2. **Slow Convergence**: The algorithm can converge slowly, especially if the likelihood surface is flat or has long, narrow peaks.
3. **Sensitive to Initialization**: Poor initialization can lead to convergence at an inferior local maximum or slow down the algorithm significantly.
4. **Requires Knowing the Number of Components**: If this number is incorrect, it can lead to poor results.
5. **Assumes Independence Among Latent Variables**: In its basic form, EM assumes that latent variables are independent, which may not be realistic for all datasets.



0 comments on commit 08ec173

Please sign in to comment.