Decomposition

Purpose

The purpose of decomposition (dimensionality reduction, feature extraction) is to find the "best" principal subspace where data are expressed with much less dimensions, but keep the information as much as possible.

Notebook

Example Notebook: Kerasy.examples.decomposition.ipynb

PCA

class kerasy.ML.decomposition.PCA(n_components=None)

Consider a dataset \(\{\mathbf{x}_n\}\). If we consider a \(D\)-dimensional vector \(\mathbf{u}_1\), each data point \(\mathbf{x}_n\in\mathbb{R}^{D}\) is projected onto a scalar value \(\mathbf{u}_1^T\mathbf{x}_n\). In that (one-dimensional) space,

  • Mean:
    $$\tilde{\mathbf{x}} = \frac{1}{N}\sum_{n=1}^N\mathbf{x}_n$$
  • Variance:
    $$\frac{1}{N}\sum_{n=1}^N\left\{\mathbf{u}_1^T\mathbf{x}_n - \mathbf{u}_1^T\tilde{\mathbf{x}}\right\}^2 = \mathbf{u}_1^T\mathbf{Su}_1\quad\mathbf{S}= \frac{1}{N}\sum_{n=1}^N\left(\mathbf{x}-\tilde{\mathbf{x}}\right)\left(\mathbf{x}-\tilde{\mathbf{x}}\right)^T$$

We now maximize the projected variance \(\mathbf{u}_1^T\mathbf{Su}_1\quad\mathbf{S}\) with respect to \(\mathbf{u}_1\), because we can think "Larger variance" ↔︎ "Features are more expressed".

To avoid \(\|\mathbf{u}_1\|\rightarrow\infty\), we introduce the normalization condition \(\mathbf{u}_1^T\mathbf{u}_1=1\) using Lagrange multiplier \(\lambda_1\).

Then, the object function is given by

$$\begin{aligned} L &= \mathbf{u}_1^T\mathbf{Su}_1 + \lambda_1\left(1-\mathbf{u}_1^T\mathbf{u}_1\right)\\ \frac{\partial L}{\partial \mathbf{u}_1} &= 2\mathbf{S}\mathbf{u}_1 - 2\lambda_1\mathbf{u}_1 = 0\\ \therefore\mathbf{u}_1^T\mathbf{Su}_1 &= \lambda_1 \quad (\because \text{left-multiplied $\mathbf{u}_1^T$}) \end{aligned}$$

The variance will be a maximum when we set \(\mathbf{u}_1\) eaual to the eigenvector having the largest eigenvalue \(\lambda_1\). This eigenvector is known as the first principal component.

We can define additional principal components in the same way.

Summary

If we consider an M-dimensional projection space, the optimal linear projection is defined by the M eigenvectors u1,...,uM of the data covariance matrix S corresponding to the M largest eigenvalues λ1,...,λM.

Kernel PCA

class kerasy.ML.decomposition.KernelPCA(n_components=None, kernel="gaussian", **kernelargs)

The main idea is same with PCA, but we will perform it in the feature space, which implicitly defines a nonlinear principal component model in the original data space.

If we assume that the projected data set has zero mean, The covariance matrix in feature space is given by

$$ \mathbf{C} = \frac{1}{N}\sum_{n=1}^N\phi(\mathbf{x}_n)\phi(\mathbf{x}_n)^T $$

and its eigenvector expansion is defined by

$$\begin{aligned} \mathbf{C}\mathbf{v}_i &= \lambda_i\mathbf{v}_i\\ \frac{1}{N}\sum_{n=1}^N\phi(\mathbf{x}_n)\underset{\text{scalar}}{\left\{\phi(\mathbf{x}_n)^T\mathbf{v}_i\right\}} &= \lambda_i\mathbf{v}_i\quad (\ast)\\ \end{aligned}$$

From this equation, we see that the vector \(\mathbf{v}_i\) is given by a linear combination of the \(\phi(\mathbf{x}_n)\)

$$\mathbf{v}_i = \sum_{n=1}^N a_{in}\phi(\mathbf{x}_n)$$

Substituting this expansion back into the eigenvector equation \((\ast)\), we obtain

$$\begin{aligned} \frac{1}{N}\sum_{n=1}^N\phi(\mathbf{x_n})\phi(\mathbf{x}_n)^T\sum_{m=1}^Na_{im}\phi(\mathbf{x}_m) &= \lambda_i\sum_{n=1}^Na_{in}\phi(\mathbf{x}_n)\\ \frac{1}{N}\sum_{n=1}^Nk(\mathbf{x}_l,\mathbf{x}_n)\sum_{m=1}^Na_{i,m}k(\mathbf{x}_n,\mathbf{x}_m) &= \lambda_i\sum_{n=1}^Na_{in}k(\mathbf{x}_l,\mathbf{x}_n)\quad(\because\text{multiplied $\phi(\mathbf{x}_l)^T$})\\ \mathbf{K}^2\mathbf{a}_i &= \lambda_iN\mathbf{Ka}_i\quad (\because\text{written in matrix notation})\\ \end{aligned}$$

We can find solutions for \(\mathbf{a}_i\) by solving the following eigenvalue problem. The variance in feature space will be a maximum when we set \(\mathbf{a}_1\) eaual to the eigenvector having the largest eigenvalue \(\lambda_i\).

$$\mathbf{Ka}_i = \lambda_iN\mathbf{a}_i$$

Having solved the eigenvector problem, the projection of a point \(\mathbf{x}\) onto eigenvector \(i\) is given by

$$y_i(\mathbf{x}) = \phi(\mathbf{x})^T\mathbf{v}_i = \sum_{n=1}^Na_{in}\phi(\mathbf{x})^T\phi(\mathbf{x}_n) = \sum_{n=1}^Na_{in}k(\mathbf{x},\mathbf{x}_n)$$

If projected data set doesn't have zero mean, the projected data points after centralizing are given by

$$\tilde{\phi(\mathbf{x}_n)} = \phi(\mathbf{x}_n)-\frac{1}{N}\sum_{l=1}^N\phi(\mathbf{x}_l)$$

and the corresponding elements of the Gram matrix are given by

$$\tilde{\mathbf{K}} = \mathbf{K} - \mathbf{1_NK} - \mathbf{K1_N} +\mathbf{1_NK1_N}$$

where \(\mathbf{1}_N\) denotes the \(N\times N\) matrix in which every element takes the value \(1/N\).

Warning

I don't understand clearly how to deal with new data point xn.

SNE

Stochastic Neighbor Embedding (SNE) starts by converting the high-dimensional Euclidean distances between datapoints into conditional probabilities. The similarity of datapoint \(\mathbf{x}_j\) to datapoint \(\mathbf{x}_i\) is conditional probability \(p_{j|i}\), that \(\mathbf{x}_i\) would pick \(\mathbf{x}_j\) as its neighbor if neighbors were picked in propotion to their probability density under Gaussian centered at \(\mathbf{x}_i\).

Mathmatically, the conditional probability \(p_{j|i}\) is given by

$$p_{j|i} = \frac{\exp\left(-\|\mathbf{x}_i-\mathbf{x}_j\|^2/2\boldsymbol{\sigma}_i^2\right)}{\sum_{k\neq i}\exp\left(-\|\mathbf{x}_i-\mathbf{x}_k\|^2/2\boldsymbol{\sigma}_i^2\right)}$$

Because we are only interested in modeling pairwise similarities, we set the value of \(p_{i|i}\) to zero.

For the low-dimensional counterparts \(\mathbf{y}_i\) and \(\mathbf{y}_j\) of the high-dimensional data points \(\mathbf{x}_i\) and \(\mathbf{x}_j\), it is possible to compute a similar conditional probability, which we denote by \(q_{j|i}\)

$$q_{j|i} = \frac{\exp\left(-\|\mathbf{y}_i-\mathbf{y}_j\|^2\right)}{\sum_{k\neq i}\exp\left(-\|\mathbf{y}_i-\mathbf{y}_k\|^2\right)}$$

Again, we set \(q_{i|i} = 0\), and the variance of the Gaussian to \(\frac{1}{\sqrt{2}}\)

Then, the cost function is given by

$$C = \sum_i\mathrm{KL}(P_i\|Q_i) = \sum_i\sum_j p_{j|i}\log\frac{p_{j|i}}{q_{j|i}}$$

in which \(P_i\) represents the conditional probability distribution over all other data points given data points \(\mathbf{x}_i\)

The remaining parameter to be selected is the variance \(\boldsymbol{\sigma_i}\). It is defined to produce a \(P_i\) with a fixed perplexity that is specified by the user. The perplexity is defined as

$$Perp(P_i) = 2^{H(P_i)},\quad H(P_i) = -\sum_jp_{j|i}\log_2p_{j|i}$$

The minimization of the cost function is performed using a gradient descent method. The gradient has a surprisingly simple form

$$\frac{\partial C}{\partial \mathbf{y}_i} = 2\sum_j\left(p_{j|i} - q_{j|i} + p_{i|j} - q_{i|j}\right)\left(\mathbf{y}_i-\mathbf{y}_j\right)$$

tSNE

class kerasy.ML.decomposition.tSNE(initial_momentum=0.5, final_momoentum=0.8, eta=500, min_gain=0.1, tol=1e-5, prec_max_iter=50)

As an alternative to minimizing the sum of the Kullback-Leibler divergence between a joint probability distribution \(p_{j|i}\) and \(q_{j|i}\), it is also possible to minimize a single Kullback-Leibler divergence between a joint probability distribution:

$$C = KL\left(P\|Q\right) = \sum_i\sum_jp_{ij}\log\frac{p_{ij}}{q_{ij}}$$

In t-SNE, - Employ a Student t-distribution with one degree of freedom (which is the same as a Cauchy distribution) as the heavy-tailed distribution in the low-dimensional map. Using this distribution, the joint probabilities \(q_{ij}\) are defined as

$$q_{ij} = \frac{\left(1 + \|\mathbf{y}_i-\mathbf{y}_j\|^2\right)^{-1}}{\sum_{k\neq l}\left(1 + \|\mathbf{y}_k-\mathbf{y}_l\|^2\right)^{-1}}$$

- If there is an outlier \(\mathbf{x}_i\), low-dimensional map point \(\mathbf{y}_i\) has very little effect on the cost function, so set

$$p_{ij} = \frac{p_{j|i} + p_{i|j}}{2n}$$

to ensure that \(\sum_jp_{ij}>\frac{1}{2n}\) for all datapoints \(\mathbf{x}_i\)

The gradient is given by

$$\frac{\partial C}{\partial \mathbf{y}_i} = 4\sum_j\left(p_{ij} - q_{ij}\right) \left(\mathbf{y}_i - \mathbf{y}_j\right)\left(1 + \|\mathbf{y}_i - \mathbf{y}_j\|^2\right)^{-1}$$

UMAP

class kerasy.ML.decomposition.UMAP(metric="euclidean", metric_kwds=None, min_dist=0.1, a=None, b=None, random_state=None, sigma_iter=20, sigma_tol=1e-5, sigma_lower=0, sigma_upper=1e3)

UMAP is constructed from a theoretical framework based in Riemannian geometry and algebraic topology.

tSNE vs. UMAP

tSNE UMAP
pure Machine Learning semi-empirical algorithm based on solid mathematical principles
Preserves Only Local Structure Preserves Global Structure
probability in high-dimensional space
$$p_{j\mid i} = \frac{\exp\left(-\|x_i-x_j\|^2/2\sigma_i^2\right)}{\sum_{k\neq i}\exp\left(-\|x_i-x_k\|^2/2\sigma_i^2\right)}$$
$$p_{i\mid j} = \exp\left(-\frac{d(x_i,x_j)-\rho_i}{\sigma_i}\right), \quad \rho_i = \min_{j\neq i}\left\{d(x_i,x_j)\right\}$$
joint probability in high-dimensional space
$$p_{ij}=\frac{p_{i\mid j}+p_{j\mid i}}{2N}$$
$$p_{ij} = p_{i\mid j} + p_{j\mid i} - p_{i\mid j}p_{j\mid i}$$
$$\text{Perplexity} = 2^{-\sum_j p_{j\mid i}\log_2p_{j\mid i}}$$
number of nearest neighbors
$$k = 2^{\sum_jp_{i\mid j}}$$
probability in low-dimensional space
$$q_{ij}=\frac{\left(1 + \|y_i-y_j\|^2\right)^{-1}}{\sum_{k\neq i}\left(1 + \|y_i-y_k\|^2\right)^{-1}}$$
$$q_{ij} = \left(1 + a(y_i-y_j)^{2b}\right)^{-1}$$
loss function \(\mathrm{KL}(P_i\|Q_i) = \sum_i\sum_jp_{j\mid i}\log\frac{p_{j\mid i}}{q_{j\mid i}}\)
$$\mathrm{CE}(X,Y) = \sum_i\sum_j\left[p_{ij}(X)\log\left(\frac{p_{ij}(X)}{q_{ij}(Y)}\right) + \left(1-p_{ij}(X)\right)\log\left(\frac{1-p_{ij}(X)}{1-q_{ij}(Y)}\right)\right]$$
Optimization Gradient Descent Stochastic Gradient Descent
Gradients
$$\frac{\partial\mathrm{KL}}{\partial y_i} = 4\sum_j(p_{ij}-q_{ij})(y_i-y_j)\left(1 + \|y_i-y_j\|^2\right)^{-1}$$
$$\frac{\partial\mathrm{CE}}{\partial y_i} = \sum_j\left[\frac{2abd_{ij}^{2(b-1)}P(X)}{1 + ad_{ij}^{2b}}-\frac{2b(1-P(X))}{d_{ij}^2\left(1 + ad_{ij}^{2b}\right)}\right](y_i-y_j)$$
Initial low-dimensional coordinates Random Normal Initialization Graph Laplacian