Decomposition
Purpose
The purpose of decomposition (dimensionality reduction, feature extraction) is to find the "best" principal subspace where data are expressed with much less dimensions, but keep the information as much as possible.
Notebook
Example Notebook: Kerasy.examples.decomposition.ipynb
PCA
class kerasy.ML.decomposition.PCA(n_components=None)
Consider a dataset \(\{\mathbf{x}_n\}\). If we consider a \(D\)-dimensional vector \(\mathbf{u}_1\), each data point \(\mathbf{x}_n\in\mathbb{R}^{D}\) is projected onto a scalar value \(\mathbf{u}_1^T\mathbf{x}_n\). In that (one-dimensional) space,
- Mean:
$$\tilde{\mathbf{x}} = \frac{1}{N}\sum_{n=1}^N\mathbf{x}_n$$
- Variance:
$$\frac{1}{N}\sum_{n=1}^N\left\{\mathbf{u}_1^T\mathbf{x}_n - \mathbf{u}_1^T\tilde{\mathbf{x}}\right\}^2 = \mathbf{u}_1^T\mathbf{Su}_1\quad\mathbf{S}= \frac{1}{N}\sum_{n=1}^N\left(\mathbf{x}-\tilde{\mathbf{x}}\right)\left(\mathbf{x}-\tilde{\mathbf{x}}\right)^T$$
We now maximize the projected variance \(\mathbf{u}_1^T\mathbf{Su}_1\quad\mathbf{S}\) with respect to \(\mathbf{u}_1\), because we can think "Larger variance" ↔︎ "Features are more expressed".
To avoid \(\|\mathbf{u}_1\|\rightarrow\infty\), we introduce the normalization condition \(\mathbf{u}_1^T\mathbf{u}_1=1\) using Lagrange multiplier \(\lambda_1\).
Then, the object function is given by
The variance will be a maximum when we set \(\mathbf{u}_1\) eaual to the eigenvector having the largest eigenvalue \(\lambda_1\). This eigenvector is known as the first principal component.
We can define additional principal components in the same way.
Summary
If we consider an M-dimensional projection space, the optimal linear projection is defined by the M eigenvectors u1,...,uM of the data covariance matrix S corresponding to the M largest eigenvalues λ1,...,λM.
Kernel PCA
class kerasy.ML.decomposition.KernelPCA(n_components=None, kernel="gaussian", **kernelargs)
The main idea is same with PCA, but we will perform it in the feature space, which implicitly defines a nonlinear principal component model in the original data space.
If we assume that the projected data set has zero mean, The covariance matrix in feature space is given by
and its eigenvector expansion is defined by
From this equation, we see that the vector \(\mathbf{v}_i\) is given by a linear combination of the \(\phi(\mathbf{x}_n)\)
Substituting this expansion back into the eigenvector equation \((\ast)\), we obtain
We can find solutions for \(\mathbf{a}_i\) by solving the following eigenvalue problem. The variance in feature space will be a maximum when we set \(\mathbf{a}_1\) eaual to the eigenvector having the largest eigenvalue \(\lambda_i\).
Having solved the eigenvector problem, the projection of a point \(\mathbf{x}\) onto eigenvector \(i\) is given by
If projected data set doesn't have zero mean, the projected data points after centralizing are given by
and the corresponding elements of the Gram matrix are given by
where \(\mathbf{1}_N\) denotes the \(N\times N\) matrix in which every element takes the value \(1/N\).
Warning
I don't understand clearly how to deal with new data point xn.
SNE
Stochastic Neighbor Embedding (SNE) starts by converting the high-dimensional Euclidean distances between datapoints into conditional probabilities. The similarity of datapoint \(\mathbf{x}_j\) to datapoint \(\mathbf{x}_i\) is conditional probability \(p_{j|i}\), that \(\mathbf{x}_i\) would pick \(\mathbf{x}_j\) as its neighbor if neighbors were picked in propotion to their probability density under Gaussian centered at \(\mathbf{x}_i\).
Mathmatically, the conditional probability \(p_{j|i}\) is given by
Because we are only interested in modeling pairwise similarities, we set the value of \(p_{i|i}\) to zero.
For the low-dimensional counterparts \(\mathbf{y}_i\) and \(\mathbf{y}_j\) of the high-dimensional data points \(\mathbf{x}_i\) and \(\mathbf{x}_j\), it is possible to compute a similar conditional probability, which we denote by \(q_{j|i}\)
Again, we set \(q_{i|i} = 0\), and the variance of the Gaussian to \(\frac{1}{\sqrt{2}}\)
Then, the cost function is given by
in which \(P_i\) represents the conditional probability distribution over all other data points given data points \(\mathbf{x}_i\)
The remaining parameter to be selected is the variance \(\boldsymbol{\sigma_i}\). It is defined to produce a \(P_i\) with a fixed perplexity that is specified by the user. The perplexity is defined as
The minimization of the cost function is performed using a gradient descent method. The gradient has a surprisingly simple form
tSNE
class kerasy.ML.decomposition.tSNE(initial_momentum=0.5, final_momoentum=0.8, eta=500, min_gain=0.1, tol=1e-5, prec_max_iter=50)
As an alternative to minimizing the sum of the Kullback-Leibler divergence between a joint probability distribution \(p_{j|i}\) and \(q_{j|i}\), it is also possible to minimize a single Kullback-Leibler divergence between a joint probability distribution:
In t-SNE, - Employ a Student t-distribution with one degree of freedom (which is the same as a Cauchy distribution) as the heavy-tailed distribution in the low-dimensional map. Using this distribution, the joint probabilities \(q_{ij}\) are defined as
- If there is an outlier \(\mathbf{x}_i\), low-dimensional map point \(\mathbf{y}_i\) has very little effect on the cost function, so set
to ensure that \(\sum_jp_{ij}>\frac{1}{2n}\) for all datapoints \(\mathbf{x}_i\)
The gradient is given by
Reference
UMAP
class kerasy.ML.decomposition.UMAP(metric="euclidean", metric_kwds=None, min_dist=0.1, a=None, b=None, random_state=None, sigma_iter=20, sigma_tol=1e-5, sigma_lower=0, sigma_upper=1e3)
UMAP is constructed from a theoretical framework based in Riemannian geometry and algebraic topology.
tSNE vs. UMAP
tSNE | UMAP | |
---|---|---|
pure Machine Learning semi-empirical algorithm | based on solid mathematical principles | |
Preserves Only Local Structure | Preserves Global Structure | |
probability in high-dimensional space |
$$p_{j\mid i} = \frac{\exp\left(-\|x_i-x_j\|^2/2\sigma_i^2\right)}{\sum_{k\neq i}\exp\left(-\|x_i-x_k\|^2/2\sigma_i^2\right)}$$
|
$$p_{i\mid j} = \exp\left(-\frac{d(x_i,x_j)-\rho_i}{\sigma_i}\right), \quad \rho_i = \min_{j\neq i}\left\{d(x_i,x_j)\right\}$$
|
joint probability in high-dimensional space |
$$p_{ij}=\frac{p_{i\mid j}+p_{j\mid i}}{2N}$$
|
$$p_{ij} = p_{i\mid j} + p_{j\mid i} - p_{i\mid j}p_{j\mid i}$$
|
$$\text{Perplexity} = 2^{-\sum_j p_{j\mid i}\log_2p_{j\mid i}}$$
|
number of nearest neighbors $$k = 2^{\sum_jp_{i\mid j}}$$
|
|
probability in low-dimensional space |
$$q_{ij}=\frac{\left(1 + \|y_i-y_j\|^2\right)^{-1}}{\sum_{k\neq i}\left(1 + \|y_i-y_k\|^2\right)^{-1}}$$
|
$$q_{ij} = \left(1 + a(y_i-y_j)^{2b}\right)^{-1}$$
|
loss function | \(\mathrm{KL}(P_i\|Q_i) = \sum_i\sum_jp_{j\mid i}\log\frac{p_{j\mid i}}{q_{j\mid i}}\) |
$$\mathrm{CE}(X,Y) = \sum_i\sum_j\left[p_{ij}(X)\log\left(\frac{p_{ij}(X)}{q_{ij}(Y)}\right) + \left(1-p_{ij}(X)\right)\log\left(\frac{1-p_{ij}(X)}{1-q_{ij}(Y)}\right)\right]$$
|
Optimization | Gradient Descent | Stochastic Gradient Descent |
Gradients |
$$\frac{\partial\mathrm{KL}}{\partial y_i} = 4\sum_j(p_{ij}-q_{ij})(y_i-y_j)\left(1 + \|y_i-y_j\|^2\right)^{-1}$$
|
$$\frac{\partial\mathrm{CE}}{\partial y_i} = \sum_j\left[\frac{2abd_{ij}^{2(b-1)}P(X)}{1 + ad_{ij}^{2b}}-\frac{2b(1-P(X))}{d_{ij}^2\left(1 + ad_{ij}^{2b}\right)}\right](y_i-y_j)$$
|
Initial low-dimensional coordinates | Random Normal Initialization | Graph Laplacian |