Linear Regression

2019-11-27(Wed)
ML/linear.py

Notebook

Example Notebook: Kerasy.examples.linear.ipynb

Linear Regression

class kerasy.ML.linear.LinearRegression(basis="none", **basisargs)

The simplest model for regression is

$$y(\mathbf{x},\mathbf{w}) = w_0 + w_1x_1 + \cdots + w_Dx_D$$

However, because of the linearity for the input variables $x_i$, this model has a poor expressive ability. Therefore, we extend it by considering linear combinations of fixed nonlinear functions of the input variables, of the form

$$y(\mathbf{x},\mathbf{w}) = \sum_{j=0}^{M-1}w_j\phi_j(\mathbf{x}) = \mathbf{w}^T\boldsymbol{\phi}(\mathbf{x})$$

In terms of the target variable $t$, we assume that it is given by a $y(\mathbf{x},\mathbf{w})$ (deterministic) with additive Gaussian noise so that

$$p(t|\mathbf{x},\mathbf{w},\beta) = \mathcal{N}\left(t|y(\mathbf{x},\mathbf{w}),\beta^{-1}\right)$$

Therefore, if we observe data, we obtain the following expression for the likelihood function

$$ \begin{aligned} p(\mathbf{t}|\mathbf{X},\mathbf{w},\beta) = \prod_{n=1}^N\mathcal{N}\left(t_n|\mathbf{w}^T\phi(\mathbf{x}_n),\beta^{-1}\right)\\ \begin{cases} \begin{aligned} \ln p(\mathbf{t}|\mathbf{X},\mathbf{w},\beta) &=\sum_{n=1}^N\ln\mathcal{N}\left(t_n|\mathbf{w}^T\phi(\mathbf{x}_n),\beta^{-1}\right)\\ &= \frac{N}{2}\ln\beta - \frac{N}{2}\ln(2\pi) - \beta E_D(\mathbf{w})\\ E_D(\mathbf{w}) &= \frac{1}{2}\sum_{n=1}^N\left\{t_n - \mathbf{w}^T\phi(\mathbf{x}_n)\right\}^2 \end{aligned} \end{cases} \end{aligned} $$

As we obtain such a likelihood, it is easy to maximize them respect to $\mathbf{w}$ and $\beta$. (Setting the gradient to zero.)

$$ \begin{aligned} \nabla\ln p(\mathbf{t}|\mathbf{w},\beta) &= \sum_{n=1}^N\left\{t_n - \mathbf{w}^T\phi(\mathbf{x}_n)\right\}\phi(\mathbf{x}_n)^T = 0\\ 0 &= \sum_{n=1}^Nt_n\phi(\mathbf{x}_n)^T - \mathbf{w}^T\left(\sum_{n=1}^N\phi(\mathbf{x}_n)\phi(\mathbf{x}_n)^T\right)\\ \therefore\mathbf{w}_{\text{ML}} &= \left(\Phi^T\Phi\right)^{-1}\Phi^T\mathbf{t}\\ \frac{1}{\beta_{\text{ML}}} &= \frac{1}{N}\sum_{n=1}^N\left\{t_n -\mathbf{w}_{\text{ML}}^T\phi(\mathbf{x}_n)\right\}^2 \end{aligned} $$

$\Phi$ is an $N\times M$ matrix, called the design matrix, whose elements are given by $\Phi_{nj} = \phi_j(\mathbf{x}_n)$
The quantity $\Phi^{\dagger}\equiv\left(\Phi^T\Phi\right)^{-1}\Phi^T$ is known as the Moore-Penrose pseudo-inverse of the matrix $\Phi$

Linear Regression (Ridge, L2)

class kerasy.ML.linear.LinearRegressionRidge(lambda_, basis="none", **basisargs)

As you see at the notebook, it is likely to over-fit especially when model is too complicated. ($M\rightarrow\text{Large}$) For avoiding it, we introduce the idea of adding a regularization term to an error function.

$$E_D(\mathbf{w}) + \lambda E_{\mathbf{w}}(\mathbf{w})$$

where $\lambda$ is the regularization coefficient that controls the relative importance of the data-dependent error $E_D(\mathbf{w})$ and the regularization term $E_{\mathbf{w}}(\mathbf{w})$.

One of the simplest forms of regularizer is given by

$$E_{\mathbf{w}}(\mathbf{w}) = \frac{1}{2}\mathbf{w}^T\mathbf{w}$$

Then, the total error function and Maximum likelihood estimated $\mathbf{w}$ is expressed as

$$ \frac{1}{2}\sum_{n=1}^N\left\{t_n- \mathbf{w}^T\phi(\mathbf{x}_n)\right\}^2 + \frac{\lambda}{2}\mathbf{w}^T\mathbf{w}\\ \mathbf{w}_{\text{ML}} = \left(\lambda\mathbf{I} + \Phi^T\Phi\right)^{-1}\Phi^T\mathbf{t} $$

Linear Regression (LASSO, L1)

class kerasy.ML.linear.LinearRegressionLASSO(lambda_, basis="none", **basisargs)

Consider the regularizer given by

$$E_{\mathbf{w}}(\mathbf{w}) = \|\mathbf{w}\|$$

This function is not differentiable, so we use ADMM(Alternating Direction Method of Multipliers). In this algorithm, we use Extended Lagrange multiplier.

$$\begin{aligned} L_{\rho}\left(\mathbf{w},\mathbf{z},\boldsymbol{\alpha}\right) &= \frac{1}{2}\|\mathbf{y} - \mathbf{Xw}\|^2 + \lambda|\mathbf{z}| + \boldsymbol{\alpha}^T\left(\mathbf{w}-\mathbf{z}\right) + \frac{\rho}{2}\|\mathbf{w}-\mathbf{z}\|^2\\ \text{subject to }\mathbf{w} &= \mathbf{z} \end{aligned}$$

We minimize $L_{\rho}$ respect to $\mathbf{w}$ and $\mathbf{z}$ repeatedly.

$$ \begin{cases} \begin{aligned} \mathbf{w}&\leftarrow\underset{\mathbf{w}}{\text{argmin}}L_{\rho}\left(\mathbf{w},\mathbf{z},\boldsymbol{\alpha}\right)\text{ with fixed with $\mathbf{z},\boldsymbol{\alpha}$}\\ &=\left(\mathbf{X}^T\mathbf{X} + \rho\mathbf{I}\right)^{-1}\left(\mathbf{X}^T\mathbf{y} - \boldsymbol{\alpha} + \rho\mathbf{z}\right)\\ \mathbf{z}&\leftarrow\underset{\mathbf{z}}{\text{argmin}}L_{\rho}\left(\mathbf{w},\mathbf{z},\boldsymbol{\alpha}\right)\text{ with fixed with $\mathbf{w},\boldsymbol{\alpha}$}\\ &= \text{prox}_{\frac{\lambda}{\rho}|\ast|}\left(w_i + \frac{\alpha_i}{\rho}\right)\\ \boldsymbol{\alpha}&\leftarrow\boldsymbol{\alpha} + \rho\left(\mathbf{w}-\mathbf{z}\right) \end{aligned} \end{cases}\\ \text{prox}_{c|\ast|}(z_0) := \underset{z}{\text{argmin}}\left\{c|z| + \frac{1}{2}\left(z-z_0\right)^2\right\} = \begin{cases}z_0-c & (c < z_0)\\0 & (-c\leq z_0 \leq c)\\z_0 + c & (z_0 < -c)\end{cases} $$

A more general regularizer is sometimes used, for which the regularized error takes the form

$$\frac{1}{2}\sum_{n=1}^N\left{t_n-\mathbf{w}^T\phi(\mathbf{x}n)\right}^2 + \frac{\lambda}{2}\sum{j=1}^M|w_j|^q$$

Bayesian Linear Regression

class kerasy.ML.linear.BayesianLinearRegression(alpha=1, beta=25, basis="none", **basisargs)

We turn to a Bayesian treatment of linear regression, which will avoid the over-fitting problem of maximum likelihood, and which will also lead to automatic methods of determining model complexity using the training data alone. (Not requiring Hold-out methods)

We assume that the prior distribution of $\mathbf{w}$ is given by a Gaussian distribution of the form
$$p(\mathbf{w}|\alpha) = \mathcal{N}\left(\mathbf{w}|\mathbf{0},\alpha^{-1}\mathbf{I}\right)$$
As I mentioned before, likelihood is given by
$$p(t|\mathbf{x},\mathbf{w},\beta) = \mathcal{N}\left(t|y(\mathbf{x},\mathbf{w}),\beta^{-1}\right)\\ p(\mathbf{t}|\mathbf{X},\mathbf{w},\beta) = \prod_{n=1}^N\mathcal{N}\left(t_n|\mathbf{w}^T\phi(\mathbf{x}_n),\beta^{-1}\right)$$
From the Bayes' theorem, posterior distribution is proportional to the product of prior distribution and likelihood. Therefore, posterior distribution is described as
$$p(\mathbf{w}|\mathbf{t}) = \mathcal{N}\left(\mathbf{m}_N,\mathbf{S}_N\right)\\ \begin{cases}\mathbf{m}_N &= \beta\mathbf{S}_N\Phi^T\mathbf{t}\\\mathbf{S}_N^{-1} &= \alpha\mathbf{I} + \beta\Phi^T\Phi\end{cases}$$
Finally, the predictive distribution defined by:
$$\begin{aligned} p(t|\mathbf{t},\alpha,\beta) &= \int p(t|\mathbf{w},\beta)p(\mathbf{w}|\mathbf{t},\alpha,\beta)d\mathbf{w}\\ &= \mathcal{N}\left(t|\mathbf{m}_N^T\phi(\mathbf{x}),\sigma^2_N(\mathbf{x})\right)\\ \sigma^2_N(\mathbf{x}) &= \frac{1}{\beta} + \phi(\mathbf{x})^T\mathbf{S}\phi(\mathbf{x}) \end{aligned}$$

Reference

If you want to know the derivation process (calculation process) in detail, please visit here.

Evidence Approximation Bayesian Regression

class kerasy.ML.linear.EvidenceApproxBayesianRegression(alpha=1, beta=1, basis="none", **basisargs)

In a fully Bayesian treatment of the linear basis function model, we would introduce prior distributions over the hyperparameters $\alpha$, and $\beta$.

$$ \begin{aligned} p(t|\mathbf{t},\mathbf{X},\mathbf{x},\alpha,\beta) &= \int p(t|\mathbf{w},\mathbf{x},\beta)p(\mathbf{w}|\mathbf{t},\mathbf{X},\alpha,\beta)d\mathbf{w}\\ &\Rightarrow \int p(t|\mathbf{w},\mathbf{x},\beta)p(\mathbf{w}|\mathbf{t},\mathbf{X},\alpha,\beta)p(\alpha,\beta|\mathbf{t},\mathbf{X})d\mathbf{w}d\alpha d\beta \quad (\ast) \end{aligned} $$

Marginalize with respect to these hyperparameters as well as with respect to the parameters $\mathbf{w}$ to make predictions.

As, the complete marginalization over all of these variables is analytically intractable, We will maximize $(\ast)$ in line with the framework of empirical Bayes (or type 2 maximum likelihood, generalized maximum likelihood, evidence approximation)

In the framework, we repeat the following process.

Obtain the marginal likelihood function by first integrating over the parameters $\mathbf{w}$
$$p(\mathbf{t}|\alpha,\beta) = \int p(\mathbf{t}|\mathbf{w},\mathbf{X},\beta)p(\mathbf{w}|\alpha)d\mathbf{w}$$
Maximize $p(\mathbf{t}|\alpha,\beta)$ with respect to $\alpha$ and $\beta$.
$$ \begin{cases} \begin{aligned} \gamma &= \sum_{i}\frac{\lambda_i}{\lambda_i + \alpha^{\text{old}}}\\ \alpha^{\text{new}} &= \frac{\gamma}{\mathbf{m}_N^T\mathbf{m}_N}\\ \frac{1}{\beta^{\text{new}}} &= \frac{1}{N-\gamma}\sum_{n=1}^N\left\{t_n - \mathbf{m}_N^T\phi(\mathbf{x}_n)\right\}^2 \end{aligned} \end{cases} $$

Reference

Pattern Recognition and Machine Learning by Christopher Bishop