Linear Regression
Notebook
Example Notebook: Kerasy.examples.linear.ipynb
Linear Regression
class kerasy.ML.linear.LinearRegression(basis="none", **basisargs)
The simplest model for regression is
However, because of the linearity for the input variables \(x_i\), this model has a poor expressive ability. Therefore, we extend it by considering linear combinations of fixed nonlinear functions of the input variables, of the form
In terms of the target variable \(t\), we assume that it is given by a \(y(\mathbf{x},\mathbf{w})\) (deterministic) with additive Gaussian noise so that
Therefore, if we observe data, we obtain the following expression for the likelihood function
As we obtain such a likelihood, it is easy to maximize them respect to \(\mathbf{w}\) and \(\beta\). (Setting the gradient to zero.)
- \(\Phi\) is an \(N\times M\) matrix, called the design matrix, whose elements are given by \(\Phi_{nj} = \phi_j(\mathbf{x}_n)\)
- The quantity \(\Phi^{\dagger}\equiv\left(\Phi^T\Phi\right)^{-1}\Phi^T\) is known as the Moore-Penrose pseudo-inverse of the matrix \(\Phi\)
Linear Regression (Ridge, L2)
class kerasy.ML.linear.LinearRegressionRidge(lambda_, basis="none", **basisargs)
As you see at the notebook, it is likely to over-fit especially when model is too complicated. (\(M\rightarrow\text{Large}\)) For avoiding it, we introduce the idea of adding a regularization term to an error function.
where \(\lambda\) is the regularization coefficient that controls the relative importance of the data-dependent error \(E_D(\mathbf{w})\) and the regularization term \(E_{\mathbf{w}}(\mathbf{w})\).
One of the simplest forms of regularizer is given by
Then, the total error function and Maximum likelihood estimated \(\mathbf{w}\) is expressed as
Linear Regression (LASSO, L1)
class kerasy.ML.linear.LinearRegressionLASSO(lambda_, basis="none", **basisargs)
Consider the regularizer given by
This function is not differentiable, so we use ADMM(Alternating Direction Method of Multipliers). In this algorithm, we use Extended Lagrange multiplier.
We minimize \(L_{\rho}\) respect to \(\mathbf{w}\) and \(\mathbf{z}\) repeatedly.
A more general regularizer is sometimes used, for which the regularized error takes the form
Bayesian Linear Regression
class kerasy.ML.linear.BayesianLinearRegression(alpha=1, beta=25, basis="none", **basisargs)
We turn to a Bayesian treatment of linear regression, which will avoid the over-fitting problem of maximum likelihood, and which will also lead to automatic methods of determining model complexity using the training data alone. (Not requiring Hold-out methods)
- We assume that the prior distribution of \(\mathbf{w}\) is given by a Gaussian distribution of the form
$$p(\mathbf{w}|\alpha) = \mathcal{N}\left(\mathbf{w}|\mathbf{0},\alpha^{-1}\mathbf{I}\right)$$
- As I mentioned before, likelihood is given by
$$p(t|\mathbf{x},\mathbf{w},\beta) = \mathcal{N}\left(t|y(\mathbf{x},\mathbf{w}),\beta^{-1}\right)\\ p(\mathbf{t}|\mathbf{X},\mathbf{w},\beta) = \prod_{n=1}^N\mathcal{N}\left(t_n|\mathbf{w}^T\phi(\mathbf{x}_n),\beta^{-1}\right)$$
- From the Bayes' theorem, posterior distribution is proportional to the product of prior distribution and likelihood. Therefore, posterior distribution is described as
$$p(\mathbf{w}|\mathbf{t}) = \mathcal{N}\left(\mathbf{m}_N,\mathbf{S}_N\right)\\ \begin{cases}\mathbf{m}_N &= \beta\mathbf{S}_N\Phi^T\mathbf{t}\\\mathbf{S}_N^{-1} &= \alpha\mathbf{I} + \beta\Phi^T\Phi\end{cases}$$
- Finally, the predictive distribution defined by:
$$\begin{aligned} p(t|\mathbf{t},\alpha,\beta) &= \int p(t|\mathbf{w},\beta)p(\mathbf{w}|\mathbf{t},\alpha,\beta)d\mathbf{w}\\ &= \mathcal{N}\left(t|\mathbf{m}_N^T\phi(\mathbf{x}),\sigma^2_N(\mathbf{x})\right)\\ \sigma^2_N(\mathbf{x}) &= \frac{1}{\beta} + \phi(\mathbf{x})^T\mathbf{S}\phi(\mathbf{x}) \end{aligned}$$
Reference
If you want to know the derivation process (calculation process) in detail, please visit here.
Evidence Approximation Bayesian Regression
class kerasy.ML.linear.EvidenceApproxBayesianRegression(alpha=1, beta=1, basis="none", **basisargs)
In a fully Bayesian treatment of the linear basis function model, we would introduce prior distributions over the hyperparameters \(\alpha\), and \(\beta\).
Marginalize with respect to these hyperparameters as well as with respect to the parameters \(\mathbf{w}\) to make predictions.
As, the complete marginalization over all of these variables is analytically intractable, We will maximize \((\ast)\) in line with the framework of empirical Bayes (or type 2 maximum likelihood, generalized maximum likelihood, evidence approximation)
In the framework, we repeat the following process.
- Obtain the marginal likelihood function by first integrating over the parameters \(\mathbf{w}\)
$$p(\mathbf{t}|\alpha,\beta) = \int p(\mathbf{t}|\mathbf{w},\mathbf{X},\beta)p(\mathbf{w}|\alpha)d\mathbf{w}$$
- Maximize \(p(\mathbf{t}|\alpha,\beta)\) with respect to \(\alpha\) and \(\beta\).
$$ \begin{cases} \begin{aligned} \gamma &= \sum_{i}\frac{\lambda_i}{\lambda_i + \alpha^{\text{old}}}\\ \alpha^{\text{new}} &= \frac{\gamma}{\mathbf{m}_N^T\mathbf{m}_N}\\ \frac{1}{\beta^{\text{new}}} &= \frac{1}{N-\gamma}\sum_{n=1}^N\left\{t_n - \mathbf{m}_N^T\phi(\mathbf{x}_n)\right\}^2 \end{aligned} \end{cases} $$