Support Vector Machine

Notebook

Example Notebook: Kerasy.examples.svm.ipynb

Hard SVC

class kerasy.ML.svm.hardSVC(kernel="gaussian", **kernelargs)

Start from the "two-class classification" problem using "linear models", which "separate the training data set linearly". Under the assumption, we can denote

  • The training data:
    $$D = \left\{(t_n,\mathbf{x}_n)\in \left\{-1, 1\right\} \times \mathbb{R}^m | n = 1,\ldots, N\right\}$$
  • linear models:
    $$y(\mathbf{x}) = \mathbf{w}^T\phi(\mathbf{x}) + b$$
  • For all training data points:
    $$t_ny(\mathbf{x}_n) > 0$$
  • Margin is given by:
    $$\frac{1}{\|\mathbf{w}\|}$$

※ If there are any uncertain points, please check here.

Points

In support vector machines, the decision boundary is chosen to be the one for which the margin is maximized.

From this discussion above, the optimization problem (Minimizing) is given by

$$L\left(\mathbf{w},b,\mathbf{a}\right) = \frac{1}{2}\|\mathbf{w}\|^2 - \sum_{n=1}^Na_n\left\{t_n\left(\mathbf{w}^T\phi(\mathbf{x}_n) + b\right) -1 \right\}$$

where \(\mathbf{a} = \left(a_1,\ldots,a_N\right)^T\) are the Lagrange multipliers \(a_n\gg0\).

Warning

Note the minus sign in front of the Lagrange multiplier term. Check HERE.

To solve this optimization problem, set the derivatives of \(L\left(\mathbf{w},b,\mathbf{a}\right)\) with respect to \(\mathbf{w}\) and \(b\) equal to zero.

$$ \begin{aligned} \frac{\partial L}{\partial \mathbf{w}} &= \mathbf{w}-\sum_{n=1}^N a_nt_n\phi(\mathbf{x}_n) =0 &\therefore \mathbf{w}^{\ast}=\sum_{n=1}^N a_nt_n\phi(\mathbf{x}_n)\\ \frac{\partial L}{\partial b} &= -\sum_{n=1}^N a_nt_n = 0 &\therefore\sum_{n=1}^Na_n^{\ast}t_n = 0\\ \end{aligned} $$

Eliminating \(\mathbf{w}\) and \(b\) from \(L\left(\mathbf{w},b,\mathbf{a}\right)\) using these conditions then gives the dual representation of the maximum margin problem in which we maximize

$$ \begin{aligned} \tilde{L}\left(\mathbf{a}\right) &= \frac{1}{2}\|\mathbf{w}\|^2 - \mathbf{w}^T\sum_{n=1}^Na_nt_n\phi(\mathbf{x}_n) - b\sum_{n=1}^Na_nt_n + \sum_{n=1}^Na_n\\ &= -\frac{1}{2}\|\mathbf{w}\|^2 - b\cdot0 + \sum_{n=1}^Na_n\\ &= \sum_{n=1}^Na_n - \frac{1}{2}\sum_{n=1}^N\sum_{m=1}^Na_na_mt_nt_mk\left(\mathbf{x}_n,\mathbf{x}_m\right)\quad k\left(\mathbf{x}_n,\mathbf{x}_m\right)=\phi(\mathbf{x}_n)^T\phi(\mathbf{x}_m) \end{aligned} $$

with respect to \(\mathbf{a}\) subject to the constraints

$$ \begin{aligned} a_n & \geq 0,\quad n=1,\ldots,N\\ \sum_{n=1}^Na_nt_n &= 0 \end{aligned} $$

This takes the form of a quadratic programming problems, and there are some well-known algorithms to solve them. (ex. SMO; Sequential Minimal Optimization)

Once optimal \(\mathbf{a}\) are calculated, the decision boundary is given by

$$y(\mathbf{x}) = \mathbf{w}^{\ast T}\phi(\mathbf{x})+b = \sum_{n=1}^N a_nt_nk(\mathbf{x},\mathbf{x}_n) + b$$

In terms of bias parameter \(b\), we can calculate it by noting that any support vector \(\mathbf{x}_n\) satisfies \(t_ny(\mathbf{x}_n) = 1\).

$$ \begin{aligned} &t_n\left(\sum_{m\in\mathcal{S}}a_mt_mk(\mathbf{x}_n,\mathbf{x}_m) + b\right) = 1\\ &\therefore b^{\ast}= \frac{1}{N_{\mathcal{S}}}\sum_{n\in\mathcal{S}}\left(t_n-\sum_{n\in\mathcal{S}}a_mt_mk(\mathbf{x}_n,\mathbf{x}_m)\right) \end{aligned} $$

Points

To minimize the numerical error, it is more appropriate to average these equation over all support vectors instead of choosing an arbitrarily support vector.

Soft SVC

class kerasy.ML.svm.SVC(kernel="gaussian", C=10, **kernelargs)

In the Hard SVC, we have assumed that the training data points are linearly separable in the feature space \(\phi(\mathbf{x})\). However, in practice, this property may cause the bad effect on the generalization performance.

Therefore, we introduce slack variables, \(\xi_n\gg0\) to allow some of the training points to be misclassified.

$$ t_ny(\mathbf{x}_n) \geq 1 \Longrightarrow t_ny(\mathbf{x}_n) \geq 1 - \xi_n $$

Our goal is now to maximize the margin while softly penalizing points that lie on the wrong side of the margin boundary

$$C\sum_{n=1}^N\xi_n + \frac{1}{2}\|\mathbf{w}\|$$

\(C\) controls the trade-off between minimizing training errors and controlling model complexity. The corresponding Lagrangian is given by

$$L(\mathbf{w},b,\boldsymbol{\xi},\mathbf{a},\boldsymbol{\mu}) = \frac{1}{2}\|\mathbf{w}\|^2 + C\sum_{n=1}^N\xi_n - \sum_{n=1}^{N}a_n\left\{t_n\left(\mathbf{w}^T\phi(\mathbf{x}_n) + b\right) - 1 + \xi_n\right\} - \sum_{n=1}^N\mu_n\xi_n$$

Same as before, we set the derivatives of \(L\left(\mathbf{w},b,\mathbf{a}\right)\) with respect to \(\mathbf{w}, b\) and \(\{\xi_n\}\) equal to zero.

$$ \begin{aligned} \frac{\partial L}{\partial \mathbf{w}} &= \mathbf{w}-\sum_{n=1}^N a_nt_n\phi(\mathbf{x}_n) =0 &\therefore \mathbf{w}^{\ast}=\sum_{n=1}^N a_nt_n\phi(\mathbf{x}_n)\\ \frac{\partial L}{\partial b} &= -\sum_{n=1}^N a_nt_n = 0 &\therefore\sum_{n=1}^Na_n^{\ast}t_n = 0\\ \frac{\partial L}{\partial \xi_n} &= C-a_n-\mu_n &\therefore a_n = C-\mu_n \end{aligned} $$

Eliminating them from \(L\) using these conditions then gives the dual representation of the maximum margin problem in which we maximize

$$ \begin{aligned} \tilde{L}\left(\mathbf{a}\right) &= \frac{1}{2}\|\mathbf{w}\|^2 +C\sum_{n=1}^N\xi_n - \|\mathbf{w}\|^2 - b\sum_{n=1}^Na_nt_n + \sum_{n=1}^Na_n - \sum_{n=1}^Na_n\xi_n - \sum_{n=1}^N\mu_n\xi_n\\ &= -\frac{1}{2}\|\mathbf{w}\|^2 - b\cdot0 + \sum_{n=1}^Na_n + \sum_{n=1}^N\left(C - a_n - \mu_n\right)\xi_n\\ &= \sum_{n=1}^Na_n - \frac{1}{2}\sum_{n=1}^N\sum_{m=1}^Na_na_mt_nt_mk\left(\mathbf{x}_n,\mathbf{x}_m\right)\quad k\left(\mathbf{x}_n,\mathbf{x}_m\right)=\phi(\mathbf{x}_n)^T\phi(\mathbf{x}_m) \end{aligned} $$

with respect to \(\mathbf{a}\) subject to the constraints

$$ \begin{aligned} &a_n \geq 0,\mu_n\geq 0 \quad n=1,\ldots,N\\ &\therefore 0\leq a_n \leq C\\ &\sum_{n=1}^Na_nt_n = 0 \end{aligned} $$

In this case, As shown in the table below, we consider only the data points having \(0 < a_n < C\) to calculate the biasparameter \(b\)

\(a_n\) \(\mu_n\) \(\xi_n\) Meaning
\(a_n=0\) They do not contribute to the predictive model.
\(0 < a_n < C\) \(\mu_n>0\) \(\xi_n=0\) satisfy the \(t_ny(\mathbf{x}_n) = 1\)
\(a_n=C\) \(\mu_n=0\) \(\xi_n>0\) lie inside the margin and can either be correctly classified if \(\xi\ll1\) or misclassified if \(\xi_n>1\)

Multiclass SVMs

class kerasy.ML.svm.MultipleSVM(kernel="gaussian", C=10, **kernelargs)

The support vector machine is fundamentally a two-class classifier, but in practice, we often face the multiple class problems. In that cases, we have to combine the multiple two-class SVMs.

Various methods have therefore been proposed for combining, and one commonly used approach (this method is also used in Kerasy) is to construct \(K\) separate SVMs, in which the \(k^{\text{th}}\) model \(y_k(\mathbf{x})\) is trained to distinguish whether \(\mathbf{x}\) belongs to class \(k\) or not, and final decision is given by

$$y(\mathbf{x}) = \underset{k}{\max}y_k(\mathbf{x})$$

This is known as the one-versus-the-rest approach.

Warning

This heuristic approach suffers from some problems:

  • There is no guarantee that the real-valued quantities yk(x) for different classifiers will have appropriate scales.
  • The training sets are imbalanced.