Support Vector Machine
Notebook
Example Notebook: Kerasy.examples.svm.ipynb
Hard SVC
class kerasy.ML.svm.hardSVC(kernel="gaussian", **kernelargs)
Start from the "two-class classification" problem using "linear models", which "separate the training data set linearly". Under the assumption, we can denote
- The training data:
$$D = \left\{(t_n,\mathbf{x}_n)\in \left\{-1, 1\right\} \times \mathbb{R}^m | n = 1,\ldots, N\right\}$$
- linear models:
$$y(\mathbf{x}) = \mathbf{w}^T\phi(\mathbf{x}) + b$$
- For all training data points:
$$t_ny(\mathbf{x}_n) > 0$$
- Margin is given by:
$$\frac{1}{\|\mathbf{w}\|}$$
※ If there are any uncertain points, please check here.
Points
In support vector machines, the decision boundary is chosen to be the one for which the margin is maximized.
From this discussion above, the optimization problem (Minimizing) is given by
where \(\mathbf{a} = \left(a_1,\ldots,a_N\right)^T\) are the Lagrange multipliers \(a_n\gg0\).
Warning
Note the minus sign in front of the Lagrange multiplier term. Check HERE.
To solve this optimization problem, set the derivatives of \(L\left(\mathbf{w},b,\mathbf{a}\right)\) with respect to \(\mathbf{w}\) and \(b\) equal to zero.
Eliminating \(\mathbf{w}\) and \(b\) from \(L\left(\mathbf{w},b,\mathbf{a}\right)\) using these conditions then gives the dual representation of the maximum margin problem in which we maximize
with respect to \(\mathbf{a}\) subject to the constraints
This takes the form of a quadratic programming problems, and there are some well-known algorithms to solve them. (ex. SMO; Sequential Minimal Optimization)
Once optimal \(\mathbf{a}\) are calculated, the decision boundary is given by
In terms of bias parameter \(b\), we can calculate it by noting that any support vector \(\mathbf{x}_n\) satisfies \(t_ny(\mathbf{x}_n) = 1\).
Points
To minimize the numerical error, it is more appropriate to average these equation over all support vectors instead of choosing an arbitrarily support vector.
Soft SVC
class kerasy.ML.svm.SVC(kernel="gaussian", C=10, **kernelargs)
In the Hard SVC, we have assumed that the training data points are linearly separable in the feature space \(\phi(\mathbf{x})\). However, in practice, this property may cause the bad effect on the generalization performance.
Therefore, we introduce slack variables, \(\xi_n\gg0\) to allow some of the training points to be misclassified.
Our goal is now to maximize the margin while softly penalizing points that lie on the wrong side of the margin boundary
\(C\) controls the trade-off between minimizing training errors and controlling model complexity. The corresponding Lagrangian is given by
Same as before, we set the derivatives of \(L\left(\mathbf{w},b,\mathbf{a}\right)\) with respect to \(\mathbf{w}, b\) and \(\{\xi_n\}\) equal to zero.
Eliminating them from \(L\) using these conditions then gives the dual representation of the maximum margin problem in which we maximize
with respect to \(\mathbf{a}\) subject to the constraints
In this case, As shown in the table below, we consider only the data points having \(0 < a_n < C\) to calculate the biasparameter \(b\)
\(a_n\) | \(\mu_n\) | \(\xi_n\) | Meaning |
---|---|---|---|
\(a_n=0\) | They do not contribute to the predictive model. | ||
\(0 < a_n < C\) | \(\mu_n>0\) | \(\xi_n=0\) | satisfy the \(t_ny(\mathbf{x}_n) = 1\) |
\(a_n=C\) | \(\mu_n=0\) | \(\xi_n>0\) | lie inside the margin and can either be correctly classified if \(\xi\ll1\) or misclassified if \(\xi_n>1\) |
Multiclass SVMs
class kerasy.ML.svm.MultipleSVM(kernel="gaussian", C=10, **kernelargs)
The support vector machine is fundamentally a two-class classifier, but in practice, we often face the multiple class problems. In that cases, we have to combine the multiple two-class SVMs.
Various methods have therefore been proposed for combining, and one commonly used approach (this method is also used in Kerasy) is to construct \(K\) separate SVMs, in which the \(k^{\text{th}}\) model \(y_k(\mathbf{x})\) is trained to distinguish whether \(\mathbf{x}\) belongs to class \(k\) or not, and final decision is given by
This is known as the one-versus-the-rest approach.
Warning
This heuristic approach suffers from some problems:
- There is no guarantee that the real-valued quantities yk(x) for different classifiers will have appropriate scales.
- The training sets are imbalanced.