Reinforce Me: Support Vector Machines

Support Vector Machines

Hard Margin Classifier

Let $x_i$ be a high dimensional data point indexed by $i$ , and $y_i \in \{1, -1\}$ be the corresponding label.

Suppose the dataset is linearly separable, there exists a boundary that clearly separate the two classes of data
$w^Tx+b = 0 \; \; \; ...(1)$

where $w$ is a vector having the same dimension as $x$ and $b$ is a scalar.

If we scale $w$ and $b$ properly, there also exists an upper and lower boundaries:
$w^Tx + b = 1 \; \; \; ...(2)$

$w^Tx + b = -1 \; \; \; ...(3)$

The unit vector $w$ is
$\tilde w = \frac{w}{\| w \|}$

Let $x_1, x_2$ be two arbitrary points on (2) and (3) respectively. Substitute these into (2) and (3), we get:

$\|w\| \tilde w^Tx_1 + b = 1 \; \; \; ...(4)$

$\|w\| \tilde w^Tx_2 + b = -1 \; \; \; ...(5)$

The margin, which is the perpendicular distance between the upper and lower boundaries, is calculated by:

$\tilde w^Tx_1 - \tilde w^Tx_2 = \frac{1-b}{\|w\|}+\frac{1+b}{\|w\|} = \frac{2}{\|w\|} \; \; \; ...(6)$

The motivation of the hard margin support vector machine (SVM) classifier is to find the boundary as defined in (1) such that the margin calculated in (6) is maximized.

For computational convenience, maximizing $\frac{2}{\|w\|}$ is just equal to minimizing $\frac{1}{2}\|w\|^2$ .

And we also want to constraint the upper and lower boundaries to clearly separate $y_i = \{1, -1\}$ .

$w^Tx_i + b \ge 1 \; \; \; if \; y_i = 1 \; \; \; ...(7)$

$w^Tx_i + b \le -1 \; \; \; if \; y_i = -1 \; \; \; ...(8)$

Combining (7) and (8) into a single constraint, we get:
$1 - y_i(w^Tx_i + b) \le 0 \; \; \; ...(9)$

Therefore, we form a constrained optimization problem:
$\min_{w,b} \frac{1}{2}||w||^2$

s.t.
$1 - y_i(w^Tx_i + b) \le 0$

By solving the problem using quadratic programing, we can find the classifier $w^Tx+b$ that classifies $x$ as positive if it is greater than zero, otherwise negative.

Soft Margin Classifier

If our dataset is not linearly separable, we can use soft margin SVM classifier.

We define some slack variables $\xi_i \ge 0$ for each data points $x_i$ . They are positive if $y_i = 1$ but the point cannot lie above the upper boundary, or $y_i = -1$ but cannot lie below the lower boundary. Otherwise $\xi_i =0$ . Our constraints are relaxed as:

$w^Tx_i + b \ge 1 - \xi_i \; \; \; if \; y_i = 1 \; \; \; ..(9)$

$w^Tx_i + b \le -1 + \xi_i \; \; \; if \; y_i = -1 \; \; \; ..(10)$

Combining (9) and (10) into a single constraint, we get:
$1 - y_i(w^Tx_i + b) \le \xi_i \; \; \; ...(11)$

And $\xi_i$ must not less than zero:
$0 \le \xi_i \; \; \; ...(12)$

Combining (11) and (12), we get:
$\max \{0, 1 - y_i(w^Tx_i + b)\} \le \xi_i \; \; \; ...(13)$

Besides maximizing the margin, we also want to minimize the slack variables $\xi_i$ . Our objective function is changed to:
$\min_{w,b} \frac{1}{2}||w||^2 + C \sum_i \xi_i \; \; \; ...(14)$

where $C$ is a constant to penalize the slack variables.

By combining (14) and (13), we result an unconstrained optimization problem:
$\min_{w,b} \frac{1}{2}||w||^2 + C \sum_i \max \{0, 1 - y_i(w^Tx_i + b)\}$

By solving this problem with gradient descent, we obtain the soft margin SVM classifier $w^Tx +b$ .

Kernel Functions

By changing the SVM margin maximization problem into its dual problem, we can solve it by calculating the pairwise dot products of data points $<x^{(i)}, x^{(j)}>, \forall i,j = \{1, ..., m\}$ for a dataset with $m$ data points.

One the other hand, if an $n$ -dimensional dataset $X = [X_1, X_2, ..., X_n]$ is not linearly separable with label $Y \in \{1, -1\}$ , we can try to expand its dimension by transforming the variables, e.g. $\phi(X) = [X_1, ..., X_n, X_1^2, ..., X_n^2]$ . The transformed data may become linearly separable in the higher dimensional space.

The kernel technique replaces the dot product of pairwise data points $<x^{(i)}, x^{(j)}>$ with a specific kernel function $K(x^{(i)}, x^{(j)})$ .

Let’s examine the effect of the kernel function $K(x, z) = (x^Tz)^2$ .
$K(x, z) = (\sum_i^nx_i z_i)(\sum_j^nx_j z_j) = \sum_i^n \sum_j^n x_i x_j z_i z_j = \phi(x)^T\phi(z)$

where the transform function $\phi(x) = [x_1x_1, x_1x_2, ..., x_1x_n, ..., x_nx_n]^T$

We see that the kernel function can compute the dot products of the transformed variables, but without the transformation explicitly. This is much more computational efficient.

Some popular kernel functions are:

Polynomial Kernel
$K(x,z) = (x^Tz - c)^d$
Gaussian Kernel
$K(x,z) = \exp(- \frac{\| x -c \|^2}{2\sigma^2})$

In general, if $x$ and $z$ are similar, the kernel value $K(x,z)$ is large, otherwise the value is small.

Reinforce Me

Tuesday, December 4, 2018

Support Vector Machines

Hard Margin Classifier

Soft Margin Classifier

Kernel Functions

No comments:

Post a Comment

Principle Component Analysis