Tuesday, December 4, 2018

Support Vector Machines

Support Vector Machines

Hard Margin Classifier

Let xix_i be a high dimensional data point indexed by ii, and yi{1,1}y_i \in \{1, -1\} be the corresponding label.

Suppose the dataset is linearly separable, there exists a boundary that clearly separate the two classes of data
wTx+b=0      ...(1)w^Tx+b = 0 \; \; \; ...(1)

where ww is a vector having the same dimension as xx and bb is a scalar.

If we scale ww and bb properly, there also exists an upper and lower boundaries:
wTx+b=1      ...(2)w^Tx + b = 1 \; \; \; ...(2)

wTx+b=1      ...(3)w^Tx + b = -1 \; \; \; ...(3)

The unit vector ww is
w~=ww\tilde w = \frac{w}{\| w \|}

Let x1,x2x_1, x_2 be two arbitrary points on (2) and (3) respectively. Substitute these into (2) and (3), we get:

ww~Tx1+b=1      ...(4)\|w\| \tilde w^Tx_1 + b = 1 \; \; \; ...(4)

ww~Tx2+b=1      ...(5)\|w\| \tilde w^Tx_2 + b = -1 \; \; \; ...(5)

The margin, which is the perpendicular distance between the upper and lower boundaries, is calculated by:

w~Tx1w~Tx2=1bw+1+bw=2w      ...(6)\tilde w^Tx_1 - \tilde w^Tx_2 = \frac{1-b}{\|w\|}+\frac{1+b}{\|w\|} = \frac{2}{\|w\|} \; \; \; ...(6)

The motivation of the hard margin support vector machine (SVM) classifier is to find the boundary as defined in (1) such that the margin calculated in (6) is maximized.

For computational convenience, maximizing 2w\frac{2}{\|w\|} is just equal to minimizing 12w2\frac{1}{2}\|w\|^2.

And we also want to constraint the upper and lower boundaries to clearly separate yi={1,1}y_i = \{1, -1\}.

wTxi+b1      if  yi=1      ...(7)w^Tx_i + b \ge 1 \; \; \; if \; y_i = 1 \; \; \; ...(7)

wTxi+b1      if  yi=1      ...(8)w^Tx_i + b \le -1 \; \; \; if \; y_i = -1 \; \; \; ...(8)

Combining (7) and (8) into a single constraint, we get:
1yi(wTxi+b)0      ...(9)1 - y_i(w^Tx_i + b) \le 0 \; \; \; ...(9)

Therefore, we form a constrained optimization problem:
minw,b12w2\min_{w,b} \frac{1}{2}||w||^2

s.t.
1yi(wTxi+b)0 1 - y_i(w^Tx_i + b) \le 0

By solving the problem using quadratic programing, we can find the classifier wTx+bw^Tx+b that classifies xx as positive if it is greater than zero, otherwise negative.

Soft Margin Classifier

If our dataset is not linearly separable, we can use soft margin SVM classifier.

We define some slack variables ξi0\xi_i \ge 0 for each data points xix_i. They are positive if yi=1y_i = 1 but the point cannot lie above the upper boundary, or yi=1y_i = -1 but cannot lie below the lower boundary. Otherwise ξi=0\xi_i =0. Our constraints are relaxed as:

wTxi+b1ξi      if  yi=1      ..(9)w^Tx_i + b \ge 1 - \xi_i \; \; \; if \; y_i = 1 \; \; \; ..(9)

wTxi+b1+ξi      if  yi=1      ..(10)w^Tx_i + b \le -1 + \xi_i \; \; \; if \; y_i = -1 \; \; \; ..(10)

Combining (9) and (10) into a single constraint, we get:
1yi(wTxi+b)ξi      ...(11)1 - y_i(w^Tx_i + b) \le \xi_i \; \; \; ...(11)

And ξi\xi_i must not less than zero:
0ξi      ...(12)0 \le \xi_i \; \; \; ...(12)

Combining (11) and (12), we get:
max{0,1yi(wTxi+b)}ξi      ...(13)\max \{0, 1 - y_i(w^Tx_i + b)\} \le \xi_i \; \; \; ...(13)

Besides maximizing the margin, we also want to minimize the slack variables ξi\xi_i. Our objective function is changed to:
minw,b12w2+Ciξi      ...(14)\min_{w,b} \frac{1}{2}||w||^2 + C \sum_i \xi_i \; \; \; ...(14)

where CC is a constant to penalize the slack variables.

By combining (14) and (13), we result an unconstrained optimization problem:
minw,b12w2+Cimax{0,1yi(wTxi+b)}\min_{w,b} \frac{1}{2}||w||^2 + C \sum_i \max \{0, 1 - y_i(w^Tx_i + b)\}

By solving this problem with gradient descent, we obtain the soft margin SVM classifier wTx+bw^Tx +b.

Kernel Functions

By changing the SVM margin maximization problem into its dual problem, we can solve it by calculating the pairwise dot products of data points <x(i),x(j)>,i,j={1,...,m}<x^{(i)}, x^{(j)}>, \forall i,j = \{1, ..., m\} for a dataset with mm data points.

One the other hand, if an nn-dimensional dataset X=[X1,X2,...,Xn]X = [X_1, X_2, ..., X_n] is not linearly separable with label Y{1,1}Y \in \{1, -1\}, we can try to expand its dimension by transforming the variables, e.g. ϕ(X)=[X1,...,Xn,X12,...,Xn2]\phi(X) = [X_1, ..., X_n, X_1^2, ..., X_n^2]. The transformed data may become linearly separable in the higher dimensional space.

The kernel technique replaces the dot product of pairwise data points <x(i),x(j)><x^{(i)}, x^{(j)}> with a specific kernel function K(x(i),x(j))K(x^{(i)}, x^{(j)}).

Let’s examine the effect of the kernel function K(x,z)=(xTz)2K(x, z) = (x^Tz)^2.
K(x,z)=(inxizi)(jnxjzj)=injnxixjzizj=ϕ(x)Tϕ(z)K(x, z) = (\sum_i^nx_i z_i)(\sum_j^nx_j z_j) = \sum_i^n \sum_j^n x_i x_j z_i z_j = \phi(x)^T\phi(z)

where the transform function ϕ(x)=[x1x1,x1x2,...,x1xn,...,xnxn]T\phi(x) = [x_1x_1, x_1x_2, ..., x_1x_n, ..., x_nx_n]^T

We see that the kernel function can compute the dot products of the transformed variables, but without the transformation explicitly. This is much more computational efficient.

Some popular kernel functions are:

  1. Polynomial Kernel
    K(x,z)=(xTzc)dK(x,z) = (x^Tz - c)^d
  2. Gaussian Kernel
    K(x,z)=exp(xc22σ2)K(x,z) = \exp(- \frac{\| x -c \|^2}{2\sigma^2})

In general, if xx and zz are similar, the kernel value K(x,z)K(x,z) is large, otherwise the value is small.

No comments:

Post a Comment

Principle Component Analysis

Principle Component Analysis Eigenvector Decomposition Let A ∈ R n × n A \in \R^{n \times n} A ∈ R n × n be an n by n...