Hard Margin Classifier
Let xi be a high dimensional data point indexed by i, and yi∈{1,−1} be the corresponding label.
Suppose the dataset is linearly separable, there exists a boundary that clearly separate the two classes of data
wTx+b=0...(1)
where w is a vector having the same dimension as x and b is a scalar.
If we scale w and b properly, there also exists an upper and lower boundaries:
wTx+b=1...(2)
wTx+b=−1...(3)
The unit vector w is
w~=∥w∥w
Let x1,x2 be two arbitrary points on (2) and (3) respectively. Substitute these into (2) and (3), we get:
∥w∥w~Tx1+b=1...(4)
∥w∥w~Tx2+b=−1...(5)
The margin, which is the perpendicular distance between the upper and lower boundaries, is calculated by:
w~Tx1−w~Tx2=∥w∥1−b+∥w∥1+b=∥w∥2...(6)
The motivation of the hard margin support vector machine (SVM) classifier is to find the boundary as defined in (1) such that the margin calculated in (6) is maximized.
For computational convenience, maximizing ∥w∥2 is just equal to minimizing 21∥w∥2.
And we also want to constraint the upper and lower boundaries to clearly separate yi={1,−1}.
wTxi+b≥1ifyi=1...(7)
wTxi+b≤−1ifyi=−1...(8)
Combining (7) and (8) into a single constraint, we get:
1−yi(wTxi+b)≤0...(9)
Therefore, we form a constrained optimization problem:
w,bmin21∣∣w∣∣2
s.t.
1−yi(wTxi+b)≤0
By solving the problem using quadratic programing, we can find the classifier wTx+b that classifies x as positive if it is greater than zero, otherwise negative.
Soft Margin Classifier
If our dataset is not linearly separable, we can use soft margin SVM classifier.
We define some slack variables ξi≥0 for each data points xi. They are positive if yi=1 but the point cannot lie above the upper boundary, or yi=−1 but cannot lie below the lower boundary. Otherwise ξi=0. Our constraints are relaxed as:
wTxi+b≥1−ξiifyi=1..(9)
wTxi+b≤−1+ξiifyi=−1..(10)
Combining (9) and (10) into a single constraint, we get:
1−yi(wTxi+b)≤ξi...(11)
And ξi must not less than zero:
0≤ξi...(12)
Combining (11) and (12), we get:
max{0,1−yi(wTxi+b)}≤ξi...(13)
Besides maximizing the margin, we also want to minimize the slack variables ξi. Our objective function is changed to:
w,bmin21∣∣w∣∣2+Ci∑ξi...(14)
where C is a constant to penalize the slack variables.
By combining (14) and (13), we result an unconstrained optimization problem:
w,bmin21∣∣w∣∣2+Ci∑max{0,1−yi(wTxi+b)}
By solving this problem with gradient descent, we obtain the soft margin SVM classifier wTx+b.
Kernel Functions
By changing the SVM margin maximization problem into its dual problem, we can solve it by calculating the pairwise dot products of data points <x(i),x(j)>,∀i,j={1,...,m} for a dataset with m data points.
One the other hand, if an n-dimensional dataset X=[X1,X2,...,Xn] is not linearly separable with label Y∈{1,−1}, we can try to expand its dimension by transforming the variables, e.g. ϕ(X)=[X1,...,Xn,X12,...,Xn2]. The transformed data may become linearly separable in the higher dimensional space.
The kernel technique replaces the dot product of pairwise data points <x(i),x(j)> with a specific kernel function K(x(i),x(j)).
Let’s examine the effect of the kernel function K(x,z)=(xTz)2.
K(x,z)=(i∑nxizi)(j∑nxjzj)=i∑nj∑nxixjzizj=ϕ(x)Tϕ(z)
where the transform function ϕ(x)=[x1x1,x1x2,...,x1xn,...,xnxn]T
We see that the kernel function can compute the dot products of the transformed variables, but without the transformation explicitly. This is much more computational efficient.
Some popular kernel functions are:
- Polynomial Kernel
K(x,z)=(xTz−c)d
- Gaussian Kernel
K(x,z)=exp(−2σ2∥x−c∥2)
In general, if x and z are similar, the kernel value K(x,z) is large, otherwise the value is small.
No comments:
Post a Comment