Thursday, December 6, 2018

Principle Component Analysis

Principle Component Analysis

Eigenvector Decomposition

Let ARn×nA \in \R^{n \times n} be an n by n square matrix.

If there exists a unit vector vRnv \in \R^n and a scaler λR\lambda \in \R such that Av=λvAv = \lambda v,
vv is called the eigenvector, and λ\lambda the eigenvalue of AA respectively.

If there AA has nn independent pairs of eigenvectors v(1),...,v(n)v^{(1)}, ...,v^{(n)} and eigenvalues λ(1),...,λ(n)\lambda^{(1)}, ...,\lambda^{(n)}, we can arrange them as follows:

A[v(1)...v(n)]=[v(1)...v(n)][λ(1)0...0λ(n)]A \begin{bmatrix} | & & |\\ v^{(1)} & ... & v^{(n)}\\ | & & | \end{bmatrix} = \begin{bmatrix} | & & |\\ v^{(1)} & ... & v^{(n)}\\ | & & | \end{bmatrix} \begin{bmatrix} \lambda^{(1)} & & 0\\ & ... & \\ 0 & & \lambda^{(n)} \end{bmatrix}

AV=VΛ AV = V \Lambda

The n×nn \times n matrices VV and Λ\Lambda denote the eigenvectors matrix and the diagonal matrix of eigenvalues respectively.

If AA is a symmetric matrix, now we called SS, then it guarantees that:

  1. it must have nn independent eigenvectors and eigenvalues
  2. its eigenvalues must be real
  3. its eigenvectors are orthonormal (orthogonal and unit length)

As the result of observation (3), the inverse of eigenvector matrix VV is equal to its transpose: V1=VTV^{-1} = V^T

SV=VΛSV = V \Lambda

S=VΛV1=VΛVTS = V \Lambda V^{-1} = V \Lambda V^T

We can decompose SS into multiplication of three matrix VV, Λ\Lambda and VTV^T.

A semi-definite matrix is a symmetric matrix that all of its eigenvectors λ(1),...,λ(n)\lambda^{(1)}, ...,\lambda^{(n)} are no less than zero. A typical example is the covariance matrix AATAA^T for an arbitrary matrix ARm×nA \in \R^{m \times n}.

Change of Basis

A vector space is defined by a set of basis, e.g. the standard basis of R3\R^3 are:
[100][010][001] \begin{bmatrix} 1\\ 0\\ 0 \end{bmatrix} \begin{bmatrix} 0\\ 1\\ 0 \end{bmatrix} \begin{bmatrix} 0\\ 0\\ 1 \end{bmatrix}

Let take a data matrix XX that has 3 variables and 4 data points for example.
X=[456789012345] X= \begin{bmatrix} 4 & 5 & 6\\ 7 & 8 & 9\\ 0 & 1 & 2\\ 3 & 4 & 5\\ \end{bmatrix}

The first data point means that it has 4 units in the first basis, 5 units in the second basis, and 6 units in the third basis, etc., in term of the standard basis.

[456]=4×[100]+5×[010]+6×[001] \begin{bmatrix} 4 \\ 5 \\ 6 \\ \end{bmatrix} = 4 \times \begin{bmatrix} 1 \\ 0 \\ 0 \\ \end{bmatrix} + 5 \times \begin{bmatrix} 0 \\ 1 \\ 0 \\ \end{bmatrix} + 6 \times \begin{bmatrix} 0 \\ 0 \\ 1 \\ \end{bmatrix}

To reduce the data points dimension to 2, we project the them to a 2-dimensional subspace (a plane) in R3\R^3. Suppose the basis of the new vector space are two orthonormal vectors v(1),v(2)R3v^{(1)}, v^{(2)} \in \R^3, we define the basis matrix as:

B=[v(1)v(2)]B = \begin{bmatrix} | & |\\ v^{(1)} & v^{(2)}\\ | & | \end{bmatrix}

We are interested in how much weights w(1),...,w(4)R2w^{(1)}, ..., w^{(4)} \in \R^2 should be assigned for the two basis in the new vector space.

[456789012345]=[w(1)w(2)w(3)w(4)][v(1)v(2)]\begin{bmatrix} 4 & 5 & 6 \\ 7 & 8 & 9 \\ 0 & 1 & 2 \\ 3 & 4 & 5 \\ \end{bmatrix} = \begin{bmatrix} -& w^{(1)} &- \\ -& w^{(2)} &- \\ -& w^{(3)} &- \\ -& w^{(4)} &- \end{bmatrix} \begin{bmatrix} -& v^{(1)} &-\\ -& v^{(2)} &- \end{bmatrix}

This is equivalent to solving the equation for WR4×2W \in \R^{4 \times 2}:

X=WBT X = WB^T

W=X(BT)1=XB W = X(B^T)^{-1} = XB

The solution WW became the lower dimensional representation of XX. And it can also be thought that XX is projected to a lower dimensional subspace, which is the column space of BB, and BB is also the projection matrix.

Retaining Data Variance

In principle component analysis (PCA), besides reducing the data dimension, we also want to find a subspace that retains the variance of the original data as much as possible.

An nn-dimensional data matrix with mm records XRm×nX \in \R^{m \times n} can be considered as composing nn variables, X1,X2,...,XnRmX_1, X_2, ..., X_n \in \R^m, each of them contains mm samples.
X=[X1X2...Xn] X = \begin{bmatrix} | & | & & | \\ X_1 & X_2 & ... & X_n \\ | & | & & | \end{bmatrix}

To begin with, we normalized the nn variables.

Xi:=XiXˉiσXi,i{1,...,n} X_i := \frac{X_i - \bar X_i}{\sigma_{X_i}}, \forall i \in \{1, ..., n\}

The covariance matrix ΣRn×n\Sigma \in \R^{n \times n} can be computed as:
Σ=1m1XTX=[σ112...σ1n2...σij2...σn12...σnn2]\Sigma = \frac{1}{m-1}X^TX = \begin{bmatrix} \sigma^2_{11} & ... & \sigma^2_{1n} \\ ... & \sigma^2_{ij} & ... \\ \sigma^2_{n1} & ...& \sigma^2_{nn} \end{bmatrix}

The elements σij2\sigma^2_{ij} represent the covariance of variables XiX_i and XjX_j. When ii and jj are equal, σii2\sigma^2_{ii} is the variance of variable XiX_i.

We then try to perform eigenvector decomposition on Σ\Sigma.
Σ=VΛVT=[V1...Vn][λ1000...000λn][V1...Vn] \Sigma = V \Lambda V^T = \begin{bmatrix} | & & | \\ V_1 & ... & V_n \\ | & & | \end{bmatrix} \begin{bmatrix} \lambda_1& 0 & 0 \\ 0 & ... & 0 \\ 0 & 0 & \lambda_n \end{bmatrix} \begin{bmatrix} -& V_1&- \\ & ... & \\ -& V_n&- \end{bmatrix}

As Σ\Sigma is symmetric and semi-positive definite, it is guaranteed that λi\lambda_i are real and never less than zero. Eigenvectos V1,...,VnV_1, ..., V_n are orthonormal. We arrange the positions of the eigenvectors and eigenvalues so that λ1>λ2>...>λn\lambda_1 > \lambda_2 > ... > \lambda_n.

If we found a good subspace for the data matrix XX to project on, we call BB the projection matrix, which define the basis of the new vector space. The covariance matrix of the projected data is:

Σ^=(XB)T(XB)m1=BTXTXBm1=BTVΛVTBm1 \hat \Sigma = \frac{(XB)^T(XB)}{m-1} = \frac{B^TX^TXB}{m-1} = \frac{B^TV \Lambda V^TB}{m-1}

We want to maximize Σ^\hat \Sigma, and the column vectors of BB and VV, namely BiB_i and ViV_i respectively i{1,...,n}\forall i \in \{1, ..., n\}, are unit vectors. For their dot products <Bi,Vi><B_i,V_i> to be maximized, BB must be equal to VV. Therefore, VV is actually the projection matrix to maximize the variance in data matrix XX. Then

Σ^=1m1(VTV)Λ(VTV)T=1m1Λ \hat \Sigma = \frac{1}{m-1}(V^TV) \Lambda (V^TV)^T = \frac{1}{m-1}\Lambda

We knew that 1m1λiΣ^\frac{1}{m-1}\lambda_i \in \hat \Sigma represents the variance of the iith variable in the new vector space. If it was very close to zero, then most of the variable values are distributed around zero, which means it has almost no descriptive power for the dataset.

So we directly retain the top k<<nk << n eigenvalues and set the others to zero.
λ1,λ2,...λk,0,...,0\lambda_1, \lambda_2, ... \lambda_k, 0, ..., 0

This essentially omits the basis Vk+1,...,VnV_{k+1}, ..., V_n, and the dataset is projected to a kk-dimensional subspace, but with most of its variance retained.

Singular Value Decomposition

For a m×n{m \times n} matrix AA, where nmn \le m, it will have a rank rnr \le n. Both of its row space C(AT)C(A^T) and column space C(A)C(A) will be in rr dimension, and they have rr basis.

Suppose V1,V2,...,VrRnV_1, V_2, ..., V_r \in \R^n is a set of orthonormal vectors, there always exist a transformation matrix XRm×nX \in R^{m \times n} that transforms
them into a set of orthogonal vectors U1,U2,...,UrRmU_1, U_2, ..., U_r \in \R^m. And if we use scalar σ1,...,σrR\sigma_1, ..., \sigma_r \in \R to scale these rr vectors, they can be unit length, i.e. orthonormal too.

[X1...Xm][V1...Vr]=[U1...Ur][σ1000...000σr] \begin{bmatrix} -& X_1 &- \\ & ... & \\ -& X_m &- \end{bmatrix}\begin{bmatrix} | & & | \\ V_1 & ... & V_r \\ | & & | \end{bmatrix} = \begin{bmatrix} | & & | \\ U_1 & ... & U_r \\ | & & | \end{bmatrix} \begin{bmatrix} \sigma_1 & 0 & 0 \\ 0 & ... & 0 \\ 0 & 0& \sigma_r \end{bmatrix}
Let matrices VRn×r=[V1,...,Vr],URm×r=[U1,...,Ur]V \in \R^{n \times r} = [V_1, ..., V_r], U \in \R^{m \times r} = [U_1, ..., U_r], and DD be the diagonal matrix, then

XV=UD XV = UD

X=UDVT X = UDV^T

We see that for any rectangular matrix XX, it can be decomposed into two orthonormal matrices U,VU, V and a diagonal matrix DD.

Supposed variables in XX are normalized, let’s examine its covariance matrix.

Σ=XTX=(UDVT)TUDVT=VDUTUDVT=VD2VT\Sigma = X^TX = (UDV^T)^TUDV^T = VDU^TUDV^T = VD^2V^T

We see that VV is in fact the eigenvector matrix of Σ\Sigma. If the dimension of XX is high, e.g. 1000, the dimension of Σ\Sigma would be 1000×10001000 \times 1000. Using singular value decomposition (SVD), we can obtain VV without computing Σ\Sigma.

No comments:

Post a Comment

Principle Component Analysis

Principle Component Analysis Eigenvector Decomposition Let A ∈ R n × n A \in \R^{n \times n} A ∈ R n × n be an n by n...