Eigenvector Decomposition
Let A∈Rn×n be an n by n square matrix.
If there exists a unit vector v∈Rn and a scaler λ∈R such that Av=λv,
v is called the eigenvector, and λ the eigenvalue of A respectively.
If there A has n independent pairs of eigenvectors v(1),...,v(n) and eigenvalues λ(1),...,λ(n), we can arrange them as follows:
A⎣⎡∣v(1)∣...∣v(n)∣⎦⎤=⎣⎡∣v(1)∣...∣v(n)∣⎦⎤⎣⎡λ(1)0...0λ(n)⎦⎤
AV=VΛ
The n×n matrices V and Λ denote the eigenvectors matrix and the diagonal matrix of eigenvalues respectively.
If A is a symmetric matrix, now we called S, then it guarantees that:
- it must have n independent eigenvectors and eigenvalues
- its eigenvalues must be real
- its eigenvectors are orthonormal (orthogonal and unit length)
As the result of observation (3), the inverse of eigenvector matrix V is equal to its transpose: V−1=VT
SV=VΛ
S=VΛV−1=VΛVT
We can decompose S into multiplication of three matrix V, Λ and VT.
A semi-definite matrix is a symmetric matrix that all of its eigenvectors λ(1),...,λ(n) are no less than zero. A typical example is the covariance matrix AAT for an arbitrary matrix A∈Rm×n.
Change of Basis
A vector space is defined by a set of basis, e.g. the standard basis of R3 are:
⎣⎡100⎦⎤⎣⎡010⎦⎤⎣⎡001⎦⎤
Let take a data matrix X that has 3 variables and 4 data points for example.
X=⎣⎢⎢⎡470358146925⎦⎥⎥⎤
The first data point means that it has 4 units in the first basis, 5 units in the second basis, and 6 units in the third basis, etc., in term of the standard basis.
⎣⎡456⎦⎤=4×⎣⎡100⎦⎤+5×⎣⎡010⎦⎤+6×⎣⎡001⎦⎤
To reduce the data points dimension to 2, we project the them to a 2-dimensional subspace (a plane) in R3. Suppose the basis of the new vector space are two orthonormal vectors v(1),v(2)∈R3, we define the basis matrix as:
B=⎣⎡∣v(1)∣∣v(2)∣⎦⎤
We are interested in how much weights w(1),...,w(4)∈R2 should be assigned for the two basis in the new vector space.
⎣⎢⎢⎡470358146925⎦⎥⎥⎤=⎣⎢⎢⎡−−−−w(1)w(2)w(3)w(4)−−−−⎦⎥⎥⎤[−−v(1)v(2)−−]
This is equivalent to solving the equation for W∈R4×2:
X=WBT
W=X(BT)−1=XB
The solution W became the lower dimensional representation of X. And it can also be thought that X is projected to a lower dimensional subspace, which is the column space of B, and B is also the projection matrix.
Retaining Data Variance
In principle component analysis (PCA), besides reducing the data dimension, we also want to find a subspace that retains the variance of the original data as much as possible.
An n-dimensional data matrix with m records X∈Rm×n can be considered as composing n variables, X1,X2,...,Xn∈Rm, each of them contains m samples.
X=⎣⎡∣X1∣∣X2∣...∣Xn∣⎦⎤
To begin with, we normalized the n variables.
Xi:=σXiXi−Xˉi,∀i∈{1,...,n}
The covariance matrix Σ∈Rn×n can be computed as:
Σ=m−11XTX=⎣⎡σ112...σn12...σij2...σ1n2...σnn2⎦⎤
The elements σij2 represent the covariance of variables Xi and Xj. When i and j are equal, σii2 is the variance of variable Xi.
We then try to perform eigenvector decomposition on Σ.
Σ=VΛVT=⎣⎡∣V1∣...∣Vn∣⎦⎤⎣⎡λ1000...000λn⎦⎤⎣⎡−−V1...Vn−−⎦⎤
As Σ is symmetric and semi-positive definite, it is guaranteed that λi are real and never less than zero. Eigenvectos V1,...,Vn are orthonormal. We arrange the positions of the eigenvectors and eigenvalues so that λ1>λ2>...>λn.
If we found a good subspace for the data matrix X to project on, we call B the projection matrix, which define the basis of the new vector space. The covariance matrix of the projected data is:
Σ^=m−1(XB)T(XB)=m−1BTXTXB=m−1BTVΛVTB
We want to maximize Σ^, and the column vectors of B and V, namely Bi and Vi respectively ∀i∈{1,...,n}, are unit vectors. For their dot products <Bi,Vi> to be maximized, B must be equal to V. Therefore, V is actually the projection matrix to maximize the variance in data matrix X. Then
Σ^=m−11(VTV)Λ(VTV)T=m−11Λ
We knew that m−11λi∈Σ^ represents the variance of the ith variable in the new vector space. If it was very close to zero, then most of the variable values are distributed around zero, which means it has almost no descriptive power for the dataset.
So we directly retain the top k<<n eigenvalues and set the others to zero.
λ1,λ2,...λk,0,...,0
This essentially omits the basis Vk+1,...,Vn, and the dataset is projected to a k-dimensional subspace, but with most of its variance retained.
Singular Value Decomposition
For a m×n matrix A, where n≤m, it will have a rank r≤n. Both of its row space C(AT) and column space C(A) will be in r dimension, and they have r basis.
Suppose V1,V2,...,Vr∈Rn is a set of orthonormal vectors, there always exist a transformation matrix X∈Rm×n that transforms
them into a set of orthogonal vectors U1,U2,...,Ur∈Rm. And if we use scalar σ1,...,σr∈R to scale these r vectors, they can be unit length, i.e. orthonormal too.
⎣⎡−−X1...Xm−−⎦⎤⎣⎡∣V1∣...∣Vr∣⎦⎤=⎣⎡∣U1∣...∣Ur∣⎦⎤⎣⎡σ1000...000σr⎦⎤
Let matrices V∈Rn×r=[V1,...,Vr],U∈Rm×r=[U1,...,Ur], and D be the diagonal matrix, then
XV=UD
X=UDVT
We see that for any rectangular matrix X, it can be decomposed into two orthonormal matrices U,V and a diagonal matrix D.
Supposed variables in X are normalized, let’s examine its covariance matrix.
Σ=XTX=(UDVT)TUDVT=VDUTUDVT=VD2VT
We see that V is in fact the eigenvector matrix of Σ. If the dimension of X is high, e.g. 1000, the dimension of Σ would be 1000×1000. Using singular value decomposition (SVD), we can obtain V without computing Σ.
No comments:
Post a Comment