Reinforce Me: Principle Component Analysis

Principle Component Analysis

Eigenvector Decomposition

Let $A \in \R^{n \times n}$ be an n by n square matrix.

If there exists a unit vector $v \in \R^n$ and a scaler $\lambda \in \R$ such that $Av = \lambda v$ ,
$v$ is called the eigenvector, and $\lambda$ the eigenvalue of $A$ respectively.

If there $A$ has $n$ independent pairs of eigenvectors $v^{(1)}, ...,v^{(n)}$ and eigenvalues $\lambda^{(1)}, ...,\lambda^{(n)}$ , we can arrange them as follows:

$A \begin{bmatrix} | & & |\\ v^{(1)} & ... & v^{(n)}\\ | & & | \end{bmatrix} = \begin{bmatrix} | & & |\\ v^{(1)} & ... & v^{(n)}\\ | & & | \end{bmatrix} \begin{bmatrix} \lambda^{(1)} & & 0\\ & ... & \\ 0 & & \lambda^{(n)} \end{bmatrix}$

$AV = V \Lambda$

The $n \times n$ matrices $V$ and $\Lambda$ denote the eigenvectors matrix and the diagonal matrix of eigenvalues respectively.

If $A$ is a symmetric matrix, now we called $S$ , then it guarantees that:

it must have $n$ independent eigenvectors and eigenvalues
its eigenvalues must be real
its eigenvectors are orthonormal (orthogonal and unit length)

As the result of observation (3), the inverse of eigenvector matrix $V$ is equal to its transpose: $V^{-1} = V^T$

$SV = V \Lambda$

$S = V \Lambda V^{-1} = V \Lambda V^T$

We can decompose $S$ into multiplication of three matrix $V$ , $\Lambda$ and $V^T$ .

A semi-definite matrix is a symmetric matrix that all of its eigenvectors $\lambda^{(1)}, ...,\lambda^{(n)}$ are no less than zero. A typical example is the covariance matrix $AA^T$ for an arbitrary matrix $A \in \R^{m \times n}$ .

Change of Basis

A vector space is defined by a set of basis, e.g. the standard basis of $\R^3$ are:
$\begin{bmatrix} 1\\ 0\\ 0 \end{bmatrix} \begin{bmatrix} 0\\ 1\\ 0 \end{bmatrix} \begin{bmatrix} 0\\ 0\\ 1 \end{bmatrix}$

Let take a data matrix $X$ that has 3 variables and 4 data points for example.
$X= \begin{bmatrix} 4 & 5 & 6\\ 7 & 8 & 9\\ 0 & 1 & 2\\ 3 & 4 & 5\\ \end{bmatrix}$

The first data point means that it has 4 units in the first basis, 5 units in the second basis, and 6 units in the third basis, etc., in term of the standard basis.

$\begin{bmatrix} 4 \\ 5 \\ 6 \\ \end{bmatrix} = 4 \times \begin{bmatrix} 1 \\ 0 \\ 0 \\ \end{bmatrix} + 5 \times \begin{bmatrix} 0 \\ 1 \\ 0 \\ \end{bmatrix} + 6 \times \begin{bmatrix} 0 \\ 0 \\ 1 \\ \end{bmatrix}$

To reduce the data points dimension to 2, we project the them to a 2-dimensional subspace (a plane) in $\R^3$ . Suppose the basis of the new vector space are two orthonormal vectors $v^{(1)}, v^{(2)} \in \R^3$ , we define the basis matrix as:

$B = \begin{bmatrix} | & |\\ v^{(1)} & v^{(2)}\\ | & | \end{bmatrix}$

We are interested in how much weights $w^{(1)}, ..., w^{(4)} \in \R^2$ should be assigned for the two basis in the new vector space.

$\begin{bmatrix} 4 & 5 & 6 \\ 7 & 8 & 9 \\ 0 & 1 & 2 \\ 3 & 4 & 5 \\ \end{bmatrix} = \begin{bmatrix} -& w^{(1)} &- \\ -& w^{(2)} &- \\ -& w^{(3)} &- \\ -& w^{(4)} &- \end{bmatrix} \begin{bmatrix} -& v^{(1)} &-\\ -& v^{(2)} &- \end{bmatrix}$

This is equivalent to solving the equation for $W \in \R^{4 \times 2}$ :

$X = WB^T$

$W = X(B^T)^{-1} = XB$

The solution $W$ became the lower dimensional representation of $X$ . And it can also be thought that $X$ is projected to a lower dimensional subspace, which is the column space of $B$ , and $B$ is also the projection matrix.

Retaining Data Variance

In principle component analysis (PCA), besides reducing the data dimension, we also want to find a subspace that retains the variance of the original data as much as possible.

An $n$ -dimensional data matrix with $m$ records $X \in \R^{m \times n}$ can be considered as composing $n$ variables, $X_1, X_2, ..., X_n \in \R^m$ , each of them contains $m$ samples.
$X = \begin{bmatrix} | & | & & | \\ X_1 & X_2 & ... & X_n \\ | & | & & | \end{bmatrix}$

To begin with, we normalized the $n$ variables.

$X_i := \frac{X_i - \bar X_i}{\sigma_{X_i}}, \forall i \in \{1, ..., n\}$

The covariance matrix $\Sigma \in \R^{n \times n}$ can be computed as:
$\Sigma = \frac{1}{m-1}X^TX = \begin{bmatrix} \sigma^2_{11} & ... & \sigma^2_{1n} \\ ... & \sigma^2_{ij} & ... \\ \sigma^2_{n1} & ...& \sigma^2_{nn} \end{bmatrix}$

The elements $\sigma^2_{ij}$ represent the covariance of variables $X_i$ and $X_j$ . When $i$ and $j$ are equal, $\sigma^2_{ii}$ is the variance of variable $X_i$ .

We then try to perform eigenvector decomposition on $\Sigma$ .
$\Sigma = V \Lambda V^T = \begin{bmatrix} | & & | \\ V_1 & ... & V_n \\ | & & | \end{bmatrix} \begin{bmatrix} \lambda_1& 0 & 0 \\ 0 & ... & 0 \\ 0 & 0 & \lambda_n \end{bmatrix} \begin{bmatrix} -& V_1&- \\ & ... & \\ -& V_n&- \end{bmatrix}$

As $\Sigma$ is symmetric and semi-positive definite, it is guaranteed that $\lambda_i$ are real and never less than zero. Eigenvectos $V_1, ..., V_n$ are orthonormal. We arrange the positions of the eigenvectors and eigenvalues so that $\lambda_1 > \lambda_2 > ... > \lambda_n$ .

If we found a good subspace for the data matrix $X$ to project on, we call $B$ the projection matrix, which define the basis of the new vector space. The covariance matrix of the projected data is:

$\hat \Sigma = \frac{(XB)^T(XB)}{m-1} = \frac{B^TX^TXB}{m-1} = \frac{B^TV \Lambda V^TB}{m-1}$

We want to maximize $\hat \Sigma$ , and the column vectors of $B$ and $V$ , namely $B_i$ and $V_i$ respectively $\forall i \in \{1, ..., n\}$ , are unit vectors. For their dot products $<B_i,V_i>$ to be maximized, $B$ must be equal to $V$ . Therefore, $V$ is actually the projection matrix to maximize the variance in data matrix $X$ . Then

$\hat \Sigma = \frac{1}{m-1}(V^TV) \Lambda (V^TV)^T = \frac{1}{m-1}\Lambda$

We knew that $\frac{1}{m-1}\lambda_i \in \hat \Sigma$ represents the variance of the $i$ ^th variable in the new vector space. If it was very close to zero, then most of the variable values are distributed around zero, which means it has almost no descriptive power for the dataset.

So we directly retain the top $k << n$ eigenvalues and set the others to zero.
$\lambda_1, \lambda_2, ... \lambda_k, 0, ..., 0$

This essentially omits the basis $V_{k+1}, ..., V_n$ , and the dataset is projected to a $k$ -dimensional subspace, but with most of its variance retained.

Singular Value Decomposition

For a ${m \times n}$ matrix $A$ , where $n \le m$ , it will have a rank $r \le n$ . Both of its row space $C(A^T)$ and column space $C(A)$ will be in $r$ dimension, and they have $r$ basis.

Suppose $V_1, V_2, ..., V_r \in \R^n$ is a set of orthonormal vectors, there always exist a transformation matrix $X \in R^{m \times n}$ that transforms
them into a set of orthogonal vectors $U_1, U_2, ..., U_r \in \R^m$ . And if we use scalar $\sigma_1, ..., \sigma_r \in \R$ to scale these $r$ vectors, they can be unit length, i.e. orthonormal too.

$\begin{bmatrix} -& X_1 &- \\ & ... & \\ -& X_m &- \end{bmatrix}\begin{bmatrix} | & & | \\ V_1 & ... & V_r \\ | & & | \end{bmatrix} = \begin{bmatrix} | & & | \\ U_1 & ... & U_r \\ | & & | \end{bmatrix} \begin{bmatrix} \sigma_1 & 0 & 0 \\ 0 & ... & 0 \\ 0 & 0& \sigma_r \end{bmatrix}$
Let matrices $V \in \R^{n \times r} = [V_1, ..., V_r], U \in \R^{m \times r} = [U_1, ..., U_r]$ , and $D$ be the diagonal matrix, then

$XV = UD$

$X = UDV^T$

We see that for any rectangular matrix $X$ , it can be decomposed into two orthonormal matrices $U, V$ and a diagonal matrix $D$ .

Supposed variables in $X$ are normalized, let’s examine its covariance matrix.

$\Sigma = X^TX = (UDV^T)^TUDV^T = VDU^TUDV^T = VD^2V^T$

We see that $V$ is in fact the eigenvector matrix of $\Sigma$ . If the dimension of $X$ is high, e.g. 1000, the dimension of $\Sigma$ would be $1000 \times 1000$ . Using singular value decomposition (SVD), we can obtain $V$ without computing $\Sigma$ .

Reinforce Me

Thursday, December 6, 2018

Principle Component Analysis

Eigenvector Decomposition

Change of Basis

Retaining Data Variance

Singular Value Decomposition

No comments:

Post a Comment

Principle Component Analysis