Monday, July 30, 2018

Demonstration of Sorting and Subsetting in Python and R

Demonstration of Sorting and Subsetting in Python and R

This blog extracts some Python code snippets from my previous Movie Recommender System project, and compare how the tasks can be accomplished in R.

The data files can be found in the Github.

Load Movie Names List

Python

file = open("movie_ids.txt")

movieList_ = []
for line in file.readlines():
    movieName = line.split()[1:]
    movieList_.append(' '.join(movieName))

# no. of movies in dataset
print(len(movieList_))

# check last 5 movies
print(movieList_[-5:])

R

library(stringr)

lines <- readLines("movie_ids.txt")
movieList_ <- word(lines, 2, -1)

# no. of movies in dataset
length(movieList_)

# check last 5 movies
tail(movieList_, 5)

Load Matlab Data File

Python

import scipy.io
import numpy as np

# load data from Matlab file
data = scipy.io.loadmat('ex8_movies.mat')
# transpose the matrix Y & R
Y_ = data['Y'].astype(int).T
R_ = data['R'].astype(int).T

# check demensions and sample contents
print(Y_.shape)
print(Y_[-5:])
print(R_.shape)
print(R_[-5:])

R

library(R.matlab)

# load data from Matlab file
data <- readMat("ex8_movies.mat")
# transpose the matrix Y & R
Y_ <- t(data$Y) 
R_ <- t(data$R)

# check demensions and sample contents
dim(Y_)
tail(Y_, 5)
dim(R_)
tail(R_, 5)

Shortlist 100 Movies

Python

# shortlist top 100 most rated movies
top100 = np.argsort(-np.sum(R_, axis=0))[:100]
movieList = [movieList_[i] for i in top100]
Y = Y_[:, top100]
R = R_[:, top100]

# check dimensions
print(R.shape)

# top 10 movies in our shortlist
movieList[:10]

R

top100 <- order(-colSums(R_))[1:100]
movieList <- movieList_[top100]
Y <- Y_[ ,top100]
R <- R_[ ,top100]

# check dimensions
dim(R)

# top 10 movies in our shortlist
movieList[1:10]

Find the Terminator’s Fan

Python

# find the Terminator fan with least rating activities
# 31 - Terminator, The (1984)
# 38 - Terminator 2: Judgment Day (1991)

# shortlist the user ids who have rating >=4 in the two movies
terminator_fans = np.where((Y[:, 31] >= 4) & (Y[:, 38] >= 4))[0]

# find the index in terminator_fans with the least rating count
fan_id = np.sum(R[terminator_fans], axis=1).argsort()[0]

# locate the user
uid = terminator_fans[fan_id]
print(uid)

# check the movies he has rated
[(i, movieList[i], Y[uid,i]) for i in np.argsort(-Y[uid])[:10:]]

R

# find the Terminator fan with least rating activities
# 32 - Terminator, The (1984)
# 39 - Terminator 2: Judgment Day (1991)

# shortlist the user ids who have rating >=4 in the two movies
terminator_fans <- which(Y[,32] >= 4 & Y[,39] >= 4)

# find the index in terminator_fans with the least rating count
fan_id <- order(rowSums(R[terminator_fans, ]))[1]

# locate the user
uid <- terminator_fans[fan_id]
uid

# check the movies he has rated
movieList[order(-Y[uid, ])[1:10]]
Y[uid, order(-Y[uid, ])[1:10]]

Sunday, July 29, 2018

Covariance, Correlation Coefficient and R-squared

Covariance, Correlation Coefficient and R-squared

Covariance

The covariance measures of the degree to which two random variables (X, Y) change together.

Cov(X,Y)=E[(XE(X))(YE(Y)] Cov (X, Y) = E[(X-E(X))(Y-E(Y)]

=E[XYXE(Y)YE(X)+E(X)E(Y)] = E[XY-XE(Y)-YE(X) + E(X)E(Y)]

=E(XY)2E(X)E(Y)+E(X)E(Y) = E(XY)-2E(X)E(Y) + E(X)E(Y)

=E(XY)E(X)E(Y) = E(XY) - E(X)E(Y)

Correlation Coefficient ( R )

The correlation coefficient, or Pearson’s R, standardise the convariance and constraints its value between -1 and +1. The two random variables X and Y have strong negative correlation if R < -0.5, meanwhile they have strong position correlation if R > +0.5.

R(X,Y)=Cov(X,Y)Var(X)Var(Y) R(X,Y) = \frac {Cov(X,Y)} {\sqrt {Var(X)Var(Y)}}

Generating Correlated Random Variables

Suppose XX is a standard normal random variable, we can generate another correlated strandard normal randdom variable ZZ, that has correlated coefficient ρ\rho like this:

Z=ρX+1ρ2Y Z = \rho X + \sqrt { 1 - \rho^2 } Y

where YY is an independant standard random variable with XX.

To Prove:

E(X)=E(Y)=0 E(X) = E(Y) = 0

Var(X)=Var(Y)=1Var(X) = Var(Y) = 1

Var(X)=E(X2)E(X)2=1 Var(X) = E(X^2) - E(X)^2 = 1

E(X2)=E(Y2)=1 E(X^2) = E(Y^2) = 1

E(XY)=0,independent E(XY) = 0 , \because independent

Var(ρX+1ρ2Y)=ρ2Var(X)+(1ρ2)Var(Y)=1 Var(\rho X + \sqrt { 1 - \rho^2 } Y) = \rho^2 Var(X) + (1-\rho^2)Var(Y) = 1

R(Z,X)=R(ρX+1ρ2Y,X)=Cov(ρX+1ρ2Y,X) R(Z, X) = R(\rho X + \sqrt { 1 - \rho^2 } Y, X) = Cov(\rho X + \sqrt { 1 - \rho^2 } Y, X)

=E[X(ρX+1ρ2Y)]E(X)E(ρX+1ρ2Y) = E[X(\rho X + \sqrt { 1 - \rho^2 } Y) ] - E(X)E(\rho X + \sqrt { 1 - \rho^2 } Y)

=ρE(X2)+1ρ2E(XY)=ρ = \rho E(X^2) + \sqrt { 1 - \rho^2 } E(XY) = \rho

For generating more than two correlated random variables, refer to Cholesky Decomposition.

R-squared

R-squared, as the name implies, can be calculated from squaring the correlation coefficient R. The result ranges from 0 to 1.

In regression, R-squared can also be calculated by:

R2=SSTSSESST R^2 = \frac {SST - SSE} {SST}

, where

SSE=(yiyi^)2 SSE = \sum (y_i - \hat{y_i})^2

SST=(yiyi¯)2 SST = \sum (y_i - \bar{y_i})^2

It measures the reduction in residual error from using the regression line over the mean line, or just simply how well data points fit the regression line or curve.

It can also represents the variation in the dependent variable explained by the independent variable.

Tuesday, July 24, 2018

Installing Jupyter Notebook for R in Multi-users Environment

Installing Jupyter Notebook for R in Multi-users Environment

Beforehand, I assume the followings were done.

  1. Install Anaconda 3.6
  2. Install R 3.3, RGui, and RStudio, etc.

Don’t Try This

conda update anaconda  
conda install -c r r-essentials

There are web pages that teach you install this way.
This will duplicate your libraries in the R evnironment. And the R kernel cannot start in Jupyter Notebook.

Install with IRkernel

In RGui or RStudio, run the followings:

install.packages(c('repr', 'IRdisplay', 'evaluate', 'crayon', 'pbdZMQ', 'devtools', 'uuid', 'digest')) 

devtools::install_github('IRkernel/IRkernel')

This is important

Check you R library path in RGui or RStudio

.libPaths()

In my case, this output:

[1] "C:/Users/Ken Chan/Documents/R/win-library/3.3"
[2] "C:/Program Files/R/R-3.3.3/library"

Copy the first path. In command line or bash console, create an environment variable.

export R_LIBS_PATH="C:/Users/Ken Chan/Documents/R/win-library/3.3"

Start the console with R --no-save, and run:

IRkernel::installspec()

You are Done!

Where you restart your Jupyter Notebook, you can see a new R kernel, under the menu Kernel -> Change kernel.

Written with StackEdit.

Principle Component Analysis

Principle Component Analysis Eigenvector Decomposition Let A ∈ R n × n A \in \R^{n \times n} A ∈ R n × n be an n by n...