Monday, July 30, 2018

Demonstration of Sorting and Subsetting in Python and R

This blog extracts some Python code snippets from my previous Movie Recommender System project, and compare how the tasks can be accomplished in R.

The data files can be found in the Github.

Load Movie Names List

Python

file = open("movie_ids.txt")

movieList_ = []
for line in file.readlines():
    movieName = line.split()[1:]
    movieList_.append(' '.join(movieName))

# no. of movies in dataset
print(len(movieList_))

# check last 5 movies
print(movieList_[-5:])

library(stringr)

lines <- readLines("movie_ids.txt")
movieList_ <- word(lines, 2, -1)

# no. of movies in dataset
length(movieList_)

# check last 5 movies
tail(movieList_, 5)

Load Matlab Data File

Python

import scipy.io
import numpy as np

# load data from Matlab file
data = scipy.io.loadmat('ex8_movies.mat')
# transpose the matrix Y & R
Y_ = data['Y'].astype(int).T
R_ = data['R'].astype(int).T

# check demensions and sample contents
print(Y_.shape)
print(Y_[-5:])
print(R_.shape)
print(R_[-5:])

library(R.matlab)

# load data from Matlab file
data <- readMat("ex8_movies.mat")
# transpose the matrix Y & R
Y_ <- t(data$Y) 
R_ <- t(data$R)

# check demensions and sample contents
dim(Y_)
tail(Y_, 5)
dim(R_)
tail(R_, 5)

Shortlist 100 Movies

Python

# shortlist top 100 most rated movies
top100 = np.argsort(-np.sum(R_, axis=0))[:100]
movieList = [movieList_[i] for i in top100]
Y = Y_[:, top100]
R = R_[:, top100]

# check dimensions
print(R.shape)

# top 10 movies in our shortlist
movieList[:10]

top100 <- order(-colSums(R_))[1:100]
movieList <- movieList_[top100]
Y <- Y_[ ,top100]
R <- R_[ ,top100]

# check dimensions
dim(R)

# top 10 movies in our shortlist
movieList[1:10]

Find the Terminator’s Fan

Python

# find the Terminator fan with least rating activities
# 31 - Terminator, The (1984)
# 38 - Terminator 2: Judgment Day (1991)

# shortlist the user ids who have rating >=4 in the two movies
terminator_fans = np.where((Y[:, 31] >= 4) & (Y[:, 38] >= 4))[0]

# find the index in terminator_fans with the least rating count
fan_id = np.sum(R[terminator_fans], axis=1).argsort()[0]

# locate the user
uid = terminator_fans[fan_id]
print(uid)

# check the movies he has rated
[(i, movieList[i], Y[uid,i]) for i in np.argsort(-Y[uid])[:10:]]

# find the Terminator fan with least rating activities
# 32 - Terminator, The (1984)
# 39 - Terminator 2: Judgment Day (1991)

# shortlist the user ids who have rating >=4 in the two movies
terminator_fans <- which(Y[,32] >= 4 & Y[,39] >= 4)

# find the index in terminator_fans with the least rating count
fan_id <- order(rowSums(R[terminator_fans, ]))[1]

# locate the user
uid <- terminator_fans[fan_id]
uid

# check the movies he has rated
movieList[order(-Y[uid, ])[1:10]]
Y[uid, order(-Y[uid, ])[1:10]]

Sunday, July 29, 2018

Covariance, Correlation Coefficient and R-squared

Covariance

The covariance measures of the degree to which two random variables (X, Y) change together.

$Cov (X, Y) = E[(X-E(X))(Y-E(Y)]$

$= E[XY-XE(Y)-YE(X) + E(X)E(Y)]$

$= E(XY)-2E(X)E(Y) + E(X)E(Y)$

$= E(XY) - E(X)E(Y)$

Correlation Coefficient ( R )

The correlation coefficient, or Pearson’s R, standardise the convariance and constraints its value between -1 and +1. The two random variables X and Y have strong negative correlation if R < -0.5, meanwhile they have strong position correlation if R > +0.5.

$R(X,Y) = \frac {Cov(X,Y)} {\sqrt {Var(X)Var(Y)}}$

Generating Correlated Random Variables

Suppose $X$ is a standard normal random variable, we can generate another correlated strandard normal randdom variable $Z$ , that has correlated coefficient $\rho$ like this:

$Z = \rho X + \sqrt { 1 - \rho^2 } Y$

where $Y$ is an independant standard random variable with $X$ .

To Prove:

$E(X) = E(Y) = 0$

$Var(X) = Var(Y) = 1$

$Var(X) = E(X^2) - E(X)^2 = 1$

$E(X^2) = E(Y^2) = 1$

$E(XY) = 0 , \because independent$

$Var(\rho X + \sqrt { 1 - \rho^2 } Y) = \rho^2 Var(X) + (1-\rho^2)Var(Y) = 1$

$R(Z, X) = R(\rho X + \sqrt { 1 - \rho^2 } Y, X) = Cov(\rho X + \sqrt { 1 - \rho^2 } Y, X)$

$= E[X(\rho X + \sqrt { 1 - \rho^2 } Y) ] - E(X)E(\rho X + \sqrt { 1 - \rho^2 } Y)$

$= \rho E(X^2) + \sqrt { 1 - \rho^2 } E(XY) = \rho$

For generating more than two correlated random variables, refer to Cholesky Decomposition.

R-squared

R-squared, as the name implies, can be calculated from squaring the correlation coefficient R. The result ranges from 0 to 1.

In regression, R-squared can also be calculated by:

$R^2 = \frac {SST - SSE} {SST}$

, where

$SSE = \sum (y_i - \hat{y_i})^2$

$SST = \sum (y_i - \bar{y_i})^2$

It measures the reduction in residual error from using the regression line over the mean line, or just simply how well data points fit the regression line or curve.

It can also represents the variation in the dependent variable explained by the independent variable.

Tuesday, July 24, 2018

Installing Jupyter Notebook for R in Multi-users Environment

Beforehand, I assume the followings were done.

Install Anaconda 3.6
Install R 3.3, RGui, and RStudio, etc.

Don’t Try This

conda update anaconda  
conda install -c r r-essentials

There are web pages that teach you install this way.
This will duplicate your libraries in the R evnironment. And the R kernel cannot start in Jupyter Notebook.

Install with IRkernel

In RGui or RStudio, run the followings:

install.packages(c('repr', 'IRdisplay', 'evaluate', 'crayon', 'pbdZMQ', 'devtools', 'uuid', 'digest')) 

devtools::install_github('IRkernel/IRkernel')

This is important

Check you R library path in RGui or RStudio

.libPaths()

In my case, this output:

[1] "C:/Users/Ken Chan/Documents/R/win-library/3.3"
[2] "C:/Program Files/R/R-3.3.3/library"

Copy the first path. In command line or bash console, create an environment variable.

export R_LIBS_PATH="C:/Users/Ken Chan/Documents/R/win-library/3.3"

Start the console with R --no-save, and run:

IRkernel::installspec()

You are Done!

Where you restart your Jupyter Notebook, you can see a new R kernel, under the menu Kernel -> Change kernel.

Written with StackEdit.

Reinforce Me

Monday, July 30, 2018

Demonstration of Sorting and Subsetting in Python and R

Load Movie Names List

Load Matlab Data File

Shortlist 100 Movies

Find the Terminator’s Fan

Sunday, July 29, 2018

Covariance, Correlation Coefficient and R-squared

Covariance

Correlation Coefficient ( R )

Generating Correlated Random Variables

R-squared

Tuesday, July 24, 2018

Installing Jupyter Notebook for R in Multi-users Environment

Don’t Try This

Install with IRkernel

This is important

You are Done!

Principle Component Analysis