Monday, July 30, 2018

Demonstration of Sorting and Subsetting in Python and R

Demonstration of Sorting and Subsetting in Python and R

This blog extracts some Python code snippets from my previous Movie Recommender System project, and compare how the tasks can be accomplished in R.

The data files can be found in the Github.

Load Movie Names List

Python

file = open("movie_ids.txt")

movieList_ = []
for line in file.readlines():
    movieName = line.split()[1:]
    movieList_.append(' '.join(movieName))

# no. of movies in dataset
print(len(movieList_))

# check last 5 movies
print(movieList_[-5:])

R

library(stringr)

lines <- readLines("movie_ids.txt")
movieList_ <- word(lines, 2, -1)

# no. of movies in dataset
length(movieList_)

# check last 5 movies
tail(movieList_, 5)

Load Matlab Data File

Python

import scipy.io
import numpy as np

# load data from Matlab file
data = scipy.io.loadmat('ex8_movies.mat')
# transpose the matrix Y & R
Y_ = data['Y'].astype(int).T
R_ = data['R'].astype(int).T

# check demensions and sample contents
print(Y_.shape)
print(Y_[-5:])
print(R_.shape)
print(R_[-5:])

R

library(R.matlab)

# load data from Matlab file
data <- readMat("ex8_movies.mat")
# transpose the matrix Y & R
Y_ <- t(data$Y) 
R_ <- t(data$R)

# check demensions and sample contents
dim(Y_)
tail(Y_, 5)
dim(R_)
tail(R_, 5)

Shortlist 100 Movies

Python

# shortlist top 100 most rated movies
top100 = np.argsort(-np.sum(R_, axis=0))[:100]
movieList = [movieList_[i] for i in top100]
Y = Y_[:, top100]
R = R_[:, top100]

# check dimensions
print(R.shape)

# top 10 movies in our shortlist
movieList[:10]

R

top100 <- order(-colSums(R_))[1:100]
movieList <- movieList_[top100]
Y <- Y_[ ,top100]
R <- R_[ ,top100]

# check dimensions
dim(R)

# top 10 movies in our shortlist
movieList[1:10]

Find the Terminator’s Fan

Python

# find the Terminator fan with least rating activities
# 31 - Terminator, The (1984)
# 38 - Terminator 2: Judgment Day (1991)

# shortlist the user ids who have rating >=4 in the two movies
terminator_fans = np.where((Y[:, 31] >= 4) & (Y[:, 38] >= 4))[0]

# find the index in terminator_fans with the least rating count
fan_id = np.sum(R[terminator_fans], axis=1).argsort()[0]

# locate the user
uid = terminator_fans[fan_id]
print(uid)

# check the movies he has rated
[(i, movieList[i], Y[uid,i]) for i in np.argsort(-Y[uid])[:10:]]

R

# find the Terminator fan with least rating activities
# 32 - Terminator, The (1984)
# 39 - Terminator 2: Judgment Day (1991)

# shortlist the user ids who have rating >=4 in the two movies
terminator_fans <- which(Y[,32] >= 4 & Y[,39] >= 4)

# find the index in terminator_fans with the least rating count
fan_id <- order(rowSums(R[terminator_fans, ]))[1]

# locate the user
uid <- terminator_fans[fan_id]
uid

# check the movies he has rated
movieList[order(-Y[uid, ])[1:10]]
Y[uid, order(-Y[uid, ])[1:10]]

No comments:

Post a Comment

Principle Component Analysis

Principle Component Analysis Eigenvector Decomposition Let A ∈ R n × n A \in \R^{n \times n} A ∈ R n × n be an n by n...