Movie Recommendation System Part 2¶


Context¶


Online streaming platforms like Netflix have plenty of movies in their repositories. If we can build a recommendation system to recommend movies to the users based on their historical interactions with movies, this would improve customer satisfaction. Increased customer satisfaction will increase the revenue of the company. The techniques that we will learn here will not only be limited to movies but can be any item for which you can build a recommendation system.


Objective¶


Using the above dataset, we will build two different types of recommendation systems that are listed below.

  • Clustering-based recommendation system.
  • Content-based collaborative filtering.

Dataset¶


We will use the following three datasets for this case study:

  • ratings dataset - This dataset contains the following attributes:

    • userId
    • movieId
    • rating
    • timestamp
  • movies dataset - This dataset contains the following attributes:

    • movieId
    • title
    • genres
  • tags dataset- This dataset contains the following attributes:

    • userId
    • movieId
    • tag
    • timestamp

Sometimes, the installation of the surprise library, which is used to build recommendation systems, faces issues in Jupyter. To avoid any issues, it is advised to use Google Colab for this case study.

Let's start by mounting the Google drive on Colab.

In [50]:
# Mounting the drive
from google.colab import drive
drive.mount('/content/drive')
Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).

Importing the necessary libraries and overview of the dataset¶

In [51]:
import warnings                                 # Used to ignore the warning given as output of the code
warnings.filterwarnings('ignore')

import numpy as np                              # Basic libraries of python for numeric and dataframe computations
import pandas as pd

import matplotlib.pyplot as plt                 # Basic library for data visualization
import seaborn as sns                           # Slightly advanced library for data visualization

from collections import defaultdict             # A dictionary that does not raise a key error

from sklearn.metrics.pairwise import cosine_similarity # To find the similarity between two vectors

from sklearn.metrics import mean_squared_error  # A performance metric in sklearn

Loading the data¶

In [52]:
# Loading the movies dataset
movies = pd.read_csv('/content/drive/MyDrive/movies.csv')

# Let us see the first five records of the dataset
movies.head()
Out[52]:
movieId title genres
0 1 Toy Story (1995) Adventure|Animation|Children|Comedy|Fantasy
1 2 Jumanji (1995) Adventure|Children|Fantasy
2 3 Grumpier Old Men (1995) Comedy|Romance
3 4 Waiting to Exhale (1995) Comedy|Drama|Romance
4 5 Father of the Bride Part II (1995) Comedy
In [53]:
# Shape of the DataFrame
movies.shape
Out[53]:
(9742, 3)
In [54]:
# Loading the ratings dataset
ratings = pd.read_csv('/content/drive/MyDrive/ratings.csv')
In [55]:
# Shape of the ratings dataset
ratings.shape
Out[55]:
(100836, 4)

Let's merge both the datasets to get the title and rating of each movie in a single DataFrame.

In [56]:
# Merging datasets on movieId 
ratings_with_title = pd.merge(ratings, movies[['movieId', 'title']], on = 'movieId', how = 'inner')

# See the first five records of the dataset
ratings_with_title.head()
Out[56]:
userId movieId rating timestamp title
0 1 1 4.0 964982703 Toy Story (1995)
1 5 1 4.0 847434962 Toy Story (1995)
2 7 1 4.5 1106635946 Toy Story (1995)
3 15 1 2.5 1510577970 Toy Story (1995)
4 17 1 4.5 1305696483 Toy Story (1995)

Let's check the info of the data.

In [57]:
# Checking info of the merged dataset
ratings_with_title.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 100836 entries, 0 to 100835
Data columns (total 5 columns):
 #   Column     Non-Null Count   Dtype  
---  ------     --------------   -----  
 0   userId     100836 non-null  int64  
 1   movieId    100836 non-null  int64  
 2   rating     100836 non-null  float64
 3   timestamp  100836 non-null  int64  
 4   title      100836 non-null  object 
dtypes: float64(1), int64(3), object(1)
memory usage: 4.6+ MB
  • There are 100,836 observations and 5 columns in the data.
  • All the columns are of numeric data type except the title column. The title column is of object data type.
  • The data type of the timestamp column is int64 which is not correct. We can convert this to DateTime format but we don't need a timestamp for our analysis. Hence, we can drop this column.
In [58]:
# Dropping the timestamp column
rating = ratings_with_title.drop(['timestamp'], axis = 1)
In [59]:
# Calculating average ratings
average_rating = rating.groupby('movieId').mean()['rating']

# Calculating the count of ratings
count_rating = rating.groupby('movieId').count()['rating']

# Making a dataframe with the count and average of ratings
final_rating = pd.DataFrame({'avg_rating': average_rating, 'rating_count': count_rating})
In [60]:
# See the first five records of the final_rating dataset
final_rating.head()
Out[60]:
avg_rating rating_count
movieId
1 3.920930 215
2 3.431818 110
3 3.259615 52
4 2.357143 7
5 3.071429 49

Exploring the dataset¶

Let's explore the dataset and answer some basic data-related questions.

Question 1: How many unique users are present in the above dataset?¶

In [61]:
# Find the number of unique users
ratings_with_title['userId'].nunique()
Out[61]:
610
  • There are 610 unique users in the dataset.

Question 2: What is the total number of unique movies?¶

In [62]:
# Find the number of unique movies
ratings_with_title['title'].nunique()
Out[62]:
9719
  • There are 9719 unique movies in the dataset.

To demonstrate the clustering-based recommendation system, we are going to use the surprise library in this case study.

  • Please use the following code to install the surprise library. You need to only do it once while running the code for the first time.

!pip install surprise

In [63]:
# Installing the surprise package
!pip install surprise
Requirement already satisfied: surprise in /usr/local/lib/python3.7/dist-packages (0.1)
Requirement already satisfied: scikit-surprise in /usr/local/lib/python3.7/dist-packages (from surprise) (1.1.1)
Requirement already satisfied: joblib>=0.11 in /usr/local/lib/python3.7/dist-packages (from scikit-surprise->surprise) (1.1.0)
Requirement already satisfied: numpy>=1.11.2 in /usr/local/lib/python3.7/dist-packages (from scikit-surprise->surprise) (1.21.6)
Requirement already satisfied: six>=1.10.0 in /usr/local/lib/python3.7/dist-packages (from scikit-surprise->surprise) (1.15.0)
Requirement already satisfied: scipy>=1.0.0 in /usr/local/lib/python3.7/dist-packages (from scikit-surprise->surprise) (1.4.1)
In [64]:
# To compute the accuracy of models
from surprise import accuracy

# Class to parse a file containing ratings, data should be in structure - user; item; rating
from surprise.reader import Reader

# Class for loading datasets
from surprise.dataset import Dataset

# For tuning model hyperparameters
from surprise.model_selection import GridSearchCV

# For splitting the rating data in train and test dataset
from surprise.model_selection import train_test_split

# For implementing a clustering-based recommendation system
from surprise import CoClustering

Before building the recommendation systems, let's go over some basic terminologies we are going to use:

Relevant item: An item (product in this case) that is actually rated higher than the threshold rating (here 3.5) is relevant, and an item that is actually rated lower than the threshold rating is a non-relevant item.

Recommended item: An item whose predicted rating is higher than the threshold (here 3.5) is a recommended item, and an item whose predicted rating is lower the threshold rating is a non-recommended item, i.e., it will not be recommended to the user.

False Negative (FN): It is the frequency of relevant items that are not recommended to the user. If the relevant items are not recommended to the user, then the user might not buy the product/item. This would result in the loss of opportunity for the service provider, which they would like to minimize.

False Positive (FP): It is the frequency of recommended items that are actually not relevant. In this case, the recommendation system is not doing a good job of finding and recommending the relevant items to the user. This would result in loss of resources for the service provider, which they would also like to minimize.

Recall: It is the fraction of actually relevant items that are recommended to the user, i.e., if out of 10 relevant products, 6 are recommended to the user, then recall is 0.60. Higher the value of recall, better is the model. It is one of the metrics to do the performance assessment of classification models.

Precision: It is the fraction of recommended items that are relevant actually, i.e., if out of 10 recommended items, 6 are found relevant by the user, then precision is 0.60. The higher the value of precision, better is the model. It is one of the metrics to do the performance assessment of classification models.

While making a recommendation system, it becomes customary to look at the performance of the model. In terms of how many recommendations are relevant and vice-versa, below are some most used performance metrics used in the assessment of recommendation systems.

Precision@k, Recall@ k, and F1-score@k¶

Precision@k - It is the fraction of recommended items that are relevant in top k predictions. The value of k is the number of recommendations to be provided to the user. One can choose a variable number of recommendations to be given to a unique user.

Recall@k - It is the fraction of relevant items that are recommended to the user in top k predictions.

F1-score@k - It is the harmonic mean of Precision@k and Recall@k. When precision@k and recall@k both seem to be important, it is useful to use this metric because it is representative of both of them.

Some useful functions¶

  • Below function takes the recommendation model as input and gives the precision@k, recall@k, and F1-score@k for that model.
  • To compute precision and recall, top k predictions are taken under consideration for each user.
  • We will use the precision and recall to compute the F1-score.
In [65]:
def precision_recall_at_k(model, k = 10, threshold = 3.5):
    """Return precision@k and recall@k metrics for each user"""

    # First map the predictions to each user
    user_est_true = defaultdict(list)
    
    # Making predictions on the test data
    predictions=model.test(testset)
    
    for uid, _, true_r, est, _ in predictions:
        user_est_true[uid].append((est, true_r))

    precisions = dict()
    recalls = dict()
    for uid, user_ratings in user_est_true.items():

        # Sort user ratings by estimated value
        user_ratings.sort(key = lambda x: x[0], reverse = True)

        # Number of relevant items
        n_rel = sum((true_r >= threshold) for (_, true_r) in user_ratings)

        # Number of recommended items in top k
        n_rec_k = sum((est >= threshold) for (est, _) in user_ratings[ : k])

        # Number of relevant and recommended items in top k
        n_rel_and_rec_k = sum(((true_r >= threshold) and (est >= threshold))
                              for (est, true_r) in user_ratings[ : k])

        # Precision@K: Proportion of recommended items that are relevant
        # When n_rec_k is 0, Precision is undefined. We here set Precision to 0 when n_rec_k is 0

        precisions[uid] = n_rel_and_rec_k / n_rec_k if n_rec_k != 0 else 0

        # Recall@K: Proportion of relevant items that are recommended
        # When n_rel is 0, Recall is undefined. We here set Recall to 0 when n_rel is 0

        recalls[uid] = n_rel_and_rec_k / n_rel if n_rel != 0 else 0
    
    # Mean of all the predicted precisions are calculated
    precision = round((sum(prec for prec in precisions.values()) / len(precisions)), 3)

    # Mean of all the predicted recalls are calculated
    recall = round((sum(rec for rec in recalls.values()) / len(recalls)), 3)
    
    accuracy.rmse(predictions)

    # Command to print the overall precision
    print('Precision: ', precision)
    
    # Command to print the overall recall
    print('Recall: ', recall)
    
    # Formula to compute the F-1 score
    print('F_1 score: ', round((2 * precision * recall) / (precision + recall), 3))

Below, we are loading the rating dataset, which is a pandas DataFrame, into a different format called surprise.dataset.DatasetAutoFolds. This is required by this library. To do this we will be using the classes Reader and Dataset.

In [66]:
# Instantiating Reader scale with expected rating scale
reader = Reader(rating_scale = (0, 5))

# Loading the rating dataset
data = Dataset.load_from_df(rating[['userId', 'movieId', 'rating']], reader)

# Splitting the data into train and test dataset
trainset, testset = train_test_split(data, test_size = 0.2, random_state = 42)

Cluster-Based Recommendation System¶

In clustering-based recommendation systems, we explore the similarities and differences in people's tastes in movies based on how they rate different movies. We cluster similar users together and recommend movies to a user based on ratings from other users in the same cluster.

  • Co-clustering is a set of techniques in Cluster Analysis. Given some matrix A, we want to cluster rows of A and columns of A simultaneously, this is a common task for user-item matrices.

  • As it clusters both the rows and the columns simultaneously, it is also called bi-clustering. To understand the working of the algorithm, let A be m x n matrix, the goal is to generate co-clusters: a subset of rows that exhibit similar behavior across a subset of columns, or vice versa.

  • Co-clustering is defined as two map functions: rows -> row cluster indexes columns -> column cluster indexes.

These map functions are learned simultaneously. It is different from other clustering techniques where we cluster first the rows and then the columns.

In [67]:
# Using CoClustering algorithm
clust_baseline = CoClustering(random_state = 1)

# Training the algorithm on the train set
clust_baseline.fit(trainset)

# Let us compute precision@k, recall@k, and F_1 score with k = 10
precision_recall_at_k(clust_baseline)
RMSE: 0.9490
Precision:  0.717
Recall:  0.502
F_1 score:  0.591
  • We have calculated RMSE to check how far the overall predicted ratings are from the actual ratings.
  • Here F_1 score of the baseline model is ~ 0.60. It indicates that mostly recommended movies were relevant, and relevant movies were recommended. We will try to improve this later by using GridSearchCV by tuning different hyperparameters of this algorithm.

Now, let's predict the rating for the user with userId = 4 and the movie with movieId = 10 as shown below. Here, the user has already interacted or watched the movie with movieId 10.

In [68]:
# Making prediction for userId 4 and movieId 10
clust_baseline.predict(4, 10, r_ui = 4, verbose = True)
user: 4          item: 10         r_ui = 4.00   est = 3.68   {'was_impossible': False}
Out[68]:
Prediction(uid=4, iid=10, r_ui=4, est=3.6757402992691386, details={'was_impossible': False})
  • The actual rating for this user-item pair is 4 and the predicted rating by this Co-clustering is 3.68 which is close to the predicted rating. The model has slightly under-estimated the rating. We will try to fix this later by tuning the hyperparameters of the model using GridSearchCV.

Below, we are predicting the rating for the same userId = 4 but for a movie with which this user has not interacted before, i.e., movieId = 3, as shown below.

In [69]:
# Making prediction for userId 4 and movieId 3
clust_baseline.predict(4, 3, verbose = True)
user: 4          item: 3          r_ui = None   est = 3.26   {'was_impossible': False}
Out[69]:
Prediction(uid=4, iid=3, r_ui=None, est=3.258169827544438, details={'was_impossible': False})

Improving clustering-based recommendation system by tuning its hyperparameters¶

Below, we will be tuning hyperparameters for the CoClustering algorithm. Let's try to understand the different hyperparameters of this algorithm.

  • n_cltr_u (int) – Number of user clusters. The default value is 3.
  • n_cltr_i (int) – Number of item clusters. The default value is 3.
  • n_epochs (int) – Number of iteration of the optimization loop. The default value is 3.
  • random_state (int, RandomState instance from NumPy, or None) – Determines the RNG that will be used for initialization. If int, random_state will be used as a seed for a new RNG. This is useful to get the same initialization over multiple calls to fit(). If RandomState instance, this same instance is used as RNG. If None, the current RNG from NumPy is used. The default value is None.
  • verbose (bool) – If True, the current epoch will be printed. The default value is False.
In [70]:
# Set the parameter space to tune
param_grid = {'n_cltr_u': [3, 4, 5, 6], 'n_cltr_i': [3, 4, 5, 6], 'n_epochs': [30, 40, 50]}

# Performing 3-Fold gridsearch cross-validation
gs = GridSearchCV(CoClustering, param_grid, measures = ['rmse'], cv = 3, n_jobs = -1)

# Fitting data
gs.fit(data)

# Printing the best RMSE score
print(gs.best_score['rmse'])

# Printing the combination of parameters that gives the best RMSE score
print(gs.best_params['rmse'])
0.9570102979396152
{'n_cltr_u': 3, 'n_cltr_i': 3, 'n_epochs': 30}

Once the grid search is complete, we can get the optimal values for each of those hyperparameters as shown above.

We will build final model by using tuned values of the hyperparameters received after using the Grid search cross-validation above.

In [71]:
# Using tuned Coclustering algorithm
clust_tuned = CoClustering(n_cltr_u = 3, n_cltr_i = 3, n_epochs = 30, random_state = 1)

# Training the algorithm on the train set
clust_tuned.fit(trainset)

# Let us compute precision@k, recall@k, and F_1 score with k = 10
precision_recall_at_k(clust_tuned)
RMSE: 0.9499
Precision:  0.715
Recall:  0.5
F_1 score:  0.588
  • We can see that the F_1 score for tuned co-clustering model is slightly lower than the F_1 score for the baseline Co-clustering model.

Let's now predict the rating for the user with userId = 4 and for the movie with movieId = 10 as shown below. Here, the user has already rated the movie.

In [72]:
# Using co-clustering_optimized model to recommend for userId 4 and movieId 10
clust_tuned.predict(4, 10, r_ui = 4, verbose = True)
user: 4          item: 10         r_ui = 4.00   est = 3.65   {'was_impossible': False}
Out[72]:
Prediction(uid=4, iid=10, r_ui=4, est=3.6549811878747214, details={'was_impossible': False})

Below, we are predicting the rating for the same userId = 4 but for a movie with which this user has not interacted before, i.e., movieId = 3, as shown below.

In [73]:
# Using Co-clustering based optimized model
clust_tuned.predict(4, 3, verbose = True)
user: 4          item: 3          r_ui = None   est = 3.23   {'was_impossible': False}
Out[73]:
Prediction(uid=4, iid=3, r_ui=None, est=3.2318175022354345, details={'was_impossible': False})

Implementing the recommendation algorithm based on the optimized KNNBasic model¶

Below we will be implementing a function where the input parameters are:

  • data: A rating dataset
  • user_id: A user id against which we want the recommendations
  • top_n: The number of movies we want to recommend
  • algo: The algorithm we want to use for predicting the ratings
  • The output of the function is a set of top_n items recommended for the given user id based on the given algorithm
In [74]:
def get_recommendations(data, user_id, top_n, algo):
    
    # Creating an empty list to store the recommended movie IDs
    recommendations = []
    
    # Creating an user-item interactions matrix 
    user_item_interactions_matrix = data.pivot(index = 'userId', columns = 'movieId', values = 'rating')
    
    # Extracting those movie IDs which the userId has not interacted yet
    non_interacted_movies = user_item_interactions_matrix.loc[user_id][user_item_interactions_matrix.loc[user_id].isnull()].index.tolist()
    
    # Looping through each of the movie IDs which userId has not interacted yet
    for item_id in non_interacted_movies:
        
        # Predicting the ratings for those non interacted movie IDs by this user
        est = algo.predict(user_id, item_id).est
        
        # Appending the predicted ratings
        recommendations.append((item_id, est))

    # Sorting the predicted ratings in descending order
    recommendations.sort(key = lambda x: x[1], reverse = True)

    # Returing top n highest predicted rating movies for this user
    return recommendations[:top_n]
In [75]:
# Getting top 5 recommendations for userId 4 using Co-clustering based optimized algorithm
clustering_recommendations = get_recommendations(rating, 4, 5, clust_tuned)

Correcting the Ratings and Ranking the above movies¶

While comparing the ratings of two movies, it is not only the ratings that describe the likelihood of the user to that movie. Along with the rating, the number of users who have watched that movie also becomes important to consider. Due to this, we have calculated the "corrected_ratings" for each movie. Commonly higher the "rating_count" of a movie more it is liked by users. To interpret the above concept, a movie rated 4 with rating_count 3 is less liked in comparison to a movie rated 3 with a rating count of 50. It has been empirically found that the likelihood of the movie is directly proportional to the inverse of the square root of the rating_count of the movie.

In [76]:
def ranking_movies(recommendations, final_rating):
  
    # Sort the movies based on ratings count
    ranked_movies = final_rating.loc[[items[0] for items in recommendations]].sort_values('rating_count', ascending = False)[['rating_count']].reset_index()

    # Merge with the recommended movies to get predicted ratings
    ranked_movies = ranked_movies.merge(pd.DataFrame(recommendations, columns = ['movieId', 'predicted_ratings']), on = 'movieId', how = 'inner')

    # Rank the movies based on corrected ratings
    ranked_movies['corrected_ratings'] = ranked_movies['predicted_ratings'] - 1 / np.sqrt(ranked_movies['rating_count'])

    # Sort the movies based on corrected ratings
    ranked_movies = ranked_movies.sort_values('corrected_ratings', ascending = False)

    return ranked_movies

Note: In the above-corrected rating formula, we can add the quantity 1 / np.sqrt(n) instead of subtracting it to get more optimistic predictions. But here we are subtracting this quantity, as there are some movies with ratings of 5 and we can't have a rating more than 5 for a movie.

In [77]:
# Ranking movies based on the above recommendations
ranking_movies(clustering_recommendations, final_rating)
Out[77]:
movieId rating_count predicted_ratings corrected_ratings
0 304 3 5 4.422650
1 53 2 5 4.292893
2 99 2 5 4.292893
3 238 2 5 4.292893
4 148 1 5 4.000000

Let us now move to the final recommendation algorithm which is named the Content-based recommendation system.

Content-Based Recommendation System¶

In a content-based recommendation system, we will use the feature - text, i.e., reviews to find similar movies.

Text data generally contains pronunciation, stopwords, and non-ASCII characters, which makes it very noisy. So, we will first need to pre-process the text and then we will generate features from the text to compute similarities between the texts/reviews.

Let's load the tags dataset.

In [78]:
# Importing the tags data
tags = pd.read_csv('/content/drive/MyDrive/tags.csv')
tags.head()
Out[78]:
userId movieId tag timestamp
0 2 60756 funny 1445714994
1 2 60756 Highly quotable 1445714996
2 2 60756 will ferrell 1445714992
3 2 89774 Boxing story 1445715207
4 2 89774 MMA 1445715200

In this dataset, we don't have any movie review or plot of the movie, so we will combine the columns - title, genres from the other two datasets, and tag from the tags dataset to create a text-based feature and apply the TF-IDF feature extraction technique to extract features, which we later use to compute similar movies based on these texts.

In [79]:
# Merging all the three datasets on movieId
ratings_with_title = pd.merge(ratings, movies[['movieId', 'title', 'genres']], on = 'movieId' )

final_ratings = pd.merge(ratings_with_title, tags[['movieId', 'tag']], on = 'movieId' )

# Let us see the dataset
final_ratings
Out[79]:
userId movieId rating timestamp title genres tag
0 1 1 4.0 964982703 Toy Story (1995) Adventure|Animation|Children|Comedy|Fantasy pixar
1 1 1 4.0 964982703 Toy Story (1995) Adventure|Animation|Children|Comedy|Fantasy pixar
2 1 1 4.0 964982703 Toy Story (1995) Adventure|Animation|Children|Comedy|Fantasy fun
3 5 1 4.0 847434962 Toy Story (1995) Adventure|Animation|Children|Comedy|Fantasy pixar
4 5 1 4.0 847434962 Toy Story (1995) Adventure|Animation|Children|Comedy|Fantasy pixar
... ... ... ... ... ... ... ...
233208 599 176419 3.5 1516604655 Mother! (2017) Drama|Horror|Mystery|Thriller uncomfortable
233209 599 176419 3.5 1516604655 Mother! (2017) Drama|Horror|Mystery|Thriller unsettling
233210 594 7023 4.5 1108972356 Wedding Banquet, The (Xi yan) (1993) Comedy|Drama|Romance In Netflix queue
233211 606 6107 4.0 1171324428 Night of the Shooting Stars (Notte di San Lore... Drama|War World War II
233212 606 6516 3.5 1171755910 Anastasia (1956) Drama In Netflix queue

233213 rows × 7 columns

  • We can observe that the multiple genres are separated by | which we need to remove.
  • We will combine the three columns title, genres, and tag.
In [80]:
# Replacing | character with space in genres column
final_ratings['genres'] = final_ratings['genres'].apply(lambda x: " ".join(x.split('|')))
In [81]:
# Combining title, genres, and tag columns
final_ratings['text'] = final_ratings['title'] + ' ' + final_ratings['genres'] + ' ' + final_ratings['tag']

final_ratings.head()
Out[81]:
userId movieId rating timestamp title genres tag text
0 1 1 4.0 964982703 Toy Story (1995) Adventure Animation Children Comedy Fantasy pixar Toy Story (1995) Adventure Animation Children ...
1 1 1 4.0 964982703 Toy Story (1995) Adventure Animation Children Comedy Fantasy pixar Toy Story (1995) Adventure Animation Children ...
2 1 1 4.0 964982703 Toy Story (1995) Adventure Animation Children Comedy Fantasy fun Toy Story (1995) Adventure Animation Children ...
3 5 1 4.0 847434962 Toy Story (1995) Adventure Animation Children Comedy Fantasy pixar Toy Story (1995) Adventure Animation Children ...
4 5 1 4.0 847434962 Toy Story (1995) Adventure Animation Children Comedy Fantasy pixar Toy Story (1995) Adventure Animation Children ...

Now, we will keep only four columns - userId, movieId, rating, and text. We will drop the duplicate titles from the data and make it the title column as the index of the dataframe.

In [82]:
# Create the final_ratings dataset with specified columns
final_ratings = final_ratings[['userId', 'movieId', 'rating', 'title', 'text']]

# Let us drop the duplicate records
final_ratings = final_ratings.drop_duplicates(subset = ['title'])

# Set the index
final_ratings = final_ratings.set_index('title')

# See the first five records of the dataset
final_ratings.head()
Out[82]:
userId movieId rating text
title
Toy Story (1995) 1 1 4.0 Toy Story (1995) Adventure Animation Children ...
Grumpier Old Men (1995) 1 3 4.0 Grumpier Old Men (1995) Comedy Romance moldy
Seven (a.k.a. Se7en) (1995) 1 47 5.0 Seven (a.k.a. Se7en) (1995) Mystery Thriller m...
Usual Suspects, The (1995) 1 50 5.0 Usual Suspects, The (1995) Crime Mystery Thril...
Bottle Rocket (1996) 1 101 5.0 Bottle Rocket (1996) Adventure Comedy Crime Ro...
In [83]:
# Let us see the shape of final_ratings data
final_ratings.shape
Out[83]:
(1554, 4)

Now, let's process the text data and create features to find the similarity between movies.

Loading libraries to handle the text data¶

In [84]:
# Importing nltk (natural language toolkit library)
import nltk

# Downloading punctuations
nltk.download('punkt')

# Downloading stopwords
nltk.download('stopwords')

# Downloading wordnet
nltk.download('wordnet') 
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
Out[84]:
True
In [85]:
# This is importing regular expression
import re

# Word_tokenize is used to do tokenization
from nltk import word_tokenize

# Importing the Lematizer 
from nltk.stem import WordNetLemmatizer

# Importing the stopwords
from nltk.corpus import stopwords

# Tfidf vectorizer used to create the computational vectors
from sklearn.feature_extraction.text import TfidfVectorizer

We will create a function to pre-process the text data. Before that, let's see some terminology.

  • stopwords: A stop word is a commonly used word (such as “the”, “a”, “an”, or “in”) that does not contain information in the text and can be ignored.
  • Lemmatization: Lemmatization in linguistics is the process of grouping together the inflected forms of a word so they can be analyzed as a single item. For example, runs, running, and ran are all forms of the word run, therefore run is the lemma of all these words.
In [86]:
# Create the tokenize function
def tokenize(text):
    
    # Making each letter as lowercase and removing non-alphabetical text
    text = re.sub(r"[^a-zA-Z]"," ", text.lower())
    
    # Extracting each word in the text
    tokens = word_tokenize(text)
    
    # Removing stopwords
    words = [word for word in tokens if word not in stopwords.words("english")]
    
    # Lemmatize the words
    text_lems = [WordNetLemmatizer().lemmatize(lem).strip() for lem in words]

    return text_lems

Feature Extraction¶

Below are some of the ways to extract features from texts:

  • Bag of words
  • TF-IDF
  • One hot encoding
  • Word vectors

alt text

Here, we will be using TF-IDF as a feature extraction technique.

In [87]:
# Creating the TF-IDF object
tfidf = TfidfVectorizer(tokenizer = tokenize)

movie_tfidf = tfidf.fit_transform(final_ratings['text'].values).toarray()
In [88]:
# Making the DataFrame of movie_tfidf data
pd.DataFrame(movie_tfidf)
Out[88]:
0 1 2 3 4 5 6 7 8 9 ... 2773 2774 2775 2776 2777 2778 2779 2780 2781 2782
0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
1 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
2 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
3 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
4 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
1549 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
1550 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
1551 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
1552 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
1553 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0

1554 rows × 2783 columns

We have extracted features from the text data. Now, we can find similarities between movies using these features. We will use cosine similarity to calculate the similarity.

In [89]:
# Calculating the cosine similarity
similar_movies = cosine_similarity(movie_tfidf, movie_tfidf)

# Let us see the above array
similar_movies
Out[89]:
array([[1.        , 0.02268393, 0.        , ..., 0.02022472, 0.        ,
        0.        ],
       [0.02268393, 1.        , 0.        , ..., 0.04779055, 0.        ,
        0.        ],
       [0.        , 0.        , 1.        , ..., 0.        , 0.        ,
        0.        ],
       ...,
       [0.02022472, 0.04779055, 0.        , ..., 1.        , 0.00719396,
        0.19617374],
       [0.        , 0.        , 0.        , ..., 0.00719396, 1.        ,
        0.01217017],
       [0.        , 0.        , 0.        , ..., 0.19617374, 0.01217017,
        1.        ]])

Finally, let's create a function to find the most similar movies to recommend for a given movie.

In [90]:
# Function that takes in movie title as input and returns the top 10 recommended movies
def recommendations(title, similar_movies):
    
    recommended_movies = []
    
    indices = pd.Series(final_ratings.index)
    
    # Getting the index of the movie that matches the title
    idx = indices[indices == title].index[0]

    # Creating a Series with the similarity scores in descending order
    score_series = pd.Series(similar_movies[idx]).sort_values(ascending = False)

    # Getting the indices of 10 most similar movies
    top_10_indexes = list(score_series.iloc[1 : 11].index)
    print(top_10_indexes)
    
    # Populating the list with the titles of the best 10 matching movies
    for i in top_10_indexes:
        recommended_movies.append(list(final_ratings.index)[i])
        
    return recommended_movies
In [91]:
recommendations('Usual Suspects, The (1995)', similar_movies)
[71, 1186, 124, 551, 569, 77, 719, 766, 123, 658]
Out[91]:
['Game, The (1997)',
 'Andalusian Dog, An (Chien andalou, Un) (1929)',
 'Town, The (2010)',
 'Now You See Me (2013)',
 'Charade (1963)',
 'Negotiator, The (1998)',
 'Following (1998)',
 '21 Grams (2003)',
 'Inception (2010)',
 'Insomnia (2002)']
  • The movie belongs to Crime, Mystery, and Thriller genres, and the majority of our recommendations lie in one or more of these genres. It implies that the resulting recommendation system is working well.

Conclusion¶

  • In this case study, we built recommendation systems using five different algorithms. They are as follows:

    • clustering-based recommendation systems
    • content-based recommendation systems
  • To demonstrate clustering-based recommendation systems, surprise library has been used. Grid search cross-validation is applied to find the best working model, and with that the corresponding predictions are made.
  • To evaluate the performance of these models, precision@k and recall@k are used in this case study. Using these two metrics F_1 score is calculated for each working model.
  • We can try to improve the performance of these models using hyperparameter tuning.
In [91]: