Online streaming platforms like Netflix have plenty of movies in their repository and if we can build a Recommendation System to recommend relevant movies to users, based on their historical interactions, this would improve customer satisfaction and hence, it will also improve the revenue of the platform. The techniques that we will learn here will not only be limited to movies, it can be any item for which you want to build a recommendation system.
In this case study, we will build various recommendation systems:
To demonstrate the above techniques, we are going to use the ratings dataset.
The ratings dataset contains the following attributes:
Sometimes, the installation of the surprise library, which is used to build recommendation systems, faces issues in Jupyter. To avoid any issues, it is advised to use Google Colab for this case study.
Let's start by mounting the Google drive on Colab.
from google.colab import drive
drive.mount('/content/drive')
Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).
import warnings # Used to ignore the warning given as output of the code
warnings.filterwarnings('ignore')
import numpy as np # Basic libraries of python for numeric and dataframe computations
import pandas as pd
import matplotlib.pyplot as plt # Basic library for data visualization
import seaborn as sns # Slightly advanced library for data visualization
from collections import defaultdict # A dictionary output that does not raise a key error
from sklearn.metrics import mean_squared_error # A performance metrics in sklearn
# Importing the "ratings.csv" dataset
rating = pd.read_csv('/content/drive/MyDrive/ratings.csv')
Let's check the info of the data.
# Info of the dataframe
rating.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 100836 entries, 0 to 100835 Data columns (total 4 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 userId 100836 non-null int64 1 movieId 100836 non-null int64 2 rating 100836 non-null float64 3 timestamp 100836 non-null int64 dtypes: float64(1), int64(3) memory usage: 3.1 MB
# drop() is a method used to remove the desired columns/rows from a dataframe
rating = rating.drop(['timestamp'], axis = 1)
Now, let us see the top five records of the rating data.
# head() is a method used to display the first n records of a dataframe, by default n=5.
rating.head()
| userId | movieId | rating | |
|---|---|---|---|
| 0 | 1 | 1 | 4.0 |
| 1 | 1 | 3 | 4.0 |
| 2 | 1 | 6 | 4.0 |
| 3 | 1 | 47 | 5.0 |
| 4 | 1 | 50 | 5.0 |
Let's explore the dataset and answer some basic data-related questions:
# Finding number of unique users by using nunique method
rating['userId'].nunique()
610
# Finding the number of unique movies
rating['movieId'].nunique()
9724
# Finding the frequency of movies rated by each user
rating.groupby(['userId', 'movieId']).count()
| rating | ||
|---|---|---|
| userId | movieId | |
| 1 | 1 | 1 |
| 3 | 1 | |
| 6 | 1 | |
| 47 | 1 | |
| 50 | 1 | |
| ... | ... | ... |
| 610 | 166534 | 1 |
| 168248 | 1 | |
| 168250 | 1 | |
| 168252 | 1 | |
| 170875 | 1 |
100836 rows × 1 columns
# Finding the sum of ratings count by user-movie pair
rating.groupby(['userId', 'movieId']).count()['rating'].sum()
100836
# Counting the number of people who have watched a certain movie
rating['movieId'].value_counts()
356 329
318 317
296 307
593 279
2571 278
...
86279 1
86922 1
5962 1
87660 1
163981 1
Name: movieId, Length: 9724, dtype: int64
Also, out of these 329 interactions, we need to consider the distribution of ratings as well to check whether this movie is the most liked or most disliked movie.
# Plotting distributions of ratings for 329 interactions with movieid 356
# Let us fix the size of the figure
plt.figure(figsize = (7, 7))
rating[rating['movieId'] == 356]['rating'].value_counts().plot(kind = 'bar')
# This gives a label to the variable on the x-axis
plt.xlabel('Rating')
# This gives a label to the variable on the y-axis
plt.ylabel('Count')
# This displays the plot
plt.show()
# Counting the number of movies each user has watched
rating['userId'].value_counts()
414 2698
599 2478
474 2108
448 1864
274 1346
...
442 20
569 20
320 20
576 20
53 20
Name: userId, Length: 610, dtype: int64
# Finding user-movie interactions distribution
count_interactions = rating.groupby('userId').count()['movieId']
count_interactions
userId
1 232
2 29
3 39
4 216
5 44
...
606 1115
607 187
608 831
609 37
610 1302
Name: movieId, Length: 610, dtype: int64
Rank-based recommendation systems provide recommendations based on the most popular items. This kind of recommendation system is useful when we have cold start problem. Cold start refers to the issue when we get a new user into the system and the machine is not able to recommend movies to him, as the user did not have any historical interactions available in the dataset. In those cases, we can use a rank-based recommendation system to recommend movies to the new user.
To build the rank-based recommendation system, we take the average of all the ratings provided to each movie and then rank them based on their average rating.
# Calculate average ratings for each movie
average_rating = rating.groupby('movieId').mean()['rating']
# Calculate the count of ratings for each movie
count_rating = rating.groupby('movieId').count()['rating']
# Making a dataframe with the count and average of ratings
final_rating = pd.DataFrame({'avg_rating': average_rating, 'rating_count': count_rating})
# First 5 records of the final_rating dataset
final_rating.head()
| avg_rating | rating_count | |
|---|---|---|
| movieId | ||
| 1 | 3.920930 | 215 |
| 2 | 3.431818 | 110 |
| 3 | 3.259615 | 52 |
| 4 | 2.357143 | 7 |
| 5 | 3.071429 | 49 |
Now, let's create a function to find the top n movies for a recommendation based on the average ratings of movies. We can also add a threshold for a minimum number of interactions for a movie to be considered for recommendation.
# It gives top n movies among those being watched for more than min_interactions
def top_n_movies(data, n, min_interaction = 100):
# Finding movies with interactions greater than the minimum number of interactions
recommendations = data[data['rating_count'] > min_interaction]
# Sorting values with respect to the average rating
recommendations = recommendations.sort_values(by = 'avg_rating', ascending = False)
return recommendations.index[:n]
We can use this function with different n's and minimum interactions to get movies to be recommended.
list(top_n_movies(final_rating, 5, 50))
[318, 858, 2959, 1276, 750]
list(top_n_movies(final_rating, 5, 100))
[318, 858, 2959, 1221, 48516]
list(top_n_movies(final_rating, 5, 200))
[318, 2959, 50, 260, 527]
Now that we have seen how to apply the Rank-Based Recommendation System, let's apply the Collaborative Filtering Based Recommendation Systems.
In the above interactions matrix, out of users B and C, which user is most likely to interact with the movie, "The Terminal"?
In this type of recommendation system, we do not need any information about the user or item. We only need user-item interaction data to build a collaborative recommendation system. For example:
cosine similarity and using KNN to find similar users which are the nearest neighbor to the given user. surprise to build the remaining models, let's first import the necessary classes and functions from this library.install the surprise library. You only do it once while running the code for the first time.!pip install surprise
# Installing the surprise library
!pip install surprise
Requirement already satisfied: surprise in /usr/local/lib/python3.7/dist-packages (0.1) Requirement already satisfied: scikit-surprise in /usr/local/lib/python3.7/dist-packages (from surprise) (1.1.1) Requirement already satisfied: numpy>=1.11.2 in /usr/local/lib/python3.7/dist-packages (from scikit-surprise->surprise) (1.21.6) Requirement already satisfied: scipy>=1.0.0 in /usr/local/lib/python3.7/dist-packages (from scikit-surprise->surprise) (1.4.1) Requirement already satisfied: six>=1.10.0 in /usr/local/lib/python3.7/dist-packages (from scikit-surprise->surprise) (1.15.0) Requirement already satisfied: joblib>=0.11 in /usr/local/lib/python3.7/dist-packages (from scikit-surprise->surprise) (1.1.0)
# To compute the accuracy of models
from surprise import accuracy
# Class to parse a file containing ratings, data should be in structure - user; item; rating
from surprise.reader import Reader
# Class for loading datasets
from surprise.dataset import Dataset
# For tuning model hyperparameters
from surprise.model_selection import GridSearchCV
# For splitting the rating data in train and test datasets
from surprise.model_selection import train_test_split
# For implementing similarity-based recommendation system
from surprise.prediction_algorithms.knns import KNNBasic
# For implementing matrix factorization based recommendation system
from surprise.prediction_algorithms.matrix_factorization import SVD
# for implementing K-Fold cross-validation
from surprise.model_selection import KFold
# For implementing clustering-based recommendation system
from surprise import CoClustering
Before building the recommendation systems, let's go over some basic terminologies we are going to use:
Relevant item: An item (product in this case) that is actually rated higher than the threshold rating (here 3.5) is relevant, and an item that is actually rated lower than the threshold rating is a non-relevant item.
Recommended item: An item whose predicted rating is higher than the threshold (here 3.5) is a recommended item, and an item whose predicted rating is lower the threshold rating is a non-recommended item, i.e., it will not be recommended to the user.
False Negative (FN): It is the frequency of relevant items that are not recommended to the user. If the relevant items are not recommended to the user, then the user might not buy the product/item. This would result in the loss of opportunity for the service provider, which they would like to minimize.
False Positive (FP): It is the frequency of recommended items that are actually not relevant. In this case, the recommendation system is not doing a good job of finding and recommending the relevant items to the user. This would result in loss of resources for the service provider, which they would also like to minimize.
Recall: It is the fraction of actually relevant items that are recommended to the user, i.e., if out of 10 relevant products, 6 are recommended to the user, then recall is 0.60. Higher the value of recall, better is the model. It is one of the metrics to do the performance assessment of classification models.
Precision: It is the fraction of recommended items that are relevant actually, i.e., if out of 10 recommended items, 6 are found relevant by the user, then precision is 0.60. The higher the value of precision, better is the model. It is one of the metrics to do the performance assessment of classification models.
While making a recommendation system, it becomes customary to look at the performance of the model. In terms of how many recommendations are relevant and vice-versa, below are some most used performance metrics used in the assessment of recommendation systems.
Precision@k - It is the fraction of recommended items that are relevant in top k predictions. The value of k is the number of recommendations to be provided to the user. One can choose a variable number of recommendations to be given to a unique user.
Recall@k - It is the fraction of relevant items that are recommended to the user in top k predictions.
F1-score@k - It is the harmonic mean of Precision@k and Recall@k. When precision@k and recall@k both seem to be important, it is useful to use this metric because it is representative of both of them.
def precision_recall_at_k(model, k = 10, threshold = 3.5):
"""Return precision and recall at k metrics for each user"""
# First map the predictions to each user
user_est_true = defaultdict(list)
# Making predictions on the test data
predictions = model.test(testset)
for uid, _, true_r, est, _ in predictions:
user_est_true[uid].append((est, true_r))
precisions = dict()
recalls = dict()
for uid, user_ratings in user_est_true.items():
# Sort user ratings by estimated value
user_ratings.sort(key = lambda x: x[0], reverse = True)
# Number of relevant items
n_rel = sum((true_r >= threshold) for (_, true_r) in user_ratings)
# Number of recommended items in top k
n_rec_k = sum((est >= threshold) for (est, _) in user_ratings[:k])
# Number of relevant and recommended items in top k
n_rel_and_rec_k = sum(((true_r >= threshold) and (est >= threshold))
for (est, true_r) in user_ratings[:k])
# Precision@K: Proportion of recommended items that are relevant
# When n_rec_k is 0, Precision is undefined. Therefore, we are setting Precision to 0 when n_rec_k is 0
precisions[uid] = n_rel_and_rec_k / n_rec_k if n_rec_k != 0 else 0
# Recall@K: Proportion of relevant items that are recommended
# When n_rel is 0, Recall is undefined. Therefore, we are setting Recall to 0 when n_rel is 0
recalls[uid] = n_rel_and_rec_k / n_rel if n_rel != 0 else 0
# Mean of all the predicted precisions are calculated
precision = round((sum(prec for prec in precisions.values()) / len(precisions)), 3)
# Mean of all the predicted recalls are calculated
recall = round((sum(rec for rec in recalls.values()) / len(recalls)), 3)
accuracy.rmse(predictions)
print('Precision: ', precision) # Command to print the overall precision
print('Recall: ', recall) # Command to print the overall recall
print('F_1 score: ', round((2*precision*recall)/(precision+recall), 3)) # Formula to compute the F-1 score
Below we are converting the rating dataset, which is a pandas dataframe, into a different format called surprise.dataset.DatasetAutoFolds. This is required by the surprise library. To do this, we will be using the classes Reader and Dataset.
# Instantiating Reader scale with expected rating scale
reader = Reader(rating_scale = (0, 5))
# Loading the rating dataset
data = Dataset.load_from_df(rating[['userId', 'movieId', 'rating']], reader)
# Splitting the data into train and test datasets
trainset, testset = train_test_split(data, test_size = 0.2, random_state = 42)
# Declaring the similarity options
sim_options = {'name': 'cosine',
'user_based': True}
# KNN algorithm is used to find desired similar items
sim_user_user = KNNBasic(sim_options = sim_options, verbose = False, random_state = 1)
# Train the algorithm on the trainset, and predict ratings for the test set
sim_user_user.fit(trainset)
# Let us compute precision@k, recall@k, and F_1 score with k = 10
precision_recall_at_k(sim_user_user)
RMSE: 0.9823 Precision: 0.757 Recall: 0.542 F_1 score: 0.632
Now, let's predict the rating for the user with userId = 4 and the movie with movieId = 10 as shown below. Here, the user has already interacted or watched the movie with movieId 10.
# Predicting rating for a sample user with an interacted movie
sim_user_user.predict(4, 10, r_ui = 4, verbose = True)
user: 4 item: 10 r_ui = 4.00 est = 3.41 {'actual_k': 40, 'was_impossible': False}
Prediction(uid=4, iid=10, r_ui=4, est=3.4133289774831344, details={'actual_k': 40, 'was_impossible': False})
Below is the list of users who have seen the movie with movieId 3.
rating[rating.movieId == 3].userId.unique()
array([ 1, 6, 19, 32, 42, 43, 44, 51, 58, 64, 68, 91, 100,
102, 116, 117, 150, 151, 169, 179, 217, 226, 240, 269, 270, 288,
289, 294, 302, 307, 308, 321, 330, 337, 368, 410, 414, 448, 456,
470, 477, 480, 492, 501, 544, 552, 555, 588, 590, 594, 599, 608])
Below, we are predicting the rating for the same userId = 4 but for a movie with which this user has not interacted yet, i.e., movieId = 3.
# Predicting rating for a sample user with a non interacted movie
sim_user_user.predict(4, 3, verbose = True)
user: 4 item: 3 r_ui = None est = 3.26 {'actual_k': 40, 'was_impossible': False}
Prediction(uid=4, iid=3, r_ui=None, est=3.260929007645626, details={'actual_k': 40, 'was_impossible': False})
Above, we have predicted the rating for this user-item pair based on this user-user similarity-based baseline model.
Below, we will be tuning hyperparameters for the KNNBasic algorithms. Let's try to understand some of the hyperparameters of this algorithm:
Note: GridSearchCV does not accept the metrics recall@k, precision@k, or F1 Score@k. As a result, we'll tune the model using RMSE.
# Setting up parameter grid to tune the hyperparameters
param_grid = {'k': [30, 40, 50], 'min_k': [3, 6, 9],
'sim_options': {'name': ['msd', 'cosine'],
'user_based': [True]}
}
# Performing 3-Fold cross-validation to tune the hyperparameters
gs = GridSearchCV(KNNBasic, param_grid, measures = ['rmse'], cv = 3, n_jobs = -1)
# Fitting the model on data
gs.fit(data)
# Printing the best RMSE score
print(gs.best_score['rmse'])
# Printing the combination of parameters that gave the best RMSE score
print(gs.best_params['rmse'])
0.947994992166605
{'k': 30, 'min_k': 3, 'sim_options': {'name': 'msd', 'user_based': True}}
Once the grid search is complete, we can get the optimal values for each of those hyperparameters as shown above.
Now, let's build the final model by using the optimal values of the hyperparameters, which we received by using the grid search cross-validation.
# Using the optimal similarity measure for user-user collaborative filtering
sim_options = {'name': 'msd',
'user_based': True}
# Creating an instance of KNNBasic with optimal hyperparameter values
sim_user_user_optimized = KNNBasic(sim_options = sim_options, k = 30, min_k = 3, random_state = 1, verbose = False)
# Training the algorithm on the trainset
sim_user_user_optimized.fit(trainset)
# Let us compute precision@k, recall@k, and F_1 score with k = 10
precision_recall_at_k(sim_user_user_optimized)
RMSE: 0.9467 Precision: 0.762 Recall: 0.554 F_1 score: 0.642
Let us now predict the rating for the user with userId = 4 and the movie with movieId = 10 with the optimized model as shown below.
sim_user_user_optimized.predict(4, 10, r_ui = 4, verbose = True)
user: 4 item: 10 r_ui = 4.00 est = 3.50 {'actual_k': 30, 'was_impossible': False}
Prediction(uid=4, iid=10, r_ui=4, est=3.497691535784751, details={'actual_k': 30, 'was_impossible': False})
Below, we are predicting the rating for the same userId = 4 but for a movie with which this user has not interacted before, i.e., movieId = 3, by using the optimized model as shown below.
sim_user_user_optimized.predict(4, 3, verbose = True)
user: 4 item: 3 r_ui = None est = 3.45 {'actual_k': 30, 'was_impossible': False}
Prediction(uid=4, iid=3, r_ui=None, est=3.4530529132024763, details={'actual_k': 30, 'was_impossible': False})
We can also find out similar users to a given user or its nearest neighbors based on this KNNBasic algorithm. Below, we are finding the 5 most similar users to userId = 4 based on the msd distance metric.
sim_user_user_optimized.get_neighbors(4, k = 5)
[89, 90, 91, 181, 230]
Below we will be implementing a function where the input parameters are:
def get_recommendations(data, user_id, top_n, algo):
# Creating an empty list to store the recommended movie ids
recommendations = []
# Creating an user item interactions matrix
user_item_interactions_matrix = data.pivot(index = 'userId', columns = 'movieId', values = 'rating')
# Extracting those movie IDs which the user ID has not interacted yet
non_interacted_movies = user_item_interactions_matrix.loc[user_id][user_item_interactions_matrix.loc[user_id].isnull()].index.tolist()
# Looping through each of the movie IDs which user ID has not interacted yet
for item_id in non_interacted_movies:
# Predicting the ratings for those non interacted movie IDs by this user
est = algo.predict(user_id, item_id).est
# Appending the predicted ratings
recommendations.append((item_id, est))
# Sorting the predicted ratings in descending order
recommendations.sort(key = lambda x: x[1], reverse = True)
# Returing top n highest predicted rating movies for this user
return recommendations[:top_n]
# Making top 5 recommendations for userId 4 using the similarity-based recommendation system
recommendations = get_recommendations(rating, 4, 5, sim_user_user_optimized)
# Building the dataframe for above recommendations with columns "movieId" and "predicted_ratings"
pd.DataFrame(recommendations, columns = ['movieId', 'predicted_ratings'])
| movieId | predicted_ratings | |
|---|---|---|
| 0 | 3404 | 5.000000 |
| 1 | 7121 | 5.000000 |
| 2 | 6460 | 4.844207 |
| 3 | 115122 | 4.813285 |
| 4 | 1178 | 4.808807 |
While comparing the ratings of two movies, it is not only the ratings that describe the likelihood of the user to interact with that movie. Along with the rating, the number of users who have watched that movie also become important to consider. Due to this, we have calculated the "corrected_ratings" for each movie. Generally, higher the "rating_count" of a movie, more reliable the rating is. To interpret the above concept, a movie rated 4 with rating_count 3 is less liked in comparison to a movie rated 3 with a rating count of 50. It has been empirically found that the likelihood of the movie is directly proportional to the inverse of the square root of the rating_count of the movie.
def ranking_movies(recommendations, final_rating):
# Sort the movies based on ratings count
ranked_movies = final_rating.loc[[items[0] for items in recommendations]].sort_values('rating_count', ascending = False)[['rating_count']].reset_index()
# Merge with the recommended movies to get predicted ratings
ranked_movies = ranked_movies.merge(pd.DataFrame(recommendations, columns = ['movieId', 'predicted_ratings']), on = 'movieId', how = 'inner')
# Rank the movies based on corrected ratings
ranked_movies['corrected_ratings'] = ranked_movies['predicted_ratings'] - 1 / np.sqrt(ranked_movies['rating_count'])
# Sort the movies based on corrected ratings
ranked_movies = ranked_movies.sort_values('corrected_ratings', ascending = False)
return ranked_movies
Note: In the above-corrected rating formula, we can add the quantity 1/np.sqrt(n) instead of subtracting it to get more optimistic predictions. But here we are subtracting this quantity, as there are some movies with ratings of 5 and we can't have a rating more than 5 for a movie.
# Applying the ranking movies function and sorting it based on corrected ratings
ranking_movies(recommendations, final_rating)
| movieId | rating_count | predicted_ratings | corrected_ratings | |
|---|---|---|---|---|
| 1 | 3404 | 6 | 5.000000 | 4.591752 |
| 0 | 1178 | 12 | 4.808807 | 4.520132 |
| 3 | 7121 | 4 | 5.000000 | 4.500000 |
| 2 | 6460 | 5 | 4.844207 | 4.396993 |
| 4 | 115122 | 3 | 4.813285 | 4.235934 |
We have seen user-user similarity-based collaborative filtering. Now, let us look into similarity-based collaborative filtering where similarity is computed between items.
# Declaring the similarity options
sim_options = {'name': 'cosine',
'user_based': False}
# The KNN algorithm is used to find desired similar items
sim_item_item = KNNBasic(sim_options = sim_options, random_state = 1, verbose = False)
# Train the algorithm on the trainset, and predict ratings for the testset
sim_item_item.fit(trainset)
# Let us compute precision@k, recall@k, and f_1 score with k = 10
precision_recall_at_k(sim_item_item)
RMSE: 0.9800 Precision: 0.609 Recall: 0.464 F_1 score: 0.527
Let's now predict the rating for the user with userId = 4 and the movie with movieId = 10 as shown below. Here, the user has already interacted or watched the movie with movieId 10.
# Predicting rating for a sample user with an interacted movie
sim_item_item.predict(4, 10, r_ui = 4, verbose = True)
user: 4 item: 10 r_ui = 4.00 est = 3.63 {'actual_k': 40, 'was_impossible': False}
Prediction(uid=4, iid=10, r_ui=4, est=3.6257369831511945, details={'actual_k': 40, 'was_impossible': False})
Below, we are predicting the rating for the same userId = 4 but for a movie with which this user has not interacted yet, i.e., movieId = 3.
# Predicting rating for a sample user with a non interacted movie
sim_item_item.predict(4, 3, verbose = True)
user: 4 item: 3 r_ui = None est = 3.67 {'actual_k': 40, 'was_impossible': False}
Prediction(uid=4, iid=3, r_ui=None, est=3.6748659322681623, details={'actual_k': 40, 'was_impossible': False})
Below, we will be tuning hyperparameters of the KNNBasic algorithm.
# Setting up parameter grid to tune the hyperparameters
param_grid = {'k': [10, 20, 30], 'min_k': [3, 6, 9],
'sim_options': {'name': ['msd', 'cosine'],
'user_based': [False]}
}
# Performing 3-Fold cross validation to tune the hyperparameters
gs = GridSearchCV(KNNBasic, param_grid, measures = ['rmse'], cv = 3, n_jobs = -1)
# Fitting the model on the data
gs.fit(data)
# Print the best RMSE score
print(gs.best_score['rmse'])
# Print the combination of parameters that gave the best RMSE score
print(gs.best_params['rmse'])
0.9168710668985779
{'k': 30, 'min_k': 6, 'sim_options': {'name': 'msd', 'user_based': False}}
Once the grid search is complete, we can get the optimal values for each of those hyperparameters as shown above.
Now, let's build the model using the optimal values of the hyperparameters, which we received using the grid search cross-validation.
# Using the optimal similarity measure for item-item based collaborative filtering
sim_options = {'name': 'msd',
'user_based': False}
# Creating an instance of KNNBasic with optimal hyperparameter values
sim_item_item_optimized = KNNBasic(sim_options = sim_options, k = 30, min_k = 6, random_state = 1, verbose = False)
# Training the algorithm on the trainset
sim_item_item_optimized.fit(trainset)
# Let us compute precision@k, recall@k, and f_1 score with k = 10
precision_recall_at_k(sim_item_item_optimized)
RMSE: 0.9160 Precision: 0.678 Recall: 0.499 F_1 score: 0.575
Let's now predict the rating for the user with userId = 4 and the movie with movieId = 10 using the optimized model as shown below.
sim_item_item_optimized.predict(4, 10, r_ui = 4, verbose = True)
user: 4 item: 10 r_ui = 4.00 est = 3.26 {'actual_k': 30, 'was_impossible': False}
Prediction(uid=4, iid=10, r_ui=4, est=3.2569148418342952, details={'actual_k': 30, 'was_impossible': False})
Below we are predicting the rating for the same userId = 4 but for a movie which this user has not interacted before, i.e., movieId = 3, by using the optimized model as shown below.
sim_item_item_optimized.predict(4, 3, verbose = True)
user: 4 item: 3 r_ui = None est = 3.57 {'actual_k': 30, 'was_impossible': False}
Prediction(uid=4, iid=3, r_ui=None, est=3.574650247164053, details={'actual_k': 30, 'was_impossible': False})
We can also find out similar items to a given item or its nearest neighbors based on this KNNBasic algorithm. Below, we are finding the 5 most similar users to userId = 4 based on the msd distance metric.
sim_item_item_optimized.get_neighbors(4, k = 5)
[45, 73, 148, 155, 180]
# Making top 5 recommendations for userId 4 using the similarity-based recommendation system
recommendations = get_recommendations(rating, 4, 5, sim_item_item_optimized)
# Building the dataframe for above recommendations with columns "movieId" and "predicted_ratings"
pd.DataFrame(recommendations, columns = ['movieId', 'predicted_ratings'])
| movieId | predicted_ratings | |
|---|---|---|
| 0 | 5706 | 4.771028 |
| 1 | 176579 | 4.748016 |
| 2 | 25959 | 4.744049 |
| 3 | 2149 | 4.730439 |
| 4 | 56176 | 4.724374 |
# Applying the "ranking_movies" function and sorting it based on corrected ratings
ranking_movies(recommendations, final_rating)
| movieId | rating_count | predicted_ratings | corrected_ratings | |
|---|---|---|---|---|
| 0 | 2149 | 4 | 4.730439 | 4.230439 |
| 1 | 56176 | 3 | 4.724374 | 4.147023 |
| 2 | 5706 | 1 | 4.771028 | 3.771028 |
| 3 | 176579 | 1 | 4.748016 | 3.748016 |
| 4 | 25959 | 1 | 4.744049 | 3.744049 |
Now, as we have seen similarity-based collaborative filtering algorithms, let us now get into the model-based collaborative filtering algorithm.
Model-based Collaborative Filtering is a personalized recommendation system, the recommendations are based on the past behavior of the user and it is not dependent on any additional information. We use latent features to find recommendations for each user.
Latent Features: These are the features that are not present in the original data but are inferred from it. They are named after they are created. For example:
Now, if we notice the above movies closely:
Here, Action, Romance, Suspense, and Comedy are latent features of the corresponding movies. Similarly, we can compute the latent features for users as shown below:
SVD is used to compute the latent features from the user-item interaction matrix that we learned earlier. But SVD does not work when we miss values in the user-item interaction matrix.
First, we need to convert the below movie-rating dataset:
into a user-item interaction matrix as shown below:
We have already done this while computing cosine similarities.
SVD decomposes the above matrix into three separate matrices:
U-matrix is a matrix with dimension m x k, where:
Sigma-matrix is a matrix with dimension k x k, where:
V-transpose matrix is a matrix with dimension k x n matrix, where:
# Using SVD with matrix factorization
svd = SVD(random_state = 1)
# Training the algorithm on the training dataset
svd.fit(trainset)
# Let us compute precision@k, recall@k, and f_1 score with k = 10
precision_recall_at_k(svd)
RMSE: 0.8797 Precision: 0.738 Recall: 0.507 F_1 score: 0.601
Let's now predict the rating for the user with userId = 4 and the movie with movieId = 10 as shown below. Here, the user has already rated the movie.
# Making prediction for userId 4 and movieId 10
svd.predict(4, 10, r_ui = 4, verbose = True)
user: 4 item: 10 r_ui = 4.00 est = 3.33 {'was_impossible': False}
Prediction(uid=4, iid=10, r_ui=4, est=3.333359479354037, details={'was_impossible': False})
Below, we are predicting rating for the same userId = 4 but for a movie which this user has not interacted before, i.e., movieId = 3, as shown below.
# Making prediction for userid 4 and movieId 3
svd.predict(4, 3, verbose = True)
user: 4 item: 3 r_ui = None est = 2.94 {'was_impossible': False}
Prediction(uid=4, iid=3, r_ui=None, est=2.9386110726567756, details={'was_impossible': False})
In SVD, rating is predicted as:
If user $u$ is unknown, then the bias $b_{u}$ and the factors $p_{u}$ are assumed to be zero. The same applies for item $i$ with $b_{i}$ and $q_{i}$.
To estimate all the unknown, we minimize the following regularized squared error:
The minimization is performed by a very straightforward stochastic gradient descent:
There are many hyperparameters to tune in this algorithm, you can find the full list of hyperparameters here
Below we will be tuning only three hyperparameters:
# Set the parameter space to do hyperparameter tuning
param_grid = {'n_epochs': [10, 20, 30], 'lr_all': [0.001, 0.005, 0.01],
'reg_all': [0.2, 0.4, 0.6]}
# Performing 3-Fold gridsearch cross-validation
gs = GridSearchCV(SVD, param_grid, measures = ['rmse'], cv = 3, n_jobs = -1)
# Fitting the model on the data
gs.fit(data)
# Print the best RMSE score
print(gs.best_score['rmse'])
# Print the combination of parameters that gave the best RMSE score
print(gs.best_params['rmse'])
0.8717294809446704
{'n_epochs': 30, 'lr_all': 0.01, 'reg_all': 0.2}
Once the grid search is complete, we can get the optimal values for each of those hyperparameters as shown above.
Now, we will build the model using the optimal values of hyperparameters that we received from the grid search cross-validation.
# Building the optimized SVD model using optimal hyperparameters search
svd_optimized = SVD(n_epochs = 30, lr_all = 0.01, reg_all = 0.2, random_state = 1)
# Training the algorithm on the train set
svd_optimized = svd_optimized.fit(trainset)
# Let us compute precision@k, recall@k, and f_1 score with k = 10
precision_recall_at_k(svd_optimized)
RMSE: 0.8752 Precision: 0.731 Recall: 0.511 F_1 score: 0.602
Let's now predict the rating for the user with userId = 4 and the movie with movieId = 10 with the optimized model as shown below.
# Using svd_algo_optimized model to recommend for userId 4 and movieId 10
svd_optimized.predict(4, 10, r_ui = 4, verbose = True)
user: 4 item: 10 r_ui = 4.00 est = 3.39 {'was_impossible': False}
Prediction(uid=4, iid=10, r_ui=4, est=3.3892642624049993, details={'was_impossible': False})
# Using svd_algo_optimized model to recommend for userId 4 and movieId 3 with unknown baseline rating
svd_optimized.predict(4, 3, verbose = True)
user: 4 item: 3 r_ui = None est = 3.20 {'was_impossible': False}
Prediction(uid=4, iid=3, r_ui=None, est=3.20286300753659, details={'was_impossible': False})
Now, let's recommend the movies using the optimized svd model.
# Getting top 5 recommendations for userId 4 using "svd_algo_optimized" algorithm
svd_recommendations = get_recommendations(rating, 4, 5, svd_optimized)
# Ranking movies based on above recommendations
ranking_movies(svd_recommendations, final_rating)
| movieId | rating_count | predicted_ratings | corrected_ratings | |
|---|---|---|---|---|
| 0 | 1178 | 12 | 4.446400 | 4.157725 |
| 1 | 177593 | 8 | 4.380428 | 4.026875 |
| 2 | 106642 | 7 | 4.379596 | 4.001631 |
| 3 | 3266 | 6 | 4.332485 | 3.924236 |
| 4 | 7121 | 4 | 4.342665 | 3.842665 |
In this case study, we built recommendation systems using four different algorithms. They are as follows: