Online streaming platforms like Netflix have plenty of movies in their repositories. If we can build a recommendation system to recommend movies to the users based on their historical interactions with movies, this would improve customer satisfaction. Increased customer satisfaction will increase the revenue of the company. The techniques that we will learn here will not only be limited to movies but can be any item for which you can build a recommendation system.
Using the above dataset, we will build two different types of recommendation systems that are listed below.
We will use the following three datasets for this case study:
ratings dataset - This dataset contains the following attributes:
movies dataset - This dataset contains the following attributes:
tags dataset- This dataset contains the following attributes:
Sometimes, the installation of the surprise library, which is used to build recommendation systems, faces issues in Jupyter. To avoid any issues, it is advised to use Google Colab for this case study.
Let's start by mounting the Google drive on Colab.
# Mounting the drive
from google.colab import drive
drive.mount('/content/drive')
Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).
import warnings # Used to ignore the warning given as output of the code
warnings.filterwarnings('ignore')
import numpy as np # Basic libraries of python for numeric and dataframe computations
import pandas as pd
import matplotlib.pyplot as plt # Basic library for data visualization
import seaborn as sns # Slightly advanced library for data visualization
from collections import defaultdict # A dictionary that does not raise a key error
from sklearn.metrics.pairwise import cosine_similarity # To find the similarity between two vectors
from sklearn.metrics import mean_squared_error # A performance metric in sklearn
# Loading the movies dataset
movies = pd.read_csv('/content/drive/MyDrive/movies.csv')
# Let us see the first five records of the dataset
movies.head()
| movieId | title | genres | |
|---|---|---|---|
| 0 | 1 | Toy Story (1995) | Adventure|Animation|Children|Comedy|Fantasy |
| 1 | 2 | Jumanji (1995) | Adventure|Children|Fantasy |
| 2 | 3 | Grumpier Old Men (1995) | Comedy|Romance |
| 3 | 4 | Waiting to Exhale (1995) | Comedy|Drama|Romance |
| 4 | 5 | Father of the Bride Part II (1995) | Comedy |
# Shape of the DataFrame
movies.shape
(9742, 3)
# Loading the ratings dataset
ratings = pd.read_csv('/content/drive/MyDrive/ratings.csv')
# Shape of the ratings dataset
ratings.shape
(100836, 4)
Let's merge both the datasets to get the title and rating of each movie in a single DataFrame.
# Merging datasets on movieId
ratings_with_title = pd.merge(ratings, movies[['movieId', 'title']], on = 'movieId', how = 'inner')
# See the first five records of the dataset
ratings_with_title.head()
| userId | movieId | rating | timestamp | title | |
|---|---|---|---|---|---|
| 0 | 1 | 1 | 4.0 | 964982703 | Toy Story (1995) |
| 1 | 5 | 1 | 4.0 | 847434962 | Toy Story (1995) |
| 2 | 7 | 1 | 4.5 | 1106635946 | Toy Story (1995) |
| 3 | 15 | 1 | 2.5 | 1510577970 | Toy Story (1995) |
| 4 | 17 | 1 | 4.5 | 1305696483 | Toy Story (1995) |
Let's check the info of the data.
# Checking info of the merged dataset
ratings_with_title.info()
<class 'pandas.core.frame.DataFrame'> Int64Index: 100836 entries, 0 to 100835 Data columns (total 5 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 userId 100836 non-null int64 1 movieId 100836 non-null int64 2 rating 100836 non-null float64 3 timestamp 100836 non-null int64 4 title 100836 non-null object dtypes: float64(1), int64(3), object(1) memory usage: 4.6+ MB
# Dropping the timestamp column
rating = ratings_with_title.drop(['timestamp'], axis = 1)
# Calculating average ratings
average_rating = rating.groupby('movieId').mean()['rating']
# Calculating the count of ratings
count_rating = rating.groupby('movieId').count()['rating']
# Making a dataframe with the count and average of ratings
final_rating = pd.DataFrame({'avg_rating': average_rating, 'rating_count': count_rating})
# See the first five records of the final_rating dataset
final_rating.head()
| avg_rating | rating_count | |
|---|---|---|
| movieId | ||
| 1 | 3.920930 | 215 |
| 2 | 3.431818 | 110 |
| 3 | 3.259615 | 52 |
| 4 | 2.357143 | 7 |
| 5 | 3.071429 | 49 |
Let's explore the dataset and answer some basic data-related questions.
# Find the number of unique users
ratings_with_title['userId'].nunique()
610
# Find the number of unique movies
ratings_with_title['title'].nunique()
9719
To demonstrate the clustering-based recommendation system, we are going to use the surprise library in this case study.
install the surprise library. You need to only do it once while running the code for the first time.!pip install surprise
# Installing the surprise package
!pip install surprise
Requirement already satisfied: surprise in /usr/local/lib/python3.7/dist-packages (0.1) Requirement already satisfied: scikit-surprise in /usr/local/lib/python3.7/dist-packages (from surprise) (1.1.1) Requirement already satisfied: joblib>=0.11 in /usr/local/lib/python3.7/dist-packages (from scikit-surprise->surprise) (1.1.0) Requirement already satisfied: numpy>=1.11.2 in /usr/local/lib/python3.7/dist-packages (from scikit-surprise->surprise) (1.21.6) Requirement already satisfied: six>=1.10.0 in /usr/local/lib/python3.7/dist-packages (from scikit-surprise->surprise) (1.15.0) Requirement already satisfied: scipy>=1.0.0 in /usr/local/lib/python3.7/dist-packages (from scikit-surprise->surprise) (1.4.1)
# To compute the accuracy of models
from surprise import accuracy
# Class to parse a file containing ratings, data should be in structure - user; item; rating
from surprise.reader import Reader
# Class for loading datasets
from surprise.dataset import Dataset
# For tuning model hyperparameters
from surprise.model_selection import GridSearchCV
# For splitting the rating data in train and test dataset
from surprise.model_selection import train_test_split
# For implementing a clustering-based recommendation system
from surprise import CoClustering
Before building the recommendation systems, let's go over some basic terminologies we are going to use:
Relevant item: An item (product in this case) that is actually rated higher than the threshold rating (here 3.5) is relevant, and an item that is actually rated lower than the threshold rating is a non-relevant item.
Recommended item: An item whose predicted rating is higher than the threshold (here 3.5) is a recommended item, and an item whose predicted rating is lower the threshold rating is a non-recommended item, i.e., it will not be recommended to the user.
False Negative (FN): It is the frequency of relevant items that are not recommended to the user. If the relevant items are not recommended to the user, then the user might not buy the product/item. This would result in the loss of opportunity for the service provider, which they would like to minimize.
False Positive (FP): It is the frequency of recommended items that are actually not relevant. In this case, the recommendation system is not doing a good job of finding and recommending the relevant items to the user. This would result in loss of resources for the service provider, which they would also like to minimize.
Recall: It is the fraction of actually relevant items that are recommended to the user, i.e., if out of 10 relevant products, 6 are recommended to the user, then recall is 0.60. Higher the value of recall, better is the model. It is one of the metrics to do the performance assessment of classification models.
Precision: It is the fraction of recommended items that are relevant actually, i.e., if out of 10 recommended items, 6 are found relevant by the user, then precision is 0.60. The higher the value of precision, better is the model. It is one of the metrics to do the performance assessment of classification models.
While making a recommendation system, it becomes customary to look at the performance of the model. In terms of how many recommendations are relevant and vice-versa, below are some most used performance metrics used in the assessment of recommendation systems.
Precision@k - It is the fraction of recommended items that are relevant in top k predictions. The value of k is the number of recommendations to be provided to the user. One can choose a variable number of recommendations to be given to a unique user.
Recall@k - It is the fraction of relevant items that are recommended to the user in top k predictions.
F1-score@k - It is the harmonic mean of Precision@k and Recall@k. When precision@k and recall@k both seem to be important, it is useful to use this metric because it is representative of both of them.
def precision_recall_at_k(model, k = 10, threshold = 3.5):
"""Return precision@k and recall@k metrics for each user"""
# First map the predictions to each user
user_est_true = defaultdict(list)
# Making predictions on the test data
predictions=model.test(testset)
for uid, _, true_r, est, _ in predictions:
user_est_true[uid].append((est, true_r))
precisions = dict()
recalls = dict()
for uid, user_ratings in user_est_true.items():
# Sort user ratings by estimated value
user_ratings.sort(key = lambda x: x[0], reverse = True)
# Number of relevant items
n_rel = sum((true_r >= threshold) for (_, true_r) in user_ratings)
# Number of recommended items in top k
n_rec_k = sum((est >= threshold) for (est, _) in user_ratings[ : k])
# Number of relevant and recommended items in top k
n_rel_and_rec_k = sum(((true_r >= threshold) and (est >= threshold))
for (est, true_r) in user_ratings[ : k])
# Precision@K: Proportion of recommended items that are relevant
# When n_rec_k is 0, Precision is undefined. We here set Precision to 0 when n_rec_k is 0
precisions[uid] = n_rel_and_rec_k / n_rec_k if n_rec_k != 0 else 0
# Recall@K: Proportion of relevant items that are recommended
# When n_rel is 0, Recall is undefined. We here set Recall to 0 when n_rel is 0
recalls[uid] = n_rel_and_rec_k / n_rel if n_rel != 0 else 0
# Mean of all the predicted precisions are calculated
precision = round((sum(prec for prec in precisions.values()) / len(precisions)), 3)
# Mean of all the predicted recalls are calculated
recall = round((sum(rec for rec in recalls.values()) / len(recalls)), 3)
accuracy.rmse(predictions)
# Command to print the overall precision
print('Precision: ', precision)
# Command to print the overall recall
print('Recall: ', recall)
# Formula to compute the F-1 score
print('F_1 score: ', round((2 * precision * recall) / (precision + recall), 3))
Below, we are loading the rating dataset, which is a pandas DataFrame, into a different format called surprise.dataset.DatasetAutoFolds. This is required by this library. To do this we will be using the classes Reader and Dataset.
# Instantiating Reader scale with expected rating scale
reader = Reader(rating_scale = (0, 5))
# Loading the rating dataset
data = Dataset.load_from_df(rating[['userId', 'movieId', 'rating']], reader)
# Splitting the data into train and test dataset
trainset, testset = train_test_split(data, test_size = 0.2, random_state = 42)
In clustering-based recommendation systems, we explore the similarities and differences in people's tastes in movies based on how they rate different movies. We cluster similar users together and recommend movies to a user based on ratings from other users in the same cluster.
Co-clustering is a set of techniques in Cluster Analysis. Given some matrix A, we want to cluster rows of A and columns of A simultaneously, this is a common task for user-item matrices.
As it clusters both the rows and the columns simultaneously, it is also called bi-clustering. To understand the working of the algorithm, let A be m x n matrix, the goal is to generate co-clusters: a subset of rows that exhibit similar behavior across a subset of columns, or vice versa.
Co-clustering is defined as two map functions: rows -> row cluster indexes columns -> column cluster indexes.
These map functions are learned simultaneously. It is different from other clustering techniques where we cluster first the rows and then the columns.
# Using CoClustering algorithm
clust_baseline = CoClustering(random_state = 1)
# Training the algorithm on the train set
clust_baseline.fit(trainset)
# Let us compute precision@k, recall@k, and F_1 score with k = 10
precision_recall_at_k(clust_baseline)
RMSE: 0.9490 Precision: 0.717 Recall: 0.502 F_1 score: 0.591
Now, let's predict the rating for the user with userId = 4 and the movie with movieId = 10 as shown below. Here, the user has already interacted or watched the movie with movieId 10.
# Making prediction for userId 4 and movieId 10
clust_baseline.predict(4, 10, r_ui = 4, verbose = True)
user: 4 item: 10 r_ui = 4.00 est = 3.68 {'was_impossible': False}
Prediction(uid=4, iid=10, r_ui=4, est=3.6757402992691386, details={'was_impossible': False})
Below, we are predicting the rating for the same userId = 4 but for a movie with which this user has not interacted before, i.e., movieId = 3, as shown below.
# Making prediction for userId 4 and movieId 3
clust_baseline.predict(4, 3, verbose = True)
user: 4 item: 3 r_ui = None est = 3.26 {'was_impossible': False}
Prediction(uid=4, iid=3, r_ui=None, est=3.258169827544438, details={'was_impossible': False})
Below, we will be tuning hyperparameters for the CoClustering algorithm. Let's try to understand the different hyperparameters of this algorithm.
# Set the parameter space to tune
param_grid = {'n_cltr_u': [3, 4, 5, 6], 'n_cltr_i': [3, 4, 5, 6], 'n_epochs': [30, 40, 50]}
# Performing 3-Fold gridsearch cross-validation
gs = GridSearchCV(CoClustering, param_grid, measures = ['rmse'], cv = 3, n_jobs = -1)
# Fitting data
gs.fit(data)
# Printing the best RMSE score
print(gs.best_score['rmse'])
# Printing the combination of parameters that gives the best RMSE score
print(gs.best_params['rmse'])
0.9570102979396152
{'n_cltr_u': 3, 'n_cltr_i': 3, 'n_epochs': 30}
Once the grid search is complete, we can get the optimal values for each of those hyperparameters as shown above.
We will build final model by using tuned values of the hyperparameters received after using the Grid search cross-validation above.
# Using tuned Coclustering algorithm
clust_tuned = CoClustering(n_cltr_u = 3, n_cltr_i = 3, n_epochs = 30, random_state = 1)
# Training the algorithm on the train set
clust_tuned.fit(trainset)
# Let us compute precision@k, recall@k, and F_1 score with k = 10
precision_recall_at_k(clust_tuned)
RMSE: 0.9499 Precision: 0.715 Recall: 0.5 F_1 score: 0.588
Let's now predict the rating for the user with userId = 4 and for the movie with movieId = 10 as shown below. Here, the user has already rated the movie.
# Using co-clustering_optimized model to recommend for userId 4 and movieId 10
clust_tuned.predict(4, 10, r_ui = 4, verbose = True)
user: 4 item: 10 r_ui = 4.00 est = 3.65 {'was_impossible': False}
Prediction(uid=4, iid=10, r_ui=4, est=3.6549811878747214, details={'was_impossible': False})
Below, we are predicting the rating for the same userId = 4 but for a movie with which this user has not interacted before, i.e., movieId = 3, as shown below.
# Using Co-clustering based optimized model
clust_tuned.predict(4, 3, verbose = True)
user: 4 item: 3 r_ui = None est = 3.23 {'was_impossible': False}
Prediction(uid=4, iid=3, r_ui=None, est=3.2318175022354345, details={'was_impossible': False})
Below we will be implementing a function where the input parameters are:
def get_recommendations(data, user_id, top_n, algo):
# Creating an empty list to store the recommended movie IDs
recommendations = []
# Creating an user-item interactions matrix
user_item_interactions_matrix = data.pivot(index = 'userId', columns = 'movieId', values = 'rating')
# Extracting those movie IDs which the userId has not interacted yet
non_interacted_movies = user_item_interactions_matrix.loc[user_id][user_item_interactions_matrix.loc[user_id].isnull()].index.tolist()
# Looping through each of the movie IDs which userId has not interacted yet
for item_id in non_interacted_movies:
# Predicting the ratings for those non interacted movie IDs by this user
est = algo.predict(user_id, item_id).est
# Appending the predicted ratings
recommendations.append((item_id, est))
# Sorting the predicted ratings in descending order
recommendations.sort(key = lambda x: x[1], reverse = True)
# Returing top n highest predicted rating movies for this user
return recommendations[:top_n]
# Getting top 5 recommendations for userId 4 using Co-clustering based optimized algorithm
clustering_recommendations = get_recommendations(rating, 4, 5, clust_tuned)
While comparing the ratings of two movies, it is not only the ratings that describe the likelihood of the user to that movie. Along with the rating, the number of users who have watched that movie also becomes important to consider. Due to this, we have calculated the "corrected_ratings" for each movie. Commonly higher the "rating_count" of a movie more it is liked by users. To interpret the above concept, a movie rated 4 with rating_count 3 is less liked in comparison to a movie rated 3 with a rating count of 50. It has been empirically found that the likelihood of the movie is directly proportional to the inverse of the square root of the rating_count of the movie.
def ranking_movies(recommendations, final_rating):
# Sort the movies based on ratings count
ranked_movies = final_rating.loc[[items[0] for items in recommendations]].sort_values('rating_count', ascending = False)[['rating_count']].reset_index()
# Merge with the recommended movies to get predicted ratings
ranked_movies = ranked_movies.merge(pd.DataFrame(recommendations, columns = ['movieId', 'predicted_ratings']), on = 'movieId', how = 'inner')
# Rank the movies based on corrected ratings
ranked_movies['corrected_ratings'] = ranked_movies['predicted_ratings'] - 1 / np.sqrt(ranked_movies['rating_count'])
# Sort the movies based on corrected ratings
ranked_movies = ranked_movies.sort_values('corrected_ratings', ascending = False)
return ranked_movies
Note: In the above-corrected rating formula, we can add the quantity 1 / np.sqrt(n) instead of subtracting it to get more optimistic predictions. But here we are subtracting this quantity, as there are some movies with ratings of 5 and we can't have a rating more than 5 for a movie.
# Ranking movies based on the above recommendations
ranking_movies(clustering_recommendations, final_rating)
| movieId | rating_count | predicted_ratings | corrected_ratings | |
|---|---|---|---|---|
| 0 | 304 | 3 | 5 | 4.422650 |
| 1 | 53 | 2 | 5 | 4.292893 |
| 2 | 99 | 2 | 5 | 4.292893 |
| 3 | 238 | 2 | 5 | 4.292893 |
| 4 | 148 | 1 | 5 | 4.000000 |
Let us now move to the final recommendation algorithm which is named the Content-based recommendation system.
In a content-based recommendation system, we will use the feature - text, i.e., reviews to find similar movies.
Text data generally contains pronunciation, stopwords, and non-ASCII characters, which makes it very noisy. So, we will first need to pre-process the text and then we will generate features from the text to compute similarities between the texts/reviews.
Let's load the tags dataset.
# Importing the tags data
tags = pd.read_csv('/content/drive/MyDrive/tags.csv')
tags.head()
| userId | movieId | tag | timestamp | |
|---|---|---|---|---|
| 0 | 2 | 60756 | funny | 1445714994 |
| 1 | 2 | 60756 | Highly quotable | 1445714996 |
| 2 | 2 | 60756 | will ferrell | 1445714992 |
| 3 | 2 | 89774 | Boxing story | 1445715207 |
| 4 | 2 | 89774 | MMA | 1445715200 |
In this dataset, we don't have any movie review or plot of the movie, so we will combine the columns - title, genres from the other two datasets, and tag from the tags dataset to create a text-based feature and apply the TF-IDF feature extraction technique to extract features, which we later use to compute similar movies based on these texts.
# Merging all the three datasets on movieId
ratings_with_title = pd.merge(ratings, movies[['movieId', 'title', 'genres']], on = 'movieId' )
final_ratings = pd.merge(ratings_with_title, tags[['movieId', 'tag']], on = 'movieId' )
# Let us see the dataset
final_ratings
| userId | movieId | rating | timestamp | title | genres | tag | |
|---|---|---|---|---|---|---|---|
| 0 | 1 | 1 | 4.0 | 964982703 | Toy Story (1995) | Adventure|Animation|Children|Comedy|Fantasy | pixar |
| 1 | 1 | 1 | 4.0 | 964982703 | Toy Story (1995) | Adventure|Animation|Children|Comedy|Fantasy | pixar |
| 2 | 1 | 1 | 4.0 | 964982703 | Toy Story (1995) | Adventure|Animation|Children|Comedy|Fantasy | fun |
| 3 | 5 | 1 | 4.0 | 847434962 | Toy Story (1995) | Adventure|Animation|Children|Comedy|Fantasy | pixar |
| 4 | 5 | 1 | 4.0 | 847434962 | Toy Story (1995) | Adventure|Animation|Children|Comedy|Fantasy | pixar |
| ... | ... | ... | ... | ... | ... | ... | ... |
| 233208 | 599 | 176419 | 3.5 | 1516604655 | Mother! (2017) | Drama|Horror|Mystery|Thriller | uncomfortable |
| 233209 | 599 | 176419 | 3.5 | 1516604655 | Mother! (2017) | Drama|Horror|Mystery|Thriller | unsettling |
| 233210 | 594 | 7023 | 4.5 | 1108972356 | Wedding Banquet, The (Xi yan) (1993) | Comedy|Drama|Romance | In Netflix queue |
| 233211 | 606 | 6107 | 4.0 | 1171324428 | Night of the Shooting Stars (Notte di San Lore... | Drama|War | World War II |
| 233212 | 606 | 6516 | 3.5 | 1171755910 | Anastasia (1956) | Drama | In Netflix queue |
233213 rows × 7 columns
# Replacing | character with space in genres column
final_ratings['genres'] = final_ratings['genres'].apply(lambda x: " ".join(x.split('|')))
# Combining title, genres, and tag columns
final_ratings['text'] = final_ratings['title'] + ' ' + final_ratings['genres'] + ' ' + final_ratings['tag']
final_ratings.head()
| userId | movieId | rating | timestamp | title | genres | tag | text | |
|---|---|---|---|---|---|---|---|---|
| 0 | 1 | 1 | 4.0 | 964982703 | Toy Story (1995) | Adventure Animation Children Comedy Fantasy | pixar | Toy Story (1995) Adventure Animation Children ... |
| 1 | 1 | 1 | 4.0 | 964982703 | Toy Story (1995) | Adventure Animation Children Comedy Fantasy | pixar | Toy Story (1995) Adventure Animation Children ... |
| 2 | 1 | 1 | 4.0 | 964982703 | Toy Story (1995) | Adventure Animation Children Comedy Fantasy | fun | Toy Story (1995) Adventure Animation Children ... |
| 3 | 5 | 1 | 4.0 | 847434962 | Toy Story (1995) | Adventure Animation Children Comedy Fantasy | pixar | Toy Story (1995) Adventure Animation Children ... |
| 4 | 5 | 1 | 4.0 | 847434962 | Toy Story (1995) | Adventure Animation Children Comedy Fantasy | pixar | Toy Story (1995) Adventure Animation Children ... |
Now, we will keep only four columns - userId, movieId, rating, and text. We will drop the duplicate titles from the data and make it the title column as the index of the dataframe.
# Create the final_ratings dataset with specified columns
final_ratings = final_ratings[['userId', 'movieId', 'rating', 'title', 'text']]
# Let us drop the duplicate records
final_ratings = final_ratings.drop_duplicates(subset = ['title'])
# Set the index
final_ratings = final_ratings.set_index('title')
# See the first five records of the dataset
final_ratings.head()
| userId | movieId | rating | text | |
|---|---|---|---|---|
| title | ||||
| Toy Story (1995) | 1 | 1 | 4.0 | Toy Story (1995) Adventure Animation Children ... |
| Grumpier Old Men (1995) | 1 | 3 | 4.0 | Grumpier Old Men (1995) Comedy Romance moldy |
| Seven (a.k.a. Se7en) (1995) | 1 | 47 | 5.0 | Seven (a.k.a. Se7en) (1995) Mystery Thriller m... |
| Usual Suspects, The (1995) | 1 | 50 | 5.0 | Usual Suspects, The (1995) Crime Mystery Thril... |
| Bottle Rocket (1996) | 1 | 101 | 5.0 | Bottle Rocket (1996) Adventure Comedy Crime Ro... |
# Let us see the shape of final_ratings data
final_ratings.shape
(1554, 4)
Now, let's process the text data and create features to find the similarity between movies.
# Importing nltk (natural language toolkit library)
import nltk
# Downloading punctuations
nltk.download('punkt')
# Downloading stopwords
nltk.download('stopwords')
# Downloading wordnet
nltk.download('wordnet')
[nltk_data] Downloading package punkt to /root/nltk_data... [nltk_data] Package punkt is already up-to-date! [nltk_data] Downloading package stopwords to /root/nltk_data... [nltk_data] Package stopwords is already up-to-date! [nltk_data] Downloading package wordnet to /root/nltk_data... [nltk_data] Package wordnet is already up-to-date!
True
# This is importing regular expression
import re
# Word_tokenize is used to do tokenization
from nltk import word_tokenize
# Importing the Lematizer
from nltk.stem import WordNetLemmatizer
# Importing the stopwords
from nltk.corpus import stopwords
# Tfidf vectorizer used to create the computational vectors
from sklearn.feature_extraction.text import TfidfVectorizer
We will create a function to pre-process the text data. Before that, let's see some terminology.
# Create the tokenize function
def tokenize(text):
# Making each letter as lowercase and removing non-alphabetical text
text = re.sub(r"[^a-zA-Z]"," ", text.lower())
# Extracting each word in the text
tokens = word_tokenize(text)
# Removing stopwords
words = [word for word in tokens if word not in stopwords.words("english")]
# Lemmatize the words
text_lems = [WordNetLemmatizer().lemmatize(lem).strip() for lem in words]
return text_lems
Below are some of the ways to extract features from texts:

Here, we will be using TF-IDF as a feature extraction technique.
# Creating the TF-IDF object
tfidf = TfidfVectorizer(tokenizer = tokenize)
movie_tfidf = tfidf.fit_transform(final_ratings['text'].values).toarray()
# Making the DataFrame of movie_tfidf data
pd.DataFrame(movie_tfidf)
| 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | ... | 2773 | 2774 | 2775 | 2776 | 2777 | 2778 | 2779 | 2780 | 2781 | 2782 | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
| 1 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
| 2 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
| 3 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
| 4 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 1549 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
| 1550 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
| 1551 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
| 1552 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
| 1553 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
1554 rows × 2783 columns
We have extracted features from the text data. Now, we can find similarities between movies using these features. We will use cosine similarity to calculate the similarity.
# Calculating the cosine similarity
similar_movies = cosine_similarity(movie_tfidf, movie_tfidf)
# Let us see the above array
similar_movies
array([[1. , 0.02268393, 0. , ..., 0.02022472, 0. ,
0. ],
[0.02268393, 1. , 0. , ..., 0.04779055, 0. ,
0. ],
[0. , 0. , 1. , ..., 0. , 0. ,
0. ],
...,
[0.02022472, 0.04779055, 0. , ..., 1. , 0.00719396,
0.19617374],
[0. , 0. , 0. , ..., 0.00719396, 1. ,
0.01217017],
[0. , 0. , 0. , ..., 0.19617374, 0.01217017,
1. ]])
Finally, let's create a function to find the most similar movies to recommend for a given movie.
# Function that takes in movie title as input and returns the top 10 recommended movies
def recommendations(title, similar_movies):
recommended_movies = []
indices = pd.Series(final_ratings.index)
# Getting the index of the movie that matches the title
idx = indices[indices == title].index[0]
# Creating a Series with the similarity scores in descending order
score_series = pd.Series(similar_movies[idx]).sort_values(ascending = False)
# Getting the indices of 10 most similar movies
top_10_indexes = list(score_series.iloc[1 : 11].index)
print(top_10_indexes)
# Populating the list with the titles of the best 10 matching movies
for i in top_10_indexes:
recommended_movies.append(list(final_ratings.index)[i])
return recommended_movies
recommendations('Usual Suspects, The (1995)', similar_movies)
[71, 1186, 124, 551, 569, 77, 719, 766, 123, 658]
['Game, The (1997)', 'Andalusian Dog, An (Chien andalou, Un) (1929)', 'Town, The (2010)', 'Now You See Me (2013)', 'Charade (1963)', 'Negotiator, The (1998)', 'Following (1998)', '21 Grams (2003)', 'Inception (2010)', 'Insomnia (2002)']
In this case study, we built recommendation systems using five different algorithms. They are as follows: