Based on the past user behavior, MoRe recommends the movies to users based on their similarity. It suggests movies to users with a recommendation rate that is greater than the preference rate of movie for the same user. So in core words it will give recommendations which are never liked by other, but a user might like that.
Image Credit - [mohamed_hassan](https://pixabay.com/users/mohamed_hassan-5229782/)Recommendation systems are the systems that are designed to recommend things to user based on many different factors. These system predict things that users are more likely to purchase or interested in it. Giant companies Google, Amazon, Netflix use recommendation system to help their users to purchase products or movies for them. Recommendation system recommends you the items based on past activities this is known as Content Based Filtering or the preference of the other user's that are to similar to you this is known as Collaborative Based Filtering .
Cosine similarity is a metric used to measure how similar two items are. Mathematically it calculates the cosine of the angle between two vectors projected in a multidimensional space. Cosine similarity is advantageous when two similar documents are far apart by Euclidean distance(size of documents) chances are they may be oriented closed together. The smaller the angle, higher the cosine similarity.
1 - cosine-similarity = cosine-distance
Jupyter python notebook is available at nbviewer.
Download the dataset from here
import pandas as pd
import numpy as np
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.metrics.pairwise import cosine_similarity
df = pd.read_csv("movie_dataset.csv")
we'll choose the features that are most relevant to us and store it in the list name features .
features = ['keywords', 'cast', 'genres', 'director']
Data preprocessing is needed before proceeding further. Hence all the null values must be removed.
for feature in features:
df[feature] = df[feature].fillna('')
combining all the features in the single feature and difference column to the existing dataset.
def combined_features(row):
return row['keywords']+" "+row['cast']+" "+row['genres']+" "+row['director']
df['combined_features'] = df.apply(combined_features,axis = 1)
now we'll extract the features by using sklearn's feature_extraction module it helps us to extract feature into format supported by machine learning algorithms.
CountVetcorizer()'s fit_transform we'll help to count the number of the text present in the document.
cv = CountVectorizer()
count_matrix = cv.fit_transform(df['combined_features'])
print("Count Matrix: ",count_matrix.toarray())
sklearn has the module cosine_similarity which we'll use to compute the similarity between two vectors.
cosine_sim = cosine_similarity(count_matrix)
cosine_sim is a numpy array with calculated cosine similarity between tw movies
Now we'll take the input movies in the movie_user_like variable. Since we're building content based recommendation system we need to know the the content user like in order to predict the similar.
movie_user_like = "Dead Poets Society"
def get_index_from(title):
return df[df.title == title]["index"].values[0]
movie_index = get_index_from(movie_user_like)
similar_movies = list(enumerate(cosine_sim[movie_index]))
sorted_similar_movies = sorted(similar_movies, key = lambda x:x[1], reverse = True)
def get_title_from_index(index):
return df[df.index == index]["title"].values[0]
i=0
for movies in sorted_similar_movies:
print(get_title_from_index(movies[0]))
i = i+1;
if i>15:
break