- Total Samples: 509236
- Start date: 1/25/2008, End Date: 11/22/2016.
- Features: time_created, date_created, up_votes, down_votes, title, over_18, author, category
- Total number of articles published each year
- Total number of articles published monthly
- Total number of authors published monthly
- pandas: Used for data manipulation and analysis
- NumPy: Used for handling multi-dimensional arrays and matrices
- nltk: Used for Natural Language preprocessing
- fastText: Used for learning word embeddings and text classification which is created by Facebook's AI
- faiss: Used for efficient similarity search and clustering of vector
- Removed stopwords and performed stemming on titles using nltk library
- Appened time_created, date_created, author, processed titles in order to increase the similarity between articles published in and around the same date and same author
- Trained a fastText with the preprocessed text with embedding size of 100, window size of 5, minimum word count of 5 for 25 epochs
- Creates a set of L2 based similarity index table between the embedding of traget keywords or article and embeddings of all other articles created from fastText model
- Based on the search query entered by the user, the most relevant articles are shown. Reranked using date_created ordered by most recent articles to least recent articles NOTE: time_created and up_votes can also be used in reranking. However, it's not implemented as of now.
- Searched through similarity vectors to find the closest embeddings based on the given tiitle embeddings
- Implemented two such recommendation approaches: Global article recommendation and last 30 days recommendation
- This recommends 20 similar articles from all the articles in the database. Reranked using up_votes such that most upvotes shown first.
- This recommends 5 similar articles from the last 30 days based on upvotes. Reranked using up_votes such that most upvotes shown first.
Since there is no ground truth data like user interaction, impressions, likes and dislikes, there is no easy way to generate ground truth and evaluate the model. Hence, it is treated as an unsupervised learning problem. If there was ground truth data we could compare the generated recommendations/ranking and calculate the quality of my approach using metrics like NDCG(Normalized Discounted Cummalative Gain) and/or MAP(Mean Average Precision).