Eluvio_DS_Challenge

Dataset Description

Total Samples: 509236
Start date: 1/25/2008, End Date: 11/22/2016.
Features: time_created, date_created, up_votes, down_votes, title, over_18, author, category
Total number of articles published each year

Total number of articles published monthly

Total number of authors published monthly

Libraries Used

pandas: Used for data manipulation and analysis
NumPy: Used for handling multi-dimensional arrays and matrices
nltk: Used for Natural Language preprocessing
fastText: Used for learning word embeddings and text classification which is created by Facebook's AI
faiss: Used for efficient similarity search and clustering of vector

Data Embedding Generation

Preprocessing

Removed stopwords and performed stemming on titles using nltk library
Appened time_created, date_created, author, processed titles in order to increase the similarity between articles published in and around the same date and same author

fastText model

Trained a fastText with the preprocessed text with embedding size of 100, window size of 5, minimum word count of 5 for 25 epochs

faiss

Creates a set of L2 based similarity index table between the embedding of traget keywords or article and embeddings of all other articles created from fastText model

Search and Ranking

Based on the search query entered by the user, the most relevant articles are shown. Reranked using date_created ordered by most recent articles to least recent articles NOTE: time_created and up_votes can also be used in reranking. However, it's not implemented as of now.

Recommendation System

Searched through similarity vectors to find the closest embeddings based on the given tiitle embeddings
Implemented two such recommendation approaches: Global article recommendation and last 30 days recommendation

Global Article Recommendation

This recommends 20 similar articles from all the articles in the database. Reranked using up_votes such that most upvotes shown first.

30 Day Recommendation

This recommends 5 similar articles from the last 30 days based on upvotes. Reranked using up_votes such that most upvotes shown first.

Evaluation

Since there is no ground truth data like user interaction, impressions, likes and dislikes, there is no easy way to generate ground truth and evaluate the model. Hence, it is treated as an unsupervised learning problem. If there was ground truth data we could compare the generated recommendations/ranking and calculate the quality of my approach using metrics like NDCG(Normalized Discounted Cummalative Gain) and/or MAP(Mean Average Precision).

Name		Name	Last commit message	Last commit date
Latest commit History 31 Commits
fig		fig
Data_Analysis.ipynb		Data_Analysis.ipynb
README.md		README.md
Recommedation.ipynb		Recommedation.ipynb
Search_and_ranking.ipynb		Search_and_ranking.ipynb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Eluvio_DS_Challenge

Dataset Description

Libraries Used

Data Embedding Generation

Preprocessing

fastText model

faiss

Search and Ranking

Recommendation System

Global Article Recommendation

30 Day Recommendation

Evaluation

About

Releases

Packages

Languages

praveenphatate/Eluvio_DS_Challenge

Folders and files

Latest commit

History

Repository files navigation

Eluvio_DS_Challenge

Dataset Description

Libraries Used

Data Embedding Generation

Preprocessing

fastText model

faiss

Search and Ranking

Recommendation System

Global Article Recommendation

30 Day Recommendation

Evaluation

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages