Toxic Text Classification Kaggle Competition

The goal of this project is to build a full end-to-end machine learning based solution on classifing different types of text. These classifications include:

toxic
severe_toxic
obscene
threat
insult * identity_hate

These types of comments can be hurtful and insentive to others, therefore, being able to removing these types of messages is benificial in keeping a online community safe and allows for all users enjoy partisipanting with fear of judgement.

The project is composed into 4 Sections:

EDA
Basline Model
Improvments
Deployment

EDA

This notebook includes the basic code on understanding the data. This includes how to preprocess the text for model training. This includes:

Removing Stopwords
Removing URL's
Removing newlines

Baseline Model

For the first iteration, I used a Nieve Bayes model to classifiy these toxic comments. This used the same preprocesses as found in the EDA notebook as well as using sklearn's CountVectorizer to convert the text into a matrix of numbers. This matrix is then feed into the Nieve Bayes model to train. Currently, the model ROC score is ~0.94. I've also intergrated Paperspace Gradient into the code commit process. Each time a commit is made, it runs a training job that trains the model and save the model file that can be downloaded from Gradient.

I've documented more about it here

The training script can be found here

Improvments

This is still a work in progress, however, will be using a LSTM to see weather this is better than the baseline model. My goal is to also try a few other models, including Google's BERT and a CharCNN

Deployment

Once our model is better than our NB model, we'll deploy to a server. Currently, my goal is to use KubeFlow for inference as for retraining. This will be complete once the deeper model is complete.

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
.ps_project		.ps_project
data		data
notebooks		notebooks
src		src
.gitignore		.gitignore
EDA.ipynb		EDA.ipynb
experiments.md		experiments.md
model.bin		model.bin
readme.md		readme.md
requirements.txt		requirements.txt
sklearn_scores.csv		sklearn_scores.csv
submission.csv		submission.csv

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Toxic Text Classification Kaggle Competition

EDA

Baseline Model

Improvments

Deployment

About

Releases

Packages

Languages

tbass134/jigsaw_toxic_text_classification

Folders and files

Latest commit

History

Repository files navigation

Toxic Text Classification Kaggle Competition

EDA

Baseline Model

Improvments

Deployment

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages