Hate Speech Detection: Statistical vs. Deep Learning Approaches

Overview

This project explores the performance of various statistical and deep learning-based embedding methods for hate speech detection. The analysis is conducted using two Jupyter notebooks: NLP/Statistical models.ipynb and NLP/DEEP Learning Models.ipynb. The focus is on understanding the strengths and weaknesses of each approach in terms of accuracy, precision, recall, and F1-score.

Statistical Models

Preprocessing and Dataset Setup

Libraries Used: NLTK, scikit-learn, Gensim, Matplotlib
Resources: NLTK stopwords, lemmatization, GloVe embeddings
Dataset: cardiffnlp/tweet_eval hate-speech dataset

Text Preprocessing

Functionality: Handles invalid input, converts to lowercase, cleans Twitter elements, removes noise, and applies lemmatization.
Safe Preprocessing: Ensures error handling and adds a processed_text column.

Embedding Methods

Bag of Words (BoW)
- Uses CountVectorizer with a max of 5000 features.
TF-IDF
- Employs TfidfVectorizer with a max of 5000 features.
Word2Vec
- Trains on tokenized tweets with vector size 100.
FastText
- Similar parameters as Word2Vec.
GloVe
- Utilizes pre-trained Twitter GloVe embeddings.

Model Training and Evaluation

Model: Logistic Regression
Metrics: Classification report, accuracy, precision, recall, F1-score

Results

Model	Accuracy	Hate Speech Detection Rate	False Positive Rate	F1 (Hate Speech)
BoW	51.0%	90.1%	77.5%	60.8%
TF-IDF	50.7%	88.8%	77.1%	60.3%
Word2Vec	47.6%	65.6%	65.5%	51.3%
FastText	49.1%	67.3%	64.1%	52.7%
GloVe	57.8%	0.0%	0.0%	0.0%

Deep Learning Models

Setup

Libraries: Transformers, PyTorch
Device: Utilizes GPU if available

Model Architecture

Transformer-based Models: BERT, RoBERTa, etc.
Training: Fine-tuning on the hate-speech dataset

Evaluation Metrics

Accuracy: Overall correctness of predictions
Precision: Correct positive predictions
Recall: True positive rate
F1-Score: Balance between precision and recall

Results

Deep Learning Models: Generally outperform statistical models in terms of precision and recall.
Contextual Understanding: Better capture of nuances and context in language.

Key Insights

Statistical Models
- Pros: Simplicity, quick deployment, lower computational cost.
- Cons: Higher false positive rates, less nuanced understanding.
Deep Learning Models
- Pros: Better contextual understanding, improved precision and recall.
- Cons: Higher computational requirements, longer training times.
Use Case Considerations
- Statistical Models: Suitable for simpler tasks with limited resources.
- Deep Learning Models: Ideal for complex language tasks requiring nuanced understanding.

Recommendations

Hybrid Approaches: Combine statistical and deep learning methods for improved performance.
Resource Allocation: Consider computational resources when choosing between approaches.
Domain Adaptation: Fine-tune models on domain-specific data for better results.

Dependencies

Python 3.10+
PyTorch
Transformers
scikit-learn
NLTK
Gensim
pandas
numpy

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Readme.md

Readme.md

Hate Speech Detection: Statistical vs. Deep Learning Approaches

Overview

Statistical Models

Preprocessing and Dataset Setup

Text Preprocessing

Embedding Methods

Model Training and Evaluation

Results

Deep Learning Models

Setup

Model Architecture

Evaluation Metrics

Results

Key Insights

Recommendations

Dependencies

References

Documentation

Resources

Files

Readme.md

Latest commit

History

Readme.md

File metadata and controls

Hate Speech Detection: Statistical vs. Deep Learning Approaches

Overview

Statistical Models

Preprocessing and Dataset Setup

Text Preprocessing

Embedding Methods

Model Training and Evaluation

Results

Deep Learning Models

Setup

Model Architecture

Evaluation Metrics

Results

Key Insights

Recommendations

Dependencies

References

Documentation

Resources