This project explores the performance of various statistical and deep learning-based embedding methods for hate speech detection. The analysis is conducted using two Jupyter notebooks: NLP/Statistical models.ipynb
and NLP/DEEP Learning Models.ipynb
. The focus is on understanding the strengths and weaknesses of each approach in terms of accuracy, precision, recall, and F1-score.
- Libraries Used: NLTK, scikit-learn, Gensim, Matplotlib
- Resources: NLTK stopwords, lemmatization, GloVe embeddings
- Dataset:
cardiffnlp/tweet_eval
hate-speech dataset
- Functionality: Handles invalid input, converts to lowercase, cleans Twitter elements, removes noise, and applies lemmatization.
- Safe Preprocessing: Ensures error handling and adds a
processed_text
column.
-
Bag of Words (BoW)
- Uses
CountVectorizer
with a max of 5000 features.
- Uses
-
TF-IDF
- Employs
TfidfVectorizer
with a max of 5000 features.
- Employs
-
Word2Vec
- Trains on tokenized tweets with vector size 100.
-
FastText
- Similar parameters as Word2Vec.
-
GloVe
- Utilizes pre-trained Twitter GloVe embeddings.
- Model: Logistic Regression
- Metrics: Classification report, accuracy, precision, recall, F1-score
Model | Accuracy | Hate Speech Detection Rate | False Positive Rate | F1 (Hate Speech) |
---|---|---|---|---|
BoW | 51.0% | 90.1% | 77.5% | 60.8% |
TF-IDF | 50.7% | 88.8% | 77.1% | 60.3% |
Word2Vec | 47.6% | 65.6% | 65.5% | 51.3% |
FastText | 49.1% | 67.3% | 64.1% | 52.7% |
GloVe | 57.8% | 0.0% | 0.0% | 0.0% |
- Libraries: Transformers, PyTorch
- Device: Utilizes GPU if available
- Transformer-based Models: BERT, RoBERTa, etc.
- Training: Fine-tuning on the hate-speech dataset
- Accuracy: Overall correctness of predictions
- Precision: Correct positive predictions
- Recall: True positive rate
- F1-Score: Balance between precision and recall
- Deep Learning Models: Generally outperform statistical models in terms of precision and recall.
- Contextual Understanding: Better capture of nuances and context in language.
-
Statistical Models
- Pros: Simplicity, quick deployment, lower computational cost.
- Cons: Higher false positive rates, less nuanced understanding.
-
Deep Learning Models
- Pros: Better contextual understanding, improved precision and recall.
- Cons: Higher computational requirements, longer training times.
-
Use Case Considerations
- Statistical Models: Suitable for simpler tasks with limited resources.
- Deep Learning Models: Ideal for complex language tasks requiring nuanced understanding.
- Hybrid Approaches: Combine statistical and deep learning methods for improved performance.
- Resource Allocation: Consider computational resources when choosing between approaches.
- Domain Adaptation: Fine-tune models on domain-specific data for better results.
- Python 3.10+
- PyTorch
- Transformers
- scikit-learn
- NLTK
- Gensim
- pandas
- numpy
-
Text Preprocessing
-
Word Embeddings
-
BERT Architecture
-
Sentence Transformers