Skip to content

Latest commit

 

History

History
136 lines (85 loc) · 4.4 KB

File metadata and controls

136 lines (85 loc) · 4.4 KB

Hate Speech Detection: Statistical vs. Deep Learning Approaches

Overview

This project explores the performance of various statistical and deep learning-based embedding methods for hate speech detection. The analysis is conducted using two Jupyter notebooks: NLP/Statistical models.ipynb and NLP/DEEP Learning Models.ipynb. The focus is on understanding the strengths and weaknesses of each approach in terms of accuracy, precision, recall, and F1-score.

Statistical Models

Preprocessing and Dataset Setup

  • Libraries Used: NLTK, scikit-learn, Gensim, Matplotlib
  • Resources: NLTK stopwords, lemmatization, GloVe embeddings
  • Dataset: cardiffnlp/tweet_eval hate-speech dataset

Text Preprocessing

  • Functionality: Handles invalid input, converts to lowercase, cleans Twitter elements, removes noise, and applies lemmatization.
  • Safe Preprocessing: Ensures error handling and adds a processed_text column.

Embedding Methods

  1. Bag of Words (BoW)

    • Uses CountVectorizer with a max of 5000 features.
  2. TF-IDF

    • Employs TfidfVectorizer with a max of 5000 features.
  3. Word2Vec

    • Trains on tokenized tweets with vector size 100.
  4. FastText

    • Similar parameters as Word2Vec.
  5. GloVe

    • Utilizes pre-trained Twitter GloVe embeddings.

Model Training and Evaluation

  • Model: Logistic Regression
  • Metrics: Classification report, accuracy, precision, recall, F1-score

Results

Model Accuracy Hate Speech Detection Rate False Positive Rate F1 (Hate Speech)
BoW 51.0% 90.1% 77.5% 60.8%
TF-IDF 50.7% 88.8% 77.1% 60.3%
Word2Vec 47.6% 65.6% 65.5% 51.3%
FastText 49.1% 67.3% 64.1% 52.7%
GloVe 57.8% 0.0% 0.0% 0.0%

Deep Learning Models

Setup

  • Libraries: Transformers, PyTorch
  • Device: Utilizes GPU if available

Model Architecture

  • Transformer-based Models: BERT, RoBERTa, etc.
  • Training: Fine-tuning on the hate-speech dataset

Evaluation Metrics

  • Accuracy: Overall correctness of predictions
  • Precision: Correct positive predictions
  • Recall: True positive rate
  • F1-Score: Balance between precision and recall

Results

  • Deep Learning Models: Generally outperform statistical models in terms of precision and recall.
  • Contextual Understanding: Better capture of nuances and context in language.

Key Insights

  1. Statistical Models

    • Pros: Simplicity, quick deployment, lower computational cost.
    • Cons: Higher false positive rates, less nuanced understanding.
  2. Deep Learning Models

    • Pros: Better contextual understanding, improved precision and recall.
    • Cons: Higher computational requirements, longer training times.
  3. Use Case Considerations

    • Statistical Models: Suitable for simpler tasks with limited resources.
    • Deep Learning Models: Ideal for complex language tasks requiring nuanced understanding.

Recommendations

  1. Hybrid Approaches: Combine statistical and deep learning methods for improved performance.
  2. Resource Allocation: Consider computational resources when choosing between approaches.
  3. Domain Adaptation: Fine-tune models on domain-specific data for better results.

Dependencies

  • Python 3.10+
  • PyTorch
  • Transformers
  • scikit-learn
  • NLTK
  • Gensim
  • pandas
  • numpy

References

Documentation

Resources

  1. Text Preprocessing

  2. Word Embeddings

  3. BERT Architecture

  4. Sentence Transformers