Skip to content

Latest commit

 

History

History
105 lines (74 loc) · 3.44 KB

README.md

File metadata and controls

105 lines (74 loc) · 3.44 KB

Sentiment Analysis on Primate Dataset

This project aims to perform sentiment analysis on a dataset containing posts related to primates. Leveraging transformer-based models, sentiment labels are predicted for the textual data. Below is an elaborate description of the project workflow:

Workflow Overview

  1. Data Preprocessing:

    • The dataset (primate_dataset.json) is loaded and preprocessed to prepare it for model training.
    • The clean_text function in preprocess_data.py cleans the text data by converting it to lowercase, removing punctuation, tokenizing, removing stopwords, and stemming.
    import pandas as pd
    import nltk
    import string
    from nltk import word_tokenize
    from nltk.corpus import stopwords
    from nltk.stem.porter import PorterStemmer
    
    nltk.download('punkt')
    nltk.download('stopwords')
    
    def clean_text(data):
        # Implementation of text cleaning
        ...
        return result
  2. Data Splitting:

    • The preprocessed data is split into training and testing sets for model evaluation.
    from sklearn.model_selection import train_test_split
    
    X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.2, random_state=42)
  3. Model Training:

    • The sentiment analysis model is trained using the training data.
    from transformers import AutoTokenizer, AutoModelForSequenceClassification
    import torch
    from torch.utils.data import DataLoader, TensorDataset
    
    tokenizer = AutoTokenizer.from_pretrained("sbcBI/sentiment_analysis_model")
    model = AutoModelForSequenceClassification.from_pretrained("sbcBI/sentiment_analysis_model")
    
    # Model training code...
  4. Model Evaluation:

    • The trained model is evaluated using the testing data to assess its performance.
    from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, roc_auc_score
    import numpy as np
    
    # Model evaluation code...
  5. Optional Quantization:

    • Optionally, the trained model can be quantized to reduce memory usage and improve inference speed.
    from torch.quantization import quantize_dynamic
    
    quantized_model = quantize_dynamic(model, {torch.nn.Linear}, dtype=torch.qint8)

Code Structure

  • preprocess_data.py: Contains functions for cleaning and preprocessing the dataset.
  • train_model.py: Script for training the sentiment analysis model.
  • evaluate_model.py: Script for evaluating the trained model on the test dataset.
  • quantize_model.py: Optional script for quantizing the trained model.
  • utils.py: Utility functions used across different scripts.

Setup and Dependencies

  1. Install the required Python packages:

    pip install pandas transformers torch nltk
    
  2. Download NLTK data:

    import nltk
    nltk.download('punkt')
    nltk.download('stopwords')
  3. Ensure GPU support for faster model training if available.

Usage

  1. Preprocess the dataset using preprocess_data.py.
  2. Train the sentiment analysis model using train_model.py.
  3. Evaluate the trained model using evaluate_model.py.
  4. Optionally, quantize the trained model using quantize_model.py.

Model Deployment

The trained and optionally quantized model can be deployed for inference in production environments. Ensure compatibility with the deployment platform and optimize for performance if necessary.