Sentiment Analysis on Primate Dataset

This project aims to perform sentiment analysis on a dataset containing posts related to primates. Leveraging transformer-based models, sentiment labels are predicted for the textual data. Below is an elaborate description of the project workflow:

Workflow Overview

Data Preprocessing:

The dataset (primate_dataset.json) is loaded and preprocessed to prepare it for model training.
The clean_text function in preprocess_data.py cleans the text data by converting it to lowercase, removing punctuation, tokenizing, removing stopwords, and stemming.

import pandas as pd
import nltk
import string
from nltk import word_tokenize
from nltk.corpus import stopwords
from nltk.stem.porter import PorterStemmer

nltk.download('punkt')
nltk.download('stopwords')

def clean_text(data):
    # Implementation of text cleaning
    ...
    return result

Data Splitting:

The preprocessed data is split into training and testing sets for model evaluation.

from sklearn.model_selection import train_test_split

X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.2, random_state=42)

Model Training:

The sentiment analysis model is trained using the training data.

from transformers import AutoTokenizer, AutoModelForSequenceClassification
import torch
from torch.utils.data import DataLoader, TensorDataset

tokenizer = AutoTokenizer.from_pretrained("sbcBI/sentiment_analysis_model")
model = AutoModelForSequenceClassification.from_pretrained("sbcBI/sentiment_analysis_model")

# Model training code...

Model Evaluation:

The trained model is evaluated using the testing data to assess its performance.

from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, roc_auc_score
import numpy as np

# Model evaluation code...

Optional Quantization:

Optionally, the trained model can be quantized to reduce memory usage and improve inference speed.

from torch.quantization import quantize_dynamic

quantized_model = quantize_dynamic(model, {torch.nn.Linear}, dtype=torch.qint8)

Code Structure

preprocess_data.py: Contains functions for cleaning and preprocessing the dataset.
train_model.py: Script for training the sentiment analysis model.
evaluate_model.py: Script for evaluating the trained model on the test dataset.
quantize_model.py: Optional script for quantizing the trained model.
utils.py: Utility functions used across different scripts.

Setup and Dependencies

Install the required Python packages:

pip install pandas transformers torch nltk

Download NLTK data:

import nltk
nltk.download('punkt')
nltk.download('stopwords')

Ensure GPU support for faster model training if available.

Usage

Preprocess the dataset using preprocess_data.py.
Train the sentiment analysis model using train_model.py.
Evaluate the trained model using evaluate_model.py.
Optionally, quantize the trained model using quantize_model.py.

Model Deployment

The trained and optionally quantized model can be deployed for inference in production environments. Ensure compatibility with the deployment platform and optimize for performance if necessary.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

Sentiment Analysis on Primate Dataset

Workflow Overview

Code Structure

Setup and Dependencies

Usage

Model Deployment

Files

README.md

Latest commit

History

README.md

File metadata and controls

Sentiment Analysis on Primate Dataset

Workflow Overview

Code Structure

Setup and Dependencies

Usage

Model Deployment