This project aims to perform sentiment analysis on a dataset containing posts related to primates. Leveraging transformer-based models, sentiment labels are predicted for the textual data. Below is an elaborate description of the project workflow:
-
Data Preprocessing:
- The dataset (
primate_dataset.json
) is loaded and preprocessed to prepare it for model training. - The
clean_text
function inpreprocess_data.py
cleans the text data by converting it to lowercase, removing punctuation, tokenizing, removing stopwords, and stemming.
import pandas as pd import nltk import string from nltk import word_tokenize from nltk.corpus import stopwords from nltk.stem.porter import PorterStemmer nltk.download('punkt') nltk.download('stopwords') def clean_text(data): # Implementation of text cleaning ... return result
- The dataset (
-
Data Splitting:
- The preprocessed data is split into training and testing sets for model evaluation.
from sklearn.model_selection import train_test_split X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.2, random_state=42)
-
Model Training:
- The sentiment analysis model is trained using the training data.
from transformers import AutoTokenizer, AutoModelForSequenceClassification import torch from torch.utils.data import DataLoader, TensorDataset tokenizer = AutoTokenizer.from_pretrained("sbcBI/sentiment_analysis_model") model = AutoModelForSequenceClassification.from_pretrained("sbcBI/sentiment_analysis_model") # Model training code...
-
Model Evaluation:
- The trained model is evaluated using the testing data to assess its performance.
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, roc_auc_score import numpy as np # Model evaluation code...
-
Optional Quantization:
- Optionally, the trained model can be quantized to reduce memory usage and improve inference speed.
from torch.quantization import quantize_dynamic quantized_model = quantize_dynamic(model, {torch.nn.Linear}, dtype=torch.qint8)
preprocess_data.py
: Contains functions for cleaning and preprocessing the dataset.train_model.py
: Script for training the sentiment analysis model.evaluate_model.py
: Script for evaluating the trained model on the test dataset.quantize_model.py
: Optional script for quantizing the trained model.utils.py
: Utility functions used across different scripts.
-
Install the required Python packages:
pip install pandas transformers torch nltk
-
Download NLTK data:
import nltk nltk.download('punkt') nltk.download('stopwords')
-
Ensure GPU support for faster model training if available.
- Preprocess the dataset using
preprocess_data.py
. - Train the sentiment analysis model using
train_model.py
. - Evaluate the trained model using
evaluate_model.py
. - Optionally, quantize the trained model using
quantize_model.py
.
The trained and optionally quantized model can be deployed for inference in production environments. Ensure compatibility with the deployment platform and optimize for performance if necessary.