Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Sentiment analysis laser #274

Merged
Merged
Show file tree
Hide file tree
Changes from 5 commits
Commits
Show all changes
25 commits
Select commit Hold shift + click to select a range
1ad6d26
sentiment analysis using laser encoders
NIXBLACK11 Nov 24, 2023
b40847c
Updated readme
NIXBLACK11 Nov 24, 2023
f73a835
Update README.md
NIXBLACK11 Nov 24, 2023
603a56c
Added button to run on collab
NIXBLACK11 Nov 28, 2023
a1dbe66
Merge branch 'Sentiment-analysis-laser' of github.com:NIXBLACK11/LASE…
NIXBLACK11 Nov 28, 2023
99b242a
Added button to run on collab
NIXBLACK11 Nov 29, 2023
3b22a18
added context to the notebook
NIXBLACK11 Dec 1, 2023
599b4cd
Added the steps to download
NIXBLACK11 Dec 4, 2023
069471f
Changed the dataset which helped the accuracy to 83%
NIXBLACK11 Dec 5, 2023
fcf72cf
Updated the readme
NIXBLACK11 Dec 5, 2023
3e6a144
update
NIXBLACK11 Dec 5, 2023
5de9edc
Update SentimentAnalysis.ipynb
NIXBLACK11 Dec 5, 2023
d12059d
Update README.md
NIXBLACK11 Dec 6, 2023
48cc180
balanced the training and test
NIXBLACK11 Dec 6, 2023
fb869cb
Add files via upload
NIXBLACK11 Dec 6, 2023
bdb4102
balanced
NIXBLACK11 Dec 6, 2023
0582aa7
Merge branch 'Sentiment-analysis-laser' of github.com:NIXBLACK11/LASE…
NIXBLACK11 Dec 6, 2023
19c8966
updated notebook
NIXBLACK11 Dec 6, 2023
1405e67
Changed title of step 14 SentimentAnalysis.ipynb
NIXBLACK11 Dec 7, 2023
b4842a5
Update SentimentAnalysis.ipynb
NIXBLACK11 Dec 7, 2023
a4e0c60
Update SentimentAnalysis.ipynb
NIXBLACK11 Dec 7, 2023
9d616e1
Update SentimentAnalysis.ipynb
NIXBLACK11 Dec 7, 2023
da22645
updated
NIXBLACK11 Dec 7, 2023
6df5966
Changed the title of step 14
NIXBLACK11 Dec 7, 2023
93a6494
updated sentimentAnalysis.ipynb
NIXBLACK11 Dec 7, 2023
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
32 changes: 32 additions & 0 deletions tasks/SentimentAnalysis/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,32 @@
# Laser Encoder: Sentiment Analysis

## Overview

This project demonstrates the application of the Laser Encoder tool for creating sentence embeddings in the context of sentiment analysis. The Laser Encoder is used to encode text data, and a sentiment analysis model is trained to predict the sentiment of the text.

## Getting Started

To run the notebook in Google Colab, simply click the "Open in Colab" button below:

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/drive/12gQUG7rPJvOVeWQkpMFzMiixqwDIdv4W?usp=sharing)
Copy link
Contributor

@avidale avidale Nov 29, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems that you have now two independent notebooks: one in Google Drive (tied to Colab), and another here in Github.

I suggest that instead, we have only one copy of the notebook, the one in Github, and modify the Colab url so that it always loads the version from Github. The url will look like https://colab.research.google.com/github/NIXBLACK11/LASER-fork/blob/Sentiment-analysis-laser/tasks/SentimentAnalysis/SentimentAnalysis.ipynb, only you'll need to update the path in a way that refers the final destination (the main branch of the LASER repository).

(I found this trick here)


## Example Usage

- Download Dataset:
Download the sample dataset from the following link: [Sample Dataset](https://www.kaggle.com/datasets/abhi8923shriv/sentiment-analysis-dataset)

- Run the Example Notebook:
Execute the provided Jupyter notebook SentimentAnalysis.ipynb

jupyter notebook SentimentAnalysis.ipynb


## Customization

- Modify the model architecture, hyperparameters, and training settings in the neural network model section based on your requirements.
- Customize the sentiment mapping and handling of unknown sentiments in the data preparation section.

## Additional Notes
- Feel free to experiment with different models, embeddings, and hyperparameters to optimize performance.
- Ensure that the dimensions of embeddings and model inputs are compatible.
Adapt the code based on your specific dataset and use case.
186 changes: 186 additions & 0 deletions tasks/SentimentAnalysis/SentimentAnalysis.ipynb
Original file line number Diff line number Diff line change
@@ -0,0 +1,186 @@
{
"cells": [
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"! pip install laser_encoders\n",
"! pip install chardet"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"import numpy as np\n",
"import pandas as pd\n",
"import chardet\n",
"from laser_encoders import LaserEncoderPipeline\n",
"from sklearn.model_selection import train_test_split\n",
"from sklearn.metrics import accuracy_score\n",
"from sklearn.linear_model import LogisticRegression\n",
"from sklearn.preprocessing import LabelEncoder\n",
"from tensorflow.keras.models import Sequential\n",
"from tensorflow.keras.layers import Dense\n",
"from tqdm import tqdm"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"with open('/content/drive/MyDrive/dataset/train.csv', 'rb') as f:\n",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You seem to be using Google Drive here, but there is not code above that mounts it. This is confusing.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I have to add that.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we maybe add above some text about where and how to download this dataset?
Currently, those who open the notebook directly have no idea where to get it.

" result = chardet.detect(f.read())\n",
"\n",
"# Use the detected encoding when reading the CSV file\n",
"data = pd.read_csv('/content/drive/MyDrive/dataset/train.csv', encoding=result['encoding'])\n",
"data = data[['sentiment', 'text']]"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"print(data.head())\n",
"print(data.shape)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"sentiments = []\n",
"texts = []\n",
"\n",
"for index, row in data.iterrows():\n",
" sentiment = row['sentiment'].lower() # Convert to lowercase for case-insensitivity\n",
" if sentiment == 'neutral':\n",
" sentiments.append(1)\n",
" elif sentiment == 'positive':\n",
" sentiments.append(2)\n",
" elif sentiment == 'negative':\n",
" sentiments.append(3)\n",
" else:\n",
" # Handle the case where sentiment is not one of the expected values\n",
" # You may choose to skip this row or handle it differently based on your requirements\n",
" print(f\"Warning: Unknown sentiment '{sentiment}' in row {index}\")\n",
"\n",
" text = row['text']\n",
" texts.append(text)\n",
"\n",
"print(len(sentiments))\n",
"print(len(texts))\n",
"sentiments = sentiments[:300] + sentiments[400:]\n",
"texts = texts[:300] + texts[400:]"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"label_encoder = LabelEncoder()\n",
"encoded_sentiments = label_encoder.fit_transform(sentiments)\n",
"\n",
"# Split the data into training and testing sets\n",
"X_train, X_test, y_train, y_test = train_test_split(texts, encoded_sentiments, test_size=0.2, random_state=42)\n",
"\n",
"# Initialize the LaserEncoder\n",
"encoder = LaserEncoderPipeline(lang=\"eng_Latn\")"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# Initialize empty arrays to store embeddings\n",
"X_train_embeddings = []\n",
"X_test_embeddings = []\n",
"\n",
"# Encode sentences line-wise using tqdm for progress visualization\n",
"print(\"Encoding training sentences:\")\n",
"for sentence in tqdm(X_train):\n",
" embeddings = encoder.encode_sentences([sentence])[0]\n",
" X_train_embeddings.append(embeddings)\n",
"\n",
"print(\"Encoding testing sentences:\")\n",
"for sentence in tqdm(X_test):\n",
" embeddings = encoder.encode_sentences([sentence])[0]\n",
" X_test_embeddings.append(embeddings)\n",
"\n",
"# Convert lists to numpy arrays\n",
"X_train_embeddings = np.array(X_train_embeddings)\n",
"X_test_embeddings = np.array(X_test_embeddings)\n",
"\n",
"# # Encode sentences line-wise\n",
"# X_train_embeddings = np.array([encoder.encode_sentences([sentence])[0] for sentence in X_train])\n",
"# X_test_embeddings = np.array([encoder.encode_sentences([sentence])[0] for sentence in X_test])"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# Build a simple neural network model\n",
"model = Sequential()\n",
"model.add(Dense(64, input_shape=(1024,), activation='relu'))\n",
"model.add(Dense(3, activation='softmax')) # Assuming 3 classes (neutral, positive, negative)\n",
"\n",
"# Compile the model\n",
"model.compile(optimizer='adam', loss='sparse_categorical_crossentropy', metrics=['accuracy'])\n",
"\n",
"# Train the model\n",
"model.fit(X_train_embeddings, y_train, epochs=20, batch_size=32, validation_split=0.1)\n",
"\n",
"# Evaluate the model on the test set\n",
"accuracy = model.evaluate(X_test_embeddings, y_test)[1]\n",
"print(f\"Accuracy: {accuracy * 100:.2f}%\")"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# Now, you can use the trained model to predict the sentiment of user input\n",
"user_text = input(\"Enter a text: \")\n",
"user_text_embedding = encoder.encode_sentences([user_text])[0]\n",
"user_text_embedding = np.reshape(user_text_embedding, (1, -1))\n",
"\n",
"predicted_sentiment = np.argmax(model.predict(user_text_embedding))\n",
"predicted_sentiment_no = label_encoder.inverse_transform([predicted_sentiment])[0]\n",
"if predicted_sentiment_no == 1:\n",
" predicted_sentiment_label = 'neutral'\n",
"elif predicted_sentiment_no == 2:\n",
" predicted_sentiment_label = 'positive'\n",
"else:\n",
" predicted_sentiment_label = 'negative'\n",
"\n",
"print(f\"Predicted Sentiment: {predicted_sentiment_label}\")"
]
}
],
"metadata": {
"language_info": {
"name": "python"
}
},
"nbformat": 4,
"nbformat_minor": 2
}
Loading