☀️⚡ Problem Statement ⚡☀️

NLP Challenge: IMDB Dataset of 50K Movie Reviews to perform Sentiment analysis

Perform a thorough Exploratory Data Analysis of the dataset and report the final performance metrics for your approach. Suggest ways in which you can improve the model.

View solution code

📌 Exploration + Preprocessing Steps Followed

Exploration:

Balanced dataset: Check if number of positive and negative reviews is similar
Word frequencies: Identify frequently appearing words through a word cloud
Token frequencies: Identify frequently appearing tokens to understand cleaning requirements

Preprocessing & Cleaning:

HTML Tag Removal: Remove all html </..> tags
Lowercasing: Convert all characters to lowercase to ensure uniformity.
Stop Words Removal: Remove common words that don't contribute much to the sentiment.
Stemming/Lemmatization: Reduce words to their base or root form.
Punctuation Removal: Remove punctuation marks.
Handling Special Characters and Numbers: Decide whether to remove or keep special characters and numbers based on their relevance.
Text Normalization: Expand contractions (e.g., "don't" to "do not").

📌 Models + Feature Engineering Explored

Bidirectional RNN(LSTM) with Embeddings
Logistic Regression with TF-IDF Vectorization

📌 Potential improvements

Bidirectional RNN with Embeddings:

Optimize the max_len and max_words components during tokenization
Optimize the embedding_dim in Word2Vec embeddings
Experiment with different Word2Vec embeddings besides the word2vec-google-news-300

Logistic Regression with TF-DIF Vectorization:

Optimize the n_grams in TF-DIF vectorization

Feature Engineering - BERT Tokenization

Due to resource limitations, following the pre-trained BERT model for tokenization as a feature engineering methodology was taking too long to run so I decided to continue my project by using a standard Keras tokenizer instead. However, the code is included below if anyone wishes to try it out.

import tensorflow as tf

from transformers import BertTokenizer, TFBertModel

# Initialize BERT tokenizer
bert_tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')

# Tokenize text data
def bert_tokenize(texts, tokenizer, max_len):
    return tokenizer(texts, padding='max_length', truncation=True, max_length=max_len, return_tensors='tf')

max_len = 100  # Define the maximum sequence length

X_train_bert = bert_tokenize(X_train.tolist(), bert_tokenizer, max_len)
X_test_bert = bert_tokenize(X_test.tolist(), bert_tokenizer, max_len)

# Define the model
input_ids = tf.keras.layers.Input(shape=(max_len,), dtype=tf.int32, name='input_ids')
attention_mask = tf.keras.layers.Input(shape=(max_len,), dtype=tf.int32, name='attention_mask')
bert_model = TFBertModel.from_pretrained('bert-base-uncased')
bert_output = bert_model([input_ids, attention_mask])[0]
bi_lstm = tf.keras.layers.Bidirectional(tf.keras.layers.LSTM(64, return_sequences=True))(bert_output)
dropout = tf.keras.layers.Dropout(0.5)(bi_lstm)
bi_lstm = tf.keras.layers.Bidirectional(tf.keras.layers.LSTM(32))(dropout)
output = tf.keras.layers.Dense(1, activation='sigmoid')(bi_lstm)

model = tf.keras.Model(inputs=[input_ids, attention_mask], outputs=output)
model.compile(optimizer='adam', loss='binary_crossentropy', metrics=['accuracy'])

# Train the model
history = model.fit({'input_ids': X_train_bert['input_ids'], 'attention_mask': X_train_bert['attention_mask']},
                    y_train, epochs=3, batch_size=16,
                    validation_data=({'input_ids': X_test_bert['input_ids'], 'attention_mask': X_test_bert['attention_mask']}, y_test))

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
input		input
models		models
.gitignore		.gitignore
README.md		README.md
requirements.txt		requirements.txt
sentiment_analysis.ipynb		sentiment_analysis.ipynb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

☀️⚡ Problem Statement ⚡☀️

📌 Exploration + Preprocessing Steps Followed

📌 Models + Feature Engineering Explored

📌 Potential improvements

Feature Engineering - BERT Tokenization

About

Releases

Packages

Languages

shansitads/sentiment-analysis

Folders and files

Latest commit

History

Repository files navigation

☀️⚡ Problem Statement ⚡☀️

📌 Exploration + Preprocessing Steps Followed

📌 Models + Feature Engineering Explored

📌 Potential improvements

Feature Engineering - BERT Tokenization

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages