Skip to content

udbhav-chugh/TextCoherenceOnSocialMedia

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

21 Commits
 
 
 
 
 
 

Repository files navigation

TextCoherenceOnSocialMedia

This repository contains the project done during CS535. With social media becoming the number one platform for individuals to find a global audience, being coherent is more crucial than ever. However, the colloquialisms of social media make assessing coherence a problematic task. In this project, we look to automate the generation a dataset of coherent and incoherent social media posts. We also describe the preprocessing pipeline, a critical step given the varied nature of social media. Furthermore, we develop a sliding window convolutional neural network (CNN) architecture to capture the semantics of social media. Finally, we compare our results with existing architectures on both social media and non-social media-based corpora.

Refer to reports/EndTerm_Report for detailed information about the project including proposed method description and experimental analysis.

Our source code is split into three colab notebooks. They are:

  1. Text_Coherence_Dataset_Generation It contains code to synthesize the two datasets whose details are mentioned in the report:
    • Where the Reddit posts and comments are considered separately
    • Where the Reddit Posts and top 10 comments are considered.
  • Note: To run this notebook successfully, make sure to have the redditDataset folder in your google drive root directory.
  • Use the following link to download the zip dataset folder: https://www.kaggle.com/ammar111/reddit-top-1000
  • Then extract it, rename it to redditDataset and upload it on your drive before running the steps ahead.
  1. Text_Coherence_Pre-Processing It contains code for data pre-processing and vector embedding generation as described in the report.
  • This notebook directly imports social media and NTSB testing and training data from Kaggle.
  • Importing GloVE embeddings is done from Google Drive
  • The following pre-processing steps are followed:
    • Sentence Level Cleaning: Removing non-alphanumeric characters
    • Word/Token Level Cleaning: Stem each word + remove stop words
    • Genism simple_preprocess
    • LDA based Topic Modelling
    • GloVE embeddings generation for each sentence
  • Output: Word embeddings for training and testing of NTSB and social media corpus.
  • A sample of generated embeddings is also included at the end of the Colab sheet.
  • Note: Cells can take in excess of an hour to complete and can overload RAM so run relevant cells only.
  1. TextCoherenceModel It contains the code for training and testing the text coherence model.
  • Note you must have previously run the notebook to preprocess and generate word embeddings for each of the datasets. This notebook trains and tests models for three datasets:
    • One is the generated Social Media dataset which considers posts and comments as separate documents.
    • One is the generated Social Media dataset which considers posts and top 10 comments as single documents.
    • One is the Accidents Report dataset which is the standard dataset used for text coherence analysis.
  • We train our model on all three datasets, and for each model, we test on all three sets. See the report for result analysis.
  • Ensure the word embeddings for the training data and testing data are present in your google drive. For our notebook, we kept a folder named nlp in google drive which had three sub-folders: separate, combined and accident. All three folders had 2 files each: the word embeddings for training and testing dataset.
  • Once the word embeddings are present in the required folder with names mentioned in the notebook, you can easily run the notebook.
  • Note: It takes a few hours for training and testing to be complete.

About

This repository contains the project done during CS535.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 3

  •  
  •  
  •