Skip to content

Latest commit

 

History

History
25 lines (16 loc) · 1.65 KB

File metadata and controls

25 lines (16 loc) · 1.65 KB

Natural Language Processing Project

Overview:

This project is about utilizing text analysis techniques to analyze unstructured data (text) in multiple text documents, aiming at providing insights and figuring out hidden themes in these documents. As a result, grouped 42 txt files into 5 topics, and classified overall sentiment of each file. Process including:

  • Data understanding and preparation including removing punctuation marks, transforming all letters to lowercase, Stemming etc.
  • Exploratory data analysis including word frequency, TF-IDF, word cloud, and Bigram
  • Clustering using K-mean, Hierarchical clustering, Network graph
  • Latent semantic analysis such as semantic similarity, sentiment analysis
  • Topic modelling utilising Latent Dirichlet Allocation (LDA) algorithm

Output:

The approach used, assumptions and supporting rationale for each stage of the CRISP-DM framework. Results and recommendations, including supporting visualisations and summary data. Evaluate the results of different techniques, giving reasons for the final approach.

An appendix including working code

A blog post reflecting on the use of the techniques of text analysis in the workplace.

Edit on May 39, 2020