This project is about utilizing text analysis techniques to analyze unstructured data (text) in multiple text documents, aiming at providing insights and figuring out hidden themes in these documents. As a result, grouped 42 txt files into 5 topics, and classified overall sentiment of each file. Process including:
- Data understanding and preparation including removing punctuation marks, transforming all letters to lowercase, Stemming etc.
- Exploratory data analysis including word frequency, TF-IDF, word cloud, and Bigram
- Clustering using K-mean, Hierarchical clustering, Network graph
- Latent semantic analysis such as semantic similarity, sentiment analysis
- Topic modelling utilising Latent Dirichlet Allocation (LDA) algorithm
The approach used, assumptions and supporting rationale for each stage of the CRISP-DM framework. Results and recommendations, including supporting visualisations and summary data. Evaluate the results of different techniques, giving reasons for the final approach.
An appendix including working code
A blog post reflecting on the use of the techniques of text analysis in the workplace.
Edit on May 39, 2020