Skip to content

Jeremy-Yan-Liu/Topic-Modeling

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

57 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Topic-Modeling

Topic Modeling of New York Times Company News

Description

News folder contains data of 31 companies collected by the author. Each file has three columns: date, headline/title, summary.

nltk_data folder contains stopwords and lemmatizer used to preprocess the text data.

topic_model.ipynb is the code that takes apple-incorporated.txt as an example, runs HDP, LSI, LDA and LDA_Mod (LDA with optimized number of topics) four models and uses coherence value to evaluate the results. An interactive visualization is built through pyLDAVis for demonstrating the results of LDA.

Sample Output

1. Coherence Value Comparison

Comments

Coherence value is calculated to evaluate the four models’ performance. It can be easily worked out using a built-in function in Gensim and the higher the coherence value, the more human interpretable the generated topics are. The experiment results of different company’s data were similar. Hence, only one result was shown above. The y-axis represents the coherence value of each model and the x-axis is the list of models. Typically, HDP model performs more than two time better than the other three models. LSI is slightly better than LDA or LDA_Mod. LDA and LDA_Mod receive a similar coherence value as we actually know in advance the optimal number of topics.

2. Visualization of LDA result

Comments

Apple Inc. is perhaps one of the most prominent technology companies. As we could tell from the graph on the left, three circles represent three most relevant topics and they have large coincided area, which suggest some evident focus of Apple’s news. The term-topic bar chart on the right captures Apple’s main products: “computer”, “iphone”, “mac”, “ipod”. Its core competency lies on “tech”, “patent”, “software” and “apps”; its direct competitors include “google” and “microsoft”.

Reference

Thanks to the Topic Analysis example of Gensim and SusanLi's resourceful repository Machine-Learning-with-Python.

About

Topic Modeling of New York Times Company News

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published