Topic modeling views documents as a collection of various topics. It is not only used for exploring hidden semantic structures in new textual data, but for also complementing supervised machine learning in constructing and checking the accuracy of existing classification models.
In this repository, I chose to utilise Latent Dirichlet Allocation(LDA) for topic modeling. I trained LDA on a subset of the English wikipedia corpus, and found the top 20 topics of the texts found below. Furthurmore, I used the trained LDA to predict topic distributions on several seletected texts within the corpus(short_texts.txt), of which the result can be found below. The main programs can be found in folder main
.
Through this process, I have written some other codes, such as comparing two matrixes using cosine similarity, or making and then compressing a xml file.
- Run
python3 xml1.py
, to extract xml file from bz2, and then extract every 300 text longer than 200 words. - Run
python3 prune
, to prune texts, and create a dictionary. - Run
python3 lda.py
, to perform Latent Dirichlet Analysis(LDA). - Run
python3 document_selector.py
to prepare self selected text for LDA. Text must be namedshort_texts.txt
in seperate folderdata
. - Run
python3 lda_comparison.py
to compare the percentage of each topic in selected text from step 4 against the top 20 topics generated in step 3.
- Convert xml.bz2 file(wiki) to text file, and extract all text bodies. The first 10,000 texts are selected and saved(extracted_texts.txt). (See
xml1.py
) - Remove common and unique words, and non-english syllybus. (See
prune.py
) - Create and saves the dictionary(the_dictionary.dict). (See
prune.py
) - Create sparse vectors representing the token frequency in each text (text_vectors.txt). (See
prune.py
) - Train the Latend Dirichlet Analysis model. (See
lda.py
) - Predict topic distribution given a text. (See
lda_comparison.py
)
Topic 1: 0.008*"book" + 0.008*"name" + 0.007*"an" + 0.007*"titl" + 0.007*"cite" + 0.006*"journal" + 0.006*"univers" + 0.006*"publish" + 0.006*"this" + 0.005*"from"
Topic 2: 0.012*"war" + 0.010*"name" + 0.010*"forc" + 0.007*"armi" + 0.007*"from" + 0.007*"air" + 0.007*"p" + 0.007*"militari" + 0.006*"first" + 0.006*"titl"
Topic 3: 0.018*"name" + 0.017*"journal" + 0.015*"titl" + 0.015*"cite" + 0.010*"year" + 0.010*"page" + 0.008*"from" + 0.007*"volum" + 0.007*"publish" + 0.007*"date"
Topic 4: 0.014*"state" + 0.010*"titl" + 0.010*"cite" + 0.008*"unit" + 0.008*"parti" + 0.008*"name" + 0.008*"nation" + 0.007*"date" + 0.007*"govern" + 0.007*"publish"
Topic 5: 0.013*"use" + 0.010*"comput" + 0.009*"system" + 0.009*"web" + 0.008*"code" + 0.008*"titl" + 0.008*"com" + 0.008*"name" + 0.007*"cite" + 0.007*"an"
Topic 6: 0.030*"math" + 0.012*"sub" + 0.008*"an" + 0.008*"sup" + 0.008*"this" + 0.007*"titl" + 0.006*"number" + 0.006*"f" + 0.006*"mathemat" + 0.006*"first"
Topic 7: 0.036*"languag" + 0.012*"use" + 0.009*"from" + 0.008*"word" + 0.007*"name" + 0.006*"titl" + 0.005*"other" + 0.005*"also" + 0.005*"an" + 0.005*"which"
Topic 8: 0.020*"player" + 0.019*"politician" + 0.016*"footbal" + 0.015*"day" + 0.014*"author" + 0.014*"singer" + 0.012*"actor" + 0.011*"french" + 0.010*"born" + 0.009*"songwrit"
Topic 9: 0.029*"music" + 0.009*"use" + 0.009*"instrument" + 0.008*"from" + 0.007*"play" + 0.006*"danc" + 0.006*"an" + 0.006*"compos" + 0.005*"which" + 0.005*"string"
Topic 10: 0.026*"style" + 0.018*"color" + 0.017*"small" + 0.017*"background" + 0.011*"from" + 0.010*"file" + 0.010*"year" + 0.009*"text" + 0.009*"span" + 0.009*"day"
Topic 11: 0.009*"use" + 0.008*"engin" + 0.007*"com" + 0.007*"titl" + 0.007*"cite" + 0.006*"compani" + 0.006*"an" + 0.006*"product" + 0.006*"from" + 0.006*"econom"
Topic 12: 0.013*"web" + 0.013*"titl" + 0.012*"cite" + 0.011*"name" + 0.010*"citi" + 0.009*"publish" + 0.008*"from" + 0.008*"com" + 0.008*"date" + 0.007*"state"
Topic 13: 0.018*"web" + 0.018*"titl" + 0.017*"game" + 0.017*"cite" + 0.013*"com" + 0.012*"date" + 0.011*"publish" + 0.011*"name" + 0.008*"univers" + 0.008*"leagu"
Topic 14: 0.018*"his" + 0.011*"name" + 0.009*"p" + 0.009*"from" + 0.009*"had" + 0.007*"year" + 0.007*"book" + 0.007*"king" + 0.007*"first" + 0.007*"titl"
Topic 15: 0.023*"film" + 0.016*"com" + 0.016*"titl" + 0.014*"cite" + 0.012*"name" + 0.012*"date" + 0.012*"web" + 0.010*"news" + 0.009*"publish" + 0.008*"his"
Topic 16: 0.018*"journal" + 0.014*"titl" + 0.014*"name" + 0.013*"sub" + 0.013*"cite" + 0.010*"page" + 0.009*"volum" + 0.008*"use" + 0.008*"date" + 0.008*"sup"
Topic 17: 0.014*"com" + 0.013*"titl" + 0.011*"album" + 0.011*"cite" + 0.010*"music" + 0.010*"record" + 0.009*"web" + 0.009*"name" + 0.009*"date" + 0.008*"first"
Topic 18: 0.027*"his" + 0.014*"book" + 0.010*"name" + 0.010*"publish" + 0.009*"first" + 0.009*"categori" + 0.009*"titl" + 0.009*"work" + 0.008*"from" + 0.008*"year"
Topic 19: 0.017*"book" + 0.011*"church" + 0.009*"titl" + 0.009*"name" + 0.009*"from" + 0.008*"publish" + 0.008*"first" + 0.008*"god" + 0.007*"cite" + 0.006*"year"
Topic 20: 0.106*"align" + 0.071*"right" + 0.046*"style" + 0.032*"text" + 0.024*"center" + 0.013*"footbal" + 0.011*"left" + 0.010*"width" + 0.010*"world" + 0.009*"name"
The Federal Reserve Act created a system of private and public entities. There were to be at least eight and no more than twelve private regional Federal Reserve banks. Twelve were established, and each had various branches, a board of directors, and district boundaries. The Federal Reserve Board, consisting of seven members, was created as the governing body of the Fed. Each member is appointed by the President of the U.S and confirmed by the U.S. Senate. In 1935, the Board was renamed and restructured. Also created as part of the Federal Reserve System was a 12-member Federal Advisory Committee and a single new United States currency, the Federal Reserve Note. The Federal Reserve Act created a national currency and a monetary system that could respond effectively to the stresses in the banking system and create a stable financial system. With the goal of creating a national monetary system and financial stability, the Federal Reserve Act also provided many other functions and financial services for the economy, such as check clearing and collection for all members of the Federal Reserve.
Most similiar to topic 4:
0.014*"state" + 0.010*"titl" + 0.010*"cite" + 0.008*"unit" + 0.008*"parti" + 0.008*"name" + 0.008*"nation" + 0.007*"date" + 0.007*"govern" + 0.007*"publish"
With 47.625570439519227% confidence.
The violin is a wooden string instrument in the violin family. It is the smallest and highest-pitched instrument in the family in regular use. Smaller violin-type instruments are known, including the violino piccolo and the kit violin, but these are virtually unused in the 2010s. The violin typically has four strings tuned in perfect fifths, and is most commonly played by drawing a bow across its strings, though it can also be played by plucking the strings with the fingers (pizzicato). Violins are important instruments in a wide variety of musical genres. They are most prominent in the Western classical tradition and in many varieties of folk music. They are also frequently used in genres of folk including country music and bluegrass music and in jazz. Electric violins are used in some forms of rock music; further, the violin has come to be played in many non-Western music cultures, including Indian music and Iranian music. The violin is sometimes informally called a fiddle, particularly in Irish traditional music and bluegrass, but this nickname is also used regardless of the type of music played on it.
Most similiar to topic 17:
0.014*"com" + 0.013*"titl" + 0.011*"album" + 0.011*"cite" + 0.010*"music" + 0.010*"record" + 0.009*"web" + 0.009*"name" + 0.009*"date" + 0.008*"first"
with 52.053422831164042% confidence.
convert_for_lda.py
: represents tokens in the necessary formatquick_sort.py
: align ID(most right collum) in increasing orderfurthur convert_for_lda.py
: represents tokens in the necessary formatsparce_vectors.py
: sparse vector representationmatrix_convert.py, matrix_compare.py
: matrix representationmake_xml.py
: make xml file with text bodiescompress.py
: compresses the file