Document similarity comparison using 5 popular algorithms: Jaccard, TF-IDF, Doc2vec, USE, and BERT.
33,914 New York Times articles are used for the experiment. It aims to show which algorithm yields the best result out of the box in 2020.
- By running multiple algorithms with some of which are most up-to-date and trendy in NLP community, it will show which algorithms give the best results and by how much onto the same set of data.
- By using full-length popular media publications as our data input, we will simulate the real world similarity/recommendation use case.
- By following URLs, you can actually see and compare the quality document similarity yourself.
- By using only the pretrained models available publicly, you can easily set it up and start implementing document similarity on your own with very little prior knowledge while expecting the similar result output.
33,914 New York Times articles from 2018 to June 2020 were selected. The data was mostly collected from their RSS feed.
5 articles were chosen as the basis. They represent different categories.
- How My Worst Date Ever Became My Best (lifestyle)
- A Deep-Sea Magma Monster Gets a Body Scan (science)
- Renault and Nissan Try a New Way After Years When Carlos Ghosn Ruled (business)
- Dominic Thiem Beats Rafael Nadal in Australian Open Quarterfinal (sports)
- 2020 Democrats Seek Voters in an Unusual Spot: Fox News (politics)
Overlapping tags, sections, subsections, writing format, and subjective judgement are considered. For a more detailed description, please follow this blog post.
TF-IDF. It resulted in the best matches in 2.5 out of 5 comparisons.
The detailed breakdowns of how each algorithm predicted can be found in the algorithm folders.
Winner: BERT
Title | Tag Overlap | Section Overlap | Subsection Overlap | Style Overlap | Theme | Subjective |
---|---|---|---|---|---|---|
Why Are All the Exes Texting? | 1 | Y | Y | Y | Dating | Related |
When Love Seems Too Easy to Trust | 1 | Y | Y | Y | Dating | Related |
He Saved His Last Lesson for Me | 1 | Y | Y | Y | Dating | Related |
Winner: TF-IDF
Title | Tag Overlap | Section Overlap | Subsection Overlap | Style Overlap | Theme | Subjective |
---|---|---|---|---|---|---|
A 3D Encounter With a Violent Volcano’s Underbelly | 4 | Y | N | Y | 3D Mapped Volcano | Highly Related |
Pressure, and Mystery, on the Rise | 1 | Y | Y | Y | Iceland's Volcano | Related |
It’s Not Just Hawaii: The U.S. Has 169 Volcanoes That Could Erupt | 2 | N | N | Y | Volcanos | Related |
Winner: TF-IDF
Title | Tag Overlap | Section Overlap | Subsection Overlap | Style Overlap | Theme | Subjective |
---|---|---|---|---|---|---|
Nissan CEO Says 'No Merit' in Merger With Renault-Nikkei | 3 | N | N | Y | Nissan and Renault | Related |
Carlos Ghosn and the Roots of Nissan’s Decline | 3 | N | N | N | Carlos Ghosn | Related |
Renault Chooses Volkswagen Executive as New C.E.O. | 5 | Y | Y | Y | Renault CEO | Very Related |
Winner: Jaccard, TF-IDF, and USE
(Jaccard)
Title | Tag Overlap | Section Overlap | Subsection Overlap | Style Overlap | Theme | Subjective |
---|---|---|---|---|---|---|
Djokovic vs. Federer, a Rivalry for the Ages, Is One-Sided This Time | 3 | Y | Y | Y | Australian Open | Related |
Novak Djokovic Wins the Australian Open | 4 | Y | Y | Y | Dominic vs Novak in Australian Open | Very Related |
With Rome Title, Nadal Back on Track Entering French Open | 1 | N | N | Y | French Open | Unrelated |
Winner: USE
Title | Tag Overlap | Section Overlap | Subsection Overlap | Style Overlap | Theme | Subjective |
---|---|---|---|---|---|---|
Bernie Sanders Had a Problem With MSNBC. Then Came Super Tuesday. | 7 | N | N | Y | Sanders and MSNBC | Very Related |
Democrats, Don’t Abandon Fox News | 7 | N | N | N | Democrats and Fox News | Very Related |
Candidates Running Against, and With, Cable News | 4 | Y | Y | Y | Fox, MSNBC and politics | Related |