-
Notifications
You must be signed in to change notification settings - Fork 108
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Using jiebaR package (SimHash algorithm) #66
Comments
@remibacha library(jiebaR)
#> Loading required package: jiebaRD
simhasher_5 = worker("simhash", topn = 5)
keyword_1 <- c("Simhash", "duplicates")
keyword_2 <- c("Simhash", "quickly")
simhash_1 <- vector_simhash(keyword_1, simhasher_5)
simhash_1
#> $simhash
#> [1] "144150442997195320"
#>
#> $keyword
#> 11.7392 11.7392
#> "Simhash" "duplicates"
simhash_2 <- vector_simhash(keyword_2, simhasher_5)
simhash_2
#> $simhash
#> [1] "1730138795753340968"
#>
#> $keyword
#> 11.7392 11.7392
#> "Simhash" "quickly"
tobin(simhash_1$simhash)
#> [1] "0000001000000000001000000001000001101101000100000010001000111000"
tobin(simhash_2$simhash)
#> [1] "0001100000000010101100000001000101101101000000000000000000101000"
# hamming-distance
simhash_dist(simhash_1$simhash, simhash_2$simhash)
#> [1] 11
vector_distance(keyword_1, keyword_2, simhasher_5)
#> $distance
#> [1] 11
#>
#> $lhs
#> 11.7392 11.7392
#> "Simhash" "duplicates"
#>
#> $rhs
#> 11.7392 11.7392
#> "Simhash" "quickly"
# only one keyword "Simhash"
simhasher_1 <- worker("simhash", topn = 1)
simhash_1 <- vector_simhash(keyword_1, simhasher_1)
simhash_1
#> $simhash
#> [1] "1883542797686548280"
#>
#> $keyword
#> 11.7392
#> "Simhash"
simhash_2 <- vector_simhash(keyword_2, simhasher_1)
simhash_2
#> $simhash
#> [1] "1883542797686548280"
#>
#> $keyword
#> 11.7392
#> "Simhash"
tobin(simhash_1$simhash)
#> [1] "0001101000100011101100000011000111101111010110100010011100111000"
tobin(simhash_2$simhash)
#> [1] "0001101000100011101100000011000111101111010110100010011100111000"
# hamming-distance
simhash_dist(simhash_1$simhash, simhash_2$simhash)
#> [1] 0
vector_distance(keyword_1, keyword_2, simhasher_1)
#> $distance
#> [1] 0
#>
#> $lhs
#> 11.7392
#> "Simhash"
#>
#> $rhs
#> 11.7392
#> "Simhash" Created on 2018-10-23 by the reprex package (v0.2.0). hamming_distance: https://en.wikipedia.org/wiki/Hamming_distance You can modify the user dict in jiebaRD, |
Thanks for this example, really helpfull ! But I still don't get what the figures above the words in lhs and rhs are (e.g: 11.7392). Can you please explain it? |
@remibacha jiebaR is design for Chinese Text Segment, it has a default idf dict which only contains Chinse words. Maybe the default idf weight for English word is IDFPATH
#> [1] "E:/R/R-3.5-library/jiebaRD/dict/idf.utf8"
keys = worker("keywords", topn = 2)
keys <= "Simhash is quick, Simhash ia fast"
#> 23.4784 11.7392
#> "Simhash" "fast" If you want to get a more accuary tf-idf weight, you need to train the Corpus yourself. The Suppose you have many Englisth corpus, you can use these corpus to trian idf, then, use
I think the main trick is to hash the keyword and weight to the simhash code, and it is pretty fast for calculating hamming-distance, which can used for de-duplicate docs. for more, you can read https://github.com/yanyiwu/simhash/blob/master/README_EN.md the author's cppjieba is the soure of jiebaR. Some introductions: https://github.com/seomoz/simhash-cpp/#architecture and https://yanyiwu.com/work/2014/01/30/simhash-shi-xian-xiang-jie.html |
Hello
Here are 2 texts I would like to check for near duplicate thanks to the SimHash algorithm (jiebaR package):
I have create a worker called "simhasher":
Then I have computed the distance:
Here is the result:
I need you help on 3 things:
the distance is 22. The bigger the distance is, the more the 2 texts are different. Here texts seems REALLY close, so I was expected the distante to be smaller... Can you please explain me this result?
What are the figures above the words in lhs and rhs ? (e.g: 11.7392 , 23.4784)
I also checked the worker I have created :
simhasher <= codel
And here is the result I discovered:
What is the simhash here and why do I need to create it before to run the distance function? This part is not really clear to me and not really explained inside the package documentation.
Can you please help me? This package seems really powerfull but I feel like I only understand 5% of it.
The text was updated successfully, but these errors were encountered: