-
Notifications
You must be signed in to change notification settings - Fork 32
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
how to speed up the computation? #92
Comments
The benchmark has some conflation in it as it's including the computation time for Virtually nothing (useful) is faster than this algorithm. I've benched many of the libs on https://publicsuffix.org/learn/ and this still outpaces them. Even https://github.com/hrbrmstr/psl is only faster in edge cases (but I needed full reproducibility with A possible specific-to-this-use-case speedup wld be to take Also, Also, it is unwise to do a comparison between public-suffix-list function in library(psl)
library(urltools)
library(microbenchmark)
c(
"https://en.wikipedia.org/wiki/Article",
"https://en.wikipedia.org/wiki/Article",
"https://en.wikipedia.org/wiki/Article",
"https://en.wikipedia.org/wiki/Article"
) -> xdat
xdoms <- urltools::domain(xdat)
xdoms
## [1] "en.wikipedia.org" "en.wikipedia.org" "en.wikipedia.org"
## [4] "en.wikipedia.org"
microbenchmark(
urltools = urltools::suffix_extract(xdoms)$domain,
psl = psl::domain_only(xdoms)
)
## Unit: microseconds
## expr min lq mean median uq max neval
## urltools 865.103 1016.3295 1213.65774 1097.6180 1323.802 2442.916 100
## psl 13.397 16.4105 30.73293 23.5315 32.256 287.164 100
##
## REAL WORLD TEST
##
# http://malware-domains.com/files/domains.zip
lots <- suppressWarnings(readr::read_tsv("~/Data/domains.txt", col_names=FALSE, skip=4, col_types="ccccciii")$X3)
ldoms <- urltools::domain(lots)
length(ldoms)
## [1] 26951
head(ldoms)
## [1] "amazon.co.uk.security-check.ga"
## [2] "autosegurancabrasil.com"
## [3] "dadossolicitado-antendimento.sad879.mobi"
## [4] "hitnrun.com.my"
## [5] "maruthorvattomsrianjaneyatemple.org"
## [6] "paypalsecure-2016.sucurecode524154241.arita.ac.tz"
microbenchmark(
urltools = urltools::suffix_extract(ldoms)$domain,
psl = psl::domain_only(ldoms)
)
## Unit: milliseconds
## expr min lq mean median uq max neval
## urltools 52.39128 57.28778 59.78562 58.74512 60.31533 83.47342 100
## psl 54.18679 61.26153 62.52437 62.32959 63.49736 70.26835 100 And, (extrapolating) 22 seconds for 10 million domains requiring alot of string ops seems … not bad? |
pretty interesting. thanks harbourmaster |
@hrbrmstr to be more precise, my point about |
I grok the ease of |
I am using
urltools
with sparklyr and the computation is unfortunately pretty slow. I wonder if the x3 slower computation withsuffix_extract
is expected? Can I improve somehow its speed?Thanks!
The text was updated successfully, but these errors were encountered: