Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

how to speed up the computation? #92

Open
randomgambit opened this issue Sep 22, 2018 · 4 comments
Open

how to speed up the computation? #92

randomgambit opened this issue Sep 22, 2018 · 4 comments

Comments

@randomgambit
Copy link

I am using urltools with sparklyr and the computation is unfortunately pretty slow. I wonder if the x3 slower computation with suffix_extract is expected? Can I improve somehow its speed?

Thanks!


> data <- c("https://en.wikipedia.org/wiki/Article",
+           "https://en.wikipedia.org/wiki/Article",
+           "https://en.wikipedia.org/wiki/Article",
+           "https://en.wikipedia.org/wiki/Article")

> microbenchmark(url_parse(data)$domain)
Unit: microseconds
                   expr    min      lq     mean   median       uq     max neval
 url_parse(data)$domain 433.84 448.841 493.6815 481.0135 505.2915 947.815   100

> microbenchmark(suffix_extract(domain(data))$domain)
Unit: microseconds
                                expr     min       lq     mean   median       uq      max neval
 suffix_extract(domain(data))$domain 928.077 1029.135 1480.917 1116.575 1340.205 8784.951   100
@hrbrmstr
Copy link
Collaborator

The benchmark has some conflation in it as it's including the computation time for domain(data) which I'm not sure is completely legit since url_parse(data)$domain and suffix_extract(domain(data))$domain aren't really equivalent.

Virtually nothing (useful) is faster than this algorithm. I've benched many of the libs on https://publicsuffix.org/learn/ and this still outpaces them. Even https://github.com/hrbrmstr/psl is only faster in edge cases (but I needed full reproducibility with libpsl for a project). Plus the Guava implementation is pretty janky.

A possible specific-to-this-use-case speedup wld be to take finalise_suffixes() (https://github.com/Ironholds/urltools/blob/master/src/suffix.cpp#L26-L91) and make a version that only does the domain component.

Also, sparklyr is rly not a factor, either. It's just R with some Java wrappers and this example isn't marshalling data back and forth between R & Java.

Also, it is unwise to do a comparison between public-suffix-list function in psl and urltools usin the data in this example. It'll look like psl is faster but it's rly not:

library(psl)
library(urltools)
library(microbenchmark)

c(
  "https://en.wikipedia.org/wiki/Article",
  "https://en.wikipedia.org/wiki/Article",
  "https://en.wikipedia.org/wiki/Article",
  "https://en.wikipedia.org/wiki/Article"
) -> xdat

xdoms <- urltools::domain(xdat)

xdoms
## [1] "en.wikipedia.org" "en.wikipedia.org" "en.wikipedia.org"
## [4] "en.wikipedia.org"

microbenchmark(
  urltools = urltools::suffix_extract(xdoms)$domain,
  psl = psl::domain_only(xdoms)
)
## Unit: microseconds
##      expr     min        lq       mean    median       uq      max neval
##  urltools 865.103 1016.3295 1213.65774 1097.6180 1323.802 2442.916   100
##       psl  13.397   16.4105   30.73293   23.5315   32.256  287.164   100

##
## REAL WORLD TEST
## 

# http://malware-domains.com/files/domains.zip
lots <- suppressWarnings(readr::read_tsv("~/Data/domains.txt", col_names=FALSE, skip=4, col_types="ccccciii")$X3)

ldoms <- urltools::domain(lots)

length(ldoms)
## [1] 26951

head(ldoms)
## [1] "amazon.co.uk.security-check.ga"                   
## [2] "autosegurancabrasil.com"                          
## [3] "dadossolicitado-antendimento.sad879.mobi"         
## [4] "hitnrun.com.my"                                   
## [5] "maruthorvattomsrianjaneyatemple.org"              
## [6] "paypalsecure-2016.sucurecode524154241.arita.ac.tz"

microbenchmark(
  urltools = urltools::suffix_extract(ldoms)$domain,
  psl = psl::domain_only(ldoms)
)
## Unit: milliseconds
##      expr      min       lq     mean   median       uq      max neval
##  urltools 52.39128 57.28778 59.78562 58.74512 60.31533 83.47342   100
##       psl 54.18679 61.26153 62.52437 62.32959 63.49736 70.26835   100

And, (extrapolating) 22 seconds for 10 million domains requiring alot of string ops seems … not bad?

@randomgambit
Copy link
Author

pretty interesting. thanks harbourmaster

@randomgambit
Copy link
Author

randomgambit commented Sep 22, 2018

@hrbrmstr to be more precise, my point about spark is that when you apply a native R package to every row of a spark dataframe which as tens of billions of rows, then you start worriying about every bit of processing time. In my case, the easiest solution was to use urltools::domain and aggregate by day, and then run the suffix_extract.

@hrbrmstr
Copy link
Collaborator

I grok the ease of sparklyr but it comes at a severe cost of marshalling data between the JVM and R. We abandoned both Python & R in favor of using the native pubsuf Java libraries at $DAYJOB for this very scenario you've identified. The marshalling is not insignificant. It matters little in a small context, but at-scale it's quite costly

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants