how to speed up the computation? #92

randomgambit · 2018-09-22T12:35:45Z

I am using urltools with sparklyr and the computation is unfortunately pretty slow. I wonder if the x3 slower computation with suffix_extract is expected? Can I improve somehow its speed?

Thanks!


> data <- c("https://en.wikipedia.org/wiki/Article",
+           "https://en.wikipedia.org/wiki/Article",
+           "https://en.wikipedia.org/wiki/Article",
+           "https://en.wikipedia.org/wiki/Article")

> microbenchmark(url_parse(data)$domain)
Unit: microseconds
                   expr    min      lq     mean   median       uq     max neval
 url_parse(data)$domain 433.84 448.841 493.6815 481.0135 505.2915 947.815   100

> microbenchmark(suffix_extract(domain(data))$domain)
Unit: microseconds
                                expr     min       lq     mean   median       uq      max neval
 suffix_extract(domain(data))$domain 928.077 1029.135 1480.917 1116.575 1340.205 8784.951   100

The text was updated successfully, but these errors were encountered:

hrbrmstr · 2018-09-22T14:41:12Z

The benchmark has some conflation in it as it's including the computation time for domain(data) which I'm not sure is completely legit since url_parse(data)$domain and suffix_extract(domain(data))$domain aren't really equivalent.

Virtually nothing (useful) is faster than this algorithm. I've benched many of the libs on https://publicsuffix.org/learn/ and this still outpaces them. Even https://github.com/hrbrmstr/psl is only faster in edge cases (but I needed full reproducibility with libpsl for a project). Plus the Guava implementation is pretty janky.

A possible specific-to-this-use-case speedup wld be to take finalise_suffixes() (https://github.com/Ironholds/urltools/blob/master/src/suffix.cpp#L26-L91) and make a version that only does the domain component.

Also, sparklyr is rly not a factor, either. It's just R with some Java wrappers and this example isn't marshalling data back and forth between R & Java.

Also, it is unwise to do a comparison between public-suffix-list function in psl and urltools usin the data in this example. It'll look like psl is faster but it's rly not:

library(psl)
library(urltools)
library(microbenchmark)

c(
  "https://en.wikipedia.org/wiki/Article",
  "https://en.wikipedia.org/wiki/Article",
  "https://en.wikipedia.org/wiki/Article",
  "https://en.wikipedia.org/wiki/Article"
) -> xdat

xdoms <- urltools::domain(xdat)

xdoms
## [1] "en.wikipedia.org" "en.wikipedia.org" "en.wikipedia.org"
## [4] "en.wikipedia.org"

microbenchmark(
  urltools = urltools::suffix_extract(xdoms)$domain,
  psl = psl::domain_only(xdoms)
)
## Unit: microseconds
##      expr     min        lq       mean    median       uq      max neval
##  urltools 865.103 1016.3295 1213.65774 1097.6180 1323.802 2442.916   100
##       psl  13.397   16.4105   30.73293   23.5315   32.256  287.164   100

##
## REAL WORLD TEST
## 

# http://malware-domains.com/files/domains.zip
lots <- suppressWarnings(readr::read_tsv("~/Data/domains.txt", col_names=FALSE, skip=4, col_types="ccccciii")$X3)

ldoms <- urltools::domain(lots)

length(ldoms)
## [1] 26951

head(ldoms)
## [1] "amazon.co.uk.security-check.ga"                   
## [2] "autosegurancabrasil.com"                          
## [3] "dadossolicitado-antendimento.sad879.mobi"         
## [4] "hitnrun.com.my"                                   
## [5] "maruthorvattomsrianjaneyatemple.org"              
## [6] "paypalsecure-2016.sucurecode524154241.arita.ac.tz"

microbenchmark(
  urltools = urltools::suffix_extract(ldoms)$domain,
  psl = psl::domain_only(ldoms)
)
## Unit: milliseconds
##      expr      min       lq     mean   median       uq      max neval
##  urltools 52.39128 57.28778 59.78562 58.74512 60.31533 83.47342   100
##       psl 54.18679 61.26153 62.52437 62.32959 63.49736 70.26835   100

And, (extrapolating) 22 seconds for 10 million domains requiring alot of string ops seems … not bad?

randomgambit · 2018-09-22T20:34:45Z

pretty interesting. thanks harbourmaster

randomgambit · 2018-09-22T23:44:30Z

@hrbrmstr to be more precise, my point about spark is that when you apply a native R package to every row of a spark dataframe which as tens of billions of rows, then you start worriying about every bit of processing time. In my case, the easiest solution was to use urltools::domain and aggregate by day, and then run the suffix_extract.

hrbrmstr · 2018-09-23T14:07:32Z

I grok the ease of sparklyr but it comes at a severe cost of marshalling data between the JVM and R. We abandoned both Python & R in favor of using the native pubsuf Java libraries at $DAYJOB for this very scenario you've identified. The marshalling is not insignificant. It matters little in a small context, but at-scale it's quite costly

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

how to speed up the computation? #92

how to speed up the computation? #92

randomgambit commented Sep 22, 2018

hrbrmstr commented Sep 22, 2018

randomgambit commented Sep 22, 2018

randomgambit commented Sep 22, 2018 •

edited

Loading

hrbrmstr commented Sep 23, 2018

how to speed up the computation? #92

how to speed up the computation? #92

Comments

randomgambit commented Sep 22, 2018

hrbrmstr commented Sep 22, 2018

randomgambit commented Sep 22, 2018

randomgambit commented Sep 22, 2018 • edited Loading

hrbrmstr commented Sep 23, 2018

randomgambit commented Sep 22, 2018 •

edited

Loading