-
Notifications
You must be signed in to change notification settings - Fork 56
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
"Memoization" on elements in a vector input. #149
Comments
I've also encountered issues with slow I think your proposed function is more specific and narrow than makes sense for the |
I guess what I was thinking is it's just a form of memoization on the elements of the input themselves. So Perhaps both the original x and the unique x can be stored to save the |
@wch To follow up — the |
To anyone that comes across this and is looking for a solution, I've written a separate package, # if(!requireNamespace("deduped")) install.packages("deduped")
library(deduped)
N_TOTAL <- 1e4
repeated_paths <- fs::path("base", stringr::str_glue("dir{d}", d=1:10), "inner") |>
rep(N_TOTAL/10) |>
sample()
bench::mark(
direct = repeated_paths |> fs::path_dir(),
indirect = repeated_paths |> deduped(fs::path_dir)(),
iterations = 10
)
#> # A tibble: 2 × 6
#> expression min median `itr/sec` mem_alloc `gc/sec`
#> <bch:expr> <bch:tm> <bch:tm> <dbl> <bch:byt> <dbl>
#> 1 direct 51.8ms 52.5ms 18.9 749.02KB 2.10
#> 2 indirect 206.3µs 213.5µs 4574. 6.13MB 0
all_unique_paths <- fs::path("base", stringr::str_glue("dir{d}", d=1:N_TOTAL), "inner")
bench::mark(
direct = all_unique_paths |> fs::path_dir(),
indirect = all_unique_paths |> deduped(fs::path_dir)(),
iterations = 10
)
#> # A tibble: 2 × 6
#> expression min median `itr/sec` mem_alloc `gc/sec`
#> <bch:expr> <bch:tm> <bch:tm> <dbl> <bch:byt> <dbl>
#> 1 direct 53.6ms 54.6ms 18.3 901.88KB 0
#> 2 indirect 53.6ms 54.9ms 18.2 1.03MB 2.02 Created on 2023-10-26 with reprex v2.0.2 |
I'm not sure if
memoise
is the appropriate place for this but when I suggested it inpurrr
, Hadley suggested memoization would be a better approach for the issue. However, memoization currently acts on the entire input to a function, without accounting for repeats in the input.This issue came about when I discovered that
fs::path_file
andfs::path_dir
run very slowly on Windows (see r-lib/fs#424), and since most of my use case of these functions is after usingreadr::read_csv(files, .id="file_path")
, most of the vector is duplicated. As such, I found that I could save a significant amount of time by deduplicating the vector (2x on Mac, 40x on Windows). This approach is not just helpful forfs::path_
functions.The most straightforward approach is:
I've also submitted a PR into
vctrs
to speed this up (see r-lib/vctrs#1857 and r-lib/vctrs#1858).While traditional "Memoization" is typically performed blindly on the inputs, most programming languages aren't inherently vectorized like R. Therefore, I think it would make sense for
memoise
to add this extra feature to its memoization, such that it cached any input that matches theunique
input. Or at least a new function, saymemoise_unique
since calculatingunique
every time takes some extra time.The text was updated successfully, but these errors were encountered: