Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Is the description of compute.dm() accurate ? #2059

Closed
moodymudskipper opened this issue Oct 25, 2023 · 4 comments · Fixed by #2103
Closed

Is the description of compute.dm() accurate ? #2059

moodymudskipper opened this issue Oct 25, 2023 · 4 comments · Fixed by #2103

Comments

@moodymudskipper
Copy link
Collaborator

?materialize says :

compute() materializes all tables in a dm to new (temporary or permanent) tables on the database.

As far as I can tell temporary tables are the unique option.

Should the description be updated ? Or should we implement the actual permanent table creation feature (if so an overwrite flag would be useful) ?

To build a dm with proper constraints our client needs to do some non trivial operations on raw tables, so the lazy tables that make our clean dm imply some joins or even pivots, this makes it impractical to work with lazy data, a solution to materialise those periodically on the db would be comfortable.

@moodymudskipper
Copy link
Collaborator Author

moodymudskipper commented Oct 25, 2023

As @TSchiefer pointed to me, compute.dm maps to dbplyr:::compute.tbl_sql where we have a temporary = FALSE arg so we can already do this to some extent through the dots. However it will create permanent tables called "dbplyr_01" etc which defies the purpose of having something permanent.

I would suggest that when choosing permanent we create a table with the actual name of the table in the dm.

Fancier would be :

  • To define the names with a glue specification, the tidyverse has uses "{.col}" in several places, we could use "{.table}", we can also provide a named vector.
  • To be able to provide a schema (fully flexible would be to be able to provide schemas table by table but maybe overkill ?).

There is a parallel between db table materialisation and memoization, I wonder if we can make materialisation more like memoization in the sense of invalidating/refreshing the "cached" (materialised) data on demand, but maybe that's out of scope and for another feature. Basically a dm (or a table in a dm) might have 3 states : lazy (a dm table maps to a query), remote (a dm table maps to db table), collected (a dm table maps to an R object), at the moment these 3 states have to live in separate objects but could we have then coexist in a single object and sync on demand ?

If compute.dm() stores the lazy dm as an attribute (or somewhere in the object structure), we can recompute it at any point by calling compute() one more time on the object. A breaking change but the original behavior is a no op.

If collect.dm() stores the lazy dm, and the remote dm if one was computed, we can recollect again, from the remote dm in priority, by calling collect() on the object, again a breaking change but the original behavior, collecting a local dm, is a no op. Calling compute() on a collected dm rerun the materialisation and returns a remote dm, that we can recollect if needed.

@krlmlr
Copy link
Collaborator

krlmlr commented Oct 30, 2023

Thanks. Do we need to do more in the PR to clarify the issue?

@krlmlr
Copy link
Collaborator

krlmlr commented Nov 1, 2023

compute(temporary = FALSE) just doesn't work as expected, at least not with current versions:

options(conflicts.policy = list(warn = FALSE))
library(dm)

con <- DBI::dbConnect(duckdb::duckdb())

dm <- dm(a = data.frame(x = 1), b = data.frame(y = 1))

dm_db <- copy_dm_to(con, dm)

dm_db_2 <- dm_db |> compute(temporary = FALSE)
#> Warning: There was 1 warning in `mutate()`.
#> ℹ In argument: `data = map(data, compute, ...)`.
#> Caused by warning:
#> ! The `name` argument of `compute()` must be provided when `temporary = FALSE`
#>   as of dbplyr 2.3.3.
#> ℹ The deprecated feature was likely used in the purrr package.
#>   Please report the issue at <https://github.com/tidyverse/purrr/issues>.

dm_db_2$a
#> # Source:   table<dbplyr_001> [1 x 1]
#> # Database: DuckDB v0.9.1 [kirill@Darwin 23.1.0:R 4.3.1/:memory:]
#>       x
#>   <dbl>
#> 1     1

DBI::dbDisconnect(con)

Created on 2023-11-01 with reprex v2.0.2

Copy link
Contributor

github-actions bot commented Nov 1, 2024

This old thread has been automatically locked. If you think you have found something related to this, please open a new issue and link to this old issue if necessary.

@github-actions github-actions bot locked and limited conversation to collaborators Nov 1, 2024
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Development

Successfully merging a pull request may close this issue.

2 participants