New URI path and consolidated fragments #533
-
Moving an array to a new URI path, it brings back the consolidated fragments along with the compressed fragment. Is it expected? Please see the example below. library(tiledb)
# Helpers -----------------------------------------------------
create_sparse_db <- function(uri, dups = FALSE) {
dt <- Sys.Date()
tm_dom <- c(as.Date("1970-01-01"), as.Date("2100-01-01"))
# ingest #1
df1 <- data.frame(x = 1:3, tm = dt + 1:3)
fromDataFrame(df1, uri, mode = "ingest",
sparse = TRUE,
col_index = c("tm"),
tile_domain = list(tm = tm_dom),
allows_dups = dups)
Sys.sleep(2)
# ingest #2
df2 <- data.frame(x = 4:6, tm = dt + 4:6)
fromDataFrame(df2, uri, mode = "append",
sparse = TRUE,
col_index = c("tm"),
tile_domain = list(tm = tm_dom),
allows_dups = dups)
Sys.sleep(2)
# ingest #3
df3 <- data.frame(x = 7:9, tm = dt + 7:9)
fromDataFrame(df3, uri, mode = "append",
sparse = TRUE,
col_index = c("tm"),
tile_domain = list(tm = tm_dom),
allows_dups = dups)
}
get_frag_num <- function(uri) {
finfo <- tiledb_fragment_info(uri)
tiledb_fragment_info_get_num(finfo)
}
dump_finfo <- function(uri) {
finfo <- tiledb_fragment_info(uri)
tiledb_fragment_info_dump(finfo)
}
# Example -----------------------------------------------------
uri_old <- tempfile()
create_sparse_db(uri_old, dups = TRUE)
## Load and save data
df_old <- tiledb_array(uri_old, return_as = "data.table")[]
## Get number of fragments
get_frag_num(uri_old)
#> [1] 3
## Dump info to console
# dump_finfo(uri_old)
## Consolidate array
array_consolidate(uri_old)
## Get number of fragments after consolidation
get_frag_num(uri_old)
#> [1] 1
## Vacuum uris
# array_vacuum(uri_old)
## Move array to new_uri
uri_new <- tempfile()
tiledb_object_mv(uri_old, uri_new)
#> [1] "C:\\Users\\Constantine\\AppData\\Local\\Temp\\Rtmp0AYhIr\\file685043c477a8"
## Get number of fragments
get_frag_num(uri_new)
#> [1] 4
## Dump info to console
# dump_finfo(uri_new)
## Load and save data
df_new <- tiledb_array(uri_new, return_as = "data.table")[]
# When dups = TRUE, df_new will have duplicates even you didn't write duplicates
all.equal(df_old, df_new) # OK (dups = F), NOT OK (dups = T)
#> [1] "Different number of rows" Created on 2023-04-04 with reprex v2.0.2 Session infosessioninfo::session_info()
#> ─ Session info ───────────────────────────────────────────────────────────────
#> setting value
#> version R version 4.2.3 (2023-03-15 ucrt)
#> os Windows 10 x64 (build 19044)
#> system x86_64, mingw32
#> ui RTerm
#> language (EN)
#> collate English_United Kingdom.utf8
#> ctype English_United Kingdom.utf8
#> tz Europe/Istanbul
#> date 2023-04-04
#> pandoc 2.19.2 @ C:/Program Files/RStudio/resources/app/bin/quarto/bin/tools/ (via rmarkdown)
#>
#> ─ Packages ───────────────────────────────────────────────────────────────────
#> package * version date (UTC) lib source
#> bit 4.0.5 2022-11-15 [1] CRAN (R 4.2.2)
#> bit64 4.0.5 2020-08-30 [1] CRAN (R 4.2.0)
#> cli 3.6.0 2023-01-09 [1] CRAN (R 4.2.2)
#> data.table 1.14.8 2023-02-17 [1] CRAN (R 4.2.2)
#> digest 0.6.31 2022-12-11 [1] CRAN (R 4.2.2)
#> evaluate 0.20 2023-01-17 [1] CRAN (R 4.2.2)
#> fastmap 1.1.1 2023-02-24 [1] CRAN (R 4.2.2)
#> fs 1.6.1 2023-02-06 [1] CRAN (R 4.2.2)
#> glue 1.6.2 2022-02-24 [1] CRAN (R 4.2.0)
#> htmltools 0.5.5 2023-03-23 [1] CRAN (R 4.2.3)
#> knitr 1.42 2023-01-25 [1] CRAN (R 4.2.2)
#> lattice 0.20-45 2021-09-22 [3] CRAN (R 4.2.3)
#> lifecycle 1.0.3 2022-10-07 [1] CRAN (R 4.2.1)
#> magrittr 2.0.3 2022-03-30 [1] CRAN (R 4.2.0)
#> nanotime 0.3.7 2022-10-24 [1] CRAN (R 4.2.1)
#> purrr 1.0.1 2023-01-10 [1] CRAN (R 4.2.2)
#> R.cache 0.16.0 2022-07-21 [1] CRAN (R 4.2.1)
#> R.methodsS3 1.8.2 2022-06-13 [1] CRAN (R 4.2.0)
#> R.oo 1.25.0 2022-06-12 [1] CRAN (R 4.2.0)
#> R.utils 2.12.2 2022-11-11 [1] CRAN (R 4.2.1)
#> Rcpp 1.0.10 2023-01-22 [1] CRAN (R 4.2.2)
#> RcppCCTZ 0.2.12 2022-11-06 [1] CRAN (R 4.2.2)
#> RcppSpdlog * 0.0.12 2023-01-07 [1] CRAN (R 4.2.2)
#> reprex 2.0.2 2022-08-17 [1] CRAN (R 4.2.1)
#> rlang 1.1.0 2023-03-14 [1] CRAN (R 4.2.2)
#> rmarkdown 2.21 2023-03-26 [1] CRAN (R 4.2.3)
#> rstudioapi 0.14 2022-08-22 [1] CRAN (R 4.2.1)
#> sessioninfo 1.2.2 2021-12-06 [1] CRAN (R 4.2.0)
#> spdl 0.0.4 2023-01-08 [1] CRAN (R 4.2.2)
#> styler 1.9.1 2023-03-04 [1] CRAN (R 4.2.2)
#> tiledb * 0.19.0 2023-03-13 [1] CRAN (R 4.2.2)
#> vctrs 0.6.1 2023-03-22 [1] CRAN (R 4.2.3)
#> withr 2.5.0 2022-03-03 [1] CRAN (R 4.2.0)
#> xfun 0.38 2023-03-24 [1] CRAN (R 4.2.3)
#> yaml 2.3.7 2023-01-23 [1] CRAN (R 4.2.2)
#> zoo 1.8-11 2022-09-17 [1] CRAN (R 4.2.1)
#>
#> [1] C:/Program Files/R/library
#> [2] C:/Users/Constantine/AppData/Local/R/win-library/4.2
#> [3] C:/Program Files/R/R-4.2.3/library
#>
#> ────────────────────────────────────────────────────────────────────────────── |
Beta Was this translation helpful? Give feedback.
Replies: 4 comments 10 replies
-
Hello, and thank you for taking the time to write such a fine and detail bug report! It approaches the problem correctly. However I (right now) cannot replicate the behavior on my Linux system. After consolidate, vacuum and move I have one new fragment: > get_frag_num(uri_old)
[1] 3
> array_consolidate(uri_old)
> get_frag_num(uri_old)
[1] 1
> # not shown: dump_info() call
> array_vacuum(uri_old)
> uri_old
[1] "/tmp/Rtmp15cBkR/file18772d2c31b712"
> uri_new <- tempfile()
> tiledb_object_mv(uri_old, uri_new)
[1] "/tmp/Rtmp15cBkR/file18772d79a62aa8"
> get_frag_num(uri_new)
[1] 1
> I also get > all.equal(df_old, df_new) # OK (dups = F), NOT OK (dups = T)
[1] TRUE
> so there is something for us to look into. I am running a slight newer 0.19.0 as on GitHub with 2.16.0 but that should not be the difference. |
Beta Was this translation helpful? Give feedback.
-
That's how I spotted this issue. I had one fragment and expect the array to be intact (i.e., vacuum uris, consolidated frags). In the new URI the queries became unreasonably slow because all the fragments reverted back, so I had to re-consolidate. The key question remains: Why when moving a consolidated array (no vacuum), reverts back to its consolidated fragments in the new URI. It's more of understanding the rationale. Couldn't find anything in the docs. It's also critical in case you have a directory with many arrays, you vacuum periodically and at some point you decide to move/rename the URIs for whatever reason but without vacuuming; it seems that you need to re-consolidate by default after moving an array that hasn't been vacuumed. I will raise the question to Slack channel in due time. Thanks for your quick responses and assistance. |
Beta Was this translation helpful? Give feedback.
-
Hi @cgiachalis, thanks for posting this. When moving an array to a new folder / URI, the behavior of the new array should be identical to the old array. You should not need to re-consolidate, etc. @KiterLuc will take a look, as this appears to be a core issue. |
Beta Was this translation helpful? Give feedback.
-
Hi @stavrospapadopoulos , thanks for confirming. I was drafting a message for Slack, refining the example above so it can be easily understood by anyone. In a nutshell, you confirm that the "new world" must be identical to the "old world". See below. library(tiledb)
# Helpers -----------------------------------------------------
# Creates sparse array with three writes
create_sparse_db <- function(uri, dups = FALSE) {
dt <- Sys.Date()
tm_dom <- c(as.Date("1970-01-01"), as.Date("2100-01-01"))
# ingest #1
df1 <- data.frame(x = 1:3, tm = dt + 1:3)
fromDataFrame(df1, uri, mode = "ingest",
sparse = TRUE,
col_index = c("tm"),
tile_domain = list(tm = tm_dom),
allows_dups = dups)
Sys.sleep(2)
# ingest #2
df2 <- data.frame(x = 4:6, tm = dt + 4:6)
fromDataFrame(df2, uri, mode = "append",
sparse = TRUE,
col_index = c("tm"),
tile_domain = list(tm = tm_dom),
allows_dups = dups)
Sys.sleep(2)
# ingest #3
df3 <- data.frame(x = 7:9, tm = dt + 7:9)
fromDataFrame(df3, uri, mode = "append",
sparse = TRUE,
col_index = c("tm"),
tile_domain = list(tm = tm_dom),
allows_dups = dups)
}
# Gets fragment number and uri location - (trunc_uri: bool to truncate uri)
finfo_table_uris <- function(uri, trunc_uri = TRUE) {
finfo <- tiledb_fragment_info(uri)
idx <- tiledb_fragment_info_get_num(finfo) - 1
lst <- lapply(0:idx, function(.x) {
tmp <- tiledb::tiledb_fragment_info_uri(finfo, .x)
data.frame(Fragment = paste0("#",.x + 1),
URI = ifelse(trunc_uri, sub(".*__fragments", "", tmp) , tmp))
})
do.call(rbind, lst)
}
# Example -----------------------------------------------------
uri_old <- tempfile()
create_sparse_db(uri_old, dups = TRUE)
## Load and save data
df_old <- tiledb_array(uri_old, return_as = "data.table")[]
# Table of finfo uris before consolidation
finfo_table_uris(uri_old)
#> Fragment URI
#> 1 #1 /__1680621044775_1680621044775_cfbafb0caffd453da3e7da8b40f94b92_18
#> 2 #2 /__1680621047279_1680621047279_87b450f30969440a9ceb5e1586181429_18
#> 3 #3 /__1680621049807_1680621049807_10cc9235ba3f4c25b0313240cd3fcb01_18
## Consolidate array
array_consolidate(uri_old)
# Table of finfo uris after consolidation
finfo_table_uris(uri_old)
#> Fragment URI
#> 1 #1 /__1680621044775_1680621049807_67b9efc0124f4ce3a643a81ce0ae55d0_18
## Move array to new_uri
uri_new <- tempfile()
tiledb_object_mv(uri_old, uri_new)
#> [1] "C:\\Users\\Constantine\\AppData\\Local\\Temp\\RtmpacSaYY\\file69cc2f94104"
# Table of new finfo uris
finfo_table_uris(uri_new)
#> Fragment URI
#> 1 #1 /__1680621044775_1680621044775_cfbafb0caffd453da3e7da8b40f94b92_18
#> 2 #2 /__1680621044775_1680621049807_67b9efc0124f4ce3a643a81ce0ae55d0_18
#> 3 #3 /__1680621047279_1680621047279_87b450f30969440a9ceb5e1586181429_18
#> 4 #4 /__1680621049807_1680621049807_10cc9235ba3f4c25b0313240cd3fcb01_18
## Load and save data
df_new <- tiledb_array(uri_new, return_as = "data.table")[]
# When dups = TRUE, df_new will have duplicates even you didn't write duplicates
all.equal(df_old, df_new) # OK (dups = F), NOT OK (dups = T)
#> [1] "Different number of rows"
# Print all data from the Old World
df_old
#> tm x
#> 1: 2023-04-05 1
#> 2: 2023-04-06 2
#> 3: 2023-04-07 3
#> 4: 2023-04-08 4
#> 5: 2023-04-09 5
#> 6: 2023-04-10 6
#> 7: 2023-04-11 7
#> 8: 2023-04-12 8
#> 9: 2023-04-13 9
# Print all data from the New World (it should be identical, strange)
df_new
#> tm x
#> 1: 2023-04-05 1
#> 2: 2023-04-06 2
#> 3: 2023-04-07 3
#> 4: 2023-04-05 1
#> 5: 2023-04-06 2
#> 6: 2023-04-07 3
#> 7: 2023-04-08 4
#> 8: 2023-04-09 5
#> 9: 2023-04-10 6
#> 10: 2023-04-11 7
#> 11: 2023-04-12 8
#> 12: 2023-04-13 9
#> 13: 2023-04-08 4
#> 14: 2023-04-09 5
#> 15: 2023-04-10 6
#> 16: 2023-04-11 7
#> 17: 2023-04-12 8
#> 18: 2023-04-13 9 Created on 2023-04-04 with reprex v2.0.2 Session infosessioninfo::session_info()
#> ─ Session info ───────────────────────────────────────────────────────────────
#> setting value
#> version R version 4.2.3 (2023-03-15 ucrt)
#> os Windows 10 x64 (build 19044)
#> system x86_64, mingw32
#> ui RTerm
#> language (EN)
#> collate English_United Kingdom.utf8
#> ctype English_United Kingdom.utf8
#> tz Europe/Istanbul
#> date 2023-04-04
#> pandoc 2.19.2 @ C:/Program Files/RStudio/resources/app/bin/quarto/bin/tools/ (via rmarkdown)
#>
#> ─ Packages ───────────────────────────────────────────────────────────────────
#> package * version date (UTC) lib source
#> bit 4.0.5 2022-11-15 [1] CRAN (R 4.2.2)
#> bit64 4.0.5 2020-08-30 [1] CRAN (R 4.2.0)
#> cli 3.6.0 2023-01-09 [1] CRAN (R 4.2.2)
#> data.table 1.14.8 2023-02-17 [1] CRAN (R 4.2.2)
#> digest 0.6.31 2022-12-11 [1] CRAN (R 4.2.2)
#> evaluate 0.20 2023-01-17 [1] CRAN (R 4.2.2)
#> fastmap 1.1.1 2023-02-24 [1] CRAN (R 4.2.2)
#> fs 1.6.1 2023-02-06 [1] CRAN (R 4.2.2)
#> glue 1.6.2 2022-02-24 [1] CRAN (R 4.2.0)
#> htmltools 0.5.5 2023-03-23 [1] CRAN (R 4.2.3)
#> knitr 1.42 2023-01-25 [1] CRAN (R 4.2.2)
#> lattice 0.20-45 2021-09-22 [3] CRAN (R 4.2.3)
#> lifecycle 1.0.3 2022-10-07 [1] CRAN (R 4.2.1)
#> magrittr 2.0.3 2022-03-30 [1] CRAN (R 4.2.0)
#> nanotime 0.3.7 2022-10-24 [1] CRAN (R 4.2.1)
#> purrr 1.0.1 2023-01-10 [1] CRAN (R 4.2.2)
#> R.cache 0.16.0 2022-07-21 [1] CRAN (R 4.2.1)
#> R.methodsS3 1.8.2 2022-06-13 [1] CRAN (R 4.2.0)
#> R.oo 1.25.0 2022-06-12 [1] CRAN (R 4.2.0)
#> R.utils 2.12.2 2022-11-11 [1] CRAN (R 4.2.1)
#> Rcpp 1.0.10 2023-01-22 [1] CRAN (R 4.2.2)
#> RcppCCTZ 0.2.12 2022-11-06 [1] CRAN (R 4.2.2)
#> RcppSpdlog * 0.0.12 2023-01-07 [1] CRAN (R 4.2.2)
#> reprex 2.0.2 2022-08-17 [1] CRAN (R 4.2.1)
#> rlang 1.1.0 2023-03-14 [1] CRAN (R 4.2.2)
#> rmarkdown 2.21 2023-03-26 [1] CRAN (R 4.2.3)
#> rstudioapi 0.14 2022-08-22 [1] CRAN (R 4.2.1)
#> sessioninfo 1.2.2 2021-12-06 [1] CRAN (R 4.2.0)
#> spdl 0.0.4 2023-01-08 [1] CRAN (R 4.2.2)
#> styler 1.9.1 2023-03-04 [1] CRAN (R 4.2.2)
#> tiledb * 0.19.0 2023-03-13 [1] CRAN (R 4.2.2)
#> vctrs 0.6.1 2023-03-22 [1] CRAN (R 4.2.3)
#> withr 2.5.0 2022-03-03 [1] CRAN (R 4.2.0)
#> xfun 0.38 2023-03-24 [1] CRAN (R 4.2.3)
#> yaml 2.3.7 2023-01-23 [1] CRAN (R 4.2.2)
#> zoo 1.8-11 2022-09-17 [1] CRAN (R 4.2.1)
#>
#> [1] C:/Program Files/R/library
#> [2] C:/Users/Constantine/AppData/Local/R/win-library/4.2
#> [3] C:/Program Files/R/R-4.2.3/library
#>
#> ────────────────────────────────────────────────────────────────────────────── |
Beta Was this translation helpful? Give feedback.
Hi @cgiachalis, thanks for posting this.
When moving an array to a new folder / URI, the behavior of the new array should be identical to the old array. You should not need to re-consolidate, etc. @KiterLuc will take a look, as this appears to be a core issue.