Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

enter_object : allow keeping row if object does not exist #134

Open
ramiromagno opened this issue Jan 23, 2021 · 4 comments
Open

enter_object : allow keeping row if object does not exist #134

ramiromagno opened this issue Jan 23, 2021 · 4 comments

Comments

@ramiromagno
Copy link
Contributor

ramiromagno commented Jan 23, 2021

Feature request

From the documentation details about enter_object():

After using enter_object, all further tidyjson calls happen inside the referenced object (all other JSON data outside the object is discarded). If the object doesn't exist for a given row / index, then that row will be discarded.

Could you give the user the option to not discard?

From the source code of enter_object it does not seem difficult to allow this:

function (.x, ...)
{
    if (!is.tbl_json(.x)) 
        .x <- as.tbl_json(.x)
    path <- path(...)
    json <- json_get(.x)
    json <- purrr::map(json, path %>% as.list)
    tbl_json(.x, json, drop.null.json = TRUE)
}

could it be changed to this code?

function (.x, ..., drop.null.json = TRUE)
{
    if (!is.tbl_json(.x)) 
        .x <- as.tbl_json(.x)
    path <- path(...)
    json <- json_get(.x)
    json <- purrr::map(json, path %>% as.list)
    tbl_json(.x, json, drop.null.json = drop.null.json)
}

Motivation

Perhaps I am not using tidyjson idiomatically, but I would like to use the code below to extract a json array and bind it as a new column. In the example below I have a tbl json with 3 rows: in the third row the object "associated_pgs_ids" is null. Therefore I cannot take advantage of this function get_column_char() because tidyjson::enter_object will return only two rows instead of three, not allowing me to further bind this column to the starting tibble.

get_column_chr()

    get_column_chr <- function(tbl_json, json_object, col = json_object, only_col = TRUE) {
      
      tbl_json %>%
        tidyjson::enter_object({{ json_object }}) %>%
        tidyjson::json_get_column(column_name = {{ col }}) %>%
        dplyr::mutate({{ col }} := purrr::map(.data[[col]], as.character)) %>%
        `if`(only_col, tidyjson::as_tibble(.[col]), .) # as_tibble necessary to drop ..JSON col.
    }

Example code

    library(magrittr)
    library(dplyr)
    #> 
    #> Attaching package: 'dplyr'
    #> The following objects are masked from 'package:stats':
    #> 
    #>     filter, lag
    #> The following objects are masked from 'package:base':
    #> 
    #>     intersect, setdiff, setequal, union
    library(tidyjson)
    #> 
    #> Attaching package: 'tidyjson'
    #> The following object is masked from 'package:stats':
    #> 
    #>     filter

    tbl_json1 <-
      structure(
        list(
          ..resource = c(
            "https://www.pgscatalog.org/rest/publication/all?offset=0&limit=20&format=json",
            "https://www.pgscatalog.org/rest/publication/all?offset=0&limit=20&format=json",
            "https://www.pgscatalog.org/rest/publication/all?offset=0&limit=20&format=json"
          ),
          ..timestamp = structure(
            c(1611422943.87465, 1611422943.87465,
              1611422943.87465),
            tzone = "",
            class = c("POSIXct", "POSIXt")
          ),
          ..page = c(1L, 1L, 1L),
          array.index = 6:8,
          id = c("PGP000006",
                 "PGP000007", "PGP000008"),
          pubmed_id = c("30104762", "30309464",
                        "31184202"),
          publication_date = c("2018-08-13", "2018-10-01",
                               "2019-06-11"),
          publication = c("Nat Genet", "J Am Coll Cardiol",
                          "Circ Genom Precis Med"),
          title = c(
            "Genome-wide polygenic scores for common diseases identify individuals with risk equivalent to monogenic mutations.",
            "Genomic Risk Prediction of Coronary Artery Disease in 480,000 Adults: Implications for Primary Prevention.",
            "Validation of Genome-Wide Polygenic Risk Scores for Coronary Artery Disease in French Canadians."
          ),
          author_fullname = c("Khera AV", "Inouye M", "Wünnemann F"),
          doi = c(
            "10.1038/s41588-018-0183-z",
            "10.1016/j.jacc.2018.07.079",
            "10.1161/CIRCGEN.119.002481"
          ),
          authors = c(
            "Khera AV, Chaffin M, Aragam KG, Haas ME, Roselli C, Choi SH, Natarajan P, Lander ES, Lubitz SA, Ellinor PT, Kathiresan S.",
            "Inouye M, Abraham G, Nelson CP, Wood AM, Sweeting MJ, Dudbridge F, Lai FY, Kaptoge S, Brozynska M, Wang T, Ye S, Webb TR, Rutter MK, Tzoulaki I, Patel RS, Loos RJF, Keavney B, Hemingway H, Thompson J, Watkins H, Deloukas P, Di Angelantonio E, Butterworth AS, Danesh J, Samani NJ, UK Biobank CardioMetabolic Consortium CHD Working Group.",
            "Wünnemann F, Sin Lo K, Langford-Avelar A, Busseuil D, Dubé MP, Tardif JC, Lettre G."
          ),
          ..JSON = list(
            list(
              id = "PGP000006",
              title = "Genome-wide polygenic scores for common diseases identify individuals with risk equivalent to monogenic mutations.",
              doi = "10.1038/s41588-018-0183-z",
              PMID = 30104762L,
              journal = "Nat Genet",
              firstauthor = "Khera AV",
              date_publication = "2018-08-13",
              authors = "Khera AV, Chaffin M, Aragam KG, Haas ME, Roselli C, Choi SH, Natarajan P, Lander ES, Lubitz SA, Ellinor PT, Kathiresan S.",
              associated_pgs_ids = list(
                "PGS000013",
                "PGS000014",
                "PGS000015",
                "PGS000016",
                "PGS000017"
              )
            ),
            list(
              id = "PGP000007",
              title = "Genomic Risk Prediction of Coronary Artery Disease in 480,000 Adults: Implications for Primary Prevention.",
              doi = "10.1016/j.jacc.2018.07.079",
              PMID = 30309464L,
              journal = "J Am Coll Cardiol",
              firstauthor = "Inouye M",
              date_publication = "2018-10-01",
              authors = "Inouye M, Abraham G, Nelson CP, Wood AM, Sweeting MJ, Dudbridge F, Lai FY, Kaptoge S, Brozynska M, Wang T, Ye S, Webb TR, Rutter MK, Tzoulaki I, Patel RS, Loos RJF, Keavney B, Hemingway H, Thompson J, Watkins H, Deloukas P, Di Angelantonio E, Butterworth AS, Danesh J, Samani NJ, UK Biobank CardioMetabolic Consortium CHD Working Group.",
              associated_pgs_ids = list("PGS000018")
            ),
            list(
              id = "PGP000008",
              title = "Validation of Genome-Wide Polygenic Risk Scores for Coronary Artery Disease in French Canadians.",
              doi = "10.1161/CIRCGEN.119.002481",
              PMID = 31184202L,
              journal = "Circ Genom Precis Med",
              firstauthor = "Wünnemann F",
              date_publication = "2019-06-11",
              authors = "Wünnemann F, Sin Lo K, Langford-Avelar A, Busseuil D, Dubé MP, Tardif JC, Lettre G.",
              associated_pgs_ids = list()
            )
          )
        ),
        row.names = c(NA, 3L),
        class = c("tbl_json",
                  "tbl_df", "tbl", "data.frame")
      )

    get_column_chr <- function(tbl_json, json_object, col = json_object, only_col = TRUE) {
      
      tbl_json %>%
        tidyjson::enter_object({{ json_object }}) %>%
        tidyjson::json_get_column(column_name = {{ col }}) %>%
        dplyr::mutate({{ col }} := purrr::map(.data[[col]], as.character)) %>%
        `if`(only_col, tidyjson::as_tibble(.[col]), .) # as_tibble necessary to drop ..JSON col.
    }

    tbl_json1 %>%
      dplyr::bind_cols(., get_column_chr(., 'associated_pgs_ids', 'pgs_id'))
    #> Error: Can't recycle `..1` (size 3) to match `..2` (size 2).

    tbl_json1[1:2, ] %>%
      dplyr::bind_cols(., get_column_chr(., 'associated_pgs_ids', 'pgs_id'))
    #> # A tbl_json: 2 x 14 tibble with a "JSON" attribute
    #>   ..JSON ..resource ..timestamp         ..page array.index id    pubmed_id
    #>   <chr>  <chr>      <dttm>               <int>       <int> <chr> <chr>    
    #> 1 "{\"i… https://w… 2021-01-23 17:29:03      1           6 PGP0… 30104762 
    #> 2 "{\"i… https://w… 2021-01-23 17:29:03      1           7 PGP0… 30309464 
    #> # … with 7 more variables: publication_date <chr>, publication <chr>,
    #> #   title <chr>, author_fullname <chr>, doi <chr>, authors <chr>, pgs_id <list>
@ramiromagno
Copy link
Contributor Author

ramiromagno commented Jan 23, 2021

This seems to be working on my side:

enter_object2 <- function (.x, ..., drop.null.json = TRUE) {
  if (!tidyjson::is.tbl_json(.x))
    .x <- tidyjson::as.tbl_json(.x)

  path <- tidyjson:::path(...)
  json <- tidyjson::json_get(.x)
  json <- purrr::map(json, path %>% as.list)
  tidyjson::tbl_json(.x, json, drop.null.json = drop.null.json)
}

It's only a bit risky because I am now depending on the internal function tidyjson:::path(...), but it's the only one.

@colearendt
Copy link
Owner

colearendt commented Feb 6, 2021

Awesome!! Thanks for reporting this - this is definitely a confusing part of the package. #121 is where we have tracked this in the past, but you have done much more on the topic than anyone previously!

Would you be interested in PRing your function change and adding some tests? I am inclined to feel drop_null_json would be a better naming convention for the new argument. Alternatively, perhaps drop = TRUE would be a better default. I.e. looking at tidyr::spread, it seems that "unexplained" references to the word "drop" can be contextualized by help docs / etc.

@ramiromagno
Copy link
Contributor Author

What about .drop = TRUE?

@marklyng
Copy link

marklyng commented Mar 1, 2024

Any progress on this issue? Also, any thoughts on implementing something similar to spread_all()?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants