Parent array dropped when child array is empty #117

msinjin · 2019-12-15T17:23:40Z

[cross posted on stack overflow, getting no love]

It seems as though tidyjson uses an inner-join-like behaviour on nested arrays, thus dropping records with empty child arrays. Is there a way to get left-join-like behaviour instead, filling with NAs?

For example, these fake data have one record with a populated nested array (middles) and two records where middles is empty:

library(tidyjson)

people <- c('{"age": 32, "name": [{"first": "Bob",   "last": "Smith", "middles":[{"middle1":"John", "middle2":"Rick"}]}]}',
            '{"age": 54, "name": [{"first": "Susan", "last": "Doe", "middles":[]}]}',
            '{"age": 18, "name": [{"first": "Ann",   "last": "Jones", "middles":[]}]}')

From these data I wish to have a dataframe with all the parent records retained and missing child-array information filled with NAs (~ left join) as such:

# A tibble: 3 x 5
    age first last  middle1 middle2
  <dbl> <chr> <chr> <chr>   <chr>  
1    32 Bob   Smith John    Rick   
2    54 Susan Doe   NA      NA
3    18 Ann   Jones NA      NA

However, extracting a nested array with some empty child arrays causes loss of their parent's information (~ inner join):

people %>% 
    spread_all() %>%  
    enter_object("name") %>% gather_array() %>% 
    spread_all() %>% select(-document.id,-array.index) %>% 
    enter_object("middles") %>% gather_array %>% 
    spread_all()  %>% select(-array.index) %>% 
    tbl_df()

# A tibble: 1 x 5
    age first last  middle1 middle2
  <dbl> <chr> <chr> <chr>   <chr>  
1    32 Bob   Smith John    Rick

Is there a way to avoid this; i.e., to retain all the rows even when child arrays are empty?

The text was updated successfully, but these errors were encountered:

colearendt · 2019-12-15T19:54:53Z

Thanks for reporting, and apologies for missing the SO post 😄

I think this is related to #95 and possibly others. enter_object() is the problem here, as it discards records that do not have that object (And in this case, empty is treated as "does not have").

There is a short-term solution in splitting up the pipeline and doing the left_join() yourself. Not ideal I don't think, but functional for now until we have a bit more time to think on this one 😄

library(dplyr)
library(tidyjson)


people <- c('{"age": 32, "name": [{"first": "Bob",   "last": "Smith", "middles":[{"middle1":"John", "middle2":"Rick"}]}]}',
            '{"age": 54, "name": [{"first": "Susan", "last": "Doe", "middles":[]}]}',
            '{"age": 18, "name": [{"first": "Ann",   "last": "Jones", "middles":[]}]}')

# does not work as we want
people %>% 
  spread_all() %>%  
  enter_object("name") %>% gather_array() %>% 
  spread_all() %>% select(-document.id,-array.index) %>% 
  enter_object("middles") %>% gather_array %>% 
  spread_all()  %>% select(-array.index) %>% 
  tbl_df()
#> # A tibble: 1 x 5
#>     age first last  middle1 middle2
#>   <dbl> <chr> <chr> <chr>   <chr>  
#> 1    32 Bob   Smith John    Rick

# prepare the JSON
prep <- people %>% 
  spread_all() %>% 
  enter_object("name") %>% 
  gather_array() %>% 
  spread_all() %>% 
  select(-array.index)

# extract the sub-object
middles <- prep %>% 
  enter_object("middles") %>%
  gather_array() %>% 
  spread_all() %>% 
  select("document.id", starts_with("middle")) %>%
  as_tibble()

# join together
prep %>% 
  as_tibble() %>%
  left_join(
    middles, by = c("document.id")
    )
#> # A tibble: 3 x 6
#>   document.id   age first last  middle1 middle2
#>         <int> <dbl> <chr> <chr> <chr>   <chr>  
#> 1           1    32 Bob   Smith John    Rick   
#> 2           2    54 Susan Doe   <NA>    <NA>   
#> 3           3    18 Ann   Jones <NA>    <NA>

colearendt · 2019-12-15T20:02:43Z

And another approach that gets close (going tall first and avoiding enter_object()... but we'd really rather have a "list" column that you can then explore, rather than a character).

library(dplyr)
library(tidyr)
library(tidyjson)

people <- c('{"age": 32, "name": [{"first": "Bob",   "last": "Smith", "middles":[{"middle1":"John", "middle2":"Rick"}]}]}',
            '{"age": 54, "name": [{"first": "Susan", "last": "Doe", "middles":[]}]}',
            '{"age": 18, "name": [{"first": "Ann",   "last": "Jones", "middles":[]}]}')
people %>%
  spread_all() %>%
  enter_object("name") %>%
  gather_array() %>%
  gather_object() %>%
  append_values_string() %>%
  as_tibble() %>%
  spread(name, string)
#> # A tibble: 3 x 6
#>   document.id   age array.index first last  middles                             
#>         <int> <dbl>       <int> <chr> <chr> <chr>                               
#> 1           1    32           1 Bob   Smith "list(middle1 = \"John\", middle2 =…
#> 2           2    54           1 Susan Doe   NA                                  
#> 3           3    18           1 Ann   Jones NA

^{Created on 2019-12-15 by the reprex package (v0.2.1)}

colearendt · 2019-12-15T20:41:34Z

I had a few minutes, so I tried adding a simple append_values_list() that makes the latter a bit more plausible... enter_object() is a bit more complex to dig into and will need more thought. Column types are still a little weird here with list-cols.

devtools::install_github("colearendt/tidyjson@add_append_values")

library(dplyr)
library(tidyr)
library(purrr)
library(tidyjson)

people <- c('{"age": 32, "name": [{"first": "Bob",   "last": "Smith", "middles":[{"middle1":"John", "middle2":"Rick"}]}]}',
            '{"age": 54, "name": [{"first": "Susan", "last": "Doe", "middles":[]}]}',
            '{"age": 18, "name": [{"first": "Ann",   "last": "Jones", "middles":[]}]}')

structured <- people %>%
  spread_all() %>%
  enter_object("name") %>%
  gather_array() %>%
  gather_object() %>%
  append_values_list() %>% 
  as_tibble()

structured
#> # A tibble: 9 x 5
#>   document.id   age array.index name    list            
#>         <int> <dbl>       <int> <chr>   <list>          
#> 1           1    32           1 first   <chr [1]>       
#> 2           1    32           1 last    <chr [1]>       
#> 3           1    32           1 middles <named list [2]>
#> 4           2    54           1 first   <chr [1]>       
#> 5           2    54           1 last    <chr [1]>       
#> 6           2    54           1 middles <lgl [1]>       
#> 7           3    18           1 first   <chr [1]>       
#> 8           3    18           1 last    <chr [1]>       
#> 9           3    18           1 middles <lgl [1]>

spread_it_out <- structured %>% spread(name, list)
spread_it_out
#> # A tibble: 3 x 6
#>   document.id   age array.index first     last      middles         
#>         <int> <dbl>       <int> <list>    <list>    <list>          
#> 1           1    32           1 <chr [1]> <chr [1]> <named list [2]>
#> 2           2    54           1 <chr [1]> <chr [1]> <lgl [1]>       
#> 3           3    18           1 <chr [1]> <chr [1]> <lgl [1]>

spread_it_out %>% mutate(first = flatten_chr(first), last = flatten_chr(last))
#> # A tibble: 3 x 6
#>   document.id   age array.index first last  middles         
#>         <int> <dbl>       <int> <chr> <chr> <list>          
#> 1           1    32           1 Bob   Smith <named list [2]>
#> 2           2    54           1 Susan Doe   <lgl [1]>       
#> 3           3    18           1 Ann   Jones <lgl [1]>

^{Created on 2019-12-15 by the reprex package (v0.2.1)}

msinjin · 2019-12-16T03:22:39Z

Thanks for reporting, and apologies for missing the SO post 😄

All good. I really wasn't sure if it was a bug or a feature!

I think this is related to #95 and possibly others. enter_object() is the problem here, as it discards records that do not have that object (And in this case, empty is treated as "does not have").

Interesting. I had considered issue #95 while researching a solution, so that makes sense.

There is a short-term solution in splitting up the pipeline and doing the left_join() yourself. Not ideal I don't think, but functional for now until we have a bit more time to think on this one 😄

I had come up with a similar left_join() work around, which I included in the SO post. It gets the job done but doubles the execution time which isn't great because I'm dealing with GBs of JSON.

I'll try out your other ideas and see how it goes.

Thanks for the quick answer!

colearendt · 2020-06-29T02:15:29Z

Another possible pattern I just added to the SO post.

library(tidyjson)
library(dplyr)
library(tidyr)

people <- c('{"age": 32, "name": [{"first": "Bob",   "last": "Smith", "middles":[{"middle1":"John", "middle2":"Rick"}]}]}',
            '{"age": 54, "name": [{"first": "Susan", "last": "Doe", "middles":[]}]}',
            '{"age": 18, "name": [{"first": "Ann",   "last": "Jones", "middles":[]}]}')

as_tbl_json(people) %>% spread_all() %>% enter_object("name") %>% gather_array("nameid") %>%
  gather_object("key") %>% 
  {bind_rows(
    filter(., key != "middles") %>% append_values_string("value"),
    filter(., key == "middles") %>% gather_array("middleid") %>% 
      select(-key) %>%
      gather_object("key") %>%
      append_values_string("value")
  )} %>%
  # drop tbl_json
  as_tibble() %>%
  select(-document.id, -nameid, -middleid) %>%
  # could also use tidyr::pivot_wider
  tidyr::spread(key, value)
#> # A tibble: 3 x 5
#>     age first last  middle1 middle2
#>   <dbl> <chr> <chr> <chr>   <chr>  
#> 1    18 Ann   Jones <NA>    <NA>   
#> 2    32 Bob   Smith John    Rick   
#> 3    54 Susan Doe   <NA>    <NA>

^{Created on 2020-06-28 by the reprex package (v0.3.0)}

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Parent array dropped when child array is empty #117

Parent array dropped when child array is empty #117

msinjin commented Dec 15, 2019

colearendt commented Dec 15, 2019

colearendt commented Dec 15, 2019

colearendt commented Dec 15, 2019 •

edited

Loading

msinjin commented Dec 16, 2019

colearendt commented Jun 29, 2020

Parent array dropped when child array is empty #117

Parent array dropped when child array is empty #117

Comments

msinjin commented Dec 15, 2019

colearendt commented Dec 15, 2019

colearendt commented Dec 15, 2019

colearendt commented Dec 15, 2019 • edited Loading

msinjin commented Dec 16, 2019

colearendt commented Jun 29, 2020

colearendt commented Dec 15, 2019 •

edited

Loading