-
Notifications
You must be signed in to change notification settings - Fork 15
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Parent array dropped when child array is empty #117
Comments
Thanks for reporting, and apologies for missing the SO post 😄 I think this is related to #95 and possibly others. There is a short-term solution in splitting up the pipeline and doing the
|
And another approach that gets close (going tall first and avoiding library(dplyr)
library(tidyr)
library(tidyjson)
people <- c('{"age": 32, "name": [{"first": "Bob", "last": "Smith", "middles":[{"middle1":"John", "middle2":"Rick"}]}]}',
'{"age": 54, "name": [{"first": "Susan", "last": "Doe", "middles":[]}]}',
'{"age": 18, "name": [{"first": "Ann", "last": "Jones", "middles":[]}]}')
people %>%
spread_all() %>%
enter_object("name") %>%
gather_array() %>%
gather_object() %>%
append_values_string() %>%
as_tibble() %>%
spread(name, string)
#> # A tibble: 3 x 6
#> document.id age array.index first last middles
#> <int> <dbl> <int> <chr> <chr> <chr>
#> 1 1 32 1 Bob Smith "list(middle1 = \"John\", middle2 =…
#> 2 2 54 1 Susan Doe NA
#> 3 3 18 1 Ann Jones NA Created on 2019-12-15 by the reprex package (v0.2.1) |
I had a few minutes, so I tried adding a simple devtools::install_github("colearendt/tidyjson@add_append_values") library(dplyr)
library(tidyr)
library(purrr)
library(tidyjson)
people <- c('{"age": 32, "name": [{"first": "Bob", "last": "Smith", "middles":[{"middle1":"John", "middle2":"Rick"}]}]}',
'{"age": 54, "name": [{"first": "Susan", "last": "Doe", "middles":[]}]}',
'{"age": 18, "name": [{"first": "Ann", "last": "Jones", "middles":[]}]}')
structured <- people %>%
spread_all() %>%
enter_object("name") %>%
gather_array() %>%
gather_object() %>%
append_values_list() %>%
as_tibble()
structured
#> # A tibble: 9 x 5
#> document.id age array.index name list
#> <int> <dbl> <int> <chr> <list>
#> 1 1 32 1 first <chr [1]>
#> 2 1 32 1 last <chr [1]>
#> 3 1 32 1 middles <named list [2]>
#> 4 2 54 1 first <chr [1]>
#> 5 2 54 1 last <chr [1]>
#> 6 2 54 1 middles <lgl [1]>
#> 7 3 18 1 first <chr [1]>
#> 8 3 18 1 last <chr [1]>
#> 9 3 18 1 middles <lgl [1]>
spread_it_out <- structured %>% spread(name, list)
spread_it_out
#> # A tibble: 3 x 6
#> document.id age array.index first last middles
#> <int> <dbl> <int> <list> <list> <list>
#> 1 1 32 1 <chr [1]> <chr [1]> <named list [2]>
#> 2 2 54 1 <chr [1]> <chr [1]> <lgl [1]>
#> 3 3 18 1 <chr [1]> <chr [1]> <lgl [1]>
spread_it_out %>% mutate(first = flatten_chr(first), last = flatten_chr(last))
#> # A tibble: 3 x 6
#> document.id age array.index first last middles
#> <int> <dbl> <int> <chr> <chr> <list>
#> 1 1 32 1 Bob Smith <named list [2]>
#> 2 2 54 1 Susan Doe <lgl [1]>
#> 3 3 18 1 Ann Jones <lgl [1]> Created on 2019-12-15 by the reprex package (v0.2.1) |
All good. I really wasn't sure if it was a bug or a feature!
Interesting. I had considered issue #95 while researching a solution, so that makes sense.
I had come up with a similar I'll try out your other ideas and see how it goes. Thanks for the quick answer! |
Another possible pattern I just added to the SO post. library(tidyjson)
library(dplyr)
library(tidyr)
people <- c('{"age": 32, "name": [{"first": "Bob", "last": "Smith", "middles":[{"middle1":"John", "middle2":"Rick"}]}]}',
'{"age": 54, "name": [{"first": "Susan", "last": "Doe", "middles":[]}]}',
'{"age": 18, "name": [{"first": "Ann", "last": "Jones", "middles":[]}]}')
as_tbl_json(people) %>% spread_all() %>% enter_object("name") %>% gather_array("nameid") %>%
gather_object("key") %>%
{bind_rows(
filter(., key != "middles") %>% append_values_string("value"),
filter(., key == "middles") %>% gather_array("middleid") %>%
select(-key) %>%
gather_object("key") %>%
append_values_string("value")
)} %>%
# drop tbl_json
as_tibble() %>%
select(-document.id, -nameid, -middleid) %>%
# could also use tidyr::pivot_wider
tidyr::spread(key, value)
#> # A tibble: 3 x 5
#> age first last middle1 middle2
#> <dbl> <chr> <chr> <chr> <chr>
#> 1 18 Ann Jones <NA> <NA>
#> 2 32 Bob Smith John Rick
#> 3 54 Susan Doe <NA> <NA> Created on 2020-06-28 by the reprex package (v0.3.0) |
[cross posted on stack overflow, getting no love]
It seems as though tidyjson uses an inner-join-like behaviour on nested arrays, thus dropping records with empty child arrays. Is there a way to get left-join-like behaviour instead, filling with
NA
s?For example, these fake data have one record with a populated nested array (
middles
) and two records wheremiddles
is empty:From these data I wish to have a dataframe with all the parent records retained and missing child-array information filled with
NA
s (~ left join) as such:However, extracting a nested array with some empty child arrays causes loss of their parent's information (~ inner join):
Is there a way to avoid this; i.e., to retain all the rows even when child arrays are empty?
The text was updated successfully, but these errors were encountered: