Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Document how to handle grouped line list columns #114

Open
joshwlambert opened this issue Apr 17, 2024 · 4 comments
Open

Document how to handle grouped line list columns #114

joshwlambert opened this issue Apr 17, 2024 · 4 comments
Labels
documentation Improvements or additions to documentation

Comments

@joshwlambert
Copy link

It is unclear from the current {incidence2} documentation whether the package, specifically incidence2::incidence(), can handle grouped columns.

An example of such a line list with a grouped column is the Ebola simulated line list in {outbreaks}.

head(outbreaks::ebola_sim_clean$linelist)
#>   case_id generation date_of_infection date_of_onset date_of_hospitalisation
#> 1  d1fafd          0              <NA>    2014-04-07              2014-04-17
#> 2  53371b          1        2014-04-09    2014-04-15              2014-04-20
#> 3  f5c3d8          1        2014-04-18    2014-04-21              2014-04-25
#> 4  6c286a          2              <NA>    2014-04-27              2014-04-27
#> 5  0f58c4          2        2014-04-22    2014-04-26              2014-04-29
#> 6  49731d          0        2014-03-19    2014-04-25              2014-05-02
#>   date_of_outcome outcome gender           hospital       lon      lat
#> 1      2014-04-19    <NA>      f  Military Hospital -13.21799 8.473514
#> 2            <NA>    <NA>      m Connaught Hospital -13.21491 8.464927
#> 3      2014-04-30 Recover      f              other -13.22804 8.483356
#> 4      2014-05-07   Death      f               <NA> -13.23112 8.464776
#> 5      2014-05-17 Recover      f              other -13.21016 8.452143
#> 6      2014-05-07    <NA>      f               <NA> -13.23443 8.468572

Created on 2024-04-17 with reprex v2.1.0

The way I've been passing line list data like this to incidence() is using tidyr::pivot_wider() beforehand.

library(magrittr)
outbreaks::ebola_sim_clean$linelist %>%
  tidyr::pivot_wider(
    names_from = outcome,
    values_from = date_of_outcome
  )
#> # A tibble: 5,829 × 12
#>    case_id generation date_of_infection date_of_onset date_of_hospitalisation
#>    <chr>        <int> <date>            <date>        <date>                 
#>  1 d1fafd           0 NA                2014-04-07    2014-04-17             
#>  2 53371b           1 2014-04-09        2014-04-15    2014-04-20             
#>  3 f5c3d8           1 2014-04-18        2014-04-21    2014-04-25             
#>  4 6c286a           2 NA                2014-04-27    2014-04-27             
#>  5 0f58c4           2 2014-04-22        2014-04-26    2014-04-29             
#>  6 49731d           0 2014-03-19        2014-04-25    2014-05-02             
#>  7 f9149b           3 NA                2014-05-03    2014-05-04             
#>  8 881bd4           3 2014-04-26        2014-05-01    2014-05-05             
#>  9 e66fa4           2 NA                2014-04-21    2014-05-06             
#> 10 20b688           3 NA                2014-05-05    2014-05-06             
#> # ℹ 5,819 more rows
#> # ℹ 7 more variables: gender <fct>, hospital <fct>, lon <dbl>, lat <dbl>,
#> #   `NA` <date>, Recover <date>, Death <date>

Created on 2024-04-17 with reprex v2.1.0

These columns can then be selected using the date_index argument in incidence().

daily <- incidence(
  linelist,
  date_index = c(
    onset = "date_of_onset",
    death = "Death"
  ),
  interval = "daily"
)

Having the best way to work with this data documented somewhere in the {incidence2} package or add functionality to handle it would be great.

@joshwlambert joshwlambert changed the title Document how to handle grouped line list columsn Document how to handle grouped line list columns Apr 17, 2024
@joshwlambert
Copy link
Author

This issue might also link with the {linelist} package and whether there is a line list standard from that package and whether any of the columns are grouped. If so an as_incidence.linelist() S3 method might be beneficial. @Bisaloo is this the case or are all {linelist} tags for ungrouped columns?

@TimTaylor
Copy link
Collaborator

Cheers @joshwlambert. Yes this is exactly how I'd handle it (outside of incidence). Will add an example along the lines of

outbreaks::ebola_sim_clean$linelist |> 
    pivot_wider(names_from = outcome, values_from = date_of_outcome) |> 
    incidence(
        date_index = c(
            onset = "date_of_onset",
            hospitalisation = "date_of_hospitalisation",
            death = "Death"
        ),
        interval = "daily"
    )

The issue we have is that incidence2 is expecting wide data (albeit potentially aggregated) whereas this is a mixture of wide and long. Whilst it may be possible to adapt allow for long-style "outcome" (and asociated date) columns I think the tidyr approach is so elegant my preference is just to ensure that is documented.

As you alude to we could do a lot with an as_incidence.linelist() assuming that handles this wide/long mixture in it's specification - @Bisaloo?

As an aside I may also had a grates version to the examples to illustrate the different approaches.

@TimTaylor TimTaylor added the documentation Improvements or additions to documentation label Apr 17, 2024
@Bisaloo
Copy link

Bisaloo commented Apr 18, 2024

As you alude to we could do a lot with an as_incidence.linelist() assuming that handles this wide/long mixture in it's specification - @Bisaloo?

Couple of thoughts on this:

  • I've been considering it on multiple occasions but I'm still not convinced this should be a as_incidence() method. My view of as_() methods is that they convert objects between two different but equivalent formats. But line list data and aggregated count data are two very different object. A different way to say it is that in most cases, it should be possible to do the class round-trip in a (quasi-)transparent, which would not be the case here as there is loss of information when aggregating the data. But maybe this is just a terminology issue.
  • Would we actually be able to do much more than with the default incidence() function? E.g., in the case of a column tagged with date_outcome, are we sure that users will always prefer to pivot and convert it to date_death + date_recovery? Or can we imagine that they would be happy to pass date_outcome directly to incidence(). In others words, I see value for such a function only if we can provide more informative defaults than for standard data.frames. Is it really the case here?

@TimTaylor
Copy link
Collaborator

TimTaylor commented Apr 18, 2024

  • I've been considering it on multiple occasions but I'm still not convinced this should be a as_incidence() method. My view of as_() methods is that they convert objects between two different but equivalent formats. But line list data and aggregated count data are two very different object. A different way to say it is that in most cases, it should be possible to do the class round-trip in a (quasi-)transparent, which would not be the case here as there is loss of information when aggregating the data. But maybe this is just a terminology issue.

Interesting. I've always viewed as_ methods as (potentially lossy) casts (e.g. as.integer(1.5)). Alternatively you could make the incidence() funciton generic but that feels less satisfying (although cannot put my finger on way).

  • Would we actually be able to do much more than with the default incidence() function? E.g., in the case of a column tagged with date_outcome, are we sure that users will always prefer to pivot and convert it to date_death + date_recovery? Or can we imagine that they would be happy to pass date_outcome directly to incidence(). In others words, I see value for such a function only if we can provide more informative defaults than for standard data.frames. Is it really the case here?

These are the interesting questions. I think it very much depends on what a typical input linelist (in the non-package sense) looks like. incidence only really handles wide, potentially aggregated, data as this was, in essence, what I inherited spec-wise. If data generally has this form with additional "outcome" and "date of outcome" we could perhaps adapt but I'm loathe to do so with out more of a formal spec of a linelist.

An aside:
My gut feeling is that incidence2 is currently a package without a reason. I'm ok with this but thinks it's important to be open here. I tend to push people towards dplyr/data.table in combination with grates as this is more aligned with how I approach things. It could be useful if there were a range of methods (e.g. models for trend fitting) that were incorporated in incidence2 (calling functions from suggested packages) so people could easily go:

data -> incidence2 -> model fits by groups/counts.

but after 2 to 3 years I don't think there is a desire for this and unless a specific need comes up in ${DAYJOB} it's not something I'll push for.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
documentation Improvements or additions to documentation
Projects
None yet
Development

No branches or pull requests

3 participants