Document how to handle grouped line list columns #114

joshwlambert · 2024-04-17T09:15:46Z

It is unclear from the current {incidence2} documentation whether the package, specifically incidence2::incidence(), can handle grouped columns.

An example of such a line list with a grouped column is the Ebola simulated line list in {outbreaks}.

head(outbreaks::ebola_sim_clean$linelist)
#>   case_id generation date_of_infection date_of_onset date_of_hospitalisation
#> 1  d1fafd          0              <NA>    2014-04-07              2014-04-17
#> 2  53371b          1        2014-04-09    2014-04-15              2014-04-20
#> 3  f5c3d8          1        2014-04-18    2014-04-21              2014-04-25
#> 4  6c286a          2              <NA>    2014-04-27              2014-04-27
#> 5  0f58c4          2        2014-04-22    2014-04-26              2014-04-29
#> 6  49731d          0        2014-03-19    2014-04-25              2014-05-02
#>   date_of_outcome outcome gender           hospital       lon      lat
#> 1      2014-04-19    <NA>      f  Military Hospital -13.21799 8.473514
#> 2            <NA>    <NA>      m Connaught Hospital -13.21491 8.464927
#> 3      2014-04-30 Recover      f              other -13.22804 8.483356
#> 4      2014-05-07   Death      f               <NA> -13.23112 8.464776
#> 5      2014-05-17 Recover      f              other -13.21016 8.452143
#> 6      2014-05-07    <NA>      f               <NA> -13.23443 8.468572

^{Created on 2024-04-17 with reprex v2.1.0}

The way I've been passing line list data like this to incidence() is using tidyr::pivot_wider() beforehand.

library(magrittr)
outbreaks::ebola_sim_clean$linelist %>%
  tidyr::pivot_wider(
    names_from = outcome,
    values_from = date_of_outcome
  )
#> # A tibble: 5,829 × 12
#>    case_id generation date_of_infection date_of_onset date_of_hospitalisation
#>    <chr>        <int> <date>            <date>        <date>                 
#>  1 d1fafd           0 NA                2014-04-07    2014-04-17             
#>  2 53371b           1 2014-04-09        2014-04-15    2014-04-20             
#>  3 f5c3d8           1 2014-04-18        2014-04-21    2014-04-25             
#>  4 6c286a           2 NA                2014-04-27    2014-04-27             
#>  5 0f58c4           2 2014-04-22        2014-04-26    2014-04-29             
#>  6 49731d           0 2014-03-19        2014-04-25    2014-05-02             
#>  7 f9149b           3 NA                2014-05-03    2014-05-04             
#>  8 881bd4           3 2014-04-26        2014-05-01    2014-05-05             
#>  9 e66fa4           2 NA                2014-04-21    2014-05-06             
#> 10 20b688           3 NA                2014-05-05    2014-05-06             
#> # ℹ 5,819 more rows
#> # ℹ 7 more variables: gender <fct>, hospital <fct>, lon <dbl>, lat <dbl>,
#> #   `NA` <date>, Recover <date>, Death <date>

^{Created on 2024-04-17 with reprex v2.1.0}

These columns can then be selected using the date_index argument in incidence().

daily <- incidence(
  linelist,
  date_index = c(
    onset = "date_of_onset",
    death = "Death"
  ),
  interval = "daily"
)

Having the best way to work with this data documented somewhere in the {incidence2} package or add functionality to handle it would be great.

The text was updated successfully, but these errors were encountered:

joshwlambert · 2024-04-17T09:17:52Z

This issue might also link with the {linelist} package and whether there is a line list standard from that package and whether any of the columns are grouped. If so an as_incidence.linelist() S3 method might be beneficial. @Bisaloo is this the case or are all {linelist} tags for ungrouped columns?

TimTaylor · 2024-04-17T09:53:51Z

Cheers @joshwlambert. Yes this is exactly how I'd handle it (outside of incidence). Will add an example along the lines of

outbreaks::ebola_sim_clean$linelist |> 
    pivot_wider(names_from = outcome, values_from = date_of_outcome) |> 
    incidence(
        date_index = c(
            onset = "date_of_onset",
            hospitalisation = "date_of_hospitalisation",
            death = "Death"
        ),
        interval = "daily"
    )

The issue we have is that incidence2 is expecting wide data (albeit potentially aggregated) whereas this is a mixture of wide and long. Whilst it may be possible to adapt allow for long-style "outcome" (and asociated date) columns I think the tidyr approach is so elegant my preference is just to ensure that is documented.

As you alude to we could do a lot with an as_incidence.linelist() assuming that handles this wide/long mixture in it's specification - @Bisaloo?

As an aside I may also had a grates version to the examples to illustrate the different approaches.

Bisaloo · 2024-04-18T07:39:26Z

As you alude to we could do a lot with an as_incidence.linelist() assuming that handles this wide/long mixture in it's specification - @Bisaloo?

Couple of thoughts on this:

I've been considering it on multiple occasions but I'm still not convinced this should be a as_incidence() method. My view of as_() methods is that they convert objects between two different but equivalent formats. But line list data and aggregated count data are two very different object. A different way to say it is that in most cases, it should be possible to do the class round-trip in a (quasi-)transparent, which would not be the case here as there is loss of information when aggregating the data. But maybe this is just a terminology issue.
Would we actually be able to do much more than with the default incidence() function? E.g., in the case of a column tagged with date_outcome, are we sure that users will always prefer to pivot and convert it to date_death + date_recovery? Or can we imagine that they would be happy to pass date_outcome directly to incidence(). In others words, I see value for such a function only if we can provide more informative defaults than for standard data.frames. Is it really the case here?

TimTaylor · 2024-04-18T08:16:10Z

I've been considering it on multiple occasions but I'm still not convinced this should be a as_incidence() method. My view of as_() methods is that they convert objects between two different but equivalent formats. But line list data and aggregated count data are two very different object. A different way to say it is that in most cases, it should be possible to do the class round-trip in a (quasi-)transparent, which would not be the case here as there is loss of information when aggregating the data. But maybe this is just a terminology issue.

Interesting. I've always viewed as_ methods as (potentially lossy) casts (e.g. as.integer(1.5)). Alternatively you could make the incidence() funciton generic but that feels less satisfying (although cannot put my finger on way).

Would we actually be able to do much more than with the default incidence() function? E.g., in the case of a column tagged with date_outcome, are we sure that users will always prefer to pivot and convert it to date_death + date_recovery? Or can we imagine that they would be happy to pass date_outcome directly to incidence(). In others words, I see value for such a function only if we can provide more informative defaults than for standard data.frames. Is it really the case here?

These are the interesting questions. I think it very much depends on what a typical input linelist (in the non-package sense) looks like. incidence only really handles wide, potentially aggregated, data as this was, in essence, what I inherited spec-wise. If data generally has this form with additional "outcome" and "date of outcome" we could perhaps adapt but I'm loathe to do so with out more of a formal spec of a linelist.

An aside:
My gut feeling is that incidence2 is currently a package without a reason. I'm ok with this but thinks it's important to be open here. I tend to push people towards dplyr/data.table in combination with grates as this is more aligned with how I approach things. It could be useful if there were a range of methods (e.g. models for trend fitting) that were incorporated in incidence2 (calling functions from suggested packages) so people could easily go:

data -> incidence2 -> model fits by groups/counts.

but after 2 to 3 years I don't think there is a desire for this and unless a specific need comes up in ${DAYJOB} it's not something I'll push for.

joshwlambert changed the title ~~Document how to handle grouped line list columsn~~ Document how to handle grouped line list columns Apr 17, 2024

TimTaylor added the documentation Improvements or additions to documentation label Apr 17, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Document how to handle grouped line list columns #114

Document how to handle grouped line list columns #114

joshwlambert commented Apr 17, 2024

joshwlambert commented Apr 17, 2024

TimTaylor commented Apr 17, 2024

Bisaloo commented Apr 18, 2024

TimTaylor commented Apr 18, 2024 •

edited

Loading

Document how to handle grouped line list columns #114

Document how to handle grouped line list columns #114

Comments

joshwlambert commented Apr 17, 2024

joshwlambert commented Apr 17, 2024

TimTaylor commented Apr 17, 2024

Bisaloo commented Apr 18, 2024

TimTaylor commented Apr 18, 2024 • edited Loading

TimTaylor commented Apr 18, 2024 •

edited

Loading