Skip to content

Commit

Permalink
wip vignette
Browse files Browse the repository at this point in the history
  • Loading branch information
mgyliu committed Aug 12, 2022
1 parent 59eba27 commit 1fd96bb
Showing 1 changed file with 108 additions and 30 deletions.
138 changes: 108 additions & 30 deletions vignettes/panel-data.Rmd
Original file line number Diff line number Diff line change
@@ -1,13 +1,13 @@
---
title: "Using epipredict on panel data from other contexts"
title: "Using epipredict on non-epidemic panel data"
output: rmarkdown::html_vignette
vignette: >
%\VignetteIndexEntry{Using epipredict on panel data from other contexts}
%\VignetteIndexEntry{Using epipredict on non-epidemic panel data}
%\VignetteEngine{knitr::rmarkdown}
%\VignetteEncoding{UTF-8}
---

```{r setup, include = F}
```{r setup, include=F}
knitr::opts_chunk$set(
collapse = TRUE,
comment = "#>",
Expand All @@ -16,23 +16,22 @@ knitr::opts_chunk$set(
)
```

```{r load-data, include = F}
# TODO: temp - remove me when I figure out how to run this rmd without loading
# perhaps need to commit statcan_employ_subset.rda first?
devtools::load_all()
```

```{r libraries}
library(epiprocess)
library(epipredict)
library(dplyr)
library(stringr)
library(parsnip)
library(recipes)
```

[Panel data](https://en.wikipedia.org/wiki/Panel_data), or longitudinal data, contains cross-sectional measurements of subjects over time. The `epipredict` package is most suitable for running forecasters on epidemiological data. However, other datasets with similar structures are also valid candidates for `epipredict` functionality.
[Panel data](https://en.wikipedia.org/wiki/Panel_data), or longitudinal data,
contains cross-sectional measurements of subjects over time. The `epipredict`
package is most suitable for running forecasters on epidemiological data.
However, other datasets with similar structures are also valid candidates for
`epipredict` functionality.

```{r employ-stats, include = F}
```{r employ-stats, include=F}
date_format <- "%B %Y"
date_start <- format(as.Date(min(statcan_employ_subset$time_value)), date_format)
date_end <- format(as.Date(max(statcan_employ_subset$time_value)), date_format)
Expand All @@ -41,9 +40,19 @@ uniq_employee_type <- paste(unique(statcan_employ_subset$employee_type), collaps

## Example panel data overview

In this vignette, we will demonstrate using `epipredict` with employment data from Statistics Canada. We will be using [Table 14-10-0220-01: Employment and average weekly earnings (including overtime) for all employees by industry, monthly, seasonally adjusted, Canada](https://www150.statcan.gc.ca/t1/tbl1/en/tv.action?pid=1410022001#data). The full dataset contains monthly employment counts from `r date_start` to `r date_end`, and presents employment data stratified by geographic region, [NAICS industries](https://www23.statcan.gc.ca/imdb/p3VD.pl?Function=getVD&TVD=1181553), and employee type. The full dataset also contains metadata that describes the quality of data collected. For demonstration purposes, we make the following modifications to get a subset of the full dataset:

* Only keep level 1 industries (2-digit codes) in the [NAICS hierarchy](https://www23.statcan.gc.ca/imdb/pUtil.pl?Function=getNote&Id=1181553&NT=45) and remove aggregated industry codes.
In this vignette, we will demonstrate using `epipredict` with employment data from
Statistics Canada. We will be using
[Table 14-10-0220-01: Employment and average weekly earnings (including overtime) for all employees by industry, monthly, seasonally adjusted, Canada](https://www150.statcan.gc.ca/t1/tbl1/en/tv.action?pid=1410022001#data).
The full dataset contains monthly employment counts from `r date_start` to `r date_end`,
and presents employment data stratified by geographic region,
[NAICS industries](https://www23.statcan.gc.ca/imdb/p3VD.pl?Function=getVD&TVD=1181553),
and employee type. The full dataset also contains metadata that describes the quality
of data collected. For demonstration purposes, we make the following modifications to
get a subset of the full dataset:

* Only keep level 1 industries (2-digit codes) in the
[NAICS hierarchy](https://www23.statcan.gc.ca/imdb/pUtil.pl?Function=getNote&Id=1181553&NT=45)
and remove aggregated industry codes.
* Only keep province-level geographic region (the full data also has "Canada" as a region)
* Only keep "good" or better quality data rows, as indicated by the [`STATUS`](https://www.statcan.gc.ca/en/concepts/definitions/guide-symbol) column

Expand Down Expand Up @@ -84,60 +93,129 @@ statcan_employ_subset_input <- statcan_employ %>%
mutate(naics_industry = factor(naics_industry))
```

To use this data with `epipredict`, we need to convert it into `epi_df` format using `as_epi_df` with additional keys. In our case, the additional keys are `employee_type` and `naics_industry`. Note that in the above modifications, we encoded `time_value` as type `tsibble::yearmonth`. This allows us to set `time_type` to `"yearmonth"` below, and to ensure lag and ahead modifications later on are using the correct time units.
To use this data with `epipredict`, we need to convert it into `epi_df` format using
`as_epi_df` with additional keys. In our case, the additional keys are `employee_type`
and `naics_industry`. Note that in the above modifications, we encoded `time_value`
as type `tsibble::yearmonth`. This allows us to set `time_type` to `"yearmonth"` below,
and to ensure lag and ahead modifications later on are using the correct time units.

```{r convert-to-epidf, eval=F}
statcan_employ_subset <- statcan_employ_subset_input %>%
tsibble::as_tsibble(
index=time_value,
key=c(geo_value, employee_type, naics_industry)) %>%
as_epi_df(time_type = "yearmonth", as_of="2022-07-22")
as_epi_df(
additional_metadata=c(other_keys=list("employee_type", "naics_industry")))
```

```{r data-dim, include=F}
employ_rowcount <- format(nrow(statcan_employ_subset), big.mark=",")
employ_colcount <- length(names(statcan_employ_subset))
```

The data contains `r employ_rowcount` rows and `r employ_colcount` columns. Now, we are ready to use `statcan_employ_subset` with `epipredict`.
The data contains `r employ_rowcount` rows and `r employ_colcount` columns. Now, we are
ready to use `statcan_employ_subset` with `epipredict`.

```{r preview-data, include=T}
head(statcan_employ_subset)
# Rename for simplicity
employ <- statcan_employ_subset
head(employ)
```

In the following sections, we will go over preprocessing the data in the `epi_recipe` framework, fitting 3 types of models from the `parsnip` package, and making future predictions.
In the following sections, we will go over preprocessing the data in the `epi_recipe`
framework, fitting 3 types of models from the `parsnip` package, and making future
predictions.

## Preprocessing

We will create a recipe that adds one `ahead` column and 3 `lag` columns.

```{r make-recipe, include = T}
r <- epi_recipe(statcan_employ_subset) %>%
```{r make-recipe, include=T}
r <- epi_recipe(employ) %>%
step_epi_ahead(ppl_count, ahead = 6) %>% # lag & ahead units in months
step_epi_lag(ppl_count, lag = c(0, 6, 12)) %>%
step_epi_naomit()
r
```

There is one `raw` role which includes our value column `ppl_count`, and two `key` roles which include our additional keys `employee_type` and `naics_industry`. Let's take a look at what these additional columns look like.
There is one `raw` role which includes our value column `ppl_count`, and two `key`
roles which include our additional keys `employee_type` and `naics_industry`. Let's
take a look at what these additional columns look like.

```{r view-preprocessed, include = T}
latest <- statcan_employ_subset %>% filter(time_value >= max(time_value) - 12)
```{r view-preprocessed, include=T}
# Display a sample of the preprocessed data
r %>% prep() %>% bake(new_data = latest) %>% sample_n(5)
baked_sample <- r %>% prep() %>% bake(new_data = employ) %>% sample_n(5)
baked_sample
```

## Model fitting and prediction

First we will look at a simple model: `parsnip::linear_reg()` with default engine `lm`. We can use `epi_workflow` with the above `epi_recipe` to fit a linear model using lags at time $t$ (current), $t-6$ months, and $t-12$ months.
```{r linearreg-wf, include = T}
wf_linreg <- epi_workflow(r, linear_reg()) %>% fit(statcan_employ_subset)
### With direct forecasters

Even though we aren't working with epidemiological data, the `arx_epi_forecaster`
still works out of the box. And specifying `args_list` still works in the same way.

```{r arx-epi, include=T, warning=F}
demo_args_list = arx_args_list(lags = c(0L, 6L, 12L), ahead = 6L)
out_lr <- arx_epi_forecaster(
employ,
outcome = "ppl_count",
predictors = c("ppl_count"),
args_list = demo_args_list)
out_lr$predictions
```

Other changes to the forecaster, like changing the engine, also work as expected.

```{r arx-epi-rf, include=T, warning=F}
out_rf <- arx_epi_forecaster(
employ,
outcome = "ppl_count",
predictors = c("ppl_count"),
trainer = parsnip::rand_forest(mode="regression", trees=100),
args_list = demo_args_list)
out_rf$predictions
```

### Within recipes framework

First we will look at a simple model: `parsnip::linear_reg()` with default engine
`lm`. We can use `epi_workflow` with the above `epi_recipe` to fit an autoregressive
linear model using lags at time $t$ (current), $t-6$ months, and $t-12$ months.

```{r linearreg-wf, include=T}
wf_linreg <- epi_workflow(r, parsnip::linear_reg()) %>% fit(employ)
wf_linreg
```

Now that we have our workflow, we can generate predictions from a subset of our data.
For this demo, we will predict the employment counts from the last 12 months of our dataset.

```{r linearreg-predict, include=T}
latest <- employ %>% filter(time_value >= max(time_value) - 15)
preds <- stats::predict(wf_linreg, latest) %>% filter(!is.na(.pred))
# Display a sample of the prediction values
preds %>% sample_n(5)
```

<!-- TODO: ggplot the above predictions against the actual values -->
<!-- TODO: fit more complex AR model -->
Notice that `predict` returns an `epi_df` with all of the keys that were present in the original dataset.

```{r plot-pred, include=F}
# library(ggplot2)
# joined <- full_join(statcan_employ_subset, preds, by=c("geo_value", "time_value", "employee_type", "naics_industry"))
# joined %>% filter(!is.na(.pred)) %>% select(time_value) %>% unique()
# joined %>% dplyr::filter(
# geo_value %in% c("British Columbia", "Ontario") &
# naics_industry == "Real estate and rental and leasing [53]" &
# employee_type == "All employees") %>%
# ggplot() +
# geom_line(aes(x = time_value, y = ppl_count)) +
# geom_line(aes(x = time_value, y = preds)) +
# facet_wrap(vars(geo_value), scales = "free_y", ncol = 1) +
# scale_x_date(minor_breaks = "month", date_labels = "%b %y") +
# labs(x = "Date", y = "Number employed")
```

0 comments on commit 1fd96bb

Please sign in to comment.