README.Rmd

---
output: github_document
editor_options: 
  chunk_output_type: console
---

<!-- README.md is generated from README.Rmd. Please edit that file -->

```{r, include = FALSE}
knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>",
  fig.path = "man/figures/README-",
  out.width = "70%"
)
set.seed(843)
```

# `tidyfast v0.4.0` <img src="man/figures/tidyfast_hex.png" align="right" width="30%" height="30%" />

<!-- badges: start -->
[![CRAN status](https://www.r-pkg.org/badges/version/tidyfast)](https://CRAN.R-project.org/package=tidyfast)
[![Lifecycle: maturing](https://img.shields.io/badge/lifecycle-maturing-blue.svg)](https://lifecycle.r-lib.org/articles/stages.html#maturing)
[![Codecov test coverage](https://codecov.io/gh/TysonStanley/tidyfast/branch/master/graph/badge.svg)](https://app.codecov.io/gh/TysonStanley/tidyfast?branch=master)
![Downloads](https://cranlogs.r-pkg.org/badges/grand-total/tidyfast)
[![R-CMD-check](https://github.com/TysonStanley/tidyfast/actions/workflows/R-CMD-check.yaml/badge.svg)](https://github.com/TysonStanley/tidyfast/actions/workflows/R-CMD-check.yaml)
<!-- badges: end -->

**Note: The expansion of `dtplyr` has made some of the functionality in `tidyfast` redundant. See `dtplyr` for a list of functions that are handled within that framework.**

The goal of `tidyfast` is to provide fast and efficient alternatives to some `tidyr` (and a few `dplyr`) functions using `data.table` under the hood. Each have the prefix of `dt_` to allow for autocomplete in IDEs such as RStudio. These should compliment some of the current functionality in `dtplyr` (but notably does not use the `lazy_dt()` framework of `dtplyr`). This package imports `data.table` and `cpp11` (no other dependencies).

These are, in essence, translations from a more `tidyverse` grammar to `data.table`. Most functions herein are in places where, in my opinion, the `data.table` syntax is not obvious or clear. As such, these functions can translate a simple function call into the fast, efficient, and concise syntax of `data.table`.

The current functions include:

**Nesting and unnesting** (similar to `dplyr::group_nest()` and `tidyr::unnest()`):

- `dt_nest()` for nesting data tables
- `dt_unnest()` for unnesting data tables
- `dt_hoist()` for unnesting vectors in a list-column in a data table

**Pivoting** (similar to `tidyr::pivot_longer()` and `tidyr::pivot_wider()`)

- `dt_pivot_longer()` for fast pivoting using `data.table::melt()`
- `dt_pivot_wider()` for fast pivoting using `data.table::dcast()`

**If Else** (similar to `dplyr::case_when()`):

- `dt_case_when()` for `dplyr::case_when()` syntax with the speed of `data.table::fifelse()`

**Fill** (similar to `tidyr::fill()`)

- `dt_fill()` for filling `NA` values with values before it, after it, or both. This can be done by a grouping variable (e.g. fill in `NA` values with values within an individual).

**Count** and **Uncount** (similar to `tidyr::uncount()` and `dplyr::count()`)

- `dt_count()` for fast counting by group(s)
- `dt_uncount()` for creating full data from a count table

**Separate** (similar to `tidyr::separate()`)

- `dt_separate()` for splitting a single column into multiple based on a match within the column (e.g., column with values like "A.B" could be split into two columns by using the period as the separator where column 1 would have "A" and 2 would have "B"). It is built on `data.table::tstrsplit()`. This is not well tested yet and lacks some functionality of `tidyr::separate()`.

**Adjust `data.table` print options**

- `dt_print_options()` for adjusting the options for `print.data.table()`


## General API

`tidyfast` attempts to convert syntax from `tidyr` with its accompanying grammar to `data.table` function calls. As such, we have tried to maintain the `tidyr` syntax as closely as possible without hurting speed and efficiency. Some more advanced use cases in `tidyr` may not translate yet. We try to be transparent about the shortcomings in syntax and behavior where known.

Each function that takes data (labeled as `dt_` in the package docs) as its first argument automatically coerces it to a data table with `as.data.table()` if it isn't already a data table. Each of these functions will return a data table.


## Installation

You can install the stable version from CRAN with:

``` r
install.packages("tidyfast")
```

or you can install the development version from [GitHub](https://github.com/) with:

``` r
# install.packages("remotes")
remotes::install_github("TysonStanley/tidyfast")
```

```{r, echo=FALSE}
devtools::load_all(here::here())
```


## Examples

The initial versions of the nesting and unnesting functions were shown in a [preprint](https://osf.io/preprints/psyarxiv/u8ekc/). Herein is shown some simple applications and the functions' speed/efficiency.

```{r, eval=FALSE}
library(tidyfast)
```


### Nesting and Unnesting

The following data table will be used for the nesting/unnesting examples.

```{r, message = FALSE, warning = FALSE}
set.seed(84322)

library(data.table)
library(dplyr)       # to compare with case_when()
library(tidyr)       # to compare with fill() and separate()
library(ggplot2)     # figures
library(ggbeeswarm)  # figures

dt <- data.table(
   x = rnorm(1e5),
   y = runif(1e5),
   grp = sample(1L:5L, 1e5, replace = TRUE),
   nested1 = lapply(1:10, sample, 10, replace = TRUE),
   nested2 = lapply(c("thing1", "thing2"), sample, 10, replace = TRUE),
   id = 1:1e5)
```

To make all the comparisons herein more equal, we will set the number of threads that `data.table` will use to 1.

```{r}
setDTthreads(1)
```

We can nest this data using `dt_nest()`:

```{r}
nested <- dt_nest(dt, grp)
nested
```

We can also unnest this with `dt_unnest()`:

```{r}
dt_unnest(nested, col = data)
```

When our list columns don't have data tables (as output from `dt_nest()`) we can use the `dt_hoist()` function, that will unnest vectors. It keeps all the other variables that are not list-columns as well.

```{r}
dt_hoist(dt, nested1, nested2)
```

Speed comparisons (similar to those shown in the preprint) are highlighted below. Notably, the timings are without the `nested1` and `nested2` columns of the original `dt` object from above. Also, all `dplyr` and `tidyr` functions use a `tbl` version of the `dt` table.

```{r, echo = FALSE, fig.width=4, fig.height=8, dpi=300, warning=FALSE, message=FALSE}
tbl <- as_tibble(dt) %>% select(x, y, id, grp)
dt2 <- dt[, .(x,y,id,grp)]
nesting <- bench::mark(
  nested1 <- dt_nest(dt2, grp),
  group_nest(tbl, grp),
  check = FALSE,
  iterations = 50) %>% 
  mutate(expression = c("dt_nest", "group_nest"))
nested_tbl <- as_tibble(nested1)
unnesting <- bench::mark(
  dt_unnest(nested1, data),
  unnest(nested_tbl, data),
  check = FALSE,
  iterations = 50) %>% 
  mutate(expression = c("dt_unnest", "unnest"))

nest_unnest <- bind_rows(nesting, unnesting)

theme_set(
  theme_minimal() +
  theme(panel.grid.minor = element_blank(),
        panel.grid.major.y = element_line(linetype = "dashed"),
        panel.grid.major.x = element_blank(),
        legend.position = "none")
)


library(ggplot2)
as.data.table(nest_unnest$time) %>% 
  setNames(c("dt_nest", "group_nest", "dt_unnest", "unnest")) %>% 
  dt_pivot_longer(cols = dt_nest:unnest, names_to = "expression", values_to = "time") %>% 
  .[, type := dt_case_when(stringr::str_detect(expression, "unnest") ~ "Unnesting",
                           TRUE ~ "Nesting")] %>% 
  .[, time := as.numeric(time)] %>% 
  ggplot(aes(expression, time, color = expression)) +
  ggbeeswarm::geom_beeswarm(alpha = .6) +
  labs(x = "",
       y = "Time (seconds)") +
  facet_wrap(type~.,  scales = "free", ncol = 1) +
  scale_color_viridis_d(option = "plasma", end = .8) +
  scale_y_log10() +
  NULL

select(nesting, expression, median, mem_alloc)
select(unnesting, expression, median, mem_alloc)
```


## Pivoting

Thanks to [@markfairbanks](https://github.com/markfairbanks), we now have pivoting translations to `data.table::melt()` and `data.table::dcast()`. Consider the following example (similar to the example in `tidyr::pivot_longer()` and `tidyr::pivot_wider()`):

```{r}
billboard <- tidyr::billboard

# note the warning - melt is telling us what 
#   it did with the various data types---logical (where there were just NAs
#   and numeric
longer <- billboard %>%
  dt_pivot_longer(
     cols = c(-artist, -track, -date.entered),
     names_to = "week",
     values_to = "rank"
  )
longer

wider <- longer %>% 
  dt_pivot_wider(
    names_from = week,
    values_from = rank
  )
wider[, .(artist, track, wk1, wk2)]
```

Notably, there are some current limitations to these: 1) `tidyselect` techniques do not work across the board (e.g. cannot use `start_with()` and friends) and 2) the functions are new and likely prone to edge-case bugs.

But let's compare some basic speed and efficiency. Because of the `data.table` functions, these are extremely fast and efficient.

```{r first_pivot, echo = FALSE, warning=FALSE, message=FALSE}
bill_dt <- as.data.table(billboard)
longer_timings <- bench::mark(
  dt_pivot_longer = dt_pivot_longer(bill_dt, cols = c(-artist, -track, -date.entered), 
                                    names_to = "week", names_prefix = "wk", values_to = "rank"),
  pivot_longer = pivot_longer(billboard, cols = c(-artist, -track, -date.entered), 
                              names_to = "week", names_prefix = "wk", values_to = "rank"),
  check = FALSE,
  iterations = 40)
```

```{r second_pivot, echo = FALSE, fig.width=5, fig.height=4, dpi=300, warning=FALSE, message=FALSE}
longer_tbl <- as_tibble(longer)
wider_timings <- bench::mark(
  dt_pivot_wider = dt_pivot_wider(longer, names_from = week, values_from = rank),
  pivot_wider = pivot_wider(longer_tbl, names_from = week, values_from = rank),
  check = FALSE,
  iterations = 40)
```


```{r third_pivot, echo = FALSE, fig.width=5, fig.height=4, dpi=300, warning=FALSE, message=FALSE}
pivot_timings <- rbind(longer_timings, wider_timings) %>% 
  mutate(type = c("longer", "longer", "wider", "wider")) %>% 
  mutate(expression = as.character(expression))

pivot_timings %>%
  dt_hoist(time) %>% 
  mutate(time = lubridate::seconds(time)) %>% 
  filter(type == "longer") %>% 
  ggplot(aes(x = expression, 
             y = time,
             color = expression)) +
  ggbeeswarm::geom_beeswarm() +
  labs(x = "",
       y = "Time (seconds)") +
  scale_color_viridis_d(option = "plasma", end = .8) +
  scale_y_log10() +
  facet_grid(~type, space = "free", scales = "free")

pivot_timings %>%
  dt_hoist(time) %>% 
  mutate(time = lubridate::seconds(time)) %>% 
  filter(type == "wider") %>% 
  ggplot(aes(x = expression, 
             y = time,
             color = expression)) +
  ggbeeswarm::geom_beeswarm() +
  labs(x = "",
       y = "Time (seconds)") +
  scale_color_viridis_d(option = "plasma", end = .8) +
  scale_y_log10() +
  facet_grid(~type, space = "free", scales = "free")

pivot_timings %>% 
  select(expression, median, mem_alloc)
```


### If Else

Also, the new `dt_case_when()` function is built on the very fast `data.table::fiflese()` but has syntax like unto `dplyr::case_when()`. That is, it looks like:

```{r, eval = FALSE}
dt_case_when(condition1 ~ label1,
             condition2 ~ label2,
             ...)
```

To show that each method, `dt_case_when()`, `dplyr::case_when()`, and `data.table::fifelse()` produce the same result, consider the following example. 

```{r}
x <- rnorm(1e6)

medianx <- median(x)
x_cat <-
  dt_case_when(x < medianx ~ "low",
               x >= medianx ~ "high",
               is.na(x) ~ "unknown")
x_cat_dplyr <-
  case_when(x < medianx ~ "low",
            x >= medianx ~ "high",
            is.na(x) ~ "unknown")
x_cat_fif <-
  fifelse(x < medianx, "low",
  fifelse(x >= medianx, "high",
  fifelse(is.na(x), "unknown", NA_character_)))

identical(x_cat, x_cat_dplyr)
identical(x_cat, x_cat_fif)
```

Notably, `dt_case_when()` is very fast and memory efficient, given it is built on `data.table::fifelse()`.

```{r, echo = FALSE, warning = FALSE, message = FALSE, fig.width=6, fig.height=5, dpi=300}
marks <-
  bench::mark(dt_case_when(x < medianx ~ "low",
                           x >= medianx ~ "high",
                           is.na(x) ~ "unknown"),
              case_when(x < medianx ~ "low",
                        x >= medianx ~ "high",
                        is.na(x) ~ "unknown"),
              fifelse(x < medianx, "low",
              fifelse(x >= medianx, "high",
              fifelse(is.na(x), "unknown", NA_character_))),
              iterations = 50)

library(ggbeeswarm)  # for the speed comparison plot

marks$time %>% 
  setNames(c("dt_case_when", "case_when", "fifelse")) %>% 
  data.frame() %>% 
  tidyr::gather() %>% 
  mutate(value = as.numeric(value)) %>% 
  ggplot(aes(key, value, color = key)) +
  ggbeeswarm::geom_beeswarm() +
  labs(x = "",
       y = "Time (seconds)") +
    scale_color_viridis_d(option = "plasma", end = .8) +
    scale_y_log10()

marks %>% 
  select(expression, median, mem_alloc) %>% 
  mutate(expression = c("dt_case_when", "case_when", "fifelse")) %>% 
  arrange(expression)
```


## Fill

A new function is `dt_fill()`, which fulfills the role of `tidyr::fill()` to fill in `NA` values with values around it (either the value above, below, or trying both). This currently relies on the efficient `C++` code from `tidyr` (`fillUp()` and `fillDown()`).

```{r}
x = 1:10
dt_with_nas <- data.table(
  x = x,
  y = shift(x, 2L),
  z = shift(x, -2L),
  a = sample(c(rep(NA, 10), x), 10),
  id = sample(1:3, 10, replace = TRUE)
)

# Original
dt_with_nas

# All defaults
dt_fill(dt_with_nas, y, z, a, immutable = FALSE)

# by id variable called `grp`
dt_fill(dt_with_nas, 
        y, z, a, 
        id = list(id))

# both down and then up filling by group
dt_fill(dt_with_nas, 
        y, z, a, 
        id = list(id), 
        .direction = "downup")
```

In its current form, `dt_fill()` is faster than `tidyr::fill()` and uses slightly less memory. Below are the results of filling in the `NA`s within each `id` on a 19 MB data set.

```{r}
x = 1:1e6
dt3 <- data.table(
  x = x,
  y = shift(x, 10L),
  z = shift(x, -10L),
  a = sample(c(rep(NA, 10), x), 10),
  id = sample(1:3, 10, replace = TRUE))
df3 <- data.frame(dt3)

marks3 <-
  bench::mark(
    tidyr::fill(dplyr::group_by(df3, id), x, y),
    tidyfast::dt_fill(dt3, x, y, id = list(id)),
    check = FALSE,
    iterations = 50
  )
```


```{r, echo = FALSE, fig.width=6, fig.height=5, dpi=300}
marks3$time %>% 
  setNames(c("fill", "dt_fill")) %>% 
  data.frame() %>% 
  tidyr::gather() %>% 
  mutate(value = as.numeric(value)) %>% 
  ggplot(aes(key, value, color = key)) +
  ggbeeswarm::geom_beeswarm() +
  labs(x = "",
       y = "Time (seconds)") +
  scale_color_viridis_d(end = .8) +
    scale_y_log10()

marks3 %>% 
  select(expression, median, mem_alloc)
```


## Separate

The `dt_separate()` function is still under heavy development. Its behavior is similar to `tidyr::separate()` but is lacking some functionality currently. For example, `into` needs to be supplied the maximum number of possible columns to separate.

```{r, eval = FALSE}
dt_separate(data.table(col = "A.B.C"), col, into = c("A", "B"))
#> Error in `[.data.table`(dt, , eval(split_it)) : 
#>   Supplied 2 columns to be assigned 3 items. Please see NEWS for v1.12.2.
```

For current functionality, consider the following example. 

```{r}
dt_to_split <- data.table(
  x = paste(letters, LETTERS, sep = ".")
)

dt_separate(dt_to_split, x, into = c("lower", "upper"))
```

```{r, echo = FALSE}
head(dt_separate(dt_to_split, x, into = c("lower", "upper")))
```


Testing with a 4 MB data set with one variable that has columns of "A.B" repeatedly, shows that `dt_separate()` is fast and far more memory efficient compared to `tidyr::separate()`.

```{r, echo = FALSE, warning = FALSE}
dt4 <- data.table(
  col = paste(rep("A", 5e5), rep("B", 5e5), sep = ".")
)
df4 <- data.frame(dt4)

marks4 <-
  bench::mark(
    tidyr::separate(df4, col, into = c("first", "second"), sep = "\\."),
    tidyfast::dt_separate(dt4, col, into = c("first", "second"), sep = ".", remove = FALSE),
    tidyfast::dt_separate(dt4, col, into = c("first", "second"), sep = ".", immutable = FALSE, remove = FALSE),
    check = FALSE,
    iterations = 25
  )
```

```{r, echo = FALSE, fig.width=6, fig.height=5, dpi=300}
marks4$time %>% 
  setNames(c("separate", "dt_separate", "dt_separate-mutable")) %>% 
  data.frame() %>% 
  tidyr::gather() %>% 
  mutate(value = as.numeric(value)) %>% 
  ggplot(aes(key, value, color = key)) +
  ggbeeswarm::geom_beeswarm() +
  labs(x = "",
       y = "Time (seconds)") +
  scale_color_viridis_d(end = .8) +
  scale_y_log10()

marks4 %>% 
  mutate(expression = c("separate", "dt_separate", "dt_separate-mutable")) %>% 
  select(expression, median, mem_alloc)
```


## Count and Uncount

The `dt_count()` function does essentially what `dplyr::count()` does. Notably, this, unlike the majority of other `dt_` functions, wraps a very simple statement in `data.table`. That is, `data.table` makes getting counts very simple and concise. Nonetheless, `dt_count()` fits the general API of `tidyfast`. To some degree, `dt_uncount()` is also a fairly simple wrapper, although the approach may not be as straightforward as that for `dt_count()`.

The following examples show how count and uncount can work. We'll use the `dt` data table from the nesting examples.

```{r}
counted <- dt_count(dt, grp)
counted
```

```{r}
uncounted <- dt_uncount(counted, N)
uncounted[]
```

These are also quick (not that the `tidyverse` functions were at all slow here).

```{r}
dt5 <- copy(dt)
df5 <- data.frame(dt5)

marks5 <-
  bench::mark(
    counted_tbl <- dplyr::count(df5, grp),
    counted_dt <- tidyfast::dt_count(dt5, grp),
    tidyr::uncount(counted_tbl, n),
    tidyfast::dt_uncount(counted_dt, N),
    check = FALSE,
    iterations = 25
  )
```

```{r, echo = FALSE, fig.width=6, fig.height=5, dpi=300}
marks5$time %>% 
  setNames(c("count", "dt_count", "uncount", "dt_uncount")) %>% 
  data.frame() %>% 
  tidyr::gather() %>% 
  mutate(value = as.numeric(value)) %>% 
  mutate(type = dt_case_when(stringr::str_detect(key, "uncount") ~ "Uncounting",
                             TRUE ~ "Counting")) %>% 
  ggplot(aes(key, value, color = key)) +
  ggbeeswarm::geom_beeswarm() +
  labs(x = "",
       y = "Time (seconds)") +
  scale_color_viridis_d(option = "plasma", end = .8) +
  facet_grid(~type, space = "free", scales = "free") +
  scale_y_log10()
```


## Notes

Please note that the `tidyfast` project is released with a [Contributor Code of Conduct](https://github.com/TysonStanley/tidyfast/blob/master/.github/CODE_OF_CONDUCT.md). By contributing to this project, you agree to abide by its terms.

We want to thank our wonderful contributors:

- [markfairbanks](https://github.com/markfairbanks) for PR #6 providing initial the pivoting functions. Note the [`tidytable`](https://github.com/markfairbanks/tidytable) package that compliments some of `tidyfast`s functionality.


**Complementary Packages:**

- [`dtplyr`](https://dtplyr.tidyverse.org)
- [`tidytable`](https://github.com/markfairbanks/tidytable)
- [`maditr`](https://github.com/gdemin/maditr)