Skip to content

Commit

Permalink
Draft for JOSS publication (#190)
Browse files Browse the repository at this point in the history
* fix compilation issues

* add table

* more content

* statement of need

* data wrangling example

* data transforms example

* add my orcid

* revise authors and title

* more edits

* reorganize intro

* reknit

* change title

* try to fit describe table

* reknit

* reshape_ -> data_to_

* update data_to_long args

* fix function name

* beautify tables

* Update paper/paper.md

Co-authored-by: Brenton M. Wiernik <[email protected]>

* Address Brenton's comments

* apply @bwiernik changes to Rmd (not only md)

* fix affiliation

* Address Etienne's comments

* To be consistent with the paper author order

@DominiqueMakowski Lemme if this is okay with you.

* Update paper/paper.Rmd

* Update paper/paper.Rmd

* Update paper/paper.Rmd

* Update paper/paper.Rmd

* Update paper/paper.Rmd

* Update paper/paper.Rmd

Co-authored-by: Etienne Bacher <[email protected]>
Co-authored-by: Dominique Makowski <[email protected]>
Co-authored-by: etiennebacher <[email protected]>
Co-authored-by: Brenton M. Wiernik <[email protected]>
  • Loading branch information
5 people authored Aug 7, 2022
1 parent bcf7e3b commit ce95b3a
Show file tree
Hide file tree
Showing 6 changed files with 1,502 additions and 14 deletions.
4 changes: 2 additions & 2 deletions DESCRIPTION
Original file line number Diff line number Diff line change
Expand Up @@ -3,12 +3,12 @@ Package: datawizard
Title: Easy Data Wrangling and Statistical Transformations
Version: 0.4.1.10
Authors@R: c(
person("Indrajeet", "Patil", , "[email protected]", role = c("aut", "cre"),
comment = c(ORCID = "0000-0003-1995-6531", Twitter = "@patilindrajeets")),
person("Dominique", "Makowski", , "[email protected]", role = "aut",
comment = c(ORCID = "0000-0001-5375-9967", Twitter = "@Dom_Makowski")),
person("Daniel", "Lüdecke", , "[email protected]", role = "aut",
comment = c(ORCID = "0000-0002-8895-3206", Twitter = "@strengejacke")),
person("Indrajeet", "Patil", , "[email protected]", role = c("aut", "cre"),
comment = c(ORCID = "0000-0003-1995-6531", Twitter = "@patilindrajeets")),
person("Mattan S.", "Ben-Shachar", , "[email protected]", role = "aut",
comment = c(ORCID = "0000-0002-4287-4801")),
person("Brenton M.", "Wiernik", , "[email protected]", role = "aut",
Expand Down
128 changes: 116 additions & 12 deletions paper/paper.Rmd
Original file line number Diff line number Diff line change
@@ -1,35 +1,42 @@
---
title: "datawizard: An R Package for Easy Data Wrangling"
title: "datawizard: An R Package for Easy Data Preparation and Statistical Transformations"
tags:
- R
- easystats
authors:
- affiliation: 1
name: Dominique Makowski
orcid: 0000-0001-5375-9967
- affiliation: 2
name: Indrajeet Patil
orcid: 0000-0003-1995-6531
- affiliation: 2
name: Dominique Makowski
orcid: 0000-0001-5375-9967
- affiliation: 3
name: Mattan S. Ben-Shachar
orcid: 0000-0002-4287-4801
- affiliation: 4
name: Brenton M. Wiernik
name: Brenton M. Wiernik^[Brenton Wiernik is currently an independent researcher and Research Scientist at Meta, Demography and Survey Science. The current work was done in an independent capacity.]
orcid: 0000-0001-9560-6336
- affiliation: 5
name: Etienne Bacher
orcid: 0000-0002-9271-5075
- affiliation: 6
name: Daniel Lüdecke
orcid: 0000-0002-8895-3206

affiliations:
- index: 1
name: Nanyang Technological University, Singapore
name: esqLABS GmbH, Germany
- index: 2
name: Center for Humans and Machines, Max Planck Institute for Human Development, Berlin, Germany
name: Nanyang Technological University, Singapore
- index: 3
name: Ben-Gurion University of the Negev, Israel
- index: 4
name: Department of Psychology, University of South Florida, USA
name: Independent Researcher
- index: 5
name: University Medical Center Hamburg-Eppendorf, Germany
name: Luxembourg Institute of Socio-Economic Research (LISER), Luxembourg
- index: 6
name: University Medical Center Hamburg-Eppendorf, Germany

date: "`r Sys.Date()`"
bibliography: paper.bib
output: rticles::joss_article
Expand All @@ -42,23 +49,120 @@ link-citations: yes
knitr::opts_chunk$set(
collapse = TRUE,
out.width = "100%",
dpi = 450,
dpi = 300,
comment = "#>",
message = FALSE,
warning = FALSE
)
library(datawizard)
set.seed(2016)
```

# Summary

The `{datawizard}` package for the R programming language [@base2021] provides a lightweight toolbox to assist in keys steps involved in any data analysis workflow: (1) wrangling the raw data to get it in the needed form, (2) applying preprocessing steps and statistical transformations, and (3) compute statistical summaries of data properties and distributions. Therefore, it can be a valuable tool for R users and developers looking for a lightweight option for data preparation.

# Statement of Need

The `{datawizard}` package is part of `{easystats}`, a collection of R packages designed to make statistical analysis easier (@Ben-Shachar2020, @Lüdecke2020parameters, @Lüdecke2020performance, @Lüdecke2021see, @Lüdecke2019, @Makowski2019, @Makowski2020). As this ecosystem follows a "0-external-hard-dependency" policy, a base R data manipulation package that relies only on base R needed to be created. In effect, `{datawizard}` provides data processing backend for this entire ecosystem.
In addition to its usefulness to the `{easystats}` ecosystem, it also provides *an* option for R users and package developers if they wish to keep their (recursive) dependency weight to a minimum (for other options, see @Dowle2021, @Eastwood2021, etc.).

Because `{datawizard}` is also meant to be used and adopted easily by a wide range of users, its workflow and syntax are designed to be similar to `{tidyverse}` (@Wickham2019), a widely used ecosystem of R packages. Thus, users familiar with the `{tidyverse}` can easily translate their knowledge and make full use of `{datawizard}`.

In addition to being a lightweight solution to clean messy data, `{datawizard}` also provides helpers for the other important step of data analysis: applying statistical transformations to the cleaned data while setting up statistical models. This includes various types of data standardization, normalization, rank-transformation, and adjustment. These transformations, although widely used, are not currently collectively implemented in a package in the R ecosystem, so `{datawizard}` can help new R users in finding the transformation they need.

Lastly, `{datawizard}` also provides a toolbox to create detailed summaries of data properties and distributions (e.g., tables of descriptive statistics for each variable). This is a common step in data analysis, but it is not available in base R or many modeling packages, so its inclusion makes `{datawizard}` a one-stop-shop for data preparation tasks.

# Features

## Data Preparation

The raw data is rarely in a state that it can be directly fed into a statistical model. It often needs to be modified in various ways. For example, columns need to be renamed, certain portions of the data need to be filtered out, some columns need to be reshaped, data scattered across multiple tables needs to be joined, etc.

`{datawizard}` provides various functions for cleaning and preparing data (see Table 1).

| Function | Operation |
| :--------------- | :------------------------------------ |
| `data_filter()` | to select only certain *observations* |
| `data_select()` | to select only a few *variables* |
| `data_extract()` | to extract a single *variable* |
| `data_rename()` | to rename variables |
| `data_to_long()` | to convert data from wide to long |
| `data_to_wide()` | to convert data from long to wide |
| `data_join()` | to join two data frames |
| ... | ... |

Table: The table below lists a few key functions offered by *datawizard* for data wrangling. To see the full list, see the package website: <https://easystats.github.io/datawizard/>

We will look at one example function that converts data in wide format to tidy/long format:

```{r}
stocks <- data.frame(
time = as.Date('2009-01-01') + 0:4,
X = rnorm(5, 0, 1),
Y = rnorm(5, 0, 2)
)
stocks
data_to_long(
stocks,
select = -c("time"),
names_to = "stock",
values_to = "price"
)
```

## Statistical Transformations

Even after getting the raw data in the needed format, we may need to transform certain variables further to meet requirements imposed by a statistical test.

`{datawizard}` provides a rich collection of such functions for transforming variables (see Table 2).

| Function | Operation |
| :---------------- | :------------------------------------------- |
| `standardize()` | to center and scale data |
| `normalize()` | to scale variables to 0-1 range |
| `adjust()` | to adjust data for effect of other variables |
| `slide()` | to shift numeric value range |
| `ranktransform()` | to convert numeric values to integer ranks |
| ... | ... |

Table: The table below lists a few key functions offered by *datawizard* for data transformations. To see the full list, see the package website: <https://easystats.github.io/datawizard/>

We will look at one example function that standardizes (i.e. centers and scales) data so that it can be expressed in terms of standard deviation:

```{r}
d <- data.frame(
a = c(-2, -1, 0, 1, 2),
b = c(3, 4, 5, 6, 7)
)
standardize(d, center = c(3, 4), scale = c(2, 4))
```

## Summaries of Data Properties and Distributions

The workhorse function to get a comprehensive summary of data properties is `describe_distribution()`, which combines a set of indices (e.g., measures of centrality, dispersion, range, skewness, kurtosis, etc.) computed by other functions in `{datawizard}`.

```{r eval=FALSE}
describe_distribution(mtcars)
```

```{r echo=FALSE, eval=TRUE, results="asis"}
library(kableExtra)
options(digits = 3)
kbl(describe_distribution(mtcars), format = "latex", booktabs = TRUE, linesep = "") |>
kable_styling(latex_options = "scale_down")
```

# Licensing and Availability

*see* is licensed under the GNU General Public License (v3.0), with all source code openly developed and stored at GitHub (<https://github.com/easystats/datawizard>), along with a corresponding issue tracker for bug reporting and feature enhancements. In the spirit of honest and open science, we encourage requests, tips for fixes, feature updates, as well as general questions and concerns via direct interaction with contributors and developers.
`{datawizard}` is licensed under the GNU General Public License (v3.0), with all source code openly developed and stored on GitHub (<https://github.com/easystats/datawizard>), along with a corresponding issue tracker for bug reporting and feature enhancements. In the spirit of honest and open science, we encourage requests, tips for fixes, feature updates, as well as general questions and concerns via direct interaction with contributors and developers.

# Acknowledgments

*see* is part of the collaborative [*easystats*](https://github.com/easystats/easystats) ecosystem. Thus, we thank the [members of easystats](https://github.com/orgs/easystats/people) as well as the users.
`{datawizard}` is part of the collaborative [*easystats*](https://easystats.github.io/easystats/) ecosystem. Thus, we thank the [members of easystats](https://github.com/orgs/easystats/people) as well as the users.

# References
26 changes: 26 additions & 0 deletions paper/paper.bib
Original file line number Diff line number Diff line change
Expand Up @@ -23,6 +23,17 @@ @Article{Lüdecke2020parameters
pages = {2445},
}

@Article{Lüdecke2021see,
title = {{see}: An {R} Package for Visualizing Statistical Models},
author = {Daniel Lüdecke and Indrajeet Patil and Mattan S. Ben-Shachar and Brenton M. Wiernik and Philip Waggoner and Dominique Makowski},
journal = {Journal of Open Source Software},
year = {2021},
volume = {6},
number = {64},
pages = {3393},
doi = {10.21105/joss.03393},
}

@Article{Lüdecke2020performance,
title = {{performance}: An {R} Package for Assessment, Comparison and Testing of Statistical Models},
author = {Daniel Lüdecke and Mattan S. Ben-Shachar and Indrajeet Patil and Philip Waggoner and Dominique Makowski},
Expand Down Expand Up @@ -112,3 +123,18 @@ @Manual{base2021
url = {https://www.R-project.org/},
}

@Manual{Eastwood2021,
title = {poorman: A Poor Man's Dependency Free Recreation of 'dplyr'},
author = {Nathan Eastwood},
year = {2021},
note = {R package version 0.2.5},
url = {https://CRAN.R-project.org/package=poorman},
}

@Manual{Dowle2021,
title = {data.table: Extension of `data.frame`},
author = {Matt Dowle and Arun Srinivasan},
year = {2021},
note = {R package version 1.14.2},
url = {https://CRAN.R-project.org/package=data.table},
}
Loading

0 comments on commit ce95b3a

Please sign in to comment.