Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Draft for JOSS publication #190

Merged
merged 43 commits into from
Aug 7, 2022
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
43 commits
Select commit Hold shift + click to select a range
304be58
fix compilation issues
IndrajeetPatil Jul 3, 2022
651b537
add table
IndrajeetPatil Jul 3, 2022
ad884d9
more content
IndrajeetPatil Jul 3, 2022
43e7b74
statement of need
IndrajeetPatil Jul 4, 2022
72213cb
data wrangling example
IndrajeetPatil Jul 4, 2022
4861c42
data transforms example
IndrajeetPatil Jul 4, 2022
8c46cb5
Merge branch 'master' into 59_joss_paper
IndrajeetPatil Jul 4, 2022
8f0d352
Merge branch 'master' into 59_joss_paper
IndrajeetPatil Jul 5, 2022
c504de4
add my orcid
etiennebacher Jul 5, 2022
4133e22
revise authors and title
IndrajeetPatil Jul 5, 2022
1a834de
Merge branch 'master' into 59_joss_paper
IndrajeetPatil Jul 5, 2022
d8aebfe
more edits
IndrajeetPatil Jul 5, 2022
c45a736
reorganize intro
DominiqueMakowski Jul 6, 2022
92c6ab2
reknit
IndrajeetPatil Jul 6, 2022
8126298
change title
IndrajeetPatil Jul 9, 2022
2023f1f
Merge branch 'main' into 59_joss_paper
IndrajeetPatil Jul 10, 2022
9061d38
try to fit describe table
etiennebacher Jul 11, 2022
a408c5b
reknit
IndrajeetPatil Jul 11, 2022
3efdbf1
reshape_ -> data_to_
etiennebacher Jul 11, 2022
400bc47
update data_to_long args
etiennebacher Jul 11, 2022
5c60cfe
Merge branch 'main' into 59_joss_paper
IndrajeetPatil Jul 12, 2022
fd58a47
Merge branch 'main' into 59_joss_paper
IndrajeetPatil Jul 15, 2022
03064af
Merge branch 'main' into 59_joss_paper
IndrajeetPatil Jul 18, 2022
8590a76
Merge branch 'main' into 59_joss_paper
IndrajeetPatil Jul 22, 2022
e143892
fix function name
IndrajeetPatil Jul 22, 2022
c85acc1
beautify tables
IndrajeetPatil Jul 23, 2022
8166e1b
Merge branch 'main' into 59_joss_paper
IndrajeetPatil Jul 26, 2022
256de0e
Merge branch 'main' into 59_joss_paper
IndrajeetPatil Jul 26, 2022
b6a7d9f
Update paper/paper.md
IndrajeetPatil Jul 26, 2022
32e0902
Merge branch 'main' into 59_joss_paper
IndrajeetPatil Jul 26, 2022
de7362a
Address Brenton's comments
IndrajeetPatil Jul 26, 2022
a7ddca4
apply @bwiernik changes to Rmd (not only md)
etiennebacher Jul 26, 2022
33780a3
fix affiliation
etiennebacher Jul 27, 2022
e3757ea
Address Etienne's comments
IndrajeetPatil Jul 27, 2022
39d937c
To be consistent with the paper author order
IndrajeetPatil Jul 27, 2022
c681116
Update paper/paper.Rmd
bwiernik Jul 28, 2022
66c98d9
Update paper/paper.Rmd
bwiernik Jul 28, 2022
c7b2f25
Update paper/paper.Rmd
bwiernik Jul 28, 2022
923e820
Update paper/paper.Rmd
bwiernik Jul 28, 2022
15eee5d
Update paper/paper.Rmd
bwiernik Jul 28, 2022
20d87c1
Update paper/paper.Rmd
bwiernik Jul 28, 2022
eb6c0d1
Merge branch 'main' into 59_joss_paper
IndrajeetPatil Jul 28, 2022
7864ddd
Merge branch 'main' into 59_joss_paper
IndrajeetPatil Jul 30, 2022
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 2 additions & 2 deletions DESCRIPTION
Original file line number Diff line number Diff line change
Expand Up @@ -3,12 +3,12 @@ Package: datawizard
Title: Easy Data Wrangling and Statistical Transformations
Version: 0.4.1.10
Authors@R: c(
person("Indrajeet", "Patil", , "[email protected]", role = c("aut", "cre"),
comment = c(ORCID = "0000-0003-1995-6531", Twitter = "@patilindrajeets")),
person("Dominique", "Makowski", , "[email protected]", role = "aut",
comment = c(ORCID = "0000-0001-5375-9967", Twitter = "@Dom_Makowski")),
person("Daniel", "Lüdecke", , "[email protected]", role = "aut",
comment = c(ORCID = "0000-0002-8895-3206", Twitter = "@strengejacke")),
person("Indrajeet", "Patil", , "[email protected]", role = c("aut", "cre"),
comment = c(ORCID = "0000-0003-1995-6531", Twitter = "@patilindrajeets")),
person("Mattan S.", "Ben-Shachar", , "[email protected]", role = "aut",
comment = c(ORCID = "0000-0002-4287-4801")),
person("Brenton M.", "Wiernik", , "[email protected]", role = "aut",
Expand Down
128 changes: 116 additions & 12 deletions paper/paper.Rmd
Original file line number Diff line number Diff line change
@@ -1,35 +1,42 @@
---
title: "datawizard: An R Package for Easy Data Wrangling"
title: "datawizard: An R Package for Easy Data Preparation and Statistical Transformations"
tags:
- R
- easystats
authors:
- affiliation: 1
name: Dominique Makowski
orcid: 0000-0001-5375-9967
- affiliation: 2
name: Indrajeet Patil
orcid: 0000-0003-1995-6531
- affiliation: 2
name: Dominique Makowski
orcid: 0000-0001-5375-9967
- affiliation: 3
name: Mattan S. Ben-Shachar
orcid: 0000-0002-4287-4801
- affiliation: 4
name: Brenton M. Wiernik
name: Brenton M. Wiernik^[Brenton Wiernik is currently an independent researcher and Research Scientist at Meta, Demography and Survey Science. The current work was done in an independent capacity.]
orcid: 0000-0001-9560-6336
- affiliation: 5
name: Etienne Bacher
orcid: 0000-0002-9271-5075
etiennebacher marked this conversation as resolved.
Show resolved Hide resolved
- affiliation: 6
name: Daniel Lüdecke
orcid: 0000-0002-8895-3206

affiliations:
- index: 1
name: Nanyang Technological University, Singapore
name: esqLABS GmbH, Germany
- index: 2
name: Center for Humans and Machines, Max Planck Institute for Human Development, Berlin, Germany
name: Nanyang Technological University, Singapore
- index: 3
name: Ben-Gurion University of the Negev, Israel
- index: 4
name: Department of Psychology, University of South Florida, USA
name: Independent Researcher
- index: 5
name: University Medical Center Hamburg-Eppendorf, Germany
name: Luxembourg Institute of Socio-Economic Research (LISER), Luxembourg
- index: 6
name: University Medical Center Hamburg-Eppendorf, Germany

date: "`r Sys.Date()`"
bibliography: paper.bib
output: rticles::joss_article
Expand All @@ -42,23 +49,120 @@ link-citations: yes
knitr::opts_chunk$set(
collapse = TRUE,
out.width = "100%",
dpi = 450,
dpi = 300,
comment = "#>",
message = FALSE,
warning = FALSE
)

library(datawizard)
bwiernik marked this conversation as resolved.
Show resolved Hide resolved
set.seed(2016)
```

# Summary

The `{datawizard}` package for the R programming language [@base2021] provides a lightweight toolbox to assist in keys steps involved in any data analysis workflow: (1) wrangling the raw data to get it in the needed form, (2) applying preprocessing steps and statistical transformations, and (3) compute statistical summaries of data properties and distributions. Therefore, it can be a valuable tool for R users and developers looking for a lightweight option for data preparation.

# Statement of Need

The `{datawizard}` package is part of `{easystats}`, a collection of R packages designed to make statistical analysis easier (@Ben-Shachar2020, @Lüdecke2020parameters, @Lüdecke2020performance, @Lüdecke2021see, @Lüdecke2019, @Makowski2019, @Makowski2020). As this ecosystem follows a "0-external-hard-dependency" policy, a base R data manipulation package that relies only on base R needed to be created. In effect, `{datawizard}` provides data processing backend for this entire ecosystem.
In addition to its usefulness to the `{easystats}` ecosystem, it also provides *an* option for R users and package developers if they wish to keep their (recursive) dependency weight to a minimum (for other options, see @Dowle2021, @Eastwood2021, etc.).

Because `{datawizard}` is also meant to be used and adopted easily by a wide range of users, its workflow and syntax are designed to be similar to `{tidyverse}` (@Wickham2019), a widely used ecosystem of R packages. Thus, users familiar with the `{tidyverse}` can easily translate their knowledge and make full use of `{datawizard}`.

In addition to being a lightweight solution to clean messy data, `{datawizard}` also provides helpers for the other important step of data analysis: applying statistical transformations to the cleaned data while setting up statistical models. This includes various types of data standardization, normalization, rank-transformation, and adjustment. These transformations, although widely used, are not currently collectively implemented in a package in the R ecosystem, so `{datawizard}` can help new R users in finding the transformation they need.

Lastly, `{datawizard}` also provides a toolbox to create detailed summaries of data properties and distributions (e.g., tables of descriptive statistics for each variable). This is a common step in data analysis, but it is not available in base R or many modeling packages, so its inclusion makes `{datawizard}` a one-stop-shop for data preparation tasks.

# Features

## Data Preparation

The raw data is rarely in a state that it can be directly fed into a statistical model. It often needs to be modified in various ways. For example, columns need to be renamed, certain portions of the data need to be filtered out, some columns need to be reshaped, data scattered across multiple tables needs to be joined, etc.

`{datawizard}` provides various functions for cleaning and preparing data (see Table 1).

| Function | Operation |
| :--------------- | :------------------------------------ |
| `data_filter()` | to select only certain *observations* |
| `data_select()` | to select only a few *variables* |
| `data_extract()` | to extract a single *variable* |
| `data_rename()` | to rename variables |
| `data_to_long()` | to convert data from wide to long |
| `data_to_wide()` | to convert data from long to wide |
| `data_join()` | to join two data frames |
| ... | ... |

Table: The table below lists a few key functions offered by *datawizard* for data wrangling. To see the full list, see the package website: <https://easystats.github.io/datawizard/>
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am deliberately not being comprehensive because we might add more functions in the future and ... cover all of those.


We will look at one example function that converts data in wide format to tidy/long format:

```{r}
stocks <- data.frame(
time = as.Date('2009-01-01') + 0:4,
X = rnorm(5, 0, 1),
Y = rnorm(5, 0, 2)
)

stocks

data_to_long(
stocks,
select = -c("time"),
names_to = "stock",
values_to = "price"
)
```

## Statistical Transformations

Even after getting the raw data in the needed format, we may need to transform certain variables further to meet requirements imposed by a statistical test.

`{datawizard}` provides a rich collection of such functions for transforming variables (see Table 2).

| Function | Operation |
| :---------------- | :------------------------------------------- |
| `standardize()` | to center and scale data |
| `normalize()` | to scale variables to 0-1 range |
| `adjust()` | to adjust data for effect of other variables |
| `slide()` | to shift numeric value range |
| `ranktransform()` | to convert numeric values to integer ranks |
| ... | ... |

Table: The table below lists a few key functions offered by *datawizard* for data transformations. To see the full list, see the package website: <https://easystats.github.io/datawizard/>
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am deliberately not being comprehensive because we might add more functions in the future and ... cover all of those.


We will look at one example function that standardizes (i.e. centers and scales) data so that it can be expressed in terms of standard deviation:

```{r}
d <- data.frame(
a = c(-2, -1, 0, 1, 2),
b = c(3, 4, 5, 6, 7)
)

standardize(d, center = c(3, 4), scale = c(2, 4))
```

## Summaries of Data Properties and Distributions

The workhorse function to get a comprehensive summary of data properties is `describe_distribution()`, which combines a set of indices (e.g., measures of centrality, dispersion, range, skewness, kurtosis, etc.) computed by other functions in `{datawizard}`.
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure if a table is necessary here. We don't have many functions.


```{r eval=FALSE}
describe_distribution(mtcars)
```

```{r echo=FALSE, eval=TRUE, results="asis"}
library(kableExtra)
options(digits = 3)
kbl(describe_distribution(mtcars), format = "latex", booktabs = TRUE, linesep = "") |>
kable_styling(latex_options = "scale_down")
```
IndrajeetPatil marked this conversation as resolved.
Show resolved Hide resolved

# Licensing and Availability

*see* is licensed under the GNU General Public License (v3.0), with all source code openly developed and stored at GitHub (<https://github.com/easystats/datawizard>), along with a corresponding issue tracker for bug reporting and feature enhancements. In the spirit of honest and open science, we encourage requests, tips for fixes, feature updates, as well as general questions and concerns via direct interaction with contributors and developers.
`{datawizard}` is licensed under the GNU General Public License (v3.0), with all source code openly developed and stored on GitHub (<https://github.com/easystats/datawizard>), along with a corresponding issue tracker for bug reporting and feature enhancements. In the spirit of honest and open science, we encourage requests, tips for fixes, feature updates, as well as general questions and concerns via direct interaction with contributors and developers.

# Acknowledgments

*see* is part of the collaborative [*easystats*](https://github.com/easystats/easystats) ecosystem. Thus, we thank the [members of easystats](https://github.com/orgs/easystats/people) as well as the users.
`{datawizard}` is part of the collaborative [*easystats*](https://easystats.github.io/easystats/) ecosystem. Thus, we thank the [members of easystats](https://github.com/orgs/easystats/people) as well as the users.

# References
26 changes: 26 additions & 0 deletions paper/paper.bib
Original file line number Diff line number Diff line change
Expand Up @@ -23,6 +23,17 @@ @Article{Lüdecke2020parameters
pages = {2445},
}

@Article{Lüdecke2021see,
title = {{see}: An {R} Package for Visualizing Statistical Models},
author = {Daniel Lüdecke and Indrajeet Patil and Mattan S. Ben-Shachar and Brenton M. Wiernik and Philip Waggoner and Dominique Makowski},
journal = {Journal of Open Source Software},
year = {2021},
volume = {6},
number = {64},
pages = {3393},
doi = {10.21105/joss.03393},
}

@Article{Lüdecke2020performance,
title = {{performance}: An {R} Package for Assessment, Comparison and Testing of Statistical Models},
author = {Daniel Lüdecke and Mattan S. Ben-Shachar and Indrajeet Patil and Philip Waggoner and Dominique Makowski},
Expand Down Expand Up @@ -112,3 +123,18 @@ @Manual{base2021
url = {https://www.R-project.org/},
}

@Manual{Eastwood2021,
title = {poorman: A Poor Man's Dependency Free Recreation of 'dplyr'},
author = {Nathan Eastwood},
year = {2021},
note = {R package version 0.2.5},
url = {https://CRAN.R-project.org/package=poorman},
}

@Manual{Dowle2021,
title = {data.table: Extension of `data.frame`},
author = {Matt Dowle and Arun Srinivasan},
year = {2021},
note = {R package version 1.14.2},
url = {https://CRAN.R-project.org/package=data.table},
}
Loading