Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add tests and method for list in describe_distribution #105

Merged
merged 11 commits into from
Mar 15, 2022
Merged

Add tests and method for list in describe_distribution #105

merged 11 commits into from
Mar 15, 2022

Conversation

etiennebacher
Copy link
Member

@etiennebacher etiennebacher commented Mar 13, 2022

This PR is related to #97. Sometimes we want to compare the distribution of several variables but not necessarily of the whole dataframe, so this PR adds a method to pass a list to describe_distribution(). Some examples below:

> describe_distribution(list(mtcars$mpg, mtcars$cyl))

Variable   |  Mean |   SD |  IQR |          Range | Skewness | Kurtosis |  n | n_Missing
----------------------------------------------------------------------------------------
mtcars$mpg | 20.09 | 6.03 | 7.53 | [10.40, 33.90] |     0.67 |    -0.02 | 32 |         0
mtcars$cyl |  6.19 | 1.79 | 4.00 |   [4.00, 8.00] |    -0.19 |    -1.76 | 32 |         0

> describe_distribution(list(foo = mtcars$mpg, foo2 = mtcars$cyl))

Variable |  Mean |   SD |  IQR |          Range | Skewness | Kurtosis |  n | n_Missing
--------------------------------------------------------------------------------------
foo      | 20.09 | 6.03 | 7.53 | [10.40, 33.90] |     0.67 |    -0.02 | 32 |         0
foo2     |  6.19 | 1.79 | 4.00 |   [4.00, 8.00] |    -0.19 |    -1.76 | 32 |         0

> describe_distribution(list(foo = mtcars$mpg, mtcars$cyl))

Variable   |  Mean |   SD |  IQR |          Range | Skewness | Kurtosis |  n | n_Missing
----------------------------------------------------------------------------------------
foo        | 20.09 | 6.03 | 7.53 | [10.40, 33.90] |     0.67 |    -0.02 | 32 |         0
mtcars$cyl |  6.19 | 1.79 | 4.00 |   [4.00, 8.00] |    -0.19 |    -1.76 | 32 |         0

> x <- list(mtcars$mpg, mtcars$cyl)
> describe_distribution(x)

Variable |  Mean |   SD |  IQR |          Range | Skewness | Kurtosis |  n | n_Missing
--------------------------------------------------------------------------------------
Var_1    | 20.09 | 6.03 | 7.53 | [10.40, 33.90] |     0.67 |    -0.02 | 32 |         0
Var_2    |  6.19 | 1.79 | 4.00 |   [4.00, 8.00] |    -0.19 |    -1.76 | 32 |         0

Note that if the list is stored in an object (such as x above) and if its elements are unnamed, it is not possible to know the elements that created x, hence the "Var_1", "Var_2" as variable names. I also add tests and some doc for this.

There will probably be some things to change so I didn't add details in the NEWS yet. What do you think?

Close #97.

@codecov-commenter
Copy link

Codecov Report

Merging #105 (d76719c) into master (c4c484d) will increase coverage by 2.78%.
The diff coverage is 96.66%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master     #105      +/-   ##
==========================================
+ Coverage   73.12%   75.90%   +2.78%     
==========================================
  Files          39       40       +1     
  Lines        2013     2042      +29     
==========================================
+ Hits         1472     1550      +78     
+ Misses        541      492      -49     
Impacted Files Coverage Δ
R/describe_distribution.R 76.55% <96.66%> (+17.89%) ⬆️
R/weighted_mean_median_sd_mad.R 91.89% <0.00%> (ø)
R/utils_standardize_center.R 77.84% <0.00%> (+9.06%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update c4c484d...d76719c. Read the comment docs.

Copy link
Member

@IndrajeetPatil IndrajeetPatil left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good! Just a few minor comments that need to be attended to.

R/describe_distribution.R Outdated Show resolved Hide resolved
R/describe_distribution.R Show resolved Hide resolved
row.names(out) <- NULL
out <- out[c("Variable", setdiff(colnames(out), "Variable"))]

class(out) <- unique(c("parameters_distribution", "see_parameters_distribution", class(out)))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Have you tried the plotting method in {see} and checked if it works with outputs from list input?

Maybe you can post an example in the PR thread.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't know there was a plotting method for that. I just tried plot(describe_distribution(list(mtcars$mpg, mtcars$cyl))) and it doesn't work. I don't have any experience with ggplot2 programming so not sure how to handle this. I'll take a look at see source code later, maybe it's easier than I think

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you can leave this in for now and create an issue in {see} repository so that we don't forget about it.

@IndrajeetPatil IndrajeetPatil self-requested a review March 14, 2022 10:52
Copy link
Member

@IndrajeetPatil IndrajeetPatil left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me! Thanks for working on this (and also for adding exhaustive unit tests!).

I will wait for @strengejacke to have a look at this before I merge.

R/describe_distribution.R Outdated Show resolved Hide resolved
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Use describe_distribution() with a vector/list of variables
4 participants