Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

revising describe_distribution output for factor class #46

Open
IndrajeetPatil opened this issue Dec 26, 2020 · 14 comments
Open

revising describe_distribution output for factor class #46

IndrajeetPatil opened this issue Dec 26, 2020 · 14 comments

Comments

@IndrajeetPatil
Copy link
Member

As awesome as this function is for numeric type variables, I am not sure if this is the best output we can provide for factor type variables.

as.data.frame(parameters::describe_distribution(as.factor(mtcars$am)))
#>   Mean SD Min Max  Skewness Kurtosis  n n_Missing
#> 1   NA NA   0   1 0.4008089 -1.96655 32         0

I was thinking instead we can probably take inspiration from skimr output for factor class:

as.data.frame(skimr::skim(as.factor(mtcars$am)))
#>   skim_type skim_variable n_missing complete_rate factor.ordered
#> 1    factor          data         0             1          FALSE
#>   factor.n_unique factor.top_counts
#> 1               2      0: 19, 1: 13
@IndrajeetPatil

This comment has been minimized.

@IndrajeetPatil
Copy link
Member Author

IndrajeetPatil commented Jan 5, 2021

@strengejacke, @DominiqueMakowski, @mattansb What do you think?

@IndrajeetPatil
Copy link
Member Author

For a reference, the output can look something like this:

library(tabulator)
library(dplyr)
a <- tibble(varname = sample.int(20, size = 1000000, replace = TRUE))
a %>% tab(varname)
#> # A tibble: 20 x 4
#>    varname     N  prop cum_prop
#>      <int> <int> <dbl>    <dbl>
#>  1       4 50346  0.05     0.05
#>  2       5 50328  0.05     0.1 
#>  3      20 50320  0.05     0.15
#>  4      14 50223  0.05     0.2 
#>  5       2 50208  0.05     0.25
#>  6      19 50101  0.05     0.3 
#>  7       7 50088  0.05     0.35
#>  8      15 50067  0.05     0.4 
#>  9       6 50044  0.05     0.45
#> 10      10 50040  0.05     0.5 
#> 11      12 50036  0.05     0.55
#> 12       1 50015  0.05     0.6 
#> 13      16 50014  0.05     0.65
#> 14      17 49935  0.05     0.7 
#> 15       3 49889  0.05     0.75
#> 16      11 49857  0.05     0.8 
#> 17      13 49726  0.05     0.85
#> 18      18 49632  0.05     0.9 
#> 19       8 49603  0.05     0.95
#> 20       9 49528  0.05     1

Created on 2021-01-21 by the reprex package (v0.3.0)

Session info
devtools::session_info()
#> ─ Session info ───────────────────────────────────────────────────────────────
#>  setting  value                       
#>  version  R version 4.0.3 (2020-10-10)
#>  os       macOS Mojave 10.14.6        
#>  system   x86_64, darwin17.0          
#>  ui       X11                         
#>  language (EN)                        
#>  collate  en_US.UTF-8                 
#>  ctype    en_US.UTF-8                 
#>  tz       Europe/Berlin               
#>  date     2021-01-21                  
#> 
#> ─ Packages ───────────────────────────────────────────────────────────────────
#>  package     * version date       lib source        
#>  assertthat    0.2.1   2019-03-21 [1] CRAN (R 4.0.2)
#>  callr         3.5.1   2020-10-13 [1] CRAN (R 4.0.2)
#>  cli           2.2.0   2020-11-20 [1] CRAN (R 4.0.3)
#>  crayon        1.3.4   2017-09-16 [1] CRAN (R 4.0.2)
#>  data.table    1.13.6  2020-12-30 [1] CRAN (R 4.0.2)
#>  DBI           1.1.1   2021-01-15 [1] CRAN (R 4.0.3)
#>  desc          1.2.0   2018-05-01 [1] CRAN (R 4.0.2)
#>  devtools      2.3.2   2020-09-18 [1] CRAN (R 4.0.2)
#>  digest        0.6.27  2020-10-24 [1] CRAN (R 4.0.2)
#>  dplyr       * 1.0.3   2021-01-15 [1] CRAN (R 4.0.3)
#>  ellipsis      0.3.1   2020-05-15 [1] CRAN (R 4.0.2)
#>  evaluate      0.14    2019-05-28 [1] CRAN (R 4.0.1)
#>  fansi         0.4.2   2021-01-15 [1] CRAN (R 4.0.3)
#>  fs            1.5.0   2020-07-31 [1] CRAN (R 4.0.2)
#>  generics      0.1.0   2020-10-31 [1] CRAN (R 4.0.2)
#>  glue          1.4.2   2020-08-27 [1] CRAN (R 4.0.2)
#>  highr         0.8     2019-03-20 [1] CRAN (R 4.0.2)
#>  htmltools     0.5.1   2021-01-12 [1] CRAN (R 4.0.3)
#>  knitr         1.30    2020-09-22 [1] CRAN (R 4.0.2)
#>  lifecycle     0.2.0   2020-03-06 [1] CRAN (R 4.0.2)
#>  magrittr      2.0.1   2020-11-17 [1] CRAN (R 4.0.3)
#>  memoise       1.1.0   2017-04-21 [1] CRAN (R 4.0.2)
#>  pillar        1.4.7   2020-11-20 [1] CRAN (R 4.0.3)
#>  pkgbuild      1.2.0   2020-12-15 [1] CRAN (R 4.0.3)
#>  pkgconfig     2.0.3   2019-09-22 [1] CRAN (R 4.0.2)
#>  pkgload       1.1.0   2020-05-29 [1] CRAN (R 4.0.2)
#>  prettyunits   1.1.1   2020-01-24 [1] CRAN (R 4.0.2)
#>  processx      3.4.5   2020-11-30 [1] CRAN (R 4.0.3)
#>  ps            1.5.0   2020-12-05 [1] CRAN (R 4.0.3)
#>  purrr         0.3.4   2020-04-17 [1] CRAN (R 4.0.2)
#>  R6            2.5.0   2020-10-28 [1] CRAN (R 4.0.2)
#>  remotes       2.2.0   2020-07-21 [1] CRAN (R 4.0.2)
#>  rlang         0.4.10  2020-12-30 [1] CRAN (R 4.0.3)
#>  rmarkdown     2.6     2020-12-14 [1] CRAN (R 4.0.3)
#>  rprojroot     2.0.2   2020-11-15 [1] CRAN (R 4.0.3)
#>  sessioninfo   1.1.1   2018-11-05 [1] CRAN (R 4.0.2)
#>  stringi       1.5.3   2020-09-09 [1] CRAN (R 4.0.2)
#>  stringr       1.4.0   2019-02-10 [1] CRAN (R 4.0.2)
#>  tabulator   * 1.0.0   2021-01-08 [1] CRAN (R 4.0.2)
#>  testthat      3.0.1   2020-12-17 [1] CRAN (R 4.0.3)
#>  tibble        3.0.5   2021-01-15 [1] CRAN (R 4.0.3)
#>  tidyselect    1.1.0   2020-05-11 [1] CRAN (R 4.0.2)
#>  usethis       2.0.0   2020-12-10 [1] CRAN (R 4.0.3)
#>  utf8          1.1.4   2018-05-24 [1] CRAN (R 4.0.2)
#>  vctrs         0.3.6   2020-12-17 [1] CRAN (R 4.0.3)
#>  withr         2.4.0   2021-01-16 [1] CRAN (R 4.0.3)
#>  xfun          0.20    2021-01-06 [1] CRAN (R 4.0.3)
#>  yaml          2.2.1   2020-02-01 [1] CRAN (R 4.0.2)
#> 
#> [1] /Users/patil/Library/R/4.0/library
#> [2] /Library/Frameworks/R.framework/Versions/4.0/Resources/library

@strengejacke
Copy link
Member

Looks rather like a frequency table to me:

a <- tibble::tibble(varname = sample.int(20, size = 1000000, replace = TRUE))
sjmisc::frq(a)
#> 
#> varname <integer>
#> # total N=1000000  valid N=1000000  mean=10.50  sd=5.76
#> 
#> Value |     N | Raw % | Valid % | Cum. %
#> ----------------------------------------
#>     1 | 49907 |  4.99 |    4.99 |   4.99
#>     2 | 50198 |  5.02 |    5.02 |  10.01
#>     3 | 49954 |  5.00 |    5.00 |  15.01
#>     4 | 50005 |  5.00 |    5.00 |  20.01
#>     5 | 50225 |  5.02 |    5.02 |  25.03
#>     6 | 49937 |  4.99 |    4.99 |  30.02
#>     7 | 49870 |  4.99 |    4.99 |  35.01
#>     8 | 49561 |  4.96 |    4.96 |  39.97
#>     9 | 50219 |  5.02 |    5.02 |  44.99
#>    10 | 50506 |  5.05 |    5.05 |  50.04
#>    11 | 50092 |  5.01 |    5.01 |  55.05
#>    12 | 50094 |  5.01 |    5.01 |  60.06
#>    13 | 49818 |  4.98 |    4.98 |  65.04
#>    14 | 49990 |  5.00 |    5.00 |  70.04
#>    15 | 49585 |  4.96 |    4.96 |  75.00
#>    16 | 50085 |  5.01 |    5.01 |  80.00
#>    17 | 50328 |  5.03 |    5.03 |  85.04
#>    18 | 50103 |  5.01 |    5.01 |  90.05
#>    19 | 50068 |  5.01 |    5.01 |  95.05
#>    20 | 49455 |  4.95 |    4.95 | 100.00
#>  <NA> |     0 |  0.00 |    <NA> |   <NA>

Created on 2021-01-21 by the reprex package (v0.3.0)

@IndrajeetPatil
Copy link
Member Author

Can we re-purpose the code from sjmisc to do the same thing for this function?

@strengejacke
Copy link
Member

Can we re-purpose the code from sjmisc to do the same thing for this function?

I still think that would be a frequency table rather than a descriptive version. It may depend on the number of levels whether the output would be compact enough, though.

@IndrajeetPatil
Copy link
Member Author

I still think that would be a frequency table rather than a descriptive version.

Sure, that also works. But will you want to introduce a new function or will this be part of describe_distribution's functionality?

@strengejacke
Copy link
Member

I wouldn't change this in either way. Maybe using the mode, as suggested in easystats/parameters#515, could be a good way of dealing with factors.

@bwiernik
Copy link
Contributor

bwiernik commented Jun 4, 2021

I like the skimr output. Perhaps state the mode and anti-mode and their counts?

@etiennebacher
Copy link
Member

Is this issue still on the todo list or was it superseded by #45 and the new function data_tabulate()?

@etiennebacher
Copy link
Member

bump @IndrajeetPatil @strengejacke

@strengejacke
Copy link
Member

The problem is that categorical variables have no "distribution" in that way how we use this for numeric variables, i.e. summary statistics like centrality or dispersion do not apply. That's why describe_distribution() has a limited output here. I'm not sure how other packages that describe variables deal with this, but in either way, it is difficult to "mix" the output for different types.

@etiennebacher
Copy link
Member

How skimr does it:

library(skimr)

dat <- mtcars[, c("am", "mpg")]
dat$am <- as.factor(dat$am)

skim(dat)
Name dat
Number of rows 32
Number of columns 2
_______________________
Column type frequency:
factor 1
numeric 1
________________________
Group variables None

Data summary

Variable type: factor

skim_variable n_missing complete_rate ordered n_unique top_counts
am 0 1 FALSE 2 0: 19, 1: 13

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
mpg 0 1 20.09 6.03 10.4 15.43 19.2 22.8 33.9 ▃▇▅▁▂

Created on 2023-03-15 with reprex v2.0.2

@etiennebacher
Copy link
Member

etiennebacher commented Mar 15, 2023

maybe describe_distribution() for factor/characters could show something similar: n, n_missing, ordered, n_unique, mode (cf #160)

but should it still be named describe_distribution() then?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants