revising `describe_distribution` output for `factor` class #46

IndrajeetPatil · 2020-12-26T20:53:04Z

As awesome as this function is for numeric type variables, I am not sure if this is the best output we can provide for factor type variables.

as.data.frame(parameters::describe_distribution(as.factor(mtcars$am)))
#>   Mean SD Min Max  Skewness Kurtosis  n n_Missing
#> 1   NA NA   0   1 0.4008089 -1.96655 32         0

I was thinking instead we can probably take inspiration from skimr output for factor class:

as.data.frame(skimr::skim(as.factor(mtcars$am)))
#>   skim_type skim_variable n_missing complete_rate factor.ordered
#> 1    factor          data         0             1          FALSE
#>   factor.n_unique factor.top_counts
#> 1               2      0: 19, 1: 13

The text was updated successfully, but these errors were encountered:

IndrajeetPatil · 2021-01-05T21:21:46Z

@strengejacke, @DominiqueMakowski, @mattansb What do you think?

IndrajeetPatil · 2021-01-21T16:37:57Z

For a reference, the output can look something like this:

library(tabulator)
library(dplyr)
a <- tibble(varname = sample.int(20, size = 1000000, replace = TRUE))
a %>% tab(varname)
#> # A tibble: 20 x 4
#>    varname     N  prop cum_prop
#>      <int> <int> <dbl>    <dbl>
#>  1       4 50346  0.05     0.05
#>  2       5 50328  0.05     0.1 
#>  3      20 50320  0.05     0.15
#>  4      14 50223  0.05     0.2 
#>  5       2 50208  0.05     0.25
#>  6      19 50101  0.05     0.3 
#>  7       7 50088  0.05     0.35
#>  8      15 50067  0.05     0.4 
#>  9       6 50044  0.05     0.45
#> 10      10 50040  0.05     0.5 
#> 11      12 50036  0.05     0.55
#> 12       1 50015  0.05     0.6 
#> 13      16 50014  0.05     0.65
#> 14      17 49935  0.05     0.7 
#> 15       3 49889  0.05     0.75
#> 16      11 49857  0.05     0.8 
#> 17      13 49726  0.05     0.85
#> 18      18 49632  0.05     0.9 
#> 19       8 49603  0.05     0.95
#> 20       9 49528  0.05     1

^{Created on 2021-01-21 by the reprex package (v0.3.0)}

Session info

devtools::session_info()
#> ─ Session info ───────────────────────────────────────────────────────────────
#>  setting  value                       
#>  version  R version 4.0.3 (2020-10-10)
#>  os       macOS Mojave 10.14.6        
#>  system   x86_64, darwin17.0          
#>  ui       X11                         
#>  language (EN)                        
#>  collate  en_US.UTF-8                 
#>  ctype    en_US.UTF-8                 
#>  tz       Europe/Berlin               
#>  date     2021-01-21                  
#> 
#> ─ Packages ───────────────────────────────────────────────────────────────────
#>  package     * version date       lib source        
#>  assertthat    0.2.1   2019-03-21 [1] CRAN (R 4.0.2)
#>  callr         3.5.1   2020-10-13 [1] CRAN (R 4.0.2)
#>  cli           2.2.0   2020-11-20 [1] CRAN (R 4.0.3)
#>  crayon        1.3.4   2017-09-16 [1] CRAN (R 4.0.2)
#>  data.table    1.13.6  2020-12-30 [1] CRAN (R 4.0.2)
#>  DBI           1.1.1   2021-01-15 [1] CRAN (R 4.0.3)
#>  desc          1.2.0   2018-05-01 [1] CRAN (R 4.0.2)
#>  devtools      2.3.2   2020-09-18 [1] CRAN (R 4.0.2)
#>  digest        0.6.27  2020-10-24 [1] CRAN (R 4.0.2)
#>  dplyr       * 1.0.3   2021-01-15 [1] CRAN (R 4.0.3)
#>  ellipsis      0.3.1   2020-05-15 [1] CRAN (R 4.0.2)
#>  evaluate      0.14    2019-05-28 [1] CRAN (R 4.0.1)
#>  fansi         0.4.2   2021-01-15 [1] CRAN (R 4.0.3)
#>  fs            1.5.0   2020-07-31 [1] CRAN (R 4.0.2)
#>  generics      0.1.0   2020-10-31 [1] CRAN (R 4.0.2)
#>  glue          1.4.2   2020-08-27 [1] CRAN (R 4.0.2)
#>  highr         0.8     2019-03-20 [1] CRAN (R 4.0.2)
#>  htmltools     0.5.1   2021-01-12 [1] CRAN (R 4.0.3)
#>  knitr         1.30    2020-09-22 [1] CRAN (R 4.0.2)
#>  lifecycle     0.2.0   2020-03-06 [1] CRAN (R 4.0.2)
#>  magrittr      2.0.1   2020-11-17 [1] CRAN (R 4.0.3)
#>  memoise       1.1.0   2017-04-21 [1] CRAN (R 4.0.2)
#>  pillar        1.4.7   2020-11-20 [1] CRAN (R 4.0.3)
#>  pkgbuild      1.2.0   2020-12-15 [1] CRAN (R 4.0.3)
#>  pkgconfig     2.0.3   2019-09-22 [1] CRAN (R 4.0.2)
#>  pkgload       1.1.0   2020-05-29 [1] CRAN (R 4.0.2)
#>  prettyunits   1.1.1   2020-01-24 [1] CRAN (R 4.0.2)
#>  processx      3.4.5   2020-11-30 [1] CRAN (R 4.0.3)
#>  ps            1.5.0   2020-12-05 [1] CRAN (R 4.0.3)
#>  purrr         0.3.4   2020-04-17 [1] CRAN (R 4.0.2)
#>  R6            2.5.0   2020-10-28 [1] CRAN (R 4.0.2)
#>  remotes       2.2.0   2020-07-21 [1] CRAN (R 4.0.2)
#>  rlang         0.4.10  2020-12-30 [1] CRAN (R 4.0.3)
#>  rmarkdown     2.6     2020-12-14 [1] CRAN (R 4.0.3)
#>  rprojroot     2.0.2   2020-11-15 [1] CRAN (R 4.0.3)
#>  sessioninfo   1.1.1   2018-11-05 [1] CRAN (R 4.0.2)
#>  stringi       1.5.3   2020-09-09 [1] CRAN (R 4.0.2)
#>  stringr       1.4.0   2019-02-10 [1] CRAN (R 4.0.2)
#>  tabulator   * 1.0.0   2021-01-08 [1] CRAN (R 4.0.2)
#>  testthat      3.0.1   2020-12-17 [1] CRAN (R 4.0.3)
#>  tibble        3.0.5   2021-01-15 [1] CRAN (R 4.0.3)
#>  tidyselect    1.1.0   2020-05-11 [1] CRAN (R 4.0.2)
#>  usethis       2.0.0   2020-12-10 [1] CRAN (R 4.0.3)
#>  utf8          1.1.4   2018-05-24 [1] CRAN (R 4.0.2)
#>  vctrs         0.3.6   2020-12-17 [1] CRAN (R 4.0.3)
#>  withr         2.4.0   2021-01-16 [1] CRAN (R 4.0.3)
#>  xfun          0.20    2021-01-06 [1] CRAN (R 4.0.3)
#>  yaml          2.2.1   2020-02-01 [1] CRAN (R 4.0.2)
#> 
#> [1] /Users/patil/Library/R/4.0/library
#> [2] /Library/Frameworks/R.framework/Versions/4.0/Resources/library

strengejacke · 2021-01-21T16:47:31Z

Looks rather like a frequency table to me:

a <- tibble::tibble(varname = sample.int(20, size = 1000000, replace = TRUE))
sjmisc::frq(a)
#> 
#> varname <integer>
#> # total N=1000000  valid N=1000000  mean=10.50  sd=5.76
#> 
#> Value |     N | Raw % | Valid % | Cum. %
#> ----------------------------------------
#>     1 | 49907 |  4.99 |    4.99 |   4.99
#>     2 | 50198 |  5.02 |    5.02 |  10.01
#>     3 | 49954 |  5.00 |    5.00 |  15.01
#>     4 | 50005 |  5.00 |    5.00 |  20.01
#>     5 | 50225 |  5.02 |    5.02 |  25.03
#>     6 | 49937 |  4.99 |    4.99 |  30.02
#>     7 | 49870 |  4.99 |    4.99 |  35.01
#>     8 | 49561 |  4.96 |    4.96 |  39.97
#>     9 | 50219 |  5.02 |    5.02 |  44.99
#>    10 | 50506 |  5.05 |    5.05 |  50.04
#>    11 | 50092 |  5.01 |    5.01 |  55.05
#>    12 | 50094 |  5.01 |    5.01 |  60.06
#>    13 | 49818 |  4.98 |    4.98 |  65.04
#>    14 | 49990 |  5.00 |    5.00 |  70.04
#>    15 | 49585 |  4.96 |    4.96 |  75.00
#>    16 | 50085 |  5.01 |    5.01 |  80.00
#>    17 | 50328 |  5.03 |    5.03 |  85.04
#>    18 | 50103 |  5.01 |    5.01 |  90.05
#>    19 | 50068 |  5.01 |    5.01 |  95.05
#>    20 | 49455 |  4.95 |    4.95 | 100.00
#>  <NA> |     0 |  0.00 |    <NA> |   <NA>

^{Created on 2021-01-21 by the reprex package (v0.3.0)}

IndrajeetPatil · 2021-03-28T09:24:46Z

Can we re-purpose the code from sjmisc to do the same thing for this function?

strengejacke · 2021-04-03T09:53:50Z

Can we re-purpose the code from sjmisc to do the same thing for this function?

I still think that would be a frequency table rather than a descriptive version. It may depend on the number of levels whether the output would be compact enough, though.

IndrajeetPatil · 2021-06-03T19:24:36Z

I still think that would be a frequency table rather than a descriptive version.

Sure, that also works. But will you want to introduce a new function or will this be part of describe_distribution's functionality?

strengejacke · 2021-06-03T22:05:14Z

I wouldn't change this in either way. Maybe using the mode, as suggested in easystats/parameters#515, could be a good way of dealing with factors.

bwiernik · 2021-06-04T02:22:02Z

I like the skimr output. Perhaps state the mode and anti-mode and their counts?

etiennebacher · 2022-09-05T14:46:43Z

Is this issue still on the todo list or was it superseded by #45 and the new function data_tabulate()?

etiennebacher · 2023-03-15T07:46:29Z

bump @IndrajeetPatil @strengejacke

strengejacke · 2023-03-15T07:53:25Z

The problem is that categorical variables have no "distribution" in that way how we use this for numeric variables, i.e. summary statistics like centrality or dispersion do not apply. That's why describe_distribution() has a limited output here. I'm not sure how other packages that describe variables deal with this, but in either way, it is difficult to "mix" the output for different types.

etiennebacher · 2023-03-15T07:56:05Z

How skimr does it:

library(skimr)

dat <- mtcars[, c("am", "mpg")]
dat$am <- as.factor(dat$am)

skim(dat)


Name	dat
Number of rows	32
Number of columns	2
_______________________
Column type frequency:
factor	1
numeric	1
________________________
Group variables	None

Data summary

Variable type: factor

skim_variable	n_missing	complete_rate	ordered	n_unique	top_counts
am	0	1	FALSE	2	0: 19, 1: 13

Variable type: numeric

skim_variable	n_missing	complete_rate	mean	sd	p0	p25	p50	p75	p100	hist
mpg	0	1	20.09	6.03	10.4	15.43	19.2	22.8	33.9	▃▇▅▁▂

^{Created on 2023-03-15 with reprex v2.0.2}

etiennebacher · 2023-03-15T07:59:26Z

maybe describe_distribution() for factor/characters could show something similar: n, n_missing, ordered, n_unique, mode (cf #160)

but should it still be named describe_distribution() then?

This comment has been minimized.

Sign in to view

IndrajeetPatil transferred this issue from easystats/parameters Dec 13, 2021

IndrajeetPatil mentioned this issue Dec 13, 2021

Request for function that tabulates factors #45

Closed

IndrajeetPatil mentioned this issue Mar 14, 2022

Add tests and method for list in describe_distribution #105

Merged

etiennebacher mentioned this issue Feb 16, 2023

including mode for the distribution in describe_distribution #160

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

revising `describe_distribution` output for `factor` class #46

revising `describe_distribution` output for `factor` class #46

IndrajeetPatil commented Dec 26, 2020

This comment has been minimized.

IndrajeetPatil commented Jan 5, 2021 •

edited

Loading

IndrajeetPatil commented Jan 21, 2021

strengejacke commented Jan 21, 2021

IndrajeetPatil commented Mar 28, 2021

strengejacke commented Apr 3, 2021

IndrajeetPatil commented Jun 3, 2021

strengejacke commented Jun 3, 2021

bwiernik commented Jun 4, 2021

etiennebacher commented Sep 5, 2022

etiennebacher commented Mar 15, 2023

strengejacke commented Mar 15, 2023

etiennebacher commented Mar 15, 2023

etiennebacher commented Mar 15, 2023 •

edited

Loading

revising describe_distribution output for factor class #46

revising describe_distribution output for factor class #46

Comments

IndrajeetPatil commented Dec 26, 2020

This comment has been minimized.

IndrajeetPatil commented Jan 5, 2021 • edited Loading

IndrajeetPatil commented Jan 21, 2021

strengejacke commented Jan 21, 2021

IndrajeetPatil commented Mar 28, 2021

strengejacke commented Apr 3, 2021

IndrajeetPatil commented Jun 3, 2021

strengejacke commented Jun 3, 2021

bwiernik commented Jun 4, 2021

etiennebacher commented Sep 5, 2022

etiennebacher commented Mar 15, 2023

strengejacke commented Mar 15, 2023

etiennebacher commented Mar 15, 2023

etiennebacher commented Mar 15, 2023 • edited Loading

revising `describe_distribution` output for `factor` class #46

revising `describe_distribution` output for `factor` class #46

IndrajeetPatil commented Jan 5, 2021 •

edited

Loading

etiennebacher commented Mar 15, 2023 •

edited

Loading