-
-
Notifications
You must be signed in to change notification settings - Fork 16
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
revising describe_distribution
output for factor
class
#46
Comments
This comment has been minimized.
This comment has been minimized.
@strengejacke, @DominiqueMakowski, @mattansb What do you think? |
For a reference, the output can look something like this: library(tabulator)
library(dplyr)
a <- tibble(varname = sample.int(20, size = 1000000, replace = TRUE))
a %>% tab(varname)
#> # A tibble: 20 x 4
#> varname N prop cum_prop
#> <int> <int> <dbl> <dbl>
#> 1 4 50346 0.05 0.05
#> 2 5 50328 0.05 0.1
#> 3 20 50320 0.05 0.15
#> 4 14 50223 0.05 0.2
#> 5 2 50208 0.05 0.25
#> 6 19 50101 0.05 0.3
#> 7 7 50088 0.05 0.35
#> 8 15 50067 0.05 0.4
#> 9 6 50044 0.05 0.45
#> 10 10 50040 0.05 0.5
#> 11 12 50036 0.05 0.55
#> 12 1 50015 0.05 0.6
#> 13 16 50014 0.05 0.65
#> 14 17 49935 0.05 0.7
#> 15 3 49889 0.05 0.75
#> 16 11 49857 0.05 0.8
#> 17 13 49726 0.05 0.85
#> 18 18 49632 0.05 0.9
#> 19 8 49603 0.05 0.95
#> 20 9 49528 0.05 1 Created on 2021-01-21 by the reprex package (v0.3.0) Session infodevtools::session_info()
#> ─ Session info ───────────────────────────────────────────────────────────────
#> setting value
#> version R version 4.0.3 (2020-10-10)
#> os macOS Mojave 10.14.6
#> system x86_64, darwin17.0
#> ui X11
#> language (EN)
#> collate en_US.UTF-8
#> ctype en_US.UTF-8
#> tz Europe/Berlin
#> date 2021-01-21
#>
#> ─ Packages ───────────────────────────────────────────────────────────────────
#> package * version date lib source
#> assertthat 0.2.1 2019-03-21 [1] CRAN (R 4.0.2)
#> callr 3.5.1 2020-10-13 [1] CRAN (R 4.0.2)
#> cli 2.2.0 2020-11-20 [1] CRAN (R 4.0.3)
#> crayon 1.3.4 2017-09-16 [1] CRAN (R 4.0.2)
#> data.table 1.13.6 2020-12-30 [1] CRAN (R 4.0.2)
#> DBI 1.1.1 2021-01-15 [1] CRAN (R 4.0.3)
#> desc 1.2.0 2018-05-01 [1] CRAN (R 4.0.2)
#> devtools 2.3.2 2020-09-18 [1] CRAN (R 4.0.2)
#> digest 0.6.27 2020-10-24 [1] CRAN (R 4.0.2)
#> dplyr * 1.0.3 2021-01-15 [1] CRAN (R 4.0.3)
#> ellipsis 0.3.1 2020-05-15 [1] CRAN (R 4.0.2)
#> evaluate 0.14 2019-05-28 [1] CRAN (R 4.0.1)
#> fansi 0.4.2 2021-01-15 [1] CRAN (R 4.0.3)
#> fs 1.5.0 2020-07-31 [1] CRAN (R 4.0.2)
#> generics 0.1.0 2020-10-31 [1] CRAN (R 4.0.2)
#> glue 1.4.2 2020-08-27 [1] CRAN (R 4.0.2)
#> highr 0.8 2019-03-20 [1] CRAN (R 4.0.2)
#> htmltools 0.5.1 2021-01-12 [1] CRAN (R 4.0.3)
#> knitr 1.30 2020-09-22 [1] CRAN (R 4.0.2)
#> lifecycle 0.2.0 2020-03-06 [1] CRAN (R 4.0.2)
#> magrittr 2.0.1 2020-11-17 [1] CRAN (R 4.0.3)
#> memoise 1.1.0 2017-04-21 [1] CRAN (R 4.0.2)
#> pillar 1.4.7 2020-11-20 [1] CRAN (R 4.0.3)
#> pkgbuild 1.2.0 2020-12-15 [1] CRAN (R 4.0.3)
#> pkgconfig 2.0.3 2019-09-22 [1] CRAN (R 4.0.2)
#> pkgload 1.1.0 2020-05-29 [1] CRAN (R 4.0.2)
#> prettyunits 1.1.1 2020-01-24 [1] CRAN (R 4.0.2)
#> processx 3.4.5 2020-11-30 [1] CRAN (R 4.0.3)
#> ps 1.5.0 2020-12-05 [1] CRAN (R 4.0.3)
#> purrr 0.3.4 2020-04-17 [1] CRAN (R 4.0.2)
#> R6 2.5.0 2020-10-28 [1] CRAN (R 4.0.2)
#> remotes 2.2.0 2020-07-21 [1] CRAN (R 4.0.2)
#> rlang 0.4.10 2020-12-30 [1] CRAN (R 4.0.3)
#> rmarkdown 2.6 2020-12-14 [1] CRAN (R 4.0.3)
#> rprojroot 2.0.2 2020-11-15 [1] CRAN (R 4.0.3)
#> sessioninfo 1.1.1 2018-11-05 [1] CRAN (R 4.0.2)
#> stringi 1.5.3 2020-09-09 [1] CRAN (R 4.0.2)
#> stringr 1.4.0 2019-02-10 [1] CRAN (R 4.0.2)
#> tabulator * 1.0.0 2021-01-08 [1] CRAN (R 4.0.2)
#> testthat 3.0.1 2020-12-17 [1] CRAN (R 4.0.3)
#> tibble 3.0.5 2021-01-15 [1] CRAN (R 4.0.3)
#> tidyselect 1.1.0 2020-05-11 [1] CRAN (R 4.0.2)
#> usethis 2.0.0 2020-12-10 [1] CRAN (R 4.0.3)
#> utf8 1.1.4 2018-05-24 [1] CRAN (R 4.0.2)
#> vctrs 0.3.6 2020-12-17 [1] CRAN (R 4.0.3)
#> withr 2.4.0 2021-01-16 [1] CRAN (R 4.0.3)
#> xfun 0.20 2021-01-06 [1] CRAN (R 4.0.3)
#> yaml 2.2.1 2020-02-01 [1] CRAN (R 4.0.2)
#>
#> [1] /Users/patil/Library/R/4.0/library
#> [2] /Library/Frameworks/R.framework/Versions/4.0/Resources/library |
Looks rather like a frequency table to me: a <- tibble::tibble(varname = sample.int(20, size = 1000000, replace = TRUE))
sjmisc::frq(a)
#>
#> varname <integer>
#> # total N=1000000 valid N=1000000 mean=10.50 sd=5.76
#>
#> Value | N | Raw % | Valid % | Cum. %
#> ----------------------------------------
#> 1 | 49907 | 4.99 | 4.99 | 4.99
#> 2 | 50198 | 5.02 | 5.02 | 10.01
#> 3 | 49954 | 5.00 | 5.00 | 15.01
#> 4 | 50005 | 5.00 | 5.00 | 20.01
#> 5 | 50225 | 5.02 | 5.02 | 25.03
#> 6 | 49937 | 4.99 | 4.99 | 30.02
#> 7 | 49870 | 4.99 | 4.99 | 35.01
#> 8 | 49561 | 4.96 | 4.96 | 39.97
#> 9 | 50219 | 5.02 | 5.02 | 44.99
#> 10 | 50506 | 5.05 | 5.05 | 50.04
#> 11 | 50092 | 5.01 | 5.01 | 55.05
#> 12 | 50094 | 5.01 | 5.01 | 60.06
#> 13 | 49818 | 4.98 | 4.98 | 65.04
#> 14 | 49990 | 5.00 | 5.00 | 70.04
#> 15 | 49585 | 4.96 | 4.96 | 75.00
#> 16 | 50085 | 5.01 | 5.01 | 80.00
#> 17 | 50328 | 5.03 | 5.03 | 85.04
#> 18 | 50103 | 5.01 | 5.01 | 90.05
#> 19 | 50068 | 5.01 | 5.01 | 95.05
#> 20 | 49455 | 4.95 | 4.95 | 100.00
#> <NA> | 0 | 0.00 | <NA> | <NA> Created on 2021-01-21 by the reprex package (v0.3.0) |
Can we re-purpose the code from |
I still think that would be a frequency table rather than a descriptive version. It may depend on the number of levels whether the output would be compact enough, though. |
Sure, that also works. But will you want to introduce a new function or will this be part of |
I wouldn't change this in either way. Maybe using the mode, as suggested in easystats/parameters#515, could be a good way of dealing with factors. |
I like the skimr output. Perhaps state the mode and anti-mode and their counts? |
Is this issue still on the todo list or was it superseded by #45 and the new function |
The problem is that categorical variables have no "distribution" in that way how we use this for numeric variables, i.e. summary statistics like centrality or dispersion do not apply. That's why |
How library(skimr)
dat <- mtcars[, c("am", "mpg")]
dat$am <- as.factor(dat$am)
skim(dat)
Data summary Variable type: factor
Variable type: numeric
Created on 2023-03-15 with reprex v2.0.2 |
maybe but should it still be named |
As awesome as this function is for
numeric
type variables, I am not sure if this is the best output we can provide forfactor
type variables.I was thinking instead we can probably take inspiration from
skimr
output forfactor
class:The text was updated successfully, but these errors were encountered: