factor lump for ordinal factors #28

rtaph · 2016-08-31T11:56:12Z

Hi Hadley,

Do you have thoughts on creating an analogue (or generic) of fct_lump() for ordinal factors?

The utility I am looking for is the ability to keep contiguous levels together. A simple solution for this could be to lump ordinal levels directionally (either from the left or right), instead of by frequency.

Here is an example of the current behaviour:

set.seed(6)
f <- ordered(sample(letters[1:5], 10, replace = TRUE))
fct_count(f)
#> # A tibble: 5 × 2
#>       f     n
#>   <ord> <int>
#> 1     a     1
#> 2     b     2
#> 3     c     1
#> 4     d     2
#> 5     e     4
fct_lump(f, 3) # non-contiguous lumping
#>  [1] d     e     b     b     e     e     e     d     Other Other
#> Levels: b < d < e < Other

I imagine that in many cases, lumping categories a and c such that they are placed above b, d, and e will not created a desired result.

Here is what I have in mind for ord_lump():

ord_lump(f, 3, from = "left")
#>  [1] Other Other b b Other Other Other Other c a
#> Levels: a < b < c < Other

If you think this is worth it, I can propose a PR. An alternative would be to work the change into fct_lump by adding type = c("freq", "left", "right") into the formals. Let me know what you think.

Cheers,

Rafael

The text was updated successfully, but these errors were encountered:

hadley · 2016-09-09T14:21:57Z

Hmmmm, this seems like it will need quite a different algorithm to fct_lump so it should be a different function. I think you need to spell out the general approach here a bit more before it's worth doing a PR.

sfirke · 2017-02-04T02:14:29Z

I have a related use case where I want to symmetrically collapse ordered factors, grouping the top and bottom n levels. Say, converting responses on a Likert survey item into groups of hi-med-lo. Rafael's proposal might partially meet my needs. It feels like a new function separate from fct_lump.

rtaph · 2017-02-04T13:55:29Z

Sounds like there is some renewed interest in this.

As a general approach, the same formals can be kept as fct_lump, with the exception of "ties.method", which can be dropped since there are no ties in an ordered factor. For positive n or p, the levels would lumped together, starting from the left or right (as specified). For negative n (or p), the n levels would be retained from the left or right, and others lumped. The method should be general enough to apply intuitively on unordered factors too, based on the default ordering given by levels().

The formals would boil down to:

ord_lump(f, n, prop, from = c("left", "right"), other_level = "Other")

with the following behaviour:

(x <- factor(LETTERS[1:8], ordered = TRUE))
#> [1] A B C D E F G H
#> Levels: A < B < C < D < E < F < G < H

ord_lump(x, 3, from = "right")
#> [1] A B C Other Other Other Other Other
#> Levels: A < B < C < Other

ord_lump(x, -3, from = "right")
#> [1] A B C D E Other Other Other
#> Levels: A < B < C < D < E < Other

ord_lump(x, 3, from = "left")
#> [1] Other Other Other D E F G H
#> Levels: Other < D < E < F < G < H

ord_lump(x, prop = 0.25, from = "left")
#> [1] Other Other C D E F G H
#> Levels: Other < C < D < E < F < G < H

# unordered factor
y <- factor(LETTERS[1:8])
ord_lump(y, 3, from = "left")
#> [1] Other Other Other D E F G H
#> Levels: Other D E F G H

@sfirke's use case might be tricky to implement intuitively, since the lumping would need to occur from both sides, and a tie-breaking rule chosen when the middle interval is even. Maybe best left as a two-step process relying on a lumping function?

sfirke · 2018-03-05T21:20:26Z

I could see that ord_lump function above working with my use case. If the user can pass both arguments to "from", then can they supply two arguments to "into" (in the style of tidyr::separate, I prefer to "other_level" here)?

Then using data from the above example, here's how I imagine it:

ord_lump(x, -3, from = c("left", "right"), into = c("top", "bottom"))
#> [1] top top top D E bottom bottom bottom
#> Levels: top < D < E < bottom

If that's too complex, I think I could hack what I want together from multiple chained ord_lump as calls proposed above.

dmcalli2 · 2019-10-25T13:32:15Z

I was interested in this behaviour for tabular data, where I want to be able to collapse levels with small counts into the level closest to them (as a privacy/disclosure control) protection and found this discussion. In case someone else comes here with a similar problem, please see below for some crude code.
It takes the nearest level and sums the count across the levels and relabels the new summarised level. It depends on dplyr and purrr.

library(tidyverse)
## Create example data
mydf <- tibble(x = sample(letters[1:3], 50, replace = TRUE), n = rpois(50, 5))
mydf <- mydf %>% 
  group_by(x) %>% 
  mutate(i = seq_along(x)) %>% 
  ungroup() %>% 
  arrange(x, i)


LumpByN <- function(i, n, n_threshold =5, fail_iters = 10){
  ## for an ordered variable, i it lumps together the closest ordered levels when they have
  ## a count of n_threshold or lower
  mydf <- tibble(n = n, i = i) %>% 
    arrange(i)
  
  ## Loop through until no values less than 5 found
  count <- 0
  while(any(mydf$n <=5) & count < fail_iters){
    # print(count)
    lag_lead <- if( (count %%2) ==1) function(x) lag(x, default = FALSE) else function(x) lead(x, default = FALSE)
    
    mydf <- mydf %>% 
      mutate(n_small = n<=n_threshold,
             i_match = if_else(lag_lead(n_small), lag_lead(i), i)) %>% 
      group_by(i_match) %>% 
      summarise(n = sum(n),
                i = paste(i, collapse = " or ")) %>% 
      ungroup() %>% 
      select(-i_match) %>% 
      mutate(i_order = as.integer(str_extract(i, "[0-9]{1,}\\b"))) %>% 
      arrange(i_order)
    
    count <- count +1
  }
  mydf
}

## Apply to a group
mydf_slct <- mydf %>% 
  filter(x == "a")
LumpByN(mydf_slct$i, mydf_slct$n)

## Apply to multiple groups
mydf2 <- mydf %>% 
  group_by(x) %>% 
  nest()
map(mydf2$data, ~ LumpByN(.x$i, .x$n))

hadley added the feature a feature request or enhancement label Feb 10, 2018

sfirke mentioned this issue Mar 5, 2018

should top_levels return a vector? sfirke/janitor#93

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

factor lump for ordinal factors #28

factor lump for ordinal factors #28

rtaph commented Aug 31, 2016 •

edited

Loading

hadley commented Sep 9, 2016

sfirke commented Feb 4, 2017

rtaph commented Feb 4, 2017

sfirke commented Mar 5, 2018

dmcalli2 commented Oct 25, 2019

factor lump for ordinal factors #28

factor lump for ordinal factors #28

Comments

rtaph commented Aug 31, 2016 • edited Loading

hadley commented Sep 9, 2016

sfirke commented Feb 4, 2017

rtaph commented Feb 4, 2017

sfirke commented Mar 5, 2018

dmcalli2 commented Oct 25, 2019

rtaph commented Aug 31, 2016 •

edited

Loading