Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

factor lump for ordinal factors #28

Open
rtaph opened this issue Aug 31, 2016 · 5 comments
Open

factor lump for ordinal factors #28

rtaph opened this issue Aug 31, 2016 · 5 comments
Labels
feature a feature request or enhancement

Comments

@rtaph
Copy link
Contributor

rtaph commented Aug 31, 2016

Hi Hadley,

Do you have thoughts on creating an analogue (or generic) of fct_lump() for ordinal factors?

The utility I am looking for is the ability to keep contiguous levels together. A simple solution for this could be to lump ordinal levels directionally (either from the left or right), instead of by frequency.

Here is an example of the current behaviour:

set.seed(6)
f <- ordered(sample(letters[1:5], 10, replace = TRUE))
fct_count(f)
#> # A tibble: 5 × 2
#>       f     n
#>   <ord> <int>
#> 1     a     1
#> 2     b     2
#> 3     c     1
#> 4     d     2
#> 5     e     4
fct_lump(f, 3) # non-contiguous lumping
#>  [1] d     e     b     b     e     e     e     d     Other Other
#> Levels: b < d < e < Other

I imagine that in many cases, lumping categories a and c such that they are placed above b, d, and e will not created a desired result.

Here is what I have in mind for ord_lump():

ord_lump(f, 3, from = "left")
#>  [1] Other Other b b Other Other Other Other c a
#> Levels: a < b < c < Other

If you think this is worth it, I can propose a PR. An alternative would be to work the change into fct_lump by adding type = c("freq", "left", "right") into the formals. Let me know what you think.

Cheers,

Rafael

@hadley
Copy link
Member

hadley commented Sep 9, 2016

Hmmmm, this seems like it will need quite a different algorithm to fct_lump so it should be a different function. I think you need to spell out the general approach here a bit more before it's worth doing a PR.

@sfirke
Copy link

sfirke commented Feb 4, 2017

I have a related use case where I want to symmetrically collapse ordered factors, grouping the top and bottom n levels. Say, converting responses on a Likert survey item into groups of hi-med-lo. Rafael's proposal might partially meet my needs. It feels like a new function separate from fct_lump.

@rtaph
Copy link
Contributor Author

rtaph commented Feb 4, 2017

Sounds like there is some renewed interest in this.

As a general approach, the same formals can be kept as fct_lump, with the exception of "ties.method", which can be dropped since there are no ties in an ordered factor. For positive n or p, the levels would lumped together, starting from the left or right (as specified). For negative n (or p), the n levels would be retained from the left or right, and others lumped. The method should be general enough to apply intuitively on unordered factors too, based on the default ordering given by levels().

The formals would boil down to:

ord_lump(f, n, prop, from = c("left", "right"), other_level = "Other")

with the following behaviour:

(x <- factor(LETTERS[1:8], ordered = TRUE))
#> [1] A B C D E F G H
#> Levels: A < B < C < D < E < F < G < H

ord_lump(x, 3, from = "right")
#> [1] A B C Other Other Other Other Other
#> Levels: A < B < C < Other

ord_lump(x, -3, from = "right")
#> [1] A B C D E Other Other Other
#> Levels: A < B < C < D < E < Other

ord_lump(x, 3, from = "left")
#> [1] Other Other Other D E F G H
#> Levels: Other < D < E < F < G < H

ord_lump(x, prop = 0.25, from = "left")
#> [1] Other Other C D E F G H
#> Levels: Other < C < D < E < F < G < H

# unordered factor
y <- factor(LETTERS[1:8])
ord_lump(y, 3, from = "left")
#> [1] Other Other Other D E F G H
#> Levels: Other D E F G H

@sfirke's use case might be tricky to implement intuitively, since the lumping would need to occur from both sides, and a tie-breaking rule chosen when the middle interval is even. Maybe best left as a two-step process relying on a lumping function?

@hadley hadley added the feature a feature request or enhancement label Feb 10, 2018
@sfirke
Copy link

sfirke commented Mar 5, 2018

I could see that ord_lump function above working with my use case. If the user can pass both arguments to "from", then can they supply two arguments to "into" (in the style of tidyr::separate, I prefer to "other_level" here)?

Then using data from the above example, here's how I imagine it:

ord_lump(x, -3, from = c("left", "right"), into = c("top", "bottom"))
#> [1] top top top D E bottom bottom bottom
#> Levels: top < D < E < bottom

If that's too complex, I think I could hack what I want together from multiple chained ord_lump as calls proposed above.

@dmcalli2
Copy link

I was interested in this behaviour for tabular data, where I want to be able to collapse levels with small counts into the level closest to them (as a privacy/disclosure control) protection and found this discussion. In case someone else comes here with a similar problem, please see below for some crude code.
It takes the nearest level and sums the count across the levels and relabels the new summarised level. It depends on dplyr and purrr.

library(tidyverse)
## Create example data
mydf <- tibble(x = sample(letters[1:3], 50, replace = TRUE), n = rpois(50, 5))
mydf <- mydf %>% 
  group_by(x) %>% 
  mutate(i = seq_along(x)) %>% 
  ungroup() %>% 
  arrange(x, i)


LumpByN <- function(i, n, n_threshold =5, fail_iters = 10){
  ## for an ordered variable, i it lumps together the closest ordered levels when they have
  ## a count of n_threshold or lower
  mydf <- tibble(n = n, i = i) %>% 
    arrange(i)
  
  ## Loop through until no values less than 5 found
  count <- 0
  while(any(mydf$n <=5) & count < fail_iters){
    # print(count)
    lag_lead <- if( (count %%2) ==1) function(x) lag(x, default = FALSE) else function(x) lead(x, default = FALSE)
    
    mydf <- mydf %>% 
      mutate(n_small = n<=n_threshold,
             i_match = if_else(lag_lead(n_small), lag_lead(i), i)) %>% 
      group_by(i_match) %>% 
      summarise(n = sum(n),
                i = paste(i, collapse = " or ")) %>% 
      ungroup() %>% 
      select(-i_match) %>% 
      mutate(i_order = as.integer(str_extract(i, "[0-9]{1,}\\b"))) %>% 
      arrange(i_order)
    
    count <- count +1
  }
  mydf
}

## Apply to a group
mydf_slct <- mydf %>% 
  filter(x == "a")
LumpByN(mydf_slct$i, mydf_slct$n)

## Apply to multiple groups
mydf2 <- mydf %>% 
  group_by(x) %>% 
  nest()
map(mydf2$data, ~ LumpByN(.x$i, .x$n))

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature a feature request or enhancement
Projects
None yet
Development

No branches or pull requests

4 participants