-
Notifications
You must be signed in to change notification settings - Fork 126
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Feature request: Order factor by (multiple functions of) multiple variables #16
Comments
Changing the |
Not at the top of my head, but ordering factors first by the mean of one variable and then by the min of another variable (usually a date variable), or the order way around, is something I’ve done several times. Doing stuff like this is more common when you have lots of ties on the first variable, so that you’re not really using a second variable just for breaking ties, but because you’re really interested in the ordering for this second variable. I’ve also used different summary functions for the same variable, e.g. first ordering by the median of one variable and then by the standard deviation of the same variable. And finally, sometimes I have data in long format where a variable is repeated but unique within the factor I’m reordering. Then I sometimes abuse the mean function by reordering first on the mean (which is equivalent to reordering on x[1], since the values are unique within the factor) and then by (perhaps a different summary function on) a different variable. For example, I had long data on patients receiving blood transfusions, with longitudinal follow-up (one or more measurements/rows). One variable was the number of bags of blood received by each patient (unique within each patient), and I wanted to graph the longitudinal data, where the patients (as separate panels) would ordered by the number of bags of blood (where many patients received the same number of bags) and then by the maximum value obtained on the measurement variable (some sort of the measure of the effect of the blood transfusion). Then ordering first by the mean and then by max was useful. Sometimes reordering the actual data set using other function can be done as a workaround (followed by |
Thanks - that's useful. That somehow feels a bit big for forcats, and seems like somehow it might be an interaction with dplyr. Let me think about it for a bit. |
Let me just support @huftis, I think he described a common use case very well. Here is another example on stackoverflow. It seems to me, that the order of factor levels is most relevant for creating nicer plots. In fact I'm not sure if the order of factor levels has any meaning if the factor itself is not ordered? I understand dplyr as a tool to permanetly manipulate my data. Therefore a fct_reordern would be better placed inside forcats to just temporarily reorder levels. (e.g. ggplot(aes(x=fct_reordern(a,...))) |
I wandered here looking for a 'lexicographic reordering' of factors. In my use case, there is a hierarchy to my factors, say f is a coarse classification, and g is a fine classification. I will make a plot (a bar plot actually) with colors (and x axis) determined by g, but facets determined by f. I want the colors to be essentially in order across the facets and within each facet. I need to reorder f according to a numeric order of corresponding g (there are many ties), and then another variable. First pass at code, which is not terribly general, looks like: fct_lexi_reord <- function(f, ..., .desc=FALSE) {
# cannot seem to include this in the function list
fun <- median
numbys <- list(...)
f <- forcats:::check_factor(f)
stopifnot(rep(length(f),length(numbys))==unlist(lapply(numbys,length)))
allsumma <- lapply(numbys,function(anx) {
summary <- tapply(anx, f, fun)
if (!is.numeric(summary)) {
stop("`fun` must return a single number per group", call. = FALSE)
}
summary
})
neworder <- do.call(order,args=c(allsumma,list(decreasing=.desc)))
lvls_reorder(f, neworder)
} Again, this is probably not general enough for inclusion in |
Some context, here is a MWE for something like what I am doing, using the above library(tibble)
library(dplyr)
library(ggplot2)
library(forcats)
set.seed(123)
wines <- tibble::tribble(~color, ~varietal,
'white', 'riesling',
'white', 'chardonnay',
'white', 'sauv blanc',
'rose', '2 buck chuck',
'red', 'barbera',
'red', 'grenache',
'red', 'zinfandel',
'red', 'merlot',
'red', 'pinot noir',
'red', 'syrah',
'red', 'cab sauv') %>%
mutate(points=runif(length(color),min=0,max=100))
ph <- wines %>%
mutate(color_ord=forcats::fct_reorder(color,points)) %>%
mutate(varietal_ord=fct_lexi_reord(varietal,as.numeric(color_ord),points)) %>%
ggplot(aes(varietal_ord,points,fill=varietal_ord)) +
geom_bar(stat='identity') +
facet_grid(.~color_ord,space='free',scale='free') +
labs(x='varietal',y='points')
print(ph) (BTW, awful things happen when you try to flip this to the y axis via ph <- wines %>%
mutate(color_ord=forcats::fct_reorder(color,points)) %>%
mutate(varietal_ord=fct_lexi_reord(varietal,as.numeric(color_ord),points)) %>%
ggplot(aes(varietal_ord,points,fill=varietal_ord)) +
geom_bar(stat='identity') +
coord_flip() +
facet_grid(color_ord~.,space='free',scale='free') +
labs(x='varietal',y='points')
print(ph) But I suppose that is an issue for |
I had the similar problem of ordering a factor by multiple other variables. In this use case, the other variables served as a sort of hierarchical index, where I needed to sort by level 1 first, then by level 2 within level 1 groups.
|
I ran into the same attempt at using I tend to agree with @hadley that multiple functions seem complex, but I wonder if something like the following could work (borrowing from Rather than: name = fct_reordern(name,
vars = list(quality, year, weight),
funs = list(mean, min, median)) How about: myorder <- function(quality, year, weight) {
c(mean(quality), min(year), median(weight))
}
name = fct_reordern(.f = name,
.l = list(quality, year, weight),
.fun = myorder) Then, |
Will have a go at this for the tidyverse developer day |
I think I ran into a similar problem as I wanted to make a plot where my 'name' variable is ordered first for min of 'value' and then in descending order for 'delta'. Directly nesting
|
As I need this all the time, and it's the oldest open issue in library(forcats)
fct_reordern <- function(.f, .data, .desc=FALSE, ordered=FALSE) {
stopifnot(nrow(.data) == length(.f))
.f_name <- paste0(max(names(.data)), "X")
fct_inorder(
f=
.f[
do.call(
base::order,
append(
unname(.data),
list(method="radix", decreasing=.desc)
)
)
],
ordered=ordered
)
}
mydata <-
data.frame(
A=c(3, 3, 2, 1),
B=c("A", "B", "C", "D"),
stringsAsFactors=FALSE
)
fct_reordern(.f=c("A", "B", "C", "D"), .data=mydata)
#> [1] D C A B
#> Levels: D C A B
fct_reordern(.f=c("A", "B", "C", "D"), .data=mydata, .desc=TRUE)
#> [1] B A C D
#> Levels: B A C D
fct_reordern(.f=c("A", "B", "C", "D"), .data=mydata, .desc=c(FALSE, TRUE))
#> [1] D C B A
#> Levels: D C B A Created on 2019-11-17 by the reprex package (v0.3.0) |
@billdenney unfortunately I don't follow your explanation; I don't see how this solves the original problem. I also find the the implementation rather hard to understand because of the giant nested call inside of |
@hadley, It solves the issue that the factor is ordered by an arbitrary number of variables (given in Here is an un-nested version of the function that is more in line with #' @param .f A factor (or character vector)
#' @param .data A tbl with the same number of rows as the length of \code{.f}
#' @param .desc Order in descending order? It may either be a scalar or a
#' vector with the same length as the number of columns as \code{.data}.
#' @inheritParams fct_inorder
fct_reordern <- function(.f, .data, .desc=FALSE, ordered=NA) {
stopifnot(nrow(.data) == length(.f))
stopifnot(length(.desc) %in% c(1, ncol(.data)))
stopifnot(all(.desc %in% c(TRUE, FALSE)))
f <- forcats:::check_factor(.f)
# .data is unnamed so that its names do not clash with named arguments to
# order(). The radix method is used to support a vector of .desc (other
# methods only support scalar values for .desc).
order_args <-
append(
unname(.data),
list(method="radix", decreasing=.desc)
)
new_order <- do.call(base::order, order_args)
f_sorted <- f[new_order]
fct_inorder(f=f_sorted, ordered=ordered)
} |
I just realized that you were probably getting at something a bit different. Here is a way that you can pass in an arbitrary number of vectors. It has the side benefit that it also indirectly exposes the library(forcats)
#' @param .f A factor (or character vector)
#' @param ... Arguments passed to \code{base::order()}. (\code{method} may not
#' be modified, and \code{decreasing} is handled through the \code{.desc}
#' argument.)
#' @param .desc Order in descending order? It may either be a scalar or a
#' vector with the same length as the number of columns as \code{.data}.
#' @inheritParams fct_inorder
fct_reordern <- function(.f, ..., .desc=FALSE, ordered=NA) {
stopifnot(length(.desc) %in% c(1, ...length()))
stopifnot(all(.desc %in% c(TRUE, FALSE)))
f <- forcats:::check_factor(.f)
# The radix method is used to support a vector of .desc (other methods only
# support scalar values for .desc).
new_order <- base::order(..., method="radix", decreasing=.desc)
f_sorted <- f[new_order]
fct_inorder(f=f_sorted, ordered=ordered)
}
mydata <-
data.frame(
A=c(3, 3, 2, 1),
B=c("A", "B", "C", "D"),
stringsAsFactors=FALSE
)
fct_reordern(.f=c("A", "B", "C", "D"), mydata$A, mydata$B)
#> [1] D C A B
#> Levels: D C A B
fct_reordern(.f=c("A", "B", "C", "D"), mydata$A, mydata$B, .desc=TRUE)
#> [1] B A C D
#> Levels: B A C D
fct_reordern(.f=c("A", "B", "C", "D"), mydata$A, mydata$B, .desc=c(FALSE, TRUE))
#> [1] D C B A
#> Levels: D C B A Created on 2019-11-18 by the reprex package (v0.3.0) |
Ah, ok, that's starting to make sense to me. I'd suggest dropping the |
But at this point, the overall approach seems reasonable to me, so it's probably easier to move to a PR. |
For plotting and tables, it’s useful to reorder levels of a factor according to other variables, first by one variable, and then by other variables to break any ties. The summary function used for each variable may be different. Example:
I want to order the
name
factor by 1) average quality, then 2) the firstyear
the product appeared, and then 3) its median weight. If any ties remain, keep the original label order for these ties. In this example, the levels would be ordered ABCD.To do this reordering, I have to think backwards, reordering by the last tie-breaker first, using either
fct_reorder()
orreorder()
:I would be very convenient to be able to do this in one go, using something like this (I’m dropping the d$ prefix to make the code clearer):
It would be even better if the functions were shown along with the variable names, e.g. something like this (if possible, or perhaps using some form of formula syntax?):
For descending order, perhaps by using a
desc()
function, like this?The text was updated successfully, but these errors were encountered: