-
Notifications
You must be signed in to change notification settings - Fork 15
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
glmGamPoi scales cubically with the number of design matrix columns #41
Comments
Hey, thanks for documenting the aspect. I am not terribly surprised though, that I would be curious if there is a clever trick to speed this up. I could imagine that there could be interesting pointers in the GWAS literature, because they have to fit gigantic linear models all the time :) |
From a practical standpoint, if there are many conditions, maybe one could subdivide the dataset (perhaps in an overlapping way that the control conditions are in every subset) and hope for very similar dispersion estimates? In this way, the linear runtime of estimating on all subsets would outperform the cubic runtime on the whole dataset. |
Sorry, I'm a bit confused and that's probably simply because I worked too long today 🙈 |
Here is an example, using the set.seed(1)
n_genes <- 10
n_replicates <- 3
n_conditions <- 201
dds <- make_example_dds(n_genes, n_replicates, n_conditions)
fit <- glm_gp(assay(dds), design = design(dds), col_data = colData(dds), size_factors = rep(1, ncol(dds)))
beta_full <- fit$Beta
overdispersions_full <- fit$overdispersions
# split samples in conditions 2 to 201 in 10 chunks of size 20
# and estimate parameters for each chunk separately,
# i.e. chunk 1: samples of conditions 1, 2, ... 20
# chunk 2: samples of conditions 1, 21, ... 30
chunk_size <- 20
fit_chunk <- 2:201 %>%
split(ceiling(seq_along(.) / chunk_size)) %>%
map(function(conditions){
dds_chunk <- dds[,dds$condition %in% c("cond1",str_c("cond",conditions))]
fit_chunk <- glmGamPoi::glm_gp(assay(dds_chunk), design = design(dds_chunk), col_data = colData(dds_chunk), size_factors = rep(1, ncol(dds_chunk)))
})
# matrix of Beta estimates
beta_chunk <- cbind(fit_chunk[[1]]$Beta[,1,drop=F],do.call("cbind",map(fit_chunk,~.x$Beta[,-1])))
# overdispersions by averaging estimates for each gene
# over all chunks
overdispersions_chunk <- map(fit_chunk,~.x$overdispersions) %>%
do.call("cbind",.) %>%
rowMeans()
# overdispersions of full vs chunked model
tibble(
full = overdispersions_full,
chunk = overdispersions_chunk
) %>%
ggplot(aes(full, sub)) +
geom_point() +
geom_abline() +
labs(title = "overdispersion")
# beta estimates
reshape2::melt(beta_full, value.name="beta_full") %>%
left_join(reshape2::melt(beta_chunk, value.name="beta_chunk"), by=c("Var1","Var2")) %>%
filter(beta_full > -1e-7) %>%
ggplot(aes(beta_full, beta_chunk)) +
geom_point() +
geom_abline() +
labs(title = "Beta") The full model takes 10sec on my machine, the chunked version 0.4sec. I followed up on your hint and looked briefly into GWAS literature. One strategy is to use variable selection to potentially not estimate all covariates (see the time and memory requirements section here, and the referenced paper therein). |
Hey Frederik, I am very sorry for the late reply, there was too much else going on. You can make the regular function fast by setting
The problem was that I assumed the issue was in estimating the beta coefficients, where I cannot beat the cubic scaling. I did not realize that your design matrix had a convenient form where we know that the coefficients are independent and can thus make it scale linear with the number of coefficients. In fact, the estimating of the beta coefficients was already fast but the overdispersion estimate (which I thought linear) scaled cubically because calculating the Cox-Reid (McCarthy 2012) adjustment is implemented as a simple matrix multiplication ( I could implement a faster version of the Cox-Reid adjustment that uses the known sparsity of the design matrix. However, that would complicate the methods significantly, so I am hesitant. There is a difference between the overdispersion estimates, but it is not too big: |
Hi Constantin,
when the number of design matrix columns increases (and thus the number of samples for a fixed number of replicates), glmGamPoi scales cubically. Consider the following code where 3 genes are fitted with 3 replicates per condition and a varying number of conditions:
Could this be related to the QR decomposition of the design matrix and ultimately not be improved?
The text was updated successfully, but these errors were encountered: