Replies: 30 comments 11 replies
-
Hi @chassenr ! Thanks for the question. I'm not quite sure what the sequencing data mean in practice. Could you explain? |
Beta Was this translation helpful? Give feedback.
-
Hi @JenniNiku Thanks! Cheers, |
Beta Was this translation helpful? Give feedback.
-
It sounds to me like you want to include an offset to account for the different number of sequences in samples. However, I'm not sure that answers your question. As long as the sample per species is a count, then applying one of the regular count distributions (Poisson, NB, ZIP) shouldn't be a problem. |
Beta Was this translation helpful? Give feedback.
-
It is more about accounting for the dependence of counts on each other per sample, to avoid spurious correlations. Usually, I use a clr transformation (e.g. ALDEx2, SPIEC-EASI). |
Beta Was this translation helpful? Give feedback.
-
Would you mind describing your data in detail? At the moment it's not very clear what you're after :). |
Beta Was this translation helpful? Give feedback.
-
I assume my data would be very similar to the microbial dataset used in the vignette. If no internal standard was used for sequencing, the sequence counts will not reflect actual abundances. I will try to explain the compositionality problem with a simple example. Sample 1: 3 species (absolute abundance: 10, 30, 60; sequence proportions: 0.1, 0.3, 0.6) The total number of sequences per sample (the library size or sequencing depth) has no ecological meaning (it is determine by how much DNA you put on the sequencer), but it constrains the sequence counts per OTU, i.e. all OTU counts always have to add up to the library size (100%). In the example this leads to an 'increase' in the proportions of the 2 species in sample 2 because they have to fill up this sequence space. However, this apparent increase is caused by the compositional character of the data and not a change in actual abundances. Therefore, the counts in an OTU table cannot be used as they are for statistical models and even taking the percentages is insufficient. |
Beta Was this translation helpful? Give feedback.
-
The above explanation doesn't clarify much for me I'm afraid. I don't know much about sequencing or microbial biology (but maybe Jenni knows more? ). If you want to model absolute abundances relative to their sequence proportions, an offset should take care of that as far as I can tell. However, in your previous message you wrote that you want to account for dependence between counts of species in a sample. Are you implying that you expect them to be correlated somehow, i.e. in a pseudo replication kind of way? Alternatively, you can include additional sample-specific intercepts so that the mean abundance of both samples and species is accounted for outside of the ordination, giving focus on composition. Do I understand correctly that you have multiple sequences for each sample? If so, how is that reflected in your data? |
Beta Was this translation helpful? Give feedback.
-
Hi @chassenr, Have you tried the example analysis from the vignette? I tried to follow the example with my microbial data (ASV table of 32 samples and 15 000 taxa, 10 environmental variables) but calculations took me days and I ended up with errors or nothing in the plots. I assumed it was because of such a large number of ASVs. |
Beta Was this translation helpful? Give feedback.
-
@kbitenieks, any specific errors you had trouble with? GLLVMs are complex statistical models, and with so many taxa calculations are bound to take a while. However, just having many taxa should not create any errors. Lack of data in some taxa might, though. |
Beta Was this translation helpful? Give feedback.
-
To @chassenr , so is the interest more in the proportions of the absolute abundances than the abundance itself in a sequence? For proportion/coverage data, package has not suitable distribution at the moment. However, we are developing GLLVMs for beta distribution. I haven't considered the possibility for Dirichlet, at all yet, so I don't know if that would work with these models. To @kbitenieks , I would recommend to start with a smaller subset of the data where most sparse columns are left out, and see how it works. Also, with large datasets, try to fit a model without calculating standard errors at first, using argument sd.errors = FALSE, as that part takes a long time for huge datasets. |
Beta Was this translation helpful? Give feedback.
-
@JenniNiku My point is that sequence counts do not reflect abundance, and are therefore unsuited for statistical ecological models without transformation. As far as I can tell, the data in the vignette are untransformed sequence counts, which are used directly in GLLVM. Furthermore, the total number of sequences per sample differs between samples. This difference has no ecological meaning. Given the peculiarities of sequence counts, I question whether it is a valid to use such counts in GLLVM. Or did I miss something in the GLLVM package? Introducing a distribution for percentages will also not solve the problem, as sequence percentages are not independent (see compositionality, https://www.frontiersin.org/articles/10.3389/fmicb.2017.02224/full). Therefore the more recent packages for sequence data analysis, use log ratio transformations. @kbitenieks : Which kind of input data do you use: full OTU table, rarefied to equal library size, rare OTUs filtered? |
Beta Was this translation helpful? Give feedback.
-
@chassenr, I would argue that statistical models should be formulated to account for the properties of data, also in the case of sequencing data, so that transformations can be avoided as much as possible (and as much information can be retained in the data as possible). As such, the question to answer is (from my perspective) "what GLLVM formulation accounts for the properties of sequencing data?". |
Beta Was this translation helpful? Give feedback.
-
As chassenr rightfully pointed out, the sample sizes in a NGS data set, i.e. the total observations in a sample, have no biological meaning, and are technical artifacts inherited by the sequencing device, a now generally accepted "feature" of NGS data. There are two consequences resulting from this feature: i) the need for normalizing for unequal sample sizes and ii) to check if the resulting data is compositional or not. Traditionally, however, it was either scaling to sample sums (relative/proportional data), or subsampling to an equal sampling size ("rarefying"). Both have their problems, and are disregarded for a number of applications (differential abundance, abundance modelling, correlation analysis), but are still probably fine for beta diversity. Weiss et al, 2017 (Microbiome), is a very good start into this topic. However, the problem does not end with just adjusting for biologically irrelevant sample sizes. I think the gllvm package would greatly benefit from having recommendations how to deal best with the two problems. |
Beta Was this translation helpful? Give feedback.
-
Does this reference: "Negative binomial mixed models for analyzing microbiome count data", give any leads? They account for library size by including an offset, as suggested above. |
Beta Was this translation helpful? Give feedback.
-
Thanks for the heads up. I thing applying offsets is basically the same as using proportional data, isnt it? I remember using this approach for GAMMs with zero-inflated negative binomial distributions on NGS data. |
Beta Was this translation helpful? Give feedback.
-
@trichterhb I would love to comment on that, but I'm still not completely clear on what you mean by "compositionality". Care to have another try to elaborate? |
Beta Was this translation helpful? Give feedback.
-
Here is an evaluation of NB for sequence data: https://journals.plos.org/plosone/article?id=10.1371/journal.pone.0224909 |
Beta Was this translation helpful? Give feedback.
-
I think the issue is best explained here: Imagine you are measuring soil texture as either loam or silt, in % per unit of soil. Such a data set only holds relative information, and by knowing loam, you also know silt percentage, giving you only one degree of freedom. Likewise, the issue with NGS data is that the last unknown sample size is determined by the instrument (as it e.g. can only yield 5 million observations per run). the issues with NGS count data are thus two fold: first, unequal, but meaningless sample sizes for which various normalizations exist (some of which cause compositionality within the sample), second, count totals are fixed by the instrument, and are thus intrinsically delivering compositional data. The second point depends on your personal philosophy (there is no observer in existence which can deliver infinite data, and a sequencer delivers much more observations than an eye), but the field of microbial ecology is definitely moving into adopting this view point. Since gllvm seem to be a very promising approach and since an NGS data is used as an example if i am not mistaken, it would be extremely helpful to have this method "fit" for recent trends in microbial ecology. |
Beta Was this translation helpful? Give feedback.
-
I'm a bit late to the game here but, I believe, as @BertvanderVeen pointed out by using an offset and a random or fixed row effect for each sample is in effect modelling the data as a composition, as well as accounting for sequencing depth differences. But I digress- the properties that make compositional data different is that fact they are a simplex. So another set of distributions that could be considered are multivariate categorical distributions. With this in mind @BertvanderVeen and @JenniNiku could I make request to have the NM, GDM and DM distributions considered for the GLLVM package? Thanks Harrison, JG, Calder, WJ, Shastry, V, & Buerkle, CA 2020, ‘Dirichlet‐multinomial modelling outperforms alternatives for analysis of microbiome and other ecological count data’ Molecular Ecology Resources, vol. 20, no. 2, pp. 481–497, doi: 10.1111/1755-0998.13128. Kim J, Zhang Y, Day J, Zhou H. MGLM: An R Package for Multivariate Categorical Data Analysis. R J. 2018 Jul;10(1):73-90. doi: 10.32614/rj-2018-015. PMID: 32523781; PMCID: PMC7286576. |
Beta Was this translation helpful? Give feedback.
-
Thanks for that confirmation, @ch16S. It's an interesting point that you mention, though I'm not too familiar with some of the distributions that you mention. Similarly to the binomial distribution, I'm not sure an analytically tractable solution is available for the multinomial distribution, though I might be wrong. Either how, the development of EVA might make that more straightforward. Having said that, at first glance it seems to me that those distributions prevent the assumption of independence in the observations (being multivariate in nature, that would make sense!). Currently, that assumption is key in the implementation of GLLVMs in the R-package, since it allows us to provide (in a relatively straightforward way compared to the alternative) tractable solutions to the marginal log-likelihood of GLLVMs. Not assuming independence (e.g., due to spatially structured residuals) makes the maths a lot more complex (for (E)VA at least)! That's not to say it's not a possibility, I just don't see it happening anytime soon. But @JenniNiku might have a different perspective on that of course. |
Beta Was this translation helpful? Give feedback.
-
Interesting points here. I'm not that familiar with all of those distributions, so I would need to get to know them at first. |
Beta Was this translation helpful? Give feedback.
-
Sorry I'm somewhat late to the party here... an appropriate procedure to account for compositionality in GLLVM is to analyse the raw counts but with a row effect in the model (adding the argument row.eff="fixed" or row.eff="random"). This controls for differences in library size so that all remaining terms in the model estimate compositional effects. Bert suggested an offset, which is a similar idea, but it assumes the row effect is known a priori. There are some curly questions about whether the matrix of model coefficients should then be constrained, partitioned into a "main effect" and remaining compositional terms, but that consideration is more about parameterisation of the model than anything else, and is secondary to the concern of which model to actually fit! |
Beta Was this translation helpful? Give feedback.
-
Indeed, and my second suggestion a row intercept. Thanks for contributing to the discussion David. |
Beta Was this translation helpful? Give feedback.
-
Out of curiosity is there circumstances where you would use a row offset and a row effect? And in terms of the curly questions about the row effect is there an example of how to partition that code into a main effect and compositional terms? |
Beta Was this translation helpful? Give feedback.
-
row offset and row effect - I guess so, the offset would be accounting for effects we know about, the row effect is picking up residual effects not captured by offset. code to partition - well I guess I've done something relevant to this in mvabund::manyglm(..., comp=TRUE). Although the start of this discussion was about sequencing data for which this particular function would not be very practical (at least, not without recoding it from scratch, comp=TRUE co-opts code written for a different purpose and it does not come across gracefully). |
Beta Was this translation helpful? Give feedback.
-
@BertvanderVeen asked for future posts related to this post here so here goes. Regarding the comment "the fact that the number of sequences obtained is capped/fixed creates the main issues associated with compositionality sensu Gloor" - if the library size is capped or fixed that doesn't mean you can't use row effects to estimate it, on the contrary, that is a reason to include row effects - because we want to condition on library size, and if you include a row effect term in the model for library size then everything else is interpreted conditional on that term, hence relative to library size. |
Beta Was this translation helpful? Give feedback.
-
Regarding zero-inflated distributions "my perceived need of a hurdle model to deal with the presence-absence and abundance portions of my dataset separately"... |
Beta Was this translation helpful? Give feedback.
-
Wow--thanks @dwarton and @BertvanderVeen! This discussion has been incredibly helpful! Related to some of the great points made by @trichterhb, it also seemed to me that the additive log ratio/centered log ratio transformations suggested by Gloor were primarily what were needed to address compositionality problems with NGS microbiome data. Given this, I guess I'm having trouble understanding how row effect approaches fully account for library size effect and compositionality issues instead of just the effects of library size (the latter matches my expectations better, and is similar to the position in the Nov 5, 2020 post in this discussion by @trichterhb ). That is, the compositionality issue described by Gloor is separate from the issue that library size/sequencing depth differs between samples--even if I magically got the same number of sequences for each sample, I would still have the compositionality issues Gloor describes because the number of sequences obtained from the sequencer is fixed. (re-linking the paper @trichterhb linked, just for consistency) |
Beta Was this translation helpful? Give feedback.
-
While I'll definitely include a library size row effect in my model for the reasons described by @dwarton, it seems like I could also try to include a row effect that represents a fixed feature in each of my samples. In my case, I added a small, defined amount of DNA with a known sequence (called a spike-in) to each of my samples. Because this added amount of spike-in is the same for each sample, I can use the counts of this spike-in after sequencing as a reference for the counts of each real taxon I sequenced. Gloor would suggest that I use this reference in a transformation (specifically the additive log ratio, followed by distance-based analyses), but I'd like to have the non-distance-based benefits of gllvm. Instead of applying a transformation, perhaps it would be better to continue to leverage the features of gllvm by including the counts of the spike-in sequence as another row effect to constrain on? This way, I won't have to use a hurdle model and can instead use a count-appropriate distribution, which is probably best for data originally generated as counts (i.e., probably best to avoid transformations). But I guess it's not apparent to me that this approach would have the same affect as the transformations... |
Beta Was this translation helpful? Give feedback.
-
Is there need for a publication in the space of model-based ordination for compositional data? |
Beta Was this translation helpful? Give feedback.
-
Hi @JenniNiku ,
a colleague recommended your R package for GLLVM and I am very curious, how you applied this approach to microbial data, specifically sequencing data. As sequencing data is compositional, I was wondering how you accounted for this in your approach, as the typical models that are used for species count data (e.g. poisson, NB) are not necessarily suitable for sequencing data?
Thanks!
Cheers,
Christiane
Beta Was this translation helpful? Give feedback.
All reactions