Project to analyze the evolution of sex conflict in new and old genes of Drosophila melanogaster. We will be modeling offspring distributions in non-essential gene knock-down lines. Project consists of egg count data and fitted Poisson or Negative binomial regression using RStan. Posterior estimates of gene knockdown effects in somatic and germline cells, male and female parentals, or cross combinations will be compared to assess intrahost conflict or sex conflict, respectively, and correlated with gene age and function.
We want to identify genes non-essential for development that have apparent conflicting effects on fertility. This conflict would arise as differing fertility effects between somatic and germline expression or between male and female expression of a candidate gene. We also want to identify if adaptive and non-adaptive genes (for males, females, or both) correlate with gene age, chromosome location, and expression pattern.
To address these question, we need to estimate the effects of RNAi knock down experiments on fertility. Using UAS RNAi, we can knock down gene expression in the germline using nos GAL4, knock down gene expression in the soma using Act05 GAL4, and knock down expression for both using Tubp GAL4.
We want to categorize the mean and shape of the offspring distribution for these lines compared to driver controls. To do so, we'll parameterize statistical models with offspring count data. The formulation of these candidate models will differ to account for various biological assumptions, including the dispersion of the offspring distribution, the source of zero-count data, and sex-specific interactions with those parameters. The best candidate model will be identified using formal model selection to see which formulation offers the most explanatory power for our data.
From the best model, we will take parameters estimating the effects of our reverse genetics experiments on offspring distribution. These parameters will be used in our analyses relating gene age, chromosome location, and expression pattern to fertility effects as we will claim that they represent the ``true" values of fertility impact by each gene.
The simplest realistic model we can generate for a single gene knock down experiment assumes that the offspring count
where each
The base model is reasonable as a single gene model, but the model is so simple compared to our data and later elaborations that it's disingenuous to even attempt fitting it. When confronted with data and compared to other models, we know that it's going to lose. Therefore, the truly simplest model that we'll move forward with is one where the the knock down effects are not just of course gene-specific and driver-specific but also sex-specific. We will also allow the effect of each driver to be sex-specific and genetic background-specific. Our simplest model then is
and any models that we compare to this simplest model will be more elaborate.
We can allow for over-dispersion, where an excess of high offspring counts occur, by changing the distribution type and adding an extra parameter. Negative-binomial distributions are well-suited for this kind of data and introduce the dispersion parameter
We can also say that overdispersion
There are higher than expected zero counts in many of the crosses, so we need to account for these in some ways. Two common approaches we can compare are zero inflation and a zero hurdle
Zero-inflation is a mixture distribution that depends on a probability
The probability of observing a zero comes from the model offspring distribution with probability
Zero-hurdle is similar to zero-inflation but the zeroes all come from a factor external to the distribution and we say that the offspring count distribution is truncated to exclude zero. We adjust for this by dividing the offspring distribution by the probability of observing a zero value.
In both of these models, we can make multiple values of
The only observed condition in the vial considered to possibly affect the offspring distribution is early mortality of parentals, which we can quickly add by adding a factor
There were observed effects of time of year start time for the experiment. We can simply include various coefficients for either a particular start day where each start day has some value
We'll parameterize the models above using Markov-chain Monte Carlo (MCMC) in the RStan package. The resulting ``chains" will each be samples from the posterior distributions of our biological parameters based on our priors and likelihood functions.
Model selection helps to determine the quality of fits made using Bayesian inference, and rank models based on how accurate the predictions they make are. The best models will be chosen based on the variance and mean of the likelihood scores from our posterior distribution parameter sets , using the ``leave-one-out" cross validation information criterion (LOO-IC). More variable and worse likelihood scores are penalized with this information criterion.
We can start with a quick list of the simplest 24 models that take combinatorix of overdispersion (none, sex-agnostic, sex-specific), excess zeroes (zero-hurdle, zero inflation), parental mortality (included, excluded), and sex-specific mean adjustment (included, excluded) into account. From that list of