Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

n/p ratio clarification #64

Open
samgregoire opened this issue Jun 17, 2024 · 1 comment
Open

n/p ratio clarification #64

samgregoire opened this issue Jun 17, 2024 · 1 comment

Comments

@samgregoire
Copy link

samgregoire commented Jun 17, 2024

I made a scpModelWorkflow() modeling of a small SingleCellExperiment object (I only have 20 cells).
The scpModelFilterPlot() looks like this:

Rplot

I'm not surprised that I only have a few estimated features as I only have a few cells/observations. However, I'm puzzled by two things:

  • Why is the bar carresponding to features with a n/p ratio of 1 colored as "inestimable" ? According to the legend (and what I checked), features with a n/p ratio >= 1 are considered to be estimated.

  • How can I have features with a n/p ratio of 0?
    I thought that n could never be equal to 0 and checked that this was the case.

summary(sapply(metadata(sce)$model@scpModelFitList, "slot", "n"))
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  1.000   1.000   3.000   3.695   5.000  21.000 

Indeed, n/p ratio is never less than 0.5

np <- 
  sapply(metadata(sce)$model@scpModelFitList, "slot", "n") /
  sapply(metadata(sce)$model@scpModelFitList, "slot", "p")
summary(np)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  0.5000  0.6667  1.1250     Inf     Inf     Inf 

However, I was surpised to see that a large number of the n/p ratios were infinite, which means that p is equal 0.

summary(sapply(metadata(sce)$model@scpModelFitList, "slot", "p"))
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  0.000   0.000   5.000   3.905   7.000  14.000

On further investigation, I found out that this happens whenever there are only 1 or 2 observations for a specific feature

p0 <- which(sapply(metadata(sce)$model@scpModelFitList, "slot", "p") == 0)
 
nobs_p0 <- rep(NA, length(p0))
 
for(i in seq_along(p0)) {
     nobs_p0[i] <- nrow(colData(sce)[!is.na(assay(sce)[p0[i], ]), ])
}
 
nobs <- rowSums(!is.na(assay(sce)))
obs_2 <- which(nobs <= 2)
table(obs_2 == p0)

TRUE 
3409 

I assume that the 3409 features with an infinite n/p ratio are plotted as 0 in the plot.
Why do the features with 2 observations always have a p equal 0? I suppose it's not that important since features with only 2 observations are not very informative in bigger datasets.

@cvanderaa
Copy link
Member

Hi Sam,
Thanks for pointing out these inconsistencies.

  • Regarding your first point, I will fix this. The legend and docs are right, but the plot is misleading. It has to do with a wrong assignment of the edge cases when I cut the histograms into estimable and non-estimable features.

  • Regarding your second point, you did a great investigation job! Indeed, the issue you are raising lies within these lines:

    scp/R/ScpModel-Workflow.R

    Lines 213 to 217 in 5e094c6

    if (nrow(coldata) <= 2) {
    out <- matrix(nrow = nrow(coldata), ncol = 0)
    attr(out, "levels") <- List()
    return(out)
    }
    I intentionally did this, as IMHO, there is no use to model data with only 2 or less data points. Hence I generate an empty model matrix, hence p = 0, hence the feature is ignored. I'm open for discussion whether this would need a more clever management.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants