Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Most cells are unassigned and a few of them do not match the filters #99

Open
txemaheredia opened this issue Apr 16, 2024 · 7 comments
Open

Comments

@txemaheredia
Copy link

Hi,

first of all, thank you for this tool. It has really saved our ass in a different experiment where HTO-based demultiplexing failed. Thank you a lot.

I am running now vireo on two 10x sequencing runs, each containing 4 samples (mouse, littermates with WT/mutant genotypes, 3F/1M or 1F/3M in each run).

I created a VCF file from the single cell sequencing on each of the samples with:

samtools view \
                -@ 8 \
                --write-index \
                -D "${barcode_tag}":<(zcat "${barcodes_tsv_filename}") \
                -o "${output_bam_filename}" \
                "${input_bam_filename}"

cellsnp-lite -s $BAM -O $CELLSNP_OUTDIR -p 14 --minMAF 0.1 --minCOUNT 20 --cellTAG None --UMItag UB --maxDEPTH 500000 --gzip --chrom 1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,MT,X,Y


#1a - 2nd step - genotyping
cellsnp-lite -s $BAM -b $BARCODES -O $CELLSNP_GENO_OUTDIR -R $CELLSNP_OUTDIR/cellSNP.base.vcf.gz -p 14 --minMAF 0.1 --minCOUNT 20 --gzip --chrom 1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,MT,X,Y

This resulted in 60,797 SNV for the first sequencing run (25,647 cells) and 106,960 for the second one (19,207 cells).

Then I run vireo using the same command I had success with in a previous analysis:

N=2
vireo -c $CELLSNP_GENO_OUTDIR -N $N -o $VIREO_OUTDIR

However, the results I got have an overwhelming amount of "unassigned" samples:

vireo] Loading cell folder ...
[vireo] Demultiplex 25647 cells to 4 donors with 60797 variants.
[vireo] lower bound ranges [-3509690.7, -3508383.5, -3506645.0]
[vireo] allelic rate mean and concentrations:
[[0.117 0.279 0.468]]
[[1676287.6 1369376.8 4033486.6]]
[vireo] donor size before removing doublets:
donor0	donor1	donor2	donor3
6456	6757	6349	6085
[vireo] final donor size:
donor0	donor1	donor2	donor3	doublet	unassigned
2204	2358	2216	2013	179	16677
[vireo] All done: 6 min 23.1 sec

Looking a other posts here I also tried to run it with --callAmbientRNAs :

$ cat nohup.cellsnp_vireo_v2_SSt_L1_4_ambient.out
[vireo] Loading cell folder ...
[vireo] Demultiplex 25647 cells to 4 donors with 60797 variants.
[vireo] lower bound ranges [-3509676.9, -3508213.7, -3507234.5]
[vireo] allelic rate mean and concentrations:
[[0.117 0.281 0.467]]
[[1684690.7 1350985.6 4043474.7]]
[vireo] donor size before removing doublets:
donor0	donor1	donor2	donor3
6631	6408	6236	6371
[vireo] 8151 out 60797 SNPs selected for ambient RNA detection: ELBO_gain > 53.4
[vireo] Ambient RNA time: 2478.3 sec
[vireo] final donor size:
donor0	donor1	donor2	donor3	doublet	unassigned
2253	2160	2018	2106	169	16941
[vireo] All done: 52 min 13.5 sec

with --callAmbientRNAs -M 200 :

$ cat nohup.cellsnp_vireo_v2_SSt_L1_4_ambient_nInit200.out
[vireo] Loading cell folder ...
[vireo] Demultiplex 25647 cells to 4 donors with 60797 variants.
[vireo] lower bound ranges [-3509645.1, -3508161.5, -3507201.0]
[vireo] allelic rate mean and concentrations:
[[0.117 0.275 0.468]]
[[1661061.8 1355583.4 4062505.7]]
[vireo] donor size before removing doublets:
donor0	donor1	donor2	donor3
6225	6329	6656	6437
[vireo] 8145 out 60797 SNPs selected for ambient RNA detection: ELBO_gain > 53.4
[vireo] Ambient RNA time: 402.3 sec
[vireo] final donor size:
donor0	donor1	donor2	donor3	doublet	unassigned
2084	2156	2264	2205	168	16770
[vireo] All done: 15 min 6.0 sec

And with --callAmbientRNAs -M 1000 :

$ cat nohup.cellsnp_vireo_v2_SSt_L1_4_ambient_nInit1000.out
[vireo] Loading cell folder ...
[vireo] Demultiplex 25647 cells to 4 donors with 60797 variants.
[vireo] lower bound ranges [-3509715.5, -3508234.1, -3506775.0]
[vireo] allelic rate mean and concentrations:
[[0.117 0.281 0.468]]
[[1686010.2 1369964.7 4023176.1]]
[vireo] donor size before removing doublets:
donor0	donor1	donor2	donor3
6414	6314	6523	6396
[vireo] 8155 out 60797 SNPs selected for ambient RNA detection: ELBO_gain > 53.4
[vireo] Ambient RNA time: 399.7 sec
[vireo] final donor size:
donor0	donor1	donor2	donor3	doublet	unassigned
2120	2140	2208	2174	163	16842
[vireo] All done: 44 min 58.3 sec

with identical results.

I also explored the results of the first run in R and I see the following distribution:

vireo_1

vireo_2

With most unassigned cells having moderate values of prob_max and prob_doublet < 0.25

Also, the values for the unassigned cells overlap those of donor-assigned cells:


> donor_min_prob_max = min(m[!m$donor_id %in% c("unassigned","doublet"),]$prob_max)
> donor_min_prob_max
[1] 0.9
> dim( m[m$donor_id=="unassigned" & m$prob_max >= donor_min_prob_max ,])
[1] 26  8
> 
> donor_max_LLR = max(m[!m$donor_id %in% c("unassigned","doublet"),]$doublet_logLikRatio)
> donor_max_LLR
[1] -0.734
> dim( m[m$donor_id=="unassigned" & m$doublet_logLikRatio <=donor_max_LLR ,])
[1] 5229    8

26 unassigned cells have prob_max = 0.9, which, after looking into the code (io_utils.py), I don't understand how were they deemed "unassigned" because they all have n_vars > 10.

It is possible that these cells "suffered" a lot during the processing and there is a lot of ambient RNA floating around. How should I deal with all this? Should I just use a lower prob_max threshold and maybe use a different doublet finder software down the line?

PS: running vireo with --noDoublet leaves 4,726 unassigned cells.
Oh, and in the previous runs, 6,371 unassigned cells have their best_singlet not in their own best_doublet list.

@huangyh09
Copy link
Collaborator

Hi, thanks for sending the detailed info (& trust).

The first thing I noticed is the unusual allele rate: [0.117 0.281 0.468], instead of around [0, 0.5, 1]. After reading your commands, one thought came into my mind:

  • when you performed the genotyping with cellsnp-lite, it didn't know which allele is ALT or REF, so it will assign the one with higher frequency as RAF. In your case, for each SNP, you may only have two genotypes 0/0 or 0/1, but rarely also with 1/1.
  • if the genotypes of most variants are only two instead of three, it violates the assumption of Vireo, making the whole results not trustable. Unfortunately, Vireo doesn't support a two-genotype setting (yet).
  • less importantly, there might be a small proportion of variants with 0/1 for all 4 donors, and it's hard to filter.

To fix it, you can try randomly selecting half of the variants and flipping the ALT and REF alleles, for example by

  • option 1: edit the VCF file by changing ALT and REF and re-run cellsnp-lite
  • option 2: simply manipulate the AD matrix (making AD_new[idx_to_flip, :] = DP[idx_to_flip, :] - AD[idx_to_flip, :])

Hope this may help you.

Yuanhua

@txemaheredia
Copy link
Author

Thanks for the suggestions.

I tried a (crude version of) what you suggested flipping the ALT/REF for half the variants and got similarly bad results with a flipped allele rate:

[vireo] Loading cell folder ...
[vireo] Demultiplex 25647 cells to 4 donors with 60797 variants.
[vireo] lower bound ranges [-1900092.9, -1899299.3, -1898714.2]
[vireo] allelic rate mean and concentrations:
[[0.551 0.858 1.   ]]
[[2305018.1 1158633.8 3615499.2]]
[vireo] donor size before removing doublets:
donor0  donor1  donor2  donor3
6444    6317    6217    6669
[vireo] final donor size:
donor0  donor1  donor2  donor3  doublet unassigned
1187    1165    1126    1220    35      20914
[vireo] All done: 6 min 10.8 sec

I have delved a bit into this, and it seems that these samples are first generation hybrids between C57BL6 and FVB mice strains. That means that vireo's underlying allele rate assumption will never be true for these samples.

I tried downloading the strain-specific VCFs from the Mouse Genomes Project and limit the analysis to either the merged SNV set, or the intersection SNV set between both strains.

Merged:

[vireo] Loading cell folder ...
[vireo] Demultiplex 25647 cells to 4 donors with 28266 variants.
[vireo] lower bound ranges [-2385086.2, -2384185.6, -2382473.9]
[vireo] allelic rate mean and concentrations:
[[0.243 0.509 0.693]]
[[ 686241.5 3465441.8  501250.7]]
[vireo] donor size before removing doublets:
donor0	donor1	donor2	donor3
6267	6582	6476	6321
[vireo] final donor size:
donor0	donor1	donor2	donor3	doublet	unassigned
1549	1659	1635	1542	182	19080
[vireo] All done: 3 min 41.6 sec

Decent amount of SNV, bad allelic rates, poor classification.

Intersect:

[vireo] Loading cell folder ...
[vireo] Demultiplex 25647 cells to 4 donors with 280 variants.
[vireo] lower bound ranges [-28465.8, -26967.1, -26680.3]
[vireo] allelic rate mean and concentrations:
[[0.001 0.307 0.897]]
[[ 3680.8 46416.6  4968.6]]
[vireo] donor size before removing doublets:
donor0	donor1	donor2	donor3
6431	6391	6392	6432
[vireo] final donor size:
donor0	donor1	donor2	donor3	unassigned
12	19	20	20	25576
[vireo] All done: 0 min 8.8 sec

Extremely low number of SNV, better allelic ratios, no donor classification power.

Can you think of any way to make vireo work with these kind of samples? Otherwise, do you know of any other similar tool that could make them work?

Thank you very much.

@huangyh09
Copy link
Collaborator

For your last strategy, maybe you can double-check with a similar one used in this paper:

Demultiplexing of 10x Data
Genotyping information for the C3H_HeJ, CAST_EiJ and C57BL_6NJ mouse strains were extracted from the Mouse Genome Project (Keane et al., 2011) dataset. The SNPs were filtered to identify those which w ere heterozygous in at least one of the three strains (25.7 million in total). These were used as candidates to genotype all of the cells in each pool using cellSNP v0.1.7 (Huang et al., 2019), parameters “–minMAF 0.1–minCOUNT 20.” 76,000 to 111,000 informative SNPs were obtained from the pooled scRNA-seq data, these were utilized further in Vireo v0.2.2 (Huang et al., 2019) in the genotype reference free mode with parameters “-N 4 -M 100” to de multiplex the pools. The estimated genotypes for these strains were mapped back to the three known genotypes from the Mouse Genome Project to link the cell lines to their parental mouse strain.

@txemaheredia
Copy link
Author

I've just tried that:

bcftools view  -i 'GT="het"' ${invcf} | bgzip > $outvcf
$ zcat merged_heterozygous_FVB_C57BL6NJ.vcf.gz | grep -v "^#" | wc -l
573115
[vireo] Loading cell folder ...
[vireo] Demultiplex 25647 cells to 4 donors with 886 variants.
[vireo] lower bound ranges [-72189.6, -70781.1, -69979.7]
[vireo] allelic rate mean and concentrations:
[[0.018 0.33  0.918]]
[[ 18630.4 107414.   10785.6]]
[vireo] donor size before removing doublets:
donor0  donor1  donor2  donor3
6463    6483    6323    6378
[vireo] final donor size:
donor0  donor1  donor2  donor3  doublet unassigned
87      77      82      54      4       25343
[vireo] All done: 0 min 13.1 sec

Similarly to using only the "intersected" SNV, using these variants give good allelic ratios, but they are simply not enough of them to classify donors.

vireo_1_heterozygous

vireo_2_heterozygous

> m %>% 
     group_by(donor_id) %>% 
     summarize(min = min(n_vars), 
                         mean = mean(n_vars),
                         median = median(n_vars), 
                         max = max(n_vars))
# A tibble: 6 × 5
  donor_id     min  mean median   max
  <chr>      <int> <dbl>  <dbl> <int>
1 donor0        10 20.1    17      46
2 donor1        10 21.9    20      58
3 donor2        10 19.5    17      54
4 donor3        10 22.1    20.5    48
5 doublet       29 36.8    39      40
6 unassigned     0  3.85    3      61

@yuantiaotiao
Copy link

I've just tried that:

bcftools view  -i 'GT="het"' ${invcf} | bgzip > $outvcf
$ zcat merged_heterozygous_FVB_C57BL6NJ.vcf.gz | grep -v "^#" | wc -l
573115
[vireo] Loading cell folder ...
[vireo] Demultiplex 25647 cells to 4 donors with 886 variants.
[vireo] lower bound ranges [-72189.6, -70781.1, -69979.7]
[vireo] allelic rate mean and concentrations:
[[0.018 0.33  0.918]]
[[ 18630.4 107414.   10785.6]]
[vireo] donor size before removing doublets:
donor0  donor1  donor2  donor3
6463    6483    6323    6378
[vireo] final donor size:
donor0  donor1  donor2  donor3  doublet unassigned
87      77      82      54      4       25343
[vireo] All done: 0 min 13.1 sec

Similarly to using only the "intersected" SNV, using these variants give good allelic ratios, but they are simply not enough of them to classify donors.

vireo_1_heterozygous

vireo_2_heterozygous

> m %>% 
     group_by(donor_id) %>% 
     summarize(min = min(n_vars), 
                         mean = mean(n_vars),
                         median = median(n_vars), 
                         max = max(n_vars))
# A tibble: 6 × 5
  donor_id     min  mean median   max
  <chr>      <int> <dbl>  <dbl> <int>
1 donor0        10 20.1    17      46
2 donor1        10 21.9    20      58
3 donor2        10 19.5    17      54
4 donor3        10 22.1    20.5    48
5 doublet       29 36.8    39      40
6 unassigned     0  3.85    3      61

Hello, I have also encountered this problem. Have you solved it

@txemaheredia
Copy link
Author

No. We ended up considering these samples a lost cause and we threw them away.

We focused our analysis on a different set of samples that had a "better genetic background" and were only a mixture of 2 animals. We were able to use vireo + souporcell + sex gene information to demultiplex those samples.

@huangyh09
Copy link
Collaborator

No. We ended up considering these samples a lost cause and we threw them away.

Sorry to hear this. If you want to share this data (email me the link [email protected]), I may give it a try when I have time and see if there is anything we can help.

Yuanhua

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants