From 943393ea822efe7eb267552f48697269a66de3b8 Mon Sep 17 00:00:00 2001 From: Xichen Wu Date: Thu, 19 Oct 2023 22:49:08 +0200 Subject: [PATCH] restructure the docs --- docs/source/general.md | 150 +++++++++ docs/source/{usage.md => genetic.md} | 436 ++++++++------------------- docs/source/hashing.md | 284 +++++++++++++++++ docs/source/index.md | 3 + docs/source/output.md | 322 -------------------- docs/source/rescue.md | 47 +++ 6 files changed, 614 insertions(+), 628 deletions(-) create mode 100644 docs/source/general.md rename docs/source/{usage.md => genetic.md} (66%) create mode 100644 docs/source/hashing.md delete mode 100644 docs/source/output.md create mode 100644 docs/source/rescue.md diff --git a/docs/source/general.md b/docs/source/general.md new file mode 100644 index 0000000..a6260c5 --- /dev/null +++ b/docs/source/general.md @@ -0,0 +1,150 @@ +# General + +## **Pipeline overview:** +The mode of the pipeline is set by `params.mode`. hadge provides 4 modes in total: genetic, hashing, rescue or donor_match. +- genetic: Genetics-based deconvolution workflow (check [](genetic)) +- hashing: Hashing-based deconvolution workflow (check [](hashing)) +- rescue: genetic + hashing + donor matching (check [](rescue)) +- donor_match: donor matching (check [](rescue)) + +## **Pipeline configuration** + +### Conda environments: + +We provide a `environment.yml` file for each process. But you can also use local Conda environments to run a process: + +``` +// dont forget to enable conda +conda.enable = true +process { + // Use Conda environment files + withName:scSplit { + conda = './conda/scsplit.yml' + } + // Use Conda package names + withName:cellSNP { + conda = 'bioconda::cellsnp-lite' + } + // Use existing Conda environments + withName:summary { + conda = '/path/to/an/existing/env/directory' + } +} + +``` + +### Containers: + +Nextflow also supports a variety of container runtimes, e.g. Docker. To specify a different Docker image for each process: + +``` +process { + withName:foo { + container = 'image_name_1' + } + withName:bar { + container = 'image_name_2' + } +} +// do not forget to enable docker + +docker.enabled = true + +``` + +### Executor and resource specifications: + +- The pipeline can be run either locally or on an HPC. You can set the executor by running the pipeline with `-profile standard` or `-profile cluster`. Of course, you can add other profiles if you want. +- Feel free to add other configurations, e.g. the number of CPUS, the memory allocation, etc. If you are new to Nextflow framework, please visit the [Nextlfow page](https://www.nextflow.io/docs/latest/config.html#). +- As default, the pipeline is run locally with the standard profile, where all processes annotated with the big_mem label are assigned 4 cpus and 16 Gb of memory. + +``` +profiles{ + standard { + process { + executor = 'local' + withLabel: big_mem { + cpus = 4 + memory = 16.GB + } + withLabel: small_mem { + cpus = 2 + memory = 8.GB + } + } + + } + + cluster { + process { + executor = 'slurm' + // queue = ... + withLabel: big_mem { + cpus = 32 + memory = 64.GB + } + withLabel: small_mem { + cpus = 16 + memory = 32.GB + } + } + } +} + +``` +## **Advanced usecases** + +### **Running on multiple samples** + +The pipeline is able to run on multiple samples. In this scenario, the shared parameters for input data are retrieved from a sample sheet using `params.multi_sample`, which is set to None by default. Along with the input data, the sample sheet should contain an additional column for unique sample IDs assigned to each sample. The remaining parameters for each process are specified in the nextflow.config file, just like when demultiplexing a single sample. However, there is a distinction between running on a single sample and running on multiple samples. When processing multiple samples, the pipeline only permits a single value for each process parameter, whereas in the case of a single sample, multiple values separated by commas are allowed. The sample sheet should have e.g. following columns depending on the methods you want to run: + +- sampleId +- na_matrix_raw +- rna_matrix_filtered +- hto_matrix_raw +- hto_matrix_filtered +- bam +- bam_index +- barcodes +- nsample +- celldata +- vcf_mixed +- vcf_donor + +### **scverse compatibility** + +To ensure scverse compatibility, the pipeline provides the option to generate anndata or mudata after demultiplexing specifeid by `params.generate_anndata` and `params.generate_mudata`. This object contains the scRNA-seq counts from `params.rna_matrix_filered` and stores the assignment of each demultiplexing method in the `assignment` column of `obs`. Additionlly, if `match_donor` is True, the pipeline also produces an AnnData object which contains the assignment of the best-matched method pair after donor matching. + +## **Pipeline output** +Output directory of the pipeline is set by `$params.outdir`. By default, the pipeline is run on a single sample. In this case, all pipeline output will be saved in the folder `$projectDir/$params.outdir/$params.mode`. When running the pipeline on multiple samples, the pipeline output will be found in the folder `"$projectDir/$params.outdir/$sampleId/$params.mode`. To simplify this, we'll refer to this folder as `$pipeline_output_folder` from now on. + +The demultiplexing workflow saves its output in `$pipeline_output_folder/[gene/hash]_demulti`. The pipeline will also generate some TSV files to summarize the results in the folder `[gene/hash]_summary` under this directory. + +- `[method]_classification.csv`: classification of all trials for a given method + | Barcode | multiseq_1 | multiseq_2 | ... | + |:---------: |:----------: |:----------: |:---: | + | barcode-1 | singlet | singlet | ... | + | barcode-2 | doublet | negative | ... | + | ... | ... | ... | ... | +- `[method]_assignment.csv`: assignment of all trials for a given method + | Barcode | multiseq_1 | multiseq_2 | ... | + |:---------: |:----------: |:----------: |:---: | + | barcode-1 | donor-1 | donor-2 | ... | + | barcode-2 | doublet | negative | ... | + | ... | ... | ... | ... | +- `[method]_params.csv`: specified paramters of all trials for a given method + | Argument | Value | + | :---------: | :----------: | + | seuratObejctPath | Path | + | quantile | 0.7 | + | ... | ... | +- `[workflow]_classification_all.csv`: classification of all trials across different methods + | Barcode | multiseq_1 | htodemux_1 | ... | + |:---------: |:----------: |:----------: |:---: | + | ... | ... | ... | ... | +- `[workflow]_assignment_all.csv`: save the assignment of all trials across different methods + | Barcode | multiseq_1 | htodemux_1 | ... | + |:---------: |:----------: |:----------: |:---: | + | ... | ... | ... | ... | +- `adata` folder: stores Anndata object with filtered scRNA-seq read counts and assignment of each deconvolution method if `params.generate_anndata` is `True`. Details see section "scverse compatibility" above. +- In the `rescue` mode, the pipeline generates some additional output files, details please check [](rescue). \ No newline at end of file diff --git a/docs/source/usage.md b/docs/source/genetic.md similarity index 66% rename from docs/source/usage.md rename to docs/source/genetic.md index 6577b59..34808c3 100644 --- a/docs/source/usage.md +++ b/docs/source/genetic.md @@ -1,10 +1,20 @@ -# Usage +# Genetics-based deconvolution workflow +Genotyped-based deconvolution leverages the unique genetic composition of individual samples to guarantee that the final cell mixture can be deconvolved. This can be conducted with genotype of origin or in a genotype-free mode using a genomic reference from unmatched donors, for example the 1000 genome project genotypes in a genotype-free. The result of this approach is a table of SNP assignment to cells that can be used to computationally infer the donors. One limitation of this approach is the need to produce additional data to genotype the individual donors in order to correctly assign the cell mixtures. -## **Input data preparation** +## **Genetics-based deconvolution (gene_demulti) in hadge** -The input data depends heavily on the deconvolution tools. In the following table, you will find the minimal input data required by different tools. +- Pre-processing: Samtools +- Variant-calling: freebayes +- Variant-filtering: BCFtools +- Variant-calling: cellsnp-lite +- Demuxlet +- Freemuxlet +- Vireo +- Souporcell +- scSplit -### Genotype-based deconvolution methods: +## **Input data preparation** +The input data depends heavily on the deconvolution tools. In the following table, you will find the minimal input data required by different tools. | Deconvolution methods | Input data | | --------------------- | ------------------------------------------------------------------------------------ | @@ -59,293 +69,120 @@ When running genotype-based deconvolution methods without genotype reference, yo | Freemuxlet | common_variants_freemuxlet | https://sourceforge.net/projects/cellsnp/files/SNPlist/ | | cellSNP-lite | common_variants_cellsnp | https://sourceforge.net/projects/cellsnp/files/SNPlist/ | -### Hashing-based deconvolution workflow +## **Output** -| Deconvolution method | Input data | Parameter | -| -------------------- | ------------------------------------------------------------------------------------------------------------------ | -------------------------------------------------------------- | -| HTODemux | - Seurat object with both UMI and hashing count matrix (RDS) | `params.rna_matrix_htodemux`
`params.hto_matrix_htodemux` | -| Multiseq | - Seurat object with both UMI and hashing count matrix (RDS) | `params.rna_matrix_multiseq`
`params.hto_matrix_multiseq` | -| HashSolo | - 10x mtx directory with hashing count matrix (H5) | `params.hto_matrix_hashsolo`
`params.rna_matrix_hashsolo` | -| HashedDrops | - 10x mtx directory with hashing count matrix (Directory) | `params.hto_matrix_hashedDrops` | -| Demuxem | - 10x mtx directory with UMI count matrix (Directory)
- 10x mtx directory with hashing count matrix (Directory) | `params.hto_matrix_demuxem`
`params.rna_matrix_demuxem` | +By default, the pipeline is run on a single sample. In this case, all pipeline output will be saved in the folder `$projectDir/$params.outdir/genetic/gene_demulti`. When running the pipeline on multiple samples, the pipeline output will be found in the folder `"$projectDir/$params.outdir/$sampleId/genetic/gene_demulti`. To simplify this, we'll refer to this folder as `$pipeline_output_folder` from now on. -The parameters `params.[rna/hto]_matrix_[method]` is used to specify whether to use raw or filtered counts for each method. Similary as genotype-based deconvlution methods, hashing methods also utilize common input parameters to store count matrices for better control. +### Samtools -| Input data | Parameter | -| ------------------------------ | ---------------------------- | -| Raw scRNAseq count matrix | `params.rna_matrix_raw` | -| Filtered scRNAseq count matrix | `params.rna_matrix_filtered` | -| Raw HTO count matrix | `params.hto_matrix_raw` | -| Filtered HTO count matrix | `params.hto_matrix_filtered` | +output directory: `$pipeline_output_folder/samtools/samtools_[task_ID/sampleId]` -#### Pre-processing +- `filtered.bam`: processed BAM in a way that reads with any of following patterns be removed: read quality lower than 10, being unmapped segment, being secondary alignment, not passing filters, being PCR or optical duplicate, or being supplementary alignment +- `filtered.bam.bai`: index of filtered bam +- `no_dup.bam`: processed BAM after removing duplicated reads based on UMI +- `sorted.bam`: sorted BAM +- `sorted.bam.bai`: index of sorted BAM + +### cellSNP-lite + +output directory: `$pipeline_output_folder/cellsnp/cellsnp_[task_ID/sampleId]` + +- `cellSNP.base.vcf.gz`: a VCF file listing genotyped SNPs and aggregated AD & DP infomation (without GT) +- `cellSNP.samples.tsv`: a TSV file listing cell barcodes or sample IDs +- `cellSNP.tag.AD.mtx`: a file in mtx format, containing the allele depths of the alternative (ALT) alleles +- `cellSNP.tag.DP.mtx`: a file in mtx format, containing the sum of allele depths of the reference and alternative alleles (REF + ALT) +- `cellSNP.tag.OTH.mtx`: a file in mtx format, containing the sum of allele depths of all the alleles other than REF and ALT. +- `cellSNP.cells.vcf.gz`: a VCF file listing genotyped SNPs and AD & DP & genotype (GT) information for each cell or sample +- `params.csv`: specified parameters in the cellsnp-lite task + +### Freebayes + +- `${region}_${vcf_freebayes}`: a VCF file containing variants called from mixed samples in the given chromosome region + +### Bcftools + +output directory: `$pipeline_output_folder/bcftools/bcftools_[task_ID/sampleId]` + +- `total_chroms.vcf`: a VCF containing variants from all chromosomes +- `sorted_total_chroms.vcf`: sorted VCF file +- `filtered_sorted_total_chroms.vcf`: sorted VCF file containing variants with a quality score > 30 + +### Demuxlet + +output directory: `$pipeline_output_folder/demuxlet/demuxlet_[task_ID/sampleId]` + +- `{demuxlet_out}.best`: result of demuxlet containing the best guess of the sample identity, with detailed statistics to reach to the best guess +- `params.csv`: specified parameters in the Demuxlet task + +Optionally: + +- `{demuxlet_out}.cel`: contains the relation between numerated barcode ID and barcode. Also, it contains the number of SNP and number of UMI for each barcoded droplet. +- `{demuxlet_out}.plp`: contains the overlapping SNP and the corresponding read and base quality for each barcode ID. +- `{demuxlet_out}.umi`: contains the position covered by each umi +- `{demuxlet_out}.var`: contains the position, reference allele and allele frequency for each SNP. + +### Freemuxlet + +output directory: `$pipeline_output_folder/freemuxlet/freemuxlet_[task_ID/sampleId]` + +- `{freemuxlet_out}.clust1.samples.gz`: contains the best guess of the sample identity, with detailed statistics to reach to the best guess. +- `{freemuxlet_out}.clust1.vcf.gz`: VCF file for each sample inferred and clustered from freemuxlet +- `{freemuxlet_out}.lmix`: contains basic statistics for each barcode +- `params.csv`: specified parameters in the Freemuxlet task + +Optionally: + +- `{freemuxlet_out}.cel`: contains the relation between numerated barcode ID and barcode. Also, it contains the number of SNP and number of UMI for each barcoded droplet. +- `{freemuxlet_out}.plp`: contains the overlapping SNP and the corresponding read and base quality for each barcode ID. +- `{freemuxlet_out}.umi`: contains the position covered by each umi +- `{freemuxlet_out}.var`: contains the position, reference allele and allele frequency for each SNP. +- `{freemuxlet_out}.clust0.samples.gz`: contains the best sample identity assuming all droplets are singlets +- `{freemuxlet_out}.clust0.vcf.gz}`: VCF file for each sample inferred and clustered from freemuxlet assuming all droplets are singlets +- `{freemuxlet_out}.ldist.gz`: contains the pairwise Bayes factor for each possible pair of droplets + +### Vireo + +output directory: `$pipeline_output_folder/vireo/vireo_[task_ID/sampleId]` + +- `donor_ids.tsv`: assignment of Vireo with detailed statistics +- `summary.tsv`: summary of assignment +- `prob_singlet.tsv.gz`: contains probability of classifing singlets +- `prob_doublet.tsv.gz`: contains probability of classifing doublets +- `GT_donors.vireo.vcf.gz`: contains estimated donor genotypes +- `filtered_variants.tsv`: a minimal set of discriminatory variants +- `GT_barcodes.png`: a figure for the identified genotype barcodes +- `fig_GT_distance_estimated.pdf`: a plog showing estimated genotype distance +- `_log.txt`: vireo log file +- `params.csv`: specified parameters in the Vireo task + +### scSplit + +output directory: `$pipeline_output_folder/scSplit/scsplit_[task_ID/sampleId]` + +- `alt_filtered.csv`: count matrix of alternative alleles +- `ref_filtered.csv`: count matrix of reference alleles +- `scSplit_result.csv`: barcodes assigned to each of the N+1 cluster (N singlets and 1 doublet cluster), doublet marked as DBL- (n stands for the cluster number), e.g SNG-0 means the cluster 0 is a singlet cluster. +- `scSplit_dist_matrix.csv`: the ALT allele Presence/Absence (P/A) matrix on distinguishing variants for all samples as a reference in assigning sample to clusters, NOT including the doublet cluster, whose sequence number would be different every run (please pay enough attention to this) +- `scSplit_dist_variants.txt`: the distinguishing variants that can be used to genotype and assign sample to clusters +- `scSplit_PA_matrix.csv`: the full ALT allele Presence/Absence (P/A) matrix for all samples, NOT including the doublet cluster, whose sequence number would be different every run (please pay enough attention to this) +- `scSplit_P_s_c.csv`: the probability of each cell belonging to each sample +- `scSplit.log`: log file containing information for current run, iterations, and final Maximum Likelihood and doublet sample +- `params.csv`: specified parameters in the scSplit task + +### Souporcell + +output directory: `$pipeline_output_folder/souporcell/souporcell_[task_ID/sampleId]` + +- `alt.mtx`: count matrix of alternative alleles +- `ref.mtx`: count matrix of reference alleles +- `clusters.tsv`: assignment of Souporcell with the cell barcode, singlet/doublet status, cluster, log_loss_singleton, log_loss_doublet, followed by log loss for each cluster. +- `cluster_genotypes.vcf`: VCF with genotypes for each cluster for each variant in the input vcf from freebayes +- `ambient_rna.txt`: contains the ambient RNA percentage detected +- `params.csv`: specified parameters in the Souporcell task + +## **Parameter** -Similar as in the genetic demultiplexing workflow, we provide a pre-processing step required before running HTODemux and Multiseq to load count matrices into a Seurat object. The input will be automatically loaded from the parameters mentioned above. - -### **Running on multiple samples** - -The pipeline is able to run on multiple samples. In this scenario, the shared parameters for input data are retrieved from a sample sheet using `params.multi_sample`, which is set to None by default. Along with the input data, the sample sheet should contain an additional column for unique sample IDs assigned to each sample. The remaining parameters for each process are specified in the nextflow.config file, just like when demultiplexing a single sample. However, there is a distinction between running on a single sample and running on multiple samples. When processing multiple samples, the pipeline only permits a single value for each process parameter, whereas in the case of a single sample, multiple values separated by commas are allowed. The sample sheet should have e.g. following columns depending on the methods you want to run: - -- sampleId -- na_matrix_raw -- rna_matrix_filtered -- hto_matrix_raw -- hto_matrix_filtered -- bam -- bam_index -- barcodes -- nsample -- celldata -- vcf_mixed -- vcf_donor - -### **scverse compatibility** - -To ensure scverse compatibility, the pipeline provides the option to generate anndata or mudata specifeid by `params.generate_anndata`. If set to True, the pipeline will generate an AnnData object in the folder `[workflow]_summary/adata` during the summary process of two workflows. This object contains the scRNA-seq counts from `params.rna_matrix_filered` and stores the assignment of each demultiplexing method in the `assignment` column of `obs`. Additionlly, if `match_donor` is True, the pipeline also produces an AnnData object in the `data_output` folder which contains the assignment of the best-matched method pair after donor matching. - -## **Pipeline configuration** - -### **Conda environments:** - -We provide a `environment.yml` file for each process. But you can also use local Conda environments to run a process: - -``` -// dont forget to enable conda -conda.enable = true -process { - // Use Conda environment files - withName:scSplit { - conda = './conda/scsplit.yml' - } - // Use Conda package names - withName:cellSNP { - conda = 'bioconda::cellsnp-lite' - } - // Use existing Conda environments - withName:summary { - conda = '/path/to/an/existing/env/directory' - } -} - -``` - -### Containers: - -Nextflow also supports a variety of container runtimes, e.g. Docker. To specify a different Docker image for each process: - -``` -process { - withName:foo { - container = 'image_name_1' - } - withName:bar { - container = 'image_name_2' - } -} -// do not forget to enable docker - -docker.enabled = true - -``` - -### Executor and resource specifications: - -- The pipeline can be run either locally or on an HPC. You can set the executor by running the pipeline with `-profile standard` or `-profile cluster`. Of course, you can add other profiles if you want. -- Feel free to add other configurations, e.g. the number of CPUS, the memory allocation, etc. If you are new to Nextflow framework, please visit the [Nextlfow page](https://www.nextflow.io/docs/latest/config.html#). -- As default, the pipeline is run locally with the standard profile, where all processes annotated with the big_mem label are assigned 4 cpus and 16 Gb of memory. - -``` -profiles{ - standard { - process { - executor = 'local' - withLabel: big_mem { - cpus = 4 - memory = 16.GB - } - withLabel: small_mem { - cpus = 2 - memory = 8.GB - } - } - - } - - cluster { - process { - executor = 'slurm' - // queue = ... - withLabel: big_mem { - cpus = 32 - memory = 64.GB - } - withLabel: small_mem { - cpus = 16 - memory = 32.GB - } - } - } -} - -``` - -## Parameters - -### General - -| | | -| :--------------: | :-------------------------------------------------------------: | -| outdir | Output directory of the pipeline | -| mode | Mode of the pipeline: genetic, hashing, rescue or donor_match | -| generate_anndata | Whether to generate anndata after demultiplexing. Default: True | -| generate_mudata | Whether to generate mudata after demultiplexing. Default: False | - -### Hashing-based: Preprocessing - -| | | -| ------------- | ----------------------------------------------------------------------------------------------- | -| ndelim | For the initial identity calss for each cell, delimiter for the cell's column name. Default: \_ | -| sel_method | The selection method used to choose top variable features. Default: mean.var.plot | -| n_features | Number of features to be used when finding variable features. Default: 2000 | -| assay | Assay name for HTO modality. Default: HTO | -| norm_method | Method for normalization of HTO data. Default: CLR | -| margin | If performing CLR normalization, normalize across features (1) or cells (2). Default: 2 | -| preprocessOut | Name of the output Seurat object. Default: preprocessed | - -### Hashing-based: HTODemux - -| | | -| ------------------- | ------------------------------------------------------------------------------------------------------------------------------ | -| htodemux | Whether to perform Multiseq. Default: True | -| rna_matrix_htodemux | Whether to use raw or filtered scRNA-seq count matrix. Default: filtered | -| hto_matrix_htodemux | Whether to use raw or filtered HTO count matrix. Default: filtered | -| assay | Name of the hashtag assay. Default: HTO | -| quantile_htodemux | The quantile of inferred 'negative' distribution for each hashtag, over which the cell is considered 'positive'. Default: 0.99 | -| kfunc | Clustering function for initial hashtag grouping. Default: clara. | -| nstarts | nstarts value for k-means clustering when kfunc=kmeans. Default: 100 | -| nsamples | Number of samples to be drawn from the dataset used for clustering when kfunc= clara. Default: 100 | -| seed | Sets the random seed. Default: 42 | -| init | Initial number of clusters for hashtags. Default: NULL, which means the # of hashtag oligo names + 1 to account for negatives. | -| objectOutHTO | Name of the output Seurat object. Default: htodemux | -| assignmentOutHTO | Prefix of the output CSV files. Default: htodemux | -| ridgePlot | Whether to generate a ridge plot to visualize enrichment for all HTOs. Default: TRUE | -| ridgeNCol | Number of columns in the ridge plot. Default: 3 | -| featureScatter | Whether to generate a scatter plot to visualize pairs of HTO signals. Default: FALSE | -| scatterFeat1 | First feature to plot. Default: None | -| scatterFeat2 | Second feature to plot. Default: None | -| vlnplot | Whether to generate a violin plot, e.g. to compare number of UMIs for singlets, doublets and negative cells. Default: TRUE | -| vlnFeatures | Features to plot. Default: nCount_RNA | -| vlnLog | Whether to plot the feature axis on log scale. Default: TRUE | -| tsne | Whether to generate a 2D tSNE embedding for HTOs. Default: TRUE | -| tsneIdents | Subset Seurat object based on identity class. Default: Negative | -| tsneInvert | Whether to keep or remove the identity class. Default: TRUE | -| tsneVerbose | Whether to print the top genes associated with high/low loadings for the PCs when running PCA. Default: FALSE | -| tsneApprox | Whether to use truncated singular value decomposition to approximate PCA. Default: FALSE | -| tsneDimMax | Number of dimensions to use as input features when running t-SNE dimensionality reduction. Default: 2 | -| tsnePerplexity | Perplexity when running t-SNE dimensionality reduction. Default: 100 | -| heatmap | Whether to generate an HTO heatmap. Default: TRUE | -| heatmapNcells | Number of cells to plot. Default: 5000 | - -### Hashing-based: Multiseq - -| | | -| ------------------- | ------------------------------------------------------------------------------------------------------- | -| multiseq | Whether to perform Multiseq. Default: True | -| rna_matrix_multiseq | Whether to use raw or filtered scRNA-seq count matrix. Default: filtered | -| hto_matrix_multiseq | Whether to use raw or filtered HTO count matrix. Default: filtered | -| assay | Name of the hashtag assay, same as used for HTODemux. Default: HTO | -| quantile_multi | The quantile to use for classification. Default: 0.7 | -| autoThresh | Whether to perform automated threshold finding to define the best quantile. Default: TRUE | -| maxiter | nstarts value for k-means clustering when kfunc=kmeans. Default: 100 | -| qrangeFrom | The minimal possible quantile value to try if autoThresh=TRUE. Default: 0.1 | -| qrangeTo | The minimal possible quantile value to try if autoThresh=TRUE. Default: 0.9 | -| qrangeBy | The constant difference of a range of possible quantile values to try if autoThresh=TRUE. Default: 0.05 | -| verbose_multiseq | Wether to print the output. Default: TRUE | -| assignmentOutMulti | Prefix of the output CSV files. Default: multiseq | -| objectOutMulti | Name of the output Seurat object. Default: multiseq | - -### Hashing-based: Solo - -| | | -| -------------------------- | ------------------------------------------------------------------------------------------------ | -| solo | Whether to perform Solo. Default: True | -| rna_matrix_solo | Input folder to RNA expression matrix in 10x format. | -| max_epochs | Number of epochs to train for. Default: 400 | -| lr | Learning rate for optimization. Default: 0.001 | -| train_size | Size of training set in the range between 0 and 1. Default: 0.9 | -| validation_size | Size of the test set. Default: 0.1 | -| batch_size | Minibatch size to use during training. Default: 128 | -| early_stopping | Adds callback for early stopping on validation_loss. Default: True | -| early_stopping_patience | Number of times early stopping metric can not improve over early_stopping_min_delta. Default: 30 | -| early_stopping_min_delta | Threshold for counting an epoch towards patience train(). Default: 10 | -| soft | Return probabilities instead of class label. Default: False | -| include_simulated_doublets | Return probabilities for simulated doublets as well. | -| assignmentOutSolo | Prefix of the output CSV files. Default: solo_predict | - -### Hashing-based: HashSolo - -| | | -| ------------------------ | -------------------------------------------------------------------------------------------- | -| hashsolo | Whether to perform HashSolo. Default: True | -| rna_matrix_hashsolo | Whether to use raw or filtered scRNA-seq count matrix. Default: raw | -| hto_matrix_hashsolo | Whether to use raw or filtered HTO count matrix if use_rna_data is set to True. Default: raw | -| priors_negative | Prior for the negative hypothesis. Default: 1/3 | -| priors_singlet | Prior for the singlet hypothesis. Default: 1/3 | -| priors_doublet | Prior for the doublet hypothesis. Default: 1/3 | -| pre_existing_clusters | Column in the input data for how to break up demultiplexing. Default: None | -| use_rna_data | Whether to use RNA counts for deconvolution. Default: False | -| number_of_noise_barcodes | Number of barcodes to use to create noise distribution. Default: None | -| assignmentOutHashSolo | Prefix of the output CSV files. Default: hashsolo | -| plotOutHashSolo | Prefix of the output figures. Default: hashsolo | - -### Hashing-based: DemuxEm - -| | | -| -------------------- | ----------------------------------------------------------------------------------------------------------------------------- | -| demuxem | Whether to perform Demuxem. Default: True | -| rna_matrix_demuxem | Whether to use raw or filtered scRNA-seq count matrix. Default: raw | -| hto_matrix_demuxem | Whether to use raw or filtered HTO count matrix. Default: raw | -| threads_demuxem | Number of threads to use. Must be a positive integer. Default: 1 | -| alpha_demuxem | The Dirichlet prior concentration parameter (alpha) on samples. An alpha value < 1.0 will make the prior sparse. Default: 0.0 | -| alpha_noise | The Dirichlet prior concenration parameter on the background noise. Default: 1.0 | -| min_num_genes | Filter cells/nuclei with at least specified number of expressed genes. Default: 100 | -| min_num_umis | Filter cells/nuclei with at least specified number of UMIs. Default: 100 | -| min_signal | Any cell/nucleus with less than min_signal hashtags from the signal will be marked as unknown. Default: 10 | -| tol | Threshold used for the EM convergence. Default: 1e-6 | -| generate_gender_plot | Generate violin plots using gender-specific genes (e.g. Xist). Value is a comma-separated list of gene names. Default: None | -| random_state | Random seed set for reproducing results. Default: 0 | -| objectOutDemuxem | Prefix of the output files. Default: demuxem_res | - -### Hashing-based: HashedDrops - -| | | -| ------------------------ | ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | -| hashedDrops | Whether to perform hashedDrops. Default: True | -| hto_matrix_hashedDrops | Whether to use raw or filtered HTO count matrix. Default: raw | -| lower | The lower bound on the total UMI count, at or below which all barcodes are assumed to correspond to empty droplets. Default: 100 | -| niters | The number of iterations to use for the Monte Carlo p-value calculations. Default: 10000 | -| testAmbient | Whether results should be returned for barcodes with totals less than or equal to lower. Default: TRUE | -| ignore_hashedDrops | The lower bound on the total UMI count, at or below which barcodes will be ignored. Default: NULL | -| alpha_hashedDrops | The scaling parameter for the Dirichlet-multinomial sampling scheme. Default: NULL | -| round | Whether to check for non-integer values in m and, if present, round them for ambient profile estimation. Default: TRUE | -| byRank | If set, this is used to redefine lower and any specified value for lower is ignored. Default: NULL | -| isCellFDR | FDR Threshold to filter the cells for empty droplet detection. Default: 0.01 | -| objectOutEmptyDrops | Prefix of the emptyDroplets output RDS object. Default: emptyDroplets | -| assignmentOutEmptyDrops | Prefix of the emptyDroplets output CSV file. Default: emptyDroplets | -| ambient | Whether to use the relative abundance of each HTO in the ambient solution from emptyDrops, set TRUE only when testAmbient=TRUE. Default: FALSE | -| minProp | The ambient profile when ambient=NULL. Default: 0.05 | -| pseudoCount | The minimum pseudo-count when computing logfold changes. Default: 5 | -| constantAmbient | Whether a constant level of ambient contamination should be used to estimate LogFC2 for all cells. Default: FALSE | -| doubletNmads | The number of median absolute deviations (MADs) to use to identify doublets. Default: 3 | -| doubletMin | The minimum threshold on the log-fold change to use to identify doublets. Default: 2 | -| doubletMixture | Wwhether to use a 2-component mixture model to identify doublets. Default: FALSE | -| confidentNmads | The number of MADs to use to identify confidently assigned singlets. Default: 3 | -| confidenMin | The minimum threshold on the log-fold change to use to identify singlets. Default: 2 | -| combinations | An integer matrix specifying valid combinations of HTOs. Each row corresponds to a single sample and specifies the indices of rows in x corresponding to the HTOs used to label that sample. Default: NULL | -| objectOutHashedDrops | Prefix of the hashedDrops output RDS object. Default: hashedDrops | -| assignmentOutHashedDrops | Prefix of the hashedDrops output CSV file. Default: hashedDrops | - -### Genotype-based: Demuxlet and dsc-pileup +### Demuxlet and dsc-pileup | | | | ------------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------ | @@ -382,7 +219,7 @@ profiles{ | doublet-prior | Prior of doublet. Default: 0.5 | | demuxlet_out | Prefix out the demuxlet and dsc-pileup output files. Default: demuxlet_res | -### Genotype-based: Freemuxlet and dsc-pileup +### Freemuxlet and dsc-pileup | | | | -------------------------- | -------------------------------------------------------------------------------------------------------------------------------------------------------------- | @@ -419,7 +256,7 @@ profiles{ | keep_init_missing | Keep missing cluster assignment as missing in the initial iteration. Default: False | | freemuxlet_out | Prefix out the freemuxlet and dsc-pileup output files. Default: freemuxlet_out | -### Genotype-based: Vireo +### Vireo | | | | ---------------- | ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | @@ -444,7 +281,7 @@ profiles{ | nproc | Number of subprocesses for computing, sacrifices memory for speedups. Default: 4 | | vireo_out | Dirtectory for output files. Default: vireo_out | -### Genotype-based: scSplit +### scSplit | | | | ----------------------- | ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | @@ -467,7 +304,7 @@ profiles{ | sample_geno | Whether to generate sample genotypes based on the split result. Default: True | | scsplit_out | Dirtectory for scSplit output files. Default: scsplit_out | -### Genotype-based: Souporcell +### Souporcell | | | | ---------------------------- | -------------------------------------------------------------------------------------------------------------------------------------------------------------- | @@ -492,7 +329,7 @@ profiles{ | ignore | Set to True to ignore data error assertions. Default: False | | souporcell_out | Dirtectory for Souporcell output files. Default: souporcell_out | -### Genotype-based: cellSNP-lite +### cellSNP-lite | | | | ----------------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | @@ -521,7 +358,7 @@ profiles{ | countORPHAN | If use, do not skip anomalous read pairs. Default: False | | cellsnp_out | Dirtectory for cellSNP-lite output files. Default: cellSNP_out | -### Genotype-based: Freebayes +### Freebayes | | | | ------------------------------- | ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | @@ -599,17 +436,4 @@ profiles{ | read_dependence_factor | Incorporate non-independence of reads by scaling successive observations by this factor during data likelihood calculations. Default: 0.9 | | genotype_qualities | Calculate the marginal probability of genotypes and report as GQ in each sample field in the VCF output Default: False | | debug | Print debugging output. Default: False | -| dd | Print more verbose debugging output (requires "make DEBUG"). Default: False | - -### Donor matching - -| | | -| --------------------- | ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | -| match_donor | Whether to match donors. Default: True | -| demultiplexing_result | A CSV file with demultiplexing assignment when running in donor_match mode. In other modes, the input is passed by the pipeline automatically. Default: None | -| match_donor_method1 | The method name to match donors. If None, all genotype-based methods are compared. Default: None | -| match_donor_method2 | The method name to match donors. If None, all hashing-based methods are compared. Default: None | -| findVariants | Whether to extract a subset of informative variants when best genotype-based method for donor matching is vireo. `default`: subset as described in paper; `vireo`: subset by Vireo; `True`: subset using both methods; `False`: not extracting variants. Default: False | -| variant_count | The threshold for the minimal read depth of a variant in the cell group when subseting the informative variants by default. Default: 10 | -| variant_pct | The threshold for the minimal frequency of the alternative or reference allele to determine the dominant allele of a variant in the cell group when subseting the informative variants by default. Default: 0.9 | -| vireo_parent_dir | A parent folder which contains the output folder of vireo in the format of `vireo_[taskID/sampleId]` generated by hadge pipeline when running in donor_match mode. In other modes, the input is passed by the pipeline automatically. Default: None | +| dd | Print more verbose debugging output (requires "make DEBUG"). Default: False | \ No newline at end of file diff --git a/docs/source/hashing.md b/docs/source/hashing.md new file mode 100644 index 0000000..64e0b79 --- /dev/null +++ b/docs/source/hashing.md @@ -0,0 +1,284 @@ +# Hashing demultiplexing + +## **Hashing-based deconvolution (hash_demulti) in hadge** +- Pre-processing +- Multiseq +- HTODemux +- HashedDrops +- DemuxEM +- HashSolo +- Demuxmix +- GMM-Demux +- BFF + +## **Input data preparation** + +The input data depends heavily on the deconvolution tools. In the following table, you will find the minimal input data required by different tools. + +| Deconvolution method | Input data | Parameter | +| -------------------- | ------------------------------------------------------------------------------------------------------------------ | -------------------------------------------------------------- | +| HTODemux | - Seurat object with both UMI and hashing count matrix (RDS) | `params.rna_matrix_htodemux`
`params.hto_matrix_htodemux` | +| Multiseq | - Seurat object with both UMI and hashing count matrix (RDS) | `params.rna_matrix_multiseq`
`params.hto_matrix_multiseq` | +| HashSolo | - 10x mtx directory with hashing count matrix (H5) | `params.hto_matrix_hashsolo`
`params.rna_matrix_hashsolo` | +| HashedDrops | - 10x mtx directory with hashing count matrix (Directory) | `params.hto_matrix_hashedDrops` | +| Demuxem | - 10x mtx directory with UMI count matrix (Directory)
- 10x mtx directory with hashing count matrix (Directory) | `params.hto_matrix_demuxem`
`params.rna_matrix_demuxem` | + +Similary as genotype-based deconvlution methods, hashing methods also have some input in common. So we also try to utilize common input parameters `params.[rna/hto]_matrix_[raw/filtered]` to store count matrices for better control and `params.[rna/hto]_matrix_[method]` is used to specify whether to use raw or filtered counts for each method. + +| Input data | Parameter | +| ------------------------------ | ---------------------------- | +| Raw scRNAseq count matrix | `params.rna_matrix_raw` | +| Filtered scRNAseq count matrix | `params.rna_matrix_filtered` | +| Raw HTO count matrix | `params.hto_matrix_raw` | +| Filtered HTO count matrix | `params.hto_matrix_filtered` | + +#### Pre-processing +Similar as in the genetic demultiplexing workflow, we provide a pre-processing step required before running HTODemux and Multiseq to load count matrices into a Seurat object. The input will be automatically loaded from the parameters set above. + +## **Output** +By default, the pipeline is run on a single sample. In this case, all pipeline output will be saved in the folder `$projectDir/$params.outdir/hashing/hash_demulti`. When running the pipeline on multiple samples, the pipeline output will be found in the folder `"$projectDir/$params.outdir/$sampleId/hashing/hash_demulti`. To simplify this, we'll refer to this folder as `$pipeline_output_folder` from now on. + +### Pre-processing + +output directory: `$pipeline_output_folder/preprocess/preprocess_[task_ID/sampleId]` + +- `${params.preprocessOut}.rds`: pre-processed data in an RDS object +- `params.csv`: specified parameters in the hashing pre-processing task + +### HTODemux + +output directory: `$pipeline_output_folder/htodemux/htodemux_[task_ID/sampleId]` + +- `${params.assignmentOutHTO}_assignment_htodemux.csv`: the assignment of HTODemux +- `${params.assignmentOutHTO}_classification_htodemux.csv`: the classification of HTODemux as singlet, doublet and negative droplets +- `${params.objectOutHTO}.rds`: the result of HTODemux in an RDS object +- `params.csv`: specified parameters in the HTODemux task + +Optionally: + +- `ridge.jpeg`: a ridge plot showing the enrichment of selected HTOs +- `featureScatter.jpeg`: a scatter plot showing the signal of two selected HTOs +- `violinPlot.jpeg`: a violin plot showing selected features +- `tSNE.jpeg`: a 2D plot based on tSNE embedding of HTOs +- `heatMap.jpeg`: a heatmap of hashtag oligo signals across singlets, doublets and negative cells +- `visual_params.csv`: specified parameters for visualisation of the HTODemux result + +### Multiseq + +output directory: `$pipeline_output_folder/multiseq/multiseq_[task_ID/sampleId]` + +- `${params.assignmentOutMulti}_res.csv`: the assignment of Multiseq +- `${params.objectOutMulti}.rds`: the result of Multiseq in an RDS object +- `params.csv`: specified parameters in the Multiseq task + +### Demuxem + +output directory: `$pipeline_output_folder/demuxem/demuxem_[task_ID/sampleId]` + +- `${params.objectOutDemuxem}_demux.zarr.zip`: RNA expression matrix with demultiplexed sample identities in Zarr format +- `${params.objectOutDemuxem}.out.demuxEM.zarr.zip`: DemuxEM-calculated results in Zarr format, containing two datasets, one for HTO and one for RNA +- `${params.objectOutDemuxem}.ambient_hashtag.hist.pdf`: A histogram plot depicting hashtag distributions of empty droplets and non-empty droplets +- `${params.objectOutDemuxem}.background_probabilities.bar.pdf}`: A bar plot visualizing the estimated hashtag background probability distribution +- `${params.objectOutDemuxem}.real_content.hist.pdf`: A histogram plot depicting hashtag distributions of not-real-cells and real-cells as defined by total number of expressed genes in the RNA assay +- `${params.objectOutDemuxem}.rna_demux.hist.pdf`: This figure consists of two plots. The first one is a horizontal bar plot depicting the percentage of RNA barcodes with at least one HTO count. The second plot is a histogram plot depicting RNA UMI distribution for singlets, doublets and unknown cells. +- `${params..objectOutDemuxem}.gene_name.violin.pdf`: Violin plots depicting gender-specific gene expression across samples. +- `${params.objectOutDemuxem}_summary.csv`: the classification of Demuxem +- `${params.objectOutDemuxem}_obs.csv`: the assignment of Demuxem +- `params.csv`: specified parameters in the Demuxem task + +Optionally: + +- `{params.objectOutDemuxem}.{gene_name}.violin.pdf`: violin plots using specified gender-specific gene + +### HashSolo + +output directory: `$pipeline_output_folder/hashsolo/hashsolo_[task_ID/sampleId]` + +- `${params.assignmentOutHashSolo}_res.csv`: the assignment of HashSolo +- `${params.plotOutHashSolo}.jpg`: plot of HashSolo demultiplexing results for QC checks +- `params.csv`: specified parameters in the HashSolo task + +### HashedDrops + +output directory: `$pipeline_output_folder/hashedDrops/hashedDrops_[task_ID/sampleId]` + +- `${params.objectOutEmptyDrops}.rds`: the result of emptyDrops in an RDS object +- `${params.assignmentOutEmptyDrops}.csv`: the result of emptyDrops in a csv file +- `plot_emptyDrops.png`: a diagnostic plot comparing the total count against the negative log-probability +- `${params.objectOutHashedDrops}.rds`: the result of hashedyDrops in an RDS object +- `${params.assignmentOutHashedDrops}_res.csv`: the assignment of HashSolo +- `${params.objectOutHashedDrops}_LogFC.png`: a diagnostic plot comparing the log-fold change between the second HTO's abundance and the ambient contamination +- `params.csv`: specified parameters in the HashedDrops task + +### Demuxmix + +output directory: `$pipeline_output_folder/demuxmix/demuxmix_[task_ID/sampleId]` + +- `${params.assignmentOutDemuxmix}_assignment_demuxmix.csv`: the assignment and classification results produced by Demuxmix +- `params.csv`: specified parameters in the Demuxmix task + +### GMM-Demux + +output directory: `$pipeline_output_folder/gmm_demux/gmm_demux_[task_ID/sampleId]` + +- `features.tsv.gz`: default content in the output folder are the non-MSM droplets (SSDs), stored in MTX format. +- `barcodes.tsv.gz`: default content in the output folder are the non-MSM droplets (SSDs), stored in MTX format. +- `matrix.mtx.gz`: default content in the output folder are the non-MSM droplets (SSDs), stored in MTX format. +- `GMM_full.csv`: The classification file containing the label of each droplet as well as the probability of the classification. +- `GMM_full.config`: Used to assign each classification to a donor using the numbers listed in the config file +- `gmm_demux_${task.index}_report.txt`: Specify the file to store summary report, produced only if GMM can find a viable solution that satisfies the droplet formation model +- `params.csv`: specified parameters in the GMM-Demux task + +### BFF + +output directory: `$pipeline_output_folder/bff/bff_[task_ID/sampleId]` + +- `${params.assignmentOutBff}_assignment_demuxmix.csv`: the assignment and classification results produced by BFF +- `params.csv`: specified parameters in the BFF task + +## **Parameter** + +### Preprocessing + +| | | +| ------------- | ----------------------------------------------------------------------------------------------- | +| ndelim | For the initial identity calss for each cell, delimiter for the cell's column name. Default: \_ | +| sel_method | The selection method used to choose top variable features. Default: mean.var.plot | +| n_features | Number of features to be used when finding variable features. Default: 2000 | +| assay | Assay name for HTO modality. Default: HTO | +| norm_method | Method for normalization of HTO data. Default: CLR | +| margin | If performing CLR normalization, normalize across features (1) or cells (2). Default: 2 | +| preprocessOut | Name of the output Seurat object. Default: preprocessed | + +### HTODemux + +| | | +| ------------------- | ------------------------------------------------------------------------------------------------------------------------------ | +| htodemux | Whether to perform Multiseq. Default: True | +| rna_matrix_htodemux | Whether to use raw or filtered scRNA-seq count matrix. Default: filtered | +| hto_matrix_htodemux | Whether to use raw or filtered HTO count matrix. Default: filtered | +| assay | Name of the hashtag assay. Default: HTO | +| quantile_htodemux | The quantile of inferred 'negative' distribution for each hashtag, over which the cell is considered 'positive'. Default: 0.99 | +| kfunc | Clustering function for initial hashtag grouping. Default: clara. | +| nstarts | nstarts value for k-means clustering when kfunc=kmeans. Default: 100 | +| nsamples | Number of samples to be drawn from the dataset used for clustering when kfunc= clara. Default: 100 | +| seed | Sets the random seed. Default: 42 | +| init | Initial number of clusters for hashtags. Default: NULL, which means the # of hashtag oligo names + 1 to account for negatives. | +| objectOutHTO | Name of the output Seurat object. Default: htodemux | +| assignmentOutHTO | Prefix of the output CSV files. Default: htodemux | +| ridgePlot | Whether to generate a ridge plot to visualize enrichment for all HTOs. Default: TRUE | +| ridgeNCol | Number of columns in the ridge plot. Default: 3 | +| featureScatter | Whether to generate a scatter plot to visualize pairs of HTO signals. Default: FALSE | +| scatterFeat1 | First feature to plot. Default: None | +| scatterFeat2 | Second feature to plot. Default: None | +| vlnplot | Whether to generate a violin plot, e.g. to compare number of UMIs for singlets, doublets and negative cells. Default: TRUE | +| vlnFeatures | Features to plot. Default: nCount_RNA | +| vlnLog | Whether to plot the feature axis on log scale. Default: TRUE | +| tsne | Whether to generate a 2D tSNE embedding for HTOs. Default: TRUE | +| tsneIdents | Subset Seurat object based on identity class. Default: Negative | +| tsneInvert | Whether to keep or remove the identity class. Default: TRUE | +| tsneVerbose | Whether to print the top genes associated with high/low loadings for the PCs when running PCA. Default: FALSE | +| tsneApprox | Whether to use truncated singular value decomposition to approximate PCA. Default: FALSE | +| tsneDimMax | Number of dimensions to use as input features when running t-SNE dimensionality reduction. Default: 2 | +| tsnePerplexity | Perplexity when running t-SNE dimensionality reduction. Default: 100 | +| heatmap | Whether to generate an HTO heatmap. Default: TRUE | +| heatmapNcells | Number of cells to plot. Default: 5000 | + +### Multiseq + +| | | +| ------------------- | ------------------------------------------------------------------------------------------------------- | +| multiseq | Whether to perform Multiseq. Default: True | +| rna_matrix_multiseq | Whether to use raw or filtered scRNA-seq count matrix. Default: filtered | +| hto_matrix_multiseq | Whether to use raw or filtered HTO count matrix. Default: filtered | +| assay | Name of the hashtag assay, same as used for HTODemux. Default: HTO | +| quantile_multi | The quantile to use for classification. Default: 0.7 | +| autoThresh | Whether to perform automated threshold finding to define the best quantile. Default: TRUE | +| maxiter | nstarts value for k-means clustering when kfunc=kmeans. Default: 100 | +| qrangeFrom | The minimal possible quantile value to try if autoThresh=TRUE. Default: 0.1 | +| qrangeTo | The minimal possible quantile value to try if autoThresh=TRUE. Default: 0.9 | +| qrangeBy | The constant difference of a range of possible quantile values to try if autoThresh=TRUE. Default: 0.05 | +| verbose_multiseq | Wether to print the output. Default: TRUE | +| assignmentOutMulti | Prefix of the output CSV files. Default: multiseq | +| objectOutMulti | Name of the output Seurat object. Default: multiseq | + +### Solo + +| | | +| -------------------------- | ------------------------------------------------------------------------------------------------ | +| solo | Whether to perform Solo. Default: True | +| rna_matrix_solo | Input folder to RNA expression matrix in 10x format. | +| max_epochs | Number of epochs to train for. Default: 400 | +| lr | Learning rate for optimization. Default: 0.001 | +| train_size | Size of training set in the range between 0 and 1. Default: 0.9 | +| validation_size | Size of the test set. Default: 0.1 | +| batch_size | Minibatch size to use during training. Default: 128 | +| early_stopping | Adds callback for early stopping on validation_loss. Default: True | +| early_stopping_patience | Number of times early stopping metric can not improve over early_stopping_min_delta. Default: 30 | +| early_stopping_min_delta | Threshold for counting an epoch towards patience train(). Default: 10 | +| soft | Return probabilities instead of class label. Default: False | +| include_simulated_doublets | Return probabilities for simulated doublets as well. | +| assignmentOutSolo | Prefix of the output CSV files. Default: solo_predict | + +### HashSolo + +| | | +| ------------------------ | -------------------------------------------------------------------------------------------- | +| hashsolo | Whether to perform HashSolo. Default: True | +| rna_matrix_hashsolo | Whether to use raw or filtered scRNA-seq count matrix. Default: raw | +| hto_matrix_hashsolo | Whether to use raw or filtered HTO count matrix if use_rna_data is set to True. Default: raw | +| priors_negative | Prior for the negative hypothesis. Default: 1/3 | +| priors_singlet | Prior for the singlet hypothesis. Default: 1/3 | +| priors_doublet | Prior for the doublet hypothesis. Default: 1/3 | +| pre_existing_clusters | Column in the input data for how to break up demultiplexing. Default: None | +| use_rna_data | Whether to use RNA counts for deconvolution. Default: False | +| number_of_noise_barcodes | Number of barcodes to use to create noise distribution. Default: None | +| assignmentOutHashSolo | Prefix of the output CSV files. Default: hashsolo | +| plotOutHashSolo | Prefix of the output figures. Default: hashsolo | + +### DemuxEm + +| | | +| -------------------- | ----------------------------------------------------------------------------------------------------------------------------- | +| demuxem | Whether to perform Demuxem. Default: True | +| rna_matrix_demuxem | Whether to use raw or filtered scRNA-seq count matrix. Default: raw | +| hto_matrix_demuxem | Whether to use raw or filtered HTO count matrix. Default: raw | +| threads_demuxem | Number of threads to use. Must be a positive integer. Default: 1 | +| alpha_demuxem | The Dirichlet prior concentration parameter (alpha) on samples. An alpha value < 1.0 will make the prior sparse. Default: 0.0 | +| alpha_noise | The Dirichlet prior concenration parameter on the background noise. Default: 1.0 | +| min_num_genes | Filter cells/nuclei with at least specified number of expressed genes. Default: 100 | +| min_num_umis | Filter cells/nuclei with at least specified number of UMIs. Default: 100 | +| min_signal | Any cell/nucleus with less than min_signal hashtags from the signal will be marked as unknown. Default: 10 | +| tol | Threshold used for the EM convergence. Default: 1e-6 | +| generate_gender_plot | Generate violin plots using gender-specific genes (e.g. Xist). Value is a comma-separated list of gene names. Default: None | +| random_state | Random seed set for reproducing results. Default: 0 | +| objectOutDemuxem | Prefix of the output files. Default: demuxem_res | + +### HashedDrops + +| | | +| ------------------------ | ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | +| hashedDrops | Whether to perform hashedDrops. Default: True | +| hto_matrix_hashedDrops | Whether to use raw or filtered HTO count matrix. Default: raw | +| lower | The lower bound on the total UMI count, at or below which all barcodes are assumed to correspond to empty droplets. Default: 100 | +| niters | The number of iterations to use for the Monte Carlo p-value calculations. Default: 10000 | +| testAmbient | Whether results should be returned for barcodes with totals less than or equal to lower. Default: TRUE | +| ignore_hashedDrops | The lower bound on the total UMI count, at or below which barcodes will be ignored. Default: NULL | +| alpha_hashedDrops | The scaling parameter for the Dirichlet-multinomial sampling scheme. Default: NULL | +| round | Whether to check for non-integer values in m and, if present, round them for ambient profile estimation. Default: TRUE | +| byRank | If set, this is used to redefine lower and any specified value for lower is ignored. Default: NULL | +| isCellFDR | FDR Threshold to filter the cells for empty droplet detection. Default: 0.01 | +| objectOutEmptyDrops | Prefix of the emptyDroplets output RDS object. Default: emptyDroplets | +| assignmentOutEmptyDrops | Prefix of the emptyDroplets output CSV file. Default: emptyDroplets | +| ambient | Whether to use the relative abundance of each HTO in the ambient solution from emptyDrops, set TRUE only when testAmbient=TRUE. Default: FALSE | +| minProp | The ambient profile when ambient=NULL. Default: 0.05 | +| pseudoCount | The minimum pseudo-count when computing logfold changes. Default: 5 | +| constantAmbient | Whether a constant level of ambient contamination should be used to estimate LogFC2 for all cells. Default: FALSE | +| doubletNmads | The number of median absolute deviations (MADs) to use to identify doublets. Default: 3 | +| doubletMin | The minimum threshold on the log-fold change to use to identify doublets. Default: 2 | +| doubletMixture | Wwhether to use a 2-component mixture model to identify doublets. Default: FALSE | +| confidentNmads | The number of MADs to use to identify confidently assigned singlets. Default: 3 | +| confidenMin | The minimum threshold on the log-fold change to use to identify singlets. Default: 2 | +| combinations | An integer matrix specifying valid combinations of HTOs. Each row corresponds to a single sample and specifies the indices of rows in x corresponding to the HTOs used to label that sample. Default: NULL | +| objectOutHashedDrops | Prefix of the hashedDrops output RDS object. Default: hashedDrops | +| assignmentOutHashedDrops | Prefix of the hashedDrops output CSV file. Default: hashedDrops | diff --git a/docs/source/index.md b/docs/source/index.md index 9ca3125..e311e68 100644 --- a/docs/source/index.md +++ b/docs/source/index.md @@ -81,6 +81,9 @@ The output of the donor-matching process can be found in the folder `donor_match :maxdepth: 3 usage output +general +genetic +hashing ``` # Indices and tables diff --git a/docs/source/output.md b/docs/source/output.md deleted file mode 100644 index 71c7b40..0000000 --- a/docs/source/output.md +++ /dev/null @@ -1,322 +0,0 @@ -# Output - -This document describes the output produced by each process of the pipeline. - -## Pipeline overview - -
Modes - -- genetic: Genetics-based deconvolution workflow -- hashing: Hashing-based deconvolution workflow -- rescue: genetic + hashing + donor matching - -
-
-
Workflows -
Hashing-based deconvolution (hash_demulti) - -- Pre-processing -- Multiseq -- HTODemux -- HashedDrops -- DemuxEM -- HashSolo -- Demuxmix -- GMM-Demux -- BFF - -
- -
Genetics-based deconvolution (gene_demulti) - -- Pre-processing: Samtools -- Variant-calling: freebayes -- Variant-filtering: BCFtools -- Variant-calling: cellsnp-lite -- Demuxlet -- Freemuxlet -- Vireo -- Souporcell -- scSplit - -
-
- -
- -By default, the pipeline is run on a single sample. In this case, all pipeline output will be saved in the folder `$projectDir/$params.outdir/$params.mode`. When running the pipeline on multiple samples, the pipeline output will be found in the folder `"$projectDir/$params.outdir/$sampleId/$params.mode/`. To simplify this, we'll refer to this folder as `$pipeline_output_folder` from now on. - -
- -## Hashing-based deconvolution workflow - -Running on a single sample, the output of hashing-based deconvolution workflow is saved in the folder `$pipeline_output_folder/hash_demulti`. - -### Pre-processing - -output directory: `preprocess/preprocess_[task_ID/sampleId]` - -- `${params.preprocessOut}.rds`: pre-processed data in an RDS object -- `params.csv`: specified parameters in the hashing pre-processing task - -### HTODemux - -output directory: `htodemux/htodemux_[task_ID/sampleId]` - -- `${params.assignmentOutHTO}_assignment_htodemux.csv`: the assignment of HTODemux -- `${params.assignmentOutHTO}_classification_htodemux.csv`: the classification of HTODemux as singlet, doublet and negative droplets -- `${params.objectOutHTO}.rds`: the result of HTODemux in an RDS object -- `params.csv`: specified parameters in the HTODemux task - -Optionally: - -- `ridge.jpeg`: a ridge plot showing the enrichment of selected HTOs -- `featureScatter.jpeg`: a scatter plot showing the signal of two selected HTOs -- `violinPlot.jpeg`: a violin plot showing selected features -- `tSNE.jpeg`: a 2D plot based on tSNE embedding of HTOs -- `heatMap.jpeg`: a heatmap of hashtag oligo signals across singlets, doublets and negative cells -- `visual_params.csv`: specified parameters for visualisation of the HTODemux result - -### Multiseq - -output directory: `multiseq/multiseq_[task_ID/sampleId]` - -- `${params.assignmentOutMulti}_res.csv`: the assignment of Multiseq -- `${params.objectOutMulti}.rds`: the result of Multiseq in an RDS object -- `params.csv`: specified parameters in the Multiseq task - -### Demuxem - -output directory: `demuxem/demuxem_[task_ID/sampleId]` - -- `${params.objectOutDemuxem}_demux.zarr.zip`: RNA expression matrix with demultiplexed sample identities in Zarr format -- `${params.objectOutDemuxem}.out.demuxEM.zarr.zip`: DemuxEM-calculated results in Zarr format, containing two datasets, one for HTO and one for RNA -- `${params.objectOutDemuxem}.ambient_hashtag.hist.pdf`: A histogram plot depicting hashtag distributions of empty droplets and non-empty droplets -- `${params.objectOutDemuxem}.background_probabilities.bar.pdf}`: A bar plot visualizing the estimated hashtag background probability distribution -- `${params.objectOutDemuxem}.real_content.hist.pdf`: A histogram plot depicting hashtag distributions of not-real-cells and real-cells as defined by total number of expressed genes in the RNA assay -- `${params.objectOutDemuxem}.rna_demux.hist.pdf`: This figure consists of two plots. The first one is a horizontal bar plot depicting the percentage of RNA barcodes with at least one HTO count. The second plot is a histogram plot depicting RNA UMI distribution for singlets, doublets and unknown cells. -- `${params..objectOutDemuxem}.gene_name.violin.pdf`: Violin plots depicting gender-specific gene expression across samples. -- `${params.objectOutDemuxem}_summary.csv`: the classification of Demuxem -- `${params.objectOutDemuxem}_obs.csv`: the assignment of Demuxem -- `params.csv`: specified parameters in the Demuxem task - -Optionally: - -- `{params.objectOutDemuxem}.{gene_name}.violin.pdf`: violin plots using specified gender-specific gene - -### Solo - -output directory: `solo/solo_[task_ID/sampleId]` - -- `${params.assignmentOutSolo}_res.csv`: the assignment of Solo -- `params.csv`: specified parameters in the Solo task - -### HashSolo - -output directory: `hashsolo/hashsolo_[task_ID/sampleId]` - -- `${params.assignmentOutHashSolo}_res.csv`: the assignment of HashSolo -- `${params.plotOutHashSolo}.jpg`: plot of HashSolo demultiplexing results for QC checks -- `params.csv`: specified parameters in the HashSolo task - -### HashedDrops - -output directory: `hashedDrops/hashedDrops_[task_ID/sampleId]` - -- `${params.objectOutEmptyDrops}.rds`: the result of emptyDrops in an RDS object -- `${params.assignmentOutEmptyDrops}.csv`: the result of emptyDrops in a csv file -- `plot_emptyDrops.png`: a diagnostic plot comparing the total count against the negative log-probability -- `${params.objectOutHashedDrops}.rds`: the result of hashedyDrops in an RDS object -- `${params.assignmentOutHashedDrops}_res.csv`: the assignment of HashSolo -- `${params.objectOutHashedDrops}_LogFC.png`: a diagnostic plot comparing the log-fold change between the second HTO's abundance and the ambient contamination -- `params.csv`: specified parameters in the HashedDrops task - -### Demuxmix - -output directory: `demuxmix/demuxmix_[task_ID/sampleId]` - -- `${params.assignmentOutDemuxmix}_assignment_demuxmix.csv`: the assignment and classification results produced by Demuxmix -- `params.csv`: specified parameters in the Demuxmix task - -### GMM-Demux - -output directory: `gmm_demux/gmm_demux_[task_ID/sampleId]` - -- `features.tsv.gz`: default content in the output folder are the non-MSM droplets (SSDs), stored in MTX format. -- `barcodes.tsv.gz`: default content in the output folder are the non-MSM droplets (SSDs), stored in MTX format. -- `matrix.mtx.gz`: default content in the output folder are the non-MSM droplets (SSDs), stored in MTX format. -- `GMM_full.csv`: The classification file containing the label of each droplet as well as the probability of the classification. -- `GMM_full.config`: Used to assign each classification to a donor using the numbers listed in the config file -- `gmm_demux_${task.index}_report.txt`: Specify the file to store summary report, produced only if GMM can find a viable solution that satisfies the droplet formation model -- `params.csv`: specified parameters in the GMM-Demux task - -### BFF - -output directory: `bff/bff_[task_ID/sampleId]` - -- `${params.assignmentOutBff}_assignment_demuxmix.csv`: the assignment and classification results produced by BFF -- `params.csv`: specified parameters in the BFF task - -## Genetics-based deconvolution workflow - -The output of genetics-based deconvolution workflow is saved in the folder `$pipeline_output_folder/gene_demulti`. - -### Samtools - -output directory: `samtools/samtools_[task_ID/sampleId]` - -- `filtered.bam`: processed BAM in a way that reads with any of following patterns be removed: read quality lower than 10, being unmapped segment, being secondary alignment, not passing filters, being PCR or optical duplicate, or being supplementary alignment -- `filtered.bam.bai`: index of filtered bam -- `no_dup.bam`: processed BAM after removing duplicated reads based on UMI -- `sorted.bam`: sorted BAM -- `sorted.bam.bai`: index of sorted BAM - -### cellSNP-lite - -output directory: `cellsnp/cellsnp_[task_ID/sampleId]` - -- `cellSNP.base.vcf.gz`: a VCF file listing genotyped SNPs and aggregated AD & DP infomation (without GT) -- `cellSNP.samples.tsv`: a TSV file listing cell barcodes or sample IDs -- `cellSNP.tag.AD.mtx`: a file in mtx format, containing the allele depths of the alternative (ALT) alleles -- `cellSNP.tag.DP.mtx`: a file in mtx format, containing the sum of allele depths of the reference and alternative alleles (REF + ALT) -- `cellSNP.tag.OTH.mtx`: a file in mtx format, containing the sum of allele depths of all the alleles other than REF and ALT. -- `cellSNP.cells.vcf.gz`: a VCF file listing genotyped SNPs and AD & DP & genotype (GT) information for each cell or sample -- `params.csv`: specified parameters in the cellsnp-lite task - -### Freebayes - -- `${region}_${vcf_freebayes}`: a VCF file containing variants called from mixed samples in the given chromosome region - -### Bcftools - -output directory: `bcftools/bcftools_[task_ID/sampleId]` - -- `total_chroms.vcf`: a VCF containing variants from all chromosomes -- `sorted_total_chroms.vcf`: sorted VCF file -- `filtered_sorted_total_chroms.vcf`: sorted VCF file containing variants with a quality score > 30 - -### Demuxlet - -output directory: `demuxlet/demuxlet_[task_ID/sampleId]` - -- `{demuxlet_out}.best`: result of demuxlet containing the best guess of the sample identity, with detailed statistics to reach to the best guess -- `params.csv`: specified parameters in the Demuxlet task - -Optionally: - -- `{demuxlet_out}.cel`: contains the relation between numerated barcode ID and barcode. Also, it contains the number of SNP and number of UMI for each barcoded droplet. -- `{demuxlet_out}.plp`: contains the overlapping SNP and the corresponding read and base quality for each barcode ID. -- `{demuxlet_out}.umi`: contains the position covered by each umi -- `{demuxlet_out}.var`: contains the position, reference allele and allele frequency for each SNP. - -### Freemuxlet - -output directory: `freemuxlet/freemuxlet_[task_ID/sampleId]` - -- `{freemuxlet_out}.clust1.samples.gz`: contains the best guess of the sample identity, with detailed statistics to reach to the best guess. -- `{freemuxlet_out}.clust1.vcf.gz`: VCF file for each sample inferred and clustered from freemuxlet -- `{freemuxlet_out}.lmix`: contains basic statistics for each barcode -- `params.csv`: specified parameters in the Freemuxlet task - -Optionally: - -- `{freemuxlet_out}.cel`: contains the relation between numerated barcode ID and barcode. Also, it contains the number of SNP and number of UMI for each barcoded droplet. -- `{freemuxlet_out}.plp`: contains the overlapping SNP and the corresponding read and base quality for each barcode ID. -- `{freemuxlet_out}.umi`: contains the position covered by each umi -- `{freemuxlet_out}.var`: contains the position, reference allele and allele frequency for each SNP. -- `{freemuxlet_out}.clust0.samples.gz`: contains the best sample identity assuming all droplets are singlets -- `{freemuxlet_out}.clust0.vcf.gz}`: VCF file for each sample inferred and clustered from freemuxlet assuming all droplets are singlets -- `{freemuxlet_out}.ldist.gz`: contains the pairwise Bayes factor for each possible pair of droplets - -### Vireo - -output directory: `vireo/vireo_[task_ID/sampleId]` - -- `donor_ids.tsv`: assignment of Vireo with detailed statistics -- `summary.tsv`: summary of assignment -- `prob_singlet.tsv.gz`: contains probability of classifing singlets -- `prob_doublet.tsv.gz`: contains probability of classifing doublets -- `GT_donors.vireo.vcf.gz`: contains estimated donor genotypes -- `filtered_variants.tsv`: a minimal set of discriminatory variants -- `GT_barcodes.png`: a figure for the identified genotype barcodes -- `fig_GT_distance_estimated.pdf`: a plog showing estimated genotype distance -- `_log.txt`: vireo log file -- `params.csv`: specified parameters in the Vireo task - -### scSplit - -output directory: `scSplit/scsplit_[task_ID/sampleId]` - -- `alt_filtered.csv`: count matrix of alternative alleles -- `ref_filtered.csv`: count matrix of reference alleles -- `scSplit_result.csv`: barcodes assigned to each of the N+1 cluster (N singlets and 1 doublet cluster), doublet marked as DBL- (n stands for the cluster number), e.g SNG-0 means the cluster 0 is a singlet cluster. -- `scSplit_dist_matrix.csv`: the ALT allele Presence/Absence (P/A) matrix on distinguishing variants for all samples as a reference in assigning sample to clusters, NOT including the doublet cluster, whose sequence number would be different every run (please pay enough attention to this) -- `scSplit_dist_variants.txt`: the distinguishing variants that can be used to genotype and assign sample to clusters -- `scSplit_PA_matrix.csv`: the full ALT allele Presence/Absence (P/A) matrix for all samples, NOT including the doublet cluster, whose sequence number would be different every run (please pay enough attention to this) -- `scSplit_P_s_c.csv`: the probability of each cell belonging to each sample -- `scSplit.log`: log file containing information for current run, iterations, and final Maximum Likelihood and doublet sample -- `params.csv`: specified parameters in the scSplit task - -### Souporcell - -output directory: `souporcell/souporcell_[task_ID/sampleId]` - -- `alt.mtx`: count matrix of alternative alleles -- `ref.mtx`: count matrix of reference alleles -- `clusters.tsv`: assignment of Souporcell with the cell barcode, singlet/doublet status, cluster, log_loss_singleton, log_loss_doublet, followed by log loss for each cluster. -- `cluster_genotypes.vcf`: VCF with genotypes for each cluster for each variant in the input vcf from freebayes -- `ambient_rna.txt`: contains the ambient RNA percentage detected -- `params.csv`: specified parameters in the Souporcell task - -## Merging results - -After each demultiplexing workflow, the pipeline will generate some TSV files to summarize the results in the folder `$pipeline_output_folder/[workflow]/[workflow]_summary`. - -- `[method]_classification.csv`: classification of all trials for a given method - | Barcode | multiseq_1 | multiseq_2 | ... | - |:---------: |:----------: |:----------: |:---: | - | barcode-1 | singlet | singlet | ... | - | barcode-2 | doublet | negative | ... | - | ... | ... | ... | ... | -- `[method]_assignment.csv`: assignment of all trials for a given method - | Barcode | multiseq_1 | multiseq_2 | ... | - |:---------: |:----------: |:----------: |:---: | - | barcode-1 | donor-1 | donor-2 | ... | - | barcode-2 | doublet | negative | ... | - | ... | ... | ... | ... | -- `[method]_params.csv`: specified paramters of all trials for a given method - | Argument | Value | - | :---------: | :----------: | - | seuratObejctPath | Path | - | quantile | 0.7 | - | ... | ... | -- `[workflow]_classification_all.csv`: classification of all trials across different methods - | Barcode | multiseq_1 | htodemux_1 | ... | - |:---------: |:----------: |:----------: |:---: | - | ... | ... | ... | ... | -- `[workflow]_assignment_all.csv`: save the assignment of all trials across different methods - | Barcode | multiseq_1 | htodemux_1 | ... | - |:---------: |:----------: |:----------: |:---: | - | ... | ... | ... | ... | -- `adata` folder: stores Anndata object with filtered scRNA-seq read counts and assignment of each deconvolution method if `params.generate_anndata` is `True`. -- In the `rescue` mode, the pipeline merges the results of hashing and genetic demultiplexing tools into and `assignment_all_genetic_and_hash.csv` in the `$pipeline_output_folder/summary` folder. - -## Donor matching - -- Folder`[method1]_[task_ID/sampleId]_vs_[method2]_[task_ID/sampleId]` with: - - `correlation_res.csv`: correlation scores of donor matching - - `concordance_heatmap.png`: a heatmap visualising the the correlation scores - - `donor_match.csv`: a map between hashtag and donor identity. - - `all_assignment_after_match.csv`: assignment of all cell barcodes after donor matching - - `intersect_assignment_after_match.csv`: assignment of joint singlets after donor matching -- General output in the `$pipeline_output_folder/donor_match` folder: - - `all_assignment_after_match.csv`: assignment of all cell barcodes based on the donor matching of the optimal match - - `donor_match.csv`: a map between hashtags and donor identities based on the donor matching of the optimal match - - `score_record.csv`: a CSV file storing the matching score and the number of matched donors for each method pair -- Folder `data_output` with: - - an Anndata object which contains the filtered scRNA-seq counts from `params.rna_matrix_filered` and the assignment of the best-matched method pair after donor matching -- Folder `donor_match/donor_match_[best_method1]_[best_method2]`: Optionally, if `best_method1` is `vireo` for the optimal match `best_method1` and `best_method2` among all trials and identification of donor-specific or discriminatory variants is enabled: - - `donor_specific_variants.csv`: a list of donor-specific variants - - `donor_specific_variants_upset.png`: An upset plot showing the number of donor-specific variants - - `donor_genotype_subset_by_default_matched.vcf`: Donor genotypes of donor-specific variants - - `donor_genotype_subset_by_vireo.vcf`: Donor genotypes of a set of discriminatory variants filtered by Vireo diff --git a/docs/source/rescue.md b/docs/source/rescue.md new file mode 100644 index 0000000..cbcace6 --- /dev/null +++ b/docs/source/rescue.md @@ -0,0 +1,47 @@ +# Combining results: rescue mode +The joint call of hashing and genetic deconvolution methods has been shown to be beneficial for cell recovery rate and calling accuracy. hadge provides a rescue mode to run both genotype- and hashing-based approaches jointly to rescue problematic hashing experiments in cases where donors are genetically distinct. In this scenario, samples of both hashing and genetic multiplexing experiments are deconvoluted simultaneously. Furthermore, hadge allows for the automatic determination of the best combination of hashing and SNP- based donor deconvolution tools. + +## **Parameter** + +| | | +| --------------------- | ---------------------------------------------------------- | +| match_donor | Whether to match donors. Default: True | +| demultiplexing_result | A CSV file with demultiplexing assignment only when running in donor_match mode. In other modes, the input is passed by the pipeline automatically. Default: None | +| match_donor_method1 | The method name to match donors. If None, all genotype-based methods are compared. Default: None | +| match_donor_method2 | The method name to match donors. If None, all hashing-based methods are compared. Default: None | +| findVariants | Whether to extract a subset of informative variants when best genotype-based method for donor matching is vireo. `default`: subset as described in paper; `vireo`: subset by Vireo; `True`: subset using both methods; `False`: not extracting variants. Default: False | +| variant_count | The threshold for the minimal read depth of a variant in the cell group when subseting the informative variants by default. Default: 10 | +| variant_pct | The threshold for the minimal frequency of the alternative or reference allele to determine the dominant allele of a variant in the cell group when subseting the informative variants by default. Default: 0.9 | +| vireo_parent_dir | A parent folder which contains the output folder of vireo in the format of `vireo_[taskID/sampleId]` generated by hadge pipeline only when running in donor_match mode. In other modes, the input is passed by the pipeline automatically. Default: None | + +## **Output** + +By default, the pipeline is run on a single sample. In this case, all pipeline output will be saved in the folder `$projectDir/$params.outdir/rescue`. When running the pipeline on multiple samples, the pipeline output will be found in the folder `"$projectDir/$params.outdir/$sampleId/rescue`. To simplify this, we'll refer to this folder as `$pipeline_output_folder` from now on. + +In rescue mode, the genotype- and hashing-based demultiplexing workflow run in parallel. They save their output in `$pipeline_output_folder/[gene/hash]_demulti`. Before running the donor-matching preocess, the pipeline merges the results of two workflows into `classification_all_genetic_and_hash.csv` and `assignment_all_genetic_and_hash.csv` in the `$pipeline_output_folder/summary` folder. + +The following additional output can be found in `$pipeline_output_folder/donor_match`. + +### Optional output: Donor matching +- Folder`[method1]_[task_ID/sampleId]_vs_[method2]_[task_ID/sampleId]` with: + - `correlation_res.csv`: correlation scores of donor matching + - `concordance_heatmap.png`: a heatmap visualising the the correlation scores + - `donor_match.csv`: a map between hashtag and donor identity. + - `all_assignment_after_match.csv`: assignment of all cell barcodes after donor matching + - `intersect_assignment_after_match.csv`: assignment of joint singlets after donor matching +- General output in the `$pipeline_output_folder/donor_match` folder: + - `all_assignment_after_match.csv`: assignment of all cell barcodes based on the donor matching of the optimal match + - `donor_match.csv`: a map between hashtags and donor identities based on the donor matching of the optimal match + - `score_record.csv`: a CSV file storing the matching score and the number of matched donors for each method pair + +### Optinal output: scverse compatibility +Folder `data_output` with: + - an Anndata object which contains the filtered scRNA-seq counts from `params.rna_matrix_filered` and the assignment of the best-matched method pair after donor matching + +### Optional output: Extracting donor-specific variants +Only when 1) `best_method1` for the optimal match (`best_method1` and `best_method2`) is `vireo` and 2) identification of donor-specific or discriminatory variants is enabled, then in folder `donor_match/donor_match_[best_method1]_[best_method2]`: + - `donor_specific_variants.csv`: a list of donor-specific variants + - `donor_specific_variants_upset.png`: An upset plot showing the number of donor-specific variants + - `donor_genotype_subset_by_default_matched.vcf`: Donor genotypes of donor-specific variants + - `donor_genotype_subset_by_vireo.vcf`: Donor genotypes of a set of discriminatory variants filtered by Vireo +