Lightweight pipeline for discovering new features in single-cell and spatial transcriptomics data.
This is a lightweight version of the workflow for TAR-scRNA-seq by Michael Wang. uTAR_lite
just generates a count matrix for unannotated transcriptionally active regions (uTARs). We removed a few of the dependencies and sped up the runtimes.
Right now, it only works for datasets which have already been aligned with STARsolo. All output files are saved into the output directory (/path/to/sampleID/TAR
).
#TODO
See ./vignettes
for example notebooks on how to use uTARs with your data
- Add in .bam deduplication prior to HMM (saves run time, improves scalability, probably also improves sensitivity)
- Rewrite parameterization of THRESH (minnumber of UMIs, not relative to total reads)
This workflow requires the following packages listed below. Please ensure that tool can be called from the command line (i.e. the paths to each tool is in your path variable). We recommend using conda to organize your packages. We also included a .yml (scTAR_STARsolo.yml
) file which can be used to initialize a conda environment with all of the required dependencies.
1. Snakemake
2. umi_tools
We replaced DropSeqUtils
& Picard
with umi_tools
.
conda install -c bioconda -c conda-forge umi_tools==1.1.2
Please also ensure that you have downloaded the following R packages. They will be used throughout the pipeline.
4. Samtools
conda install -c bioconda samtools
This tool is used to convert gtf annotation files to refFlat format.
conda install -c bioconda ucsc-gtftogenepred
6. Bedtools
Please make sure this tool is available in your working environment.
conda install -c bioconda bedtools
- Be sure that the aligned /bam files hav the
CB
and
Run the following command in your command line.
git clone [email protected]:mckellardw/uTAR_lite.git
We recommend using conda to manage packages, see above.
4. Align .fastq files with STARsolo.
Place each STAR output into the same parent directory, so that it follows this file tree: (DATADIR
in STARsolo_config.yaml
).
DATADIR
├── sample_A
│ ├── STARsolo
│ │ ├── Aligned.sortedByCoord.dedup.out.bam
│ │ ├── Aligned.sortedByCoord.dedup.out.bam.bai
│ │ ├── Aligned.sortedByCoord.out.bam
│ │ ├── Aligned.sortedByCoord.out.bam.bai
│ │ ├── Solo.out
│ │ │ ├── Gene
│ │ │ │ ├── filtered
│ │ │ │ │ ├── barcodes.tsv.gz
│ │ │ │ │ ├── features.tsv.gz
│ │ │ │ │ ├── matrix.mtx.gz
├── sample_B
│ ├── STARsolo
│ │ ├── Aligned.sortedByCoord.dedup.out.bam
│ │ ├── Aligned.sortedByCoord.dedup.out.bam.bai
│ │ ├── Aligned.sortedByCoord.out.bam
│ │ ├── Aligned.sortedByCoord.out.bam.bai
│ │ ├── Solo.out
│ │ │ ├── Gene
│ │ │ │ ├── filtered
│ │ │ │ │ ├── barcodes.tsv.gz
│ │ │ │ │ ├── features.tsv.gz
│ │ │ │ │ ├── matrix.mtx.gz
Please change the variable names in the config.yaml
as required for your analysis. This includes the following changes:
- Samples: STARsolo output directory names; each should be a directory in
DATADIR
- GENOME_FASTA: path to the .fasta (genome sequence) file used to build STAR reference
- GENOME_GTF: path to the .gtf (gene annotations) file used to build STAR reference
- DATADIR: Path to where the STARsolo outputs (
.../DATADIR/sample_A
) are stored. Also note that outputs will be stored in each sample's directory, individually. See above for the expected file tree. - TMPDIR: Directory to store temporary files.
- GTFTOGENEPRED: Path to gtfToGenePred tool. NOTE This step will only need to run once for each reference genome you use, and the resulting .refFlat file ("genes.refFlat") will be stored in the same directory as
GENOME_GTF
. - CORES: Number of cores used in each step of the pipeline. To run multiple samples in parallel, please specify total number of cores in the snakemake command (i.e. "snakemake -j {total cores}").
- MERGEBP: Number of bases to merge in groHMM. Smaller numbers creates more TARs but takes longer to run. We recommend keeping the default value of 500.
- THRESH: Used to set TARs coverage threshold. This is sequence depth dependent. Default coverage threshold set at 1 in 10,000,000 uniquely aligned reads. For example, if there are 500,000,000 total aligned reads, TARs with at least 50 reads are kept when THRESH is set to 10,000,000. A higher THRESH value increases the number of TARs kept after filtering. We recommend keeping the default value of 10000000.
From the command line, cd
into the directory .../TAR-SCRNA-seq/from_STARsolo
Please ensure the Snakefile and STARsolo_config.yaml files as well as the scripts folder are in the directory where you intend to run the pipeline.
#TODO- update output tree
All output files are stored in a directory inside of the STARsolo output .../DATADIR/sample_A/TAR/
- RefFlat format of TAR features with and without consideration of directionality stored in "TAR_reads.bed.gz.withDir.genes.refFlat" and "TAR_reads.bed.gz.noDir.refFlat.refFlat".
- Digital expression matrix for TAR features, with consideration of TAR directionality relative to annotated gene features, is stored in "TAR_expression_matrix_withDir.txt.gz". Additional rules within the Snakefile are provided to generate a matrix that does not consider strandedness- just uncomment them.
- A list of differentially expressed genes and uTARs in "results_out/{sample}/{sample}_diffMarkers.txt".
- A list of differentially expressed uTARs and their labels based on BLASTn results in "results_out/{sample}/{sample}_TAR_diff_uTAR_Features_Labeled.txt".
- Results of the BLASTn analysis for differentially expressed uTARs in "TAR_blastResults.txt".
Please cite our Nature Communications paper: Wang, M. F. Z. et al. Uncovering transcriptional dark matter via gene annotation independent single-cell RNA sequencing analysis. Nature Communications 12, 2158 (2021).