-
Notifications
You must be signed in to change notification settings - Fork 5
Home
The following commands exemplify a full run of SCAN2, including cross sample panel building and indel calling, on a SLURM cluster.
You will need:
- the BAM files of interest (your scWGS samples and at least one bulk sample)
- BAM files from scWGS data (including a bulk sample) from a completely different individual. These files are necessary to construct a cross-sample panel for filtering indel calls that SCAN2 will make. Sometimes, you might want to call sSNVs and sIDs from these BAM files too, but you definitely need BAM files from a different individual to construct a cross-sample filter. For our purposes, we used a set of 52 neurons assayed by PTA and scWGS from Luquette et al 2021.
- a CSV file (you can title it
panel_meta.csv
) containing metadata information on #1 and #2: It would look like so for 4 samples: two scWGS from PTA, one scWGS from MDA, and one bulk sample
donor,sample,amp
d1,single_cell_1,PTA
d1,single_cell_2,PTA
d2,single_cell_3,MDA
d1,hbulk,bulk
- The required sequence/variant references installed per the instructions in the SCAN2 github repository.
- Ideally, you are running this entire single-cell mutation calling workflow on a cluster with a job scheduler to allow for job parallelization (eg: SLURM). The instructions that follow will provide options for running on our local SLURM cluster.
- A list of genomic intervals over which you can split the GATK runs. This is strongly recommended so that you can parallelize the jobs effectively. For typical runs of <5 single cells, dividing chrs 1-22 of the human genome into 300 windows works well. For much larger projects (and especially when building a cross-sample panel with 50-100 single cells and bulks), we often divide chrs 1-22 into ~30,000 windows of ~100 kb each. These windows can be generated using standard tools such as
bedtools makewindows -g <genome file> -n <number of desired windows>
. Ensure that chromosomes are named in the same way as the aligned BAMs (i.e., with or without a 'chr' prefix). DO NOT INCLUDE SEX CHROMOSOMES IN YOUR PANEL. THESE ARE CURRENTLY UNSUPPORTED.
An automated version of the below will be provided in a script, but here are the basic instructions below. Suppose that we go back to the panel_meta.csv
from the Prerequisites
step. We can set up a panel as below
#!/bin/bash
scan2 -d panel init
scan2 -d panel config \
--verbose \
--analysis makepanel \
--regions-file /path/to/panel_regions_300windows.txt \
--verbose \
--analysis makepanel \
--ref /path/to/reference/genome/fasta/file.fasta \
--dbsnp /path/to/dbsnp/germline/variants.vcf \
--shapeit-refpanel /path/to/1000GP_Phase3 \
--makepanel-metadata panel_meta.csv \
--bulk-bam /path/to/bulk.bam \
--bam /path/to/single_cell_1.bam \
--bam /path/to/single_cell_2.bam \
--bam /path/to/single_cell_3.bam
scan2 -d panel validate
scan2 -d panel \
makepanel --joblimit 300 \
--drmaa ' -p park -A park_contrib -c {threads} --mem={resources.mem} -t 12:00:00 -o %logdir/slurm-%A.log' \
--snakemake-args ' --keep-going --max-status-checks-per-second 0.1' # --dryrun' \
The first command initializes your job and creates a folder called panel
. The second command configures the entire run (essentially, by creating a file called panel/scan.yaml
which will contain all of the parameters for this run). The third command will make sure the job has been specified correctly. The fourth command will actually run the job. Note a few things about this command.
-
--joblimit 300
will specify the max number of jobs that will be submitted at once. Here, at most 300 jobs will be submitted at once. You could increase this number, but be wary about how many cores your queue can access. - The string after
--drmaa
will help submit your jobs to SLURM (our local cluster supports DRMAA, which will manage how those jobs are submitted). NOTE: Snakemake's authors now recommend against using DRMAA (and--drmaa
); instead,--cluster
should be used. In the commands below,--cluster
can replace all instances of--drmaa
. -
-c {threads}
and--mem={resources.mem}
are crucial. They link up snakemake's resource description with SLURM's sbatch for job runs. -
-t 12:00:00
gives each job a 12-hour time limit. You could go up to 24 hours (i.e.-t 24:00:00
), but generally, shorter jobs are scheduled more easily than 24-hour jobs.
If you want to see the set of operations for the job and not actually execute the operations, you can replace the last command with the following:
scan2 -d panel \
makepanel --joblimit 1000 \
--snakemake-args ’ --dryrun'
This will just do a dry run, i.e. print out all of the jobs that will be run and exit.
Once you successfully run the job, you will end up with two new files
$ls panel/panel/panel.tab.gz{,.tbi}
panel/panel/panel.tab.gz panel/panel/panel.tab.gz.tbi
This is your panel used for filtering. Make sure that you keep the panel.tab.gz.tbi
file with the panel.tab.gz
file.
The SCAN2 repository gives the basic instructions for how to run the demo. These instructions are modified, taking into account the cross-sample panel that you have just created and the other steps needed for parallelizing jobs. Let's say that we want to do analyses on just single_cell_1
and single_cell_2
. Run the below commands in the same directory as where you created the panel
folder (i.e. the run_job
and panel
folders should sit in the same directory).
scan2 -d run_job init
cd run_job
# Configure analysis parameters. Can be run multiple times to change
# parameters.
scan2 config \
--verbose \
--ref /path/to/reference/genome/fasta/file.fasta \
--dbsnp /path/to/dbsnp/germline/variants.vcf \
--shapeit-refpanel /path/to/1000GP_Phase3 \
--regions-file /path/to/panel_regions_300windows.txt \
--bulk-bam /path/to/hunamp.chr22.bam \
--sc-bam /path/to/single_cell_1.bam \
--sc-bam /path/to/single_cell_2.bam \
--abmodel-n-cores 10 \
--cross-sample-panel panel/panel/panel.tab.gz
scan2 validate
scan2 run \
--joblimit 300 \
--drmaa ' -p park -A park_contrib -c {threads} --mem={resources.mem} -t 12:00:00 -o %logdir/slurm-%A.log' \
To explain some of these steps
-
--abmodel-n-cores 10
is suggested for the # of cores to use for fitting the allele balance model. You could go up to 20, but 10 works for our machines much better. - Note that we specified our
--regions-file /path/to/panel_regions_300windows.txt
as the regions over which we do GATK mutation calls. - Note the
--drmaa
string - Note that the cross-sample panel that we constructed in Step 1 is provided as
--cross-sample-panel panel/panel/panel.tab.gz
.
The rescue
step will use signature-based rescue described in the SCAN2 paper for rescuing mutations. Unlike steps 1 and 2, which do take a while to run, step 3 runs relatively quickly. Run this, like before, in the same directory where panel
and run_job
are.
scan2 -d rescue init
scan2 -d rescue config \
--verbose \
--analysis rescue \
--rescue-target-fdr 0.01 \
--scan2-object single_cell_1 run_job/call_mutations/single_cell_1/scan2_object.rda \
--scan2-object single_cell_2 run_job/call_mutations/single_cell_2/scan2_object.rda
scan2 -d rescue validate
scan2 -d rescue rescue --joblimit 20
Some notes:
- if you wanted to load mutations from another SCAN2 run (say
other_muts.csv
), you can add them like so
scan2 -d rescue config \
--verbose \
--analysis rescue \
--add-muts other_muts.csv \
--rescue-target-fdr 0.01 \
--scan2-object single_cell_1 run_job/call_mutations/single_cell_1/scan2_object.rda \
--scan2-object single_cell_2 run_job/call_mutations/single_cell_2/scan2_object.rda
- You can parallelize this job if you want, but it runs within a few minutes, so in theory you could just let this job proceed on your terminal without submitting as an
sbatch
job.
The panel
and run
steps are the most time-consuming steps in this workflow, particularly the GATK steps. It helps to think about how many compute-hours you will require to run your jobs and, with parallelization, how much real-world time your runs will take. Here's a breakdown for a run of 40 PTA single cells that we are interested in analyzing, along with 52 PTA neurons from Luquette et al 2021 to help us construct our cross-sample panel;
- We estimate that these 92 cells will take about 80,000 compute hours to construct the cross-sample panel.
- The core SCAN2 run has three major time-consuming units a. There are two GATK runs that occur during the actual SCAN2 run. For the 40 cells of interest, each GATK job will take about 40,000 compute hours, so we expect that the GATK runs in the actual SCAN2 run itself will take about 80,000 more compute hours. b. After the GATK runs, the allele balance model is fit, taking about 6 compute hours per chromosome per cell. This brings us up to about 5,000 more compute hours. c. The rest of the pipeline in the core will take about the same number of hours to finish, or another 5,000 compute hours
- The signature-based rescue takes only a few minutes.
In all, we estimate a run of 170,000 compute hours. We suggested --joblimit 300
, which means that the above jobs would take about 23-24 days to run on SLURM with DRMAA, on our local cluster. If you used --joblimit 800
(rather large # of jobs), we are looking at about 8-9 days.