-
Notifications
You must be signed in to change notification settings - Fork 22
Walkthrough scripts only
This page runs through variant calling two samples from the ENA. It uses clockwork scripts only - ie it does not use a database or nextflow.
We will use two Mycobacterium tuberculosis samples from the ENA. This walkthrough will:
- Get the Clockwork code.
- Get and index reference genomes.
- Download the reads for the two samples.
- Run the decontamination pipeline on the reads.
- Run the variant calling pipeline on each sample.
Note: decontamination is optional (it is what we ran for processing CRyPTIC Mtb samples). You can skip it and just run variant calling if that makes sense for your data. There is nothing particularly special required for the input of variant calling, it just needs to be paired reads in FASTQ format.
The simplest way to run Clockwork is using a container, to avoid installing its numerous dependencies. This walkthrough uses a singularity image from the Clockwork Releases page.
To use the singularity image, every command has to be run inside the container. This means every command looks like this:
singularity exec clockwork.img COMMAND
where clockwork.img
is the singularity image file, and COMMAND
is the
command we would like to run.
Two references will be downloaded:
- A large collection of genomes to use for read decontamination.
- The standard H37Rv reference genome for mapping and variant calling.
Note that this walkthrough is for M. tuberculosis, where we recommend decontaminating reads to improve the quality of variant calls. If you are working on a different species, then it is up to you whether or not and how you decontaminate your reads. You can provide your own files by following the instructions on preparing remove contamination reference data.
To download the decontamination reference, run:
singularity exec clockwork.img download_tb_reference_files.pl Ref.download
It will create a new directory called Ref.download
, putting the required files
in there. Note that this could take up to one hour to run, depending on your internet
connection speed. It downloads a large number of genomes (including human),
and then processes them. If the script crashes, then rerun the same command
and it will continue, skipping any genomes that have already been downloaded.
Index the decontamination reference with this command:
singularity exec clockwork.img clockwork reference_prepare \
--contam_tsv Ref.download/remove_contam.tsv \
--outdir Ref.remove_contam \
Ref.download/remove_contam.fa.gz
That command made a new directory Ref.remove_contam
, containing files that
will be used later as input to the decontamination script.
We also need to make index files for the H37Rv reference, used for
variant calling. The genome was downloaded by the script we ran earlier
download_tb_reference_files.pl
and is here:
Ref.download/NC_000962.3.fa
.
Make the directory of indexed files to use later with the variant calling
pipeline:
singularity exec clockwork.img clockwork reference_prepare \
--outdir Ref.H37Rv \
Ref.download/NC_000962.3.fa
The clockwork container includes the script enaDataGet
from
the enaBrowserTools repository.
Use this to download the first sample SAMEA2534128:
singularity exec clockwork.img enaDataGet SAMEA2534128
The output in your terminal while it runs should look like this:
Downloading file with md5 check:ftp.sra.ebi.ac.uk/vol1/fastq/ERR551/ERR551697/ERR551697_1.fastq.gz
Downloading file with md5 check:ftp.sra.ebi.ac.uk/vol1/fastq/ERR551/ERR551697/ERR551697_2.fastq.gz
Downloading file with md5 check:ftp.sra.ebi.ac.uk/vol1/fastq/ERR551/ERR551698/ERR551698_1.fastq.gz
Downloading file with md5 check:ftp.sra.ebi.ac.uk/vol1/fastq/ERR551/ERR551698/ERR551698_2.fastq.gz
Completed
Sample SAMEA2534128 contains two read runs, which we will have to handle later
when using all the reads to variant call. This is why there were four
FASTQ files downloaded. The script made a new directory called
SAMEA2534128
that contains the downloaded FASTQ files.
Running ls SAMEA2534128
shows:
ERR551697_1.fastq.gz ERR551697_2.fastq.gz ERR551698_1.fastq.gz ERR551698_2.fastq.gz
The second sample is SAMEA2534421
. Download the reads:
singularity exec clockwork.img enaDataGet -f fastq SAMEA2534421
Sample SAMEA2534421 only has one pair of FASTQ files:
ERR552113_1.fastq.gz ERR552113_2.fastq.gz
Decontaminating one sample comprises two stages:
- map the reads, making a sorted by name SAM file.
- make decontaminated FASTQ files from the SAM file.
There is a script to run for each of those two stages.
Make a SAM file for SAMEA2534421:
singularity exec clockwork.img clockwork map_reads \
--unsorted_sam SAMEA2534421 Ref.remove_contam/ref.fa SAMEA2534421.sam \
SAMEA2534421/ERR552113_1.fastq.gz SAMEA2534421/ERR552113_2.fastq.gz
SAMEA2534128 has two pairs of FASTQ files. We can give the script all four FASTQ files, and it will make a single SAM file of all reads:
singularity exec clockwork.img clockwork map_reads \
--unsorted_sam SAMEA2534128 Ref.remove_contam/ref.fa SAMEA2534128.sam \
SAMEA2534128/ERR551697_1.fastq.gz SAMEA2534128/ERR551697_2.fastq.gz \
SAMEA2534128/ERR551698_1.fastq.gz SAMEA2534128/ERR551698_2.fastq.gz
Make the decontaminated FASTQ files from the two SAM files we just made
SAMEA2534421.sam, SAMEA2534128.sam
.
singularity exec clockwork.img clockwork remove_contam \
Ref.remove_contam/remove_contam_metadata.tsv \
SAMEA2534421.sam SAMEA2534421.decontam.counts.tsv \
SAMEA2534421.decontam_1.fq.gz SAMEA2534421.decontam_2.fq.gz
singularity exec clockwork.img clockwork remove_contam \
Ref.remove_contam/remove_contam_metadata.tsv \
SAMEA2534128.sam SAMEA2534128.decontam.counts.tsv \
SAMEA2534128.decontam_1.fq.gz SAMEA2534128.decontam_2.fq.gz
We have made decontaminated FASTQ files for each sample. Additionally, the
script wrote a tab-delimited file of numbers of reads matching the
sets of genomes in the contamination reference. For example, this is the
output for sample SAMEA2534421, obtained by running the command
column -t SAMEA2534421.decontam.counts.tsv
:
Name Is_contam Reads
Bacteria 1 12
Human 1 943
NTM 1 1094
TB 0 1656682
Virus 1 0
Unmapped 0 55241
Reads_kept_after_remove_contam 0 1713150
The reads are nearly all TB, with 1656682 reads matching the TB genome, 55241 unmapped, and the rest are the bacteria (12), human (943) or NTMs (1094). The pipeline keeps all reads that match the H37Rv genome or are unmapped. Note that it always keeps or discards reads in their pairs, so that there are no unpaired reads retained.
Run variant calling on the decontaminated reads from sample SAMEA2534421:
singularity exec clockwork.img clockwork variant_call_one_sample \
--sample_name SAMEA2534421 \
Ref.H37Rv Var_call_SAMEA2534421 \
SAMEA2534421.decontam_1.fq.gz SAMEA2534421.decontam_2.fq.gz
Run variant calling on the decontaminated reads from sample SAMEA2534128:
singularity exec clockwork.img clockwork variant_call_one_sample \
--sample_name SAMEA2534128 \
Ref.H37Rv Var_call_SAMEA2534128 \
SAMEA2534128.decontam_1.fq.gz SAMEA2534128.decontam_2.fq.gz
The variant calling pipeline runs these stages:
- Trim adapter and low quality ends of reads using Trimmomatic
- Map the reads to H37Rv using minimap2.
- Remove optical PCR duplicates using samtools. At this point we have
the mapped reads in a sorted, indexed BAM file. If you want to keep this file
then add the option
--keep_bam
when running the variant calling script. Otherwise it is deleted to save disk space. - Variant calls are made independently using Cortex and bcftools.
- The two call sets from 4 are adjudicated to make a final call set using Minos.
The output files are:
-
final.vcf
- this is likely the only file you want. It is a VCF file of the final call set. -
cortex.vcf
andsamtools.vcf
- these are the two calls sets made in stage 4, used as input to Minos to make the final call set. -
map.bam
- the mapped reads in a sorted BAM file (only present if the option--keep_bam
was used). -
final.gvcf
- a GVCF file. At variant sites, the record from minos is used. At all other positions, the VCF record from samtools is used. -
final.gvcf.fasta
- a FASTA file generated from the GVCF, containing the inferred genome. Insertions in the sample sequence are removed, and deletions are replaced with Ns. This means that the sequence has the same length as the reference sequence. Low depth (<5) and heterozygous (FRS < 0.9) are replaced with Ns.