Skip to content

Walkthrough scripts only

martinghunt edited this page Nov 7, 2023 · 3 revisions

Walkthrough: variant calling using clockwork scripts

This page runs through variant calling two samples from the ENA. It uses clockwork scripts only - ie it does not use a database or nextflow.

We will use two Mycobacterium tuberculosis samples from the ENA. This walkthrough will:

  1. Get the Clockwork code.
  2. Get and index reference genomes.
  3. Download the reads for the two samples.
  4. Run the decontamination pipeline on the reads.
  5. Run the variant calling pipeline on each sample.

Note: decontamination is optional (it is what we ran for processing CRyPTIC Mtb samples). You can skip it and just run variant calling if that makes sense for your data. There is nothing particularly special required for the input of variant calling, it just needs to be paired reads in FASTQ format.

Get the Clockwork code

The simplest way to run Clockwork is using a container, to avoid installing its numerous dependencies. This walkthrough uses a singularity image from the Clockwork Releases page.

To use the singularity image, every command has to be run inside the container. This means every command looks like this:

singularity exec clockwork.img COMMAND

where clockwork.img is the singularity image file, and COMMAND is the command we would like to run.

Get and index reference genomes

Two references will be downloaded:

  1. A large collection of genomes to use for read decontamination.
  2. The standard H37Rv reference genome for mapping and variant calling.

Note that this walkthrough is for M. tuberculosis, where we recommend decontaminating reads to improve the quality of variant calls. If you are working on a different species, then it is up to you whether or not and how you decontaminate your reads. You can provide your own files by following the instructions on preparing remove contamination reference data.

To download the decontamination reference, run:

singularity exec clockwork.img download_tb_reference_files.pl Ref.download

It will create a new directory called Ref.download, putting the required files in there. Note that this could take up to one hour to run, depending on your internet connection speed. It downloads a large number of genomes (including human), and then processes them. If the script crashes, then rerun the same command and it will continue, skipping any genomes that have already been downloaded.

Index the decontamination reference with this command:

singularity exec clockwork.img clockwork reference_prepare \
  --contam_tsv Ref.download/remove_contam.tsv \
  --outdir Ref.remove_contam \
  Ref.download/remove_contam.fa.gz

That command made a new directory Ref.remove_contam, containing files that will be used later as input to the decontamination script.

We also need to make index files for the H37Rv reference, used for variant calling. The genome was downloaded by the script we ran earlier download_tb_reference_files.pl and is here: Ref.download/NC_000962.3.fa. Make the directory of indexed files to use later with the variant calling pipeline:

singularity exec clockwork.img clockwork reference_prepare \
  --outdir Ref.H37Rv \
  Ref.download/NC_000962.3.fa

Download the reads

The clockwork container includes the script enaDataGet from the enaBrowserTools repository. Use this to download the first sample SAMEA2534128:

singularity exec clockwork.img enaDataGet SAMEA2534128

The output in your terminal while it runs should look like this:

Downloading file with md5 check:ftp.sra.ebi.ac.uk/vol1/fastq/ERR551/ERR551697/ERR551697_1.fastq.gz
Downloading file with md5 check:ftp.sra.ebi.ac.uk/vol1/fastq/ERR551/ERR551697/ERR551697_2.fastq.gz
Downloading file with md5 check:ftp.sra.ebi.ac.uk/vol1/fastq/ERR551/ERR551698/ERR551698_1.fastq.gz
Downloading file with md5 check:ftp.sra.ebi.ac.uk/vol1/fastq/ERR551/ERR551698/ERR551698_2.fastq.gz
Completed

Sample SAMEA2534128 contains two read runs, which we will have to handle later when using all the reads to variant call. This is why there were four FASTQ files downloaded. The script made a new directory called SAMEA2534128 that contains the downloaded FASTQ files. Running ls SAMEA2534128 shows:

ERR551697_1.fastq.gz  ERR551697_2.fastq.gz  ERR551698_1.fastq.gz  ERR551698_2.fastq.gz

The second sample is SAMEA2534421. Download the reads:

singularity exec clockwork.img enaDataGet -f fastq SAMEA2534421

Sample SAMEA2534421 only has one pair of FASTQ files:

ERR552113_1.fastq.gz  ERR552113_2.fastq.gz

Decontaminate the reads

Decontaminating one sample comprises two stages:

  1. map the reads, making a sorted by name SAM file.
  2. make decontaminated FASTQ files from the SAM file.

There is a script to run for each of those two stages.

Make a SAM file for SAMEA2534421:

singularity exec clockwork.img clockwork map_reads \
  --unsorted_sam SAMEA2534421 Ref.remove_contam/ref.fa SAMEA2534421.sam \
  SAMEA2534421/ERR552113_1.fastq.gz SAMEA2534421/ERR552113_2.fastq.gz

SAMEA2534128 has two pairs of FASTQ files. We can give the script all four FASTQ files, and it will make a single SAM file of all reads:

singularity exec clockwork.img clockwork map_reads \
  --unsorted_sam SAMEA2534128 Ref.remove_contam/ref.fa SAMEA2534128.sam \
  SAMEA2534128/ERR551697_1.fastq.gz SAMEA2534128/ERR551697_2.fastq.gz \
  SAMEA2534128/ERR551698_1.fastq.gz SAMEA2534128/ERR551698_2.fastq.gz

Make the decontaminated FASTQ files from the two SAM files we just made SAMEA2534421.sam, SAMEA2534128.sam.

singularity exec clockwork.img clockwork remove_contam \
  Ref.remove_contam/remove_contam_metadata.tsv \
  SAMEA2534421.sam SAMEA2534421.decontam.counts.tsv \
  SAMEA2534421.decontam_1.fq.gz SAMEA2534421.decontam_2.fq.gz

singularity exec clockwork.img clockwork remove_contam \
  Ref.remove_contam/remove_contam_metadata.tsv \
  SAMEA2534128.sam SAMEA2534128.decontam.counts.tsv \
  SAMEA2534128.decontam_1.fq.gz SAMEA2534128.decontam_2.fq.gz

We have made decontaminated FASTQ files for each sample. Additionally, the script wrote a tab-delimited file of numbers of reads matching the sets of genomes in the contamination reference. For example, this is the output for sample SAMEA2534421, obtained by running the command column -t SAMEA2534421.decontam.counts.tsv:

Name                            Is_contam  Reads
Bacteria                        1          12
Human                           1          943
NTM                             1          1094
TB                              0          1656682
Virus                           1          0
Unmapped                        0          55241
Reads_kept_after_remove_contam  0          1713150

The reads are nearly all TB, with 1656682 reads matching the TB genome, 55241 unmapped, and the rest are the bacteria (12), human (943) or NTMs (1094). The pipeline keeps all reads that match the H37Rv genome or are unmapped. Note that it always keeps or discards reads in their pairs, so that there are no unpaired reads retained.

Variant calling

Run variant calling on the decontaminated reads from sample SAMEA2534421:

singularity exec clockwork.img clockwork variant_call_one_sample \
  --sample_name SAMEA2534421 \
  Ref.H37Rv Var_call_SAMEA2534421 \
  SAMEA2534421.decontam_1.fq.gz SAMEA2534421.decontam_2.fq.gz

Run variant calling on the decontaminated reads from sample SAMEA2534128:

singularity exec clockwork.img clockwork variant_call_one_sample \
  --sample_name SAMEA2534128 \
  Ref.H37Rv Var_call_SAMEA2534128 \
  SAMEA2534128.decontam_1.fq.gz SAMEA2534128.decontam_2.fq.gz

The variant calling pipeline runs these stages:

  1. Trim adapter and low quality ends of reads using Trimmomatic
  2. Map the reads to H37Rv using minimap2.
  3. Remove optical PCR duplicates using samtools. At this point we have the mapped reads in a sorted, indexed BAM file. If you want to keep this file then add the option --keep_bam when running the variant calling script. Otherwise it is deleted to save disk space.
  4. Variant calls are made independently using Cortex and bcftools.
  5. The two call sets from 4 are adjudicated to make a final call set using Minos.

The output files are:

  • final.vcf - this is likely the only file you want. It is a VCF file of the final call set.
  • cortex.vcf and samtools.vcf - these are the two calls sets made in stage 4, used as input to Minos to make the final call set.
  • map.bam - the mapped reads in a sorted BAM file (only present if the option --keep_bam was used).
  • final.gvcf - a GVCF file. At variant sites, the record from minos is used. At all other positions, the VCF record from samtools is used.
  • final.gvcf.fasta - a FASTA file generated from the GVCF, containing the inferred genome. Insertions in the sample sequence are removed, and deletions are replaced with Ns. This means that the sequence has the same length as the reference sequence. Low depth (<5) and heterozygous (FRS < 0.9) are replaced with Ns.