Skip to content

Clairvoyante: a multi-task convolutional deep neural network for variant calling in Single Molecule Sequencing

License

Notifications You must be signed in to change notification settings

aquaskyline/Clairvoyante

Repository files navigation

Clairvoyante - A deep neural network based variant caller

License
Contact: Ruibang Luo
Email: [email protected]


The successor of Clairvoyante

Is called Clair. Clair is published in Nature Machine Intelligence. A preprint is available in bioRxiv


Introduction

Identifying the variants of DNA sequences sensitively and accurately is an important but challenging task in the field of genomics. This task is particularly difficult when dealing with Single Molecule Sequencing, the error rate of which is still tens to hundreds of times higher than Next Generation Sequencing. With the increasing prevalence of Single Molecule Sequencing, an efficient variant caller will not only expedite basic research but also enable various downstream applications. To meet this demand, we developed Clairvoyante, a multi-task five-layer convolutional neural network model for predicting variant type, zygosity, alternative allele and Indel length. On NA12878, Clairvoyante achieved 99.73%, 97.68% and 95.36% accuracy on known variants, and achieved 98.65%, 92.57%, 87.26% F1 score on the whole genome, in Illumina, PacBio, and Oxford Nanopore data, respectively. Training Clairvoyante with a sample and call variant on another shows that Clairvoyante is sample agnostic and general for variant calling. A slim version of Clairvoyante with reduced model parameters produced a much lower F1, suggesting the full model's power in disentangling subtle details in read alignment. Clairvoyante is the first method for Single Molecule Sequencing to finish a whole genome variant calling in two hours on a 28 CPU-core machine, with top-tier accuracy and sensitivity. A toolset was developed to train, utilize and visualize the Clairvoyante model easily, and is publically available here is this repo.

Clairvoyante is published in Nature Communications

A PyTorch version of Clairvoyante

A PyTorch implementation of Clairvoyante is available at HKU-BAL/Clairvoyante-PyTroch.

Gallery

Network topology

Network Topology

Tensor examples

Tensor examples

Activations of the conv1 hidden layer to a non-variant tensor

conv1


Installation

Step by step

git clone --depth=1 https://github.com/aquaskyline/Clairvoyante.git
cd Clairvoyante
curl http://www.bio8.cs.hku.hk/trainedModels.tbz | tar -jxf -

Make sure you have Tensorflow ≥ 1.0.0 installed, the following commands install the lastest CPU version of Tensorflow as well as other dependencies:

pip install tensorflow==1.9.0
pip install blosc
pip install intervaltree==2.1.0
pip install numpy

To check the version of Tensorflow you have installed:

python -c 'import tensorflow as tf; print(tf.__version__)'

To do variant calling using trained models, CPU will suffice. Clairvoyante uses all available CPU cores by default in callVar.py, 4 threads by default in callVarBam.py. The number of threads to be used can be controlled using the parameter --threads. To train a new model, a high-end GPU and the GPU version of Tensorflow is needed. To install the GPU version of tensorflow:

pip install tensorflow-gpu

The installation of the blosc library might fail if your CPU doesn't support the AVX2 instruction set. Alternatively, you can compile and install from the latest source code available in GitHub with the "DISABLE_BLOSC_AVX2" environment variable set.

Clairvoyante was written in Python2 (tested on Python 2.7.10 in Linux and Python 2.7.13 in MacOS). For Python3.X user, run python port23.py which does a quickfix using 2to3 and inserts package namespace to sys.path to tackle python2-style import problem under python3.

Using pure Python interpreter on Clairvoyante is slow. Please refer to the Speed up with PyPy section for speed up.

Using bioconda

conda create -n clairvoyante-conda-env -c bioconda clairvoyante
conda activate clairvoyante-conda-env
clairvoyante.py

The conda environment has the Pypy intepreter installed, but doesn't have the required package intervaltree. The reason why the package is not installed by default is because it is not available in any conda repositories. To install the intervaltree for Pypy, after activation, please use the following commands:

wget https://bootstrap.pypa.io/get-pip.py
pypy get-pip.py
pypy -m pip install --no-cache-dir intervaltree==2.1.0

Download the models to a folder.

curl http://www.bio8.cs.hku.hk/trainedModels.tbz | tar -jxf -

The commands above install the CPU version of TensorFlow in the virtual environment thus support only variant calling. To train a model, please install the GPU version of Tensorflow in the virtual environment:

conda remove tensorflow
conda install tensorflow-gpu

Use conda deactivate to exit the virtual environment.
Use conda activate clairvoyante-conda-env to re-enter the virtual environment.

Speed up with PyPy

Without a change to the code, using PyPy python interpreter on some tensorflow independent modules such as dataPrepScripts/ExtractVariantCandidates.py and dataPrepScripts/CreateTensor.py gives a 5-10 times speed up. Pypy python interpreter can be installed by apt-get, yum, Homebrew, MacPorts, etc. If you have no root access to your system, the official website of Pypy provides a portable binary distribution for Linux. Following is a rundown extracted from Pypy's website (pypy-5.8 in this case) on how to install the binaries.

wget https://bitbucket.org/squeaky/portable-pypy/downloads/pypy-5.8-1-linux_x86_64-portable.tar.bz2
tar -jxf pypy-5.8-1-linux_x86_64-portable.tar.bz2
cd pypy-5.8-linux_x86_64-portable/bin
./pypy -m ensurepip
./pip install -U pip wheel intervaltree
# Use pypy as an inplace substitution of python to run the scripts in dataPrepScripts/

Alternatively, if you can use apt-get or yum in your system, please install both pypy and pypy-dev packages. And then install the pip for pypy.

sudo apt-get install pypy pypy-dev
wget https://bootstrap.pypa.io/get-pip.py
sudo pypy get-pip.py 
sudo pypy -m pip install intervaltree 

To guarantee a good user experience (good speed), pypy must be installed to run callVarBam.py (call variants from BAM), and callVarBamParallel.py that generate parallelizable commands to run callVarBam.py.
Tensorflow is optimized using Cython thus not compatible with pypy. For the list of scripts compatible to pypy, please refer to the Folder Stucture and Program Descriptions section.
Pypy is an awesome Python JIT intepreter, you can donate to the project.


Quick Start with Variant Calling

You have a slow way and a quick way to get some demo variant calls. The slow way generates required files from VCF and BAM files. The fast way downloads the required files.

Download testing dataset

I have plenty of time

wget 'http://www.bio8.cs.hku.hk/testingData.tar'
tar -xf testingData.tar
cd dataPrepScripts
sh PrepDataBeforeDemo.sh

I need some results now

wget 'http://www.bio8.cs.hku.hk/training.tar'
tar -xf training.tar

Call variants

Call variants from at known variant sites using a BAM file and a trained model

cd training
python ../clairvoyante/callVarBam.py \
       --chkpnt_fn ../trainedModels/fullv3-illumina-novoalign-hg001+hg002-hg38/learningRate1e-3.epoch500 \
       --bam_fn ../testingData/chr21/chr21.bam \
       --ref_fn ../testingData/chr21/chr21.fa \
       --bed_fn ../testingData/chr21/chr21.bed \
       --call_fn chr21_calls.vcf \
       --ctgName chr21
less chr21_calls.vcf

Call variants from the tensors of candidate variant and a trained model

cd training
python ../clairvoyante/callVar.py --chkpnt_fn ../trainedModels/fullv3-illumina-novoalign-hg001+hg002-hg38/learningRate1e-3.epoch500 --tensor_fn tensor_can_chr21 --call_fn tensor_can_chr21.vcf
less tensor_can_chr21.vcf

How to call variant directly from BAM

Input example

Input File name
BAM GIAB_v3.2.2_Illumina_50x_GRCh38_HG001.bam
Reference Genome GRCh38_full_analysis_set_plus_decoy_hla.fa
BED for where to call variants GRCh38_high_confidence_interval.bed
  • If no BED file was provided, Clairvoyante will call variants on the whole genome

Commands

python clairvoyante/callVarBamParallel.py \
       --chkpnt_fn trainedModels/fullv3-illumina-novoalign-hg001+hg002-hg38/learningRate1e-3.epoch500 \
       --ref_fn GRCh38_full_analysis_set_plus_decoy_hla.fa \
       --bam_fn GIAB_v3.2.2_Illumina_50x_GRCh38_HG001.bam \
       --bed_fn GRCh38_high_confidence_interval.bed \
       --sampleName HG001 \
       --output_prefix hg001 \
       --threshold 0.125 \
       --minCoverage 4 \
       --tensorflowThreads 4 \
       > commands.sh
export CUDA_VISIBLE_DEVICES=""
cat commands.sh | parallel -j4
vcfcat hg001*.vcf | vcfstreamsort | bgziptabix hg001.vcf.gz
  • parallel -j4 will run 4 commands in parallel. Each command using at most --tensorflowThreads 4 threads. vcfcat, vcfstreamsort and bgziptabix are a part of vcflib.
  • If you don't have parallel installed on your computer, try awk '{print "\""$0"\""}' commands.sh | xargs -P4 -L1 sh -c.
  • CUDA_VISIBLE_DEVICES="" makes GPUs invisible to Clairvoyante so it will use CPU only. Please notice that unless you want to run commands.sh in serial, you cannot use GPU because one running copy of Clairvoyante will occupy all available memory of a GPU. While the bottleneck of callVarBam.py is at the CPU only CreateTensor.py script, the effect of GPU accelerate is insignificant (roughly about 15% faster). But if you have multiple GPU cards in your system, and you want to utilize them in variant calling, you may want split the commands.sh in to parts, and run the parts by firstly export CUDA_VISIBLE_DEVICES="$i", where $i is an integer from 0 identifying the seqeunce of the GPU to be used.

VCF Output Format

clairvoyante/callVar.py outputs variants in VCF format with version 4.1 specifications.
Clairvoyante can predict the exact length of insertions and deletions shorter than or equal to 4bp. For insertions and deletions with a length between 5bp to 15bp, callVar guesses the length from input tensors. The indels with guessed length are denoted with a LENGUESS info tag. Although the guessed indel length might be incorrect, users can still benchmark Clairvoyante's sensitivity by matching the indel positions to other callsets. For indels longer than 15bp, callVar.py outputs them as SV without providing an alternative allele. To fit into a different usage scenario, Clairvoyante allows users to extend its model easily to support exact length prediction on longer indels by adding categories to the model output. However, this requires additional training data on the new categories. Users can also increase the length limit from where an indel is outputted as a SV by increasing the parameter flankingBaseNum from 16bp to a higher value. This extends the flanking bases to be considered with a candidate variant.


Build a Model

Quick start with a model training demo

wget 'http://www.bio8.cs.hku.hk/testingData.tar'
tar -xf testingData.tar
cd clairvoyante
python demoRun.py

Jupyter notebook interactive training

Please visit: jupyter_nb/demo.ipynb


Visualization

Interactively visualizing Input Tensors, Activation in Hidden Layers and Comparing Predicted Results (Suggested)

jupyter_nb/visualization.ipynb

Output tensors and layer activations to PNG figures

getTensorAndLayerPNG.py. You will need to input a model, a file with one or more tensors created by CreateTensor.py and optionally a file with the truth variants created by GetTruth.py. The script will create 7 PNG figures for each line of tensor, including 1) input tensors; 2) conv1 layer activations; 3) conv2 layer activations; 4) conv3 layer activations; 5) fc4 layer activations; 6) fc5 layer activations and 7) output predictions and the truth.

Tensorboard

The --olog_dir option provided in the training scripts outputs a folder of log files readable by the Tensorboard. It can be used to visualize the dynamics of parameters during training at each epoch.
You can also use the PCA and t-SNE algorithms provided by TensorBoard in the Embedding page to analyze the predictions made by a model. clairvoyante/getEmbedding.py helps you to prepare a folder for the purpose.


Folder Stucture and Program Descriptions

You can also run the program to get the parameter details.

dataPrepScripts/ Data Preparation Scripts. Outputs are gzipped unless using standard output. Scripts in this folder are compatible with pypy.
ExtractVariantCandidates.py Extract the position of variant candidates. Input: BAM; Reference FASTA. Important options: --threshold "Minimum alternative allelic fraction to report a candidate"; --minCoverage "Minimum coverage to report a candidate".
GetTruth.py Extract the variants from a truth VCF. Input: VCF.
CreateTensor.py Create tensors for candidates or truth variants. Input: A candidate list; BAM; Reference FASTA. Important option: --considerleftedge "Count the left-most base-pairs of a read for coverage even if the starting position of a read is after the starting position of a tensor. Enable if you are 1) using reads shorter than 100bp, 2) using a tensor with flanking length longer than 16bp, and 3) you are using amplicon sequencing or other sequencing technologies, in which reads starting positions are random is not a basic assumption".
PairWithNonVariants.py Pair truth variant tensors with non-variant tensors. Input: Truth variants tensors; Candidate variant tensors. Important options: --amp x "1-time truth variants + x-time non-variants".
ChooseItemInBed.py Helper script. Output the items overlapping with input the BED file.
CountNumInBed.py Helper script. Count the number of items overlapping with the input BED file.
param.py Global parameters for the scripts in the folder.
PrepDataBeforeDemo.sh A Demo showing how to prepare data for model training.
PrepDataBeforeDemo.pypy.sh The same demo but using pypy in place of python. pypy is highly recommended. It's easy to install, and makes the scripts run 5-10 times faster.
CombineMultipleDatasetsForTraining.sh Use chr21 and chr22 to exemplify how to maunually combine multiple datasets for model training.
CombineMultipleDatasetsForTraining.py A helper script for combining multiple datasets. You still need to run PairWithNonVariants.py and tensor2Bin.py after this script.
clairvoyante/ Model Training and Variant Caller Scripts. Scripts in this folder are NOT compatible with pypy. Please run with python.
callVar.py Call variants using candidate variant tensors.
callVarBam.py Call variants directly from a BAM file.
callVarBamParallel.py Generate callVarBam.py commands that can be run in parallel. A BED file is required to specify the regions for variant calling. --refChunkSize set the genome chuck size per job.
demoRun.py A Demo showing how to train a model from scratch.
evaluate.py Evaluate a model in terms of base change, zygosity, variant type and indel length.
param.py Hyperparameters for model training and other global parameters for the scripts in the folder.
tensor2Bin.py Create a compressed binary tensors file to facilitate and speed up future usage. Input: Mixed tensors by PairWithNonVariants.py; Truth variants by GetTruth.py and a BED file marks the high confidence regions in the reference genome.
train.py Training a model using adaptive learning rate decay. By default, the learning rate will decay for three times. Input a binary tensors file created by Tensor2Bin.py is highly recommended.
trainNonstop.py Helper script. Train a model continuously using the same learning rate and l2 regularization lambda.
trainWithoutValidationNonstop.py Helper script. Train a model continuously using the same learning rate and l2 regularization lambda. Take all the input tensors as training data and do not calculate loss in the validation data.
calTrainDevDiff.py Helper script. Calculate the training loss and validation loss on a trained model.
getTensorAndLayerPNG.py Create high resolution PNG figures to visualize input tensor, layer activations and output.
getEmbedding.py Prepare a folder readable by Tensorboard for visualizing predicted results.
clairvoyante_v3.py Clairvoyante network topology v3.
clairvoyante_v3_slim.py Clairvoyante network topology v3 slim. With ten times fewer parameters than the full network, it trains about 1.5 times faster than the full network. It performs only about 1% less in precision and recall rates for Illumina data.
utils_v2.py Helper functions to the network.

GIAB provides a BED file that marks the high confidence regions in the reference. The models perform better by using only the truth variants in these regions for training. If you don't have such a BED file, you can use a BED file that covers the whole genome.


About the Trained Models

The trained models are in the trainedModels/ folder.

Folder Tech Aligner Ref Sample
fullv3-illumina-novoalign
-hg001+hg002-hg38
Illumina HiSeq25001 Novoalign 3.02.07 hg38 NA12878+NA24385
fullv3-illumina-novoalign
-hg001-hg38
Illumina HiSeq25001 Novoalign 3.02.07 hg38 NA12878
fullv3-illumina-novoalign
-hg002-hg38
Illumina HiSeq25001 Novoalign 3.02.07 hg38 NA24385
fullv3-pacbio-ngmlr
-hg001+hg002+hg003+hg004-hg19
mainly PacBio P6-C42 NGMLR 0.2.6 hg19 NA12878+NA24385+NA24149
+NA24143
fullv3-pacbio-ngmlr
-hg001+hg002-hg19
PacBio P6-C42 NGMLR 0.2.6 hg19 NA12878+NA24385
fullv3-pacbio-ngmlr
-hg001-hg19
PacBio P6-C42 NGMLR 0.2.6 hg19 NA12878
fullv3-pacbio-ngmlr
-hg002-hg19
PacBio P6-C42 NGMLR 0.2.6 hg19 NA24385
fullv3-ont-ngmlr
-hg001-hg19
Oxford Nanopore Minion R9.43 NGMLR 0.2.6 hg19 NA12878

1 Also using Illumina TruSeq (LT) DNA PCR-Free Sample Prep Kits. Zook et al. Extensive sequencing of seven human genomes to characterize benchmark reference materials. 2016

2 Pendelton et al. Assembly and diploid architecture of an individual human genome via single-molecule technologies. 2015

3 Nanopore WGS Consortium

* Each folder contains one or more models. Each model contains three files suffixed data-00000-of-00001, index and meta, respectively. Only the prefix is needed when using the model with Clairvoyante. Using the prefix learningRate1e-3.epoch999.learningRate1e-4.epoch1499 as an example, it means that the model has trained for 1000 epochs at learning rate 1e-3, then another 500 epochs at learning rate 1e-4. Lambda for L2 regularization was set the same as learning rate.


Performance of GPUs in model training

Equiptment Seconds per Epoch per 11M Variant Tensors
Tesla V100 90
GTX 1080 Ti 170
GTX 980 350
GTX Titan 520
Tesla K40 (-ac 3004,875) 580
Tesla K40 620
Tesla K80 (one socket) 600
GTX 680 780
Intel Xeon E5-2680 v4 28-core 2900

About Setting the Alternative Allele Frequency Cutoff

Different from model training, in which all genome positions are candidates but randomly subsampled for training, variant calling using a trained model will require the user to define a minimal alternative allele frequency cutoff for a genome position to be considered as a candidate for variant calling. For all sequencing technologies, the lower the cutoff, the lower the speed. Setting a cutoff too low will increase the false positive rate significantly, while too high will increase the false negative rate significantly. The option --threshold controls the cutoff in these three scripts callVarBam.py, callVarBamParallel.py and ExtractVariantCandidates.py. The suggested cutoff is listed below for different sequencing technologies. A higher cutoff will increase the accuracy of datasets with poor sequencing quality, while a lower cutoff will increase the sensitivity in applications like clinical research. Setting a lower cutoff and further filter the variants by their quality is also a good practice.

Sequencing Technology Alt. AF Cutoff
Illumina 0.125
PacBio P6-C4 0.2
ONT R9.4 0.25

About the Testing Data

The testing dataset 'testingData.tar' includes:

  1. the Illumina alignments of chr21 and chr22 on GRCh38 from GIAB Github, downsampled to 50x.
  2. the truth variants v3.3.2 from GIAB.

Limitations

On variants with two alternative alleles (GT: 1/2)

Clairvoyante network version 3 can only output one of the two possible alternative alleles at a position. We will further extend the network to support genome variants with two alternative alleles.

On training

In rare cases, the model training will stuck early at a local optimal and cannot be further optimized without a higher learning rate. As we observed ,the problem only happens at the very beginning of model training, and can be predicated if the loss remains stable in the first few training epochs.