Merge pull request #44 from bioinform/updates_to_v0.2.1

Updates to v0.2.1
bioinform · Jul 11, 2019 · 95c218f · 95c218f
2 parents 7b39627 + 1bfc138
commit 95c218f
Show file tree

Hide file tree

Showing 23 changed files with 374 additions and 186 deletions.
diff --git a/README.md b/README.md
@@ -5,12 +5,17 @@ NeuSomatic is based on deep convolutional neural networks for accurate somatic m
 For more information contact us at [email protected]
 
 ## Publication
-If you use NeuSomatic in your work, please cite the following preprint:
+If you use NeuSomatic in your work, please cite the following papers:
 
 Sayed Mohammad Ebrahim Sahraeian, Ruolin Liu, Bayo Lau, Karl Podesta, Marghoob Mohiyuddin, Hugo Y. K. Lam, <br/> 
 [Deep convolutional neural networks for accurate somatic mutation detection. Nature Communications 10: 1041, (2019). <br/> 
 doi: https://doi.org/10.1038/s41467-019-09027-x](https://doi.org/10.1038/s41467-019-09027-x)
 
+Sayed Mohammad Ebrahim Sahraeian, Li Tai Fang, Marghoob Mohiyuddin, Huixiao Hong, Wenming Xiao, <br/> 
+[Robust cancer mutation detection with deep learning models derived from tumor-normal sequencing data. bioRxiv (2019): 667261. <br/> 
+doi: https://doi.org/10.1101/667261](https://doi.org/10.1101/667261)
+
+
 ## Example Input Matrix
 ![Example input](resources/toy_example.png)
 
@@ -29,14 +34,14 @@ doi: https://doi.org/10.1038/s41467-019-09027-x](https://doi.org/10.1038/s41467-
 
 ## Availability
 
-NeuSomatic is written in Python and C++ and requires a Unix-like environment to run. It has been sucessfully tested on CentOS 7. Its deep learning framework is implemented using PyTorch 1.0.1 to enable GPU acceleration for training/testing.
+NeuSomatic is written in Python and C++ and requires a Unix-like environment to run. It has been sucessfully tested on CentOS 7. Its deep learning framework is implemented using PyTorch 1.1.0 to enable GPU acceleration for training/testing.
 
 NeuSomatic first scans the genome to identify candidate variants and extract alignment information. 
 The binary for this step can be obtained at `neusomatic/bin` folder by running `./build.sh` (which requires cmake 3.13.2 and g++ 5.4.0).
 
 Python 3.7 and the following Python packages must be installed:
-* pytorch 1.0.1
-* torchvision 0.2.1
+* pytorch 1.1.0
+* torchvision 0.3.0
 * pybedtools 0.8.0
 * pysam 0.15.2
 * zlib 1.2.11
@@ -46,7 +51,7 @@ Python 3.7 and the following Python packages must be installed:
 * biopython 1.73
 
 It also depends on the following packages:
-* cudatoolkit 8.0 (if you want to use GPU)
+* cudatoolkit 9.0 (if you want to use GPU)
 * tabix 0.2.6
 * bedtools 2.27.1
 * samtools 1.9
@@ -55,7 +60,7 @@ You can install these packages using [anaconda](https://www.anaconda.com/downloa
 ```
 conda install zlib=1.2.11 numpy=1.15.4 scipy=1.2.0 cmake=3.13.2 imageio=2.5.0
 conda install pysam=0.15.2 pybedtools=0.8.0 samtools=1.9 tabix=0.2.6 bedtools=2.27.1 biopython=1.73 -c bioconda
-conda install pytorch=1.0.1 torchvision=0.2.1 cudatoolkit=8.0 -c pytorch
+conda install pytorch=1.1.0 torchvision=0.3.0 cudatoolkit=9.0 -c pytorch
 ```
 Then you can export the conda paths as:
 ```
@@ -88,13 +93,6 @@ For calling mode, the following inputs are required:
 * call region `.bed` file
 * trained model `.pth` file
 
-Reads in input `.bam` file should be sorted, indexed and have MD tags. If you are not sure that all you reads have MD tags, you should run the following command for both tumor and normal alignments:
-
-```
-samtools calmd -@ num_threads -b alignment.bam reference.fasta  > alignment.md.bam 
-samtools index alignment.md.bam
-```
-
 For the region `.bed` files, if you don't have any preferred target regions for training/calling, you can use the whole genome as the target region. Example bed files of major chromosomes for human hg38, hg19, and b37 references can be found at [resources](resources).
 
 ## Quick Test
@@ -236,20 +234,25 @@ You can then used the synthetic tumor/normal pair and the known *in silico* spik
 
 ## Trained Network Models
 We provide a set of trained NeuSomatic network models for general purpose usage. Users should note that these models are trained for sepcific settings and are not supposed to work perfectly for all circumestances.
+
+The SEQC-II pretrained models are the recommended NeuSomatic models and are analyzed in detail in [Sahraeian et al. 2019](https://doi.org/10.1101/667261).
+
 The following models can be found at `neusomatic/models` folder:
 
 
 ### Latest models
 Model                                              | Mode         | Training Information                                                        
 ---------------------------------------------------|---------------|-----------------------------------------------------------------------
-`NeuSomatic_v0.1.3_standalone_Dream3.pth` |  Stand-alone  | WGS Dream Challenge Stage 3 (trained on multiple purity settings: 100T-100N/50T-100N/70T-95N/50T-95N/25T-95N, Illumina, BWA-MEM,  ~30x) 
-`NeuSomatic_v0.1.3_ensemble_Dream3.pth`   |  Ensemble     | WGS Dream Challenge Stage 3 (trained on multiple purity settings: 100T-100N/50T-100N/70T-95N/50T-95N/25T-95N, Illumina, BWA-MEM,  ~30x)
-
-
+`NeuSomatic_v0.1.4_standalone_SEQC-WGS-Spike.pth` |  Stand-alone  | SEQC-II (SEQC-WGS-Spike model) (trained on 20 WGS replicate pairs with in silico somatic mutations of 1%-100% AF, matched with both 95%N and 100%N, Illumina HiSeq and NovaSeq, BWA-MEM,  ~40x-220x)
+`NeuSomatic_v0.1.4_ensemble_SEQC-WGS-Spike.pth`   |  Ensemble     | SEQC-II (SEQC-WGS-Spike model) (trained on 20 WGS replicate pairs with in silico somatic mutations of 1%-100% AF, matched with both 95%N and 100%N, Illumina HiSeq and NovaSeq, BWA-MEM,  ~40x-220x)
+`NeuSomatic_v0.1.4_ensemble_SEQC-WGS-GT50-SpikeWGS10.pth` |  Stand-alone  | SEQC-II (SEQC-WGS-GT50-SpikeWGS10 model) (trained on combination of two datasets: (1) 50% of the genome for 24 real tumor-normal SEQC-II replicates using the HighConf truth set annotation, with multiple purity settings of 100T-100N/10T-100N/10T-95N, 1%-100% AF and (2) 10% of data used in `NeuSomatic_v0.1.4_standalone_SEQC-WGS-Spike.pth` model. Illumina HiSeq and NovaSeq, BWA-MEM,  ~40x-390x)
+`NeuSomatic_v0.1.4_ensemble_SEQC-WGS-GT50-SpikeWGS10.pth` |  Ensemble     | SEQC-II (SEQC-WGS-GT50-SpikeWGS10 model) (trained on combination of two datasets: (1) 50% of the genome for 24 real tumor-normal SEQC-II replicates using the HighConf truth set annotation, with multiple purity settings of 100T-100N/10T-100N/10T-95N, 1%-100% AF and (2) 10% of data used in `NeuSomatic_v0.1.4_ensemble_SEQC-WGS-Spike.pth` model. Illumina HiSeq and NovaSeq, BWA-MEM,  ~40x-390x)
 
 ### Older models
 Model                                              | Mode         | Training Information                                                        
 ---------------------------------------------------|---------------|-----------------------------------------------------------------------
+`NeuSomatic_v0.1.3_standalone_Dream3.pth` |  Stand-alone  | WGS Dream Challenge Stage 3 (trained on multiple purity settings: 100T-100N/50T-100N/70T-95N/50T-95N/25T-95N, Illumina, BWA-MEM,  ~30x)
+`NeuSomatic_v0.1.3_ensemble_Dream3.pth`   |  Ensemble     | WGS Dream Challenge Stage 3 (trained on multiple purity settings: 100T-100N/50T-100N/70T-95N/50T-95N/25T-95N, Illumina, BWA-MEM,  ~30x)
 `NeuSomatic_v0.1.0_standalone_Dream3_70purity.pth` |  Stand-alone  | WGS Dream Challenge Stage 3 (70% tumor and 95% normal purities, Illumina, BWA-MEM,  ~30x) 
 `NeuSomatic_v0.1.0_ensemble_Dream3_70purity.pth`   |  Ensemble     | WGS Dream Challenge Stage 3 (70% tumor and 95% normal purities, Illumina, BWA-MEM,  ~30x) 
 `NeuSomatic_v0.1.0_standalone_WEX_100purity.pth`   |  Stand-alone  | WEX (100% tumor and normal purities, Illumina, BWA-MEM,  ~125x)

diff --git a/docker/Dockerfile b/docker/Dockerfile
@@ -1,13 +1,13 @@
 FROM ubuntu:16.04
 
-ENV NEUSOMATIC_VERSION 0.2.0
+ENV NEUSOMATIC_VERSION 0.2.1
 ENV ZLIB_VERSION 1.2.11
 ENV NUMPY_VERSION 1.15.4
 ENV SCIPY_VERSION 1.2.0
 ENV IMAGEIO_VERSION 2.5.0
-ENV PYTORCH_VERSION 1.0.1
-ENV TORCHVISION_VERSION 0.2.1
-ENV CUDATOOLKIT_VERSION 8.0
+ENV PYTORCH_VERSION 1.1.0
+ENV TORCHVISION_VERSION 0.3.0
+ENV CUDATOOLKIT_VERSION 9.0
 ENV CMAKE_VERSION 3.13.2
 ENV PYBEDTOOLS_VERSION 0.8.0
 ENV PYSAM_VERSION 0.15.2

diff --git a/ensemble_docker_pipelines/mutation_callers/submit_VarDictJava.sh b/ensemble_docker_pipelines/mutation_callers/submit_VarDictJava.sh
@@ -109,7 +109,7 @@ num_lines=`cat ${SELECTOR} | wc -l`
 input_bed=${SELECTOR}
 if [[ $(( $total_bases / $num_lines )) -gt 50000 ]]
 then
-    echo "docker run --rm -v /:/mnt -u $UID --memory ${MEM}G lethalfang/somaticseq:2.7.0 \\" >> $out_script
+    echo "docker run --rm -v /:/mnt -u $UID --memory ${MEM}G lethalfang/somaticseq:2.7.2 \\" >> $out_script
     echo "/opt/somaticseq/utilities/split_mergedBed.py \\" >> $out_script
     echo "-infile /mnt/${SELECTOR} -outfile /mnt/${outdir}/split_regions.bed" >> $out_script
     echo "" >> $out_script

diff --git a/ensemble_docker_pipelines/mutation_callers/submit_Wrapper.sh b/ensemble_docker_pipelines/mutation_callers/submit_Wrapper.sh
@@ -187,10 +187,10 @@ echo "" >> $out_script
 echo 'echo -e "Start at `date +"%Y/%m/%d %H:%M:%S"`" 1>&2' >> $out_script
 echo "" >> $out_script
 
-echo "docker pull lethalfang/somaticseq:2.7.0" >> $out_script
+echo "docker pull lethalfang/somaticseq:2.7.2" >> $out_script
 echo "" >> $out_script
 
-echo "docker run --rm -v /:/mnt -u $UID --memory 24g lethalfang/somaticseq:2.7.0 \\" >> $out_script
+echo "docker run --rm -v /:/mnt -u $UID --memory 24g lethalfang/somaticseq:2.7.2 \\" >> $out_script
 echo "/opt/somaticseq/SomaticSeq.Wrapper.sh \\" >> $out_script
 echo "--output-dir       /mnt/${outdir} \\" >> $out_script
 echo "--genome-reference /mnt/${HUMAN_REFERENCE} \\" >> $out_script

diff --git a/neusomatic/include/Options.h b/neusomatic/include/Options.h
@@ -24,10 +24,15 @@ namespace neusomatic {
     {"num_threads",                     required_argument,      0,        't'},
     {"max_depth",                       required_argument,      0,        'd'},
     {"include_secondary",               no_argument,            0,        's'},
+    {"filter_duplicate",                no_argument,            0,        'D'},
+    {"filter_QCfailed",                 no_argument,            0,        'Q'},
+    {"filter_improper_pair",            no_argument,            0,        'E'},
+    {"filter_mate_unmapped",            no_argument,            0,        'F'},
+    {"filter_improper_orientation",     no_argument,            0,        'G'},
     {0, 0, 0, 0} // terminator
   };
 
-  const char *short_options = "v:b:L:r:q:f:o:w:a:t:d:cys";
+  const char *short_options = "v:b:L:r:q:f:o:w:a:t:d:cysDQEFG";
 
   void print_help()
   {
@@ -43,11 +48,16 @@ namespace neusomatic {
     std::cerr<< "-a/--min_af,                          minimum allele freq,                                                   default 0.1.\n";
     std::cerr<< "-f/--out_vcf_file,                    output vcf file path,                                                  required.\n";
     std::cerr<< "-o/--out_count_file,                  output count file path,                                                required.\n";
-    std::cerr<< "-w/--window_size,                     window size to scan the variants,                                      default is 15\n";
-    std::cerr<< "-y/--fully_contained,                 if this option is on. A read has to be fully contained in the region,  default is False\n";
-    //std::cerr<< "-t/--num_threads,                     number or thread used for building the count matrix,                   default is 4\n";
-    std::cerr<< "-d/--max_depth,                       maximum depth for building the count matrix,                           default is 40,000\n";
-    std::cerr<< "-s/--include_secondary,               consider secondary alignments,                                         default is False\n";
+    std::cerr<< "-w/--window_size,                     window size to scan the variants,                                      default is 15.\n";
+    std::cerr<< "-y/--fully_contained,                 if this option is on. A read has to be fully contained in the region,  default is False.\n";
+    // std::cerr<< "-t/--num_threads,                     number or thread used for building the count matrix,                   default is 4.\n";
+    std::cerr<< "-d/--max_depth,                       maximum depth for building the count matrix,                           default is 40,000.\n";
+    std::cerr<< "-s/--include_secondary,               consider secondary alignments,                                         default is False.\n";
+    std::cerr<< "-D/--filter_duplicate,                filter duplicate reads if the flag is set,                             default is False.\n";
+    std::cerr<< "-Q/--filter_QCfailed,                 filter QC failed reads if the flag is set,                             default is False.\n";
+    std::cerr<< "-E/--filter_improper_pair,            filter improper pairs if the flag is set,                              default is False.\n";
+    std::cerr<< "-F/--filter_mate_unmapped,            filter reads with unmapeed mates if the flag is set,                   default is False.\n";
+    std::cerr<< "-G/--filter_improper_orientation,     filter reads with improper orientation (not FR) or different chrom,    default is False.\n";
   }
 
   int parseInt(const char* optarg, int lower, const char *errmsg, void (*print_help)()) {
@@ -138,6 +148,21 @@ namespace neusomatic {
         case 's':
           opt.include_secondary() = true;
           break;
+        case 'D':
+          opt.filter_duplicate() = true;
+          break;
+        case 'Q':
+          opt.filter_QCfailed() = true;
+          break;
+        case 'E':
+          opt.filter_improper_pair() = true;
+          break;
+        case 'F':
+          opt.filter_mate_unmapped() = true;
+          break;
+        case 'G':
+          opt.filter_improper_orientation() = true;
+          break;
         case 'a':
           opt.min_allele_freq() = parseFloat(optarg, 0.0, 1.0, "-a/--min_af must be between 0 and 1", print_help);
           break;
@@ -263,6 +288,47 @@ struct Options {
   decltype(auto) include_secondary() const {
     return (include_secondary_);
   }
+
+  decltype(auto) filter_duplicate() const {
+    return (filter_duplicate_);
+  }
+
+  decltype(auto) filter_duplicate() {
+    return (filter_duplicate_);
+  }
+
+  decltype(auto) filter_QCfailed() const {
+    return (filter_QCfailed_);
+  }
+
+  decltype(auto) filter_QCfailed() {
+    return (filter_QCfailed_);
+  }
+
+  decltype(auto) filter_improper_pair() const {
+    return (filter_improper_pair_);
+  }
+
+  decltype(auto) filter_improper_pair() {
+    return (filter_improper_pair_);
+  }
+
+  decltype(auto) filter_mate_unmapped() const {
+    return (filter_mate_unmapped_);
+  }
+
+  decltype(auto) filter_mate_unmapped() {
+    return (filter_mate_unmapped_);
+  }
+
+  decltype(auto) filter_improper_orientation() const {
+    return (filter_improper_orientation_);
+  }
+
+  decltype(auto) filter_improper_orientation() {
+    return (filter_improper_orientation_);
+  }
+
 private:
   unsigned verbosity_ = 0;
   std::string bam_in_;
@@ -276,9 +342,14 @@ struct Options {
   float min_allele_freq_ = 0.01;
   int min_mapq_ = 0;
   int window_size_ = 500;
-  int num_threads_ = 4;
-  int max_depth_ = 40000;
+  int num_threads_ = 1;
+  int max_depth_ = 5000000;
   bool include_secondary_ = false;
+  bool filter_duplicate_ = false;
+  bool filter_QCfailed_ = false;
+  bool filter_improper_pair_ = false;
+  bool filter_mate_unmapped_ = false;
+  bool filter_improper_orientation_ = false;
 };
 }//namespace neusomatic
 

diff --git a/neusomatic/include/targeted_loading.hpp b/neusomatic/include/targeted_loading.hpp
@@ -50,16 +50,40 @@ class CaptureLayout {
     std::vector<BamRecord> records;
     BamRecord rec;
     while (bam_reader_.GetNextRecord(rec)) {
-      bool good = false;
+      bool good = true;
       if (full_contained) {
-        if (rec.Position() <= curr_ginv.left() && rec.PositionEnd() + 1 >= curr_ginv.right() && rec.MapQuality()>=opts_.min_mapq() && (!rec.SecondaryFlag())) {
-          good = true;
+        if (rec.Position() > curr_ginv.left() || rec.PositionEnd() + 1 < curr_ginv.right()) {
+          good = false;
         }
-      } else {
-        if (rec.Position() < curr_ginv.right() && curr_ginv.left() <= rec.PositionEnd() && rec.MapQuality()>=opts_.min_mapq() && (opts_.include_secondary() || !rec.SecondaryFlag())) { // overlapped
-          good = true;
+      }
+      else {
+        if (rec.Position() >= curr_ginv.right() || curr_ginv.left() > rec.PositionEnd()) { // overlapped
+          good = false;
         }
       }
+
+      if (rec.MapQuality() < opts_.min_mapq()) { // mapq filter
+        good = false; 
+      }
+      if (!opts_.include_secondary() && rec.SecondaryFlag()) { // secondary flag
+        good = false; 
+      }
+      if (opts_.filter_duplicate() && rec.DuplicateFlag()) { // duplicate flag
+        good = false;
+      }
+      if (opts_.filter_QCfailed() && rec.QCFailFlag()) { // QC failed flag
+        good = false;
+      }
+      if (opts_.filter_improper_pair() && !rec.ProperPair()) { // Proper pair flag. Set by aligner
+        good = false;
+      }
+      if (opts_.filter_mate_unmapped() && !rec.MateMappedFlag()) { // Mate is ummapped
+        good = false;
+      }
+      if (opts_.filter_improper_orientation() && !rec.ProperOrientation()) { // pair read has proper orientation (FR) and on same chrom.
+        good = false;
+      }
+
       if (good) {
         if(rec.GetCigar().size() == 0) {
           std::cerr << "warning: " << rec.Qname() << " has no cigar\n";

diff --git a/neusomatic/models/NeuSomatic_v0.1.4_ensemble_SEQC-WGS-GT50-SpikeWGS10.pth b/neusomatic/models/NeuSomatic_v0.1.4_ensemble_SEQC-WGS-GT50-SpikeWGS10.pth
diff --git a/neusomatic/models/NeuSomatic_v0.1.4_ensemble_SEQC-WGS-Spike.pth b/neusomatic/models/NeuSomatic_v0.1.4_ensemble_SEQC-WGS-Spike.pth
diff --git a/neusomatic/models/NeuSomatic_v0.1.4_standalone_SEQC-WGS-GT50-SpikeWGS10.pth b/neusomatic/models/NeuSomatic_v0.1.4_standalone_SEQC-WGS-GT50-SpikeWGS10.pth
diff --git a/neusomatic/models/NeuSomatic_v0.1.4_standalone_SEQC-WGS-Spike.pth b/neusomatic/models/NeuSomatic_v0.1.4_standalone_SEQC-WGS-Spike.pth
diff --git a/neusomatic/python/_version.py b/neusomatic/python/_version.py
@@ -1 +1 @@
-__version__ = "0.2.0"
+__version__ = "0.2.1"