Skip to content

Commit

Permalink
Merge pull request #44 from bioinform/updates_to_v0.2.1
Browse files Browse the repository at this point in the history
Updates to v0.2.1
  • Loading branch information
msahraeian authored Jul 11, 2019
2 parents 7b39627 + 1bfc138 commit 95c218f
Show file tree
Hide file tree
Showing 23 changed files with 374 additions and 186 deletions.
37 changes: 20 additions & 17 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -5,12 +5,17 @@ NeuSomatic is based on deep convolutional neural networks for accurate somatic m
For more information contact us at [email protected]

## Publication
If you use NeuSomatic in your work, please cite the following preprint:
If you use NeuSomatic in your work, please cite the following papers:

Sayed Mohammad Ebrahim Sahraeian, Ruolin Liu, Bayo Lau, Karl Podesta, Marghoob Mohiyuddin, Hugo Y. K. Lam, <br/>
[Deep convolutional neural networks for accurate somatic mutation detection. Nature Communications 10: 1041, (2019). <br/>
doi: https://doi.org/10.1038/s41467-019-09027-x](https://doi.org/10.1038/s41467-019-09027-x)

Sayed Mohammad Ebrahim Sahraeian, Li Tai Fang, Marghoob Mohiyuddin, Huixiao Hong, Wenming Xiao, <br/>
[Robust cancer mutation detection with deep learning models derived from tumor-normal sequencing data. bioRxiv (2019): 667261. <br/>
doi: https://doi.org/10.1101/667261](https://doi.org/10.1101/667261)


## Example Input Matrix
![Example input](resources/toy_example.png)

Expand All @@ -29,14 +34,14 @@ doi: https://doi.org/10.1038/s41467-019-09027-x](https://doi.org/10.1038/s41467-

## Availability

NeuSomatic is written in Python and C++ and requires a Unix-like environment to run. It has been sucessfully tested on CentOS 7. Its deep learning framework is implemented using PyTorch 1.0.1 to enable GPU acceleration for training/testing.
NeuSomatic is written in Python and C++ and requires a Unix-like environment to run. It has been sucessfully tested on CentOS 7. Its deep learning framework is implemented using PyTorch 1.1.0 to enable GPU acceleration for training/testing.

NeuSomatic first scans the genome to identify candidate variants and extract alignment information.
The binary for this step can be obtained at `neusomatic/bin` folder by running `./build.sh` (which requires cmake 3.13.2 and g++ 5.4.0).

Python 3.7 and the following Python packages must be installed:
* pytorch 1.0.1
* torchvision 0.2.1
* pytorch 1.1.0
* torchvision 0.3.0
* pybedtools 0.8.0
* pysam 0.15.2
* zlib 1.2.11
Expand All @@ -46,7 +51,7 @@ Python 3.7 and the following Python packages must be installed:
* biopython 1.73

It also depends on the following packages:
* cudatoolkit 8.0 (if you want to use GPU)
* cudatoolkit 9.0 (if you want to use GPU)
* tabix 0.2.6
* bedtools 2.27.1
* samtools 1.9
Expand All @@ -55,7 +60,7 @@ You can install these packages using [anaconda](https://www.anaconda.com/downloa
```
conda install zlib=1.2.11 numpy=1.15.4 scipy=1.2.0 cmake=3.13.2 imageio=2.5.0
conda install pysam=0.15.2 pybedtools=0.8.0 samtools=1.9 tabix=0.2.6 bedtools=2.27.1 biopython=1.73 -c bioconda
conda install pytorch=1.0.1 torchvision=0.2.1 cudatoolkit=8.0 -c pytorch
conda install pytorch=1.1.0 torchvision=0.3.0 cudatoolkit=9.0 -c pytorch
```
Then you can export the conda paths as:
```
Expand Down Expand Up @@ -88,13 +93,6 @@ For calling mode, the following inputs are required:
* call region `.bed` file
* trained model `.pth` file

Reads in input `.bam` file should be sorted, indexed and have MD tags. If you are not sure that all you reads have MD tags, you should run the following command for both tumor and normal alignments:

```
samtools calmd -@ num_threads -b alignment.bam reference.fasta > alignment.md.bam
samtools index alignment.md.bam
```

For the region `.bed` files, if you don't have any preferred target regions for training/calling, you can use the whole genome as the target region. Example bed files of major chromosomes for human hg38, hg19, and b37 references can be found at [resources](resources).

## Quick Test
Expand Down Expand Up @@ -236,20 +234,25 @@ You can then used the synthetic tumor/normal pair and the known *in silico* spik

## Trained Network Models
We provide a set of trained NeuSomatic network models for general purpose usage. Users should note that these models are trained for sepcific settings and are not supposed to work perfectly for all circumestances.

The SEQC-II pretrained models are the recommended NeuSomatic models and are analyzed in detail in [Sahraeian et al. 2019](https://doi.org/10.1101/667261).

The following models can be found at `neusomatic/models` folder:


### Latest models
Model | Mode | Training Information
---------------------------------------------------|---------------|-----------------------------------------------------------------------
`NeuSomatic_v0.1.3_standalone_Dream3.pth` | Stand-alone | WGS Dream Challenge Stage 3 (trained on multiple purity settings: 100T-100N/50T-100N/70T-95N/50T-95N/25T-95N, Illumina, BWA-MEM, ~30x)
`NeuSomatic_v0.1.3_ensemble_Dream3.pth` | Ensemble | WGS Dream Challenge Stage 3 (trained on multiple purity settings: 100T-100N/50T-100N/70T-95N/50T-95N/25T-95N, Illumina, BWA-MEM, ~30x)


`NeuSomatic_v0.1.4_standalone_SEQC-WGS-Spike.pth` | Stand-alone | SEQC-II (SEQC-WGS-Spike model) (trained on 20 WGS replicate pairs with in silico somatic mutations of 1%-100% AF, matched with both 95%N and 100%N, Illumina HiSeq and NovaSeq, BWA-MEM, ~40x-220x)
`NeuSomatic_v0.1.4_ensemble_SEQC-WGS-Spike.pth` | Ensemble | SEQC-II (SEQC-WGS-Spike model) (trained on 20 WGS replicate pairs with in silico somatic mutations of 1%-100% AF, matched with both 95%N and 100%N, Illumina HiSeq and NovaSeq, BWA-MEM, ~40x-220x)
`NeuSomatic_v0.1.4_ensemble_SEQC-WGS-GT50-SpikeWGS10.pth` | Stand-alone | SEQC-II (SEQC-WGS-GT50-SpikeWGS10 model) (trained on combination of two datasets: (1) 50% of the genome for 24 real tumor-normal SEQC-II replicates using the HighConf truth set annotation, with multiple purity settings of 100T-100N/10T-100N/10T-95N, 1%-100% AF and (2) 10% of data used in `NeuSomatic_v0.1.4_standalone_SEQC-WGS-Spike.pth` model. Illumina HiSeq and NovaSeq, BWA-MEM, ~40x-390x)
`NeuSomatic_v0.1.4_ensemble_SEQC-WGS-GT50-SpikeWGS10.pth` | Ensemble | SEQC-II (SEQC-WGS-GT50-SpikeWGS10 model) (trained on combination of two datasets: (1) 50% of the genome for 24 real tumor-normal SEQC-II replicates using the HighConf truth set annotation, with multiple purity settings of 100T-100N/10T-100N/10T-95N, 1%-100% AF and (2) 10% of data used in `NeuSomatic_v0.1.4_ensemble_SEQC-WGS-Spike.pth` model. Illumina HiSeq and NovaSeq, BWA-MEM, ~40x-390x)

### Older models
Model | Mode | Training Information
---------------------------------------------------|---------------|-----------------------------------------------------------------------
`NeuSomatic_v0.1.3_standalone_Dream3.pth` | Stand-alone | WGS Dream Challenge Stage 3 (trained on multiple purity settings: 100T-100N/50T-100N/70T-95N/50T-95N/25T-95N, Illumina, BWA-MEM, ~30x)
`NeuSomatic_v0.1.3_ensemble_Dream3.pth` | Ensemble | WGS Dream Challenge Stage 3 (trained on multiple purity settings: 100T-100N/50T-100N/70T-95N/50T-95N/25T-95N, Illumina, BWA-MEM, ~30x)
`NeuSomatic_v0.1.0_standalone_Dream3_70purity.pth` | Stand-alone | WGS Dream Challenge Stage 3 (70% tumor and 95% normal purities, Illumina, BWA-MEM, ~30x)
`NeuSomatic_v0.1.0_ensemble_Dream3_70purity.pth` | Ensemble | WGS Dream Challenge Stage 3 (70% tumor and 95% normal purities, Illumina, BWA-MEM, ~30x)
`NeuSomatic_v0.1.0_standalone_WEX_100purity.pth` | Stand-alone | WEX (100% tumor and normal purities, Illumina, BWA-MEM, ~125x)
Expand Down
8 changes: 4 additions & 4 deletions docker/Dockerfile
Original file line number Diff line number Diff line change
@@ -1,13 +1,13 @@
FROM ubuntu:16.04

ENV NEUSOMATIC_VERSION 0.2.0
ENV NEUSOMATIC_VERSION 0.2.1
ENV ZLIB_VERSION 1.2.11
ENV NUMPY_VERSION 1.15.4
ENV SCIPY_VERSION 1.2.0
ENV IMAGEIO_VERSION 2.5.0
ENV PYTORCH_VERSION 1.0.1
ENV TORCHVISION_VERSION 0.2.1
ENV CUDATOOLKIT_VERSION 8.0
ENV PYTORCH_VERSION 1.1.0
ENV TORCHVISION_VERSION 0.3.0
ENV CUDATOOLKIT_VERSION 9.0
ENV CMAKE_VERSION 3.13.2
ENV PYBEDTOOLS_VERSION 0.8.0
ENV PYSAM_VERSION 0.15.2
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -109,7 +109,7 @@ num_lines=`cat ${SELECTOR} | wc -l`
input_bed=${SELECTOR}
if [[ $(( $total_bases / $num_lines )) -gt 50000 ]]
then
echo "docker run --rm -v /:/mnt -u $UID --memory ${MEM}G lethalfang/somaticseq:2.7.0 \\" >> $out_script
echo "docker run --rm -v /:/mnt -u $UID --memory ${MEM}G lethalfang/somaticseq:2.7.2 \\" >> $out_script
echo "/opt/somaticseq/utilities/split_mergedBed.py \\" >> $out_script
echo "-infile /mnt/${SELECTOR} -outfile /mnt/${outdir}/split_regions.bed" >> $out_script
echo "" >> $out_script
Expand Down
4 changes: 2 additions & 2 deletions ensemble_docker_pipelines/mutation_callers/submit_Wrapper.sh
Original file line number Diff line number Diff line change
Expand Up @@ -187,10 +187,10 @@ echo "" >> $out_script
echo 'echo -e "Start at `date +"%Y/%m/%d %H:%M:%S"`" 1>&2' >> $out_script
echo "" >> $out_script

echo "docker pull lethalfang/somaticseq:2.7.0" >> $out_script
echo "docker pull lethalfang/somaticseq:2.7.2" >> $out_script
echo "" >> $out_script

echo "docker run --rm -v /:/mnt -u $UID --memory 24g lethalfang/somaticseq:2.7.0 \\" >> $out_script
echo "docker run --rm -v /:/mnt -u $UID --memory 24g lethalfang/somaticseq:2.7.2 \\" >> $out_script
echo "/opt/somaticseq/SomaticSeq.Wrapper.sh \\" >> $out_script
echo "--output-dir /mnt/${outdir} \\" >> $out_script
echo "--genome-reference /mnt/${HUMAN_REFERENCE} \\" >> $out_script
Expand Down
87 changes: 79 additions & 8 deletions neusomatic/include/Options.h
Original file line number Diff line number Diff line change
Expand Up @@ -24,10 +24,15 @@ namespace neusomatic {
{"num_threads", required_argument, 0, 't'},
{"max_depth", required_argument, 0, 'd'},
{"include_secondary", no_argument, 0, 's'},
{"filter_duplicate", no_argument, 0, 'D'},
{"filter_QCfailed", no_argument, 0, 'Q'},
{"filter_improper_pair", no_argument, 0, 'E'},
{"filter_mate_unmapped", no_argument, 0, 'F'},
{"filter_improper_orientation", no_argument, 0, 'G'},
{0, 0, 0, 0} // terminator
};

const char *short_options = "v:b:L:r:q:f:o:w:a:t:d:cys";
const char *short_options = "v:b:L:r:q:f:o:w:a:t:d:cysDQEFG";

void print_help()
{
Expand All @@ -43,11 +48,16 @@ namespace neusomatic {
std::cerr<< "-a/--min_af, minimum allele freq, default 0.1.\n";
std::cerr<< "-f/--out_vcf_file, output vcf file path, required.\n";
std::cerr<< "-o/--out_count_file, output count file path, required.\n";
std::cerr<< "-w/--window_size, window size to scan the variants, default is 15\n";
std::cerr<< "-y/--fully_contained, if this option is on. A read has to be fully contained in the region, default is False\n";
//std::cerr<< "-t/--num_threads, number or thread used for building the count matrix, default is 4\n";
std::cerr<< "-d/--max_depth, maximum depth for building the count matrix, default is 40,000\n";
std::cerr<< "-s/--include_secondary, consider secondary alignments, default is False\n";
std::cerr<< "-w/--window_size, window size to scan the variants, default is 15.\n";
std::cerr<< "-y/--fully_contained, if this option is on. A read has to be fully contained in the region, default is False.\n";
// std::cerr<< "-t/--num_threads, number or thread used for building the count matrix, default is 4.\n";
std::cerr<< "-d/--max_depth, maximum depth for building the count matrix, default is 40,000.\n";
std::cerr<< "-s/--include_secondary, consider secondary alignments, default is False.\n";
std::cerr<< "-D/--filter_duplicate, filter duplicate reads if the flag is set, default is False.\n";
std::cerr<< "-Q/--filter_QCfailed, filter QC failed reads if the flag is set, default is False.\n";
std::cerr<< "-E/--filter_improper_pair, filter improper pairs if the flag is set, default is False.\n";
std::cerr<< "-F/--filter_mate_unmapped, filter reads with unmapeed mates if the flag is set, default is False.\n";
std::cerr<< "-G/--filter_improper_orientation, filter reads with improper orientation (not FR) or different chrom, default is False.\n";
}

int parseInt(const char* optarg, int lower, const char *errmsg, void (*print_help)()) {
Expand Down Expand Up @@ -138,6 +148,21 @@ namespace neusomatic {
case 's':
opt.include_secondary() = true;
break;
case 'D':
opt.filter_duplicate() = true;
break;
case 'Q':
opt.filter_QCfailed() = true;
break;
case 'E':
opt.filter_improper_pair() = true;
break;
case 'F':
opt.filter_mate_unmapped() = true;
break;
case 'G':
opt.filter_improper_orientation() = true;
break;
case 'a':
opt.min_allele_freq() = parseFloat(optarg, 0.0, 1.0, "-a/--min_af must be between 0 and 1", print_help);
break;
Expand Down Expand Up @@ -263,6 +288,47 @@ struct Options {
decltype(auto) include_secondary() const {
return (include_secondary_);
}

decltype(auto) filter_duplicate() const {
return (filter_duplicate_);
}

decltype(auto) filter_duplicate() {
return (filter_duplicate_);
}

decltype(auto) filter_QCfailed() const {
return (filter_QCfailed_);
}

decltype(auto) filter_QCfailed() {
return (filter_QCfailed_);
}

decltype(auto) filter_improper_pair() const {
return (filter_improper_pair_);
}

decltype(auto) filter_improper_pair() {
return (filter_improper_pair_);
}

decltype(auto) filter_mate_unmapped() const {
return (filter_mate_unmapped_);
}

decltype(auto) filter_mate_unmapped() {
return (filter_mate_unmapped_);
}

decltype(auto) filter_improper_orientation() const {
return (filter_improper_orientation_);
}

decltype(auto) filter_improper_orientation() {
return (filter_improper_orientation_);
}

private:
unsigned verbosity_ = 0;
std::string bam_in_;
Expand All @@ -276,9 +342,14 @@ struct Options {
float min_allele_freq_ = 0.01;
int min_mapq_ = 0;
int window_size_ = 500;
int num_threads_ = 4;
int max_depth_ = 40000;
int num_threads_ = 1;
int max_depth_ = 5000000;
bool include_secondary_ = false;
bool filter_duplicate_ = false;
bool filter_QCfailed_ = false;
bool filter_improper_pair_ = false;
bool filter_mate_unmapped_ = false;
bool filter_improper_orientation_ = false;
};
}//namespace neusomatic

Expand Down
36 changes: 30 additions & 6 deletions neusomatic/include/targeted_loading.hpp
Original file line number Diff line number Diff line change
Expand Up @@ -50,16 +50,40 @@ class CaptureLayout {
std::vector<BamRecord> records;
BamRecord rec;
while (bam_reader_.GetNextRecord(rec)) {
bool good = false;
bool good = true;
if (full_contained) {
if (rec.Position() <= curr_ginv.left() && rec.PositionEnd() + 1 >= curr_ginv.right() && rec.MapQuality()>=opts_.min_mapq() && (!rec.SecondaryFlag())) {
good = true;
if (rec.Position() > curr_ginv.left() || rec.PositionEnd() + 1 < curr_ginv.right()) {
good = false;
}
} else {
if (rec.Position() < curr_ginv.right() && curr_ginv.left() <= rec.PositionEnd() && rec.MapQuality()>=opts_.min_mapq() && (opts_.include_secondary() || !rec.SecondaryFlag())) { // overlapped
good = true;
}
else {
if (rec.Position() >= curr_ginv.right() || curr_ginv.left() > rec.PositionEnd()) { // overlapped
good = false;
}
}

if (rec.MapQuality() < opts_.min_mapq()) { // mapq filter
good = false;
}
if (!opts_.include_secondary() && rec.SecondaryFlag()) { // secondary flag
good = false;
}
if (opts_.filter_duplicate() && rec.DuplicateFlag()) { // duplicate flag
good = false;
}
if (opts_.filter_QCfailed() && rec.QCFailFlag()) { // QC failed flag
good = false;
}
if (opts_.filter_improper_pair() && !rec.ProperPair()) { // Proper pair flag. Set by aligner
good = false;
}
if (opts_.filter_mate_unmapped() && !rec.MateMappedFlag()) { // Mate is ummapped
good = false;
}
if (opts_.filter_improper_orientation() && !rec.ProperOrientation()) { // pair read has proper orientation (FR) and on same chrom.
good = false;
}

if (good) {
if(rec.GetCigar().size() == 0) {
std::cerr << "warning: " << rec.Qname() << " has no cigar\n";
Expand Down
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
2 changes: 1 addition & 1 deletion neusomatic/python/_version.py
Original file line number Diff line number Diff line change
@@ -1 +1 @@
__version__ = "0.2.0"
__version__ = "0.2.1"
Loading

0 comments on commit 95c218f

Please sign in to comment.