diff --git a/README.md b/README.md index 3c98fb4..dc0f7fe 100644 --- a/README.md +++ b/README.md @@ -5,12 +5,17 @@ NeuSomatic is based on deep convolutional neural networks for accurate somatic m For more information contact us at bioinformatics.red@roche.com ## Publication -If you use NeuSomatic in your work, please cite the following preprint: +If you use NeuSomatic in your work, please cite the following papers: Sayed Mohammad Ebrahim Sahraeian, Ruolin Liu, Bayo Lau, Karl Podesta, Marghoob Mohiyuddin, Hugo Y. K. Lam,
[Deep convolutional neural networks for accurate somatic mutation detection. Nature Communications 10: 1041, (2019).
doi: https://doi.org/10.1038/s41467-019-09027-x](https://doi.org/10.1038/s41467-019-09027-x) +Sayed Mohammad Ebrahim Sahraeian, Li Tai Fang, Marghoob Mohiyuddin, Huixiao Hong, Wenming Xiao,
+[Robust cancer mutation detection with deep learning models derived from tumor-normal sequencing data. bioRxiv (2019): 667261.
+doi: https://doi.org/10.1101/667261](https://doi.org/10.1101/667261) + + ## Example Input Matrix ![Example input](resources/toy_example.png) @@ -29,14 +34,14 @@ doi: https://doi.org/10.1038/s41467-019-09027-x](https://doi.org/10.1038/s41467- ## Availability -NeuSomatic is written in Python and C++ and requires a Unix-like environment to run. It has been sucessfully tested on CentOS 7. Its deep learning framework is implemented using PyTorch 1.0.1 to enable GPU acceleration for training/testing. +NeuSomatic is written in Python and C++ and requires a Unix-like environment to run. It has been sucessfully tested on CentOS 7. Its deep learning framework is implemented using PyTorch 1.1.0 to enable GPU acceleration for training/testing. NeuSomatic first scans the genome to identify candidate variants and extract alignment information. The binary for this step can be obtained at `neusomatic/bin` folder by running `./build.sh` (which requires cmake 3.13.2 and g++ 5.4.0). Python 3.7 and the following Python packages must be installed: -* pytorch 1.0.1 -* torchvision 0.2.1 +* pytorch 1.1.0 +* torchvision 0.3.0 * pybedtools 0.8.0 * pysam 0.15.2 * zlib 1.2.11 @@ -46,7 +51,7 @@ Python 3.7 and the following Python packages must be installed: * biopython 1.73 It also depends on the following packages: -* cudatoolkit 8.0 (if you want to use GPU) +* cudatoolkit 9.0 (if you want to use GPU) * tabix 0.2.6 * bedtools 2.27.1 * samtools 1.9 @@ -55,7 +60,7 @@ You can install these packages using [anaconda](https://www.anaconda.com/downloa ``` conda install zlib=1.2.11 numpy=1.15.4 scipy=1.2.0 cmake=3.13.2 imageio=2.5.0 conda install pysam=0.15.2 pybedtools=0.8.0 samtools=1.9 tabix=0.2.6 bedtools=2.27.1 biopython=1.73 -c bioconda -conda install pytorch=1.0.1 torchvision=0.2.1 cudatoolkit=8.0 -c pytorch +conda install pytorch=1.1.0 torchvision=0.3.0 cudatoolkit=9.0 -c pytorch ``` Then you can export the conda paths as: ``` @@ -88,13 +93,6 @@ For calling mode, the following inputs are required: * call region `.bed` file * trained model `.pth` file -Reads in input `.bam` file should be sorted, indexed and have MD tags. If you are not sure that all you reads have MD tags, you should run the following command for both tumor and normal alignments: - -``` -samtools calmd -@ num_threads -b alignment.bam reference.fasta > alignment.md.bam -samtools index alignment.md.bam -``` - For the region `.bed` files, if you don't have any preferred target regions for training/calling, you can use the whole genome as the target region. Example bed files of major chromosomes for human hg38, hg19, and b37 references can be found at [resources](resources). ## Quick Test @@ -236,20 +234,25 @@ You can then used the synthetic tumor/normal pair and the known *in silico* spik ## Trained Network Models We provide a set of trained NeuSomatic network models for general purpose usage. Users should note that these models are trained for sepcific settings and are not supposed to work perfectly for all circumestances. + +The SEQC-II pretrained models are the recommended NeuSomatic models and are analyzed in detail in [Sahraeian et al. 2019](https://doi.org/10.1101/667261). + The following models can be found at `neusomatic/models` folder: ### Latest models Model | Mode | Training Information ---------------------------------------------------|---------------|----------------------------------------------------------------------- -`NeuSomatic_v0.1.3_standalone_Dream3.pth` | Stand-alone | WGS Dream Challenge Stage 3 (trained on multiple purity settings: 100T-100N/50T-100N/70T-95N/50T-95N/25T-95N, Illumina, BWA-MEM, ~30x) -`NeuSomatic_v0.1.3_ensemble_Dream3.pth` | Ensemble | WGS Dream Challenge Stage 3 (trained on multiple purity settings: 100T-100N/50T-100N/70T-95N/50T-95N/25T-95N, Illumina, BWA-MEM, ~30x) - - +`NeuSomatic_v0.1.4_standalone_SEQC-WGS-Spike.pth` | Stand-alone | SEQC-II (SEQC-WGS-Spike model) (trained on 20 WGS replicate pairs with in silico somatic mutations of 1%-100% AF, matched with both 95%N and 100%N, Illumina HiSeq and NovaSeq, BWA-MEM, ~40x-220x) +`NeuSomatic_v0.1.4_ensemble_SEQC-WGS-Spike.pth` | Ensemble | SEQC-II (SEQC-WGS-Spike model) (trained on 20 WGS replicate pairs with in silico somatic mutations of 1%-100% AF, matched with both 95%N and 100%N, Illumina HiSeq and NovaSeq, BWA-MEM, ~40x-220x) +`NeuSomatic_v0.1.4_ensemble_SEQC-WGS-GT50-SpikeWGS10.pth` | Stand-alone | SEQC-II (SEQC-WGS-GT50-SpikeWGS10 model) (trained on combination of two datasets: (1) 50% of the genome for 24 real tumor-normal SEQC-II replicates using the HighConf truth set annotation, with multiple purity settings of 100T-100N/10T-100N/10T-95N, 1%-100% AF and (2) 10% of data used in `NeuSomatic_v0.1.4_standalone_SEQC-WGS-Spike.pth` model. Illumina HiSeq and NovaSeq, BWA-MEM, ~40x-390x) +`NeuSomatic_v0.1.4_ensemble_SEQC-WGS-GT50-SpikeWGS10.pth` | Ensemble | SEQC-II (SEQC-WGS-GT50-SpikeWGS10 model) (trained on combination of two datasets: (1) 50% of the genome for 24 real tumor-normal SEQC-II replicates using the HighConf truth set annotation, with multiple purity settings of 100T-100N/10T-100N/10T-95N, 1%-100% AF and (2) 10% of data used in `NeuSomatic_v0.1.4_ensemble_SEQC-WGS-Spike.pth` model. Illumina HiSeq and NovaSeq, BWA-MEM, ~40x-390x) ### Older models Model | Mode | Training Information ---------------------------------------------------|---------------|----------------------------------------------------------------------- +`NeuSomatic_v0.1.3_standalone_Dream3.pth` | Stand-alone | WGS Dream Challenge Stage 3 (trained on multiple purity settings: 100T-100N/50T-100N/70T-95N/50T-95N/25T-95N, Illumina, BWA-MEM, ~30x) +`NeuSomatic_v0.1.3_ensemble_Dream3.pth` | Ensemble | WGS Dream Challenge Stage 3 (trained on multiple purity settings: 100T-100N/50T-100N/70T-95N/50T-95N/25T-95N, Illumina, BWA-MEM, ~30x) `NeuSomatic_v0.1.0_standalone_Dream3_70purity.pth` | Stand-alone | WGS Dream Challenge Stage 3 (70% tumor and 95% normal purities, Illumina, BWA-MEM, ~30x) `NeuSomatic_v0.1.0_ensemble_Dream3_70purity.pth` | Ensemble | WGS Dream Challenge Stage 3 (70% tumor and 95% normal purities, Illumina, BWA-MEM, ~30x) `NeuSomatic_v0.1.0_standalone_WEX_100purity.pth` | Stand-alone | WEX (100% tumor and normal purities, Illumina, BWA-MEM, ~125x) diff --git a/docker/Dockerfile b/docker/Dockerfile index ee811b0..847941b 100644 --- a/docker/Dockerfile +++ b/docker/Dockerfile @@ -1,13 +1,13 @@ FROM ubuntu:16.04 -ENV NEUSOMATIC_VERSION 0.2.0 +ENV NEUSOMATIC_VERSION 0.2.1 ENV ZLIB_VERSION 1.2.11 ENV NUMPY_VERSION 1.15.4 ENV SCIPY_VERSION 1.2.0 ENV IMAGEIO_VERSION 2.5.0 -ENV PYTORCH_VERSION 1.0.1 -ENV TORCHVISION_VERSION 0.2.1 -ENV CUDATOOLKIT_VERSION 8.0 +ENV PYTORCH_VERSION 1.1.0 +ENV TORCHVISION_VERSION 0.3.0 +ENV CUDATOOLKIT_VERSION 9.0 ENV CMAKE_VERSION 3.13.2 ENV PYBEDTOOLS_VERSION 0.8.0 ENV PYSAM_VERSION 0.15.2 diff --git a/ensemble_docker_pipelines/mutation_callers/submit_VarDictJava.sh b/ensemble_docker_pipelines/mutation_callers/submit_VarDictJava.sh index 176eef2..975f27e 100755 --- a/ensemble_docker_pipelines/mutation_callers/submit_VarDictJava.sh +++ b/ensemble_docker_pipelines/mutation_callers/submit_VarDictJava.sh @@ -109,7 +109,7 @@ num_lines=`cat ${SELECTOR} | wc -l` input_bed=${SELECTOR} if [[ $(( $total_bases / $num_lines )) -gt 50000 ]] then - echo "docker run --rm -v /:/mnt -u $UID --memory ${MEM}G lethalfang/somaticseq:2.7.0 \\" >> $out_script + echo "docker run --rm -v /:/mnt -u $UID --memory ${MEM}G lethalfang/somaticseq:2.7.2 \\" >> $out_script echo "/opt/somaticseq/utilities/split_mergedBed.py \\" >> $out_script echo "-infile /mnt/${SELECTOR} -outfile /mnt/${outdir}/split_regions.bed" >> $out_script echo "" >> $out_script diff --git a/ensemble_docker_pipelines/mutation_callers/submit_Wrapper.sh b/ensemble_docker_pipelines/mutation_callers/submit_Wrapper.sh index ab3d41f..0a83b4f 100755 --- a/ensemble_docker_pipelines/mutation_callers/submit_Wrapper.sh +++ b/ensemble_docker_pipelines/mutation_callers/submit_Wrapper.sh @@ -187,10 +187,10 @@ echo "" >> $out_script echo 'echo -e "Start at `date +"%Y/%m/%d %H:%M:%S"`" 1>&2' >> $out_script echo "" >> $out_script -echo "docker pull lethalfang/somaticseq:2.7.0" >> $out_script +echo "docker pull lethalfang/somaticseq:2.7.2" >> $out_script echo "" >> $out_script -echo "docker run --rm -v /:/mnt -u $UID --memory 24g lethalfang/somaticseq:2.7.0 \\" >> $out_script +echo "docker run --rm -v /:/mnt -u $UID --memory 24g lethalfang/somaticseq:2.7.2 \\" >> $out_script echo "/opt/somaticseq/SomaticSeq.Wrapper.sh \\" >> $out_script echo "--output-dir /mnt/${outdir} \\" >> $out_script echo "--genome-reference /mnt/${HUMAN_REFERENCE} \\" >> $out_script diff --git a/neusomatic/include/Options.h b/neusomatic/include/Options.h index 92470ce..2260673 100644 --- a/neusomatic/include/Options.h +++ b/neusomatic/include/Options.h @@ -24,10 +24,15 @@ namespace neusomatic { {"num_threads", required_argument, 0, 't'}, {"max_depth", required_argument, 0, 'd'}, {"include_secondary", no_argument, 0, 's'}, + {"filter_duplicate", no_argument, 0, 'D'}, + {"filter_QCfailed", no_argument, 0, 'Q'}, + {"filter_improper_pair", no_argument, 0, 'E'}, + {"filter_mate_unmapped", no_argument, 0, 'F'}, + {"filter_improper_orientation", no_argument, 0, 'G'}, {0, 0, 0, 0} // terminator }; - const char *short_options = "v:b:L:r:q:f:o:w:a:t:d:cys"; + const char *short_options = "v:b:L:r:q:f:o:w:a:t:d:cysDQEFG"; void print_help() { @@ -43,11 +48,16 @@ namespace neusomatic { std::cerr<< "-a/--min_af, minimum allele freq, default 0.1.\n"; std::cerr<< "-f/--out_vcf_file, output vcf file path, required.\n"; std::cerr<< "-o/--out_count_file, output count file path, required.\n"; - std::cerr<< "-w/--window_size, window size to scan the variants, default is 15\n"; - std::cerr<< "-y/--fully_contained, if this option is on. A read has to be fully contained in the region, default is False\n"; - //std::cerr<< "-t/--num_threads, number or thread used for building the count matrix, default is 4\n"; - std::cerr<< "-d/--max_depth, maximum depth for building the count matrix, default is 40,000\n"; - std::cerr<< "-s/--include_secondary, consider secondary alignments, default is False\n"; + std::cerr<< "-w/--window_size, window size to scan the variants, default is 15.\n"; + std::cerr<< "-y/--fully_contained, if this option is on. A read has to be fully contained in the region, default is False.\n"; + // std::cerr<< "-t/--num_threads, number or thread used for building the count matrix, default is 4.\n"; + std::cerr<< "-d/--max_depth, maximum depth for building the count matrix, default is 40,000.\n"; + std::cerr<< "-s/--include_secondary, consider secondary alignments, default is False.\n"; + std::cerr<< "-D/--filter_duplicate, filter duplicate reads if the flag is set, default is False.\n"; + std::cerr<< "-Q/--filter_QCfailed, filter QC failed reads if the flag is set, default is False.\n"; + std::cerr<< "-E/--filter_improper_pair, filter improper pairs if the flag is set, default is False.\n"; + std::cerr<< "-F/--filter_mate_unmapped, filter reads with unmapeed mates if the flag is set, default is False.\n"; + std::cerr<< "-G/--filter_improper_orientation, filter reads with improper orientation (not FR) or different chrom, default is False.\n"; } int parseInt(const char* optarg, int lower, const char *errmsg, void (*print_help)()) { @@ -138,6 +148,21 @@ namespace neusomatic { case 's': opt.include_secondary() = true; break; + case 'D': + opt.filter_duplicate() = true; + break; + case 'Q': + opt.filter_QCfailed() = true; + break; + case 'E': + opt.filter_improper_pair() = true; + break; + case 'F': + opt.filter_mate_unmapped() = true; + break; + case 'G': + opt.filter_improper_orientation() = true; + break; case 'a': opt.min_allele_freq() = parseFloat(optarg, 0.0, 1.0, "-a/--min_af must be between 0 and 1", print_help); break; @@ -263,6 +288,47 @@ struct Options { decltype(auto) include_secondary() const { return (include_secondary_); } + + decltype(auto) filter_duplicate() const { + return (filter_duplicate_); + } + + decltype(auto) filter_duplicate() { + return (filter_duplicate_); + } + + decltype(auto) filter_QCfailed() const { + return (filter_QCfailed_); + } + + decltype(auto) filter_QCfailed() { + return (filter_QCfailed_); + } + + decltype(auto) filter_improper_pair() const { + return (filter_improper_pair_); + } + + decltype(auto) filter_improper_pair() { + return (filter_improper_pair_); + } + + decltype(auto) filter_mate_unmapped() const { + return (filter_mate_unmapped_); + } + + decltype(auto) filter_mate_unmapped() { + return (filter_mate_unmapped_); + } + + decltype(auto) filter_improper_orientation() const { + return (filter_improper_orientation_); + } + + decltype(auto) filter_improper_orientation() { + return (filter_improper_orientation_); + } + private: unsigned verbosity_ = 0; std::string bam_in_; @@ -276,9 +342,14 @@ struct Options { float min_allele_freq_ = 0.01; int min_mapq_ = 0; int window_size_ = 500; - int num_threads_ = 4; - int max_depth_ = 40000; + int num_threads_ = 1; + int max_depth_ = 5000000; bool include_secondary_ = false; + bool filter_duplicate_ = false; + bool filter_QCfailed_ = false; + bool filter_improper_pair_ = false; + bool filter_mate_unmapped_ = false; + bool filter_improper_orientation_ = false; }; }//namespace neusomatic diff --git a/neusomatic/include/targeted_loading.hpp b/neusomatic/include/targeted_loading.hpp index da166d0..ae83420 100644 --- a/neusomatic/include/targeted_loading.hpp +++ b/neusomatic/include/targeted_loading.hpp @@ -50,16 +50,40 @@ class CaptureLayout { std::vector records; BamRecord rec; while (bam_reader_.GetNextRecord(rec)) { - bool good = false; + bool good = true; if (full_contained) { - if (rec.Position() <= curr_ginv.left() && rec.PositionEnd() + 1 >= curr_ginv.right() && rec.MapQuality()>=opts_.min_mapq() && (!rec.SecondaryFlag())) { - good = true; + if (rec.Position() > curr_ginv.left() || rec.PositionEnd() + 1 < curr_ginv.right()) { + good = false; } - } else { - if (rec.Position() < curr_ginv.right() && curr_ginv.left() <= rec.PositionEnd() && rec.MapQuality()>=opts_.min_mapq() && (opts_.include_secondary() || !rec.SecondaryFlag())) { // overlapped - good = true; + } + else { + if (rec.Position() >= curr_ginv.right() || curr_ginv.left() > rec.PositionEnd()) { // overlapped + good = false; } } + + if (rec.MapQuality() < opts_.min_mapq()) { // mapq filter + good = false; + } + if (!opts_.include_secondary() && rec.SecondaryFlag()) { // secondary flag + good = false; + } + if (opts_.filter_duplicate() && rec.DuplicateFlag()) { // duplicate flag + good = false; + } + if (opts_.filter_QCfailed() && rec.QCFailFlag()) { // QC failed flag + good = false; + } + if (opts_.filter_improper_pair() && !rec.ProperPair()) { // Proper pair flag. Set by aligner + good = false; + } + if (opts_.filter_mate_unmapped() && !rec.MateMappedFlag()) { // Mate is ummapped + good = false; + } + if (opts_.filter_improper_orientation() && !rec.ProperOrientation()) { // pair read has proper orientation (FR) and on same chrom. + good = false; + } + if (good) { if(rec.GetCigar().size() == 0) { std::cerr << "warning: " << rec.Qname() << " has no cigar\n"; diff --git a/neusomatic/models/NeuSomatic_v0.1.4_ensemble_SEQC-WGS-GT50-SpikeWGS10.pth b/neusomatic/models/NeuSomatic_v0.1.4_ensemble_SEQC-WGS-GT50-SpikeWGS10.pth new file mode 100644 index 0000000..3adef0a Binary files /dev/null and b/neusomatic/models/NeuSomatic_v0.1.4_ensemble_SEQC-WGS-GT50-SpikeWGS10.pth differ diff --git a/neusomatic/models/NeuSomatic_v0.1.4_ensemble_SEQC-WGS-Spike.pth b/neusomatic/models/NeuSomatic_v0.1.4_ensemble_SEQC-WGS-Spike.pth new file mode 100644 index 0000000..42c7ce1 Binary files /dev/null and b/neusomatic/models/NeuSomatic_v0.1.4_ensemble_SEQC-WGS-Spike.pth differ diff --git a/neusomatic/models/NeuSomatic_v0.1.4_standalone_SEQC-WGS-GT50-SpikeWGS10.pth b/neusomatic/models/NeuSomatic_v0.1.4_standalone_SEQC-WGS-GT50-SpikeWGS10.pth new file mode 100644 index 0000000..eeaae44 Binary files /dev/null and b/neusomatic/models/NeuSomatic_v0.1.4_standalone_SEQC-WGS-GT50-SpikeWGS10.pth differ diff --git a/neusomatic/models/NeuSomatic_v0.1.4_standalone_SEQC-WGS-Spike.pth b/neusomatic/models/NeuSomatic_v0.1.4_standalone_SEQC-WGS-Spike.pth new file mode 100644 index 0000000..69d6ef7 Binary files /dev/null and b/neusomatic/models/NeuSomatic_v0.1.4_standalone_SEQC-WGS-Spike.pth differ diff --git a/neusomatic/python/_version.py b/neusomatic/python/_version.py index d3ec452..3ced358 100755 --- a/neusomatic/python/_version.py +++ b/neusomatic/python/_version.py @@ -1 +1 @@ -__version__ = "0.2.0" +__version__ = "0.2.1" diff --git a/neusomatic/python/call.py b/neusomatic/python/call.py index 7fae07a..d2fc754 100755 --- a/neusomatic/python/call.py +++ b/neusomatic/python/call.py @@ -23,7 +23,7 @@ import torchvision from network import NeuSomaticNet -from dataloader import NeuSomaticDataset +from dataloader import NeuSomaticDataset, matrix_transform from utils import get_chromosomes_order, prob2phred from merge_tsvs import merge_tsvs @@ -32,7 +32,8 @@ torch._utils._rebuild_tensor_v2 except AttributeError: def _rebuild_tensor_v2(storage, storage_offset, size, stride, requires_grad, backward_hooks): - tensor = torch._utils._rebuild_tensor(storage, storage_offset, size, stride) + tensor = torch._utils._rebuild_tensor( + storage, storage_offset, size, stride) tensor.requires_grad = requires_grad tensor._backward_hooks = backward_hooks return tensor @@ -89,7 +90,8 @@ def call_variants(net, vartype_classes, call_loader, out_dir, model_tag, use_cud file_name = "{}/matrices_{}/{}.png".format( out_dir, model_tag, path) if not os.path.exists(file_name): - imwrite(file_name, np.array(non_transformed_matrices[i, :, :, 0:3])) + imwrite(file_name, np.array( + non_transformed_matrices[i, :, :, 0:3])) true_path[path] = file_name final_preds[path] = [vartype_classes[predicted[i]], pos_pred[i], len_pred[i], list(map(lambda x: round(x, 4), F.softmax( @@ -97,9 +99,9 @@ def call_variants(net, vartype_classes, call_loader, out_dir, model_tag, use_cud list(map(lambda x: round(x, 4), F.softmax( outputs3[i, :], 0).data.cpu().numpy())), list(map(lambda x: round(x, 4), - outputs1.data.cpu()[i].numpy())), + outputs1.data.cpu()[i].numpy())), list(map(lambda x: round(x, 4), - outputs3.data.cpu()[i].numpy()))] + outputs3.data.cpu()[i].numpy()))] else: none_preds[path] = [vartype_classes[predicted[i]], pos_pred[i], len_pred[i], list(map(lambda x: round(x, 4), F.softmax( @@ -107,9 +109,9 @@ def call_variants(net, vartype_classes, call_loader, out_dir, model_tag, use_cud list(map(lambda x: round(x, 4), F.softmax( outputs3[i, :], 0).data.cpu().numpy())), list(map(lambda x: round(x, 4), - outputs1.data.cpu()[i].numpy())), + outputs1.data.cpu()[i].numpy())), list(map(lambda x: round(x, 4), - outputs3.data.cpu()[i].numpy()))] + outputs3.data.cpu()[i].numpy()))] if (iii % 10 == 0): logger.info("Called {} candidates in this batch.".format(j)) logger.info("Called {} candidates in this batch.".format(j)) @@ -384,6 +386,7 @@ def write_vcf(vcf_records, output_vcf, chroms_order, pass_threshold, lowqual_thr ov.write(line) lines.append(line) + def call_neusomatic(candidates_tsv, ref_file, out_dir, checkpoint, num_threads, batch_size, max_load_candidates, pass_threshold, lowqual_threshold, ensemble, @@ -402,8 +405,7 @@ def call_neusomatic(candidates_tsv, ref_file, out_dir, checkpoint, num_threads, chroms = rf.references vartype_classes = ['DEL', 'INS', 'NONE', 'SNP'] - data_transform = transforms.Compose( - [transforms.Normalize((0.5, 0.5, 0.5), (0.5, 0.5, 0.5))]) + data_transform = matrix_transform((0.5, 0.5, 0.5), (0.5, 0.5, 0.5)) num_channels = 119 if ensemble else 26 net = NeuSomaticNet(num_channels) if use_cuda: @@ -412,7 +414,6 @@ def call_neusomatic(candidates_tsv, ref_file, out_dir, checkpoint, num_threads, else: logger.info("CPU calling!") - if torch.cuda.device_count() > 1: logger.info("We use {} GPUs!".format(torch.cuda.device_count())) net = nn.DataParallel(net) @@ -432,6 +433,10 @@ def call_neusomatic(candidates_tsv, ref_file, out_dir, checkpoint, num_threads, shutil.rmtree(matrices_dir) os.mkdir(matrices_dir) coverage_thr = pretrained_dict["coverage_thr"] + if "normalize_channels" in pretrained_dict: + normalize_channels = pretrained_dict["normalize_channels"] + else: + normalize_channels = False model_dict = net.state_dict() @@ -454,31 +459,37 @@ def call_neusomatic(candidates_tsv, ref_file, out_dir, checkpoint, num_threads, # 3. load the new state dict net.load_state_dict(pretrained_state_dict) - new_split_tsvs_dir = os.path.join(out_dir,"split_tsvs") + new_split_tsvs_dir = os.path.join(out_dir, "split_tsvs") if os.path.exists(new_split_tsvs_dir): - logger.warning("Remove split candidates directory: {}".format(new_split_tsvs_dir)) + logger.warning( + "Remove split candidates directory: {}".format(new_split_tsvs_dir)) shutil.rmtree(new_split_tsvs_dir) os.mkdir(new_split_tsvs_dir) Ls = [] - candidates_tsv_=[] + candidates_tsv_ = [] split_i = 0 for candidate_file in candidates_tsv: idx = pickle.load(open(candidate_file + ".idx", "rb")) if len(idx) > max_load_candidates / 2: - logger.info("Splitting {} of lenght {}".format(candidate_file, len(idx))) - new_split_tsvs_dir_i=os.path.join(new_split_tsvs_dir,"split_{}".format(split_i)) + logger.info("Splitting {} of lenght {}".format( + candidate_file, len(idx))) + new_split_tsvs_dir_i = os.path.join( + new_split_tsvs_dir, "split_{}".format(split_i)) if os.path.exists(new_split_tsvs_dir_i): - logger.warning("Remove split candidates directory: {}".format(new_split_tsvs_dir_i)) + logger.warning("Remove split candidates directory: {}".format( + new_split_tsvs_dir_i)) shutil.rmtree(new_split_tsvs_dir_i) os.mkdir(new_split_tsvs_dir_i) - candidate_file_splits = merge_tsvs(input_tsvs=[candidate_file], + candidate_file_splits = merge_tsvs(input_tsvs=[candidate_file], out=new_split_tsvs_dir_i, - candidates_per_tsv=max(1, max_load_candidates/2), + candidates_per_tsv=max( + 1, max_load_candidates / 2), max_num_tsvs=100000, overwrite_merged_tsvs=True, keep_none_types=True) for candidate_file_split in candidate_file_splits: - idx_split = pickle.load(open(candidate_file_split + ".idx", "rb")) + idx_split = pickle.load( + open(candidate_file_split + ".idx", "rb")) candidates_tsv_.append(candidate_file_split) Ls.append(len(idx_split) - 1) split_i += 1 @@ -499,7 +510,8 @@ def call_neusomatic(candidates_tsv, ref_file, out_dir, checkpoint, num_threads, max_load_candidates=max_load_candidates, transform=data_transform, is_test=True, num_threads=num_threads, - coverage_thr=coverage_thr) + coverage_thr=coverage_thr, + normalize_channels=normalize_channels) call_loader = torch.utils.data.DataLoader(call_set, batch_size=batch_size, shuffle=True, pin_memory=True, @@ -524,9 +536,9 @@ def call_neusomatic(candidates_tsv, ref_file, out_dir, checkpoint, num_threads, all_vcf_records = dict(all_vcf_records) all_vcf_records_none = dict(all_vcf_records_none) - if os.path.exists(new_split_tsvs_dir): - logger.warning("Remove split candidates directory: {}".format(new_split_tsvs_dir)) + logger.warning( + "Remove split candidates directory: {}".format(new_split_tsvs_dir)) shutil.rmtree(new_split_tsvs_dir) logger.info("Prepare Output VCF") @@ -544,6 +556,9 @@ def call_neusomatic(candidates_tsv, ref_file, out_dir, checkpoint, num_threads, if os.path.exists(matrices_dir): logger.warning("Remove matrices directory: {}".format(matrices_dir)) shutil.rmtree(matrices_dir) + + logger.info("Calling is Done.") + return output_vcf if __name__ == '__main__': diff --git a/neusomatic/python/dataloader.py b/neusomatic/python/dataloader.py index 06ece78..92a5f46 100755 --- a/neusomatic/python/dataloader.py +++ b/neusomatic/python/dataloader.py @@ -22,6 +22,19 @@ type_class_dict = {"DEL": 0, "INS": 1, "NONE": 2, "SNP": 3} +class matrix_transform(): + + def __init__(self, mean, std): + self.mean = mean + self.std = std + + def __call__(self, matrix): + matrix_ = matrix.clone() + for t, m, s in zip(matrix_, self.mean, self.std): + t.sub_(m).div_(s) + return matrix_ + + def extract_zlib(zlib_compressed_im): return np.fromstring(zlib.decompress(zlib_compressed_im), dtype="uint8").reshape((5, 32, 23)) @@ -116,6 +129,7 @@ def __init__(self, roots, max_load_candidates, transform=None, loader=candidate_loader_tsv, is_test=False, num_threads=1, disable_ensemble=False, data_augmentation=False, nclasses_t=4, nclasses_l=4, coverage_thr=100, + normalize_channels=False, max_opended_tsv=-1): soft, hard = resource.getrlimit(resource.RLIMIT_NOFILE) @@ -125,6 +139,7 @@ def __init__(self, roots, max_load_candidates, transform=None, else: max_opended_tsv = min(max_opended_tsv, soft) self.max_opended_tsv = max_opended_tsv + self.normalize_channels = normalize_channels self.da_shift_p = 0.3 self.da_base_p = 0.05 self.da_rev_p = 0.1 @@ -368,6 +383,9 @@ def __getitem__(self, index): # add COV channel matrix_ = np.zeros((matrix.shape[0], matrix.shape[1], 26 + len(anns))) matrix_[:, :, 0:23] = matrix + if self.normalize_channels: + matrix_[:, :, 3:23:2] *= (matrix_[:, :, 1:2] / 255.0) + matrix_[:, :, 4:23:2] *= (matrix_[:, :, 2:3] / 255.0) matrix = matrix_ matrix[:, center, 23] = np.max(matrix[:, :, 0]) matrix[:, :, 24] = (min(tumor_cov, self.coverage_thr) / @@ -386,11 +404,11 @@ def __getitem__(self, index): non_transformed_matrix = np.array(orig_matrix).astype(np.uint8) else: non_transformed_matrix = [] - + matrix = torch.from_numpy(matrix.transpose((2, 0, 1))) matrix = matrix.float().div(255) if self.transform is not None: - matrix = self.transform(matrix) + matrix = self.transform(matrix) var_pos = [length, center] var_pos = torch.Tensor(var_pos) diff --git a/neusomatic/python/filter_candidates.py b/neusomatic/python/filter_candidates.py index 9c2c310..df6b126 100755 --- a/neusomatic/python/filter_candidates.py +++ b/neusomatic/python/filter_candidates.py @@ -17,8 +17,8 @@ def filter_candidates(candidate_record): candidates_vcf, filtered_candidates_vcf, reference, dbsnp, min_dp, max_dp, good_ao, \ - min_ao, snp_min_af, snp_min_bq, snp_min_ao, ins_min_af, del_min_af, \ - del_merge_min_af, ins_merge_min_af, merge_r = candidate_record + min_ao, snp_min_af, snp_min_bq, snp_min_ao, ins_min_af, del_min_af, \ + del_merge_min_af, ins_merge_min_af, merge_r = candidate_record thread_logger = logging.getLogger( "{} ({})".format(filter_candidates.__name__, multiprocessing.current_process().name)) try: @@ -80,7 +80,7 @@ def filter_candidates(candidate_record): afs = list(map(lambda x: x[6] / float(x[5] + x[6]), ins)) max_af = max(afs) ins = list(filter(lambda x: x[6] / float(x[5] + - x[6]) >= (max_af * merge_r), ins)) + x[6]) >= (max_af * merge_r), ins)) chrom, pos, ref = ins[0][0:3] dp = max(map(lambda x: x[4], ins)) ro = max(map(lambda x: x[5], ins)) @@ -174,7 +174,7 @@ def filter_candidates(candidate_record): good_records.extend(dels) else: afs = list(map(lambda x: x[6] / - float(x[5] + x[6]), dels)) + float(x[5] + x[6]), dels)) max_af = max(afs) merge_r_thr = merge_r * max_af dels = list(filter( @@ -331,7 +331,7 @@ def filter_candidates(candidate_record): default=0.5) parser.add_argument('--min_dp', type=float, help='min depth', default=5) parser.add_argument('--max_dp', type=float, - help='max depth', default=40000) + help='max depth', default=100000) args = parser.parse_args() logger.info(args) diff --git a/neusomatic/python/generate_dataset.py b/neusomatic/python/generate_dataset.py index 95cc6e7..4862a66 100755 --- a/neusomatic/python/generate_dataset.py +++ b/neusomatic/python/generate_dataset.py @@ -190,9 +190,10 @@ def align_tumor_normal_matrices(record, tumor_matrix_, bq_tumor_matrix_, mq_tumo raise(RuntimeError( "tumor_col_pos_map and normal_col_pos_map have different keys.")) - pT = list(map(lambda x: tumor_col_pos_map[x], sorted(tumor_col_pos_map.keys()))) + pT = list(map(lambda x: tumor_col_pos_map[ + x], sorted(tumor_col_pos_map.keys()))) pN = list(map(lambda x: normal_col_pos_map[ - x], sorted(normal_col_pos_map.keys()))) + x], sorted(normal_col_pos_map.keys()))) if pT[0] != pN[0]: logger.error("record: {}".format(record)) @@ -572,7 +573,7 @@ def prepare_info_matrices_tabix(ref_file, tumor_count_bed, normal_count_bed, rec def prep_data_single_tabix(input_record): ref_file, tumor_count_bed, normal_count_bed, record, vartype, rlen, rcenter, ch_order, \ - matrix_base_pad, matrix_width, min_ev_frac_per_col, min_cov, ann, chrom_lengths = input_record + matrix_base_pad, matrix_width, min_ev_frac_per_col, min_cov, ann, chrom_lengths = input_record thread_logger = logging.getLogger( "{} ({})".format(prep_data_single_tabix.__name__, multiprocessing.current_process().name)) @@ -652,7 +653,8 @@ def prep_data_single_tabix(input_record): candidate_mat, 255)).astype(np.uint8) tag = "{}.{}.{}.{}.{}.{}.{}.{}.{}".format(ch_order, pos, ref[0:55], alt[ 0:55], vartype, center, rlen, tumor_cov, normal_cov) - candidate_mat = base64.b64encode(zlib.compress(candidate_mat)).decode('ascii') + candidate_mat = base64.b64encode( + zlib.compress(candidate_mat)).decode('ascii') return tag, candidate_mat, record[0:4], ann except Exception as ex: thread_logger.error(traceback.format_exc()) @@ -855,8 +857,8 @@ def find_records(input_record): pybedtools.BedTool(ensemble_bed).intersect( split_bed, u=True).saveas(split_ensemble_bed_file) pybedtools.BedTool(split_ensemble_bed_file).window(split_pred_vcf_file, w=5, v=True).each( - lambda x: pybedtools.Interval(x[0], int(x[1]), int(x[1]) + 1, x[3], x[4], - ".", otherfields=[".".encode('utf-8'), ".".encode('utf-8'), ".".encode('utf-8'), ".".encode('utf-8')])).saveas(split_missed_ensemble_bed_file) + lambda x: pybedtools.Interval(x[0], int(x[1]), int(x[1]) + 1, x[3], x[4], + ".", otherfields=[".".encode('utf-8'), ".".encode('utf-8'), ".".encode('utf-8'), ".".encode('utf-8')])).saveas(split_missed_ensemble_bed_file) concatenate_vcfs( [split_pred_vcf_file, split_missed_ensemble_bed_file], split_pred_with_missed_file) pred_with_missed = pybedtools.BedTool(split_pred_with_missed_file).each( @@ -1051,7 +1053,8 @@ def find_records(input_record): mt = merge_records(fasta_file, t_) if mt: mt2, eqs2 = push_lr(fasta_file, mt, 2) - eqs2 = list(map(lambda x: "_".join(map(str, x[0:4])), eqs2)) + eqs2 = list( + map(lambda x: "_".join(map(str, x[0:4])), eqs2)) for idx_jj, jj in enumerate(js): for n_merge_j in range(0, len(js) - idx_jj): r_j = js[idx_jj:idx_jj + n_merge_j + 1] @@ -1257,36 +1260,37 @@ def extract_ensemble(work, ensemble_tsv): header_pos = [] order_header = [] COV = 50 - expected_features = ["if_MuTect", "if_VarScan2", "if_JointSNVMix2", - "if_SomaticSniper", "if_VarDict", "MuSE_Tier", "if_LoFreq", "if_Scalpel", "if_Strelka", - "if_TNscope", "Strelka_Score", "Strelka_QSS", "Strelka_TQSS", "VarScan2_Score", "SNVMix2_Score", - "Sniper_Score", "VarDict_Score", "if_dbsnp", "COMMON", "if_COSMIC", "COSMIC_CNT", - "Consistent_Mates", "Inconsistent_Mates", "N_DP", "nBAM_REF_MQ", "nBAM_ALT_MQ", - "nBAM_Z_Ranksums_MQ", "nBAM_REF_BQ", "nBAM_ALT_BQ", "nBAM_Z_Ranksums_BQ", "nBAM_REF_NM", - "nBAM_ALT_NM", "nBAM_NM_Diff", "nBAM_REF_Concordant", "nBAM_REF_Discordant", - "nBAM_ALT_Concordant", "nBAM_ALT_Discordant", "nBAM_Concordance_FET", "N_REF_FOR", "N_REF_REV", - "N_ALT_FOR", "N_ALT_REV", "nBAM_StrandBias_FET", "nBAM_Z_Ranksums_EndPos", - "nBAM_REF_Clipped_Reads", "nBAM_ALT_Clipped_Reads", "nBAM_Clipping_FET", "nBAM_MQ0", - "nBAM_Other_Reads", "nBAM_Poor_Reads", "nBAM_REF_InDel_3bp", "nBAM_REF_InDel_2bp", - "nBAM_REF_InDel_1bp", "nBAM_ALT_InDel_3bp", "nBAM_ALT_InDel_2bp", "nBAM_ALT_InDel_1bp", - "M2_NLOD", "M2_TLOD", "M2_STR", "M2_ECNT", "SOR", "MSI", "MSILEN", "SHIFT3", - "MaxHomopolymer_Length", "SiteHomopolymer_Length", "T_DP", "tBAM_REF_MQ", "tBAM_ALT_MQ", - "tBAM_Z_Ranksums_MQ", "tBAM_REF_BQ", "tBAM_ALT_BQ", "tBAM_Z_Ranksums_BQ", "tBAM_REF_NM", - "tBAM_ALT_NM", "tBAM_NM_Diff", "tBAM_REF_Concordant", "tBAM_REF_Discordant", - "tBAM_ALT_Concordant", "tBAM_ALT_Discordant", "tBAM_Concordance_FET", "T_REF_FOR", - "T_REF_REV", "T_ALT_FOR", "T_ALT_REV", "tBAM_StrandBias_FET", "tBAM_Z_Ranksums_EndPos", - "tBAM_REF_Clipped_Reads", "tBAM_ALT_Clipped_Reads", "tBAM_Clipping_FET", "tBAM_MQ0", - "tBAM_Other_Reads", "tBAM_Poor_Reads", "tBAM_REF_InDel_3bp", "tBAM_REF_InDel_2bp", - "tBAM_REF_InDel_1bp", "tBAM_ALT_InDel_3bp", "tBAM_ALT_InDel_2bp", "tBAM_ALT_InDel_1bp", - "InDel_Length"] + expected_features = ["if_MuTect", "if_VarScan2", "if_JointSNVMix2", + "if_SomaticSniper", "if_VarDict", "MuSE_Tier", "if_LoFreq", "if_Scalpel", "if_Strelka", + "if_TNscope", "Strelka_Score", "Strelka_QSS", "Strelka_TQSS", "VarScan2_Score", "SNVMix2_Score", + "Sniper_Score", "VarDict_Score", "if_dbsnp", "COMMON", "if_COSMIC", "COSMIC_CNT", + "Consistent_Mates", "Inconsistent_Mates", "N_DP", "nBAM_REF_MQ", "nBAM_ALT_MQ", + "nBAM_Z_Ranksums_MQ", "nBAM_REF_BQ", "nBAM_ALT_BQ", "nBAM_Z_Ranksums_BQ", "nBAM_REF_NM", + "nBAM_ALT_NM", "nBAM_NM_Diff", "nBAM_REF_Concordant", "nBAM_REF_Discordant", + "nBAM_ALT_Concordant", "nBAM_ALT_Discordant", "nBAM_Concordance_FET", "N_REF_FOR", "N_REF_REV", + "N_ALT_FOR", "N_ALT_REV", "nBAM_StrandBias_FET", "nBAM_Z_Ranksums_EndPos", + "nBAM_REF_Clipped_Reads", "nBAM_ALT_Clipped_Reads", "nBAM_Clipping_FET", "nBAM_MQ0", + "nBAM_Other_Reads", "nBAM_Poor_Reads", "nBAM_REF_InDel_3bp", "nBAM_REF_InDel_2bp", + "nBAM_REF_InDel_1bp", "nBAM_ALT_InDel_3bp", "nBAM_ALT_InDel_2bp", "nBAM_ALT_InDel_1bp", + "M2_NLOD", "M2_TLOD", "M2_STR", "M2_ECNT", "SOR", "MSI", "MSILEN", "SHIFT3", + "MaxHomopolymer_Length", "SiteHomopolymer_Length", "T_DP", "tBAM_REF_MQ", "tBAM_ALT_MQ", + "tBAM_Z_Ranksums_MQ", "tBAM_REF_BQ", "tBAM_ALT_BQ", "tBAM_Z_Ranksums_BQ", "tBAM_REF_NM", + "tBAM_ALT_NM", "tBAM_NM_Diff", "tBAM_REF_Concordant", "tBAM_REF_Discordant", + "tBAM_ALT_Concordant", "tBAM_ALT_Discordant", "tBAM_Concordance_FET", "T_REF_FOR", + "T_REF_REV", "T_ALT_FOR", "T_ALT_REV", "tBAM_StrandBias_FET", "tBAM_Z_Ranksums_EndPos", + "tBAM_REF_Clipped_Reads", "tBAM_ALT_Clipped_Reads", "tBAM_Clipping_FET", "tBAM_MQ0", + "tBAM_Other_Reads", "tBAM_Poor_Reads", "tBAM_REF_InDel_3bp", "tBAM_REF_InDel_2bp", + "tBAM_REF_InDel_1bp", "tBAM_ALT_InDel_3bp", "tBAM_ALT_InDel_2bp", "tBAM_ALT_InDel_1bp", + "InDel_Length"] with open(ensemble_tsv) as s_f: for line in s_f: if line[0:5] == "CHROM": header_pos = line.strip().split()[0:5] header = line.strip().split()[5:105] - header_en = list(filter(lambda x:x[1] in expected_features, enumerate(line.strip().split()[5:]))) + header_en = list(filter( + lambda x: x[1] in expected_features, enumerate(line.strip().split()[5:]))) header = list(map(lambda x: x[1], header_en)) - if set(expected_features)-set(header): + if set(expected_features) - set(header): logger.error("The following features are missing from ensemble file: {}".format( list(set(expected_features) - set(header)))) raise Exception @@ -1299,7 +1303,7 @@ def extract_ensemble(work, ensemble_tsv): ensemble_pos.append(fields[0:5]) ensemble_data.append(list(map(lambda x: float( x.replace("False", "0").replace("True", "1")), fields[5:]))) - ensemble_data = np.array(ensemble_data)[:,order_header] + ensemble_data = np.array(ensemble_data)[:, order_header] cov_features = list(map(lambda x: x[0], filter(lambda x: x[1] in [ "Consistent_Mates", "Inconsistent_Mates", "N_DP", @@ -1314,16 +1318,16 @@ def extract_ensemble(work, ensemble_tsv): "tBAM_REF_InDel_1bp", "tBAM_ALT_InDel_3bp", "tBAM_ALT_InDel_2bp", "tBAM_ALT_InDel_1bp", ], enumerate(header)))) mq_features = list(map(lambda x: x[0], filter(lambda x: x[1] in [ - "nBAM_REF_MQ", "nBAM_ALT_MQ", "tBAM_REF_MQ", "tBAM_ALT_MQ"], enumerate(header)))) + "nBAM_REF_MQ", "nBAM_ALT_MQ", "tBAM_REF_MQ", "tBAM_ALT_MQ"], enumerate(header)))) bq_features = list(map(lambda x: x[0], filter(lambda x: x[1] in [ - "nBAM_REF_BQ", "nBAM_ALT_BQ", "tBAM_REF_BQ", "tBAM_ALT_BQ"], enumerate(header)))) + "nBAM_REF_BQ", "nBAM_ALT_BQ", "tBAM_REF_BQ", "tBAM_ALT_BQ"], enumerate(header)))) nm_diff_features = list(map(lambda x: x[0], filter( lambda x: x[1] in ["nBAM_NM_Diff", "tBAM_NM_Diff"], enumerate(header)))) ranksum_features = list(map(lambda x: x[0], filter(lambda x: x[1] in ["nBAM_Z_Ranksums_MQ", "nBAM_Z_Ranksums_BQ", - "nBAM_Z_Ranksums_EndPos", "tBAM_Z_Ranksums_BQ", "tBAM_Z_Ranksums_MQ", "tBAM_Z_Ranksums_EndPos", ], enumerate(header)))) + "nBAM_Z_Ranksums_EndPos", "tBAM_Z_Ranksums_BQ", "tBAM_Z_Ranksums_MQ", "tBAM_Z_Ranksums_EndPos", ], enumerate(header)))) zero_to_one_features = list(map(lambda x: x[0], filter(lambda x: x[1] in ["if_MuTect", "if_VarScan2", "if_SomaticSniper", "if_VarDict", - "MuSE_Tier", "if_Strelka"] + ["nBAM_Concordance_FET", "nBAM_StrandBias_FET", "nBAM_Clipping_FET", - "tBAM_Concordance_FET", "tBAM_StrandBias_FET", "tBAM_Clipping_FET"] + ["if_dbsnp", "COMMON"] + ["M2_STR"], enumerate(header)))) + "MuSE_Tier", "if_Strelka"] + ["nBAM_Concordance_FET", "nBAM_StrandBias_FET", "nBAM_Clipping_FET", + "tBAM_Concordance_FET", "tBAM_StrandBias_FET", "tBAM_Clipping_FET"] + ["if_dbsnp", "COMMON"] + ["M2_STR"], enumerate(header)))) stralka_scor = list(map(lambda x: x[0], filter( lambda x: x[1] in ["Strelka_Score"], enumerate(header)))) stralka_qss = list(map(lambda x: x[0], filter( @@ -1335,7 +1339,7 @@ def extract_ensemble(work, ensemble_tsv): vardict_score = list(map(lambda x: x[0], filter( lambda x: x[1] in ["VarDict_Score"], enumerate(header)))) m2_lod = list(map(lambda x: x[0], filter(lambda x: x[1] in [ - "M2_NLOD", "M2_TLOD"], enumerate(header)))) + "M2_NLOD", "M2_TLOD"], enumerate(header)))) sniper_score = list(map(lambda x: x[0], filter( lambda x: x[1] in ["Sniper_Score"], enumerate(header)))) m2_ecent = list(map(lambda x: x[0], filter( @@ -1405,14 +1409,12 @@ def generate_dataset(work, truth_vcf_file, mode, tumor_pred_vcf_file, region_be if not os.path.exists(work): os.mkdir(work) - original_tempdir = pybedtools.get_tempdir() pybedtmp = os.path.join(work, "pybedtmp") if not os.path.exists(pybedtmp): os.mkdir(pybedtmp) pybedtools.set_tempdir(pybedtmp) - if mode == "train" and not truth_vcf_file: raise(RuntimeError("--truth_vcf is needed for 'train' mode")) diff --git a/neusomatic/python/long_read_indelrealign.py b/neusomatic/python/long_read_indelrealign.py index 7b4ee36..0bb8e09 100755 --- a/neusomatic/python/long_read_indelrealign.py +++ b/neusomatic/python/long_read_indelrealign.py @@ -127,7 +127,7 @@ def fix_record(self, record, ref_seq): for region_start, region_end, start_idx, end_idx, del_start, del_end, \ pos_start, pos_end, new_cigar, excess_start, excess_end in self.realignments: c_array = np.array(list(map(lambda x: [x[0], x[1][1] if x[1][0] - != CIGAR_DEL else 0], enumerate(cigartuples)))) + != CIGAR_DEL else 0], enumerate(cigartuples)))) c_map = np.repeat(c_array[:, 0], c_array[:, 1]) c_i = c_map[start_idx - bias] @@ -241,7 +241,7 @@ def get_cigar_stat(cigartuple, keys=[]): def find_NM(record, ref_seq): logger = logging.getLogger(find_NM.__name__) positions = np.array(list(map(lambda x: x if x else -1, - (record.get_reference_positions(full_length=True))))) + (record.get_reference_positions(full_length=True))))) sc_start = (record.cigartuples[0][0] == CIGAR_SOFTCLIP) * record.cigartuples[0][1] sc_end = (record.cigartuples[-1][0] == @@ -299,12 +299,13 @@ def prepare_fasta(work, region, input_bam, ref_fasta_file, include_ref, split_i) if record.is_supplementary and "SA" in dict(record.tags): sas = dict(record.tags)["SA"].split(";") sas = list(filter(None, sas)) - sas_cigs = list(map(lambda x: x.split(",")[3], sas)) + sas_cigs = list( + map(lambda x: x.split(",")[3], sas)) if record.cigarstring in sas_cigs: continue positions = np.array(list(map(lambda x: x if x else -1, - (record.get_reference_positions( - full_length=True))))) + (record.get_reference_positions( + full_length=True))))) if not record.cigartuples: continue sc_start = (record.cigartuples[0][0] == @@ -328,8 +329,8 @@ def prepare_fasta(work, region, input_bam, ref_fasta_file, include_ref, split_i) if max(end_idx - start_idx, pos_end - pos_start) >= (region.span() * 0.75): c_array = np.array(list(map(lambda x: [x[0], x[1][1] if x[1][0] - != CIGAR_DEL else 0], - enumerate(record.cigartuples)))) + != CIGAR_DEL else 0], + enumerate(record.cigartuples)))) c_map = np.repeat(c_array[:, 0], c_array[:, 1]) c_i = c_map[start_idx] c_e = c_map[end_idx] @@ -346,8 +347,9 @@ def prepare_fasta(work, region, input_bam, ref_fasta_file, include_ref, split_i) mn, mx = min(non_ins_positions), max( non_ins_positions) ref_array = np.array(list(map(lambda x: - NUC_to_NUM[x.upper()], - list(refseq))))[non_ins_positions - mn] + NUC_to_NUM[ + x.upper()], + list(refseq))))[non_ins_positions - mn] q_array = np.array(list(map(lambda x: NUC_to_NUM[x.upper()], list(q_seq))))[ non_ins] cigar_stat = get_cigar_stat( @@ -369,7 +371,7 @@ def prepare_fasta(work, region, input_bam, ref_fasta_file, include_ref, split_i) def split_bam_to_chunks(work, region, input_bam, chunk_size=200, - chunk_scale=1.5): + chunk_scale=1.5): logger = logging.getLogger(split_bam_to_chunks.__name__) records = [] with pysam.Samfile(input_bam, "rb") as samfile: @@ -478,9 +480,9 @@ def extract_new_cigars(region, info_file, out_fasta_file): raise Exception alignment = list(map(lambda x: x[1], sorted(map(lambda x: [int(x[0]), list(map(lambda x: 0 if x == "-" - else 1, x[1].seq))], - records.items()), - key=lambda x: x[0]))) + else 1, x[1].seq))], + records.items()), + key=lambda x: x[0]))) ref_seq = np.array(alignment[0]) pos_ref = np.cumsum(alignment[0]) - 1 alignment = np.array(alignment[1:]) - ref_seq @@ -563,7 +565,8 @@ def get_final_msa(region, msa_0, consensus, out_fasta_file_1, out_fasta_file_fin ii = int(i) msa_1[ii] = list(map(lambda x: nuc_to_num_convert(x), record.seq)) msa_1 = np.array(msa_1, dtype=int) - consensus_array = np.array(list(map(lambda x: nuc_to_num_convert(x), consensus))) + consensus_array = np.array( + list(map(lambda x: nuc_to_num_convert(x), consensus))) consensus_cumsum = np.cumsum(consensus_array > 0) new_cols = np.where(msa_1[1, :] == 0)[0] new_cols -= np.arange(len(new_cols)) @@ -901,9 +904,9 @@ def TrimREFALT(ref, alt, pos): def run_realignment(input_record): work, ref_fasta_file, target_region, pad, chunk_size, chunk_scale, \ - snp_min_af, del_min_af, ins_min_af, len_chr, input_bam, \ - match_score, mismatch_penalty, gap_open_penalty, gap_ext_penalty, \ - msa_binary, get_var = input_record + snp_min_af, del_min_af, ins_min_af, len_chr, input_bam, \ + match_score, mismatch_penalty, gap_open_penalty, gap_ext_penalty, \ + msa_binary, get_var = input_record thread_logger = logging.getLogger( "{} ({})".format(run_realignment.__name__, multiprocessing.current_process().name)) @@ -1261,9 +1264,10 @@ def long_read_indelrealign(work, input_bam, output_bam, output_vcf, region_bed_f realign_bed_file = os.path.join(work, "realign.bed") realign_entries.sort() relaign_bed = pybedtools.BedTool(map(lambda x: pybedtools.Interval(x[0], x[1], - x[2],x[3], - otherfields=list(map(lambda y:str(y).encode('utf-8'), x[4:]))), - realign_entries)).saveas(realign_bed_file) + x[2], x[ + 3], + otherfields=list(map(lambda y: str(y).encode('utf-8'), x[4:]))), + realign_entries)).saveas(realign_bed_file) relaign_bed = pybedtools.BedTool(realign_bed_file) diff --git a/neusomatic/python/postprocess.py b/neusomatic/python/postprocess.py index 3a878ce..5085f65 100755 --- a/neusomatic/python/postprocess.py +++ b/neusomatic/python/postprocess.py @@ -106,7 +106,7 @@ def add_vcf_info(work, reference, merged_vcf, candidates_vcf, ensemble_tsv, scores[tag] = [x[5], x[6], x[7], x[9]] tags = sorted(fina_info_tag.keys(), key=lambda x: list(map(int, x.split("-")[0:2] - ))) + ))) with open(output_vcf, "w") as o_f: o_f.write("##fileformat=VCFv4.2\n") o_f.write("##NeuSomatic Version={}\n".format(__version__)) @@ -162,7 +162,6 @@ def postprocess(work, reference, pred_vcf_file, output_vcf, candidates_vcf, ense if not os.path.exists(work): os.mkdir(work) - original_tempdir = pybedtools.get_tempdir() pybedtmp = os.path.join(work, "pybedtmp_postprocess") if not os.path.exists(pybedtmp): diff --git a/neusomatic/python/preprocess.py b/neusomatic/python/preprocess.py index a59e895..81f2efc 100755 --- a/neusomatic/python/preprocess.py +++ b/neusomatic/python/preprocess.py @@ -39,7 +39,9 @@ def split_dbsnp(record): def process_split_region(tn, work, region, reference, mode, alignment_bam, dbsnp, scan_window_size, scan_maf, min_mapq, - filtered_candidates_vcf, min_dp, max_dp, good_ao, min_ao, snp_min_af, snp_min_bq, snp_min_ao, + filtered_candidates_vcf, min_dp, max_dp, + filter_duplicate, + good_ao, min_ao, snp_min_af, snp_min_bq, snp_min_ao, ins_min_af, del_min_af, del_merge_min_af, ins_merge_min_af, merge_r, scan_alignments_binary, restart, num_threads, calc_qual, regions=[], dbsnp_regions=[]): @@ -48,7 +50,7 @@ def process_split_region(tn, work, region, reference, mode, alignment_bam, dbsnp logger.info("Scan bam.") scan_outputs = scan_alignments(work, scan_alignments_binary, alignment_bam, region, reference, num_threads, scan_window_size, scan_maf, - min_mapq, max_dp, restart=restart, split_region_files=regions, + min_mapq, max_dp, filter_duplicate, restart=restart, split_region_files=regions, calc_qual=calc_qual) if filtered_candidates_vcf: logger.info("Filter candidates.") @@ -200,7 +202,9 @@ def preprocess(work, mode, reference, region_bed, tumor_bam, normal_bam, dbsnp, ins_min_af, del_min_af, del_merge_min_af, ins_merge_min_af, merge_r, truth_vcf, tsv_batch_size, matrix_width, matrix_base_pad, min_ev_frac_per_col, - ensemble_tsv, long_read, restart, first_do_without_qual, num_threads, + ensemble_tsv, long_read, restart, first_do_without_qual, + filter_duplicate, + num_threads, scan_alignments_binary,): logger = logging.getLogger(preprocess.__name__) @@ -208,7 +212,6 @@ def preprocess(work, mode, reference, region_bed, tumor_bam, normal_bam, dbsnp, if restart or not os.path.exists(work): os.mkdir(work) - original_tempdir = pybedtools.get_tempdir() pybedtmp = os.path.join(work, "pybedtmp_preprocess") if not os.path.exists(pybedtmp): @@ -223,11 +226,12 @@ def preprocess(work, mode, reference, region_bed, tumor_bam, normal_bam, dbsnp, raise Exception("No normal BAM file {}".format(normal_bam)) if not os.path.exists(tumor_bam + ".bai"): logger.error("Aborting!") - raise Exception("No tumor .bai index file {}".format(tumor_bam + ".bai")) + raise Exception( + "No tumor .bai index file {}".format(tumor_bam + ".bai")) if not os.path.exists(normal_bam + ".bai"): logger.error("Aborting!") - raise Exception("No normal .bai index file {}".format(normal_bam + ".bai")) - + raise Exception( + "No normal .bai index file {}".format(normal_bam + ".bai")) ensemble_bed = None if ensemble_tsv: @@ -250,7 +254,9 @@ def preprocess(work, mode, reference, region_bed, tumor_bam, normal_bam, dbsnp, tumor_outputs_without_q = process_split_region("tumor", work_tumor_without_q, region_bed, reference, mode, tumor_bam, dbsnp, scan_window_size, scan_maf, min_mapq, - filtered_candidates_vcf_without_q, min_dp, max_dp, good_ao, min_ao, + filtered_candidates_vcf_without_q, min_dp, max_dp, + filter_duplicate, + good_ao, min_ao, snp_min_af, -10000, snp_min_ao, ins_min_af, del_min_af, del_merge_min_af, ins_merge_min_af, merge_r, @@ -273,7 +279,9 @@ def preprocess(work, mode, reference, region_bed, tumor_bam, normal_bam, dbsnp, logger.info("Scan tumor bam (and extracting quality scores).") tumor_outputs = process_split_region("tumor", work_tumor, region_bed, reference, mode, tumor_bam, dbsnp, scan_window_size, scan_maf, min_mapq, - filtered_candidates_vcf, min_dp, max_dp, good_ao, min_ao, + filtered_candidates_vcf, min_dp, max_dp, + filter_duplicate, + good_ao, min_ao, snp_min_af, snp_min_bq, snp_min_ao, ins_min_af, del_min_af, del_merge_min_af, ins_merge_min_af, merge_r, @@ -300,7 +308,9 @@ def preprocess(work, mode, reference, region_bed, tumor_bam, normal_bam, dbsnp, logger.info("Scan normal bam (and extracting quality scores).") normal_counts, _, _, _ = process_split_region("normal", work_normal, region_bed, reference, mode, normal_bam, None, scan_window_size, 0.2, min_mapq, - None, min_dp, max_dp, good_ao, min_ao, snp_min_af, snp_min_bq, snp_min_ao, + None, min_dp, max_dp, + filter_duplicate, + good_ao, min_ao, snp_min_af, snp_min_bq, snp_min_ao, ins_min_af, del_min_af, del_merge_min_af, ins_merge_min_af, merge_r, scan_alignments_binary, restart, num_threads, @@ -330,7 +340,6 @@ def preprocess(work, mode, reference, region_bed, tumor_bam, normal_bam, dbsnp, logger.info("Preprocessing is Done.") - if __name__ == '__main__': FORMAT = '%(levelname)s %(asctime)-15s %(name)-20s %(message)s' logging.basicConfig(level=logging.INFO, format=FORMAT) @@ -360,7 +369,7 @@ def preprocess(work, mode, reference, region_bed, tumor_bam, normal_bam, dbsnp, help='minimum mapping quality', default=1) parser.add_argument('--min_dp', type=float, help='min depth', default=5) parser.add_argument('--max_dp', type=float, - help='max depth', default=40000) + help='max depth', default=100000) parser.add_argument('--good_ao', type=float, help='good alternate count (ignores maf)', default=10) parser.add_argument('--min_ao', type=float, @@ -402,6 +411,9 @@ def preprocess(work, mode, reference, region_bed, tumor_bam, normal_bam, dbsnp, parser.add_argument('--first_do_without_qual', help='Perform initial scan without calculating the quality stats', action="store_true") + parser.add_argument('--filter_duplicate', + help='filter duplicate reads when preparing pileup information', + action="store_true") parser.add_argument('--num_threads', type=int, help='number of threads', default=1) parser.add_argument('--scan_alignments_binary', type=str, @@ -417,7 +429,9 @@ def preprocess(work, mode, reference, region_bed, tumor_bam, normal_bam, dbsnp, args.ins_min_af, args.del_min_af, args.del_merge_min_af, args.ins_merge_min_af, args.merge_r, args.truth_vcf, args.tsv_batch_size, args.matrix_width, args.matrix_base_pad, args.min_ev_frac_per_col, - args.ensemble_tsv, args.long_read, args.restart, args.first_do_without_qual, args.num_threads, + args.ensemble_tsv, args.long_read, args.restart, args.first_do_without_qual, + args.filter_duplicate, + args.num_threads, args.scan_alignments_binary) except Exception as e: logger.error(traceback.format_exc()) diff --git a/neusomatic/python/scan_alignments.py b/neusomatic/python/scan_alignments.py index 820adb4..79c02df 100755 --- a/neusomatic/python/scan_alignments.py +++ b/neusomatic/python/scan_alignments.py @@ -26,8 +26,12 @@ def run_scan_alignments(record): work, reference, scan_alignments_binary, split_region_file, \ - input_bam, window_size, maf, min_mapq, max_dp, calc_qual, num_threads = record + input_bam, window_size, maf, min_mapq, max_dp, filter_duplicate, calc_qual = record + if filter_duplicate: + filter_duplicate_str = "--filter_duplicate" + else: + filter_duplicate_str = "" thread_logger = logging.getLogger( "{} ({})".format(run_scan_alignments.__name__, multiprocessing.current_process().name)) try: @@ -38,9 +42,9 @@ def run_scan_alignments(record): os.mkdir(work) if len(pybedtools.BedTool(split_region_file)) > 0: cmd = "{} --ref {} -b {} -L {} --out_vcf_file {}/candidates.vcf --out_count_file {}/count.bed \ - --window_size {} --min_af {} --min_mapq {} --max_depth {} --num_thread {}".format( + --window_size {} --min_af {} --min_mapq {} --max_depth {} {}".format( scan_alignments_binary, reference, input_bam, split_region_file, - work, work, window_size, maf, min_mapq, max_dp, num_threads) + work, work, window_size, maf, min_mapq, max_dp*window_size/100.0, filter_duplicate_str) if calc_qual: cmd += " --calculate_qual_stat" run_shell_command(cmd, stdout=os.path.join(work, "scan.out"), @@ -66,7 +70,7 @@ def run_scan_alignments(record): def scan_alignments(work, scan_alignments_binary, input_bam, regions_bed_file, reference, - num_threads, window_size, maf, min_mapq, max_dp, restart=True, + num_threads, window_size, maf, min_mapq, max_dp, filter_duplicate, restart=True, split_region_files=[], calc_qual=True): logger = logging.getLogger(scan_alignments.__name__) @@ -120,7 +124,7 @@ def scan_alignments(work, scan_alignments_binary, input_bam, shutil.rmtree(work_) map_args.append((os.path.join(work, "work.{}".format(i)), reference, scan_alignments_binary, split_region_file, - input_bam, window_size, maf, min_mapq, max_dp, calc_qual, 1)) + input_bam, window_size, maf, min_mapq, max_dp, filter_duplicate, calc_qual)) not_done.append(i) else: all_outputs[i] = [os.path.join(work, "work.{}".format(i), "candidates.vcf"), @@ -170,7 +174,10 @@ def scan_alignments(work, scan_alignments_binary, input_bam, parser.add_argument('--min_mapq', type=int, help='minimum mapping quality', default=1) parser.add_argument('--max_dp', type=float, - help='max depth', default=40000) + help='max depth', default=100000) + parser.add_argument('--filter_duplicate', + help='filter duplicate reads when preparing pileup information', + action="store_true") parser.add_argument('--num_threads', type=int, help='number of threads', default=1) args = parser.parse_args() @@ -180,7 +187,7 @@ def scan_alignments(work, scan_alignments_binary, input_bam, outputs = scan_alignments(args.work, args.scan_alignments_binary, args.input_bam, args.regions_bed_file, args.reference, args.num_threads, args.window_size, args.maf, - args.min_mapq, args.max_dp) + args.min_mapq, args.max_dp, args.filter_duplicate) except Exception as e: logger.error(traceback.format_exc()) logger.error("Aborting!") diff --git a/neusomatic/python/train.py b/neusomatic/python/train.py index c5ecf99..b85e9af 100755 --- a/neusomatic/python/train.py +++ b/neusomatic/python/train.py @@ -21,7 +21,7 @@ import pickle from network import NeuSomaticNet -from dataloader import NeuSomaticDataset +from dataloader import NeuSomaticDataset, matrix_transform from merge_tsvs import merge_tsvs type_class_dict = {"DEL": 0, "INS": 1, "NONE": 2, "SNP": 3} @@ -32,12 +32,14 @@ torch._utils._rebuild_tensor_v2 except AttributeError: def _rebuild_tensor_v2(storage, storage_offset, size, stride, requires_grad, backward_hooks): - tensor = torch._utils._rebuild_tensor(storage, storage_offset, size, stride) + tensor = torch._utils._rebuild_tensor( + storage, storage_offset, size, stride) tensor.requires_grad = requires_grad tensor._backward_hooks = backward_hooks return tensor torch._utils._rebuild_tensor_v2 = _rebuild_tensor_v2 + def make_weights_for_balanced_classes(count_class_t, count_class_l, nclasses_t, nclasses_l, none_count=None): logger = logging.getLogger(make_weights_for_balanced_classes.__name__) @@ -57,7 +59,8 @@ def make_weights_for_balanced_classes(count_class_t, count_class_l, nclasses_t, for i in range(nclasses_t): w_t[i] = (1 - (float(count_class_t[i]) / float(N))) / float(nclasses_t) w_t = np.array(w_t) - logger.info("weight type classes: {}".format(list(zip(vartype_classes, w_t)))) + logger.info("weight type classes: {}".format( + list(zip(vartype_classes, w_t)))) logger.info("count length classes: {}".format(list( zip(range(nclasses_l), count_class_l)))) @@ -100,7 +103,10 @@ def test(net, epoch, validation_loader, use_cuda): for i, _ in enumerate(paths[0]): preds[i] = [vartype_classes[predicted[i]], pos_pred[i], len_pred[i]] - compare_labels = (predicted == labels).squeeze() + if labels.size()[0] > 1: + compare_labels = (predicted == labels).squeeze() + else: + compare_labels = (predicted == labels) false_preds = np.where(compare_labels.numpy() == 0)[0] if len(false_preds) > 0: for i in false_preds: @@ -118,7 +124,11 @@ def test(net, epoch, validation_loader, use_cuda): label = predicted[i] class_p_total[label] += 1 - compare_len = (len_pred == var_len_s).squeeze() + if var_len_s.size()[0] > 1: + compare_len = (len_pred == var_len_s).squeeze() + else: + compare_len = (len_pred == var_len_s) + for i in range(len(var_len_s)): len_ = var_len_s[i] len_class_correct[len_] += compare_len[i].data.cpu().numpy() @@ -166,7 +176,7 @@ def __iter__(self): if self.current_none_id > (len(self.none_indices) - self.none_count): this_round_nones = self.none_indices[self.current_none_id:] self.none_indices = list(map(lambda i: self.none_indices[i], - torch.randperm(len(self.none_indices)).tolist())) + torch.randperm(len(self.none_indices)).tolist())) self.current_none_id = self.none_count - len(this_round_nones) this_round_nones += self.none_indices[0:self.current_none_id] else: @@ -188,7 +198,9 @@ def train_neusomatic(candidates_tsv, validation_candidates_tsv, out_dir, checkpo lr_drop_ratio, momentum, boost_none, none_count_scale, max_load_candidates, coverage_thr, save_freq, ensemble, merged_candidates_per_tsv, merged_max_num_tsvs, overwrite_merged_tsvs, - trian_split_len, use_cuda): + train_split_len, + normalize_channels, + use_cuda): logger = logging.getLogger(train_neusomatic.__name__) logger.info("----------------Train NeuSomatic Network-------------------") @@ -201,9 +213,7 @@ def train_neusomatic(candidates_tsv, validation_candidates_tsv, out_dir, checkpo if not use_cuda: torch.set_num_threads(num_threads) - data_transform = transforms.Compose( - [transforms.Normalize((0.5, 0.5, 0.5), (0.5, 0.5, 0.5))] - ) + data_transform = matrix_transform((0.5, 0.5, 0.5), (0.5, 0.5, 0.5)) num_channels = 119 if ensemble else 26 net = NeuSomaticNet(num_channels) if use_cuda: @@ -232,6 +242,12 @@ def train_neusomatic(candidates_tsv, validation_candidates_tsv, out_dir, checkpo coverage_thr = pretrained_dict["coverage_thr"] logger.info( "Override coverage_thr from pretrained checkpoint: {}".format(coverage_thr)) + if "normalize_channels" in pretrained_dict: + normalize_channels = pretrained_dict["normalize_channels"] + else: + normalize_channels = False + logger.info( + "Override normalize_channels from pretrained checkpoint: {}".format(normalize_channels)) prev_epochs = sofar_epochs + 1 model_dict = net.state_dict() # 1. filter out unnecessary keys @@ -279,7 +295,7 @@ def train_neusomatic(candidates_tsv, validation_candidates_tsv, out_dir, checkpo for i, (L, tsv) in enumerate(zip(Ls, candidates_tsv)): current_L += L current_split_tsvs.append(tsv) - if current_L >= trian_split_len or (i == len(candidates_tsv) - 1 and current_L > 0): + if current_L >= train_split_len or (i == len(candidates_tsv) - 1 and current_L > 0): logger.info("tsvs in split {}: {}".format( len(train_split_tsvs), current_split_tsvs)) train_split_tsvs.append(current_split_tsvs) @@ -298,18 +314,19 @@ def train_neusomatic(candidates_tsv, validation_candidates_tsv, out_dir, checkpo max_load_candidates=int( max_load_candidates * len(tsvs) / float(len(candidates_tsv))), transform=data_transform, is_test=False, - num_threads=num_threads, coverage_thr=coverage_thr) + num_threads=num_threads, coverage_thr=coverage_thr, + normalize_channels=normalize_channels) train_sets.append(train_set) none_indices = train_set.get_none_indices() var_indices = train_set.get_var_indices() if none_indices: none_indices = list(map(lambda i: none_indices[i], - torch.randperm(len(none_indices)).tolist())) + torch.randperm(len(none_indices)).tolist())) logger.info( "Non-somatic candidates in split {}: {}".format(split_i, len(none_indices))) if var_indices: var_indices = list(map(lambda i: var_indices[i], - torch.randperm(len(var_indices)).tolist())) + torch.randperm(len(var_indices)).tolist())) logger.info("Somatic candidates in split {}: {}".format( split_i, len(var_indices))) none_count = max(min(len(none_indices), len( @@ -330,7 +347,8 @@ def train_neusomatic(candidates_tsv, validation_candidates_tsv, out_dir, checkpo validation_set = NeuSomaticDataset(roots=validation_candidates_tsv, max_load_candidates=max_load_candidates, transform=data_transform, is_test=True, - num_threads=num_threads, coverage_thr=coverage_thr) + num_threads=num_threads, coverage_thr=coverage_thr, + normalize_channels=normalize_channels) validation_loader = torch.utils.data.DataLoader(validation_set, batch_size=batch_size, shuffle=True, num_workers=num_threads, pin_memory=True) @@ -372,7 +390,8 @@ def train_neusomatic(candidates_tsv, validation_candidates_tsv, out_dir, checkpo torch.save({"state_dict": net.state_dict(), "tag": tag, "epoch": curr_epoch, - "coverage_thr": coverage_thr}, + "coverage_thr": coverage_thr, + "normalize_channels": normalize_channels}, '{}/models/checkpoint_{}_epoch{}.pth'.format(out_dir, tag, curr_epoch)) if len(train_sets) == 1: @@ -384,7 +403,7 @@ def train_neusomatic(candidates_tsv, validation_candidates_tsv, out_dir, checkpo # loop over the dataset multiple times n_epoch = 0 for epoch in range(max_epochs - prev_epochs): - n_epoch +=1 + n_epoch += 1 running_loss = 0.0 i_ = 0 for split_i, train_set in enumerate(train_sets): @@ -437,6 +456,7 @@ def train_neusomatic(candidates_tsv, validation_candidates_tsv, out_dir, checkpo "tag": tag, "epoch": curr_epoch, "coverage_thr": coverage_thr, + "normalize_channels": normalize_channels, }, '{}/models/checkpoint_{}_epoch{}.pth'.format(out_dir, tag, curr_epoch)) if validation_candidates_tsv: test(net, curr_epoch, validation_loader, use_cuda) @@ -453,11 +473,17 @@ def train_neusomatic(candidates_tsv, validation_candidates_tsv, out_dir, checkpo torch.save({"state_dict": net.state_dict(), "tag": tag, "epoch": curr_epoch, - "coverage_thr": coverage_thr}, '{}/models/checkpoint_{}_epoch{}.pth'.format( + "coverage_thr": coverage_thr, + "normalize_channels": normalize_channels, + }, '{}/models/checkpoint_{}_epoch{}.pth'.format( out_dir, tag, curr_epoch)) if validation_candidates_tsv: test(net, curr_epoch, validation_loader, use_cuda) logger.info("Total Epochs: {}".format(curr_epoch)) + logger.info("Total Epochs: {}".format(curr_epoch)) + + logger.info("Training is Done.") + return '{}/models/checkpoint_{}_epoch{}.pth'.format(out_dir, tag, curr_epoch) if __name__ == '__main__': @@ -473,7 +499,7 @@ def train_neusomatic(candidates_tsv, validation_candidates_tsv, out_dir, checkpo parser.add_argument('--out', type=str, help='output directory', required=True) parser.add_argument('--checkpoint', type=str, - help='pretrianed network model checkpoint path', default=None) + help='pretrained network model checkpoint path', default=None) parser.add_argument('--validation_candidates_tsv', nargs="*", help=' validation candidate tsv files', default=[]) parser.add_argument('--ensemble', @@ -510,7 +536,7 @@ def train_neusomatic(candidates_tsv, validation_candidates_tsv, out_dir, checkpo parser.add_argument('--overwrite_merged_tsvs', help='if OUT/merged_tsvs/ folder exists overwrite the merged tsvs', action="store_true") - parser.add_argument('--trian_split_len', type=int, + parser.add_argument('--train_split_len', type=int, help='Maximum number of candidates used in each split of training (>=merged_candidates_per_tsv)', default=10000000) parser.add_argument('--coverage_thr', type=int, @@ -519,6 +545,10 @@ def train_neusomatic(candidates_tsv, validation_candidates_tsv, out_dir, checkpo Will be overridden if pretrained model is provided\ For ~50x WGS, coverage_thr=100 should work. \ For higher coverage WES, coverage_thr=300 should work.', default=100) + parser.add_argument('--normalize_channels', + help='normalize BQ, MQ, and other bam-info channels by frequency of observed alleles. \ + Will be overridden if pretrained model is provided', + action="store_true") args = parser.parse_args() logger.info(args) @@ -535,7 +565,8 @@ def train_neusomatic(candidates_tsv, validation_candidates_tsv, out_dir, checkpo args.max_load_candidates, args.coverage_thr, args.save_freq, args.ensemble, args.merged_candidates_per_tsv, args.merged_max_num_tsvs, - args.overwrite_merged_tsvs, args.trian_split_len, + args.overwrite_merged_tsvs, args.train_split_len, + args.normalize_channels, use_cuda) except Exception as e: logger.error(traceback.format_exc()) diff --git a/test/NeuSomatic_ensemble.vcf b/test/NeuSomatic_ensemble.vcf index 8fdc9a7..2302afe 100644 --- a/test/NeuSomatic_ensemble.vcf +++ b/test/NeuSomatic_ensemble.vcf @@ -1,5 +1,5 @@ ##fileformat=VCFv4.2 -##NeuSomatic Version=0.2.0 +##NeuSomatic Version=0.2.1 ##INFO= ##INFO= ##INFO= @@ -15,10 +15,10 @@ ##FORMAT= #CHROM POS ID REF ALT QUAL FILTER INFO FORMAT SAMPLE 22 21330787 . C T 26.9917 PASS SCORE=0.9980;DP=396;RO=306;AO=88;AF=0.2234 GT:DP:RO:AO:AF 0/1:396:306:88:0.2234 -22 21332122 . G A 28.8618 PASS SCORE=0.9987;DP=285;RO=223;AO=62;AF=0.2175 GT:DP:RO:AO:AF 0/1:285:223:62:0.2175 -22 21334924 . G C 17.5885 PASS SCORE=0.9826;DP=106;RO=83;AO=23;AF=0.217 GT:DP:RO:AO:AF 0/1:106:83:23:0.217 +22 21332122 . G A 28.5402 PASS SCORE=0.9986;DP=285;RO=223;AO=62;AF=0.2175 GT:DP:RO:AO:AF 0/1:285:223:62:0.2175 +22 21334924 . G C 17.5639 PASS SCORE=0.9825;DP=106;RO=83;AO=23;AF=0.217 GT:DP:RO:AO:AF 0/1:106:83:23:0.217 22 21335259 . C A 19.7149 PASS SCORE=0.9893;DP=249;RO=200;AO=49;AF=0.1968 GT:DP:RO:AO:AF 0/1:249:200:49:0.1968 22 21384516 . C T 27.6969 PASS SCORE=0.9983;DP=95;RO=68;AO=27;AF=0.2842 GT:DP:RO:AO:AF 0/1:95:68:27:0.2842 -22 21982892 . C T 21.4946 PASS SCORE=0.9929;DP=158;RO=113;AO=45;AF=0.2848 GT:DP:RO:AO:AF 0/1:158:113:45:0.2848 +22 21982892 . C T 21.5561 PASS SCORE=0.9930;DP=158;RO=113;AO=45;AF=0.2848 GT:DP:RO:AO:AF 0/1:158:113:45:0.2848 22 21983260 . A G 31.5494 PASS SCORE=0.9993;DP=118;RO=74;AO=44;AF=0.3729 GT:DP:RO:AO:AF 0/1:118:74:44:0.3729 22 21989959 . AAG A 33.0106 PASS SCORE=0.9995;DP=139;RO=107;AO=32;AF=0.2302 GT:DP:RO:AO:AF 0/1:139:107:32:0.2302 diff --git a/test/NeuSomatic_standalone.vcf b/test/NeuSomatic_standalone.vcf index ac82554..a7dd79f 100644 --- a/test/NeuSomatic_standalone.vcf +++ b/test/NeuSomatic_standalone.vcf @@ -1,5 +1,5 @@ ##fileformat=VCFv4.2 -##NeuSomatic Version=0.2.0 +##NeuSomatic Version=0.2.1 ##INFO= ##INFO= ##INFO= @@ -16,7 +16,7 @@ #CHROM POS ID REF ALT QUAL FILTER INFO FORMAT SAMPLE 22 21330787 . C T 33.0111 PASS SCORE=0.9995;DP=396;RO=306;AO=88;AF=0.2234 GT:DP:RO:AO:AF 0/1:396:306:88:0.2234 22 21332122 . G A 36.9903 PASS SCORE=0.9998;DP=285;RO=223;AO=62;AF=0.2175 GT:DP:RO:AO:AF 0/1:285:223:62:0.2175 -22 21334924 . G C 12.9395 PASS SCORE=0.9492;DP=106;RO=83;AO=23;AF=0.217 GT:DP:RO:AO:AF 0/1:106:83:23:0.217 +22 21334924 . G C 12.9061 PASS SCORE=0.9488;DP=106;RO=83;AO=23;AF=0.217 GT:DP:RO:AO:AF 0/1:106:83:23:0.217 22 21335259 . C A 25.0876 PASS SCORE=0.9969;DP=249;RO=200;AO=49;AF=0.1968 GT:DP:RO:AO:AF 0/1:249:200:49:0.1968 22 21384516 . C T 32.2191 PASS SCORE=0.9994;DP=95;RO=68;AO=27;AF=0.2842 GT:DP:RO:AO:AF 0/1:95:68:27:0.2842 22 21982892 . C T 29.5872 PASS SCORE=0.9989;DP=158;RO=113;AO=45;AF=0.2848 GT:DP:RO:AO:AF 0/1:158:113:45:0.2848 diff --git a/test/docker_test.sh b/test/docker_test.sh index a3fafba..6118711 100755 --- a/test/docker_test.sh +++ b/test/docker_test.sh @@ -10,16 +10,16 @@ if [ ! -f Homo_sapiens.GRCh37.75.dna.chromosome.22.fa ] then if [ ! -f Homo_sapiens.GRCh37.75.dna.chromosome.22.fa.gz ] then - docker run -v ${test_dir}:/mnt -u $UID --memory 30G msahraeian/neusomatic:0.2.0 /bin/bash -c \ + docker run -v ${test_dir}:/mnt -u $UID --memory 30G msahraeian/neusomatic:0.2.1 /bin/bash -c \ "cd /mnt/example/ && wget ftp://ftp.ensembl.org/pub/release-75//fasta/homo_sapiens/dna/Homo_sapiens.GRCh37.75.dna.chromosome.22.fa.gz" fi - docker run -v ${test_dir}:/mnt -u $UID --memory 30G msahraeian/neusomatic:0.2.0 /bin/bash -c \ + docker run -v ${test_dir}:/mnt -u $UID --memory 30G msahraeian/neusomatic:0.2.1 /bin/bash -c \ "cd /mnt/example/ && gunzip -f Homo_sapiens.GRCh37.75.dna.chromosome.22.fa.gz" fi if [ ! -f Homo_sapiens.GRCh37.75.dna.chromosome.22.fa.fai ] then - docker run -v ${test_dir}:/mnt -u $UID --memory 30G msahraeian/neusomatic:0.2.0 /bin/bash -c \ + docker run -v ${test_dir}:/mnt -u $UID --memory 30G msahraeian/neusomatic:0.2.1 /bin/bash -c \ "samtools faidx /mnt/example/Homo_sapiens.GRCh37.75.dna.chromosome.22.fa" fi rm -rf work_standalone @@ -27,7 +27,7 @@ rm -rf work_standalone #Stand-alone NeuSomatic test -docker run -v ${test_dir}:/mnt -u $UID --memory 30G msahraeian/neusomatic:0.2.0 /bin/bash -c \ +docker run -v ${test_dir}:/mnt -u $UID --memory 30G msahraeian/neusomatic:0.2.1 /bin/bash -c \ "python /opt/neusomatic/neusomatic/python/preprocess.py \ --mode call \ --reference /mnt/example/Homo_sapiens.GRCh37.75.dna.chromosome.22.fa \ @@ -45,7 +45,7 @@ docker run -v ${test_dir}:/mnt -u $UID --memory 30G msahraeian/neusomatic:0.2.0 --num_threads 1 \ --scan_alignments_binary /opt/neusomatic/neusomatic/bin/scan_alignments" -docker run -v ${test_dir}:/mnt -u $UID --memory 30G --shm-size 8G msahraeian/neusomatic:0.2.0 /bin/bash -c \ +docker run -v ${test_dir}:/mnt -u $UID --memory 30G --shm-size 8G msahraeian/neusomatic:0.2.1 /bin/bash -c \ "CUDA_VISIBLE_DEVICES= python /opt/neusomatic/neusomatic/python/call.py \ --candidates_tsv /mnt/example/work_standalone/dataset/*/candidates*.tsv \ --reference /mnt/example/Homo_sapiens.GRCh37.75.dna.chromosome.22.fa \ @@ -54,7 +54,7 @@ docker run -v ${test_dir}:/mnt -u $UID --memory 30G --shm-size 8G msahraeian/neu --num_threads 1 \ --batch_size 100" -docker run -v ${test_dir}:/mnt -u $UID --memory 30G msahraeian/neusomatic:0.2.0 /bin/bash -c \ +docker run -v ${test_dir}:/mnt -u $UID --memory 30G msahraeian/neusomatic:0.2.1 /bin/bash -c \ "python /opt/neusomatic/neusomatic/python/postprocess.py \ --reference /mnt/example/Homo_sapiens.GRCh37.75.dna.chromosome.22.fa \ --tumor_bam /mnt/tumor.bam \ @@ -66,7 +66,7 @@ docker run -v ${test_dir}:/mnt -u $UID --memory 30G msahraeian/neusomatic:0.2.0 rm -rf /mnt/example/work_ensemble #Ensemble NeuSomatic test -docker run -v ${test_dir}:/mnt -u $UID --memory 30G msahraeian/neusomatic:0.2.0 /bin/bash -c \ +docker run -v ${test_dir}:/mnt -u $UID --memory 30G msahraeian/neusomatic:0.2.1 /bin/bash -c \ "python /opt/neusomatic/neusomatic/python/preprocess.py \ --mode call \ --reference /mnt/example/Homo_sapiens.GRCh37.75.dna.chromosome.22.fa \ @@ -85,7 +85,7 @@ docker run -v ${test_dir}:/mnt -u $UID --memory 30G msahraeian/neusomatic:0.2.0 --ensemble_tsv /mnt/ensemble.tsv \ --scan_alignments_binary /opt/neusomatic/neusomatic/bin/scan_alignments" -docker run -v ${test_dir}:/mnt -u $UID --memory 30G --shm-size 8G msahraeian/neusomatic:0.2.0 /bin/bash -c \ +docker run -v ${test_dir}:/mnt -u $UID --memory 30G --shm-size 8G msahraeian/neusomatic:0.2.1 /bin/bash -c \ "CUDA_VISIBLE_DEVICES= python /opt/neusomatic/neusomatic/python/call.py \ --candidates_tsv /mnt/example/work_ensemble/dataset/*/candidates*.tsv \ --reference /mnt/example/Homo_sapiens.GRCh37.75.dna.chromosome.22.fa \ @@ -95,7 +95,7 @@ docker run -v ${test_dir}:/mnt -u $UID --memory 30G --shm-size 8G msahraeian/neu --ensemble \ --batch_size 100" -docker run -v ${test_dir}:/mnt -u $UID --memory 30G msahraeian/neusomatic:0.2.0 /bin/bash -c \ +docker run -v ${test_dir}:/mnt -u $UID --memory 30G msahraeian/neusomatic:0.2.1 /bin/bash -c \ "python /opt/neusomatic/neusomatic/python/postprocess.py \ --reference /mnt/example/Homo_sapiens.GRCh37.75.dna.chromosome.22.fa \ --tumor_bam /mnt/tumor.bam \