Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

failed at the last step #53

Open
cbiitbian opened this issue Nov 24, 2019 · 23 comments
Open

failed at the last step #53

cbiitbian opened this issue Nov 24, 2019 · 23 comments

Comments

@cbiitbian
Copy link

example sample ran well, but when run with a pair of real data set, neusomatic failed at the last step with error:

NeuSomatic stand-alone FAILED: Files ../example/work_standalone/NeuSomatic_standalone.vcf and ../NeuSomatic_standalone.vcf Are Different!

NeuSomatic ensemble FAILED: Files ../example/work_ensemble/NeuSomatic_ensemble.vcf and ../NeuSomatic_ensemble.vcf Are Different!

Please advise what could be wrong

@cbiitbian
Copy link
Author

I see what is wrong here. The last step of comparison is for the testing sample to make sure testing process run successfully.
Can I substitute the input files with my sample sets including normal bam, tumor bam, bed and reference genome and consider the result vcf at /example/work_standalone/NeuSomatic_standalone.vcf and /example/work_ensemble/NeuSomatic_ensemble.vcf are the variant results for my data?

@msahraeian
Copy link
Contributor

@cbiitbian happy to see your interest in NeuSomatic. As you pointed out the test script will only succeed if it is used on test data. For your real application, you should follow the instructions in README. The inputs/models/parameters to use may need modification depending on your data.
Particularly,
1- I would recommend to use newer trained models (NeuSomatic_v0.1.4_standalone_SEQC-WGS-Spike.pth for standalone and NeuSomatic_v0.1.4_ensemble_SEQC-WGS-Spike.pth for ensemble), more information here. Depending how different your data is from our training set you may be interested to train the network too.
2- For ensemble mode, you need to have somatic calls from five other tools (MuTect2, VarDict, Strelka2, SomaticSniper, and MuSE) for your samples, you can find instruction for that here
3- The reference/regions/minimum AFs can also be adjusted as explained https://github.com/bioinform/neusomatic#example-usage. As discussed there, for inference, you need to follow three steps (preprocess/call/postprocess).

Please let me know if you need more information

@cbiitbian
Copy link
Author

cbiitbian commented Nov 26, 2019 via email

@msahraeian
Copy link
Contributor

@cbiitbian

  1. Yes, it should be the right steps to go. You may also set the number of threads based on your resources. Another important note is that on the test script I used min AF of 5%. If it is too high for your sample, you can modify the parameters. For our new SEQC-II robustness analysis paper, we use the following setting: --scan_maf 0.01 --min_mapq 10 --snp_min_af 0.03 --snp_min_bq 15 --snp_min_ao 3 --ins_min_af 0.02 --del_min_af 0.02.
  2. Each variant call is assigned a quality score in the VCF. Based on this score we categorized the calls to three classes PASS (score>=0.7), LowQUAL (0.4<=score<0.7), and REJECT (score<0.4). This is the default limits we use. You can modify these limits in the postprocessing step using input parameters. You should then use the PASS calls for analysis.
  3. For real samples in NatureComm paper, we used DREAM challenge model NeuSomatic_v0.1.3_standalone_Dream3.pth to infer. There, we used 0.97 pass threshold for CLL1 and COLO-829 samples. For the DREAM challenge stage 3 we also used DREAM challenge model to infer with 0.7 pass threshold. For the SEQC-II robustness analysis paper, we used the default 0.7 pass threshold for whole analysis. With the new SEQC trained models, 0.7 pass threshold should give you similar accuracy on CLL1 and COLO samples.
  4. You can find the information on how to use BAMSurgeon to spike in mutations here.

Also note that the best performance is usually achieved in the ensemble mode.

@cbiitbian
Copy link
Author

cbiitbian commented Nov 28, 2019 via email

@msahraeian
Copy link
Contributor

@cbiitbian The NeuSomatic_v0.1.0_standalone_WEX_100purity.pth is the model we used in original paper for WES. We have also applied our WGS trained model NeuSomatic_v0.1.4_standalone_SEQC-WGS-Spike.pth on WES data in the SEQC paper and it seems to perform good.

@cbiitbian
Copy link
Author

cbiitbian commented Dec 1, 2019 via email

@msahraeian
Copy link
Contributor

@cbiitbian I cannot see the attachment. Would you please attach the scan.err file again?

@cbiitbian
Copy link
Author

cbiitbian commented Dec 2, 2019 via email

@msahraeian
Copy link
Contributor

@cbiitbian it is not clear to me. Can you make sure you alignment .bam files have index files .bam.bai with them. Also please make sure the reference file used to align reads matches the one you use here. If it didn't help, can you restrict your bam to a smaller region and send it to me (if the code fails on the restricted bam).

@cbiitbian
Copy link
Author

cbiitbian commented Dec 3, 2019 via email

@cbiitbian
Copy link
Author

cbiitbian commented Dec 3, 2019 via email

@msahraeian
Copy link
Contributor

@cbiitbian if it is a public sample, it would be good if you could share with me the whole bam file and the command line parameters you use and I will take a look.

@cbiitbian
Copy link
Author

cbiitbian commented Dec 6, 2019 via email

@msahraeian
Copy link
Contributor

@cbiitbian I think there should be sth wrong with how you run NeuSomatic. I cannot see your attachments. Would you please send them again. You can attach them to the Github message.

@cbiitbian
Copy link
Author

cbiitbian commented Dec 8, 2019 via email

@msahraeian
Copy link
Contributor

@cbiitbian are you using only PASS calls during evaluation? And if your sample is WEX you should intersect the final vcf with the exons bed.

@cbiitbian
Copy link
Author

cbiitbian commented Dec 20, 2019 via email

@msahraeian
Copy link
Contributor

@cbiitbian you don't need to output to the same directory. For each step there is an input parameter that specifies the output location. for preprocess.py you can specify the work folder with --work, for call.py you can specify the output folder that with --out, and for postprocess.py you can specify the output vcf using --output_vcf and the work folder with --work.

@cbiitbian
Copy link
Author

cbiitbian commented Dec 24, 2019 via email

@cbiitbian
Copy link
Author

cbiitbian commented Dec 24, 2019 via email

@msahraeian
Copy link
Contributor

@cbiitbian I think there is something wrong here. Are you using HapMap samples as tumor-normal pairs? As HapMap data is public, it would be great if you could share with me the ID of samples you are using as well as your neusomatic run scripts and I can make sure you are using the right settings.

@cbiitbian
Copy link
Author

cbiitbian commented Dec 27, 2019 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants