-
Notifications
You must be signed in to change notification settings - Fork 25
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Use read mapper as alternative to PrimerSearch
#29
Comments
I've been using Jim Kent's stand alone https://anaconda.org/bioconda/ispcr It does seem consistently much faster, but how much would depend on the typical FASTA file size used (first example here is 4.9MB, one bacterial chromosome), number of primers (here 141), and how strict you want this to be:
This drops to 141 results with a mismatch percentage of 9 or lower. The tool is a lot faster with
However
Timing best of 5 on the same machine. The default settings in |
Thanks @peterjc - that's a considerable speed-up, and worth considering as a drop-in replacement for PrimerSearch - it could have a significant effect on runtime. A quick Google suggests it's not maintained: and the documentation is the usage string. There doesn't appear to be an M1-compiled version yet, which is a potential question for future application/testing (but may be inconsequential - I don't have an M1 to test compilation on). We're not treating PrimerSearch as a gold standard, so I don't think it's essential that the results are identical to ispcr (but if they were very different there would be other questions). So long as they each capture plausible cross-hybridisation their performance should be good enough to generate candidates for in vitro validation. |
I recently emailed Jim Kent about a trivial bug (one of the documented options is not implemented), partly to see if he replies. Putting this in a positive light, isPcr is a mature stable tool ;) I would expect the bioconda community to tackle M1 compilation, but have not looked into that. Right now I'm running a larger test case - a much bigger genome and lots more primers. Here |
Large genome example, 173 contigs in 116MB FASTA file, with 2634 candidate primers (only 2300 unique), simple test script:
Best of 5 with ispcr:
Single test with primersearch:
I had run this with primersearch on the same cluster (but likely a different node) and that run also took about 1h30. Note also this is with So, ispcr did not return as many hits, but did so in less than 1/1000 of the time. While in the bacteria test, the time taken wasn't quite as dramatic needing approx 1/100 the time. There are subtleties in the different output, e.g. if the genome contains i.e. Switching in-silico PCR tool could be a non-trivial behavioural change, but worth considering if this is a computational bottleneck. |
😁 That is true!
I didn't go so far as to inspect the recipe to see if there's a (re-)compilation, or if the executable is pulled down as-is. |
That is a striking speed increase!
I have no problem with having ispcr as an option (maybe along with other tools/approaches, too). The speed increase alone will make some jobs tractable - especially on limited hardware - that would not otherwise be accessible.
In my experience the problem we've had - even with PrimerSearch looking for the potential to amplify some sites with "wobble" in the match - is that we under-predict the extent of cross-amplification. The balance to hit here is between speed of computation and waste of time/material in testing candidates primer sets that could be excluded with less effort. Generally, computation is cheap and you can get on with other things while the code runs. Lab staff and consumables are more expensive in terms of time and cash. With that in mind a mismatch of zero is, more or less, equivalent to a grep for the primer sequences. We would naturally expect this to be computationally efficient. To the extent that it always finds cross-hybridisation with exact matches, it can rule out some primer sets. But it is likely to let through a number of sets that amplify despite having numerous mismatches ("false positives", in the sense of a "positive" being a discriminatory primer set). The computational challenge is more difficult as we try to more closely approximate the in vitro behaviour of the primers, and reduce our "false positives" closer to zero. I'd like the balance to be in the hands of the user so, as I say, I've no objection to having several options for the cross-hybridisation detection, including ispcr (with zero mismatches if the user wants it), but I felt I should make the case why faster isn't necessarily better. I imagine that a deep-learning classifier, trained on a sufficient number of in vitro examples would do a much better job than either tool… I wonder if anyone would fund us to do that? ;)
It should be possible to restrict the maximum amplicon length with ispcr. When doing metabarcoding there's a natural restriction on the expected amplicon size that conveniently lets us set a hard limit.
I've come round to the view that giving the user the option to use their choice of tool and prioritise either speed or accuracy according to their means and needs is a better option than switching. But ispcr would clearly massively speed up the process and, if you've a robotised lab, you might not mind testing all the candidates a slower tool would rule out. |
Indeed. And |
It should be possible to speed up the in silico hybridisation step by using
bwa
or similar to map candidate primers. The output of this step could be made to resemblePrimerSearch
output to minimise the extra effort required to process the result.The text was updated successfully, but these errors were encountered: