-
Notifications
You must be signed in to change notification settings - Fork 441
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Check InterProScan seqtype #5891
base: main
Are you sure you want to change the base?
Conversation
@neoformit please LMK if you have any comments on the UX or anything else. |
Looks good to me, is it possible to add "detected nucleotide when you selected protein" to the error message, just to be a bit more specific for the user? |
@neoformit that would be better, but the values of the seqtype variable are "p" and "n" so it makes the command section a bit nastier. See 7a25359 - worth it? |
Since you already have the |
Thanks! @TomHarrop can you look at the failing lint, please? How is interproscan failing and how fast? We could also add a Some thoughts for the future, because this is now coming up more and more often:
|
Thanks @bgruening . The lint failure is because of InterProScan's non-standard version numbering.
I believe it doesn't fail but creates huge jobs that run for days, and there is a similar problem with blastp. My understanding is that @igormakunin has to cancel jobs manually to keep the queue moving. Hopefully he can comment.
I don't think there is any technical difference in FASTA that we can rely on, e.g. "ATGC" is a valid peptide string.
This seems like a good option to me. I could look into that if you point me in the right direction.
As above we might not detect them reliably, but I don't know how the sniffer works under the hood. |
Ages ago I did start developing a Python lib It has a validator for DNA, protein, sequence count etc. and some common formatting for FASTA files. Could also implement for FASTQ but haven't yet. It raises stderr that results in user-friendly error messages in Galaxy. Maybe there are existing libraries but I don't know of one that does this exactly. It wraps BioPython etc. under the hood. |
This seems like a nice solution but the logic for checking and raising error messages still has to be re-implemented in every wrapper, right? |
@TomHarrop how do we proceed here? Should we merge what you have here? |
@bgruening yes please. @igormakunin is still seeing jobs where the input doesn't match the selected data type. It's not the ideal fix because the data still gets copied to the runner before the check runs but I'm not aware of any other fixes that have been made. |
Can you rebase and fix the linting ? Thanks! |
Yikes what happened here? Don't merge this, I must have rebased off the wrong branch. |
75da91c
to
7a25359
Compare
Thanks @bgruening this should be OK now. |
@abretaud is that ok with you? +1 from my side. |
FOR CONTRIBUTOR:
This tool is causing a bit of support work when users input nucleotide sequences but select protein as the sequence type.
This greps for anything that is not an IUPAC nucleotide and checks the result against the selected seqtype.
It won't always be correct because a protein consisting only of amino acid residues that are in the nucleotide alphabet is valid (but probably not common). It could also take a long time to search through valid nucleotide input. I could limit it to the first 1000 lines or something?
Ping @igormakunin