Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Check InterProScan seqtype #5891

Open
wants to merge 6 commits into
base: main
Choose a base branch
from

Conversation

TomHarrop
Copy link
Contributor

FOR CONTRIBUTOR:

  • I have read the CONTRIBUTING.md document and this tool is appropriate for the tools-iuc repo.
  • License permits unrestricted use (educational + commercial)
  • This PR adds a new tool or tool collection
  • This PR updates an existing tool or tool collection
  • This PR does something else (explain below)

This tool is causing a bit of support work when users input nucleotide sequences but select protein as the sequence type.

This greps for anything that is not an IUPAC nucleotide and checks the result against the selected seqtype.

It won't always be correct because a protein consisting only of amino acid residues that are in the nucleotide alphabet is valid (but probably not common). It could also take a long time to search through valid nucleotide input. I could limit it to the first 1000 lines or something?

Ping @igormakunin

@TomHarrop
Copy link
Contributor Author

screenshot-localhost_9090-2024 03 20-13_31_32


screenshot-localhost_9090-2024 03 20-13_31_09


@neoformit please LMK if you have any comments on the UX or anything else.

@neoformit
Copy link
Contributor

Looks good to me, is it possible to add "detected nucleotide when you selected protein" to the error message, just to be a bit more specific for the user?

@TomHarrop
Copy link
Contributor Author

TomHarrop commented Mar 20, 2024

@neoformit that would be better, but the values of the seqtype variable are "p" and "n" so it makes the command section a bit nastier. See 7a25359 - worth it?

@neoformit
Copy link
Contributor

Since you already have the if grep -q '[^[:space:]]' <<< \${match}; then line, yeah I think this is worth it. Looks good, thanks Tom!

@bgruening
Copy link
Member

Thanks! @TomHarrop can you look at the failing lint, please?

How is interproscan failing and how fast? We could also add a stdio, catch this error and provide your better error description.

Some thoughts for the future, because this is now coming up more and more often:

  • do we need separate datatypes for nt, and prot
  • do we need a library/script that checks the content and can be shared?
  • should this be part of the fasta sniffer (it will have performance implications)

@TomHarrop
Copy link
Contributor Author

Thanks @bgruening . The lint failure is because of InterProScan's non-standard version numbering.

How is interproscan failing and how fast? We could also add a stdio, catch this error and provide your better error description.

I believe it doesn't fail but creates huge jobs that run for days, and there is a similar problem with blastp. My understanding is that @igormakunin has to cancel jobs manually to keep the queue moving. Hopefully he can comment.


do we need separate datatypes for nt, and prot

I don't think there is any technical difference in FASTA that we can rely on, e.g. "ATGC" is a valid peptide string.


do we need a library/script that checks the content and can be shared?

This seems like a good option to me. I could look into that if you point me in the right direction.


should this be part of the fasta sniffer (it will have performance implications)

As above we might not detect them reliably, but I don't know how the sniffer works under the hood.

@neoformit
Copy link
Contributor

do we need a library/script that checks the content and can be shared?

Ages ago I did start developing a Python lib fastkit for that reason: https://github.com/neoformit/fastkit

It has a validator for DNA, protein, sequence count etc. and some common formatting for FASTA files. Could also implement for FASTQ but haven't yet. It raises stderr that results in user-friendly error messages in Galaxy.

Maybe there are existing libraries but I don't know of one that does this exactly. It wraps BioPython etc. under the hood.

@neoformit
Copy link
Contributor

should this be part of the fasta sniffer (it will have performance implications)

This seems like a nice solution but the logic for checking and raising error messages still has to be re-implemented in every wrapper, right?

@bgruening
Copy link
Member

@TomHarrop how do we proceed here? Should we merge what you have here?

@TomHarrop
Copy link
Contributor Author

TomHarrop commented Oct 13, 2024

@bgruening yes please. @igormakunin is still seeing jobs where the input doesn't match the selected data type.

It's not the ideal fix because the data still gets copied to the runner before the check runs but I'm not aware of any other fixes that have been made.

@bgruening
Copy link
Member

Can you rebase and fix the linting ? Thanks!

@TomHarrop
Copy link
Contributor Author

Yikes what happened here?

Don't merge this, I must have rebased off the wrong branch.

@TomHarrop TomHarrop changed the title Check InterProScan seqtype DON'T MERGE Check InterProScan seqtype Oct 14, 2024
@TomHarrop TomHarrop changed the title DON'T MERGE Check InterProScan seqtype Check InterProScan seqtype Oct 14, 2024
@TomHarrop TomHarrop marked this pull request as ready for review October 14, 2024 05:44
@TomHarrop
Copy link
Contributor Author

Thanks @bgruening this should be OK now.

@bgruening
Copy link
Member

@abretaud is that ok with you?

+1 from my side.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants