Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

bump msconvert #721

Draft
wants to merge 7 commits into
base: master
Choose a base branch
from

Conversation

bernt-matthias
Copy link
Collaborator

just stumbled over msconvert and thought it might be a good idea to update it.

@bernt-matthias bernt-matthias requested a review from chambm July 13, 2023 17:10
Copy link
Contributor

@chambm chambm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for updating this. IIRC I stopped updating because it was a pain to update the test reference files.

@bgruening
Copy link
Member

@chambm linting still fails because of booleans and one data output name

@bernt-matthias
Copy link
Collaborator Author

Will fix it.

@@ -523,20 +550,20 @@

<xml name="msconvertOutput">
<outputs>
<data format="mzml" name="output" label="${($input.name[:-4] if $input.name.endswith('.tar') else $input.name).rsplit('.',1)[0]}.${output_type}" >
<filter>general_options['multi_run_output']['do_multi_run_output'] == False</filter>
<data format="mzml" name="output" label="${tool.name} on ${on_string}" >
Copy link
Contributor

@chambm chambm Jul 17, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've never liked these <tool> on <dataset> names in Galaxy nor that you can only override them by using a workflow. Even in a workflow the renaming isn't very flexible. That's why I had this python code here to name the dataset the same way it would be if you downloaded the file on your computer, ran msconvert locally, then uploaded the result (that was the idea anyway). If I run msconvert on mydatafile.raw to get an mzML, I should get a dataset named mydatafile.mzML, not "msconvert on mydatafile.raw". If lint is erroring on this output renaming, I think the lint is being too aggressive.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We had an analogous discussion here #706 (comment) and the solution is to use collections.

We are currently adapting the W4M tools accordingly: workflow4metabolomics/tools-metabolomics#238 workflow4metabolomics/tools-metabolomics#239 .. let me check if the msconvert tool uses name somewhere.

I think the lint is being too aggressive.

We discussed this here: galaxyproject/galaxy#15933 (comment)

The problem is that it's hard to check if the filters are mutually exclusive. So we decided to enforce being explicit and forbid outputs of the same name.

Copy link
Contributor

@mvdbeek mvdbeek Jul 24, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Who decided this ? This sounds like a bad idea. warning is the right level.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've never liked these <tool> on <dataset> names in Galaxy nor that you can only override them by using a workflow

Yes, 💯, such a bad idea to attempt to capture provenance. I'd say though that input.name is also not exactly the right thing to do. I'd just give it a reasonable default, like ${tool.name}: raw data or whatever is appropriate. If passed a collection the element_identifier will be kept throughout the analysis.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With ${tool.name}: raw data all msconvert jobs in a history would have the same name ... and I would prefer any of the two other possibilities.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you never need to demultiplex but need to keep the filename a collection with a single element would work, if you do need to demultiplex, the tool that does the multiplexing only needs to output a collection with the demultiplexed samples, like you've said. I guess I don't understand why you need to address a portion of an input dataset, and why this works when you don't use a dataset collection but fails if you do ?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just don't see a good reason for the naming behavior of single datasets and collections to be so different. It's not unusual to test a tool or series of tools on just a single file, and I'm not sure why one should need to make a collection of that single file first in order to get sensible naming. TBH I'd even like to see collection member names change depending on their type so it's obvious just from looking at the entry that it's an mzML, or a mzIdentML, or TSV, etc. File extensions are an imperfect but very popular and established solution to the real problem of being able to look at a filename and have an idea of what its contents/purpose is. IMO tool wrappers for command-line tools that use file extensions descriptively should preserve that behavior, whether in single datasets or collections.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just don't see a good reason for the naming behavior of single datasets and collections to be so different. It's not unusual to test a tool or series of tools on just a single file, and I'm not sure why one should need to make a collection of that single file first in order to get sensible naming.

What should the name in that case be ? A common way to deal with this on the command line is to have directories

├── step0
│   └── sample_1.fastq
├── step1
│   └── sample_1.bam
└── step2
    └── sample_1.vcf

the collection gives you a sort of container where the container describes what's inside it.
If you'd treat the dataset name as the immutable thing in the analysis you'd end up with a lot of datasets with the same name (and maybe different extensions) ... that would probably be a little confusing to users.

TBH I'd even like to see collection member names change depending on their type so it's obvious just from looking at the entry that it's an mzML, or a mzIdentML, or TSV, etc.

this seems straightforward to change in the new history .. you'd like to always see the datatype extension, even in the summary view, right ? I think that'd be pretty cool.

Screenshot 2023-07-24 at 18 05 46

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that every automatic naming scheme has its pros and cons and there will always be users who like one more than the other.

I admit that I disliked Galaxy's default naming scheme at first, but I got used to it .. which does not mean that it's good :)

It really should work just like the command-line

The problem is that the command line is operated by a human that sets the argument values, for "free text parameters" like filenames the operating human can choose sensible names (i.e. that make sense to her).

So in some sense, Galaxy works exactly as the command line .. it just uses other names than you would. One thing I like about Galaxy is the consistency of the naming scheme even if it may not be the best in all cases.

pretty easy to reconstruct the on section in the client

I found it always very convenient to be able to reconstruct this in most cases without clicking. Butm maybe that's the main point: which information do users want to be visible (without the need for additional clicks):

  • which tool generated the data with which input (data sets)
  • sample name + file type

Both will "fail" for instance for an analysis of the parameter space of a tool. Then you would like to have the value of a specific parameter in the dataset name, e.g. BWA with k=23.

What came to my mind, was how about per user / per history configurable naming schemes?

... here's also WIFF files which contain multiple runs ...

That's just another can of worms. Seems that some proteomics datatypes include references to other files. This won't work in Galaxy (out of the box), since Galaxy stores the datasets using completely different names (using some hash as a name).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The problem is that the command line is operated by a human that sets the argument values, for "free text parameters" like filenames the operating human can choose sensible names (i.e. that make sense to her).

In most cases in proteomics, the software always picks a default output name based on the input filename (usually by changing file extension). That's why I think it should be the default naming scheme for Galaxy wrappers that wrap such tools. But of course you're right it has pros and cons.

@bgruening
Copy link
Member

Maybe it's time to update the CI

@bernt-matthias bernt-matthias marked this pull request as draft August 7, 2023 12:50
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants