bump msconvert #721

bernt-matthias · 2023-07-13T17:10:10Z

just stumbled over msconvert and thought it might be a good idea to update it.

chambm

Thanks for updating this. IIRC I stopped updating because it was a pain to update the test reference files.

bgruening · 2023-07-15T23:25:27Z

@chambm linting still fails because of booleans and one data output name

bernt-matthias · 2023-07-16T05:10:47Z

Will fix it.

chambm · 2023-07-17T15:40:03Z

tools/msconvert/msconvert_macros.xml

@@ -523,20 +550,20 @@

  <xml name="msconvertOutput">
    <outputs>
-      <data format="mzml" name="output" label="${($input.name[:-4] if $input.name.endswith('.tar') else $input.name).rsplit('.',1)[0]}.${output_type}" >
-        <filter>general_options['multi_run_output']['do_multi_run_output'] == False</filter>
+      <data format="mzml" name="output" label="${tool.name} on ${on_string}" >


I've never liked these <tool> on <dataset> names in Galaxy nor that you can only override them by using a workflow. Even in a workflow the renaming isn't very flexible. That's why I had this python code here to name the dataset the same way it would be if you downloaded the file on your computer, ran msconvert locally, then uploaded the result (that was the idea anyway). If I run msconvert on mydatafile.raw to get an mzML, I should get a dataset named mydatafile.mzML, not "msconvert on mydatafile.raw". If lint is erroring on this output renaming, I think the lint is being too aggressive.

We had an analogous discussion here #706 (comment) and the solution is to use collections.

We are currently adapting the W4M tools accordingly: workflow4metabolomics/tools-metabolomics#238 workflow4metabolomics/tools-metabolomics#239 .. let me check if the msconvert tool uses name somewhere.

I think the lint is being too aggressive.

We discussed this here: galaxyproject/galaxy#15933 (comment)

The problem is that it's hard to check if the filters are mutually exclusive. So we decided to enforce being explicit and forbid outputs of the same name.

Who decided this ? This sounds like a bad idea. warning is the right level.

I've never liked these <tool> on <dataset> names in Galaxy nor that you can only override them by using a workflow

Yes, 💯, such a bad idea to attempt to capture provenance. I'd say though that input.name is also not exactly the right thing to do. I'd just give it a reasonable default, like ${tool.name}: raw data or whatever is appropriate. If passed a collection the element_identifier will be kept throughout the analysis.

With ${tool.name}: raw data all msconvert jobs in a history would have the same name ... and I would prefer any of the two other possibilities.

If you never need to demultiplex but need to keep the filename a collection with a single element would work, if you do need to demultiplex, the tool that does the multiplexing only needs to output a collection with the demultiplexed samples, like you've said. I guess I don't understand why you need to address a portion of an input dataset, and why this works when you don't use a dataset collection but fails if you do ?

I just don't see a good reason for the naming behavior of single datasets and collections to be so different. It's not unusual to test a tool or series of tools on just a single file, and I'm not sure why one should need to make a collection of that single file first in order to get sensible naming. TBH I'd even like to see collection member names change depending on their type so it's obvious just from looking at the entry that it's an mzML, or a mzIdentML, or TSV, etc. File extensions are an imperfect but very popular and established solution to the real problem of being able to look at a filename and have an idea of what its contents/purpose is. IMO tool wrappers for command-line tools that use file extensions descriptively should preserve that behavior, whether in single datasets or collections.

I just don't see a good reason for the naming behavior of single datasets and collections to be so different. It's not unusual to test a tool or series of tools on just a single file, and I'm not sure why one should need to make a collection of that single file first in order to get sensible naming.

What should the name in that case be ? A common way to deal with this on the command line is to have directories

├── step0 │ └── sample_1.fastq ├── step1 │ └── sample_1.bam └── step2 └── sample_1.vcf

the collection gives you a sort of container where the container describes what's inside it.
If you'd treat the dataset name as the immutable thing in the analysis you'd end up with a lot of datasets with the same name (and maybe different extensions) ... that would probably be a little confusing to users.

TBH I'd even like to see collection member names change depending on their type so it's obvious just from looking at the entry that it's an mzML, or a mzIdentML, or TSV, etc.

this seems straightforward to change in the new history .. you'd like to always see the datatype extension, even in the summary view, right ? I think that'd be pretty cool.

I think that every automatic naming scheme has its pros and cons and there will always be users who like one more than the other.

I admit that I disliked Galaxy's default naming scheme at first, but I got used to it .. which does not mean that it's good :)

It really should work just like the command-line

The problem is that the command line is operated by a human that sets the argument values, for "free text parameters" like filenames the operating human can choose sensible names (i.e. that make sense to her).

So in some sense, Galaxy works exactly as the command line .. it just uses other names than you would. One thing I like about Galaxy is the consistency of the naming scheme even if it may not be the best in all cases.

pretty easy to reconstruct the on section in the client

I found it always very convenient to be able to reconstruct this in most cases without clicking. Butm maybe that's the main point: which information do users want to be visible (without the need for additional clicks):

which tool generated the data with which input (data sets)

sample name + file type

Both will "fail" for instance for an analysis of the parameter space of a tool. Then you would like to have the value of a specific parameter in the dataset name, e.g. BWA with k=23.

What came to my mind, was how about per user / per history configurable naming schemes?

... here's also WIFF files which contain multiple runs ...

That's just another can of worms. Seems that some proteomics datatypes include references to other files. This won't work in Galaxy (out of the box), since Galaxy stores the datasets using completely different names (using some hash as a name).

The problem is that the command line is operated by a human that sets the argument values, for "free text parameters" like filenames the operating human can choose sensible names (i.e. that make sense to her).

In most cases in proteomics, the software always picks a default output name based on the input filename (usually by changing file extension). That's why I think it should be the default naming scheme for Galaxy wrappers that wrap such tools. But of course you're right it has pros and cons.

bgruening · 2023-08-07T07:57:48Z

Maybe it's time to update the CI

make booleans -> selects

bernt-matthias requested a review from chambm July 13, 2023 17:10

chambm approved these changes Jul 14, 2023

View reviewed changes

chambm reviewed Jul 17, 2023

View reviewed changes

bernt-matthias mentioned this pull request Jul 18, 2023

[23.0] Allow duplicate labels in linter if outputs contain filters galaxyproject/galaxy#15933

Merged

4 tasks

bernt-matthias closed this Aug 7, 2023

bernt-matthias reopened this Aug 7, 2023

bernt-matthias force-pushed the bump/msconvert branch from ab42080 to c58d281 Compare August 7, 2023 12:50

bernt-matthias marked this pull request as draft August 7, 2023 12:50

bernt-matthias added 7 commits August 7, 2023 14:53

bump msconvert

155bf27

add missing homepage URL

f0f2954

rename README

cc384e3

fix tests

e0ebded

fix conditionals

413ec66

make booleans -> selects

rename labels

74a2c34

fix tests

c58d281

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

bump msconvert #721

bump msconvert #721

bernt-matthias commented Jul 13, 2023

chambm left a comment

bgruening commented Jul 15, 2023

bernt-matthias commented Jul 16, 2023

chambm Jul 17, 2023 •

edited

Loading

bernt-matthias Jul 17, 2023

mvdbeek Jul 24, 2023 •

edited

Loading

mvdbeek Jul 24, 2023

bernt-matthias Jul 24, 2023

mvdbeek Jul 24, 2023

chambm Jul 24, 2023

mvdbeek Jul 24, 2023

bernt-matthias Jul 24, 2023

chambm Jul 24, 2023

bgruening commented Aug 7, 2023

bump msconvert #721

Are you sure you want to change the base?

bump msconvert #721

Conversation

bernt-matthias commented Jul 13, 2023

chambm left a comment

Choose a reason for hiding this comment

bgruening commented Jul 15, 2023

bernt-matthias commented Jul 16, 2023

chambm Jul 17, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mvdbeek Jul 24, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

bgruening commented Aug 7, 2023

chambm Jul 17, 2023 •

edited

Loading

mvdbeek Jul 24, 2023 •

edited

Loading