-
Notifications
You must be signed in to change notification settings - Fork 441
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Update biobox_add_taxid wrapper #6344
Update biobox_add_taxid wrapper #6344
Conversation
I change the select column from data_column to integer, since it can happen that you have multiple files as input and all files share the same column as input. With data_column it can not work since my workflow generate a collection which stop it there since it has now data as reference for the column or at least it still throws an error if you try to use it this way. |
@bgruening could you merge this quick to have this in the bot update scope? I really need it this weekend! |
Can you please explain this more. This change seems backwards and maybe there is a Galaxy bug to fix. |
Yes, I can explain it more. So in this workflow https://usegalaxy.eu/u/santinof/w/gtdb-tk-subworkflow-1 there is the possibility that GTDB-Tk will output 2 summary files. These 2 files will run through 2 other tools. The last tool is Names2taxID which is needed for this tool as input. Here you have now the Problem that you have to set the column where the names are stated in the Names2taxID output but since you have 2 files Galaxy can not refer to a file with the data_colum type. So I did change it to use an integer as a workaround because both file has the same format so both share the same column which has to be stated. Here is the error msg:
and here a History as an example: Hope this will explain the change if not i can give more details about it! |
Okay, maybe this change doesn't need to be done. I find it strange that when I only got 1 summary file from GTDB-Tk that I end up with a list of a list instead of 2 files in a list. I only saw this now since I did let my workflow run till the error such that I can use the data to work with them manually to get some result. There was one History where this error did not appear, since GTDB-Tk did yield a list with 2 files and not a list of a list. I will now try the workflow with the flatten tool to see if I cut out the error or not, and I will either close the PR or I will give more details in here |
The workflow still got the error this time in both runs: https://usegalaxy.eu/u/santinof/h/mag-benchmark-workflow-without-batcami-low-3 Here is the history where the tool did work, only thing different was that the output from Names2taxID was not flattened before inputting into biobox add taxid. https://usegalaxy.eu/u/santinof/h/mag-benchmark-workflow-without-batmarine-sample-0-2 |
This should not happen that the flatten tool did run on a collection which was not created yet? In the linked History where the error did not appear, it seems that the flatten tool work there but only there..... After Name2taxID was run I did try flatten again and there you can see it work, so I think there is a bug in galaxy with flatten? For my workflow i try a workaround to see if using a subworkflow to see of flatten work there since it is forced to wait for the result |
Mhh, I can only assume this, but the input in the history you provided is list(samples):list(summary files); I assume as such, the tool wants to get the column from the first level (which is a collection not a file), maybe we could just merge the summary files (one is for archaea and one for bacteria, right ? To overcome the difficult to handle collection structure ? |
In general, I am wondering how the logic of |
Can you also explain why there can be multiple inputs here: https://github.com/galaxyproject/tools-iuc/blob/303002db06287fb25306020c4391626842f52162/tools/cami_amber/biobox_add_taxid.xml#L86C23-L86C115 |
Correct this way i want to use tha flatten tool to have all dataset on one Level but in the exapanation above Show that this tool runs wirhout waiting for the needed outputs. Even when merge them when we have the list:list Situation it will still yield this error to see this you can see in the cami error worklow LinkedIn above there Names2axID have only 1 files Aa output but still in the list:list dataype which means it does not work |
For the stuff i tested the data_column param type can still be used when habe mutlipe files. The only Problem which can happen is that the mutlipe file does not have any specific format which means that the chopse column is not the same all over each file. The error is still showed when trying to is manually but Galaxy still runs the tool. You can see this in the not error history (marine-sample-0-2) linked above. There you can try to run biobox add taxid to see the "error" msg in the column para GUI |
Can you name the input which i should explain more? :) |
So the main problem here is that you have nested lists. Is this expected or a potential problem of the tools running upstream in the workflow? I do not understand yet: does flattening the collection not help? |
Correct and it is not expected since only want a list as input. How this happens I can not explain, but for this I build in the flatten tool to eliminate the nested list. Now to the real problem: It seems that show here That flatten will be executed right after the job is created, which does not follow the workflow logic since it should have waited for the Names2taxID did finish since this is the input. I now try to work around with that, I split my Subworkflow into 2 other Workflow such that flatting will be in the second and force (hopefully) to wait till all outputs from the first Subworkflow are created. I hope this help understanding the Problem a bit better? |
Might be also a problem, but I think your primary problem is that an upstream tools produces a nested list and you/we need to understand why. |
Okay now I know how the nested list will be generated. It is because of a batch mode of a different tool which is expected since it can happen that GTDB-Tk can produce 2 files which has to be in the upstream. Now I have a question, since I didn't find it is there a tool to merge 2 TSV files to one file where the content will be merged by row and not by column? This might work as a problem solver or to change this tool such that the param is an integer and not data_column |
There are quite a few tools to concatenate files (one below the other), e.g. https://usegalaxy.eu/?tool_id=toolshed.g2.bx.psu.edu%2Frepos%2Fbgruening%2Ftext_processing%2Ftp_cat%2F9.3%2Bgalaxy1&version=latest For pasting (adding new columns) https://usegalaxy.eu/?tool_id=Paste1&version=latest Will close this here. Feel free to reopen if you still think its a bug. Otherwise we can continue discussion at gitter
or of course https://help.galaxyproject.org/ |
@SantaMcCloud there was also a fix, maybe related, in Galaxy, so check out latest EU. |
FOR CONTRIBUTOR: