-
Notifications
You must be signed in to change notification settings - Fork 441
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add beacon2 tools (vcf2bff.pl, pxf2bff, csv2xlsx and bff-validator) into galaxy #5442
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is a lot of stuff in your test-data directory, can you look at this and remove all unneeded stuff? Thanks.
tools/beacon2/macros.xml
Outdated
</xrefs> | ||
</xml> | ||
<xml name="vcf2bff_param"> | ||
<param argument="--input" type="data" format="tabular.gz" label="Annotated vcf file" help="" /> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why not have this in the main tool? this is not shared with any other tool isn't it?
tools/beacon2/macros.xml
Outdated
<option value="hash">hash</option> | ||
<option value="json">json</option> | ||
</param> | ||
<param argument="--project-dir" type="text" label="Beacon project dir" value="" help="" /> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Mh, this will not work I fear. What is expected to be there? User don't know where the data is stored...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I left it like this as I don't know where Galaxy tends to keep things can I use pwd as a parameter? I used it to test locally and worked
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What kind of data is in there? Is that static data that never changes?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If my understanding is correct, it is to point out the directory where the inputs are/where the output should go to. I wasn't sure how to do that for Galaxy as the work will be carried out remotely in history, so I thought about the pwd, and surprisingly that worked
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But pwd
is doing nothing. Its a random string. If you want to specify the current working dir you can use ./
or \$PWD
. What is strange is that it still turned green. So either this parameter is not used or your tests are not working :(
I changed it now to ./
to see if this also works.
tools/beacon2/vcf2bff.xml
Outdated
<option value="hash">hash</option> | ||
<option value="json">json</option> | ||
</param> | ||
<param argument="--dataset-id" type="text" label="Dataset ID" value="" help="" /> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you please add help text to the parameters? I don't know what should be included here. How should a user know such an ID?
tools/beacon2/vcf2bff.xml
Outdated
<option value="json">json</option> | ||
</param> | ||
<param argument="--dataset-id" type="text" label="Dataset ID" value="" help="" /> | ||
<param argument="--genome" type="text" label="Reference genome" value="" help="" /> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here as well ... maybe we can offer a select parameter with the most common reference genomes and IDs?
tools/beacon2/vcf2bff.xml
Outdated
b) Standard JSON [json] | ||
c) Perl hash data structure [hash] | ||
|
||
HOW TO RUN VCF2BFF |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a bit useless. Can you please enhance the tool help a bit.
@khaled196 please clean up the test-data directory, I don't think those files are needed. |
…and fix some labellings
Hi @bgruening I am working on wrapping the tools needed for the users to prepare for data before sharing them or inquiring about them on MongoDB. I have created the XML wrapper files for them. Among those tools vcf2bff.xml, pxf2bff.xml, csv2xlsx.xml and bff_validator.xml The tools vcf2bff.xml, pxf2bff.xml, and csv2xlsx.xml work fine on Galaxy (tested on Planemo). I just waiting to meet with someone from the beacon team to add the useful labelling and documentation for them. bff_validator.xml is the stubborn one. It failed without a mentioned reason in the JSON file. The trick with this tool it uses a directory deref_schemas as an input, maybe the way I am calling it within the wrapped is not correct. However, I don't have an interoperable error to debug it. |
tools/beacon2/vcf2bff.xml
Outdated
--genome '$genome' | ||
]]></command> | ||
<inputs> | ||
<param argument="--input" type="data" format="tabular.gz" label="Annotated VCF file" help="The output genomic variations VCF file of bcftools, snpeff, snpsift" /> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@khaled196 why is the format here tabular.gz and not VCF?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank You @bgruening for looking into it.
Originally it was VCF. However, because the file is compressed every time I tested the tool it complained about the gz(zipped) part. I searched for VCF.gz format, but I couldn't find it in the file types that are supported in Galaxy and I found the tabular.gz which I thing VCF is a tabular also, so I thought that should be fine.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess you are searching for vcf_bgzip
https://github.com/galaxyproject/galaxy/blob/dev/lib/galaxy/config/sample/datatypes_conf.xml.sample#L501
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you @bgruening . I had an error before because I changed it in the input format and the test ftype.
If I change the Ftype from tabular.gz to vcf_bgzip I will get an error.
Error Massage
"Input staging problem: Job in error state.. tool_id: upload1, exit_code: None, stderr: .",
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@bgruening won't the user get an error if the format is left as vcf_bgzip in that case?
The tool takes the input file in the compressed format, processes it and gives the output in the compressed format, so it doesn't work if the input file is uncompressed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can test this with planemo serve ...
But if the format is a VCF its not a tabular and we should fix the real problem that you have. Which I currently do not understand.
Oh I see the tests are passing ... where do you see the this error?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I tested that locally on my VM. the error when I change both the input and the test into vcf_bgzip
param argument="--input" type="data" format="vcf_bgzip" label="Annotated vcf file" help="The output genomic variations VCF file of bcftools, snpeff, snpsift" /
param name="input" ftype="vcf_bgzip" value="test_1000G.norm.ann.dbnsfp.clinvar.cosmic.vcf.gz" /
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have prefixed the tool ID and tried to shorten the description of the tool as well as trying to make clear that those are not general CSV to XLSX converter tools, but specific to Beacon.
Is that ok for you? |
Thank you @bgruening . If possible I want to ask for another help. The tool I have a problem with is the bff-validator. I deleted it to merge the repo. After merging the pull request, I will open a pull request for it. If you can take a look at it. |
tools/beacon2/vcf2bff.xml
Outdated
</outputs> | ||
<tests> | ||
<test expect_num_outputs="1"> | ||
<param name="input" ftype="tabular.gz" value="test_1000G.norm.ann.dbnsfp.clinvar.cosmic.vcf.gz" /> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
<param name="input" ftype="tabular.gz" value="test_1000G.norm.ann.dbnsfp.clinvar.cosmic.vcf.gz" /> | |
<param name="input" ftype="vcf_bgzip" value="test_1000G.norm.ann.dbnsfp.clinvar.cosmic.vcf.gz" /> |
Maybe we have a confusion here. Is this a proper Bgzip compressed file? Or something else?
e.g. samtools/bcftools#668
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
here you see a few vcf.gz files: https://github.com/galaxyproject/tools-iuc/tree/5ab93133efa5de410e992919c4e841807fb43b55/tools/clair3/test-data
that are known to work on other variant calling tools ... maybe you can try them?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The VCF.gz file is not normal by itself, it is the output of prepossessing it with bcftools, snpeff and snpsift. Kindly find out how they created that file. run_vcf2bff.sh
Test it with htsfile and return this.
beacon2/test-data$ htsfile test_1000G.norm.ann.dbnsfp.clinvar.cosmic.vcf.gz
test_1000G.norm.ann.dbnsfp.clinvar.cosmic.vcf.gz: VCF version 4.1 gzip-compressed var iant calling data
While testing the vcf.gz files in the link dir returned
(.venv) (new_beacon) khaled@galaxy-planemo-khaled-2:~/khaled/galaxy_beacon/tools-iuc/tools/beacon2/test-data$ htsfile merge_output_1.vcf.gz
merge_output_1.vcf.gz: VCF version 4.2 BGZF-compressed variant calling data
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@bgruening sorted the file format using the steps in the link above, but now it returns an empty JSON file.
tools/beacon2/vcf2bff.xml
Outdated
<param argument="--genome" type="text" label="Reference genome" value="" help="Select the reference genome used the annotate the data to create the genomicVariations dataset examples for reference genomes are hs37, hg37 and hg38" /> | ||
</inputs> | ||
<outputs> | ||
<data name="genomicVariationsVcf" format="json" label="${tool.name} on ${on_string}: genomicVariationsVcf file" from_work_dir="genomicVariationsVcf.json.gz" /> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You expect here a normal json, but in the test you are comparing the uncompressed json to a compressed json.
We do not have currently a notion of compressed json.
Maybe its time to add this feature to Galaxy core by adding auto_compressed_types="gz,bz2"
to the datatype defintion.
But for the time being you are probably better off to not push back a compressed json to galaxy. Just uncompress it as last step in your command section.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@bgruening I am trying to do this, but I am sure I am missing something.
First, the error of having an empty comes from the test data, after converting them from gzip comprised format into bgzf format it gives an empty file for output, I test the input data locally.
Regarding the unzipped part, I am sure I am missing something there as it gives an unrelated output tile.
tools/beacon2/vcf2bff.xml
Outdated
<param name="format" value="bff"/> | ||
<param name="dataset_id" value="beacon"/> | ||
<param name="genome" value="hg19"/> | ||
<output name="genomicVariationsVcf" file="genomicVariationsVcf.json.gz" /> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
<output name="genomicVariationsVcf" file="genomicVariationsVcf.json.gz" /> | |
<output name="genomicVariationsVcf" file="genomicVariationsVcf.json.gz" ftype="json" /> |
This than matches your output defition.
…d compress it from the command part
@bgruening I am done with the changes to the script. the conclusion is that if I change the file format to bgzip (vcf_bgzip) it will be considered as a false file from the tool, and it will not work as it is supposed to. So, I reused the tabular.gz, it is not the best way to do it, but it works in the same way as the tool locally. the old gzip format works. the vcf_bgzip Doesn't work |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @khaled196
FOR CONTRIBUTOR: