Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add beacon2 tools (vcf2bff.pl, pxf2bff, csv2xlsx and bff-validator) into galaxy #5442

Merged
merged 38 commits into from
Oct 1, 2023
Merged

Conversation

khaled196
Copy link
Contributor

@khaled196 khaled196 commented Aug 11, 2023

FOR CONTRIBUTOR:

  • - I have read the CONTRIBUTING.md document and this tool is appropriate for the tools-iuc repo.
  • - License permits unrestricted use (educational + commercial)
  • - This PR adds a new tool or tool collection
  • - This PR updates an existing tool or tool collection
  • - This PR does something else (explain below)

@khaled196 khaled196 changed the title Add files via upload Add vcf2bff beacon2 tool into galaxy Aug 11, 2023
Copy link
Member

@bgruening bgruening left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is a lot of stuff in your test-data directory, can you look at this and remove all unneeded stuff? Thanks.

</xrefs>
</xml>
<xml name="vcf2bff_param">
<param argument="--input" type="data" format="tabular.gz" label="Annotated vcf file" help="" />
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why not have this in the main tool? this is not shared with any other tool isn't it?

<option value="hash">hash</option>
<option value="json">json</option>
</param>
<param argument="--project-dir" type="text" label="Beacon project dir" value="" help="" />
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mh, this will not work I fear. What is expected to be there? User don't know where the data is stored...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I left it like this as I don't know where Galaxy tends to keep things can I use pwd as a parameter? I used it to test locally and worked

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What kind of data is in there? Is that static data that never changes?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@bgruening

If my understanding is correct, it is to point out the directory where the inputs are/where the output should go to. I wasn't sure how to do that for Galaxy as the work will be carried out remotely in history, so I thought about the pwd, and surprisingly that worked

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But pwd is doing nothing. Its a random string. If you want to specify the current working dir you can use ./ or \$PWD. What is strange is that it still turned green. So either this parameter is not used or your tests are not working :(

I changed it now to ./ to see if this also works.

<option value="hash">hash</option>
<option value="json">json</option>
</param>
<param argument="--dataset-id" type="text" label="Dataset ID" value="" help="" />
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you please add help text to the parameters? I don't know what should be included here. How should a user know such an ID?

<option value="json">json</option>
</param>
<param argument="--dataset-id" type="text" label="Dataset ID" value="" help="" />
<param argument="--genome" type="text" label="Reference genome" value="" help="" />
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here as well ... maybe we can offer a select parameter with the most common reference genomes and IDs?

b) Standard JSON [json]
c) Perl hash data structure [hash]

HOW TO RUN VCF2BFF
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a bit useless. Can you please enhance the tool help a bit.

@bgruening
Copy link
Member

@khaled196 please clean up the test-data directory, I don't think those files are needed.

@khaled196 khaled196 requested a review from bgruening September 1, 2023 10:21
@khaled196
Copy link
Contributor Author

Hi @bgruening

I am working on wrapping the tools needed for the users to prepare for data before sharing them or inquiring about them on MongoDB. I have created the XML wrapper files for them.

Among those tools vcf2bff.xml, pxf2bff.xml, csv2xlsx.xml and bff_validator.xml

The tools vcf2bff.xml, pxf2bff.xml, and csv2xlsx.xml work fine on Galaxy (tested on Planemo). I just waiting to meet with someone from the beacon team to add the useful labelling and documentation for them.

bff_validator.xml is the stubborn one. It failed without a mentioned reason in the JSON file.

The trick with this tool it uses a directory deref_schemas as an input, maybe the way I am calling it within the wrapped is not correct. However, I don't have an interoperable error to debug it.

@khaled196 khaled196 changed the title Add vcf2bff beacon2 tool into galaxy Add beacon2 tools (vcf2bff.pl, pxf2bff, csv2xlsx and bff-validator) into galaxy Sep 6, 2023
--genome '$genome'
]]></command>
<inputs>
<param argument="--input" type="data" format="tabular.gz" label="Annotated VCF file" help="The output genomic variations VCF file of bcftools, snpeff, snpsift" />
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@khaled196 why is the format here tabular.gz and not VCF?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank You @bgruening for looking into it.

Originally it was VCF. However, because the file is compressed every time I tested the tool it complained about the gz(zipped) part. I searched for VCF.gz format, but I couldn't find it in the file types that are supported in Galaxy and I found the tabular.gz which I thing VCF is a tabular also, so I thought that should be fine.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor Author

@khaled196 khaled196 Sep 16, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @bgruening . I had an error before because I changed it in the input format and the test ftype.

If I change the Ftype from tabular.gz to vcf_bgzip I will get an error.

Error Massage

"Input staging problem: Job in error state.. tool_id: upload1, exit_code: None, stderr: .",

Copy link
Contributor Author

@khaled196 khaled196 Sep 16, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@bgruening won't the user get an error if the format is left as vcf_bgzip in that case?

The tool takes the input file in the compressed format, processes it and gives the output in the compressed format, so it doesn't work if the input file is uncompressed.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can test this with planemo serve ...

But if the format is a VCF its not a tabular and we should fix the real problem that you have. Which I currently do not understand.

Oh I see the tests are passing ... where do you see the this error?

Copy link
Contributor Author

@khaled196 khaled196 Sep 16, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tested that locally on my VM. the error when I change both the input and the test into vcf_bgzip

param argument="--input" type="data" format="vcf_bgzip" label="Annotated vcf file" help="The output genomic variations VCF file of bcftools, snpeff, snpsift" /
param name="input" ftype="vcf_bgzip" value="test_1000G.norm.ann.dbnsfp.clinvar.cosmic.vcf.gz" /

Copy link
Member

@bgruening bgruening left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have prefixed the tool ID and tried to shorten the description of the tool as well as trying to make clear that those are not general CSV to XLSX converter tools, but specific to Beacon.

@bgruening
Copy link
Member

Is that ok for you?

@khaled196
Copy link
Contributor Author

khaled196 commented Sep 16, 2023

Thank you @bgruening . If possible I want to ask for another help. The tool I have a problem with is the bff-validator. I deleted it to merge the repo. After merging the pull request, I will open a pull request for it. If you can take a look at it.

</outputs>
<tests>
<test expect_num_outputs="1">
<param name="input" ftype="tabular.gz" value="test_1000G.norm.ann.dbnsfp.clinvar.cosmic.vcf.gz" />
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
<param name="input" ftype="tabular.gz" value="test_1000G.norm.ann.dbnsfp.clinvar.cosmic.vcf.gz" />
<param name="input" ftype="vcf_bgzip" value="test_1000G.norm.ann.dbnsfp.clinvar.cosmic.vcf.gz" />

Maybe we have a confusion here. Is this a proper Bgzip compressed file? Or something else?
e.g. samtools/bcftools#668

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

here you see a few vcf.gz files: https://github.com/galaxyproject/tools-iuc/tree/5ab93133efa5de410e992919c4e841807fb43b55/tools/clair3/test-data

that are known to work on other variant calling tools ... maybe you can try them?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The VCF.gz file is not normal by itself, it is the output of prepossessing it with bcftools, snpeff and snpsift. Kindly find out how they created that file. run_vcf2bff.sh

Test it with htsfile and return this.

beacon2/test-data$ htsfile test_1000G.norm.ann.dbnsfp.clinvar.cosmic.vcf.gz
test_1000G.norm.ann.dbnsfp.clinvar.cosmic.vcf.gz: VCF version 4.1 gzip-compressed var iant calling data

While testing the vcf.gz files in the link dir returned

(.venv) (new_beacon) khaled@galaxy-planemo-khaled-2:~/khaled/galaxy_beacon/tools-iuc/tools/beacon2/test-data$ htsfile merge_output_1.vcf.gz
merge_output_1.vcf.gz: VCF version 4.2 BGZF-compressed variant calling data

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@bgruening sorted the file format using the steps in the link above, but now it returns an empty JSON file.

<param argument="--genome" type="text" label="Reference genome" value="" help="Select the reference genome used the annotate the data to create the genomicVariations dataset examples for reference genomes are hs37, hg37 and hg38" />
</inputs>
<outputs>
<data name="genomicVariationsVcf" format="json" label="${tool.name} on ${on_string}: genomicVariationsVcf file" from_work_dir="genomicVariationsVcf.json.gz" />
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You expect here a normal json, but in the test you are comparing the uncompressed json to a compressed json.

We do not have currently a notion of compressed json.

https://github.com/galaxyproject/galaxy/blob/dev/lib/galaxy/config/sample/datatypes_conf.xml.sample#L554

Maybe its time to add this feature to Galaxy core by adding auto_compressed_types="gz,bz2" to the datatype defintion.

Like here: https://github.com/galaxyproject/galaxy/blob/dev/lib/galaxy/config/sample/datatypes_conf.xml.sample#L138C33-L138C63

But for the time being you are probably better off to not push back a compressed json to galaxy. Just uncompress it as last step in your command section.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@bgruening I am trying to do this, but I am sure I am missing something.

First, the error of having an empty comes from the test data, after converting them from gzip comprised format into bgzf format it gives an empty file for output, I test the input data locally.

Regarding the unzipped part, I am sure I am missing something there as it gives an unrelated output tile.

<param name="format" value="bff"/>
<param name="dataset_id" value="beacon"/>
<param name="genome" value="hg19"/>
<output name="genomicVariationsVcf" file="genomicVariationsVcf.json.gz" />
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
<output name="genomicVariationsVcf" file="genomicVariationsVcf.json.gz" />
<output name="genomicVariationsVcf" file="genomicVariationsVcf.json.gz" ftype="json" />

This than matches your output defition.

@khaled196 khaled196 requested a review from bgruening September 19, 2023 09:29
@khaled196
Copy link
Contributor Author

khaled196 commented Sep 20, 2023

@bgruening I am done with the changes to the script. the conclusion is that if I change the file format to bgzip (vcf_bgzip) it will be considered as a false file from the tool, and it will not work as it is supposed to. So, I reused the tabular.gz, it is not the best way to do it, but it works in the same way as the tool locally.

the old gzip format works.
test_1000G.norm.ann.dbnsfp.clinvar.cosmic.vcf.gz: VCF version 4.1 gzip-compressed variant calling data

the vcf_bgzip Doesn't work
test_1000G.norm.ann.dbnsfp.clinvar.cosmic.vcf.gz: VCF version 4.2 BGZF-compressed variant calling data

Copy link
Member

@bgruening bgruening left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @khaled196

@bgruening bgruening merged commit dcaf804 into galaxyproject:main Oct 1, 2023
12 checks passed
@bgruening bgruening deleted the beacon2 branch October 1, 2023 16:06
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants