Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Handle WDL structs #97

Closed
wants to merge 5 commits into from
Closed

Conversation

kinow
Copy link
Member

@kinow kinow commented Oct 30, 2021

Closes #72

Related to #68 too.

N.B.: pending tests, but a pre-review before I write the tests to confirm it's going in the right direction is welcome 😬

This change has minor cosmetics changes (unused imports, typos), and addresses the #72 as follows:

  • In the ANTLR parser visitor we create a list of the structs found. This is a list of a custom data class Struct, but happy to create a dict/tuple/normal object/etc if it's better
  • In the translator, after the tree is “visited”, we convert the list of Struct into a dictionary with key=type_id (e.g. References) and the value=cwl.RecordSchema
  • In the translator's get_input, we pass this dictionary as a new parameter
  • The if/elselogic inget_input` we look up the type in the dictionary of records, if a value is found, it will be the type used
  • The get_output has been renamed to get_expression_output as it was not doing the same thing as get_input, I think.
  • Moved the code that was creating CWL outputs in the main function into a new get_output function, that looks similar to get_input. It handles the structs in the same manner.
Resulting workflow using the `warp`'s `BuildIndices.wdl`:
class: CommandLineTool
id: BuildIntervalList
inputs:
  - id: gtf_version
    type: string
  - id: organism
    type: string
  - id: organism_prefix
    type: string
  - id: gtf_version
    type: string
  - id: organism
    type: string
  - id: references
    type:
        fields:
          - name: genome_fa
            type: File
          - name: annotation_gtf
            type: File
        type: References
  - id: gtf_version
    type: string
  - id: organism
    type: string
  - id: organism_prefix
    type: string
  - id: references
    type:
        fields:
          - name: genome_fa
            type: File
          - name: annotation_gtf
            type: File
        type: References
  - id: gtf_version
    type: string
  - id: organism
    type: string
  - id: references
    type:
        fields:
          - name: genome_fa
            type: File
          - name: annotation_gtf
            type: File
        type: References
  - id: rsem_index
    type: File
  - id: gtf_version
    type: string
  - id: organism
    type: string
  - id: genome_fa
    type: File
  - id: organism
    type: string
  - id: genome_short_string
    type: string
  - id: references
    type:
        fields:
          - name: genome_fa
            type: File
          - name: annotation_gtf
            type: File
        type: References
  - id: gtf_version
    type: string
  - id: dbsnp_version
    type: string
  - id: references
    type:
        fields:
          - name: genome_fa
            type: File
          - name: annotation_gtf
            type: File
        type: References
  - id: references
    type:
        fields:
          - name: genome_fa
            type: File
          - name: annotation_gtf
            type: File
        type: References
outputs:
  - id: references
    type:
        fields:
          - name: genome_fa
            type: File
          - name: annotation_gtf
            type: File
        type: References
  - id: star_index
    type: File
    outputBinding:
        glob: ''
  - id: star_index
    type: File
    outputBinding:
        glob: ''
  - id: annotation_gtf_modified_introns
    type: File
    outputBinding:
        glob: ''
  - id: modified_references
    type:
        fields:
          - name: genome_fa
            type: File
          - name: annotation_gtf
            type: File
        type: References
  - id: rsem_index
    type: File
    outputBinding:
        glob: ''
  - id: hisat2_index
    type: File
    outputBinding:
        glob: ''
  - id: hisat2_index
    type: File
    outputBinding:
        glob: ''
  - id: hisat2_index
    type: File
    outputBinding:
        glob: ''
  - id: refflat
    type: File
    outputBinding:
        glob: ''
  - id: interval_list
    type: File
    outputBinding:
        glob: ''
  - id: pipeline_version
    type: string
    outputBinding:
        glob: 0.1.0
  - id: star_index
    type: File
    outputBinding:
        glob: ''
  - id: snSS2_star_index
    type: File
    outputBinding:
        glob: ''
  - id: rsem_index
    type: File
    outputBinding:
        glob: ''
  - id: hisat2_from_rsem_index
    type: File
    outputBinding:
        glob: ''
  - id: hisat2_index
    type: File
    outputBinding:
        glob: ''
  - id: hisat2_snp_haplotype_splicing_index
    type: File
    outputBinding:
        glob: ''
  - id: refflat
    type: File
    outputBinding:
        glob: ''
  - id: interval_list
    type: File
    outputBinding:
        glob: ''
  - id: genome_fa
    type: File
    outputBinding:
        glob: ''
  - id: annotation_gtf
    type: File
    outputBinding:
        glob: ''
  - id: snSS2_genome_fa
    type: File
    outputBinding:
        glob: ''
  - id: snSS2_annotation_gtf
    type: File
    outputBinding:
        glob: ''
  - id: snSS2_annotation_gtf_introns
    type: File
    outputBinding:
        glob: ''
requirements:
  - class: DockerRequirement
    dockerPull: quay.io/humancellatlas/secondary-analysis-umitools:0.0.1
  - class: InitialWorkDirRequirement
    listing:
      - entryname: example.sh
        entry: |4+

            set -eo pipefail


            # index the fasta file

            samtools faidx $(inputs.references.genome_fa)
            cut -f1,2 $(inputs.references.genome_fa).fai > sizes.genome

            awk -F '\t'  '{  printf "@SQ\tSN:%s\tLN:%s\n", $1, $2 }' sizes.genome  >> $(inputs.interval_list_name)

            grep 'gene_type "rRNA"' $(inputs.references.annotation_gtf) |
                awk '$3 == "transcript"' |
            cut -f1,4,5,7,9 |
            perl -lane '
                /transcript_id "([^"]+)"/ or die "no transcript_id on $.";
                print join "\t", (@F[0,1,2,3], $1)
            ' |
            sort -k1V -k2n -k3n  >> $(inputs.interval_list_name)

  - class: InlineJavascriptRequirement
  - class: NetworkAccess
    networkAccess: true
  - class: ResourceRequirement
    ramMin: 8192
  - class: ResourceRequirement
    coresMin: ''
cwlVersion: v1.2
baseCommand:
  - bash
  - example.sh

☝️

Thanks!
Bruno

wdl2cwl/WdlV1_1ParserVisitor.py Outdated Show resolved Hide resolved
struct = Struct(name=str(ctx.Identifier()))
for decl in ctx.unbound_decls():
struct.fields.append([str(decl.Identifier()), decl.wdl_type().getText()])
self.structs.append(struct)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So the core of this change depends on saving the structs when visiting the WDL abstract tree. An alternative would be to fetch them when/if necessary from the tree (not from ast), but I opted for this approach. Happy to change if needed.

@@ -396,7 +394,7 @@ def get_command(
return new_command


def get_output(expression: str, input_names: List[str]) -> str:
def get_expression_output(expression: str, input_names: List[str]) -> str:
"""Get expression for outputs."""
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was confused at first as I thought get_output was to get_input, but apparently it did something slightly different (I am not familiar with expressions for outputs as in the docs). I thought it would be OK to rename it and create a get_output that does something similar to get_input?

wdl2cwl/main.py Outdated
inputs.append(cwl.CommandInputParameter(id=input_name, type=record_schemas.get(i[0])))

else:
raise ValueError(f"Failed to get input type: {i[0]}")
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now raising an error if we know we cannot handle the input type (also the output in another function).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some unit tests failed, so I reverted this change. Added the elif that checks if it is a struct before the else block, this last one is handling the types in the same way as before. Only addition will be the extra elif key in structs...

raise ValueError(f"Failed to get output type: {i[0]}")

return outputs

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This ☝️ came from the convert method.

@kinow
Copy link
Member Author

kinow commented Oct 30, 2021

(fixing linter & some unit tests…)

Copy link
Member

@mr-c mr-c left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is exciting!

Do you want help assembling test data for running BuildIntervalList to affirm the resulting CWL description produces the same output?

@kinow kinow force-pushed the references branch 2 times, most recently from fa15194 to f7db6d7 Compare October 30, 2021 02:59
@codecov
Copy link

codecov bot commented Oct 30, 2021

Codecov Report

Merging #97 (dcd022b) into main (96c7c6a) will decrease coverage by 0.64%.
The diff coverage is 74.35%.

❗ Current head dcd022b differs from pull request most recent head 5d428aa. Consider uploading reports for the commit 5d428aa to get more accurate results
Impacted file tree graph

@@            Coverage Diff             @@
##             main      #97      +/-   ##
==========================================
- Coverage   94.78%   94.13%   -0.65%     
==========================================
  Files           4        4              
  Lines        1380     1399      +19     
==========================================
+ Hits         1308     1317       +9     
- Misses         72       82      +10     
Impacted Files Coverage Δ
wdl2cwl/WdlV1_1ParserVisitor.py 68.87% <40.00%> (-1.26%) ⬇️
wdl2cwl/main.py 98.15% <86.20%> (-1.04%) ⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 96c7c6a...5d428aa. Read the comment docs.

@kinow kinow force-pushed the references branch 2 times, most recently from 2c87b2a to 22bb6f7 Compare October 30, 2021 03:06
@kinow
Copy link
Member Author

kinow commented Oct 30, 2021

This is exciting!

Do you want help assembling test data for running BuildIntervalList to affirm the resulting CWL description produces the same output?

Hi @mr-c !

Yes, please! I was actually going to investigate later how the project is tested, whether workflows are executed, or if we need to write unit tests :)

If there's an existing test somewhere, or notes on how to write tests, I can take a look and see if I can follow/use as reference.

Thanks!
Bruno

@mr-c
Copy link
Member

mr-c commented Oct 30, 2021

This is exciting!
Do you want help assembling test data for running BuildIntervalList to affirm the resulting CWL description produces the same output?

Hi @mr-c !

Hoopla!

Yes, please! I was actually going to investigate later how the project is tested, whether workflows are executed, or if we need to write unit tests :)

If there's an existing test somewhere, or notes on how to write tests, I can take a look and see if I can follow/use as reference.

I just merged #94

To make it easier on yourself, feel free to substitute a fake command in the WDL description

Thanks! Bruno

kinow added 4 commits October 30, 2021 16:27
Also rename the get_output function as it was not equivalent to get_input.

Then move code that was creating CWL outputs to get_output (more similar to get_input).
@kinow
Copy link
Member Author

kinow commented Oct 30, 2021

I think I understood how to write a test. I have one almost ready, just pending the expected output and learning how to run it to test locally.

But it has already identified one issue with my code :-) I was using the second token of References references as the parameter type, but it actually needs to be a constant "record".

@mr-c I will push a WIP commit with my test, but will fix it later. I think I found a couple other issues that are not related to structs. Will see if there are any existing issues, if not will submit new ones 👍

@mr-c
Copy link
Member

mr-c commented Oct 30, 2021

I think I understood how to write a test. I have one almost ready, just pending the expected output and learning how to run it to test locally.

Sorry, I should have been more clear. We have two major classes of tests: 1) comparing with stored conversion results and 2) actually running the CWL version with real inputs and comparing with known good outputs

For 1) see https://github.com/common-workflow-lab/wdl-cwl-translator#adding-test-cases and for 2) see 96c7c6a#diff-002fd7337e44ea597c874d9b564acf1cfe36a4a7c2fcf3d0fe9cf41c748214cfR8

@kinow
Copy link
Member Author

kinow commented Oct 30, 2021

Sorry, I should have been more clear. We have two major classes of tests: 1) comparing with stored conversion results and 2) actually running the CWL version with real inputs and comparing with known good outputs

Ah! My bad, I started reading the readme a couple days ago but never finished. Went as far as the part about testing with an existing WDL 😅 I should have revisited it later.

So what I have is a test like 2). Which option would be more appropriate for this PR? Thanks!

@kinow
Copy link
Member Author

kinow commented Oct 30, 2021

Created the other two issues @mr-c . Done for the day, but learned a little more about CWL & WDL. My second issue could be something really silly that I didn't understand when using the translator, feel free to point if I missed anything 👍

Thanks!
Bruno

@mr-c
Copy link
Member

mr-c commented Oct 30, 2021

Sorry, I should have been more clear. We have two major classes of tests: 1) comparing with stored conversion results and 2) actually running the CWL version with real inputs and comparing with known good outputs

Ah! My bad, I started reading the readme a couple days ago but never finished. Went as far as the part about testing with an existing WDL 😅 I should have revisited it later.

So what I have is a test like 2). Which option would be more appropriate for this PR? Thanks!

Option 1 is mandatory. Option 2 is appreciated!

@mr-c
Copy link
Member

mr-c commented Jan 27, 2022

Closing as this needs to be redone from scratch given the refactoring. Hopefully it will be easier the 2nd time!

@mr-c mr-c closed this Jan 27, 2022
@kinow kinow deleted the references branch January 27, 2022 11:31
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

WDL structs are not handled
2 participants