Handle WDL structs #97

kinow · 2021-10-30T02:13:26Z

Closes #72

Related to #68 too.

N.B.: pending tests, but a pre-review before I write the tests to confirm it's going in the right direction is welcome 😬

This change has minor cosmetics changes (unused imports, typos), and addresses the #72 as follows:

In the ANTLR parser visitor we create a list of the structs found. This is a list of a custom data class Struct, but happy to create a dict/tuple/normal object/etc if it's better
In the translator, after the tree is “visited”, we convert the list of Struct into a dictionary with key=type_id (e.g. References) and the value=cwl.RecordSchema
In the translator's get_input, we pass this dictionary as a new parameter
The if/elselogic inget_input` we look up the type in the dictionary of records, if a value is found, it will be the type used
The get_output has been renamed to get_expression_output as it was not doing the same thing as get_input, I think.
Moved the code that was creating CWL outputs in the main function into a new get_output function, that looks similar to get_input. It handles the structs in the same manner.

Resulting workflow using the `warp`'s `BuildIndices.wdl`:

class: CommandLineTool
id: BuildIntervalList
inputs:
  - id: gtf_version
    type: string
  - id: organism
    type: string
  - id: organism_prefix
    type: string
  - id: gtf_version
    type: string
  - id: organism
    type: string
  - id: references
    type:
        fields:
          - name: genome_fa
            type: File
          - name: annotation_gtf
            type: File
        type: References
  - id: gtf_version
    type: string
  - id: organism
    type: string
  - id: organism_prefix
    type: string
  - id: references
    type:
        fields:
          - name: genome_fa
            type: File
          - name: annotation_gtf
            type: File
        type: References
  - id: gtf_version
    type: string
  - id: organism
    type: string
  - id: references
    type:
        fields:
          - name: genome_fa
            type: File
          - name: annotation_gtf
            type: File
        type: References
  - id: rsem_index
    type: File
  - id: gtf_version
    type: string
  - id: organism
    type: string
  - id: genome_fa
    type: File
  - id: organism
    type: string
  - id: genome_short_string
    type: string
  - id: references
    type:
        fields:
          - name: genome_fa
            type: File
          - name: annotation_gtf
            type: File
        type: References
  - id: gtf_version
    type: string
  - id: dbsnp_version
    type: string
  - id: references
    type:
        fields:
          - name: genome_fa
            type: File
          - name: annotation_gtf
            type: File
        type: References
  - id: references
    type:
        fields:
          - name: genome_fa
            type: File
          - name: annotation_gtf
            type: File
        type: References
outputs:
  - id: references
    type:
        fields:
          - name: genome_fa
            type: File
          - name: annotation_gtf
            type: File
        type: References
  - id: star_index
    type: File
    outputBinding:
        glob: ''
  - id: star_index
    type: File
    outputBinding:
        glob: ''
  - id: annotation_gtf_modified_introns
    type: File
    outputBinding:
        glob: ''
  - id: modified_references
    type:
        fields:
          - name: genome_fa
            type: File
          - name: annotation_gtf
            type: File
        type: References
  - id: rsem_index
    type: File
    outputBinding:
        glob: ''
  - id: hisat2_index
    type: File
    outputBinding:
        glob: ''
  - id: hisat2_index
    type: File
    outputBinding:
        glob: ''
  - id: hisat2_index
    type: File
    outputBinding:
        glob: ''
  - id: refflat
    type: File
    outputBinding:
        glob: ''
  - id: interval_list
    type: File
    outputBinding:
        glob: ''
  - id: pipeline_version
    type: string
    outputBinding:
        glob: 0.1.0
  - id: star_index
    type: File
    outputBinding:
        glob: ''
  - id: snSS2_star_index
    type: File
    outputBinding:
        glob: ''
  - id: rsem_index
    type: File
    outputBinding:
        glob: ''
  - id: hisat2_from_rsem_index
    type: File
    outputBinding:
        glob: ''
  - id: hisat2_index
    type: File
    outputBinding:
        glob: ''
  - id: hisat2_snp_haplotype_splicing_index
    type: File
    outputBinding:
        glob: ''
  - id: refflat
    type: File
    outputBinding:
        glob: ''
  - id: interval_list
    type: File
    outputBinding:
        glob: ''
  - id: genome_fa
    type: File
    outputBinding:
        glob: ''
  - id: annotation_gtf
    type: File
    outputBinding:
        glob: ''
  - id: snSS2_genome_fa
    type: File
    outputBinding:
        glob: ''
  - id: snSS2_annotation_gtf
    type: File
    outputBinding:
        glob: ''
  - id: snSS2_annotation_gtf_introns
    type: File
    outputBinding:
        glob: ''
requirements:
  - class: DockerRequirement
    dockerPull: quay.io/humancellatlas/secondary-analysis-umitools:0.0.1
  - class: InitialWorkDirRequirement
    listing:
      - entryname: example.sh
        entry: |4+

            set -eo pipefail


            # index the fasta file

            samtools faidx $(inputs.references.genome_fa)
            cut -f1,2 $(inputs.references.genome_fa).fai > sizes.genome

            awk -F '\t'  '{  printf "@SQ\tSN:%s\tLN:%s\n", $1, $2 }' sizes.genome  >> $(inputs.interval_list_name)

            grep 'gene_type "rRNA"' $(inputs.references.annotation_gtf) |
                awk '$3 == "transcript"' |
            cut -f1,4,5,7,9 |
            perl -lane '
                /transcript_id "([^"]+)"/ or die "no transcript_id on $.";
                print join "\t", (@F[0,1,2,3], $1)
            ' |
            sort -k1V -k2n -k3n  >> $(inputs.interval_list_name)

  - class: InlineJavascriptRequirement
  - class: NetworkAccess
    networkAccess: true
  - class: ResourceRequirement
    ramMin: 8192
  - class: ResourceRequirement
    coresMin: ''
cwlVersion: v1.2
baseCommand:
  - bash
  - example.sh

☝️

Thanks!
Bruno

wdl2cwl/WdlV1_1ParserVisitor.py

kinow · 2021-10-30T02:16:31Z

wdl2cwl/WdlV1_1ParserVisitor.py

+        struct = Struct(name=str(ctx.Identifier()))
+        for decl in ctx.unbound_decls():
+            struct.fields.append([str(decl.Identifier()), decl.wdl_type().getText()])
+        self.structs.append(struct)


So the core of this change depends on saving the structs when visiting the WDL abstract tree. An alternative would be to fetch them when/if necessary from the tree (not from ast), but I opted for this approach. Happy to change if needed.

kinow · 2021-10-30T02:17:33Z

wdl2cwl/main.py

@@ -396,7 +394,7 @@ def get_command(
    return new_command


-def get_output(expression: str, input_names: List[str]) -> str:
+def get_expression_output(expression: str, input_names: List[str]) -> str:
    """Get expression for outputs."""


I was confused at first as I thought get_output was to get_input, but apparently it did something slightly different (I am not familiar with expressions for outputs as in the docs). I thought it would be OK to rename it and create a get_output that does something similar to get_input?

kinow · 2021-10-30T02:18:00Z

wdl2cwl/main.py

+            inputs.append(cwl.CommandInputParameter(id=input_name, type=record_schemas.get(i[0])))
+
+        else:
+            raise ValueError(f"Failed to get input type: {i[0]}")


Now raising an error if we know we cannot handle the input type (also the output in another function).

Some unit tests failed, so I reverted this change. Added the elif that checks if it is a struct before the else block, this last one is handling the types in the same way as before. Only addition will be the extra elif key in structs...

kinow · 2021-10-30T02:18:36Z

wdl2cwl/main.py

+            raise ValueError(f"Failed to get output type: {i[0]}")
+
+    return outputs
+


This ☝️ came from the convert method.

kinow · 2021-10-30T02:22:54Z

(fixing linter & some unit tests…)

mr-c

This is exciting!

Do you want help assembling test data for running BuildIntervalList to affirm the resulting CWL description produces the same output?

codecov · 2021-10-30T03:00:21Z

Codecov Report

Merging #97 (dcd022b) into main (96c7c6a) will decrease coverage by 0.64%.
The diff coverage is 74.35%.

❗ Current head dcd022b differs from pull request most recent head 5d428aa. Consider uploading reports for the commit 5d428aa to get more accurate results

@@            Coverage Diff             @@
##             main      #97      +/-   ##
==========================================
- Coverage   94.78%   94.13%   -0.65%     
==========================================
  Files           4        4              
  Lines        1380     1399      +19     
==========================================
+ Hits         1308     1317       +9     
- Misses         72       82      +10

Impacted Files	Coverage Δ
wdl2cwl/WdlV1_1ParserVisitor.py	`68.87% <40.00%> (-1.26%)`	⬇️
wdl2cwl/main.py	`98.15% <86.20%> (-1.04%)`	⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 96c7c6a...5d428aa. Read the comment docs.

kinow · 2021-10-30T03:17:27Z

This is exciting!

Do you want help assembling test data for running BuildIntervalList to affirm the resulting CWL description produces the same output?

Hi @mr-c !

Yes, please! I was actually going to investigate later how the project is tested, whether workflows are executed, or if we need to write unit tests :)

If there's an existing test somewhere, or notes on how to write tests, I can take a look and see if I can follow/use as reference.

Thanks!
Bruno

mr-c · 2021-10-30T03:23:06Z

This is exciting!
Do you want help assembling test data for running BuildIntervalList to affirm the resulting CWL description produces the same output?

Hi @mr-c !

Hoopla!

Yes, please! I was actually going to investigate later how the project is tested, whether workflows are executed, or if we need to write unit tests :)

If there's an existing test somewhere, or notes on how to write tests, I can take a look and see if I can follow/use as reference.

I just merged #94

To make it easier on yourself, feel free to substitute a fake command in the WDL description

Thanks! Bruno

Also rename the get_output function as it was not equivalent to get_input. Then move code that was creating CWL outputs to get_output (more similar to get_input).

kinow · 2021-10-30T03:43:57Z

I think I understood how to write a test. I have one almost ready, just pending the expected output and learning how to run it to test locally.

But it has already identified one issue with my code :-) I was using the second token of References references as the parameter type, but it actually needs to be a constant "record".

@mr-c I will push a WIP commit with my test, but will fix it later. I think I found a couple other issues that are not related to structs. Will see if there are any existing issues, if not will submit new ones 👍

mr-c · 2021-10-30T03:49:02Z

I think I understood how to write a test. I have one almost ready, just pending the expected output and learning how to run it to test locally.

Sorry, I should have been more clear. We have two major classes of tests: 1) comparing with stored conversion results and 2) actually running the CWL version with real inputs and comparing with known good outputs

For 1) see https://github.com/common-workflow-lab/wdl-cwl-translator#adding-test-cases and for 2) see 96c7c6a#diff-002fd7337e44ea597c874d9b564acf1cfe36a4a7c2fcf3d0fe9cf41c748214cfR8

kinow · 2021-10-30T03:51:55Z

Sorry, I should have been more clear. We have two major classes of tests: 1) comparing with stored conversion results and 2) actually running the CWL version with real inputs and comparing with known good outputs

Ah! My bad, I started reading the readme a couple days ago but never finished. Went as far as the part about testing with an existing WDL 😅 I should have revisited it later.

So what I have is a test like 2). Which option would be more appropriate for this PR? Thanks!

kinow · 2021-10-30T04:14:34Z

Created the other two issues @mr-c . Done for the day, but learned a little more about CWL & WDL. My second issue could be something really silly that I didn't understand when using the translator, feel free to point if I missed anything 👍

Thanks!
Bruno

mr-c · 2021-10-30T04:15:01Z

Sorry, I should have been more clear. We have two major classes of tests: 1) comparing with stored conversion results and 2) actually running the CWL version with real inputs and comparing with known good outputs

Ah! My bad, I started reading the readme a couple days ago but never finished. Went as far as the part about testing with an existing WDL 😅 I should have revisited it later.

So what I have is a test like 2). Which option would be more appropriate for this PR? Thanks!

Option 1 is mandatory. Option 2 is appreciated!

mr-c · 2022-01-27T10:06:22Z

Closing as this needs to be redone from scratch given the refactoring. Hopefully it will be easier the 2nd time!

kinow commented Oct 30, 2021

View reviewed changes

mr-c reviewed Oct 30, 2021

View reviewed changes

kinow force-pushed the references branch 2 times, most recently from fa15194 to f7db6d7 Compare October 30, 2021 02:59

kinow force-pushed the references branch 2 times, most recently from 2c87b2a to 22bb6f7 Compare October 30, 2021 03:06

kinow added 4 commits October 30, 2021 16:27

Fix typo in comment

3ffa8ff

Handle structs in input and output.

2e7ad33

Also rename the get_output function as it was not equivalent to get_input. Then move code that was creating CWL outputs to get_output (more similar to get_input).

Remove unused imports

4c914d1

Fix pydocstyle issues

7090bea

kinow force-pushed the references branch from dcd022b to 7090bea Compare October 30, 2021 03:28

WIP add unit test, and fix parameter type (must be constant record)

5d428aa

This was referenced Oct 30, 2021

WDL runtime requirement cpu not being parsed correctly #98

Closed

Support multiple tasks in the same WDL pipeline #99

Closed

mr-c closed this Jan 27, 2022

kinow deleted the references branch January 27, 2022 11:31

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Handle WDL structs #97

Handle WDL structs #97

kinow commented Oct 30, 2021 •

edited

Loading

kinow Oct 30, 2021

kinow Oct 30, 2021

kinow Oct 30, 2021

kinow Oct 30, 2021

kinow Oct 30, 2021

kinow commented Oct 30, 2021

mr-c left a comment

codecov bot commented Oct 30, 2021 •

edited

Loading

kinow commented Oct 30, 2021

mr-c commented Oct 30, 2021 •

edited

Loading

kinow commented Oct 30, 2021

mr-c commented Oct 30, 2021

kinow commented Oct 30, 2021

kinow commented Oct 30, 2021

mr-c commented Oct 30, 2021

mr-c commented Jan 27, 2022

		raise ValueError(f"Failed to get output type: {i[0]}")

		return outputs

Handle WDL structs #97

Handle WDL structs #97

Conversation

kinow commented Oct 30, 2021 • edited Loading

kinow Oct 30, 2021

Choose a reason for hiding this comment

kinow Oct 30, 2021

Choose a reason for hiding this comment

kinow Oct 30, 2021

Choose a reason for hiding this comment

kinow Oct 30, 2021

Choose a reason for hiding this comment

kinow Oct 30, 2021

Choose a reason for hiding this comment

kinow commented Oct 30, 2021

mr-c left a comment

Choose a reason for hiding this comment

codecov bot commented Oct 30, 2021 • edited Loading

Codecov Report

kinow commented Oct 30, 2021

mr-c commented Oct 30, 2021 • edited Loading

kinow commented Oct 30, 2021

mr-c commented Oct 30, 2021

kinow commented Oct 30, 2021

kinow commented Oct 30, 2021

mr-c commented Oct 30, 2021

mr-c commented Jan 27, 2022

kinow commented Oct 30, 2021 •

edited

Loading

codecov bot commented Oct 30, 2021 •

edited

Loading

mr-c commented Oct 30, 2021 •

edited

Loading