Workflomics · kretep · Oct 8, 2024 · Sep 30, 2024 · Oct 1, 2024 · Oct 1, 2024
diff --git a/.github/workflows/test.yml b/.github/workflows/test.yml
@@ -0,0 +1,28 @@
+name: Run test script
+
+on: 
+  push:
+    branches:
+      - main
+  pull_request:
+    branches:
+      - main
+
+jobs:
+  run-test:
+    runs-on: ubuntu-latest
+
+    steps:
+      - name: Checkout code
+        uses: actions/checkout@v4
+
+      - name: Set up Python
+        uses: actions/setup-python@v5
+        with:
+          python-version: '3.x'
+
+      - name: Install cwltool
+        run: pip install cwltool
+
+      - name: Run tests
+        run: ./test.sh
diff --git a/.gitignore b/.gitignore
@@ -161,3 +161,4 @@ domains/proteomics/UseCases-Demo/*
 # Exclude generated workflow outputs
 examples/cwl-workflows/example*/workflow_*_output/
 examples/cwl-workflows/example*/benchmarks.json
+**/test/output/
diff --git a/README.md b/README.md
@@ -3,12 +3,18 @@
 
 This repository contains the domain and tool descriptions needed to generate and execute workflows in the [Workflomics](https://github.com/Workflomics/workflomics-frontend) interface.
 
-To add new domains or tools to the `Workflomics` environment, CWL files and other files needed to run these tools should be added to this repository. See the [domain annotation guide](https://workflomics.readthedocs.io/en/domain-creation/developer-guide/domain-development.html) for more information.
+To add new domains or tools to the `Workflomics` environment, CWL files and other files needed to run these tools should be added to this repository. For a detailed description of the steps that this requires, have a look at the [Workflomics documentation](https://workflomics.readthedocs.io/en/latest/index.html), under 'Domain Expert Guide'
 
 The repository is organized in the following way:
 
-- `domains/`: Contains the domain descriptions. The descriptions are used by the Workflomics environment to generate workflows. Each domain comprises a set of tools (e.g., described in `tools.json` file) and a configuraion file (e.g., `config.json`) that specifies the domain-specific parameters. See the [documentation of the APE engine](https://ape-framework.readthedocs.io/en/latest/docs/specifications/setup.html#configuration-file) to learn more about the configuration file.
-- `cwl-tools/`: Contains the CWL CommandLineTool descriptions of the tools used in the workflows (similar to the [bio-cwl-tools](https://github.com/common-workflow-library/bio-cwl-tools) repo). The CWL files are used by the Workflomics environment to execute each step of the workflow. Within the Workflomics ecosystem these workflows are executed using the [Workflomics Benchmarker](https://github.com/Workflomics/workflomics-benchmarker) which utilizes [cwltool](https://github.com/common-workflow-language/cwltool).
+- `domains/`: Contains the domain descriptions. The descriptions are used by the Workflomics environment to generate workflows. Each domain comprises a set of tools (e.g., described in `tools.json` file) and a configuraion file (e.g., `config.json`) that specifies the domain-specific parameters. See the [domain annotation guide](https://workflomics.readthedocs.io/en/latest/domain-expert-guide/domain-development.html) to learn more about these files.
+
+- `cwl-tools/`: Contains the CWL CommandLineTool descriptions of the tools used in the workflows (similar to the [bio-cwl-tools](https://github.com/common-workflow-library/bio-cwl-tools) repo). The CWL files are used by the Workflomics environment to execute each step of the workflow. Within the Workflomics ecosystem these workflows are executed using the [Workflomics Benchmarker](https://github.com/Workflomics/workflomics-benchmarker) which utilizes [cwltool](https://github.com/common-workflow-language/cwltool). For more information about adding new tools, see the [adding tools section](https://workflomics.readthedocs.io/en/latest/domain-expert-guide/adding-tools.html) of the documentation.
+
 - `examples/`: Contains example workflows that can be executed using the [Workflomics Benchmarker](https://github.com/Workflomics/workflomics-benchmarker). The workflows were generated by the Workflomics platform are written in the [Common Workflow Language (CWL)](https://www.commonwl.org/).
 
 When using Workflomics web interface, workflows are referencing this repository directly. They are downloaded during the workflow execution, so you don't need to clone this repository for normal usage in the Workflomics environment.
+
+## Testing
+
+To test the CWL annotations, run `test_cwl_annotations.cwl` in the repository root. This script runs the test scripts in the 'test' directory of each tool, testing whether the CWL annotations pass as stand-alone workflow steps. This requires `cwltool` and `docker` to be installed.
diff --git a/cwl-tools/Sage-proteomics/Sage-proteomics.cwl b/cwl-tools/Sage-proteomics/Sage-proteomics.cwl
@@ -0,0 +1,42 @@
+cwlVersion: v1.0
+label: Sage
+class: CommandLineTool
+baseCommand: ["/bin/bash", "-c"]
+arguments:
+  - valueFrom: >
+      "sage -o /data/output -f $(inputs.Sage_in_2.path) \
+      $(inputs.Configuration.path) $(inputs.Sage_in_1.path) && \
+      /data/sage_TSV_to_mzIdentML.sh /data/output/results.sage.tsv"
+    shellQuote: false
+requirements:
+  ShellCommandRequirement: {}
+  DockerRequirement:
+    dockerPull: ghcr.io/lazear/sage:v0.14.7
+    dockerOutputDirectory: /data
+  InitialWorkDirRequirement:
+    listing:
+      - class: File
+        location: sage_TSV_to_mzIdentML.sh
+        basename: sage_TSV_to_mzIdentML.sh
+
+inputs:
+  Sage_in_1:
+    type: File
+    format: "http://edamontology.org/format_3244" # mzML
+  Sage_in_2:
+    type: File
+    format: "http://edamontology.org/format_1929" # FASTA
+  Configuration:
+    type: File
+    format: "http://edamontology.org/format_3464" # JSON
+    default:
+      class: File
+      format: "http://edamontology.org/format_3464" # JSON
+      location: https://raw.githubusercontent.com/Workflomics/tools-and-domains/main/cwl-tools/Sage-proteomics/config.json
+
+outputs:
+  Sage_out:
+    type: File
+    format: "http://edamontology.org/format_3247" # mzIdentML
+    outputBinding:
+      glob: /data/output/results.sage.mzid
diff --git a/cwl-tools/Sage-proteomics/Sage-proteomics.json b/cwl-tools/Sage-proteomics/Sage-proteomics.json
@@ -0,0 +1,24 @@
+{"functions": [{
+    "outputs": [{
+        "format_1915": ["http://edamontology.org/format_3247"],
+        "data_0006": ["http://edamontology.org/data_0945"]
+    }],
+    "biotoolsID": "Sage-proteomics",
+    "inputs": [
+        {
+            "format_1915": ["http://edamontology.org/format_3244"],
+            "data_0006": ["http://edamontology.org/data_0943"]
+        },
+        {
+            "format_1915": ["http://edamontology.org/format_1929"],
+            "data_0006": ["http://edamontology.org/data_2976"]
+        }
+    ],
+    "taxonomyOperations": [
+        "http://edamontology.org/operation_3631",
+        "http://edamontology.org/operation_3633",
+        "http://edamontology.org/operation_2428"
+    ],
+    "label": "Sage",
+    "id": "Sage-proteomics"
+}]}
diff --git a/cwl-tools/Sage-proteomics/config.json b/cwl-tools/Sage-proteomics/config.json
@@ -0,0 +1,61 @@
+{
+    "database": {
+        "bucket_size": 8192,
+        "enzyme": {
+            "missed_cleavages": 2,
+            "min_len": 7,
+            "max_len": 50,
+            "cleave_at": "KR",
+            "restrict": "P"
+        },
+        "fragment_min_mz": 150.0,
+        "fragment_max_mz": 2000.0,
+        "peptide_min_mass": 500.0,
+        "peptide_max_mass": 5000.0,
+        "ion_kinds": [
+            "b",
+            "y"
+        ],
+        "min_ion_index": 2,
+        "max_variable_mods": 3,
+        "static_mods": {
+            "C": 57.0215
+        },
+        "variable_mods": {
+            "M": 15.994
+        },
+        "decoy_tag": "rev_",
+        "generate_decoys": true
+    },
+    "quant": {
+        "lfq": true,
+        "lfq_settings": {
+            "peak_scoring": "Hybrid",
+            "integration": "Sum",
+            "spectral_angle": 0.6,
+            "ppm_tolerance": 5.0
+        }
+    },
+    "precursor_tol": {
+        "ppm": [
+            -20.0,
+            20.0
+        ]
+    },
+    "fragment_tol": {
+        "ppm": [
+            -20.0,
+            20.0
+        ]
+    },
+    "isotope_errors": [
+        0,
+        2
+    ],
+    "deisotope": true,
+    "min_peaks": 15,
+    "max_peaks": 150,
+    "max_fragment_charge": 1,
+    "min_matched_peaks": 4,
+    "predict_rt": true
+}
diff --git a/cwl-tools/Sage-proteomics/sage_TSV_to_mzIdentML.sh b/cwl-tools/Sage-proteomics/sage_TSV_to_mzIdentML.sh
@@ -0,0 +1,96 @@
+#!/bin/bash
+#
+# This is a quick-and-dirty converter of Sage TSV output to mzIdentML, with a separate Peptide id entry for each PSM, not each unique peptide.
+# This may break some third-party software, and should be fixed in future versions. The software version is assumed to be the latest version
+# of Sage (currently 0.14.5). There are currently (February 2024) no CV terms for Sage. Sage outputs "hyperscore". Assume that this is X!Tandem
+# hyperscore for now.
+#
+
+# Check for input file
+if [ "$#" -ne 1 ]; then
+    echo "Usage: $0 results.sage.tsv (or name of Sage results TSV file, if changed)"
+    exit 1
+fi
+
+# get the current date and time (of the conversion to mzIdentML)
+creationDate=`date -I'ns'|tr ',' '.'| awk 'sub("00\+.+","Z")'`
+
+echo "creationDate" $creationDate
+
+# Check Sage version (assumes Sage is in the current directory)
+if [ $(which sage) ]; then
+	version=`sage --version | cut -f2 -d ' '`
+	else
+	version="UNKNOWN"
+fi
+
+echo "Sage version" $version
+
+inputfile=$1
+outputfile="${inputfile%.tsv}.mzid"
+
+# Start and sequence collection of the mzIdentML file
+awk -v version=$version -v creationDate=$creationDate '
+	BEGIN {
+		print("<?xml version=\"1.0\" encoding=\"UTF-8\"?>");
+		printf("<MzIdentML xmlns:xsi=\"http://www.w3.org/2001/XMLSchema-instance\" xmlns:xsd=\"http://www.w3.org/2001/XMLSchema\" id=\"\" xsi:schemaLocation=\"http://psidev.info/psi/pi/mzIdentML/1.1 http://www.psidev.info/files/mzIdentML1.2.0.xsd\" creationDate=\"%s\" version=\"1.2.0\" xmlns=\"http://psidev.info/psi/pi/mzIdentML/1.2\">", creationDate);
+		print("  <cvList>");
+		print("    <cv fullName=\"Proteomics Standards Initiative Mass Spectrometry Vocabularies\" version=\"4.1.99\" uri=\"https://raw.githubusercontent.com/HUPO-PSI/psi-ms-CV/master/psi-ms.obo\" id=\"PSI-MS\" />");
+		print("    <cv fullName=\"UNIMOD\" uri=\"https://raw.githubusercontent.com/HUPO-PSI/psi-ms-CV/master/psi-ms.obo\" id=\"UNIMOD\" />");
+		print("    <cv fullName=\"UNIT-ONTOLOGY\" uri=\"http://obo.cvs.sourceforge.net/viewvc/obo/obo/ontology/phenotype/unit.obo\" id=\"UO\" />");
+		print("  </cvList>");
+		print("  <AnalysisSoftwareList>");
+		printf("    <AnalysisSoftware id=\"Sage\" version=\"%s\" uri=\"https://github.com/lazear/sage\">\n", version);
+		print("      <SoftwareName>");
+		print("        <cvParam name=\"X!Tandem\" cvRef=\"PSI-MS\" accession=\"MS:1001456\" />"); # Pretend Sage hyperscores are X!Tandem hyperscores
+		print("      </SoftwareName>");
+		print("    </AnalysisSoftware>");
+		print("  </AnalysisSoftwareList>");
+		print("  <SequenceCollection>");
+	}
+	(NR>1) {
+		printf("  <Peptide id=\"%s_%09d\">\n",$2,NR-1);
+		printf("    <PeptideSequence>%s</PeptideSequence>\n",$2);
+		printf("  </Peptide>\n");
+		printf("  <PeptideEvidence id=\"%s_%09d_%s\" dBSequence_ref=\"DBSeq_%s\" peptide_ref=\"%s_%09d\" />\n", $2, NR-1, $3, $3, $2, NR-1);
+		printf("  <DBSequence id=\"DBSeq_%s\" searchDatabase_ref=\"SearchDB_1\" accession=\"%s\">\n", $3, $3);
+		printf("  <Seq>UNKNOWN</Seq>\n");
+		printf("  <cvParam name=\"protein description\" value=\"\" cvRef=\"PSI-MS\" accession=\"MS:1001088\" />\n");
+		print("</DBSequence>");
+	}
+	END {
+	print("  </SequenceCollection>");
+
+	}
+' $inputfile > $outputfile
+
+# Analysis data and end of the mzIdentML file
+awk '
+	BEGIN {
+		print("  <DataCollection>");
+		print("    <AnalysisData>");
+		print("      <SpectrumIdentificationList id=\"SIL_1\">");
+	}
+	(NR>1) {
+		sub("scan=", "", $8)
+		printf("        <SpectrumIdentificationResult id=\"SIR_%i\" spectrumID=\"index=%i\" spectraData_ref=\"SD_%i\" name=\"%s\">\n", NR-2, NR-2, NR-1, $8);
+		printf("          <SpectrumIdentificationItem id=\"SII_%i_%i\" chargeState=\"%i\" experimentalMassToCharge=\"%f\" calculatedMassToCharge=\"%f\" peptide_ref=\"%s_%09d\" rank=\"%i\">\n", NR-2, 1, $13, $11, $12, $2, NR-1, $9);
+		printf("            <PeptideEvidenceRef peptideEvidence_ref=\"%s_%09d_%s\" />\n", $2, NR-1, $3);
+		printf("            <cvParam name=\"X!Tandem:hyperscore\" value=\"%f\" cvRef=\"PSI-MS\" accession=\"MS:1001331\" />\n", $20); # Pretend to be X!Tandem hyperscore
+
+
+
+		print("          </SpectrumIdentificationItem>");
+
+
+		print("        </SpectrumIdentificationResult>");
+	}
+	END {
+		print("      </SpectrumIdentificationList>");
+		print("    </AnalysisData>");
+		print("  </DataCollection>");
+		print("</MzIdentML>");
+	}
+' $inputfile >> $outputfile
+
+echo "Conversion completed: $outputfile"
diff --git a/cwl-tools/Sage-proteomics/test/debug-in-docker.sh b/cwl-tools/Sage-proteomics/test/debug-in-docker.sh
@@ -0,0 +1,10 @@
+#!/bin/bash
+
+docker run -it --rm \
+  --entrypoint /bin/bash \
+  --mount=type=bind,source=/Users/peter/repos/bakeoff/containers/cwl-tools/Sage/test/data,target=/data/ \
+  ghcr.io/lazear/sage:v0.14.7
+
+# sage -o /data/output -f /data/small.fasta /data/config.json /data/small.mzML
+
+#docker run -it --rm  -v ${PWD}:/data sage:latest /app/sage -o /data /data/config.json
diff --git a/cwl-tools/Sage-proteomics/test/input.yml b/cwl-tools/Sage-proteomics/test/input.yml
@@ -0,0 +1,8 @@
+Sage_in_1:
+  class: File
+  format: http://edamontology.org/format_3244
+  path: https://raw.githubusercontent.com/Workflomics/DemoKit/refs/heads/main/data/inputs/small.mzML
+Sage_in_2:
+  class: File
+  format: http://edamontology.org/format_1929
+  path: https://raw.githubusercontent.com/Workflomics/DemoKit/refs/heads/main/data/inputs/small.fasta
diff --git a/cwl-tools/Sage-proteomics/test/run-cwl.sh b/cwl-tools/Sage-proteomics/test/run-cwl.sh
@@ -0,0 +1,3 @@
+#!/bin/bash
+
+cwltool --outdir output ../Sage-proteomics.cwl ./input.yml
diff --git a/test_cwl_annotations.sh b/test_cwl_annotations.sh
@@ -0,0 +1,54 @@
+#!/bin/bash
+
+# This script runs the test scripts in the 'test' directory of each tool,
+# testing whether the CWL annotations pass as stand-alone workflow steps.
+# This requires `cwltool` and `docker` to be installed.
+
+# Set the base directory
+base_dir="cwl-tools"
+
+# Create arrays and counters for passed, failed tests, and missing scripts
+declare -a failed_tests
+passed_tests_count=0
+failed_tests_count=0
+missing_script_count=0
+
+# Iterate over all directories (tool-names) in the base directory
+for tool_dir in "$base_dir"/*/; do
+    # Define the path to the 'test' directory for the current tool
+    script_dir="${tool_dir}test"
+
+    # Check if the 'run-cwl.sh' script exists
+    if [ -f "$script_dir/run-cwl.sh" ]; then
+        echo "Testing $script_dir..."
+
+        (cd "$script_dir" && bash "./run-cwl.sh")
+        if [ $? -eq 0 ]; then
+            ((passed_tests_count++))  # Increment passed tests counter
+        else
+            failed_tests+=("$script_dir/run-cwl.sh")
+            ((failed_tests_count++))  # Increment failed tests counter
+        fi
+
+    else
+        echo "❗ Script not found in directory: $script_dir"
+        ((missing_script_count++))  # Increment the missing script counter
+    fi
+done
+
+# Print missing test script count
+if [ $missing_script_count -gt 0 ]; then
+    echo "❗ $missing_script_count tools did not have a test script"
+fi
+
+# Print a summary of test results
+if [ $failed_tests_count -eq 0 ]; then
+    echo "✅ $passed_tests_count tests passed successfully"
+    exit 0
+else
+    echo "The following tests failed:"
+    for test in "${failed_tests[@]}"; do
+        echo "🚨 $test"
+    done
+    exit 1
+fi
Original file line number	Diff line number	Diff line change
		@@ -0,0 +1,3 @@
		#!/bin/bash

		cwltool --outdir output ../Sage-proteomics.cwl ./input.yml