Merge branch 'main' of https://github.com/fhdsl/FH_WDL101_Cromwell in…

…to main
fhdsl · Jan 4, 2023 · 7bc6e99 · 7bc6e99
2 parents 16389e8 + d3078ab
commit 7bc6e99
Show file tree

Hide file tree

Showing 72 changed files with 2,788 additions and 661 deletions.
diff --git a/docs/01-intro.md b/docs/01-intro.md
@@ -2,15 +2,29 @@
 
 
 # Introduction
+At the Fred Hutch we have configured a software from the Broad called Cromwell to allow us to run WDLs on our local cluster that then can be easily ported to other cloud based compute infrastructure when desired. This allows us to simplify our workflow testing and design, leverage WDL for smaller scale work that does not need the cloud, and can let users of all kinds manage their workflow work over time via this tool.  This guide is intended to be a hands on introduction to getting started with using Cromwell at the Hutch to run WDL workflows.  Once you know how you can run WDLs at the Fred Hutch, we hope you'll be able to start designing your own WDLs and converting existing processes and analyses over into the specification in order to more effectively run them.  
+
+## What is WDL?
+
+[WDL](https://wdl-docs.readthedocs.io/en/1.0.0/) is an open specification for a workflow description language that originated at the Broad but has grown to a much wider audience over time. WDL workflows can be run using an engine, which is software that interprets and runs your WDL on various high performance computing resources, such as SLURM (the Fred Hutch local cluster), AWS (Amazon Web Services), Google and Azure.  While this guide won't go into the details of what the WDL syntax is, there are resources linked to in our Summary section and we we actively developing additional courses on the topic.  
 
-WDL is an open specification for a workflow description language that originated at the Broad but has grown to a much wider audience over time. WDL workflows can be run using an engine, which is software that interprets and runs your WDL on various high performance computing resources, such as SLURM (the Fred Hutch local cluster), AWS (Amazon Web Services), Google and Azure.
 
-At the Fred Hutch we have configured a software from the Broad called Cromwell to allow us to run WDLs on our local cluster that then can be easily ported to other cloud based compute infrastructure when desired. This allows us to simplify our workflow testing and design, leverage WDL for smaller scale work that does not need the cloud, and can let users of all kinds manage their workflow work over time via this tool.  
 
 ## What is Cromwell?
 Cromwell is a workflow engine (sometimes called a workflow manager) software developed by the Broad which manages the individual tasks involved in multi-step workflows, tracks job metadata, provides an API interface and allows users to manage multiple workflows simultaneously.  Cromwell isn't the only WDL "engine" that exists, but it is a tool that has been configured for use on the Fred Hutch gizmo cluster in order to make running workflows here very simple.
 
 
+## Using Cromwell
+In general, Cromwell works best when run in server mode, which means that users run a Cromwell server as a job on our local SLURM cluster that can connect to a database specifically for Cromwell workflow tracking.  This Cromwell server job then behaves as the workflow coordinator for that user, allowing a user to send workflow instructions for multiple workflows running simultaneously.  The Cromwell server will then parse these workflow instructions, find and copy the relevant input files, send the tasks to either `Gizmo` to be processed, coordinate the results of those tasks and record all of the metadata about what is happening in its database.  
+
+This means that individual users can:
+- run multiple independent workflows at the same time using one Cromwell server,
+- use cached results when identical to the current task,
+- track the status of workflows and tasks via multiple methods while they are running,
+- customize the locations of input data, intermediate data, and workflow outputs to data storage resources appropriate to the data type (re: cost, backup and accessibility),
+- query the Cromwell database for information about workflows run in the past, including where their workflow outputs were saved or a variety of other workflow and task level metadata.  
+
+
 
 # Getting Started with Cromwell
 
@@ -113,39 +127,40 @@ In `cromUserConfig.txt` there are some variables that allow users to share a sim
 The following text is also in this repo but these are the customizations you'll need to decide on for your server.
 ```
 ################## WORKING DIRECTORY AND PATH CUSTOMIZATIONS ###################
-## Where do you want the working directory to be for Cromwell (note: this process will create a subdirectory here called "cromwell-executions")?   Note, please include the leading and trailing slashes!!! 
+## Where do you want the working directory to be for Cromwell
+## Note: startng the server will create a subdirectory in the directory you specify here called "cromwell-executions"). Please include the leading and trailing slashes!!! You likely will want to include your username in the path just in case others in your lab are ALSO using Cromwell to reduce confusion.
 ### Suggestion: /fh/scratch/delete90/pilastname_f/username/
 SCRATCHDIR=/fh/scratch/delete90/...
 
 ## Where do you want logs about individual workflows (not jobs) to be written?
-## Note: this is a default for the server and can be overwritten for a given workflow in workflow-options.
+## Note: this is a default for the server and can be overwritten for a given workflow in workflow-options.  Most of the time workflow troubleshooting occurs without having to refer to these logs, but the ability to make them can be useful if you like them. 
 ### Suggestion: ~/cromwell-home/workflow-logs
 WORKFLOWLOGDIR=~/cromwell-home/workflow-logs
 
 ## Where do you want to save Cromwell server logs for troubleshooting Cromwell itself?
+## You'll want this handy in the beginning as when Cromwell cannot start up, this is where you'll go to do all of your troubleshooting.  
 ### Suggestion: ~/cromwell-home/server-logs
 SERVERLOGDIR=~/cromwell-home/server-logs
 
 ################ DATABASE CUSTOMIZATIONS #################
-## DB4Sci MariaDB details (remove < and >, and use unquoted text):
+## DB4Sci MariaDB details (remove any `...`'s and use unquoted text):
 
 CROMWELLDBPORT=...
 CROMWELLDBNAME=...
 CROMWELLDBUSERNAME=...
 CROMWELLDBPASSWORD=...
 
 ## Number of cores for your Cromwell server itself - usually 4 is sufficient.  
-###Increase if you want to run many, complex workflows simultaneously or notice your server is slowing down.
+## Increase if you want to run many, complex workflows simultaneously or notice your server is slowing down. Keep in mind these cpu's do count toward your lab's allocations, so you want to keep it fairly minimal. 
 NCORES=4
 
 ## Length of time you want the server to run for.  
-### Note: when servers go down, all jobs they'd sent will continue.  When you start up a server the next time
-### using the same database, the new server will pick up whereever the previous workflows left off.  "7-0" is 7 days, zero hours.
+## Note: when servers go down, all jobs they'd sent will continue.  When you start up a server the next time using the same database, the new server will pick up whereever the previous workflows left off.  "7-0" is 7 days, zero hours.
 SERVERTIME="7-0" 
 ```
 
 
-> Note:  For this server, you will want multiple cores to allow it to multi-task.  Memory is less important when you use an external database.  If you notice issues, the particular resource request for the server job itself might be a good place to start adjusting, in conjunction with some guidance from SciComp or the Slack [Question and Answer channel](https://fhbig.slack.com/archives/CD3HGJHJT) folks.
+> Note:  For this server, you will want multiple cores to allow it to multi-task.  If you notice issues, the particular resource request for the server job itself might be a good place to start adjusting, in conjunction with some guidance from SciComp or the Slack [Question and Answer channel](https://fhbig.slack.com/archives/CD3HGJHJT) folks.
 
 ### Kick off your Cromwell server
 
@@ -201,11 +216,12 @@ scancel 2733799
 
 ## Starting up your server in the future
 Good news! The above instructions are a one time event. In the future, when you want to start up a Cromwell server to do some computing work, all you'll have to do is:
+
 1. Get onto Rhino in Terminal
-2. Change to the `cromwell-home` directory you made
-3. Enter: `./cromwell.sh cromUserConfig.txt` and you're off to the races!
 
+2. Change to the `cromwell-home` directory you made
 
+3. Enter: `./cromwell.sh cromUserConfig.txt` and you're off to the races!
 
 
-Congrats you've started your first Cromwell server!!  Now on to how to submit a WDL workflow to it.
+Congrats you've started your first Cromwell server!!  
diff --git a/docs/02-care-keeping.md b/docs/02-care-keeping.md
@@ -0,0 +1,80 @@
+
+
+# Using Cromwell at Fred Hutch
+Good news! Once you've worked through the Getting Started section, you won't have to do that again! Ongoing use of Cromwell at the Hutch will look a bit more straightforward and we'll discuss the steps to using Cromwell in an ongoing way, the Fred Hutch specific configuration details, and provide some test workflows you can use to test out some of the interfaces we have at the Hutch to Cromwell.  
+
+## Everyday Usage
+To get started using Cromwell, you'll first do these steps:
+
+1.  Log into Rhino
+2.  Go to your `cromwell-home` directory
+3.  Kick off a server job using the command: `./cromwell.sh cromUserConfig.txt`
+3.  Wait for a successful response and the node:port information for your server!
+
+That's it!  Now your Cromwell server will run for a week by default (unless you have set a different server length in `cromUserConfig.txt`).  It will be accessible to submit workflows to and execute them whenever you want through multiple mechanisms that we'll describe in the next chapters.  Next week you can simply repeat the above to restart your server and it'll be ready again!  
+
+Don't worry, if you have a workflow that is running at the end of the week and your server job ends, when you start a new server job it will automatically check for the current status of any previously running workflows, then pickup and finish anything that might be left to do.  While you can adjust the configuration of your Cromwell server in your configuration file to run for more than 7 days, we've found that the servers tend to run much faster when they are occasionally "rebooted" like this, and also it is more polite to your lab members to not always have a server running that is not busy coordinating a workflow.  
+
+
+## Test Workflows
+Once you have a server up and running, you'll want to check out our [Test Workflow GitHub repo](https://github.com/FredHutch/wdl-test-workflows) and run through the tests specified in the markdowns there.  The next chapters will guide you through the most common mechanisms for submitting workflows to your server, so you'll want to have cloned this repo to your local computer so you can have the files handy.  They also are useful templates for you to start editing from to craft your first custom workflow later.  
+
+> Note: For those test workflows that use Docker containers, know that the first time you run them, you may notice that jobs aren't being sent very quickly. That is because for our cluster, we need to convert those Docker containers to something that can be run by Singularity. The first time a Docker container is used, it must be converted, but in the future Cromwell will used the cached version of the Docker container and jobs will be submitted more quickly.
+
+
+## Runtime Variables
+
+Cromwell can help run WDL workflows on a variety of computing resources such as SLURM clusters (like the Fred Hutch cluster), as well as AWS, Google and Azure cloud computing systems.  Using WDL workflows allows users to focus on their workflow contents rather than the intricacies of particular computing platform.  However, there are optimizations of how those workflows run that may be specific to each computing tool or task in your workflow.  Writing your workflow as a WDL allows you to more easily request only the resources each individual task will use each time a job is submitted to the Gizmo cluster.  This allows you to maximize the utilization of the computing resources you request and lets you run workflows much faster than using a single request for a SLURM job and working within that allocation (such as via a `grabnode` process or single bash script).  
+
+We'll discuss some of the available customizations to help you run WDLs on our cluster that still allow those workflows to be portable to other computing platforms.  
+
+
+### Standard Runtime Variables
+
+These runtime variables can be used on any computing platform, and the values given here are the defaults for our Fred Hutch configuration if there is a default set.  
+
+- `cpu: 1`
+  - An integer number of cpus you want for the task. 
+- `memory: 2000`
+  - An integer number of MB of memory you want to use for the task.  Other formats that are accepted include: `"memory: 2GB"`, `"memory: taskMemory + "GB""` (in this case the memory to use is a variable called `taskMemory` and is specified in a task itself. 
+- `docker: `
+  - A specific Docker container to use for the task. An example of the value for this variable is: `"ubuntu:latest"`.  No default container is specified in our configuration and this runtime variable should only be used if/when a task should be run inside a docker container, in which case you'll want to specify both the container name and specific version.  If left unset or left out of the runtime block of a task completely, the Fred Hutch configuration will run the task as a regular job and not use docker containers at all.  For the custom Hutch configuration, docker containers can be specified and the necessary conversions (to Singularity) will be performed by Cromwell (not the user).  
+  > Note: when docker is used, soft links cannot be used in our filesystem, so workflows using very large datasets may run slightly slower due to the need for Cromwell to copy files rather than link to them.  
+
+
+### Fred Hutch Custom Runtime Variables
+For the `gizmo` cluster, the following custom runtime variables are available (below we show each variable with its current default value). You can change these variables in the `runtime` block for individual tasks in a WDL file. These variables are not variables that will be understood by Cromwell or other WDL engines when the workflow is not being run on the Fred Hutch cluster!  
+
+>Note: when values are specified in the runtime blocks of individual tasks in a workflow, those values will override these defaults for that task only!!
+
+
+- `walltime: "18:00:00"`
+  - A string ("HH:MM:SS") that specifies how much time you want to request for the task. Can also specify >1 day, e.g. "1-12:00:00" is 1 day+12 hours.
+- `partition:  "campus-new"`
+  - Which cluster partition to use. The default is `campus-new`: other options currently include `restart` or `short` but check [SciWiki](https://sciwiki.fredhutch.org/scicomputing/) for updated information. 
+- `modules: ""`
+  - A space-separated list of the environment modules you'd like to load (in that order) prior to running the task.  See below for more about software modules.
+- `dockerSL: `
+  - This is a custom configuration for the Hutch that allows users to use docker and softlinks only to specific locations in Scratch.  It is helpful when working with very large files. An example of the value for this variable is: `"ubuntu:latest"`. Just like the `docker:` runtime variable, only specify this if you want the task to run in a container (otherwise the default will be a non-containerized job). 
+- `account: `
+  - This allows users who run jobs for multiple PI accounts to specify which account to use for each task, to manage cluster allocations.  An example of the value for this variable is `"paguirigan_a"`, following the  pilastname_f pattern. 
+
+## Managing Software Environments
+
+
+### Modules
+At Fred Hutch we have huge array of pre-curated software modules installed on our SLURM cluster which you can [read about in SciWiki](https://sciwiki.fredhutch.org/scicomputing/compute_scientificSoftware/).   The custom configuration of our Cromwell server allows users to specify one or more modules to use for individual tasks in a workflow.  The desired module(s) can be requested for a task in the `runtime` block of your calls like this:
+
+```
+runtime {
+  modules: "GATK/4.2.6.1-GCCcore-11.2.0 SAMtools/1.16.1-GCC-11.2.0"
+}
+```
+
+In this example, we specify two modules, separated by a space (with quotes surrounding them). The GATK module will be loaded first, followed by the SAMtools module.   In this example you'll note the "toolchain" used to build each modules is the same ("GCC-11.2.0").  When you load >1 module for a single task it is important to ensure that they are compatible with each other.  Choose versions built with the same toolchain if you can.  
+
+### Docker
+If you want to move your WDL workflow to the cloud in the future, you'll want to leverage Cromwell's ability to run your tasks in Docker containers.  Users can specify docker containers in runtime blocks. Cromwell will maintain a local cache of previously used containers, facilitating the pull of Docker containers and conversion for use.  This behavior allows us to evade rate-limiting by DockerHub and improves speed of your workflows.  We will dig into Docker containers more in the next class.  
+
+
+Now you're ready to start learning about how to submit our test workflows to your Cromwell server!