Skip to content

Developer Documentation

Mike Lloyd edited this page Jul 27, 2023 · 6 revisions

Developer Information

Repository Structure:

bin: This folder holds scripts that are shared among pipelines or specific to pipelines. These scripts are typically Python, Perl, or other languages that will be accessed by various containers.

config: This folder holds all pipeline configuration (config) files. These config files hold the default values for the pipeline and are a good place to see what parameter values are named or set to.

modules: This folder holds the module.nf files. These files are such as gatk.nf where processes with their specified containers and publish directories reside.

main.nf: This file contains the logic block related to starting individual pipelines.

nextflow.config: The global configuration file used by all workflows. This file is reserved for configurations related to things such as HPC batch job submission, and Singularity execution.

Example Module Process (modules/quality_stats.nf)

Line 1: Names the process QUALITY_STATISTICS. By convention the process is in all capital letters.

Line 3: sets the tag for this process to sampleID. A tag is required. By convention it is set to sampleID.

sampleID is defined by each workflow for each FASTQ file (or pair of files) being processed. It is usually set by Nextflow as input files are first parsed (see workflow example below). The tag directive allows for association of each process with a custom label, so that it will be easier to identify them in the log file or in the trace execution report.

Lines 5-8: defines the compute parameters (i.e., cpus, memory [RAM], wallclock) to be used by the process, and any additional cluster options (e.g., -q batch). This information is required for each process, and should be reasonable relative to the expected requirements of the process. It's important to profile processes to set these compute parameters to reasonable values, because setting these to arbitrary large values will greatly constrain how jobs are issued and queued.

Line 10: defines the container to be used by the process. In this case the container quay.io/jaxcompsci/python-bz2file:2.7.18 is defined as an external quay.io container. It best practice to refer to containers as links to repositories such as Docker Hub or Quay. Using local containers is NOT acceptable. In a majority of cases, a container likely already exists that has the software necessary for a Nextflow process. Biocontainers is a helpful resource for finding publicly available containers.

When a pipeline is run, if a container does not already exist in the “singularity cache” directory then the container will be pulled and stored there. If the container already exists in the cache directory, that container will be used for execution. The cache location is defined for JAX users in the global nextflow.config file as “/projects/omics_share/meta/containers”.

Line 12: the publishDir is defined. This links two options in the global nextflow.config: params.pubdir and params.organize_by. The former (params.pubdir) is the location that the pipeline output files will be copied to. This parameter is set by the user at runtime, there is no default. This parameter will make a directory where specified. We highly suggest using output directories with ample space to collect final files.

The second option organize_by is default set to “sample” so that each output folder is organized as the sampleID and all subsequent analysis outputs for said sample are stored in the folder. The alternative option is to set this to “analysis” where each analysis will have a folder and each samples output for said analysis will be put in that folder. For example, after trimming the sequences you may want a single folder with all of the trimmed sequences rather than a sample specific folders that you need to traverse to find all trimmed sequences.

Notes on output:

  1. Copy NOT move is the preferred method to publish output.

    • e.g., publishDir "${params.pubdir}/${ params.organize_by=='sample' ? sampleID : 'rsem' }", pattern: "*stats", mode:'copy' where mode:'copy' species to copy the output to the output directory.
  2. Outputs can be organized into sub-directories.

    • e.g., publishDir "${params.pubdir}/${ params.organize_by=='sample' ? sampleID+'/stats' : 'quality_stats' }", pattern: "*fastq.gz_stat", mode:'copy'. where sampleID+'/stats' specifies that the output be placed in a stats subdirectory when output is organized by sample.
  3. Certain outputs can be defined as 'intermediate' and will only be saved when requested by the user.

    • e.g., publishDir "${params.pubdir}/${ params.organize_by=='sample' ? sampleID : 'bwa_mem' }", pattern: "*.sam", mode:'copy', enabled: params.keep_intermediate. where enabled: params.keep_intermediate specifies that the output should only be saved when keep_intermediate is true.
  4. Not all outputs in the output stream from a tool or module need be published.

Lines 14-15: capture inputs. The majority of inputs are typically passed to the module as tuples with a sampleID which is a value, followed by the file(s) of interest. In general, one file per input is preferred so that parsing/decision making stays transparent. However, in this case we have a reads channel that may contain paired end or single end reads. The module is coded to allow for either PE or SE data, and contains a logic statement in the script section to process the data based on type (see line 24-31). The use of logic and parameter passing is discussed further below.

Lines 17-19: describes the outputs. Again, output are preferred to be a tuple with sampleID and a single file of interest where possible. However, this may not always be the case. Using the emit argument at the end of the output allows the use of simple syntax in the workflow so that information is transparently passed from one process to another. This will be shown and discussed further in the example workflow below, but briefly, outputs that are passed with emit are accessible to downstream processes with the following syntax: PROCESS_NAME.out.emit_name where PROCESS_NAME is the name of the process. In this case, to get the trimmed fastq files we would say QUALITY_STATISTICS.out.trimmed_fastq, where trimmed_fastq was set by emit: trimmed_fastq.

Lines 21-36: is the script section. This section is where the tool or code will be called, and final logic statements will be executed. The command call is fully dependent on the container that is to used in the process. To guide the required command, command path, and command options, it is recommended to test the container and commands to be used prior to coding the script step. In this example, python and with all dependencies already present in the container path.

Line 22 calls the log.info statement, which adds that information to the .nextflow.log as well as the slurm-\*jobid\*.out files. This is a required command.

Lines 24-31: is a logic block that references a global parameter: params.read_type. This logic sets a parameter used in the command that follows.

In this case it is expected that the global parameter read_type will be set by a configuration file (see for example config/rnaseq.config), or as a flag at run time by the user. This logic block sets the variable inputfq, which is a list of reads and will have 1 or 2 read files based on if fq_reads has 1 or 2 files. A second variable is also set in this logic block mode_HQ, which takes a slightly different command structure for SE or PE data. The use of such logic blocks, passed parameters, and variables gives flexibility to the module process. It allows for one process to function in this case for either paired end or single end data inputs.

Note: You can pass strings to the process from the workflow, which provides another way to modify commands within a process. An empty variable can be used as a place holder in a process for any additional optional parameters not captured in the command block.

For example, a variable such as command_options can be added as an expected process input:

process EXAMPLE_PROCESS {
  ...
  input:
  tuple val(sampleID), file(input)
  val(command_options)
  ...
  script:
  ...
  """
  python ${params.filter_trim} ${command_options} ${params.min_pct_hq_reads}  $inputfq
  """
}

When this process is called from a workflow:
EXAMPLE_PROCESS(input_channel, '-a additional_command_flags -b options -c')
the string '-a additional_command_flags -b options -c' will be set to the variable command_options in EXAMPLE_PROCESS.

Lines 33-36: contains the code that executes the tool. In this case, a Python script set by another global parameter ${params.filter_trim}.

The module script is bracketed on top and bottom by three sets of double quotes (“””). These quotes instruct Nextflow that this is the script section, and allows for the use of Nextflow variables in this block. Nextflow does allow for shell scripting to be used at run time rather than the standard Nextflow process syntax when three single quotes (''') are used. However, it is preferable to have scripts in the bin/ folder and call them from modules, rather than embedding scripts in the modules themselves. Nextflow variables in a script should be written with a dollar sign and closed in brackets. E.g., ${params.variable} or ${variable}. You can call any global, local, or introspective variable here.

Example Pipeline Workflow (workflows/rnaseq.nf)

The above is the rnaseq.nf workflow file. A workflow calls modules to perform a series of desired tasks.

Line 1: defines a default location for Nextflow in cases where the script is directly executed rather than submitted to an HPC system.

Line 2: establishes that Nextflow DSL2 (domain specific language 2) is used. DSL2 is an expanded version of Nextflow, which allows for the use of modules in the workflow code. This is a required statement to use DSL2 features.

Lines 5-21: import the modules for the workflow. The import syntax relies on the verb include, which is followed by curly brackets and the name of a process, and the location of the module file in which the process resides. e.g.:

include {PICARD_ADDORREPLACEREADGROUPS;  
         PICARD_REORDERSAM;  
         PICARD_COLLECTRNASEQMETRICS;  
         PICARD_SORTSAM} from '../modules/picard'  

where a semicolon “;” separates process names, and the from verb outside the closing bracket specifies the relative path to the module .nf file contains the process. Notice that Nextflow will resolve the extension for the module, and you do not need to specify .nf.

IMPORTANT NOTE: Nextflow can run a named process only once, and if require the use of a process multiple times, you must alias the name of that process. For example:

include {PROCESS; 
        PROCESS as ANOTHER_PROCESS; 
        PROCESS as YET_ANOTHER_PROCESS} from ‘../modules/my_module”

The above syntax will allow you to use said PROCESS three times, once for each name or aliased name. That is PROCESS can be used once, ANOTHER_PROCESS can be used once, and YET_ANOTHER_PROCESS can be used once. This constraint also applies to input/output channels that processes use (discussed below).

Lines 24-27: provide the syntax for outputting the help documentation, when help is invoked in the nextflow run statement.

Line 30: emits a log for the run.

Line 36-48: contains a logic block with parsing samples that are split across lanes. If data are paired, 2 files are expected to match a sampleID string. SampleID string is determined by a simple '_' split and capture.

Lines 50-56 contain a logic block for dealing with SE vs PE inputs. If data are paired, 2 files are expected to match a sampleID string, if data are single end, only 1 file is expected.

This line: read_ch = Channel.fromFilePairs("${params.sample_folder}/*${params.extension}",checkExists:true, size:1 ) sets the variable read_ch to a tuple that contains the [sampleID, Read 1 file path].

This line: read_ch = Channel.fromFilePairs("${params.sample_folder}/${params.pattern}${params.extension}",checkExists:true ) sets the variable read_ch to a tuple that contains the [sampleID, Read 1 file path, Read 2 file path].

Lines 59: checks if read_ch has files. If the user specified an empty directory, or a directory without fastq files, or did not set the proper matching string or file ending, this line gracefully exits the pipeline with an error message.

Line 62: sets a reference data parameter.

Line 65-124: is the main workflow.

The workflow consists of linking process calls:

For example: QUALITY_STATISTICS(read_ch) calls the process outlined in the section above. The output from this chanel is specified by the emit statement(s) within the process: QUALITY_STATISTICS.out.trimmed_fastq. That output is used as the input for the next process: RSEM_ALIGNMENT_EXPRESSION(QUALITY_STATISTICS.out.trimmed_fastq, rsem_ref_files). The linking of process, output to input to output to input etc. is continued until all required steps are completed in the workflow.

A few additional notes on this workflow, and workflow standards in general:

Line 88: Contains a join statement between the output from READ_GROUPS and RSEM_ALIGNMENT_EXPRESSION. The join is required to re-pair the outpus from each step by SampleID. join will connect tuples (output streams) by default by the 0 index. In all cases for the NGS pipelines index 0 is sampleID. IMPORTANT NOTE if you fail to join output streams in this way, sample swaps can occur.

Line 85: READ_GROUPS(QUALITY_STATISTICS.out.trimmed_fastq, "picard") This passed the trimmed_fastq from QUALITY_STATISTICS to READ_GROUPS, but also passed the string "picard" as an input. This type of string addition allows for flexible logic and commands in the READ_GROUPS (and other) processes. See additional notes in the module section above on using strings are variables within the process itself.

Line 74: is a logic block that performs different processes if a pipeline is run with "${params.gen_org}" == 'human'. The use of logic blocks of this nature provides flexibility to the workflow when multiple differing sets of related analyses are required based on species, strain or other analysis factors.

Clone this wiki locally