Skip to content

Documentation

AlistairNWard edited this page Oct 10, 2014 · 65 revisions

Documentation for Version 1.0.0

Whats New

gkno has been updated to make use of the networkX Python library for graphs. Pipelines are now constructed as a graph as is discussed later in the documentation.

Coming soon

Active integration of gkno and the iobio, web-based, real-time, visually driven analysis suite is underway. Keep your eyes open for high level integration later this year.

Table of contents

## Overview As next-generation DNA sequencing becomes increasingly commonplace, the demand for powerful, sophisticated, yet easy to use analysis software has increased dramatically. The Marth lab at Boston College is at the forefront of genomic software development, addressing a large fraction of the analysis problems from read mapping to variant analysis. To best serve the research community, the _gkno_ package has been developed to address the following requirements of next-generation data analysis.
  1. A unified launcher bringing together software tools into a single environment,
  2. a general framework for generating linked lists of tools allowing integrated pipelines to be constructed,
  3. and a web environment providing easy access to documentation and tutorials, user forums, blogs and bug reporting.

The web environment and the Twitter feed @gknoProject keep people up to date on the work being performed with gkno and useful information that different users post in the forum. The documentation and tutorials provide clear instructions on how to download and execute gkno as well as more in depth information about the included tools, pipelines and configuration files.

A core goal of the package is to enable inexperienced users to simply download and execute predetermined analysis pipelines in order to generate sensible results for their research projects. The intricacies of the pipelines (including which processing tools and sensible parameter sets) are all hidden in configuration files and only advanced users need interrogate them.

  ##Terminology Throughout this documentation, the terminology used is generally closely related to Python objects. In descriptions of json files, what in json terminology are termed objects are referred to as dictionaries and json arrays are termed lists. When Python libraries parse the json files, these are the Python objects in which the values are stored.

Installing gkno

Step 1

The following tools are necessary to obtain and install gkno. They are probably already present on most Unix-style systems, but if not, are available via the links provided or through the system's package manager.

  • Mac users can simply install Xcode. This is necessary anyway for other dependencies (gcc/g++), and already provides git "out of the box."

Step 2

Type the following commands:

git clone https://github.com/gkno/gkno_launcher.git

cd gkno_launcher

./gkno build

The build step begins by checking the user's system for the following additional dependencies:

  • ant
  • cmake
  • gcc / g++
  • java / javac
  • make

These mostly consist of support for building & running the component tools. If any of these are either missing or not up to some required minimum version, gkno will print a message informing which tool(s) need to be installed/updated. After all dependencies are satisfied, gkno will initialize all of its internal components - by fetching & compiling software tools and then downloading default (tutorial) resource data.

Upon successful completion, the executable ./gkno can now be used to run any of the tools and pipelines in gkno. This executable is also used to manage data resources for pipelines.

Step 3

The following command:

./gkno run-test

will run a basic pipeline on tutorial data. In addition to checking that the internal components were built properly, this command provides the user a first look at gkno "in action" as it processes a pipeline.

gkno launcher description

The gkno launcher is designed to bring the wealth of next-generation DNA sequencing analysis software into a single, easy to use command line. The power of the launcher is the ability to bring together multiple tools into a single analysis pipeline with the minimum of required user input. A pipeline is defined in a configuration file that can be quickly and easily constructed and is then available for repeated use. When the command line is executed, gkno generates a makefile that is automatically (unless specified otherwise by the user) executed using the GNU make framework. This system ensures that each tool is aware of its file dependencies and includes rules to determine how all of the necessary files are to be created. If a tool fails, any files created in the failed step are deleted and the user is informed of where the problems occurred. This ensures that no partially constructed files will be made available to the user, leading to the potential of analysis based on incomplete data. In addition, having identified and fixed the problem, rerunning the pipeline will start at the last possible point in the pipeline. Files that were successfully generated in the first run will not be unnecessarily regenerated.

### Tool mode _gkno_ provides the user access to all of its constituent tools. Each tool in _gkno_ is described by configuration file in _json_ format. This file describes the executable commands, the tool location, all of the allowed command line arguments, the expected parameters, data types and default values. Common arguments across tools are given the same arguments, as far as is possible, providing commonality between the command lines for all tools, making it straightforward to switch between different tools. In general, the user should have no need to deal with the configuration files, but a complete description of the format of the configuration files is given in the '_Configuration files_' section. A list of all the available tools can be seen by typing:

gkno --help

In order to run a tool, the user simply needs to specify the name of the tool to run. In order to get extra information (e.g. the available command line arguments), help can be displayed by typing:

gkno <tool> --help

###Pipeline mode The _gkno_ launcher can be used to launch any of the available pipelines. Including the term _pipe_ as the first argument instructs _gkno_ to operate in the pipeline mode. To see a list of all available pipelines, type:

gkno pipe --help

In order to see all of the available command line arguments for a particular pipeline, the following command line can be used:

gkno pipe <pipeline name> --help

Executing the command line above lists all of the arguments available as part of the specified pipeline. The pipeline arguments are not, however, the complete set of arguments available to all of the constituent tools. If the user wishes to set a parameter in one of the pipelines' tools, but this is not an available pipeline command line argument, the argument can still be accessed. To set arguments for a specific tool, the pipeline task can be supplied as an argument and then task specific arguments are enclosed in square brackets. For example, consider the pipeline, build-moblist-reference. This pipeline uses the tool mosaik-jump for task build-jump-database, but the argument --iupac, but there is no pipeline argument to set this. The available arguments can be seen by typing:

gkno pipe build-moblist-reference --help

If this argument was required, the following command line would set it:

gkno pipe build-moblist-reference --build-jump-database [--iupac]

All of the commands for (in this example, build-jump-database) are contained within the square brackets. The pipelines are designed in such a way that the commonly accessed commands for each of the constituent tools are accessible via the standard command line, but advanced options may require using this syntax.

###Admin mode _gkno_ provides an "admin" mode with various features for updating _gkno_ and managing resources. The following commands are considered "admin" operations:
  • gkno build - initialize gkno & build component tools. See installation section.
  • gkno update - update component tools and check for available resource updates.
  • gkno add-resource - add genome resource data.
  • gkno remove-resource - remove genome resource data.
  • gkno update-resource - update genome resource data.

See resources section for more info on the commands related to resource management.

###GNU make The _gkno_ package uses the _GNU make_ system to execute tools and pipelines. On execution of a _gkno_ pipeline, a _makefile_ is generated. The general framework of the _makefile_ is a list of blocks describing what files are required by a '_rule_' and the files that are output when the '_rule_' is executed. The _rule_ is itself one or more command lines. When executed (using the command ``make --file ``), _make_ searches for the final required output files and all of the _dependencies_, e.g. the files that are required to make the output files. If the final files do not exist, or any of the dependencies are missing or were created more recently than the output, _make_ will try to execute the rule. In the absence of some of the dependencies, _make_ will search for a _rule_ describing how to generate this dependency and so on.

The important thing to note is that after the pipeline has been executed, it can be rerun at any point by using the make --file <makefile name> command. If all files generated by the pipeline exist and none of the input files are newer (e.g. have been modified) than the output files, no tools will be executed. If any files have been modified or deleted, the pipeline will be begin execution where these files are relevant. Already existing files will not be recreated.

If the same pipeline is being run multiple times, this can be important. Consider a Mosaik based alignment pipeline, whose first tasks prepare genome reference files. Once the reference files exist, the provided sequence reads are then aligned to the reference. If the pipeline is rerun for a different set of sequence reads, there is no need to regenerate all of the reference files, since these will be unchanged from the first run of the pipeline. So, when the pipeline is run for the second time, it will start with the read alignment tasks and use the already existing reference files.

See the 'Using GNU make' tutorial for worked examples of using the GNU make framework. ###Logging gkno usage is logged in order to keep track of which tools and pipelines are most commonly used in the community. Every time gkno is launched, an ID of the form tool/ or pipe/ is generated and sent back to the Marth lab. No information about the user/location etc. is tracked, just the tool or pipeline executed.   ##Configuration files The Python code describing the gkno launcher does not include any hard-coded information about any tools or pipelines. Instead, each tool and pipeline is described by a configuration file in json format.

This section of the documentation describes the format of the json configuration files in some detail and is not intended for the user just wanting to get started with the gkno package. For a more hands on description of how to use gkno or modify specific aspects of the configuration files, specific tutorials with worked examples have been developed. These are included in the documentation, but are also available on the gkno website under the Tutorials tab.

All of the configuration files are validated and processed using a separate Python library included with gkno. For the purposes of this documentation, when reference is made to the underlying Python code, this includes both that contained in gkno and that contained in configurationClass. For users wanting to interrogate the code base, note that all functions directly relating to the configuration files is handled by this separate class.

###Tool configuration files The tool configuration files describe all of the information necessary to run each of the individual tools. There are many occasions where a single tool actually has multiple configuration files. Consider the tool _bamtools_; this tool comprises multiple modes and the command line arguments depend on the mode being used. For example, the command ``bamtools sort`` has only two possible arguments; the input _bam_ file and an optional flag. The command ``bamtools filter`` also has an argument for the input _bam_ file, but there are several other optional arguments. Instead of complicating the tool configuration files by building in logic that allows certain arguments depending on others, separate configuration files exist for each distinct mode of operation. Looking at the help (``gkno --help``) reveals that there are multiple different tools of the form _bamtools-_. Each of these configuration files contains the arguments relevant to the particular tool mode and no others.

The tool configuration file consists of a number of required and optional fields, summarised in the list below. ####Required fields

  • arguments: a dictionary of all the valid command line arguments for this tool. See the 'Tool arguments' section for more details.
  • category: is the category to which the tool belongs, for example, 'align', or 'alignment processing' etc. These are used in the plots that can be created of gkno pipelines.
  • description: a brief description of the tool and its role. This text appears in the pipeline help and so its inclusion is necessary in order to ensure clarity.
  • executable: the name of the executable file.
  • help: the help command for this tool (usually --help or -h).
  • parameter sets: define values to apply to the tool arguments. This is a very useful feature and is dealt with in detail in section 'Parameter sets'.
  • path: the location of the executable file within the gkno package.
  • tools: a list of the names of the tools whose compilation is required for the tool to execute. The values is in the list must be the names of tools in the gkno package.

####Optional fields

  • argument delimiter: modifies the format of the argument/value pair on the command line. See the section 'Defining argument delimiters' for more details.
  • argument order: the command lines for some tools do not use arguments, but the values on the command line are required to be in a specific order. For these tools, the argument order field lists all of the command line arguments in the order they must appear on the command line. See section 'Argument order' for more details.
  • experimental: a flag that identifies the tool as experimental. This means that the tool is identified in the help as a tool that should be used with caution.
  • help group: is the used to group together tools in the gkno help messages. If not set, the tool is included in the 'General' category.
  • hide tool: hide tools from the user.
  • input is stream: some tools only operate on the stream and, as such, do not have command line arguments for the input files as the stream is assumed (ogap and bamleftalign are examples of such tools). By setting input is stream in the tool configuration file, gkno will ensure that files are piped to the tool.
  • modifier: modifies the executable command with a suffix.
  • precommand: modifies the executable command with a prefix.

Some of these fields can themselves include a number of options and require explanation. These are covered in more detail below.

####Tool arguments The bulk of the tool configuration file is the definition of all the command line arguments available for the tool. The arguments for each tool are organised into different groups; each named group being a list of dictionaries. Each dictionary contains all of the required and optional information for a specific argument. All input files need to be in the _inputs_ group and all output files in the _outputs_ group. _gkno_ determines whether arguments point to files by their presence in these groups, so it is essential that this convention is followed. Outside of these groups, there can be as many groups with any name (although each group name can only be used once within the configuration file). Each dictionary within the group contains a combination of the following fields:

####Required fields

  • command line argument: the argument that the tool expects to receive. The long form argument and short form argument fields define the argument for the gkno command line, but the argument expected by the tool is often different and so is defined here.
  • data type: the expected data type associated with this argument. This can be one of the following: string, int, float, bool or flag. On the command line, all arguments will expect a value to be provided unless the data type is set to flag.
  • description: a brief description of the command line argument used in the help messages.
  • extensions: a list of the allowed extensions for the file associated with this argument (including the preceding '.'). If this argument is not associated with a file, this should be set to no extension.
  • long form argument: a long form version of the command line argument.
  • short form argument: a short form version of the command line argument. For example, the argument could be --fastq and the short form would likely be -f.

The optional fields are as follows: ####Optional fields

  • allow multiple values: a Boolean, which, if set to true, instructs gkno that this command can appear on the command line multiple times. For example, if multiple inputs can be defined and the input file command is --in, using the command line gkno tool --in a --in b will result in a and b being stored in a list. The command line in the makefile will then include this argument multiple times, for each supplied input. If the Boolean was not set to true (the default) and an argument is specified multiple times, gkno will terminate with a warning, rather than picking one of the supplied values or including all values on the command line.
  • argument list: instructs gkno that this is not a argument understood by the tool, but allows the user to define a file containing a list of values. This is a dictionary and must contain the following attributes:
  • use argument: indicates the argument (must be a valid argument for the tool) that will be used for including all the values included in the list.
  • mode: defines how the values are to be used. Currently, the allowed modes are:
  • multiple makefiles: indicates that the tool will be run multiple times; once for each value in the list. A separate makefile will be generated for each run. A common use case would be a list of regions supplied to a tool. The tool will then be run for each region allowing parallelisation across all regions.
  • single makefile: As with multiple makefiles, except that each individual run of the tool will be included in the same makefile.
  • repeat argument: indicates that the tool will only be run once and that each value in the list will be applied on the command line together. An example use case would be supplying a list of input files that are all supplied to the tool at once (e.g. for merging files).
  • directory: instructs gkno that this argument points to a directory.
  • filename extensions: lists the extensions of the filenames produced by this argument. This is only used for arguments that use filename stubs, so the is filename stub value should also be set. Handling filename stubs is dealt with in the 'Handling filename stubs' section.
  • hide in help: is a Boolean, which if set ensures that this argument does not get displayed when help on the tool is requested. For occasions where this is useful, see section 'Defining additional input/output files'.
  • if input is stream: applied to an argument for an input file, this section describes the behaviour of the command line argument if the input is a stream rather than a file. This value that accompanies this can be either do not include or replace.
  • If the value is do not include, the argument will be omitted from the command line. Internally, gkno will be keeping track of filenames in order to define filenames further along the pipeline, but the tool command line will no longer include this argument.
  • If the value is replace, this argument must also include the field replace argument with. This is a dictionary that contains the two keys, argument and value. argument is associated with a string. This string is the text that the argument is replaced with. value is the value that accompanies the argument (if this is blank, this can be set to no value). For example, if the tool freebayes is fed a stream rather than an input bam file, the input argument --bam <file> needs to be replaced with --stdin. This is accomplished by including the following in the configuration file under arguments and the --bam section.
  • is filename stub: identifies the argument as defining an file stub, e.g. multiple files with this value are created with extensions defined by the filename extensions field. See the section on 'handling filename stubs' for more details on handling filename stubs in tools and pipelines.
  • is stream: if set to true identifies the argument as the one holding files to stream to the tool. If no stream is piped to the tool, and the tool expects a stream, the argument with the is stream field set will be used to generate a stream for the tool.
"if input is stream" : "replace",
"replace argument with" : {
  "argument" : "--stdin",
  "value" : "no value"
}
  • if output to stream: allows modification of the argument (this is specifically for arguments defining output files) if the desired output is to a stream rather than a file. Currently, the only allowed value for this field is do not include. This is only of relevance if the tool is included in a pipeline as part of a set of piped tasks, and it isn't the final task in the pipe stream. In this situation, the command line will be modified to ensure that the pipes link the tools together successfully.

  • modify argument name on command line: allows modification of the argument before being written to the command line. Some tools have command line constructions that mean that there are no actual arguments (just values) on the command line, or instead of defining an output file, the output is sent to stdout etc. In order to standardise the gkno interface, all of the arguments are still defined in the configuration file, however, when it comes to constructing the command lines in the makefile, the individual tools' formats need to be respected. The modify argument name on command line can take one of the following forms:

  • hide: when constructing the command line, hide the argument and only write the value. For example, if a command line should have the form tool [input file] [output file], the configuration file may specify arguments --in and --out associated with the input and output files respectively. Other tasks in pipelines can then link to these arguments without any problems. If modify argument name on command line is left unset, the command line would take the form: tool --in [input file] --out [output file] which would be inconsistent with that expected by the tool. By including: "modify argument name on command line" : "hide" in the configuration file for both --in and --out, the command line would then take the required form.

  • stdout: as with hide, but instead of just omitting the argument, the argument is replaced with the stdout '>' operator.

  • stderr: as with stdout, but instead of using the operator '>', the stderr operator '2>' is used.

  • omit: nothing will be written to the command line for this argument. The argument is essentially a placeholder that allows linkage etc.

  • is filename stub: for some tools, the input or output is defined on the command line without any extensions. The tool itself takes this stub and determines the full filenames internally. Arguments of this type have the is filename stub field set to true. When set, the following field is also required:

  • filename extensions: a list of the output extensions that will be generated by the tool (including any preceding '.').

  • construct filename: instructions on how to construct the filename if it hasn't been explicitly set. This section is discussed in the 'Construct filenames' section.

  • required: A Boolean indicating if the file associated with this argument is required for successful operation of the tool. If required is set to true and the file is not provided, gkno will terminate highlighting that this file is missing. If not present, it is assumed that this file is not required.

  • required if stream: As required, but specifically set to handle streaming inputs. By default, this is set to None, but if set, allows different behaviour when handling streams if necessary.

####Construct filenames In order to minimise the amount of information that the user is required to provide, many of the output files for the tools do not need to be explicitly specified. In order to allow this to happen, _gkno_ needs to know how to generate a filename in the absence of a specific definition. Any output file that can be generated by _gkno_ includes __construct filename__ in the argument definition. This field is accompanied by a dictionary of instructions on how to construct the filename. There are a number of different ways in which the filename can be constructed, but the dictionary must always contain the field __method__ describing the method of constructing the name. Following are descriptions of the different methods. #####From tool argument If __method__ is set to __from tool argument__, the filename will be constructed using the value associated with a different argument for the tool. In this case, the following fields are required in the __construct filename__ block: * __modify extension__: can take one of a number of values. If set to __append__, the extension defined for this argument will be appended to the value. The original extension associated will not be replaced. If set to __omit__, the original extension will be removed, but no extension will take its place. Finally, if set to __replace__, the original extension will be replaced with the extension defined for this argument. In the case that there is a list of allowed extensions for this argument, the first value in the list will be used. Finally, __retain__ will retain the original extension of the file being used to construct the filename. * __use argument__: this must be accompanied by a valid argument for the same tool. The value associated with this argument will be used to construct the filename.

In addition to the above required fields, the field modify text can also be included to define additional changes to be made. The modify text field is accompanied by a list of dictionaries, where each dictionary is permitted one and only one key/value pair describing one operation. When making changes to the filename, the instructions are executed in the order in which they appear in the list. The allowed instructions are:

  • add argument values: accompanied by a list. This list consists of one or more valid arguments for the tool. The values associated with the arguments in the list will be appended to the filename prior to any extensions.
  • add text: accompanied by a list of strings (usually only one since multiple strings in the list will just be concatenated). This string will be added at the end of the filename, but prior to any extensions.
  • remove text: accompanied by a list of strings (again, usually only one). The defined text will be removed from the filename.

As an example, consider the hypothetical example illustrated below.

"extensions" : [".out", ".out.gz"],  
"construct filename" : {  
  "method" : "from tool argument",  
  "use argument" : "--in",  
  "modify extension" : "replace",
  "modify text" : [  
    {  
      "remove text" : ["_1"]  
    },  
    {  
      "add text" : ["_"]  
    },  
    {  
      "add argument values" : ["--value"]  
    },  
    {  
      "add additional text" : "_test"  
    }  
  ]
}

Construction would proceed by checking that the argument --in is a valid argument for the tool. The extension associated with --in would also be determined. For this example, let's assume that the value associated with --in is input_1.in. The modify text instructions are then processed in order, starting with the remove text instruction. The filename is checked to ensure that it ends with _1 and then this is removed to give input.in. Next, the text _ is added giving input_.in and then the tool is checked to ensure that --value is a valid tool argument. Assuming that it is, the associated value (let's assume it is 10) is appended to the name to give input_10.in. Now the string associated with the final add text is added to give input_10_test.in. Finally, the instructions demand that the extension is replaced. In this case, the extension for -in is removed and replaced with the extension provided for the argument being constructed. In this case, the new extension can be .out or .out.gz. gkno chooses the first value in the list, so the final value associated with this argument is input_10_test.out.

####Define name If method is set to define name, the filename is defined based on the contents of the construct filename block. When using this method, the following additional fields are required inside the construct filenames block:

  • add extension: is a Boolean, which if set to true will add the extension for this argument to the final value.
  • filename: is the string to be used for the file, excluding the extension.

In addition to the required fields, the following optional fields can also be defined:

  • directory argument: accompanied by a valid argument for the tool that defines a directory. If this is set, the final filename will be prepended with the value associated with the directory argument followed by a '/'.
####Defining additional input/output files Some tools generate output files or are dependent on input files that are not associated with any command line argument. For the _GNU make_ system to work properly, it is necessary that **all** of the outputs and dependencies are known. For a concrete example, consider the indexing of a _bam_ file using the _bamtools_ software. The command line to index a file is: ``gkno bamtools-index --in .bam`` No output file is specified on the command line. In fact, _bamtools_ does not even have an argument to specify the name of the output index file; it is assumed to be ``.bam.bai``. Similarly, tools that use the index file generally do not have an argument to specify the index file; they just assume that a file of the form ``.bam.bai`` exists. In order to ensure that the tool outputs and dependencies are correctly handled (this becomes especially important in pipelines), all input and output files require an argument in the configuration file - even if the tool itself does not expect one. The argument ``--out`` for _bamtools-index_ is shown below as an illustration.
{
  "description" : "the index file.",
  "long form argument" : "--out",
  "short form argument" : "-o",
  "command line argument" : "-out",
  "input" : false,
  "output" : true,
  "required" : true,
  "data type" : "string",
  "extensions" : [".bai"],
  "hide in help" : true,
  "include on command line" : false,
  "construct filename" : {
    "method": "from tool argument",
    "use argument" : "--in",
    "modify extension" : "append"
  }
}

This argument does not need to be manually set and since the field hide in help is set to true, the user will not know that the argument exists at all. In addition, the include on command line field is set to false, so when the makefile is constructed, this argument will be ignored. However, the argument is required and its value is constructed as the value from --in, with the extension .bai appended as required. By including this argument in the configuration file, the outputs from this tool will be set. In addition, when working with pipelines, other tools can link to this index file. See the section 'Additional dependencies' for further information.

####Defining argument delimiters The standard format of a command line argument for the majority of tools (include _gkno_ tools and pipelines) is ``--argument `` and the equivalent short form version ``-short_form ``. When the argument is a flag, the value is omitted. While common, not all tools conform to this format. For example, there are tools that use the format ``argument=``. If this is the case, the _argument delimiter_ can be set to ensure that the argument format appears correctly in the _makefile_. For the example format, the _argument delimiter_ block would be of the form:
"argument delimiter" : "="

If the argument delimiter block is omitted, the default value is a single space.

####Hiding tools from the user There are some tools included in the _gkno_ package that have peculiar command lines or are only intended for use in a piped stream. For example, the _ogap_ tool expects to have a _bam_ file piped into it and outputs a _bam_ file to the stream. It is not straightforward to use these tools from the _gkno_ command line and so it is desirable to hide these tools from view. For example, in the list of available tools (``gkno --help``), _ogap_ and _bamleftalign_ are not visible. While these can't be seen as available tools, they can still be used in constructing pipelines like any other tool. To hide a tool, the _hide tool_ block takes the form:
"hide tool" : true

If this block is omitted, the tool is assumed to be visible.

####Modifiers to the executable command Some of the tools included in the _gkno_ package appear in a command line with modifiers before of after the actual executable file. Tools that use _java_ may require some additional text before the executable file, for example ``java -Xmx4g -jar```. The _executable_ block defines the name of the executable file and is used to check that the executable actually exists, so it cannot be modified to include this additional text. In order to ensure that the command line is correctly constructed, the _precommand_ block can be used to define this additional text. For the _java_ example, the tool configuration file would include the block:
"precommand" : "java -Xmx4g -jar"

The makefile would then correctly construct the executable command. There are also cases where text needs to be added after the executable. bamtools is a suite of tools that operates on bam files. The tool is constructed such that the executable file is called bamtools, but then the specific operation within bamtools needs to be defined. If the desired operation is the sorting of a bam file, the command line would have the form bamtools sort [arguments]. The text sort can be defined using the modifier section. For this example, the modifier block would have the form:

"modifier" : "sort"

If these sections are omitted from the configuration file, the default operation is to include the executable file only in the command line, followed by the defined arguments.

####Argument order Some tools do not have arguments at all on the command line, instead, the values are supplied in a specific order. For example, a tool command line can be of the form:

tool <option 1> <option 2> [input file] [output file]

The tool configuration file will provide command line arguments for each of these options and files, so that the gkno command line is consistent with all other tools. However, when the command line is written out in the makefile, the above syntax must be replicated. Within the argument definitions, the field modify argument name on command line will be set to hide, ensuring that when the arguments will not be included, only the associated values. In order to ensure that the values are written out in the correct order, the argument order defines a list include all of the arguments for this tool, in the order they should appear on the command line. So for the example command line above, the argument order will be defined as:

"argument order" : [
  "--option1",
  "--option2",
  "--input",
  "--output"
]
###Pipeline configuration files The pipeline configuration files contain all of the information necessary to define a pipeline. This includes all of the tools that are used in the pipeline, command line arguments for the pipeline and how to link all of the tools together. As with the [tool configuration files](#tool_config), there are a number of required fields, which are all described below.

####Required fields

  • description: a brief description of the pipeline,
  • parameter sets: definitions of values for the pipeline arguments. See section ('Parameter sets')[#parameter_sets] for more details.
  • _nodes: describe the details of the pipeline including pipeilne arguments and logical connection of tools. This is described in detail in section ('Pipeline nodes')[#pipeline_config_nodes].
  • tasks: defines tasks in the pipeline with necessary information about each task. This is discussed in detail in section 'Pipeline tasks'.

####Optional fields

  • experimental: a flag that identifies the pipeline as experimental. This means that the pipeline is identified in the help as one that should be used with caution.

Each of these individual components of the pipeline configuration file are discussed in detail in the following sections. Tutorials are provided that give worked examples on how to construct a basic pipeline configuration file Building and modifying pipelines and how to add the further options. Please refer to these for examples.

####Pipeline tasks A pipeline is basically a set of tools that are executed in a specific order, passing files between them. The __tasks__ section of the configuration file defines a name for each task in the pipeline. Each task is an operation to be performed and must be a unique name. For each task, the tool used to perform the task is specified using the required __tool__ field. Additionally, the task can be identified as outputting to a stream rather than a file using the optional __output to stream__ field. If this is set, _gkno_ will check that the tool is able to output to a stream and that the next task in the pipeline is capable of accepting a streaming input. As an example, consider a simple pipeline that calls variants with _freebayes_, then streams the output _vcf_ file into _vcflib_ for filtering. The __tasks__ section would take the form:
"tasks" : {
  "variant-call" : {
    "tool" : "freebayes",
    "output to stream" : true
  },
  "filter-variants" : {
    "tool" : "vcflib-filter"
  }
}

The order in which the tasks are defined is unimportant, since the order in which the tasks are executed is determined by the flow of files through the pipeline. However, it is typical to include the tasks in the order in which the order expects them to be run.

####Pipeline nodes The __nodes__ section consists of a list of dictionaries. Each one of these dictionaries has a set of required and optional fields as outlined below. The nodes are used to define arguments that can be used on the command line for the pipeline and to which tasks/arguments the assigned values point. In addition, the nodes define which tasks share which arguments and define how the information passes through the pipeline. The allowed fields are listed below, and those that require further explanation have separate sections in this documentation.

#####Required fields

  • description: describes the values associated with the node. If the node is assigned arguments, the description is what appears in the help message for this argument.
  • ID: is a unique identifier for the node.
  • tasks: is a dictionary of task/argument pairs. It is from this that the pipeline data flow is derived and so this is extremely important. The tasks section is described in more detail in section 'Pipeline task nodes'.

#####Optional fields

  • delete files: a Boolean that, if set to true, instructs gkno to delete files associated with this node. gkno will determine when the files can be deleted. See section 'Deleting intermediate files' for more details on how to handle intermediate files.
  • evaluate command: replaces the value for an argument with a command to evaluate at execution time. See 'Evaluating commands at execution time' for details.
  • extensions: is only used in special cases where, for example, a task input is the output of a previous task, but the previous task output is a filename stub. This is looked at in more detail in the section 'Handling filename stubs'.
  • greedy tasks: similar to the required tasks field above, but instructs gkno that the contained task argument are greedy. This is only a concern if multiple sets of input files have been provided to the pipeline, and indicates that all of the sets of files passing through the pipeline should be used together for this task. This is covered in more detail in the section 'Handling multiple data sets'.
  • long form argument: defines the long form of command line argument. Other fields in the node will connect this argument to arguments associated with tasks within the pipeline. It is conventional to use arguments that mirror the argument is the tools. For example, this node might allow the user to define a file, say file.fastq. This file might be used by several different tasks in the pipeline. An attempt has been made to standardise the command line arguments in all the tools, so hopefully, all those tools used a command like --fastq. In this case, the argument in the pipeline should also be set to --fastq.
  • required: indicates if an argument is required. See the section 'Required pipeline arguments' for details and examples of when this is necessary.
  • short form argument: the short form of the command line argument.
####Pipeline nodes: tasks The __tasks__ and __greedy tasks__ fields tell _gkno_ which tasks share the same information and, in some cases, connect this with a pipeline argument that allows the user to set the value. The format is a list of task/argument pairs that use the same data. This is demonstrated for the following example. A pipeline consists of a set of tasks and two of those tasks us the same input file. The command line argument for each of the tools for this argument is ``-in``, as defined in the individual tool configuration files. In order to maintain consistency, with these tools, the pipeline argument in this node is also set as ``-in``. The __tasks__ field then has the form: ```javascript { "ID" : "input", "description" : "Input file.", "long form argument" : "--in", "short form argument" : "-i", "tasks" : { "task-1" : "--in", "task-2" : "--in" } } ``` The pipeline help will show that there is a command ``--in`` available for setting the name of the input file. When _gkno_ is executed, a graph of all tasks is created and a node exists that describes the input file defined by the ``--in`` command. This node will have two edges; one connects the node to the _task-1_ node and the second connects the node to the _task-2_ node. In this way, the defined input can be used in as many tasks as required in the pipeline.

There are cases where there is no defined long or short form arguments. When this happens, the user doesn't have the opportunity to set these values (although the syntax described in 'Pipeline mode' to set the values of tasks in the pipeline is still valid). This in usually used to link the output of a task with the input of another task(s). In this case, the configuration file node would have the form:

{
  "ID" : "link",
  "description" : "Linking tasks",
  "tasks" : {
    "task-1" : "--out",
    "task-2" : "--in"
  }
}

In this case, the file output by task-1 would be used as the input for both task-2 and task-3. Since there is no argument associated with this node, nothing about this would be shown in the pipeline help.

The greedy_tasks field has the same form as tasks, but is used for cases where there are multiple data sets being processed by the pipeline. This is dealt with in section 'Handling multiple data sets'.

####Pipeline workflow The pipeline workflow is the order in which the tools are executed. This is determined by _gkno_ by performing a topological sort on the pipeline graph. After all tasks have been assigned to the graph as task nodes, all arguments are given option nodes and files are assigned file nodes. All option nodes are joined to the relevant task node by an edge (all option nodes are predecessors to the tasks, since they are providing information to the task). All file nodes can either precede or succeed the task node depending on whether they are input or output files for the task. Performing a topographical sort provides a non-unique order in which the tasks are executed, but it is assured that tasks that depend on the output of other tasks will appear after them in the workflow. While this workflow is unimportant for the produced _makefile_, it is useful for giving a human readable flow of tasks that helps the user understand the role of the pipeline. As an example, the pipeline _build-moblist-reference_ takes two _fasta_ files as input, merges them and then generates a reference file in _Mosaik_ native format as well as a set of _Mosaik_ jump database files. The pipeline graph is illustrated in the following figure:

Pipeline graph for the build-moblist-reference pipeline

Performing a topological sort on this graph yields the workflow:

1. merge-fasta
2. build-reference
3. build-jump-database
4. create-sequence-dictionary
5. index-fasta

From the graph, it is clear that step 1 must occur first. After that, the only consideration is that step 3 must occur after step 2, but steps 2, 4 and 5 could be performed in any order. This is why the workflow is non-unique.

####Setting required pipeline arguments If an argument is required by a tool within the pipeline, not setting the pipeline argument that points to the particular task argument will result in an error. This is because the pipeline will fail to execute if any of the constituent tools within it do not have all their required parameters. There are cases where an argument for a tool is optional, but in the context of a pipeline, the argument needs to be set. In this case, including the field ``"required" : true`` in the pipeline configuration node, will ensure that if the argument isn't set, _gkno_ will terminate with an error if the argument isn't set.

As an example, consider a pipeline that processes paired end reads and two fastq files are expected. Consider an aligner with a required argument --fastq. This is required since no alignment can take place without some reads. A second argument --fastq2 is optional. If set, the aligner will work in a paired end mode, otherwise it will assume all reads are single ended. In a paired end read pipeline, it is necessary that both --fastq and --fastq2 are set. In this case, the pipeline configuration file will include nodes linking to each of these tool arguments as follows:

...
{
  "ID" : "first mate",
  "description" : "The file containing the sequence reads for the first mate",
  "long form argument" : "--fastq",
  "short form argument" : "-q",
  "tasks" : {
    "aligner" : "--fastq"
  }
},
{
  "ID" : "second mate",
  "description" : "The file containing the sequence reads for the second mate",
  "long form argument" : "--fastq2",
  "short form argument" : "-q2",
  "required" : true,
  "tasks" : {
    "aligner" : "--fastq2"
  }
}

The node with the ID 'first mate' does need to specified as required as it is set as required in the aligners own configuration file. Since the tool configuration file does not list --fastq2 as required, it needs to be identified as such for the purposes of this pipeline.

####Deleting intermediate files There are many occasions where tasks in a pipeline produce output files that do not need to be kept. In fact, there are many cases where lots of intermediate steps create files that if kept could fill up all the available storage. The solution is to identify which files are not required by the user and then delete them at the earliest opportunity. Consider the case where _task A_ generates an output used by _task B_. Once _task B_ has consumed the file, it is no longer of use to the pipeline and the user has no desire to keep the file. In this case, the arguments for the two tasks are linked in a pipeline node as follows:
{
  "ID" : "example",
  "description" : "delete files example",
  "delete files" : true,
  "tasks" : {
    "task A" : "--out",
    "task B" : "--in"
  }
}

The tasks section above instructs the pipeline to link the output of task A to task B. By including the field "delete files" : true", gkno is instructed to remove the file when the pipeline no longer needs it. gkno does not wait until the pipeline has been completed to delete the file as this could create the situation with all the intermediate files overwhelming the available memory. The file is also listed at the top of the makefile in the .INTERMEDIATES section. This ensures that if the pipeline is rerun, this file will not be regenerated unless earlier files have been modified and task B needs to be rerun.

####Evaluating commands at execution time There are times where the parameter to be given to a task is unknown when writing out the _gkno_ command line. Instead, the value is to be extracted from a file created by the pipeline, for example. The pipeline configuration file allows for a command to be defined for an argument. If a value is given (either on the command line, or from a parameter set), this is used, but in the absence of a value, the command will be executed at run time to determine the value. To do this, the __evaluate command__ dictionary needs to be included in the pipeline configuration file node associated with the command(s) in question. This dictionary contains the field __command__ in which the command to be executed is defined. Any files to which the command points are replaced in this command with a unique ID. In addition, the __add values__ list is included to define these files. For example, consider the example argument ``--in`` for task ``taskA``. When the task is executed, a file generated by another task in the pipeline (``taskB``, argument ``--out``) is interrogated using the bash command ``head -1 ``. This command (reading the first line of the file) yields the value to use. This would be represented in the configuration file as follows:
{
  "ID" : "example",
  "description" : "evaluate command example",
  "long form argument" : "--in",
  "short form argument" : "-i",
  "tasks" : {
    "taskA" : "--in"
  },
  "evaluate command" : {
    "command" : "shell head -1 FILE1",
    "add values" : [
      {
        "ID" : "FILE1",
        "task" : "taskB",
        "argument" : "--out"
      }
    ]
  }
}

This node describes the pipeline command line argument --in. If a value for this is given on the command line, the value will be used. If there is no value supplied, the defined command will be used instead. The command can contain as many unique ID strings as required; each ID is then defined in the add values list. Each dictionary in that list must contain the three fields; ID, task and argument and the ID must be present in the command. In this example, the task taskB generates the file file.txt from the argument --out and so the generated makefile will have the following line in the command line for taskA:

--in $(shell head -1 file.txt)

When the command line for taskA is executed, file.txt will be interrogated and the resulting value used.

###Handling filename stubs ###gkno command line arguments In addition to the arguments for the tool/pipeline being run, there are a set of arguments that can always be set. These are all optional and are for general gkno functionality and are summarised below.
  • --debug (-db): prints out messages throughout operation detailing tasks that have been completed. This is useful for identifying sources of error.
  • --do-not-execute (-dne): is a flag defining whether gkno should execute the scripts after creating them. If not specified, gkno will automatically execute the makefile. This behaviour is overwritten if multiple makefiles are created or if required files/executables are missing.
  • --do-not-log-usage (-dnl): is a flag that ensures that this usage isn't logged. This is usually used in development to avoid skewing usage stats.
  • --draw-pipeline-graph (-dpg): defines the name of a file to output a .dot format file that can be plotted using graphviz.
  • --export-parameter-set (-ep): tells gkno to generate a new parameter set in the parameter set configuration file. See the specific tutorial for further information on this option.
  • --input-path (-ip): the input path for all input files if the path is unspecified. If the path is specified, this path is obviously used, otherwise the assumption is that the files reside in the current working directory. Setting --input-path will force gkno to assume all unspecified input files (except for resource files, see below) are available in the path specified by --input-path.
  • --parameter-set (-ps): define the parameter set to use. This sets a number of parameters for the pipeline.
  • --internal-loop (-il): only selectable for pipelines with a defined internal loop. This command defines the json file defining multiple sets of input files/parameters.
  • --multiple-runs (-mr): informs gkno that a json file is provided containing multiple sets of input files/parameters. See the specific tutorial for further information on this option.
  • --no-hard-warnings (-nhw): is a flag which, if set, removes the requirement for the user to press 'Enter' to provide acknowledgement of an error/warning. There are very few cases where the users response is required, so it is recommended that this is not turned off.
  • --number-jobs (-nj): the number of parallel jobs to execute. This is only applicable when running a pipeline utilising an internal loop.
  • --output-path (-op): similar to the input path. All output files are output to the --output-path unless a path is provided with the filename.
  • --task-stdout (-ts): if set, each task will generate its own stdout and stderr file. The default behaviour is to produce a single stdout/stderr file for the pipeline.
  • --timing (-tm): includes the time command for each task in the pipeline. On completion, the timing information for each task is outputted to the stderr.
  • --verbose (-vb): is a flag used to tell gkno whether to output verbose information to screen as gkno runs.
###Handling multiple data sets ####Tools outputting to stream (optional) There are times where it is preferable to link several tasks together with pipes, so that each task sends its output to the stream and the following task accepts the stream as input. For cases where the intermediate files are not required, it can save a lot of memory to just pass information on the stream. The downside to this process is that if a task in the stream fails, _GNU make_ cannot give information on the specific task that failed. As such, this option should only be used if there is a high degree of confidence in the tasks.

If included, this section is just a list of tasks that output to the stream. In the makefile, all files contained in this list output to the stream and so will be linked to the next task in the pipeline workflow by a pipe. If consecutive tasks appear in this list, then there will be multiple tasks linked by pipes in a single command line.

For further information on this feature and worked examples demonstrating its use, see the relevant pipeline tutorial.

####Internal loops (optional) It is often the case that a user will want to run a pipeline multiple times for a set of input data. There are two distinct use cases for these pipelines. The first is that the entire pipeline needs to be run from start to finish for a specific input file(s) and the user has a more than one such set of input files. _gkno_ can be set to accept a _json_ file containing the input parameters for each required pipeline execution and a _makefile_ will be created for each set of files. These files can be executed serially or sent to a cluster environment for execution. This use case is covered in more detail in the [_Performing multiple runs of a pipeline_](#tutorial_multiple_runs) section.

The second use case uses what we term internal loops. The following figure demonstrated a possible pipeline configuration:

'Task A' is the first task to be executed and its output is used as input to 'Task B', which in turn is executed and produces output files. 'Task C' requires as input, the output from 'Task B', but also some files defined by the user. The internal loop refers to the set of tasks (in this case, tasks C and D) that are run multiple times for different input files, but are all independent of each other. As soon as 'Task B' is complete, as many jobs as required can be spawned in parallel to execute tasks C and D for all defined input files. 'Task E' requires as input the outputs of all of the 'Task Ds' in the internal loop and so is not an independent task and has to wait until all of the 'Task Ds' have been completed.

A real example of a pipeline with this format is fastq-vcf. The first tasks in the pipeline are concerned with preparation of the reference sequence and must be completed prior to any read alignments. The internal loop consists of aligning fastq files and some post-processing (e.g. sorting and indexing) of the bam files. Finally, variant calling is performed using all of the bam files and thus is dependent on all of the tasks in the internal loop being complete. If the user has, for example, three pairs of fastq files, using the internal loop allows the three alignments to be performed in parallel (the number of parallel jobs is controlled using the --number-jobs (-nj) command line parameter).

To include an internal loop in a pipeline, the "internal loop" section needs to be included in the pipeline configuration file. This section is simply a list of the tools (in the order that they appear in the "workflow") that should be included in the internal loop. With this section defined, the functionality can be accessed using the --internal-loop (-il) command line argument and providing a json file with the input files/parameters for each iteration of the internal loop. For details on using the internal loop, see the Running a pipeline using the internal loop section.

##Parameter sets

Parameter sets are used to store default parameters to reduce the number of arguments that need to be set on the command line. These are user configurable and are available for both tool and pipeline configuration files are discussed in more detail in the Parameter sets section.

##Resource management

For sequenced genomes there are various files, used by gkno's component tools, that are typically independent of any actual experimental data that will be processed. These include the genome's reference sequence(s), known variants, etc. The gkno project calls this collection a genome resource.

In addition, the resource data for any particular organism is not static; it goes through modifications as its reference is corrected, new variants are catalogued, and so on. Snapshots of this process are often released over time (e.g. human build 35, 36, 37, ...). The gkno project refers to such a snapshot as a release.

^^ FIXME ^^ How to describe release update process? (not exactly tied to genome release schedule)

gkno's resource-management commands allow the user to manage multiple resources as well as multiple releases for each, if needed. For example:

  • Resource A
  • Release-1
  • Release-2
  • Resource B
  • Release-1
  • Resource C
  • Release-4

Note - Any resources & releases referred to in this section are only those that have been bundled and made available by the gkno project. Users may certainly provide their own data files and run analysis using them, but will not be able to manage those files using gkno's resource commands.

### Add a resource
gkno add-resource

will display a list of all genome resources that gkno is hosting and can fetch. Any resources preceded by a '*' have already been added.

gkno add-resource <organism>

will download that organism's current release. The files will be stored under the organism's directory, in a subdirectory matching the release name (e.g. resources/homo_sapiens/build_37). In addition, a symlink (shortcut) named "current" will be created in the organism's directory that points to the current release. This allows the user to refer to "resources/<organism>/current" as the resource path in pipeline scripts and always use the most up-to-date release data. This organism is also considered "tracked" for later updates, see below.

gkno add-resource <organism> --release

will display a list all available releases for that genome. Any releases preceded by a '*' have already been added.

gkno add-resource <organism> --release <release_name>

will download a particular genome release. The files will be stored under the organism's directory, in a subdirectory matching the release name (e.g. resources/homo_sapiens/build_36.2). The "current" release symlink is not created or moved.

### Update a resource

Running gkno update will check all tracked resources for new releases. The gkno team realizes that a user may not always wish to automatically update data. Therefore, if any updates are found, a summary message is printed to the screen. The user may type the following command later, at his discretion:

gkno update-resource <organism>

This actually performs the update - downloading the new release files and moving that organism's "current" release symlink to point to it.

The new release may be fetched with out moving the "current" symlink, by using the named-release version of gkno add-resource described above.

### Remove a resource
gkno remove-resource

will display a list of all organisms with resource files that be removed.

gkno remove-resource <organism>

will remove all releases for an organism.

gkno remove-resource <organism> --release

will display a list of releases for that organism that can be removed.

gkno remove-resource <organism> --release <release_name>

will remove a particular genome release.

## Available tools The toolkit is dynamic and extra tools can be added by the Marthlab or others (in collaboration with the Marthlab). A list of currently available tools, along with a brief description and links to references are included below: ### Mosaik Mosaik is the Marthlabs sequence read alignment software and comprises multiple elements, each of which are described below. **MosaikBuild** MosaikBuild is used to convert a fasta format reference file into a native format used by the alignment software. Sequence reads themselves also require conversion into a format that the aligner can read. This is also achieved using MosaikBuild.

MosaikJump

A hash-based algorithm is used to perform alignments within Mosaik. To facilitate this, a jump database is required. This database is generated using the MosaikJump utility.

MosaikAligner

MosaikAligner description.

### Bamtools Bamtools description. ### Freebayes Freebayes description.