Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update README and small fixes #130

Merged
merged 16 commits into from
Jun 19, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
16 changes: 13 additions & 3 deletions .github/workflows/ci.yml
Original file line number Diff line number Diff line change
Expand Up @@ -26,6 +26,10 @@ jobs:
NXF_VER:
- "24.04.1"
- "latest-everything"
ANALYSIS:
- "test"
- "test_pdb"
- "test_parameters"
steps:
- name: Check out pipeline code
uses: actions/checkout@0ad4b8fadaa221de15dcec353f45205ec38ea70b # v4
Expand All @@ -40,9 +44,9 @@ jobs:

- name: Run pipeline with test data
run: |
nextflow run ${GITHUB_WORKSPACE} -profile test,docker --outdir ./results
nextflow run ${GITHUB_WORKSPACE} -profile ${{ matrix.ANALYSIS }},docker --outdir ./results

parameters:
parameters_stub:
name: Test workflow parameters
if: "${{ github.event_name != 'push' || (github.event_name == 'push' && github.repository == 'nf-core/multiplesequencealign') }}"
runs-on: ubuntu-latest
Expand All @@ -51,6 +55,12 @@ jobs:
NXF_VER:
- "24.04.1"
- "latest-everything"
PARAMS:
- "--skip_stats"
- "--skip_eval"
- "--skip_compression"
- "--skip_shiny"

steps:
- name: Check out pipeline code
uses: actions/checkout@v4
Expand All @@ -62,4 +72,4 @@ jobs:

- name: Test workflow parameters
run: |
nextflow run ${GITHUB_WORKSPACE} -profile test_parameters,docker --outdir ./results
nextflow run -stub-run ${GITHUB_WORKSPACE} -profile test,docker ${{ matrix.PARAMS }} --outdir ./results
1 change: 1 addition & 0 deletions .nf-core.yml
Original file line number Diff line number Diff line change
Expand Up @@ -2,3 +2,4 @@ repository_type: pipeline
nf_core_version: "2.14.1"
lint:
multiqc_config: False
files_exist: conf/igenomes.config
69 changes: 26 additions & 43 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -21,43 +21,25 @@

**nf-core/multiplesequencealign** is a pipeline to deploy and systematically evaluate Multiple Sequence Alignment (MSA) methods.

The pipeline is built using [Nextflow](https://www.nextflow.io), a workflow tool to run tasks across multiple compute infrastructures in a very portable manner. It uses Docker/Singularity containers making installation trivial and results highly reproducible. The [Nextflow DSL2](https://www.nextflow.io/docs/latest/dsl2.html) implementation of this pipeline uses one container per process which makes it much easier to maintain and update software dependencies. Where possible, these processes have been submitted to and installed from [nf-core/modules](https://github.com/nf-core/modules) in order to make them available to all nf-core pipelines, and to everyone within the Nextflow community!

On release, automated continuous integration tests run the pipeline on a full-sized dataset on the AWS cloud infrastructure. This ensures that the pipeline runs on AWS, has sensible resource allocation defaults set to run on real-world datasets, and permits the persistent storage of results to benchmark between pipeline releases and other analysis sources.The results obtained from the full-sized test can be viewed on the [nf-core website](https://nf-co.re/proteinfold/results).

![Alt text](docs/images/nf-core-msa_metro_map.png?raw=true "nf-core-msa metro map")

1. **Collect Input Information**: computation of summary statistics on the input fasta file, such as the average sequence similarity across the input sequences, their length, etc. Skip by `--skip_stats` as a parameter.
2. **Guide Tree**: (Optional, depends on alignment tools requirement) Renders a guide tree.
3. **Align**: Runs one or multiple MSA tools in parallel.
4. **Evaluate**: The obtained alignments are evaluated with different metrics: Sum Of Pairs (SoP), Total Column score (TC), iRMSD, Total Consistency Score (TCS), etc. Skip by passing `--skip_eval` as a parameter.
5. **Compress**: As MSAs can be very large, by default all tools in the pipeline produce compressed output. For most of them, the compression happens through a pipe, such that uncompressed data never hits the disk. This compression can be turned off by passing `--no_compression` as a parameter.

Available GUIDE TREE methods:

- CLUSTALO
- FAMSA
- MAGUS
In a nutshell, the pipeline performs the following steps:

Available ALIGN methods:

- CLUSTALO
- FAMSA
- KALIGN
- LEARNMSA
- MAFFT
- MAGUS
- MUSCLE5
- MTMALIGN
- T-COFFEE
- 3DCOFFEE
1. **Input files summary**: (Optional) computation of summary statistics on the input files, such as the average sequence similarity across the input sequences, their length, plddt extraction if available, etc.
2. **Guide Tree**: (Optional) Renders a guide tree with a chosen tool (list available below). Some aligners use guide trees to define the order in which the sequences are aligned.
3. **Align**: (Required) Aligns the sequences with a chosen tool (list available below).
4. **Evaluate**: (Optional) Evaluates the generated alignments with different metrics: Sum Of Pairs (SoP), Total Column score (TC), iRMSD, Total Consistency Score (TCS), etc.
5. **Report**: Reports the collected information of the runs in a shiny app and a summary table in MultiQC.

## Usage

> [!NOTE]
> If you are new to Nextflow and nf-core, please refer to [this page](https://nf-co.re/docs/usage/installation) on how to set-up Nextflow. Make sure to [test your setup](https://nf-co.re/docs/usage/introduction#how-to-run-a-pipeline) with `-profile test` before running the workflow on actual data.

First, prepare a samplesheet with your input data that looks as follows:
#### 1. SAMPLESHEET

The sample sheet defines the input data that the pipeline will process.
It should look like this:

`samplesheet.csv`:

Expand All @@ -67,33 +49,30 @@ seatoxin,seatoxin.fa,seatoxin-ref.fa,seatoxin_structures
toxin,toxin.fa,toxin-ref.fa,toxin_structures
```

Each row represents a set of sequences (in this case the seatoxin and toxin protein families) to be processed.

`id` is the name of the set of sequences. It can correspond to the protein family name or to an internal id.
Each row represents a set of sequences (in this case the seatoxin and toxin protein families) to be aligned and the associated (if available) reference alignments and protein structure files.

The column `fasta` contains the path to the fasta file that contains the sequences.
> [!NOTE]
> The only required input is the id column and either fasta or structures.

The column `reference` is optional and contains the path to the reference alignment. It is used for certain evaluation steps. It can be left empty.
#### 2. TOOLSHEET

The column `structures` is also optional and contains the path to the folder that contains the protein structures for the sequences to be aligned. It is used for structural aligners and certain evaluation steps. It can be left empty.
Each line of the toolsheet defines a combination of guide tree and multiple sequence aligner to run with the respective arguments to be used.

Then, you should prepare a toolsheet which defines which tools to run as follows:
It should look at follows:

`toolsheet.csv`:

```csv
tree,args_tree,aligner,args_aligner,
FAMSA, -gt upgma -partree, FAMSA,
, ,TCOFFEE, -output fasta_aln
FAMSA, -gt upgma -medoidtree, FAMSA,
, ,TCOFFEE,
FAMSA,,REGRESSIVE,
```

`tree` is the tool used to build the tree.

Arguments to the tree tool can be provided using `args_tree`.

The `aligner` column contains the tool to run the alignment.
> [!NOTE]
> The only required input is aligner.

Finally, the arguments to the aligner tool can be set by using the `args_alginer` column.
#### 3. RUN THE PIPELINE

Now, you can run the pipeline using:

Expand All @@ -117,6 +96,10 @@ To see the results of an example test run with a full size dataset refer to the
For more details about the output files and reports, please refer to the
[output documentation](https://nf-co.re/multiplesequencealign/output).

## Extending the pipeline

For details on how to add your favourite guide tree/MSA/evaluation step in nf-core/multiplesequencealign please refer to [extending documentation](https://github.com/luisas/multiplesequencealign/blob/luisa_patch/docs/extending.md).

## Credits

nf-core/multiplesequencealign was originally written by Luisa Santus ([@luisas](https://github.com/luisas)) and Jose Espinosa-Carrasco ([@JoseEspinosa](https://github.com/JoseEspinosa)) from The Comparative Bioinformatics Group at The Centre for Genomic Regulation, Spain.
Expand Down
3 changes: 2 additions & 1 deletion assets/schema_input.json
Original file line number Diff line number Diff line change
Expand Up @@ -28,6 +28,7 @@
"type": "string"
}
},
"required": ["id", "fasta"]
"required": ["id"],
"anyOf": [{ "required": ["fasta"] }, { "required": ["structures"] }]
}
}
Loading
Loading