Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Chaining proteinfold #176

Merged
merged 39 commits into from
Nov 27, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
39 commits
Select commit Hold shift + click to select a range
29b2a4a
add matching ids
luisas Nov 18, 2024
ef90b74
clean
luisas Nov 18, 2024
8871516
update
luisas Nov 18, 2024
c65ec4b
Update input
luisas Nov 22, 2024
66ee7b2
Fix docs
luisas Nov 22, 2024
a315e30
Update Readme
luisas Nov 22, 2024
c0e5f29
Add documentation chaining pipelines
luisas Nov 22, 2024
4d11ded
update docs
luisas Nov 25, 2024
c4e5c1a
update schema
luisas Nov 25, 2024
6dd7aa8
update schema
luisas Nov 25, 2024
daeaa65
fix some lintin
luisas Nov 25, 2024
757a111
update modules
luisas Nov 25, 2024
63ffeff
merge dev
luisas Nov 25, 2024
67bc726
fix lint
luisas Nov 25, 2024
a3e72a9
update metromap
luisas Nov 25, 2024
69a04d7
fix config
luisas Nov 25, 2024
6eff7df
fix conf
luisas Nov 25, 2024
a5272c6
up
luisas Nov 25, 2024
5f2e4d7
up
luisas Nov 25, 2024
91a7e63
up
luisas Nov 25, 2024
e266fc1
up
luisas Nov 25, 2024
010a447
update modules
luisas Nov 26, 2024
0c7fa46
update docs
luisas Nov 26, 2024
f8e62bd
fix lint
luisas Nov 26, 2024
3b6ff65
update
luisas Nov 26, 2024
46dcc9d
up
luisas Nov 26, 2024
6c348d4
up
luisas Nov 26, 2024
386d973
Update README.md
luisas Nov 27, 2024
e068acc
Update docs/usage/chaining_with_proteinfold.md
luisas Nov 27, 2024
9928909
Update docs/usage/chaining_with_proteinfold.md
luisas Nov 27, 2024
052321d
Update workflows/multiplesequencealign.nf
luisas Nov 27, 2024
121e75f
Update workflows/multiplesequencealign.nf
luisas Nov 27, 2024
21037be
Update docs/usage/chaining_with_proteinfold.md
luisas Nov 27, 2024
54a0030
Update workflows/multiplesequencealign.nf
luisas Nov 27, 2024
802741a
Update workflows/multiplesequencealign.nf
luisas Nov 27, 2024
54e2f50
Update workflows/multiplesequencealign.nf
luisas Nov 27, 2024
66bf019
up
luisas Nov 27, 2024
91026c6
update docs
luisas Nov 27, 2024
ccb04b0
Update docs/usage/chaining_with_proteinfold.md
luisas Nov 27, 2024
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -39,6 +39,7 @@ Initial release of nf-core/multiplesequencealign, created with the [nf-core](htt
- [[#147](https://github.com/nf-core/multiplesequencealign/pull/147)] - Add small testing profile + some fixes of the shiny app.
- [[#148](https://github.com/nf-core/multiplesequencealign/pull/148)] - Add UPP module.
- [[#150](https://github.com/nf-core/multiplesequencealign/pull/150)] - Update modules and readme for pre-release.
- [[#174](https://github.com/nf-core/multiplesequencealign/issues/174)] - Add the chaining of proteinfold output to MSA input.

### `Fixed`

Expand Down
17 changes: 7 additions & 10 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -38,9 +38,8 @@ The pipeline performs the following steps:

## Usage

:::note
If you are new to Nextflow and nf-core, please refer to [this page](https://nf-co.re/docs/usage/installation) on how to set-up Nextflow. Make sure to [test your setup](https://nf-co.re/docs/usage/introduction#how-to-run-a-pipeline) with `-profile test` before running the workflow on actual data.
:::
> [!NOTE]
> If you are new to Nextflow and nf-core, please refer to [this page](https://nf-co.re/docs/usage/installation) on how to set-up Nextflow. Make sure to [test your setup](https://nf-co.re/docs/usage/introduction#how-to-run-a-pipeline) with `-profile test` before running the workflow on actual data.

#### 1. SAMPLESHEET

Expand All @@ -50,16 +49,15 @@ It should look like this:
`samplesheet.csv`:

```csv
id,fasta,reference,dependencies,template
id,fasta,reference,optional_data,template
seatoxin,seatoxin.fa,seatoxin-ref.fa,seatoxin_structures,seatoxin_template.txt
toxin,toxin.fa,toxin-ref.fa,toxin_structures,toxin_template.txt
```

Each row represents a set of sequences (in this case the seatoxin and toxin protein families) to be aligned and the associated (if available) reference alignments and dependency files (this can be anything from protein structure or any other information you would want to use in your favourite MSA tool).

:::note
The only required input is the id column and either fasta or dependencies.
:::
> [!NOTE]
> The only required input is the id column and either fasta or optional_data.

#### 2. TOOLSHEET

Expand All @@ -78,9 +76,8 @@ FAMSA, -gt upgma -medoidtree, FAMSA,
FAMSA,,REGRESSIVE,
```

:::note
The only required input is aligner.
:::
> [!NOTE]
> The only required input is `aligner`.

#### 3. RUN THE PIPELINE

Expand Down
4 changes: 2 additions & 2 deletions assets/schema_input.json
Original file line number Diff line number Diff line change
Expand Up @@ -23,7 +23,7 @@
"type": "string",
"default": ""
},
"dependencies": {
"optional_data": {
"type": "string",
"default": ""
},
Expand All @@ -33,6 +33,6 @@
}
},
"required": ["id"],
"anyOf": [{ "required": ["fasta"] }, { "required": ["dependencies"] }]
"anyOf": [{ "required": ["fasta"] }, { "required": ["optional_data"] }]
}
}
Binary file modified docs/images/nf-core-msa_metro_map.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
18 changes: 9 additions & 9 deletions docs/usage.md
Original file line number Diff line number Diff line change
Expand Up @@ -100,23 +100,23 @@ The sample sheet defines the **input data** that the pipeline will process.
It should look like this:

```csv title="samplesheet.csv"
id,fasta,reference,dependencies,template
id,fasta,reference,optional_data,template
seatoxin,seatoxin.fa,seatoxin-ref.fa,seatoxin_structures,seatoxin_template.txt
toxin,toxin.fa,toxin-ref.fa,toxin_structures,toxin_template.txt
```

Each row represents a set of sequences (in this case the seatoxin and toxin protein families) to be processed.

| Column | Description |
| -------------- | -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| `id` | Required. Name of the set of sequences. It can correspond to the protein family name or to an internal id. It must be unique. |
| `fasta` | Required (At least one of fasta or dependencies must be provided). Full path to the fasta file that contains the sequence to be aligned. |
| `reference` | Optional. Full path to the reference alignment. It is used for the reference-based evaluation steps. It can be left empty. |
| `dependencies` | Required (At least one of fasta or dependencies must be provided). Full path to the folder that contains the dependency files (e.g. protein structures) for the sequences to be aligned. Currently, it is used for structural aligners and structure-based evaluation steps. It can be left empty. |
| `template` | Optional. Files that define the mapping between the input sequence and the dependency files (e.g. protein structures) to be used. Used by 3D-Coffee. If not specified, they will be automatically generated assuming that the sequence name provided in the fasta is the same as the file name of the corresponding PDB file. E.g. if you set (default) the parameter templates_suffix to .pdb, then: ">MyProteinName" in the fasta file and "MyProteinName.pdb" for the corresponding protein structure. For more information on how to generate a template file manually, please look at the T-Coffee [documentation](https://tcoffee.readthedocs.io/en/latest/tcoffee_main_documentation.html). |
| Column | Description |
| --------------- | -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| `id` | Required. Name of the set of sequences. It can correspond to the protein family name or to an internal id. It must be unique. |
| `fasta` | Required (At least one of fasta or optional_data must be provided). Full path to the fasta file that contains the sequence to be aligned. |
| `reference` | Optional. Full path to the reference alignment. It is used for the reference-based evaluation steps. It can be left empty. |
| `optional_data` | Required (At least one of fasta or optional_data must be provided). Full path to the folder that contains the dependency files (e.g. protein structures) for the sequences to be aligned. Currently, it is used for structural aligners and structure-based evaluation steps. It can be left empty. |
| `template` | Optional. Files that define the mapping between the input sequence and the dependency files (e.g. protein structures) to be used. Used by 3D-Coffee. If not specified, they will be automatically generated assuming that the sequence name provided in the fasta is the same as the file name of the corresponding PDB file. E.g. if you set (default) the parameter templates_suffix to .pdb, then: ">MyProteinName" in the fasta file and "MyProteinName.pdb" for the corresponding protein structure. For more information on how to generate a template file manually, please look at the T-Coffee [documentation](https://tcoffee.readthedocs.io/en/latest/tcoffee_main_documentation.html). |

:::note
You can have some samples with dependencies and/or references and some without. The pipeline will run the modules requiring dependencies/references only on the samples for which you have provided the required information and the others will be just skipped.
You can have some samples with optional_data and/or references and some without. The pipeline will run the modules requiring optional_data/references only on the samples for which you have provided the required information and the others will be just skipped.
:::

## Toolsheet input
Expand Down
49 changes: 49 additions & 0 deletions docs/usage/chaining_with_proteinfold.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,49 @@
# Using nf-core/proteinfold to generate the input protein structures

Structural aligners leverage protein structural information to render the MSA.

You can provide your PDB structures via the samplesheet, as outlined in the primary usage documentation. However, if you do not already have protein structures available, you may opt to use protein structure prediction tools to create these models.

To facilitate this, we offer seamless integration with the nf-core/proteinfold pipeline, enabling you to generate the protein structures required for this workflow.

To do so, you only need to build one samplesheet file, in the exact format required by nf-core/multiplesequencealign pipeline.
This is made compatible with nf-core/proteinfold and will predict and output the structures in the format required by the nf-core/multiplesquencealign pipeline.

Now, to run you simply can use the following code.

> [!NOTE]
> Please refer to the [proteinfold documentation](https://nf-co.re/proteinfold/1.1.1/) for picking your favourite params.

Here we showcase how to run proteinfold in its colabfold local flavour - but it works for all the proteinfold modes.

```bash
nextflow run nf-core/proteinfold \
--input ./samplesheet.csv \
--outdir ./proteinfold_results \
--split_fasta \
-r dev \
--mode colabfold \
--colabfold_server local \
--colabfold_db <null (default) | PATH> \
--num_recycle 3 \
--use_amber <true/false> \
--colabfold_model_preset "AlphaFold2-ptm" \
--use_gpu <true/false> \
--db_load_mode 0
-profile <docker/singularity/podman/shifter/charliecloud/conda/institute>


nextflow run nf-core/multiplesequencealign \
--input ./samplesheet.csv \
--tools ./toolsheet.csv \
--optional_data_dir ./proteinfold_results/*/*/top_ranked_structures \
--outdir ./results \
-profile <docker/singularity/podman/shifter/charliecloud/conda/institute>

```

> [!NOTE]
> The one imporant parameter NOT to forget in proteinfold for the chaining is `--split_fasta`. This will allow to use a multifasta file as input for monomer predictions, needed by the MSA pipeline.The rest of the proteinfold parameters can and should be tuned according to your preferences for your proteinfold run. Please refer to the proteinfold documentation for this.

> [!WARNING]
> This is currently an experimetal feature and only available in the dev branch of proteinfold, so also do not forget `-r dev`. This feature will be soon available with the next release of nf-core/proteinfold.
File renamed without changes.
12 changes: 6 additions & 6 deletions modules.json
Original file line number Diff line number Diff line change
Expand Up @@ -7,7 +7,7 @@
"nf-core": {
"clustalo/align": {
"branch": "master",
"git_sha": "666652151335353eef2fcd58880bcef5bc2928e1",
"git_sha": "2a8530b890878747f5063a894bad9fb2abd5c071",
"installed_by": ["modules"]
},
"clustalo/guidetree": {
Expand Down Expand Up @@ -99,12 +99,12 @@
},
"tcoffee/alncompare": {
"branch": "master",
"git_sha": "666652151335353eef2fcd58880bcef5bc2928e1",
"git_sha": "ffa000ab3c33df25a165b5f9a039c4cbb665a77b",
"installed_by": ["modules"]
},
"tcoffee/consensus": {
"branch": "master",
"git_sha": "66b22564bc1bc0db7292f2073cdef954ead773e7",
"git_sha": "023e51187884ea6cc7290767486f551565f1b77a",
"installed_by": ["modules"]
},
"tcoffee/irmsd": {
Expand Down Expand Up @@ -143,17 +143,17 @@
"nf-core": {
"utils_nextflow_pipeline": {
"branch": "master",
"git_sha": "3aa0aec1d52d492fe241919f0c6100ebf0074082",
"git_sha": "c2b22d85f30a706a3073387f30380704fcae013b",
"installed_by": ["subworkflows"]
},
"utils_nfcore_pipeline": {
"branch": "master",
"git_sha": "1b6b9a3338d011367137808b49b923515080e3ba",
"git_sha": "1b89f75f1aa2021ec3360d0deccd0f6e97240551",
"installed_by": ["subworkflows"]
},
"utils_nfschema_plugin": {
"branch": "master",
"git_sha": "bbd5a41f4535a8defafe6080e00ea74c45f4f96c",
"git_sha": "2fd2cd6d0e7b273747f32e465fdc6bcc3ae0814e",
"installed_by": ["subworkflows"]
}
}
Expand Down
20 changes: 17 additions & 3 deletions modules/nf-core/clustalo/align/main.nf

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

Loading