Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Plot contig boundaries #5

Merged
merged 5 commits into from
Jun 28, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
158 changes: 81 additions & 77 deletions assets/tube_map.svg
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
8 changes: 8 additions & 0 deletions conf/modules.config
Original file line number Diff line number Diff line change
Expand Up @@ -18,6 +18,14 @@ process {
saveAs: { filename -> filename.equals('versions.yml') ? null : filename }
]

withName: SEQTK_CUTN_TARGET {
ext.args = { "-n 10" }
}

withName: SEQTK_CUTN_QUERY {
ext.args = { "-n 10" }
}

withName: 'LAST_LASTDB' {
// See https://gitlab.com/mcfrith/last/-/blob/main/doc/lastdb.rst for details
// -R01: uppercase all sequences and then lowercase simple repeats
Expand Down
53 changes: 38 additions & 15 deletions docs/output.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,31 +2,54 @@

## Introduction

This document describes the output produced by the pipeline. Most of the plots are taken from the Last_dotplot report, which summarises results at the end of the pipeline.
This document describes the output produced by the pipeline.

The directories listed below will be created in the results directory after the pipeline has finished. All paths are relative to the top-level results directory.

## Pipeline overview

## Outputs

Each _query_ genome, is aligned to the _target_ genome, and each alignment is visualised with dot plots. The output file names are constructed by concatenating the _target_ and _query_ sample identifiers with a `___` separator (three underscores), to faciliate re-extraction of the IDs from file names. The file suffixes are as follows:

- `.train` is the alignment parameters computed by `last-train` (optional)
- `m2m_aln` is the _**many-to-many**_ alignment between _target_ and _query_ genomes. (optional through the `--m2m` option)
- `m2m_plot` (optional)
- `m2o_aln` is the _**many-to-one**_ alignment regions of the _target_ genome are matched at most once by the _query_ genome.
- `m2o_plot` (optional)
- `o2o_aln` is the _**one-to-one**_ alignment between the _target_ and _query_ genomes.
- `o2o_plot` (optional)
- `o2m_aln` is the _**one-to-many**_ alignment between the _target_ and _query_ genomes (optional).
- `o2m_plot` (optional)

The pipeline is built using [Nextflow](https://www.nextflow.io/) and processes data using the following steps:

- [Alignments](#alignments) - Alignment of the _query_ genomes to the _target_ genome
- [Dot plots](#dot-plots) - Alignment of the _query_ genomes to the _target_ genome
- [MultiQC](#multiqc) - Aggregate report describing results and QC from the whole pipeline
- [Pipeline information](#pipeline-information) - Report metrics generated during the workflow execution

Each _query_ genome, is aligned to the _target_ genome, and each alignment is visualised with dot plots. The output file names are constructed by concatenating the _target_ and _query_ sample identifiers with a `___` separator (three underscores), to faciliate re-extraction of the IDs from file names.

### Alignments

<details markdown="1">
<summary>Output files</summary>

- `last/`
- `*.train` is the alignment parameters computed by `last-train` (optional)
- `*.m2m_aln.maf.gz` is the _**many-to-many**_ alignment between _target_ and _query_ genomes. (optional through the `--m2m` option)
- `*.m2o_aln.maf.gz` is the _**many-to-one**_ alignment regions of the _target_ genome are matched at most once by the _query_ genome.
- `*.o2o_aln.maf.gz` is the _**one-to-one**_ alignment between the _target_ and _query_ genomes.
- `*.o2m_aln.maf.gz` is the _**one-to-many**_ alignment between the _target_ and _query_ genomes (optional).

</details>

Genomes are aligned witn [`lastal`](https://gitlab.com/mcfrith/last/-/blob/main/doc/lastal.rst) after alignment parameters have been determined with [`last-train`](https://gitlab.com/mcfrith/last/-/blob/main/doc/last-train.rst). _**Many-to-many**_ alignments are progressively converted to _**one-to-one**_ with [`last-split`](https://gitlab.com/mcfrith/last/-/blob/main/doc/last-split.rst).

### Dot plots

<details markdown="1">
<summary>Output files</summary>

- `last/`
- `*.m2m_plot` (optional)
- `*.m2o_plot` (optional)
- `*.o2o_plot` (optional)
- `*.o2m_plot` (optional)

</details>

Dot plots representing the pairwise genome alignments, produced with the [`last-dotplot`](https://gitlab.com/mcfrith/last/-/blob/main/doc/last-dotplot.rst) tool.

The poly-N regions longer than 9 bases in each genome sequence are marked in pale red in the dot-plots. These often indicate contig boundaries in scaffolds. This is done with `seqtk cutN` and its output is provided in the `seqtk` directory.

### MultiQC

<details markdown="1">
Expand Down
2 changes: 1 addition & 1 deletion modules.json
Original file line number Diff line number Diff line change
Expand Up @@ -17,7 +17,7 @@
},
"last/dotplot": {
"branch": "master",
"git_sha": "3fa9017b55b9c26e1c327ca189d3942b55f4d496",
"git_sha": "23a928df77b20861eac09ca998029ad47a7155cb",
"installed_by": ["modules"]
},
"last/lastal": {
Expand Down
7 changes: 6 additions & 1 deletion modules/nf-core/last/dotplot/main.nf

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

8 changes: 8 additions & 0 deletions modules/nf-core/last/dotplot/meta.yml

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

Loading