Skip to content

Commit

Permalink
Check if flowcell id matches for paired samples (#1664)
Browse files Browse the repository at this point in the history
I noticed [this comment
](https://github.com/nf-core/sarek/blob/5cc30494a6b8e7e53be64d308b582190ca7d2585/workflows/sarek/main.nf#L946)
about checking the flowcell ID for paired samples while constructing
GATK read groups. I was adapting the read group code for a custom
pipeline and attempted a quick fix, so I thought I'd contribute it back
to sarek.

> While constructing the read group from paired fastq samples, perform a
check to ensure that the id is the same for (the first reads) in fastq_1
and fastq_2. Exit out with an error otherwise and report the problematic
sample and file paths.

Incidentally, while researching read groups I came across the following
recommendations: https://support.sentieon.com/appnotes/read_groups/.
Would it be worth updating some of the fields to match these guidelines?

<!--
# nf-core/sarek pull request

Many thanks for contributing to nf-core/sarek!

Please fill in the appropriate checklist below (delete whatever is not
relevant).
These are the most common things requested on pull requests (PRs).

Remember that PRs should be made against the dev branch, unless you're
preparing a pipeline release.

Learn more about contributing:
[CONTRIBUTING.md](https://github.com/nf-core/sarek/tree/master/.github/CONTRIBUTING.md)
-->

## PR checklist

- [x] This comment contains a description of changes (with reason).
- [ ] If you've fixed a bug or added code that should be tested, add
tests!
- => Only tested this manually, but happy to add a proper test if you
could give me a starting point. Is there already an existing test for
samplesheet validation that I can add this too? I guess I will need to
add "corrupt" fastq files to the nf-core test repo?
- [ ] If you've added a new tool - have you followed the pipeline
conventions in the [contribution
docs](https://github.com/nf-core/sarek/tree/master/.github/CONTRIBUTING.md)
- [ ] If necessary, also make a PR on the nf-core/sarek _branch_ on the
[nf-core/test-datasets](https://github.com/nf-core/test-datasets)
repository.
- [x] Make sure your code lints (`nf-core lint`).
- [x] Ensure the test suite passes (`nextflow run . -profile test,docker
--outdir <OUTDIR>`).
- [x] Check for unexpected warnings in debug mode (`nextflow run .
-profile debug,test,docker --outdir <OUTDIR>`).
- [ ] Usage Documentation in `docs/usage.md` is updated.
- [ ] Output Documentation in `docs/output.md` is updated.
- [ ] `CHANGELOG.md` is updated.
    - => will do this after submitting the PR so that I can link to it.
- [ ] `README.md` is updated (including new tool citations and
authors/contributors).
    - => should I do this even for such a minor contribution?

---------

Co-authored-by: Maxime U Garcia <[email protected]>
Co-authored-by: Maxime U Garcia <[email protected]>
  • Loading branch information
3 people authored Oct 30, 2024
1 parent 8ea4af9 commit 74db9d3
Show file tree
Hide file tree
Showing 2 changed files with 8 additions and 3 deletions.
1 change: 1 addition & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -14,6 +14,7 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0
- [1653](https://github.com/nf-core/sarek/pull/1653) - Updates `sarek_subway` files with `lofreq`
- [1660](https://github.com/nf-core/sarek/pull/1642) - Add `--length_required` for minimal reads length with `FASTP`
- [1663](https://github.com/nf-core/sarek/pull/1663) - Massive conda modules update
- [1664](https://github.com/nf-core/sarek/pull/1664) - Check if flowcell ID matches for read pair

### Changed

Expand Down
10 changes: 7 additions & 3 deletions workflows/sarek/main.nf
Original file line number Diff line number Diff line change
Expand Up @@ -944,11 +944,15 @@ workflow SAREK {
// Add readgroup to meta and remove lane
def addReadgroupToMeta(meta, files) {
def CN = params.seq_center ? "CN:${params.seq_center}\\t" : ''
def flowcell = flowcellLaneFromFastq(files[0])

// Check if flowcell ID matches
if ( flowcell && flowcell != flowcellLaneFromFastq(files[1]) ){
error("Flowcell ID does not match for paired reads of sample ${meta.id} - ${files}")
}

// Here we're assuming that fastq_1 and fastq_2 are from the same flowcell:
// If we cannot read the flowcell ID from the fastq file, then we don't use it
def sample_lane_id = flowcellLaneFromFastq(files[0]) ? "${meta.flowcell}.${meta.sample}.${meta.lane}" : "${meta.sample}.${meta.lane}"
// TO-DO: Would it perhaps be better to also call flowcellLaneFromFastq(files[1]) and check that we get the same flowcell-id?
def sample_lane_id = flowcell ? "${meta.flowcell}.${meta.sample}.${meta.lane}" : "${meta.sample}.${meta.lane}"

// Don't use a random element for ID, it breaks resuming
def read_group = "\"@RG\\tID:${sample_lane_id}\\t${CN}PU:${meta.lane}\\tSM:${meta.patient}_${meta.sample}\\tLB:${meta.sample}\\tDS:${params.fasta}\\tPL:${params.seq_platform}\""
Expand Down

0 comments on commit 74db9d3

Please sign in to comment.