From 74db9d3f332a7fcbb2d6053612cff554e61865ed Mon Sep 17 00:00:00 2001
From: Pieter Moris <13552343+pmoris@users.noreply.github.com>
Date: Wed, 30 Oct 2024 11:27:16 +0100
Subject: [PATCH] Check if flowcell id matches for paired samples (#1664)

I noticed [this comment
](https://github.com/nf-core/sarek/blob/5cc30494a6b8e7e53be64d308b582190ca7d2585/workflows/sarek/main.nf#L946)
about checking the flowcell ID for paired samples while constructing
GATK read groups. I was adapting the read group code for a custom
pipeline and attempted a quick fix, so I thought I'd contribute it back
to sarek.

> While constructing the read group from paired fastq samples, perform a
check to ensure that the id is the same for (the first reads) in fastq_1
and fastq_2. Exit out with an error otherwise and report the problematic
sample and file paths.

Incidentally, while researching read groups I came across the following
recommendations: https://support.sentieon.com/appnotes/read_groups/.
Would it be worth updating some of the fields to match these guidelines?

<!--
# nf-core/sarek pull request

Many thanks for contributing to nf-core/sarek!

Please fill in the appropriate checklist below (delete whatever is not
relevant).
These are the most common things requested on pull requests (PRs).

Remember that PRs should be made against the dev branch, unless you're
preparing a pipeline release.

Learn more about contributing:
[CONTRIBUTING.md](https://github.com/nf-core/sarek/tree/master/.github/CONTRIBUTING.md)
-->

## PR checklist

- [x] This comment contains a description of changes (with reason).
- [ ] If you've fixed a bug or added code that should be tested, add
tests!
- => Only tested this manually, but happy to add a proper test if you
could give me a starting point. Is there already an existing test for
samplesheet validation that I can add this too? I guess I will need to
add "corrupt" fastq files to the nf-core test repo?
- [ ] If you've added a new tool - have you followed the pipeline
conventions in the [contribution
docs](https://github.com/nf-core/sarek/tree/master/.github/CONTRIBUTING.md)
- [ ] If necessary, also make a PR on the nf-core/sarek _branch_ on the
[nf-core/test-datasets](https://github.com/nf-core/test-datasets)
repository.
- [x] Make sure your code lints (`nf-core lint`).
- [x] Ensure the test suite passes (`nextflow run . -profile test,docker
--outdir <OUTDIR>`).
- [x] Check for unexpected warnings in debug mode (`nextflow run .
-profile debug,test,docker --outdir <OUTDIR>`).
- [ ] Usage Documentation in `docs/usage.md` is updated.
- [ ] Output Documentation in `docs/output.md` is updated.
- [ ] `CHANGELOG.md` is updated.
    - => will do this after submitting the PR so that I can link to it.
- [ ] `README.md` is updated (including new tool citations and
authors/contributors).
    - => should I do this even for such a minor contribution?

---------

Co-authored-by: Maxime U Garcia <max.u.garcia@gmail.com>
Co-authored-by: Maxime U Garcia <maxime.garcia@seqera.io>
---
 CHANGELOG.md            |  1 +
 workflows/sarek/main.nf | 10 +++++++---
 2 files changed, 8 insertions(+), 3 deletions(-)
diff --git a/CHANGELOG.md b/CHANGELOG.md
index e14914386..0a90f9db6 100644
--- a/CHANGELOG.md
+++ b/CHANGELOG.md
@@ -14,6 +14,7 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0
 - [1653](https://github.com/nf-core/sarek/pull/1653) - Updates `sarek_subway` files with `lofreq`
 - [1660](https://github.com/nf-core/sarek/pull/1642) - Add `--length_required` for minimal reads length with `FASTP`
 - [1663](https://github.com/nf-core/sarek/pull/1663) - Massive conda modules update
+- [1664](https://github.com/nf-core/sarek/pull/1664) - Check if flowcell ID matches for read pair
 
 ### Changed
 
diff --git a/workflows/sarek/main.nf b/workflows/sarek/main.nf
index f60bc3d93..6ece09f9a 100644
--- a/workflows/sarek/main.nf
+++ b/workflows/sarek/main.nf
@@ -944,11 +944,15 @@ workflow SAREK {
 // Add readgroup to meta and remove lane
 def addReadgroupToMeta(meta, files) {
     def CN = params.seq_center ? "CN:${params.seq_center}\\t" : ''
+    def flowcell = flowcellLaneFromFastq(files[0])
+
+    // Check if flowcell ID matches
+    if ( flowcell && flowcell != flowcellLaneFromFastq(files[1]) ){
+        error("Flowcell ID does not match for paired reads of sample ${meta.id} - ${files}")
+    }
 
-    // Here we're assuming that fastq_1 and fastq_2 are from the same flowcell:
     // If we cannot read the flowcell ID from the fastq file, then we don't use it
-    def sample_lane_id = flowcellLaneFromFastq(files[0]) ? "${meta.flowcell}.${meta.sample}.${meta.lane}" : "${meta.sample}.${meta.lane}"
-    // TO-DO: Would it perhaps be better to also call flowcellLaneFromFastq(files[1]) and check that we get the same flowcell-id?
+    def sample_lane_id = flowcell ? "${meta.flowcell}.${meta.sample}.${meta.lane}" : "${meta.sample}.${meta.lane}"
 
     // Don't use a random element for ID, it breaks resuming
     def read_group = "\"@RG\\tID:${sample_lane_id}\\t${CN}PU:${meta.lane}\\tSM:${meta.patient}_${meta.sample}\\tLB:${meta.sample}\\tDS:${params.fasta}\\tPL:${params.seq_platform}\""