From 74db9d3f332a7fcbb2d6053612cff554e61865ed Mon Sep 17 00:00:00 2001 From: Pieter Moris <13552343+pmoris@users.noreply.github.com> Date: Wed, 30 Oct 2024 11:27:16 +0100 Subject: [PATCH] Check if flowcell id matches for paired samples (#1664) I noticed [this comment ](https://github.com/nf-core/sarek/blob/5cc30494a6b8e7e53be64d308b582190ca7d2585/workflows/sarek/main.nf#L946) about checking the flowcell ID for paired samples while constructing GATK read groups. I was adapting the read group code for a custom pipeline and attempted a quick fix, so I thought I'd contribute it back to sarek. > While constructing the read group from paired fastq samples, perform a check to ensure that the id is the same for (the first reads) in fastq_1 and fastq_2. Exit out with an error otherwise and report the problematic sample and file paths. Incidentally, while researching read groups I came across the following recommendations: https://support.sentieon.com/appnotes/read_groups/. Would it be worth updating some of the fields to match these guidelines? ## PR checklist - [x] This comment contains a description of changes (with reason). - [ ] If you've fixed a bug or added code that should be tested, add tests! - => Only tested this manually, but happy to add a proper test if you could give me a starting point. Is there already an existing test for samplesheet validation that I can add this too? I guess I will need to add "corrupt" fastq files to the nf-core test repo? - [ ] If you've added a new tool - have you followed the pipeline conventions in the [contribution docs](https://github.com/nf-core/sarek/tree/master/.github/CONTRIBUTING.md) - [ ] If necessary, also make a PR on the nf-core/sarek _branch_ on the [nf-core/test-datasets](https://github.com/nf-core/test-datasets) repository. - [x] Make sure your code lints (`nf-core lint`). - [x] Ensure the test suite passes (`nextflow run . -profile test,docker --outdir `). - [x] Check for unexpected warnings in debug mode (`nextflow run . -profile debug,test,docker --outdir `). - [ ] Usage Documentation in `docs/usage.md` is updated. - [ ] Output Documentation in `docs/output.md` is updated. - [ ] `CHANGELOG.md` is updated. - => will do this after submitting the PR so that I can link to it. - [ ] `README.md` is updated (including new tool citations and authors/contributors). - => should I do this even for such a minor contribution? --------- Co-authored-by: Maxime U Garcia Co-authored-by: Maxime U Garcia --- CHANGELOG.md | 1 + workflows/sarek/main.nf | 10 +++++++--- 2 files changed, 8 insertions(+), 3 deletions(-) diff --git a/CHANGELOG.md b/CHANGELOG.md index e14914386..0a90f9db6 100644 --- a/CHANGELOG.md +++ b/CHANGELOG.md @@ -14,6 +14,7 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0 - [1653](https://github.com/nf-core/sarek/pull/1653) - Updates `sarek_subway` files with `lofreq` - [1660](https://github.com/nf-core/sarek/pull/1642) - Add `--length_required` for minimal reads length with `FASTP` - [1663](https://github.com/nf-core/sarek/pull/1663) - Massive conda modules update +- [1664](https://github.com/nf-core/sarek/pull/1664) - Check if flowcell ID matches for read pair ### Changed diff --git a/workflows/sarek/main.nf b/workflows/sarek/main.nf index f60bc3d93..6ece09f9a 100644 --- a/workflows/sarek/main.nf +++ b/workflows/sarek/main.nf @@ -944,11 +944,15 @@ workflow SAREK { // Add readgroup to meta and remove lane def addReadgroupToMeta(meta, files) { def CN = params.seq_center ? "CN:${params.seq_center}\\t" : '' + def flowcell = flowcellLaneFromFastq(files[0]) + + // Check if flowcell ID matches + if ( flowcell && flowcell != flowcellLaneFromFastq(files[1]) ){ + error("Flowcell ID does not match for paired reads of sample ${meta.id} - ${files}") + } - // Here we're assuming that fastq_1 and fastq_2 are from the same flowcell: // If we cannot read the flowcell ID from the fastq file, then we don't use it - def sample_lane_id = flowcellLaneFromFastq(files[0]) ? "${meta.flowcell}.${meta.sample}.${meta.lane}" : "${meta.sample}.${meta.lane}" - // TO-DO: Would it perhaps be better to also call flowcellLaneFromFastq(files[1]) and check that we get the same flowcell-id? + def sample_lane_id = flowcell ? "${meta.flowcell}.${meta.sample}.${meta.lane}" : "${meta.sample}.${meta.lane}" // Don't use a random element for ID, it breaks resuming def read_group = "\"@RG\\tID:${sample_lane_id}\\t${CN}PU:${meta.lane}\\tSM:${meta.patient}_${meta.sample}\\tLB:${meta.sample}\\tDS:${params.fasta}\\tPL:${params.seq_platform}\""