fq lint module update: exit on failed validation #7000

oligomyeggo · 2024-11-15T13:56:24Z

I am proposing a change to the nf-core fq lint module, such that the process will exit if a FastQ file fails validation. This will allow us to add this module into nf-core pipelines like rnaseq (see relevant PR here) and validate FastQ files early in the workflow, preventing pipelines from continuing and running more computationally-heavy steps on corrupted FastQ files.

PR checklist

Closes #XXX

subworkflows/nf-core/fastq_qc_trim_filter_setstrandedness/main.nf

pinin4fjords · 2024-11-15T16:49:10Z

modules/nf-core/fq/lint/main.nf

@@ -29,5 +30,10 @@ process FQ_LINT {
    "${task.process}":
        fq: \$(echo \$(fq lint --version | sed 's/fq-lint //g'))
    END_VERSIONS
+
+    if ! tail -n 1 ${prefix}.fq_lint.txt | grep -q 'fq-lint end'; then


We shouldn't change the module, generally we try to let the tools work as they do, without overlaying additional logic.

We can add some logic to the subworkflow + workflow that would filter out any libraries failing linting. You may already see logic in rnaseq that we use for trimming, strand failures etc, and we can use the same mechanism.

But my understanding is that fq_lint will return a non-zero error code on a failure? That should trigger a stop without this. Am I mistaken?

@pinin4fjords Yeah, kind of. fq lint has different validators that are all assigned a different code, and if a FastQ file fails the linting, you will see in the log what code is failed on. You can't grep for the codes themselves in the log, as the log will say upfront what validators are enabled by code. That's why I had to grep for "fq-lint end" and check if it doesn't exist. That's the list line you'll see in the linting log if the linting was successful, so failed FastQ files won't have that in their log.

So, fq lint doesn't actually put out an error code itself. If we left the current module alone as is, then that process will successfully complete on any FastQ file, even corrupt ones, which defeats the purpose of adding the linting into the pipeline.

I am definitely open to trying out some different logic to filter out samples that failed linting. Are you thinking those samples would just filtered out, and the pipeline would keep going with the good samples, or that the pipeline would exit if it finds a sample that failed linting? We find that our users often want the pipeline to stop so that they can try reuploading (or redownloading from source and then reuploading) the offending FastQ file, which has worked in a lot of cases. Maybe another conditional that could be set as to if a user wants the pipeline to exit if a sample fails linting or if they want the pipeline to continue with successful samples?

What I mean is the following. Assume a bad fastq file, foo.fq, with content like:

@SEQ_ID_1 GATTTGGGGTTTTCCCAGTT THIS SHOULD NOT BE HERE IIIIIIIIIIIIIIIIIIII @SEQ_ID_2 AGCTAGCTAGGCTAGCTAAG + JJJJJJJJJJJJJJJJJJJJ

The output of fq lint foo.fq is:

2024-11-17T17:03:29.770623Z INFO fq::commands::lint: fq-lint start 2024-11-17T17:03:29.770645Z INFO fq::commands::lint: validating single end read 2024-11-17T17:03:29.770651Z INFO fq::validators: disabled validators: [] 2024-11-17T17:03:29.770658Z INFO fq::validators: enabled single read validators: ["[S003] NameValidator", "[S004] CompleteValidator", "[S002] AlphabetValidator", "[S001] PlusLineValidator", "[S005] ConsistentSeqQualValidator", "[S006] QualityStringValidator"] 2024-11-17T17:03:29.770668Z INFO fq::validators: enabled paired read validators: [] 2024-11-17T17:03:29.770671Z INFO fq::commands::lint: starting validation /data/foo.fq:3:1: [S001] PlusLineValidator: missing + prefix

This produces a non-zero error code:

$ echo $? 1

Because Nextflow runs processes with set -e turned on, the task will fail when it encounters that issue. Your addition will have no effect.

I see that fq has its own codes, but that wasn't what I meant.

But the other point I mentioned is important. Modules in nf-core should reflect the native behaviour of the tool as much as possible, otherwise users don't know what to expect. We shouldn't layer in additional errors etc as you were doing here (even if it worked). If this WAS a problem (and again, I don't think it is, because of set -e), we would need to introduce any custom logic at the pipeline level, perhaps as a 'local' module that parsed the error logs.

Oh, ok, I follow now! Let me up take out the custom code from the module.

I think I got mixed up and noticed the pipeline was trying to re-run failed linting jobs, which I didn't want - unless the linting job was failing due to a pipeline resource error. How are those two different scenarios handled in nextflow, i.e., if the linting completes but the FastQ file was bad and it produces a non-zero error code, I would want the pipeline to fail. However, if the linting doesn't complete because of a resource issue or something, then I would want the pipeline to retry that task.

For the specific rnaseq example, we'd just make sure to set the appropriate label so the process gets enough resource to start.

But to the general point, you can use dynamic retry strategies to catch exit codes reflecting e.g. OOM:

errorStrategy { task.exitStatus in 137..140 ? 'retry' : 'terminate' }

Actually you could PR against this, if you feel the current value isn't appropriate.

oligomyeggo · 2024-11-15T18:14:56Z

@pinin4fjords - sorry, I accidentally clicked "Resolve conversation" on the topic of adding the module output to the subworkflow! I added that into my last commit, if that is what you meant/had in mind?

pinin4fjords

Looks good- thanks!

pinin4fjords · 2024-11-29T16:11:10Z

@oligomyeggo sorry, I was pulled into other things for a couple of weeks, we can get this merged soon.

Before we do, something came up on the workflow: maybe we could run this linting after every FASTQ manipulation, to check that no problems were introduced?

We'd just have to import the module multiple times, .e.g.:

include { FQ_LINT as FQ_LINT_AFTER_TRIMMING } from '../../../modules/nf-core/fq/lint/main'

~~Would you like to do that, or shall I?~~ I think you'll have to due to your branch permissions

pinin4fjords · 2024-11-29T16:19:34Z

Never mind, this is actually a bit fiddly, I'm going to do it on a new PR

pinin4fjords · 2024-11-29T18:47:14Z

Thanks for the work! Closing in favour of #7123

oligomyeggo added 2 commits November 15, 2024 06:51

Have fq lint module exit on failed validation

b3c6ecb

Update subworkflow

4e8ef40

oligomyeggo mentioned this pull request Nov 15, 2024

Initial commit to add fq lint module nf-core/rnaseq#1453

Closed

11 tasks

pinin4fjords requested changes Nov 15, 2024

View reviewed changes

Add module output to subworkflow

cc79afc

Remove custom changes from fq lint main.nf

98a2105

pinin4fjords approved these changes Nov 18, 2024

View reviewed changes

pinin4fjords closed this Nov 29, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fq lint module update: exit on failed validation #7000

fq lint module update: exit on failed validation #7000

oligomyeggo commented Nov 15, 2024

pinin4fjords Nov 15, 2024

oligomyeggo Nov 15, 2024

pinin4fjords Nov 17, 2024 •

edited

Loading

oligomyeggo Nov 18, 2024

pinin4fjords Nov 18, 2024

pinin4fjords Nov 18, 2024

oligomyeggo commented Nov 15, 2024

pinin4fjords left a comment

pinin4fjords commented Nov 29, 2024 •

edited

Loading

pinin4fjords commented Nov 29, 2024

pinin4fjords commented Nov 29, 2024

fq lint module update: exit on failed validation #7000

fq lint module update: exit on failed validation #7000

Conversation

oligomyeggo commented Nov 15, 2024

PR checklist

pinin4fjords Nov 15, 2024

Choose a reason for hiding this comment

oligomyeggo Nov 15, 2024

Choose a reason for hiding this comment

pinin4fjords Nov 17, 2024 • edited Loading

Choose a reason for hiding this comment

oligomyeggo Nov 18, 2024

Choose a reason for hiding this comment

pinin4fjords Nov 18, 2024

Choose a reason for hiding this comment

pinin4fjords Nov 18, 2024

Choose a reason for hiding this comment

oligomyeggo commented Nov 15, 2024

pinin4fjords left a comment

Choose a reason for hiding this comment

pinin4fjords commented Nov 29, 2024 • edited Loading

pinin4fjords commented Nov 29, 2024

pinin4fjords commented Nov 29, 2024

pinin4fjords Nov 17, 2024 •

edited

Loading

pinin4fjords commented Nov 29, 2024 •

edited

Loading