Joining channels using maps as the key to join by can fail on resume #3175

Midnighter · 2022-09-03T14:26:24Z

Bug report

Expected behavior and actual behavior

When joining two channels, for example, on the first element which is often a map with sample meta information, I would expect to be able to resume my pipeline and all samples to be processed. Instead, it can happen that a large part of my samples are dropped in the join (presumably the elements suddenly mismatch).

Steps to reproduce the problem

This is a bit tricky to reproduce since it may require large input. I've written about it in more detail but the examples there do not reproduce the problem. However, several nf-core pipelines do seem to be affected.

I have a pipeline where this consistently happens on resume but I cannot share the data (FASTQ pairs) with you to reproduce this.

Program output

The output will look something like the following. Be aware that the output of MINIO and FASTQ_READCOUNT are joined and that before being interrupted the pipeline had already processed more than 20 samples.

[18/09acfc] process > MINIO (6061)                                  [100%] 78 of 78, cached: 78
[91/22cd44] process > FASTQ_READCOUNT (6061)                        [100%] 78 of 78, cached: 78
[e4/d5f672] process > QUALITY_CONTROL:FASTP (6061)                  [100%] 20 of 20, cached: 20
[97/92efc0] process > QUALITY_CONTROL:FASTQC (6061)                 [100%] 20 of 20, cached: 20

Environment

Nextflow version: 22.04.5 build 5708
Java version: openjdk version "11.0.16.1" 2022-08-12
Operating system: Linux 5.19.5-arch1-1
Bash version: 5.1.16

The text was updated successfully, but these errors were encountered:

pditommaso · 2022-09-06T06:31:47Z

Without further evidence, I'm enclined to think that's caused by a wrong pattern used by the pipeline. Is the map modified when passed around different processes?

Midnighter · 2022-09-06T07:06:53Z

I will try to come up with a minimal example but what do you mean by a wrong pattern? On a single, complete run, the pipeline finishes as expected.

pditommaso · 2022-09-06T12:30:01Z

The problem can arise modiying the map content across different processes

Midnighter · 2023-03-05T22:14:53Z

I finally managed to create a reproducible example. Takes a bit to set up since I chose to download a number of sequencing files but I hope it's bearable. As one of my solution in that repo shows, I think that using string keys correctly #3108 will be an adequate solution to the problem.

bentsherman · 2023-03-09T17:18:06Z

@Midnighter , I looked at your minimal example and I think your solutions are both correct. If you want to use a map as a join key, you should either (1) copy the maps whenever they are modified or (2) join on a key within the map (which requires you to temporarily append the map key to the tuple before the join).

I think ultimately this issue has the same root cause as #2660 , so the solution is a matter of best practice rather than changing something in Nextflow.

Midnighter · 2023-03-09T19:59:03Z

To me, the ability to specify that I want to join by the map in position 0 and look only at the keys 'id' and 'tool' would be very desirable to have as part of the existing operators in nextflow. However, I can understand that you would be hesitant to expand their scope. In that case, I will be happy to try and continue developing a plugin which can provide such operators.

bentsherman · 2023-03-09T20:34:12Z

What if we just let by be a closure? That usually works: by: { it[0].id }

Midnighter · 2023-03-09T21:09:35Z

So you could also do by: { [it[0].id, it[0].tool, it[2].key] }? That's brilliant 😃

Midnighter · 2023-03-10T22:48:12Z

And here I thought you already implemented the closure 😆

bentsherman · 2023-03-10T23:36:28Z

Reopening since this issue should instead be resolved by allowing by to be a closure.

LouisLeNezet · 2023-03-19T18:01:42Z

Hi !
That would be a super nice feature to have !
Especially in the nf-core community where we rely a lot on map :)

bentsherman · 2023-03-27T15:25:35Z

On further reflection, it seems that allowing by to be a closure doesn't really work. Consider this join example:

    left  = Channel.of(
      [ [id: 'X', name: 'foo'], 1],
      [ [id: 'Y', name: 'bar'], 2],
      [ [id: 'Z', name: 'baz'], 3]
    )
    right = Channel.of(
      [ [id: 'Z', name: 'foo'], 6],
      [ [id: 'Y', name: 'bar'], 5],
      [ [id: 'X', name: 'baz'], 4]
    )
    left
      | join(right, by: { it[0].id })
      | view

We try to join on the id key in the first element, but the names don't match. In other words, it doesn't make sense to join on a nested value when you could have other nested values that don't match. Furthermore, when the join key is a closure rather than a specific position in the tuple, you don't know which keys are duplicated after the join. I could construct similar examples for combine and groupTuple. I guess you can't solve everything with a closure 😆

With that said, I'm going to close this issue. To reiterate, if you want to join on a map key, you can just do it, but keep in mind that you should use the + operator or clone() to modify maps without modifying other references throughout your pipeline. Otherwise you will have issues whenever you resume the pipeline, as reported originally.

bentsherman added the lang/operators label Sep 6, 2022

bentsherman linked a pull request Mar 9, 2023 that will close this issue

Add note about modifying maps with the + operator #3740

Merged

bentsherman mentioned this issue Mar 9, 2023

Support string indices for operators with by option #3108

Open

pditommaso closed this as completed in #3740 Mar 10, 2023

bentsherman reopened this Mar 10, 2023

bentsherman closed this as completed Mar 27, 2023

LouisLeNezet mentioned this issue Mar 28, 2023

Partial meta map jonction nf-core/website#1698

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Joining channels using maps as the key to join by can fail on resume #3175

Joining channels using maps as the key to join by can fail on resume #3175

Midnighter commented Sep 3, 2022

pditommaso commented Sep 6, 2022

Midnighter commented Sep 6, 2022

pditommaso commented Sep 6, 2022

Midnighter commented Mar 5, 2023

bentsherman commented Mar 9, 2023

Midnighter commented Mar 9, 2023

bentsherman commented Mar 9, 2023

Midnighter commented Mar 9, 2023

Midnighter commented Mar 10, 2023

bentsherman commented Mar 10, 2023

LouisLeNezet commented Mar 19, 2023

bentsherman commented Mar 27, 2023

Joining channels using maps as the key to join by can fail on resume #3175

Joining channels using maps as the key to join by can fail on resume #3175

Comments

Midnighter commented Sep 3, 2022

Bug report

Expected behavior and actual behavior

Steps to reproduce the problem

Program output

Environment

pditommaso commented Sep 6, 2022

Midnighter commented Sep 6, 2022

pditommaso commented Sep 6, 2022

Midnighter commented Mar 5, 2023

bentsherman commented Mar 9, 2023

Midnighter commented Mar 9, 2023

bentsherman commented Mar 9, 2023

Midnighter commented Mar 9, 2023

Midnighter commented Mar 10, 2023

bentsherman commented Mar 10, 2023

LouisLeNezet commented Mar 19, 2023

bentsherman commented Mar 27, 2023