Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Joining channels using maps as the key to join by can fail on resume #3175

Closed
Midnighter opened this issue Sep 3, 2022 · 12 comments · Fixed by #3740
Closed

Joining channels using maps as the key to join by can fail on resume #3175

Midnighter opened this issue Sep 3, 2022 · 12 comments · Fixed by #3740

Comments

@Midnighter
Copy link

Bug report

Expected behavior and actual behavior

When joining two channels, for example, on the first element which is often a map with sample meta information, I would expect to be able to resume my pipeline and all samples to be processed. Instead, it can happen that a large part of my samples are dropped in the join (presumably the elements suddenly mismatch).

Steps to reproduce the problem

This is a bit tricky to reproduce since it may require large input. I've written about it in more detail but the examples there do not reproduce the problem. However, several nf-core pipelines do seem to be affected.

I have a pipeline where this consistently happens on resume but I cannot share the data (FASTQ pairs) with you to reproduce this.

Program output

The output will look something like the following. Be aware that the output of MINIO and FASTQ_READCOUNT are joined and that before being interrupted the pipeline had already processed more than 20 samples.

[18/09acfc] process > MINIO (6061)                                  [100%] 78 of 78, cached: 78
[91/22cd44] process > FASTQ_READCOUNT (6061)                        [100%] 78 of 78, cached: 78
[e4/d5f672] process > QUALITY_CONTROL:FASTP (6061)                  [100%] 20 of 20, cached: 20
[97/92efc0] process > QUALITY_CONTROL:FASTQC (6061)                 [100%] 20 of 20, cached: 20

Environment

  • Nextflow version: 22.04.5 build 5708
  • Java version: openjdk version "11.0.16.1" 2022-08-12
  • Operating system: Linux 5.19.5-arch1-1
  • Bash version: 5.1.16
@pditommaso
Copy link
Member

Without further evidence, I'm enclined to think that's caused by a wrong pattern used by the pipeline. Is the map modified when passed around different processes?

@Midnighter
Copy link
Author

I will try to come up with a minimal example but what do you mean by a wrong pattern? On a single, complete run, the pipeline finishes as expected.

@pditommaso
Copy link
Member

The problem can arise modiying the map content across different processes

@Midnighter
Copy link
Author

I finally managed to create a reproducible example. Takes a bit to set up since I chose to download a number of sequencing files but I hope it's bearable. As one of my solution in that repo shows, I think that using string keys correctly #3108 will be an adequate solution to the problem.

@bentsherman bentsherman linked a pull request Mar 9, 2023 that will close this issue
@bentsherman
Copy link
Member

@Midnighter , I looked at your minimal example and I think your solutions are both correct. If you want to use a map as a join key, you should either (1) copy the maps whenever they are modified or (2) join on a key within the map (which requires you to temporarily append the map key to the tuple before the join).

I think ultimately this issue has the same root cause as #2660 , so the solution is a matter of best practice rather than changing something in Nextflow.

@Midnighter
Copy link
Author

To me, the ability to specify that I want to join by the map in position 0 and look only at the keys 'id' and 'tool' would be very desirable to have as part of the existing operators in nextflow. However, I can understand that you would be hesitant to expand their scope. In that case, I will be happy to try and continue developing a plugin which can provide such operators.

@bentsherman
Copy link
Member

What if we just let by be a closure? That usually works: by: { it[0].id }

@Midnighter
Copy link
Author

So you could also do by: { [it[0].id, it[0].tool, it[2].key] }? That's brilliant 😃

@Midnighter
Copy link
Author

And here I thought you already implemented the closure 😆

@bentsherman
Copy link
Member

Reopening since this issue should instead be resolved by allowing by to be a closure.

@bentsherman bentsherman reopened this Mar 10, 2023
@LouisLeNezet
Copy link

Hi !
That would be a super nice feature to have !
Especially in the nf-core community where we rely a lot on map :)

@bentsherman
Copy link
Member

On further reflection, it seems that allowing by to be a closure doesn't really work. Consider this join example:

    left  = Channel.of(
      [ [id: 'X', name: 'foo'], 1],
      [ [id: 'Y', name: 'bar'], 2],
      [ [id: 'Z', name: 'baz'], 3]
    )
    right = Channel.of(
      [ [id: 'Z', name: 'foo'], 6],
      [ [id: 'Y', name: 'bar'], 5],
      [ [id: 'X', name: 'baz'], 4]
    )
    left
      | join(right, by: { it[0].id })
      | view

We try to join on the id key in the first element, but the names don't match. In other words, it doesn't make sense to join on a nested value when you could have other nested values that don't match. Furthermore, when the join key is a closure rather than a specific position in the tuple, you don't know which keys are duplicated after the join. I could construct similar examples for combine and groupTuple. I guess you can't solve everything with a closure 😆

With that said, I'm going to close this issue. To reiterate, if you want to join on a map key, you can just do it, but keep in mind that you should use the + operator or clone() to modify maps without modifying other references throughout your pipeline. Otherwise you will have issues whenever you resume the pipeline, as reported originally.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants