-
Notifications
You must be signed in to change notification settings - Fork 634
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Joining channels using maps as the key to join by can fail on resume #3175
Comments
Without further evidence, I'm enclined to think that's caused by a wrong pattern used by the pipeline. Is the map modified when passed around different processes? |
I will try to come up with a minimal example but what do you mean by a wrong pattern? On a single, complete run, the pipeline finishes as expected. |
The problem can arise modiying the map content across different processes |
I finally managed to create a reproducible example. Takes a bit to set up since I chose to download a number of sequencing files but I hope it's bearable. As one of my solution in that repo shows, I think that using string keys correctly #3108 will be an adequate solution to the problem. |
@Midnighter , I looked at your minimal example and I think your solutions are both correct. If you want to use a map as a join key, you should either (1) copy the maps whenever they are modified or (2) join on a key within the map (which requires you to temporarily append the map key to the tuple before the join). I think ultimately this issue has the same root cause as #2660 , so the solution is a matter of best practice rather than changing something in Nextflow. |
To me, the ability to specify that I want to join by the map in position 0 and look only at the keys 'id' and 'tool' would be very desirable to have as part of the existing operators in nextflow. However, I can understand that you would be hesitant to expand their scope. In that case, I will be happy to try and continue developing a plugin which can provide such operators. |
What if we just let |
So you could also do |
And here I thought you already implemented the closure 😆 |
Reopening since this issue should instead be resolved by allowing |
Hi ! |
On further reflection, it seems that allowing left = Channel.of(
[ [id: 'X', name: 'foo'], 1],
[ [id: 'Y', name: 'bar'], 2],
[ [id: 'Z', name: 'baz'], 3]
)
right = Channel.of(
[ [id: 'Z', name: 'foo'], 6],
[ [id: 'Y', name: 'bar'], 5],
[ [id: 'X', name: 'baz'], 4]
)
left
| join(right, by: { it[0].id })
| view We try to join on the With that said, I'm going to close this issue. To reiterate, if you want to join on a map key, you can just do it, but keep in mind that you should use the |
Bug report
Expected behavior and actual behavior
When joining two channels, for example, on the first element which is often a map with sample meta information, I would expect to be able to resume my pipeline and all samples to be processed. Instead, it can happen that a large part of my samples are dropped in the join (presumably the elements suddenly mismatch).
Steps to reproduce the problem
This is a bit tricky to reproduce since it may require large input. I've written about it in more detail but the examples there do not reproduce the problem. However, several nf-core pipelines do seem to be affected.
I have a pipeline where this consistently happens on resume but I cannot share the data (FASTQ pairs) with you to reproduce this.
Program output
The output will look something like the following. Be aware that the output of
MINIO
andFASTQ_READCOUNT
are joined and that before being interrupted the pipeline had already processed more than 20 samples.Environment
The text was updated successfully, but these errors were encountered: