Channel topics [EXPERIMENTAL] #2842

pditommaso · 2022-05-01T20:57:08Z

pditommaso
May 1, 2022
Maintainer

Nextflow channels allow connecting one or more processes producing some output data with one or more processes consuming such data. For example:

workflow {
  A()
  B(A.out)
  C(A.out)
}

In the snippet above the processes B and C consume the data produced by process A. Instead:

workflow {
  A()
  B()
  C(A.out, B.out)
}

In the example above the process, C consumes the output of processes A and B.

This approach covers the most common use cases, however, it requires declaring beforehand the expected input/output of each process. For example, if process C needs to collect the inputs from three processes, it would be required to change the corresponding input declaration.

An alternative could be to declare just one input and then mix together the input channels into a single one, provided they contain homogenous data. For example:

workflow {
  A()
  B1()
  B2()
  C( A.out.mix(B1.out).mix(B2.out) )
}

In this case, the C process can receive an arbitrary number of input channels, however, the relationship still needs to be known beforehand and the resulting channel can be verbose and error-prone to be written.

This is even more clear in this recurrent pattern in nf-core pipelines.

Topics for the rescue

A solution to this problem could be the introduction of topic channel. A topic channel is a shared channel identified by a name to which multiple processes can write some output.

For example:

process A {
  output:
   path '*.txt', topic: 'foo'
  '''
   some_comand > a.txt
  '''
}

process B {
  output:
   path '*.txt', topic: 'foo'
  '''
   some_comand > b.txt
  '''
}

The above example shows two processes, A and B declaring two outputs with the same foo topic.

The topic channel holds the output of all processes having the same topic name and it can be accessed via the new channel.topic(NAME) factory method.

For example:

workflow {
   A()
   B()
   C( channel.topic('foo') )  
}

If necessary the output of the processes can also be accessed independently using the usual notation e.g. A.out and B.out.

Conclusion

Topic channels add to Nextflow the ability to implement pub/sub message model in which multiple publishers send a message over a shared channel that is received by multiple subscribers.

This model further decouples the workflow composition adding the ability to connect each other tasks only known at runtime.

drpatelh · 2022-05-01T21:12:10Z

drpatelh
May 1, 2022
Collaborator

Awesome!! How about channel.group / channel.category / channel.classifier ?

channel.topic doesn't sit quite right in this context.

If one or more of the channels in a "topic" are empty would the process still be triggered? ifEmpty([]) is used in the nf-core snippet as a workaround to have optional inputs.

1 reply

maxulysse May 2, 2022

I like channel.category too.
And definitively love this feature <3

manuelesimi · 2022-05-02T03:56:10Z

manuelesimi
May 2, 2022
Collaborator

This proposed feature is very welcome. I don't know how many times I had to mimics this behavior by mixing channels.

As much as I understand that topic is inherited from the messaging systems, I also don't think it's the most appropriate name. It could be misleading for the users because both channel and topic recall a single destination for the emitted values. Something representing a collaborative work (multiple processes writing on the same place) should be used here.

I, for one, would vote for board.

Anyway, great job!!

0 replies

pditommaso · 2022-05-02T07:57:29Z

pditommaso
May 2, 2022
Maintainer Author

Happy to see your comments! Let's try to focus on the feature itself more than the naming.

Do you think this could be really used to replace the mass inputs declaration required for multiqc into a single topic channel?

Re the ifEmpty([]) is should not be required anymore since that was needed when having a one channel with some data and another with no data. Using the topic approach the data would only be delivered by a single channel.

I've uploaded a snapshot version so that you can try it by yourself

NXF_VER=22.05.0-SNAPSHOT nextflow info

0 replies

Midnighter · 2022-05-02T08:00:35Z

Midnighter
May 2, 2022

I very much like the idea of topic channels and I think they will be very valuable.

I don't think the currently proposed implementation is necessarily the best one because it forces me to do pipeline design at the module level. You gave the example of nf-core. Nf-core has hundreds of modules that are supposed to be plug and play into any pipeline (not only nf-core ones). Of course, one can design by convention and say certain topics are expected but it makes more sense to me, to make the topic configurable. Since the topic is a string, maybe you already intend for this to be possible.

So it could be something like

  output:
   path '*.txt', topic: "${params.foo_topic}"

Another idea could be to set this at the workflow level, maybe something like:

A.out.topic('foo')

6 replies

pditommaso May 2, 2022
Maintainer Author

  output:
   path '*.txt', topic: "${params.foo_topic}"

This is already possible. Actually it's enough

  output:
   path '*.txt', topic: params.foo_topic

drpatelh May 2, 2022
Collaborator

Yep, but what if you wanted to change the entries within a topic channel within the workflow context?

Say I only want to use a subset of topics in different workflow contexts?

Midnighter May 2, 2022

Additional questions:

What if you want to use an output in multiple topic channels? Will that be possible?
Will it be possible to set both topic and named output?
```
output:
path '*.txt', emit: result, topic: 'foo'
```

pditommaso May 2, 2022
Maintainer Author

Say I only want to use a subset of topics in different workflow contexts?

I want to see a concrete use case.

What if you want to use an output in multiple topic channels? Will that be possible?

Currently it would require to replicate the output declaration, another option could be to allow topic to accept a list of names

Will it be possible to set both topic and named output?

Yes. One thought here was to have the emit to default to the topic name when omitted.

ewels May 3, 2022
Maintainer

Say I only want to use a subset of topics in different workflow contexts?

So the use case we were just discussing is if for example you want to run a process 3 times, but you only want MultiQC to report logs from 1 of those. With a static topic: 'multiqc' it would be impossible to prevent MultiQC from reporting on all 3 runs.

One potential workaround could be to use a combination of suggestions from @Midnighter - multiple topic channels and dynamic topics. Then the module could do something like:

path '*.txt', emit: result, topic: ['foo', params.workflow_name]

And then be able to have some kind of combinatorial logic when creating the channel, eg:

channel.topic('foo' + 'mysubworkflow')

This would allow simple usage of channel.topic('foo') in most cases but also allow logic to be introduced at pipeline level for filtering.

Another idea could be to set this at the workflow level, maybe something like:
A.out.topic('foo')

I only just saw that you wrote this @Midnighter - I suggested the same thing just now to @pditommaso and I thought that he was going to faint 😆 But yeah, I agree that this would be another simple yet powerful way to pull in subsets of a topic 👍🏻

ewels · 2022-05-03T15:16:53Z

ewels
May 3, 2022
Maintainer

Clashing input file paths

One potential issue with topics may be that it'll be difficult to keep input file paths unique. This is one of the reasons that nf-core/rnaseq uses a local copy of the MultiQC module: it gives directory prefixes to each input channel to avoid clashes.

An idea to generalise this would be to use wildcards in the staging-in directory path.

So instead of:

    path ('fastqc/*')
    path ('trimgalore/fastqc/*')
    path ('trimgalore/*')
    path ('sortmerna/*')
    //...

Could use (straight from the docs):

    path ('dir??/*')

This would give staged file names:

named as the source file, created in a progressively indexed subdirectory e.g. dir01/, dir02/, etc.

@drpatelh I wonder if we could investigate using this right away, as it would be a general improvement for us aside from the whole idea of topics.

8 replies

ewels May 9, 2022
Maintainer

Yeah, but they'll be uniquely named! 😆 Do you think having so many folders will break things?

pditommaso May 9, 2022
Maintainer Author

it may work for multiqc, but I think we need a better abstraction

ewels May 9, 2022
Maintainer

Ah right, but it may only be MultiQC where it's relevant / we have this set of requirements. I think the staging in is likely to be fairly tool dependent, so that was kind of my point with this comment - we can probably already solve the MultiQC-clashing-filenames issue with the tools that Nextflow gives us. So it's a separate issue to the channel topics.

I think it's dangerous to try to do too much abstraction around this, as tools often have specific requirements / expectations for input file structures. So better to not mess around with input filenames by default and let people do this kind of thing 👆🏻 where necessary.

pditommaso May 9, 2022
Maintainer Author

My thought is that it can be useful in some cases the relative path captured in the output definition is preserved.

ewels Jun 7, 2022
Maintainer

@matthdsm put together a PR for my suggestion here: nf-core/modules#1735

bentsherman · 2022-07-28T20:38:14Z

bentsherman
Jul 28, 2022
Maintainer

I agree that channel topics seems to break the separation of concerns between module design and workflow design. Given that the nf-core multiqc module seems to be the primary motivating case, I'm looking at this module and wondering, why not just have one input channel that is a mix of all the multiqc inputs? Isn't multiqc just creating a report out of whatever inputs are present?

I presume that it is useful to explicitly enumerate all of the directory names that multiqc recognizes, but I can't tell if that's actually important to what multiqc is doing. Even if it is, you could also have one input channel and filter it based on an allowlist of directory names.

Overall, it seems to me that a channel topic is a sort of implicit and dynamic mix operator. On the one hand, you can "add channels to the mix" simply by adding a label to an output channel; on the other hand, you have to coordinate across modules and workflows to create this mix. I wonder if using the mix operator in a workflow is just the best practice here.

2 replies

ewels Jul 28, 2022
Maintainer

Yeah the directory names are generally only needed to help avoid filename clashes I think. MultiQC nearly always finds files from the log filename itself or it's contents (with one or two rare exceptions).

bentsherman Jul 29, 2022
Maintainer

I see, so the multiple input channels are used to redirect different inputs to different subfolders automatically. And the path input is exactly how you'd want to stage input files in this way.

I still wonder if we are over-optimizing a solution for this one case, when there might be other solutions that would be more broadly applicable. For example, we could add support for named input channels with default arguments, so that you don't have to specify an ifEmpty for every input, or even specify every input if you don't want to. That seems to be the paradigm of the multiqc module -- it has a bunch of named optional arguments.

bentsherman · 2023-10-30T15:44:46Z

bentsherman
Oct 30, 2023
Maintainer

Revisiting this discussion... I was initially focused on topics as a solution to multiqc's many optional arguments. But now I see it was more about multiqc collecting the tool versions from all upstream processes.

After thinking through the implementation details of a versions directive (#4386), I think channel topics might be the best way to facilitate the dataflow. We can use a versions and/or trace directive to define whatever metadata such as tool versions, but in order to provide that metadata to multiqc, we must do one of the following:

add a custom .versions / .trace property to processes (like .out but for metadata) and use the existing channel logic to feed metadata to MULTIQC
add a workflow.trace variable which contains the metadata of all previous runs, should only be used in workflow.onComplete handler but technically could be used in the MULTIQC process at the end
use a channel topic to collect specific task metadata (e.g. versions topic) and have MULTIQC subscribe to that topic as an input channel

While (1) works and already removes a lot of nf-core boilerplate from processes, it doesn't remove all the channel logic required to collect the tool versions for MULTIQC. On the other hand, I'm not sure how a process would emit this trace metadata aside from a custom output property... I guess you could declare an output channel like so:

output:
path '.command.trace', emit: trace, topic: 'trace'

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Channel topics [EXPERIMENTAL] #2842

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 7 comments 17 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Channel topics [EXPERIMENTAL] #2842

pditommaso May 1, 2022 Maintainer

Topics for the rescue

Conclusion

Replies: 7 comments · 17 replies

drpatelh May 1, 2022 Collaborator

maxulysse May 2, 2022

manuelesimi May 2, 2022 Collaborator

pditommaso May 2, 2022 Maintainer Author

Midnighter May 2, 2022

pditommaso May 2, 2022 Maintainer Author

drpatelh May 2, 2022 Collaborator

Midnighter May 2, 2022

pditommaso May 2, 2022 Maintainer Author

ewels May 3, 2022 Maintainer

ewels May 3, 2022 Maintainer

Clashing input file paths

ewels May 9, 2022 Maintainer

pditommaso May 9, 2022 Maintainer Author

ewels May 9, 2022 Maintainer

pditommaso May 9, 2022 Maintainer Author

ewels Jun 7, 2022 Maintainer

bentsherman Jul 28, 2022 Maintainer

ewels Jul 28, 2022 Maintainer

bentsherman Jul 29, 2022 Maintainer

bentsherman Oct 30, 2023 Maintainer

pditommaso
May 1, 2022
Maintainer

Replies: 7 comments 17 replies

drpatelh
May 1, 2022
Collaborator

manuelesimi
May 2, 2022
Collaborator

pditommaso
May 2, 2022
Maintainer Author

Midnighter
May 2, 2022

pditommaso May 2, 2022
Maintainer Author

drpatelh May 2, 2022
Collaborator

pditommaso May 2, 2022
Maintainer Author

ewels May 3, 2022
Maintainer

ewels
May 3, 2022
Maintainer

ewels May 9, 2022
Maintainer

pditommaso May 9, 2022
Maintainer Author

ewels May 9, 2022
Maintainer

pditommaso May 9, 2022
Maintainer Author

ewels Jun 7, 2022
Maintainer

bentsherman
Jul 28, 2022
Maintainer

ewels Jul 28, 2022
Maintainer

bentsherman Jul 29, 2022
Maintainer

bentsherman
Oct 30, 2023
Maintainer