Can the `ConsumerStream` implement `Clone`? #4267

sagoez · 2024-11-19T23:43:40Z

Hello,

I frequently encounter the need to write code like the following (whether using Fluvio or Kafka):

async fn process_chunk_with_shutdown<'a>(
    &self,
    ...
) -> Result<Unit, Error> {
    match chunk {
        Ok(chunk) => {
            tokio::select! {
                _ = subsys.on_shutdown_requested() => {
                    // Perform shutdown-related actions here.
                },
                _ = self.process_chunk(ctx, *target, chunk.as_ref()) => {
                   ...
                },
            }
            Ok(())
        }
        Err(e) => ...
    }
}

async fn consume_chunks<'a>(
    &self,
    ...
) -> Result<Unit, Error> {
    let mut stream = ...

    stream
        .try_ready_chunks(consumer_batch_size)
        .map(|chunk| {
            async move {
                self.process_chunk_with_shutdown(chunk, ctx, target, subsys)
                    .await
            }
        })
        .buffered(consumer_batch_size)
        .collect::<Vec<_>>()
        .await;

    Ok(())
}

In essence, trying to read from a stream and process its chunks. However, with Fluvio I'm currently facing a limitation: the ConsumerStream isn't clonable, so I can't wrap it in an Arc and/or pass it by reference to a StreamIterator that consumes it (such as try_ready_chunks).

As far as I can tell, the only way to consume records via the Rust client while manually committing offsets is to process them one by one [next].

Would it be reasonable to implement Clone for ConsumerStream, or perhaps provide an operator that allows batch processing? If I'm missing something here, I’d greatly appreciate clarification. I'd also be happy to contribute if there’s something I can help with!

Thanks in advance for your time and support!

The text was updated successfully, but these errors were encountered:

sehz · 2024-11-20T03:50:43Z

Both MultiplePartition Consumer and PartitionConsumer implements clone already. Am I missing something?

Also there is API to return stream of batches already. You just need to use PartitionConsumer https://docs.rs/fluvio/0.24.0/fluvio/consumer/struct.PartitionConsumer.html#method.stream_batches_with_config since batches only works with partition

sagoez · 2024-11-20T09:22:33Z

Hi @sehz, thanks for getting back to me! The reason I didn’t mention those APIs is that they’re all deprecated. The new recommended approach doesn’t allow for cloning, which is why I didn’t explore it further.

I ended up implementing the code using PartitionConsumer. I was hoping to find a way to consume in batches at the topic level without having to worry too much about partition-specific details.

sehz · 2024-11-20T19:12:10Z

Curiously, why do want to process at batch level? It is low level optimization that shouldn't be done at application layer similar to application shouldn't deal with file blocks. Topic is not actually physical representation in the fluvio (similar to Kafka). Topics are actually made of partitions and SPU really works with partitions (to be precise, replica). So when you are consuming topic, you are consuming records coming from different partitions where batch doesn't make sense.

@fraidev, let's un-deprecate https://docs.rs/fluvio/0.24.0/fluvio/consumer/struct.PartitionConsumer.html#method.stream_batches_with_config API since we don't have alternative way to get batches with new API. We still need to keep this

sagoez · 2024-11-21T19:35:51Z

Thank you @sehz for the response and explanation! I'm familiar with the construct and how it works—I've worked with Kafka for a while, so coming to Fluvio I felt at home. That said, the use case I presented was, to my knowledge, quite standard. Working at a batch level can be valuable, particularly in scenarios where strict ordering isn’t important (say for example cases where you don't even need a seq_nr but just to react to individual events in real time). Batches allow you to maximize the number of elements processed concurrently.

The reason this is particularly important (to me at least) is that in systems where you don't block but "keep moving forward"—and simply send unprocessable events to a dead letter queue—processing events individually rather than in batches can become a significant bottleneck.

sehz · 2024-11-21T21:17:10Z

Understood. Thanks for feedback!

ajhunyady assigned fraidev Nov 20, 2024

fraidev mentioned this issue Nov 22, 2024

chore: un-deprecate stream_batches_with_config #4272

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Can the `ConsumerStream` implement `Clone`? #4267

Can the `ConsumerStream` implement `Clone`? #4267

sagoez commented Nov 19, 2024 •

edited

Loading

sehz commented Nov 20, 2024 •

edited

Loading

sagoez commented Nov 20, 2024 •

edited

Loading

sehz commented Nov 20, 2024

sagoez commented Nov 21, 2024 •

edited

Loading

sehz commented Nov 21, 2024

Can the ConsumerStream implement Clone? #4267

Can the ConsumerStream implement Clone? #4267

Comments

sagoez commented Nov 19, 2024 • edited Loading

sehz commented Nov 20, 2024 • edited Loading

sagoez commented Nov 20, 2024 • edited Loading

sehz commented Nov 20, 2024

sagoez commented Nov 21, 2024 • edited Loading

sehz commented Nov 21, 2024

Can the `ConsumerStream` implement `Clone`? #4267

Can the `ConsumerStream` implement `Clone`? #4267

sagoez commented Nov 19, 2024 •

edited

Loading

sehz commented Nov 20, 2024 •

edited

Loading

sagoez commented Nov 20, 2024 •

edited

Loading

sagoez commented Nov 21, 2024 •

edited

Loading