-
Notifications
You must be signed in to change notification settings - Fork 139
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Consumer not reconnecting on lost session on proxy connectivity #1233
Comments
I suspect that this is since version 2.7.0, due to this change:
What I do not understand is that it hangs indefinitely. If a partition is lost, the stream should be interrupted. This should immediately cause an error at the place where the stream is run. |
2.6 resumes the stream, so it must be 2.7+ issue |
Before 2.7.0 a lost partition was treated as a revoked partition. Since the partition is already assigned to another node, this potentially leads to duplicate processing of records. Zio-kafka 2.7.0 assumes that a lost partition is a fatal event. It leads to an interrupt in the stream that handles the partition. The other streams are ended, and the consumer closes with an error. Usually, a full program restart is needed to resume consuming. It should be noted that stream processing is not interrupted immediately. Only when the stream requests new records, the interrupt is observed. Unfortunately, we have not found a clean way to interrupt the stream consumer directly. Meanwhile, from bug reports, we understand that partitions are usually lost when no records have been received for a long time. In conclusion, 1) it is not possible to immediately interrupt user stream processing, and 2) it most likely not needed anyway because the stream is awaiting new records. With this change, a lost partition no longer leads to an interrupt. Instead, we first drain the stream's internal queue (just to be sure, it is probably already empty), and then we end it gracefully (that is, without error). Other streams are not affected, the consumer will continue to work. When `rebalanceSafeCommits` is enabled, lost partitions do _not_ participate like revoked partitions do. So lost partitions cannot hold up a rebalance. Fixes #1233 and #1250.
@fd8s0 Since you seem to be able to reproduce the issue quite well, could you confirm that zio-kafka 2.7.5. fixes this issue? There are a few upgrades of Would you be willing to try to reproduce the issue with the older kafka-clients by adding this to your dependencies? I'm not 100% sure if you can just override the version like that, but let's try.
If we could pinpoint the kafka-clients version that introduces the issue, that would be even nicer. |
@svroonland seems still broken in 2.7.5 for me I rolled back kafka-clients version by version, all the way back to 3.4.1 while having zio-kafka 2.7.5 and in no case it works like in zio-kafka 2.6. If you're having trouble replicating the issue I can try to help. I don't share my exact setup because I'm relying on a zookeeper instance embedded inside hbase and it's a bit offtopic. I'm adding this on a docker compose with the kafka server
Connect to 9192 port with the client, stop the proxy container for over a minute, then start it again. |
Thanks, I'm able to replicate this behavior now. |
@erikvanoosten Sure this was intended to be closed by 1252? |
Yes! |
Re-opened after discussion. We first want to see if #1252 really helped solve this issue. |
Fixes #1288. See also #1233 and #1250. When all partitions are lost after some connection issue to the broker, the streams for lost partitions are ended but polling stops, due to the conditions in `Runloop.State#shouldPoll`. This PR fixes this by removing the lost partition streams from the `assignedStreams` in the state, thereby not disabling polling. Also adds a warning that is logged whenever the assigned partitions (according to the apache kafka consumer) are different from the assigned streams, which helps to identify other issues or any future regressions of this issue. ~Still needs a good test, the `MockConsumer` used in other tests unfortunately does not allow simulating lost partitions, and the exact behavior of the kafka client in this situation is hard to predict..~ Includes a test that fails when undoing the change to Runloop
Issue was reproduced and (very likely) fixed by #1350 in v2.8.3. I'll try to reproduce with the proxy at some later time, but feel free to beat me to it. |
This behaviour doesn't seem to apply when not using a kafka proxy. Connecting directly we observe the stream always recomposes itself.
Example code:
In version 2.3.2 it goes something like this:
In version 2.7.4 it goes:
The text was updated successfully, but these errors were encountered: