-
Notifications
You must be signed in to change notification settings - Fork 3.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Potential issues with new flush() #3633
Comments
I had a chance to investigate this further. The "surprising exception" was due to an implementation bug in kafkacrypto . However, the hang with flush, where no messages are actually sent, persists. I have been unable to come up with a consistent "small" reproducer, but was able to trigger it with debug=all set for the producer. That log is attached. My analysis:
Importantly, after this, there are no messages indicating that any metadata requests are actually sent to the broker, there are no further broker wakeups, and the confluent-kafka-python flush call has not returned. This looks like the broker thread gets stuck, presumably due to some kind of livelock/deadlock . Unfortunately, I don't know enough about the internals of librdkafka to know whether this is happening within librdkafka, or due to some interaction with the GIL in python (since all of this is called via confluent-kafka-python). Perhaps @edenhill knows?
|
Hi, Sorry for reviving this, but did you ever figure this out? We're using confluent-kafka, which also uses librdkafka and running into a very similar situation where we're stuck in the 1 second loop described above. |
Sorry for the delay, completely missed this issue. Will look into it. flush() will trigger callbacks, typically delivery report callbacks but also error, throttle and stats. Is there anything in your code that has started decommissioning other parts when calling close()/flush()? |
Not that I know of, our code is fairly straight forward, and basically amounts to this:
It also calls flush() several times when there was nothing pushed, but as its a synchronous operation, i don't believe an issue should result from that... The one thing that may be different from @tmcqueen-materials though, is that we're running through AWS lambda. Whats interesting, though, is that when we run this on an EC2 instance instead, the issue doesn't seem to pop up... The relavant logs in my case:
The last 4 lines repeats until the lambda timeout is hit |
Hi, I just wanted to note that we're experiencing this same issue as well. One additional data point is that when we upgraded confluent-kafka from I have some nodes handling the same workloads running The "write fails" graph is our code hitting the issue reported here, where it stalls on resolving metadata and the error is eventually receiving a local timeout. Both are running against a kafka cluster running |
To answer @zero4573, no, we never figured it out because we were never able to build a "small" reproducer, and only got this on production workloads, for which our alternative python flush implementation (avoiding librdkafka's flush) resolved. We had done some stack dumps, etc, but never got far enough to figure out what was going on inside librdkafka. From Python's perspective, we always seem hung as:
Which basically means one thread is waiting for librdkafka's flush to return, and the other thread is waiting for librdkafka's poll to return [both called from the same python confluent_kafka producer object]. But just making a reproducer that had two threads trying to race to have both a flush and a poll called from the same producer object at the same time was not enough to reproduce the observed deadlock/livelock (possibly because we could not get the ordering right or it only can happen once before metadata is retrieved). We never got far enough to build librdkafka with symbols to make sense of the stack traces from the librdkafka threads. |
This prompted us to test things out again. We are still able to reproducibly hit the issue with v1.8.2. However, we find that we are again (at least with light testing) unable to hit the issue in v1.9.0-RC3 . So it is possible some change between v1.8.2 and v1.9.0-RC3 fixed this issue. Unfortunately we won't have time to bisect anytime soon, but perhaps that helps others. |
That does help, we've pivoted away form pushing directly to kafka through lambda, and instead elected to push to SQS, and have an intermediary ECS process ready/push from there so that delays aren't fatal. Will keep an eye out for v1.9.0 |
Alright, after running tests in triplicate, the first commit to resolve this issue is commit 73d9a63 . I doubt it fixed whatever underlying bug/race causes the issue being discussed here, but rather more likely that this commit makes that race very hard (impossible?) to hit in practice. Edit: More rigorous testing confirms that this commit only sometime fixes the issue -- in 6/8 "identical" test runs so far. So there must be some underlying bug/race. I have no quantitative data, but it appears to happen more frequently when the first message produce event occurs not immediately but after "several minutes" [after, probably coincidentally, the same process has had to wait for a consumer of a different topic on the same broker to rebalance to start ingesting messages again].
|
The mystery deepens! While commit 73d9a63 seems to "often" resolve the issue, after applying the immediately following commit be4e096 , I am unable to hit the issue in many trials (16 and counting). In other words, 73d9a63 and be4e096 together seem to reliably resolve the symptoms. This really suggests to me it is some kind of race/deadlock/thread coordination issue. @edenhill Let me know if you'd like this closed (though I suspect an underlying bug is likely still present). |
I can also confirm that we saw the issue disappear with an upgrade to v1.9 on a subset of machines. The higher latencies we saw (posted above) also returned to prior levels with the upgrade. So we'll wait for an official release and deploy that everywhere. |
Thank you all for the detailed troubleshoot and analysis. I'm happy to hear that v1.9.0.. seems to fix the issue, but as @tmcqueen-materials points out it is not evident why the two bisected commits would, and that it is likely a race condition elsewhere. What client callbacks are you setting up in the producer? |
affected versions (librdkafka 1.8.x). See confluentinc/librdkafka#3633
Description
EDITED 12/30/2021: The origin of the weird exception was identified and corrected. However, there still appears to be a triggerable deadlock with the new flush implementation, at least when called from python via confluent-kafka python (see comment).
I want to report a potential regression in version 1.8.0+. I am the maintainer of KafkaCrypto and one of our users was reporting an issue where key management messages were not being produced, leading to an inability to decrypt messages. Investigation of logs revealed that the KafkaCrypto management thread was dying due to the following exception:
This is a highly surprising error since the definition of FutureRecordMetadata inherits from Future, which defines failure, so it should not be possible for this error to occur, and yet it does.
Bisecting on released versions of confluent-kafka and librdkafka revealed the problematic behavior only occurs on librdkafka 1.8.0-1.8.2 (including latest master as of this writing), irrespective of the version of confluent-kafka.
Inspecting the changes between librdkafka 1.7.0 and 1.8.0, commit
22707a3 was identified as the source of the problem. This was further confirmed by reimplementing producer flush() based on poll in KafkaCrypto, which enables it to function with all recent versions of librdkafka, including 1.8.0-1.8.2.
librdkafka logs (with debug="broker,topic,msg") were collected for 3 scenarios. The places where the logs diverge are summarized here:
So something seems to be going wrong with acquiring topic metadata in the broker.
Based on the above, I suspect there are potentially two separate issues here:
How to reproduce
Was found in a complex user environment. Appears to reliably trigger by utilizing KafkaCrypto in a consumer with a polling loop and subscribed to a topic with a large number of large messages to consume (KafkaCrypto will be both producing and consuming key management messages in a daemon thread started before these lines):
No librdkafka defaults on the KafkaCrypto producer are changed. The consumer sets these parameters:
Though I suspect these do not matter since the issue is occurring with a producer flush.
Checklist
IMPORTANT: We will close issues where the checklist has not been completed.
Please provide the following information:
debug=..
as necessary) from librdkafkaEdited to add log snippets.
The text was updated successfully, but these errors were encountered: