Potential issues with new flush() #3633

tmcqueen-materials · 2021-11-26T19:10:46Z

Description

EDITED 12/30/2021: The origin of the weird exception was identified and corrected. However, there still appears to be a triggerable deadlock with the new flush implementation, at least when called from python via confluent-kafka python (see comment).

I want to report a potential regression in version 1.8.0+. I am the maintainer of KafkaCrypto and one of our users was reporting an issue where key management messages were not being produced, leading to an inability to decrypt messages. Investigation of logs revealed that the KafkaCrypto management thread was dying due to the following exception:

Traceback (most recent call last):
  File "/usr/local/lib/python3.9/threading.py", line 973, in _bootstrap_inner
    self.run()
  File "/usr/local/lib/python3.9/threading.py", line 910, in run
    self._target(*self._args, **self._kwargs)
  File "/usr/src/app/kafkacrypto/crypto.py", line 285, in _process_mgmt_messages
    self._kp.flush()
  File "/usr/src/app/kafkacrypto/confluent_kafka_wrapper.py", line 34, in base_callback
    self.failure(err)
AttributeError: 'FutureRecordMetadata' object has no attribute 'failure'

This is a highly surprising error since the definition of FutureRecordMetadata inherits from Future, which defines failure, so it should not be possible for this error to occur, and yet it does.

Bisecting on released versions of confluent-kafka and librdkafka revealed the problematic behavior only occurs on librdkafka 1.8.0-1.8.2 (including latest master as of this writing), irrespective of the version of confluent-kafka.

Inspecting the changes between librdkafka 1.7.0 and 1.8.0, commit
22707a3 was identified as the source of the problem. This was further confirmed by reimplementing producer flush() based on poll in KafkaCrypto, which enables it to function with all recent versions of librdkafka, including 1.8.0-1.8.2.

librdkafka logs (with debug="broker,topic,msg") were collected for 3 scenarios. The places where the logs diverge are summarized here:

v1.8.2 (kafkacrypto reimplemented flush): functions properly

...
2021-11-28T08:08:29.305491454-05:00 stderr F %7|1638104909.305|TOPIC|rdkafka#producer-2| [thrd:app]: New local topic: topic-name.subs
2021-11-28T08:08:29.305491454-05:00 stderr F %7|1638104909.305|TOPPARNEW|rdkafka#producer-2| [thrd:app]: NEW topic-name.subs [-1] 0x7f1e7ddbeee0 refcnt 0x7f1e7ddbef70 (at rd_kafka_topic_new0:465)
2021-11-28T08:08:29.769194645-05:00 stderr F %7|1638104909.769|NOINFO|rdkafka#producer-2| [thrd:main]: Topic topic-name.subs metadata information unknown
2021-11-28T08:08:29.769194645-05:00 stderr F %7|1638104909.769|NOINFO|rdkafka#producer-2| [thrd:main]: Topic topic-name.subs partition count is zero: should refresh metadata
2021-11-28T08:08:29.771310795-05:00 stderr F %7|1638104909.771|STATE|rdkafka#producer-2| [thrd:main]: Topic topic-name.subs changed state unknown -> exists
2021-11-28T08:08:29.771310795-05:00 stderr F %7|1638104909.771|PARTCNT|rdkafka#producer-2| [thrd:main]: Topic topic-name.subs partition count changed from 0 to 1
2021-11-28T08:08:29.771310795-05:00 stderr F %7|1638104909.771|TOPPARNEW|rdkafka#producer-2| [thrd:main]: NEW topic-name.subs [0] 0x7f1e7ddbfe90 refcnt 0x7f1e7ddbff20 (at rd_kafka_topic_partition_cnt_update:798)
2021-11-28T08:08:29.771310795-05:00 stderr F %7|1638104909.771|METADATA|rdkafka#producer-2| [thrd:main]:   Topic topic-name.subs partition 0 Leader 1
2021-11-28T08:08:29.771340355-05:00 stderr F %7|1638104909.771|BRKDELGT|rdkafka#producer-2| [thrd:main]: topic-name.subs [0]: delegate to broker ssl://kafka-broker:9093/1 (rktp 0x7f1e7ddbfe90, term 0, ref 2)
2021-11-28T08:08:29.771340355-05:00 stderr F %7|1638104909.771|BRKDELGT|rdkafka#producer-2| [thrd:main]: topic-name.subs [0]: delegating to broker ssl://kafka-broker:9093/1 for partition with 0 messages (0 bytes) queued
2021-11-28T08:08:29.771340355-05:00 stderr F %7|1638104909.771|BRKMIGR|rdkafka#producer-2| [thrd:main]: Migrating topic topic-name.subs [0] 0x7f1e7ddbfe90 from (none) to ssl://kafka-broker:9093/1 (sending PARTITION_JOIN to ssl://kafka-broker:9093/1)
2021-11-28T08:08:29.771340355-05:00 stderr F %7|1638104909.771|PARTCNT|rdkafka#producer-2| [thrd:main]: Partitioning 1 unassigned messages in topic topic-name.subs to 1 partitions
2021-11-28T08:08:29.771340355-05:00 stderr F %7|1638104909.771|UAS|rdkafka#producer-2| [thrd:main]: 1/1 messages were partitioned in topic topic-name.subs
2021-11-28T08:08:29.771340355-05:00 stderr F %7|1638104909.771|TOPBRK|rdkafka#producer-2| [thrd:ssl://kafka-broker:9093/bootstrap]: ssl://kafka-broker:9093/1: Topic topic-name.subs [0]: joining broker (rktp 0x7f1e7ddbfe90, 1 message(s) queued)
2021-11-28T08:08:29.771358586-05:00 stderr F %7|1638104909.771|METADATA|rdkafka#producer-2| [thrd:main]: ssl://kafka-broker:9093/1: 1/1 requested topic(s) seen in metadata
2021-11-28T08:08:29.771358586-05:00 stderr F %7|1638104909.771|FETCHADD|rdkafka#producer-2| [thrd:ssl://kafka-broker:9093/bootstrap]: ssl://kafka-broker:9093/1: Added topic-name.subs [0] to active list (1 entries, opv 0, 1 messages queued): joining
2021-11-28T08:08:29.771373176-05:00 stderr F %7|1638104909.771|PRODUCE|rdkafka#producer-2| [thrd:ssl://kafka-broker:9093/bootstrap]: ssl://kafka-broker:9093/1: topic-name.subs [0]: Produce MessageSet with 1 message(s) (676 bytes, ApiVersion 7, MsgVersion 2, MsgId 0, BaseSeq -1, PID{Invalid}, uncompressed)
2021-11-28T08:08:29.811754508-05:00 stderr F %7|1638104909.811|MSGSET|rdkafka#producer-2| [thrd:ssl://kafka-broker:9093/bootstrap]: ssl://kafka-broker:9093/1: topic-name.subs [0]: MessageSet with 1 message(s) (MsgId 0, BaseSeq -1) delivered
...

v1.8.2 (librdkafka native flush): does not produce messages

...
2021-11-28T08:12:11.894793300-05:00 stderr F %7|1638105131.894|TOPIC|rdkafka#producer-2| [thrd:app]: New local topic: topic-name.subs
2021-11-28T08:12:11.894793300-05:00 stderr F %7|1638105131.894|TOPPARNEW|rdkafka#producer-2| [thrd:app]: NEW topic-name.subs [-1] 0x7f67b3817ee0 refcnt 0x7f67b3817f70 (at rd_kafka_topic_new0:465)
2021-11-28T08:12:12.355058689-05:00 stderr F %7|1638105132.355|NOINFO|rdkafka#producer-2| [thrd:main]: Topic topic-name.subs metadata information unknown
2021-11-28T08:12:12.355058689-05:00 stderr F %7|1638105132.355|NOINFO|rdkafka#producer-2| [thrd:main]: Topic topic-name.subs partition count is zero: should refresh metadata
2021-11-28T08:12:13.355085769-05:00 stderr F %7|1638105133.355|NOINFO|rdkafka#producer-2| [thrd:main]: Topic topic-name.subs metadata information unknown
2021-11-28T08:12:13.355085769-05:00 stderr F %7|1638105133.355|NOINFO|rdkafka#producer-2| [thrd:main]: Topic topic-name.subs partition count is zero: should refresh metadata
2021-11-28T08:12:14.355107847-05:00 stderr F %7|1638105134.355|NOINFO|rdkafka#producer-2| [thrd:main]: Topic topic-name.subs metadata information unknown
2021-11-28T08:12:14.355107847-05:00 stderr F %7|1638105134.355|NOINFO|rdkafka#producer-2| [thrd:main]: Topic topic-name.subs partition count is zero: should refresh metadata
2021-11-28T08:12:15.355201738-05:00 stderr F %7|1638105135.355|NOINFO|rdkafka#producer-2| [thrd:main]: Topic topic-name.subs metadata information unknown
2021-11-28T08:12:15.355201738-05:00 stderr F %7|1638105135.355|NOINFO|rdkafka#producer-2| [thrd:main]: Topic topic-name.subs partition count is zero: should refresh metadata
2021-11-28T08:12:16.355299158-05:00 stderr F %7|1638105136.355|NOINFO|rdkafka#producer-2| [thrd:main]: Topic topic-name.subs metadata information unknown
2021-11-28T08:12:16.355299158-05:00 stderr F %7|1638105136.355|NOINFO|rdkafka#producer-2| [thrd:main]: Topic topic-name.subs partition count is zero: should refresh metadata
2021-11-28T08:12:17.355335117-05:00 stderr F %7|1638105137.355|NOINFO|rdkafka#producer-2| [thrd:main]: Topic topic-name.subs metadata information unknown
2021-11-28T08:12:17.355335117-05:00 stderr F %7|1638105137.355|NOINFO|rdkafka#producer-2| [thrd:main]: Topic topic-name.subs partition count is zero: should refresh metadata
...
2021-11-28T08:12:41.381892604-05:00 stderr F %7|1638105161.381|METADATA|rdkafka#producer-4| [thrd:main]: ssl://kafka-broker:9093/1: 1/1 requested topic(s) seen in metadata
2021-11-28T08:12:42.356845587-05:00 stderr F %7|1638105162.356|NOINFO|rdkafka#producer-2| [thrd:main]: Topic topic-name.subs metadata information unknown
2021-11-28T08:12:42.356845587-05:00 stderr F %7|1638105162.356|NOINFO|rdkafka#producer-2| [thrd:main]: Topic topic-name.subs partition count is zero: should refresh metadata
2021-11-28T08:12:42.381059616-05:00 stderr F %7|1638105162.381|METADATA|rdkafka#producer-4| [thrd:main]: ssl://kafka-broker:9093/1: 1/1 requested topic(s) seen in metadata
2021-11-28T08:12:43.356942577-05:00 stderr F %7|1638105163.356|NOINFO|rdkafka#producer-2| [thrd:main]: Topic topic-name.subs metadata information unknown
2021-11-28T08:12:43.356942577-05:00 stderr F %7|1638105163.356|NOINFO|rdkafka#producer-2| [thrd:main]: Topic topic-name.subs partition count is zero: should refresh metadata
2021-11-28T08:12:43.381181315-05:00 stderr F %7|1638105163.381|METADATA|rdkafka#producer-4| [thrd:main]: ssl://kafka-broker:9093/1: 1/1 requested topic(s) seen in metadata
2021-11-28T08:12:44.356995246-05:00 stderr F %7|1638105164.356|NOINFO|rdkafka#producer-2| [thrd:main]: Topic topic-name.subs metadata information unknown
2021-11-28T08:12:44.356995246-05:00 stderr F %7|1638105164.356|NOINFO|rdkafka#producer-2| [thrd:main]: Topic topic-name.subs partition count is zero: should refresh metadata
2021-11-28T08:12:44.381041803-05:00 stderr F %7|1638105164.381|METADATA|rdkafka#producer-4| [thrd:main]: ssl://kafka-broker:9093/1: 1/1 requested topic(s) seen in metadata
...

v1.6.0 (librdkafka native flush): functions properly

...
2021-11-28T08:30:46.414834639-05:00 stderr F %7|1638106246.414|TOPIC|rdkafka#producer-2| [thrd:app]: New local topic: topic-name.subs
2021-11-28T08:30:46.414834639-05:00 stderr F %7|1638106246.414|TOPPARNEW|rdkafka#producer-2| [thrd:app]: NEW topic-name.subs [-1] 0x7fdb668aa2c0 refcnt 0x7fdb668aa350 (at rd_kafka_topic_new0:465)
2021-11-28T08:30:46.880397416-05:00 stderr F %7|1638106246.880|NOINFO|rdkafka#producer-2| [thrd:main]: Topic topic-name.subs metadata information unknown
2021-11-28T08:30:46.880397416-05:00 stderr F %7|1638106246.880|NOINFO|rdkafka#producer-2| [thrd:main]: Topic topic-name.subs partition count is zero: should refresh metadata
2021-11-28T08:30:46.881160403-05:00 stderr F %7|1638106246.881|STATE|rdkafka#producer-2| [thrd:main]: Topic topic-name.subs changed state unknown -> exists
2021-11-28T08:30:46.881160403-05:00 stderr F %7|1638106246.881|PARTCNT|rdkafka#producer-2| [thrd:main]: Topic topic-name.subs partition count changed from 0 to 1
2021-11-28T08:30:46.881160403-05:00 stderr F %7|1638106246.881|TOPPARNEW|rdkafka#producer-2| [thrd:main]: NEW topic-name.subs [0] 0x7fdb668aa6b0 refcnt 0x7fdb668aa740 (at rd_kafka_topic_partition_cnt_update:798)
2021-11-28T08:30:46.881160403-05:00 stderr F %7|1638106246.881|METADATA|rdkafka#producer-2| [thrd:main]:   Topic topic-name.subs partition 0 Leader 1
2021-11-28T08:30:46.881160403-05:00 stderr F %7|1638106246.881|BRKDELGT|rdkafka#producer-2| [thrd:main]: topic-name.subs [0]: delegate to broker ssl://kafka-broker:9093/1 (rktp 0x7fdb668aa6b0, term 0, ref 2)
2021-11-28T08:30:46.881191783-05:00 stderr F %7|1638106246.881|BRKDELGT|rdkafka#producer-2| [thrd:main]: topic-name.subs [0]: delegating to broker ssl://kafka-broker:9093/1 for partition with 0 messages (0 bytes) queued
2021-11-28T08:30:46.881191783-05:00 stderr F %7|1638106246.881|BRKMIGR|rdkafka#producer-2| [thrd:main]: Migrating topic topic-name.subs [0] 0x7fdb668aa6b0 from (none) to ssl://kafka-broker:9093/1 (sending PARTITION_JOIN to ssl://kafka-broker:9093/1)
2021-11-28T08:30:46.881191783-05:00 stderr F %7|1638106246.881|PARTCNT|rdkafka#producer-2| [thrd:main]: Partitioning 1 unassigned messages in topic topic-name.subs to 1 partitions
2021-11-28T08:30:46.881191783-05:00 stderr F %7|1638106246.881|TOPBRK|rdkafka#producer-2| [thrd:ssl://kafka-broker:9093/bootstrap]: ssl://kafka-broker:9093/1: Topic topic-name.subs [0]: joining broker (rktp 0x7fdb668aa6b0, 0 message(s) queued)
2021-11-28T08:30:46.881191783-05:00 stderr F %7|1638106246.881|FETCHADD|rdkafka#producer-2| [thrd:ssl://kafka-broker:9093/bootstrap]: ssl://kafka-broker:9093/1: Added topic-name.subs [0] to active list (1 entries, opv 0, 0 messages queued): joining
2021-11-28T08:30:46.881209314-05:00 stderr F %7|1638106246.881|UAS|rdkafka#producer-2| [thrd:main]: 1/1 messages were partitioned in topic topic-name.subs
2021-11-28T08:30:46.881209314-05:00 stderr F %7|1638106246.881|METADATA|rdkafka#producer-2| [thrd:main]: ssl://kafka-broker:9093/1: 1/1 requested topic(s) seen in metadata
2021-11-28T08:30:46.881224224-05:00 stderr F %7|1638106246.881|PRODUCE|rdkafka#producer-2| [thrd:ssl://kafka-broker:9093/bootstrap]: ssl://kafka-broker:9093/1: topic-name.subs [0]: Produce MessageSet with 1 message(s) (676 bytes, ApiVersion 7, MsgVersion 2, MsgId 0, BaseSeq -1, PID{Invalid}, uncompressed)
2021-11-28T08:30:46.921496248-05:00 stderr F %7|1638106246.921|MSGSET|rdkafka#producer-2| [thrd:ssl://kafka-broker:9093/bootstrap]: ssl://kafka-broker:9093/1: topic-name.subs [0]: MessageSet with 1 message(s) (MsgId 0, BaseSeq -1) delivered
...

So something seems to be going wrong with acquiring topic metadata in the broker.

Based on the above, I suspect there are potentially two separate issues here:

Whatever is preventing the acquisition of necessary metadata to produce in v1.8.0+ with the native flush -- perhaps it occurs if the flush is issued before the producer has any topic metadata?
The origin of the nonsensical python exception. This might be an issue in confluent-kafka python, in kafkacrypto, or in librdkafka, and perhaps comes from the callback being called after the underlying object has already been cleaned up... some reference counting issue?

How to reproduce

Was found in a complex user environment. Appears to reliably trigger by utilizing KafkaCrypto in a consumer with a polling loop and subscribed to a topic with a large number of large messages to consume (KafkaCrypto will be both producing and consuming key management messages in a daemon thread started before these lines):

consumer = KafkaConsumer(**kafka_config)
...
consumer.subscribe(topics)

while True:
  rv = consumer.poll(timeout_ms=commit_time*1000)
  for tp,msgs in rv.items():
      ...

No librdkafka defaults on the KafkaCrypto producer are changed. The consumer sets these parameters:

enable.auto.commit : False
session.timeout.ms : 180000
heartbeat.interval.ms : 60000
message.max.bytes : 33556480
max.poll.interval.ms : 180000

Though I suspect these do not matter since the issue is occurring with a producer flush.

Checklist

IMPORTANT: We will close issues where the checklist has not been completed.

Please provide the following information:

librdkafka version (release number or git tag): v1.8.0-v1.8.2
Apache Kafka version: v.2.8.0
librdkafka client configuration: see above
Operating system: Python-3.9 on Alpine in podman container on CentOS 8.4
Provide logs (with debug=.. as necessary) from librdkafka
Provide broker log excerpts
Critical issue

Edited to add log snippets.

The text was updated successfully, but these errors were encountered:

tmcqueen-materials · 2021-12-31T01:45:51Z

I had a chance to investigate this further. The "surprising exception" was due to an implementation bug in kafkacrypto . However, the hang with flush, where no messages are actually sent, persists.

I have been unable to come up with a consistent "small" reproducer, but was able to trigger it with debug=all set for the producer. That log is attached. My analysis:

The initial broker connection completes without incident. This is then followed by ~3 minutes of periodic wakeups -- this is time when other things are happening, and nothing has (yet) been sent with send to any topics.
Immediately after the first send call, librdkafka creates a new local topic.
Immediately after the send, but before linger.ms is exceeded, a producer flush is executed. This results in a debug line entry indicating a broker wakeup, presumably as a result of the rd_kafka_all_brokers_wakeup since there was a previous periodic wakeup only ~1 ms earlier.
At this point, ~0.8 seconds elapses before thrd:main queues a request for new metadata about the topic.This then repeats once a second.

Importantly, after this, there are no messages indicating that any metadata requests are actually sent to the broker, there are no further broker wakeups, and the confluent-kafka-python flush call has not returned. This looks like the broker thread gets stuck, presumably due to some kind of livelock/deadlock .

Unfortunately, I don't know enough about the internals of librdkafka to know whether this is happening within librdkafka, or due to some interaction with the GIL in python (since all of this is called via confluent-kafka-python). Perhaps @edenhill knows?

...
2021-12-30T18:30:16.863317229-05:00 stderr F %7|1640907016.863|WAKEUP|rdkafka#producer-2| [thrd:app]: ssl://kafka-broker:9093/1: Wake-up
confluent-kafka-python send is executed here at 2021-12-30T18:30:16.864012806-05:00
2021-12-30T18:30:16.864046336-05:00 stderr F %7|1640907016.864|TOPIC|rdkafka#producer-2| [thrd:app]: New local topic: topic.subs
2021-12-30T18:30:16.864055646-05:00 stderr F %7|1640907016.864|TOPPARNEW|rdkafka#producer-2| [thrd:app]: NEW topic.subs [-1] 0x7f354f2012c0 refcnt 0x7f354f201350 (at rd_kafka_topic_new0:465)
confluent-kafka-python flush is started here at 2021-12-30T18:30:16.864094207-05:00
2021-12-30T18:30:16.864111197-05:00 stderr F %7|1640907016.864|WAKEUP|rdkafka#producer-2| [thrd:app]: ssl://kafka-broker:9093/1: Wake-up
2021-12-30T18:30:17.629798467-05:00 stderr F %7|1640907017.629|NOINFO|rdkafka#producer-2| [thrd:main]: Topic topic.subs metadata information unknown
2021-12-30T18:30:17.629798467-05:00 stderr F %7|1640907017.629|NOINFO|rdkafka#producer-2| [thrd:main]: Topic topic.subs partition count is zero: should refresh metadata
2021-12-30T18:30:17.629798467-05:00 stderr F %7|1640907017.629|METADATA|rdkafka#producer-2| [thrd:main]: Requesting metadata for 1/1 topics: refresh unavailable topics
2021-12-30T18:30:17.629798467-05:00 stderr F %7|1640907017.629|METADATA|rdkafka#producer-2| [thrd:main]: ssl://kafka-broker:9093/1: Request metadata for 1 topic(s): refresh unavailable topics
2021-12-30T18:30:18.629833629-05:00 stderr F %7|1640907018.629|NOINFO|rdkafka#producer-2| [thrd:main]: Topic topic.subs metadata information unknown
2021-12-30T18:30:18.629833629-05:00 stderr F %7|1640907018.629|NOINFO|rdkafka#producer-2| [thrd:main]: Topic topic.subs partition count is zero: should refresh metadata
2021-12-30T18:30:18.629833629-05:00 stderr F %7|1640907018.629|METADATA|rdkafka#producer-2| [thrd:main]: Requesting metadata for 1/1 topics: refresh unavailable topics
2021-12-30T18:30:18.629833629-05:00 stderr F %7|1640907018.629|METADATA|rdkafka#producer-2| [thrd:main]: ssl://kafka-broker:9093/1: Request metadata for 1 topic(s): refresh unavailable topics
...

full-trace-stall.log

zero4573 · 2022-04-11T15:12:25Z

Hi, Sorry for reviving this, but did you ever figure this out? We're using confluent-kafka, which also uses librdkafka and running into a very similar situation where we're stuck in the 1 second loop described above.

edenhill · 2022-04-11T16:57:58Z

Sorry for the delay, completely missed this issue. Will look into it.

flush() will trigger callbacks, typically delivery report callbacks but also error, throttle and stats.
The flush() change did not really alter that behaviour, but made it quicker.

Is there anything in your code that has started decommissioning other parts when calling close()/flush()?

zero4573 · 2022-04-11T18:32:01Z

Not that I know of, our code is fairly straight forward, and basically amounts to this:

kafka_configs = json.loads('{"bootstrap.servers":"<bootstrap_servers>","sasl.mechanism": "SCRAM-SHA-512","sasl.username": "[redacted]","sasl.password": "[redacted]","security.protocol": "SASL_SSL", "debug": "all"}')
kafka_producer = Producer(kafka_configs)

...

kafka_producer.produce(topic=KAFKA_TOPIC, value=msg)
kafka_producer.flush()

It also calls flush() several times when there was nothing pushed, but as its a synchronous operation, i don't believe an issue should result from that...

The one thing that may be different from @tmcqueen-materials though, is that we're running through AWS lambda. Whats interesting, though, is that when we run this on an EC2 instance instead, the issue doesn't seem to pop up...

The relavant logs in my case:

%7|1649694559.279|TOPIC|rdkafka#producer-1| [thrd:app]: New local topic: service_events
%7|1649694559.299|TOPPARNEW|rdkafka#producer-1| [thrd:app]: NEW service_events [-1] 0x2943920 refcnt 0x29439b0 (at rd_kafka_topic_new0:465)
%7|1649694559.299|WAKEUP|rdkafka#producer-1| [thrd:app]: sasl_ssl://b-3.[redacted].kafka.us-east-1.amazonaws.com:9096/3: Wake-up
%7|1649694560.237|NOINFO|rdkafka#producer-1| [thrd:main]: Topic service_events metadata information unknown
%7|1649694560.237|NOINFO|rdkafka#producer-1| [thrd:main]: Topic service_events partition count is zero: should refresh metadata
%7|1649694560.238|METADATA|rdkafka#producer-1| [thrd:main]: Requesting metadata for 1/1 topics: refresh unavailable topics
%7|1649694560.238|METADATA|rdkafka#producer-1| [thrd:main]: sasl_ssl://b-3.[redacted].kafka.us-east-1.amazonaws.com:9096/3: Request metadata for 1 topic(s): refresh unavailable topics
%7|1649694561.238|NOINFO|rdkafka#producer-1| [thrd:main]: Topic service_events metadata information unknown
%7|1649694561.238|NOINFO|rdkafka#producer-1| [thrd:main]: Topic service_events partition count is zero: should refresh metadata
%7|1649694561.238|METADATA|rdkafka#producer-1| [thrd:main]: Requesting metadata for 1/1 topics: refresh unavailable topics

The last 4 lines repeats until the lambda timeout is hit

conorbranagan · 2022-04-18T13:16:53Z

Hi, I just wanted to note that we're experiencing this same issue as well. One additional data point is that when we upgraded confluent-kafka from v1.7.0 to v1.8.2 on our cluster we did not experience any issues until we later moved our kafka cluster from one running 0.10.2.1 to one running 2.7.1.

I have some nodes handling the same workloads running v1.8.2 of the library (blue line) and some others running v1.7.0 (purple line) and you can see the differences in terms of errors and latencies. This is a python-based application running on VMs in GCP.

The "write fails" graph is our code hitting the issue reported here, where it stalls on resolving metadata and the error is eventually receiving a local timeout. Both are running against a kafka cluster running v2.7.1.

tmcqueen-materials · 2022-04-20T12:09:23Z

To answer @zero4573, no, we never figured it out because we were never able to build a "small" reproducer, and only got this on production workloads, for which our alternative python flush implementation (avoiding librdkafka's flush) resolved. We had done some stack dumps, etc, but never got far enough to figure out what was going on inside librdkafka. From Python's perspective, we always seem hung as:

Thread 1 (idle): "MainThread"
    0x7f578dade413 (ld-musl-x86_64.so.1)
    poll (kafkacrypto/confluent_kafka_wrapper.py:393)
    get (kafkacrypto/confluent_kafka_wrapper.py:67)
    <module> (kafka-fz-tosql.py:228)
Thread 17 (idle): "Thread-1 (_process_mgmt_messages)"
    0x7f578dade413 (ld-musl-x86_64.so.1)
    flush (kafkacrypto/confluent_kafka_wrapper.py:384)
    _process_mgmt_messages (kafkacrypto/crypto.py:291)
    run (threading.py:946)
    _bootstrap_inner (threading.py:1009)
    _bootstrap (threading.py:966)

Which basically means one thread is waiting for librdkafka's flush to return, and the other thread is waiting for librdkafka's poll to return [both called from the same python confluent_kafka producer object]. But just making a reproducer that had two threads trying to race to have both a flush and a poll called from the same producer object at the same time was not enough to reproduce the observed deadlock/livelock (possibly because we could not get the ordering right or it only can happen once before metadata is retrieved). We never got far enough to build librdkafka with symbols to make sense of the stack traces from the librdkafka threads.

tmcqueen-materials · 2022-04-20T12:30:07Z

This prompted us to test things out again. We are still able to reproducibly hit the issue with v1.8.2. However, we find that we are again (at least with light testing) unable to hit the issue in v1.9.0-RC3 . So it is possible some change between v1.8.2 and v1.9.0-RC3 fixed this issue. Unfortunately we won't have time to bisect anytime soon, but perhaps that helps others.

zero4573 · 2022-04-20T13:48:07Z

That does help, we've pivoted away form pushing directly to kafka through lambda, and instead elected to push to SQS, and have an intermediary ECS process ready/push from there so that delays aren't fatal. Will keep an eye out for v1.9.0

tmcqueen-materials · 2022-04-21T01:05:50Z

Alright, after running tests in triplicate, the first commit to resolve this issue is commit 73d9a63 . I doubt it fixed whatever underlying bug/race causes the issue being discussed here, but rather more likely that this commit makes that race very hard (impossible?) to hit in practice.

Edit: More rigorous testing confirms that this commit only sometime fixes the issue -- in 6/8 "identical" test runs so far. So there must be some underlying bug/race. I have no quantitative data, but it appears to happen more frequently when the first message produce event occurs not immediately but after "several minutes" [after, probably coincidentally, the same process has had to wait for a consumer of a different topic on the same broker to rebalance to start ingesting messages again].

~~Apologies for the noise, I messed up the bisecting. I'll return when I have a correct answer.~~

~~@edenhill Is it possible that commit 6059541 fixed this? A git bisect with light testing fingers it as the likely commit that restores functionality.~~

tmcqueen-materials · 2022-04-21T13:22:54Z

The mystery deepens! While commit 73d9a63 seems to "often" resolve the issue, after applying the immediately following commit be4e096 , I am unable to hit the issue in many trials (16 and counting). In other words, 73d9a63 and be4e096 together seem to reliably resolve the symptoms. This really suggests to me it is some kind of race/deadlock/thread coordination issue.

@edenhill Let me know if you'd like this closed (though I suspect an underlying bug is likely still present).

conorbranagan · 2022-04-25T14:48:38Z

I can also confirm that we saw the issue disappear with an upgrade to v1.9 on a subset of machines. The higher latencies we saw (posted above) also returned to prior levels with the upgrade. So we'll wait for an official release and deploy that everywhere.

edenhill · 2022-04-25T16:41:28Z

Thank you all for the detailed troubleshoot and analysis.

I'm happy to hear that v1.9.0.. seems to fix the issue, but as @tmcqueen-materials points out it is not evident why the two bisected commits would, and that it is likely a race condition elsewhere.

What client callbacks are you setting up in the producer?

affected versions (librdkafka 1.8.x). See confluentinc/librdkafka#3633

edenhill added the GREAT REPORT label Apr 11, 2022

tmcqueen-materials pushed a commit to tmcqueen-materials/kafkacrypto that referenced this issue Apr 28, 2022

Scope flush workaround in confluent_kafka_wrapper to only the

ef23f6d

affected versions (librdkafka 1.8.x). See confluentinc/librdkafka#3633

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Potential issues with new flush() #3633

Potential issues with new flush() #3633

tmcqueen-materials commented Nov 26, 2021 •

edited

Loading

tmcqueen-materials commented Dec 31, 2021

zero4573 commented Apr 11, 2022

edenhill commented Apr 11, 2022

zero4573 commented Apr 11, 2022 •

edited

Loading

conorbranagan commented Apr 18, 2022 •

edited

Loading

tmcqueen-materials commented Apr 20, 2022

tmcqueen-materials commented Apr 20, 2022

zero4573 commented Apr 20, 2022

tmcqueen-materials commented Apr 21, 2022 •

edited

Loading

tmcqueen-materials commented Apr 21, 2022

conorbranagan commented Apr 25, 2022

edenhill commented Apr 25, 2022

Potential issues with new flush() #3633

Potential issues with new flush() #3633

Comments

tmcqueen-materials commented Nov 26, 2021 • edited Loading

Description

How to reproduce

Checklist

tmcqueen-materials commented Dec 31, 2021

zero4573 commented Apr 11, 2022

edenhill commented Apr 11, 2022

zero4573 commented Apr 11, 2022 • edited Loading

conorbranagan commented Apr 18, 2022 • edited Loading

tmcqueen-materials commented Apr 20, 2022

tmcqueen-materials commented Apr 20, 2022

zero4573 commented Apr 20, 2022

tmcqueen-materials commented Apr 21, 2022 • edited Loading

tmcqueen-materials commented Apr 21, 2022

conorbranagan commented Apr 25, 2022

edenhill commented Apr 25, 2022

tmcqueen-materials commented Nov 26, 2021 •

edited

Loading

zero4573 commented Apr 11, 2022 •

edited

Loading

conorbranagan commented Apr 18, 2022 •

edited

Loading

tmcqueen-materials commented Apr 21, 2022 •

edited

Loading