Avoid 100% CPU usage while socket is closed #2091

orange-kao · 2020-07-23T03:06:09Z

After stop/start kafka service, kafka-python may use 100% CPU caused by
busy-retry while the socket was closed. This fix the issue by unregister
the socket if the fd is negative.

This change is

ghost · 2020-07-29T05:46:30Z

It could also be possible to do a self._selector.unregister(key.fileobj) instead, but the interesting part is where this negative fd is actually coming from, that might be the root cause for this issue, but might also be tricky to find.

After stop/start kafka service, kafka-python may use 100% CPU caused by busy-retry while the socket was closed. This fix the issue by unregister the socket if the fd is negative.

orange-kao · 2020-07-31T01:51:33Z

Commit updated. Unregister socket instead of sleep.

jeffwidman · 2020-09-17T06:49:24Z

Agreed, can you look at where this negative FD might be coming from? If you can consistently repro it, then the eBPF tools would probably make it a lot easier to track down...

Or if you have a way to consistently repro it in a test case, I'd be willing to take a look...

gabriel-tincu · 2020-09-22T16:04:32Z

Agreed, can you look at where this negative FD might be coming from? If you can consistently repro it, then the eBPF tools would probably make it a lot easier to track down...

Or if you have a way to consistently repro it in a test case, I'd be willing to take a look...

@jeffwidman the scenario where this happens, to the best of my knowledge, is having the same client connected to a cluster that rapidly drops and accepts new members. Having a long-ish metadata refresh time, plus the client code, to my understanding, having a very long grace period to kill idle connections, can lead to the least_loaded_node method returning nodes that are no longer actually there, mapped to dead sockets. Needless to say, setting up a scenario for this would require some automation (1 broker leaves, another arrives, least_loaded_node is forced to select the old one or we just use that node id

dpkp · 2020-09-30T03:53:37Z

kafka/client_async.py

@@ -634,6 +634,9 @@ def _poll(self, timeout):
            self._sensors.select_time.record((end_select - start_select) * 1000000000)

        for key, events in ready:
+            if key.fileobj.fileno() < 0:
+                self._selector.unregister(key.fileobj)


I think this needs to be more robust: we want to close the conn here if it is a BrokerConnection, which would then trigger an unregister. But it could also be the _wake_r socketpair, in which case we need to reset/rebuild the wake socketpair.

dnj12345 · 2021-04-27T22:15:31Z

Agreed, can you look at where this negative FD might be coming from? If you can consistently repro it, then the eBPF tools would probably make it a lot easier to track down...
Or if you have a way to consistently repro it in a test case, I'd be willing to take a look...

@jeffwidman the scenario where this happens, to the best of my knowledge, is having the same client connected to a cluster that rapidly drops and accepts new members. Having a long-ish metadata refresh time, plus the client code, to my understanding, having a very long grace period to kill idle connections, can lead to the least_loaded_node method returning nodes that are no longer actually there, mapped to dead sockets. Needless to say, setting up a scenario for this would require some automation (1 broker leaves, another arrives, least_loaded_node is forced to select the old one or we just use that node id

I am also seeing this issue. When the broker goes down, the CPU usage percentage of the producer shoots up. In htop my producer usage is usually less than 5% under normal conditions. Once the broker goes down and CPU shoots up to 50% usage. There is a constant stream of records I am trying to write to the bus. The producer.send call does not raise an exception which would have helped me close the producer.

…terations for Kafka 0.8.2 and Python 3.12 (dpkp#159) * skip failing tests for PyPy since they work locally * Reconfigure tests for PyPy and 3.12 * Skip partitioner tests in test_partitioner.py if 3.12 and 0.8.2 * Update test_partitioner.py * Update test_producer.py * Timeout tests after ten minutes * Set 0.8.2.2 to be experimental from hereon * Formally support PyPy 3.9

* Test Kafka 0.8.2.2 using Python 3.11 in the meantime * Override PYTHON_LATEST conditionally in python-package.yml * Update python-package.yml * add python annotation to kafka version test matrix * Update python-package.yml * try python 3.10

* Remove support for EOL'ed versions of Python * Update setup.py

Too many MRs to review... so little time.

orange-kao force-pushed the orange-cpu-usage-while-socket-closed branch from d72a192 to 80e1efa Compare July 31, 2020 00:45

Avoid 100% CPU usage while socket is closed

09b5574

After stop/start kafka service, kafka-python may use 100% CPU caused by busy-retry while the socket was closed. This fix the issue by unregister the socket if the fd is negative.

orange-kao force-pushed the orange-cpu-usage-while-socket-closed branch from 80e1efa to 09b5574 Compare July 31, 2020 01:08

dpkp reviewed Sep 30, 2020

View reviewed changes

wbarnha and others added 9 commits March 7, 2024 10:31

Rename project from kafka-python to kafka-python-ng (dpkp#1)

b95e46d

Fix artifact downloads for release

78c74c0

Fix badge links in README.rst

e796019

Remove support for EOL'ed versions of Python (dpkp#160)

00750aa

* Remove support for EOL'ed versions of Python * Update setup.py

Merge branch 'master' into orange-cpu-usage-while-socket-closed

2bd7b8a

Stop testing Python 3.13 in python-package.yml (dpkp#162)

5bd1323

Too many MRs to review... so little time.

Merge branch 'master' into orange-cpu-usage-while-socket-closed

1511271

orange-kao force-pushed the orange-cpu-usage-while-socket-closed branch from 4c44bfb to 1511271 Compare July 23, 2024 04:27

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Avoid 100% CPU usage while socket is closed #2091

Avoid 100% CPU usage while socket is closed #2091

orange-kao commented Jul 23, 2020 •

edited

Loading

ghost commented Jul 29, 2020

orange-kao commented Jul 31, 2020

jeffwidman commented Sep 17, 2020

gabriel-tincu commented Sep 22, 2020

dpkp Sep 30, 2020

dnj12345 commented Apr 27, 2021

Avoid 100% CPU usage while socket is closed #2091

Are you sure you want to change the base?

Avoid 100% CPU usage while socket is closed #2091

Conversation

orange-kao commented Jul 23, 2020 • edited Loading

ghost commented Jul 29, 2020

orange-kao commented Jul 31, 2020

jeffwidman commented Sep 17, 2020

gabriel-tincu commented Sep 22, 2020

dpkp Sep 30, 2020

Choose a reason for hiding this comment

dnj12345 commented Apr 27, 2021

orange-kao commented Jul 23, 2020 •

edited

Loading