Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Avoid 100% CPU usage while socket is closed #2091

Open
wants to merge 10 commits into
base: master
Choose a base branch
from

Conversation

orange-kao
Copy link

@orange-kao orange-kao commented Jul 23, 2020

After stop/start kafka service, kafka-python may use 100% CPU caused by
busy-retry while the socket was closed. This fix the issue by unregister
the socket if the fd is negative.


This change is Reviewable

@ghost
Copy link

ghost commented Jul 29, 2020

It could also be possible to do a self._selector.unregister(key.fileobj) instead, but the interesting part is where this negative fd is actually coming from, that might be the root cause for this issue, but might also be tricky to find.

@orange-kao orange-kao force-pushed the orange-cpu-usage-while-socket-closed branch from d72a192 to 80e1efa Compare July 31, 2020 00:45
After stop/start kafka service, kafka-python may use 100% CPU caused by
busy-retry while the socket was closed. This fix the issue by unregister
the socket if the fd is negative.
@orange-kao orange-kao force-pushed the orange-cpu-usage-while-socket-closed branch from 80e1efa to 09b5574 Compare July 31, 2020 01:08
@orange-kao
Copy link
Author

Commit updated. Unregister socket instead of sleep.

@jeffwidman
Copy link
Collaborator

Agreed, can you look at where this negative FD might be coming from? If you can consistently repro it, then the eBPF tools would probably make it a lot easier to track down...

Or if you have a way to consistently repro it in a test case, I'd be willing to take a look...

@gabriel-tincu
Copy link
Contributor

Agreed, can you look at where this negative FD might be coming from? If you can consistently repro it, then the eBPF tools would probably make it a lot easier to track down...

Or if you have a way to consistently repro it in a test case, I'd be willing to take a look...

@jeffwidman the scenario where this happens, to the best of my knowledge, is having the same client connected to a cluster that rapidly drops and accepts new members. Having a long-ish metadata refresh time, plus the client code, to my understanding, having a very long grace period to kill idle connections, can lead to the least_loaded_node method returning nodes that are no longer actually there, mapped to dead sockets. Needless to say, setting up a scenario for this would require some automation (1 broker leaves, another arrives, least_loaded_node is forced to select the old one or we just use that node id

@@ -634,6 +634,9 @@ def _poll(self, timeout):
self._sensors.select_time.record((end_select - start_select) * 1000000000)

for key, events in ready:
if key.fileobj.fileno() < 0:
self._selector.unregister(key.fileobj)
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this needs to be more robust: we want to close the conn here if it is a BrokerConnection, which would then trigger an unregister. But it could also be the _wake_r socketpair, in which case we need to reset/rebuild the wake socketpair.

@dnj12345
Copy link

Agreed, can you look at where this negative FD might be coming from? If you can consistently repro it, then the eBPF tools would probably make it a lot easier to track down...
Or if you have a way to consistently repro it in a test case, I'd be willing to take a look...

@jeffwidman the scenario where this happens, to the best of my knowledge, is having the same client connected to a cluster that rapidly drops and accepts new members. Having a long-ish metadata refresh time, plus the client code, to my understanding, having a very long grace period to kill idle connections, can lead to the least_loaded_node method returning nodes that are no longer actually there, mapped to dead sockets. Needless to say, setting up a scenario for this would require some automation (1 broker leaves, another arrives, least_loaded_node is forced to select the old one or we just use that node id

I am also seeing this issue. When the broker goes down, the CPU usage percentage of the producer shoots up. In htop my producer usage is usually less than 5% under normal conditions. Once the broker goes down and CPU shoots up to 50% usage. There is a constant stream of records I am trying to write to the bus. The producer.send call does not raise an exception which would have helped me close the producer.

wbarnha and others added 9 commits March 7, 2024 10:31
…terations for Kafka 0.8.2 and Python 3.12 (dpkp#159)

* skip failing tests for PyPy since they work locally

* Reconfigure tests for PyPy and 3.12

* Skip partitioner tests in test_partitioner.py if 3.12 and 0.8.2

* Update test_partitioner.py

* Update test_producer.py

* Timeout tests after ten minutes

* Set 0.8.2.2 to be experimental from hereon

* Formally support PyPy 3.9
* Test Kafka 0.8.2.2 using Python 3.11 in the meantime

* Override PYTHON_LATEST conditionally in python-package.yml

* Update python-package.yml

* add python annotation to kafka version test matrix

* Update python-package.yml

* try python 3.10
* Remove support for EOL'ed versions of Python

* Update setup.py
Too many MRs to review... so little time.
@orange-kao orange-kao force-pushed the orange-cpu-usage-while-socket-closed branch from 4c44bfb to 1511271 Compare July 23, 2024 04:27
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants