-
Notifications
You must be signed in to change notification settings - Fork 618
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Sometimes, kafka_exporter cannot connect to broker and generate too many sockets. #54
Comments
+1 we are seeing the same thing in our cluster. |
+1 |
1 similar comment
+1 |
This seems relevant? They recommend calling Broker.open() before each client call. |
I would also like to add that not only do we get "Cannot get current offset" but also "Cannot get describe groups". Both cases show the |
I'm not sure what I'm supposed to know about this app specifically but it looks like they're already calling |
Thanks @eapache for taking a look. Is it better to create a single client for each loop iteration? or share the client across all iterations? |
The code as written looks correct. It's possible there's something I'm overlooking, or it's possible you've managed to uncover a really subtle concurrency bug in Sarama. What does the race-detector say? Is anything interesting coming from the Sarama logs? |
I have the same issue, found the reason from kafka server log: |
@hukaixuan We still experience it on kafka 1.0.0. So it's still a bug. Also, it temporarily seems resolved after restarting the process. After a few hours, I imagine the symptoms will appear again. |
Hi, It seems that kafka exporter is having a problem with metadata refresh: Things are always the same for me. I do get holes in my timeseries when meta refresh kicks in. There was an suggestion to create metadata cache. I can also sugest to get meta asynchronously, in a separate goroutine. Please have a look at my logs, it is from our sandbox so please let me know if i can help you troubleshooting. (1h timestamp difference)
|
BTW, Why it is always fetching meta from the same broker ? |
Hi guys ! Any news ? For us it is a showstoper :( |
@danielqsj can you please take a look at this issue? Thanks |
Hi ! Did anyone tried if this PR helps for this issue ? I do really think it is all around metadata handling: @danielqsj Would you please give us at least short answer ? |
@agolomoodysaada @bojleros #75 merged, can you try the newest image and see whether things are getting better? |
Deployed it. Will test for a few days and report back. I think others should test it as well |
Sure , I'll post my results as soon as we have one. |
@danielqsj @jorgelbg @agolomoodysaada Other thing is this i/o timeout and it's handling. I think such error can happen but kafka_exporter should be able to make at least 1 or 2 fast retries (to the one of different brokers) before updating it's metadata and generating a gap. Another finding is that kafka_exporter keeps polling only one of brokers for metadata change instead of making a round-robin across all brokers.
|
Hi , except Friday i can't see any gaps during the weekend. I'd propose to assume that this PR makes things better. Gonna let you know if anything bad happens. |
Hi, We have installed kafka-exporter on prod just to get additional metrics without alerting. Bad news is that i do still see occassional gaps in my metrics. @agolomoodysaada how does it work for you ? Did you changed metadata refresh interval ? |
Not sure if it's the right place, but I observe something similar after upgrading Prometheus from 2.5.0 to 2.7.1. I have odd setup where single kafka 1.1.1 exporter 1.2.0 is scraped by 6 servers simultaneously, and after update 30-40% of kafka_consumergroup_lag metric values are missing depending on RTT between Prometheus and exporter (I have 0.5, 50 and 90 ms). And my guess it's a race condition, because commenting the defer broker.Close() fixes the issue right away. Add: I don't observe any significant fd consumption. |
@alexey-dushechkin I think I've run into the same conclusion. I'm running the Kafka exporter as well with multiple Prometheus scraping the target (4 in my case). After enabling the extended logging for sarama I see a lot of consecutive connections&disconnections (from the brokers) on the logs, within the same second:
Also at the same time on the normal logs I see the following:
This means that from the call to In theory we don't need to disconnect after each request to the broker. Although Kafka has the I've deployed a patched version this morning and will post results here tomorrow. |
Currently i do only have a single prometheus talking with multiple kafka-exporters so for me it should not be a concurrency problem.
Yes . There is a such parameter. I do not remember exact kafka version introducing it . I do only remember that our versions are newer ones but i do also see some stalled partition connections (10m+) from time to time so i am not absolutely sure it is working correctly. |
@bojleros From your logs it got my attention that your exporter is actually timing out when connecting to the broker:
Also is what you get directly from the exporter:
This looks like it really can't connect to that particular broker. Could you try to pass multiple brokers at startup time? Something like:
I've tested locally by passing 1 existing broker and one dummy broker and it seems to work. The only issue that I've encountered is that it will try again on each request to connect to the non-healthy broker first (if it was selected initially as the "main" broker). This retry mechanism is provided by sarama directly. But you need to pass multiple brokers in the command line. |
@jorgelbg Thats how my commandline looks like. Every single broker in my cluster is on that list. Since my clusters are rather small i do include every broker on the list. I do not consider it like it was a '--bootstrap-server' known from kafka tools.
|
I've been running a patched version without the We have less gaps now. I've found ocasional disconnections to the broker that I'm connecting to but it reconnected automatically after that and no gap was present at the same time. Memory usage is more stable now (as you can see) and although it has been increasing. This job was running with plenty of RAM (200MB) so perhaps it was not triggering any GC actions. I've deployed a new job with 100MB of RAM, will leave it running for some time and will check later on. It would be interesting to see if the RAM usage of your exporter follows the same pattern @alexey-dushechkin. |
My limits are following:
There seems to be a plenty of ram and i see no OOM at all. I do use following to generate this graph:
It is possible that memory usage is related to the broker count , number of topics , etc but first, are we using the same metrics ? How many brokers do you run @jorgelbg ? PS. My kafka_exporter runs on k8s cluster but i also have node_exporters on each kafka broker. No gaps on node_exporter metrics at all. This is our tool : https://github.com/msales/kage . I am posting it since i am not a go developer but i hope it can be helpful. |
@forget6 would you mind running the exporter with the debug log level? It would be interesting to see why it is failing to connect to the broker. Are you running master or a released version? Unfortunately, I haven't been able to reproduce this issue in my setup. |
Maybe somewhat related, I recently tried to deploy kafka_exporter in a new environment and noticed that the metrics took a really long time to collect (in my case about 11 seconds with a dozen or so topics with the largest having 50 partitions). Trying to figure out why brought me to this issue. I took a dive into the exporter and learned that it was due to the latency between the exporter and the kafka cluster. |
@jadams-gr8 I would be interested in looking are your changes, perhaps they can be merged. I understand that your latency times were very high probably due to specifics of your infrastructure or some other particularities but I would be interested in testing your changes. I'm running the current version and monitoring 5 different Kafka clusters (the biggest one with ~270 active topics). Right now in master, the connection to Kafka is closed after each scrape interval which caused some issues, see #54 (comment). Did you try with the PR #90 by any chance? If you can make your fork available I can give it a try and perhaps you can contribute your changes as a PR. |
I have not tried the patch, but I don't believe that patch to be particularly relevant to the specific time issue that I had. My repo is available here: https://github.com/sysadmind/kafka_exporter/ I think my changes may be of use to you especially if some of your topics have a lot of partitions because this adds more concurrency to reduce waiting on network latency. I'd be open to submitting my changes as a PR if you can validate that they work for you (maybe even with your patch on top of it). I didn't want to open a PR yet since I'm not sure my work won't break things for other users and this repo looks to be lacking much activity recently. My latency is because the exporter is actually running in a different datacenter than my kafka cluster as it's a development environment and some things are transitioning between datacenters. I could have moved my exporter but my current prometheus deployment makes that harder than it sounds on the surface. |
Same issue under kafka_exporter 1.2.0 and kafka 2.2.0, but unfortunately, restart did not solve issue yet. Any update or release about it ? After switch on debug option, there are too many errros as below after running a while: |
I had similar error on my kafka exporter (kafka exporter 1.2)
I used Strimzi chart to deploy the strimzi-kafka-operator:0.16.2 and then my kafka cluster. Fixed it also by reloading the kafka exporter |
Greetings, I am experiencing the same issues since last few weeks, A restart of the
I also tried increasing the Any news on the update/fix about this? Thanks! 🙏 |
I am seeing this too, with the latest exporter image. Any solutions or workarounds? This is the only Kafka exporter out there I could find - so it would be nice if this were fixed. |
I found this too. when I started writing data into kafka , Kafka Exporter is not responsive。 |
Also seeing this. IBM/sarama#1857 (comment) may be related? |
Adding one more I was already opening the broker connection before the function call. err = broker.Open(t.client.Config())
if err != nil && err != sarama.ErrAlreadyConnected {
return defaultLag, fmt.Errorf("Error opening broker connection again, err: %v", err)
}
k.consumerGroupLag(id, topic, partiton, broker) func (t *kafkaWatch) consumerGroupLag(id string, topic string, partition int32, broker *sarama.Broker) {
.... do something with client ....(t.client.GetOffset)
// one more broker open needed
err = broker.Open(t.client.Config())
if err != nil && err != sarama.ErrAlreadyConnected {
return defaultLag, fmt.Errorf("Error opening broker connection again, err: %v", err)
}
offsetFetchResponse, err := broker.FetchOffset(&offsetFetchRequest) |
No this also did not work :( still geting not connected broker in This happens when there are huge number of requests for many topics to kafka in parallel. Slowing it down makes the problem disappear. |
@alok87 did you try with the latest release? Here I've merged some other perf PR, plus added a way to limit getTopicMetrics routines ... if it helps ... |
Deployed your optimized fork https://github.com/alesj/kafka_exporter/tree/fork1 This optimization seems to work! @danielqsj please take these optimizations in, for large cluster support! |
@alesj Your fork has resolved our issue where the exporter's memory usage would sometimes spike resulting in an oom. Thank you! Memory usage is now much lower and stays low. |
@danielqsj any plan to take this in? big clusters with many consumer groups the master of this repo does not work. |
Cherry-picked from #228, and released v1.4.1. @alok87 and @atrbgithub and @ALL, please have a try. |
Thanks @danielqsj |
Update: it's working in 1000+ consumer group cluster in prod. I will update if any issue happens. It should not. Thanks @danielqsj and @skonmeme |
Usually it works well.
Sometimes it could not connect to brokers and generates too many sockets (up to about 1,000 sockets open) and generate following error messages.
Whenever I restart kafka_exporter, it works well again.
Jul 20 12:36:38 172.27.115.150 kafka_exporter[20251]: time="2018-07-20T12:36:38+09:00" level=error msg="Can't get current offset of topic test partition %!s(int32=10): kafka: broker not connected" source="kafka_exporter.go:271"
...
The text was updated successfully, but these errors were encountered: