Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Sometimes, kafka_exporter cannot connect to broker and generate too many sockets. #54

Closed
skonmeme opened this issue Jul 20, 2018 · 67 comments

Comments

@skonmeme
Copy link

Usually it works well.

Sometimes it could not connect to brokers and generates too many sockets (up to about 1,000 sockets open) and generate following error messages.

Whenever I restart kafka_exporter, it works well again.


Jul 20 12:36:38 172.27.115.150 kafka_exporter[20251]: time="2018-07-20T12:36:38+09:00" level=error msg="Can't get current offset of topic test partition %!s(int32=10): kafka: broker not connected" source="kafka_exporter.go:271"
...

screen shot 2018-07-20 at 12 42 53 pm

@dcarrier
Copy link

+1 we are seeing the same thing in our cluster.

@eastcirclek
Copy link

+1

1 similar comment
@sirpkt
Copy link

sirpkt commented Aug 3, 2018

+1

@jpbelanger-mtl
Copy link

We had the same issue after 87days of uptime. Restarting fixed it... during that time, metrics occasionally failed. It got worst over time until it starting triggering out alerts.

image

@agolomoodysaada
Copy link

This seems relevant?
IBM/sarama#853

They recommend calling Broker.open() before each client call.
@eapache could you please take a look at this one?

@agolomoodysaada
Copy link

agolomoodysaada commented Aug 27, 2018

I would also like to add that not only do we get "Cannot get current offset" but also "Cannot get describe groups". Both cases show the broker not connected error

@eapache
Copy link

eapache commented Aug 28, 2018

I'm not sure what I'm supposed to know about this app specifically but it looks like they're already calling brokerOpen: https://github.com/danielqsj/kafka_exporter/blob/master/kafka_exporter.go#L353

@agolomoodysaada
Copy link

agolomoodysaada commented Aug 28, 2018

Thanks @eapache for taking a look.
The exporter is basically an HTTP endpoint that gets called every interval (say 10s) and returns metrics from Kafka. In this case, on each HTTP request, we loop over each consumer group from __consumer_offsets and calculate retrieve different metrics from there. There appears to be a socket leak in this loop or some random disconnects in the client.

Is it better to create a single client for each loop iteration? or share the client across all iterations?
Are we missing anything in how the client is being used?
Is there some cleanup that we're not doing?
If two concurrent http requests come in, would there be shared state in the current implementation?

@eapache
Copy link

eapache commented Aug 29, 2018

The code as written looks correct. It's possible there's something I'm overlooking, or it's possible you've managed to uncover a really subtle concurrency bug in Sarama. What does the race-detector say? Is anything interesting coming from the Sarama logs?

@hukaixuan
Copy link

I have the same issue, found the reason from kafka server log:
ERROR Closing socket for ... because of error (kafka.network.Processor) kafka.network.InvalidRequestException: Error getting request for apiKey: 2 and apiVersion: 1
And found that the kafka version is 0.10.0.0 but the kafka_exporter deafault kafka version is 1.0.0. Fix it by add --kafka.version=0.10.0.0

@agolomoodysaada
Copy link

@hukaixuan We still experience it on kafka 1.0.0. So it's still a bug. Also, it temporarily seems resolved after restarting the process. After a few hours, I imagine the symptoms will appear again.

@bojleros
Copy link

bojleros commented Dec 3, 2018

Hi,

It seems that kafka exporter is having a problem with metadata refresh:

image

Things are always the same for me. I do get holes in my timeseries when meta refresh kicks in. There was an suggestion to create metadata cache. I can also sugest to get meta asynchronously, in a separate goroutine. Please have a look at my logs, it is from our sandbox so please let me know if i can help you troubleshooting.

(1h timestamp difference)

[sarama] 2018/12/03 05:10:17 Connected to broker at broker0:9092 (registered as #555)
[sarama] 2018/12/03 05:10:17 Connected to broker at broker1:9092 (registered as #556)
[sarama] 2018/12/03 05:10:17 Connected to broker at broker2:9092 (registered as #557)
[sarama] 2018/12/03 05:10:17 Closed connection to broker broker0:9092
[sarama] 2018/12/03 05:10:17 Closed connection to broker broker2:9092
[sarama] 2018/12/03 05:10:17 Closed connection to broker broker1:9092
[sarama] 2018/12/03 05:11:16 client/metadata fetching metadata for all topics from broker broker2:9092
[sarama] 2018/12/03 05:11:17 Connected to broker at broker1:9092 (registered as #556)
[sarama] 2018/12/03 05:11:17 Connected to broker at broker0:9092 (registered as #555)
[sarama] 2018/12/03 05:11:46 Failed to connect to broker broker2:9092: dial tcp 172.20.2.169:9092: i/o timeout
[sarama] 2018/12/03 05:11:46 client/metadata fetching metadata for [_one_of_our_topics_] from broker broker2:9092
[sarama] 2018/12/03 05:11:46 client/metadata fetching metadata for [_one_of_our_topics_] from broker broker2:9092
[sarama] 2018/12/03 05:11:46 client/metadata fetching metadata for [_one_of_our_topics_] from broker broker2:9092
[sarama] 2018/12/03 05:11:46 client/metadata fetching metadata for [_one_of_our_topics_] from broker broker2:9092
[sarama] 2018/12/03 05:11:46 client/metadata fetching metadata for [_one_of_our_topics_] from broker broker2:9092
[sarama] 2018/12/03 05:11:46 client/metadata fetching metadata for [_one_of_our_topics_] from broker broker2:9092
[sarama] 2018/12/03 05:11:46 client/metadata fetching metadata for [_one_of_our_topics_] from broker broker2:9092
[sarama] 2018/12/03 05:11:46 client/metadata fetching metadata for [_one_of_our_topics_] from broker broker2:9092
[sarama] 2018/12/03 05:11:46 client/metadata fetching metadata for [_one_of_our_topics_] from broker broker2:9092
[sarama] 2018/12/03 05:11:46 client/metadata fetching metadata for [_one_of_our_topics_] from broker broker2:9092
[sarama] 2018/12/03 05:11:46 client/metadata fetching metadata for [_one_of_our_topics_] from broker broker2:9092
[sarama] 2018/12/03 05:11:46 client/metadata fetching metadata for [_one_of_our_topics_] from broker broker2:9092
[sarama] 2018/12/03 05:11:46 client/metadata fetching metadata for [_one_of_our_topics_] from broker broker2:9092
[sarama] 2018/12/03 05:11:46 client/metadata fetching metadata for [_one_of_our_topics_] from broker broker2:9092
[sarama] 2018/12/03 05:11:46 client/metadata fetching metadata for [_one_of_our_topics_] from broker broker2:9092
[sarama] 2018/12/03 05:11:46 client/metadata fetching metadata for [_one_of_our_topics_] from broker broker2:9092
[sarama] 2018/12/03 05:11:46 client/metadata fetching metadata for [__consumer_offsets] from broker broker2:9092
[sarama] 2018/12/03 05:11:47 Connected to broker at broker2:9092 (registered as #557)
[sarama] 2018/12/03 05:11:47 Closed connection to broker broker0:9092
[sarama] 2018/12/03 05:11:47 Closed connection to broker broker2:9092
[sarama] 2018/12/03 05:11:47 Closed connection to broker broker1:9092
[sarama] 2018/12/03 05:12:16 client/metadata fetching metadata for all topics from broker broker2:9092
[sarama] 2018/12/03 05:12:17 Connected to broker at broker1:9092 (registered as #556)
[sarama] 2018/12/03 05:12:17 Connected to broker at broker2:9092 (registered as #557)
[sarama] 2018/12/03 05:12:17 Connected to broker at broker0:9092 (registered as #555)
[sarama] 2018/12/03 05:12:17 Closed connection to broker broker0:9092
[sarama] 2018/12/03 05:12:17 Closed connection to broker broker2:9092
[sarama] 2018/12/03 05:12:17 Closed connection to broker broker1:9092
[sarama] 2018/12/03 05:13:16 client/metadata fetching metadata for all topics from broker broker2:9092
[sarama] 2018/12/03 05:13:17 Connected to broker at broker0:9092 (registered as #555)
[sarama] 2018/12/03 05:13:17 Connected to broker at broker2:9092 (registered as #557)
[sarama] 2018/12/03 05:13:17 Connected to broker at broker1:9092 (registered as #556)
[sarama] 2018/12/03 05:13:17 Closed connection to broker broker0:9092
[sarama] 2018/12/03 05:13:17 Closed connection to broker broker2:9092
[sarama] 2018/12/03 05:13:17 Closed connection to broker broker1:9092
[sarama] 2018/12/03 05:14:16 client/metadata fetching metadata for all topics from broker broker2:9092
[sarama] 2018/12/03 05:14:17 Connected to broker at broker0:9092 (registered as #555)
[sarama] 2018/12/03 05:14:17 Connected to broker at broker1:9092 (registered as #556)
[sarama] 2018/12/03 05:14:17 Connected to broker at broker2:9092 (registered as #557)
[sarama] 2018/12/03 05:14:17 Closed connection to broker broker2:9092
[sarama] 2018/12/03 05:14:17 Closed connection to broker broker0:9092
[sarama] 2018/12/03 05:14:17 Closed connection to broker broker1:9092
[sarama] 2018/12/03 05:15:16 client/metadata fetching metadata for all topics from broker broker2:9092
[sarama] 2018/12/03 05:15:17 Connected to broker at broker1:9092 (registered as #556)
[sarama] 2018/12/03 05:15:17 Connected to broker at broker0:9092 (registered as #555)
[sarama] 2018/12/03 05:15:17 Connected to broker at broker2:9092 (registered as #557)
[sarama] 2018/12/03 05:15:17 Closed connection to broker broker0:9092
[sarama] 2018/12/03 05:15:17 Closed connection to broker broker1:9092
[sarama] 2018/12/03 05:15:17 Closed connection to broker broker2:9092
[sarama] 2018/12/03 05:16:16 client/metadata fetching metadata for all topics from broker broker2:9092
[sarama] 2018/12/03 05:16:17 Connected to broker at broker1:9092 (registered as #556)
[sarama] 2018/12/03 05:16:17 Connected to broker at broker0:9092 (registered as #555)
[sarama] 2018/12/03 05:16:17 Connected to broker at broker2:9092 (registered as #557)
[sarama] 2018/12/03 05:16:17 Closed connection to broker broker0:9092
[sarama] 2018/12/03 05:16:17 Closed connection to broker broker2:9092
[sarama] 2018/12/03 05:16:17 Closed connection to broker broker1:9092

@bojleros
Copy link

bojleros commented Dec 3, 2018

BTW, Why it is always fetching meta from the same broker ?

@bojleros
Copy link

Hi guys ! Any news ? For us it is a showstoper :(

@agolomoodysaada
Copy link

@danielqsj can you please take a look at this issue? Thanks

@bojleros
Copy link

Hi ! Did anyone tried if this PR helps for this issue ? I do really think it is all around metadata handling:

#75

@danielqsj Would you please give us at least short answer ?

@danielqsj
Copy link
Owner

@agolomoodysaada @bojleros #75 merged, can you try the newest image and see whether things are getting better?

@agolomoodysaada
Copy link

agolomoodysaada commented Jan 31, 2019

Deployed it. Will test for a few days and report back. I think others should test it as well

@bojleros
Copy link

bojleros commented Feb 1, 2019

Sure , I'll post my results as soon as we have one.

@bojleros
Copy link

bojleros commented Feb 1, 2019

@danielqsj @jorgelbg @agolomoodysaada
Hi , i see now is a metadata refresh every 30s and that's fine however i do still get gaps from time to time.

Other thing is this i/o timeout and it's handling. I think such error can happen but kafka_exporter should be able to make at least 1 or 2 fast retries (to the one of different brokers) before updating it's metadata and generating a gap. Another finding is that kafka_exporter keeps polling only one of brokers for metadata change instead of making a round-robin across all brokers.

[sarama] 2019/02/01 10:42:26 client/metadata fetching metadata for all topics from broker event-sea-2:9092
time="2019-02-01T10:42:41Z" level=info msg="Refreshing client metadata" source="kafka_exporter.go:233"
[sarama] 2019/02/01 10:42:41 client/metadata fetching metadata for all topics from broker event-sea-2:9092
[sarama] 2019/02/01 10:42:41 Failed to connect to broker event-sea-2:9092: dial tcp 172.a.a.a:9092: i/o timeout
time="2019-02-01T10:42:41Z" level=error msg="Cannot get current offset of topic aaaa partition 0: dial tcp 172.a.a.a:9092: i/o timeout" source="kafka_exporter.go:274"
time="2019-02-01T10:42:41Z" level=error msg="Cannot get current offset of topic aaaa partition 0: dial tcp 172.a.a.a:9092: i/o timeout" source="kafka_exporter.go:274"
time="2019-02-01T10:42:41Z" level=error msg="Cannot get current offset of topic aaaa partition 0: dial tcp 172.a.a.a:9092: i/o timeout" source="kafka_exporter.go:274"
time="2019-02-01T10:42:41Z" level=error msg="Cannot get current offset of topic aaaa partition 0: dial tcp 172.a.a.a:9092: i/o timeout" source="kafka_exporter.go:274"
[sarama] 2019/02/01 10:42:41 Connected to broker at event-sea-2:9092 (registered as #102)
[sarama] 2019/02/01 10:42:41 Closed connection to broker event-sea-0:9092
[sarama] 2019/02/01 10:42:41 Closed connection to broker event-sea-2:9092
[sarama] 2019/02/01 10:42:41 Closed connection to broker event-sea-1:9092
[sarama] 2019/02/01 10:42:41 Connected to broker at event-sea-2:9092 (registered as #102)
[sarama] 2019/02/01 10:42:41 Connected to broker at event-sea-0:9092 (registered as #100)
[sarama] 2019/02/01 10:42:41 Connected to broker at event-sea-1:9092 (registered as #101)
[sarama] 2019/02/01 10:42:41 Closed connection to broker event-sea-1:9092
[sarama] 2019/02/01 10:42:41 Closed connection to broker event-sea-2:9092
[sarama] 2019/02/01 10:42:41 Closed connection to broker event-sea-0:9092
[sarama] 2019/02/01 10:42:56 client/metadata fetching metadata for all topics from broker event-sea-2:9092```

@bojleros
Copy link

bojleros commented Feb 4, 2019

Hi , except Friday i can't see any gaps during the weekend. I'd propose to assume that this PR makes things better. Gonna let you know if anything bad happens.

@bojleros
Copy link

bojleros commented Feb 7, 2019

Hi, We have installed kafka-exporter on prod just to get additional metrics without alerting. Bad news is that i do still see occassional gaps in my metrics. @agolomoodysaada how does it work for you ? Did you changed metadata refresh interval ?

@rzerda
Copy link

rzerda commented Feb 11, 2019

Not sure if it's the right place, but I observe something similar after upgrading Prometheus from 2.5.0 to 2.7.1. I have odd setup where single kafka 1.1.1 exporter 1.2.0 is scraped by 6 servers simultaneously, and after update 30-40% of kafka_consumergroup_lag metric values are missing depending on RTT between Prometheus and exporter (I have 0.5, 50 and 90 ms).

And my guess it's a race condition, because commenting the defer broker.Close() fixes the issue right away.

Add: I don't observe any significant fd consumption.

@jorgelbg
Copy link
Contributor

@alexey-dushechkin I think I've run into the same conclusion.

I'm running the Kafka exporter as well with multiple Prometheus scraping the target (4 in my case). After enabling the extended logging for sarama I see a lot of consecutive connections&disconnections (from the brokers) on the logs, within the same second:

[sarama] 2019/02/11 09:33:59 Connected to broker at kafka19:9092 (registered as #19)
[sarama] 2019/02/11 09:33:59 Connected to broker at kafka0:9092 (registered as #0)
[sarama] 2019/02/11 09:33:59 Connected to broker at kafka4:9092 (registered as #4)
[sarama] 2019/02/11 09:33:59 Closed connection to broker kafka3:9092
[sarama] 2019/02/11 09:33:59 Closed connection to broker kafka4:9092
[sarama] 2019/02/11 09:33:59 Closed connection to broker kafka18:9092
[sarama] 2019/02/11 09:33:59 Closed connection to broker kafka19:9092
[sarama] 2019/02/11 09:33:59 Closed connection to broker kafka10:9092
[sarama] 2019/02/11 09:33:59 Closed connection to broker kafka8:9092
[sarama] 2019/02/11 09:33:59 Closed connection to broker kafka9:9092
[sarama] 2019/02/11 09:33:59 Closed connection to broker kafka7:9092
[sarama] 2019/02/11 09:33:59 Connected to broker at kafka10:9092 (registered as #10)
[sarama] 2019/02/11 09:33:59 Connected to broker at kafka8:9092 (registered as #8)
[sarama] 2019/02/11 09:33:59 Connected to broker at kafka19:9092 (registered as #19)
[sarama] 2019/02/11 09:33:59 Connected to broker at kafka9:9092 (registered as #9)
[sarama] 2019/02/11 09:33:59 Closed connection to broker kafka10:9092
[sarama] 2019/02/11 09:33:59 Closed connection to broker kafka8:9092

Also at the same time on the normal logs I see the following:

time="2019-02-11T09:34:17Z" level=error msg="Cannot get consumer group: kafka: broker not connected" source="kafka_exporter.go:372"

This means that from the call to broker.Open() on and the call to broker.ListGroups() the connection was closed.

In theory we don't need to disconnect after each request to the broker. Although Kafka has the connections.max.idle.ms on recent versions (defaults to 10m, I think), the connection is going to be used long before that (unless you have a scrape_interval set to >10m).

I've deployed a patched version this morning and will post results here tomorrow.

@bojleros
Copy link

Currently i do only have a single prometheus talking with multiple kafka-exporters so for me it should not be a concurrency problem.

In theory we don't need to disconnect after each request to the broker. Although Kafka has the connections.max.idle.ms on recent versions (defaults to 10m, I think), the connection is going to be used long before that (unless you have a scrape_interval set to >10m).

Yes . There is a such parameter. I do not remember exact kafka version introducing it . I do only remember that our versions are newer ones but i do also see some stalled partition connections (10m+) from time to time so i am not absolutely sure it is working correctly.

@jorgelbg
Copy link
Contributor

@bojleros From your logs it got my attention that your exporter is actually timing out when connecting to the broker:

[sarama] 2019/02/01 10:42:41 Failed to connect to broker event-sea-2:9092: dial tcp 172.a.a.a:9092: i/o timeout

Also is what you get directly from the exporter:

time="2019-02-01T10:42:41Z" level=error msg="Cannot get current offset of topic aaaa partition 0: dial tcp 172.a.a.a:9092: i/o timeout" source="kafka_exporter.go:274"

This looks like it really can't connect to that particular broker. Could you try to pass multiple brokers at startup time? Something like:

--kafka.server=kafka1:9092 --kafka.server=kafka2:9092

I've tested locally by passing 1 existing broker and one dummy broker and it seems to work. The only issue that I've encountered is that it will try again on each request to connect to the non-healthy broker first (if it was selected initially as the "main" broker). This retry mechanism is provided by sarama directly. But you need to pass multiple brokers in the command line.

@bojleros
Copy link

bojleros commented Feb 13, 2019

@jorgelbg Thats how my commandline looks like. Every single broker in my cluster is on that list. Since my clusters are rather small i do include every broker on the list. I do not consider it like it was a '--bootstrap-server' known from kafka tools.

    spec:
      containers:
      - args:
        - --log.enable-sarama
        - --log.level=debug
        - --kafka.version=1.1.0
        - --kafka.server=kafka-sea-0:9092
        - --kafka.server=kafka-sea-1:9092
        - --kafka.server=kafka-sea-2:9092
        - --kafka.server=kafka-sea-3:9092
        - --kafka.server=kafka-sea-4:9092
        - --kafka.server=kafka-sea-5:9092

@jorgelbg
Copy link
Contributor

I've been running a patched version without the broker.Close() (after the broker is consumed) for a couple of days. On the metrics side it looks a lot better than before:

image

We have less gaps now. I've found ocasional disconnections to the broker that I'm connecting to but it reconnected automatically after that and no gap was present at the same time.

Memory usage is more stable now (as you can see) and although it has been increasing. This job was running with plenty of RAM (200MB) so perhaps it was not triggering any GC actions. I've deployed a new job with 100MB of RAM, will leave it running for some time and will check later on.

It would be interesting to see if the RAM usage of your exporter follows the same pattern @alexey-dushechkin.

image

@bojleros
Copy link

bojleros commented Feb 14, 2019

My limits are following:

    Limits:
      memory:  256Mi
    Requests:
      cpu:        50m
      memory:     32Mi

image

There seems to be a plenty of ram and i see no OOM at all. I do use following to generate this graph:

sum (container_memory_working_set_bytes{pod_name=~"kafka-exporter.*"}) by (pod_name)

It is possible that memory usage is related to the broker count , number of topics , etc but first, are we using the same metrics ? How many brokers do you run @jorgelbg ?

PS. My kafka_exporter runs on k8s cluster but i also have node_exporters on each kafka broker. No gaps on node_exporter metrics at all.

This is our tool : https://github.com/msales/kage . I am posting it since i am not a go developer but i hope it can be helpful.

@jorgelbg
Copy link
Contributor

@forget6 would you mind running the exporter with the debug log level? It would be interesting to see why it is failing to connect to the broker. Are you running master or a released version? Unfortunately, I haven't been able to reproduce this issue in my setup.

@jadams-gr8
Copy link

Maybe somewhat related, I recently tried to deploy kafka_exporter in a new environment and noticed that the metrics took a really long time to collect (in my case about 11 seconds with a dozen or so topics with the largest having 50 partitions). Trying to figure out why brought me to this issue. I took a dive into the exporter and learned that it was due to the latency between the exporter and the kafka cluster.
In most of my deployments there is only a few ms between the exporter and kafka, but in this scenario there was about 75ms. Each partition's offsets need to be queried from kafka live. This is done once for the current offset and once for the oldest offset. This must be done for each partition of each topic, so if there are a large number of network round trips.
I wound up forking the repo to try and solve this for my use case. I did some refactoring and was able to get more concurrency on the network calls and that brought my collection time down to 5 seconds. Adding a rudimentary connection pool (the sarama.Client instances process commands in order according to the docs) brought that down further to 3 seconds. That time is now split relatively equally between the topic metrics and the consumer group metrics. 3 seconds is good enough for my case so I stopped optimizing at that point.
Not sure if the latency and serialized queries contribute to the scenarios of the other commenters, but it's probably worth a look. This exporter has the potential to create a lot of network packets especially as the number of topics and partitions increases.

@jorgelbg
Copy link
Contributor

jorgelbg commented Sep 6, 2019

@jadams-gr8 I would be interested in looking are your changes, perhaps they can be merged. I understand that your latency times were very high probably due to specifics of your infrastructure or some other particularities but I would be interested in testing your changes.

I'm running the current version and monitoring 5 different Kafka clusters (the biggest one with ~270 active topics). Right now in master, the connection to Kafka is closed after each scrape interval which caused some issues, see #54 (comment).

Did you try with the PR #90 by any chance? If you can make your fork available I can give it a try and perhaps you can contribute your changes as a PR.

@sysadmind
Copy link
Contributor

sysadmind commented Sep 6, 2019

I have not tried the patch, but I don't believe that patch to be particularly relevant to the specific time issue that I had. My repo is available here: https://github.com/sysadmind/kafka_exporter/

I think my changes may be of use to you especially if some of your topics have a lot of partitions because this adds more concurrency to reduce waiting on network latency. I'd be open to submitting my changes as a PR if you can validate that they work for you (maybe even with your patch on top of it). I didn't want to open a PR yet since I'm not sure my work won't break things for other users and this repo looks to be lacking much activity recently.

My latency is because the exporter is actually running in a different datacenter than my kafka cluster as it's a development environment and some things are transitioning between datacenters. I could have moved my exporter but my current prometheus deployment makes that harder than it sounds on the surface.

@arkyhuang
Copy link

arkyhuang commented Jan 2, 2020

Same issue under kafka_exporter 1.2.0 and kafka 2.2.0, but unfortunately, restart did not solve issue yet. Any update or release about it ?

After switch on debug option, there are too many errros as below after running a while:
Jan 03 13:54:58 kafka1 bash[25994]: 2020/01/03 13:54:58 http: Accept error: accept tcp [::]:9308: accept4: too many open files; retrying in 1s
also,
ls -l /proc/25994/fd|wc -l
1025

@Amojow
Copy link

Amojow commented Jan 30, 2020

I had similar error on my kafka exporter (kafka exporter 1.2)

time="2020-01-30T10:58:03Z" level=error msg="Cannot get describe groups: kafka: broker not connected" source="kafka_exporter.go:373"

I used Strimzi chart to deploy the strimzi-kafka-operator:0.16.2 and then my kafka cluster.
Kubernetes v1.14.3

Fixed it also by reloading the kafka exporter

@prasus
Copy link

prasus commented Aug 12, 2020

Greetings, I am experiencing the same issues since last few weeks, A restart of the kafka_exporter pod sometime solves the problems, but it reoccurs intermittently leading to missing data in our Grafana dashboards.

time="2020-08-12T04:06:47Z" level=error msg="Cannot get oldest offset of topic xxxxx partition 31: kafka: broker not connected" source="kafka_exporter.go:296"
time="2020-08-12T04:06:50Z" level=error msg="Cannot get current offset of topic xxxxx partition 129: kafka: broker not connected" source="kafka_exporter.go:284"
time="2020-08-12T04:37:15Z" level=error msg="Cannot get consumer group: kafka: broker not connected" source="kafka_exporter.go:361"

I also tried increasing the scrape_timeout value on the Prometheus service monitor, but it doesn't help much.

Any news on the update/fix about this? Thanks! 🙏

@talonx
Copy link

talonx commented Oct 9, 2020

I am seeing this too, with the latest exporter image. Any solutions or workarounds? This is the only Kafka exporter out there I could find - so it would be nice if this were fixed.

@ximis
Copy link

ximis commented Nov 20, 2020

I found this too. when I started writing data into kafka , Kafka Exporter is not responsive。

@atrbgithub
Copy link

Also seeing this.

IBM/sarama#1857 (comment) may be related?

@alok87
Copy link

alok87 commented Feb 19, 2021

Adding one more broker.Open() fixed the issue.

I was already opening the broker connection before the function call.

err = broker.Open(t.client.Config())
if err != nil && err != sarama.ErrAlreadyConnected {
return defaultLag, fmt.Errorf("Error opening broker connection again, err: %v", err)
}

k.consumerGroupLag(id, topic, partiton, broker)
 func (t *kafkaWatch) consumerGroupLag(id string, topic string, partition int32, broker *sarama.Broker) {
        .... do something with client ....(t.client.GetOffset)
   
        // one more broker open needed
        err = broker.Open(t.client.Config())
	if err != nil && err != sarama.ErrAlreadyConnected {
		return defaultLag, fmt.Errorf("Error opening broker connection again, err: %v", err)
	}

	offsetFetchResponse, err := broker.FetchOffset(&offsetFetchRequest)

@alok87
Copy link

alok87 commented Feb 19, 2021

No this also did not work :( still geting not connected broker in broker.FetchOffset(&offsetFetchRequest)

This happens when there are huge number of requests for many topics to kafka in parallel. Slowing it down makes the problem disappear.

@alesj
Copy link
Contributor

alesj commented Apr 1, 2021

@alok87 did you try with the latest release?

Here I've merged some other perf PR, plus added a way to limit getTopicMetrics routines

... if it helps ...

@alok87
Copy link

alok87 commented Apr 21, 2021

Thanks @alesj

@alesj Yes using v1.3.0 still troubled with

time="2021-04-21T15:16:22Z" level=error msg="Cannot get offset of group CG: kafka: broker not connected" source="kafka_exporter.go:412"

Trying your fork! Do you have an image i can use?

@alok87
Copy link

alok87 commented Apr 21, 2021

Deployed your optimized fork https://github.com/alesj/kafka_exporter/tree/fork1
Here is the image for it: practodev/kafka-exporter:fork1
We also have a big cluster, 1000s of consumer groups!

This optimization seems to work!

@danielqsj please take these optimizations in, for large cluster support!

alok87 added a commit to practo/tipoca-stream that referenced this issue Jun 5, 2021
alok87 added a commit to practo/tipoca-stream that referenced this issue Jun 7, 2021
@atrbgithub
Copy link

@alesj Your fork has resolved our issue where the exporter's memory usage would sometimes spike resulting in an oom. Thank you! Memory usage is now much lower and stays low.

alok87 added a commit to practo/tipoca-stream that referenced this issue Jun 17, 2021
@alok87
Copy link

alok87 commented Sep 8, 2021

@danielqsj any plan to take this in? big clusters with many consumer groups the master of this repo does not work.

@danielqsj
Copy link
Owner

Cherry-picked from #228, and released v1.4.1. @alok87 and @atrbgithub and @ALL, please have a try.

@alok87
Copy link

alok87 commented Sep 8, 2021

Thanks @danielqsj
I have deployed danielqsj/kafka-exporter:v1.4.1. I'll Let you know how it goes.

@alok87
Copy link

alok87 commented Sep 8, 2021

Update: it's working in 1000+ consumer group cluster in prod. I will update if any issue happens. It should not. Thanks @danielqsj and @skonmeme

@alok87
Copy link

alok87 commented Sep 8, 2021

Update: Had to revert v1.4.1 the target endpoint became very slow to respond after this, and target was down.
Screenshot 2021-09-08 at 4 26 08 PM

@danielqsj

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests