Sometimes, kafka_exporter cannot connect to broker and generate too many sockets. #54

skonmeme · 2018-07-20T03:47:33Z

Usually it works well.

Sometimes it could not connect to brokers and generates too many sockets (up to about 1,000 sockets open) and generate following error messages.

Whenever I restart kafka_exporter, it works well again.

Jul 20 12:36:38 172.27.115.150 kafka_exporter[20251]: time="2018-07-20T12:36:38+09:00" level=error msg="Can't get current offset of topic test partition %!s(int32=10): kafka: broker not connected" source="kafka_exporter.go:271"
...

dcarrier · 2018-07-20T18:06:58Z

+1 we are seeing the same thing in our cluster.

eastcirclek · 2018-07-24T01:28:08Z

+1

sirpkt · 2018-08-03T06:08:12Z

+1

jpbelanger-mtl · 2018-08-27T20:26:06Z

We had the same issue after 87days of uptime. Restarting fixed it... during that time, metrics occasionally failed. It got worst over time until it starting triggering out alerts.

agolomoodysaada · 2018-08-27T21:25:03Z

This seems relevant?
IBM/sarama#853

They recommend calling Broker.open() before each client call.
@eapache could you please take a look at this one?

agolomoodysaada · 2018-08-27T21:26:54Z

I would also like to add that not only do we get "Cannot get current offset" but also "Cannot get describe groups". Both cases show the broker not connected error

eapache · 2018-08-28T13:48:00Z

I'm not sure what I'm supposed to know about this app specifically but it looks like they're already calling brokerOpen: https://github.com/danielqsj/kafka_exporter/blob/master/kafka_exporter.go#L353

agolomoodysaada · 2018-08-28T13:57:06Z

Thanks @eapache for taking a look.
The exporter is basically an HTTP endpoint that gets called every interval (say 10s) and returns metrics from Kafka. In this case, on each HTTP request, we loop over each consumer group from __consumer_offsets and calculate retrieve different metrics from there. There appears to be a socket leak in this loop or some random disconnects in the client.

Is it better to create a single client for each loop iteration? or share the client across all iterations?
Are we missing anything in how the client is being used?
Is there some cleanup that we're not doing?
If two concurrent http requests come in, would there be shared state in the current implementation?

eapache · 2018-08-29T20:18:40Z

The code as written looks correct. It's possible there's something I'm overlooking, or it's possible you've managed to uncover a really subtle concurrency bug in Sarama. What does the race-detector say? Is anything interesting coming from the Sarama logs?

hukaixuan · 2018-11-05T05:01:46Z

I have the same issue, found the reason from kafka server log:
ERROR Closing socket for ... because of error (kafka.network.Processor) kafka.network.InvalidRequestException: Error getting request for apiKey: 2 and apiVersion: 1
And found that the kafka version is 0.10.0.0 but the kafka_exporter deafault kafka version is 1.0.0. Fix it by add --kafka.version=0.10.0.0

agolomoodysaada · 2018-11-05T15:02:20Z

@hukaixuan We still experience it on kafka 1.0.0. So it's still a bug. Also, it temporarily seems resolved after restarting the process. After a few hours, I imagine the symptoms will appear again.

bojleros · 2018-12-03T06:02:06Z

Hi,

It seems that kafka exporter is having a problem with metadata refresh:

Things are always the same for me. I do get holes in my timeseries when meta refresh kicks in. There was an suggestion to create metadata cache. I can also sugest to get meta asynchronously, in a separate goroutine. Please have a look at my logs, it is from our sandbox so please let me know if i can help you troubleshooting.

(1h timestamp difference)

[sarama] 2018/12/03 05:10:17 Connected to broker at broker0:9092 (registered as #555)
[sarama] 2018/12/03 05:10:17 Connected to broker at broker1:9092 (registered as #556)
[sarama] 2018/12/03 05:10:17 Connected to broker at broker2:9092 (registered as #557)
[sarama] 2018/12/03 05:10:17 Closed connection to broker broker0:9092
[sarama] 2018/12/03 05:10:17 Closed connection to broker broker2:9092
[sarama] 2018/12/03 05:10:17 Closed connection to broker broker1:9092
[sarama] 2018/12/03 05:11:16 client/metadata fetching metadata for all topics from broker broker2:9092
[sarama] 2018/12/03 05:11:17 Connected to broker at broker1:9092 (registered as #556)
[sarama] 2018/12/03 05:11:17 Connected to broker at broker0:9092 (registered as #555)
[sarama] 2018/12/03 05:11:46 Failed to connect to broker broker2:9092: dial tcp 172.20.2.169:9092: i/o timeout
[sarama] 2018/12/03 05:11:46 client/metadata fetching metadata for [_one_of_our_topics_] from broker broker2:9092
[sarama] 2018/12/03 05:11:46 client/metadata fetching metadata for [_one_of_our_topics_] from broker broker2:9092
[sarama] 2018/12/03 05:11:46 client/metadata fetching metadata for [_one_of_our_topics_] from broker broker2:9092
[sarama] 2018/12/03 05:11:46 client/metadata fetching metadata for [_one_of_our_topics_] from broker broker2:9092
[sarama] 2018/12/03 05:11:46 client/metadata fetching metadata for [_one_of_our_topics_] from broker broker2:9092
[sarama] 2018/12/03 05:11:46 client/metadata fetching metadata for [_one_of_our_topics_] from broker broker2:9092
[sarama] 2018/12/03 05:11:46 client/metadata fetching metadata for [_one_of_our_topics_] from broker broker2:9092
[sarama] 2018/12/03 05:11:46 client/metadata fetching metadata for [_one_of_our_topics_] from broker broker2:9092
[sarama] 2018/12/03 05:11:46 client/metadata fetching metadata for [_one_of_our_topics_] from broker broker2:9092
[sarama] 2018/12/03 05:11:46 client/metadata fetching metadata for [_one_of_our_topics_] from broker broker2:9092
[sarama] 2018/12/03 05:11:46 client/metadata fetching metadata for [_one_of_our_topics_] from broker broker2:9092
[sarama] 2018/12/03 05:11:46 client/metadata fetching metadata for [_one_of_our_topics_] from broker broker2:9092
[sarama] 2018/12/03 05:11:46 client/metadata fetching metadata for [_one_of_our_topics_] from broker broker2:9092
[sarama] 2018/12/03 05:11:46 client/metadata fetching metadata for [_one_of_our_topics_] from broker broker2:9092
[sarama] 2018/12/03 05:11:46 client/metadata fetching metadata for [_one_of_our_topics_] from broker broker2:9092
[sarama] 2018/12/03 05:11:46 client/metadata fetching metadata for [_one_of_our_topics_] from broker broker2:9092
[sarama] 2018/12/03 05:11:46 client/metadata fetching metadata for [__consumer_offsets] from broker broker2:9092
[sarama] 2018/12/03 05:11:47 Connected to broker at broker2:9092 (registered as #557)
[sarama] 2018/12/03 05:11:47 Closed connection to broker broker0:9092
[sarama] 2018/12/03 05:11:47 Closed connection to broker broker2:9092
[sarama] 2018/12/03 05:11:47 Closed connection to broker broker1:9092
[sarama] 2018/12/03 05:12:16 client/metadata fetching metadata for all topics from broker broker2:9092
[sarama] 2018/12/03 05:12:17 Connected to broker at broker1:9092 (registered as #556)
[sarama] 2018/12/03 05:12:17 Connected to broker at broker2:9092 (registered as #557)
[sarama] 2018/12/03 05:12:17 Connected to broker at broker0:9092 (registered as #555)
[sarama] 2018/12/03 05:12:17 Closed connection to broker broker0:9092
[sarama] 2018/12/03 05:12:17 Closed connection to broker broker2:9092
[sarama] 2018/12/03 05:12:17 Closed connection to broker broker1:9092
[sarama] 2018/12/03 05:13:16 client/metadata fetching metadata for all topics from broker broker2:9092
[sarama] 2018/12/03 05:13:17 Connected to broker at broker0:9092 (registered as #555)
[sarama] 2018/12/03 05:13:17 Connected to broker at broker2:9092 (registered as #557)
[sarama] 2018/12/03 05:13:17 Connected to broker at broker1:9092 (registered as #556)
[sarama] 2018/12/03 05:13:17 Closed connection to broker broker0:9092
[sarama] 2018/12/03 05:13:17 Closed connection to broker broker2:9092
[sarama] 2018/12/03 05:13:17 Closed connection to broker broker1:9092
[sarama] 2018/12/03 05:14:16 client/metadata fetching metadata for all topics from broker broker2:9092
[sarama] 2018/12/03 05:14:17 Connected to broker at broker0:9092 (registered as #555)
[sarama] 2018/12/03 05:14:17 Connected to broker at broker1:9092 (registered as #556)
[sarama] 2018/12/03 05:14:17 Connected to broker at broker2:9092 (registered as #557)
[sarama] 2018/12/03 05:14:17 Closed connection to broker broker2:9092
[sarama] 2018/12/03 05:14:17 Closed connection to broker broker0:9092
[sarama] 2018/12/03 05:14:17 Closed connection to broker broker1:9092
[sarama] 2018/12/03 05:15:16 client/metadata fetching metadata for all topics from broker broker2:9092
[sarama] 2018/12/03 05:15:17 Connected to broker at broker1:9092 (registered as #556)
[sarama] 2018/12/03 05:15:17 Connected to broker at broker0:9092 (registered as #555)
[sarama] 2018/12/03 05:15:17 Connected to broker at broker2:9092 (registered as #557)
[sarama] 2018/12/03 05:15:17 Closed connection to broker broker0:9092
[sarama] 2018/12/03 05:15:17 Closed connection to broker broker1:9092
[sarama] 2018/12/03 05:15:17 Closed connection to broker broker2:9092
[sarama] 2018/12/03 05:16:16 client/metadata fetching metadata for all topics from broker broker2:9092
[sarama] 2018/12/03 05:16:17 Connected to broker at broker1:9092 (registered as #556)
[sarama] 2018/12/03 05:16:17 Connected to broker at broker0:9092 (registered as #555)
[sarama] 2018/12/03 05:16:17 Connected to broker at broker2:9092 (registered as #557)
[sarama] 2018/12/03 05:16:17 Closed connection to broker broker0:9092
[sarama] 2018/12/03 05:16:17 Closed connection to broker broker2:9092
[sarama] 2018/12/03 05:16:17 Closed connection to broker broker1:9092

bojleros · 2018-12-03T06:08:06Z

BTW, Why it is always fetching meta from the same broker ?

bojleros · 2018-12-12T06:57:35Z

Hi guys ! Any news ? For us it is a showstoper :(

agolomoodysaada · 2019-01-02T14:26:04Z

@danielqsj can you please take a look at this issue? Thanks

bojleros · 2019-01-31T13:41:36Z

Hi ! Did anyone tried if this PR helps for this issue ? I do really think it is all around metadata handling:

#75

@danielqsj Would you please give us at least short answer ?

danielqsj · 2019-01-31T13:53:44Z

@agolomoodysaada @bojleros #75 merged, can you try the newest image and see whether things are getting better?

agolomoodysaada · 2019-01-31T14:17:47Z

Deployed it. Will test for a few days and report back. I think others should test it as well

bojleros · 2019-02-01T07:48:51Z

Sure , I'll post my results as soon as we have one.

bojleros · 2019-02-01T11:13:59Z

@danielqsj @jorgelbg @agolomoodysaada
Hi , i see now is a metadata refresh every 30s and that's fine however i do still get gaps from time to time.

Other thing is this i/o timeout and it's handling. I think such error can happen but kafka_exporter should be able to make at least 1 or 2 fast retries (to the one of different brokers) before updating it's metadata and generating a gap. Another finding is that kafka_exporter keeps polling only one of brokers for metadata change instead of making a round-robin across all brokers.

[sarama] 2019/02/01 10:42:26 client/metadata fetching metadata for all topics from broker event-sea-2:9092
time="2019-02-01T10:42:41Z" level=info msg="Refreshing client metadata" source="kafka_exporter.go:233"
[sarama] 2019/02/01 10:42:41 client/metadata fetching metadata for all topics from broker event-sea-2:9092
[sarama] 2019/02/01 10:42:41 Failed to connect to broker event-sea-2:9092: dial tcp 172.a.a.a:9092: i/o timeout
time="2019-02-01T10:42:41Z" level=error msg="Cannot get current offset of topic aaaa partition 0: dial tcp 172.a.a.a:9092: i/o timeout" source="kafka_exporter.go:274"
time="2019-02-01T10:42:41Z" level=error msg="Cannot get current offset of topic aaaa partition 0: dial tcp 172.a.a.a:9092: i/o timeout" source="kafka_exporter.go:274"
time="2019-02-01T10:42:41Z" level=error msg="Cannot get current offset of topic aaaa partition 0: dial tcp 172.a.a.a:9092: i/o timeout" source="kafka_exporter.go:274"
time="2019-02-01T10:42:41Z" level=error msg="Cannot get current offset of topic aaaa partition 0: dial tcp 172.a.a.a:9092: i/o timeout" source="kafka_exporter.go:274"
[sarama] 2019/02/01 10:42:41 Connected to broker at event-sea-2:9092 (registered as #102)
[sarama] 2019/02/01 10:42:41 Closed connection to broker event-sea-0:9092
[sarama] 2019/02/01 10:42:41 Closed connection to broker event-sea-2:9092
[sarama] 2019/02/01 10:42:41 Closed connection to broker event-sea-1:9092
[sarama] 2019/02/01 10:42:41 Connected to broker at event-sea-2:9092 (registered as #102)
[sarama] 2019/02/01 10:42:41 Connected to broker at event-sea-0:9092 (registered as #100)
[sarama] 2019/02/01 10:42:41 Connected to broker at event-sea-1:9092 (registered as #101)
[sarama] 2019/02/01 10:42:41 Closed connection to broker event-sea-1:9092
[sarama] 2019/02/01 10:42:41 Closed connection to broker event-sea-2:9092
[sarama] 2019/02/01 10:42:41 Closed connection to broker event-sea-0:9092
[sarama] 2019/02/01 10:42:56 client/metadata fetching metadata for all topics from broker event-sea-2:9092```

bojleros · 2019-02-04T06:26:12Z

Hi , except Friday i can't see any gaps during the weekend. I'd propose to assume that this PR makes things better. Gonna let you know if anything bad happens.

bojleros · 2019-02-07T12:40:52Z

Hi, We have installed kafka-exporter on prod just to get additional metrics without alerting. Bad news is that i do still see occassional gaps in my metrics. @agolomoodysaada how does it work for you ? Did you changed metadata refresh interval ?

rzerda · 2019-02-11T08:22:24Z

Not sure if it's the right place, but I observe something similar after upgrading Prometheus from 2.5.0 to 2.7.1. I have odd setup where single kafka 1.1.1 exporter 1.2.0 is scraped by 6 servers simultaneously, and after update 30-40% of kafka_consumergroup_lag metric values are missing depending on RTT between Prometheus and exporter (I have 0.5, 50 and 90 ms).

And my guess it's a race condition, because commenting the defer broker.Close() fixes the issue right away.

Add: I don't observe any significant fd consumption.

jorgelbg · 2019-02-11T16:55:10Z

@alexey-dushechkin I think I've run into the same conclusion.

I'm running the Kafka exporter as well with multiple Prometheus scraping the target (4 in my case). After enabling the extended logging for sarama I see a lot of consecutive connections&disconnections (from the brokers) on the logs, within the same second:

[sarama] 2019/02/11 09:33:59 Connected to broker at kafka19:9092 (registered as #19)
[sarama] 2019/02/11 09:33:59 Connected to broker at kafka0:9092 (registered as #0)
[sarama] 2019/02/11 09:33:59 Connected to broker at kafka4:9092 (registered as #4)
[sarama] 2019/02/11 09:33:59 Closed connection to broker kafka3:9092
[sarama] 2019/02/11 09:33:59 Closed connection to broker kafka4:9092
[sarama] 2019/02/11 09:33:59 Closed connection to broker kafka18:9092
[sarama] 2019/02/11 09:33:59 Closed connection to broker kafka19:9092
[sarama] 2019/02/11 09:33:59 Closed connection to broker kafka10:9092
[sarama] 2019/02/11 09:33:59 Closed connection to broker kafka8:9092
[sarama] 2019/02/11 09:33:59 Closed connection to broker kafka9:9092
[sarama] 2019/02/11 09:33:59 Closed connection to broker kafka7:9092
[sarama] 2019/02/11 09:33:59 Connected to broker at kafka10:9092 (registered as #10)
[sarama] 2019/02/11 09:33:59 Connected to broker at kafka8:9092 (registered as #8)
[sarama] 2019/02/11 09:33:59 Connected to broker at kafka19:9092 (registered as #19)
[sarama] 2019/02/11 09:33:59 Connected to broker at kafka9:9092 (registered as #9)
[sarama] 2019/02/11 09:33:59 Closed connection to broker kafka10:9092
[sarama] 2019/02/11 09:33:59 Closed connection to broker kafka8:9092

Also at the same time on the normal logs I see the following:

time="2019-02-11T09:34:17Z" level=error msg="Cannot get consumer group: kafka: broker not connected" source="kafka_exporter.go:372"

This means that from the call to broker.Open() on and the call to broker.ListGroups() the connection was closed.

In theory we don't need to disconnect after each request to the broker. Although Kafka has the connections.max.idle.ms on recent versions (defaults to 10m, I think), the connection is going to be used long before that (unless you have a scrape_interval set to >10m).

I've deployed a patched version this morning and will post results here tomorrow.

bojleros · 2019-02-12T07:07:16Z

Currently i do only have a single prometheus talking with multiple kafka-exporters so for me it should not be a concurrency problem.

In theory we don't need to disconnect after each request to the broker. Although Kafka has the connections.max.idle.ms on recent versions (defaults to 10m, I think), the connection is going to be used long before that (unless you have a scrape_interval set to >10m).

Yes . There is a such parameter. I do not remember exact kafka version introducing it . I do only remember that our versions are newer ones but i do also see some stalled partition connections (10m+) from time to time so i am not absolutely sure it is working correctly.

jorgelbg · 2019-02-13T14:11:10Z

@bojleros From your logs it got my attention that your exporter is actually timing out when connecting to the broker:

[sarama] 2019/02/01 10:42:41 Failed to connect to broker event-sea-2:9092: dial tcp 172.a.a.a:9092: i/o timeout

Also is what you get directly from the exporter:

time="2019-02-01T10:42:41Z" level=error msg="Cannot get current offset of topic aaaa partition 0: dial tcp 172.a.a.a:9092: i/o timeout" source="kafka_exporter.go:274"

This looks like it really can't connect to that particular broker. Could you try to pass multiple brokers at startup time? Something like:

--kafka.server=kafka1:9092 --kafka.server=kafka2:9092

I've tested locally by passing 1 existing broker and one dummy broker and it seems to work. The only issue that I've encountered is that it will try again on each request to connect to the non-healthy broker first (if it was selected initially as the "main" broker). This retry mechanism is provided by sarama directly. But you need to pass multiple brokers in the command line.

bojleros · 2019-02-13T15:38:57Z

@jorgelbg Thats how my commandline looks like. Every single broker in my cluster is on that list. Since my clusters are rather small i do include every broker on the list. I do not consider it like it was a '--bootstrap-server' known from kafka tools.

    spec:
      containers:
      - args:
        - --log.enable-sarama
        - --log.level=debug
        - --kafka.version=1.1.0
        - --kafka.server=kafka-sea-0:9092
        - --kafka.server=kafka-sea-1:9092
        - --kafka.server=kafka-sea-2:9092
        - --kafka.server=kafka-sea-3:9092
        - --kafka.server=kafka-sea-4:9092
        - --kafka.server=kafka-sea-5:9092

jorgelbg · 2019-02-13T17:37:50Z

I've been running a patched version without the broker.Close() (after the broker is consumed) for a couple of days. On the metrics side it looks a lot better than before:

We have less gaps now. I've found ocasional disconnections to the broker that I'm connecting to but it reconnected automatically after that and no gap was present at the same time.

Memory usage is more stable now (as you can see) and although it has been increasing. This job was running with plenty of RAM (200MB) so perhaps it was not triggering any GC actions. I've deployed a new job with 100MB of RAM, will leave it running for some time and will check later on.

It would be interesting to see if the RAM usage of your exporter follows the same pattern @alexey-dushechkin.

bojleros · 2019-02-14T06:25:31Z

My limits are following:

    Limits:
      memory:  256Mi
    Requests:
      cpu:        50m
      memory:     32Mi

There seems to be a plenty of ram and i see no OOM at all. I do use following to generate this graph:

sum (container_memory_working_set_bytes{pod_name=~"kafka-exporter.*"}) by (pod_name)

It is possible that memory usage is related to the broker count , number of topics , etc but first, are we using the same metrics ? How many brokers do you run @jorgelbg ?

PS. My kafka_exporter runs on k8s cluster but i also have node_exporters on each kafka broker. No gaps on node_exporter metrics at all.

This is our tool : https://github.com/msales/kage . I am posting it since i am not a go developer but i hope it can be helpful.

jorgelbg · 2019-08-15T21:32:01Z

@forget6 would you mind running the exporter with the debug log level? It would be interesting to see why it is failing to connect to the broker. Are you running master or a released version? Unfortunately, I haven't been able to reproduce this issue in my setup.

jadams-gr8 · 2019-09-05T16:52:01Z

Maybe somewhat related, I recently tried to deploy kafka_exporter in a new environment and noticed that the metrics took a really long time to collect (in my case about 11 seconds with a dozen or so topics with the largest having 50 partitions). Trying to figure out why brought me to this issue. I took a dive into the exporter and learned that it was due to the latency between the exporter and the kafka cluster.
In most of my deployments there is only a few ms between the exporter and kafka, but in this scenario there was about 75ms. Each partition's offsets need to be queried from kafka live. This is done once for the current offset and once for the oldest offset. This must be done for each partition of each topic, so if there are a large number of network round trips.
I wound up forking the repo to try and solve this for my use case. I did some refactoring and was able to get more concurrency on the network calls and that brought my collection time down to 5 seconds. Adding a rudimentary connection pool (the sarama.Client instances process commands in order according to the docs) brought that down further to 3 seconds. That time is now split relatively equally between the topic metrics and the consumer group metrics. 3 seconds is good enough for my case so I stopped optimizing at that point.
Not sure if the latency and serialized queries contribute to the scenarios of the other commenters, but it's probably worth a look. This exporter has the potential to create a lot of network packets especially as the number of topics and partitions increases.

jorgelbg · 2019-09-06T15:44:33Z

@jadams-gr8 I would be interested in looking are your changes, perhaps they can be merged. I understand that your latency times were very high probably due to specifics of your infrastructure or some other particularities but I would be interested in testing your changes.

I'm running the current version and monitoring 5 different Kafka clusters (the biggest one with ~270 active topics). Right now in master, the connection to Kafka is closed after each scrape interval which caused some issues, see #54 (comment).

Did you try with the PR #90 by any chance? If you can make your fork available I can give it a try and perhaps you can contribute your changes as a PR.

sysadmind · 2019-09-06T15:54:50Z

I have not tried the patch, but I don't believe that patch to be particularly relevant to the specific time issue that I had. My repo is available here: https://github.com/sysadmind/kafka_exporter/

I think my changes may be of use to you especially if some of your topics have a lot of partitions because this adds more concurrency to reduce waiting on network latency. I'd be open to submitting my changes as a PR if you can validate that they work for you (maybe even with your patch on top of it). I didn't want to open a PR yet since I'm not sure my work won't break things for other users and this repo looks to be lacking much activity recently.

My latency is because the exporter is actually running in a different datacenter than my kafka cluster as it's a development environment and some things are transitioning between datacenters. I could have moved my exporter but my current prometheus deployment makes that harder than it sounds on the surface.

arkyhuang · 2020-01-02T09:26:53Z

Same issue under kafka_exporter 1.2.0 and kafka 2.2.0, but unfortunately, restart did not solve issue yet. Any update or release about it ?

After switch on debug option, there are too many errros as below after running a while:
Jan 03 13:54:58 kafka1 bash[25994]: 2020/01/03 13:54:58 http: Accept error: accept tcp [::]:9308: accept4: too many open files; retrying in 1s
also,
ls -l /proc/25994/fd|wc -l
1025

Amojow · 2020-01-30T13:30:33Z

I had similar error on my kafka exporter (kafka exporter 1.2)

time="2020-01-30T10:58:03Z" level=error msg="Cannot get describe groups: kafka: broker not connected" source="kafka_exporter.go:373"

I used Strimzi chart to deploy the strimzi-kafka-operator:0.16.2 and then my kafka cluster.
Kubernetes v1.14.3

Fixed it also by reloading the kafka exporter

prasus · 2020-08-12T04:47:36Z

Greetings, I am experiencing the same issues since last few weeks, A restart of the kafka_exporter pod sometime solves the problems, but it reoccurs intermittently leading to missing data in our Grafana dashboards.

time="2020-08-12T04:06:47Z" level=error msg="Cannot get oldest offset of topic xxxxx partition 31: kafka: broker not connected" source="kafka_exporter.go:296"
time="2020-08-12T04:06:50Z" level=error msg="Cannot get current offset of topic xxxxx partition 129: kafka: broker not connected" source="kafka_exporter.go:284"
time="2020-08-12T04:37:15Z" level=error msg="Cannot get consumer group: kafka: broker not connected" source="kafka_exporter.go:361"

I also tried increasing the scrape_timeout value on the Prometheus service monitor, but it doesn't help much.

Any news on the update/fix about this? Thanks! 🙏

talonx · 2020-10-09T13:46:56Z

I am seeing this too, with the latest exporter image. Any solutions or workarounds? This is the only Kafka exporter out there I could find - so it would be nice if this were fixed.

ximis · 2020-11-20T10:26:03Z

I found this too. when I started writing data into kafka , Kafka Exporter is not responsive。

atrbgithub · 2020-12-21T08:39:49Z

Also seeing this.

IBM/sarama#1857 (comment) may be related?

danielqsj/kafka_exporter#54 (comment) IBM/sarama#853

alok87 · 2021-02-19T05:16:28Z

Adding one more broker.Open() fixed the issue.

I was already opening the broker connection before the function call.

err = broker.Open(t.client.Config())
if err != nil && err != sarama.ErrAlreadyConnected {
return defaultLag, fmt.Errorf("Error opening broker connection again, err: %v", err)
}

k.consumerGroupLag(id, topic, partiton, broker)

 func (t *kafkaWatch) consumerGroupLag(id string, topic string, partition int32, broker *sarama.Broker) {
        .... do something with client ....(t.client.GetOffset)
   
        // one more broker open needed
        err = broker.Open(t.client.Config())
	if err != nil && err != sarama.ErrAlreadyConnected {
		return defaultLag, fmt.Errorf("Error opening broker connection again, err: %v", err)
	}

	offsetFetchResponse, err := broker.FetchOffset(&offsetFetchRequest)

alok87 · 2021-02-19T05:52:49Z

No this also did not work :( still geting not connected broker in broker.FetchOffset(&offsetFetchRequest)

This happens when there are huge number of requests for many topics to kafka in parallel. Slowing it down makes the problem disappear.

alesj · 2021-04-01T10:36:01Z

@alok87 did you try with the latest release?

Here I've merged some other perf PR, plus added a way to limit getTopicMetrics routines

https://github.com/alesj/kafka_exporter/tree/fork1

... if it helps ...

alok87 · 2021-04-21T15:17:33Z

Thanks @alesj

@alesj Yes using v1.3.0 still troubled with

time="2021-04-21T15:16:22Z" level=error msg="Cannot get offset of group CG: kafka: broker not connected" source="kafka_exporter.go:412"

Trying your fork! Do you have an image i can use?

alok87 · 2021-04-21T15:30:05Z

Deployed your optimized fork https://github.com/alesj/kafka_exporter/tree/fork1
Here is the image for it: practodev/kafka-exporter:fork1
We also have a big cluster, 1000s of consumer groups!

This optimization seems to work!

@danielqsj please take these optimizations in, for large cluster support!

danielqsj/kafka_exporter#54 (comment) IBM/sarama#853

atrbgithub · 2021-06-09T14:44:21Z

@alesj Your fork has resolved our issue where the exporter's memory usage would sometimes spike resulting in an oom. Thank you! Memory usage is now much lower and stays low.

danielqsj/kafka_exporter#54 (comment) IBM/sarama#853

alok87 · 2021-09-08T07:55:06Z

@danielqsj any plan to take this in? big clusters with many consumer groups the master of this repo does not work.

danielqsj · 2021-09-08T09:35:38Z

Cherry-picked from #228, and released v1.4.1. @alok87 and @atrbgithub and @ALL, please have a try.

alok87 · 2021-09-08T10:28:37Z

Thanks @danielqsj
I have deployed danielqsj/kafka-exporter:v1.4.1. I'll Let you know how it goes.

alok87 · 2021-09-08T10:36:17Z

Update: it's working in 1000+ consumer group cluster in prod. I will update if any issue happens. It should not. Thanks @danielqsj and @skonmeme

alok87 · 2021-09-08T10:56:24Z

Update: Had to revert v1.4.1 the target endpoint became very slow to respond after this, and target was down.

@danielqsj

agolomoodysaada mentioned this issue Jan 2, 2019

Ever increasing number of sockets in CLOSE-WAIT state. #85

Open

Amojow mentioned this issue Jan 30, 2020

kafka-exporter "Cannot get describe groups" strimzi/strimzi-kafka-operator#2480

Closed

atrbgithub mentioned this issue Dec 21, 2020

Memory leaks in this exporter #193

Open

alok87 added a commit to practo/tipoca-stream that referenced this issue Feb 19, 2021

One more open broker connection

2255976

danielqsj/kafka_exporter#54 (comment) IBM/sarama#853

alok87 mentioned this issue Feb 19, 2021

Broker not connected practo/tipoca-stream#140

Closed

vin01 mentioned this issue Mar 17, 2021

Memory leak on connection failures IBM/sarama#1857

Closed

alok87 added a commit to practo/tipoca-stream that referenced this issue Jun 5, 2021

One more open broker connection

735e84d

danielqsj/kafka_exporter#54 (comment) IBM/sarama#853

alok87 added a commit to practo/tipoca-stream that referenced this issue Jun 7, 2021

One more open broker connection

cf007c2

danielqsj/kafka_exporter#54 (comment) IBM/sarama#853

alok87 added a commit to practo/tipoca-stream that referenced this issue Jun 17, 2021

One more open broker connection

c440fe7

danielqsj/kafka_exporter#54 (comment) IBM/sarama#853

danielqsj closed this as completed Sep 8, 2021

Sometimes, kafka_exporter cannot connect to broker and generate too many sockets. #54

Sometimes, kafka_exporter cannot connect to broker and generate too many sockets. #54

Comments

skonmeme commented Jul 20, 2018

Jul 20 12:36:38 172.27.115.150 kafka_exporter[20251]: time="2018-07-20T12:36:38+09:00" level=error msg="Can't get current offset of topic test partition %!s(int32=10): kafka: broker not connected" source="kafka_exporter.go:271" ...

dcarrier commented Jul 20, 2018

eastcirclek commented Jul 24, 2018

sirpkt commented Aug 3, 2018

jpbelanger-mtl commented Aug 27, 2018

agolomoodysaada commented Aug 27, 2018

agolomoodysaada commented Aug 27, 2018 • edited Loading

eapache commented Aug 28, 2018

agolomoodysaada commented Aug 28, 2018 • edited Loading

eapache commented Aug 29, 2018

hukaixuan commented Nov 5, 2018

agolomoodysaada commented Nov 5, 2018

bojleros commented Dec 3, 2018

bojleros commented Dec 3, 2018

bojleros commented Dec 12, 2018

agolomoodysaada commented Jan 2, 2019

bojleros commented Jan 31, 2019

danielqsj commented Jan 31, 2019

agolomoodysaada commented Jan 31, 2019 • edited Loading

bojleros commented Feb 1, 2019

bojleros commented Feb 1, 2019 • edited Loading

bojleros commented Feb 4, 2019

bojleros commented Feb 7, 2019

rzerda commented Feb 11, 2019 • edited Loading

jorgelbg commented Feb 11, 2019

bojleros commented Feb 12, 2019

jorgelbg commented Feb 13, 2019

bojleros commented Feb 13, 2019 • edited Loading

jorgelbg commented Feb 13, 2019

bojleros commented Feb 14, 2019 • edited Loading

jorgelbg commented Aug 15, 2019

jadams-gr8 commented Sep 5, 2019

jorgelbg commented Sep 6, 2019

sysadmind commented Sep 6, 2019 • edited Loading

arkyhuang commented Jan 2, 2020 • edited Loading

Amojow commented Jan 30, 2020

prasus commented Aug 12, 2020 • edited Loading

talonx commented Oct 9, 2020

ximis commented Nov 20, 2020

atrbgithub commented Dec 21, 2020

alok87 commented Feb 19, 2021

alok87 commented Feb 19, 2021 • edited Loading

alesj commented Apr 1, 2021 • edited Loading

alok87 commented Apr 21, 2021 • edited Loading

alok87 commented Apr 21, 2021 • edited Loading

atrbgithub commented Jun 9, 2021

alok87 commented Sep 8, 2021

danielqsj commented Sep 8, 2021

alok87 commented Sep 8, 2021

alok87 commented Sep 8, 2021

alok87 commented Sep 8, 2021

Jul 20 12:36:38 172.27.115.150 kafka_exporter[20251]: time="2018-07-20T12:36:38+09:00" level=error msg="Can't get current offset of topic test partition %!s(int32=10): kafka: broker not connected" source="kafka_exporter.go:271"
...

agolomoodysaada commented Aug 27, 2018 •

edited

Loading

agolomoodysaada commented Aug 28, 2018 •

edited

Loading

agolomoodysaada commented Jan 31, 2019 •

edited

Loading

bojleros commented Feb 1, 2019 •

edited

Loading

rzerda commented Feb 11, 2019 •

edited

Loading

bojleros commented Feb 13, 2019 •

edited

Loading

bojleros commented Feb 14, 2019 •

edited

Loading

sysadmind commented Sep 6, 2019 •

edited

Loading

arkyhuang commented Jan 2, 2020 •

edited

Loading

prasus commented Aug 12, 2020 •

edited

Loading

alok87 commented Feb 19, 2021 •

edited

Loading

alesj commented Apr 1, 2021 •

edited

Loading

alok87 commented Apr 21, 2021 •

edited

Loading

alok87 commented Apr 21, 2021 •

edited

Loading