Rolling updates downtime effect #8924

28ori · 2023-07-31T08:57:26Z

28ori
Jul 31, 2023

Hi,
I have noticed that the Strimzi Operator upgrade process causes Rolling Upgrade which causes DOWNTIME.
Although I have 3 brokers and min.insync.replicas=2, the producer needs to wait for time (from 30s-2m).
I think this might happen because of new leaders election (new election for the partition leaders on the broker which dies).
I'm wondering if there's a way to prevent this downtime, because while I'm upgrading the operator, this downtime occurs for ALL the Kafkas that the operator manages.
I thought about an option to manually change the spec.kafka.version to fake one and then the operator won't do a rolling update, but then it won't be supported (which is problematic with the strimzipodset).
Does anyone has an idea how to avoid this situation? or maybe propose a feature change?

scholzj · 2023-07-31T09:11:05Z

scholzj
Jul 31, 2023
Maintainer

I'm not sure what do you mean with new election for the partition leaders on the broker which dies. The brokers should not die -> they are shut down. If they are not shutting down cleanly for you, you should investigate it (maybe you for example need to increase the terminationGracePeriod for the brokers). Also, the leadership change will always be something the client needs to react to - from your description it is not clear why it takes 30 seconds in your clients.

The rolling update is always needed when upgrading, because the operator needs to understand what version of the software it is running.

2 replies

28ori Aug 6, 2023
Author

I've tried to configure old image in the new operator.
for example,
for Strimzi 0.29.0 I configured:

name: STRIMZI_KAFKA_IMAGES
value: |
3.1.1=my-image-registry:5000/myimagestream:strimzi-operator:0.29.0-kafka-3.1.1
3.2.0=my-image-registry:5000/myimagestream:strimzi-operator:0.29.0-kafka-3.2.0

I upgraded to Strimzi 0.33.2 and configured:

name: STRIMZI_KAFKA_IMAGES
value: |
3.2.0=my-image-registry:5000/myimagestream:strimzi-operator:0.29.0-kafka-3.2.0
.
.
3.4.0=my-image-registry:5000/myimagestream:strimzi-operator:0.33.2-kafka-3.4.0

I hoped that rolling update won't effect Kafkas with version 3.2.0 but even these Kafkas need a rolling update!
And the rolling update didn't change the image.
NOTE: StrimziPodSets was activated before the upgrade.
So my question, why is the rolling update necessary? can I disable it somehow?

scholzj Aug 6, 2023
Maintainer

This is wrong. The images are coupled with the operator release and there might be changes that make the operator not work with older or newer images at all. There are also many other changes to how the pods look like, new settings and options, new or removed environment variables. If you want the operator to manage the cluster, it needs to understand it and make sure it has the settings it expects.

jonsnowseven · 2024-05-14T12:46:29Z

jonsnowseven
May 14, 2024

Hello everyone,

I have a Kafka cluster managed by Strimzi with the following configuration:

apiVersion: kafka.strimzi.io/v1beta2
kind: Kafka
...
spec:
  kafka:
    version: 3.6.1
    replicas: 5
    config:
      default.replication.factor: 3
      min.insync.replicas: 2

When I update the Kafka version and deploy the change, I observe that the older pods are terminated before the new pods are fully up (they might not be ready in the Kafka sense; I understand that Drain Cleaner might help with this, but that’s a different conversation).

Is there something configuration-wise that I might be doing wrong for this to happen? Is this behavior expected for updates that require broker restarts? Are there any steps I can take to avoid or mitigate this issue? Does terminationGracePeriodSeconds solve the issue all together (assuming it is big enough)?

Thank you in advance for your help.

14 replies

jonsnowseven May 14, 2024

Huumm, so there is no way to overcome this from the cluster side (with a configuration, or something)?

And if you have properly written consumers and producers they should be able to handle it.

You mean with some retries, for instance?

scholzj May 14, 2024
Maintainer

The client can find the new leader, connect to it and continue.

jonsnowseven May 14, 2024

Huumm got it!
By any chance do you know if Kafka Java SDK already does that?

scholzj May 14, 2024
Maintainer

I do not know what you mean with Kafka Java SDK. The Apache Kafka Java clients can do that.

jonsnowseven May 14, 2024

I'll take a look at it, thank you very much for your help!

Tang8330 · 2024-07-10T02:11:28Z

Tang8330
Jul 10, 2024

@scholzj I noticed that while I was doing a rolling upgrade on the Kafka cluster, when the client (in this case - I was using the Go SDK) tried to create a new partition [1] I got the error: Replication factor larger than available brokers.

[1] - Publishing to a Kafka topic that doesn't exist yet while using cluster config of auto.topics.enable=true

It seems like it might be related to this thread. Do you have a recommendation on what to do in this case? Perhaps setting tweaking maxUnavailable or podDisruptionBudget?

Not sure what the best course of action here is. Tweaking maxUnavailable might be an issue if we're using PVC where the access mode is only ReadWriteOnce.

Specifically, the error looks like:

Jul  9 19:53:27.481 INF Failed to publish to kafka err="[38] Invalid Replication Factor: the replication-factor is invalid" attempts=1 sleep=105ms
Jul  9 19:53:27.705 INF Failed to publish to kafka err="[38] Invalid Replication Factor: the replication-factor is invalid" attempts=2 sleep=598ms
Jul  9 19:53:28.427 INF Failed to publish to kafka err="[38] Invalid Replication Factor: the replication-factor is invalid" attempts=3 sleep=154ms
Jul  9 19:53:28.690 INF Failed to publish to kafka err="[38] Invalid Replication Factor: the replication-factor is invalid" attempts=4 sleep=990ms
Jul  9 19:53:29.802 INF Failed to publish to kafka err="[38] Invalid Replication Factor: the replication-factor is invalid" attempts=5 sleep=1.41s
Jul  9 19:53:31.328 INF Failed to publish to kafka err="[38] Invalid Replication Factor: the replication-factor is invalid" attempts=6 sleep=1.034s
Jul  9 19:53:32.474 INF Failed to publish to kafka err="[38] Invalid Replication Factor: the replication-factor is invalid" attempts=7 sleep=1.809s
Jul  9 19:53:34.413 INF Failed to publish to kafka err="[38] Invalid Replication Factor: the replication-factor is invalid" attempts=8 sleep=1.385s
Jul  9 19:53:35.921 INF Failed to publish to kafka err="[38] Invalid Replication Factor: the replication-factor is invalid" attempts=9 sleep=3.457s

2 replies

scholzj Jul 10, 2024
Maintainer

This is how Kafka works. When you create a new topic, it is looking only at the running brokers. So during a rolling update, you might get this error. Normally, creating new topics, adding more partitions or changing replication factor is very rare, so you can simply wait for the rolling update to finish.

Tang8330 Jul 10, 2024

Makes sense.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Strimzi

Rolling updates downtime effect #8924

{{title}}

Replies: 3 comments 18 replies

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

Select a reply

Rolling updates downtime effect #8924

Replies: 3 comments · 18 replies

scholzj Jul 31, 2023 Maintainer

28ori Aug 6, 2023 Author

scholzj Aug 6, 2023 Maintainer

scholzj May 14, 2024 Maintainer

scholzj May 14, 2024 Maintainer

scholzj Jul 10, 2024 Maintainer

Replies: 3 comments 18 replies

scholzj
Jul 31, 2023
Maintainer

28ori Aug 6, 2023
Author

scholzj Aug 6, 2023
Maintainer

scholzj May 14, 2024
Maintainer

scholzj May 14, 2024
Maintainer

scholzj Jul 10, 2024
Maintainer