Operator wants to allocate only 1 coordinator per zone #2185

mpatou-openai · 2024-12-14T06:10:05Z

What happened?

In a cloud provider (azure) created a new cluster with 3 zones: 1, 2 and 3

Processes are spread across the 3 zones, the operator fails to create a cluster file because it seems that it only recruit 3 controllers.

This is despite having 5 logs servers and 4 storage servers. I suspect that the controller is choosing at most 1 pod per zone.

{"level":"error","ts":"2024-12-14T05:06:24Z","msg":"Reconciler error","controller":"foundationdbcluster","controllerGroup":"apps.foundationdb.org","controllerKind":"FoundationDBCluster","FoundationDBCluster":{"name":"fdb-cluster","namespace":"fdb-loadtest2"},"namespace":"fdb-loadtest2","name":"fdb-cluster","reconcileID":"65b1409a-cfaf-4a60-970a-58349c2e8fd1","error":"Could only select 3 processes, but 5 are required","stacktrace":"sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler\n\t/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:329\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem\n\t/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:274\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2\n\t/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:235"}

What did you expect to happen?

The controller picks 5 coordinator and creates a proper cluster file.

How can we reproduce it (as minimally and precisely as possible)?

Use a cloud provider and make it so that you expose the zones, alternatively you could create a an ADDITIONAL_ENV_FILE that hash and reduce the pod name for both logs and controller to a zone between 1 and 3 and export the variable FDB_ZONE_ID

Anything else we need to know?

No response

FDB Kubernetes operator

foundationdb/fdb-kubernetes-operator:v1.51.0

Kubernetes version

Client Version: v1.30.2
Kustomize Version: v5.0.4-0.20230601165947-6ce0bf390ce3
The connection to the server localhost:8080 was refused - did you specify the right host or port?

Cloud provider

Azure

The text was updated successfully, but these errors were encountered:

mpatou-openai · 2024-12-14T06:13:29Z

I have "triple" as replication, downgrading to "double" does the trick as only 3 coordinators are required.

johscheuer · 2024-12-16T05:19:13Z

For triple replication the operator expects 5 different zones to pick coordinators from: https://github.com/FoundationDB/fdb-kubernetes-operator/blob/main/docs/manual/fault_domains.md#coordinators. A possible alternative (if you want to keep 3 replicas) would be the three-data-hall setup: https://github.com/FoundationDB/fdb-kubernetes-operator/blob/main/docs/manual/fault_domains.md#three-data-hall-replication.

mpatou-openai · 2024-12-16T17:46:07Z

The three-data-hall requires 3 clusters which feels has its own set of challenges (ie. clients would have to know about the 3 clusters) also

NOTE: The support for this redundancy mode is new and might have issues. Please make sure you test this configuration in your test/QA environment.

This kind of feel scary. On the other hand we seems to be facing some weird issues with the operator when it takes pods without knowing if they actually are replica of each others.

mpatou-openai · 2024-12-16T17:47:41Z

For triple replication the operator expects 5 different zones to pick coordinators from

Why couldn't the operator peek at least 1 pod in each zone and then complement by picking more pods in each zone to match the required amount ?

johscheuer · 2024-12-17T08:24:28Z

The three-data-hall requires 3 clusters which feels has its own set of challenges (ie. clients would have to know about the 3 clusters) also

That's not true, for the clients it looks like it's a single FDB cluster (because it is a single FDB cluster), they just connect to the cluster with the according connection string. With the unified image it should be possible to let the operator manage a single FoundationDBCluster resource, but this setup is not yet document and I'm not sure if I'll find the time to document this setup before the end of the year.

This kind of feel scary. On the other hand we seems to be facing some weird issues with the operator when it takes pods without knowing if they actually are replica of each others.

Can you elaborate a bit more on those "weird issues"? I think that documentation is a bit outdated, we have support this replication mode for a while now.

Why couldn't the operator peek at least 1 pod in each zone and then complement by picking more pods in each zone to match the required amount ?

In theory we could do this, but this brings a few more challenges and risks. We decided to use only one coordinator per zone to have a clear failure model when a zone fails, e.g. if a whole zone fails at most one coordinator will be down. If we relax this requirement, we have to make sure that the coordinator layout still supports the replication guarantees e.g. to allow the failure of x machines/zones. We would also need to ensure we have a minimal number of distinct zones to allow a full zone to fail without bringing down the majority of the coordinators (which would make the FDB cluster unavailable and would require manual interaction to bring back the cluster if the data of the coordinators would be lost).

mpatou-openai · 2024-12-18T01:19:48Z

That's not true, for the clients it looks like it's a single FDB cluster (because it is a single FDB cluster), they just connect to the cluster with the according connection string

As they just share the same connection string ? no matter which hall the clients are connected to ?

Can you elaborate a bit more on those "weird issues"? I think that documentation is a bit outdated, we have support this replication mode for a while now.

So the weirdest one was that we changed the both the main and the side car to use an internal registry instead of DockerHub but forgot to actually push the side car to the registry in production (oups ...) and the operator started to restart a bunch of pods both and storage and for some ranges we lost of the replicas (temporarily) because the pods where stuck in the init phase.
I suspect it's because the operator thinks that there is enough of zone remaining and so it can afford to restart multiple pods but it could well happen that all 3 (when using triple replication) replicas for some ranges are in the pods that are getting restarted.
I haven't had the time to fully reproduce the error in a test cluster what I did though is confirming than more than a couple of pods could be restarting if things are not working as expected (ie. faulty sidecar or slow download from the registry, ... )

johscheuer · 2024-12-18T18:22:01Z

That's not true, for the clients it looks like it's a single FDB cluster (because it is a single FDB cluster), they just connect to the cluster with the according connection string

As they just share the same connection string ? no matter which hall the clients are connected to ?

The cluster is a single FDB cluster, so from the client side it's not possible to know that the cluster is managed by a single operator or multiple operator instances. And correct, they share the same connection string (independent of the data hall).

Can you elaborate a bit more on those "weird issues"? I think that documentation is a bit outdated, we have support this replication mode for a while now.

So the weirdest one was that we changed the both the main and the side car to use an internal registry instead of DockerHub but forgot to actually push the side car to the registry in production (oups ...) and the operator started to restart a bunch of pods both and storage and for some ranges we lost of the replicas (temporarily) because the pods where stuck in the init phase. I suspect it's because the operator thinks that there is enough of zone remaining and so it can afford to restart multiple pods but it could well happen that all 3 (when using triple replication) replicas for some ranges are in the pods that are getting restarted. I haven't had the time to fully reproduce the error in a test cluster what I did though is confirming than more than a couple of pods could be restarting if things are not working as expected (ie. faulty sidecar or slow download from the registry, ... )

That's interesting, the operators should be synchronising with the locking mechanism, so that only a single operator is perform "disruptive" actions (https://github.com/FoundationDB/fdb-kubernetes-operator/blob/main/controllers/update_pods.go#L377C1-L384C3). What version of the operator do you use?

mpatou-openai added the bug Something isn't working label Dec 14, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Operator wants to allocate only 1 coordinator per zone #2185

Operator wants to allocate only 1 coordinator per zone #2185

mpatou-openai commented Dec 14, 2024

mpatou-openai commented Dec 14, 2024

johscheuer commented Dec 16, 2024

mpatou-openai commented Dec 16, 2024

mpatou-openai commented Dec 16, 2024

johscheuer commented Dec 17, 2024

mpatou-openai commented Dec 18, 2024

johscheuer commented Dec 18, 2024

Operator wants to allocate only 1 coordinator per zone #2185

Operator wants to allocate only 1 coordinator per zone #2185

Comments

mpatou-openai commented Dec 14, 2024

What happened?

What did you expect to happen?

How can we reproduce it (as minimally and precisely as possible)?

Anything else we need to know?

FDB Kubernetes operator

Kubernetes version

Cloud provider

mpatou-openai commented Dec 14, 2024

johscheuer commented Dec 16, 2024

mpatou-openai commented Dec 16, 2024

mpatou-openai commented Dec 16, 2024

johscheuer commented Dec 17, 2024

mpatou-openai commented Dec 18, 2024

johscheuer commented Dec 18, 2024