Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Operator wants to allocate only 1 coordinator per zone #2185

Open
mpatou-openai opened this issue Dec 14, 2024 · 7 comments
Open

Operator wants to allocate only 1 coordinator per zone #2185

mpatou-openai opened this issue Dec 14, 2024 · 7 comments
Labels
bug Something isn't working

Comments

@mpatou-openai
Copy link

What happened?

In a cloud provider (azure) created a new cluster with 3 zones: 1, 2 and 3

Processes are spread across the 3 zones, the operator fails to create a cluster file because it seems that it only recruit 3 controllers.

This is despite having 5 logs servers and 4 storage servers. I suspect that the controller is choosing at most 1 pod per zone.

{"level":"error","ts":"2024-12-14T05:06:24Z","msg":"Reconciler error","controller":"foundationdbcluster","controllerGroup":"apps.foundationdb.org","controllerKind":"FoundationDBCluster","FoundationDBCluster":{"name":"fdb-cluster","namespace":"fdb-loadtest2"},"namespace":"fdb-loadtest2","name":"fdb-cluster","reconcileID":"65b1409a-cfaf-4a60-970a-58349c2e8fd1","error":"Could only select 3 processes, but 5 are required","stacktrace":"sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler\n\t/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:329\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem\n\t/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:274\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2\n\t/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:235"}

What did you expect to happen?

The controller picks 5 coordinator and creates a proper cluster file.

How can we reproduce it (as minimally and precisely as possible)?

Use a cloud provider and make it so that you expose the zones, alternatively you could create a an ADDITIONAL_ENV_FILE that hash and reduce the pod name for both logs and controller to a zone between 1 and 3 and export the variable FDB_ZONE_ID

Anything else we need to know?

No response

FDB Kubernetes operator

foundationdb/fdb-kubernetes-operator:v1.51.0

Kubernetes version

Client Version: v1.30.2
Kustomize Version: v5.0.4-0.20230601165947-6ce0bf390ce3
The connection to the server localhost:8080 was refused - did you specify the right host or port?

Cloud provider

Azure
@mpatou-openai mpatou-openai added the bug Something isn't working label Dec 14, 2024
@mpatou-openai
Copy link
Author

I have "triple" as replication, downgrading to "double" does the trick as only 3 coordinators are required.

@johscheuer
Copy link
Member

For triple replication the operator expects 5 different zones to pick coordinators from: https://github.com/FoundationDB/fdb-kubernetes-operator/blob/main/docs/manual/fault_domains.md#coordinators. A possible alternative (if you want to keep 3 replicas) would be the three-data-hall setup: https://github.com/FoundationDB/fdb-kubernetes-operator/blob/main/docs/manual/fault_domains.md#three-data-hall-replication.

@mpatou-openai
Copy link
Author

The three-data-hall requires 3 clusters which feels has its own set of challenges (ie. clients would have to know about the 3 clusters) also

NOTE: The support for this redundancy mode is new and might have issues. Please make sure you test this configuration in your test/QA environment.

This kind of feel scary. On the other hand we seems to be facing some weird issues with the operator when it takes pods without knowing if they actually are replica of each others.

@mpatou-openai
Copy link
Author

For triple replication the operator expects 5 different zones to pick coordinators from

Why couldn't the operator peek at least 1 pod in each zone and then complement by picking more pods in each zone to match the required amount ?

@johscheuer
Copy link
Member

The three-data-hall requires 3 clusters which feels has its own set of challenges (ie. clients would have to know about the 3 clusters) also

That's not true, for the clients it looks like it's a single FDB cluster (because it is a single FDB cluster), they just connect to the cluster with the according connection string. With the unified image it should be possible to let the operator manage a single FoundationDBCluster resource, but this setup is not yet document and I'm not sure if I'll find the time to document this setup before the end of the year.

This kind of feel scary. On the other hand we seems to be facing some weird issues with the operator when it takes pods without knowing if they actually are replica of each others.

Can you elaborate a bit more on those "weird issues"? I think that documentation is a bit outdated, we have support this replication mode for a while now.

Why couldn't the operator peek at least 1 pod in each zone and then complement by picking more pods in each zone to match the required amount ?

In theory we could do this, but this brings a few more challenges and risks. We decided to use only one coordinator per zone to have a clear failure model when a zone fails, e.g. if a whole zone fails at most one coordinator will be down. If we relax this requirement, we have to make sure that the coordinator layout still supports the replication guarantees e.g. to allow the failure of x machines/zones. We would also need to ensure we have a minimal number of distinct zones to allow a full zone to fail without bringing down the majority of the coordinators (which would make the FDB cluster unavailable and would require manual interaction to bring back the cluster if the data of the coordinators would be lost).

@mpatou-openai
Copy link
Author

That's not true, for the clients it looks like it's a single FDB cluster (because it is a single FDB cluster), they just connect to the cluster with the according connection string

As they just share the same connection string ? no matter which hall the clients are connected to ?

Can you elaborate a bit more on those "weird issues"? I think that documentation is a bit outdated, we have support this replication mode for a while now.

So the weirdest one was that we changed the both the main and the side car to use an internal registry instead of DockerHub but forgot to actually push the side car to the registry in production (oups ...) and the operator started to restart a bunch of pods both and storage and for some ranges we lost of the replicas (temporarily) because the pods where stuck in the init phase.
I suspect it's because the operator thinks that there is enough of zone remaining and so it can afford to restart multiple pods but it could well happen that all 3 (when using triple replication) replicas for some ranges are in the pods that are getting restarted.
I haven't had the time to fully reproduce the error in a test cluster what I did though is confirming than more than a couple of pods could be restarting if things are not working as expected (ie. faulty sidecar or slow download from the registry, ... )

@johscheuer
Copy link
Member

That's not true, for the clients it looks like it's a single FDB cluster (because it is a single FDB cluster), they just connect to the cluster with the according connection string

As they just share the same connection string ? no matter which hall the clients are connected to ?

The cluster is a single FDB cluster, so from the client side it's not possible to know that the cluster is managed by a single operator or multiple operator instances. And correct, they share the same connection string (independent of the data hall).

Can you elaborate a bit more on those "weird issues"? I think that documentation is a bit outdated, we have support this replication mode for a while now.

So the weirdest one was that we changed the both the main and the side car to use an internal registry instead of DockerHub but forgot to actually push the side car to the registry in production (oups ...) and the operator started to restart a bunch of pods both and storage and for some ranges we lost of the replicas (temporarily) because the pods where stuck in the init phase. I suspect it's because the operator thinks that there is enough of zone remaining and so it can afford to restart multiple pods but it could well happen that all 3 (when using triple replication) replicas for some ranges are in the pods that are getting restarted. I haven't had the time to fully reproduce the error in a test cluster what I did though is confirming than more than a couple of pods could be restarting if things are not working as expected (ie. faulty sidecar or slow download from the registry, ... )

That's interesting, the operators should be synchronising with the locking mechanism, so that only a single operator is perform "disruptive" actions (https://github.com/FoundationDB/fdb-kubernetes-operator/blob/main/controllers/update_pods.go#L377C1-L384C3). What version of the operator do you use?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants