Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Initial support for three data hall replication #1651

Merged
merged 10 commits into from
Oct 13, 2023

Conversation

johscheuer
Copy link
Member

@johscheuer johscheuer commented May 30, 2023

Description

Fixes: #348

Type of change

Please select one of the options below.

  • New feature (non-breaking change which adds functionality)

Discussion

When running a FoundationDB cluster on public cloud providers it can be useful to use three data hall as redundancy more and spread Pods across multiple availability zones. So far the operator was not supporting this mode. Now the operator supports this mode with minimal code changes, the drawback is, that a user has to create 3 FoundationDBCluster resources.

Testing

I did some manual testing:

$ kubectl get fdb,po 

NAME                                                                GENERATION   RECONCILED   AVAILABLE   FULLREPLICATION   VERSION   AGE
foundationdbcluster.apps.foundationdb.org/test-cluster-us-west-2a   2            2            true        true              7.1.26    8m49s
foundationdbcluster.apps.foundationdb.org/test-cluster-us-west-2b   1            1             true        true              7.1.26    5m58s
foundationdbcluster.apps.foundationdb.org/test-cluster-us-west-2c   1            1            true        true              7.1.26    5m54s

NAME                                                              READY   STATUS    RESTARTS   AGE
pod/fdb-kubernetes-operator-controller-manager-7c55d7786c-424nj   1/1     Running   0          9m2s
pod/fdb-kubernetes-operator-controller-manager-7c55d7786c-929kw   1/1     Running   0          9m12s
pod/grafana-74f64cbb9f-rpdhs                                      1/1     Running   0          38d
pod/test-cluster-us-west-2a-log-1                                 2/2     Running   0          8m41s
pod/test-cluster-us-west-2a-log-2                                 2/2     Running   0          8m41s
pod/test-cluster-us-west-2a-log-3                                 2/2     Running   0          8m41s
pod/test-cluster-us-west-2a-log-4                                 2/2     Running   0          8m41s
pod/test-cluster-us-west-2a-log-5                                 2/2     Running   0          8m41s
pod/test-cluster-us-west-2a-log-6                                 2/2     Running   0          6m3s
pod/test-cluster-us-west-2a-storage-1                             2/2     Running   0          8m41s
pod/test-cluster-us-west-2a-storage-2                             2/2     Running   0          8m41s
pod/test-cluster-us-west-2a-storage-3                             2/2     Running   0          8m41s
pod/test-cluster-us-west-2a-storage-4                             2/2     Running   0          8m41s
pod/test-cluster-us-west-2a-storage-5                             2/2     Running   0          8m41s
pod/test-cluster-us-west-2b-log-1                                 2/2     Running   0          5m59s
pod/test-cluster-us-west-2b-log-2                                 2/2     Running   0          5m59s
pod/test-cluster-us-west-2b-log-3                                 2/2     Running   0          5m59s
pod/test-cluster-us-west-2b-log-4                                 2/2     Running   0          5m59s
pod/test-cluster-us-west-2b-log-5                                 2/2     Running   0          5m59s
pod/test-cluster-us-west-2b-log-6                                 2/2     Running   0          5m59s
pod/test-cluster-us-west-2b-storage-1                             2/2     Running   0          5m59s
pod/test-cluster-us-west-2b-storage-2                             2/2     Running   0          5m59s
pod/test-cluster-us-west-2b-storage-3                             2/2     Running   0          5m59s
pod/test-cluster-us-west-2b-storage-4                             2/2     Running   0          5m59s
pod/test-cluster-us-west-2b-storage-5                             2/2     Running   0          5m59s
pod/test-cluster-us-west-2c-log-1                                 2/2     Running   0          5m55s
pod/test-cluster-us-west-2c-log-2                                 2/2     Running   0          5m55s
pod/test-cluster-us-west-2c-log-3                                 2/2     Running   0          5m55s
pod/test-cluster-us-west-2c-log-4                                 2/2     Running   0          5m55s
pod/test-cluster-us-west-2c-log-5                                 2/2     Running   0          5m55s
pod/test-cluster-us-west-2c-log-6                                 2/2     Running   0          5m55s
pod/test-cluster-us-west-2c-storage-1                             2/2     Running   0          5m55s
pod/test-cluster-us-west-2c-storage-2                             2/2     Running   0          5m55s
pod/test-cluster-us-west-2c-storage-3                             2/2     Running   0          5m55s
pod/test-cluster-us-west-2c-storage-4                             2/2     Running   0          5m55s
pod/test-cluster-us-west-2c-storage-5                             2/2     Running   0          5m55s

and fdbcli show the correct config:

fdb> status

Using cluster file `/var/dynamic-conf/fdb.cluster'.

Configuration:
  Redundancy mode        - three_data_hall
  Storage engine         - ssd-2
  Coordinators           - 9
  Desired Commit Proxies - 2
  Desired GRV Proxies    - 1
  Desired Resolvers      - 1
  Desired Logs           - 4
  Desired Remote Logs    - -1
  Desired Log Routers    - -1
  Usable Regions         - 1

Cluster:
  FoundationDB processes - 33
  Zones                  - 33
  Machines               - 33
  Memory availability    - 8.0 GB per process on machine with least available
  Fault Tolerance        - 2 machines
  Server time            - 05/30/23 11:41:30

Data:
  Replication health     - Healthy
  Moving data            - 0.000 GB
  Sum of key-value sizes - 0 MB
  Disk space used        - 1.993 GB

Operating space:
  Storage server         - 14.9 GB free on most full server
  Log server             - 14.9 GB free on most full server

Workload:
  Read rate              - 16 Hz
  Write rate             - 0 Hz
  Transactions started   - 17 Hz
  Transactions committed - 1 Hz
  Conflict rate          - 0 Hz

Backup and DR:
  Running backups        - 0
  Running DRs            - 0

Client time: 05/30/23 11:41:30

Documentation

Added to this PR.

Follow-up

We can improve the three data hall setup in the future to only require one single FoundationDBCluster resource, but that will require additional changes in the operator.

As a follow up we could think about adding support for three_data_center as well. Based on my current understanding those modes are similar.

@johscheuer johscheuer added the enhancement New feature or request label May 30, 2023
@foundationdb-ci
Copy link

Result of fdb-kubernetes-operator-pr on Linux CentOS 7

  • Commit ID: fe562f1
  • Duration 3:03:33
  • Result: ✅ SUCCEEDED
  • Error: N/A
  • Build Log terminal output (available for 30 days)
  • Build Workspace zip file of the working directory (available for 30 days)

@foundationdb-ci
Copy link

Result of fdb-kubernetes-operator-pr-kind on Linux CentOS 7

  • Commit ID: fe562f1
  • Duration 4:09:57
  • Result: ❌ FAILED
  • Error: Error while executing command: if $fail_test; then exit 1; fi. Reason: exit status 1
  • Build Log terminal output (available for 30 days)
  • Build Workspace zip file of the working directory (available for 30 days)

Copy link
Member Author

@johscheuer johscheuer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We have to create a follow up issue to build e2e test cases for this configuration.

Copy link
Member

@brownleej brownleej left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I found a typo, but otherwise this looks good to me.

docs/manual/fault_domains.md Outdated Show resolved Hide resolved
internal/locality/locality.go Show resolved Hide resolved
@johscheuer johscheuer force-pushed the add-support-three-data-hall branch from fe562f1 to a8c689c Compare October 10, 2023 14:21
@johscheuer johscheuer requested review from brownleej, 09harsh and manfontan and removed request for sbodagala October 10, 2023 14:21
@foundationdb-ci
Copy link

Result of fdb-kubernetes-operator-pr on Linux CentOS 7

  • Commit ID: a8c689c
  • Duration 2:51:53
  • Result: ✅ SUCCEEDED
  • Error: N/A
  • Build Log terminal output (available for 30 days)
  • Build Workspace zip file of the working directory (available for 30 days)

@foundationdb-ci
Copy link

Result of fdb-kubernetes-operator-pr on Linux CentOS 7

  • Commit ID: af12c6f
  • Duration 2:47:40
  • Result: ✅ SUCCEEDED
  • Error: N/A
  • Build Log terminal output (available for 30 days)
  • Build Workspace zip file of the working directory (available for 30 days)

@foundationdb-ci
Copy link

Result of fdb-kubernetes-operator-pr on Linux CentOS 7

  • Commit ID: d207ce6
  • Duration 0:04:42
  • Result: ❌ FAILED
  • Error: Error while executing command: IMG=${REGISTRY}/${OPERATOR_IMAGE} make container-build container-push. Reason: exit status 2
  • Build Log terminal output (available for 30 days)
  • Build Workspace zip file of the working directory (available for 30 days)

@foundationdb-ci
Copy link

Result of fdb-kubernetes-operator-pr on Linux CentOS 7

  • Commit ID: 84644f8
  • Duration 2:48:34
  • Result: ✅ SUCCEEDED
  • Error: N/A
  • Build Log terminal output (available for 30 days)
  • Build Workspace zip file of the working directory (available for 30 days)

@foundationdb-ci
Copy link

Result of fdb-kubernetes-operator-pr on Linux CentOS 7

  • Commit ID: f7c4280
  • Duration 2:57:07
  • Result: ❌ FAILED
  • Error: Error while executing command: if $fail_test; then exit 1; fi. Reason: exit status 1
  • Build Log terminal output (available for 30 days)
  • Build Workspace zip file of the working directory (available for 30 days)

@foundationdb-ci
Copy link

Result of fdb-kubernetes-operator-pr on Linux CentOS 7

  • Commit ID: 2a7b132
  • Duration 2:59:26
  • Result: ✅ SUCCEEDED
  • Error: N/A
  • Build Log terminal output (available for 30 days)
  • Build Workspace zip file of the working directory (available for 30 days)

controllers/change_coordinators_test.go Outdated Show resolved Hide resolved
controllers/change_coordinators_test.go Outdated Show resolved Hide resolved
Copy link
Member

@brownleej brownleej left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

@johscheuer johscheuer requested a review from brownleej October 12, 2023 15:36
@johscheuer johscheuer force-pushed the add-support-three-data-hall branch from 7508317 to 892c6a9 Compare October 12, 2023 16:13
Copy link
Member Author

@johscheuer johscheuer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Waiting for the e2e tests to pass. I will make a small annoncement about this in the forums, as I know a few people waiting for this feature.

Copy link
Member

@brownleej brownleej left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

@foundationdb-ci
Copy link

Result of fdb-kubernetes-operator-pr on Linux CentOS 7

  • Commit ID: 7508317
  • Duration 3:01:33
  • Result: ✅ SUCCEEDED
  • Error: N/A
  • Build Log terminal output (available for 30 days)
  • Build Workspace zip file of the working directory (available for 30 days)

@foundationdb-ci
Copy link

Result of fdb-kubernetes-operator-pr on Linux CentOS 7

  • Commit ID: 892c6a9
  • Duration 2:49:03
  • Result: ❌ FAILED
  • Error: Error while executing command: if $fail_test; then exit 1; fi. Reason: exit status 1
  • Build Log terminal output (available for 30 days)
  • Build Workspace zip file of the working directory (available for 30 days)

Copy link
Member Author

@johscheuer johscheuer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Result of fdb-kubernetes-operator-pr on Linux CentOS 7

  • Commit ID: 892c6a9
  • Duration 2:49:03
  • Result: ❌ FAILED
  • Error: Error while executing command: if $fail_test; then exit 1; fi. Reason: exit status 1
  • Build Log terminal output (available for 30 days)
  • Build Workspace zip file of the working directory (available for 30 days)

We've seen this failure a few times:

  2023/10/12 16:52:33 reconciled name=fdb-cluster-ccrfs4x5, namespace=pr-464-x6hz2s62, generation:2
• [FAILED] [1121.240 seconds]
Operator Migrations when a migration is triggered and the namespace quota is limited [BeforeEach] should add the prefix to all instances [e2e, pr]
  [BeforeEach] /codebuild/output/src2181315438/src/github.com/FoundationDB/fdb-kubernetes-operator/e2e/test_operator_migrations/operator_migration_test.go:77
  [It] /codebuild/output/src2181315438/src/github.com/FoundationDB/fdb-kubernetes-operator/e2e/test_operator_migrations/operator_migration_test.go:104

  [FAILED] Timed out after 600.000s.
  Expected
      <int64>: 2
  to be zero-valued
  In [BeforeEach] at: /codebuild/output/src2181315438/src/github.com/FoundationDB/fdb-kubernetes-operator/e2e/test_operator_migrations/operator_migration_test.go:100 @ 10/12/23 16:52:33.427
------------------------------

I'm going to open an issue for it to fix it.

Same for:

• [FAILED] [1937.014 seconds]
Operator Upgrades upgrading a cluster without chaos [It] Upgrade from 7.1.37 to 7.3.15 [e2e, pr]
/codebuild/output/src2181315438/src/github.com/FoundationDB/fdb-kubernetes-operator/e2e/fixtures/upgrade_test_configuration.go:115

  [FAILED] Unexpected error:
      <*errors.errorString | 0xc00029fb40>: 
      timed out waiting for the condition
      {
          s: "timed out waiting for the condition",
      }
  occurred
  In [It] at: /codebuild/output/src2181315438/src/github.com/FoundationDB/fdb-kubernetes-operator/e2e/test_operator_upgrades_variations/operator_upgrades_variations_test.go:119 @ 10/12/23 16:54:29.772

  There were additional failures detected.  To view them in detail run ginkgo -vv
------------------------------

This was referenced Oct 13, 2023
@johscheuer johscheuer merged commit a59a3d1 into FoundationDB:main Oct 13, 2023
8 checks passed
@johscheuer johscheuer deleted the add-support-three-data-hall branch October 13, 2023 05:20
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Support three_data_hall redundancy
4 participants