Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Disaster Recovery Updates #4164

Open
wants to merge 11 commits into
base: main
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from 6 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
33 changes: 28 additions & 5 deletions docs/database/CLUSTER_DB.MD
Original file line number Diff line number Diff line change
Expand Up @@ -56,21 +56,44 @@ Spin up the Repo-Standby Cluster with:
### Promoting a Standby Cluster

Once a standby is stood up, it can be promoted to be the primary cluster. **Note: only do this if the existing primary has been shut down first.**
You can shutdown the primary cluster with this command:
`kubectl patch postgrescluster/<cluster-name> --type merge --patch '{"spec":{"shutdown": true}}'`
You can determine that the cluster is fully shutdown when there are no StatefulSet pods running for that cluster.

Promote the standby cluster by editing the [crunchy_standby.yaml](../../openshift/templates/crunchy_standby.yaml) to set the `standby` field to `false`.
You can determine that promotion has completed when the logs of the standby StatefulSet show a leader has been elected.

More details here: <https://access.crunchydata.com/documentation/postgres-operator/latest/architecture/disaster-recovery#promoting-a-standby-cluster>

### Setting secrets

The promoted standby cluster created it's own secrets for connecting to pgbouncer and it has created a new user in the database with the same name as the cluster. Ex. if the standby cluster is named "wps-crunchy-16-2024-12-10", the user will have the same name.
Once the standby has been promoted, the easiest way to update user privileges is to reassign table ownership from the old user to the new user.
`REASSIGN OWNED BY "<old-user>" TO "<new-user>";`
Use `\du` in psql to see users in the database.

Once this is done, the deployment using the crunchy secrets and cluster references will need to be updated. This can be done by manually editing the deployment YAML in the Openshift UI and changing all config that referenced the original crunchy cluster to the newly promoted standby cluster secrets. Once this is done, new pods should roll out successfully.

- Change to name of new cluster secret
- POSTGRES_READ_USER
- POSTGRES_WRITE_USER
- POSTGRES_PASSWORD
- POSTGRES_WRITE_HOST
- POSTGRES_READ_HOST
- POSTGRES_PORT
- Change to name of new cluster
- PATRONI_CLUSTER_NAME

## Cluster Restore From pg_dump

In the event that the cluster can't be restored from pgbackrest you can create a new cluster and restore using a pg_dump from S3.

##### Deploy new cluster

```
oc login --token=<your-token> --server=<openshift-api-url>
PROJ_TARGET=<namespace-license-plate> BUCKET=<s3-bucket> CPU_REQUEST=75m CPU_LIMIT=2000m MEMORY_REQUEST=2Gi MEMORY_LIMIT=16Gi DATA_SIZE=65Gi WAL_SIZE=45Gi bash ./oc_provision_crunchy.sh <suffix> apply
```
```
oc login --token=<your-token> --server=<openshift-api-url>
PROJ_TARGET=<namespace-license-plate> BUCKET=<s3-bucket> CPU_REQUEST=75m CPU_LIMIT=2000m MEMORY_REQUEST=2Gi MEMORY_LIMIT=16Gi DATA_SIZE=65Gi WAL_SIZE=45Gi bash ./oc_provision_crunchy.sh <suffix> apply
```

##### Set superuser permissions in new cluster via OpenShift web GUI

Expand All @@ -90,7 +113,7 @@ PGUSER=$(oc get secrets -n <namespace-license-plate> "<wps-crunchydb-pguser-secr
PGDATABASE=$(oc get secrets -n <namespace-license-plate> "<wps-crunchydb-pguser-secret-name>" -o go-template='{{.data.dbname | base64decode}}')
oc -n <namespace-license-plate> port-forward "${PG_CLUSTER_PRIMARY_POD}" 5432:5432
```

##### Restore sql dump into new cluster in another shell

Download the latest SQL dump from S3 storage and unzip it.
Expand Down
7 changes: 6 additions & 1 deletion openshift/scripts/oc_provision_crunchy_standby.sh
Original file line number Diff line number Diff line change
Expand Up @@ -25,6 +25,9 @@ source "$(dirname ${0})/common/common"
# Target project override for Dev or Prod deployments
#
PROJ_TARGET="${PROJ_TARGET:-${PROJ_DEV}}"

# Set DATE to today's date if it isn't set
DATE=${DATE:-$(date +"%Y-%m-%d")}

# Prepare names for crunchy ephemeral instance for this PR.
IMAGE_STREAM_NAMESPACE=${IMAGE_STREAM_NAMESPACE:-${PROJ_TOOLS}}
Expand All @@ -35,7 +38,8 @@ OC_PROCESS="oc -n ${PROJ_TARGET} process -f ${TEMPLATE_PATH}/crunchy_standby.yam
-p SUFFIX=${SUFFIX} \
-p TARGET_NAMESPACE=${PROJ_TARGET} \
-p BUCKET=${BUCKET} \
-p DATA_SIZE=45Gi \
-p DATE=${DATE} \
-p DATA_SIZE=65Gi \
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we parameterize DATA_SIZE and WAL_SIZE? It will give us a little more flexibility around test deployments.

-p WAL_SIZE=15Gi \
dgboss marked this conversation as resolved.
Show resolved Hide resolved
${IMAGE_NAME:+ " -p IMAGE_NAME=${IMAGE_NAME}"} \
${IMAGE_TAG:+ " -p IMAGE_TAG=${IMAGE_TAG}"} \
Expand All @@ -46,6 +50,7 @@ OC_PROCESS="oc -n ${PROJ_TARGET} process -f ${TEMPLATE_PATH}/crunchy_standby.yam
-p MEMORY_LIMIT=16Gi"



# In order to avoid running out of storage quota in our development environment, use
# ephemeral storage by removing the pvc request from the template.
if [ "$EPHEMERAL_STORAGE" = "True" ]
Expand Down
64 changes: 51 additions & 13 deletions openshift/templates/crunchy_standby.yaml
Original file line number Diff line number Diff line change
@@ -1,12 +1,12 @@
apiVersion: template.openshift.io/v1
kind: Template
metadata:
name: wps-crunchydb-standby
name: ${APP_NAME}-${DATE}
annotations:
"openshift.io/display-name": wps-crunchydb-standby
"openshift.io/display-name": ${APP_NAME}-${DATE}
labels:
app.kubernetes.io/part-of: wps-crunchydb-standby
app: wps-crunchydb-standby
app.kubernetes.io/part-of: ${APP_NAME}-${DATE}
app: ${APP_NAME}-${DATE}
parameters:
- description: Namespace in which database resides
displayName: Target Namespace
Expand All @@ -15,6 +15,13 @@ parameters:
- name: BUCKET
description: S3 bucket name
required: true
- name: APP_NAME
description: Application name (wps - wildfire predictive services)
value: wps-crunchydb-16
required: true
- name: DATE
description: Date the standby was created
required: true
- name: DATA_SIZE
description: Data PVC size
required: true
Expand Down Expand Up @@ -60,23 +67,17 @@ objects:
- apiVersion: postgres-operator.crunchydata.com/v1beta1
kind: PostgresCluster
metadata:
name: wps-crunchydb-standby
name: ${APP_NAME}-${DATE}
spec:
postgresVersion: 16
postGISVersion: "3.3"
metadata:
name: wps-crunchydb-standby
name: ${APP_NAME}-${DATE}
labels:
app: wps-crunchydb-standby
app: ${APP_NAME}-${DATE}
databaseInitSQL:
key: init.sql
name: wps-init-sql
users:
- name: wps
databases:
- postgres
- wps
options: "SUPERUSER"
instances:
- name: crunchy
replicas: 1
Expand Down Expand Up @@ -104,20 +105,57 @@ objects:
backups:
pgbackrest:
image: artifacts.developer.gov.bc.ca/bcgov-docker-local/crunchy-pgbackrest:ubi8-2.41-4
manual:
repoName: repo1
options:
- --type=full
configuration:
- secret:
name: crunchy-pgbackrest
items:
- key: conf
path: s3.conf
global:
repo1-retention-full: "3"
repo1-retention-full-type: count
repo1-path: /pgbackrest/${SUFFIX}/repo1
repos:
- name: repo1
schedules:
full: "0 1 * * 0"
differential: "0 1 * * 1-6"
s3:
bucket: ${BUCKET}
endpoint: nrs.objectstore.gov.bc.ca
region: "ca-central-1"
proxy:
pgBouncer:
image: artifacts.developer.gov.bc.ca/bcgov-docker-local/crunchy-pgbouncer:ubi8-1.21-0
affinity:
podAntiAffinity:
preferredDuringSchedulingIgnoredDuringExecution:
- podAffinityTerm:
labelSelector:
matchLabels:
postgres-operator.crunchydata.com/cluster: db
postgres-operator.crunchydata.com/role: pgbouncer
topologyKey: kubernetes.io/hostname
weight: 1
config:
global:
pool_mode: transaction
ignore_startup_parameters: options, extra_float_digits
max_prepared_statements: "10"
max_client_conn: "1000"
port: 5432
replicas: 1
resources:
limits:
cpu: 500m
memory: 3Gi
requests:
cpu: 100m
memory: 1Gi
standby:
enabled: true
repoName: repo1
Loading