Merge branch 'release/1.6' of github.com:Cray-HPE/docs-csm into CASMP…

…ET-6863-rollback-issues
Cray-HPE · Jan 2, 2025 · b3f25ca · b3f25ca
2 parents 8d5d13b + ca89c9a
commit b3f25ca
Show file tree

Hide file tree

Showing 15 changed files with 240 additions and 48 deletions.
diff --git a/.spelling b/.spelling
@@ -787,6 +787,8 @@ unregisters
 unset
 unsets
 unsquashed
+un-suspend
+unsuspend
 untainting
 untarred
 update-cfs-config

diff --git a/operations/boot_orchestration/Components.md b/operations/boot_orchestration/Components.md
@@ -6,19 +6,19 @@ This includes information on the desired state and some information on the curre
 Component records are created automatically and will include any components found in the [Hardware State Manager (HSM)](../../glossary.md#hardware-state-manager-hsm).
 
 * [BOS component fields](#bos-component-fields)
-  * [`actual_state`](#actual_state)
-  * [`desired_state`](#desired_state)
-  * [`staged_state`](#staged_state)
-  * [`enabled`](#enabled)
-  * [`error`](#error)
-  * [`event_stats`](#event_stats)
-  * [`last_action`](#last_action)
-  * [`session`](#session)
-  * [`status`](#status)
+    * [`actual_state`](#actual_state)
+    * [`desired_state`](#desired_state)
+    * [`staged_state`](#staged_state)
+    * [`enabled`](#enabled)
+    * [`error`](#error)
+    * [`event_stats`](#event_stats)
+    * [`last_action`](#last_action)
+    * [`session`](#session)
+    * [`status`](#status)
 * [Managing BOS components](#managing-bos-components)
-  * [List all components](#list-all-components)
-  * [Show details for a component](#show-details-for-a-component)
-  * [Update a component](#update-a-component)
+    * [List all components](#list-all-components)
+    * [Show details for a component](#show-details-for-a-component)
+    * [Update a component](#update-a-component)
 
 ## BOS component fields
 
@@ -39,7 +39,16 @@ Stores information on the eventual desired boot artifacts and configuration for
 
 ### `enabled`
 
-If the node is enabled (enabled == True), BOS will take action to make the actual state match the desired state. Disabled nodes may receive status updates from booted nodes, but BOS will not issue power commands to the nodes while they are disabled.
+If the node is enabled (enabled == True), BOS will take action to make the actual state match the desired state. If a node is disabled (enabled == False), BOS will take no action against a node. This is an internal state
+that BOS uses for tracking whether it has finished working on a node. Typically, users should **never** enable a node (enabled == True). BOS handles this during session creation.
+Even if the BOS session is deleted, the nodes remain enabled in BOS, and BOS will continue to take action to make the nodes' actual states match their desired states.
+Because of this, if an administrator wishes to stop BOS from taking such actions on a node, then they must disable it (enabled == False).
+Thus, while it is still uncommon, it is more likely that users will disable nodes than enable them.
+
+Even if a node is disabled in BOS, if it is booted, then BOS may receive status updates for it. However, BOS will not issue power commands to the nodes while they are disabled. These status updates come from the `bos-reporter`, small program running
+periodically on the nodes, that updates the nodes' actual status.
+
+Both BOS and the Hardware State Manager (HSM) use the term disabled, but not in a consistent fashion. When the HSM says a node is disabled, it is out of service. This definition should not be confused with BOS' definition of disabled.
 
 ### `error`
 
@@ -73,7 +82,7 @@ Stores the session ID of the session that is currently tracking the component.
 This collection of fields stores status information that the BOS operators and other users can query to determine the status of the component. Status fields should generally not be manually updated and should be left to BOS. These fields include:
 
 * `phase` - Describes the general phase of the boot process the component is currently in, such as `powering_on`, `powering_off` and `configuring`.
-* `status` - A more specific description of where in the boot process the component is. This can be more detailed phases, such as `power_on_pending`, `power_on_called`, as well as final states such as `failed`.  
+* `status` - A more specific description of where in the boot process the component is. This can be more detailed phases, such as `power_on_pending`, `power_on_called`, as well as final states such as `failed`.
 * `on_hold` is a special value that indicates BOS is re-evaluating the status of the component, such as when a component is re-enabled and BOS needs to collect new information from other services to determine the state of the component.
 * `status_override` - A special status field that is used to override `status` when BOS would be unable to determine the status of the node with its current information. This includes the `on_hold` status.
 
@@ -200,11 +209,12 @@ Example output:
 ### Update a component
 
 Update a BOS component using `xname`. While most fields can be updated manually, users should restrict themselves to updating the `desired_state` and `enabled`. Altering other fields such as `status` or `last_action` may result in unintended behavior.
+See the [`enabled`](#enabled) section for cautions about updating a component's `enabled` state.
 
 (`ncn-mw#`):
 
 ```bash
-cray bos v2 components update <XNAME> --enabled True --format json
+cray bos v2 components update <XNAME> --enabled False --format json
 ```
 
 Example output:

diff --git a/operations/iuf/workflows/preparation.md b/operations/iuf/workflows/preparation.md
@@ -3,9 +3,10 @@
 This section defines environment variables and directory content that is used throughout the workflow.
 
 - [1. Prepare for the install or upgrade](#1-prepare-for-the-install-or-upgrade)
-- [2. Use of `iuf activity`](#2-use-of-iuf-activity)
-- [3. Save system state before upgrade](#3-save-system-state-before-upgrade)
-- [4. Next steps](#4-next-steps)
+- [2. Install the latest documentation](#2-install-the-latest-documentation)
+- [3. Use of `iuf activity`](#3-use-of-iuf-activity)
+- [4. Save system state before upgrade](#4-save-system-state-before-upgrade)
+- [5. Next steps](#5-next-steps)
 
 ## 1. Prepare for the install or upgrade
 
@@ -35,15 +36,14 @@ This section defines environment variables and directory content that is used th
 
     - Environment variables have been set and required IUF directories have been created
 
-1. Ensure that the
-   [latest version of `docs-csm`](https://github.com/Cray-HPE/docs-csm/blob/release/1.6/update_product_stream/README.md#check-for-latest-documentation)
-    is installed for the target CSM version being installed or upgraded.
+## 2. Install the latest documentation
 
-    **`NOTE`** When using IUF to upgrade CSM, please skip this subsection.
+Ensure that the [latest version of `docs-csm`](https://github.com/Cray-HPE/docs-csm/blob/release/1.6/update_product_stream/README.md#check-for-latest-documentation)
+is installed. If CSM is being upgraded, install the **target** version of the CSM documentation.
 
-    For example: when upgrading from CSM version 1.5.0 to version 1.5.1, install `docs-csm-1.5.1.noarch`
+For example, when upgrading from CSM version 1.5.0 to version 1.6.0, install `docs-csm-1.6.0.noarch`.
 
-## 2. Use of `iuf activity`
+## 3. Use of `iuf activity`
 
 **`NOTE`** This section is informational only. There are no operations to perform.
 
@@ -60,7 +60,7 @@ iuf -a "${ACTIVITY_NAME}" activity --create --comment "download complete" waitin
 
 The install and upgrade workflow instructions will not use `iuf activity` in this manner, deferring to the administrator to use it as desired.
 
-## 3. Save system state before upgrade
+## 4. Save system state before upgrade
 
 (`ncn-m001#`) Before performing the install/upgrade, it is important to save specific system state information so that it can be referenced later if needed.
 Run the script below to save the state information. The information gathered by this script is SAT status, SAT site information,
@@ -70,7 +70,7 @@ HSN status, Ceph status, and SDU and RDA configurations. This information will b
 /usr/share/doc/csm/upgrade/scripts/upgrade/util/pre-upgrade-status.sh
 ```
 
-## 4. Next steps
+## 5. Next steps
 
 - If performing an initial install or an upgrade of non-CSM products only, return to the
   [Install or upgrade additional products with IUF](install_or_upgrade_additional_products_with_iuf.md)

diff --git a/operations/kubernetes/Troubleshoot_Postgres_Database.md b/operations/kubernetes/Troubleshoot_Postgres_Database.md
@@ -306,7 +306,7 @@ For example:
             Re-run the following command until it succeeds and reports that the leader pod is `running`.
 
             ```bash
-            kubectl exec keycloak-postgres-0 -c postgres -n services -it -- patronictl list
+            kubectl exec cray-console-data-postgres-0 -c postgres -n services -it -- patronictl list
             ```
 
             Example output:
@@ -328,7 +328,7 @@ For example:
         1. (`ncn-mw#`) Determine which pods are reporting lag.
 
             ```bash
-            kubectl exec cray-console-postgres-0 -c postgres -n services -it -- patronictl list
+            kubectl exec cray-console-data-postgres-0 -c postgres -n services -it -- patronictl list
             ```
 
             Example output:
@@ -352,7 +352,7 @@ For example:
         1. (`ncn-mw#`) Once the pods restart, verify that the lag has resolved.
 
             ```bash
-            kubectl exec cray-console-postgres-0 -c postgres -n services -it -- patronictl list
+            kubectl exec cray-console-data-postgres-0 -c postgres -n services -it -- patronictl list
             ```
 
             Example output:

diff --git a/operations/node_management/Replace_a_Compute_Blade.md b/operations/node_management/Replace_a_Compute_Blade.md
@@ -144,7 +144,7 @@ Replace an HPE Cray EX liquid-cooled compute blade.
 
 ## Power on and boot the compute nodes
 
-1. (`ncn-mw#`) Un-suspend the `hms-discovery` cronjob in Kubernetes.
+1. (`ncn-mw#`) If you suspended the `hms-discovery` cronjob in Kubernetes when shutting down the blade, unsuspend it now:.
 
    ```bash
    kubectl -n services patch cronjobs hms-discovery -p '{"spec" : {"suspend" : false }}'
@@ -180,7 +180,7 @@ Replace an HPE Cray EX liquid-cooled compute blade.
 
 1. Wait for 3-5 minutes for the blade to power on and the node BMCs to be discovered.
 
-1. (`ncn-mw#`) Verify that the affected nodes are enabled in the HSM.
+1. (`ncn-mw#`) Verify that the affected nodes are enabled in the HSM (repeat example below for all nodes).
 
     ```bash
     cray hsm state components describe x1000c3s0b0n0 --format toml
@@ -228,8 +228,6 @@ Replace an HPE Cray EX liquid-cooled compute blade.
     - If the last discovery state is `HTTPsGetFailed` or `ChildVerificationFailed`, then an error has
       occurred during the discovery process.
 
-1. Enable each node individually in the HSM database (in this example, the nodes are `x1000c3s0b0n0`-`n3`).
-
 1. (`ncn-mw#`) Rediscover the components in the chassis (the example shows cabinet 1000, chassis 3).
 
     ```bash

diff --git a/...ations/power_management/Power_On_and_Start_the_Management_Kubernetes_Cluster.md b/...ations/power_management/Power_On_and_Start_the_Management_Kubernetes_Cluster.md
@@ -330,7 +330,7 @@ Some systems are configured with lazy mounts that do not have this requirement f
 
     To resolve the space issue, see [Troubleshoot Ceph OSDs Reporting Full](../utility_storage/Troubleshoot_Ceph_OSDs_Reporting_Full.md).
 
-1. (`ncn-m001#`) Check that `spire` pods have started.
+1. (`ncn-m001#`) Check that `spire` and `cray-spire`  pods have started.
 
     Monitor the status of the `spire-jwks` pods to ensure they restart and enter the `Running` state.
 
@@ -341,6 +341,9 @@ Some systems are configured with lazy mounts that do not have this requirement f
     Example output:
 
     ```text
+    cray-spire-jwks-57bbb4f5c7-57j5k 2/3  CrashLoopBackOff   9    23h   10.44.0.31  ncn-w002 <none>   <none>
+    cray-spire-jwks-57bbb4f5c7-crb2m 2/3  CrashLoopBackOff   9    23h   10.36.0.34  ncn-w003 <none>   <none>
+    cray-spire-jwks-57bbb4f5c7-lq9ar 2/3  CrashLoopBackOff   9    23h   10.39.0.5   ncn-w001 <none>   <none>
     spire-jwks-6b97457548-gc7td    2/3  CrashLoopBackOff   9    23h   10.44.0.117  ncn-w002 <none>   <none>
     spire-jwks-6b97457548-jd7bd    2/3  CrashLoopBackOff   9    23h   10.36.0.123  ncn-w003 <none>   <none>
     spire-jwks-6b97457548-lvqmf    2/3  CrashLoopBackOff   9    23h   10.39.0.79   ncn-w001 <none>   <none>
@@ -352,6 +355,12 @@ Some systems are configured with lazy mounts that do not have this requirement f
        kubectl rollout restart -n spire deployment spire-jwks
        ```
 
+   1. (`ncn-m001#`) If the `cray-spire-jwks` pods indicate `CrashLoopBackOff`, then restart the Cray Spire deployment.
+
+       ```bash
+       kubectl rollout restart -n spire deployment cray-spire-jwks
+       ```
+
    1. (`ncn-m001#`) Rejoin Spire on the worker and master NCNs, to avoid issues with Spire tokens.
 
        ```bash

diff --git a/upgrade/scripts/common/ncn-rebuild-common.sh b/upgrade/scripts/common/ncn-rebuild-common.sh
@@ -85,6 +85,10 @@ else
   echo "====> ${state_name} has been completed"
 fi
 
+if [[ ${target_ncn} == ncn-m* ]]; then
+  drain_node ${target_ncn}
+fi
+
 state_name="SET rd.live.dir AND rd.live.overlay.reset"
 state_recorded=$(is_state_recorded "${state_name}" "${target_ncn}")
 if [[ ${state_recorded} == "0" ]]; then

diff --git a/upgrade/scripts/rebuild/ncn-rebuild-master-nodes.sh b/upgrade/scripts/rebuild/ncn-rebuild-master-nodes.sh
@@ -136,8 +136,6 @@ if [[ ${first_master_hostname} == ${target_ncn} ]]; then
   fi
 fi
 
-drain_node $target_ncn
-
 # Validate SLS health before calling csi handoff bss-update-*, since
 # it relies on SLS
 check_sls_health

diff --git a/upgrade/scripts/upgrade/csm-upgrade.sh b/upgrade/scripts/upgrade/csm-upgrade.sh
@@ -206,6 +206,10 @@ if [[ $state_recorded == "0" ]] && k8s_job_exists "${ns}" "${job_name}"; then
       --insecure-skip-tls-verify-backend --tail=-1 \
       -l 'app.kubernetes.io/instance in (cray-bos, cray-bos-db)' > "${K8S_POD_LOGS}"
 
+    # Apply fix for CASMCMS-9234
+    echo "Applying fix for CASMCMS-9234, if needed"
+    "${basedir}/workarounds/CASMCMS-9234/fix.sh" "${SNAPSHOT_DIR}"
+
     SNAPSHOT_DIR_BASENAME=$(basename "${SNAPSHOT_DIR}")
     TARFILE_BASENAME="${SNAPSHOT_DIR_BASENAME}.tgz"
     TARFILE_FULLPATH="/tmp/${TARFILE_BASENAME}"

diff --git a/upgrade/scripts/upgrade/ncn-upgrade-ceph-nodes.sh b/upgrade/scripts/upgrade/ncn-upgrade-ceph-nodes.sh
@@ -81,6 +81,22 @@ else
   echo "====> ${state_name} has been completed"
 fi
 
+state_name="CLEANUP_LIVE_IMAGES"
+state_recorded=$(is_state_recorded "${state_name}" ${target_ncn})
+if [[ $state_recorded == "0" ]]; then
+  echo "====> ${state_name} ..."
+  {
+    if [[ $ssh_keys_done == "0" ]]; then
+      ssh_keygen_keyscan "${target_ncn}"
+      ssh_keys_done=1
+    fi
+    ssh ${target_ncn} "/srv/cray/scripts/metal/cleanup-live-images.sh -y"
+  } >> ${LOG_FILE} 2>&1
+  record_state "${state_name}" ${target_ncn}
+else
+  echo "====> ${state_name} has been completed"
+fi
+
 ${basedir}/../common/ncn-rebuild-common.sh $target_ncn
 
 state_name="INSTALL_TARGET_SCRIPT"

diff --git a/upgrade/scripts/upgrade/ncn-upgrade-master-nodes.sh b/upgrade/scripts/upgrade/ncn-upgrade-master-nodes.sh
@@ -139,8 +139,6 @@ fi
   fi
 } >> ${LOG_FILE} 2>&1
 
-drain_node $target_ncn
-
 # Validate SLS health before calling csi handoff bss-update-*, since
 # it relies on SLS
 check_sls_health >> "${LOG_FILE}" 2>&1

diff --git a/upgrade/scripts/upgrade/prerequisites.sh b/upgrade/scripts/upgrade/prerequisites.sh
@@ -71,6 +71,23 @@ if [[ -z ${CSM_RELEASE} ]]; then
   exit 1
 fi
 
+# Make sure this prerequisites.sh version matches the CSM_RELEASE version
+# This prerequisites script should only be if docs-csm and CSM_RELEASE vesion match
+docs_rpm_vers=$(rpm -qa docs-csm)
+docs_trimmed_vers=${docs_rpm_vers#"docs-csm-"}
+docs_spaced_vers=$(echo "$docs_trimmed_vers" | tr "." " ")
+docs_major=$(echo "$docs_spaced_vers" | awk '{ print $1 }')
+docs_minor=$(echo "$docs_spaced_vers" | awk '{ print $2 }')
+csm_release_spaced=$(echo "$CSM_RELEASE" | tr "." " ")
+csm_release_major=$(echo "$csm_release_spaced" | awk '{ print $1 }')
+csm_release_minor=$(echo "$csm_release_spaced" | awk '{ print $2 }')
+
+if [[ $docs_major -ne $csm_release_major ]] || [[ $docs_minor -ne $csm_release_minor ]]; then
+  echo "ERROR This version of the 'prerequisites.sh' script should be run when upgrading to CSM ${docs_major}.${docs_minor}."
+  echo "ERROR Make sure docs-csm for CSM $CSM_RELEASE is installed so that the correct prerequisites.sh script is used."
+  exit 1
+fi
+
 if [[ -z ${CSM_ARTI_DIR} ]]; then
   echo "CSM_ARTI_DIR environment variable has not been set"
   echo "make sure you have run: prepare-assets.sh"
@@ -532,7 +549,8 @@ if [[ ${state_recorded} == "0" && $(hostname) == "${PRIMARY_NODE}" ]]; then
 
     # Skopeo image is stored as "skopeo:csm-${CSM_RELEASE}", which may resolve to docker.io/lirary/skopeo or quay.io/skopeo, depending on configured shortcuts
     SKOPEO_IMAGE=$(podman load -q -i "${CSM_ARTI_DIR}/vendor/skopeo.tar" 2> /dev/null | sed -e 's/^.*: //')
-    nexus_images=$(yq r -j "${CSM_MANIFESTS_DIR}/platform.yaml" 'spec.charts.(name==cray-precache-images).values.cacheImages' | jq -r '.[] | select( . | contains("nexus"))')
+    # Grab nexus and docker-kubectl images from cacheImages list, remove duplicates.
+    nexus_images=$(yq r -j "${CSM_MANIFESTS_DIR}/platform.yaml" 'spec.charts.(name==cray-precache-images).values.cacheImages' | jq -r '.[] | select( . | contains("nexus", "docker-kubectl"))' | sort | uniq)
     worker_nodes=$(grep -oP "(ncn-w\d+)" /etc/hosts | sort -u)
     while read -r nexus_image; do
       echo "Uploading $nexus_image into Nexus ..."
@@ -605,8 +623,9 @@ state_name="PRECACHE_ISTIO_IMAGES"
 state_recorded=$(is_state_recorded "${state_name}" "$(hostname)")
 if [[ ${state_recorded} == "0" && $(hostname) == "${PRIMARY_NODE}" ]]; then
   echo "====> ${state_name} ..." | tee -a "${LOG_FILE}"
+  # Grab istio and docker-kubectl images from cacheImages list, remove duplicates.
   {
-    istio_images=$(yq r -j "${CSM_MANIFESTS_DIR}/platform.yaml" 'spec.charts.(name==cray-precache-images).values.cacheImages' | jq -r '.[] | select( . | (contains("istio") or contains("docker-kubectl")))')
+    istio_images=$(yq r -j "${CSM_MANIFESTS_DIR}/platform.yaml" 'spec.charts.(name==cray-precache-images).values.cacheImages' | jq -r '.[] | select( . | (contains("istio", "docker-kubectl")))' | sort | uniq)
     worker_nodes=$(grep -oP "(ncn-w\d+)" /etc/hosts | sort -u)
     while read -r istio_image; do
       while read -r worker_node; do
@@ -730,6 +749,14 @@ if [[ ${state_recorded} == "0" && $(hostname) == "${PRIMARY_NODE}" ]]; then
       fi
     fi
 
+    # check if the cray-certmanager-issuers chart failed to deploy
+    # this will be entered if the certmanager upgrade failed on or before
+    # the certmanager-issuer chart install
+    if ! helm history -n cert-manager cray-certmanager-issuers > /dev/null 2>&1; then
+      printf "note: no helm install exists for cert-manager-issuers. Cert-manager upgrade is needed to install cert-manager-issuers\n"
+      ((needs_upgrade += 1))
+    fi
+
     # cert-manager will need to be upgraded if cray-drydock version is less than 2.18.4.
     # This will only be the case in some CSM 1.6 to CSM 1.6 upgrades.
     # It only needs to be checked if cert-manager is not already being upgraded.
@@ -757,13 +784,13 @@ if [[ ${state_recorded} == "0" && $(hostname) == "${PRIMARY_NODE}" ]]; then
       fi
     fi
 
+    # make this name unique for CSM 1.6 in case CSM 1.5 secret still exists
+    backup_secret="cm-restore-data-16"
+
     # Only run if we need to and detected not 1.12.9 or ""
     if [ "${needs_upgrade}" -gt 0 ]; then
       cmns="cert-manager"
 
-      # make this name unique for CSM 1.6 in case CSM 1.5 secret still exists
-      backup_secret="cm-restore-data-16"
-
       # We need to backup before any helm uninstalls.
       needs_backup=0
 
@@ -882,9 +909,23 @@ EOF
     # The warning statement above needs to stay a warning. It does not exit 0 because Issuers should already exist.
     # 5 is an arbitrary number, expect ~21 certificates
     if [[ $(kubectl get certificates -A | wc -l) -lt 5 ]]; then
-      echo "ERROR: certificates were not restored after certmanager upgrade. 'kubectl get certificates -A' does not show certificates."
+      echo "WARNING: certificates were not restored after certmanager upgrade. 'kubectl get certificates -A' does not show certificates."
       echo "Certificates should have been restored from backup: 'kubectl get secret ${backup_secret?}'"
-      exit 1
+      if helm history -n cert-manager cray-certmanager-issuers > /dev/null 2>&1 && helm history -n cert-manager cray-certmanager > /dev/null 2>&1; then
+        echo "cray-certmanager and cray-certmanager-issuers have been installed. Attempting to restore cert-manager backup"
+        if kubectl get secret "${backup_secret?}" > /dev/null 2>&1; then
+          kubectl get secret "${backup_secret?}" -o jsonpath='{.data.data}' | base64 -d | kubectl apply -f -
+        fi
+        if [[ $(kubectl get certificates -A | wc -l) -lt 5 ]]; then
+          echo "ERROR: certificates failed to restore. 'kubectl get certificates -A' does not show certificates."
+          exit 1
+        else
+          echo "Certificates were successfully restored"
+        fi
+      else
+        echo "ERROR: cray-certmanager and/or cray-certmanager-issers charts failed to deploy"
+        exit 1
+      fi
     fi
     # delete CSM 1.5 cert-manager backup if it exists
     backup_secret_csm_15="cm-restore-data"