Skip to content

Commit

Permalink
Merge branch 'release/1.6' of github.com:Cray-HPE/docs-csm into CASMP…
Browse files Browse the repository at this point in the history
…ET-6863-rollback-issues
  • Loading branch information
studenym-hpe committed Jan 2, 2025
2 parents 8d5d13b + ca89c9a commit b3f25ca
Show file tree
Hide file tree
Showing 15 changed files with 240 additions and 48 deletions.
2 changes: 2 additions & 0 deletions .spelling
Original file line number Diff line number Diff line change
Expand Up @@ -787,6 +787,8 @@ unregisters
unset
unsets
unsquashed
un-suspend
unsuspend
untainting
untarred
update-cfs-config
Expand Down
40 changes: 25 additions & 15 deletions operations/boot_orchestration/Components.md
Original file line number Diff line number Diff line change
Expand Up @@ -6,19 +6,19 @@ This includes information on the desired state and some information on the curre
Component records are created automatically and will include any components found in the [Hardware State Manager (HSM)](../../glossary.md#hardware-state-manager-hsm).

* [BOS component fields](#bos-component-fields)
* [`actual_state`](#actual_state)
* [`desired_state`](#desired_state)
* [`staged_state`](#staged_state)
* [`enabled`](#enabled)
* [`error`](#error)
* [`event_stats`](#event_stats)
* [`last_action`](#last_action)
* [`session`](#session)
* [`status`](#status)
* [`actual_state`](#actual_state)
* [`desired_state`](#desired_state)
* [`staged_state`](#staged_state)
* [`enabled`](#enabled)
* [`error`](#error)
* [`event_stats`](#event_stats)
* [`last_action`](#last_action)
* [`session`](#session)
* [`status`](#status)
* [Managing BOS components](#managing-bos-components)
* [List all components](#list-all-components)
* [Show details for a component](#show-details-for-a-component)
* [Update a component](#update-a-component)
* [List all components](#list-all-components)
* [Show details for a component](#show-details-for-a-component)
* [Update a component](#update-a-component)

## BOS component fields

Expand All @@ -39,7 +39,16 @@ Stores information on the eventual desired boot artifacts and configuration for

### `enabled`

If the node is enabled (enabled == True), BOS will take action to make the actual state match the desired state. Disabled nodes may receive status updates from booted nodes, but BOS will not issue power commands to the nodes while they are disabled.
If the node is enabled (enabled == True), BOS will take action to make the actual state match the desired state. If a node is disabled (enabled == False), BOS will take no action against a node. This is an internal state
that BOS uses for tracking whether it has finished working on a node. Typically, users should **never** enable a node (enabled == True). BOS handles this during session creation.
Even if the BOS session is deleted, the nodes remain enabled in BOS, and BOS will continue to take action to make the nodes' actual states match their desired states.
Because of this, if an administrator wishes to stop BOS from taking such actions on a node, then they must disable it (enabled == False).
Thus, while it is still uncommon, it is more likely that users will disable nodes than enable them.

Even if a node is disabled in BOS, if it is booted, then BOS may receive status updates for it. However, BOS will not issue power commands to the nodes while they are disabled. These status updates come from the `bos-reporter`, small program running
periodically on the nodes, that updates the nodes' actual status.

Both BOS and the Hardware State Manager (HSM) use the term disabled, but not in a consistent fashion. When the HSM says a node is disabled, it is out of service. This definition should not be confused with BOS' definition of disabled.

### `error`

Expand Down Expand Up @@ -73,7 +82,7 @@ Stores the session ID of the session that is currently tracking the component.
This collection of fields stores status information that the BOS operators and other users can query to determine the status of the component. Status fields should generally not be manually updated and should be left to BOS. These fields include:

* `phase` - Describes the general phase of the boot process the component is currently in, such as `powering_on`, `powering_off` and `configuring`.
* `status` - A more specific description of where in the boot process the component is. This can be more detailed phases, such as `power_on_pending`, `power_on_called`, as well as final states such as `failed`.
* `status` - A more specific description of where in the boot process the component is. This can be more detailed phases, such as `power_on_pending`, `power_on_called`, as well as final states such as `failed`.
* `on_hold` is a special value that indicates BOS is re-evaluating the status of the component, such as when a component is re-enabled and BOS needs to collect new information from other services to determine the state of the component.
* `status_override` - A special status field that is used to override `status` when BOS would be unable to determine the status of the node with its current information. This includes the `on_hold` status.

Expand Down Expand Up @@ -200,11 +209,12 @@ Example output:
### Update a component

Update a BOS component using `xname`. While most fields can be updated manually, users should restrict themselves to updating the `desired_state` and `enabled`. Altering other fields such as `status` or `last_action` may result in unintended behavior.
See the [`enabled`](#enabled) section for cautions about updating a component's `enabled` state.

(`ncn-mw#`):

```bash
cray bos v2 components update <XNAME> --enabled True --format json
cray bos v2 components update <XNAME> --enabled False --format json
```

Example output:
Expand Down
22 changes: 11 additions & 11 deletions operations/iuf/workflows/preparation.md
Original file line number Diff line number Diff line change
Expand Up @@ -3,9 +3,10 @@
This section defines environment variables and directory content that is used throughout the workflow.

- [1. Prepare for the install or upgrade](#1-prepare-for-the-install-or-upgrade)
- [2. Use of `iuf activity`](#2-use-of-iuf-activity)
- [3. Save system state before upgrade](#3-save-system-state-before-upgrade)
- [4. Next steps](#4-next-steps)
- [2. Install the latest documentation](#2-install-the-latest-documentation)
- [3. Use of `iuf activity`](#3-use-of-iuf-activity)
- [4. Save system state before upgrade](#4-save-system-state-before-upgrade)
- [5. Next steps](#5-next-steps)

## 1. Prepare for the install or upgrade

Expand Down Expand Up @@ -35,15 +36,14 @@ This section defines environment variables and directory content that is used th

- Environment variables have been set and required IUF directories have been created

1. Ensure that the
[latest version of `docs-csm`](https://github.com/Cray-HPE/docs-csm/blob/release/1.6/update_product_stream/README.md#check-for-latest-documentation)
is installed for the target CSM version being installed or upgraded.
## 2. Install the latest documentation

**`NOTE`** When using IUF to upgrade CSM, please skip this subsection.
Ensure that the [latest version of `docs-csm`](https://github.com/Cray-HPE/docs-csm/blob/release/1.6/update_product_stream/README.md#check-for-latest-documentation)
is installed. If CSM is being upgraded, install the **target** version of the CSM documentation.

For example: when upgrading from CSM version 1.5.0 to version 1.5.1, install `docs-csm-1.5.1.noarch`
For example, when upgrading from CSM version 1.5.0 to version 1.6.0, install `docs-csm-1.6.0.noarch`.

## 2. Use of `iuf activity`
## 3. Use of `iuf activity`

**`NOTE`** This section is informational only. There are no operations to perform.

Expand All @@ -60,7 +60,7 @@ iuf -a "${ACTIVITY_NAME}" activity --create --comment "download complete" waitin

The install and upgrade workflow instructions will not use `iuf activity` in this manner, deferring to the administrator to use it as desired.

## 3. Save system state before upgrade
## 4. Save system state before upgrade

(`ncn-m001#`) Before performing the install/upgrade, it is important to save specific system state information so that it can be referenced later if needed.
Run the script below to save the state information. The information gathered by this script is SAT status, SAT site information,
Expand All @@ -70,7 +70,7 @@ HSN status, Ceph status, and SDU and RDA configurations. This information will b
/usr/share/doc/csm/upgrade/scripts/upgrade/util/pre-upgrade-status.sh
```

## 4. Next steps
## 5. Next steps

- If performing an initial install or an upgrade of non-CSM products only, return to the
[Install or upgrade additional products with IUF](install_or_upgrade_additional_products_with_iuf.md)
Expand Down
6 changes: 3 additions & 3 deletions operations/kubernetes/Troubleshoot_Postgres_Database.md
Original file line number Diff line number Diff line change
Expand Up @@ -306,7 +306,7 @@ For example:
Re-run the following command until it succeeds and reports that the leader pod is `running`.
```bash
kubectl exec keycloak-postgres-0 -c postgres -n services -it -- patronictl list
kubectl exec cray-console-data-postgres-0 -c postgres -n services -it -- patronictl list
```
Example output:
Expand All @@ -328,7 +328,7 @@ For example:
1. (`ncn-mw#`) Determine which pods are reporting lag.
```bash
kubectl exec cray-console-postgres-0 -c postgres -n services -it -- patronictl list
kubectl exec cray-console-data-postgres-0 -c postgres -n services -it -- patronictl list
```
Example output:
Expand All @@ -352,7 +352,7 @@ For example:
1. (`ncn-mw#`) Once the pods restart, verify that the lag has resolved.
```bash
kubectl exec cray-console-postgres-0 -c postgres -n services -it -- patronictl list
kubectl exec cray-console-data-postgres-0 -c postgres -n services -it -- patronictl list
```
Example output:
Expand Down
6 changes: 2 additions & 4 deletions operations/node_management/Replace_a_Compute_Blade.md
Original file line number Diff line number Diff line change
Expand Up @@ -144,7 +144,7 @@ Replace an HPE Cray EX liquid-cooled compute blade.
## Power on and boot the compute nodes
1. (`ncn-mw#`) Un-suspend the `hms-discovery` cronjob in Kubernetes.
1. (`ncn-mw#`) If you suspended the `hms-discovery` cronjob in Kubernetes when shutting down the blade, unsuspend it now:.
```bash
kubectl -n services patch cronjobs hms-discovery -p '{"spec" : {"suspend" : false }}'
Expand Down Expand Up @@ -180,7 +180,7 @@ Replace an HPE Cray EX liquid-cooled compute blade.
1. Wait for 3-5 minutes for the blade to power on and the node BMCs to be discovered.
1. (`ncn-mw#`) Verify that the affected nodes are enabled in the HSM.
1. (`ncn-mw#`) Verify that the affected nodes are enabled in the HSM (repeat example below for all nodes).
```bash
cray hsm state components describe x1000c3s0b0n0 --format toml
Expand Down Expand Up @@ -228,8 +228,6 @@ Replace an HPE Cray EX liquid-cooled compute blade.
- If the last discovery state is `HTTPsGetFailed` or `ChildVerificationFailed`, then an error has
occurred during the discovery process.
1. Enable each node individually in the HSM database (in this example, the nodes are `x1000c3s0b0n0`-`n3`).
1. (`ncn-mw#`) Rediscover the components in the chassis (the example shows cabinet 1000, chassis 3).
```bash
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -330,7 +330,7 @@ Some systems are configured with lazy mounts that do not have this requirement f
To resolve the space issue, see [Troubleshoot Ceph OSDs Reporting Full](../utility_storage/Troubleshoot_Ceph_OSDs_Reporting_Full.md).
1. (`ncn-m001#`) Check that `spire` pods have started.
1. (`ncn-m001#`) Check that `spire` and `cray-spire` pods have started.
Monitor the status of the `spire-jwks` pods to ensure they restart and enter the `Running` state.
Expand All @@ -341,6 +341,9 @@ Some systems are configured with lazy mounts that do not have this requirement f
Example output:
```text
cray-spire-jwks-57bbb4f5c7-57j5k 2/3 CrashLoopBackOff 9 23h 10.44.0.31 ncn-w002 <none> <none>
cray-spire-jwks-57bbb4f5c7-crb2m 2/3 CrashLoopBackOff 9 23h 10.36.0.34 ncn-w003 <none> <none>
cray-spire-jwks-57bbb4f5c7-lq9ar 2/3 CrashLoopBackOff 9 23h 10.39.0.5 ncn-w001 <none> <none>
spire-jwks-6b97457548-gc7td 2/3 CrashLoopBackOff 9 23h 10.44.0.117 ncn-w002 <none> <none>
spire-jwks-6b97457548-jd7bd 2/3 CrashLoopBackOff 9 23h 10.36.0.123 ncn-w003 <none> <none>
spire-jwks-6b97457548-lvqmf 2/3 CrashLoopBackOff 9 23h 10.39.0.79 ncn-w001 <none> <none>
Expand All @@ -352,6 +355,12 @@ Some systems are configured with lazy mounts that do not have this requirement f
kubectl rollout restart -n spire deployment spire-jwks
```
1. (`ncn-m001#`) If the `cray-spire-jwks` pods indicate `CrashLoopBackOff`, then restart the Cray Spire deployment.
```bash
kubectl rollout restart -n spire deployment cray-spire-jwks
```
1. (`ncn-m001#`) Rejoin Spire on the worker and master NCNs, to avoid issues with Spire tokens.
```bash
Expand Down
4 changes: 4 additions & 0 deletions upgrade/scripts/common/ncn-rebuild-common.sh
Original file line number Diff line number Diff line change
Expand Up @@ -85,6 +85,10 @@ else
echo "====> ${state_name} has been completed"
fi

if [[ ${target_ncn} == ncn-m* ]]; then
drain_node ${target_ncn}
fi

state_name="SET rd.live.dir AND rd.live.overlay.reset"
state_recorded=$(is_state_recorded "${state_name}" "${target_ncn}")
if [[ ${state_recorded} == "0" ]]; then
Expand Down
2 changes: 0 additions & 2 deletions upgrade/scripts/rebuild/ncn-rebuild-master-nodes.sh
Original file line number Diff line number Diff line change
Expand Up @@ -136,8 +136,6 @@ if [[ ${first_master_hostname} == ${target_ncn} ]]; then
fi
fi

drain_node $target_ncn

# Validate SLS health before calling csi handoff bss-update-*, since
# it relies on SLS
check_sls_health
Expand Down
4 changes: 4 additions & 0 deletions upgrade/scripts/upgrade/csm-upgrade.sh
Original file line number Diff line number Diff line change
Expand Up @@ -206,6 +206,10 @@ if [[ $state_recorded == "0" ]] && k8s_job_exists "${ns}" "${job_name}"; then
--insecure-skip-tls-verify-backend --tail=-1 \
-l 'app.kubernetes.io/instance in (cray-bos, cray-bos-db)' > "${K8S_POD_LOGS}"

# Apply fix for CASMCMS-9234
echo "Applying fix for CASMCMS-9234, if needed"
"${basedir}/workarounds/CASMCMS-9234/fix.sh" "${SNAPSHOT_DIR}"

SNAPSHOT_DIR_BASENAME=$(basename "${SNAPSHOT_DIR}")
TARFILE_BASENAME="${SNAPSHOT_DIR_BASENAME}.tgz"
TARFILE_FULLPATH="/tmp/${TARFILE_BASENAME}"
Expand Down
16 changes: 16 additions & 0 deletions upgrade/scripts/upgrade/ncn-upgrade-ceph-nodes.sh
Original file line number Diff line number Diff line change
Expand Up @@ -81,6 +81,22 @@ else
echo "====> ${state_name} has been completed"
fi

state_name="CLEANUP_LIVE_IMAGES"
state_recorded=$(is_state_recorded "${state_name}" ${target_ncn})
if [[ $state_recorded == "0" ]]; then
echo "====> ${state_name} ..."
{
if [[ $ssh_keys_done == "0" ]]; then
ssh_keygen_keyscan "${target_ncn}"
ssh_keys_done=1
fi
ssh ${target_ncn} "/srv/cray/scripts/metal/cleanup-live-images.sh -y"
} >> ${LOG_FILE} 2>&1
record_state "${state_name}" ${target_ncn}
else
echo "====> ${state_name} has been completed"
fi

${basedir}/../common/ncn-rebuild-common.sh $target_ncn

state_name="INSTALL_TARGET_SCRIPT"
Expand Down
2 changes: 0 additions & 2 deletions upgrade/scripts/upgrade/ncn-upgrade-master-nodes.sh
Original file line number Diff line number Diff line change
Expand Up @@ -139,8 +139,6 @@ fi
fi
} >> ${LOG_FILE} 2>&1

drain_node $target_ncn

# Validate SLS health before calling csi handoff bss-update-*, since
# it relies on SLS
check_sls_health >> "${LOG_FILE}" 2>&1
Expand Down
55 changes: 48 additions & 7 deletions upgrade/scripts/upgrade/prerequisites.sh
Original file line number Diff line number Diff line change
Expand Up @@ -71,6 +71,23 @@ if [[ -z ${CSM_RELEASE} ]]; then
exit 1
fi

# Make sure this prerequisites.sh version matches the CSM_RELEASE version
# This prerequisites script should only be if docs-csm and CSM_RELEASE vesion match
docs_rpm_vers=$(rpm -qa docs-csm)
docs_trimmed_vers=${docs_rpm_vers#"docs-csm-"}
docs_spaced_vers=$(echo "$docs_trimmed_vers" | tr "." " ")
docs_major=$(echo "$docs_spaced_vers" | awk '{ print $1 }')
docs_minor=$(echo "$docs_spaced_vers" | awk '{ print $2 }')
csm_release_spaced=$(echo "$CSM_RELEASE" | tr "." " ")
csm_release_major=$(echo "$csm_release_spaced" | awk '{ print $1 }')
csm_release_minor=$(echo "$csm_release_spaced" | awk '{ print $2 }')

if [[ $docs_major -ne $csm_release_major ]] || [[ $docs_minor -ne $csm_release_minor ]]; then
echo "ERROR This version of the 'prerequisites.sh' script should be run when upgrading to CSM ${docs_major}.${docs_minor}."
echo "ERROR Make sure docs-csm for CSM $CSM_RELEASE is installed so that the correct prerequisites.sh script is used."
exit 1
fi

if [[ -z ${CSM_ARTI_DIR} ]]; then
echo "CSM_ARTI_DIR environment variable has not been set"
echo "make sure you have run: prepare-assets.sh"
Expand Down Expand Up @@ -532,7 +549,8 @@ if [[ ${state_recorded} == "0" && $(hostname) == "${PRIMARY_NODE}" ]]; then

# Skopeo image is stored as "skopeo:csm-${CSM_RELEASE}", which may resolve to docker.io/lirary/skopeo or quay.io/skopeo, depending on configured shortcuts
SKOPEO_IMAGE=$(podman load -q -i "${CSM_ARTI_DIR}/vendor/skopeo.tar" 2> /dev/null | sed -e 's/^.*: //')
nexus_images=$(yq r -j "${CSM_MANIFESTS_DIR}/platform.yaml" 'spec.charts.(name==cray-precache-images).values.cacheImages' | jq -r '.[] | select( . | contains("nexus"))')
# Grab nexus and docker-kubectl images from cacheImages list, remove duplicates.
nexus_images=$(yq r -j "${CSM_MANIFESTS_DIR}/platform.yaml" 'spec.charts.(name==cray-precache-images).values.cacheImages' | jq -r '.[] | select( . | contains("nexus", "docker-kubectl"))' | sort | uniq)
worker_nodes=$(grep -oP "(ncn-w\d+)" /etc/hosts | sort -u)
while read -r nexus_image; do
echo "Uploading $nexus_image into Nexus ..."
Expand Down Expand Up @@ -605,8 +623,9 @@ state_name="PRECACHE_ISTIO_IMAGES"
state_recorded=$(is_state_recorded "${state_name}" "$(hostname)")
if [[ ${state_recorded} == "0" && $(hostname) == "${PRIMARY_NODE}" ]]; then
echo "====> ${state_name} ..." | tee -a "${LOG_FILE}"
# Grab istio and docker-kubectl images from cacheImages list, remove duplicates.
{
istio_images=$(yq r -j "${CSM_MANIFESTS_DIR}/platform.yaml" 'spec.charts.(name==cray-precache-images).values.cacheImages' | jq -r '.[] | select( . | (contains("istio") or contains("docker-kubectl")))')
istio_images=$(yq r -j "${CSM_MANIFESTS_DIR}/platform.yaml" 'spec.charts.(name==cray-precache-images).values.cacheImages' | jq -r '.[] | select( . | (contains("istio", "docker-kubectl")))' | sort | uniq)
worker_nodes=$(grep -oP "(ncn-w\d+)" /etc/hosts | sort -u)
while read -r istio_image; do
while read -r worker_node; do
Expand Down Expand Up @@ -730,6 +749,14 @@ if [[ ${state_recorded} == "0" && $(hostname) == "${PRIMARY_NODE}" ]]; then
fi
fi

# check if the cray-certmanager-issuers chart failed to deploy
# this will be entered if the certmanager upgrade failed on or before
# the certmanager-issuer chart install
if ! helm history -n cert-manager cray-certmanager-issuers > /dev/null 2>&1; then
printf "note: no helm install exists for cert-manager-issuers. Cert-manager upgrade is needed to install cert-manager-issuers\n"
((needs_upgrade += 1))
fi

# cert-manager will need to be upgraded if cray-drydock version is less than 2.18.4.
# This will only be the case in some CSM 1.6 to CSM 1.6 upgrades.
# It only needs to be checked if cert-manager is not already being upgraded.
Expand Down Expand Up @@ -757,13 +784,13 @@ if [[ ${state_recorded} == "0" && $(hostname) == "${PRIMARY_NODE}" ]]; then
fi
fi

# make this name unique for CSM 1.6 in case CSM 1.5 secret still exists
backup_secret="cm-restore-data-16"

# Only run if we need to and detected not 1.12.9 or ""
if [ "${needs_upgrade}" -gt 0 ]; then
cmns="cert-manager"

# make this name unique for CSM 1.6 in case CSM 1.5 secret still exists
backup_secret="cm-restore-data-16"

# We need to backup before any helm uninstalls.
needs_backup=0

Expand Down Expand Up @@ -882,9 +909,23 @@ EOF
# The warning statement above needs to stay a warning. It does not exit 0 because Issuers should already exist.
# 5 is an arbitrary number, expect ~21 certificates
if [[ $(kubectl get certificates -A | wc -l) -lt 5 ]]; then
echo "ERROR: certificates were not restored after certmanager upgrade. 'kubectl get certificates -A' does not show certificates."
echo "WARNING: certificates were not restored after certmanager upgrade. 'kubectl get certificates -A' does not show certificates."
echo "Certificates should have been restored from backup: 'kubectl get secret ${backup_secret?}'"
exit 1
if helm history -n cert-manager cray-certmanager-issuers > /dev/null 2>&1 && helm history -n cert-manager cray-certmanager > /dev/null 2>&1; then
echo "cray-certmanager and cray-certmanager-issuers have been installed. Attempting to restore cert-manager backup"
if kubectl get secret "${backup_secret?}" > /dev/null 2>&1; then
kubectl get secret "${backup_secret?}" -o jsonpath='{.data.data}' | base64 -d | kubectl apply -f -
fi
if [[ $(kubectl get certificates -A | wc -l) -lt 5 ]]; then
echo "ERROR: certificates failed to restore. 'kubectl get certificates -A' does not show certificates."
exit 1
else
echo "Certificates were successfully restored"
fi
else
echo "ERROR: cray-certmanager and/or cray-certmanager-issers charts failed to deploy"
exit 1
fi
fi
# delete CSM 1.5 cert-manager backup if it exists
backup_secret_csm_15="cm-restore-data"
Expand Down
Loading

0 comments on commit b3f25ca

Please sign in to comment.