Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CRAYSAT-1857:bootsys stage to wait for kyverno before job recreation #225

Open
wants to merge 1 commit into
base: feature/CRAYSAT-1740
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 2 additions & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -41,6 +41,8 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0
`platform-services` stage of `sat bootsys boot`.
- Added a function to mount s3fs and ceph post ceph health status check on ncn-m001 in
ncn power stage
- Added a function to check the kyverno avaiability before recreating cronjobs in
platform services stage

### Fixed
- Updated `sat bootsys` to increase the default management NCN shutdown timeout
Expand Down
38 changes: 38 additions & 0 deletions sat/cli/bootsys/platform.py
Original file line number Diff line number Diff line change
Expand Up @@ -33,7 +33,9 @@
from multiprocessing import Process

from csm_api_client.k8s import load_kube_api
from kubernetes import client, config
from kubernetes.client import BatchV1Api
from kubernetes.client.rest import ApiException
from kubernetes.config import ConfigException
from paramiko import SSHException

Expand Down Expand Up @@ -487,6 +489,42 @@ def do_kubelet_start(ncn_groups):
raise FatalPlatformError("Kubernetes API not available after timeout")
LOGGER.info("Kubernetes API is available")

# Wait for Kyverno pods to be ready
wait_for_kyverno_pods()


def wait_for_kyverno_pods(timeout=300, interval=10):
"""Wait for Kyverno pods to be up and running.
Args:
timeout (int): Maximum time to wait for the pods to be ready.
interval (int): Time interval between checks.
Raises:
FatalPlatformError: if Kyverno pods are not ready after timeout.
"""
config.load_kube_config() # Ensure your kubeconfig is properly set up
v1 = client.CoreV1Api()
namespace = 'kyverno'
end_time = time.time() + timeout

LOGGER.info(f"Waiting up to {timeout} seconds for Kyverno pods to be ready in namespace {namespace}")

while time.time() < end_time:
try:
pods = v1.list_namespaced_pod(namespace)
all_running = all(pod.status.phase == 'Running' for pod in pods.items)
if all_running:
LOGGER.info("All Kyverno pods are up and running")
return True
except ApiException as e:
LOGGER.error(f"Exception when checking Kyverno pods: {e}")

LOGGER.info(f"Kyverno pods are not ready yet. Waiting for {interval} seconds before retrying...")
time.sleep(interval)

raise FatalPlatformError(f"Kyverno pods not ready after {timeout} seconds")


def do_recreate_cronjobs(_):
"""Recreate cronjobs that are not being scheduled on time."""
Expand Down
Loading