Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CRAYSAT-1878: Remove automatic cronjob recreation from bootsys #244

Conversation

haasken-hpe
Copy link
Contributor

@haasken-hpe haasken-hpe commented Jul 23, 2024

Summary and Scope

Remove the step that automatically checks for and re-creates stuck
Kubernetes CronJobs from the platform-services stage of sat bootsys boot. This should not be necessary anymore starting in Kubernetes 1.21,
which made a new CronJobControllerV2 the default.

In addition, improve the logic of the HMSDiscoveryScheduledWaiter, so
that it will more reliably detect when an hms-discovery Job has been
scheduled for the CronJob. Pass in an explicit start_time, so that we
can look for any jobs created for the CronJob after it is re-enabled.
This ensures we won't miss the first one, which could be scheduled
between when we set suspend=False on the CronJob and when we create
the HMSDiscoveryScheduledWaiter.

Issues and Related PRs

Testing

Tested on:

  • rocket

Test description:

Tested on rocket as follows:

  • Suspended the hms-discovery CronJob
  • Ran sat bootsys boot --stage cabinet-power
  • Verified that it correctly identified when the CronJob was scheduled

Risks and Mitigations

Should be pretty low-risk. This removes functionality that has caused more problems than it solved. It can always be executed manually as documented, if needed.

Pull Request Checklist

  • Version number(s) incremented, if applicable
  • Copyrights updated
  • License file intact
  • Target branch correct
  • CHANGELOG.md updated
  • Testing is appropriate and complete, if applicable
  • HPC Product Announcement prepared, if applicable

@haasken-hpe haasken-hpe force-pushed the CRAYSAT-1878-remove-automatic-cronjob-recreation branch 2 times, most recently from cbb6bdc to 116f2b8 Compare July 24, 2024 22:16
@haasken-hpe
Copy link
Contributor Author

Testing on rocket has been completed. The step that un-suspends the hms-discovery cronjob and waits for a job to be scheduled now completes very quickly in my testing thanks to the minor tweaks made here.

Before executing the sat bootsys boot --stage cabinet-power command:

ncn-m001:~ # kubectl get cronjobs -n services hms-discovery
NAME            SCHEDULE      SUSPEND   ACTIVE   LAST SCHEDULE   AGE
hms-discovery   */3 * * * *   True      0        8m5s            8h
ncn-m001:~ # kubectl get jobs -n services -l cronjob-name=hms-discovery
NAME                     COMPLETIONS   DURATION   AGE
...
hms-discovery-28697652   1/1           81s        9m45s
hms-discovery-28697655   0/1           8m15s      8m15s

Executing the command:

ncn-m001:~/haasken # sat bootsys boot --stage cabinet-power
INFO: Resuming cronjob hms-discovery in namespace services.
INFO: Waiting for cronjob hms-discovery in namespace services to be scheduled.
INFO: Waiting for ComputeModules in liquid-cooled cabinets to be powered on.
INFO: All ComputeModules have reached powered on state.

Looking at the cronjob and jobs afterwards:

ncn-m001:~ # kubectl get jobs -n services -l cronjob-name=hms-discovery
NAME                     COMPLETIONS   DURATION   AGE
...
hms-discovery-28697652   1/1           81s        10m
hms-discovery-28697655   0/1           9m         9m
hms-discovery-28697661   0/1           14s        14s
ncn-m001:~ # kubectl get cronjobs -n services hms-discovery
NAME            SCHEDULE      SUSPEND   ACTIVE   LAST SCHEDULE   AGE
hms-discovery   */3 * * * *   False     1        3m39s           8h

@haasken-hpe haasken-hpe marked this pull request as ready for review July 24, 2024 22:31
Copy link
Contributor

@annapoorna-s-alt annapoorna-s-alt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me.

Copy link
Contributor

@shivaprasad-metimath shivaprasad-metimath left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks good

@haasken-hpe haasken-hpe force-pushed the CRAYSAT-1878-remove-automatic-cronjob-recreation branch from 116f2b8 to 1123c52 Compare July 25, 2024 16:36
Remove the step that automatically checks for and re-creates stuck
Kubernetes CronJobs from the `platform-services` stage of `sat bootsys
boot`. This should not be necessary anymore starting in Kubernetes 1.21,
which made a new CronJobControllerV2 the default.

In addition, improve the logic of the HMSDiscoveryScheduledWaiter, so
that it will more reliably detect when an `hms-discovery` Job has been
scheduled for the CronJob. Pass in an explicit `start_time`, so that we
can look for any jobs created for the CronJob after it is re-enabled.
This ensures we won't miss the first one, which could be scheduled
between when we set `suspend=False` on the CronJob and when we create
the `HMSDiscoveryScheduledWaiter`.

Test Description:
Tested on rocket as follows:

* Suspended the `hms-discovery` CronJob
* Ran `sat bootsys boot --stage cabinet-power`
* Verified that it correctly identified when the CronJob was scheduled
PyCharm was complaining about this. Make it happy.
@haasken-hpe haasken-hpe force-pushed the CRAYSAT-1878-remove-automatic-cronjob-recreation branch from 1123c52 to 97368b2 Compare July 25, 2024 16:39
@haasken-hpe haasken-hpe merged commit 5c2e7d6 into feature/CRAYSAT-1740 Jul 25, 2024
3 checks passed
@haasken-hpe haasken-hpe deleted the CRAYSAT-1878-remove-automatic-cronjob-recreation branch July 25, 2024 16:44
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants