Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

training-operator cannot be upgraded from 1.7/stable to recent version #170

Open
DnPlas opened this issue Jun 26, 2024 · 1 comment
Open
Labels
bug Something isn't working

Comments

@DnPlas
Copy link
Contributor

DnPlas commented Jun 26, 2024

Bug Description

Due to #161, the latest training-operator charm will only have one container in the scheduled Pod, which corresponds to the container running the charm, as the workload will be deployed and scheduled separately.
Because of https://bugs.launchpad.net/juju/+bug/1991955, upgrading the charm from 1.7/stable to the most recent version, the Pod will still have two containers instead of one after running juju refresh training-operator --channel latest/edge.

To Reproduce

  1. Deploy juju training-operator --channel 1.7/stable --trust
  2. Refresh juju refresh training-operator --channel latest/edge/pr-162 (this works right now) or juju refresh training-operator --channel latest/edge (this will only work once refactor, chore: refactor charm to use Deployment for workload, also bumps training-operator 1.7->1.8 #167 is merged)
  3. Observe

Environment

microk8s 1.29/stable
juju 3.5.0

Relevant Log Output

# Getting the pods after a refresh shows the  training-operator-0 Pod has 2/2
# which has the charm and the workload container
# Then because of the refactor, we have another Pod training-operator-7f97689fcf-2zshp
# with the workload as well

$ kubectl get pods -A
testing                               training-operator-0                      2/2     Running   0          10m
testing                               training-operator-7f97689fcf-2zshp       1/1     Running   0          9m59s

Additional Context

This issue is currently affecting the test_upgrade integration test case and the upgrade path.

Workarounds

So far, the only workaround that has worked is to remove the operator and re-deploy it.

@DnPlas DnPlas added the bug Something isn't working label Jun 26, 2024
Copy link

Thank you for reporting us your feedback!

The internal ticket has been created: https://warthogs.atlassian.net/browse/KF-5901.

This message was autogenerated

DnPlas added a commit that referenced this issue Jun 26, 2024
#170 is affecting the execution of this test, but since the fix
is on juju, there is not much we can do at the moment other than
skipping the test.

Part of #170
DnPlas added a commit that referenced this issue Jul 1, 2024
* tests: skip test_upgrade due to #170

#170 is affecting the execution of this test, but since the fix
is on juju, there is not much we can do at the moment other than
skipping the test.

Part of #170
DnPlas added a commit that referenced this issue Jul 4, 2024
* tests: skip test_upgrade due to #170

#170 is affecting the execution of this test, but since the fix
is on juju, there is not much we can do at the moment other than
skipping the test.

Part of #170
DnPlas added a commit that referenced this issue Jul 9, 2024
…` for workload, also bumps training-operator 1.7->1.8 (#167)

* pin integration test dependencies, refactor constants in tests (#164)
* refactor: deploy the training-operator with kubernetes resources (#161)
* chore: bump training-operator v1.7 -> v1.8 (#162)
* refactor: apply a workload Service instead of using juju created one (#173)
* tests: skip test_upgrade due to #170 (#171)
* build, tests: bump charmed-kubeflow-chisme 0.4.0 -> 0.4.1 (#172)

Fixes #159
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

1 participant