Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jupyter Controller reconcileHandler incorrectly reports ERRORs when reconciling #409

Open
nishant-dash opened this issue Oct 1, 2024 · 3 comments
Labels
bug Something isn't working

Comments

@nishant-dash
Copy link

nishant-dash commented Oct 1, 2024

Bug Description

I see that on various notebook reconciliations, the reconcile handler reports the same error

However the error is not critical, since it eventually reconciles it ok.
This red herring in the logs messes with the alerts from JupyterControllerRuntimeReconciliationErrorsExceedThreshold which results in it almost always firing.

I filed an issue upstream kubeflow/notebooks#62

To Reproduce

N/A

Environment

ckf 1.9/stable
jupyter controller 1.9/stable 1038

Relevant Log Output

2024-10-01T13:07:25.847Z [jupyter-controller] 1.7277880458479142e+09	ERROR	controller.notebook	Reconciler error	{"reconciler group": "kubeflow.org", "reconciler kind": "Notebook", "name": "mini-notebook-gpu", "namespace": "<REDACTED>", "error": "Operation cannot be fulfilled on notebooks.kubeflow.org \"mini-notebook-gpu\": the object has been modified; please apply your changes to the latest version and try again"}

Additional Context

No response

@nishant-dash nishant-dash added the bug Something isn't working label Oct 1, 2024
Copy link

Thank you for reporting us your feedback!

The internal ticket has been created: https://warthogs.atlassian.net/browse/KF-6343.

This message was autogenerated

@orfeas-k
Copy link
Contributor

orfeas-k commented Oct 25, 2024

#412 removes this alert rule, since it will always go to Firing as long as the upstream controller produces those errors. It will be backported to 1.9 too. Once upstream issue is fixed, we should put the alert rule back (and revert the changes from the PR). Here's the removed alert rule:

alert: JupyterControllerRuntimeReconciliationErrorsExceedThreshold
expr: rate(controller_runtime_reconcile_errors_total[5m]) > 0
for: 0m
labels:
  severity: critical
annotations:
  summary: Total number of reconciliation errors per controller
  description: >
    Total number of reconciliation errors per controller
    LABELS = {{ $labels }}

We will keep this issue open to track the upstream one.

orfeas-k added a commit that referenced this issue Nov 6, 2024
This comments out an alert rule that's always `Firing` due to kubeflow/notebooks#62

Ref #409
orfeas-k added a commit that referenced this issue Nov 6, 2024
This comments out an alert rule that's always `Firing` due to kubeflow/notebooks#62

Ref #409
@orfeas-k
Copy link
Contributor

orfeas-k commented Nov 7, 2024

@nishant-dash We commented out the alert rule, the change has been promoted to 1.9/stable so you should refresh and the rule won't be there anymore. As mentioned above, keeping this open to track the upstream issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants