-
Notifications
You must be signed in to change notification settings - Fork 107
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
StepChainParentageFix ReqMgr2 task getting killed #11693
Comments
Alan, DBS provides full report of what has happened and you need to use your python skills to make it human friendly. Here is one suggestion:
Now, you can easily read it and it is very clear. In particular, read |
Valentin, thank you for providing these details. However, I have to strongly disagree with your statement that:
is "easy to read and very clear"! It might be easier for you because you implemented it, but I doubt any mortal client would manage to consume this information. My comments and questions are:
This is not the first problem that we can't decode from the error message, example: Is there any chance that the service actually logs some more interesting and helpful information? |
Alan
To resolve the issue you need to look at JSON data from your client and check it. I understand that it is not ideal error report but it is the best we can do based on amount of DBS load and dynamic size of our data. If you need to know lexicon rules DBS uses they can be found for reader, writer, migration DBS servers over here https://github.com/dmwm/dbs2go/tree/master/static We had a long standing issue with lexicon convergence between DBS and WMCore and it is still unresolved. In particular, we may need to look at
|
and if you look at a stack which provides
At the end, stack info is very valuable to debug process as it points to exact lines of the code which someone can inspect to understand the logic of the code flow. |
An update on the service getting auto-restarted. I am still not sure whether it is triggered by a single workflow, but I did see this workflow
causing a kill command of the CherryPy application multiple times. As it can be seen in this grafana plot: memory footprint goes from a standard ~500MB all the way up to 5GB and beyond, making the service to restart itself. I created an ad-hoc script to deal only with a (that) workflow, which is running for more than 30min now in dry-run mode. I am soon going to update the PR with that script such that we have an easier way to deal with this in the future. A long term solution will be tracked in a new GH issue. |
It turns out I misused the psutil module and I failed to fetch the max memory used when resolving that workflow. From grafana schedd dashboard, I believe it was around the 5GB. I just wanted to say that it took almost 2h to fix this one workflow as well:
Note that many of the 3k+ blocks actually failed to get the parentage resolved, since many files lack their parentage, thus failing on the server side. Right now the production instance of |
Quick update from patching reqmgr2-tasks and continuously skipping the large workflow from the comment above:
Today we have a chat with Dennis (DBS expert) and we will try to get this issue resolved on the server side in the next week or two. |
Moved from 2023 Q3 to Q4, as we plan to finish it in the coming days/weeks. |
Impact of the bug
ReqMgr2 CherryPy threads (reqmgr2-tasks k8s service)
Describe the bug
Resuming investigation of the high number of workflows stuck in announced status:
https://its.cern.ch/jira/browse/CMSPROD-26
and as a follow up of this ticket:
#11260
I found out gazillions of DBS related errors, example [1] and also something more worrisome, which is actual a restart of the service itself inside the kubernetes pod [2]. These records are available in the log
reqmgr2-20230824-reqmgr2-tasks-77664bfc6c-xwcrc.log
. It is not clear to me whether this restart gets somehow triggered by CherryPy framework itself, or from a different cmsweb related watchdog.The effect of these restarts is that the service never manages to converge, given that it first deals with all the dataset parantage, and only then it tries to update the relevant request status to archived.
How to reproduce it
Not sure
Expected behavior
The service should not crash!
Second, I am inclined to deal with one workflow at a time, even if that means that some datasets will be looked twice.
Third, we need to understand those DBS errors. It could be a problem on our client side.
Additional context and error message
[1]
[2]
The text was updated successfully, but these errors were encountered: