-
Notifications
You must be signed in to change notification settings - Fork 107
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
StepChainParentageFixTask ReqMgr2 thread failing to resolve parentage for 10 workflows #11260
Comments
A new set of workflows affected by this issue: https://its.cern.ch/jira/browse/CMSCOMPPR-41306 |
Here are my first findings for one of those workflows listed above. This one: https://cmsweb.cern.ch/reqmgr2/fetch?rid=cmsunified_task_TSG-Run3Winter23GS-00022__v1_T_230213_204700_2664 Looking into the
And while searching for the child dataset
|
And here is the sequence of calls to reproduce it interactively:
|
Thanks for this investigation, Todor. I think this is the key error message:
without looking into the code, I would read it as: the child file that needs to have the parentage information fixed, does not have any available parent file matching the same run,lumi tuple. I guess this makes the cherrypy thread to either not send something that is expected, or to send something wrong and/or empty. IF all my observations are true, then the way to fix it is: if there is no parent file that matches a given run,lumi tuple, we simply skip it when fixing the parentage. Of course, we still have to make a log record for that. |
Thanks @amaltaro , The protection you are talking about here:
Is already in place: WMCore/src/python/WMCore/Services/DBS/DBS3Reader.py Lines 879 to 882 in de8ca32
There must be something else I am overlooking here. |
But reading the actual content of the HTTP error returned it also includes the DBS internal error, which kind of leads in the same direction:
I am trying to reproduce the call step by step and will post the result. |
OK now, after 19 cups of coffee and enormous headache, here are my findings:
Where the keys are frozen sets built out of one
So I do not see what we can do here to solve this on our side. At least not until we figure out what has happened with this dataset in the meantime. So @haozturk do you happen to remember any OPS issue with this combination of
p.s. I need to mention there are more errors of the sort in the logs, but once we find what is going on with one of them we can propagate the solution to all the rest unresolved blocks. |
* And we end up hitting this piece of DBS Server:
[9]https://github.com/dmwm/dbs2go/blob/928dc255e5a695546767c363c5d0
fb9b37cd6fd8/dbs/fileparents.go#L344-L358
+ What DBS server rightfully does in this case, is to check how
many files it have on record for this block then checks if it
matches the length of the parentage resolved map and once it
find a mismatch just rejects the call and reports it by HTTP
440 error header and puts the relevant information for the
error in the response. ***@***.*** please correct me if I am
misinterpreting this exact piece from the DBS Server side)
Todor, this is correct understanding of DBS server logic.
|
Just to make sure I follow this, let me illustrate it with an example. Is it correct? On what concerns this call:
the |
The root cause has already been confirmed here:
So we basically skip those files with missing parents from the final list of Now we have three courses of action in order to solve this;
FYI @amaltaro @vkuznet I am now about to create a ticket in DBS for releasing the constraint I am talking about, and once it is done we may start iterating with other groups eventually affected by such a change. |
And here is the DBS issue dmwm/dbs2go#94 |
Just for historical reference - this DBS API has been problematic in the past and was timing out due to the heavy database queries. While running through the Q1 stakeholders' issues today, I stumbled on the following one (already closed): #9537 . Which may have actually lead to the current model of the data structure returned by DBS. And there is also a discussion related to the origins of the FYI @vkuznet |
@todor-ivanov even though this WMCore issue has been fixed, the actual problem isn't yet solved and further follow up with @d-ylee on the DBS changes is required. Can you please stay in close contact with Dennis and ensure that he has all the necessary information/assistance to fix: dmwm/dbs2go#94 |
Impact of the bug
ReqMgr2 StepChainParentageFixTask thread
Describe the bug
Now that this thread is behaving again, I noticed that it keeps failing to resolve parentage information for 10 workflows, e.g. from the log [1].
There could be multiple reasons for this failure. So we need to look into the logs and see what requires a change to the WMCore baseline code or what needs to have just a manual intervention.
How to reproduce it
Unclear
Expected behavior
Check the
parentageFixTask
reqmgr2-tasks log, identify which 10 workflows are continuously failing and start unstucking them.Depending on the problem, we might want to consider a new error mode within this thread such that in the future it doesn't hold a workflow until a manual intervention is performed.
Additional context and error message
[1]
The text was updated successfully, but these errors were encountered: