Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

StepChainParentageFixTask ReqMgr2 thread failing to resolve parentage for 10 workflows #11260

Closed
amaltaro opened this issue Aug 27, 2022 · 13 comments · Fixed by #11520
Closed

Comments

@amaltaro
Copy link
Contributor

Impact of the bug
ReqMgr2 StepChainParentageFixTask thread

Describe the bug
Now that this thread is behaving again, I noticed that it keeps failing to resolve parentage information for 10 workflows, e.g. from the log [1].

There could be multiple reasons for this failure. So we need to look into the logs and see what requires a change to the WMCore baseline code or what needs to have just a manual intervention.

How to reproduce it
Unclear

Expected behavior
Check the parentageFixTask reqmgr2-tasks log, identify which 10 workflows are continuously failing and start unstucking them.
Depending on the problem, we might want to consider a new error mode within this thread such that in the future it doesn't hold a workflow until a manual intervention is performed.

Additional context and error message
[1]

vocms0750:/cephfs/product/dmwm-logs $ grep processed parentageFixTask-reqmgr2-tasks-67dcb598c4-l2wx8-20220827.log | tail -n10
2022-08-26 23:29:20,113:INFO:StepChainParentageFixTask:A total of 16 requests have been processed, where 10 will have to be retried in the next cycle.
2022-08-27 02:35:35,006:INFO:StepChainParentageFixTask:A total of 7 requests have been processed, where 10 will have to be retried in the next cycle.
2022-08-27 05:41:42,089:INFO:StepChainParentageFixTask:A total of 18 requests have been processed, where 10 will have to be retried in the next cycle.
2022-08-27 08:47:44,998:INFO:StepChainParentageFixTask:A total of 16 requests have been processed, where 10 will have to be retried in the next cycle.
2022-08-27 11:54:05,594:INFO:StepChainParentageFixTask:A total of 14 requests have been processed, where 10 will have to be retried in the next cycle.
2022-08-27 15:00:38,751:INFO:StepChainParentageFixTask:A total of 26 requests have been processed, where 10 will have to be retried in the next cycle.
2022-08-27 18:07:00,612:INFO:StepChainParentageFixTask:A total of 33 requests have been processed, where 10 will have to be retried in the next cycle.
@todor-ivanov
Copy link
Contributor

@todor-ivanov todor-ivanov moved this from Todo to In Progress in WMCore quarterly developments Mar 1, 2023
@todor-ivanov
Copy link
Contributor

todor-ivanov commented Mar 1, 2023

Here are my first findings for one of those workflows listed above. This one: https://cmsweb.cern.ch/reqmgr2/fetch?rid=cmsunified_task_TSG-Run3Winter23GS-00022__v1_T_230213_204700_2664

Looking into the ChainParentageMap it is as follows:

  "ChainParentageMap": {
    "Step1": {
      "ParentDset": null,
      "ChildDsets": [
        "/JPsiTo2Mu_Pt-0To100_pythia8-gun/Run3Winter23GS-126X_mcRun3_2023_forPU65_v1-v2/GEN-SIM"
      ]
    },
    "Step2": {
      "ParentDset": "/JPsiTo2Mu_Pt-0To100_pythia8-gun/Run3Winter23GS-126X_mcRun3_2023_forPU65_v1-v2/GEN-SIM",
      "ChildDsets": [
        "/JPsiTo2Mu_Pt-0To100_pythia8-gun/Run3Winter23Digi-126X_mcRun3_2023_forPU65_v1-v2/GEN-SIM-RAW"
      ]
    }
  },

And while searching for the child dataset /JPsiTo2Mu_Pt-0To100_pythia8-gun/Run3Winter23Digi-126X_mcRun3_2023_forPU65_v1-v2/GEN-SIM-RAW in the parentageFixTask-reqmgr2-tasks-646898bd65-22m4g-20230226.log log, I immediately stumbled on the following error:

2023-02-26 01:59:25,886:INFO:StepChainParentageFixTask:Resolving parentage for dataset: /JPsiTo2Mu_Pt-0To100_pythia8-gun/Run3Winter23Digi-126X_mcRun3_2023_forPU65_v1-v2/GEN-SIM-RAW
2023-02-26 01:59:25,955:INFO:DBS3Reader:Parent datasets for /JPsiTo2Mu_Pt-0To100_pythia8-gun/Run3Winter23Digi-126X_mcRun3_2023_forPU65_v1-v2/GEN-SIM-RAW are: [{'parent_dataset': '/JPsiTo2Mu_Pt-0To100_pythia8-gun/Run3Winter23GS-126X_mcRun3_2023_forPU65_v1-v2/GEN-SIM', 'parent_dataset_id': 14638924, 'this_dataset': '/JPsiTo2Mu_Pt-0To100_pythia8-gun/Run3Winter23Digi-126X_mcRun3_2023_forPU65_v1-v2/GEN-SIM-RAW'}]
2023-02-26 01:59:26,070:INFO:DBS3Reader:Found 1 blocks without parentage information
2023-02-26 01:59:26,070:INFO:DBS3Reader:Fixing parentage for block: /JPsiTo2Mu_Pt-0To100_pythia8-gun/Run3Winter23Digi-126X_mcRun3_2023_forPU65_v1-v2/GEN-SIM-RAW#7a1d6f5c-d054-4315-91a9-c9ffa7ccbb78
2023-02-26 01:59:26,082:WARNING:DBS3Reader:Child file id: 2746490637, with run/lumi: frozenset({1, 9249}), has no match in the parent dataset
2023-02-26 01:59:26,083:WARNING:DBS3Reader:Child file id: 2746490637, with run/lumi: frozenset({1, 9290}), has no match in the parent dataset
2023-02-26 01:59:26,095:ERROR:DBS3Reader:Parentage updated failed for block /JPsiTo2Mu_Pt-0To100_pythia8-gun/Run3Winter23Digi-126X_mcRun3_2023_forPU65_v1-v2/GEN-SIM-RAW#7a1d6f5c-d054-4315-91a9-c9ffa7ccbb78 with error HTTP Error 400: DBSError Code:110 Description:DBS DB insert record error Function:dbs.fileparents.InsertFileParents Message: Error: nested DBSError Code:118 Description:DBS invalid parameter for the DBS API Function:dbs.fileparents.InsertFileParentsBlockTxt Message:not all files present in block Error: record error
Traceback (most recent call last):
  File "/usr/local/lib/python3.8/site-packages/dbs/apis/dbsClient.py", line 474, in __callServer
    self.http_response = method_func(self.url, method, params, data, request_headers)
  File "/usr/local/lib/python3.8/site-packages/RestClient/RestApi.py", line 42, in post
    return http_request(self._curl)
  File "/usr/local/lib/python3.8/site-packages/RestClient/RequestHandling/HTTPRequest.py", line 62, in __call__
    raise HTTPError(effective_url, http_code, http_response.msg, http_response.raw_header, http_response.body)
RestClient.ErrorHandling.RestClientExceptions.HTTPError: HTTP Error 400: Bad Request

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/local/lib/python3.8/site-packages/WMCore/Services/DBS/DBS3Reader.py", line 953, in fixMissingParentageDatasets
    numFiles = self.findAndInsertMissingParentage(blockName, parentFullInfo, insertFlag=insertFlag)
  File "/usr/local/lib/python3.8/site-packages/WMCore/Services/DBS/DBS3Reader.py", line 892, in findAndInsertMissingParentage
    self.dbs.insertFileParents({"block_name": childBlockName, "child_parent_id_list": listChildParent})
  File "/usr/local/lib/python3.8/site-packages/dbs/apis/dbsClient.py", line 768, in insertFileParents
    return self.__callServer("fileparents", data=fileParentObj, callmethod='POST' )
  File "/usr/local/lib/python3.8/site-packages/dbs/apis/dbsClient.py", line 477, in __callServer
    self.__parseForException(http_error)
  File "/usr/local/lib/python3.8/site-packages/dbs/apis/dbsClient.py", line 524, in __parseForException
    raise exp
  File "/usr/local/lib/python3.8/site-packages/dbs/apis/dbsClient.py", line 522, in __parseForException
    raise HTTPError(http_error.url, data['exception'], data['message'], http_error.header, http_error.body, serverCode(data))
RestClient.ErrorHandling.RestClientExceptions.HTTPError: HTTP Error 400: DBSError Code:110 Description:DBS DB insert record error Function:dbs.fileparents.InsertFileParents Message: Error: nested DBSError Code:118 Description:DBS invalid parameter for the DBS API Function:dbs.fileparents.InsertFileParentsBlockTxt Message:not all files present in block Error: record error
2023-02-26 01:59:26,095:WARNING:StepChainParentageFixTask:These blocks failed to be resolved and will be retried later: ['/JPsiTo2Mu_Pt-0To100_pythia8-gun/Run3Winter23Digi-126X_mcRun3_2023_forPU65_v1-v2/GEN-SIM-RAW#7a1d6f5c-d054-4315-91a9-c9ffa7ccbb78']
2023-02-26 01:59:26,095:INFO:StepChainParentageFixTask:    spent 0.20961904525756836 secs on this dataset: /JPsiTo2Mu_Pt-0To100_pythia8-gun/Run3Winter23Digi-126X_mcRun3_2023_forPU65_v1-v2/GEN-SIM-RAW

@todor-ivanov
Copy link
Contributor

todor-ivanov commented Mar 1, 2023

And here is the sequence of calls to reproduce it interactively:

In [1]: from WMCore.Services.DBS.DBS3Reader import DBS3Reader

In [2]: dbsSvc = DBS3Reader(msConfig['dbsUrl'], logger=logger)

In [3]: childDataset = '/JPsiTo2Mu_Pt-0To100_pythia8-gun/Run3Winter23Digi-126X_mcRun3_2023_forPU65_v1-v2/GEN-SIM-RAW'

In [8]: dbsSvc.fixMissingParentageDatasets(childDataset, insertFlag=True)
2023-03-01 15:19:11,399:INFO:DBS3Reader: Parent datasets for /JPsiTo2Mu_Pt-0To100_pythia8-gun/Run3Winter23Digi-126X_mcRun3_2023_forPU65_v1-v2/GEN-SIM-RAW are: [{'parent_dataset': '/JPsiTo2Mu_Pt-0To100_pythia8-gun/Run3Winter23GS-126X_mcRun3_2023_forPU65_v1-v2/GEN-SIM', 'parent_dataset_id': 14638924, 'this_dataset': '/JPsiTo2Mu_Pt-0To100_pythia8-gun/Run3Winter23Digi-126X_mcRun3_2023_forPU65_v1-v2/GEN-SIM-RAW'}]
2023-03-01 15:19:11,541:INFO:DBS3Reader: Found 1 blocks without parentage information
2023-03-01 15:19:11,541:INFO:DBS3Reader: Fixing parentage for block: /JPsiTo2Mu_Pt-0To100_pythia8-gun/Run3Winter23Digi-126X_mcRun3_2023_forPU65_v1-v2/GEN-SIM-RAW#7a1d6f5c-d054-4315-91a9-c9ffa7ccbb78
2023-03-01 15:19:11,556:WARNING:DBS3Reader: Child file id: 2746490637, with run/lumi: frozenset({1, 9249}), has no match in the parent dataset
2023-03-01 15:19:11,556:WARNING:DBS3Reader: Child file id: 2746490637, with run/lumi: frozenset({1, 9290}), has no match in the parent dataset

@amaltaro
Copy link
Contributor Author

amaltaro commented Mar 1, 2023

Thanks for this investigation, Todor.

I think this is the key error message:

2023-03-01 15:19:11,556:WARNING:DBS3Reader: Child file id: 2746490637, with run/lumi: frozenset({1, 9249}), has no match in the parent dataset

without looking into the code, I would read it as: the child file that needs to have the parentage information fixed, does not have any available parent file matching the same run,lumi tuple.

I guess this makes the cherrypy thread to either not send something that is expected, or to send something wrong and/or empty.

IF all my observations are true, then the way to fix it is: if there is no parent file that matches a given run,lumi tuple, we simply skip it when fixing the parentage. Of course, we still have to make a log record for that.

@todor-ivanov
Copy link
Contributor

Thanks @amaltaro ,

The protection you are talking about here:

if there is no parent file that matches a given run,lumi tuple, we simply skip it when fixing the parentage. Of course, we still have to make a log record for that.

Is already in place:

if parentId is None:
msg = "Child file id: %s, with run/lumi: %s, has no match in the parent dataset"
self.logger.warning(msg, childFileID, frozenKey)
continue

There must be something else I am overlooking here.

@todor-ivanov
Copy link
Contributor

todor-ivanov commented Mar 2, 2023

But reading the actual content of the HTTP error returned it also includes the DBS internal error, which kind of leads in the same direction:

HTTP Error 400: DBSError Code:110 Description:DBS DB insert record error Function:dbs.fileparents.InsertFileParents Messag
e: Error: nested DBSError Code:118 Description:DBS invalid parameter for the DBS API Function:dbs.fileparents.InsertFileParentsBlockTxt Message:not all files present in block Error: record e
rror

I am trying to reproduce the call step by step and will post the result.

@todor-ivanov
Copy link
Contributor

OK now, after 19 cups of coffee and enormous headache, here are my findings:

  • We first find the parent dataset per each child dataset then
  • We fetch full parent dataset info
    parentFullInfo = self.getParentDatasetTrio(childDataset)
    from DBS by calling dbsClisnet.listParentDSTrio
    • it returns a list of nested dictionaries containing [{file: [{run:(lumi,..)},..]},..] tuples, which we then split and rearrange it in the following data structure:
{...
 frozenset({1, 2335}): 2733771317,
 frozenset({1, 2338}): 2733771317,
 frozenset({1, 2345}): 2733771317,
 frozenset({1, 2353}): 2733771317,
 frozenset({1, 2361}): 2733771317,
...}

Where the keys are frozen sets built out of one run and one lumi each, and the values are file ids. So we basically have one record in this data structure per each combination of run:lumi:file for the whole parent dataset.

  • Then we again query DBS to figure out which blocks still have unresolved parrents
  • Then for each of the blocks we do:
    • it again returns a list of nested dictionaries containing [{file: [{run:(lumi,..)},..]},..] tuples, which we again split in frozenset(run, Lumi) : fileId pairs and
    • we try to find each of the run: Lumi combination from the unresolved block in the parendDataset's info and by that
    • we create a map of {childFileId: {parentFileId}}
      mapChildParent.setdefault(childFileID, set())
      mapChildParent[childFileID].add(parentId)
    • but in the meantime we drop those run:lumi:file combinations from the block for which we do not find any information in the parent dataset:
      if parentId is None:
      msg = "Child file id: %s, with run/lumi: %s, has no match in the parent dataset"
      self.logger.warning(msg, childFileID, frozenKey)
      continue
    • This basically means that according to DBS the child dataset contains information which is extra and missing in the parent ( in terms of run:lumi:file combinations). To me this may happen only if there has been some interim file invalidation in the parent dataset, which has not been propagated to the child dataset/blocks or there is a broken record in DBS related to the parent. @amaltaro please correct me here if I am wrong.
  • Now comes the funny part: we then try to insert the so built parentage map for the child block into DBS by calling it here:
    self.dbs.insertFileParents({"block_name": childBlockName, "child_parent_id_list": listChildParent})
  • And we end up hitting this piece of DBS Server: https://github.com/dmwm/dbs2go/blob/928dc255e5a695546767c363c5d0fb9b37cd6fd8/dbs/fileparents.go#L344-L358
    • What DBS server rightfully does in this case, is to check how many files it have on record for this block then checks if it matches the length of the parentage resolved map and once it find a mismatch just rejects the call and reports it by HTTP 440 error header and puts the relevant information for the error in the response. (@vkuznet please correct me if I am misinterpreting this exact piece from the DBS Server side)

So I do not see what we can do here to solve this on our side. At least not until we figure out what has happened with this dataset in the meantime. So @haozturk do you happen to remember any OPS issue with this combination of

p.s. I need to mention there are more errors of the sort in the logs, but once we find what is going on with one of them we can propagate the solution to all the rest unresolved blocks.

@vkuznet
Copy link
Contributor

vkuznet commented Mar 2, 2023 via email

@amaltaro
Copy link
Contributor Author

amaltaro commented Mar 2, 2023

Just to make sure I follow this, let me illustrate it with an example.
'
If a child block /a/b/RECO#1 has run/lumi pairs that cannot be found in the parent block /a/b/RAW#3, it's possible that the relevant child file will not be present in the data structure posted to DBS Server.
And the DBS server is actually collecting a list of file ids, for a given block, and comparing that list from the database against the list provided by the end user. In case they don't match, DBS cannot accept that client request.

Is it correct?

On what concerns this call:

 self.dbs.insertFileParents({"block_name": childBlockName, "child_parent_id_list": listChildParent}) 

the listChildParent object is actually a list of childID, parentID tuples.

@todor-ivanov
Copy link
Contributor

The root cause has already been confirmed here:
We do have some files in the Child Block that are missing from the Parent Dataset. And this is normal for StepChains because the merge jobs for different steps are asynchronous - meaning:

  • The merge job of a later step may read from a file present in the output of a previous step locally on a site before the merge job of the previous step has completed,
  • and if a consequent failure happens in the merge job related to the parent data then we basically end up in the current situation

So we basically skip those files with missing parents from the final list of child_parent_id_list, which we send to DBS. On the other side DBS does check all the files present in the Block (as it knows them) and if it finds there is missing information about the child-parent file relation even for a single file from the Block it fails the call, hence to resolve the Parentage information for the whole Block

Now we have three courses of action in order to solve this;

  • We set some default value (needs to be communicated which this should be so that it complies with the DBS tables structures) for the parent file do and we keep the numbers inline
  • We ask DBS server to release this constrain, so that we can end up with child files ina block with missing parentage information (basically whole blocks with partially resolved parentage information)
  • we start developing a mechanism to automatically invalidate files of the sort in both parent dataset and child block - both in Rucio and DBS

FYI @amaltaro @vkuznet I am now about to create a ticket in DBS for releasing the constraint I am talking about, and once it is done we may start iterating with other groups eventually affected by such a change.

@todor-ivanov
Copy link
Contributor

And here is the DBS issue dmwm/dbs2go#94

@todor-ivanov
Copy link
Contributor

Just for historical reference - this DBS API has been problematic in the past and was timing out due to the heavy database queries. While running through the Q1 stakeholders' issues today, I stumbled on the following one (already closed): #9537 . Which may have actually lead to the current model of the data structure returned by DBS.

And there is also a discussion related to the origins of the parent by lumi API, which obviously stems from a bug into the agent: dmwm/DBS#613. I could not read everything to the bottom, but @amaltaro if you can remember and summarize here it would be really helpful. I'd prefer not to release a constraint that has been put there for a reason and open the Pandora box. If you think my concerns are irrelevant ..., then please ignore my comment here and let's continue with our current plan.

FYI @vkuznet

@github-project-automation github-project-automation bot moved this from In Progress to Done in WMCore quarterly developments Mar 29, 2023
@amaltaro
Copy link
Contributor Author

@todor-ivanov even though this WMCore issue has been fixed, the actual problem isn't yet solved and further follow up with @d-ylee on the DBS changes is required.

Can you please stay in close contact with Dennis and ensure that he has all the necessary information/assistance to fix: dmwm/dbs2go#94
? Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Status: Done
3 participants