Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adding Null parentage information for RunLumi pairs missing at the parent Dataset #11520

Conversation

todor-ivanov
Copy link
Contributor

@todor-ivanov todor-ivanov commented Mar 23, 2023

Fixes #11260

Status

Ready

Description

With the current PR we try to add Null information for the RunLumi pairs that are missing at the parent dataset by adding the following improvements to the current code:

  • Introduce a new data structure which is a NamedTuple containing the triplet fileId, run, lumi
  • Benefiting from set operations to create subsets of missing and resolved run/lumi pairs
  • Adding the subset of unresolved run/lumi pairs with Null parentage information so that the whole length of the block information uploaded to DBS to be pereserved

Is it backward compatible (if not, which system it affects?)

No

Related PRs

None

External dependencies / deployment changes

DBS needs to change their code in order to accept `blocks with partially resolved rentage information.
The relevant DBS issue is:
dmwm/dbs2go#94

@cmsdmwmbot
Copy link

Jenkins results:

  • Python3 Unit tests: failed
    • 1 new failures
    • 1 tests no longer failing
    • 3 changes in unstable tests
  • Python3 Pylint check: succeeded
    • 1 warnings
    • 67 comments to review
  • Pylint py3k check: succeeded
  • Pycodestyle check: succeeded
    • 4 comments to review

Details at https://cmssdt.cern.ch/dmwm-jenkins/view/All/job/DMWM-WMCore-PR-test/14130/artifact/artifacts/PullRequestReport.html

Copy link
Contributor

@vkuznet vkuznet left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I"m not convinced that current implementation is the best one. What I see is that you need a flat structure like fileId, runNumber, lumiNumber and instead use dict structures. Instead, I rather prefer to discuss proper DBS API for that and use such flat stream of data in a code, instead of many nested loops and list of dictionaries.

src/python/WMCore/Services/DBS/DBS3Reader.py Show resolved Hide resolved
src/python/WMCore/Services/DBS/DBS3Reader.py Outdated Show resolved Hide resolved
@todor-ivanov
Copy link
Contributor Author

todor-ivanov commented Mar 24, 2023

hi @vkuznet , and to address your next comment:

What I see is that you need a flat structure like fileId, runNumber, lumiNumber and instead use dict structures

Yes we do need a flat structure of the kind. And initially I implemented it with just two sets of FileRunLumi members (one for the childBlock and one for the parentDataset), and thought: "OK, the sets of tuples are structures cheap in memory terms because of their fine granularity and member size (in contrast to the dictionary), both are hashable which means quick search operations etc".... But when I analyzed which were the common fileds in between the two instances of those sets of flat named tuples (the instance of childFlatData and the instance of parentFlatData), I realized: it was not the full member of the datastructure (file, run, lumi) but rather only (run, lumi), hence these were the ones that had to be searched through for resolving the parentage information.... so full set operations were impossible between the two instances (sets of FlatRunLumi tuples for the child and the parent). And the time complexity was about to raise exponentially here. So the only option was to do what @amaltaro was doing in the previous implementation - take only the common parts and make them hashable that way the search operations in terms of run/lumis become O(1). That's why I converted the sets to dicts and created the keys out of the (run, lum) tuples.

I was extremely careful on time complexity this time, exactly because I was aware of all this. So if you look deeper at the end the complexity here in the worst case scenario is not as you state O(n^4) (when n is defined in as number run/lumi pairs) but rather O(2n) (or if I am missing something that would be at most O(n^2)). Because we basically simply get to iterate through the full set of run/lumi pairs of the parent dataset twice. And nothing more. The so nested for cycles, at a first glance appear to be fully nested for all members of the structure, but they simply operate on the chunks of the data (block by block) sequentially as they go through all run/lumi pairs from the parent dataset.

At the end I reused what was there from the previous implementation and made the best to make the code more readable, but the idea here is mostly not to make code optimizations but rather make it more readable, because it was really difficult to understand what was going on while I was reading it for the firs time. And to make the data structure for (file, run, lumi) better and more flexible to work with. We can also enforce type checks as it is right now... etc.

@cmsdmwmbot
Copy link

Jenkins results:

  • Python3 Unit tests: failed
    • 2 new failures
    • 1 changes in unstable tests
  • Python3 Pylint check: succeeded
    • 1 warnings
    • 67 comments to review
  • Pylint py3k check: succeeded
  • Pycodestyle check: succeeded
    • 4 comments to review

Details at https://cmssdt.cern.ch/dmwm-jenkins/view/All/job/DMWM-WMCore-PR-test/14135/artifact/artifacts/PullRequestReport.html

Copy link
Contributor

@amaltaro amaltaro left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Todor, I left a few questions and comments along the code for your consideration.

One of my concerns though is that we are duplicating the run/lumi information, which will increase the memory footprint of this thread, which already is not great.

@@ -9,6 +9,7 @@

from builtins import object, str, bytes
from future.utils import viewitems
from typing import NamedTuple
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was wondering whether it should be using the collections library, but it looks like both provide this data type and they are practically the same. Just a note :)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NamedTuple from typing is typified wrapper for namedtupple from collections. I deliberately used this one because it allows you to have type annotations with defaults if you decide. And also you may redefine methods if you want to (except __init__ or __new__, though - for them you need some cumbersome inheritance). And, mostly, it makes things much much more readable. This is a properly defined data structure of the size of a standard tuple but with named fields, and it is also immutable - hence hashable. It has all we need.

fileId: int
run: int
lumi: int
# def __eq__(frl1, frl2):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if we don't need to override this method, then please remove.

@@ -97,6 +119,9 @@ def __init__(self, url, logger=None, parallel=None, **contact):
msg += "%s\n" % formatEx3(ex)
raise DBSReaderError(msg) from None

# A type definition visible only inside the DBS3 Class
# NOTE: thes may also go in the global scope in case we decide it won't break anything
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd say these 2 comments could be removed.

for fileId in childBlockInfo[0]:
for runLumiPair in childBlockInfo[0][fileId]:
frlObj = FileRunLumi(fileId=fileId, run=runLumiPair[0], lumi=runLumiPair[1])
childFlatData[(frlObj.run, frlObj.lumi)] = frlObj
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We are duplicating every single run/lumi pair here both in the key and in the value. Did you check the memory impact with this implementation? Or is there any good reason to keep run/lumi in both places?

src/python/WMCore/Services/DBS/DBS3Reader.py Outdated Show resolved Hide resolved
src/python/WMCore/Services/DBS/DBS3Reader.py Outdated Show resolved Hide resolved
src/python/WMCore/Services/DBS/DBS3Reader.py Show resolved Hide resolved
@todor-ivanov
Copy link
Contributor Author

Thanks @amaltaro for your feedback

One of my concerns though is that we are duplicating the run/lumi information, which will increase the memory footprint of this thread, which already is not great.

I wouldn't be concerned about that, since the dictionary overhead is much higher. Here is a measurement done with the two data formats - Dict of Named Tuple and Old Format Dict :

In [1]: import sys

In [2]: from WMCore.Services.DBS.DBS3Reader import DBS3Reader

In [3]: childDataset = '/EGamma/CMSSW_12_4_10-RunEGamma2022C_v1_TkAl_RelVal-v1/RECO'

In [4]: dbsSvc = DBS3Reader(msConfig['dbsUrl'], logger=logger)

In [5]: parentFlatData_NamedTuple = dbsSvc.getParentDatasetTrio(childDataset)

In [6]: parentFlatData_OldFormat = dbsSvc.getParentDatasetTrioOld(childDataset)

In [7]: len(parentFlatData_NamedTuple)
Out[7]: 35543

In [8]: sys.getsizeof(parentFlatData_NamedTuple)
Out[8]: 1310816

In [9]: sys.getsizeof(parentFlatData_OldFormat)
Out[9]: 1310816

In [10]: dbsSvc.listDatasetParents(childDataset)
Out[10]: 
[{'parent_dataset': '/EGamma/Run2022C-v1/RAW',
  'parent_dataset_id': 14404594,
  'this_dataset': '/EGamma/CMSSW_12_4_10-RunEGamma2022C_v1_TkAl_RelVal-v1/RECO'}]

And just for clarity which is which here are two items popped out from each data structure:

In [16]: parentFlatData_NamedTuple.popitem()
Out[16]: ((357479, 109), FileRunLumi(fileId=1723293557, run=357479, lumi=109))

In [17]: parentFlatData_OldFormat.popitem()
Out[17]: (frozenset({109, 357479}), 1723293557)

I did use the simplest way to measure this (sys.getsizeof returns the object size in bytes), but you can see - there is no overhead from the fact that we've put two more integers inside the dictionary value.

@amaltaro
Copy link
Contributor

@todor-ivanov 35k lumis (tuples) will likely not be substantial to spot memory footprint differences. I think we would need something with around half million lumis.

In addition, sys.getsizeof isn't reliable for complex data structures (AFAIK anything beyond int/float/str). I would suggest to use either this function to calculate it:
https://github.com/dmwm/WMCore/blob/master/src/python/Utils/Utilities.py#L171

or something like

sys.getsizeof(json.dumps(data))

@todor-ivanov
Copy link
Contributor Author

todor-ivanov commented Mar 29, 2023

thanks for the idea @amaltaro I did follow your suggestion to iterate through the whole object, and you will be surprised by the results. (I also took a parent dataset of size 113K Lumis, the biggest one I could find - taken from the O&C meeting's plots):

In [19]: childDataset = '/ParkingBPH2/Run2018D-05May2019promptD-v1/AOD'

In [20]: dbsSvc.listDatasetParents(childDataset)
Out[20]: 
[{'parent_dataset': '/ParkingBPH2/Run2018D-v1/RAW',
  'parent_dataset_id': 13694337,
  'this_dataset': '/ParkingBPH2/Run2018D-05May2019promptD-v1/AOD'}]

In [21]: parentFlatData_NamedTuple = dbsSvc.getParentDatasetTrio(childDataset)

In [22]: parentFlatData_OldFormat = dbsSvc.getParentDatasetTrioOld(childDataset)

In [23]: len(parentFlatData_NamedTuple)
Out[23]: 113542

In [24]: sys.getsizeof(parentFlatData_NamedTuple)
Out[24]: 5242976

In [25]: sys.getsizeof(parentFlatData_OldFormat)
Out[25]: 5242976

In [26]: from Utils.Utilities import getSize

In [27]: getSize(parentFlatData_NamedTuple)
Out[27]: 29220964

In [28]: getSize(parentFlatData_OldFormat)
Out[28]: 39212660

@todor-ivanov
Copy link
Contributor Author

You can clearly see that old dictionary is using more volume than the one with the NamedTuples, regardless of the two additional integers in the value filed. And there is a reason for this. Because the dominating contribution in this data structure comes not from the values but from the keys used. There is one more difference between the dictionary of named tuples (as I constructed it and the Old dictionary) - I used simple tuples as keyNames, while the old one is using frozenSets.

In [16]: parentFlatData_NamedTuple.popitem()
Out[16]: ((357479, 109), FileRunLumi(fileId=1723293557, run=357479, lumi=109))

In [17]: parentFlatData_OldFormat.popitem()
Out[17]: (frozenset({109, 357479}), 1723293557)

In case we'd want to make things more equal and measure only the difference coming from the additional two integers... we should redefine the old format to use tuples as keyNames as well. And here is exactly what I did:

In [41]: parentFlatData_OldFormat.popitem()
Out[41]: ((325114, 7), 407993086)

In [42]: parentFlatData_NamedTuple.popitem()
Out[42]: ((325114, 7), FileRunLumi(fileId=407993086, run=325114, lumi=7))

Now better comparing the two:

In [39]: getSize(parentFlatData_OldFormat)
Out[39]: 21045940


In [40]: getSize(parentFlatData_NamedTuple)
Out[40]: 29220964

Now we get the expected raise - but again it is insignificant in comparison to the contribution from the keyNames sizes. Taking as base for comparission the lowest possible size of 21MBytes per 113K lumis we have:

  • The additional two integers in the datastructire are adding 8M
  • The frozenset in the keyNames is adding 18M
    FYI: @amaltaro

@amaltaro
Copy link
Contributor

amaltaro commented Mar 29, 2023

Thank you for making these tests and providing these numbers, Todor.

The summary I get then is that, using the same set/tuple structure with:
a) the not-so-much readable current code (tuple: fileid)
b) or using the new data structure (tuple: namedTuple)

there is an increase of almost 40% in the memory footprint, for 325k lumis, taken from:

In [1]: 29220964/21045940
Out[1]: 1.3884371047337396

We already have spikes of almost 4GB for this reqmgr2-tasks service (where one of the CP threads is this StepChain parentage):
https://monit-grafana.cern.ch/d/M2U6T6uGk/kube-eagle-k8s-metrics?orgId=11&from=1680060628726&to=1680103828726&refresh=30m&viewPanel=223

So I fear that this potential 40% increase might be just too much.

@todor-ivanov
Copy link
Contributor Author

@amaltaro this statement is incorrect:

there is an increase of almost 40% in the memory footprint, for 325k lumis, taken from:

In [1]: 29220964/21045940
Out[1]: 1.3884371047337396

Because the current datastructure size is not 21045940
But rather 39212660

So I'd say the so offered data structure of dictionary of named tuples already improves the situation (decreases the memory footprint) rather than worsen it.

@todor-ivanov
Copy link
Contributor Author

todor-ivanov commented Mar 29, 2023

oh .. actually you do compare apples with apples here:

a) the not-so-much readable current code (tuple: fileid)
b) or using the new data structure (tuple: namedTuple)

Sorry for missing that you are referring to the case with (tuples) keyNames in both cases - which is indeed what we need to do and what I was also trying to stress.

So do you offer to just squeeze this to the maximum and use dict of (tuple: fileid) here?

@amaltaro
Copy link
Contributor

So do you offer to just squeeze this to the maximum and use dict of (tuple: fileid) here?

Yes Todor. Even though the code is not going to be most readable, it does save quite a lot of memory (even more if compared to the current implementation in WMCore).

Maybe we take this opportunity to document something either in the StepChainParentage module, or in a wiki, such that next time we need to investigate it, it does not take too long to learn what and how exactly this code operates. What do you think?

@todor-ivanov
Copy link
Contributor Author

todor-ivanov commented Mar 29, 2023

Hi @amaltaro,

Even though the code is not going to be most readable, it does save quite a lot of memory (even more if compared to the current implementation in WMCore).

With my latest commit I made my best to improve both:

  • Fix code readability
  • Reduce data structure size

And I did Not test the difference in performance but it must be working at least as good as before, if not even better. Please take a look again.

And here is the result of one single call (without actually writing to DBS).

In [46]: childDataset = '/JPsiTo2Mu_Pt-0To100_pythia8-gun/Run3Winter23Digi-126X_mcRun3_2023_forPU65_v1-v2/GEN-SIM-RAW'

In [47]: dbsSvc.listBlocksWithNoParents(childDataset)
Out[47]: {'/JPsiTo2Mu_Pt-0To100_pythia8-gun/Run3Winter23Digi-126X_mcRun3_2023_forPU65_v1-v2/GEN-SIM-RAW#7a1d6f5c-d054-4315-91a9-c9ffa7ccbb78'}

In [48]: dbsSvc.fixMissingParentageDatasets(childDataset, insertFlag=False)
2023-03-29 18:59:49,838:INFO:DBS3Reader: Parent datasets for /JPsiTo2Mu_Pt-0To100_pythia8-gun/Run3Winter23Digi-126X_mcRun3_2023_forPU65_v1-v2/GEN-SIM-RAW are: [{'parent_dataset': '/JPsiTo2Mu_Pt-0To100_pythia8-gun/Run3Winter23GS-126X_mcRun3_2023_forPU65_v1-v2/GEN-SIM', 'parent_dataset_id': 14638924, 'this_dataset': '/JPsiTo2Mu_Pt-0To100_pythia8-gun/Run3Winter23Digi-126X_mcRun3_2023_forPU65_v1-v2/GEN-SIM-RAW'}]
2023-03-29 18:59:50,047:INFO:DBS3Reader: Found 1 blocks without parentage information
2023-03-29 18:59:50,048:INFO:DBS3Reader: Fixing parentage for block: /JPsiTo2Mu_Pt-0To100_pythia8-gun/Run3Winter23Digi-126X_mcRun3_2023_forPU65_v1-v2/GEN-SIM-RAW#7a1d6f5c-d054-4315-91a9-c9ffa7ccbb78
2023-03-29 18:59:50,135:WARNING:DBS3Reader: Child file id: 2746490637, with run/lumi: (1, 9290), has no match in the parent dataset. Adding it with null parentage information to DBS.
2023-03-29 18:59:50,135:WARNING:DBS3Reader: Child file id: 2746490637, with run/lumi: (1, 9249), has no match in the parent dataset. Adding it with null parentage information to DBS.
2023-03-29 18:59:50,135:WARNING:DBS3Reader: No parentage information added to DBS for block /JPsiTo2Mu_Pt-0To100_pythia8-gun/Run3Winter23Digi-126X_mcRun3_2023_forPU65_v1-v2/GEN-SIM-RAW#7a1d6f5c-d054-4315-91a9-c9ffa7ccbb78
Out[48]: []

And here is another call with the actual new error returned from DBS, given that we now provide Null parentage information for some of the run/lumi pairs.

In [49]: dbsSvc.fixMissingParentageDatasets(childDataset, insertFlag=True)
2023-03-29 19:07:32,001:INFO:DBS3Reader: Parent datasets for /JPsiTo2Mu_Pt-0To100_pythia8-gun/Run3Winter23Digi-126X_mcRun3_2023_forPU65_v1-v2/GEN-SIM-RAW are: [{'parent_dataset': '/JPsiTo2Mu_Pt-0To100_pythia8-gun/Run3Winter23GS-126X_mcRun3_2023_forPU65_v1-v2/GEN-SIM', 'parent_dataset_id': 14638924, 'this_dataset': '/JPsiTo2Mu_Pt-0To100_pythia8-gun/Run3Winter23Digi-126X_mcRun3_2023_forPU65_v1-v2/GEN-SIM-RAW'}]
2023-03-29 19:07:32,203:INFO:DBS3Reader: Found 1 blocks without parentage information
2023-03-29 19:07:32,204:INFO:DBS3Reader: Fixing parentage for block: /JPsiTo2Mu_Pt-0To100_pythia8-gun/Run3Winter23Digi-126X_mcRun3_2023_forPU65_v1-v2/GEN-SIM-RAW#7a1d6f5c-d054-4315-91a9-c9ffa7ccbb78
2023-03-29 19:07:32,221:WARNING:DBS3Reader: Child file id: 2746490637, with run/lumi: (1, 9290), has no match in the parent dataset. Adding it with null parentage information to DBS.
2023-03-29 19:07:32,221:WARNING:DBS3Reader: Child file id: 2746490637, with run/lumi: (1, 9249), has no match in the parent dataset. Adding it with null parentage information to DBS.
2023-03-29 19:07:32,228:ERROR:DBS3Reader: Parentage update failed for block /JPsiTo2Mu_Pt-0To100_pythia8-gun/Run3Winter23Digi-126X_mcRun3_2023_forPU65_v1-v2/GEN-SIM-RAW#7a1d6f5c-d054-4315-91a9-c9ffa7ccbb78 with error HTTP Error 405: 
URL=https://cmsweb-testbed.cern.ch:8443/dbs/int/global/DBSReader/fileparents
Code=405
Message=Method Not Allowed
Header=HTTP/1.1 100 Continue

HTTP/1.1 405 Method Not Allowed
Date: Wed, 29 Mar 2023 17:07:42 GMT
Server: Apache
Content-Length: 0
CMS-Server-Time: D=4309 t=1680109662825237


Body=
Traceback (most recent call last):
  File "/data/tmp/WMCore.venv3/srv/current/lib/python3.6/site-packages/dbs/apis/dbsClient.py", line 501, in __parseForException
    data = json.loads(data)
  File "/usr/lib64/python3.6/json/__init__.py", line 354, in loads
    return _default_decoder.decode(s)
  File "/usr/lib64/python3.6/json/decoder.py", line 339, in decode
    obj, end = self.raw_decode(s, idx=_w(s, 0).end())
  File "/usr/lib64/python3.6/json/decoder.py", line 357, in raw_decode
    raise JSONDecodeError("Expecting value", s, err.value) from None
json.decoder.JSONDecodeError: Expecting value: line 1 column 1 (char 0)

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/data/tmp/WMCore.venv3/srv/WMCore/src/python/WMCore/Services/DBS/DBS3Reader.py", line 940, in fixMissingParentageDatasets
    self.insertFileParents(blockName, listChildParent)
  File "/data/tmp/WMCore.venv3/srv/WMCore/src/python/WMCore/Services/DBS/DBS3Reader.py", line 858, in insertFileParents
    return self.dbs.insertFileParents({"block_name": childBlockName, "child_parent_id_list": childParentsIDPairs})
  File "/data/tmp/WMCore.venv3/srv/current/lib/python3.6/site-packages/dbs/apis/dbsClient.py", line 751, in insertFileParents
    return self.__callServer("fileparents", data=fileParentObj, callmethod='POST' )
  File "/data/tmp/WMCore.venv3/srv/current/lib/python3.6/site-packages/dbs/apis/dbsClient.py", line 473, in __callServer
    self.__parseForException(data)
  File "/data/tmp/WMCore.venv3/srv/current/lib/python3.6/site-packages/dbs/apis/dbsClient.py", line 507, in __parseForException
    raise http_error
RestClient.ErrorHandling.RestClientExceptions.HTTPError: HTTP Error 405: 
URL=https://cmsweb-testbed.cern.ch:8443/dbs/int/global/DBSReader/fileparents
Code=405
Message=Method Not Allowed
Header=HTTP/1.1 100 Continue

HTTP/1.1 405 Method Not Allowed
Date: Wed, 29 Mar 2023 17:07:42 GMT
Server: Apache
Content-Length: 0
CMS-Server-Time: D=4309 t=1680109662825237


Body=

Out[49]: ['/JPsiTo2Mu_Pt-0To100_pythia8-gun/Run3Winter23Digi-126X_mcRun3_2023_forPU65_v1-v2/GEN-SIM-RAW#7a1d6f5c-d054-4315-91a9-c9ffa7ccbb78']

@cmsdmwmbot
Copy link

Jenkins results:

  • Python3 Unit tests: failed
    • 3 new failures
    • 1 tests no longer failing
    • 1 changes in unstable tests
  • Python3 Pylint check: succeeded
    • 1 warnings
    • 67 comments to review
  • Pylint py3k check: succeeded
  • Pycodestyle check: succeeded
    • 3 comments to review

Details at https://cmssdt.cern.ch/dmwm-jenkins/view/All/job/DMWM-WMCore-PR-test/14140/artifact/artifacts/PullRequestReport.html

@cmsdmwmbot
Copy link

Jenkins results:

  • Python3 Unit tests: failed
    • 1 new failures
    • 1 changes in unstable tests
  • Python3 Pylint check: failed
    • 2 warnings and errors that must be fixed
    • 1 warnings
    • 83 comments to review
  • Pylint py3k check: succeeded
  • Pycodestyle check: succeeded
    • 15 comments to review

Details at https://cmssdt.cern.ch/dmwm-jenkins/view/All/job/DMWM-WMCore-PR-test/14141/artifact/artifacts/PullRequestReport.html

…rent Dataset.

Fix docstring for getChildBlockTrio.

Change data structure from dict of NamedTuples back to dict of integers && Fix dict keys from frozenSets to tuples.

Remove obsolete method getParentDatasetTrioOld.
@todor-ivanov todor-ivanov force-pushed the bugfix_ReqM2gr_StepChainParentageFixTask_fix-11260 branch from 5b4cc8d to 4370f0e Compare March 29, 2023 18:41
@cmsdmwmbot
Copy link

Jenkins results:

  • Python3 Unit tests: failed
    • 2 new failures
    • 1 tests no longer failing
    • 2 changes in unstable tests
  • Python3 Pylint check: failed
    • 2 warnings and errors that must be fixed
    • 1 warnings
    • 82 comments to review
  • Pylint py3k check: succeeded
  • Pycodestyle check: succeeded
    • 15 comments to review

Details at https://cmssdt.cern.ch/dmwm-jenkins/view/All/job/DMWM-WMCore-PR-test/14142/artifact/artifacts/PullRequestReport.html

Copy link
Contributor

@amaltaro amaltaro left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changes look good to me, thanks!
Valentin, I think Todor answered your concern already - and I don't see a complexity increase either - but if we missed anything, please leave your message for further discussion.

@todor-ivanov we must test this functionality in testbed (by announcing a StepChain workflow) to avoid potential surprises in production.

I am merging it now such that we can test this in a dev cluster if we want (also to test things with the open MSTransferor PR), under 2.2.0rc5

@amaltaro amaltaro merged commit a3438ae into dmwm:master Mar 29, 2023
@todor-ivanov
Copy link
Contributor Author

Thanks @amaltaro

One thing I see now I've missed to mention before, but I think is important to note now:

When moving the dictionary keys from frozensets to tuples, there is one detail we should bare in mind, and that is: now the order in the run/lumi pair matters, while before with the frozensets id did not. It is important later for matching between the parentFlatData and the childFlatData. In our case we always construct them as they have been returned by DBS, so we never reorder - which saves us from stumbling on such an error on our side:

here:

parentFlatData[tuple(runLumiPair)] = fileId

and here:

childFlatData[tuple(runLumiPair)] = fileId

But....
If someone in DBS decides to revert the order as it is returned by the DBSApi, we will be hit.
I did go through the DBS code both - the python client and the Go server..... and we indeed use two different calls for obtaining this information per dataset and per block. Which on their side use two different methods inside DBS Server, materialized in two completely independent SQL calls/queries. NOT dumping the whole path of it here, but it all distills to those two sql queries on the Server side:

https://github.com/dmwm/dbs2go/blob/928dc255e5a695546767c363c5d0fb9b37cd6fd8/static/sql/parentdatasetfilelumiids.sql#L1

https://github.com/dmwm/dbs2go/blob/928dc255e5a695546767c363c5d0fb9b37cd6fd8/static/sql/blockfilelumiids.sql#L1

So if anybody touching these queries decides for some unknown reason to revert the order of run/lumi in any of those select statements but not in the other, well we'll simply stop matching between parent dataset and child block information.

I suppose this is a matter of API persistence - once you announce that an API is about to return the result in a specified form you do not change the format from there on until you announce it to the end users to check and take care if they would be affected by a supposed change ... but still decided to make it clear, so if in the future a change to any of those two APIs happens we should remember to ask the DBS developer or whom ever makes this API change to check the order in the other one as well.

FYI: @amaltaro @vkuznet @khurtado

@amaltaro
Copy link
Contributor

I initially thought you were talking about order of run/lumis, but in the end I understand that you are actually talking about the order of data attributes that are returned, thus a potential swap between fileid, run, lumi in the returned data structure.
Yes, while this can always happen, I would say there is no reason to break an API that has been around for many years. In the end, this potential problem can happen with any API consumed by WMCore. So we should be good in this sense!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

StepChainParentageFixTask ReqMgr2 thread failing to resolve parentage for 10 workflows
4 participants