-
Notifications
You must be signed in to change notification settings - Fork 107
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Adding Null parentage information for RunLumi pairs missing at the parent Dataset #11520
Adding Null parentage information for RunLumi pairs missing at the parent Dataset #11520
Conversation
Jenkins results:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I"m not convinced that current implementation is the best one. What I see is that you need a flat structure like fileId, runNumber, lumiNumber
and instead use dict structures. Instead, I rather prefer to discuss proper DBS API for that and use such flat stream of data in a code, instead of many nested loops and list of dictionaries.
hi @vkuznet , and to address your next comment:
Yes we do need a flat structure of the kind. And initially I implemented it with just two sets of I was extremely careful on time complexity this time, exactly because I was aware of all this. So if you look deeper at the end the complexity here in the worst case scenario is not as you state O(n^4) (when At the end I reused what was there from the previous implementation and made the best to make the code more readable, but the idea here is mostly not to make code optimizations but rather make it more readable, because it was really difficult to understand what was going on while I was reading it for the firs time. And to make the data structure for (file, run, lumi) better and more flexible to work with. We can also enforce type checks as it is right now... etc. |
Jenkins results:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Todor, I left a few questions and comments along the code for your consideration.
One of my concerns though is that we are duplicating the run/lumi information, which will increase the memory footprint of this thread, which already is not great.
@@ -9,6 +9,7 @@ | |||
|
|||
from builtins import object, str, bytes | |||
from future.utils import viewitems | |||
from typing import NamedTuple |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was wondering whether it should be using the collections
library, but it looks like both provide this data type and they are practically the same. Just a note :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
NamedTuple
from typing
is typified wrapper for namedtupple
from collections
. I deliberately used this one because it allows you to have type annotations with defaults if you decide. And also you may redefine methods if you want to (except __init__
or __new__
, though - for them you need some cumbersome inheritance). And, mostly, it makes things much much more readable. This is a properly defined data structure of the size of a standard tuple but with named fields, and it is also immutable - hence hashable. It has all we need.
fileId: int | ||
run: int | ||
lumi: int | ||
# def __eq__(frl1, frl2): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if we don't need to override this method, then please remove.
@@ -97,6 +119,9 @@ def __init__(self, url, logger=None, parallel=None, **contact): | |||
msg += "%s\n" % formatEx3(ex) | |||
raise DBSReaderError(msg) from None | |||
|
|||
# A type definition visible only inside the DBS3 Class | |||
# NOTE: thes may also go in the global scope in case we decide it won't break anything |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd say these 2 comments could be removed.
for fileId in childBlockInfo[0]: | ||
for runLumiPair in childBlockInfo[0][fileId]: | ||
frlObj = FileRunLumi(fileId=fileId, run=runLumiPair[0], lumi=runLumiPair[1]) | ||
childFlatData[(frlObj.run, frlObj.lumi)] = frlObj |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We are duplicating every single run/lumi pair here both in the key and in the value. Did you check the memory impact with this implementation? Or is there any good reason to keep run/lumi in both places?
Thanks @amaltaro for your feedback
I wouldn't be concerned about that, since the dictionary overhead is much higher. Here is a measurement done with the two data formats -
And just for clarity which is which here are two items popped out from each data structure:
I did use the simplest way to measure this ( |
@todor-ivanov 35k lumis (tuples) will likely not be substantial to spot memory footprint differences. I think we would need something with around half million lumis. In addition, or something like
|
thanks for the idea @amaltaro I did follow your suggestion to iterate through the whole object, and you will be surprised by the results. (I also took a parent dataset of size 113K Lumis, the biggest one I could find - taken from the O&C meeting's plots):
|
You can clearly see that old dictionary is using more volume than the one with the
In case we'd want to make things more equal and measure only the difference coming from the additional two integers... we should redefine the old format to use tuples as keyNames as well. And here is exactly what I did:
Now better comparing the two:
Now we get the expected raise - but again it is insignificant in comparison to the contribution from the keyNames sizes. Taking as base for comparission the lowest possible size of 21MBytes per 113K lumis we have:
|
Thank you for making these tests and providing these numbers, Todor. The summary I get then is that, using the same set/tuple structure with: there is an increase of almost 40% in the memory footprint, for 325k lumis, taken from:
We already have spikes of almost 4GB for this reqmgr2-tasks service (where one of the CP threads is this StepChain parentage): So I fear that this potential 40% increase might be just too much. |
@amaltaro this statement is incorrect:
Because the current datastructure size is not 21045940 So I'd say the so offered data structure of dictionary of named tuples already improves the situation (decreases the memory footprint) rather than worsen it. |
oh .. actually you do compare apples with apples here:
Sorry for missing that you are referring to the case with (tuples) keyNames in both cases - which is indeed what we need to do and what I was also trying to stress. So do you offer to just squeeze this to the maximum and use |
Yes Todor. Even though the code is not going to be most readable, it does save quite a lot of memory (even more if compared to the current implementation in WMCore). Maybe we take this opportunity to document something either in the StepChainParentage module, or in a wiki, such that next time we need to investigate it, it does not take too long to learn what and how exactly this code operates. What do you think? |
Hi @amaltaro,
With my latest commit I made my best to improve both:
And I did Not test the difference in performance but it must be working at least as good as before, if not even better. Please take a look again. And here is the result of one single call (without actually writing to DBS).
And here is another call with the actual new error returned from DBS, given that we now provide
|
Jenkins results:
|
Jenkins results:
|
…rent Dataset. Fix docstring for getChildBlockTrio. Change data structure from dict of NamedTuples back to dict of integers && Fix dict keys from frozenSets to tuples. Remove obsolete method getParentDatasetTrioOld.
5b4cc8d
to
4370f0e
Compare
Jenkins results:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changes look good to me, thanks!
Valentin, I think Todor answered your concern already - and I don't see a complexity increase either - but if we missed anything, please leave your message for further discussion.
@todor-ivanov we must test this functionality in testbed (by announcing a StepChain workflow) to avoid potential surprises in production.
I am merging it now such that we can test this in a dev cluster if we want (also to test things with the open MSTransferor PR), under 2.2.0rc5
Thanks @amaltaro One thing I see now I've missed to mention before, but I think is important to note now: When moving the dictionary keys from here:
and here:
But.... So if anybody touching these queries decides for some unknown reason to revert the order of I suppose this is a matter of API persistence - once you announce that an API is about to return the result in a specified form you do not change the format from there on until you announce it to the end users to check and take care if they would be affected by a supposed change ... but still decided to make it clear, so if in the future a change to any of those two APIs happens we should remember to ask the DBS developer or whom ever makes this API change to check the order in the other one as well. |
I initially thought you were talking about order of run/lumis, but in the end I understand that you are actually talking about the order of data attributes that are returned, thus a potential swap between fileid, run, lumi in the returned data structure. |
Fixes #11260
Status
Ready
Description
With the current PR we try to add Null information for the RunLumi pairs that are missing at the parent dataset by adding the following improvements to the current code:
NamedTuple
containing the tripletfileId, run, lumi
run/lumi
pairsrun/lumi
pairs withNull
parentage information so that the whole length of the block information uploaded to DBS to be pereservedIs it backward compatible (if not, which system it affects?)
No
Related PRs
None
External dependencies / deployment changes
DBS needs to change their code in order to accept `blocks with partially resolved rentage information.
The relevant DBS issue is:
dmwm/dbs2go#94