Check if the pool is assigned correctly #142

cronosnull · 2019-11-21T18:12:57Z

See https://cern.service-now.com/service-portal/view-request.do?n=RQF1370606

cronosnull · 2019-11-28T16:37:08Z

Context

The tier0 collector, when queried for the schedds, return also the global pool schedds with some implications:

The schedds from those collectors are queried twice. That implies that the documents are processed twice (in most cases they are overwritten on ES, and the history is not affected as the checkpoint is set by schedd name).
There will exist reports for the same jobs with a different value for the CMS_Pool attribute (documents for the same batch are overwritten). Because I suspected something similar, I did a change in the order of the collectors in the collectors.json in production 2 weeks ago and that resulted in a reduction of the tier0 jobs in ES.
This has been happening all the time but became evident because of the CMS_Pool attribute.
This also happens in the SI data in ES -e.g. the same schedd is listed with two different pools (they use the same strategy to add the metadata.pool value)

Alternatives

There are several alternatives to fix this:

Rely on the order of the collectors.json file and deduplicate by schedd name. This is, the jobs in schedd will belong to the first pool that claims it.
For the Tier0 collectors query only for the schedds with type tier0schedd. But, I have seen that the global pool also has tier0schedd schedds and the CERN (Tier0) pool also has production and crab schedds.
Instead of using the pool, use the CMSGWMS_Type, which is a property of the schedd and have values: prodschedd, crabschedd, tier0schedd, cmsconnect, institutionalschedd.

All alternatives are easy to implement, the alternative 1 doesn't imply changes in the data as the same schedds will be queried (but now only once). I tested the alternative 2 and, with that filter, have some missing jobs (so there are some schedds with other types only listed in the tier0 collectors). Alternative 3 will solve a different problem (and I'm not sure if, having the type attribute, this one will give some new information). So, I'll go ahead with alternative 1 while we can have a better solution.

cc: @leggerf

leggerf · 2019-11-28T17:11:35Z

Hi Christian, Many thanks for your investigation. Your 1) seems reasonable, but I would like to get confirmation from the SI experts that we are implementing the right logic. James, Antonio, can you please take a look and let us know your thoughts (and btw, looks like the same issue appears also in the SI data). cheers Federica

…

On 28 Nov 2019, at 17:37, Christian Ariza ***@***.***> wrote: Context The tier0 collector, when queried for the schedds, return also the global pool schedds with some implications: The schedds from those collectors are queried twice. That implies that the documents are processed twice (in most cases they are overwritten on ES, and the history is not affected as the checkpoint is set by schedd name). There will exist reports for the same jobs with a different value for the CMS_Pool attribute (documents for the same batch are overwritten). Because I suspected something similar, I did a change in the order of the collectors in the collectors.json in production 2 weeks ago and that resulted in a reduction of the tier0 jobs in ES. This has been happening all the time but became evident because of the CMS_Pool attribute. This also happens in the SI data in ES -e.g. the same schedd is listed with two different pools <https://monit-kibana.cern.ch/kibana/goto/4236262fbdd611ec29f961f3f55ecb55> (they use the same strategy to add the metadata.pool value) Alternatives There are several alternatives to fix this: Rely on the order of the collectors.json file and deduplicate by schedd name. This is, the jobs in schedd will belong to the first pool that claims it. For the Tier0 collectors query only for the schedds with type tier0schedd. But, I have seen that the global pool also has tier0schedd schedds and the CERN (Tier0) pool also has production and crab schedds. Instead of using the pool, use the CMSGWMS_Type, which is a property of the schedd and have values: prodschedd, crabschedd, tier0schedd, cmsconnect, institutionalschedd. All alternatives are easy to implement, the alternative 1 doesn't imply changes in the data as the same schedds will be queried (but now only once). I tested the alternative 2 and, with that filter, have some missing jobs (so there are some schedds with other types only listed in the tier0 collectors). Alternative 3 will solve a different problem (and I'm not sure if, having the type attribute, this one will give some new information). So, I'll go ahead with alternative 1 while we can have a better solution. cc: @leggerf <https://github.com/leggerf> — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#142?email_source=notifications&email_token=AJ4EWQODTFKL36SLG2MNLSLQV7XTJA5CNFSM4JQGHGPKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEFNCHHY#issuecomment-559555487>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AJ4EWQKK6BWYR3OQPR3OEP3QV7XTJANCNFSM4JQGHGPA>.

vkuznet · 2019-11-28T22:02:09Z

Christian, in my opinion we should solve the problem properly and the only proper solution is #3 in your list, all other are conditional. We should target the general case when all use-cases are addresses through unique set of attribute values, in this case a set of attribute includes pool, schedd property and type (and can be extended further). It seems to me that we should keep them all and only combination of all of them gives opportunity to uniquely distinguish jobs. Said that I'm not against implementing first solution #1, I'm just saying that eventually we should do things right way and solution #3 seems to be a proper choice. Best, Valentin.

…

On 0, Christian Ariza ***@***.***> wrote: # Context The tier0 collector, when queried for the schedds, return also the global pool schedds with some implications: - The schedds from those collectors are queried twice. That implies that the documents are processed twice (in most cases they are overwritten on ES, and the history is not affected as the checkpoint is set by schedd name). - There will exist reports for the same jobs with a different value for the CMS_Pool attribute (documents for the same batch are overwritten). Because I suspected something similar, I did a change in the order of the collectors in the collectors.json in production 2 weeks ago and that resulted in a reduction of the tier0 jobs in ES. - This has been happening all the time but became evident because of the CMS_Pool attribute. - This also happens in the SI data in ES -e.g. [the same schedd is listed with two different pools](https://monit-kibana.cern.ch/kibana/goto/4236262fbdd611ec29f961f3f55ecb55) (they use the same strategy to add the `metadata.pool` value) # Alternatives There are several alternatives to fix this: 1. Rely on the order of the collectors.json file and deduplicate by schedd name. This is, the jobs in schedd will belong to the first pool that claims it. 2. For the Tier0 collectors query only for the schedds with type tier0schedd. But, I have seen that the global pool also has tier0schedd schedds and the CERN (Tier0) pool also has production and crab schedds. 3. Instead of using the pool, use the CMSGWMS_Type, which is a property of the schedd and have values: `prodschedd`, `crabschedd`, `tier0schedd`, `cmsconnect`, `institutionalschedd`. All alternatives are easy to implement, the alternative 1 doesn't imply changes in the data as the same schedds will be queried (but now only once). I tested the alternative 2 and, with that filter, have some missing jobs (so there are some schedds with other types only listed in the tier0 collectors). Alternative 3 will solve a different problem (and I'm not sure if, having the type attribute, this one will give some new information). So, I'll go ahead with alternative 1 while we can have a better solution. cc: @leggerf -- You are receiving this because you are subscribed to this thread. Reply to this email directly or view it on GitHub: #142 (comment)

belforte · 2019-11-29T22:26:12Z

I agree with Valentin that #3 is the only solid action. But it does not solve the issue here.
That's because pool is not a "solid" attribute. A schedd can be moved from one pool to another by a configuration change, keeping CMSGWMS_Type the same. So this is fully in SI hands, since they are those who can set and enforce a policy about using CMSGWMS_Type to convey pool information as well, or add another ad-hoc property to the schedd class Ads, or point to some other already existing class Ad.
IMHO the "solid" way to connect one schedd to a pool is via its CollectorHost atttribute e.g.

CollectorHost = "vocms0815.cern.ch:9620,cmssrv623.fnal.gov:9620"

BUT

this should come frome SI, not me, even if I happen to know something about HTCondor
SI are the ones who say which collector hosts belongs to which pooo. In the end from HTC point of view one pool is identified by its coillector(s), while the naming "global", "ITB", "T0" etc. is fully CMS jargon which never appears inside HTCondor config.

bbockelm · 2019-11-30T22:56:18Z

I would suggest doing a sanity deduplication of schedd names in any case, regardless of the pool name. Seems like good hygiene.
I suggest thinking about what the CMS_Pool attribute should mean (if you decide it is worth having...). It’s likely not a per-schedd attribute but rather a per-job attribute (where did the job run? Where is it running?). The proper way to get that is from the matched startd and ensure it is added to the job ad.

cronosnull mentioned this issue Nov 28, 2019

CMS_Pool: Assign the schedd to the first pool that returned it #143

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Check if the pool is assigned correctly #142

Check if the pool is assigned correctly #142

cronosnull commented Nov 21, 2019

cronosnull commented Nov 28, 2019

leggerf commented Nov 28, 2019 via email

vkuznet commented Nov 28, 2019 via email

belforte commented Nov 29, 2019

bbockelm commented Nov 30, 2019

Check if the pool is assigned correctly #142

Check if the pool is assigned correctly #142

Comments

cronosnull commented Nov 21, 2019

cronosnull commented Nov 28, 2019

Context

Alternatives

leggerf commented Nov 28, 2019 via email

vkuznet commented Nov 28, 2019 via email

belforte commented Nov 29, 2019

bbockelm commented Nov 30, 2019