-
Notifications
You must be signed in to change notification settings - Fork 19
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Check if the pool is assigned correctly #142
Comments
ContextThe tier0 collector, when queried for the schedds, return also the global pool schedds with some implications:
AlternativesThere are several alternatives to fix this:
All alternatives are easy to implement, the alternative 1 doesn't imply changes in the data as the same schedds will be queried (but now only once). I tested the alternative 2 and, with that filter, have some missing jobs (so there are some schedds with other types only listed in the tier0 collectors). Alternative 3 will solve a different problem (and I'm not sure if, having the type attribute, this one will give some new information). So, I'll go ahead with alternative 1 while we can have a better solution. cc: @leggerf |
Hi Christian,
Many thanks for your investigation. Your 1) seems reasonable, but I would like to get confirmation from the SI experts that we are implementing the right logic.
James, Antonio, can you please take a look and let us know your thoughts (and btw, looks like the same issue appears also in the SI data).
cheers
Federica
… On 28 Nov 2019, at 17:37, Christian Ariza ***@***.***> wrote:
Context
The tier0 collector, when queried for the schedds, return also the global pool schedds with some implications:
The schedds from those collectors are queried twice. That implies that the documents are processed twice (in most cases they are overwritten on ES, and the history is not affected as the checkpoint is set by schedd name).
There will exist reports for the same jobs with a different value for the CMS_Pool attribute (documents for the same batch are overwritten). Because I suspected something similar, I did a change in the order of the collectors in the collectors.json in production 2 weeks ago and that resulted in a reduction of the tier0 jobs in ES.
This has been happening all the time but became evident because of the CMS_Pool attribute.
This also happens in the SI data in ES -e.g. the same schedd is listed with two different pools <https://monit-kibana.cern.ch/kibana/goto/4236262fbdd611ec29f961f3f55ecb55> (they use the same strategy to add the metadata.pool value)
Alternatives
There are several alternatives to fix this:
Rely on the order of the collectors.json file and deduplicate by schedd name. This is, the jobs in schedd will belong to the first pool that claims it.
For the Tier0 collectors query only for the schedds with type tier0schedd. But, I have seen that the global pool also has tier0schedd schedds and the CERN (Tier0) pool also has production and crab schedds.
Instead of using the pool, use the CMSGWMS_Type, which is a property of the schedd and have values: prodschedd, crabschedd, tier0schedd, cmsconnect, institutionalschedd.
All alternatives are easy to implement, the alternative 1 doesn't imply changes in the data as the same schedds will be queried (but now only once). I tested the alternative 2 and, with that filter, have some missing jobs (so there are some schedds with other types only listed in the tier0 collectors). Alternative 3 will solve a different problem (and I'm not sure if, having the type attribute, this one will give some new information). So, I'll go ahead with alternative 1 while we can have a better solution.
cc: @leggerf <https://github.com/leggerf>
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub <#142?email_source=notifications&email_token=AJ4EWQODTFKL36SLG2MNLSLQV7XTJA5CNFSM4JQGHGPKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEFNCHHY#issuecomment-559555487>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AJ4EWQKK6BWYR3OQPR3OEP3QV7XTJANCNFSM4JQGHGPA>.
|
Christian,
in my opinion we should solve the problem properly and the only proper solution
is #3 in your list, all other are conditional. We should target the general case
when all use-cases are addresses through unique set of attribute values, in this
case a set of attribute includes pool, schedd property and type (and can be extended further).
It seems to me that we should keep them all and only combination of all of them
gives opportunity to uniquely distinguish jobs.
Said that I'm not against implementing first solution #1, I'm just saying that
eventually we should do things right way and solution #3 seems to be a proper
choice.
Best,
Valentin.
…On 0, Christian Ariza ***@***.***> wrote:
# Context
The tier0 collector, when queried for the schedds, return also the global pool schedds with some implications:
- The schedds from those collectors are queried twice. That implies that the documents are processed twice (in most cases they are overwritten on ES, and the history is not affected as the checkpoint is set by schedd name).
- There will exist reports for the same jobs with a different value for the CMS_Pool attribute (documents for the same batch are overwritten). Because I suspected something similar, I did a change in the order of the collectors in the collectors.json in production 2 weeks ago and that resulted in a reduction of the tier0 jobs in ES.
- This has been happening all the time but became evident because of the CMS_Pool attribute.
- This also happens in the SI data in ES -e.g. [the same schedd is listed with two different pools](https://monit-kibana.cern.ch/kibana/goto/4236262fbdd611ec29f961f3f55ecb55) (they use the same strategy to add the `metadata.pool` value)
# Alternatives
There are several alternatives to fix this:
1. Rely on the order of the collectors.json file and deduplicate by schedd name. This is, the jobs in schedd will belong to the first pool that claims it.
2. For the Tier0 collectors query only for the schedds with type tier0schedd. But, I have seen that the global pool also has tier0schedd schedds and the CERN (Tier0) pool also has production and crab schedds.
3. Instead of using the pool, use the CMSGWMS_Type, which is a property of the schedd and have values: `prodschedd`, `crabschedd`, `tier0schedd`, `cmsconnect`, `institutionalschedd`.
All alternatives are easy to implement, the alternative 1 doesn't imply changes in the data as the same schedds will be queried (but now only once). I tested the alternative 2 and, with that filter, have some missing jobs (so there are some schedds with other types only listed in the tier0 collectors). Alternative 3 will solve a different problem (and I'm not sure if, having the type attribute, this one will give some new information). So, I'll go ahead with alternative 1 while we can have a better solution.
cc: @leggerf
--
You are receiving this because you are subscribed to this thread.
Reply to this email directly or view it on GitHub:
#142 (comment)
|
I agree with Valentin that #3 is the only solid action. But it does not solve the issue here.
BUT
|
|
See https://cern.service-now.com/service-portal/view-request.do?n=RQF1370606
The text was updated successfully, but these errors were encountered: