Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Check if the pool is assigned correctly #142

Open
cronosnull opened this issue Nov 21, 2019 · 5 comments
Open

Check if the pool is assigned correctly #142

cronosnull opened this issue Nov 21, 2019 · 5 comments

Comments

@cronosnull
Copy link
Collaborator

See https://cern.service-now.com/service-portal/view-request.do?n=RQF1370606

@cronosnull
Copy link
Collaborator Author

Context

The tier0 collector, when queried for the schedds, return also the global pool schedds with some implications:

  • The schedds from those collectors are queried twice. That implies that the documents are processed twice (in most cases they are overwritten on ES, and the history is not affected as the checkpoint is set by schedd name).
  • There will exist reports for the same jobs with a different value for the CMS_Pool attribute (documents for the same batch are overwritten). Because I suspected something similar, I did a change in the order of the collectors in the collectors.json in production 2 weeks ago and that resulted in a reduction of the tier0 jobs in ES.
  • This has been happening all the time but became evident because of the CMS_Pool attribute.
  • This also happens in the SI data in ES -e.g. the same schedd is listed with two different pools (they use the same strategy to add the metadata.pool value)

Alternatives

There are several alternatives to fix this:

  1. Rely on the order of the collectors.json file and deduplicate by schedd name. This is, the jobs in schedd will belong to the first pool that claims it.
  2. For the Tier0 collectors query only for the schedds with type tier0schedd. But, I have seen that the global pool also has tier0schedd schedds and the CERN (Tier0) pool also has production and crab schedds.
  3. Instead of using the pool, use the CMSGWMS_Type, which is a property of the schedd and have values: prodschedd, crabschedd, tier0schedd, cmsconnect, institutionalschedd.

All alternatives are easy to implement, the alternative 1 doesn't imply changes in the data as the same schedds will be queried (but now only once). I tested the alternative 2 and, with that filter, have some missing jobs (so there are some schedds with other types only listed in the tier0 collectors). Alternative 3 will solve a different problem (and I'm not sure if, having the type attribute, this one will give some new information). So, I'll go ahead with alternative 1 while we can have a better solution.

cc: @leggerf

@leggerf
Copy link
Contributor

leggerf commented Nov 28, 2019 via email

@vkuznet
Copy link
Contributor

vkuznet commented Nov 28, 2019 via email

@belforte
Copy link
Member

I agree with Valentin that #3 is the only solid action. But it does not solve the issue here.
That's because pool is not a "solid" attribute. A schedd can be moved from one pool to another by a configuration change, keeping CMSGWMS_Type the same. So this is fully in SI hands, since they are those who can set and enforce a policy about using CMSGWMS_Type to convey pool information as well, or add another ad-hoc property to the schedd class Ads, or point to some other already existing class Ad.
IMHO the "solid" way to connect one schedd to a pool is via its CollectorHost atttribute e.g.

CollectorHost = "vocms0815.cern.ch:9620,cmssrv623.fnal.gov:9620"

BUT

  1. this should come frome SI, not me, even if I happen to know something about HTCondor
  2. SI are the ones who say which collector hosts belongs to which pooo. In the end from HTC point of view one pool is identified by its coillector(s), while the naming "global", "ITB", "T0" etc. is fully CMS jargon which never appears inside HTCondor config.

@bbockelm
Copy link
Collaborator

  • I would suggest doing a sanity deduplication of schedd names in any case, regardless of the pool name. Seems like good hygiene.
  • I suggest thinking about what the CMS_Pool attribute should mean (if you decide it is worth having...). It’s likely not a per-schedd attribute but rather a per-job attribute (where did the job run? Where is it running?). The proper way to get that is from the matched startd and ensure it is added to the job ad.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants