-
Notifications
You must be signed in to change notification settings - Fork 107
Global WorkQueue monitoring
This wiki describes which information is monitored in Global WorkQueue, which ends up in WMStats (agentInfo couch view) and which is also pushed to the MonIT infrastructure, with a different data structure to ease data aggregation and visualization in Kibana/Graphana.
This information is collected by a specific CMSWEB backend running a specific CherryPy thread: https://github.com/dmwm/WMCore/blob/master/src/python/WMCore/GlobalWorkQueue/CherryPyThreads/HeartbeatMonitor.py
which runs every 10min.
Global workqueue monitoring data is only collected for workflows that are active in the system, disregarding archived workflows. The most common metrics show in several of these monitoring data are:
- num_elem: amount/count of GQE - short for global workqueue elements - grouped by another metric.
- sum_jobs: sum of the estimated top level jobs ("Jobs" field from the GQE) present in the GQE that were grouped by another metric.
- max_jobs_elem: contains the largest number of jobs ("Jobs" field from the GQE) found in a single GQE out of all the GQE that were grouped by another metric.
With these 3 important metrics in mind, the following is what's collected under Global WorkQueue and uploaded to WMStats:
- workByStatus: very basic global workqueue metric grouping amount of work in each workqueue element status.
- workByStatusAndPriority: lists the amount of work in each GQE status and their priority.
- workByAgentAndPriority: it lists the amount of work in terms of GQE, sum of all their jobs and their maximum number of jobs in a single GQE. This information is available for every pair of agent_name (representing which agent is processing those GQE, "AgentNotDefined" is used when the GQE hasn't been pulled down by any agent) and GQE priority.
- workByAgentAndStatus: lists the amount of work grouped by status and agent that has acquired those GQE.
Metrics for work assigned to specific locations are also collected and we can distinguish those in two ways:
-
possible vs unique: "possible" jobs consider that all jobs in a GQE can run at every single site that is in the SiteWhitelist and not in the SiteBlacklist. While "unique" evenly distributes the amount of jobs in a GQE among all the sites (SiteWhitelist - SiteBlacklist).
-
AAA: this AAA metric assumes that jobs can run in any resources provided in the SiteWhitelist - SiteBlacklist. On the other hand, metrics without AAA use the possibleSites() function which evaluates data locality as well in addition to the site white and black list (including parent and pileup, if needed).
-
uniqueJobsPerSite: it reports the amount of unique jobs and workqueue elements for GQE in one of the following status: Available, Acquired and Negotiating, considering data locality constraints and evenly distributing the amount of jobs among the final list of possible sites. E.g., a workflow containing 1 single GQE with 500 jobs and assigned to FNAL and CERN, would be reported as 250 jobs/1 GQE for FNAL and 250 jobs/1 GQE for CERN too.
-
possibleJobsPerSite: it reports the amount of possible jobs and workqueue elements for GQE in one of the following status: Available, Acquired and Negotiating, considering data locality constraints and assuming the whole SiteWhitelist-SiteBlacklist could run all those jobs. Data is grouped by status and site. E.g., a workflow containing 1 single GQE with 500 jobs and assigned to FNAL and CERN, would be reported as 500 jobs/1 GQE for FNAL and 500 jobs/1 GQE for CERN too.
-
possibleJobsPerSiteAAA: it reports the amount of possible jobs and workqueue elements for GQE in one of the following status: Available, Acquired and Negotiating, assuming the whole SiteWhitelist-SiteBlacklist could run all those jobs. Information is grouped by status and site.
-
uniqueJobsPerSiteAAA: it reports the amount of unique jobs and workqueue elements for GQE in one of the following status: Available, Acquired and Negotiating, assuming any site in SiteWhitelist-SiteBlacklist is capable of running those jobs, which get evenly distributed among the final list of possible sites.
Last but not least:
- and the basic information collected for any WMCore service, like agent_url, agent_version, total_query_time representing how long it took to collect all these metrics from the database, down_components with the list of components/threads down and timestamp.
The monitoring information posted to the MonIT systems are actually coming from the same metrics that are posted to WMStats, so we'll reference those metric names here as well such that you can check their description above. This wiki will also show a sample of each of those documents posted to AMQ/Elastic Search such that it becomes easier to look them up in ES via Kibana/Graphana. Every single document posted to MonIT has the following key/value pairs in addition to the payload:
{"agent_url": "reqmgr2",
"timestamp": 12345}
where timestamp is set just before uploading the documents to AMQ (note timestamp is placed under data.metadata by the AMQ client). The actual monitoring data is always available under data.payload., so an example ES query for a ReqMgr2 metric would be:
data.payload.agent_url:reqmgr2 AND data.payload.type:reqmgr2_status
- requestsByStatus: represented by reqmgr2_status document type. A single document per status is created every cycle, regardless whether there are any workflows within that status or not.
"payload": {
"agent_url": "reqmgr2",
"type": "reqmgr2_status",
"request_status": "aborted-completed",
"num_requests": 0
},
- requestsByStatusAndCampaign: represented by reqmgr2_campaign document type. Creates a document for each combination of request_status AND campaign every cycle. Status without any workflows (Campaign data) will be skipped.
"payload": {
"agent_url": "reqmgr2",
"type": "reqmgr2_campaign",
"request_status": "assignment-approved",
"num_requests": 7,
"campaign": "HG1804_Validation"
}
- requestsByStatusAndNumEvts: represented by reqmgr2_events document type. A single document per status is created every cycle, regardless whether there are any workflows (RequestNumEvents) within that status or not.
"payload": {
"agent_url": "reqmgr2",
"type": "reqmgr2_events",
"total_num_events": 100000,
"request_status": "running-closed"
}
- requestsByStatusAndPrio: represented by reqmgr2_prio document type. Creates a document for each combination of request_status AND request_priority every cycle. Status without any workflows (RequestPriority data) will be skipped.
"payload": {
"agent_url": "reqmgr2",
"type": "reqmgr2_prio",
"request_status": "failed",
"request_priority": 600000,
"num_requests": 2
}
- basic ReqMgr2 information (as described for WMStats) is not currently posted to MonIT.