The Shasta Heartbeat Tracking Service is a service which tracks the heartbeats emitted by various system components. Generally, compute nodes emit heartbeats to inform the HMS services that they are alive and healthy. Other components can also emit heartbeats if they so choose.
The Shasta Heartbeat Tracking Service uses a RESTful interface to provide an end point to which to send heartbeat packets.
The Shasta Heartbeat Tracking Service will track heartbeats received and check them against the time of the previous heartbeat received for a given component. Changes in heartbeat behavior will be communicated to the Hardware State manager:
- First time heartbeat received (HSM places component in READY state)
- Heartbeat missing -- warning situation. Component may be dead (HSM places component in READY state with a WARNING flag)
- Heartbeat missing -- alert situation. Component is dead (HSM places component in STANDBY state with an ALERT flag)
Normally hbtd runs on the SMS cluster as one or more Docker container instances managed by Kubernetes. It can also be run from a command shell for testing purposes.
hbtd's RESTful API is as follows:
/v1/heartbeat
POST a heartbeat.
/v1/params
GET or PATCH an hbdt operational parameter
See https://stash.us.cray.com/projects/HMS/repos/hms-hmi/browse/api/swagger.yaml for details on the hbtd RESTful API payloads and return values.
Usage: hbtd [options]
--help Help text.
--debug=num Debug level. (Default: 0)
--use_telemetry=yes|no Inject notifications into message.
bus. (Default: yes)
--telemetry_host=h:p:t Hostname:port:topic of telemetry service
--warntime=secs Seconds before sending a warning of
node heartbeat failure.
(Default: 10 seconds)
--errtime=secs Seconds before sending an error of
node heartbeat failure.
(Default: 30 seconds)
--interval=secs Heartbeat check interval.
(Default: 5 seconds)
--port=num HTTPS port to listen on. (Default: 28500)
--kv_url=url Key-Value service 'base' URL..
(Default: https://localhost:2379)
--sm_url=url State Manager 'base' URL.
(Default: http://localhost:27779/hsm/v2)
--sm_retries=num Number of State Manager access retries. (Default: 3)
--sm_timeout=secs State Manager access timeout. (Default: 10)
--nosm Don't contact State Manager (for testing).
Building hbtd after the Repo split
Starting hbtd:
./hbtd --sm_url=https://localhost:27999/hsm/smd/v1 --port=28501 --use_telemetry=no --kv_url="mem:"
From the root of this repo, build the docker container:
# docker build -t cray/cray-hbtd:test .
Then run (add -d
to the arguments list of docker run
to run in detached/background mode):
docker run -p 28500:28500 --name hbtd cray/hbtd:test
In addition to the service itself, this repository builds and publishes cray-hbtd-test images containing tests that verify HBTD on live Shasta systems. The tests are invoked via helm test as part of the Continuous Test (CT) framework during CSM installs and upgrades. The version of the cray-hbtd-test image (vX.Y.Z) should match the version of the cray-hbtd image being tested, both of which are specified in the helm chart for the service.
V1 Feature | V1+ Feature | XC Equivalent |
---|---|---|
/v1/heartbeat | /v1/heartbeat | HB Via HW to Node Ctlr |
/v1/params | /v1/params | - |
- | Ability to query service health | - |
- | Ability to dump service internals | - |
- Ability to receive heartbeats from components
- Ability to track component heartbeats
- Ability to notify Hardware State Manager if heartbeat status changes
- Ability to query service operating parameters
- Ability to modify service operating parameters
- Performance optimizations:
- Ability to "forget" components which are going away/powering down
- Add API to query service health and connectivity
- Add API to dump service internal
For more detailed information, see the Theory Of Operation document.