-
Notifications
You must be signed in to change notification settings - Fork 183
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Refine definition for *.limit metric name suffix -or- update current usage to match definition #438
Comments
cc @bertysentry since you initially proposed most of the hardware semantic conventions. |
That's an excellent question! TBH, I had not read the exact definition of the I'm a bit surprised by the actual definition, which is counter-intuitive: a "limit" shouldn't necessarily mean a total amount. My opinion is that this definition is too restrictive. My recommendation is to update the definition of the
|
For example with GPUs and their memory; there's the physical amount of memory, how much of that kernel exposes to user-space (after deducting its own overheads), and how much of that user-space API exposes to applications. In OneAPI Level-Zero Sysman API, first 2 limits are named as In OpenCL API, first and last are named as |
Conclusion to @arminru's question: option 1 (Adapt the definition of limit to allow for both use cases or interpretations. |
In #409 (comment) it was brought up that certain metrics with the
.limit
suffix are defined as Gauges, whereas most others are defined as UpDownCounters:The UpDownCounters are consistent with our current definition for
.limit
at https://github.com/open-telemetry/semantic-conventions/blob/v1.22.0/docs/general/metrics.md#instrument-naming:One can sum up the existing memory, disk space, network bandwidth, or power supply within a given system or compositions of them and get a meaningful aggregate representing the "total amount" available.
The Gauge metrics, however, don't represent an available "total amount". One cannot add the maximum permissible temperature (°C) over multiple components, battery charge fraction for stable operation (%) over multiple batteries, or permissible voltage (V) over multiple components. The aggregated sum breaks the definition and expectation for the individual metric observations.
Two CPUs that can sustain 100 °C each, for example. won't sustain 200 °C together (or 40°C on one and 160°C on the other). Three SSDs that operate at 3.3 V won't tolerate 9.9 V on the shared power supply. Neither is a maximum charge level of 300% for three (potentially different) batteries a helpful aggregation.
I think our options to resolve this are:
limit
to allow for both use cases or interpretations.We'd need to remove the "total amount" wording and replace it with something else. We should also consider adding a note that both aggregatable and non-aggregatable limits can occur.
limit
and introduce a new, well-known suffix for the non-aggregatable limits and change the current Gauge metrics to use this suffix instead.limit
and change the current Gauge metrics to use some other suffix that's not defined by our naming conventions.I'm looking for feedback on which direction we should pursue and potential suggestions for the respective naming/wording.
The text was updated successfully, but these errors were encountered: