Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

OTLP -> Prometheus: Instrumentation Scope needs clarification #3778

Open
fstab opened this issue Nov 28, 2023 · 5 comments
Open

OTLP -> Prometheus: Instrumentation Scope needs clarification #3778

fstab opened this issue Nov 28, 2023 · 5 comments
Assignees
Labels
sig-issue A specific SIG should look into this before discussing at the spec spec:metrics Related to the specification/metrics directory

Comments

@fstab
Copy link
Member

fstab commented Nov 28, 2023

While implementing open-telemetry/opentelemetry-java#6015 I found some corner cases when mapping the OTel instrumentation scope to Prometheus. I think it would be worthwhile to clarify these cases in the Prometheus and OpenMetrics Compatibility spec.

Good Case Example

Let's say we have two counters named processing.time: One defined with scopeA and one defined with scopeB. In Prometheus these two counters get merged into a single counter with different otel_scope... attributes:

# TYPE processing_time_seconds counter
# UNIT processing_time_seconds seconds
# HELP processing_time_seconds processing time in seconds
processing_time_seconds_total{a="b",otel_scope_name="scopeA",otel_scope_version="1.1"} 3.3
processing_time_seconds_total{a="b",otel_scope_name="scopeB",otel_scope_version="1.2"} 3.3

This good case is trivial because the metadata for the counters is exactly the same. The corner cases arise when there are differences.

What if the TYPE differs?

Thoughts:

  • TYPE refers to the Prometheus type. Metrics might have different types in OpenTelemetry but the same type in Prometheus. If the Prometheus type is the same, the metrics can be merged. Example: DOUBLE_GAUGE and non-monotonic LONG_SUM both are GAUGE in Prometheus, hence they can be merged.
  • Note that HISTOGRAM and EXPONENTIAL_HISTOGRAM are also the same type in Prometheus, so they can be merged.
  • The "Prometheus type" depends on the exposition format: For example, an INFO metric in OpenMetrics format becomes a GAUGE metric in Prometheus Text format. I don't think that's relevant here, because OTel can only produce COUNTER, GAUGE, HISTOGRAM, and SUMMARY. However, if OTel can produce more types we need to keep in mind that the other types depend on the exposition format.

What to do if the type differs?

  • Merge the metric anyway and use type UNKNOWN. This should work for all metrics except exponential histograms.
  • Drop one of the metrics and log a warning.
  • Drop both metrics and log a warning.
  • Respond the scrape request with an HTTP 500 error.

My opinion is that responding the scrape request with HTTP 500 is better than dropping metrics, because then the Prometheus server will show an issue with scraping and users can fix the issue, rather than having a metric silently missing.

What if the HELP text differs?

  • Keep only the first HELP text?
  • Concatenate both HELP texts?
  • Drop HELP?

What if the UNIT differs?

I don't think metrics with different units can be merged, but how to handle the error?

  • Drop one of the metrics and log a warning.
  • Drop both metrics and log a warning.
  • Respond the scrape request with an HTTP 500 error.

Curious what you think.

@fstab fstab added the spec:miscellaneous For issues that don't match any other spec label label Nov 28, 2023
@dashpole
Copy link
Contributor

Current Spec

The specification currently says:

Prometheus SDK exporters MUST NOT allow duplicate UNIT, HELP, or TYPE comments for the same metric name to be returned in a single scrape of the Prometheus endpoint. Exporters MUST drop entire metrics to prevent conflicting TYPE comments, but SHOULD NOT drop metric points as a result of conflicting UNIT or HELP comments. Instead, all but one of the conflicting UNIT and HELP comments (but not metric points) SHOULD be dropped. If dropping a comment or metric points, the exporter SHOULD warn the user through error logging.

So, for conflicting unit or help, exporters SHOULD keep one of them. I believe all SDKs that have implemented this currently keep the first HELP or UNIT, and cache it to ensure it doesn't change between scrapes.

For conflicting (prometheus) TYPE, we currently drop the metric.

Differences

To summarize the differences between the current spec, and your comment:

  1. UNIT conflict:
    • Current: Merge
    • Proposed: Drop/Fail
  2. How to "fail" when irreconcilable conflicts are found:
    • Current: Drop metric(s) and log a warning
    • Proposed: Fail the scrape (500 response)

Thoughts

Thankfully, unit conflicts should be difficult to hit, since we add a unit suffix when the unit is present, and otherwise discourage units in metric names. So to get a conflict, you would have to do something like:

  • name: my.metric.seconds, unit: `` -> my_metric_seconds
  • name: my.metric, unit:s -> my_metric_seconds

My only concern switching unit differences to drop/fail is units can't currently be configured using Views, which means a user using two libraries with conflicting units doesn't have an easy will have to change the names of one of the metrics to resolve it (maybe that isn't a bad thing)? With unit comments having low adoption, it seems like unit conflicts in the collection layer (e.g. in the collector's prometheus exporter) could also be very common.

For drop vs fail the scrape, I think we should defer to the recommendations of the prometheus community. @fstab do you think we should fail scrapes for all irreconcilable differences? Additional cases include:

  • Delta sums or histograms
  • Exponential histograms with scale outside [-4,8]

@fstab
Copy link
Member Author

fstab commented Nov 28, 2023

Thanks a lot @dashpole. I didn't see that the "Metric Metadata" section answers some of these points, I just looked at the "Instrumentation Scope" section. Thanks for the pointer.

For drop vs fail the scrape, I think we should defer to the recommendations of the prometheus community.

I'll reach out and ask for opinions. I'll post the result here.

@fstab do you think we should fail scrapes for all irreconcilable differences?

Great question. My gut feeling:

  • If a metric type isn't supported at all, like Delta sums, it's ok to drop them. That's similar to Prometheus not including native histograms in text format.
  • If there is a name conflict, my tendency is towards a 500 error. I can ask what others in the Prometheus community think.
  • Histograms with scale < -4 are difficult. Scale > 8 can be down-scaled, but scale < -4 cannot be up-scaled. I think it would be weird if scrapes all of a sudden fail with HTTP 500 because a histogram scaled down. There's no great option here. Maybe the least bad option is to drop histograms with scale < -4.

@fstab
Copy link
Member Author

fstab commented Nov 29, 2023

I added an item to tomorrow's Prometheus Dev Summit agenda. If anyone's interested in joining, please do, everyone is welcome to participate and propose topics. Link to the calendar invite is at the top of the agenda doc.

@jack-berg jack-berg added [label deprecated] triaged-accepted [label deprecated] Issue triaged and accepted by OTel community, can proceed with creating a PR spec:metrics Related to the specification/metrics directory and removed spec:miscellaneous For issues that don't match any other spec label labels Nov 29, 2023
@jsuereth jsuereth assigned dashpole and unassigned jsuereth Nov 29, 2023
@fstab
Copy link
Member Author

fstab commented Dec 1, 2023

Awesome news from the Prometheus Dev Summit: The Prometheus team decided to implement support for OTel's instrumentation scope. See Meeting Notes (2023-11-30) for reference.

Background

The internal data model in the Prometheus server already supports metrics with the same name but different metadata (TYPE, UNIT, HELP) as long as labels are different. Common use case: service1 exposes a metric of type gauge named request_size_bytes and service2 exposes a metric of type unknown named request_size_bytes, then you would have two metrics with the same name but different labels in the Prometheus server:

# TYPE request_size_bytes gauge
request_size_bytes{job="service1", ...} ...
# TYPE request_size_bytes unknown
request_size_bytes{job="service2", ...} ...

There is no issue with the name conflict because the job label differs. In the same way, there is no issue with name conflicts if the otel_scope_name label differs, even if job and instance are the same. Name conflicts are fine as long as the set of labels differs.

Why doesn't it work then?

The reason why name conflicts with OpenTelemetry instrumentation scope are an issue has nothing to do with Prometheus' internal data model. Internally Prometheus supports this. The issue has to do with the exposition format: Prometheus Text format (and in OpenMetrics format) does not allow metrics with the same name but different metadata (TYPE, HELP, UNIT).

What did the Prometheus Team decide

The Prometheus team decided to experiment with an # EOS (end of scope) marker in text format. This would look something like this:

# TYPE request_size_bytes gauge
# HELP request_size_bytes some help
my_metric{otel_scope_name="scope1", ...} ...
# EOS
# TYPE request_size_bytes unknown
# HELP request_size_bytes some other help
my_metric{otel_scope_name="scope2", ...} ...

If that works we will be able to map OpenTelemetry's instrumentation scope to Prometheus without conflicts.

@dashpole
Copy link
Contributor

dashpole commented Feb 8, 2024

@fstab Let me know if you need any help on the OTel side of things here.

@austinlparker austinlparker added triage:accepted:ready Ready to be implemented. Small enough or uncontroversial enough to be implemented without sponsor and removed [label deprecated] triaged-accepted [label deprecated] Issue triaged and accepted by OTel community, can proceed with creating a PR labels Apr 30, 2024
@austinlparker austinlparker moved this to Spec - Accepted in 🔭 Main Backlog Jul 16, 2024
@trask trask moved this from Spec - Accepted to Spec - In Progress in 🔭 Main Backlog Jul 16, 2024
@austinlparker austinlparker added sig-issue A specific SIG should look into this before discussing at the spec and removed triage:accepted:ready Ready to be implemented. Small enough or uncontroversial enough to be implemented without sponsor labels Jul 30, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
sig-issue A specific SIG should look into this before discussing at the spec spec:metrics Related to the specification/metrics directory
Projects
Status: Spec - In Progress
Development

No branches or pull requests

5 participants