Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[improve][pip] PIP-320: OpenTelemetry Scaffolding #21635

Merged
merged 10 commits into from
Dec 13, 2023
Merged
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
256 changes: 256 additions & 0 deletions pip/pip-320.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,256 @@
# PIP-320 OpenTelemetry Scaffolding

# Background knowledge

## PIP-264 - parent PIP titled "Enhanced OTel-based metric system"
[PIP-264](https://github.com/apache/pulsar/pull/21080), which can also be viewed [here](pip-264.md), describes in high
asafm marked this conversation as resolved.
Show resolved Hide resolved
level a plan to greatly enhance Pulsar metric system by replacing it with [OpenTelemetry](https://opentelemetry.io/).
You can read in the PIP the numerous existing problems PIP-264 solves. Among them are:
- Control which metrics to export per topic/group/namespace via the introduction of a metric filter configuration.
This configuration is planned to be dynamic as outline in the [PIP-264](pip-264.md).
- Reduce the immense metrics cardinality due to high topic count (One of Pulsar great features), by introducing
asafm marked this conversation as resolved.
Show resolved Hide resolved
the concept of Metric Group - a group of topics for metric purposes. Metric reporting will also be done to a
group granularity. 100k topics can be downsized to 1k groups. The dynamic metric filter configuration would allow
the user to control which metric group to un-filter.
- Proper histogram exporting
- Clean-up codebase clutter, by relying on a single industry standard API, SDK and metrics protocol (OTLP) instead of
existing mix of home-brew libraries and hard coded Prometheus exporter.
- any many more

You can [here](pip-264.md#why-opentelemetry) why OpenTelemetry was chosen.

## OpenTelemetry
Since OpenTelemetry (a.k.a. OTel) is an emerging industry standard, there are plenty of good articles, videos and
documentation about it. In this very short paragraph I'll describe what you need to know about OTel from this PIP
perspective.

OpenTelemetry is a project aimed to standardize the way we instrument, collect and ship metrics from applications
to telemetry backends, be it databases (e.g. Prometheus, Cortex, Thanos) or vendors (e.g. Datadog, Logz.io).
It is divided into API, SDK and Collector:
- API: interfaces to use to instrument: define a counter, record values to a histogram, etc.
- SDK: a library, available in many languages, implementing the API, and other important features such as
reading the metrics and exporting it out to a telemetry backend or OTel Collector.
- Collector: a lightweight process (application) which can receive or retrieve telemetry, transform it (e.g.
filter, drop, aggregate) and export it (e.g. send it to various backends). The SDK supports out-of-the-box
exporting metrics as Prometheus HTTP endpoint or sending them out using OTLP protocol. Many times companies choose to
ship to the Collector and there ship to their preferred vendors, since each vendor already published their exporter
plugin to OTel Collector. This makes the SDK exporters very light-weight as they don't need to support any
vendor. It's also easier for the DevOps team as they can make OTel Collector their responsibility, and have
application developers only focus on shipping metrics to that collector.

Just to have some context: Pulsar codebase will use the OTel API to create counters / histograms and records values to
them. So will the Pulsar plugins and Pulsar Function authors. Pulsar itself will be the one creating the SDK
and using that to hand over an implementation of the API where ever needed in Pulsar. Collector is up to the choice
of the user, as OTel provides a way to expose the metrics as `/metrics` endpoint on a configured port, so Prometheus
compatible scrapers can grab it from it directly. They can also send it via OTLP to OTel collector.

## Telemetry layers
PIP-264 clearly outlined there will be two layers of metrics, collected and exported, side by side: OpenTelemetry
and the existing metric system - currently exporting in Prometheus. This PIP will explain in detail how it will work.
The basic premise is that you will be able to enable or disable OTel metrics, alongside the existing Prometheus
asafm marked this conversation as resolved.
Show resolved Hide resolved
metric exporting.

## Why OTel in Pulsar will be marked experimental and not GA
As specified in [PIP-264](pip-264.md), OpenTelemetry Java SDK has several fixes the Pulsar community must
complete before it can be used in production. They are [documented](pip-264.md#what-we-need-to-fix-in-opentelemetry)
in PIP-264. The most important one is reducing memory allocations to be negligible. OTel SDK is built upon immutability,
hence allocated memory in O(`#topics`) which is a performance killer for low latency application like Pulsar.
asafm marked this conversation as resolved.
Show resolved Hide resolved

You can track the proposal and progress the Pulsar and OTel communities are making in
[this issue](https://github.com/open-telemetry/opentelemetry-java/issues/5105).


## Metrics endpoint authentication
Today Pulsar metrics endpoint `/metrics` has an option to be protected by the configured `AuthenticationProvider`.
The configuration option is named `authenticateMetricsEndpoint` in the broker and
`authenticateMetricsEndpoint` in the proxy.


# Motivation

Implementing PIP-264 consists of a long list of steps, which are detailed in
[this issue](https://github.com/apache/pulsar/issues/21121). The first step is add all the bare-bones infrastructure
to use OpenTelemetry in Pulsar, such that next PRs can use it to start translating existing metrics to their
OTel form. It means the same metrics will co-exist in the codebase and also in runtime, if OTel was enabled.

# Goals

## In Scope
- Ability to add metrics using OpenTelemetry to Pulsar components: Broker, Function Worker and Proxy.
- User can disable or enable OpenTelemetry metrics, which by default will be disabled
- OpenTelemetry metrics will be configured via its native OTel Java SDK configuration options
- All the necessary information to use OTel with Pulsar will be documented in Pulsar documentation site
- OpenTelemetry metrics layer defined as experimental, and *not* GA

asafm marked this conversation as resolved.
Show resolved Hide resolved

## Out of Scope
asafm marked this conversation as resolved.
Show resolved Hide resolved
- Ability to add metrics using OpenTelemetry as Pulsar Function author.
- Only authenticated sessions can access OTel Prometheus endpoint, using Pulsar authentication
- Metrics in Pulsar clients (as defined in [PIP-264](pip-264.md#out-of-scope)))

# High Level Design

## Configuration
OpenTelemetry, as any good telemetry library (e.g. log4j, logback), has its own configuration mechanisms:
- System properties
- Environment variables
- Experimental file-based configuration

Pulsar doesn't need to introduce any additional configuration. The user can decide, using OTel configuration
things like:
* How do I want to export the metrics? Prometheus? Which port prometheus will be exposed at
* Change histogram buckets using Views
* and more

Pulsar will use `AutoConfiguredOpenTelemetrySdk` which uses all the above configuration mechanisms
(documented [here](https://github.com/open-telemetry/opentelemetry-java/tree/main/sdk-extensions/autoconfigure)).
This class builds an `OpenTelemetrySdk` based on configurations. This is the entry point to OpenTelemetry API, as it
implements `OpenTelemetry` API class.

### Setting sensible defaults for Pulsar
There are some configuration options we wish to change their default, but still allow the users to override it
if they wish. We think those default values will make a much easier user experience.

* `otel.experimental.metrics.cardinality.limit` - value: 10,000
This property sets an upper bound on the amount of unique `Attributes` an instrument can have. Take Pulsar for example,
an instrument like `pulsar.broker.messaging.topic.received.size`, the unique `Attributes` would be in the amount of
active topics in the broker. Since Pulsar can handle up to 1M topics, it makes more sense to put the default value
to 10k, which translates to 10k topics.

`AutoConfiguredOpenTelemetrySdkBuilder` allows to add properties using the method `addPropertiesSupplier`. The
System properties and environment variables override it. The file-based configuration still doesn't take
those properties supplied into account, but it will.


## Opting in
We would like to have the ability to toggle OpenTelemetry-based metrics, as they are still new.
We won't need any special Pulsar configuration, as OpenTelemetry SDK comes with a configuration key to do that.
Since OTel is still experimental, it will have to be opt-in, hence we will add the following property to be the default
using the mechanism described [above](#setting-sensible-defaults-for-pulsar):

* `otel.sdk.disabled` - value: true
This property value disables OpenTelemetry.

With OTel disabled, the user remains with the existing metrics system. OTel in a disabled state operates in a
no-op mode. This means, instruments do get built, but the instrument builders return the same instance of a
no-op instrument, which does nothing on record-values method (e.g. `add(number)`, `record(number)`). The no-op
`MeterProvider` has no registered `MetricReader` hence no metric collection will be made. The memory impact
is almost 0 and the same goes for CPU impact.

The current metric system doesn't have a toggle which causes all existing data structures to stop collecting
data. Inserting will need changing in so many places since we don't have a single place which through
all metric instrument are created (one of the motivations for PIP-264).
The current system do have a toggle: `exposeTopicLevelMetricsInPrometheus`. It enables toggling off
topic-level metrics, which means the highest cardinality metrics will be namespace level.
Once that toggle is `false`, the amount of data structures accounting memory would in the range of
a few thousands which shouldn't post a burden memory wise. If the user refrain from calling
`/metrics` it will also reduce the CPU and memory cost associated with collecting metrics.

When the user enables OTel it means there will be a memory increase, but if the user disabled topic-level
metrics in existing system, as specified above, the majority of the memory increase will be due to topic level
metrics in OTel, at the expense of not having them in the existing metric system.



## Cluster attribute name
A broker is part of a cluster. It is configured in the Pulsar configuration key `clusterName`. When the broker is part
of a cluster, it means it shares the topics defined in that cluster (persisted in Metadata service: e.g. ZK)
among the brokers of that cluster.

Today, each unique time series emitted in Prometheus metrics contains the `cluster` label (almost all of them, as it
is done manually). We wish the same with OTel - to have that attribute in each exported unique time series.

OTel has the perfect location to place attributes which are shared across all time series: Resource. An application
can have multiple Resource, with each having 1 or more attributes. You define it once, in OTel initialization or
configuration. It can contain attributes like the hostname, AWS region, etc. The default contains the service name
and some info on the SDK version.

Attributes can be added dynamically, through `addResourceCustomizer()` in `AutoConfiguredOpenTelemetrySdkBuilder`.
We will use that to inject the `cluster` attribute, taken from the configuration.

In Prometheus, we submitted a [proposal](https://github.com/open-telemetry/opentelemetry-specification/pull/3761)
to opentelemetry specifications, which was merged, to allow copying resource attributes into each exported
unique time series in Prometheus exporter.
We plan to contribute its implementation to OTel Java SDK.

Resources in Prometheus exporter, are exported as `target_info{} 1` and the attributes are added to this
time series. This will require making joins to get it, making it extremely difficult to use.
The other alternative was to introduce our own `PulsarAttributesBuilder` class, on top of
`AttributesBuilder` of OTel. Getting every contributor to know this class, use it, is hard. Getting this
across Pulsar Functions or Plugins authors, will be immensely hard. Also, when exporting as
OTLP, it is very inefficient to repeat the attribute across all unique time series, instead of once using
Resource. Hence, this needed to be solved in the Prometheus exporter as we did in the proposal.

The attribute will be named `pulsar.cluster`, as both the proxy and the broker are part of this cluster.

## Naming and using OpenTelemetry

### Attributes
* We shall prefix each attribute with `pulsar.`. Example: `pulsar.topic`, `pulsar.cluster`.

### Instruments
We should have a clear hierarchy, hence use the following prefix
* `pulsar.broker`
* `pulsar.proxy`
* `pulsar.function_worker`

### Meter
It's customary to use reverse domain name for meter names. Hence, we'll use:
* `org.apache.pulsar.broker`
* `org.apache.pulsar.proxy`
* `org.apache.pulsar.function_worker`

OTel meter name is converted to the attribute name `otel_scope_name` and added to each unique time series
attributes by Prometheus exporter.

We won't specify a meter version, as it is used solely to signify the version of the instrumentation, and
currently we are the first version, hence not use it.


# Detailed Design

## Design & Implementation Details

* `OpenTelemetryService` class
* Parameters:
* Cluster name
* What it will do:
- Override default max cardinality to 10k
- Register a resource with cluster name
- Place defaults setting to instruct Prometheus Exporter to copy resource attributes
- In the future: place defaults for Memory Mode to be REUSABLE_DATA

* `PulsarBrokerOpenTelemetry` class
* Initialization
* Construct an `OpenTelemetryService` using the cluster name taken from the broker configuration
* Constructs a Meter for the broker metrics
* Methods
* `getMeter()` returns the `Meter` for the broker
* Notes
* This is the class that will be passed along to other Pulsar service classes that needs to define
telemetry such as metrics (in the future: traces).

* `PulsarProxyOpenTelemetry` class
* Same as `PulsarBrokerOpenTelemetry` but for Pulsar Proxy
* `PulsarWorkerOpenTelemetry` class
* Same as `PulsarBrokerOpenTelemetry` but for Pulsar function worker


## Public-facing Changes

### Public API
* OTel Prometheus Exporter adds `/metrics` endpoint on a user defined port, if user chose to use it

### Configuration
* OTel configurations are used

# Security Considerations
* OTel currently does not support setting a custom Authenticator for Prometheus exporter.
An issue has been raised [here](https://github.com/open-telemetry/opentelemetry-java/issues/6013).
* Once it do we can secure the Prometheus exporter metrics endpoint using `AuthenticationProvider`
* Any user can access metrics, and they are not protected per tenant. Like today's implementation

# Links

* Mailing List discussion thread: https://lists.apache.org/thread/xcn9rm551tyf4vxrpb0th0wj0kktnrr2
* Mailing List voting thread: https://lists.apache.org/thread/zp6vl9z9dhwbvwbplm60no13t8fvlqs2