Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add the prometheus-longterm-metrics and thanos optional components #461

Open
wants to merge 12 commits into
base: master
Choose a base branch
from
36 changes: 35 additions & 1 deletion CHANGES.md
Original file line number Diff line number Diff line change
Expand Up @@ -15,7 +15,41 @@
[Unreleased](https://github.com/bird-house/birdhouse-deploy/tree/master) (latest)
------------------------------------------------------------------------------------------------------------------

[//]: # (list changes here, using '-' for each new entry, remove this when items are added)
## Changes

- Add the `prometheus-longterm-metrics` and `thanos` optional components

The `prometheus-longterm-metrics` component collects longterm monitoring metrics from the original prometheus instance
(the one created by the ``components/monitoring`` component).

Longterm metrics are any prometheus rule that have the label ``group: longterm-metrics`` or in other words are
selectable using prometheus's ``'{group="longterm-metrics"}'`` query filter. To see which longterm metric rules are
added by default see the
``optional-components/prometheus-longterm-metrics/config/monitoring/prometheus.rules.template`` file.

To configure this component:

* update the ``PROMETHEUS_LONGTERM_RETENTION_TIME`` variable to set how long the data will be kept by prometheus
* update the ``PROMETHEUS_LONGTERM_STORE_INTERVAL`` variable to set how often the longterm metrics rules will be
calculated. For example, setting it to ``10h`` will calculate these metrics every 10 hours.

Enabling the `prometheus-longterm-metrics` component creates the additional endpoint ``/prometheus-longterm-metrics``.

The `thanos` component enables better storage of longterm metrics collected by the
``optional-components/prometheus-longterm-metrics`` component. Data will be collected from the
``prometheus-longterm-metrics`` and stored in an S3 object store indefinitely.

When enabling this component, please change the default values for the ``MINIO_ROOT_USER`` and ``MINIO_ROOT_PASSWORD``
mishaschwartz marked this conversation as resolved.
Show resolved Hide resolved
by updating the ``env.local`` file. These set the login credentials for the root user that runs the
[minio](https://min.io/) object store.

Enabling the `thanos` component creates the additional endpoints:

* ``/thanos-query``: a prometheus-like query interface to inspect the data stored by thanos
* ``/thanos-minio``: a minio web console to inspect the data stored by minio.
Comment on lines +46 to +47
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should those be configurable?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Which bits would be useful to configure? The endpoints, paths, images, other?

I agree that we could always add more configuration options, I'm just wondering which are a priority for you?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The endpoints would be the priority, to allow serving them from some other location, though still a low priority relative to the feature as a whole. Worst case, redirects can be defined in the nginx configuration, so don't block the PR just for this.


This also includes an update to the prometheus version from `v2.19.0` to the current latest `v2.52.0`. This is to
required to support the interaction between prometheus and thanos.
mishaschwartz marked this conversation as resolved.
Show resolved Hide resolved

[2.4.0](https://github.com/bird-house/birdhouse-deploy/tree/2.4.0) (2024-06-04)
------------------------------------------------------------------------------------------------------------------
Expand Down
2 changes: 1 addition & 1 deletion birdhouse/components/monitoring/default.env
Original file line number Diff line number Diff line change
Expand Up @@ -8,7 +8,7 @@ export GRAFANA_VERSION="7.0.3"
export GRAFANA_DOCKER=grafana/grafana
export GRAFANA_IMAGE='${GRAFANA_DOCKER}:${GRAFANA_VERSION}'

export PROMETHEUS_VERSION="v2.19.0"
export PROMETHEUS_VERSION="v2.52.0"
export PROMETHEUS_DOCKER=prom/prometheus
export PROMETHEUS_IMAGE='${PROMETHEUS_DOCKER}:${PROMETHEUS_VERSION}'

Expand Down
9 changes: 9 additions & 0 deletions birdhouse/env.local.example
Original file line number Diff line number Diff line change
Expand Up @@ -574,6 +574,15 @@ export THREDDS_ADDITIONAL_CATALOG=""
#export ALERTMANAGER_EXTRA_INHIBITION=""
#export ALERTMANAGER_EXTRA_RECEIVERS=""

# Below are for the prometheus-longterm-metrics optional component
#export PROMETHEUS_LONGTERM_RETENTION_TIME=1y
#export PROMETHEUS_LONGTERM_STORE_INTERVAL=1h

# Below are for the thanos optional component
# Change these from the default for added security
#export MINIO_ROOT_USER="${__DEFAULT__MINIO_ROOT_USER}"
#export MINIO_ROOT_PASSWORD="${__DEFAULT__MINIO_ROOT_PASSWORD}"

#############################################################################
# Emu optional vars
#############################################################################
Expand Down
35 changes: 35 additions & 0 deletions birdhouse/optional-components/README.rst
Original file line number Diff line number Diff line change
Expand Up @@ -443,3 +443,38 @@ How to enable X-Robots-Tag Header in ``env.local`` (a copy from `env.local.examp

.. seealso::
See the `env.local.example`_ file for more details about this ``BIRDHOUSE_PROXY_ROOT_LOCATION`` behaviour.

Prometheus Long-term Metrics
----------------------------

This is a second prometheus instance that collects longterm monitoring metrics from the original prometheus instance
(the one created by the ``components/monitoring`` component).

Longterm metrics are any prometheus rule that have the label ``group: longterm-metrics`` or in other words are
selectable using prometheus' ``'{group="longterm-metrics"}'`` query filter. To see which longterm metric rules are
added by default see the ``optional-components/prometheus-longterm-metrics/config/monitoring/prometheus.rules.template``.

To configure this component:

* update the ``PROMETHEUS_LONGTERM_RETENTION_TIME`` variable to set how long the data will be kept by prometheus
* update the ``PROMETHEUS_LONGTERM_STORE_INTERVAL`` variable to set how often the longterm metrics rules will be
calculated. For example, setting it to ``10h`` will calculate these metrics every 10 hours.

Enabling this component creates the additional endpoint ``/prometheus-longterm-metrics``.

Thanos
------

This enables better storage of longterm metrics collected by the ``optional-components/prometheus-longterm-metrics``
component. Data will be collected from the ``prometheus-longterm-metrics`` and stored in an S3 object store
indefinitely.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indefinitely ! Do we actually want this? Can we set an expiry after like 10 years?

Grafana will be able to display data from Thanos go to back to 10 years? With this kind of extreme long term stats, what is the UI to visualize it?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can choose to change this if you wish. Thanos suggests keeping data indefinitely by default. If you do not need to keep data forever, I suggest just using the prometheus-longterm-monitoring component without thanos and setting the PROMETHEUS_LONGTERM_RETENTION_TIME to whatever you'd like.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh I see, is there a switch to disable Thanos?

Same question, in the case we want to use Thanos, how to visualize the data stored on Thanos? I assume if Thanos is enabled, the retention duration on the Prometheus side will be very short to avoid doubling the storage so without data being stored in Prometheus, how to visualize that data stored on Thanos.

Just a question. If another component is required, we can do it in a follow up Pr.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh I see, is there a switch to disable Thanos?

To answer my own question, Thanos is actually a separate component so it does not have to be enabled together with the Prometheus-long-term component? The Prometheus-long-term component can function standalone of Thanos?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanos is actually a separate component so it does not have to be enabled together with the Prometheus-long-term component

Yes that's right. prometheus-longterm-metrics collects and stores specific metrics that we want to keep for longer from prometheus. If you want to also enable thanos, then thanos will store those same metrics in a much more compact/efficient way so that you can store more data over a longer time period.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm happy with forever storage for long-term-metrics. The point is to keep an archive of key metrics. If those are daily or hourly, archiving a few dozen metrics won't be a problem.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's see how much space that will take in practice and we will adjust. And eventually we need a way to visualize those older metrics. Otherwise, what's the point to keep them forever if we can not visualize?


When enabling this component, please change the default values for the ``MINIO_ROOT_USER`` and ``MINIO_ROOT_PASSWORD``
by updating the ``env.local`` file. These set the login credentials for the root user that runs the minio_ object
store.

Enabling this component creates the additional endpoints:
* ``/thanos-query``: a prometheus-like query interface to inspect the data stored by thanos
* ``/thanos-minio``: a minio_ web console to inspect the data stored by minio_.

.. _minio: https://min.io/
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
config/monitoring/prometheus.rules
config/magpie/config.yml
config/proxy/conf.extra-service.d/monitoring.conf
Original file line number Diff line number Diff line change
@@ -0,0 +1,28 @@
providers:
prometheus-longterm-metrics:
# below URL is only used to fill in the required location in Magpie
# actual auth validation is performed with Twitcher 'verify' endpoint without accessing this proxied URL
url: http://proxy:80
title: PrometheusLongtermMetrics
public: true
c4i: false
type: api
sync_type: api

permissions:
- service: prometheus-longterm-metrics
permission: read
group: administrators
action: create
- service: prometheus-longterm-metrics
permission: write
group: administrators
action: create
- service: prometheus-longterm-metrics
permission: read
group: monitoring
action: create
- service: prometheus-longterm-metrics
permission: write
group: monitoring
action: create
Original file line number Diff line number Diff line change
@@ -0,0 +1,7 @@
version: "3.4"

services:
magpie:
volumes:
- ./optional-components/prometheus-longterm-metrics/config/magpie/config.yml:${MAGPIE_PERMISSIONS_CONFIG_PATH}/prometheus-longterm-metrics.yml:ro
- ./optional-components/prometheus-longterm-metrics/config/magpie/config.yml:${MAGPIE_PROVIDERS_CONFIG_PATH}/prometheus-longterm-metrics.yml:ro
Original file line number Diff line number Diff line change
@@ -0,0 +1,6 @@
version: "3.4"

services:
prometheus:
volumes:
- ./optional-components/prometheus-longterm-metrics/config/monitoring/prometheus.rules:/etc/prometheus/prometheus-longterm-metrics.rules:ro
Original file line number Diff line number Diff line change
@@ -0,0 +1,16 @@
groups:
- name: longterm-metrics
interval: ${PROMETHEUS_LONGTERM_STORE_INTERVAL}
rules:
- record: cpu_instance:cpu_load_irate:avg${PROMETHEUS_LONGTERM_STORE_INTERVAL}
expr: avg by(cpu, instance) (irate(node_cpu_seconds_total{mode!="idle"}[${PROMETHEUS_LONGTERM_STORE_INTERVAL}]))
labels:
group: longterm-metrics
- record: instance:network_bytes_received_irate:sum${PROMETHEUS_LONGTERM_STORE_INTERVAL}
expr: sum by (instance) (irate(node_network_receive_bytes_total[${PROMETHEUS_LONGTERM_STORE_INTERVAL}]))
labels:
group: longterm-metrics
- record: instance:network_bytes_sent_irate:sum${PROMETHEUS_LONGTERM_STORE_INTERVAL}
expr: sum by (instance) (irate(node_network_transmit_bytes_total[${PROMETHEUS_LONGTERM_STORE_INTERVAL}]))
labels:
group: longterm-metrics
Original file line number Diff line number Diff line change
@@ -0,0 +1,18 @@
location /prometheus-longterm-metrics {
auth_request /secure-prometheus-longterm-metrics-auth;
auth_request_set $auth_status $upstream_status;
proxy_pass http://prometheus-longterm-metrics:9090;
proxy_set_header Host $host;
}

location = /secure-prometheus-longterm-metrics-auth {
internal;
proxy_pass https://${BIRDHOUSE_FQDN_PUBLIC}${TWITCHER_VERIFY_PATH}/prometheus-longterm-metrics$request_uri;
proxy_pass_request_body off;
proxy_set_header Host $host;
proxy_set_header Content-Length "";
proxy_set_header X-Original-URI $request_uri;
proxy_set_header X-Forwarded-Proto $real_scheme;
proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
proxy_set_header X-Forwarded-Host $host:$server_port;
}
Original file line number Diff line number Diff line change
@@ -0,0 +1,6 @@
version: "3.4"

services:
proxy:
volumes:
- ./optional-components/prometheus-longterm-metrics/config/proxy/conf.extra-service.d:/etc/nginx/conf.extra-service.d/prometheus-longterm-metrics:ro
Original file line number Diff line number Diff line change
@@ -0,0 +1,15 @@
export PROMETHEUS_LONGTERM_RETENTION_TIME=1y
export PROMETHEUS_LONGTERM_STORE_INTERVAL=1h

# These are the prometheus defaults
export PROMETHEUS_LONGTERM_TSDB_MIN_BLOCK_DURATION=2h
export PROMETHEUS_LONGTERM_TSDB_MAX_BLOCK_DURATION=1d12h

OPTIONAL_VARS="
$OPTIONAL_VARS
\$PROMETHEUS_LONGTERM_STORE_INTERVAL
"

COMPONENT_DEPENDENCIES="
./components/monitoring
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh this dependency will make the new Prometheus long term not able to run standalone.

Copy link
Collaborator Author

@mishaschwartz mishaschwartz Jun 18, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No it cannot run standalone, it collects metrics from the services that are enabled in the monitoring component.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Say I already have a bunch of PAVICS servers running with monitoring enabled and I just want to point this new Prometheus to aggregate all the data?

Can be in a follow up PR.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The goal of this PR is to create a method for saving existing prometheus data over the long term for a single birdhouse/PAVICS deployment. If you want something that will collect prometheus data for multiple servers I would recommend creating a new repository to host that code.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since the architecture here is pluggable, we do not need a separate repo and duplicate the work. To deploy the 2nd Prometheus only, on a separate machine, I see we only enable the proxy and the prometheus-longterm-metrics and optionally thanos components on the new machine and that's it.

But agree we can make this "standalone" support in a separate PR.

"
Original file line number Diff line number Diff line change
@@ -0,0 +1,32 @@
version: "3.4"

x-logging:
&default-logging
driver: "json-file"
options:
max-size: "50m"
max-file: "10"

services:
prometheus-longterm-metrics:
image: ${PROMETHEUS_IMAGE}
container_name: prometheus-longterm-metrics
volumes:
- ./optional-components/prometheus-longterm-metrics/prometheus.yml:/etc/prometheus/prometheus.yml:ro
- prometheus_longterm_persistence:/prometheus:rw
command:
- --config.file=/etc/prometheus/prometheus.yml
- --storage.tsdb.path=/prometheus
- --web.console.libraries=/usr/share/prometheus/console_libraries
- --web.console.templates=/usr/share/prometheus/consoles
- --storage.tsdb.retention.time=${PROMETHEUS_LONGTERM_RETENTION_TIME}
- --web.external-url=https://${BIRDHOUSE_FQDN_PUBLIC}/prometheus-longterm-metrics/
- --storage.tsdb.min-block-duration=${PROMETHEUS_LONGTERM_TSDB_MIN_BLOCK_DURATION}
- --storage.tsdb.max-block-duration=${PROMETHEUS_LONGTERM_TSDB_MAX_BLOCK_DURATION}
restart: always
logging: *default-logging

volumes:
prometheus_longterm_persistence:
external:
name: prometheus_longterm_persistence
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
#!/bin/sh -x

docker volume create prometheus_longterm_persistence # metrics db
Original file line number Diff line number Diff line change
@@ -0,0 +1,18 @@
global:
external_labels:
instance_name: prometheus-longterm-metrics

scrape_configs:
- job_name: 'federate'
scrape_interval: 15s
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This one is a long term but it still scrape every 15 sec? Or I might misunderstood this config.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It collects and stores metrics that are already collected by the other prometheus service. Depending on how often the other prometheus instance is calculating those metrics, you may not need such a small scrape_interval. What interval would you suggest?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I made it configurable

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I do not fully understand federation yet but I had the impression the 2nd Prometheus would query the max/min/average value from the 1st Prometheus at a much longer interval. Ex: the 1st Prometheus scrap every minute. The 2nd Prometheus would scrap every hour, taking the max/min/average over the hour of the 1st Prometheus?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You don't need multiple prometheus instances for that. You just need to create recording rules that calculate the metrics that you want to store for a longer period and then use the second prometheus instance to grab those metrics for long-term storage.

The second prometheus instance is necessary only because the data retention time is configured for each instance, so if you want to keep data longer you need a separate prometheus instance.

If you're storing data for a long time with thanos instead, you still need a second prometheus instance so that you can select which data thanos will store. thanos stores all data from a prometheus instance, you cannot pick and choose, so we get around this problem by only collecting the data we are interested in the "longterm" prometheus instance and only using thanos to store data from that specific instance.

The best way to think about it is:

  • prometheus: collects all metrics we are interested in and stores them for a short period of time
  • prometheus-longterm-metrics: selects some metrics collected by prometheus that we are interested in storing for a long period of time.
  • thanos: selects all metrics from prometheus-longterm-metrics and stores them efficiently over an even longer period of time

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see clearer now, thanks !

Just to confirm, the current setup prometheus-longterm-metrics is collecting everything from prometheus so we are not yet "selecting some metrics" yet, right?

It's alright we can perform the selection in a subsequent PR.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just to confirm, the current setup prometheus-longterm-metrics is collecting everything from prometheus so we are not yet "selecting some metrics" yet, right?

Opps sorry, just read your other reply now and so to answer my own remark: prometheus-longterm-metrics is collecting only metrics from group: longterm-metrics from prometheus.


honor_labels: true
metrics_path: '/prometheus/federate'

params:
'match[]':
- '{group="longterm-metrics"}'

static_configs:
- targets:
- 'prometheus:9090'
2 changes: 2 additions & 0 deletions birdhouse/optional-components/thanos/.gitignore
Original file line number Diff line number Diff line change
@@ -0,0 +1,2 @@
config/magpie/config.yml
config/proxy/conf.extra-service.d/monitoring.conf
Original file line number Diff line number Diff line change
@@ -0,0 +1,28 @@
providers:
thanos:
# below URL is only used to fill in the required location in Magpie
# actual auth validation is performed with Twitcher 'verify' endpoint without accessing this proxied URL
url: http://proxy:80
title: Thanos
public: true
c4i: false
type: api
sync_type: api

permissions:
- service: thanos
permission: read
group: administrators
action: create
- service: thanos
permission: write
group: administrators
action: create
- service: thanos
permission: read
group: monitoring
action: create
- service: thanos
permission: write
group: monitoring
action: create
Original file line number Diff line number Diff line change
@@ -0,0 +1,7 @@
version: "3.4"

services:
magpie:
volumes:
- ./optional-components/thanos/config/magpie/config.yml:${MAGPIE_PERMISSIONS_CONFIG_PATH}/thanos.yml:ro
- ./optional-components/thanos/config/magpie/config.yml:${MAGPIE_PROVIDERS_CONFIG_PATH}/thanos.yml:ro
Original file line number Diff line number Diff line change
@@ -0,0 +1,38 @@
location /thanos-query {
auth_request /secure-thanos-auth;
auth_request_set $auth_status $upstream_status;
proxy_pass http://thanos-query:19192;
proxy_set_header Host $host;
}

location /thanos-minio/ {
auth_request /secure-thanos-auth;
auth_request_set $auth_status $upstream_status;

rewrite ^/thanos-minio/(.*) /$1 break;
proxy_pass http://minio:9001;

proxy_http_version 1.1;
proxy_set_header Upgrade $http_upgrade;
proxy_set_header Connection "Upgrade";
proxy_set_header Host $host;
proxy_set_header X-Real-IP $remote_addr;
proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
proxy_set_header X-Forwarded-Proto $scheme;

# This allows WebSocket connections
proxy_set_header Upgrade $http_upgrade;
proxy_set_header Connection "upgrade";
}

location = /secure-thanos-auth {
internal;
proxy_pass https://${BIRDHOUSE_FQDN_PUBLIC}${TWITCHER_VERIFY_PATH}/thanos$request_uri;
proxy_pass_request_body off;
proxy_set_header Host $host;
proxy_set_header Content-Length "";
proxy_set_header X-Original-URI $request_uri;
proxy_set_header X-Forwarded-Proto $real_scheme;
proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
proxy_set_header X-Forwarded-Host $host:$server_port;
}
Original file line number Diff line number Diff line change
@@ -0,0 +1,6 @@
version: "3.4"

services:
proxy:
volumes:
- ./optional-components/thanos/config/proxy/conf.extra-service.d:/etc/nginx/conf.extra-service.d/thanos:ro
Loading
Loading