Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add the prometheus-longterm-metrics and thanos optional components #461

Open
wants to merge 12 commits into
base: master
Choose a base branch
from
Open
31 changes: 30 additions & 1 deletion CHANGES.md
Original file line number Diff line number Diff line change
Expand Up @@ -15,7 +15,36 @@
[Unreleased](https://github.com/bird-house/birdhouse-deploy/tree/master) (latest)
------------------------------------------------------------------------------------------------------------------

[//]: # (list changes here, using '-' for each new entry, remove this when items are added)
## Changes

- Add the `prometheus-longterm-metrics` and `thanos` optional components

The `prometheus-longterm-metrics` component collects longterm monitoring metrics from the original prometheus instance
(the one created by the ``components/monitoring`` component).

Longterm metrics are any prometheus rule that have the label ``group: longterm-metrics`` or in other words are
selectable using prometheus's ``'{group="longterm-metrics"}'`` query filter. To see which longterm metric rules are
added by default see the
``optional-components/prometheus-longterm-metrics/config/monitoring/prometheus.rules.template`` file.

To configure this component:

* update the ``PROMETHEUS_LONGTERM_RETENTION_TIME`` variable to set how long the data will be kept by prometheus

Enabling the `prometheus-longterm-metrics` component creates the additional endpoint ``/prometheus-longterm-metrics``.

The `thanos` component enables better storage of longterm metrics collected by the
``optional-components/prometheus-longterm-metrics`` component. Data will be collected from the
``prometheus-longterm-metrics`` and stored in an S3 object store indefinitely.

When enabling this component, please change the default values for the ``THANOS_MINIO_ROOT_USER`` and ``THANOS_MINIO_ROOT_PASSWORD``
by updating the ``env.local`` file. These set the login credentials for the root user that runs the
[minio](https://min.io/) object store.

- Enabling the `thanos` component creates the additional endpoints:

* ``/thanos-query``: a prometheus-like query interface to inspect the data stored by thanos
* ``/thanos-minio``: a minio web console to inspect the data stored by minio.
Comment on lines +46 to +47
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should those be configurable?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Which bits would be useful to configure? The endpoints, paths, images, other?

I agree that we could always add more configuration options, I'm just wondering which are a priority for you?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The endpoints would be the priority, to allow serving them from some other location, though still a low priority relative to the feature as a whole. Worst case, redirects can be defined in the nginx configuration, so don't block the PR just for this.


[2.6.0](https://github.com/bird-house/birdhouse-deploy/tree/2.6.0) (2024-11-19)
------------------------------------------------------------------------------------------------------------------
Expand Down
52 changes: 52 additions & 0 deletions birdhouse/components/README.rst
Original file line number Diff line number Diff line change
Expand Up @@ -372,6 +372,7 @@ AlertManager for Alert Dashboard and Silencing
.. image:: monitoring/images/alertmanager-dashboard.png
.. image:: monitoring/images/alertmanager-silence-alert.png

.. _monitoring-customize-the-component

Customizing the Component
-------------------------
Expand All @@ -390,6 +391,57 @@ Customizing the Component
Slack or other services accepting webhooks), ``ALERTMANAGER_EXTRA_RECEIVERS``.


Longterm Storage of Prometheus Metrics
--------------------------------------

Prometheus stores metrics for 90 days by default. This may be sufficient for some use cases but you may wish to store
some metrics for longer. In order to store certain metrics for a longer than 90 days, you can enable the following
additional components:

- :ref:`prometheus-longterm-metrics`: a second Prometheus instance used to collect the metrics that you want to store longterm
- :ref:`thanos`: a service that enables more efficient storage of the metrics collected by the :ref:`prometheus-longterm-metrics`
component.
- :ref:`prometheus-longterm-rules`: adds some example rules to the monitoring Prometheus instance (the one deployed by this `monitoring`
component) that can be stored longterm by the `prometheus-longterm-metrics` component.

.. note::
A separate prometheus instance is necessary since the retention time for prometheus metrics is set at the
instance level. This means that increasing the retention time must be done for all metrics at once which is undesirable
because you probably don't need to store every metric for a long period of time and you'll end up using a lot more
disk space than needed.

If some or all of these additional components are enabled, they interact in the following way to store certain metrics for
longer than 90 days:

1.
- `recording rules`_ are added to the monitoring Prometheus instance (the one deployed by this `monitoring` component). These
rules are any that have the `longterm-metrics` label.
- The metrics described by these rules are collected/calculated by the monitoring Prometheus instance. The monitoring Prometheus
instance treats these rules the same as any other (ie. only stores them for 90 days by default).
- To enable some example longterm `recording rules`_, enable the :ref:`prometheus-longterm-rules` component. You can also choose
to create your own rules (see :ref:`prometheus-longterm-metrics` for details on how to create these longterm metrics rules).
2.
- The :ref:`prometheus-longterm-metrics` Prometheus instance collects/copies only the rules with the `longterm-metrics` label from the
monitoring Prometheus instance.
- The :ref:`prometheus-longterm-metrics` Prometheus instance stores only these metrics for a custom duration (can be longer than
90 days).
3.
- The :ref:`thanos` component can be deployed alongside the :ref:`prometheus-longterm-metrics` Prometheus instance in order to store
the metrics that the :ref:`prometheus-longterm-metrics` Prometheus instance has already collected.
- The :ref:`thanos` component collects the metrics collected by the :ref:`prometheus-longterm-metrics` Prometheus instance and
stores them in an S3 object store.
- The :ref:`thanos` object store stores the metrics more efficiently, meaning that metrics can be stored for even longer and they'll
take up less disk space than if they were just stored by the :ref:`prometheus-longterm-metrics` Prometheus instance.

.. note::

It is possible to deploy the :ref:`prometheus-longterm-metrics` Prometheus instance and the :ref:`thanos` instance on a different
machine than the monitoring Prometheus instance. However, note that both the :ref:`prometheus-longterm-metrics` and :ref:`thanos`
components *must* be deployed on the same machine (if both are in use). Also note that this is untested and may require serious
troubleshooting to work properly.

.. _recording rules: https://prometheus.io/docs/prometheus/latest/configuration/recording_rules/

Weaver
======

Expand Down
8 changes: 8 additions & 0 deletions birdhouse/env.local.example
Original file line number Diff line number Diff line change
Expand Up @@ -632,6 +632,14 @@ export THREDDS_ADDITIONAL_CATALOG=''
#export ALERTMANAGER_EXTRA_INHIBITION=""
#export ALERTMANAGER_EXTRA_RECEIVERS=""

# Below are for the prometheus-longterm-metrics optional component
#export PROMETHEUS_LONGTERM_RETENTION_TIME=1y

# Below are for the thanos optional component
# Change these from the default for added security
#export THANOS_MINIO_ROOT_USER="${__DEFAULT__THANOS_MINIO_ROOT_USER}"
#export THANOS_MINIO_ROOT_PASSWORD="${__DEFAULT__THANOS_MINIO_ROOT_PASSWORD}"

# Below are for the prometheus-log-parser optional component
#export PROMETHEUS_LOG_PARSER_POLL_DELAY=1 # time in seconds
#export PROMETHEUS_LOG_PARSER_TAIL=true
Expand Down
74 changes: 74 additions & 0 deletions birdhouse/optional-components/README.rst
Original file line number Diff line number Diff line change
Expand Up @@ -444,6 +444,80 @@ How to enable X-Robots-Tag Header in ``env.local`` (a copy from `env.local.examp
.. seealso::
See the `env.local.example`_ file for more details about this ``BIRDHOUSE_PROXY_ROOT_LOCATION`` behaviour.

.. _prometheus-longterm-metrics

Prometheus Long-term Metrics
----------------------------

This is a second prometheus instance that collects longterm monitoring metrics from the monitoring Prometheus instance
(the one created by the ``components/monitoring`` component).

Longterm metrics are any prometheus rule that have the label ``group: longterm-metrics`` or in other words are
selectable using prometheus' ``'{group="longterm-metrics"}'`` query filter. To add some default longterm metrics rules
also enable the ``prometheus-longterm-rules`` component.

You may also choose to create your own set of rules instead of, or as well as, the default ones. See how to
:ref:`add additional rules here <monitoring-customize-the-component>`.

To configure this component:

* update the ``PROMETHEUS_LONGTERM_RETENTION_TIME`` variable to set how long the data will be kept by prometheus

If the monitoring Prometheus instance that this Prometheus instance is tracking is not deployed on the same machine
(or at a non-default network address on the same machine), you may configure the network location of the monitoring
Prometheus instance by setting the ``PROMETHEUS_LONGTERM_TARGETS`` variable. For example, if the monitoring Prometheus
instance's API is available at `https://example.com/prometheus:9090` the you can set the variable:

.. code::

export PROMETHEUS_LONGTERM_TARGETS='["https://example.com/prometheus:9090"]'

.. note::

You may list multiple monitoring Prometheus instances to track in this way by adding more URLs to the list.

.. warning::

Deploying the longterm metrics Prometheus instance on a separate machine from the monitoring Prometheus component
is untested and may require serious troubleshooting to work properly.

Enabling this component creates the additional endpoint ``/prometheus-longterm-metrics``.

.. _prometheus-longterm-rules

Prometheus Long-term Rules
--------------------------

This adds some default longterm metrics rules to the `prometheus` component for use by the `prometheus-longterm-metrics`
component. These rules all have the label ``group: longterm-metrics``.

To see which rules are added, check out the
`optional-components/prometheus-longterm-rules/config/monitoring/prometheus.rules` file.

.. _thanos

Thanos
------

This enables better storage of longterm metrics collected by the ``optional-components/prometheus-longterm-metrics``
component. Data will be collected from the ``prometheus-longterm-metrics`` and stored in an S3 object store
indefinitely.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indefinitely ! Do we actually want this? Can we set an expiry after like 10 years?

Grafana will be able to display data from Thanos go to back to 10 years? With this kind of extreme long term stats, what is the UI to visualize it?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can choose to change this if you wish. Thanos suggests keeping data indefinitely by default. If you do not need to keep data forever, I suggest just using the prometheus-longterm-monitoring component without thanos and setting the PROMETHEUS_LONGTERM_RETENTION_TIME to whatever you'd like.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh I see, is there a switch to disable Thanos?

Same question, in the case we want to use Thanos, how to visualize the data stored on Thanos? I assume if Thanos is enabled, the retention duration on the Prometheus side will be very short to avoid doubling the storage so without data being stored in Prometheus, how to visualize that data stored on Thanos.

Just a question. If another component is required, we can do it in a follow up Pr.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh I see, is there a switch to disable Thanos?

To answer my own question, Thanos is actually a separate component so it does not have to be enabled together with the Prometheus-long-term component? The Prometheus-long-term component can function standalone of Thanos?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanos is actually a separate component so it does not have to be enabled together with the Prometheus-long-term component

Yes that's right. prometheus-longterm-metrics collects and stores specific metrics that we want to keep for longer from prometheus. If you want to also enable thanos, then thanos will store those same metrics in a much more compact/efficient way so that you can store more data over a longer time period.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm happy with forever storage for long-term-metrics. The point is to keep an archive of key metrics. If those are daily or hourly, archiving a few dozen metrics won't be a problem.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's see how much space that will take in practice and we will adjust. And eventually we need a way to visualize those older metrics. Otherwise, what's the point to keep them forever if we can not visualize?


When enabling this component, please change the default values for the ``THANOS_MINIO_ROOT_USER`` and
``THANOS_MINIO_ROOT_PASSWORD`` by updating the ``env.local`` file. These set the login credentials for the root user
that runs the minio_ object store.

Enabling this component creates the additional endpoints:
* ``/thanos-query``: a prometheus-like query interface to inspect the data stored by thanos
* ``/thanos-minio``: a minio_ web console to inspect the data stored by minio_.

.. note::

The `thanos` component must be deployed on the same machine as the `prometheus-longterm-metrics` component since
`thanos` needs access to the data stored by prometheus on disk (in docker this is acheived by sharing a named volume).

.. _minio: https://min.io/

.. _prometheus-log-parser

Prometheus Log Parser
Expand Down
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
prometheus.yml
config/magpie/config.yml
config/proxy/conf.extra-service.d/monitoring.conf
Original file line number Diff line number Diff line change
@@ -0,0 +1,28 @@
providers:
prometheus-longterm-metrics:
# below URL is only used to fill in the required location in Magpie
# actual auth validation is performed with Twitcher 'verify' endpoint without accessing this proxied URL
url: http://proxy:80
title: PrometheusLongtermMetrics
public: true
c4i: false
type: api
sync_type: api

permissions:
- service: prometheus-longterm-metrics
permission: read
group: administrators
action: create
- service: prometheus-longterm-metrics
permission: write
group: administrators
action: create
- service: prometheus-longterm-metrics
permission: read
group: monitoring
action: create
- service: prometheus-longterm-metrics
permission: write
group: monitoring
action: create
Original file line number Diff line number Diff line change
@@ -0,0 +1,7 @@
version: "3.4"

services:
magpie:
volumes:
- ./optional-components/prometheus-longterm-metrics/config/magpie/config.yml:${MAGPIE_PERMISSIONS_CONFIG_PATH}/prometheus-longterm-metrics.yml:ro
- ./optional-components/prometheus-longterm-metrics/config/magpie/config.yml:${MAGPIE_PROVIDERS_CONFIG_PATH}/prometheus-longterm-metrics.yml:ro
Original file line number Diff line number Diff line change
@@ -0,0 +1,18 @@
location /prometheus-longterm-metrics {
auth_request /secure-prometheus-longterm-metrics-auth;
auth_request_set $auth_status $upstream_status;
proxy_pass http://prometheus-longterm-metrics:9090;
proxy_set_header Host $host;
}

location = /secure-prometheus-longterm-metrics-auth {
internal;
proxy_pass https://${BIRDHOUSE_FQDN_PUBLIC}${TWITCHER_VERIFY_PATH}/prometheus-longterm-metrics$request_uri;
proxy_pass_request_body off;
proxy_set_header Host $host;
proxy_set_header Content-Length "";
proxy_set_header X-Original-URI $request_uri;
proxy_set_header X-Forwarded-Proto $real_scheme;
proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
proxy_set_header X-Forwarded-Host $host:$server_port;
}
Original file line number Diff line number Diff line change
@@ -0,0 +1,6 @@
version: "3.4"

services:
proxy:
volumes:
- ./optional-components/prometheus-longterm-metrics/config/proxy/conf.extra-service.d:/etc/nginx/conf.extra-service.d/prometheus-longterm-metrics:ro
Original file line number Diff line number Diff line change
@@ -0,0 +1,29 @@
export PROMETHEUS_LONGTERM_VERSION='${PROMETHEUS_VERSION:-"v2.52.0"}'
export PROMETHEUS_LONGTERM_DOCKER='${PROMETHEUS_DOCKER:-prom/prometheus}'
fmigneault marked this conversation as resolved.
Show resolved Hide resolved
export PROMETHEUS_LONGTERM_IMAGE='${PROMETHEUS_LONGTERM_DOCKER}:${PROMETHEUS_LONGTERM_VERSION}'

export PROMETHEUS_LONGTERM_RETENTION_TIME=1y
export PROMETHEUS_LONGTERM_SCRAPE_INTERVAL=1h

# These are the prometheus defaults
export PROMETHEUS_LONGTERM_TSDB_MIN_BLOCK_DURATION=2h
export PROMETHEUS_LONGTERM_TSDB_MAX_BLOCK_DURATION=1d12h

# These are the targets that
export PROMETHEUS_LONGTERM_TARGETS='["prometheus:9090"]' # yaml list syntax

OPTIONAL_VARS="
$OPTIONAL_VARS
\$PROMETHEUS_LONGTERM_SCRAPE_INTERVAL
\$PROMETHEUS_LONGTERM_TARGETS
"

export DELAYED_EVAL="
$DELAYED_EVAL
PROMETHEUS_LONGTERM_VERSION
PROMETHEUS_LONGTERM_DOCKER
PROMETHEUS_LONGTERM_IMAGE
"

# Note that this component does not depend explicitly on the `components/monitoring` component so that this can
# theoretically be deployed on a different machine than the `prometheus` service. This is currently untested.
Comment on lines +28 to +29
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Curious how that would work out?
It seems to depend on prometheus:9090 and extends the volumes of prometheus service.
I think it is more confusing to omit the dependency.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's actually a completely separate service with no shared volumes. It does depend on prometheus:9090 by default but that default can be updated by setting the PROMETHEUS_LONGTERM_TARGETS variable.

PROMETHEUS_LONGTERM_TARGETS can point to any prometheus endpoints at all.

See the discussion about this here: #461 (comment)

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What I meant is that the rules are mounted under prometheus service here:
https://github.com/bird-house/birdhouse-deploy/blob/longterm-monitoring/birdhouse/optional-components/prometheus-longterm-rules/config/monitoring/docker-compose-extra.yml

So, whether PROMETHEUS_LONGTERM_TARGETS refers to the same instance or a custom remote one, the local instance needs to have the long-term rules defined. Therefore, doesn't that mean that prometheus service (i.e.: using components/monitoring) becomes mandatory regardless, since it is needed to mount the rules?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, prometheus must be running on the same host as the rest of the birdhouse stack.

That first prometheus instance must have at least one rule that has the longterm-metrics group label in order for a second prometheus instance (either running on the same machine or elsewhere) to know which metrics it should monitor and store from the first prometheus instance.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've taken a second look at the directory structure, and starting to see why components/monitoring is not a dependency. Is it only because components/monitoring combines cardvisor, grafana, alertmanager, prometheus, etc. all together?

Therefore, if someone wants "only" longterm-metrics, they need to define their own prometheus service with a definition similar to this?

prometheus:
image: ${PROMETHEUS_IMAGE}
container_name: prometheus
volumes:
- ./components/monitoring/prometheus.yml:/etc/prometheus/prometheus.yml:ro
- ./components/monitoring/prometheus.rules:/etc/prometheus/prometheus.rules:ro
- prometheus_persistence:/prometheus:rw
command:
# restore original CMD from image
- --config.file=/etc/prometheus/prometheus.yml
- --storage.tsdb.path=/prometheus
- --web.console.libraries=/usr/share/prometheus/console_libraries
- --web.console.templates=/usr/share/prometheus/consoles
# https://prometheus.io/docs/prometheus/latest/storage/
- --storage.tsdb.retention.time=90d
# wrong default was http://container-hash:9090/
- --web.external-url=https://${BIRDHOUSE_FQDN_PUBLIC}/prometheus/
restart: always

I think part of what makes all of this confusing is that there are 2 different, though related, component definitions (and I still do not understand why):

From a user's perspective, this all seems really convoluted.
Isn't there are way to simplify this hierarchy? For example, what if there was a distinct optional-components/prometheus definition, and components/monitoring depends on it. Then, optional-components/prometheus-longterm-metrics could also depend only on optional-components/prometheus?

Copy link
Collaborator Author

@mishaschwartz mishaschwartz Sep 18, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@fmigneault

Therefore, if someone wants "only" longterm-metrics, they need to define their own prometheus service with a definition similar to this?

If someone only wants longterm metrics then they can have a single prometheus service that sets --storage.tsdb.retention.time to something longer than 90 days.

However, if you want some metrics to be stored for longer than others, you need a second prometheus instance. This is because the --storage.tsdb.retention.time value is set for an entire prometheus instance (see discussion here: #277 (comment) and here: #461 (comment)).

I think part of what makes all of this confusing is that there are 2 different, though related, component definitions (and I still do not understand why)

These are separate in order to accommodate @tlvu's request to be able to deploy the prometheus-longterm-metrics component on a different server. prometheus-longterm-rules are just suggested default rules to be added to the prometheus server. As described in the README file:

This adds some default longterm metrics rules to the `prometheus` component 

These rules are added to the prometheus component on the same machine that is running the rest of the stack because that is where all the relevant metrics are collected by the rest of the monitoring components (cadvisor, nodeexporter). These rules are not necessary, they are recommended defaults but you can customize them however you like. It is easier to customize and add your own if the rules are separate. Users can choose to include the defaults, or not.

Copy link
Collaborator

@tlvu tlvu Sep 18, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree with Misha's explanations. I just want to add that each organization needs are different so we should not force a certain "configuration".

Specifically to this PR, we should not force the 2 Prometheus to be on the same machine. So optional-components/prometheus-longterm-metrics should not depend on components/monitoring. We should also not be forced to use optional-components/prometheus-longterm-rules because each org will probably have different metrics they deemed useful to keep for longterm.

During Misha's leave, a sysadmin from PCIC actually had a question about how to pull existing metrics from our PAVICS Prometheus into his own centralized Prometheus for a centralized view of all his servers in one place. I point him to this PR for inspiration.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My bad. I was not clear in my explanation. What I intended to say was:
"if someone wants only prometheus monitoring with both long-term and short term metrics, they need to define their own prometheus service" [...], because they cannot activate prometheus on its own (i.e.: they must use the full package components/monitoring that includes all other services).

Having short/long-term separate, and having them configurable on different servers is perfectly fine by me. I'm all for that flexibility, and never mentioned otherwise.

It just seems from the configuration files that, in order for long-term metrics to be sent to the remote server, the rules must be mounted into the short-term local prometheus (as shown below), and since that prometheus must exist, then it should be a dependency. I understand long-term-prometheus NOT being a dependency to toggle it independently of the short-term one. This is good. Here I am referring to https://github.com/bird-house/birdhouse-deploy/tree/3f75d496b932ab4d90857206003d0225cd20c435/birdhouse/optional-components/prometheus-longterm-rules having component/monitoring dependency, NOT https://github.com/bird-house/birdhouse-deploy/tree/3f75d496b932ab4d90857206003d0225cd20c435/birdhouse/optional-components/prometheus-longterm-metrics. In other words, if one adds optional-components/prometheus-longterm-rules ONLY without also thinking to add components/monitoring, docker-compose will complain that service: prometheus is not defined.

services:
prometheus:
volumes:
- ./optional-components/prometheus-longterm-rules/config/monitoring/prometheus.rules:/etc/prometheus/prometheus-longterm-metrics.rules:ro

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok I understand your concern now.

In other words, if one adds optional-components/prometheus-longterm-rules ONLY without also thinking to add components/monitoring, docker-compose will complain that service: prometheus is not defined.

Docker compose won't actually complain in this case. Additional settings defined under the config/ folder in components are only applied if the component that it references is also enabled. So in this case, if you add prometheus-longterm-rules as a component but not monitoring, then nothing will happen (since the additional docker-compose settings are in prometheus-longterm-rules/config/monitoring.

If we do make the prometheus-longterm-rules component dependent on the monitoring components then if someone adds the prometheus-longterm-rules but not the monitoring rule to the BIRDHOUSE_EXTRA_CONF_DIRS configuration variable, then monitoring will be added in automatically. This is also a surprising behaviour that users may not expect.

My opinion is that we should leave it as is, so that users must specify both components if they want both components. I'm open to other opinions on this subject though.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for validating how this resolves.
I'm fine with either approach if the README indicates clearly what is supposed to happen and what to do if the one or the other situation is desired.

Original file line number Diff line number Diff line change
@@ -0,0 +1,32 @@
version: "3.4"

x-logging:
&default-logging
driver: "json-file"
options:
max-size: "50m"
max-file: "10"

services:
prometheus-longterm-metrics:
image: ${PROMETHEUS_LONGTERM_IMAGE}
container_name: prometheus-longterm-metrics
volumes:
- ./optional-components/prometheus-longterm-metrics/prometheus.yml:/etc/prometheus/prometheus.yml:ro
- prometheus_longterm_persistence:/prometheus:rw
command:
- --config.file=/etc/prometheus/prometheus.yml
- --storage.tsdb.path=/prometheus
- --web.console.libraries=/usr/share/prometheus/console_libraries
- --web.console.templates=/usr/share/prometheus/consoles
- --storage.tsdb.retention.time=${PROMETHEUS_LONGTERM_RETENTION_TIME}
- --web.external-url=https://${BIRDHOUSE_FQDN_PUBLIC}/prometheus-longterm-metrics/
- --storage.tsdb.min-block-duration=${PROMETHEUS_LONGTERM_TSDB_MIN_BLOCK_DURATION}
- --storage.tsdb.max-block-duration=${PROMETHEUS_LONGTERM_TSDB_MAX_BLOCK_DURATION}
restart: always
logging: *default-logging

volumes:
prometheus_longterm_persistence:
external:
name: prometheus_longterm_persistence
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
#!/bin/sh -x

docker volume create prometheus_longterm_persistence # metrics db
Original file line number Diff line number Diff line change
@@ -0,0 +1,17 @@
global:
external_labels:
instance_name: prometheus-longterm-metrics

scrape_configs:
- job_name: 'federate'
scrape_interval: ${PROMETHEUS_LONGTERM_SCRAPE_INTERVAL}

honor_labels: true
metrics_path: '/prometheus/federate'

params:
'match[]':
- '{group="longterm-metrics"}'

static_configs:
- targets: ${PROMETHEUS_LONGTERM_TARGETS}
Original file line number Diff line number Diff line change
@@ -0,0 +1,6 @@
version: "3.4"

services:
prometheus:
volumes:
- ./optional-components/prometheus-longterm-rules/config/monitoring/prometheus.rules:/etc/prometheus/prometheus-longterm-metrics.rules:ro
Original file line number Diff line number Diff line change
@@ -0,0 +1,15 @@
groups:
- name: longterm-metrics-hourly
interval: 1h
rules:
# percentage of the time, over the last hour, that all CPUs were working
# 1 means all CPUs were working all the time, 0 means they were all idle all the time
- record: instance:cpu_load:avg_rate1h
expr: avg by(instance) (rate(node_cpu_seconds_total{mode!="idle"}[1h]))
labels:
group: longterm-metrics
# total number of bytes that were sent or received over the network in the last hour
- record: instance:network_bytes_transmitted:sum_rate1h
expr: sum by(instance) (rate(node_network_transmit_bytes_total[1h]) + rate(node_network_receive_bytes_total[1h]))
labels:
group: longterm-metrics
2 changes: 2 additions & 0 deletions birdhouse/optional-components/thanos/.gitignore
Original file line number Diff line number Diff line change
@@ -0,0 +1,2 @@
config/magpie/config.yml
config/proxy/conf.extra-service.d/monitoring.conf
Loading
Loading