-
Notifications
You must be signed in to change notification settings - Fork 7
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add the prometheus-longterm-metrics
and thanos
optional components
#461
base: master
Are you sure you want to change the base?
Changes from 9 commits
d981ef3
06dc997
0d7178e
42f687d
a921527
0307996
2eab8b7
76e60f7
3f75d49
79a531d
59f6c68
2c6d5f5
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -443,3 +443,54 @@ How to enable X-Robots-Tag Header in ``env.local`` (a copy from `env.local.examp | |
|
||
.. seealso:: | ||
See the `env.local.example`_ file for more details about this ``BIRDHOUSE_PROXY_ROOT_LOCATION`` behaviour. | ||
|
||
.. _prometheus-longterm-metrics | ||
|
||
Prometheus Long-term Metrics | ||
---------------------------- | ||
|
||
This is a second prometheus instance that collects longterm monitoring metrics from the original prometheus instance | ||
(the one created by the ``components/monitoring`` component). | ||
|
||
Longterm metrics are any prometheus rule that have the label ``group: longterm-metrics`` or in other words are | ||
selectable using prometheus' ``'{group="longterm-metrics"}'`` query filter. To add some default longterm metrics rules | ||
also enable the ``prometheus-longterm-rules`` component. | ||
|
||
You may also choose to create your own set of rules instead of, or as well as, the default ones. See how to | ||
:ref:`add additional rules here <monitoring-customize-the-component>`. | ||
|
||
To configure this component: | ||
|
||
* update the ``PROMETHEUS_LONGTERM_RETENTION_TIME`` variable to set how long the data will be kept by prometheus | ||
|
||
Enabling this component creates the additional endpoint ``/prometheus-longterm-metrics``. | ||
|
||
.. _prometheus-longterm-rules | ||
|
||
Prometheus Long-term Rules | ||
-------------------------- | ||
|
||
This adds some default longterm metrics rules to the `prometheus` component for use by the `prometheus-longterm-metrics` | ||
component. These rules all have the label ``group: longterm-metrics``. | ||
|
||
To see which rules are added, check out the | ||
`optional-components/prometheus-longterm-rules/config/monitoring/prometheus.rules` file. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Re-reading this following other comments in the PR, I am left only more confused... Putting ourselves in the shoes of someone having absolutely no idea about the implementation details of each component and how they interact with each other, what must the user do to achieve some results (aka: the "To enable this optional-component:" of each other section)? ie:
Specifically regarding the "Prometheus Long-term Rules" section. However, for a user without the details of "why", it is very confusing! I can see a user think they "should logically" be applied to Basically, something must indicate that service There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Ok... just to clarify what you're asking for here ... You would like specific instructions about how to set up each of these three cases:
And you would like an additional description of how the various monitoring components interact? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. That's it. Just mitigate user expectations and ensure there is a common understanding of the components/services. |
||
|
||
.. _thanos | ||
|
||
Thanos | ||
------ | ||
|
||
This enables better storage of longterm metrics collected by the ``optional-components/prometheus-longterm-metrics`` | ||
component. Data will be collected from the ``prometheus-longterm-metrics`` and stored in an S3 object store | ||
indefinitely. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Indefinitely ! Do we actually want this? Can we set an expiry after like 10 years? Grafana will be able to display data from Thanos go to back to 10 years? With this kind of extreme long term stats, what is the UI to visualize it? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. You can choose to change this if you wish. Thanos suggests keeping data indefinitely by default. If you do not need to keep data forever, I suggest just using the There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Oh I see, is there a switch to disable Thanos? Same question, in the case we want to use Thanos, how to visualize the data stored on Thanos? I assume if Thanos is enabled, the retention duration on the Prometheus side will be very short to avoid doubling the storage so without data being stored in Prometheus, how to visualize that data stored on Thanos. Just a question. If another component is required, we can do it in a follow up Pr. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
To answer my own question, Thanos is actually a separate component so it does not have to be enabled together with the Prometheus-long-term component? The Prometheus-long-term component can function standalone of Thanos? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Yes that's right. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I'm happy with forever storage for long-term-metrics. The point is to keep an archive of key metrics. If those are daily or hourly, archiving a few dozen metrics won't be a problem. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Let's see how much space that will take in practice and we will adjust. And eventually we need a way to visualize those older metrics. Otherwise, what's the point to keep them forever if we can not visualize? |
||
|
||
When enabling this component, please change the default values for the ``THANOS_MINIO_ROOT_USER`` and | ||
``THANOS_MINIO_ROOT_PASSWORD`` by updating the ``env.local`` file. These set the login credentials for the root user | ||
that runs the minio_ object store. | ||
|
||
Enabling this component creates the additional endpoints: | ||
* ``/thanos-query``: a prometheus-like query interface to inspect the data stored by thanos | ||
* ``/thanos-minio``: a minio_ web console to inspect the data stored by minio_. | ||
|
||
.. _minio: https://min.io/ |
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,3 @@ | ||
prometheus.yml | ||
config/magpie/config.yml | ||
config/proxy/conf.extra-service.d/monitoring.conf |
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,28 @@ | ||
providers: | ||
prometheus-longterm-metrics: | ||
# below URL is only used to fill in the required location in Magpie | ||
# actual auth validation is performed with Twitcher 'verify' endpoint without accessing this proxied URL | ||
url: http://proxy:80 | ||
title: PrometheusLongtermMetrics | ||
public: true | ||
c4i: false | ||
type: api | ||
sync_type: api | ||
|
||
permissions: | ||
- service: prometheus-longterm-metrics | ||
permission: read | ||
group: administrators | ||
action: create | ||
- service: prometheus-longterm-metrics | ||
permission: write | ||
group: administrators | ||
action: create | ||
- service: prometheus-longterm-metrics | ||
permission: read | ||
group: monitoring | ||
action: create | ||
- service: prometheus-longterm-metrics | ||
permission: write | ||
group: monitoring | ||
action: create |
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,7 @@ | ||
version: "3.4" | ||
|
||
services: | ||
magpie: | ||
volumes: | ||
- ./optional-components/prometheus-longterm-metrics/config/magpie/config.yml:${MAGPIE_PERMISSIONS_CONFIG_PATH}/prometheus-longterm-metrics.yml:ro | ||
- ./optional-components/prometheus-longterm-metrics/config/magpie/config.yml:${MAGPIE_PROVIDERS_CONFIG_PATH}/prometheus-longterm-metrics.yml:ro |
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,18 @@ | ||
location /prometheus-longterm-metrics { | ||
auth_request /secure-prometheus-longterm-metrics-auth; | ||
auth_request_set $auth_status $upstream_status; | ||
proxy_pass http://prometheus-longterm-metrics:9090; | ||
proxy_set_header Host $host; | ||
} | ||
|
||
location = /secure-prometheus-longterm-metrics-auth { | ||
internal; | ||
proxy_pass https://${BIRDHOUSE_FQDN_PUBLIC}${TWITCHER_VERIFY_PATH}/prometheus-longterm-metrics$request_uri; | ||
proxy_pass_request_body off; | ||
proxy_set_header Host $host; | ||
proxy_set_header Content-Length ""; | ||
proxy_set_header X-Original-URI $request_uri; | ||
proxy_set_header X-Forwarded-Proto $real_scheme; | ||
proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for; | ||
proxy_set_header X-Forwarded-Host $host:$server_port; | ||
} |
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,6 @@ | ||
version: "3.4" | ||
|
||
services: | ||
proxy: | ||
volumes: | ||
- ./optional-components/prometheus-longterm-metrics/config/proxy/conf.extra-service.d:/etc/nginx/conf.extra-service.d/prometheus-longterm-metrics:ro |
Original file line number | Diff line number | Diff line change | ||||||||||||||||||||||||||||||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
@@ -0,0 +1,29 @@ | ||||||||||||||||||||||||||||||||||||||||||||||
export PROMETHEUS_LONGTERM_VERSION='${PROMETHEUS_VERSION:-"v2.52.0"}' | ||||||||||||||||||||||||||||||||||||||||||||||
export PROMETHEUS_LONGTERM_DOCKER='${PROMETHEUS_DOCKER:-prom/prometheus}' | ||||||||||||||||||||||||||||||||||||||||||||||
fmigneault marked this conversation as resolved.
Show resolved
Hide resolved
|
||||||||||||||||||||||||||||||||||||||||||||||
export PROMETHEUS_LONGTERM_IMAGE='${PROMETHEUS_LONGTERM_DOCKER}:${PROMETHEUS_LONGTERM_VERSION}' | ||||||||||||||||||||||||||||||||||||||||||||||
|
||||||||||||||||||||||||||||||||||||||||||||||
export PROMETHEUS_LONGTERM_RETENTION_TIME=1y | ||||||||||||||||||||||||||||||||||||||||||||||
export PROMETHEUS_LONGTERM_SCRAPE_INTERVAL=1h | ||||||||||||||||||||||||||||||||||||||||||||||
|
||||||||||||||||||||||||||||||||||||||||||||||
# These are the prometheus defaults | ||||||||||||||||||||||||||||||||||||||||||||||
export PROMETHEUS_LONGTERM_TSDB_MIN_BLOCK_DURATION=2h | ||||||||||||||||||||||||||||||||||||||||||||||
export PROMETHEUS_LONGTERM_TSDB_MAX_BLOCK_DURATION=1d12h | ||||||||||||||||||||||||||||||||||||||||||||||
|
||||||||||||||||||||||||||||||||||||||||||||||
# These are the targets that | ||||||||||||||||||||||||||||||||||||||||||||||
export PROMETHEUS_LONGTERM_TARGETS='["prometheus:9090"]' # yaml list syntax | ||||||||||||||||||||||||||||||||||||||||||||||
|
||||||||||||||||||||||||||||||||||||||||||||||
OPTIONAL_VARS=" | ||||||||||||||||||||||||||||||||||||||||||||||
$OPTIONAL_VARS | ||||||||||||||||||||||||||||||||||||||||||||||
\$PROMETHEUS_LONGTERM_SCRAPE_INTERVAL | ||||||||||||||||||||||||||||||||||||||||||||||
\$PROMETHEUS_LONGTERM_TARGETS | ||||||||||||||||||||||||||||||||||||||||||||||
" | ||||||||||||||||||||||||||||||||||||||||||||||
|
||||||||||||||||||||||||||||||||||||||||||||||
export DELAYED_EVAL=" | ||||||||||||||||||||||||||||||||||||||||||||||
$DELAYED_EVAL | ||||||||||||||||||||||||||||||||||||||||||||||
PROMETHEUS_LONGTERM_VERSION | ||||||||||||||||||||||||||||||||||||||||||||||
PROMETHEUS_LONGTERM_DOCKER | ||||||||||||||||||||||||||||||||||||||||||||||
PROMETHEUS_LONGTERM_IMAGE | ||||||||||||||||||||||||||||||||||||||||||||||
" | ||||||||||||||||||||||||||||||||||||||||||||||
|
||||||||||||||||||||||||||||||||||||||||||||||
# Note that this component does not depend explicitly on the `components/monitoring` component so that this can | ||||||||||||||||||||||||||||||||||||||||||||||
# theoretically be deployed on a different machine than the `prometheus` service. This is currently untested. | ||||||||||||||||||||||||||||||||||||||||||||||
Comment on lines
+28
to
+29
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Curious how that would work out? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. It's actually a completely separate service with no shared volumes. It does depend on
See the discussion about this here: #461 (comment) There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. What I meant is that the rules are mounted under So, whether There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Yes, prometheus must be running on the same host as the rest of the birdhouse stack. That first prometheus instance must have at least one rule that has the There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I've taken a second look at the directory structure, and starting to see why Therefore, if someone wants "only" longterm-metrics, they need to define their own
I think part of what makes all of this confusing is that there are 2 different, though related, component definitions (and I still do not understand why): From a user's perspective, this all seems really convoluted. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
If someone only wants longterm metrics then they can have a single prometheus service that sets However, if you want some metrics to be stored for longer than others, you need a second prometheus instance. This is because the
These are separate in order to accommodate @tlvu's request to be able to deploy the
These rules are added to the There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Agree with Misha's explanations. I just want to add that each organization needs are different so we should not force a certain "configuration". Specifically to this PR, we should not force the 2 Prometheus to be on the same machine. So During Misha's leave, a sysadmin from PCIC actually had a question about how to pull existing metrics from our PAVICS Prometheus into his own centralized Prometheus for a centralized view of all his servers in one place. I point him to this PR for inspiration. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. My bad. I was not clear in my explanation. What I intended to say was: Having short/long-term separate, and having them configurable on different servers is perfectly fine by me. I'm all for that flexibility, and never mentioned otherwise. It just seems from the configuration files that, in order for long-term metrics to be sent to the remote server, the rules must be mounted into the short-term local Lines 3 to 6 in 3f75d49
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Ok I understand your concern now.
Docker compose won't actually complain in this case. Additional settings defined under the If we do make the My opinion is that we should leave it as is, so that users must specify both components if they want both components. I'm open to other opinions on this subject though. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Thanks for validating how this resolves. |
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,32 @@ | ||
version: "3.4" | ||
|
||
x-logging: | ||
&default-logging | ||
driver: "json-file" | ||
options: | ||
max-size: "50m" | ||
max-file: "10" | ||
|
||
services: | ||
prometheus-longterm-metrics: | ||
image: ${PROMETHEUS_LONGTERM_IMAGE} | ||
container_name: prometheus-longterm-metrics | ||
volumes: | ||
- ./optional-components/prometheus-longterm-metrics/prometheus.yml:/etc/prometheus/prometheus.yml:ro | ||
- prometheus_longterm_persistence:/prometheus:rw | ||
command: | ||
- --config.file=/etc/prometheus/prometheus.yml | ||
- --storage.tsdb.path=/prometheus | ||
- --web.console.libraries=/usr/share/prometheus/console_libraries | ||
- --web.console.templates=/usr/share/prometheus/consoles | ||
- --storage.tsdb.retention.time=${PROMETHEUS_LONGTERM_RETENTION_TIME} | ||
- --web.external-url=https://${BIRDHOUSE_FQDN_PUBLIC}/prometheus-longterm-metrics/ | ||
- --storage.tsdb.min-block-duration=${PROMETHEUS_LONGTERM_TSDB_MIN_BLOCK_DURATION} | ||
- --storage.tsdb.max-block-duration=${PROMETHEUS_LONGTERM_TSDB_MAX_BLOCK_DURATION} | ||
restart: always | ||
logging: *default-logging | ||
|
||
volumes: | ||
prometheus_longterm_persistence: | ||
external: | ||
name: prometheus_longterm_persistence |
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,3 @@ | ||
#!/bin/sh -x | ||
|
||
docker volume create prometheus_longterm_persistence # metrics db |
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,17 @@ | ||
global: | ||
external_labels: | ||
instance_name: prometheus-longterm-metrics | ||
|
||
scrape_configs: | ||
- job_name: 'federate' | ||
scrape_interval: ${PROMETHEUS_LONGTERM_SCRAPE_INTERVAL} | ||
|
||
honor_labels: true | ||
metrics_path: '/prometheus/federate' | ||
|
||
params: | ||
'match[]': | ||
- '{group="longterm-metrics"}' | ||
|
||
static_configs: | ||
- targets: ${PROMETHEUS_LONGTERM_TARGETS} |
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,6 @@ | ||
version: "3.4" | ||
|
||
services: | ||
prometheus: | ||
volumes: | ||
- ./optional-components/prometheus-longterm-rules/config/monitoring/prometheus.rules:/etc/prometheus/prometheus-longterm-metrics.rules:ro |
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,15 @@ | ||
groups: | ||
- name: longterm-metrics-hourly | ||
interval: 1h | ||
rules: | ||
# percentage of the time, over the last hour, that all CPUs were working | ||
# 1 means all CPUs were working all the time, 0 means they were all idle all the time | ||
- record: instance:cpu_load:avg_rate1h | ||
expr: avg by(instance) (rate(node_cpu_seconds_total{mode!="idle"}[1h])) | ||
labels: | ||
group: longterm-metrics | ||
# total number of bytes that were sent or received over the network in the last hour | ||
- record: instance:network_bytes_transmitted:sum_rate1h | ||
expr: sum by(instance) (rate(node_network_transmit_bytes_total[1h]) + rate(node_network_receive_bytes_total[1h])) | ||
labels: | ||
group: longterm-metrics |
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,2 @@ | ||
config/magpie/config.yml | ||
config/proxy/conf.extra-service.d/monitoring.conf |
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,28 @@ | ||
providers: | ||
thanos: | ||
# below URL is only used to fill in the required location in Magpie | ||
# actual auth validation is performed with Twitcher 'verify' endpoint without accessing this proxied URL | ||
url: http://proxy:80 | ||
title: Thanos | ||
public: true | ||
c4i: false | ||
type: api | ||
sync_type: api | ||
|
||
permissions: | ||
- service: thanos | ||
permission: read | ||
group: administrators | ||
action: create | ||
- service: thanos | ||
permission: write | ||
group: administrators | ||
action: create | ||
- service: thanos | ||
permission: read | ||
group: monitoring | ||
action: create | ||
- service: thanos | ||
permission: write | ||
group: monitoring | ||
action: create |
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,7 @@ | ||
version: "3.4" | ||
|
||
services: | ||
magpie: | ||
volumes: | ||
- ./optional-components/thanos/config/magpie/config.yml:${MAGPIE_PERMISSIONS_CONFIG_PATH}/thanos.yml:ro | ||
- ./optional-components/thanos/config/magpie/config.yml:${MAGPIE_PROVIDERS_CONFIG_PATH}/thanos.yml:ro |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should those be configurable?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Which bits would be useful to configure? The endpoints, paths, images, other?
I agree that we could always add more configuration options, I'm just wondering which are a priority for you?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The endpoints would be the priority, to allow serving them from some other location, though still a low priority relative to the feature as a whole. Worst case, redirects can be defined in the nginx configuration, so don't block the PR just for this.