Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add the prometheus-longterm-metrics and thanos optional components #461

Open
wants to merge 12 commits into
base: master
Choose a base branch
from

Conversation

mishaschwartz
Copy link
Collaborator

Overview

The prometheus-longterm-metrics component collects longterm monitoring metrics from the original prometheus instance (the one created by the components/monitoring component).

Longterm metrics are any prometheus rule that have the label group: longterm-metrics or in other words are selectable using prometheus's '{group="longterm-metrics"}' query filter. To see which longterm metric rules are added by default see the optional-components/prometheus-longterm-metrics/config/monitoring/prometheus.rules.template file.

To configure this component:

  • update the PROMETHEUS_LONGTERM_RETENTION_TIME variable to set how long the data will be kept by prometheus
  • update the PROMETHEUS_LONGTERM_STORE_INTERVAL variable to set how often the longterm metrics rules will be calculated. For example, setting it to 10h will calculate these metrics every 10 hours.

Enabling the prometheus-longterm-metrics component creates the additional endpoint /prometheus-longterm-metrics.

The thanos component enables better storage of longterm metrics collected by the optional-components/prometheus-longterm-metrics component. Data will be collected from the prometheus-longterm-metrics and stored in an S3 object store indefinitely.

When enabling this component, please change the default values for the MINIO_ROOT_USER and MINIO_ROOT_PASSWORD by updating the env.local file. These set the login credentials for the root user that runs the minio object store.

Enabling the thanos component creates the additional endpoints:

  • /thanos-query: a prometheus-like query interface to inspect the data stored by thanos
  • /thanos-minio: a minio web console to inspect the data stored by minio.

This also includes an update to the prometheus version from v2.19.0 to the current latest v2.52.0. This is to
required to support the interaction between prometheus and thanos.

Changes

Non-breaking changes

  • New component version: prometheus:v2.52.0

Related Issue / Discussion

Additional Information

  • I tested upgrading the prometheus version and there were no issues (no loss of data, no changed APIs etc.)
  • Note that the thanos set up is pretty minimal but probably good enough for our purposes. We can always add more of the thanos features/components in the future if needed.

CI Operations

birdhouse_daccs_configs_branch: master
birdhouse_skip_ci: false

@mishaschwartz mishaschwartz requested a review from huard June 7, 2024 20:12
@github-actions github-actions bot added component/magpie Related to https://github.com/Ouranosinc/Magpie documentation Improvements or additions to documentation labels Jun 7, 2024
Copy link
Collaborator

@huard huard left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can't comment on the specifics, but this is really great news !

Question: this assumes a MinIO server exists. Long has set one up on PAVICS, but I'm wondering if a MinIO config already officially exists on birdhouse-deploy?

@mishaschwartz
Copy link
Collaborator Author

mishaschwartz commented Jun 13, 2024

this assumes a MinIO server exists.

If you enable the thanos optional component, the a minio server will be created in a docker container. It is not dependent on an externally deployed/created minio server. But if you have one already you could customize this to use that one instead.

I'm wondering if a MinIO config already officially exists on birdhouse-deploy?

Not at the moment, you likely have a custom component that sets one up for PAVICS. Do you know if this is true @tlvu?

@huard
Copy link
Collaborator

huard commented Jun 13, 2024

Huh, that's convenient ! Ok thanks !

@tlvu
Copy link
Collaborator

tlvu commented Jun 13, 2024

I'm wondering if a MinIO config already officially exists on birdhouse-deploy?

Not at the moment, you likely have a custom component that sets one up for PAVICS. Do you know if this is true @tlvu?

@mishaschwartz Yes I deployed one, just for prototyping so did not hook it up "officially" to the PAVICS stack because it's not behind the proxy and Magpie.

@huard like Misha, I'd rather have the MinIO for the monitoring and MinIO for other usage be separate. It much simplifies the management (custom config, version requirements, upgrade, ...). I think Weaver or Magpie was hooking into the default Posgres at some point, then it also rolls it's own Postgres container because it requires a different version of Postgres.

@mishaschwartz Can this new component be enabled standalone? Meaning on another physical host with only itself and the proxy, as describe in this comment #277 (comment)?

If not yet standalone, it's okay we can do it in another PR.

@huard
Copy link
Collaborator

huard commented Jun 13, 2024

Good point, makes sense.

Copy link
Collaborator

@tlvu tlvu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Quick review just by reading the code, not tried to deploy yet.

CHANGES.md Outdated Show resolved Hide resolved

This enables better storage of longterm metrics collected by the ``optional-components/prometheus-longterm-metrics``
component. Data will be collected from the ``prometheus-longterm-metrics`` and stored in an S3 object store
indefinitely.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indefinitely ! Do we actually want this? Can we set an expiry after like 10 years?

Grafana will be able to display data from Thanos go to back to 10 years? With this kind of extreme long term stats, what is the UI to visualize it?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can choose to change this if you wish. Thanos suggests keeping data indefinitely by default. If you do not need to keep data forever, I suggest just using the prometheus-longterm-monitoring component without thanos and setting the PROMETHEUS_LONGTERM_RETENTION_TIME to whatever you'd like.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh I see, is there a switch to disable Thanos?

Same question, in the case we want to use Thanos, how to visualize the data stored on Thanos? I assume if Thanos is enabled, the retention duration on the Prometheus side will be very short to avoid doubling the storage so without data being stored in Prometheus, how to visualize that data stored on Thanos.

Just a question. If another component is required, we can do it in a follow up Pr.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh I see, is there a switch to disable Thanos?

To answer my own question, Thanos is actually a separate component so it does not have to be enabled together with the Prometheus-long-term component? The Prometheus-long-term component can function standalone of Thanos?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanos is actually a separate component so it does not have to be enabled together with the Prometheus-long-term component

Yes that's right. prometheus-longterm-metrics collects and stores specific metrics that we want to keep for longer from prometheus. If you want to also enable thanos, then thanos will store those same metrics in a much more compact/efficient way so that you can store more data over a longer time period.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm happy with forever storage for long-term-metrics. The point is to keep an archive of key metrics. If those are daily or hourly, archiving a few dozen metrics won't be a problem.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's see how much space that will take in practice and we will adjust. And eventually we need a way to visualize those older metrics. Otherwise, what's the point to keep them forever if we can not visualize?

"

COMPONENT_DEPENDENCIES="
./components/monitoring
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh this dependency will make the new Prometheus long term not able to run standalone.

Copy link
Collaborator Author

@mishaschwartz mishaschwartz Jun 18, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No it cannot run standalone, it collects metrics from the services that are enabled in the monitoring component.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Say I already have a bunch of PAVICS servers running with monitoring enabled and I just want to point this new Prometheus to aggregate all the data?

Can be in a follow up PR.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The goal of this PR is to create a method for saving existing prometheus data over the long term for a single birdhouse/PAVICS deployment. If you want something that will collect prometheus data for multiple servers I would recommend creating a new repository to host that code.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since the architecture here is pluggable, we do not need a separate repo and duplicate the work. To deploy the 2nd Prometheus only, on a separate machine, I see we only enable the proxy and the prometheus-longterm-metrics and optionally thanos components on the new machine and that's it.

But agree we can make this "standalone" support in a separate PR.


scrape_configs:
- job_name: 'federate'
scrape_interval: 15s
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This one is a long term but it still scrape every 15 sec? Or I might misunderstood this config.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It collects and stores metrics that are already collected by the other prometheus service. Depending on how often the other prometheus instance is calculating those metrics, you may not need such a small scrape_interval. What interval would you suggest?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I made it configurable

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I do not fully understand federation yet but I had the impression the 2nd Prometheus would query the max/min/average value from the 1st Prometheus at a much longer interval. Ex: the 1st Prometheus scrap every minute. The 2nd Prometheus would scrap every hour, taking the max/min/average over the hour of the 1st Prometheus?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You don't need multiple prometheus instances for that. You just need to create recording rules that calculate the metrics that you want to store for a longer period and then use the second prometheus instance to grab those metrics for long-term storage.

The second prometheus instance is necessary only because the data retention time is configured for each instance, so if you want to keep data longer you need a separate prometheus instance.

If you're storing data for a long time with thanos instead, you still need a second prometheus instance so that you can select which data thanos will store. thanos stores all data from a prometheus instance, you cannot pick and choose, so we get around this problem by only collecting the data we are interested in the "longterm" prometheus instance and only using thanos to store data from that specific instance.

The best way to think about it is:

  • prometheus: collects all metrics we are interested in and stores them for a short period of time
  • prometheus-longterm-metrics: selects some metrics collected by prometheus that we are interested in storing for a long period of time.
  • thanos: selects all metrics from prometheus-longterm-metrics and stores them efficiently over an even longer period of time

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see clearer now, thanks !

Just to confirm, the current setup prometheus-longterm-metrics is collecting everything from prometheus so we are not yet "selecting some metrics" yet, right?

It's alright we can perform the selection in a subsequent PR.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just to confirm, the current setup prometheus-longterm-metrics is collecting everything from prometheus so we are not yet "selecting some metrics" yet, right?

Opps sorry, just read your other reply now and so to answer my own remark: prometheus-longterm-metrics is collecting only metrics from group: longterm-metrics from prometheus.

birdhouse/optional-components/thanos/default.env Outdated Show resolved Hide resolved
birdhouse/optional-components/thanos/default.env Outdated Show resolved Hide resolved
@tlvu tlvu self-requested a review June 18, 2024 15:07
@mishaschwartz
Copy link
Collaborator Author

mishaschwartz commented Jun 18, 2024

@tlvu

Can this new component be enabled standalone? Meaning on another physical host with only itself and the proxy, as describe in this comment #277 (comment)?

Not with the current way that we deploy without some serious additional coordination to support the networking and permissions. If we used something like kubernetes or docker swarm to deploy this would be easier to deploy on multiple hosts but we're not there yet.

@tlvu
Copy link
Collaborator

tlvu commented Jun 18, 2024

additional coordination to support the networking and permissions.

Right. Let's say we expose the 1st Prometheus port to the PAVICS docker host and configure the firewall on the PAVICS docker host to allow connection from the 2nd Prometheus long term only, would that would that work?

rules:
# percentage of the time, over the last hour, that all CPUs were working
# 1 means all CPUs were working all the time, 0 means they were all idle all the time
- record: instance:cpu_load:avg_rate1h
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So the 2nd Prometheus-longterm will scrap only these 2 new metrics or it will also scrap all the existing ones?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

By default, only these two. I suggest we add more rules to this default list in the future though.

If you have other metrics that you want it to scrape for long term storage, you just have to give that rule the label group: longterm-metrics.


static_configs:
- targets:
- 'prometheus:9090'
Copy link
Collaborator

@tlvu tlvu Jun 18, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you make this target list a template expansion variable, then we can easily add more hosts and override this default local Prometheus and points to other hosts, ex:

- existing_pavics_1:9090
- existing_pavics_2:9090
- ...

This assume I open up the Prometheus port on existing_pavics_1 and existing_pavics_2 and I handle manually the inclusion of prometheus.rules on on existing_pavics_1 and existing_pavics_2.

Then you can remove the dependency on ./components/monitoring.

Just add a note in the README to enable ./components/monitoring together with this ./optional-components/prometheus-longterm-metrics component only if the admin wants both Prometheus to be on the same machine.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is for a separate discussion and out of the scope of this PR.

We can discuss the best way to handle multiple targets later on but at the very least we need to consider the security implications of recommending users open up additional ports.

I really think that any code that handles monitoring multiple prometheus endpoints should go in a separate repo.

@tlvu tlvu self-requested a review June 18, 2024 19:46
@mishaschwartz
Copy link
Collaborator Author

Right. Let's say we expose the 1st Prometheus port to the PAVICS docker host and configure the firewall on the PAVICS docker host to allow connection from the 2nd Prometheus long term only, would that would that work?

Yup, that would work. I think that code can go in a different repo though.

@tlvu
Copy link
Collaborator

tlvu commented Jun 19, 2024

Right. Let's say we expose the 1st Prometheus port to the PAVICS docker host and configure the firewall on the PAVICS docker host to allow connection from the 2nd Prometheus long term only, would that would that work?

Yup, that would work. I think that code can go in a different repo though.

Yup, I never see this code in this repo. Ouranos have our own override repo so we can do this in our own override. Because only our repo will actually know what are all the "other" hostnames !

The only change needed for this repo so to turn the targets section in birdhouse/optional-components/prometheus-longterm-metrics/prometheus.yml.template to a template variable so our override repo can override with a real list of "other" hosts.

Copy link
Collaborator

@tlvu tlvu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks pretty good for now. Any extra improvements can be done in subsequent PRs.

@mishaschwartz
Copy link
Collaborator Author

@tlvu I've made a few small changes to make it easier to deploy this on a separate server. Check out 2eab8b7 for the relevant changes.

@mishaschwartz mishaschwartz requested a review from tlvu June 25, 2024 20:33
Copy link
Collaborator

@tlvu tlvu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks ! This should indeed make the longterm prometheus standalone if need be.

Copy link
Collaborator

@fmigneault fmigneault left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi. Sorry for the delay of the review.
Still going through my plethora of emails following my long leave.

Nice feature. Will rely on the configurations to have been tested on a local instance since I cannot deep dive into it yet.

I think the README documentation with the existing monitoring component would benefit from a cross-reference of these new optional components details. A user reading about the monitoring component might not have the reflex to look for corresponding optional components definitions, and might get confused by the mixed variables/config/rule references. Also, they might not be able to figure out that thanos is relevant for prometheus and vice-versa.

In the same line of thought, it is not clear yet to me what are all the impacts of the various components and how they interact between each other. There are implicit dependencies between thanos, prometheus, prometheus-longterm-metrics, as well as grafana, alertmanager, cadvisor, others(?) through components/monitoring and all their derived dockers/volumes, but it is really not obvious unless one has personally integrated them. Can we add some guidance in the docs to help disambiguate all of them?

CHANGES.md Outdated Show resolved Hide resolved
CHANGES.md Outdated Show resolved Hide resolved
Comment on lines +49 to +50
* ``/thanos-query``: a prometheus-like query interface to inspect the data stored by thanos
* ``/thanos-minio``: a minio web console to inspect the data stored by minio.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should those be configurable?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Which bits would be useful to configure? The endpoints, paths, images, other?

I agree that we could always add more configuration options, I'm just wondering which are a priority for you?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The endpoints would be the priority, to allow serving them from some other location, though still a low priority relative to the feature as a whole. Worst case, redirects can be defined in the nginx configuration, so don't block the PR just for this.

CHANGES.md Outdated Show resolved Hide resolved
Comment on lines +29 to +30
# Note that this component does not depend explicitly on the `components/monitoring` component so that this can
# theoretically be deployed on a different machine than the `prometheus` service. This is currently untested.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Curious how that would work out?
It seems to depend on prometheus:9090 and extends the volumes of prometheus service.
I think it is more confusing to omit the dependency.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's actually a completely separate service with no shared volumes. It does depend on prometheus:9090 by default but that default can be updated by setting the PROMETHEUS_LONGTERM_TARGETS variable.

PROMETHEUS_LONGTERM_TARGETS can point to any prometheus endpoints at all.

See the discussion about this here: #461 (comment)

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What I meant is that the rules are mounted under prometheus service here:
https://github.com/bird-house/birdhouse-deploy/blob/longterm-monitoring/birdhouse/optional-components/prometheus-longterm-rules/config/monitoring/docker-compose-extra.yml

So, whether PROMETHEUS_LONGTERM_TARGETS refers to the same instance or a custom remote one, the local instance needs to have the long-term rules defined. Therefore, doesn't that mean that prometheus service (i.e.: using components/monitoring) becomes mandatory regardless, since it is needed to mount the rules?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, prometheus must be running on the same host as the rest of the birdhouse stack.

That first prometheus instance must have at least one rule that has the longterm-metrics group label in order for a second prometheus instance (either running on the same machine or elsewhere) to know which metrics it should monitor and store from the first prometheus instance.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've taken a second look at the directory structure, and starting to see why components/monitoring is not a dependency. Is it only because components/monitoring combines cardvisor, grafana, alertmanager, prometheus, etc. all together?

Therefore, if someone wants "only" longterm-metrics, they need to define their own prometheus service with a definition similar to this?

prometheus:
image: ${PROMETHEUS_IMAGE}
container_name: prometheus
volumes:
- ./components/monitoring/prometheus.yml:/etc/prometheus/prometheus.yml:ro
- ./components/monitoring/prometheus.rules:/etc/prometheus/prometheus.rules:ro
- prometheus_persistence:/prometheus:rw
command:
# restore original CMD from image
- --config.file=/etc/prometheus/prometheus.yml
- --storage.tsdb.path=/prometheus
- --web.console.libraries=/usr/share/prometheus/console_libraries
- --web.console.templates=/usr/share/prometheus/consoles
# https://prometheus.io/docs/prometheus/latest/storage/
- --storage.tsdb.retention.time=90d
# wrong default was http://container-hash:9090/
- --web.external-url=https://${BIRDHOUSE_FQDN_PUBLIC}/prometheus/
restart: always

I think part of what makes all of this confusing is that there are 2 different, though related, component definitions (and I still do not understand why):

From a user's perspective, this all seems really convoluted.
Isn't there are way to simplify this hierarchy? For example, what if there was a distinct optional-components/prometheus definition, and components/monitoring depends on it. Then, optional-components/prometheus-longterm-metrics could also depend only on optional-components/prometheus?

Copy link
Collaborator Author

@mishaschwartz mishaschwartz Sep 18, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@fmigneault

Therefore, if someone wants "only" longterm-metrics, they need to define their own prometheus service with a definition similar to this?

If someone only wants longterm metrics then they can have a single prometheus service that sets --storage.tsdb.retention.time to something longer than 90 days.

However, if you want some metrics to be stored for longer than others, you need a second prometheus instance. This is because the --storage.tsdb.retention.time value is set for an entire prometheus instance (see discussion here: #277 (comment) and here: #461 (comment)).

I think part of what makes all of this confusing is that there are 2 different, though related, component definitions (and I still do not understand why)

These are separate in order to accommodate @tlvu's request to be able to deploy the prometheus-longterm-metrics component on a different server. prometheus-longterm-rules are just suggested default rules to be added to the prometheus server. As described in the README file:

This adds some default longterm metrics rules to the `prometheus` component 

These rules are added to the prometheus component on the same machine that is running the rest of the stack because that is where all the relevant metrics are collected by the rest of the monitoring components (cadvisor, nodeexporter). These rules are not necessary, they are recommended defaults but you can customize them however you like. It is easier to customize and add your own if the rules are separate. Users can choose to include the defaults, or not.

Copy link
Collaborator

@tlvu tlvu Sep 18, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree with Misha's explanations. I just want to add that each organization needs are different so we should not force a certain "configuration".

Specifically to this PR, we should not force the 2 Prometheus to be on the same machine. So optional-components/prometheus-longterm-metrics should not depend on components/monitoring. We should also not be forced to use optional-components/prometheus-longterm-rules because each org will probably have different metrics they deemed useful to keep for longterm.

During Misha's leave, a sysadmin from PCIC actually had a question about how to pull existing metrics from our PAVICS Prometheus into his own centralized Prometheus for a centralized view of all his servers in one place. I point him to this PR for inspiration.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My bad. I was not clear in my explanation. What I intended to say was:
"if someone wants only prometheus monitoring with both long-term and short term metrics, they need to define their own prometheus service" [...], because they cannot activate prometheus on its own (i.e.: they must use the full package components/monitoring that includes all other services).

Having short/long-term separate, and having them configurable on different servers is perfectly fine by me. I'm all for that flexibility, and never mentioned otherwise.

It just seems from the configuration files that, in order for long-term metrics to be sent to the remote server, the rules must be mounted into the short-term local prometheus (as shown below), and since that prometheus must exist, then it should be a dependency. I understand long-term-prometheus NOT being a dependency to toggle it independently of the short-term one. This is good. Here I am referring to https://github.com/bird-house/birdhouse-deploy/tree/3f75d496b932ab4d90857206003d0225cd20c435/birdhouse/optional-components/prometheus-longterm-rules having component/monitoring dependency, NOT https://github.com/bird-house/birdhouse-deploy/tree/3f75d496b932ab4d90857206003d0225cd20c435/birdhouse/optional-components/prometheus-longterm-metrics. In other words, if one adds optional-components/prometheus-longterm-rules ONLY without also thinking to add components/monitoring, docker-compose will complain that service: prometheus is not defined.

services:
prometheus:
volumes:
- ./optional-components/prometheus-longterm-rules/config/monitoring/prometheus.rules:/etc/prometheus/prometheus-longterm-metrics.rules:ro

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok I understand your concern now.

In other words, if one adds optional-components/prometheus-longterm-rules ONLY without also thinking to add components/monitoring, docker-compose will complain that service: prometheus is not defined.

Docker compose won't actually complain in this case. Additional settings defined under the config/ folder in components are only applied if the component that it references is also enabled. So in this case, if you add prometheus-longterm-rules as a component but not monitoring, then nothing will happen (since the additional docker-compose settings are in prometheus-longterm-rules/config/monitoring.

If we do make the prometheus-longterm-rules component dependent on the monitoring components then if someone adds the prometheus-longterm-rules but not the monitoring rule to the BIRDHOUSE_EXTRA_CONF_DIRS configuration variable, then monitoring will be added in automatically. This is also a surprising behaviour that users may not expect.

My opinion is that we should leave it as is, so that users must specify both components if they want both components. I'm open to other opinions on this subject though.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for validating how this resolves.
I'm fine with either approach if the README indicates clearly what is supposed to happen and what to do if the one or the other situation is desired.

Comment on lines 449 to 477
Prometheus Long-term Metrics
----------------------------

This is a second prometheus instance that collects longterm monitoring metrics from the original prometheus instance
(the one created by the ``components/monitoring`` component).

Longterm metrics are any prometheus rule that have the label ``group: longterm-metrics`` or in other words are
selectable using prometheus' ``'{group="longterm-metrics"}'`` query filter. To add some default longterm metrics rules
also enable the ``prometheus-longterm-rules`` component.

You may also choose to create your own set of rules instead of, or as well as, the default ones. See how to
:ref:`add additional rules here <monitoring-customize-the-component>`.

To configure this component:

* update the ``PROMETHEUS_LONGTERM_RETENTION_TIME`` variable to set how long the data will be kept by prometheus

Enabling this component creates the additional endpoint ``/prometheus-longterm-metrics``.

.. _prometheus-longterm-rules

Prometheus Long-term Rules
--------------------------

This adds some default longterm metrics rules to the `prometheus` component for use by the `prometheus-longterm-metrics`
component. These rules all have the label ``group: longterm-metrics``.

To see which rules are added, check out the
`optional-components/prometheus-longterm-rules/config/monitoring/prometheus.rules` file.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Re-reading this following other comments in the PR, I am left only more confused...

Putting ourselves in the shoes of someone having absolutely no idea about the implementation details of each component and how they interact with each other, what must the user do to achieve some results (aka: the "To enable this optional-component:" of each other section)?

ie:

  • If user wants SHORT only, what must be enabled (guessing only components/monitoring?)
  • If user wants SHORT + LONG all locally, what must be enabled (components/monitoring + optional-components/prometheus-longterm-metrics ?) (+ optional-components/prometheus-longterm-rules depending on other dependency comment)
  • If user wants SHORT local and LONG remote (components/monitoring + optional-components/prometheus-longterm-rules + IMPORTANT PROMETHEUS_LONGTERM_TARGETS override ?)

Specifically regarding the "Prometheus Long-term Rules" section.
It is mentioned that they are applied to prometheus. (this is fine, I understand why to apply the group tags)

However, for a user without the details of "why", it is very confusing! I can see a user think they "should logically" be applied to prometheus-longterm-rules service given their names. A much better explanation must be given.

Basically, something must indicate that service prometheus is the source that feeds the others instances using optional-components/prometheus-longterm-rules. Then, using optional-components/prometheus-longterm-metrics, they can decide that this receiving instance is another local one (ie: prometheus-longterm-metrics service), or dispatch it remotely by overriding PROMETHEUS_LONGTERM_TARGETS accoridngly.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok... just to clarify what you're asking for here ...

You would like specific instructions about how to set up each of these three cases:

  • If user wants SHORT only ...
  • If user wants SHORT + LONG all locally ...
  • If user wants SHORT local and LONG remote ...

And you would like an additional description of how the various monitoring components interact?
Is there anything else?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's it. Just mitigate user expectations and ensure there is a common understanding of the components/services.

@mishaschwartz
Copy link
Collaborator Author

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
component/magpie Related to https://github.com/Ouranosinc/Magpie documentation Improvements or additions to documentation
Projects
None yet
4 participants