-
Notifications
You must be signed in to change notification settings - Fork 7
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add the prometheus-longterm-metrics
and thanos
optional components
#461
base: master
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can't comment on the specifics, but this is really great news !
Question: this assumes a MinIO server exists. Long has set one up on PAVICS, but I'm wondering if a MinIO config already officially exists on birdhouse-deploy?
If you enable the
Not at the moment, you likely have a custom component that sets one up for PAVICS. Do you know if this is true @tlvu? |
Huh, that's convenient ! Ok thanks ! |
@mishaschwartz Yes I deployed one, just for prototyping so did not hook it up "officially" to the PAVICS stack because it's not behind the proxy and Magpie. @huard like Misha, I'd rather have the MinIO for the monitoring and MinIO for other usage be separate. It much simplifies the management (custom config, version requirements, upgrade, ...). I think Weaver or Magpie was hooking into the default Posgres at some point, then it also rolls it's own Postgres container because it requires a different version of Postgres. @mishaschwartz Can this new component be enabled standalone? Meaning on another physical host with only itself and the proxy, as describe in this comment #277 (comment)? If not yet standalone, it's okay we can do it in another PR. |
Good point, makes sense. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Quick review just by reading the code, not tried to deploy yet.
|
||
This enables better storage of longterm metrics collected by the ``optional-components/prometheus-longterm-metrics`` | ||
component. Data will be collected from the ``prometheus-longterm-metrics`` and stored in an S3 object store | ||
indefinitely. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Indefinitely ! Do we actually want this? Can we set an expiry after like 10 years?
Grafana will be able to display data from Thanos go to back to 10 years? With this kind of extreme long term stats, what is the UI to visualize it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can choose to change this if you wish. Thanos suggests keeping data indefinitely by default. If you do not need to keep data forever, I suggest just using the prometheus-longterm-monitoring
component without thanos and setting the PROMETHEUS_LONGTERM_RETENTION_TIME
to whatever you'd like.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh I see, is there a switch to disable Thanos?
Same question, in the case we want to use Thanos, how to visualize the data stored on Thanos? I assume if Thanos is enabled, the retention duration on the Prometheus side will be very short to avoid doubling the storage so without data being stored in Prometheus, how to visualize that data stored on Thanos.
Just a question. If another component is required, we can do it in a follow up Pr.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh I see, is there a switch to disable Thanos?
To answer my own question, Thanos is actually a separate component so it does not have to be enabled together with the Prometheus-long-term component? The Prometheus-long-term component can function standalone of Thanos?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanos is actually a separate component so it does not have to be enabled together with the Prometheus-long-term component
Yes that's right. prometheus-longterm-metrics
collects and stores specific metrics that we want to keep for longer from prometheus
. If you want to also enable thanos
, then thanos
will store those same metrics in a much more compact/efficient way so that you can store more data over a longer time period.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm happy with forever storage for long-term-metrics. The point is to keep an archive of key metrics. If those are daily or hourly, archiving a few dozen metrics won't be a problem.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's see how much space that will take in practice and we will adjust. And eventually we need a way to visualize those older metrics. Otherwise, what's the point to keep them forever if we can not visualize?
" | ||
|
||
COMPONENT_DEPENDENCIES=" | ||
./components/monitoring |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh this dependency will make the new Prometheus long term not able to run standalone.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No it cannot run standalone, it collects metrics from the services that are enabled in the monitoring component.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Say I already have a bunch of PAVICS servers running with monitoring enabled and I just want to point this new Prometheus to aggregate all the data?
Can be in a follow up PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The goal of this PR is to create a method for saving existing prometheus data over the long term for a single birdhouse/PAVICS deployment. If you want something that will collect prometheus data for multiple servers I would recommend creating a new repository to host that code.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since the architecture here is pluggable, we do not need a separate repo and duplicate the work. To deploy the 2nd Prometheus only, on a separate machine, I see we only enable the proxy and the prometheus-longterm-metrics and optionally thanos components on the new machine and that's it.
But agree we can make this "standalone" support in a separate PR.
|
||
scrape_configs: | ||
- job_name: 'federate' | ||
scrape_interval: 15s |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This one is a long term but it still scrape every 15 sec? Or I might misunderstood this config.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It collects and stores metrics that are already collected by the other prometheus service. Depending on how often the other prometheus instance is calculating those metrics, you may not need such a small scrape_interval
. What interval would you suggest?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I made it configurable
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I do not fully understand federation yet but I had the impression the 2nd Prometheus would query the max/min/average value from the 1st Prometheus at a much longer interval. Ex: the 1st Prometheus scrap every minute. The 2nd Prometheus would scrap every hour, taking the max/min/average over the hour of the 1st Prometheus?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You don't need multiple prometheus instances for that. You just need to create recording rules that calculate the metrics that you want to store for a longer period and then use the second prometheus instance to grab those metrics for long-term storage.
The second prometheus instance is necessary only because the data retention time is configured for each instance, so if you want to keep data longer you need a separate prometheus instance.
If you're storing data for a long time with thanos
instead, you still need a second prometheus instance so that you can select which data thanos
will store. thanos
stores all data from a prometheus instance, you cannot pick and choose, so we get around this problem by only collecting the data we are interested in the "longterm" prometheus instance and only using thanos
to store data from that specific instance.
The best way to think about it is:
prometheus
: collects all metrics we are interested in and stores them for a short period of timeprometheus-longterm-metrics
: selects some metrics collected byprometheus
that we are interested in storing for a long period of time.thanos
: selects all metrics fromprometheus-longterm-metrics
and stores them efficiently over an even longer period of time
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see clearer now, thanks !
Just to confirm, the current setup prometheus-longterm-metrics
is collecting everything from prometheus
so we are not yet "selecting some metrics" yet, right?
It's alright we can perform the selection in a subsequent PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just to confirm, the current setup
prometheus-longterm-metrics
is collecting everything fromprometheus
so we are not yet "selecting some metrics" yet, right?
Opps sorry, just read your other reply now and so to answer my own remark: prometheus-longterm-metrics
is collecting only metrics from group: longterm-metrics
from prometheus
.
Not with the current way that we deploy without some serious additional coordination to support the networking and permissions. If we used something like kubernetes or docker swarm to deploy this would be easier to deploy on multiple hosts but we're not there yet. |
Right. Let's say we expose the 1st Prometheus port to the PAVICS docker host and configure the firewall on the PAVICS docker host to allow connection from the 2nd Prometheus long term only, would that would that work? |
rules: | ||
# percentage of the time, over the last hour, that all CPUs were working | ||
# 1 means all CPUs were working all the time, 0 means they were all idle all the time | ||
- record: instance:cpu_load:avg_rate1h |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So the 2nd Prometheus-longterm will scrap only these 2 new metrics or it will also scrap all the existing ones?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
By default, only these two. I suggest we add more rules to this default list in the future though.
If you have other metrics that you want it to scrape for long term storage, you just have to give that rule the label group: longterm-metrics
.
|
||
static_configs: | ||
- targets: | ||
- 'prometheus:9090' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you make this target list a template expansion variable, then we can easily add more hosts and override this default local Prometheus and points to other hosts, ex:
- existing_pavics_1:9090
- existing_pavics_2:9090
- ...
This assume I open up the Prometheus port on existing_pavics_1
and existing_pavics_2
and I handle manually the inclusion of prometheus.rules
on on existing_pavics_1
and existing_pavics_2
.
Then you can remove the dependency on ./components/monitoring
.
Just add a note in the README to enable ./components/monitoring
together with this ./optional-components/prometheus-longterm-metrics
component only if the admin wants both Prometheus to be on the same machine.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is for a separate discussion and out of the scope of this PR.
We can discuss the best way to handle multiple targets later on but at the very least we need to consider the security implications of recommending users open up additional ports.
I really think that any code that handles monitoring multiple prometheus endpoints should go in a separate repo.
Yup, that would work. I think that code can go in a different repo though. |
Yup, I never see this code in this repo. Ouranos have our own override repo so we can do this in our own override. Because only our repo will actually know what are all the "other" hostnames ! The only change needed for this repo so to turn the targets section in |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks pretty good for now. Any extra improvements can be done in subsequent PRs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks ! This should indeed make the longterm prometheus standalone if need be.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi. Sorry for the delay of the review.
Still going through my plethora of emails following my long leave.
Nice feature. Will rely on the configurations to have been tested on a local instance since I cannot deep dive into it yet.
I think the README documentation with the existing monitoring component would benefit from a cross-reference of these new optional components details. A user reading about the monitoring component might not have the reflex to look for corresponding optional components definitions, and might get confused by the mixed variables/config/rule references. Also, they might not be able to figure out that thanos
is relevant for prometheus and vice-versa.
In the same line of thought, it is not clear yet to me what are all the impacts of the various components and how they interact between each other. There are implicit dependencies between thanos
, prometheus
, prometheus-longterm-metrics
, as well as grafana
, alertmanager
, cadvisor
, others(?) through components/monitoring
and all their derived dockers/volumes, but it is really not obvious unless one has personally integrated them. Can we add some guidance in the docs to help disambiguate all of them?
* ``/thanos-query``: a prometheus-like query interface to inspect the data stored by thanos | ||
* ``/thanos-minio``: a minio web console to inspect the data stored by minio. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should those be configurable?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Which bits would be useful to configure? The endpoints, paths, images, other?
I agree that we could always add more configuration options, I'm just wondering which are a priority for you?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The endpoints would be the priority, to allow serving them from some other location, though still a low priority relative to the feature as a whole. Worst case, redirects can be defined in the nginx configuration, so don't block the PR just for this.
# Note that this component does not depend explicitly on the `components/monitoring` component so that this can | ||
# theoretically be deployed on a different machine than the `prometheus` service. This is currently untested. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Curious how that would work out?
It seems to depend on prometheus:9090
and extends the volumes of prometheus
service.
I think it is more confusing to omit the dependency.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's actually a completely separate service with no shared volumes. It does depend on prometheus:9090
by default but that default can be updated by setting the PROMETHEUS_LONGTERM_TARGETS
variable.
PROMETHEUS_LONGTERM_TARGETS
can point to any prometheus endpoints at all.
See the discussion about this here: #461 (comment)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What I meant is that the rules are mounted under prometheus
service here:
https://github.com/bird-house/birdhouse-deploy/blob/longterm-monitoring/birdhouse/optional-components/prometheus-longterm-rules/config/monitoring/docker-compose-extra.yml
So, whether PROMETHEUS_LONGTERM_TARGETS
refers to the same instance or a custom remote one, the local instance needs to have the long-term rules defined. Therefore, doesn't that mean that prometheus
service (i.e.: using components/monitoring
) becomes mandatory regardless, since it is needed to mount the rules?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, prometheus must be running on the same host as the rest of the birdhouse stack.
That first prometheus instance must have at least one rule that has the longterm-metrics
group label in order for a second prometheus instance (either running on the same machine or elsewhere) to know which metrics it should monitor and store from the first prometheus instance.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've taken a second look at the directory structure, and starting to see why components/monitoring
is not a dependency. Is it only because components/monitoring
combines cardvisor, grafana, alertmanager, prometheus, etc. all together?
Therefore, if someone wants "only" longterm-metrics, they need to define their own prometheus
service with a definition similar to this?
prometheus: | |
image: ${PROMETHEUS_IMAGE} | |
container_name: prometheus | |
volumes: | |
- ./components/monitoring/prometheus.yml:/etc/prometheus/prometheus.yml:ro | |
- ./components/monitoring/prometheus.rules:/etc/prometheus/prometheus.rules:ro | |
- prometheus_persistence:/prometheus:rw | |
command: | |
# restore original CMD from image | |
- --config.file=/etc/prometheus/prometheus.yml | |
- --storage.tsdb.path=/prometheus | |
- --web.console.libraries=/usr/share/prometheus/console_libraries | |
- --web.console.templates=/usr/share/prometheus/consoles | |
# https://prometheus.io/docs/prometheus/latest/storage/ | |
- --storage.tsdb.retention.time=90d | |
# wrong default was http://container-hash:9090/ | |
- --web.external-url=https://${BIRDHOUSE_FQDN_PUBLIC}/prometheus/ | |
restart: always |
I think part of what makes all of this confusing is that there are 2 different, though related, component definitions (and I still do not understand why):
From a user's perspective, this all seems really convoluted.
Isn't there are way to simplify this hierarchy? For example, what if there was a distinct optional-components/prometheus
definition, and components/monitoring
depends on it. Then, optional-components/prometheus-longterm-metrics could also depend only on optional-components/prometheus
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Therefore, if someone wants "only" longterm-metrics, they need to define their own prometheus service with a definition similar to this?
If someone only wants longterm metrics then they can have a single prometheus service that sets --storage.tsdb.retention.time
to something longer than 90 days.
However, if you want some metrics to be stored for longer than others, you need a second prometheus instance. This is because the --storage.tsdb.retention.time
value is set for an entire prometheus instance (see discussion here: #277 (comment) and here: #461 (comment)).
I think part of what makes all of this confusing is that there are 2 different, though related, component definitions (and I still do not understand why)
These are separate in order to accommodate @tlvu's request to be able to deploy the prometheus-longterm-metrics
component on a different server. prometheus-longterm-rules are just suggested default rules to be added to the prometheus
server. As described in the README file:
This adds some default longterm metrics rules to the `prometheus` component
These rules are added to the prometheus
component on the same machine that is running the rest of the stack because that is where all the relevant metrics are collected by the rest of the monitoring components (cadvisor, nodeexporter). These rules are not necessary, they are recommended defaults but you can customize them however you like. It is easier to customize and add your own if the rules are separate. Users can choose to include the defaults, or not.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agree with Misha's explanations. I just want to add that each organization needs are different so we should not force a certain "configuration".
Specifically to this PR, we should not force the 2 Prometheus to be on the same machine. So optional-components/prometheus-longterm-metrics
should not depend on components/monitoring
. We should also not be forced to use optional-components/prometheus-longterm-rules
because each org will probably have different metrics they deemed useful to keep for longterm.
During Misha's leave, a sysadmin from PCIC actually had a question about how to pull existing metrics from our PAVICS Prometheus into his own centralized Prometheus for a centralized view of all his servers in one place. I point him to this PR for inspiration.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My bad. I was not clear in my explanation. What I intended to say was:
"if someone wants only prometheus
monitoring with both long-term and short term metrics, they need to define their own prometheus
service" [...], because they cannot activate prometheus
on its own (i.e.: they must use the full package components/monitoring
that includes all other services).
Having short/long-term separate, and having them configurable on different servers is perfectly fine by me. I'm all for that flexibility, and never mentioned otherwise.
It just seems from the configuration files that, in order for long-term metrics to be sent to the remote server, the rules must be mounted into the short-term local prometheus
(as shown below), and since that prometheus
must exist, then it should be a dependency. I understand long-term-prometheus
NOT being a dependency to toggle it independently of the short-term one. This is good. Here I am referring to https://github.com/bird-house/birdhouse-deploy/tree/3f75d496b932ab4d90857206003d0225cd20c435/birdhouse/optional-components/prometheus-longterm-rules having component/monitoring
dependency, NOT https://github.com/bird-house/birdhouse-deploy/tree/3f75d496b932ab4d90857206003d0225cd20c435/birdhouse/optional-components/prometheus-longterm-metrics. In other words, if one adds optional-components/prometheus-longterm-rules
ONLY without also thinking to add components/monitoring
, docker-compose will complain that service: prometheus
is not defined.
Lines 3 to 6 in 3f75d49
services: | |
prometheus: | |
volumes: | |
- ./optional-components/prometheus-longterm-rules/config/monitoring/prometheus.rules:/etc/prometheus/prometheus-longterm-metrics.rules:ro |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok I understand your concern now.
In other words, if one adds optional-components/prometheus-longterm-rules ONLY without also thinking to add components/monitoring, docker-compose will complain that service: prometheus is not defined.
Docker compose won't actually complain in this case. Additional settings defined under the config/
folder in components are only applied if the component that it references is also enabled. So in this case, if you add prometheus-longterm-rules
as a component but not monitoring
, then nothing will happen (since the additional docker-compose settings are in prometheus-longterm-rules/config/monitoring
.
If we do make the prometheus-longterm-rules
component dependent on the monitoring
components then if someone adds the prometheus-longterm-rules
but not the monitoring
rule to the BIRDHOUSE_EXTRA_CONF_DIRS
configuration variable, then monitoring
will be added in automatically. This is also a surprising behaviour that users may not expect.
My opinion is that we should leave it as is, so that users must specify both components if they want both components. I'm open to other opinions on this subject though.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for validating how this resolves.
I'm fine with either approach if the README indicates clearly what is supposed to happen and what to do if the one or the other situation is desired.
Prometheus Long-term Metrics | ||
---------------------------- | ||
|
||
This is a second prometheus instance that collects longterm monitoring metrics from the original prometheus instance | ||
(the one created by the ``components/monitoring`` component). | ||
|
||
Longterm metrics are any prometheus rule that have the label ``group: longterm-metrics`` or in other words are | ||
selectable using prometheus' ``'{group="longterm-metrics"}'`` query filter. To add some default longterm metrics rules | ||
also enable the ``prometheus-longterm-rules`` component. | ||
|
||
You may also choose to create your own set of rules instead of, or as well as, the default ones. See how to | ||
:ref:`add additional rules here <monitoring-customize-the-component>`. | ||
|
||
To configure this component: | ||
|
||
* update the ``PROMETHEUS_LONGTERM_RETENTION_TIME`` variable to set how long the data will be kept by prometheus | ||
|
||
Enabling this component creates the additional endpoint ``/prometheus-longterm-metrics``. | ||
|
||
.. _prometheus-longterm-rules | ||
|
||
Prometheus Long-term Rules | ||
-------------------------- | ||
|
||
This adds some default longterm metrics rules to the `prometheus` component for use by the `prometheus-longterm-metrics` | ||
component. These rules all have the label ``group: longterm-metrics``. | ||
|
||
To see which rules are added, check out the | ||
`optional-components/prometheus-longterm-rules/config/monitoring/prometheus.rules` file. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Re-reading this following other comments in the PR, I am left only more confused...
Putting ourselves in the shoes of someone having absolutely no idea about the implementation details of each component and how they interact with each other, what must the user do to achieve some results (aka: the "To enable this optional-component:" of each other section)?
ie:
- If user wants SHORT only, what must be enabled (guessing only
components/monitoring
?) - If user wants SHORT + LONG all locally, what must be enabled (
components/monitoring
+optional-components/prometheus-longterm-metrics
?) (+optional-components/prometheus-longterm-rules
depending on other dependency comment) - If user wants SHORT local and LONG remote (
components/monitoring
+optional-components/prometheus-longterm-rules
+ IMPORTANTPROMETHEUS_LONGTERM_TARGETS
override ?)
Specifically regarding the "Prometheus Long-term Rules" section.
It is mentioned that they are applied to prometheus
. (this is fine, I understand why to apply the group tags)
However, for a user without the details of "why", it is very confusing! I can see a user think they "should logically" be applied to prometheus-longterm-rules
service given their names. A much better explanation must be given.
Basically, something must indicate that service prometheus
is the source that feeds the others instances using optional-components/prometheus-longterm-rules
. Then, using optional-components/prometheus-longterm-metrics
, they can decide that this receiving instance is another local one (ie: prometheus-longterm-metrics
service), or dispatch it remotely by overriding PROMETHEUS_LONGTERM_TARGETS
accoridngly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok... just to clarify what you're asking for here ...
You would like specific instructions about how to set up each of these three cases:
- If user wants SHORT only ...
- If user wants SHORT + LONG all locally ...
- If user wants SHORT local and LONG remote ...
And you would like an additional description of how the various monitoring components interact?
Is there anything else?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's it. Just mitigate user expectations and ensure there is a common understanding of the components/services.
@fmigneault please check out the changes since your last review when you have a minute (https://github.com/bird-house/birdhouse-deploy/pull/461/files/3f75d496b932ab4d90857206003d0225cd20c435..59f6c6819ff6529b5ce2c43b8b04a723d5bfd437) |
Overview
The
prometheus-longterm-metrics
component collects longterm monitoring metrics from the original prometheus instance (the one created by thecomponents/monitoring
component).Longterm metrics are any prometheus rule that have the label
group: longterm-metrics
or in other words are selectable using prometheus's'{group="longterm-metrics"}'
query filter. To see which longterm metric rules are added by default see theoptional-components/prometheus-longterm-metrics/config/monitoring/prometheus.rules.template
file.To configure this component:
PROMETHEUS_LONGTERM_RETENTION_TIME
variable to set how long the data will be kept by prometheusPROMETHEUS_LONGTERM_STORE_INTERVAL
variable to set how often the longterm metrics rules will be calculated. For example, setting it to10h
will calculate these metrics every 10 hours.Enabling the
prometheus-longterm-metrics
component creates the additional endpoint/prometheus-longterm-metrics
.The
thanos
component enables better storage of longterm metrics collected by theoptional-components/prometheus-longterm-metrics
component. Data will be collected from theprometheus-longterm-metrics
and stored in an S3 object store indefinitely.When enabling this component, please change the default values for the
MINIO_ROOT_USER
andMINIO_ROOT_PASSWORD
by updating theenv.local
file. These set the login credentials for the root user that runs the minio object store.Enabling the
thanos
component creates the additional endpoints:/thanos-query
: a prometheus-like query interface to inspect the data stored by thanos/thanos-minio
: a minio web console to inspect the data stored by minio.This also includes an update to the prometheus version from
v2.19.0
to the current latestv2.52.0
. This is torequired to support the interaction between prometheus and thanos.
Changes
Non-breaking changes
Related Issue / Discussion
birdhouse/optional-components/prometheus-longterm-metrics/config/monitoring/prometheus.rules.template
file.Additional Information
CI Operations
birdhouse_daccs_configs_branch: master
birdhouse_skip_ci: false