You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
We recently had an m3db outage where a subset of metrics just "disappeared" for some time period and couldn't be queried. Here is the series of events that we think caused this to happen:
We scaled up m3db cluster from 1 replica/isolation group -> 2 replicas/isolation group. We have 3 replication groups.
We scaled down the cluster from 2 replicas/isolation group back to 1 replica/isolation group
We scaled up the cluster from 1 replica/isolation group to 2 replicas/isolation group
After step 3, we started to see some metrics disappear from the cluster and can't be queried anymore (there was a metric gap for some metrics after step 3). All writes and reads to the cluster were successful and there were no failures. One thing worth to mention is, when the new replicas came up from the second scale-up in step 3, they were using the same disks that were provisioned by the reps that were brought up in step 1, which upon a quick look had some index data, but I don't think it had any actual metrics data.
To mitigate this, we scaled down the cluster by editing the placement, deleted the old disks and then scaled the cluster back up with new provisioned disks. The metrics started to work normally, and they reappeared even during the incident time so there was no longer a metric gap.
What service is experiencing the issue? (M3Coordinator, M3DB, M3Aggregator, etc)
M3db v1.3.0
What is the configuration of the service? Please include any YAML files, as well as namespace / placement configuration (with any sensitive information anonymized if necessary).
How are you using the service? For example, are you performing read/writes to the service via Prometheus, or are you using a custom script?
We're performing reads and writes through m3 coordinators
Is there a reliable way to reproduce the behavior? If so, please provide detailed instructions.
We haven't yet attempted to reproduce this issue yet, but we wanted to see if the series of events provided above is not something that should be done in general
Please let me know if you need any more details/configs. Is the series of events done above not meant to work ?
The text was updated successfully, but these errors were encountered:
Filing M3 Issues
General Issues
We recently had an m3db outage where a subset of metrics just "disappeared" for some time period and couldn't be queried. Here is the series of events that we think caused this to happen:
After step 3, we started to see some metrics disappear from the cluster and can't be queried anymore (there was a metric gap for some metrics after step 3). All writes and reads to the cluster were successful and there were no failures. One thing worth to mention is, when the new replicas came up from the second scale-up in step 3, they were using the same disks that were provisioned by the reps that were brought up in step 1, which upon a quick look had some index data, but I don't think it had any actual metrics data.
To mitigate this, we scaled down the cluster by editing the placement, deleted the old disks and then scaled the cluster back up with new provisioned disks. The metrics started to work normally, and they reappeared even during the incident time so there was no longer a metric gap.
What service is experiencing the issue? (M3Coordinator, M3DB, M3Aggregator, etc)
M3db v1.3.0
What is the configuration of the service? Please include any YAML files, as well as namespace / placement configuration (with any sensitive information anonymized if necessary).
Here is the m3db configuration yaml
How are you using the service? For example, are you performing read/writes to the service via Prometheus, or are you using a custom script?
We're performing reads and writes through m3 coordinators
Is there a reliable way to reproduce the behavior? If so, please provide detailed instructions.
We haven't yet attempted to reproduce this issue yet, but we wanted to see if the series of events provided above is not something that should be done in general
Please let me know if you need any more details/configs. Is the series of events done above not meant to work ?
The text was updated successfully, but these errors were encountered: