Skip to content
This repository has been archived by the owner on Jun 11, 2021. It is now read-only.

Add burn rate threshold levels for the SLO #6

Open
slok opened this issue Nov 13, 2018 · 0 comments
Open

Add burn rate threshold levels for the SLO #6

slok opened this issue Nov 13, 2018 · 0 comments

Comments

@slok
Copy link
Contributor

slok commented Nov 13, 2018

The alerts based on burn rate thresholds can be made easier if the operator exposes metrics based on the CRD thresholds.

My idea at this moment is having something like this on the CRD:

apiVersion: measure.slok.xyz/v1alpha1
kind: ServiceLevel
metadata:
  name: awesome-service
spec:
  serviceLevelObjectives:
    # A typical 5xx request SLO.
    - name: "9999_http_request_lt_500"
      description: 99.99% of requests must be served with <500 status code.
      disable: false
      availabilityObjectivePercent: 99.99
      burnRates:
        - errorBudgetDays: 30
          thresholds:
            - timeRangeHours: 1
              errorBudgetPercent: 2
            - timeRangeHours: 6
              errorBudgetPercent: 5
            - timeRangeHours: 72
              errorBudgetPercent: 10
      serviceLevelIndicator:
        prometheus:
          address: http://127.0.0.1:9091
          totalQuery: |
            sum(
              increase(skipper_serve_host_duration_seconds_count{host="www_spotahome_com"}[2m]))
          errorQuery: |
            sum(
              increase(skipper_serve_host_duration_seconds_count{host="www_spotahome_com", code=~"5.."}[2m]))
      output:
        prometheus: {}

We could have multiple burnRates and in each burn rate multiple thresholds.

I have a branch that creates the threshold metrics and sets the threshold information on labels:

# HELP service_level_slo_burn_rate_threshold Is the threshold for a burn rate period.
# TYPE service_level_slo_burn_rate_threshold gauge
service_level_slo_burn_rate_threshold{burn_rate_range="168h",error_budget_spent="10%",fake="true",namespace="ns0",service_level="fake-service0",slo="fake_slo0",team="fake-team0",total_error_budget_range="70d"} 1
service_level_slo_burn_rate_threshold{burn_rate_range="1h",error_budget_spent="2%",fake="true",namespace="ns0",service_level="fake-service0",slo="fake_slo0",team="fake-team0",total_error_budget_range="30d"} 14.4
service_level_slo_burn_rate_threshold{burn_rate_range="1h",error_budget_spent="2%",fake="true",namespace="ns0",service_level="fake-service0",slo="fake_slo1",team="fake-team1",total_error_budget_range="30d"} 14.4
service_level_slo_burn_rate_threshold{burn_rate_range="1h",error_budget_spent="2%",fake="true",namespace="ns0",service_level="fake-service0",slo="fake_slo2",team="fake-team2",total_error_budget_range="30d"} 14.4
service_level_slo_burn_rate_threshold{burn_rate_range="24h",error_budget_spent="7%",fake="true",namespace="ns0",service_level="fake-service0",slo="fake_slo0",team="fake-team0",total_error_budget_range="70d"} 4.9
service_level_slo_burn_rate_threshold{burn_rate_range="6h",error_budget_spent="3%",fake="true",namespace="ns0",service_level="fake-service0",slo="fake_slo0",team="fake-team0",total_error_budget_range="70d"} 8.4
service_level_slo_burn_rate_threshold{burn_rate_range="6h",error_budget_spent="5%",fake="true",namespace="ns0",service_level="fake-service0",slo="fake_slo0",team="fake-team0",total_error_budget_range="30d"} 6
service_level_slo_burn_rate_threshold{burn_rate_range="6h",error_budget_spent="5%",fake="true",namespace="ns0",service_level="fake-service0",slo="fake_slo1",team="fake-team1",total_error_budget_range="30d"} 6
service_level_slo_burn_rate_threshold{burn_rate_range="6h",error_budget_spent="5%",fake="true",namespace="ns0",service_level="fake-service0",slo="fake_slo2",team="fake-team2",total_error_budget_range="30d"} 6
service_level_slo_burn_rate_threshold{burn_rate_range="72h",error_budget_spent="10%",fake="true",namespace="ns0",service_level="fake-service0",slo="fake_slo0",team="fake-team0",total_error_budget_range="30d"} 1
service_level_slo_burn_rate_threshold{burn_rate_range="72h",error_budget_spent="10%",fake="true",namespace="ns0",service_level="fake-service0",slo="fake_slo1",team="fake-team1",total_error_budget_range="30d"} 1
service_level_slo_burn_rate_threshold{burn_rate_range="72h",error_budget_spent="10%",fake="true",namespace="ns0",service_level="fake-service0",slo="fake_slo2",team="fake-team2",total_error_budget_range="30d"} 1
# HELP service_level_slo_objective_ratio Is the objective of the SLO in ratio unit.

Any thoughs? @ese

@slok slok changed the title Add burn rate levels for the SLO Add burn rate threshold levels for the SLO Nov 13, 2018
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant