Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

🐛 [BUG] - use built-in metric to create Latency SLO On Dynatrace #342

Open
1 task done
GeoffroyLatourDK opened this issue Jun 7, 2023 · 14 comments
Open
1 task done
Assignees
Labels
bug Something isn't working

Comments

@GeoffroyLatourDK
Copy link

SLO Generator Version

v2.3.4

Python Version

3.10.11

What happened?

in the documentation, we're shown using the threshold method with an ext: metric from a OneAgent or ActiveGate extension.

ext:app.request_latency
is this mandatory, or can we use built-in metrics like those ones ?

builtin:service.response.client
builtin:service.keyRequest.response.time

it can be great to add a kind of list of metric that we can use in the documentation :)

What did you expect?

expected to have a valid result using those two builtin metrics
builtin:service.response.client
builtin:service.keyRequest.response.time

Screenshots

![DESCRIPTION](LINK.png)

Relevant log output

No response

Code of Conduct

  • I agree to follow this project's Code of Conduct
@GeoffroyLatourDK GeoffroyLatourDK added bug Something isn't working triage labels Jun 7, 2023
@lvaylet
Copy link
Collaborator

lvaylet commented Jun 9, 2023

Hi @GeoffroyLatourDK, thanks for reporting this behavior. Have you actually tried using these built-in metrics? If so, could you share the output and error message(s). Ideally I'd like to reproduce the issue.

@GeoffroyLatourDK
Copy link
Author

Hello @lvaylet,
As you can see on the screenshot below, I don't have any output errors, but a miscalculation in the number of good and bad events.
image

In fact, it's more the number of events that doesn't seem to correspond. On the screenshot below,
image

I'm using the builtin:service.errors.client.successCount metric, which allows me to calculate the number of successful calls. There's a big difference between the 6k83 calls on one side and the 60 calls on the other.

@lvaylet
Copy link
Collaborator

lvaylet commented Jun 14, 2023

Can you share your SLO definition, either as YAML or JSON?

@GeoffroyLatourDK
Copy link
Author

Hello @lvaylet of course !
I just had to annonymize the names but the structure and metrics used remain identical.

apiVersion: sre.google.com/v2
kind: ServiceLevelObjective
metadata:
  name:  dummy name
  labels:
    service_name:  dummy name
    feature_name:  dummy name
    slo_name:  dummy name
spec:
  description: dummy name
  backend: dynatrace/prod
  method: threshold
  service_level_indicator:
    query_valid:
      metric_selector: builtin:service.response.time
      entity_selector: type("service"),entityName("Dummy service")
    threshold: 200000.000         # us
  goal: 0.95
  frequency: '*/5 * * * *'

@lvaylet
Copy link
Collaborator

lvaylet commented Jun 14, 2023

Thanks @GeoffroyLatourDK. Then can you also enable debug mode and share the output? For example by setting the DEBUG environment variable to 1 before calling slo-generator compute ...:

$ DEBUG=1 slo-generator compute -f <SLO_CONFIG_PATH> -c <SHARED_CONFIG_PATH>
[...]

In the mean time, I am trying my best to get my hands on a Dynatrace environment.

@lvaylet
Copy link
Collaborator

lvaylet commented Jun 14, 2023

Looking at your SLO definition, can you also share what the frequency: '*/5 * * * *' field on the last line is supposed to do?

@GeoffroyLatourDK
Copy link
Author

Hello @lvaylet,
the frequency: '*/5 * * * * was a miss copy paste from my side sorry ^^"
in the debug file you will find the result of the compute while on debug mode :)
debug.txt

thank for your time and help

@GeoffroyLatourDK GeoffroyLatourDK changed the title 🐛 [BUG] - use built-in metric to create SLO 🐛 [BUG] - use built-in metric to create Latency SLO On Dynatrace Jul 12, 2023
@GeoffroyLatourDK
Copy link
Author

Hello @lvaylet did you had some time to investigate on this issue ?

@lvaylet
Copy link
Collaborator

lvaylet commented Aug 16, 2023

Hi @GeoffroyLatourDK. Apologies for the late reply. I was on vacation and off the grid.

I do not see anything suspicious with your SLO definition. This being said, I am surprised by the huge difference between the expected (6k83) and actual (60) values. That is two orders of magnitude! Are we really looking at the same metric? With the same filters (or absence of filters)? Over the same duration? Debug mode lets us check the actual requests to the Dynatrace API. For example on lines 67, 68 and 69 of debug.txt:

slo_generator.backends.dynatrace - DEBUG - Running "get" request to https://gwn38670.live.dynatrace.com/api/v2/metrics/query?from=1687249175000&end=1687252775000&metricSelector=builtin:service.response.time&entitySelector=type("service"),entityName("Catalina/localhost (/order)")&aggregation=SUM&includeData=True&Api-Token=   ...
urllib3.connectionpool - DEBUG - Starting new HTTPS connection (1): gwn38670.live.dynatrace.com:443
urllib3.connectionpool - DEBUG - https://gwn38670.live.dynatrace.com:443 "GET /api/v2/metrics/query?from=1687249175000&end=1687252775000&metricSelector=builtin:service.response.time&entitySelector=type(%22service%22),entityName(%22Catalina/localhost%20(/order)%22)&aggregation=SUM&includeData=True&Api-Token=   HTTP/1.1" 200 1781

Have you tried running these queries in the Dynatrace UI to confirm you get the same values? Have you also tried setting the Aggregation parameter manually to Sum in the Data explorer UI (set to Auto in your screenshot)? Finally, have you tried activating the Advanced mode (with the toggle switch at the top right) to get the equivalent query?

@lvaylet lvaylet removed the triage label Aug 16, 2023
@lvaylet
Copy link
Collaborator

lvaylet commented Aug 16, 2023

On an unrelated topic, I just noticed this performance warning at line 324 in the debug output:

             'warnings': ['The used `entityName` clause may severely degrade '
                          'the performance of your query. Please consider '
                          'using any of the following to improve query '
                          'performance: `entityName.in`, `entityName.equals`, '
                          '`entityName.startsWith`. If you need to check for '
                          'containment, please use `entityName.contains`.']}],

Most probably no consequence on the output but worth considering anyway.

@lvaylet
Copy link
Collaborator

lvaylet commented Aug 17, 2023

Hi again, I also noticed that your SLO definition sets spec.service_level_indicator.query_valid.entity_selector to type("service"),entityName("Dummy service") while your screenshot shows these values at two different places: Split by and Filter by. I am not an expert at Dynatrace Query Language (DQL). Does that translate to the same query at the end of the day?

@GeoffroyLatourDK
Copy link
Author

Hello, as an update i've check a bit more parameter via the UI and found the Fold transformation parameter and when I change it from auto to count i have the same result as the SLO Generator output but for the moment i don't know how to explain the huge difference between SLO generator SLI and Dynatrace SLI. i will make an other update soon :)

@lvaylet
Copy link
Collaborator

lvaylet commented Sep 15, 2023

Hi @GeoffroyLatourDK, any update to share?

@GeoffroyLatourDK
Copy link
Author

Hello @lvaylet , as far as my investigation were going it is not a bug but a problem of precision on the part of dynatrace. In fact, when you request data extraction via the api for "large" periods of time, Dynatrace won't send all the data, but only averages for a hundred or so periods of time. For example, over a 28-day period, Dynatrace will send us the average response time over a 6-hour period. In my case, however, over a 6-hour period I may have many peaks above my limit value, but these will not be taken into account because the average will be below the threshold.

and to finish with I don't know if Dynatrace will do it will every kind of builtin metric and work diffrently with other type of metrics we only work with builtin metric.

So we can close the issue :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants