Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

No data on various dashboards #699

Open
j2clerck opened this issue Dec 3, 2024 · 7 comments
Open

No data on various dashboards #699

j2clerck opened this issue Dec 3, 2024 · 7 comments

Comments

@j2clerck
Copy link

j2clerck commented Dec 3, 2024

Hello,

I have an issue, deploying the generic sample on EKS.
I created a monitoring namespace and deployed successfully the solution.
However the dashboards below are returning No Data for most if not all metrics.

  • NGINX Ingress controller
  • Perf / Node Utilization
  • SAS Launched Jobs (User activity)
  • SAS Launched Jobs (Node activity)

How can I troubleshoot the source of problem in this case ?
I've deployed the solution on another cluster and it works fine.

Thank you,
Joseph de Clerck

@gsmith-sas
Copy link
Member

Hello,

I am sorry to hear you are having some issues. The first troubleshooting steps I recommend are:

  • confirming that all of the pods in the monitoring space are up and running;
  • and checking the pod logs for those pods to see if they are reporting any issues.

The Perf/Node Utilization dashboard primarily reports metrics collected from the node-exporter component. The node-exporter is a deployed via a daemonset and should have a pod running on every node in the cluster. If any metrics are not showing up, or if metrics are there for some nodes and not others, I would confirm that there is, in fact, a node-exporter pod running on every node and their pod logs show no problems.

The two SAS Launched Jobs dashboards require that the SAS Workload Orchestrator (SWO) be part of your SAS Viya deployment. Metrics will only show on those dashboards if there SAS jobs (launched via SWO) are running or have been running during the time period selected. The Prometheus Pushgateway component also needs to be deployed (and running) within the same namespace as the SAS Viya deployment. We deploy that component when the monitoring/bin/deploy_monitoring_viya.sh script is run. So, if you have not run that script, you will need to run it. If you have run it, please confirm the Pushgateway pod is running and there are no error messages in the pod logs.

I hope these steps will help you identify the problem. Please let us know how things go.

Regards,
Greg Smith

@j2clerck
Copy link
Author

j2clerck commented Dec 3, 2024

Hi Greg,

First of all, thank you for your prompt reply.
All the pods are looking OK as far as I can tell.
Regarding node exporter, it seems to be running fine as I can see metrics individually and I can see the Perf / Node Utilization detail dashboard without issue.
We have SWO enabled on our installation, so I would expect to see the metrics.
Prometheus push gateway is up and running and not complaining.

Thanks again,
Joseph de Clerck

@j2clerck
Copy link
Author

j2clerck commented Dec 3, 2024

Adding some troubleshooting:
The Perf / Node Utilisation dashboard is using Variables such as NodeClass, Node and instance.

I can validate the variable for NodeClass and Node but I cannot validate the variable for instance.
instance definition is : label_values(node_uname_info{exported_nodename=~"(?i:$Node))"}, instance)

Would it mean that node exporter is not exporting the right metrics / labels ?

Perf / Node utilization is fixed. The exported_nodename was not matching the $Node value due to a custom DNS name applied to the exported_nodename.
I replaced with nodename and it works in my environment.

Now I am still struggling with the SAS Launched Jobs dashboard with 0 metrics with :sas_launcher_pod_info:.
How can I confirm that SWO is working as expected and exposing metrics ?

Thank you,
Joseph de Clerck

@gsmith-sas
Copy link
Member

I am glad you were able to sort out the problem related to your custom host names.

With respect to the SAS Launched Jobs dashboards, here's how I would go about debugging things.

  • Bring up SAS/Studio in a browser and log in with a user.
  • Run some SAS code and confirm it executes. The actual code submitted doesn't really matter, it can be as simple as the following:
 data test;
   set sashelp.cars;
 run;
  • The goal is to confirm it successfully executes since that will validate everything behind the scenes is working.

  • Confirm that a SAS Compute Server pod has been created (the name will be something like: sas-compute-server-7ee83562-0fe3-4adc-a1c4-a64a897215be-10) in the SAS Viya namespace

  • When the SAS Workload Orchestrator is running, there should be several pods in the SAS Viya namespace with names that include "sas-workload-orchestrator". If you do not see any pods with names like that, it isn't deployed. But I don't think that's likely.

  • Bring up Grafana and check either of the SAS Launched Jobs dashboards. Both should some metrics related to the SAS/Studio session.

  • You can also use the Metrics Explorer interface within Grafana to retrieve metrics. For example, you could use it to retrieve the metric :sas_launcher_pod_status: which is used to report the number of launched jobs. Here's what that looks like in my environment
    image

  • Bring up the Prometheus web interface and confirm that metrics are being collected from the Prometheus Pushgateway. You can do this by clicking Status -> Targets and finding the "target" with a name like serviceMonitor/viya_namespace/pushgateway/0. If things are working, you should see that the state is reported as "UP" as shown in the following screenshot.
    image

  • You can confirm the Prometheus Pushgateway is working properly (i.e. receiving metrics from the WLO and making them available to Prometheus) by checking the metrics endpoint for the Pushgateway pod (which should be running in the SAS Viya namespace). One way to do that is to use Kubernetes port-forwarding to make that endpoint available in your web browser on localhost. Here's a screenshot showing what that looks like in my environment:
    image

If things still aren't working, let us know what you see in each of the interfaces since that should help clarify where things are failing in the process.

@j2clerck
Copy link
Author

j2clerck commented Dec 4, 2024

Thank you for the step by step troubleshooting.

The root cause seems to come from the pushgateway since it is not exposing any metrics.
I do have 3 workload-orchestrator (0, 1 and uuid) running in my cluster so I presume that workload orchestrator is working fine.

I tried to investigate a bit more and found a rather odd behavior. When I launch a new compute node and start an interactive compute session, I can see some metrics coming into the dashboard. However as soon as I have batch servers running as reusable compute context it seems to break the workload orchestrator as it is no longer emitting metrics to the pushgateway. If I reset my Studio session (same node) I won't see any metrics.

Regards,
Joseph de Clerck

@gsmith-sas
Copy link
Member

Interesting. In your original post, you mentioned you have another cluster where everything had deployed without problems. Have you tried this same sequence of steps in that environment? And, if so, do you see the same problem?

And, at this point, I think it would be best for you to open an issue with SAS Technical Support. I suspect this may be more than a simple misconfiguration or deployment scripting problem. I think we may need to bring some of the Compute Server experts into the discussion as well. When opening the Tech Support ticket, please tell them about this GitHub issue and ask them to add me to the ticket.

Once we've figured everything out and resolved things, I'll post that information here (for future reference) before closing this issue.

Regards,
Greg Smith

@j2clerck
Copy link
Author

j2clerck commented Dec 4, 2024

Hi Greg,

I ran the same experiment in my "working" environment and I can reproduce the issue.
At 20:06 I launched a reusable batch context.
The picture was taken around 20:15 so it's about 10 minutes without metrics.
image
At 20:20 I disabled the reusable batch context
image
You can see that it started to emit the metrics again.

I will open a support case with SAS then.

Regards,
Joseph de Clerck

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants