-
Notifications
You must be signed in to change notification settings - Fork 27
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Missing 'GPU' entries in metrics #289
Comments
@Dubrzr could you please provide the output of the following command on your gpu2 server:
|
It looks like it works fine:
|
Thanks... wrong intuition then... This is indeed the right endpoint. My sample output:
@micmarty: any ideas? While nvidia-smi output looks fine and there is no error message from GPUMonitor, there are no 'GPU' entries in metrics. |
@Dubrzr: and how about this command:
I see that you have a newer version of NVIDIA driver (the newest version that we've tested is 418.116), maybe there have also been some changes to nvidia-smi... |
Here it is: :)
|
Everything looks fine here... Could you try modifying line 73 in tensorhive/core/managers/TensorHiveManager.py and set:
and see if it helps? |
Yep! It indeed works better 🎉 But gpu2 don't :
|
@Dubrzr do you have any new observations or hints? If the data was lacking for gpu3, we would at least have an idea that the differing Fan speed "[N/A]" notation is not parsed properly. And with the proper output from nvidia-smi on gpu2, we currently have no ideas how to help... What is the OS user account used by tensorhive? nvidia-smi works properly for root user, but does it also work for the user account used by TH on gpu2? |
Hmmm I cannot see GPUs even when I click on the "+" sign, nothing happens, I guess the http request used is "/api/0.3.1/nodes/metrics" ? If so here are the contents of the response:
I replaced all hostnames with fakes ones.
Here are
free -m
results on all my machines: (OS are Centos 7):Also
nvidia-smi
:Thanks for your help :)
Originally posted by @Dubrzr in #286 (comment)
The text was updated successfully, but these errors were encountered: