Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Help with jobstat installation #16

Open
sumitsaluja opened this issue Oct 11, 2024 · 1 comment
Open

Help with jobstat installation #16

sumitsaluja opened this issue Oct 11, 2024 · 1 comment

Comments

@sumitsaluja
Copy link

Hi Josko,

I tried to install jobstat but getting error:

./jobstats -d 874
DEBUG: jobidraw=874, start=1728671046, end=1728671263, cluster=ganesha, tres=cpu=2,gres/gpu=1,mem=4000M,node=1, data=, user=ss6478, account=sysops, state=COMPLETED, timelimit=90, nodes=1, ncpus=2, reqmem=4000M, qos=normal, partition=gpu, jobname=test
DEBUG: jobid=874, jobidraw=874, start=1728671046, end=1728671263, gpus=1, diff=217, cluster=ganesha, data=, timelimitraw=90
DEBUG: query=max_over_time(cgroup_memory_total_bytes{cluster='ganesha',jobid='874',step='',task=''}[217s]), time=1728671263
DEBUG: query result={'status': 'success', 'data': {'resultType': 'vector', 'result': []}}
DEBUG: query=max_over_time(cgroup_memory_rss_bytes{cluster='ganesha',jobid='874',step='',task=''}[217s]), time=1728671263
DEBUG: query result={'status': 'success', 'data': {'resultType': 'vector', 'result': []}}
DEBUG: query=max_over_time(cgroup_cpu_total_seconds{cluster='ganesha',jobid='874',step='',task=''}[217s]), time=1728671263
DEBUG: query result={'status': 'success', 'data': {'resultType': 'vector', 'result': []}}
DEBUG: query=max_over_time(cgroup_cpus{cluster='ganesha',jobid='874',step='',task=''}[217s]), time=1728671263
DEBUG: query result={'status': 'success', 'data': {'resultType': 'vector', 'result': []}}
DEBUG: query=max_over_time((nvidia_gpu_memory_total_bytes{cluster='ganesha'} and nvidia_gpu_jobId == 874)[217s:]), time=1728671263
DEBUG: query result={'status': 'success', 'data': {'resultType': 'vector', 'result': []}}
DEBUG: query=max_over_time((nvidia_gpu_memory_used_bytes{cluster='ganesha'} and nvidia_gpu_jobId == 874)[217s:]), time=1728671263
DEBUG: query result={'status': 'success', 'data': {'resultType': 'vector', 'result': []}}
DEBUG: query=avg_over_time((nvidia_gpu_duty_cycle{cluster='ganesha'} and nvidia_gpu_jobId == 874)[217s:]), time=1728671263
DEBUG: query result={'status': 'success', 'data': {'resultType': 'vector', 'result': []}}
Traceback (most recent call last):
File "./jobstats", line 58, in
stats.report_job()
File "/tmp/jobstats/jobstats.py", line 582, in report_job
+f'If the run time was very short then try running "seff {self.jobid}".')
File "/tmp/jobstats/jobstats.py", line 115, in error
raise Exception(msg)
Exception: No stats found for job 874, either because it is too old or because
it expired from jobstats database. If you are not running this command on the
cluster where the job was run then use the -c option to specify the cluster.
If the run time was very short then try running "seff 874".

Could you please help?

@plazonic
Copy link
Collaborator

Hi Sumit,

so the fact that query results are returning no data means that there is either something wrong with the data collection process (e.g. prometheus is not scraping data on nodes where the job 874 ran) or a mismatch with what is in the prometheus (e.g. job data has no cluster=ganesha label or it is a wrong label ).

What might help narrow it down is if you go to the web interface of the prometheus server and try (on the graph tab) to search for some of this data. Say cgroup_memory_rss_bytes - start with all of it, do you get anything back?

If not check your prometheus and node configs and fix until you start getting data, especially for running jobs. Also make sure that there are jobid/step/task labels. If there is data but those are missing then you did not use the correct cgroup exporter - it has to be our modified version and not the original version.

Do labels look good? Next steps depend on what you get back - e.g. we've had a few issues where folks did not follow instructions on adding a cluster label to prometheus config, our instructions have an example on how to do that.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants