Help with jobstat installation #16

sumitsaluja · 2024-10-11T18:37:29Z

Hi Josko,

I tried to install jobstat but getting error:

./jobstats -d 874
DEBUG: jobidraw=874, start=1728671046, end=1728671263, cluster=ganesha, tres=cpu=2,gres/gpu=1,mem=4000M,node=1, data=, user=ss6478, account=sysops, state=COMPLETED, timelimit=90, nodes=1, ncpus=2, reqmem=4000M, qos=normal, partition=gpu, jobname=test
DEBUG: jobid=874, jobidraw=874, start=1728671046, end=1728671263, gpus=1, diff=217, cluster=ganesha, data=, timelimitraw=90
DEBUG: query=max_over_time(cgroup_memory_total_bytes{cluster='ganesha',jobid='874',step='',task=''}[217s]), time=1728671263
DEBUG: query result={'status': 'success', 'data': {'resultType': 'vector', 'result': []}}
DEBUG: query=max_over_time(cgroup_memory_rss_bytes{cluster='ganesha',jobid='874',step='',task=''}[217s]), time=1728671263
DEBUG: query result={'status': 'success', 'data': {'resultType': 'vector', 'result': []}}
DEBUG: query=max_over_time(cgroup_cpu_total_seconds{cluster='ganesha',jobid='874',step='',task=''}[217s]), time=1728671263
DEBUG: query result={'status': 'success', 'data': {'resultType': 'vector', 'result': []}}
DEBUG: query=max_over_time(cgroup_cpus{cluster='ganesha',jobid='874',step='',task=''}[217s]), time=1728671263
DEBUG: query result={'status': 'success', 'data': {'resultType': 'vector', 'result': []}}
DEBUG: query=max_over_time((nvidia_gpu_memory_total_bytes{cluster='ganesha'} and nvidia_gpu_jobId == 874)[217s:]), time=1728671263
DEBUG: query result={'status': 'success', 'data': {'resultType': 'vector', 'result': []}}
DEBUG: query=max_over_time((nvidia_gpu_memory_used_bytes{cluster='ganesha'} and nvidia_gpu_jobId == 874)[217s:]), time=1728671263
DEBUG: query result={'status': 'success', 'data': {'resultType': 'vector', 'result': []}}
DEBUG: query=avg_over_time((nvidia_gpu_duty_cycle{cluster='ganesha'} and nvidia_gpu_jobId == 874)[217s:]), time=1728671263
DEBUG: query result={'status': 'success', 'data': {'resultType': 'vector', 'result': []}}
Traceback (most recent call last):
File "./jobstats", line 58, in
stats.report_job()
File "/tmp/jobstats/jobstats.py", line 582, in report_job
+f'If the run time was very short then try running "seff {self.jobid}".')
File "/tmp/jobstats/jobstats.py", line 115, in error
raise Exception(msg)
Exception: No stats found for job 874, either because it is too old or because
it expired from jobstats database. If you are not running this command on the
cluster where the job was run then use the -c option to specify the cluster.
If the run time was very short then try running "seff 874".

Could you please help?

plazonic · 2024-10-19T22:10:46Z

Hi Sumit,

so the fact that query results are returning no data means that there is either something wrong with the data collection process (e.g. prometheus is not scraping data on nodes where the job 874 ran) or a mismatch with what is in the prometheus (e.g. job data has no cluster=ganesha label or it is a wrong label ).

What might help narrow it down is if you go to the web interface of the prometheus server and try (on the graph tab) to search for some of this data. Say cgroup_memory_rss_bytes - start with all of it, do you get anything back?

If not check your prometheus and node configs and fix until you start getting data, especially for running jobs. Also make sure that there are jobid/step/task labels. If there is data but those are missing then you did not use the correct cgroup exporter - it has to be our modified version and not the original version.

Do labels look good? Next steps depend on what you get back - e.g. we've had a few issues where folks did not follow instructions on adding a cluster label to prometheus config, our instructions have an example on how to do that.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Help with jobstat installation #16

Help with jobstat installation #16

sumitsaluja commented Oct 11, 2024

plazonic commented Oct 19, 2024

Help with jobstat installation #16

Help with jobstat installation #16

Comments

sumitsaluja commented Oct 11, 2024

plazonic commented Oct 19, 2024