You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I have had a batch job submitted to bmm which farm has been refusing to start for roughly 12 hours now. As batch jobs go, its demands are not exorbitant (16 CPUs, 128 GB memory, 24 hours clock time). Is there some reason why I should be seeing farm effectively go on strike?
Answer:
looks like somebody from $OTHER_GROUP is consuming basically all of bmm with 32 core / 8G 20-30 day jobs 🙃
Question:
but shouldn’t we have precedence over them for our buy-in nodes?
Answer:
yes, that's what bmh is -- ctbrowngrp has a QoS that defines how many cores / RAM worth of buyin you've paid out, and bmh can suspend jobs on bmm
looks like ctbrowngrp has high priority access on 48 CPUs and 500G RAM on bmh, ie the bmX nodes
224 CPUs / 500G ram on high2
and several others
which is to say: jobs on the medium partitions cannot suspend other medium partition jobs; jobs on the high partitions suspend their corresponding medium or low partition jobs
not on bmm, you'd need to submit on bmh
on bmm you just have bumped priority
but if another lab has the same priority on that partition it'll be down to slurm's magic on job length / cpu cores / mem / throughput etc
The text was updated successfully, but these errors were encountered:
hpc priority stuff
Question:
Answer:
Question:
Answer:
The text was updated successfully, but these errors were encountered: