how farm queue priorities work #54

ctb · 2023-03-01T15:29:03Z

hpc priority stuff

Question:

I have had a batch job submitted to bmm which farm has been refusing to start for roughly 12 hours now. As batch jobs go, its demands are not exorbitant (16 CPUs, 128 GB memory, 24 hours clock time). Is there some reason why I should be seeing farm effectively go on strike?

Answer:

looks like somebody from $OTHER_GROUP is consuming basically all of bmm with 32 core / 8G 20-30 day jobs 🙃

Question:

but shouldn’t we have precedence over them for our buy-in nodes?

Answer:

yes, that's what bmh is -- ctbrowngrp has a QoS that defines how many cores / RAM worth of buyin you've paid out, and bmh can suspend jobs on bmm

looks like ctbrowngrp has high priority access on 48 CPUs and 500G RAM on bmh, ie the bmX nodes
224 CPUs / 500G ram on high2
and several others
which is to say: jobs on the medium partitions cannot suspend other medium partition jobs; jobs on the high partitions suspend their corresponding medium or low partition jobs

not on bmm, you'd need to submit on bmh
on bmm you just have bumped priority
but if another lab has the same priority on that partition it'll be down to slurm's magic on job length / cpu cores / mem / throughput etc

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

how farm queue priorities work #54

how farm queue priorities work #54

ctb commented Mar 1, 2023

how farm queue priorities work #54

how farm queue priorities work #54

Comments

ctb commented Mar 1, 2023

hpc priority stuff