Add a workload manager to GPU cluster #28

PGijsbers · 2024-07-25T08:22:03Z

Our GPU server is shared with the AutoML group, but does not have a workload manager. Currently, that means that largely division of resources happens over chat and/or unwritten rules (we currently have 2 GPUs reserved by default). This is incredibly wasteful, but also makes it hard to scale up experiments later on. We want a job scheduler installed so that everyone that needs to run GPU jobs can simply queue requested jobs and we do not need to manually ensure people are not using the same physical resources.

Overall, the server is mainly intended for prototype testing, so the workload manager should allow quick turn-around time when reasonable for all users. Allowing users to explicitly set some job priority for this is OK, as we only have a small number of users that shouldn't abuse this.

I am not sure what workload manager is most appropriate, but I think everyone on our team is already familiar with SLURM.

PGijsbers added the automation CI/CD and other automation label Jul 25, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add a workload manager to GPU cluster #28

Add a workload manager to GPU cluster #28

PGijsbers commented Jul 25, 2024 •

edited

Loading

Add a workload manager to GPU cluster #28

Add a workload manager to GPU cluster #28

Comments

PGijsbers commented Jul 25, 2024 • edited Loading

PGijsbers commented Jul 25, 2024 •

edited

Loading