You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Our GPU server is shared with the AutoML group, but does not have a workload manager. Currently, that means that largely division of resources happens over chat and/or unwritten rules (we currently have 2 GPUs reserved by default). This is incredibly wasteful, but also makes it hard to scale up experiments later on. We want a job scheduler installed so that everyone that needs to run GPU jobs can simply queue requested jobs and we do not need to manually ensure people are not using the same physical resources.
Overall, the server is mainly intended for prototype testing, so the workload manager should allow quick turn-around time when reasonable for all users. Allowing users to explicitly set some job priority for this is OK, as we only have a small number of users that shouldn't abuse this.
I am not sure what workload manager is most appropriate, but I think everyone on our team is already familiar with SLURM.
The text was updated successfully, but these errors were encountered:
Our GPU server is shared with the AutoML group, but does not have a workload manager. Currently, that means that largely division of resources happens over chat and/or unwritten rules (we currently have 2 GPUs reserved by default). This is incredibly wasteful, but also makes it hard to scale up experiments later on. We want a job scheduler installed so that everyone that needs to run GPU jobs can simply queue requested jobs and we do not need to manually ensure people are not using the same physical resources.
Overall, the server is mainly intended for prototype testing, so the workload manager should allow quick turn-around time when reasonable for all users. Allowing users to explicitly set some job priority for this is OK, as we only have a small number of users that shouldn't abuse this.
I am not sure what workload manager is most appropriate, but I think everyone on our team is already familiar with SLURM.
The text was updated successfully, but these errors were encountered: