Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

docs: Add hpc installation guide #9945

Merged
merged 2 commits into from
Sep 20, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
13 changes: 10 additions & 3 deletions docs/setup-cluster/slurm/_index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -24,14 +24,20 @@

Slurm/PBS deployment applies to the Enterprise Edition.

This document describes how Determined can be configured to utilize HPC cluster scheduling systems
This section describes how Determined can be configured to utilize HPC cluster scheduling systems
via the Determined HPC launcher. In this type of configuration, Determined delegates all job
scheduling and prioritization to the HPC workload manager (either Slurm or PBS). This integration
enables existing HPC workloads and Determined workloads to coexist and Determined workloads to
access all of the advanced capabilities of the HPC workload manager.

To install Determined on the HPC cluster, ensure that the :ref:`slurm-requirements` are met, then
follow the steps in the :ref:`install-on-slurm` document.
To install Determined on the HPC cluster, ensure that the :ref:`hpc-environment-requirements` and
:ref:`slurm-requirements` are met, then follow the steps in the :ref:`install-on-slurm` document.

.. note::

Determined supports installations without root access. For non-root installations, ensure that
the prerequisites in :ref:`hpc-environment-requirements` have been completed by your system
administrator before proceeding.

***********
Reference
Expand All @@ -52,6 +58,7 @@ follow the steps in the :ref:`install-on-slurm` document.
:hidden:

slurm-requirements
hpc-environment-requirements
hpc-launching-architecture
hpc-security-considerations
install-on-slurm
Expand Down
157 changes: 157 additions & 0 deletions docs/setup-cluster/slurm/hpc-environment-requirements.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,157 @@
.. _hpc-environment-requirements:

##############################
HPC Environment Requirements
##############################

This document describes how to prepare your environment for installing Determined on an HPC cluster
managed by Slurm or PBS workload managers.

.. include:: ../../_shared/tip-keep-install-instructions.txt

**************************
Environment Requirements
**************************

Hardware Requirements
=====================

The recommended requirements for the admin node are:

- 1 admin node for the master, the database, and the launcher with the following specs:
- 16 cores
- 32 GB of memory
- 1 TB of disk space (depends on the database, see "Database Requirements" section below)

The minimal requirements are:

- 1 admin node with 8 cores, 16 GB of memory, and 200 GB of disk space

.. note::

While the node can be virtual, a physical one is preferred.

Network Requirements
====================

The admin node requires the following network configurations:

Admin Node
----------

**Ports:** 8080, 8443 **Type:** TCP **Description:** Provide HTTP(S) access to the master node for
web UI access and agent API access

.. note::

Ensure these ports are open in your firewall settings to allow proper communication with the
admin node.

Additional Requirements:

- The admin node must reach the HPC shared area (the scratch file system).
- Recommended: 10 Gbps Ethernet link between the admin node and the HPC worker nodes.
- Minimal: 1 Gbps Ethernet link.

.. important::

The admin node must be connected to the Internet to download container images and Python
packages. If Internet access is not possible, the local container registry and package repository
must be filled manually with external data.

Storage Requirements
====================

Determined requires shared storage for experiment checkpoints, container images, datasets, and
pre-trained models. All worker nodes connected to the cluster must be able to access it. The storage
can be a network file system (like VAST, Ceph FS, Gluster FS, Lustre) or a bucket (on cloud or
on-prem if it exposes an S3 API).

Space requirements depend on the model complexity/size:

- 10-30 TB of HDD space for small models (up to 1GB in size)
- 20-60 TB of SSD space for medium to large models (more than 1GB in size)

Software Requirements
=====================

The following software components are required:

+------------------------+----------------------------------+------------------+
| Component | Version | Installation |
| | | Node |
+========================+==================================+==================+
| Operating System | RHEL 8.5+ or 9.0+ SLES 15 SP3+ | Admin |
| | Ubuntu 22.04+ | |
+------------------------+----------------------------------+------------------+
| Java | >= 1.8 | Admin |
+------------------------+----------------------------------+------------------+
| Python | >= 3.8 | Admin |
+------------------------+----------------------------------+------------------+
| Podman | >= 4.0.0 | Admin |
+------------------------+----------------------------------+------------------+
| PostgreSQL | 10 (RHEL 8), 13 (RHEL 9), 14 | Admin |
| | (Ubuntu 22.04) or newer | |
+------------------------+----------------------------------+------------------+
| HPC client packages | Same as login nodes | Admin |
+------------------------+----------------------------------+------------------+
| Container runtime | Singularity >= 3.7 (or Apptainer | Workers |
| | >= 1.0) Podman >= 3.3.1 Enroot | |
| | >= 3.4.0 | |
+------------------------+----------------------------------+------------------+
| HPC scheduler | Slurm >= 20.02 (excluding | Workers |
| | 22.05.5 - 22.05.8) PBS >= | |
| | 2021.1.2 | |
+------------------------+----------------------------------+------------------+
| NVIDIA drivers | >= 450.80 | Workers |
+------------------------+----------------------------------+------------------+

Database Requirements
=====================

The solution requires PostgreSQL 10 or newer, which will be installed on the admin node. The
required disk space for the database is estimated as follows:

- 200 GB on small systems (less than 15 workers) or big systems if the experiment logs are sent to
Elasticsearch
- 16 GB/worker on big systems that store experiment logs inside the database

****************************
Installation Prerequisites
****************************

Before proceeding with the installation, ensure that:

- The operating system is installed along with the HPC client packages (a clone of an existing
login node could be made if the OS is the same or similar)
- The node has Internet connectivity
- The node has the shared file system mounted on /scratch
- Java is installed
- Podman is installed

A dedicated OS user named ``determined`` should be created on the admin node. This user should:

- Belong to the ``determined`` group
- Be able to run HPC jobs
- Have sudo permissions for specific commands (see :ref:`hpc-security-considerations` for details)

.. note::

All subsequent installation steps assume the use of the ``determined`` user or root access.

For detailed installation steps, including OS-specific instructions and configuration, refer to the
:ref:`install-on-slurm` document.

Internal Task Gateway
=====================

As of version 0.34.0, Determined supports the Internal Task Gateway feature for Kubernetes. This
feature enables Determined tasks running on remote Kubernetes clusters to be exposed to the
Determined master and proxies. If you're using a hybrid setup with both Slurm/PBS and Kubernetes,
this feature might be relevant for your configuration.

.. important::

Enabling this feature exposes Determined tasks to the outside world. Implement appropriate
security measures to restrict access to exposed tasks and secure communication between the
external cluster and the main cluster.
20 changes: 16 additions & 4 deletions docs/setup-cluster/slurm/install-on-slurm.rst
Original file line number Diff line number Diff line change
Expand Up @@ -5,7 +5,19 @@
#################################

This document describes how to deploy Determined on an HPC cluster managed by the Slurm or PBS
workload managers.
workload managers. It covers both scenarios where root access is available and where it is not.

For non-root installations:

#. Ensure that the prerequisites in :ref:`hpc-environment-requirements` have been completed by your
system administrator.
#. Verify that you have the necessary permissions to run Slurm/PBS jobs and access the required
directories.
#. Check that the container runtime (Singularity/Apptainer, Podman, or Enroot) is properly
configured for non-root usage.

For root installations, ensure that all requirements in :ref:`hpc-environment-requirements` and
:ref:`slurm-requirements` are met before proceeding.

.. include:: ../../_shared/tip-keep-install-instructions.txt

Expand Down Expand Up @@ -123,7 +135,7 @@ configured, install and configure the Determined master:
| | path, you can override the default by updating this value. |
+----------------------------+----------------------------------------------------------------+
| ``gres_supported`` | Indicates that Slurm/PBS identifies available GPUs. The |
| | default is ``true``. See :ref:`slurm-config-requirements` or |
| | default is ``true``. See :ref:`slurm-requirements` or |
| | :ref:`pbs-config-requirements` for details. |
+----------------------------+----------------------------------------------------------------+

Expand Down Expand Up @@ -163,8 +175,8 @@ configured, install and configure the Determined master:
#. If the compute nodes of your cluster do not have internet connectivity to download Docker images,
see :ref:`slurm-image-config`.

#. If internet connectivity requires use of a proxy, make sure the proxy variables are defined as
per :ref:`proxy-config-requirements`.
#. If internet connectivity requires the use of a proxy, make sure the proxy variables are properly
configured in your environment.

#. Log into Determined, see :ref:`users`. The Determined user must be linked to a user on the HPC
cluster. If signed in with a Determined administrator account, the following example creates a
Expand Down
2 changes: 1 addition & 1 deletion docs/setup-cluster/slurm/slurm-known-issues.rst
Original file line number Diff line number Diff line change
Expand Up @@ -249,7 +249,7 @@ Some constraints are due to differences in behavior between Docker and Singulari
*********************

- Enroot uses ``XDG_RUNTIME_DIR`` which is not provided to the compute jobs by Slurm/PBS by
default. The error ``mkdir: cannot create directory /run/enroot: Permission denied`` indicates
default. The error ``mkdir: cannot create directory ' /run/enroot': Permission denied`` indicates
that the environment variable ``XDG_RUNTIME_DIR`` is not defined on the compute nodes. See
:ref:`podman-config-requirements` for recommendations.

Expand Down
94 changes: 11 additions & 83 deletions docs/setup-cluster/slurm/slurm-requirements.rst
Original file line number Diff line number Diff line change
@@ -1,83 +1,13 @@
.. _slurm-requirements:

###########################
Installation Requirements
###########################
########################
Slurm/PBS Requirements
########################

********************
Basic Requirements
********************

To deploy the Determined HPC Launcher on Slurm/PBS, the following requirements must be met.

- The login node, admin node, and compute nodes must be installed and configured with one of the
following Linux distributions:

- RHEL or Rocky Linux® 8.5, 8.6
- RHEL 9
- SUSE® Linux Enterprise Server (SLES) 12 SP3 , 15 SP3, 15 SP4
- Ubuntu® 20.04, 22.04
- Cray OS (COS) 2.3, 2.4

Note: More restrictive Linux distribution dependencies may be required by your choice of
Slurm/PBS version and container runtime (Singularity/Apptainer®, Podman, or NVIDIA® Enroot).

- Slurm 20.02 or greater (excluding 22.05.5 through at least 22.05.8 - see
:ref:`slurm-known-issues`) or PBS 2021.1.2 or greater.

- Apptainer 1.0 or greater, Singularity 3.7 or greater, Enroot 3.4.0 or greater or Podman 3.3.1 or
greater.

- A cluster-wide shared filesystem with consistent path names across the HPC cluster.

- User and group configuration must be consistent across all nodes.

- All nodes must be able to resolve the hostnames of all other nodes.

- To run jobs with GPUs, the NVIDIA or AMD drivers must be installed on each compute node.
Determined requires a version greater than or equal to 450.80 of the NVIDIA drivers. The NVIDIA
drivers can be installed as part of a CUDA installation but the rest of the CUDA toolkit is not
required.

- Determined supports the `active Python versions <https://endoflife.date/python>`__.

***********************
Launcher Requirements
***********************
This document describes the specific requirements for deploying Determined on Slurm or PBS workload
managers.

The launcher has the following additional requirements on the installation node:

- Support for an RPM or Debian-based package installer
- Java 1.8 or greater
- Sudo is configured to process configuration files present in the ``/etc/sudoers.d`` directory
- Access to the Slurm or PBS command-line interface for the cluster
- Access to a cluster-wide file system with a consistent path names across the cluster

.. _proxy-config-requirements:

**********************************
Proxy Configuration Requirements
**********************************

If internet connectivity requires a use of a proxy, verify the following requirements:

- Ensure that the proxy variables are defined in ``/etc/environment`` (or ``/etc/sysconfig/proxy``
on SLES).

- Ensure that the `no_proxy` setting covers the login and admin nodes. If these nodes may be
referenced by short names known only within the cluster, they must explicitly be included in the
`no_proxy` setting.

- If your experiment code communicates between compute nodes with a protocol that honors proxy
environment variables, you should additionally include the names of all compute nodes in the
`no_proxy` variable setting.

The HPC launcher imports `http_proxy`, `https_proxy`, `ftp_proxy`, `rsync_proxy`, `gopher_proxy`,
`socks_proxy`, `socks5_server`, and `no_proxy` from ``/etc/environment`` and
``/etc/sysconfig/proxy``. These environment variables are automatically exported in lowercase and
uppercase into any launched jobs and containers.

.. _slurm-config-requirements:
For general environment requirements, please refer to :ref:`hpc-environment-requirements`.

********************
Slurm Requirements
Expand Down Expand Up @@ -194,6 +124,8 @@ interacts with Slurm, we recommend the following steps:

.. _pbs-config-requirements:

.. _pbs-ngpus-config:

******************
PBS Requirements
******************
Expand Down Expand Up @@ -226,8 +158,6 @@ interacts with PBS, we recommend the following steps:
configure ``CUDA_VISIBLE_DEVICES`` or set the ``pbs.slots_per_node`` setting in your experiment
configuration file to indicate the desired number of GPU slots for Determined.

.. _pbs-ngpus-config:

- Ensure the ``ngpus`` resource is defined with the correct values.

To ensure the successful operation of Determined, define the ``ngpus`` resource value for each
Expand Down Expand Up @@ -396,11 +326,9 @@ interacts with PBS, we recommend the following steps:
Apptainer/Singularity Requirements
************************************

Apptainer/Singularity is the recommended container runtime for Determined on HPC clusters. Apptainer
is a fork of Singularity 3.8 and provides both the ``apptainer`` and ``singularity`` commands. For
purposes of this documentation, you can consider all references to Singularity to also apply to
Apptainer. The Determined launcher interacts with Apptainer/Singularity using the ``singularity``
command.
Determined supports Apptainer (formerly known as Singularity) for container runtime in HPC
environments. Ensure that Apptainer or Singularity is properly installed and configured on all
compute nodes of your cluster.

.. note::

Expand Down
3 changes: 2 additions & 1 deletion docs/setup-cluster/slurm/upgrade-on-hpc.rst
Original file line number Diff line number Diff line change
Expand Up @@ -8,7 +8,8 @@ This procedure describes how to upgrade Determined on an HPC cluster managed by
workload managers. Use this procedure when an earlier version of Determined is installed,
configured, and functioning properly.

#. Review the latest :ref:`slurm-requirements` and ensure all dependencies have been met.
#. Review the latest :ref:`hpc-environment-requirements` and :ref:`slurm-requirements` and ensure
all dependencies have been met.

#. Upgrade the launcher.

Expand Down
Loading