Skip to content

Commit

Permalink
Merge branch 'main' into thiago/ET-791
Browse files Browse the repository at this point in the history
  • Loading branch information
thiagodallacqua-hpe authored Oct 25, 2024
2 parents fb49351 + 834eeda commit 2efe98b
Show file tree
Hide file tree
Showing 96 changed files with 2,461 additions and 1,517 deletions.
62 changes: 62 additions & 0 deletions .github/workflows/start-minor-release.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,62 @@
---
name: "Start minor release"

on: # yamllint disable-line rule:truthy
workflow_dispatch:
inputs:
version:
description: "The Determined minor version to release. E.g. 0.38.0. This will create a new release branch and make commits on main."
required: true

jobs:
start-minor-release:
name: "Start minor release"
env:
GH_TOKEN: ${{ secrets.DETERMINED_TOKEN }}
permissions:
contents: write
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4

- name: "Validate version"
shell: bash {0}
run: |
grep -E -o '[0-9]+\.[0-9]+\.0' <<<'${{ github.event.inputs.version }}'
ret=$?
if [[ $ret != 0 ]]; then
echo '::error::Version string must match <[0-9]+\.[0-9]+\.0>. Got: <${{ github.event.inputs.version }}>'
exit $ret
fi
- name: Configure git username and e-mail"
run: |
git config user.name github-actions
git config user.email \
41898282+github-actions[bot]@users.noreply.github.com
- name: "Setup Go"
uses: actions/setup-go@v5
with:
go-version: "1.22.0"

- name: "Install protobuf dependencies"
run: "make get-deps-proto"

- name: "Create release branch"
run: |
echo 'Creating branch: release-${{ github.event.inputs.version }}'
git checkout -b release-${{ github.event.inputs.version }}
echo 'Pushing release branch'
git push -u origin release-${{ github.event.inputs.version }}
- name: "Switch back to main"
run: "git checkout main"

- name: "Publish changes to main"
run: |
./tools/scripts/lock-api-state.sh
./tools/scripts/lock-published-urls.sh
git push origin main
63 changes: 45 additions & 18 deletions docs/reference/experiment-config-reference.rst
Original file line number Diff line number Diff line change
Expand Up @@ -308,38 +308,65 @@ Optional. Defines actions and labels in response to trial logs matching specifie
language syntax). For more information about the syntax, you can visit this `RE2 reference page
<https://github.com/google/re2/wiki/Syntax>`__. Each log policy can have the following fields:

- ``name``: Optional. A name for the log policy. If provided, this name will be displayed as a
label in the UI when the log policy matches.
- ``name``: Required. The name of the log policy, displayed as a label in the WebUI when a log
policy match occurs.

- ``pattern``: Required. The regex pattern to match in the logs.
- ``pattern``: Optional. Defines a regex pattern to match log entries. If not specified, this
policy is disabled.

- ``action``: Optional. The action to take when the pattern is matched. Actions include:

- ``exclude_node``: Excludes a failed trial's restart attempts from being scheduled on nodes
with matching error logs.
- ``cancel_retries``: Prevents a trial from restarting if it reports a matching log.
- ``exclude_node``: Excludes a failed trial's restart attempts (due to its ``max_restarts``
policy) from being scheduled on nodes with matched error logs. This is useful for bypassing
nodes with hardware issues, such as uncorrectable GPU ECC errors.

.. note::

This option is not supported on PBS systems.

For the agent resource manager, if a trial becomes unschedulable due to enough node
exclusions, and ``launch_error`` in the master config is set to true (default), the trial will
fail.

- ``cancel_retries``: Prevents a trial from restarting if a log matches the pattern, even if the
trial has remaining max_restarts. This avoids using resources retrying a trial that encounters
failures unlikely to be resolved by retrying, such as CUDA memory issues.

Example configuration:

.. code:: yaml
log_policies:
- name: "ECC Error"
pattern: ".*uncorrectable ECC error encountered.*"
action:
type: exclude_node
- name: "CUDA OOM"
pattern: ".*CUDA out of memory.*"
action:
type: cancel_retries
When a log policy matches, its name (if provided) will be displayed as a label in the WebUI,
allowing for easy identification of specific issues or events during a run.
- name: ECC Error
pattern: ".*uncorrectable ECC error encountered.*"
action: exclude_node
- name: CUDA OOM
pattern: ".*CUDA out of memory.*"
action: cancel_retries
When a log policy matches, its name appears as a label in the WebUI, making it easy to identify
specific issues during a run. These labels are shown in both the run table and run detail views.

These settings may also be specified at the cluster or resource pool level through task container
defaults.

To find out more about log management, visit :ref:`Log Management <log-management>`.
Default policies:

.. code:: yaml
log_policies:
- name: CUDA OOM
pattern: ".*CUDA out of memory.*"
- name: ECC Error
pattern: ".*uncorrectable ECC error encountered.*"
To disable showing labels from the default policies:

.. code:: yaml
log_policies:
- name: CUDA OOM
- name: ECC Error
.. _log-retention-days:

Expand Down
2 changes: 1 addition & 1 deletion docs/release-notes/log-search-improvement.rst
Original file line number Diff line number Diff line change
Expand Up @@ -6,4 +6,4 @@
search result will take users directly to the relevant position in the log, allowing them to
easily view logs both before and after the matched entry. Additionally, add support for
regex-based searches, providing more flexible log filtering. For more details, refer to
:ref:`log_policies <config-log-policies>`.
:ref:`WebUI <web-ui-if>`.
10 changes: 10 additions & 0 deletions docs/release-notes/log-signal.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,10 @@
:orphan:

**New Features**

- Experiments: Add a ``name`` field to ``log_policies``. When a log policy matches, its name shows
as a label in the WebUI, making it easy to spot specific issues during a run. Labels appear in
both the run table and run detail views.

In addition, there is a new format: ``name`` is required, and ``action`` is now a plain string.
For more details, refer to :ref:`log_policies <config-log-policies>`.
19 changes: 19 additions & 0 deletions docs/release-notes/unsupport-aurora-postgres-reminder.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,19 @@
:orphan:

**Deprecations**

- Cluster: A reminder that Amazon Aurora V1 will reach End of Life at the end of 2024. It is no
longer supported as the default persistent storage for AWS Determined deployments. We recommend
that users migrate to Amazon RDS for PostgreSQL. For more information, visit the `migration
instructions <https://gist.github.com/rb-determined-ai/bfa10182e53968e00a3c88df624e777e>`_.

- Cluster: After Amazon Aurora V1 reaches End of Life, support for Amazon Aurora V1 in ``det deploy
aws`` will be removed. Future deployments will default to the ``simple-rds`` type, which uses
Amazon RDS for PostgreSQL. Changes to the deployment code will ensure this transition to the new
default.

- Database: As a follow-up to the earlier notice, PostgreSQL 12 will reach End of Life on November
14, 2024. Instances still using PostgreSQL 12 or earlier should upgrade to PostgreSQL 13 or later
to maintain compatibility. The application will log a warning if it detects a connection to any
PostgreSQL version older than 12, and this warning will be updated to include PostgreSQL 12 once
it is End of Life.
2 changes: 1 addition & 1 deletion docs/setup-cluster/aws/aws-spot.rst
Original file line number Diff line number Diff line change
Expand Up @@ -4,7 +4,7 @@
Use Spot Instances
####################

This document describes how to use AWS spot instances with Determined. Spot instances can be much
This guide describes how to use AWS spot instances with Determined. Spot instances can be much
cheaper than on-demand instances (up to 90% cheaper, but more often 70-80%) but they are unreliable,
so software that runs on spot instances must be fault tolerant. Unfortunately, deep learning code is
often not written with fault tolerance in mind, preventing many practitioners from using spot
Expand Down
12 changes: 9 additions & 3 deletions docs/setup-cluster/aws/install-on-aws.rst
Original file line number Diff line number Diff line change
Expand Up @@ -23,12 +23,18 @@ CloudFormation <https://aws.amazon.com/cloudformation/>`__ to automatically depl
Determined cluster. CloudFormation builds the necessary components for Determined into a single
CloudFormation stack.

.. important::

**Amazon Aurora V1 is no longer supported**. We recommend migrating to **Amazon RDS for
PostgreSQL 14**. Deployments using ``det deploy aws`` will default to the ``simple-rds``
deployment type, which uses Amazon RDS.

Requirements
============

- Either AWS credentials or an IAM role with permissions to access AWS CloudFormation APIs. See the
`AWS Documentation <https://docs.aws.amazon.com/cli/latest/userguide/cli-configure-files.html>`__
for information on how to use AWS credentials.
- AWS credentials or an IAM role with permissions to access AWS CloudFormation APIs. See the `AWS
Documentation <https://docs.aws.amazon.com/cli/latest/userguide/cli-configure-files.html>`__ for
information on how to use AWS credentials.

- An `AWS EC2 Keypair <https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/ec2-key-pairs.html>`__.

Expand Down
12 changes: 8 additions & 4 deletions docs/setup-cluster/checklists/postgresql.rst
Original file line number Diff line number Diff line change
Expand Up @@ -6,6 +6,10 @@

Determined uses a PostgreSQL database to store experiment and trial metadata.

.. note::

We recommend installing the latest available version of PostgreSQL.

.. note::

If you are using an existing PostgreSQL installation, we recommend confirming that
Expand All @@ -22,7 +26,7 @@ Determined uses a PostgreSQL database to store experiment and trial metadata.
GPUs, ensure that the :ref:`NVIDIA Container Toolkit <validate-nvidia-container-toolkit>` on each
one is working as expected.

#. Pull the official Docker image for PostgreSQL. PostgreSQL version 10 and later is supported.
#. Pull the official Docker image for the latest PostgreSQL version.

.. code::
Expand Down Expand Up @@ -75,21 +79,21 @@ Determined uses a PostgreSQL database to store experiment and trial metadata.
Install PostgreSQL using ``apt`` or ``yum``
===========================================

#. Install PostgreSQL 10 or greater.
#. Install PostgreSQL.

**Debian Distributions**

On Debian distributions, use the following command:

.. code::
sudo apt install postgresql-10
sudo apt install postgresql
**Red Hat Distributions**

On Red Hat distributions, you'll need to configure the PostgreSQL yum repository as described in
the `Red Hat Linux documentation <https://www.postgresql.org/download/linux/redhat>`_. Then,
install version 10:
install PostgreSQL:

.. code::
Expand Down
2 changes: 1 addition & 1 deletion docs/setup-cluster/on-prem/options/docker.rst
Original file line number Diff line number Diff line change
Expand Up @@ -14,7 +14,7 @@ This user guide provides step-by-step instructions for installing Determined usi
GPUs, ensure that the :ref:`NVIDIA Container Toolkit <validate-nvidia-container-toolkit>` on each
one is working as expected.

#. Pull the official Docker image for PostgreSQL. PostgreSQL version 10 and later is supported.
#. Pull the official Docker image for the latest PostgreSQL version.

.. code::
Expand Down
4 changes: 2 additions & 2 deletions docs/setup-cluster/on-prem/options/linux-packages.rst
Original file line number Diff line number Diff line change
Expand Up @@ -34,7 +34,7 @@ Docker container or your Linux distribution's package and service.
Run PostgreSQL in Docker
------------------------

#. Pull the official Docker image for PostgreSQL. PostgreSQL version 10 and later is supported.
#. Pull the official Docker image for the latest PostgreSQL version.

.. code::
Expand Down Expand Up @@ -65,7 +65,7 @@ Run PostgreSQL in Docker
Install PostgreSQL using ``apt`` or ``yum``
-------------------------------------------

#. Install PostgreSQL. Version 10 and later is supported.
#. Install PostgreSQL.

**Debian Distributions**

Expand Down
5 changes: 3 additions & 2 deletions docs/setup-cluster/slurm/hpc-environment-requirements.rst
Original file line number Diff line number Diff line change
Expand Up @@ -91,7 +91,7 @@ The following software components are required:
| Podman | >= 4.0.0 | Admin |
+------------------------+----------------------------------+------------------+
| PostgreSQL | 10 (RHEL 8), 13 (RHEL 9), 14 | Admin |
| | (Ubuntu 22.04) or newer | |
| | (Ubuntu 22.04) or later | |
+------------------------+----------------------------------+------------------+
| HPC client packages | Same as login nodes | Admin |
+------------------------+----------------------------------+------------------+
Expand All @@ -109,7 +109,8 @@ The following software components are required:
Database Requirements
=====================

The solution requires PostgreSQL 10 or newer, which will be installed on the admin node. The
The solution requires PostgreSQL 13 or later, which will be installed on the admin node. We
recommend using the latest available version of PostgreSQL for optimal support and security. The
required disk space for the database is estimated as follows:

- 200 GB on small systems (less than 15 workers) or big systems if the experiment logs are sent to
Expand Down
14 changes: 14 additions & 0 deletions docs/tools/webui-if.rst
Original file line number Diff line number Diff line change
Expand Up @@ -241,3 +241,17 @@ Clear the message with the following command:
.. code:: bash
det master cluster-message clear
****************************
Viewing Log Search Results
****************************

To perform a log search:

#. Navigate to your run in the WebUI.
#. In the Logs tab, start typing in the search box to open the search pane.
#. To use regex search, click the "Regex" checkbox in the search pane.
#. Click on a search result to view it in context, with logs before and after visible.
#. Scroll up and down to fetch new logs.

Note: Search results are not auto-updating. You may need to refresh to see new logs.
1 change: 0 additions & 1 deletion docs/tutorials/_index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -46,7 +46,6 @@ Examples let you build off of an existing model that already runs on Determined.
:hidden:

Quickstart for Model Developers <quickstart-mdldev>
Managing Logs and Log Policies <log-management>
Get Started with Detached Mode <detached-mode/_index>
Viewing Epoch-Based Metrics in the WebUI <viewing-epoch-based-metrics>
Using Pachyderm to Create a Batch Inferencing Pipeline <pachyderm-cat-dog>
Expand Down
52 changes: 0 additions & 52 deletions docs/tutorials/log-management.rst

This file was deleted.

Loading

0 comments on commit 2efe98b

Please sign in to comment.