Skip to content

Commit

Permalink
docs: Update log policies (#10098)
Browse files Browse the repository at this point in the history
  • Loading branch information
tara-hpe authored and thiagodallacqua-hpe committed Oct 28, 2024
1 parent fc7ab89 commit 0242b37
Show file tree
Hide file tree
Showing 3 changed files with 73 additions and 16 deletions.
34 changes: 19 additions & 15 deletions docs/reference/experiment-config-reference.rst
Original file line number Diff line number Diff line change
Expand Up @@ -304,39 +304,43 @@ if at least one of its trials completes without errors. The default value for ``
``log_policies``
================

Optional. Defines actions in response to trial logs matching specified regex patterns (Go language
syntax). For more information about the syntax, you can visit this `RE2 reference page
<https://github.com/google/re2/wiki/Syntax>`__. Actions include:
Optional. Defines actions and labels in response to trial logs matching specified regex patterns (Go
language syntax). For more information about the syntax, you can visit this `RE2 reference page
<https://github.com/google/re2/wiki/Syntax>`__. Each log policy can have the following fields:

- ``exclude_node``: Excludes a failed trial's restart attempts (due to its ``max_restarts`` policy)
from being scheduled on nodes with matched error logs. This is useful for bypassing nodes with
hardware issues, like uncorrectable GPU ECC errors.
- ``name``: Optional. A name for the log policy. If provided, this name will be displayed as a
label in the UI when the log policy matches.

Note: This option is not supported on PBS systems.
- ``pattern``: Required. The regex pattern to match in the logs.

For the agent resource manager, if a trial becomes unschedulable due to enough node exclusions,
and ``launch_error`` in the master config is true (default), the trial fails.
- ``action``: Optional. The action to take when the pattern is matched. Actions include:

- ``cancel_retries``: Prevents a trial from restarting if a trial reports a log that matches the
pattern, even if it has remaining ``max_restarts``. This avoids using resources for retrying a
trial that encounters certain failures that won't be fixed by retrying the trial, such as CUDA
memory issues.
- ``exclude_node``: Excludes a failed trial's restart attempts from being scheduled on nodes
with matching error logs.
- ``cancel_retries``: Prevents a trial from restarting if it reports a matching log.

Example configuration:

.. code:: yaml
log_policies:
- pattern: ".*uncorrectable ECC error encountered.*"
- name: "ECC Error"
pattern: ".*uncorrectable ECC error encountered.*"
action:
type: exclude_node
- pattern: ".*CUDA out of memory.*"
- name: "CUDA OOM"
pattern: ".*CUDA out of memory.*"
action:
type: cancel_retries
When a log policy matches, its name (if provided) will be displayed as a label in the WebUI,
allowing for easy identification of specific issues or events during a run.

These settings may also be specified at the cluster or resource pool level through task container
defaults.

To find out more about log management, visit :ref:`Log Management <log-management>`.

.. _log-retention-days:

``retention_policy``
Expand Down
3 changes: 2 additions & 1 deletion docs/tutorials/_index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -46,7 +46,8 @@ Examples let you build off of an existing model that already runs on Determined.
:hidden:

Quickstart for Model Developers <quickstart-mdldev>
Porting Your PyTorch Model to Determined <pytorch-mnist-tutorial>
Managing Logs and Log Policies <log-management>
Get Started with Detached Mode <detached-mode/_index>
Viewing Epoch-Based Metrics in the WebUI <viewing-epoch-based-metrics>
Using Pachyderm to Create a Batch Inferencing Pipeline <pachyderm-cat-dog>
Porting Your PyTorch Model to Determined <pytorch-mnist-tutorial>
52 changes: 52 additions & 0 deletions docs/tutorials/log-management.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,52 @@
.. _log-management:

################
Log Management
################

This guide covers two log management features: Log Search and Log Signal.

************
Log Search
************

To perform a log search:

#. Navigate to your run in the WebUI.
#. In the Logs tab, start typing in the search box to open the search pane.
#. To use regex search, click the "Regex" checkbox in the search pane.
#. Click on a search result to view it in context, with logs before and after visible.
#. Scroll up and down to fetch new logs.

Note: Search results are not auto-updating. You may need to refresh to see new logs.

************
Log Signal
************

Log Signal allows you to configure log policies in the master configuration to display labels in the
UI when specific patterns are matched in the logs.

To set up a log policy:

#. In the master configuration file, under ``task_container_defaults > log_policies``, define your
log policies.
#. Each policy can have a ``name``, ``pattern``, and ``action``.
#. When a log matching the pattern is encountered, the ``name`` will be displayed as a label in the
run table and run detail views.

Example configuration:

.. code:: yaml
log_policies:
- name: "CUDA OOM"
pattern: ".*CUDA out of memory.*"
action:
type: cancel_retries
This will display a "CUDA OOM" label in the UI when a CUDA out of memory error is encountered in the
logs.

For more detailed information on configuring log policies, refer to the :ref:`experiment
configuration reference <config-log-policies>`.

0 comments on commit 0242b37

Please sign in to comment.