Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

docs: add log signal release note and update docs #10126

Merged
merged 7 commits into from
Oct 25, 2024
Merged
Show file tree
Hide file tree
Changes from 3 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
47 changes: 32 additions & 15 deletions docs/reference/experiment-config-reference.rst
Original file line number Diff line number Diff line change
Expand Up @@ -308,10 +308,11 @@ Optional. Defines actions and labels in response to trial logs matching specifie
language syntax). For more information about the syntax, you can visit this `RE2 reference page
<https://github.com/google/re2/wiki/Syntax>`__. Each log policy can have the following fields:

- ``name``: Optional. A name for the log policy. If provided, this name will be displayed as a
label in the UI when the log policy matches.
- ``name``: Required. A name for the log policy. This name will be displayed as a label in the UI
Copy link
Contributor

@tara-hpe tara-hpe Oct 25, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
- ``name``: Required. A name for the log policy. This name will be displayed as a label in the UI
- ``name``: Required. The name of the log policy, displayed as a label in the WebUI

when the log policy matches.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
when the log policy matches.
when a log match occurs.


- ``pattern``: Required. The regex pattern to match in the logs.
- ``pattern``: Optional. It is required to provide a regex pattern to match in the logs unless the
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
- ``pattern``: Optional. It is required to provide a regex pattern to match in the logs unless the
- ``pattern``: Optional. Defines a regex pattern to match log entries. If not specified, this policy is disabled.

intention is to disable an existing policy.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
intention is to disable an existing policy.


- ``action``: Optional. The action to take when the pattern is matched. Actions include:

Expand All @@ -336,22 +337,38 @@ Example configuration:
.. code:: yaml

log_policies:
- name: "ECC Error"
pattern: ".*uncorrectable ECC error encountered.*"
action:
type: exclude_node
- name: "CUDA OOM"
pattern: ".*CUDA out of memory.*"
action:
type: cancel_retries

When a log policy matches, its name (if provided) will be displayed as a label in the WebUI,
allowing for easy identification of specific issues or events during a run. These labels will appear
in both the run table and run detail views.
- name: ECC Error
pattern: ".*uncorrectable ECC error encountered.*"
action: exclude_node
- name: CUDA OOM
pattern: ".*CUDA out of memory.*"
action: cancel_retries

When a log policy matches, its name will be displayed as a label in the WebUI, allowing for easy
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
When a log policy matches, its name will be displayed as a label in the WebUI, allowing for easy
When a log policy matches, its name appears as a label in the WebUI, making it easy

identification of specific issues during a run. These labels will appear in both the run table and
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
identification of specific issues during a run. These labels will appear in both the run table and
to identify specific issues during a run. These labels are shown in both the run table and

run detail views.

These settings may also be specified at the cluster or resource pool level through task container
defaults.

Default policies:

.. code:: yaml

log_policies:
- name: CUDA OOM
pattern: ".*CUDA out of memory.*"
- name: ECC Error
pattern: ".*uncorrectable ECC error encountered.*"

To disable showing labels from the default policies:

.. code:: yaml

log_policies:
- name: CUDA OOM
- name: ECC Error

To find out more about log management features like **Log Search** and **Log Signal**, visit
:ref:`Log Management <log-management>`.

Expand Down
4 changes: 2 additions & 2 deletions docs/release-notes/log-search-improvement.rst
Original file line number Diff line number Diff line change
Expand Up @@ -5,5 +5,5 @@
- Logs: In the WebUI, add a tab for specifically for displaying log search results. Clicking on any
search result will take users directly to the relevant position in the log, allowing them to
easily view logs both before and after the matched entry. Additionally, add support for
regex-based searches, providing more flexible log filtering. For more details, refer to
:ref:`log_policies <config-log-policies>`.
regex-based searches, providing more flexible log filtering. For more details, refer to :ref:`Log
Management <log-management>`.
10 changes: 10 additions & 0 deletions docs/release-notes/log-signal.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,10 @@
:orphan:

**New Features**

- Experiments: ``log_policies`` now have a ``name`` field. When a log policy matches, its name will
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
- Experiments: ``log_policies`` now have a ``name`` field. When a log policy matches, its name will
- Experiments: Add a ``name`` field to ``log_policies``. When a log policy matches, its name

be displayed as a label in the WebUI, allowing for easy identification of specific issues during
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
be displayed as a label in the WebUI, allowing for easy identification of specific issues during
shows as a label in the WebUI, making it easy to spot specific issues during

a run. These labels will appear in both the run table and run detail views.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
a run. These labels will appear in both the run table and run detail views.
a run. Labels appear in both the run table and run detail views.


It has a new format. ``name`` is required, and ``action`` should be a plain string. For more
Copy link
Contributor

@tara-hpe tara-hpe Oct 25, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
It has a new format. ``name`` is required, and ``action`` should be a plain string. For more
In addition, there is a new format: ``name`` is required, and ``action`` is now a plain string. For more

details, refer to :ref:`log_policies <config-log-policies>`.
5 changes: 2 additions & 3 deletions docs/tutorials/log-management.rst
Original file line number Diff line number Diff line change
Expand Up @@ -40,10 +40,9 @@ Example configuration:
.. code:: yaml

log_policies:
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the log signal feature is a visual cue to the original fault tolerant project(exclude_node or cancel_retires when there are serious errors). The main goal is to make it easier for users to know whether a log policy has executed without reading the log.

log_policies is good at:

log_policies:
  - name: "Encountered Serious Error"
    pattern: ".*serious error.*"
    action: cancel_retires

We might not want to steer users toward doing something like below, since the backend and WebUI weren't really designed for this use case:

log_policies:
  - name: "stringA appears in the log"
    pattern: ".*stringA.*"
  - name: "stringB appears in the log"
    pattern: ".*stringB.*"

- name: "CUDA OOM"
- name: CUDA OOM
pattern: ".*CUDA out of memory.*"
action:
type: cancel_retries
action: cancel_retries

This will display a "CUDA OOM" label in the UI when a CUDA out of memory error is encountered in the
logs.
Expand Down
Loading