-
Notifications
You must be signed in to change notification settings - Fork 359
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
docs: add log signal release note and update docs #10126
Changes from 3 commits
3093f7a
79a371f
8c51b01
94a0023
0cae196
a5244c4
72b9281
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change | ||||
---|---|---|---|---|---|---|
|
@@ -308,10 +308,11 @@ Optional. Defines actions and labels in response to trial logs matching specifie | |||||
language syntax). For more information about the syntax, you can visit this `RE2 reference page | ||||||
<https://github.com/google/re2/wiki/Syntax>`__. Each log policy can have the following fields: | ||||||
|
||||||
- ``name``: Optional. A name for the log policy. If provided, this name will be displayed as a | ||||||
label in the UI when the log policy matches. | ||||||
- ``name``: Required. A name for the log policy. This name will be displayed as a label in the UI | ||||||
when the log policy matches. | ||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Suggested change
|
||||||
|
||||||
- ``pattern``: Required. The regex pattern to match in the logs. | ||||||
- ``pattern``: Optional. It is required to provide a regex pattern to match in the logs unless the | ||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Suggested change
|
||||||
intention is to disable an existing policy. | ||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Suggested change
|
||||||
|
||||||
- ``action``: Optional. The action to take when the pattern is matched. Actions include: | ||||||
|
||||||
|
@@ -336,22 +337,38 @@ Example configuration: | |||||
.. code:: yaml | ||||||
|
||||||
log_policies: | ||||||
- name: "ECC Error" | ||||||
pattern: ".*uncorrectable ECC error encountered.*" | ||||||
action: | ||||||
type: exclude_node | ||||||
- name: "CUDA OOM" | ||||||
pattern: ".*CUDA out of memory.*" | ||||||
action: | ||||||
type: cancel_retries | ||||||
|
||||||
When a log policy matches, its name (if provided) will be displayed as a label in the WebUI, | ||||||
allowing for easy identification of specific issues or events during a run. These labels will appear | ||||||
in both the run table and run detail views. | ||||||
- name: ECC Error | ||||||
pattern: ".*uncorrectable ECC error encountered.*" | ||||||
action: exclude_node | ||||||
- name: CUDA OOM | ||||||
pattern: ".*CUDA out of memory.*" | ||||||
action: cancel_retries | ||||||
|
||||||
When a log policy matches, its name will be displayed as a label in the WebUI, allowing for easy | ||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Suggested change
|
||||||
identification of specific issues during a run. These labels will appear in both the run table and | ||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Suggested change
|
||||||
run detail views. | ||||||
|
||||||
These settings may also be specified at the cluster or resource pool level through task container | ||||||
defaults. | ||||||
|
||||||
Default policies: | ||||||
|
||||||
.. code:: yaml | ||||||
|
||||||
log_policies: | ||||||
- name: CUDA OOM | ||||||
pattern: ".*CUDA out of memory.*" | ||||||
- name: ECC Error | ||||||
pattern: ".*uncorrectable ECC error encountered.*" | ||||||
|
||||||
To disable showing labels from the default policies: | ||||||
|
||||||
.. code:: yaml | ||||||
|
||||||
log_policies: | ||||||
- name: CUDA OOM | ||||||
- name: ECC Error | ||||||
|
||||||
To find out more about log management features like **Log Search** and **Log Signal**, visit | ||||||
:ref:`Log Management <log-management>`. | ||||||
|
||||||
|
Original file line number | Diff line number | Diff line change | ||||
---|---|---|---|---|---|---|
@@ -0,0 +1,10 @@ | ||||||
:orphan: | ||||||
|
||||||
**New Features** | ||||||
|
||||||
- Experiments: ``log_policies`` now have a ``name`` field. When a log policy matches, its name will | ||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Suggested change
|
||||||
be displayed as a label in the WebUI, allowing for easy identification of specific issues during | ||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Suggested change
|
||||||
a run. These labels will appear in both the run table and run detail views. | ||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Suggested change
|
||||||
|
||||||
It has a new format. ``name`` is required, and ``action`` should be a plain string. For more | ||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Suggested change
|
||||||
details, refer to :ref:`log_policies <config-log-policies>`. |
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -40,10 +40,9 @@ Example configuration: | |
.. code:: yaml | ||
|
||
log_policies: | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I think the log signal feature is a visual cue to the original fault tolerant project(exclude_node or cancel_retires when there are serious errors). The main goal is to make it easier for users to know whether a log policy has executed without reading the log.
We might not want to steer users toward doing something like below, since the backend and WebUI weren't really designed for this use case:
|
||
- name: "CUDA OOM" | ||
- name: CUDA OOM | ||
pattern: ".*CUDA out of memory.*" | ||
action: | ||
type: cancel_retries | ||
action: cancel_retries | ||
|
||
This will display a "CUDA OOM" label in the UI when a CUDA out of memory error is encountered in the | ||
logs. | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.