From 3093f7a5a72be107039927b52e7da12440b8baa8 Mon Sep 17 00:00:00 2001 From: Jerry Gong Date: Thu, 24 Oct 2024 21:43:43 -0400 Subject: [PATCH 1/7] add log signal release note and update docs --- .../reference/experiment-config-reference.rst | 46 +++++++++++++------ docs/release-notes/log-search-improvement.rst | 4 +- docs/release-notes/log-signal.rst | 39 ++++++++++++++++ docs/tutorials/log-management.rst | 5 +- 4 files changed, 74 insertions(+), 20 deletions(-) create mode 100644 docs/release-notes/log-signal.rst diff --git a/docs/reference/experiment-config-reference.rst b/docs/reference/experiment-config-reference.rst index e4b489d927d..2a688cf4026 100644 --- a/docs/reference/experiment-config-reference.rst +++ b/docs/reference/experiment-config-reference.rst @@ -308,10 +308,10 @@ Optional. Defines actions and labels in response to trial logs matching specifie language syntax). For more information about the syntax, you can visit this `RE2 reference page `__. Each log policy can have the following fields: -- ``name``: Optional. A name for the log policy. If provided, this name will be displayed as a - label in the UI when the log policy matches. +- ``name``: Required. A name for the log policy. This name will be displayed as a label in the UI + when the log policy matches. -- ``pattern``: Required. The regex pattern to match in the logs. +- ``pattern``: Optional. The regex pattern to match in the logs. - ``action``: Optional. The action to take when the pattern is matched. Actions include: @@ -336,22 +336,38 @@ Example configuration: .. code:: yaml log_policies: - - name: "ECC Error" - pattern: ".*uncorrectable ECC error encountered.*" - action: - type: exclude_node - - name: "CUDA OOM" - pattern: ".*CUDA out of memory.*" - action: - type: cancel_retries - -When a log policy matches, its name (if provided) will be displayed as a label in the WebUI, -allowing for easy identification of specific issues or events during a run. These labels will appear -in both the run table and run detail views. + - name: ECC Error + pattern: ".*uncorrectable ECC error encountered.*" + action: exclude_node + - name: CUDA OOM + pattern: ".*CUDA out of memory.*" + action: cancel_retries + +When a log policy matches, its name will be displayed as a label in the WebUI, allowing for easy +identification of specific issues during a run. These labels will appear in both the run table and +run detail views. These settings may also be specified at the cluster or resource pool level through task container defaults. +Default policies: + +.. code:: yaml + + log_policies: + - name: CUDA OOM + pattern: ".*CUDA out of memory.*" + - name: ECC Error + pattern: ".*uncorrectable ECC error encountered.*" + +To disable showing labels from the default policies: + +.. code:: yaml + + log_policies: + - name: CUDA OOM + - name: ECC Error + To find out more about log management features like **Log Search** and **Log Signal**, visit :ref:`Log Management `. diff --git a/docs/release-notes/log-search-improvement.rst b/docs/release-notes/log-search-improvement.rst index c24773c49ec..e36fece8f2d 100644 --- a/docs/release-notes/log-search-improvement.rst +++ b/docs/release-notes/log-search-improvement.rst @@ -5,5 +5,5 @@ - Logs: In the WebUI, add a tab for specifically for displaying log search results. Clicking on any search result will take users directly to the relevant position in the log, allowing them to easily view logs both before and after the matched entry. Additionally, add support for - regex-based searches, providing more flexible log filtering. For more details, refer to - :ref:`log_policies `. + regex-based searches, providing more flexible log filtering. For more details, refer to :ref:`Log + Management `. diff --git a/docs/release-notes/log-signal.rst b/docs/release-notes/log-signal.rst new file mode 100644 index 00000000000..b64b4c54fd4 --- /dev/null +++ b/docs/release-notes/log-signal.rst @@ -0,0 +1,39 @@ +:orphan: + +**New Features** + +- Experiments: ``log_policies`` now have a ``name`` field. When a log policy matches, its name will + be displayed as a label in the WebUI, allowing for easy identification of specific issues during + a run. These labels will appear in both the run table and run detail views. + + It has a new format. ``name`` is required. ``pattern`` and ``action`` are optional. To make + things simpler, user no longer needs to specify the ``type`` field to set an action. For example: + + Old format: + + .. code:: yaml + + log_policies: + - pattern: ".*uncorrectable ECC error encountered.*" + action: + type: exclude_node + - pattern: ".*CUDA out of memory.*" + action: + type: cancel_retries + + New format: + + .. code:: yaml + + log_policies: + - name: ECC Error + pattern: ".*uncorrectable ECC error encountered.*" + action: exclude_node + - name: CUDA OOM + pattern: ".*CUDA out of memory.*" + action: cancel_retries + + Both old and new format are supported at this time. We plan to deprecate the old format in the + future. + + For more details, refer to :ref:`log_policies `. diff --git a/docs/tutorials/log-management.rst b/docs/tutorials/log-management.rst index 3eb95cfea69..ab6e5ec8efe 100644 --- a/docs/tutorials/log-management.rst +++ b/docs/tutorials/log-management.rst @@ -40,10 +40,9 @@ Example configuration: .. code:: yaml log_policies: - - name: "CUDA OOM" + - name: CUDA OOM pattern: ".*CUDA out of memory.*" - action: - type: cancel_retries + action: cancel_retries This will display a "CUDA OOM" label in the UI when a CUDA out of memory error is encountered in the logs. From 79a371fbf6666559d5de95eafe9dd1a4356caa64 Mon Sep 17 00:00:00 2001 From: Jerry Gong Date: Fri, 25 Oct 2024 11:38:10 -0400 Subject: [PATCH 2/7] Address comments --- .../reference/experiment-config-reference.rst | 3 +- docs/release-notes/log-signal.rst | 33 ++----------------- 2 files changed, 4 insertions(+), 32 deletions(-) diff --git a/docs/reference/experiment-config-reference.rst b/docs/reference/experiment-config-reference.rst index 2a688cf4026..cf3b1c193d6 100644 --- a/docs/reference/experiment-config-reference.rst +++ b/docs/reference/experiment-config-reference.rst @@ -311,7 +311,8 @@ language syntax). For more information about the syntax, you can visit this `RE2 - ``name``: Required. A name for the log policy. This name will be displayed as a label in the UI when the log policy matches. -- ``pattern``: Optional. The regex pattern to match in the logs. +- ``pattern``: Required. The regex pattern to match in the logs. Can't omit this field unless the + intention is to disable an existing policy. - ``action``: Optional. The action to take when the pattern is matched. Actions include: diff --git a/docs/release-notes/log-signal.rst b/docs/release-notes/log-signal.rst index b64b4c54fd4..f21982e137b 100644 --- a/docs/release-notes/log-signal.rst +++ b/docs/release-notes/log-signal.rst @@ -6,34 +6,5 @@ be displayed as a label in the WebUI, allowing for easy identification of specific issues during a run. These labels will appear in both the run table and run detail views. - It has a new format. ``name`` is required. ``pattern`` and ``action`` are optional. To make - things simpler, user no longer needs to specify the ``type`` field to set an action. For example: - - Old format: - - .. code:: yaml - - log_policies: - - pattern: ".*uncorrectable ECC error encountered.*" - action: - type: exclude_node - - pattern: ".*CUDA out of memory.*" - action: - type: cancel_retries - - New format: - - .. code:: yaml - - log_policies: - - name: ECC Error - pattern: ".*uncorrectable ECC error encountered.*" - action: exclude_node - - name: CUDA OOM - pattern: ".*CUDA out of memory.*" - action: cancel_retries - - Both old and new format are supported at this time. We plan to deprecate the old format in the - future. - - For more details, refer to :ref:`log_policies `. + It has a new format. ``name`` is required, and ``action`` should be a plain string. For more + details, refer to :ref:`log_policies `. From 8c51b01efb20b7ece783c13e354ed3cc82d1eb46 Mon Sep 17 00:00:00 2001 From: Jerry Gong Date: Fri, 25 Oct 2024 11:48:50 -0400 Subject: [PATCH 3/7] debate between required and optional --- docs/reference/experiment-config-reference.rst | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/reference/experiment-config-reference.rst b/docs/reference/experiment-config-reference.rst index cf3b1c193d6..3e0147843b5 100644 --- a/docs/reference/experiment-config-reference.rst +++ b/docs/reference/experiment-config-reference.rst @@ -311,7 +311,7 @@ language syntax). For more information about the syntax, you can visit this `RE2 - ``name``: Required. A name for the log policy. This name will be displayed as a label in the UI when the log policy matches. -- ``pattern``: Required. The regex pattern to match in the logs. Can't omit this field unless the +- ``pattern``: Optional. It is required to provide a regex pattern to match in the logs unless the intention is to disable an existing policy. - ``action``: Optional. The action to take when the pattern is matched. Actions include: From 94a00233454b55df244bc406261bbfc89de61680 Mon Sep 17 00:00:00 2001 From: Jerry Gong Date: Fri, 25 Oct 2024 12:59:50 -0400 Subject: [PATCH 4/7] Address comment --- .../reference/experiment-config-reference.rst | 13 ++++---- docs/tutorials/log-management.rst | 32 +------------------ 2 files changed, 7 insertions(+), 38 deletions(-) diff --git a/docs/reference/experiment-config-reference.rst b/docs/reference/experiment-config-reference.rst index 3e0147843b5..e1f8e1d3680 100644 --- a/docs/reference/experiment-config-reference.rst +++ b/docs/reference/experiment-config-reference.rst @@ -308,11 +308,11 @@ Optional. Defines actions and labels in response to trial logs matching specifie language syntax). For more information about the syntax, you can visit this `RE2 reference page `__. Each log policy can have the following fields: -- ``name``: Required. A name for the log policy. This name will be displayed as a label in the UI - when the log policy matches. +- ``name``: Required. The name of the log policy, displayed as a label in the WebUI when a log + policy match occurs. -- ``pattern``: Optional. It is required to provide a regex pattern to match in the logs unless the - intention is to disable an existing policy. +- ``pattern``: Optional. Defines a regex pattern to match log entries. If not specified, this + policy is disabled. - ``action``: Optional. The action to take when the pattern is matched. Actions include: @@ -344,9 +344,8 @@ Example configuration: pattern: ".*CUDA out of memory.*" action: cancel_retries -When a log policy matches, its name will be displayed as a label in the WebUI, allowing for easy -identification of specific issues during a run. These labels will appear in both the run table and -run detail views. +When a log policy matches, its name appears as a label in the WebUI, making it easy to identify +specific issues during a run. These labels are shown in both the run table and run detail views. These settings may also be specified at the cluster or resource pool level through task container defaults. diff --git a/docs/tutorials/log-management.rst b/docs/tutorials/log-management.rst index ab6e5ec8efe..305128714cb 100644 --- a/docs/tutorials/log-management.rst +++ b/docs/tutorials/log-management.rst @@ -4,7 +4,7 @@ Log Management ################ -This guide covers two log management features: Log Search and Log Signal. +This guide covers a log management feature: Log Search. ************ Log Search @@ -19,33 +19,3 @@ To perform a log search: #. Scroll up and down to fetch new logs. Note: Search results are not auto-updating. You may need to refresh to see new logs. - -************ - Log Signal -************ - -Log Signal allows you to configure log policies in the master configuration to display labels in the -UI when specific patterns are matched in the logs. - -To set up a log policy: - -#. In the master configuration file, under ``task_container_defaults > log_policies``, define your - log policies. -#. Each policy can have a ``name``, ``pattern``, and ``action``. -#. When a log matching the pattern is encountered, the ``name`` will be displayed as a label in the - run table and run detail views. - -Example configuration: - -.. code:: yaml - - log_policies: - - name: CUDA OOM - pattern: ".*CUDA out of memory.*" - action: cancel_retries - -This will display a "CUDA OOM" label in the UI when a CUDA out of memory error is encountered in the -logs. - -For more detailed information on configuring log policies, refer to the :ref:`experiment -configuration reference `. From 0cae196bb0f518e94b4187ec20c878a299190388 Mon Sep 17 00:00:00 2001 From: Jerry Gong Date: Fri, 25 Oct 2024 13:29:00 -0400 Subject: [PATCH 5/7] update log management tutorial --- .../reference/experiment-config-reference.rst | 3 --- docs/release-notes/log-search-improvement.rst | 3 +-- docs/tools/webui-if.rst | 14 +++++++++++++ docs/tutorials/_index.rst | 1 - docs/tutorials/log-management.rst | 21 ------------------- 5 files changed, 15 insertions(+), 27 deletions(-) delete mode 100644 docs/tutorials/log-management.rst diff --git a/docs/reference/experiment-config-reference.rst b/docs/reference/experiment-config-reference.rst index e1f8e1d3680..5ba662831e6 100644 --- a/docs/reference/experiment-config-reference.rst +++ b/docs/reference/experiment-config-reference.rst @@ -368,9 +368,6 @@ To disable showing labels from the default policies: - name: CUDA OOM - name: ECC Error -To find out more about log management features like **Log Search** and **Log Signal**, visit -:ref:`Log Management `. - .. _log-retention-days: ``retention_policy`` diff --git a/docs/release-notes/log-search-improvement.rst b/docs/release-notes/log-search-improvement.rst index e36fece8f2d..0e309b76484 100644 --- a/docs/release-notes/log-search-improvement.rst +++ b/docs/release-notes/log-search-improvement.rst @@ -5,5 +5,4 @@ - Logs: In the WebUI, add a tab for specifically for displaying log search results. Clicking on any search result will take users directly to the relevant position in the log, allowing them to easily view logs both before and after the matched entry. Additionally, add support for - regex-based searches, providing more flexible log filtering. For more details, refer to :ref:`Log - Management `. + regex-based searches, providing more flexible log filtering. For more details, refer to :ref:`WebUI `. diff --git a/docs/tools/webui-if.rst b/docs/tools/webui-if.rst index 733180a123e..ead6826c3b9 100644 --- a/docs/tools/webui-if.rst +++ b/docs/tools/webui-if.rst @@ -241,3 +241,17 @@ Clear the message with the following command: .. code:: bash det master cluster-message clear + +************ + Viewing Log Search Results +************ + +To perform a log search: + +#. Navigate to your run in the WebUI. +#. In the Logs tab, start typing in the search box to open the search pane. +#. To use regex search, click the "Regex" checkbox in the search pane. +#. Click on a search result to view it in context, with logs before and after visible. +#. Scroll up and down to fetch new logs. + +Note: Search results are not auto-updating. You may need to refresh to see new logs. diff --git a/docs/tutorials/_index.rst b/docs/tutorials/_index.rst index 724736f35c6..ab695aecfce 100644 --- a/docs/tutorials/_index.rst +++ b/docs/tutorials/_index.rst @@ -46,7 +46,6 @@ Examples let you build off of an existing model that already runs on Determined. :hidden: Quickstart for Model Developers - Managing Logs and Log Policies Get Started with Detached Mode Viewing Epoch-Based Metrics in the WebUI Using Pachyderm to Create a Batch Inferencing Pipeline diff --git a/docs/tutorials/log-management.rst b/docs/tutorials/log-management.rst deleted file mode 100644 index 305128714cb..00000000000 --- a/docs/tutorials/log-management.rst +++ /dev/null @@ -1,21 +0,0 @@ -.. _log-management: - -################ - Log Management -################ - -This guide covers a log management feature: Log Search. - -************ - Log Search -************ - -To perform a log search: - -#. Navigate to your run in the WebUI. -#. In the Logs tab, start typing in the search box to open the search pane. -#. To use regex search, click the "Regex" checkbox in the search pane. -#. Click on a search result to view it in context, with logs before and after visible. -#. Scroll up and down to fetch new logs. - -Note: Search results are not auto-updating. You may need to refresh to see new logs. From a5244c40430be24e6680961ab600dde6d56eaa1e Mon Sep 17 00:00:00 2001 From: Jerry Gong Date: Fri, 25 Oct 2024 13:30:29 -0400 Subject: [PATCH 6/7] fmt --- docs/release-notes/log-search-improvement.rst | 3 ++- docs/tools/webui-if.rst | 4 ++-- 2 files changed, 4 insertions(+), 3 deletions(-) diff --git a/docs/release-notes/log-search-improvement.rst b/docs/release-notes/log-search-improvement.rst index 0e309b76484..5ce8d46fff1 100644 --- a/docs/release-notes/log-search-improvement.rst +++ b/docs/release-notes/log-search-improvement.rst @@ -5,4 +5,5 @@ - Logs: In the WebUI, add a tab for specifically for displaying log search results. Clicking on any search result will take users directly to the relevant position in the log, allowing them to easily view logs both before and after the matched entry. Additionally, add support for - regex-based searches, providing more flexible log filtering. For more details, refer to :ref:`WebUI `. + regex-based searches, providing more flexible log filtering. For more details, refer to + :ref:`WebUI `. diff --git a/docs/tools/webui-if.rst b/docs/tools/webui-if.rst index ead6826c3b9..5ab5187cc33 100644 --- a/docs/tools/webui-if.rst +++ b/docs/tools/webui-if.rst @@ -242,9 +242,9 @@ Clear the message with the following command: det master cluster-message clear -************ +**************************** Viewing Log Search Results -************ +**************************** To perform a log search: From 72b9281161ef99e27bb0e7b9e67baa9b0196ad3b Mon Sep 17 00:00:00 2001 From: Jerry Gong Date: Fri, 25 Oct 2024 13:34:37 -0400 Subject: [PATCH 7/7] more comments --- docs/release-notes/log-signal.rst | 10 +++++----- 1 file changed, 5 insertions(+), 5 deletions(-) diff --git a/docs/release-notes/log-signal.rst b/docs/release-notes/log-signal.rst index f21982e137b..743b0c6c56b 100644 --- a/docs/release-notes/log-signal.rst +++ b/docs/release-notes/log-signal.rst @@ -2,9 +2,9 @@ **New Features** -- Experiments: ``log_policies`` now have a ``name`` field. When a log policy matches, its name will - be displayed as a label in the WebUI, allowing for easy identification of specific issues during - a run. These labels will appear in both the run table and run detail views. +- Experiments: Add a ``name`` field to ``log_policies``. When a log policy matches, its name shows + as a label in the WebUI, making it easy to spot specific issues during a run. Labels appear in + both the run table and run detail views. - It has a new format. ``name`` is required, and ``action`` should be a plain string. For more - details, refer to :ref:`log_policies `. + In addition, there is a new format: ``name`` is required, and ``action`` is now a plain string. + For more details, refer to :ref:`log_policies `.