Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: Log Signal Exp Config and Monitoring #9947

Merged
merged 27 commits into from
Oct 24, 2024

Conversation

jgongd
Copy link
Contributor

@jgongd jgongd commented Sep 17, 2024

Ticket

MD-493, MD-494

Description

  • Update log_policies json schema. To avoid any disruption, both legacy log policy and modern log policy are supported.
  • Store the name of the policy to the DB if a matching pattern is found in the log.
  • Rename logSignal to logPolicyMatched in the webUI.

Test Plan

At the release party, testing this PR together with #9959 could make things easier.

  1. Include below in the experiment config and verify if Test Signal are in run details page and run table.
log_policies:
 - name: "test"
   pattern: .*complete.*
image image
  1. Legacy log policy should still be accepted.
log_policies:
 - pattern: .*complete.*
   action:  
     type: cancel_retries
  1. If the experiment config doesn't specify any log policies, it will use the default log policies.
image
  1. Default log policies can be unset.
log_policies:
  - name: ECC Error
  - name: CUDA OOM

Checklist

  • Changes have been manually QA'd
  • New features have been approved by the corresponding PM
  • User-facing API changes have the "User-facing API Change" label
  • Release notes have been added as a separate file under docs/release-notes/
    See Release Note for details.
  • Licenses have been included for new code which was copied and/or modified from any external code

@cla-bot cla-bot bot added the cla-signed label Sep 17, 2024
Copy link

netlify bot commented Sep 17, 2024

Deploy Preview for determined-ui canceled.

Name Link
🔨 Latest commit 57031d0
🔍 Latest deploy log https://app.netlify.com/sites/determined-ui/deploys/671ab73c167fbc000896541e

Copy link

codecov bot commented Sep 17, 2024

Codecov Report

Attention: Patch coverage is 41.61491% with 94 lines in your changes missing coverage. Please review.

Project coverage is 54.56%. Comparing base (deb3772) to head (57031d0).
Report is 10 commits behind head on main.

Files with missing lines Patch % Lines
master/pkg/schemas/zgen_schemas.go 0.00% 40 Missing ⚠️
master/internal/logpattern/logpattern.go 5.40% 35 Missing ⚠️
master/pkg/schemas/expconf/log_pattern_config.go 88.05% 8 Missing ⚠️
.../src/pages/TrialDetails/Header/TrialHeaderLeft.tsx 20.00% 4 Missing ⚠️
...pkg/schemas/expconf/zgen_log_policies_config_v0.go 33.33% 2 Missing ⚠️
master/pkg/schemas/expconf/zgen_log_policy_v0.go 0.00% 2 Missing ⚠️
master/pkg/model/task_container_defaults.go 0.00% 1 Missing ⚠️
webui/react/src/pages/FlatRuns/columns.ts 0.00% 1 Missing ⚠️
webui/react/src/services/decoder.ts 0.00% 1 Missing ⚠️
Additional details and impacted files
@@           Coverage Diff           @@
##             main    #9947   +/-   ##
=======================================
  Coverage   54.55%   54.56%           
=======================================
  Files        1267     1265    -2     
  Lines      159525   159601   +76     
  Branches     3637     3637           
=======================================
+ Hits        87035    87089   +54     
- Misses      72357    72379   +22     
  Partials      133      133           
Flag Coverage Δ
backend 45.95% <42.10%> (+0.03%) ⬆️
harness 72.56% <ø> (ø)
web 54.02% <33.33%> (ø)

Flags with carried forward coverage won't be shown. Click here to find out more.

Files with missing lines Coverage Δ
master/internal/api_experiment.go 57.18% <100.00%> (ø)
master/internal/api_runs.go 66.27% <100.00%> (ø)
...ternal/configpolicy/postgres_task_config_policy.go 91.66% <ø> (ø)
master/pkg/model/task.go 15.26% <ø> (ø)
...ster/pkg/schemas/extensions/disallow_properties.go 96.77% <ø> (ø)
webui/react/src/types.ts 99.70% <100.00%> (ø)
master/pkg/model/task_container_defaults.go 85.00% <0.00%> (+0.62%) ⬆️
webui/react/src/pages/FlatRuns/columns.ts 18.49% <0.00%> (ø)
webui/react/src/services/decoder.ts 21.32% <0.00%> (ø)
...pkg/schemas/expconf/zgen_log_policies_config_v0.go 33.33% <33.33%> (ø)
... and 5 more

... and 4 files with indirect coverage changes

@jgongd jgongd force-pushed the jgong/feat-log-signal-exp-config branch from 8679a19 to 176454d Compare September 29, 2024 23:42
@jgongd jgongd changed the title Log Signal Exp Config and Data Storing feat: Log Signal Exp Config and Data Storing Sep 29, 2024
@jgongd jgongd requested a review from gt2345 September 30, 2024 01:20
@jgongd jgongd marked this pull request as ready for review September 30, 2024 02:06
@jgongd jgongd requested review from a team as code owners September 30, 2024 02:06
@jgongd jgongd requested a review from hamidzr September 30, 2024 02:06
@hamidzr hamidzr removed their request for review September 30, 2024 16:05
@hamidzr
Copy link
Contributor

hamidzr commented Sep 30, 2024

cc @determined-ai/backend for backend review. thank you

InformationalReason: err.Error(),
})
}

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we please add some logs explaining why we want to clear signal when allocation exit fails

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't follow why we would want to clear it. It seems the point of this change is to track when some exceptional event happen and present them in a more user friendly manner. Losing that we had an ECC error on node xyz002 seems important to surface still.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If a trial runs into an OOM error, it's caught in log and a signal will be shown in the UI. I think the signal only appears for a brief moment since the trial will restart quickly. This means user might not see there was an OOM error. You are right. I feel I shouldn't clear it at restart.

Copy link
Contributor

@gt2345 gt2345 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great work on extensive tests!

@jgongd jgongd force-pushed the jgong/feat-log-signal-exp-config branch from 994f73f to 761f8f8 Compare October 10, 2024 04:17
@jgongd jgongd requested a review from salonig23 October 10, 2024 04:20
@jgongd jgongd changed the title feat: Log Signal Exp Config and Data Storing feat: Log Signal Exp Config and Monitoring Oct 10, 2024
Copy link
Contributor

@stoksc stoksc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i know i'm a little late to the party here but log_signal / signal is a really confusing word to me for what this is. is it too late to just call it "log_policy_name"? it makes sense to me that way, that users would name their log policies and look, by name, at which one fired on a given run.

other than that, i've left some general go-related comments and i'm not sure we should be breaking the experiment config but mostly this looks fine.

type LogPoliciesConfigV0 []LogPolicyV0

// WithDefaults implements the Defaultable psuedointerface.
func (b *LogPoliciesConfigV0) WithDefaults() *LogPoliciesConfigV0 {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
func (b *LogPoliciesConfigV0) WithDefaults() *LogPoliciesConfigV0 {
func (b LogPoliciesConfigV0) WithDefaults() *LogPoliciesConfigV0 {

Don't use a pointer receiver here.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For WithDefaults to work, I think you must match the pointerness of the receiver and return value.

So probably want to return a non-pointer as well, I think.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you!

for _, p := range b {
if v, ok := patternToLp[p.RawPattern]; ok {
// Union merge actions
actions := make(set.Set[LogActionV0])
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

use set.New

}
return b
}

// Merge implemenets the mergable interface.
func (b LogPoliciesConfigV0) Merge(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you add unit test this merge?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@@ -40,7 +75,8 @@ func (b LogPoliciesConfigV0) Merge(
type LogPolicyV0 struct {
RawPattern string `json:"pattern"`

RawAction LogActionV0 `json:"action"`
RawActions []LogActionV0 `json:"actions,omitempty"`
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

did anyone explicitly ask for multiple actions? today the available actions seem mutually exclusive in terms of when you would want them to me, and this is a breaking config change for anyone using this featuer.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

did anyone explicitly ask for multiple actions

The "signal" action shows up in the UI, but you might also want an action with a side-effect.

this is a breaking config change

We should be implementing shims to keep from breaking old configs.

@jgongd Do you have the examples of the desired config examples that we worked out together? I feel like @stoksc would benefit from seeing the overall goal here.

Copy link
Contributor Author

@jgongd jgongd Oct 19, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was concerned that it might be a breaking change as well!

Desired config example:

log_policies:
  - name: ECC Error
    pattern: .*uncorrectable ECC error encountered.*
    action: exclude_node   

Shim is implemented here.

Both legacy log policy and modern log policy are accepted now: https://github.com/determined-ai/determined/pull/9947/files#diff-ced209455ef7f0c29127fe9cc7a734f53cf0012d1268916b89afe0ed4f98ad17

Tested here.

@@ -29,7 +29,7 @@ type ExperimentConfigV0 struct {
RawEnvironment *EnvironmentConfigV0 `json:"environment"`
RawHyperparameters HyperparametersV0 `json:"hyperparameters"`
RawLabels LabelsV0 `json:"labels"`
RawLogPolicies LogPoliciesConfigV0 `json:"log_policies"`
RawLogPolicies *LogPoliciesConfigV0 `json:"log_policies"`
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why change this to pointers to a slice throughout the PR?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point! A slice doesn't need a pointer.

InformationalReason: err.Error(),
})
}

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't follow why we would want to clear it. It seems the point of this change is to track when some exceptional event happen and present them in a more user friendly manner. Losing that we had an ECC error on node xyz002 seems important to surface still.

@jgongd jgongd force-pushed the jgong/feat-log-signal-exp-config branch 4 times, most recently from e4e6acd to eea42c6 Compare October 17, 2024 06:37
@jgongd jgongd force-pushed the jgong/feat-log-signal-exp-config branch from 8816f3b to 50b0ebe Compare October 18, 2024 22:11
},
}
require.Equal(t, expected, tcd)
}

func TestLogPatternPoliciesMerging(t *testing.T) {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Converted this test to a unit test log_policy.yaml::legacy_log_policies_merging

@@ -61,20 +61,6 @@
type: directory
container_path: /path/on/disk

- name: log action cancel_retries
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Moved to log_policies.yaml

@jgongd jgongd force-pushed the jgong/feat-log-signal-exp-config branch 2 times, most recently from 1d9d9df to 8095267 Compare October 19, 2024 04:34
@jgongd jgongd force-pushed the jgong/feat-log-signal-exp-config branch from 3110d2d to f08317d Compare October 24, 2024 01:35
@@ -522,6 +522,7 @@ type Run struct {
LogRetentionDays *int16 `db:"log_retention_days"`
Metadata map[string]any `db:"metadata" bun:"metadata,scanonly"`
LocalID int `db:"local_id"`
LogPolicyMatched *string `db:"log_policy_matched"`
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we want to add LogPolicyMatched to experiment model?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought /api/v1/runs query uses this model. But checking the code again, I see that it actually uses the go struct generated from message Flatrun in the proto file. Thank you, great catch!

); err != nil {
return fmt.Errorf("adding retry on different node: %w", err)
if policy.Name() != nil {
err = db.Bun().RunInTx(ctx, nil, func(ctx context.Context, tx bun.Tx) error {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ideally this wouldn't be in a separate transaction

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe i don’t understand the ideal approach you mentioned yet. My intention is to put two update queries in the same transaction, so they either both succeed or failed

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sorry, i was totally unclear. i agree these two query should be in a transaction, but i think there are also writes in addRetryOnDifferentNode and addDontRetry that are in a separate transaction.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see. Queries of one policy are either all succeed or all fail.

var lat LogActionType
if err := json.Unmarshal(data, &lat); err == nil {
switch lat {
case LogActionTypeCancelRetries, LogActionTypeExcludeNode:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if the type isn't an expected one we end up fmt.Errorf'ing a nil err.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add default to switch.

@stoksc
Copy link
Contributor

stoksc commented Oct 24, 2024

lgtm, only had some non-blocking (i.e., please address but i dont need to review.. i need a better word for this), minor feedback

@jgongd jgongd merged commit c71617c into main Oct 24, 2024
81 of 95 checks passed
@jgongd jgongd deleted the jgong/feat-log-signal-exp-config branch October 24, 2024 21:43
thiagodallacqua-hpe pushed a commit that referenced this pull request Oct 28, 2024
thiagodallacqua-hpe pushed a commit that referenced this pull request Oct 28, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants