-
Notifications
You must be signed in to change notification settings - Fork 359
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: Log Signal Exp Config and Monitoring #9947
Conversation
✅ Deploy Preview for determined-ui canceled.
|
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #9947 +/- ##
=======================================
Coverage 54.55% 54.56%
=======================================
Files 1267 1265 -2
Lines 159525 159601 +76
Branches 3637 3637
=======================================
+ Hits 87035 87089 +54
- Misses 72357 72379 +22
Partials 133 133
Flags with carried forward coverage won't be shown. Click here to find out more.
|
8679a19
to
176454d
Compare
cc @determined-ai/backend for backend review. thank you |
master/internal/trial.go
Outdated
InformationalReason: err.Error(), | ||
}) | ||
} | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we please add some logs explaining why we want to clear signal when allocation exit fails
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't follow why we would want to clear it. It seems the point of this change is to track when some exceptional event happen and present them in a more user friendly manner. Losing that we had an ECC error on node xyz002 seems important to surface still.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If a trial runs into an OOM error, it's caught in log and a signal
will be shown in the UI. I think the signal
only appears for a brief moment since the trial will restart quickly. This means user might not see there was an OOM error. You are right. I feel I shouldn't clear it at restart.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great work on extensive tests!
994f73f
to
761f8f8
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i know i'm a little late to the party here but log_signal / signal is a really confusing word to me for what this is. is it too late to just call it "log_policy_name"? it makes sense to me that way, that users would name their log policies and look, by name, at which one fired on a given run.
other than that, i've left some general go-related comments and i'm not sure we should be breaking the experiment config but mostly this looks fine.
type LogPoliciesConfigV0 []LogPolicyV0 | ||
|
||
// WithDefaults implements the Defaultable psuedointerface. | ||
func (b *LogPoliciesConfigV0) WithDefaults() *LogPoliciesConfigV0 { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
func (b *LogPoliciesConfigV0) WithDefaults() *LogPoliciesConfigV0 { | |
func (b LogPoliciesConfigV0) WithDefaults() *LogPoliciesConfigV0 { |
Don't use a pointer receiver here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For WithDefaults to work, I think you must match the pointerness of the receiver and return value.
So probably want to return a non-pointer as well, I think.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you!
for _, p := range b { | ||
if v, ok := patternToLp[p.RawPattern]; ok { | ||
// Union merge actions | ||
actions := make(set.Set[LogActionV0]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
use set.New
} | ||
return b | ||
} | ||
|
||
// Merge implemenets the mergable interface. | ||
func (b LogPoliciesConfigV0) Merge( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you add unit test this merge?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@@ -40,7 +75,8 @@ func (b LogPoliciesConfigV0) Merge( | |||
type LogPolicyV0 struct { | |||
RawPattern string `json:"pattern"` | |||
|
|||
RawAction LogActionV0 `json:"action"` | |||
RawActions []LogActionV0 `json:"actions,omitempty"` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
did anyone explicitly ask for multiple actions? today the available actions seem mutually exclusive in terms of when you would want them to me, and this is a breaking config change for anyone using this featuer.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
did anyone explicitly ask for multiple actions
The "signal" action shows up in the UI, but you might also want an action with a side-effect.
this is a breaking config change
We should be implementing shims to keep from breaking old configs.
@jgongd Do you have the examples of the desired config examples that we worked out together? I feel like @stoksc would benefit from seeing the overall goal here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was concerned that it might be a breaking change as well!
Desired config example:
log_policies:
- name: ECC Error
pattern: .*uncorrectable ECC error encountered.*
action: exclude_node
Shim is implemented here.
Both legacy log policy and modern log policy are accepted now: https://github.com/determined-ai/determined/pull/9947/files#diff-ced209455ef7f0c29127fe9cc7a734f53cf0012d1268916b89afe0ed4f98ad17
Tested here.
@@ -29,7 +29,7 @@ type ExperimentConfigV0 struct { | |||
RawEnvironment *EnvironmentConfigV0 `json:"environment"` | |||
RawHyperparameters HyperparametersV0 `json:"hyperparameters"` | |||
RawLabels LabelsV0 `json:"labels"` | |||
RawLogPolicies LogPoliciesConfigV0 `json:"log_policies"` | |||
RawLogPolicies *LogPoliciesConfigV0 `json:"log_policies"` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why change this to pointers to a slice throughout the PR?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point! A slice doesn't need a pointer.
master/internal/trial.go
Outdated
InformationalReason: err.Error(), | ||
}) | ||
} | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't follow why we would want to clear it. It seems the point of this change is to track when some exceptional event happen and present them in a more user friendly manner. Losing that we had an ECC error on node xyz002 seems important to surface still.
e4e6acd
to
eea42c6
Compare
8816f3b
to
50b0ebe
Compare
}, | ||
} | ||
require.Equal(t, expected, tcd) | ||
} | ||
|
||
func TestLogPatternPoliciesMerging(t *testing.T) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Converted this test to a unit test log_policy.yaml::legacy_log_policies_merging
@@ -61,20 +61,6 @@ | |||
type: directory | |||
container_path: /path/on/disk | |||
|
|||
- name: log action cancel_retries |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Moved to log_policies.yaml
1d9d9df
to
8095267
Compare
action is a legacy field now
3110d2d
to
f08317d
Compare
master/pkg/model/experiment.go
Outdated
@@ -522,6 +522,7 @@ type Run struct { | |||
LogRetentionDays *int16 `db:"log_retention_days"` | |||
Metadata map[string]any `db:"metadata" bun:"metadata,scanonly"` | |||
LocalID int `db:"local_id"` | |||
LogPolicyMatched *string `db:"log_policy_matched"` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do we want to add LogPolicyMatched
to experiment model?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I thought /api/v1/runs
query uses this model. But checking the code again, I see that it actually uses the go struct generated from message Flatrun
in the proto file. Thank you, great catch!
); err != nil { | ||
return fmt.Errorf("adding retry on different node: %w", err) | ||
if policy.Name() != nil { | ||
err = db.Bun().RunInTx(ctx, nil, func(ctx context.Context, tx bun.Tx) error { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ideally this wouldn't be in a separate transaction
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe i don’t understand the ideal approach you mentioned yet. My intention is to put two update queries in the same transaction, so they either both succeed or failed
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sorry, i was totally unclear. i agree these two query should be in a transaction, but i think there are also writes in addRetryOnDifferentNode
and addDontRetry
that are in a separate transaction.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see. Queries of one policy are either all succeed or all fail.
var lat LogActionType | ||
if err := json.Unmarshal(data, &lat); err == nil { | ||
switch lat { | ||
case LogActionTypeCancelRetries, LogActionTypeExcludeNode: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if the type isn't an expected one we end up fmt.Errorf'ing a nil err.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add default to switch.
lgtm, only had some non-blocking (i.e., please address but i dont need to review.. i need a better word for this), minor feedback |
Co-authored-by: Ryan <[email protected]>
Co-authored-by: Ryan <[email protected]>
Ticket
MD-493, MD-494
Description
log_policies
json schema. To avoid any disruption, both legacy log policy and modern log policy are supported.pattern
is found in the log.Test Plan
At the release party, testing this PR together with #9959 could make things easier.
Test Signal
are in run details page and run table.Checklist
docs/release-notes/
See Release Note for details.