Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve handling of out of order messages from direct connected hosts #305

Conversation

dehort
Copy link
Contributor

@dehort dehort commented Sep 19, 2023

What?

Improve handling of out of order messages from direct connected hosts
RHCLOUD-28028

Why?

It is possible that the response consumer receives a run complete message (success or failure) before receiving a run incomplete message. When this happens, then run will go from having a status of success or failure (run complete) to having a status of running. The job will eventually timeout and data (run output, status complete, etc) is lost.

How?

Describe how the change is implemented. Any noteable new libaries, APIs, or features.

Testing

A couple of unit tests were added.

Anything Else?

Any other notes about the PR that would be useful for the reviewer.

Secure Coding Practices Checklist Link

Secure Coding Checklist

  • Input Validation
  • Output Encoding
  • Authentication and Password Management
  • Session Management
  • Access Control
  • Cryptographic Practices
  • Error Handling and Logging
  • Data Protection
  • Communication Security
  • System Configuration
  • Database Security
  • File Management
  • Memory Management
  • General Coding Practices

@dehort dehort changed the title Put together a failing test [WIP] Put together a failing test Sep 19, 2023
@dehort
Copy link
Contributor Author

dehort commented Sep 26, 2023

/retest

@dehort dehort changed the title [WIP] Put together a failing test [WIP] Improve handling of out of order messages from direct connected hosts Sep 29, 2023
@dehort dehort changed the title [WIP] Improve handling of out of order messages from direct connected hosts Improve handling of out of order messages from direct connected hosts Sep 29, 2023
@dehort dehort force-pushed the improve_handling_of_out_of_order_messages branch from 7ef4b15 to 2e0501c Compare September 29, 2023 20:28
@dehort dehort marked this pull request as ready for review September 29, 2023 20:41
@dehort dehort requested a review from tahmidefaz September 29, 2023 20:41
Comment on lines 218 to 231
func createRecord(ctx context.Context, tx *gorm.DB, toCreate []db.RunHost) error {

successOrFailure := clause.OrConditions{Exprs: []clause.Expression{
clause.Eq{Column: "run_hosts.status", Value: db.RunStatusSuccess},
clause.Eq{Column: "run_hosts.status", Value: db.RunStatusFailure},
}}

notMarkedAsComplete := clause.Where{Exprs: []clause.Expression{clause.Not(successOrFailure)}}

createResult := tx.Model(db.RunHost{}).
Clauses(clause.OnConflict{
Where: notMarkedAsComplete,
Columns: []clause.Column{{Name: "run_id"}, {Name: "host"}},
DoUpdates: clause.AssignmentColumns([]string{"status", "log"}),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@dehort Looking good! Any idea if this notMarkedAsComplete check might be redundant, since we are checking for it through the where clause in the main function?

Also, any idea if we need to do the same for satellite requests? I think satellite payloads have a sat_sequence number, which kind of protects us against out of order messages, but might be good to double check. 🤔

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe the notMarkedAsComplete where clause is needed here because we are only passing the transaction object to the createRecord() method. We don't actually pass the query object that we built in the calling function.

Yeah, the sat_sequence field protect us from out of order messages at the host level. But... I think the change in this PR protects us from marking a top level run as complete or failed and then marking it as running if we received a satellite message out of order. I added another test for this.

Thank you, @tahmidefaz !

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good, thank you!

@dehort dehort force-pushed the improve_handling_of_out_of_order_messages branch from 2e0501c to 50b9ba3 Compare October 4, 2023 22:17
@dehort
Copy link
Contributor Author

dehort commented Oct 5, 2023

/retest

Copy link
Member

@tahmidefaz tahmidefaz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm!

@tahmidefaz tahmidefaz merged commit 6889b4f into RedHatInsights:master Oct 5, 2023
6 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants