Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement the undo API in worker executor #1180

Open
vigoo opened this issue Dec 13, 2024 · 0 comments
Open

Implement the undo API in worker executor #1180

vigoo opened this issue Dec 13, 2024 · 0 comments
Assignees
Milestone

Comments

@vigoo
Copy link
Contributor

vigoo commented Dec 13, 2024

The "undo" feature is a new worker executor function exposed through it's gRPC API, having two possible modes:

  • undo the worker to a specific oplog index
  • convenience feature: undo the last N invocations

The "last N invocations" can be implemented on top of the "undo to the given oplog index" feature by reading back from the end of the oplog looking for the nth ExportedFunctionInvoked entry.

On the execution level, the undo feature will take advantage of the already existing "deleted regions" support. Similarly how the Jump entry is implemented, when we undo something in the worker, that will mark the "undone" region as deleted.

In details, implementation of this ticket should consist of the following steps:

  • Define the new gRPC API
  • Add a new oplog entry Undo (it should to be different from Jump both to be able to differentiate them, and also because we may have to treat them differently when calculating some part of the worker status)
  • In the implementation, use get_or_create_suspended to get the Worker and check its status. Undo should only work on Idle or Failed or Exited workers - we don't allow undo when it is executing something (running/suspended/interrupted).
  • If the request specifies "last n invocation"s, use the worker's Oplog to translate that into a specific oplog index
  • Write the oplog entry and initiate a restart (by doing a special interrupt - I think the InterruptKind::Jump can be reused here)
  • Modify calculate_last_known_status to take the new oplog entry into account when calculating the deleted regions, and also when calculating WorkerStatus (as we have to "get back" . NOTE: I think the proper implementation is to change it that it first calculates the final deleted regions, and then in all the other "folds" it takes the deleted regions into account - right now this is not the case and it can be considered a bug.
  • Add the new undo API into the test DSL
  • Write worker executor tests that tries out this feature (both versions)

This is not split further into smaller steps, as they only make sense and only testable together.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants