Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Initial worker forking #1138

Open
vigoo opened this issue Dec 5, 2024 · 0 comments
Open

Initial worker forking #1138

vigoo opened this issue Dec 5, 2024 · 0 comments
Assignees
Milestone

Comments

@vigoo
Copy link
Contributor

vigoo commented Dec 5, 2024

Forking a worker is a general purpose feature that will be implemented in the worker executor, but not used for anything user facing at the moment in Golem OSS.

The forking operation takes three parameters:

  • source worker id
  • target worker id
  • oplog index cutoff

The request must be executed in the source worker's worker executor, but the target worker id does not have to belong to a shard that is hosted by that executor.

The implementation of this ticket must cover the following areas:

  • Define a new endpoint in the worker executor's gRPC API
  • Create or update an internal service in worker executor - most likely DefaultWorkerService is a good fit to hold this new functionality, but in case it causes difficulties, a new one can be introduced as well
  • Implement the actual worker forking in this service, and wire it to the gRPC request handler
  • Extend the test framework to be able to call fork from tests
  • Write at least one worker executor test for this

The following list is a draft of what steps the implementation would do in order to perform the forking:

  • Validate that the target worker ID does not exist and the source worker ID does exist
  • Get a Worker instance for the source worker with get_or_create_suspended - we don't want to start it if it was not running but we need to acquire the instance
  • Read the worker's oplog using the Oplog provided by Worker up to the oplog index cutoff
  • Create a new Oplog (using the OplogService) for the target worker, and append all the elements - NOTE that the first element, Create (or CreateV1) must be altered to contain the new worker ID, as that is the primary source of truth for the identity of a worker.
  • Use the worker service (by extending WorkerProxy) to resume the newly created worker. It has to go through worker service because it may live in another worker executor, depending on sharding.

By completing this ticket, we have a new expoed worker forking feature which "works", although not completely correctly yet - at this point the forked worker will replay with the new worker id from start, which can lead to divergence. A separate ticket will improve this situation.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants