Initial worker forking #1138

vigoo · 2024-12-05T14:39:18Z

Forking a worker is a general purpose feature that will be implemented in the worker executor, but not used for anything user facing at the moment in Golem OSS.

The forking operation takes three parameters:

source worker id
target worker id
oplog index cutoff

The request must be executed in the source worker's worker executor, but the target worker id does not have to belong to a shard that is hosted by that executor.

The implementation of this ticket must cover the following areas:

Define a new endpoint in the worker executor's gRPC API
Create or update an internal service in worker executor - most likely DefaultWorkerService is a good fit to hold this new functionality, but in case it causes difficulties, a new one can be introduced as well
Implement the actual worker forking in this service, and wire it to the gRPC request handler
Extend the test framework to be able to call fork from tests
Write at least one worker executor test for this

The following list is a draft of what steps the implementation would do in order to perform the forking:

Validate that the target worker ID does not exist and the source worker ID does exist
Get a Worker instance for the source worker with get_or_create_suspended - we don't want to start it if it was not running but we need to acquire the instance
Read the worker's oplog using the Oplog provided by Worker up to the oplog index cutoff
Create a new Oplog (using the OplogService) for the target worker, and append all the elements - NOTE that the first element, Create (or CreateV1) must be altered to contain the new worker ID, as that is the primary source of truth for the identity of a worker.
Use the worker service (by extending WorkerProxy) to resume the newly created worker. It has to go through worker service because it may live in another worker executor, depending on sharding.

By completing this ticket, we have a new expoed worker forking feature which "works", although not completely correctly yet - at this point the forked worker will replay with the new worker id from start, which can lead to divergence. A separate ticket will improve this situation.

The text was updated successfully, but these errors were encountered:

vigoo assigned afsalthaj Dec 5, 2024

vigoo added this to the Golem 1.2 milestone Dec 5, 2024

This was referenced Dec 5, 2024

Divergence-free worker forking #1139

Open

Expose worker forking as a host function #1151

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Initial worker forking #1138

Initial worker forking #1138

vigoo commented Dec 5, 2024 •

edited

Loading

Initial worker forking #1138

Initial worker forking #1138

Comments

vigoo commented Dec 5, 2024 • edited Loading

vigoo commented Dec 5, 2024 •

edited

Loading