WIP: Gateway cache #630

afsalthaj · 2024-07-01T10:40:49Z

This is still WIP due to: https://zivergeteam.slack.com/archives/C057S2E4XT5/p1719840912374849

With the new golem-rib, parsing function name is reused between Rib and WorkerService
Introduced a new function in worker-service that takes a ParsedFunctionName (because Rib already parsed the input function to ParsedFunctionName.
The same function takes input parameters types, and output result types, because the evaluation-context of Rib already loads these details.
This detail is cached, for a component-id that's corresponding to a gateway request.
Give ComponentMetadata is already loaded as part of forming the EvaluationContext, the new worker-service functionality doesn't require to load these metadata again.
Now if the execution fails during invocation of function, the cache of evaluation-context is invalidated, and try again.
While doing this feature, parsing functions were still being called not just twice, but three or four times. This is fixed by refactoring worker-service.

afsalthaj · 2024-07-01T11:22:30Z

There are even more parsing of functions in other parts of worker-service, after more investigation.
Let me try to fix them and raise again

vigoo · 2024-07-02T12:33:24Z

General note: should not be merged before #585 so we see the effect. (And that should not be done before merging #624)

vigoo

Added a first set of comments. One of them question the whole approach so let's make sure we discuss those before moving on.

Addition to that I think what we also need is to prove this all works with integration tests, which are invoking workers through worker service both directly and via the API gateway (I'm still a bit confused by what do we call what now), with different scenarios

invoking existing worker first time, then again when things are cached
invoking new worker first time, then again
above scenarios but new component versions added
above scenarios but worker itself gets updated
...

this is something that's complicated and we don't have any coverage for on this level

vigoo · 2024-07-02T12:36:47Z

golem-service-base/src/model.rs

+        self.get_export_function(parsed)
+    }
+
+    pub fn get_export_function(


I know this is not a change in this PR, just for the record as now I'm looking at it, it is full of unnecessary clones. (functions() clones exports then clones the function again, returns a new collection, then here it is cloned again, etc)

It was a refactoring of what was existing there, as in, moved the code, but sure I can change those clone bits

golem-worker-service-base/src/evaluator/component_elements.rs

vigoo · 2024-07-02T12:42:13Z

golem-worker-service-base/src/evaluator/component_metadata_fetch.rs

+        version: ComponentVersion,
+    ) -> Result<ComponentDetails, MetadataFetchError>;
+
+    async fn get_active_component_in_worker(


Suggested change

async fn get_active_component_in_worker(

async fn get_worker_component_version(

vigoo · 2024-07-02T12:48:03Z

golem-worker-service-base/src/evaluator/component_elements.rs

+        // Incase there is a discrepancy between the association of worker-id -> version details
+        // and the component_elements cache, it will get fixed in the next call.
+        // -------------------------------------------------------------------------------------------
+        // The Race condition of worker-executor getting updated with another version of component-id


This comment is a bit confusing because I think this (get_component_elements_from_cache_latest_version) will only be used when there is no worker yet. So we cannot say "worker-executor gets updated with another version of component-id" - there is no worker at all at this point. What happens is that here we assume that the new worker will get component version C1 and then finally make the invocation, and there the worker executor will also fetch the latest component version C2 because it is not part of the invocation request and it is a new worker. There is no guarantee that C2 == C1.

I should have put this comment in the function above. I didn't mean to put this here, will change it straight away

vigoo · 2024-07-02T12:50:37Z

golem-worker-service-base/src/evaluator/component_elements.rs

+            .await
+    }
+
+    pub(crate) async fn get_component_elements_from_cache_latest_version(


I think this should have some more descriptive name as it is doing something very specific.

How about assume_latest_version_for_new_worker or something?

vigoo · 2024-07-02T12:59:06Z

golem-worker-service-base/src/evaluator/component_elements.rs

+        &self,
+        worker_id: &WorkerId,
+    ) -> Result<ComponentElements, MetadataFetchError> {
+        let latest_component_version_details = self


Unfortunately this still means that for the "dynamic invoke to create new worker" we do two remote calls:

first to figure out the worker does not exist (calling worker executor)

then here to figure out the latest component version

I don't see any obvious improvements to this within the context of this PR but let's take a step back and think about do we need all this?

IF we would not need component metadata (exported functions) for the Rib evaluator then we could just forward the invocation to the worker executor, which is a single point to decide what version it takes etc and also easier to do cache invalidation there (for example component service can notify all worker executors if a component is updated).

So why do we need to know the component interface for Rib? Can't it just treat everything that's a function invocation as a potential function invocation, and then worker executor returns with an error if it's not an existing one or not typechecking? It checks these things anyway (even on wasmtime level).

Note that, previous to this approach, we were simply forwarding everything to worker executor. That is Rib evaluation context hardly knew anything about this.

Having no symbol-table info for evaluating Rib (meaning we simply forward everything to worker-executor) is something I am a bit hesitant to do (that is revert). Even to forward everything to worker-executor and not type checking is a an approach that's always after the fact, meaning, the rib evaluator can still (possibly) make mistakes because it couldn't decide things properly on what to do due to missing type informations and worker-executor will always return error. I think forwarding things as such is also hacky approach, if we think about any future improvements for Rib. If the boundary of Rib is well set (for example: foo(1) is always a function call and not constructing a variant), then may be forwarding everything to worker may make some more sense though.

Open to discussions there, but it will be more of reverting to where we were before , than changing this approach.

It's unfortunate if it would be a (partial) revert but I feel like we did not properly think through every aspect of this and that's why we are in this current state. We can continuously learn and reevaluate or idea of what we need to do.

I don't agree that leaving the validation to the worker executor is "hacky" - that's a perfectly valid validation point and done where every necessary information must be available anyway to run the worker.

Could you, to help the discussion, collect a more concrete list of what do we need the above mentioned symbol table info for - and what do we sacrifice if we would not have it? It is not clear to me. Is it just because of the "worker invocation looks like an inline function call" feature that I remember we were arguing about, or something else?

Also if we will support the new dynamic dispatch feature, we won't have any metadata for those components.

The "worker proxy" part of worker service shouldn't require anything else than the cached routing table for forwarding invocations. Right now it requires type information but only for the json invocation api because it's ambiguous and cannot be interpreted without knowing the types. But that interpreter can be moved one layer down to the executor.

The worker bridge part requires the metadata for evaluating Rib and I more and more think that it's a mistake

The worker bridge part requires the metadata for evaluating Rib and I more and more think that it's a mistake

Is this a mistake? Or it is becoming difficult due to dynamic dispatch feature?

I think it is a mistake because it requires a very complex (or even impossible) caching logic and/or performance penalty, as I tried to pointing out.

I do feel like there is some place where validation must be possible, even though it doesn't have to be Worker Gateway. The reason is tooling and DX: when a user is writing a Rib script, auto-complete should be possible, and type errors, to the extent they can be known in advance, should be eagerly reported. However, these could occur in some service that is separate from Worker Gateway, and which exists to do pre-compilation of Rib, typing, validation, auto-complete, etc.

If we would have such a feature it would be tied to a specific component version and not a specific worker. So I could imagine an explicit user triggered validation that contains the targeted component version as a parameter and would be triggered during deployments or live editing in the UI.

We could also permanently associate a component version with a api definition but that would lead to the same problem as we don't have per worker information (what version they use) on the worker service level

vigoo · 2024-07-02T13:00:22Z

golem-worker-service-base/src/evaluator/component_metadata_fetch.rs

+pub struct NoopComponentMetadataService;
+
+#[async_trait]
+impl ComponentMetadataService for NoopComponentMetadataService {


By the way: why do we have a Noop implementation for all the things in worker/component services?

Carry forward of some history. Hope I can delete.

vigoo · 2024-07-02T13:04:55Z

golem-worker-service-base/src/service/worker/default.rs

@@ -418,6 +503,57 @@ where
        Ok(get_json_from_typed_value(&typed_value.result))
    }

+    async fn invoke_and_await_parsed_function(


If we need this, shouldn't all the other variants in this file use this to avoid redundancy?
(And it also feels weird that we need to parse the function name on this level - do we?)

I mean do we need to have it parsed at all instead of just passing it down to worker executor where it is parsed anyway.

Rib already parsed a call as a function-call. Obviously, in future it will be more like a PreparedStatement as John mentioned. But logically, yes, it already parsed most of the details.

A few more notes:

There is already a type-checker in the peripherals (component-service/worker-service), before calling worker-executor which we decided last year May. If that's the case, then why is it that having the full parsed-function-name in the same service is a wrong/suboptimal choice now?

This does not answer my question, why do we need a ParsedFunctoinName for the invocation where the invoke API gets the function name as a string?

It implies the worker-executor should be responsible for mapping the Val against the Metadata and return a TypeAnnotatedValue

This is one possibility, and it allows to keep the current JSON response format, but increases the response payload size unnecessarily for cases when it is not needed.
There are (at least) two more possibilities to consider:

if types for the result are only for the JSON format we can potentially change the JSON format itself, but in this case without any field and constructor names (basically Val directly serialized to JSON), it would be a very user experience

We could make Val a bit more self-describing - basically having field/enum/flag/constructor names that would make it enough alone to convert to JSON but is probably a bit less information, but then it would be always coupled with this information even where it's not needed.

So probably your suggestion is the best out of these three.

Note that I think it is a good place to do this mapping in the worker-executor because it is natural to have the component metadata available (as it even has the whole component binary downloaded and cached).

This definitely implies, the function-invocation should support accepting type annotated value

I don't understand why it would need that. It currently works by just getting

repeated wasm.rpc.Val input = 3;

so it means we are not sending any type information to it anyway as it already downloads and uses it internally.
If we somehow have a TypeAnnotatedValue (from Rib evaluation?) we can just convert it to Val as you wrote.
The problem is if we have a JSON from the REST API, and what I wrote above (and seems like you also writing it) is that the responsibility of understanding that JSON format can be moved to the worker executor as an optimisation, basically the same way as we had a gRPC endpoint in worker service for a while accepting JSON, during the migration from Scala. (It is deleted now).

Yes it feels more correct to have this mapping from JSON to Val in the worker bridge as it is similar a bit (just very simplified) to evaluating a Rib on a request.

But I think it worths it if it can avoid us running into a very difficult cache invalidation / race condition problem which triggered the whole discussion.

If the responsibility of the PR is avoid calling metadata download for Rib, sure I can do, but we are doing that anyway in worker-service. So services that don't use Rib, but use these interfaces end up smashing component service.

I think the original ticket's purpose was not "Rib only" but to solve the repeated metadata download in worker service by caching (no matter which part of worker service). This might have been confusing because we have a confusing naming around worker service (proxy, bridge, api gateway, rib) that we should improve. (We have two "components" merged into one service and some old and new naming and we are quite randomly using them in discussions)

Why is caching metadata easy in worker executor and hard in the worker service?

Component metadata is per component version and which one we need depends on the worker:

a worker gets an associated component version when it is first created, by the worker executor

it can be updated later using the hot update feature

In would seem like the first one could be moved into the worker service level, so it would always figure out the component version before calling the worker executor. But it's not good because there are cases where a worker is spawn without touching the worker executor, for example when doing a local RPC call within the executor.

The second is even more tricky because even if all hot update request goes through worker service, it is just an attempt to update the worker, which gets enqueued and later tried, and can potentially fail. So the exact time when a worker starts using a new component version cannot be determined only by polling the worker executor.

This is all complicated by the fact that worker service is horizontally scalable and load balanced so requests for the same worker potentially go through different nodes of worker service which all need to properly deal with these invalidation scenarios etc.

Even if we would ignore the above points about moving the "which component version should I use for a new worker" question and somehow move that logic to the worker service, then we have the issue of invalidating this knowledge in our system. Which would require enumerating all worker service nodes and the problem is that cannot be done atomically, and requests to the same worker are potentially arriving into different worker service nodes because it is load balanced.

Now compare all this with just doing this on the worker executor level:

it is sharded - if we cache something for a worker that's always in the same executor (until rebalancing)

it is already caching whole wasmtime components (a result of downloading and compiling a component from component service)

it knows exactly when an update succeeds and can update its internal caches - and no need to do any cross-node invalidation because of it being sharded

What is still missing from the conversation is to know what exactly would we loose by not having component metadata for the Rib evaluation. Could you describe that?

It offers a better reliable developer experience, there must be some place where we can do type checking, validation, resolving references; pre-compilation, validating references to functions, etc.

Say this for example:

let x = foo(1) is my Rib expression. What exactly is foo here? A variant or a function call? Rib doesn't know it, unless it knows about the type details of the component user is referring to?

Another example

let x = my_exported_function(1, {}). Is {} an empty flag or an empty record ? Both flag and record starts and ends with a curly braces. The answer depends on the function parameter types which Rib doesn't know without downloading the metadata.

Anyway this explanation doesn't mean I want this download again.
I think for performance benefits, which is super important, I am removing these downloads of metadata at Rib side.

And as we discussed, we may need to do the same for some other functionalities of worker-service which you explained in detail to my proposal/steps.

It offers a better reliable developer experience, there must be some place where we can do type checking, validation, resolving references; pre-compilation, validating references to functions, etc.

But all these things doesn't make sense to me in this context of invocation

When you develop the rib expression you don't have a particular worker you are invoking. So this validation at the time of invocation doesn't help with it at all

vigoo · 2024-07-02T13:06:53Z

golem-worker-service-base/src/service/worker/default.rs

                .iter()
                .map(|x| x.clone().into())
                .collect()
        }
    }
+
+    async fn get_proto_invoke_result(


This is a weird function name, as under the hood it does invoke a worker. Isn't this the actual implementation of invoking a worker?

afsalthaj · 2024-07-02T13:11:14Z

golem-worker-service-base/src/worker_binding/worker_binding_resolver.rs

@@ -107,7 +105,7 @@ impl ResolvedWorkerBinding {
    pub async fn execute_with<R>(
        &self,
        evaluator: &Arc<dyn Evaluator + Sync + Send>,
-        worker_metadata_fetcher: &Arc<dyn WorkerMetadataFetcher + Sync + Send>,
+        symbol_fetch: &Arc<dyn ComponentElementsFetch + Sync + Send>,


I will rename this in the next commit

afsalthaj · 2024-07-02T14:28:16Z

@vigoo

Why is it all in integration tests to test these logic? Can you help resolve the following doubts/suggestions?

#630 (review)

invoking existing worker first time, then again when things are cached

"Worker exist can be simulated by a test interface that actually returns a worker metadata". With that we invoke first and invoke again, to see if cache is being used

invoking new worker first time, then again

Again, this is the case when worker is not returning me any metadata

above scenarios but worker itself gets updated

If this is the case, where we preloaded the execution context with some versions and metadata and made some runs and later, however, worker is working with a different version can also be simulated, and see if it retries and invalidates the stale metadata. Couldn't it be?

I understand the trustability of integration tests, but feeding this complex logic to integration test - I am not that fully convinced. If this is absolutely not reproducible in a unit test is what you think, probably you are right and I am yet to discover :)

vigoo · 2024-07-02T17:09:33Z

@vigoo

Why is it all in integration tests to test these logic? Can you help resolve the following doubts/suggestions?

#630 (review)

invoking existing worker first time, then again when things are cached

"Worker exist can be simulated by a test interface that actually returns a worker metadata". With that we invoke first and invoke again, to see if cache is being used

invoking new worker first time, then again

Again, this is the case when worker is not returning me any metadata

above scenarios but worker itself gets updated

If this I understrand this case properly, its about running some invokes through Rib done a few times, and then the worker-executor implementation is changed to use a different version of component-id worker. In this case, we expect the test case to fail first and go through retries.

I understand the trustability of integration tests, but feeding this complex logic to integration test - I am not that fully convinced. If this is absolutely not reproducible in a unit test is what you think, probably you are right and I am yet to discover :)

I'm sure you can test this in the ways you described, but for me it feels like it would require writing tons of test code to somehow simulate all these scenarios with test implementations / mocks etc, while with our integration test framework it's straightforward and you just have to write down the tests (with statements like upload component, start worker, invoke worker etc through the test dsl (which matches the component and worker api more or less).

Even if it would be tested in the way you describe, I would feel more confident if we also have integration tests showing this all works with the real implementations as well.

afsalthaj · 2024-07-03T02:19:33Z

On tests, let's have a discussion first on the following and then I spend time there:

Should worker-service/component-service/Rib-worker-birdge download metadata/component ?
If the answer is No, I kindly request you to read through the inputs/approaches mentioned here

#630 (comment)

If the answer is No, then probably let's close this PR and the ticket itself because its doesn't make sense. And if the answer is No, let's do this entirely with the approach mentioned above - almost immediately.
@vigoo may have an opinion - let's avoid download of metadata at Rib side first, as rest of the changes may require time. I prefer to do it properly all across at this stage. But neither of this should be done in this PR.

afsalthaj · 2024-07-03T11:25:11Z

I think I am in terms with this explanation
#630 (comment)

afsalthaj · 2024-07-04T09:29:52Z

As agreed, there will be a separate PR, which will delete the metadata.
It;'s already in there, but testing it uncovered a few other issues, which I will detail once done.

I am closing this PR :)

afsalthaj added 13 commits July 1, 2024 16:33

Cache component metadata in evaluation context

17059d6

Add metadata

13dd947

Make sure base module works with component

61d8e56

Reformat code

d51df88

Metadata fetcher

1e59890

Reformat code

7f84f8c

Make services private

3f0d387

Make it private

2eb4665

Fix golem worker service

60d40c3

Fix metadata fetch

7d45fb9

Rename symbol table to component elements

61edef4

Avoid fetching component metadata again in worker service

74ec703

Fix unused imports

6c0522f

afsalthaj added 10 commits July 1, 2024 22:41

Avoid multiple parsing

f6a7c39

Avoid multiple parsing

bbd9247

Avoid multiple parsing

c9d394b

Reformat code

9e6e00a

Reformat code

c258703

Reformat code

cf1c554

Make sure the tests pass

4e8edc0

Reduce visibility of evaluator details

18fbd6f

Finalise

4eafbed

Make cache configurable

a3cef90

afsalthaj marked this pull request as ready for review July 1, 2024 14:25

afsalthaj changed the title ~~Gateway cache~~ WIP: Gateway cache Jul 1, 2024

afsalthaj added 4 commits July 2, 2024 20:17

Make sure cache is keyed with version

a5493ca

Add some comments

e3d16a1

Remove confusing comments

aa36cd9

Get existing component metadata

963a2d1

afsalthaj added 2 commits July 2, 2024 21:58

Get existing component metadata

3c63dea

Fix clippy

17dba6a

vigoo requested changes Jul 2, 2024

View reviewed changes

afsalthaj commented Jul 2, 2024

View reviewed changes

afsalthaj added 6 commits July 2, 2024 23:18

Avoid cloning

95d8efc

Avoid Noop

e354527

Remove comments

9f67d05

Rename function

6c74cbe

Rename function

32c0cd0

Fix component element

ad9845c

afsalthaj closed this Jul 4, 2024

afsalthaj mentioned this pull request Jul 4, 2024

Cache worker metadata in worker bridge #586

Closed

afsalthaj deleted the gateway_cache branch October 30, 2024 06:44

	async fn get_active_component_in_worker(
	async fn get_worker_component_version(

WIP: Gateway cache #630

WIP: Gateway cache #630

Conversation

afsalthaj commented Jul 1, 2024 • edited Loading

afsalthaj commented Jul 1, 2024

vigoo commented Jul 2, 2024

vigoo left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

afsalthaj Jul 2, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jdegoes Jul 3, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

afsalthaj commented Jul 2, 2024 • edited Loading

vigoo commented Jul 2, 2024 • edited by afsalthaj Loading

afsalthaj commented Jul 3, 2024

afsalthaj commented Jul 3, 2024

afsalthaj commented Jul 4, 2024

afsalthaj commented Jul 1, 2024 •

edited

Loading

afsalthaj Jul 2, 2024 •

edited

Loading

jdegoes Jul 3, 2024 •

edited

Loading

afsalthaj commented Jul 2, 2024 •

edited

Loading

vigoo commented Jul 2, 2024 •

edited by afsalthaj

Loading