Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

WIP: Gateway cache #630

Closed
wants to merge 35 commits into from
Closed

WIP: Gateway cache #630

wants to merge 35 commits into from

Conversation

afsalthaj
Copy link
Contributor

@afsalthaj afsalthaj commented Jul 1, 2024

This is still WIP due to: https://zivergeteam.slack.com/archives/C057S2E4XT5/p1719840912374849

Fix #586

  • With the new golem-rib, parsing function name is reused between Rib and WorkerService
  • Introduced a new function in worker-service that takes a ParsedFunctionName (because Rib already parsed the input function to ParsedFunctionName.
  • The same function takes input parameters types, and output result types, because the evaluation-context of Rib already loads these details.
  • This detail is cached, for a component-id that's corresponding to a gateway request.
  • Give ComponentMetadata is already loaded as part of forming the EvaluationContext, the new worker-service functionality doesn't require to load these metadata again.
  • Now if the execution fails during invocation of function, the cache of evaluation-context is invalidated, and try again.
  • While doing this feature, parsing functions were still being called not just twice, but three or four times. This is fixed by refactoring worker-service.

@afsalthaj
Copy link
Contributor Author

There are even more parsing of functions in other parts of worker-service, after more investigation.
Let me try to fix them and raise again

@afsalthaj afsalthaj marked this pull request as ready for review July 1, 2024 14:25
@afsalthaj afsalthaj changed the title Gateway cache WIP: Gateway cache Jul 1, 2024
@vigoo
Copy link
Contributor

vigoo commented Jul 2, 2024

General note: should not be merged before #585 so we see the effect. (And that should not be done before merging #624)

Copy link
Contributor

@vigoo vigoo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added a first set of comments. One of them question the whole approach so let's make sure we discuss those before moving on.

Addition to that I think what we also need is to prove this all works with integration tests, which are invoking workers through worker service both directly and via the API gateway (I'm still a bit confused by what do we call what now), with different scenarios

  • invoking existing worker first time, then again when things are cached
  • invoking new worker first time, then again
  • above scenarios but new component versions added
  • above scenarios but worker itself gets updated
  • ...

this is something that's complicated and we don't have any coverage for on this level

self.get_export_function(parsed)
}

pub fn get_export_function(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know this is not a change in this PR, just for the record as now I'm looking at it, it is full of unnecessary clones. (functions() clones exports then clones the function again, returns a new collection, then here it is cloned again, etc)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It was a refactoring of what was existing there, as in, moved the code, but sure I can change those clone bits

version: ComponentVersion,
) -> Result<ComponentDetails, MetadataFetchError>;

async fn get_active_component_in_worker(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
async fn get_active_component_in_worker(
async fn get_worker_component_version(

// Incase there is a discrepancy between the association of worker-id -> version details
// and the component_elements cache, it will get fixed in the next call.
// -------------------------------------------------------------------------------------------
// The Race condition of worker-executor getting updated with another version of component-id
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This comment is a bit confusing because I think this (get_component_elements_from_cache_latest_version) will only be used when there is no worker yet. So we cannot say "worker-executor gets updated with another version of component-id" - there is no worker at all at this point. What happens is that here we assume that the new worker will get component version C1 and then finally make the invocation, and there the worker executor will also fetch the latest component version C2 because it is not part of the invocation request and it is a new worker. There is no guarantee that C2 == C1.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I should have put this comment in the function above. I didn't mean to put this here, will change it straight away

.await
}

pub(crate) async fn get_component_elements_from_cache_latest_version(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this should have some more descriptive name as it is doing something very specific.

How about assume_latest_version_for_new_worker or something?

&self,
worker_id: &WorkerId,
) -> Result<ComponentElements, MetadataFetchError> {
let latest_component_version_details = self
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unfortunately this still means that for the "dynamic invoke to create new worker" we do two remote calls:

  • first to figure out the worker does not exist (calling worker executor)
  • then here to figure out the latest component version

I don't see any obvious improvements to this within the context of this PR but let's take a step back and think about do we need all this?

IF we would not need component metadata (exported functions) for the Rib evaluator then we could just forward the invocation to the worker executor, which is a single point to decide what version it takes etc and also easier to do cache invalidation there (for example component service can notify all worker executors if a component is updated).

So why do we need to know the component interface for Rib? Can't it just treat everything that's a function invocation as a potential function invocation, and then worker executor returns with an error if it's not an existing one or not typechecking? It checks these things anyway (even on wasmtime level).

Copy link
Contributor Author

@afsalthaj afsalthaj Jul 2, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note that, previous to this approach, we were simply forwarding everything to worker executor. That is Rib evaluation context hardly knew anything about this.

Having no symbol-table info for evaluating Rib (meaning we simply forward everything to worker-executor) is something I am a bit hesitant to do (that is revert). Even to forward everything to worker-executor and not type checking is a an approach that's always after the fact, meaning, the rib evaluator can still (possibly) make mistakes because it couldn't decide things properly on what to do due to missing type informations and worker-executor will always return error. I think forwarding things as such is also hacky approach, if we think about any future improvements for Rib. If the boundary of Rib is well set (for example: foo(1) is always a function call and not constructing a variant), then may be forwarding everything to worker may make some more sense though.

Open to discussions there, but it will be more of reverting to where we were before , than changing this approach.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's unfortunate if it would be a (partial) revert but I feel like we did not properly think through every aspect of this and that's why we are in this current state. We can continuously learn and reevaluate or idea of what we need to do.

I don't agree that leaving the validation to the worker executor is "hacky" - that's a perfectly valid validation point and done where every necessary information must be available anyway to run the worker.

Could you, to help the discussion, collect a more concrete list of what do we need the above mentioned symbol table info for - and what do we sacrifice if we would not have it? It is not clear to me. Is it just because of the "worker invocation looks like an inline function call" feature that I remember we were arguing about, or something else?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also if we will support the new dynamic dispatch feature, we won't have any metadata for those components.

The "worker proxy" part of worker service shouldn't require anything else than the cached routing table for forwarding invocations. Right now it requires type information but only for the json invocation api because it's ambiguous and cannot be interpreted without knowing the types. But that interpreter can be moved one layer down to the executor.

The worker bridge part requires the metadata for evaluating Rib and I more and more think that it's a mistake

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The worker bridge part requires the metadata for evaluating Rib and I more and more think that it's a mistake

Is this a mistake? Or it is becoming difficult due to dynamic dispatch feature?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it is a mistake because it requires a very complex (or even impossible) caching logic and/or performance penalty, as I tried to pointing out.

Copy link
Contributor

@jdegoes jdegoes Jul 3, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I do feel like there is some place where validation must be possible, even though it doesn't have to be Worker Gateway. The reason is tooling and DX: when a user is writing a Rib script, auto-complete should be possible, and type errors, to the extent they can be known in advance, should be eagerly reported. However, these could occur in some service that is separate from Worker Gateway, and which exists to do pre-compilation of Rib, typing, validation, auto-complete, etc.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we would have such a feature it would be tied to a specific component version and not a specific worker. So I could imagine an explicit user triggered validation that contains the targeted component version as a parameter and would be triggered during deployments or live editing in the UI.

We could also permanently associate a component version with a api definition but that would lead to the same problem as we don't have per worker information (what version they use) on the worker service level

pub struct NoopComponentMetadataService;

#[async_trait]
impl ComponentMetadataService for NoopComponentMetadataService {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

By the way: why do we have a Noop implementation for all the things in worker/component services?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Carry forward of some history. Hope I can delete.

@@ -418,6 +503,57 @@ where
Ok(get_json_from_typed_value(&typed_value.result))
}

async fn invoke_and_await_parsed_function(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we need this, shouldn't all the other variants in this file use this to avoid redundancy?
(And it also feels weird that we need to parse the function name on this level - do we?)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I mean do we need to have it parsed at all instead of just passing it down to worker executor where it is parsed anyway.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rib already parsed a call as a function-call. Obviously, in future it will be more like a PreparedStatement as John mentioned. But logically, yes, it already parsed most of the details.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few more notes:

There is already a type-checker in the peripherals (component-service/worker-service), before calling worker-executor which we decided last year May. If that's the case, then why is it that having the full parsed-function-name in the same service is a wrong/suboptimal choice now?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This does not answer my question, why do we need a ParsedFunctoinName for the invocation where the invoke API gets the function name as a string?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It implies the worker-executor should be responsible for mapping the Val against the Metadata and return a TypeAnnotatedValue

This is one possibility, and it allows to keep the current JSON response format, but increases the response payload size unnecessarily for cases when it is not needed.
There are (at least) two more possibilities to consider:

  • if types for the result are only for the JSON format we can potentially change the JSON format itself, but in this case without any field and constructor names (basically Val directly serialized to JSON), it would be a very user experience
  • We could make Val a bit more self-describing - basically having field/enum/flag/constructor names that would make it enough alone to convert to JSON but is probably a bit less information, but then it would be always coupled with this information even where it's not needed.

So probably your suggestion is the best out of these three.

Note that I think it is a good place to do this mapping in the worker-executor because it is natural to have the component metadata available (as it even has the whole component binary downloaded and cached).

This definitely implies, the function-invocation should support accepting type annotated value

I don't understand why it would need that. It currently works by just getting

  repeated wasm.rpc.Val input = 3;

so it means we are not sending any type information to it anyway as it already downloads and uses it internally.
If we somehow have a TypeAnnotatedValue (from Rib evaluation?) we can just convert it to Val as you wrote.
The problem is if we have a JSON from the REST API, and what I wrote above (and seems like you also writing it) is that the responsibility of understanding that JSON format can be moved to the worker executor as an optimisation, basically the same way as we had a gRPC endpoint in worker service for a while accepting JSON, during the migration from Scala. (It is deleted now).

Yes it feels more correct to have this mapping from JSON to Val in the worker bridge as it is similar a bit (just very simplified) to evaluating a Rib on a request.

But I think it worths it if it can avoid us running into a very difficult cache invalidation / race condition problem which triggered the whole discussion.

If the responsibility of the PR is avoid calling metadata download for Rib, sure I can do, but we are doing that anyway in worker-service. So services that don't use Rib, but use these interfaces end up smashing component service.

I think the original ticket's purpose was not "Rib only" but to solve the repeated metadata download in worker service by caching (no matter which part of worker service). This might have been confusing because we have a confusing naming around worker service (proxy, bridge, api gateway, rib) that we should improve. (We have two "components" merged into one service and some old and new naming and we are quite randomly using them in discussions)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is caching metadata easy in worker executor and hard in the worker service?

Component metadata is per component version and which one we need depends on the worker:

  • a worker gets an associated component version when it is first created, by the worker executor
  • it can be updated later using the hot update feature

In would seem like the first one could be moved into the worker service level, so it would always figure out the component version before calling the worker executor. But it's not good because there are cases where a worker is spawn without touching the worker executor, for example when doing a local RPC call within the executor.

The second is even more tricky because even if all hot update request goes through worker service, it is just an attempt to update the worker, which gets enqueued and later tried, and can potentially fail. So the exact time when a worker starts using a new component version cannot be determined only by polling the worker executor.

This is all complicated by the fact that worker service is horizontally scalable and load balanced so requests for the same worker potentially go through different nodes of worker service which all need to properly deal with these invalidation scenarios etc.

Even if we would ignore the above points about moving the "which component version should I use for a new worker" question and somehow move that logic to the worker service, then we have the issue of invalidating this knowledge in our system. Which would require enumerating all worker service nodes and the problem is that cannot be done atomically, and requests to the same worker are potentially arriving into different worker service nodes because it is load balanced.

Now compare all this with just doing this on the worker executor level:

  • it is sharded - if we cache something for a worker that's always in the same executor (until rebalancing)
  • it is already caching whole wasmtime components (a result of downloading and compiling a component from component service)
  • it knows exactly when an update succeeds and can update its internal caches - and no need to do any cross-node invalidation because of it being sharded

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is still missing from the conversation is to know what exactly would we loose by not having component metadata for the Rib evaluation. Could you describe that?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It offers a better reliable developer experience, there must be some place where we can do type checking, validation, resolving references; pre-compilation, validating references to functions, etc.

Say this for example:

let x = foo(1) is my Rib expression. What exactly is foo here? A variant or a function call? Rib doesn't know it, unless it knows about the type details of the component user is referring to?

Another example

let x = my_exported_function(1, {}). Is {} an empty flag or an empty record ? Both flag and record starts and ends with a curly braces. The answer depends on the function parameter types which Rib doesn't know without downloading the metadata.

Anyway this explanation doesn't mean I want this download again.
I think for performance benefits, which is super important, I am removing these downloads of metadata at Rib side.

And as we discussed, we may need to do the same for some other functionalities of worker-service which you explained in detail to my proposal/steps.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It offers a better reliable developer experience, there must be some place where we can do type checking, validation, resolving references; pre-compilation, validating references to functions, etc.

But all these things doesn't make sense to me in this context of invocation

When you develop the rib expression you don't have a particular worker you are invoking. So this validation at the time of invocation doesn't help with it at all

.iter()
.map(|x| x.clone().into())
.collect()
}
}

async fn get_proto_invoke_result(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a weird function name, as under the hood it does invoke a worker. Isn't this the actual implementation of invoking a worker?

@@ -107,7 +105,7 @@ impl ResolvedWorkerBinding {
pub async fn execute_with<R>(
&self,
evaluator: &Arc<dyn Evaluator + Sync + Send>,
worker_metadata_fetcher: &Arc<dyn WorkerMetadataFetcher + Sync + Send>,
symbol_fetch: &Arc<dyn ComponentElementsFetch + Sync + Send>,
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will rename this in the next commit

@afsalthaj
Copy link
Contributor Author

afsalthaj commented Jul 2, 2024

@vigoo

Why is it all in integration tests to test these logic? Can you help resolve the following doubts/suggestions?

#630 (review)

invoking existing worker first time, then again when things are cached

"Worker exist can be simulated by a test interface that actually returns a worker metadata". With that we invoke first and invoke again, to see if cache is being used

invoking new worker first time, then again

Again, this is the case when worker is not returning me any metadata

above scenarios but worker itself gets updated

If this is the case, where we preloaded the execution context with some versions and metadata and made some runs and later, however, worker is working with a different version can also be simulated, and see if it retries and invalidates the stale metadata. Couldn't it be?

I understand the trustability of integration tests, but feeding this complex logic to integration test - I am not that fully convinced. If this is absolutely not reproducible in a unit test is what you think, probably you are right and I am yet to discover :)

@vigoo
Copy link
Contributor

vigoo commented Jul 2, 2024

@vigoo

Why is it all in integration tests to test these logic? Can you help resolve the following doubts/suggestions?

#630 (review)

invoking existing worker first time, then again when things are cached

"Worker exist can be simulated by a test interface that actually returns a worker metadata". With that we invoke first and invoke again, to see if cache is being used

invoking new worker first time, then again

Again, this is the case when worker is not returning me any metadata

above scenarios but worker itself gets updated

If this I understrand this case properly, its about running some invokes through Rib done a few times, and then the worker-executor implementation is changed to use a different version of component-id worker. In this case, we expect the test case to fail first and go through retries.

I understand the trustability of integration tests, but feeding this complex logic to integration test - I am not that fully convinced. If this is absolutely not reproducible in a unit test is what you think, probably you are right and I am yet to discover :)

I'm sure you can test this in the ways you described, but for me it feels like it would require writing tons of test code to somehow simulate all these scenarios with test implementations / mocks etc, while with our integration test framework it's straightforward and you just have to write down the tests (with statements like upload component, start worker, invoke worker etc through the test dsl (which matches the component and worker api more or less).

Even if it would be tested in the way you describe, I would feel more confident if we also have integration tests showing this all works with the real implementations as well.

@afsalthaj
Copy link
Contributor Author

On tests, let's have a discussion first on the following and then I spend time there:

  1. Should worker-service/component-service/Rib-worker-birdge download metadata/component ?
    If the answer is No, I kindly request you to read through the inputs/approaches mentioned here

#630 (comment)

  1. If the answer is No, then probably let's close this PR and the ticket itself because its doesn't make sense. And if the answer is No, let's do this entirely with the approach mentioned above - almost immediately.

  2. @vigoo may have an opinion - let's avoid download of metadata at Rib side first, as rest of the changes may require time. I prefer to do it properly all across at this stage. But neither of this should be done in this PR.

@afsalthaj
Copy link
Contributor Author

I think I am in terms with this explanation
#630 (comment)

@afsalthaj
Copy link
Contributor Author

As agreed, there will be a separate PR, which will delete the metadata.
It;'s already in there, but testing it uncovered a few other issues, which I will detail once done.

I am closing this PR :)

@afsalthaj afsalthaj closed this Jul 4, 2024
@afsalthaj afsalthaj deleted the gateway_cache branch October 30, 2024 06:44
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Cache worker metadata in worker bridge
3 participants