-
Notifications
You must be signed in to change notification settings - Fork 25
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Feature: provide a set of results for all invocations caused by a message #28
Comments
Thanks Alex for putting this out as retry and error handling has been is a growing concern for many implementations of the connector-sdk (which might not make it easy to reach consensus). Before going into the details, perhaps it would help to clarify what we want the user experience to be like. I.e. do we want to put as much as control/logic but also complexity to the caller (connector-sdk user) or put it on OpenFaaS gateway to deal with errors and retries? I'm sure many will be in favor of the first, but the latter for me would be in-line with the great user experience and ease of use OpenFaaS. This is one of the main reasons why the community likes OpenFaaS. Instead of letting the invoker deal with figuring out which individual functions failed and need retry (failure is not binary, also note the risk of retry storms, etc. as pointed out in that article linked) from a UX perspective knowing that OpenFaaS has received (ack-ed) the invocation request would be a user-friendly implementation - while the complexity obviously is shifted to OpenFaaS gateway. There's still a need for a (simple) retry/backoff implementation (with jitter) in the connector-sdk (configurable, e.g. at-most/least-once up to X retries). Happy to brainstorm on this and figure out where the logic and heavy lifting should reside. |
Yes, looked into it not ideal,it could be used as workaround for now. As an example, not exactly the same as we are talking about ... but, durable functions with azure are pretty cool, more of imperative way they do it. To take it one step further, I feel it would be way cooler, if it was declarative either on invocation or deploy of any OpenFaaS function, leaning to invocation would be the ultimate solution, there is an issue, how do you tell the function to use retries from events ...? but this could be done with annotations, we already do that in the stack.yml file for topics, adding retries and back-off annotations, if they exist, execute the functions with those parameters. as suggested from @Miserlou
Then the connector would pick up the meta data and invoke the functions with retries and backoff. This also gives the ability when using regular rest invokes to provide retry functionality as well, not relying on the connector to only have retries. So basically, it is saying please retry with the payload given, from event or rest invoke it doesn't matter. now how it works under the hood ...?, if it is matter including of Linkerd etc I am not sure yet. Just throwing out ideas of how I would want it ;) |
Hi @ontehfritz thanks for moving the comments / conversation over to this repository. See also "How Retries Can Go Wrong" and "Retry budgets" -> https://linkerd.io/2/features/retries-and-timeouts/ Linkerd is not included by default, but can be added through this tutorial https://github.com/openfaas-incubator/openfaas-linkerd2 |
@alexellis thanks! Yes, it is a balancing act with retries, but in cases with event triggered functions, even if you retried automatically once, and failed, it is a huge help instead of figuring out what event it failed on and resubmitting the event, especially if you use event sourcing pattern, you don't want to pollute your stream/topic, and restarting the consumer group to start at a missed offset is also messy, It is easier to say, go retry all failed jobs. The retry budgets is a pretty cool way to look at retries without circuit breaking with a defined amount of retries, pretty interesting. Some background on what we want to accomplish with openFaaS: If the connector went down, that is not that big of deal as the consumer group events would get queued. When the connector comes back up it will resume consuming events. If the function fails, then we need to know and retry the function (manually or automatic) with the same payload it received. When writing this, maybe that is more important than the automatic retry/backoff, is just the ability to know which functions have been failing, and be able to trigger the function again with the same payload. From there retry/backoffs automatically would be fairly easy to build on top. Thanks for letting me think out loud! We will try the Linkerd approach and see how it goes! I am sure this will stimulate more insight to optimally handle this with OpenFaaS :) |
Looking at your diagram above with the 1:* problem, you would need to record the invocation, payload received and the status of execution. In this scenario in our micro-services for the node.js ones anyway, we off load that functionality to bee-queue, which uses redis to store the job with the payload. This gives us the functionality needed, that I have been explaining. Might be able to do the same thing in the connector, we were going to look into potentially integrating this into our kafka connector: Haven't had time to dig into, but might work. |
I have a connector POC written in node.js, using a job queue ... needs a little more work, but I will share it on github when done. ;) |
https://github.com/ratehub/openfaas-kafka-connector, here is our connector needs a lot of polish still as I only have had a day and half to work on it, but testing is going really well so far, you can keep an eye on our progress! Thanks :) Any insights and opinions would be greatly appreciated @alexellis |
Description
Several users have requested the ability to "retry" invocations, but I think that retrying is made more difficult by the 1:* relationship functions have to topics.
Conceptual diagram of invocations
Use-case
As a developer using the connector-sdk, I want to be able to retry invocations that failed
1:1 mapping
If a message comes in on a topic
payment
, and only one functionstripe-fn
has a matching annotation of:topic: payment
.In the case of a failure, the connector could just retry. This would probably be OK, depending on the function, perhaps some caution should be applied here.
The Linkerd2 documentation talks about the dangers of retrying automatically and about retry budgets. See: "How Retries can Go wrong" https://linkerd.io/2/features/retries-and-timeouts/
1:* mapping
This is the assumed default operating mode.
If a message comes in on a topic
payment
, and two functions exist:stripe-fn
anddatalake-fn
both with:topic: payment
.In the case that
datalake-fn
fails andstripe-fn
passes, following the logic above, retrying the whole batch may issue a second payment viastripe-fn
.The proposed solution in #26 would act this way.
Potential solutions
[Feature Request] Include contextual information in responses #26 proposes that any response should indicate whether to acknowledge, delete, or do something else with a message. This works for a
1:1
mapping, but fails for a1:*
mapping.Have the status of all the invocations returned as a batch
The
Invoke
method which is called by all connectors, when they receive messages could be extended to return a set of results. By returning a set of results the caller has to wait for N*max_timeout to decide whether a set of invocations were successful or not. This may be offset by using a Go channel, but the connector may still have to wait for all of the results to come in before making a decision.This is what 2) would look like with the example in the connector-sdk GitHub repo:
Before:
After:
I would welcome input from anyone who is considering conditional acking, retrying or doing anything else with the results from the invocations.
The text was updated successfully, but these errors were encountered: