Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add docs for Nexus circuit breaker #3220

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

rodrigozhou
Copy link
Contributor

What does this PR do?

Add docs for Nexus circuit breaker

Notes to reviewers

Comment on lines +149 to +158
### Circuit Breaker {#circuit-breaker}

The circuit breaker kicks in when requests fail with a [retryable error](https://github.com/temporalio/temporal/blob/13d6cd8cf7a4ba0c4660cf98f672bbd645dca3e7/components/nexusoperations/executors.go#L659)
consecutively as it might indicate that the destination (eg: Nexus service to start operation, or
the caller for callback request) is down or unable to process the request. The default behavior of
the circuit breaker is to open after 5 consecutive failed requests. Once in open state, Nexus taskk
will fail early and requests won't be sent to destination. After a minute in open state, it will
change to half-open state, which will allow only 1 request to be made. If the request is successful,
then the circuit breaker changes its state to closed, and allows all requests to pass through.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the only change I made, all other changes was yarn build.

Copy link
Member

@bergundy bergundy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would defer to someone from documentation to go over the grammar and add a section on how users would find out when a destination is down. Specifically mention the BLOCKED state and maybe who what it'd look like when describing a workflow via the CLI.

@@ -146,6 +146,16 @@ As mentioned above, a synchronous Nexus Operation handler has less than 10 secon
Once the caller Workflow schedules an Operation with the caller’s Temporal cluster, the caller’s Nexus Machinery keeps trying to start the Operation, with automatic retries and exponential backoff.
If a Nexus Operation returns a [retryable error](https://github.com/temporalio/temporal/blob/13d6cd8cf7a4ba0c4660cf98f672bbd645dca3e7/components/nexusoperations/executors.go#L659) when attempting to start, the Operation it will be retried up to the [default Retry Policy’s](https://github.com/temporalio/temporal/blob/de7c8879e103be666a7b067cc1b247f0ac63c25c/components/nexusoperations/config.go#L111) max attempts and expiration interval.

### Circuit Breaker {#circuit-breaker}

The circuit breaker kicks in when requests fail with a [retryable error](https://github.com/temporalio/temporal/blob/13d6cd8cf7a4ba0c4660cf98f672bbd645dca3e7/components/nexusoperations/executors.go#L659)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would explicitly call out request timeouts and how that may relate to when a worker is down.

### Circuit Breaker {#circuit-breaker}

The circuit breaker kicks in when requests fail with a [retryable error](https://github.com/temporalio/temporal/blob/13d6cd8cf7a4ba0c4660cf98f672bbd645dca3e7/components/nexusoperations/executors.go#L659)
consecutively as it might indicate that the destination (eg: Nexus service to start operation, or
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The endpoint would be down, not the service.

The circuit breaker kicks in when requests fail with a [retryable error](https://github.com/temporalio/temporal/blob/13d6cd8cf7a4ba0c4660cf98f672bbd645dca3e7/components/nexusoperations/executors.go#L659)
consecutively as it might indicate that the destination (eg: Nexus service to start operation, or
the caller for callback request) is down or unable to process the request. The default behavior of
the circuit breaker is to open after 5 consecutive failed requests. Once in open state, Nexus taskk
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure users will understand what nexus tasks are in this context and this applies to callback "tasks" as well. This is all internal details, you should use "the Nexus machinery" instead.

consecutively as it might indicate that the destination (eg: Nexus service to start operation, or
the caller for callback request) is down or unable to process the request. The default behavior of
the circuit breaker is to open after 5 consecutive failed requests. Once in open state, Nexus taskk
will fail early and requests won't be sent to destination. After a minute in open state, it will
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
will fail early and requests won't be sent to destination. After a minute in open state, it will
will fail early and requests won't be sent to that destination. After a minute in open state, it will

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants