-
Notifications
You must be signed in to change notification settings - Fork 93
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Pubsub: Unbounded retry forced when publishing with message ordering enabled #1702
Comments
Hi, I tried to reproduce this error, but unable to do so. Can you please create a customer support issue for us to get more insight into why your publish calls are failing? Regarding the upper bound on the retries so it is not infinite, I will bring it up in my next client meeting and discuss this further! Keeping this bug open for now, to post any updates on the upper bound for retries. But will be unable to debug your exact issue until we get more insight into the publish calls. Thanks! |
So the reason we do not have an upperbound for retries is because if we stopped retrying, we would essentially have to fail the message, and since ordering is enabled, we would have to fail all the subsequent messages too to preserve the message ordering. So to prevent this we try to retry the message infinitely. I think the best way forward here is to understand why the publishes are failing in the first place. |
I have been able to reproduce this condition with low level gRPC logging enabled and below is the cleaned up logs. The application performs publish operation with ordering key enabled. But in this scenario, most messages have distinct key.
After this point, there are other successful publish operations. However, the future returned by the publish operation that coincides with the stream termination above would never complete, with or without exception. If I wrap the future with a hardcoded timeout, for example 120 seconds, then I will see consistent timeout exception 120 seconds after the GO_AWAY load_shed event. Looking at the source code of this java-pubsub library, my naive expectation is something like this should result in UNAVAILABLE gRPC error code which should then be auto-retried. Or, if another non-retrying error code is produced, an error should be raised to the library user. So far, all instances of this issue that I have observed are tightly correlated to server load_shed events. Library version has been updated to 1.125.5. |
A potential solution here would be to have a user-set |
Had some more discussion about this and did talk about the possibility of having the user-specified retry settings take precedent over any other configuration. However, we would want to coordinate this change across all client libraries so that the behavior is consistent in every language. |
Summary
When publishing to pubsub with message ordering enabled, the library overrides the gRPC retry settings and forces unbounded retry. The application using this library is then unable to detect failures. The workaround is to add timeout when waiting for the returned future, but this makes the retry/timeout policy scattered and surprising to configure.
This is the code that enforces infinite retry.
https://github.com/googleapis/java-pubsub/blob/main/google-cloud-pubsub/src/main/java/com/google/cloud/pubsub/v1/Publisher.java#L168 line 174 to line 180
The exact reason why the infinite retry does not eventually become successful is unclear. We have multiple random occurrences where applications running in a GKE cluster would stop publishing. In some instances, we had even waited for the application to retry for up to a week without success. Unfortunately we do not have a reliable way to trigger this type of failure and we do not have the low level gRPC logs when the failure happened.
My suggestion is to allow the library user to set a sane upper bound for retry, and let failures be propagated through the
onFailure
callback.Environment details
Steps to reproduce
Code example
Based on sample publish code in documentation.
Stack trace
External references such as API reference guides
Any additional information below
The text was updated successfully, but these errors were encountered: