Pubsub: Unbounded retry forced when publishing with message ordering enabled #1702

poernahi · 2023-08-17T17:09:44Z

Summary

When publishing to pubsub with message ordering enabled, the library overrides the gRPC retry settings and forces unbounded retry. The application using this library is then unable to detect failures. The workaround is to add timeout when waiting for the returned future, but this makes the retry/timeout policy scattered and surprising to configure.

This is the code that enforces infinite retry.
https://github.com/googleapis/java-pubsub/blob/main/google-cloud-pubsub/src/main/java/com/google/cloud/pubsub/v1/Publisher.java#L168 line 174 to line 180

The exact reason why the infinite retry does not eventually become successful is unclear. We have multiple random occurrences where applications running in a GKE cluster would stop publishing. In some instances, we had even waited for the application to retry for up to a week without success. Unfortunately we do not have a reliable way to trigger this type of failure and we do not have the low level gRPC logs when the failure happened.

My suggestion is to allow the library user to set a sane upper bound for retry, and let failures be propagated through the onFailure callback.

Environment details

API: Pubsub
OS type and version: Ubuntu 22.04
Java version: temurin-17.0.8
version(s): google-cloud-pubsub-1.123.12

Steps to reproduce

Enable message ordering in publisher
Publish messages
Simulate failure, such as network disconnect

Code example

Based on sample publish code in documentation.

    Publisher publisher =
        Publisher.newBuilder(topicName)
            // Sending messages to the same region ensures they are received in order
            // even when multiple publishers are used.
            .setEndpoint("us-east1-pubsub.googleapis.com:443")
            .setEnableMessageOrdering(true)
            .build();

    try {
      Map<String, String> messages = new LinkedHashMap<String, String>();
      messages.put("message1", "key1");
      // auto-generate many more messages here ...

      for (Map.Entry<String, String> entry : messages.entrySet()) {
        ByteString data = ByteString.copyFromUtf8(entry.getKey());
        PubsubMessage pubsubMessage =
            PubsubMessage.newBuilder().setData(data).setOrderingKey(entry.getValue()).build();
        ApiFuture<String> future = publisher.publish(pubsubMessage);

        // Add an asynchronous callback to handle publish success / failure.
        ApiFutures.addCallback(
            future,
            new ApiFutureCallback<String>() {

              // onFailure is never called, because the publisher will retry indefinitely

              @Override
              public void onFailure(Throwable throwable) {
                if (throwable instanceof ApiException) {
                  ApiException apiException = ((ApiException) throwable);
                  // Details on the API exception.
                  System.out.println(apiException.getStatusCode().getCode());
                  System.out.println(apiException.isRetryable());
                }
                System.out.println("Error publishing message : " + pubsubMessage.getData());
              }

              @Override
              public void onSuccess(String messageId) {
                // Once published, returns server-assigned message ids (unique within the topic).
                System.out.println(pubsubMessage.getData() + " : " + messageId);
              }
            },
            MoreExecutors.directExecutor());
      }
    } finally {
      // When finished with the publisher, shutdown to free up resources.
      publisher.shutdown();
      publisher.awaitTermination(1, TimeUnit.MINUTES);
    }

Stack trace

No stack trace because of infinite retry

External references such as API reference guides

https://cloud.google.com/pubsub/docs/publisher

Any additional information below

The text was updated successfully, but these errors were encountered:

maitrimangal · 2023-08-17T19:39:44Z

Hi, I tried to reproduce this error, but unable to do so. Can you please create a customer support issue for us to get more insight into why your publish calls are failing?

Regarding the upper bound on the retries so it is not infinite, I will bring it up in my next client meeting and discuss this further!

Keeping this bug open for now, to post any updates on the upper bound for retries. But will be unable to debug your exact issue until we get more insight into the publish calls. Thanks!

maitrimangal · 2023-08-24T18:58:23Z

So the reason we do not have an upperbound for retries is because if we stopped retrying, we would essentially have to fail the message, and since ordering is enabled, we would have to fail all the subsequent messages too to preserve the message ordering. So to prevent this we try to retry the message infinitely.

I think the best way forward here is to understand why the publishes are failing in the first place.

poernahi · 2023-10-10T20:31:51Z

I have been able to reproduce this condition with low level gRPC logging enabled and below is the cleaned up logs. The application performs publish operation with ordering key enabled. But in this scenario, most messages have distinct key.

(successful publish operations ...)
OUTBOUND HEADERS POST /google.pubsub.v1.Publisher/Publish ...
INBOUND HEADERS status: 200
INBOUND GO_AWAY errorCode=0 load_shed
OUTBOUND RST_STREAM errorCode=5 (repeated multiple times with different streamId)
(successful publish operations ...)

After this point, there are other successful publish operations. However, the future returned by the publish operation that coincides with the stream termination above would never complete, with or without exception. If I wrap the future with a hardcoded timeout, for example 120 seconds, then I will see consistent timeout exception 120 seconds after the GO_AWAY load_shed event.

Looking at the source code of this java-pubsub library, my naive expectation is something like this should result in UNAVAILABLE gRPC error code which should then be auto-retried. Or, if another non-retrying error code is produced, an error should be raised to the library user.

So far, all instances of this issue that I have observed are tightly correlated to server load_shed events.

Library version has been updated to 1.125.5.

michaelpri10 · 2024-02-17T00:18:12Z

A potential solution here would be to have a user-set maxAttempts field overrule any other configuration (i.e., skip setting maxAttempts for ordered messages). However, this could cause message loss because as mentioned above we would need to fail all subsequent messages. I'll need to have discussions before deciding whether we should do this as it could potentially break the configuration of current users.

michaelpri10 · 2024-02-29T19:02:29Z

Had some more discussion about this and did talk about the possibility of having the user-specified retry settings take precedent over any other configuration. However, we would want to coordinate this change across all client libraries so that the behavior is consistent in every language.

product-auto-label bot added the api: pubsub Issues related to the googleapis/java-pubsub API. label Aug 17, 2023

maitrimangal added the type: feature request ‘Nice-to-have’ improvement, new feature or different behavior or design. label Aug 18, 2023

maitrimangal closed this as completed Aug 24, 2023

maitrimangal reopened this Oct 26, 2023

maitrimangal added type: bug Error or flaw in code with unintended results or allowing sub-optimal usage patterns. and removed type: feature request ‘Nice-to-have’ improvement, new feature or different behavior or design. labels Oct 26, 2023

maitrimangal added the priority: p2 Moderately-important priority. Fix may not be included in next release. label Dec 19, 2023

maitrimangal self-assigned this Dec 19, 2023

michaelpri10 assigned michaelpri10 and unassigned maitrimangal Feb 15, 2024

michaelpri10 added type: feature request ‘Nice-to-have’ improvement, new feature or different behavior or design. and removed type: bug Error or flaw in code with unintended results or allowing sub-optimal usage patterns. labels Feb 29, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Pubsub: Unbounded retry forced when publishing with message ordering enabled #1702

Pubsub: Unbounded retry forced when publishing with message ordering enabled #1702

poernahi commented Aug 17, 2023

maitrimangal commented Aug 17, 2023

maitrimangal commented Aug 24, 2023

poernahi commented Oct 10, 2023

michaelpri10 commented Feb 17, 2024

michaelpri10 commented Feb 29, 2024

Pubsub: Unbounded retry forced when publishing with message ordering enabled #1702

Pubsub: Unbounded retry forced when publishing with message ordering enabled #1702

Comments

poernahi commented Aug 17, 2023

Summary

Environment details

Steps to reproduce

Code example

Stack trace

External references such as API reference guides

Any additional information below

maitrimangal commented Aug 17, 2023

maitrimangal commented Aug 24, 2023

poernahi commented Oct 10, 2023

michaelpri10 commented Feb 17, 2024

michaelpri10 commented Feb 29, 2024