-
Notifications
You must be signed in to change notification settings - Fork 300
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Retry logic doesn't seem to retry enough nor handle timeouts properly #594
Comments
I totally agree with you. SocketTimeoutException should be handled like HTimedOutException. We are facing the same problem as you. After some research, the code actually was there before the issue 434. I vote to revert that change. I do find some related issue and preparing a patch for it. I am doing some test with reverted change of issue 434 and can't reproduce the issue 434. |
@weizhu-us, glad to hear I am not the only one facing this issue. About my 1st point, unless I am being mistaken, I still believe it will retry an incorrect number of times, given the code I have shown and the fact that:
Hector will still rethrow the exception after, say in my case, 5 retries instead of 50:
More concretely, I guess this means Hector should be changed in the following way:
...the idea being:
|
I think there is valid reason to retry up to the number of nodes. The idea might be that it's time to give up if all the nodes are tried. But it's not the case since the host with HTimedOutException is not added to the excludeHosts list. We had a case that some requests happen to hit the same node for all the retries after timeout while using RR LB policy. We then switched to dynamic LB to lower the chance of that from happening. Bu it would be nice to add the node to the excludedHost after HTimedOutException unless there is a good reason not to do so. |
issue #594, handle SocketTimeoutException differently
Hi,
I am fairly new to Hector's codebase, but following some surprising behaviour observed in production and some subsequent reading of Hector's code, I thought I would raise this ticket to confirm or invalidate my suspicions:
In
HConnectionManager
, it seems that regardless of the number of retries configured inFailoverPolicy
(I use 50 retries with 2 seconds between retries, on a 5 nodes cluster), the maximum number of times it will actually retry is the number of nodes in the cluster as:the minimum of these two values is taken:
and, regardless of the error, a "retry token" is consumed every time it fails:
Based on the comments in the code, it seems that when a timeout occurs, the number of retries should NOT be decremented, but the code does NOT increment
retries
to compensate for the decrement in thefinally
block:When Cassandra is so busy that it drops requests or when requests timeout, it happens that:
SocketTimeoutException
gets thrown, which is rethrown asTTransportException
, which is itself rethrown asHectorTransportException
.However it seems this should be handled like
HTimedOutException
.Could it be possible to:
Thank you very much in advance!
M.
The text was updated successfully, but these errors were encountered: