-
Notifications
You must be signed in to change notification settings - Fork 94
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Gracefully handle Riak disconnects #519
Conversation
Previously, Riak CS would not start if it was unable to connect to Riak. Furthermore, it would eventually crash if it was disconnected to Riak. This was due to the fact that the `poolboy` workers would die and poolboy would not try and restart them. This leads to the `poolboy` supervisor eventually crashing and then it bubbles up to the `riak_cs_sup` crashing. The main change here is that we pass the `auto_reconnect` option into `riakc_pb_socket`, which allows it to return from `start_link` even if it can't connect to Riak. Attempts to use `riakc_pb_socket` when it's in this state will return `{error, disconnected}`. I've tried to handle as many of the cases where this might happen, and return 503 to the end user. There are cases where we might be far enough in the response (headers already sent over the wire), where for now I just Let It Crash. Originally, I had thought about adding a 'health' thread that would monitor Riak availability, for the purposes of informing the deployer via log messages. However, for now, I think the log messages are clear enough when Riak gets disconnected that it should be obvious. The sudden spike in 503 responses should also be a pretty good indicator. Other smaller changes are detailed below. * Allow `riak_cs_storage_d` to start even if it can't immediately connect to Riak * Refactor `riak_cs_wm_common:forbidden/2` to handle disconnected Riak when retrieving the user record
Even with riak running when I start cs I still see a set of
|
|
Do you think having indefinite 5 seconds retries is ok or should we consider a limited number of checks at 5 seconds and then check less frequently? Just wondering if it becomes too spammy if it is the case that the riak node is down for an extended time. I do not feel certain either way just thinking out loud. |
Should we try to handle this in a nicer way? I wonder if a disconnect is worth crashing the block server and the gc daemon processes.
|
Ah, nice catch. I see it connects fine 5 seconds later, but I'll need to look in to see why that happens. Maybe if you start the
Yeah, I could go either way on that. It is worth pointing out that this only happens once in the lifetime of the storage calculation (at least that was my read on things), so if we've already checked the bucket properties and then Riak gets disconnected, we won't keep polling, since we've already got that far. Though if the daemon crashes, it will have to be read again.
Yeah -- these might be worth a mumble-discussion over, as there's varying levels of handling we could do. |
re: seeing disconnect warnings on startup, it must be a race. I only it see it with the storage daemon and gc daemon maybe 50% of the time. The client tries to account for this race, but is either broken or the race is somewhere else. |
Ok, figured out the disconnect issue. We do successfully try and connect before we receive the first |
Ok, then I'd say let's just leave this as-is for now. |
Another option here might be to try and get poolboy changed to allow a pool to be populated in a more staggered fashion. Seems like a generally useful feature for a worker pool library imo. |
Yeah @andrewjstone and I talked briefly about that earlier today. I too think it'd be useful, though I'm not sure how difficult it might be. I have a couple other [1] [2] poolboy issues I've been musing about too, so maybe we could throw the staggered start stuff in there too? |
Alright, so recapping the outstanding issues raised during review:
Do we want to address either of these in this PR? |
I would say maybe to the former and no to the latter question. |
Gracefully handle Riak disconnects
Previously, Riak CS would not start if it was unable to connect
to Riak. Furthermore, it would eventually crash if it was disconnected to Riak.
This was due to the fact that the
poolboy
workers would die and poolboy wouldnot try and restart them. This leads to the
poolboy
supervisor eventuallycrashing and then it bubbles up to the
riak_cs_sup
crashing. The mainchange here is that we pass the
auto_reconnect
option intoriakc_pb_socket
, which allows it to return fromstart_link
even ifit can't connect to Riak. Attempts to use
riakc_pb_socket
when it's inthis state will return
{error, disconnected}
. I've tried to handle asmany of the cases where this might happen, and return 503 to the end
user. There are cases where we might be far enough in the response
(headers already sent over the wire), where for now I just Let It Crash.
Originally, I had thought about adding a 'health' thread that would
monitor Riak availability, for the purposes of informing the deployer
via log messages. However, for now, I think the log messages are clear
enough when Riak gets disconnected that it should be obvious. The sudden
spike in 503 responses should also be a pretty good indicator. Other smaller
changes are detailed below.
riak_cs_storage_d
to start even if it can't immediately connect toRiak
riak_cs_wm_common:forbidden/2
to handle disconnected Riakwhen retrieving the user record
Fixes #262
Testing!
Here are some ideas for testing, beyond the usual bits: