Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Gracefully handle Riak disconnects #519

Merged
merged 1 commit into from
Apr 12, 2013

Conversation

reiddraper
Copy link
Contributor

Previously, Riak CS would not start if it was unable to connect
to Riak. Furthermore, it would eventually crash if it was disconnected to Riak.
This was due to the fact that the poolboy workers would die and poolboy would
not try and restart them. This leads to the poolboy supervisor eventually
crashing and then it bubbles up to the riak_cs_sup crashing. The main
change here is that we pass the auto_reconnect option into
riakc_pb_socket, which allows it to return from start_link even if
it can't connect to Riak. Attempts to use riakc_pb_socket when it's in
this state will return {error, disconnected}. I've tried to handle as
many of the cases where this might happen, and return 503 to the end
user. There are cases where we might be far enough in the response
(headers already sent over the wire), where for now I just Let It Crash.
Originally, I had thought about adding a 'health' thread that would
monitor Riak availability, for the purposes of informing the deployer
via log messages. However, for now, I think the log messages are clear
enough when Riak gets disconnected that it should be obvious. The sudden
spike in 503 responses should also be a pretty good indicator. Other smaller
changes are detailed below.

  • Allow riak_cs_storage_d to start even if it can't immediately connect to
    Riak
  • Refactor riak_cs_wm_common:forbidden/2 to handle disconnected Riak
    when retrieving the user record

Fixes #262


Testing!

Here are some ideas for testing, beyond the usual bits:

  • Start Riak CS when Riak is not running
  • Start Riak CS when it's configured to talk to the wrong Riak port
  • Kill/stop Riak while Riak CS is running, verify that requests start seeing 503 errors (I think I eliminated all the cases where they'd see, 403, so it's a bug if you see 403).
  • Verify that things that working again when Riak comes back to life

Previously, Riak CS would not start if it was unable to connect
to Riak. Furthermore, it would eventually crash if it was disconnected to Riak.
This was due to the fact that the `poolboy` workers would die and poolboy would
not try and restart them. This leads to the `poolboy` supervisor eventually
crashing and then it bubbles up to the `riak_cs_sup` crashing. The main
change here is that we pass the `auto_reconnect` option into
`riakc_pb_socket`, which allows it to return from `start_link` even if
it can't connect to Riak. Attempts to use `riakc_pb_socket` when it's in
this state will return `{error, disconnected}`. I've tried to handle as
many of the cases where this might happen, and return 503 to the end
user. There are cases where we might be far enough in the response
(headers already sent over the wire), where for now I just Let It Crash.
Originally, I had thought about adding a 'health' thread that would
monitor Riak availability, for the purposes of informing the deployer
via log messages. However, for now, I think the log messages are clear
enough when Riak gets disconnected that it should be obvious. The sudden
spike in 503 responses should also be a pretty good indicator. Other smaller
changes are detailed below.

* Allow `riak_cs_storage_d` to start even if it can't immediately connect to
Riak
* Refactor `riak_cs_wm_common:forbidden/2` to handle disconnected Riak
when retrieving the user record
@kellymclaughlin
Copy link
Contributor

Even with riak running when I start cs I still see a set of disconnected messages. Is that expected?

10:40:57:riak-cs(feature/handle-riak-disconnect) $ ./bin/riak-cs console
config is OK
Exec: /Users/kelly/basho/repos/riak_cs-review/rel/riak-cs/bin/../erts-5.9.3/bin/erlexec -boot /Users/kelly/basho/repos/riak_cs-review/rel/riak-cs/bin/../releases/1.3.0/riak-cs             -embedded -config /Users/kelly/basho/repos/riak_cs-review/rel/riak-cs/bin/../etc/app.config             -pa /Users/kelly/basho/repos/riak_cs-review/rel/riak-cs/bin/../lib/basho-patches             -args_file /Users/kelly/basho/repos/riak_cs-review/rel/riak-cs/bin/../etc/vm.args -- console
Root: /Users/kelly/basho/repos/riak_cs-review/rel/riak-cs/bin/..
Erlang R15B03 (erts-5.9.3) [source] [64-bit] [smp:8:8] [async-threads:64] [kernel-poll:true] [dtrace]


=INFO REPORT==== 5-Apr-2013::10:40:59 ===
    alarm_handler: {set,{{disk_almost_full,"118435569"},[]}}
opening log file: "./log/access.log.2013_04_05_16"
10:40:59.595 [info] Application lager started on node '[email protected]'
10:40:59.596 [info] Application poolboy started on node '[email protected]'
10:40:59.615 [info] Application folsom started on node '[email protected]'
10:40:59.696 [warning] No storage schedule defined. Calculation must be triggered manually.
10:40:59.697 [warning] Unable to verify moss.storage bucket settings (disconnected).
10:40:59.697 [info] Unable to verify bucket properties for storage daemon. Riak currently disconnected. Retrying in 5 seconds.
10:41:00.717 [warning] Unable to verify riak-cs-gc bucket settings (disconnected).
10:41:00.792 [info] Application riak_cs started on node '[email protected]'
Eshell V5.9.3  (abort with ^G)
([email protected])1> 10:41:04.698 [warning] No storage schedule defined. Calculation must be triggered manually.
10:41:04.699 [info] Storage daemon successfully read bucket properties, continuing

@kellymclaughlin
Copy link
Contributor

  • Eunit and eqc tests pass
  • Client and int tests pass
  • Dialyzer output clean
  • riak_test tests pass

@kellymclaughlin
Copy link
Contributor

Do you think having indefinite 5 seconds retries is ok or should we consider a limited number of checks at 5 seconds and then check less frequently? Just wondering if it becomes too spammy if it is the case that the riak node is down for an extended time. I do not feel certain either way just thinking out loud.

@kellymclaughlin
Copy link
Contributor

Should we try to handle this in a nicer way? I wonder if a disconnect is worth crashing the block server and the gc daemon processes.

([email protected])1> 12:00:29.783 [info] alarm_handler: {set,{system_memory_high_watermark,[]}}
12:15:08.358 [error] gen_server <0.6935.0> terminated with reason: no case clause matching {error,disconnected} in riak_cs_block_server:handle_cast/2 line 222
12:15:08.360 [warning] Unable to verify riak-cs-gc bucket settings (disconnected).
12:15:08.366 [error] CRASH REPORT Process <0.6935.0> with 3 neighbours exited with reason: no case clause matching {error,disconnected} in riak_cs_block_server:handle_cast/2 line 222 in gen_server:terminate/6 line 747
12:15:08.368 [error] Supervisor riak_cs_sup had child riak_cs_gc_d started with riak_cs_gc_d:start_link() at <0.287.0> exit with reason no case clause matching {error,disconnected} in riak_cs_block_server:handle_cast/2 line 222 in context child_terminated
12:15:08.368 [error] Supervisor riak_cs_delete_fsm_sup had child undefined started with {riak_cs_delete_fsm,start_link,undefined} at <0.6934.0> exit with reason no case clause matching {error,disconnected} in riak_cs_block_server:handle_cast/2 line 222 in context child_terminated

@reiddraper
Copy link
Contributor Author

Even with riak running when I start cs I still see a set of disconnected messages. Is that expected?

Ah, nice catch. I see it connects fine 5 seconds later, but I'll need to look in to see why that happens. Maybe if you start the riakc_pb_socket with auto_reconnect there is a race condition between starting the proc and sending requests to it.

Do you think having indefinite 5 seconds retries is ok or should we consider a limited number of checks at 5 seconds and then check less frequently? Just wondering if it becomes too spammy if it is the case that the riak node is down for an extended time. I do not feel certain either way just thinking out loud.

Yeah, I could go either way on that. It is worth pointing out that this only happens once in the lifetime of the storage calculation (at least that was my read on things), so if we've already checked the bucket properties and then Riak gets disconnected, we won't keep polling, since we've already got that far. Though if the daemon crashes, it will have to be read again.

Should we try to handle this in a nicer way? I wonder if a disconnect is worth crashing the block server and the gc daemon processes.

Yeah -- these might be worth a mumble-discussion over, as there's varying levels of handling we could do.

@reiddraper
Copy link
Contributor Author

re: seeing disconnect warnings on startup, it must be a race. I only it see it with the storage daemon and gc daemon maybe 50% of the time. The client tries to account for this race, but is either broken or the race is somewhere else.

@ghost ghost assigned kellymclaughlin Apr 9, 2013
@reiddraper
Copy link
Contributor Author

Ok, figured out the disconnect issue. We do successfully try and connect before we receive the first set_bucket request, however, some of these connection requests fail (they get tried again later). The connection error they get is {error,econnreset}. This is because the default pb_backlog in Riak is 64, but we're trying 128 concurrent connections from Riak CS at startup time (because of poolboy's 'prepopulate' feature). One way to 'fix' this is to set the pb_backlog on Riak to 128 (though this won't be enough if you start more than one CS node at once). The goods news is that with connection retries, the situation works itself out, you just get some error messages along the way...

@kellymclaughlin
Copy link
Contributor

Do you think having indefinite 5 seconds retries is ok or should we consider a limited number of checks at 5 seconds and then check less frequently? Just wondering if it becomes too spammy if it is the case that the riak node is down for an extended time. I do not feel certain either way just thinking out loud.

Yeah, I could go either way on that. It is worth pointing out that this only happens once in the lifetime of the storage calculation (at least that was my read on things), so if we've already checked the bucket properties and then Riak gets disconnected, we won't keep polling, since we've already got that far. Though if the daemon crashes, it will have to be read again.

Ok, then I'd say let's just leave this as-is for now.

@kellymclaughlin
Copy link
Contributor

Ok, figured out the disconnect issue. We do successfully try and connect before we receive the first set_bucket request, however, some of these connection requests fail (they get tried again later). The connection error they get is {error,econnreset}. This is because the default pb_backlog in Riak is 64, but we're trying 128 concurrent connections from Riak CS at startup time (because of poolboy's 'prepopulate' feature). One way to 'fix' this is to set the pb_backlog on Riak to 128 (though this won't be enough if you start more than one CS node at once). The goods news is that with connection retries, the situation works itself out, you just get some error messages along the way...

Another option here might be to try and get poolboy changed to allow a pool to be populated in a more staggered fashion. Seems like a generally useful feature for a worker pool library imo.

@reiddraper
Copy link
Contributor Author

Another option here might be to try and get poolboy changed to allow a pool to be populated in a more staggered fashion. Seems like a generally useful feature for a worker pool library imo.

Yeah @andrewjstone and I talked briefly about that earlier today. I too think it'd be useful, though I'm not sure how difficult it might be. I have a couple other [1] [2] poolboy issues I've been musing about too, so maybe we could throw the staggered start stuff in there too?

@reiddraper
Copy link
Contributor Author

Alright, so recapping the outstanding issues raised during review:

  • 'More gracefully' handle disconnects in the block-server (still some open questions there)
  • Eliminate start-up syn-flood where the gc and storage daemons review disconnected once on startup

Do we want to address either of these in this PR?

@kellymclaughlin
Copy link
Contributor

Do we want to address either of these in this PR?

I would say maybe to the former and no to the latter question.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants