Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Too-large values of request_pool and put_concurrency yield 'disconnected' errors #616

Closed
slfritchie opened this issue Jul 16, 2013 · 11 comments

Comments

@slfritchie
Copy link
Contributor

Using Riak + Riak CS on my Mac in single node dev configuration, plus these changes to the Riak CS vm.args (-env ERL_MAX_PORTS 16000) and these to app.config:

--- rel/riak-cs/etc/app.config.aaa-start    2013-07-15 22:26:24.000000000 -0500
+++ rel/riak-cs/etc/app.config  2013-07-15 22:53:01.000000000 -0500
@@ -4,6 +4,8 @@
  %% Riak CS configuration
  {riak_cs, [
 {yessir_backend_in_use, true},
+{put_concurrency, 8},
+{put_buffer_factor, 16},

               %% == Basic Configuration ==

@@ -61,7 +63,7 @@
               %% tuple of {Name, {FixedSize, OverflowSize}}
               {connection_pools,
                [
-                {request_pool, {128, 0} },
+                {request_pool, {400, 0} },
                 {bucket_list_pool, {5, 0} }
                ]},

Then I see the following errors when running basho_bench with a 100% insert workload.

22:53:14.892 [error] gen_server <0.1786.0> terminated with reason: no match of right hand value {error,disconnected} in riak_cs_block_server:handle_cast/2 line 191
22:53:14.892 [error] CRASH REPORT Process <0.1786.0> with 0 neighbours exited with reason: no match of right hand value {error,disconnected} in riak_cs_block_server:handle_cast/2 line 191 in gen_server:terminate/6 line 747
22:53:14.892 [error] gen_server <0.1787.0> terminated with reason: no match of right hand value {error,disconnected} in riak_cs_block_server:handle_cast/2 line 191
22:53:14.892 [error] gen_server <0.1788.0> terminated with reason: no match of right hand value {error,disconnected} in riak_cs_block_server:handle_cast/2 line 191
22:53:14.892 [error] CRASH REPORT Process <0.1787.0> with 0 neighbours exited with reason: no match of right hand value {error,disconnected} in riak_cs_block_server:handle_cast/2 line 191 in gen_server:terminate/6 line 747
22:53:14.893 [error] gen_server <0.1785.0> terminated with reason: no match of right hand value {error,disconnected} in riak_cs_block_server:handle_cast/2 line 191
22:53:14.893 [error] CRASH REPORT Process <0.1788.0> with 0 neighbours exited with reason: no match of right hand value {error,disconnected} in riak_cs_block_server:handle_cast/2 line 191 in gen_server:terminate/6 line 747
22:53:14.893 [error] CRASH REPORT Process <0.1785.0> with 0 neighbours exited with reason: no match of right hand value {error,disconnected} in riak_cs_block_server:handle_cast/2 line 191 in gen_server:terminate/6 line 747
22:53:14.901 [error] Lager event handler error_logger_lager_h exited with reason {'EXIT',{{badmatch,["/buckets/test/objects/536",{error,{exit,{noproc,{gen_fsm,sync_send_event,[<0.1778.0>,{augment_data,<<"**************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************...">>},...]}},...}},...]},...}}
22:53:14.922 [error] gen_server <0.1747.0> terminated with reason: no match of right hand value {error,disconnected} in riak_cs_block_server:handle_cast/2 line 191
22:53:14.923 [error] CRASH REPORT Process <0.1747.0> with 0 neighbours exited with reason: no match of right hand value {error,disconnected} in riak_cs_block_server:handle_cast/2 line 191 in gen_server:terminate/6 line 747
22:53:14.923 [error] gen_server <0.1716.0> terminated with reason: no match of right hand value {error,disconnected} in riak_cs_block_server:handle_cast/2 line 191
22:53:14.923 [error] CRASH REPORT Process <0.1716.0> with 0 neighbours exited with reason: no match of right hand value {error,disconnected} in riak_cs_block_server:handle_cast/2 line 191 in gen_server:terminate/6 line 747
22:53:14.927 [error] gen_server <0.1688.0> terminated with reason: no match of right hand value {error,disconnected} in riak_cs_block_server:handle_cast/2 line 191
22:53:14.927 [error] CRASH REPORT Process <0.1688.0> with 0 neighbours exited with reason: no match of right hand value {error,disconnected} in riak_cs_block_server:handle_cast/2 line 191 in gen_server:terminate/6 line 747
22:53:15.209 [error] gen_event webmachine_log_handler installed in webmachine_log_event terminated with reason: bad argument in call to erlang:atom_to_list("***********************************************************************************************...") in webmachine_log_handler:format_req/1 line 113

... and lots of complaints from basho_bench:

22:53:04.384 [info] Errors:[{{insert,insert},250},{{{insert,insert},{put,{conn_failed,{error,econnrefused}}}},250}]
22:53:05.384 [info] Errors:[{{insert,insert},12},{{{insert,insert},{http_error,"500"}},10},{{{insert,insert},{http_error,"501"}},2}]
22:53:06.384 [info] Errors:[{{insert,insert},7},{{{insert,insert},{http_error,"500"}},4},{{{insert,insert},{http_error,"501"}},3}]
22:53:07.383 [info] Errors:[{{insert,insert},6},{{{insert,insert},{http_error,"403"}},4},{{{insert,insert},{http_error,"500"}},1},{{{insert,insert},{http_error,"501"}},1}]
22:53:08.383 [info] Errors:[{{insert,insert},7},{{{insert,insert},{http_error,"403"}},1},{{{insert,insert},{http_error,"500"}},5},{{{insert,insert},{http_error,"501"}},1}]
22:53:09.383 [info] Errors:[{{insert,insert},13},{{{insert,insert},{http_error,"403"}},1},{{{insert,insert},{http_error,"500"}},7},{{{insert,insert},{http_error,"501"}},5}]
22:53:10.383 [info] Errors:[{{insert,insert},10},{{{insert,insert},{http_error,"500"}},8},{{{insert,insert},{http_error,"501"}},2}]
22:53:11.384 [info] Errors:[{{insert,insert},5},{{{insert,insert},{http_error,"500"}},4},{{{insert,insert},{http_error,"501"}},1}]
22:53:12.383 [info] Errors:[{{insert,insert},9},{{{insert,insert},{http_error,"500"}},4},{{{insert,insert},{http_error,"501"}},5}]
22:53:13.384 [info] Errors:[{{insert,insert},4},{{{insert,insert},{http_error,"500"}},1},{{{insert,insert},{http_error,"501"}},3}]
22:53:14.384 [info] Errors:[{{insert,insert},7},{{{insert,insert},{http_error,"403"}},1},{{{insert,insert},{http_error,"500"}},4},{{{insert,insert},{http_error,"501"}},2}]
22:53:15.384 [info] Errors:[{{insert,insert},5},{{{insert,insert},{http_error,"500"}},4},{{{insert,insert},{http_error,"501"}},1}]
22:53:16.384 [info] Errors:[{{insert,insert},5},{{{insert,insert},{http_error,"500"}},5}]
22:53:17.385 [info] Errors:[{{insert,insert},79},{{{insert,insert},{http_error,"500"}},4},{{{insert,insert},{http_error,"501"}},1},{{{insert,insert},{put,{conn_failed,{error,econnrefused}}}},74}]

Here is the basho_bench config that I used:

%% Sample config file for Riak CS basho_bench benchmarking

%% About rate limiting via the `mode` setting: the `mode` rate is
%% valid for the basho_bench_driver_cs plugin for `get` operations.
%% The `mode` rate does *not* apply to `insert` operations.
%% The driver does *not* support `update` operations

{mode, max}.
%{mode, {rate,4}}.
{duration, 10}.
{concurrent, 15}.
{report_interval, 1}.

{driver, basho_bench_driver_cs}.

%% Replace this with a user you have created.
%% Instructions to create a user are here:
%% https://github.com/basho/riak_cs/wiki/Creating-a-User
{cs_access_key, "J2IP6WQQQ_FNGIAN9AFI"}.
{cs_secret_key, "mbB-1VQQQsrN0yLAUSpCFmXNNBpAC3X0lPmINA=="}.
{cs_bucket, "test"}. % create this with s3cmd before running basho_bench!
{cs_disconnect_frequency, 5}.    % # ops before disconnecting HTTP socket
{cs_raw_ip, "s3.amazonaws.com"}. % DO NOT CHANGE
{cs_raw_port, 80}.               % DO NOT CHANGE
%% Replace these with your HTTP proxy's location (i.e. Riak CS)
{cs_http_proxy_host, [{"localhost", 8080}, {"127.0.0.1", 8080}, {{127,0,0,1}, 8080}]}.
{cs_http_proxy_port, 8080}.
{cs_request_timeout, 999999000}.
% If using the cs_measurement_units option, you need to change
% any R graph's labels of the Y axis, e.g. basho_bench's Makefile target
% "make mbyte_sec-results"
{cs_measurement_units, mbyte_sec}.

%{key_generator, {int_to_str, {partitioned_sequential_int, 1000}}}.
{key_generator, {int_to_str, {uniform_int, 1000}}}.
%% See comments in source code for bigfile_valgen() function for full
%% explanation of the proplist below.
{value_generator, {function, basho_bench_driver_cs, bigfile_valgen,
                  [[{file_size, 8001001},
                  %%%%%%%%%% [[{file_size, 1001},
                    {ibrowse_chunk_size, 1000000},
                    {max_rate_per_chunk, 9999999}]]}}.

%% NOTE: It's not a good idea to mix insert & get ops in a single
%%       basho_bench instance with this driver.
%%       Use separate ones instead!
%% bad idea: {operations, [{insert, 1}, {get, 1}]}.
{operations, [{insert, 1}]}.
%{operations, [{get, 1}]}.
@shino
Copy link
Contributor

shino commented Aug 8, 2013

Just a memo:

  • reproduced at release/1.4 branch
    (commit 2fa8fbb)
  • could not reproduce at release/1.3 branch
    (commit 0aa57c7)

@ghost ghost assigned andrewjstone Aug 26, 2013
@andrewjstone
Copy link
Contributor

Increasing the pb_backlog to 64 in riak app.config is a workaround. It was introduced in this commit: 3a07c25

However, it's not actually a bug per se. This scenario can happen often enough. For example if riak dies for some reason or is unavailable to CS the same error will crop up. The best thing is probably to handle it and return a 503. We will be mumbling about it today.

@reiddraper
Copy link
Contributor

Does the test complete fine before 3a07c25?

@andrewjstone
Copy link
Contributor

My guess is no. In that case it probably crashes the proc as you suggest in that commit. I will give it a try though.

@andrewjstone
Copy link
Contributor

It actually completes with no errors on release/1.3

@andrewjstone
Copy link
Contributor

I didn't notice, but shino also confirmed that.

@andrewjstone
Copy link
Contributor

Steps to fix this type of issue:
- [ ] fix startup issues - can't do it, auto_reconnect always returns {ok, State} from riakc_pb_socket. Proceeding down the list :0

  • prevent non-connected sockets from crashing poolboy (return 503 as much as possible)
  • tell load balancer to re-route if we can't get connected sockets for poolboy procs
  • replace poolboy ?

@andrewjstone
Copy link
Contributor

New data: lowering the request_pool to 128 also does not result in this error. Debugging further, but it seems to be that the poolboy procs never actually fully hit 400 connections during intiialization.

<3 netstat

@andrewjstone
Copy link
Contributor

I found the issue finally. When you have a large number of connection attempts and a low backlog from the accept process the kernel sends RST packets that signal to riakc_pb_client to close the socket and re-attempt a connection. However, if you are attempting a much larger number than the backlog at the same time (thundering herd), the kernel actually rate limits RST packets in order to slow down attackers. Therefore all 400 processes think they are connected to riak, and only get a RST when they try to use the connection. What we would like to happen, when testing locally, is for the clients to know they are not connected, close the socket, and retry to connect. That way when the test is run, we don't get {error, disconnected}.

On my OSX machine the limit was set at 250, and hence 400 connections and a backlog of 5 causes this issue everytime. Log messasges show up in /var/log/system.log that look like the following:

kernel[0]: Limiting open port RST response from 296 to 250 packets per second

There are 2 ways to fix this issue.

  1. Increase the limit in the kernel with something like this: sudo sysctl -w net.inet.icmp.icmplim=800
  2. Set your pb_backlog in riak to something nice and big.

Issue closed. I'm out. Peace.

@slfritchie
Copy link
Contributor Author

Not to flog a dead horse or comment on a closed ticket...

... client connections between CS <-> Riak can fail at inconvenient times. The crazy OS BS that you found (and which I'd never suspected existed) is only one of those ways. IMO, CS needs to be behave better during those inconvenient cases. I suspect that the problems as yet unresolved by #519 covers a subset of "behave better".

Having said that, I've also created #660

@andrewjstone
Copy link
Contributor

Ha. @slfritchie I may have been a bit overzealous in closing this ticket, considering it was Friday evening before the long weekend :) However, there is still an open question of what to do here. I'm not convinced that trying to figure out every place where {error, disconnected} occurs and fixing it is the way to go. That goes doubly so after reading the source ;)

While it's simple enough to fix this for the obvious block_server error on the first block, as @reiddraper mentioned to me, if this happens on later blocks the header has already been sent. In this case a crash seems reasonable. I'll try to get some more discussion/consensus going and maybe I'll make some changes to handle these more gracefully. But in general, if your local riak instance isn't connected to CS something has gone horribly wrong. Putting much more effort into this seems to have diminishing returns.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants