RFC: Handling failure to wait for tablet typs specified with `--tablet_types_to_wait` #17412

ejortegau · 2024-12-19T13:00:34Z

Question

Introduction

This is an RFC to discuss the behavior of vtgate's --tablet_types_to_wait flag. It starts by describing the current behavior, and moves towards discussing an issue we have experienced with it. Finally it describes proposed changes to address the issue.

The intention of the RFC is to gain information on whether other community members share concerns about or have experienced the issue, and whether they agree with the proposal to address it; or whether they have alternate proposals.

Current behavior.

When vtgate is started with --tablet_types_to_wait, during its Init() (here) a tablet gateway struct is created and then, a call to TabletGateway.WaitForTablets() is called on that struct. If the underlying work done by this function fails with a context deadline exceeded error, a warning is logged but the error is cleared. This means that vtgate's Init() is unaware that WaitForTablets() failed to find healthy tablets of the right types for all keyspaces/shards, and it proceeds to work normally.

Under normal circumstances, this is not a problem, because retrieving the list of Targets is fast. However, under some circumstances we have seen that this can fail. Particularly, during overload of the underlying topology service, the calls to it take too long. If the whole process exceeds the time specified with --gateway_initial_tablet_timeout, TabletGateway.WaitForTablets() to hit a context deadline exceeded. This is handled on its defer function to simply log a warning and clear the error. As a result, the vtgate's Init() is unaware that TabletGateway.WaitForTablets() actually failed to find healthy tablets of the right types for all keyspaces/shards, and it continues to start-up normally.

Notice we saw the above behavior during a topo overload, but there might be other situations leading to context deadline exceeded and therefore the same beavior (e.g. network issues).

Issues of the current behavior:

As describe above, if a vtgate fails to get all healthy tablets for all targets, it still joins service after waiting for --gateway_initial_tablet_timeout. As soon as such a vtgate receives a query for one of the shard-tablet types it has not yet gotten healthy tablets for, the query errors with something like Execute: target: <keyapace>.<shard>.<tablet_type>: no healthy tablet available for 'keyspace:"<keyspace>" shard:"<shard>" tablet_type:<tablet_type>', causing client-application visible errors. The issues persist until the vtgate eventually manages to get a healthy tablet of the right type - or is taken out of service.

This is only an issue during vtgate initialization, but not for already running vtgates.

Proposal

Our proposal is that a vtgate that fails to get healthy tablets of the right types for keyspace/shards that are known to have tablets should not join service. This could be implemented in a number of ways, but we should be careful to distinghish two different scenarios:

A keyspace/shard has no tablets (e.g. the shard exists in the topology, but no tablets exist for it).
A keyspace/shard has tablets but the vtgate has not been able to get healthy ones during it's initialization.

In the first scenario, vtgate should be able to join service. A keyspace/shard with no tablets can be the result of a decommissioned keyspace/shard, for which all vttablets were removed but the topo record for the keyspace/shard was not deleted. In this case, joining service despite the failure to get healthy tablets does not lead to an issue (or at least, not any issue that was not already present on any pre-existing vtgates).

In the second scenario, vtgate should not start serving until it manages to get healthy tablets - even if that means hanging out forever. Otherwise, any queries it gets for the targets it's missing will fail.

For that, we propose that vtgates determine whether they need to wait for targets of a particular keyspace/shard by looking at an attribute in the topo record of the shard (let's temporarily call it has_tablets, but can be called something else). When a shard is created, the attribute will be set to false. It will only be set to true when a vttablet process is started up for that particular keyspace/shard. It will also be set to false when issuing DeleteTablet for the last tablet in the keyspace/shard.

With the above, we would suggest that vtgate's init waiting for tablets works as follows:

Fetch the targets, filtering out the ones whose shards have has_tablets set to false.
TabletGateway.WaitForTablets() would not clear the context deadline error so that vtgate's init knows when it failed to get all targets. If there are concerns with this behavior change, it could be controlled via a new, opt-in flag.

We look forward to your input.

The text was updated successfully, but these errors were encountered:

ejortegau added Type: RFC Request For Comment Component: VTGate labels Dec 19, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

RFC: Handling failure to wait for tablet typs specified with `--tablet_types_to_wait` #17412

RFC: Handling failure to wait for tablet typs specified with `--tablet_types_to_wait` #17412

ejortegau commented Dec 19, 2024

RFC: Handling failure to wait for tablet typs specified with --tablet_types_to_wait #17412

RFC: Handling failure to wait for tablet typs specified with --tablet_types_to_wait #17412

Comments

ejortegau commented Dec 19, 2024

Question

Introduction

Current behavior.

Issues of the current behavior:

Proposal

RFC: Handling failure to wait for tablet typs specified with `--tablet_types_to_wait` #17412

RFC: Handling failure to wait for tablet typs specified with `--tablet_types_to_wait` #17412