You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Some of vtgate's healthcheck flags have been defined as dynamic, because they are expected to be settable at runtime from /debug/env. However, the implementation of dynamic config is broken. It does not respect flags from the command line, and it does not use the defaults specified in code. It always falls back to the default (0) value for the type of the config.
The symptom is that vtgate's healthcheck ends up with no healthy REPLICA tablets in its list, because minNumTablets is set to 0. Users end up getting errors from @replica queries like this
1105: target: commerce.-.replica: no healthy tablet available for 'keyspace:"commerce" shard:"-" tablet_type:REPLICA'
Credit to @aquarapid for figuring out that the problem was with viper/flag handling.
Reproduction Steps
This is actually non-trivial to reproduce. Local testing did not run into the same issue. That is because locally there is no load and replication lag is always 0. And the code always returns 1 replica if there's only 1.
So it is necessary to run with at least 2 replicas, preferably more and with some significant replica query load.
On a system with load, deploying vitess 17.0.0+ will throw replica query errors.
I added logging in replicationlag.go which exposed the problem.
There's a workaround for the specific vtgate/healthcheck issue, which is to set --legacy_replication_lag_algorithm=false. That is a static flag (default true) and mitigates the issue to some extent. Tablets with lag of even 1s will however still not be used.
Overview of the Issue
Some of vtgate's healthcheck flags have been defined as dynamic, because they are expected to be settable at runtime from
/debug/env
. However, the implementation of dynamic config is broken. It does not respect flags from the command line, and it does not use the defaults specified in code. It always falls back to the default (0) value for the type of the config.The symptom is that vtgate's healthcheck ends up with no healthy REPLICA tablets in its list, because
minNumTablets
is set to 0. Users end up getting errors from@replica
queries like thisCredit to @aquarapid for figuring out that the problem was with viper/flag handling.
Reproduction Steps
This is actually non-trivial to reproduce. Local testing did not run into the same issue. That is because locally there is no load and replication lag is always 0. And the code always returns 1 replica if there's only 1.
So it is necessary to run with at least 2 replicas, preferably more and with some significant replica query load.
On a system with load, deploying vitess 17.0.0+ will throw replica query errors.
I added logging in
replicationlag.go
which exposed the problem.Binary Version
Operating System and Environment details
Log Fragments
The text was updated successfully, but these errors were encountered: