-
Notifications
You must be signed in to change notification settings - Fork 222
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix the issue that health check may set liveness wrongly #1127
Conversation
Signed-off-by: zyguan <[email protected]>
Signed-off-by: zyguan <[email protected]>
Signed-off-by: zyguan <[email protected]>
Signed-off-by: zyguan <[email protected]>
Signed-off-by: zyguan <[email protected]>
@cfzjywxk PTAL |
Signed-off-by: zyguan <[email protected]>
Signed-off-by: zyguan <[email protected]>
Signed-off-by: zyguan <[email protected]>
/cc @crazycs520 PTAL |
newStore := &Store{storeID: s.storeID, addr: addr, peerAddr: store.GetPeerAddress(), saddr: store.GetStatusAddress(), storeType: storeType, labels: store.GetLabels(), state: uint64(resolved)} | ||
newStore.livenessState = atomic.LoadUint32(&s.livenessState) | ||
newStore.unreachableSince = s.unreachableSince |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about abstracting a newStore
function to force specify the livenessState
and unreachableSince
as the input parameters.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about to handle it later (as a part of #1104). There are 4 &Store{...}
in region_cache.go now, I'd like to handle them togather according to the store lifecycle. Let this PR fix the corresponding issue by now.
@@ -2783,50 +2785,85 @@ func (s *Store) requestLivenessAndStartHealthCheckLoopIfNeeded(bo *retry.Backoff | |||
// It may be already started by another thread. | |||
if atomic.CompareAndSwapUint32(&s.livenessState, uint32(reachable), uint32(liveness)) { | |||
s.unreachableSince = time.Now() | |||
go s.checkUntilHealth(c) | |||
reResolveInterval := 30 * time.Second |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Better to make it a constant value.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The rawkv tests use failpoint to set it shorter.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
how about
var DefReResolveInterval = 30 * time.Second // global scope.
func (s *Store) checkUntilHealth(c *RegionCache, liveness livenessState){
reResolveInterval = DefReResolveInterval
...
}
and if test needs a shorter reResolveInterval, change DefReResolveInterval
directly, no need to use the following failpoint.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
how about
var DefReResolveInterval = 30 * time.Second // global scope. func (s *Store) checkUntilHealth(c *RegionCache, liveness livenessState){ reResolveInterval = DefReResolveInterval ... }and if test needs a shorter reResolveInterval, change
DefReResolveInterval
directly, no need to use the following failpoint.
Personally I do not prefer that. The global var is required to be handled carefully to avoid data race in ut. I've struggled with SetRegionCacheTTLSec
in #1122 for about an hour.
logutil.BgLogger().Info("[health check] store meta deleted, stop checking", zap.Uint64("storeID", s.storeID), zap.String("addr", s.addr)) | ||
if s.getResolveState() == deleted { | ||
// if the store is deleted, a new store with same id must be inserted (guaranteed by reResolve). | ||
newStore, _ := c.getStore(s.storeID) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it necessary to add an assertion newStore != nil
here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's guaranteed by reResolve
, changeToActiveStore
also rely on it.
Signed-off-by: zyguan <[email protected]>
* fix the issue that health check may set liveness wrongly Signed-off-by: zyguan <[email protected]> * fix lint issue Signed-off-by: zyguan <[email protected]> * fix rawkv ut Signed-off-by: zyguan <[email protected]> * fix data race Signed-off-by: zyguan <[email protected]> * use getStore instead of accessing storeMu directly Signed-off-by: zyguan <[email protected]> * make TestAccessFollowerAfter1TiKVDown stable Signed-off-by: zyguan <[email protected]> * make TestBackoffErrorType stable Signed-off-by: zyguan <[email protected]> * address comments Signed-off-by: zyguan <[email protected]> --------- Signed-off-by: zyguan <[email protected]> Co-authored-by: disksing <[email protected]> Signed-off-by: zyguan <[email protected]>
* fix the issue that health check may set liveness wrongly * fix lint issue * fix rawkv ut * fix data race * use getStore instead of accessing storeMu directly * make TestAccessFollowerAfter1TiKVDown stable * make TestBackoffErrorType stable * address comments --------- Signed-off-by: zyguan <[email protected]> Co-authored-by: disksing <[email protected]>
Signed-off-by: zyguan <[email protected]>
Signed-off-by: zyguan <[email protected]>
close #1111