Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

TestGetTSOImmediately is flaky #8533

Closed
lhy1024 opened this issue Aug 15, 2024 · 3 comments · Fixed by #8597
Closed

TestGetTSOImmediately is flaky #8533

lhy1024 opened this issue Aug 15, 2024 · 3 comments · Fixed by #8597
Labels
type/ci The issue is related to CI.

Comments

@lhy1024
Copy link
Contributor

lhy1024 commented Aug 15, 2024

Flaky Test

Which jobs are failing

2024-08-15T06:17:50.3934438Z     testutil.go:67: 
2024-08-15T06:17:50.3935833Z         	Error Trace:	/home/runner/work/pd/pd/pkg/utils/testutil/testutil.go:67
2024-08-15T06:17:50.3938318Z         	            				/home/runner/work/pd/pd/tests/integrations/mcs/tso/keyspace_group_manager_test.go:772
2024-08-15T06:17:50.3939789Z         	Error:      	Condition never satisfied
2024-08-15T06:17:50.3940870Z         	Test:       	TestGetTSOImmediately

CI link

https://github.com/tikv/pd/actions/runs/10399520880/job/28798931505

Reason for failure (if possible)

Anything else

@lhy1024 lhy1024 added the type/ci The issue is related to CI. label Aug 15, 2024
@rleungx
Copy link
Member

rleungx commented Sep 5, 2024

Screenshot 2024-09-05 at 14 53 13

One participant seems to be resetting due to the priority, while another participant keeps skipping campaigning due to the expected primary.

@HuSharp
Copy link
Member

HuSharp commented Sep 5, 2024

There are two TSOs, 36295 and 40455.

  1. 36295 is selected as primary.
  2. Set 40455's priority to be bigger, 40455's priority check will start the election after it finds out.
  3. 40455 skips the campaign because the Expected Primary has a value.
  4. At this point, 36295 also exits the primary and deletes the Expected Primary.
  5. Then 36295 campaign faster, and is elected again (which amplifies the time difference mentioned above).
    will repeat the cycle

@HuSharp
Copy link
Member

HuSharp commented Sep 5, 2024

Root cause

Further more, the root cause is tso's priority uses ResetLeader instead of moveLeader.

The tso priority checker process is:

  • secondary: primary check priority -> secondary exit watch -> secondary enter campaign -> secondary select new leader
  • primary: primary check priority -> primary finds out there is not primary and exit campaign.

Secondary can be elected as new primary because of time gap which is not stable!!!

For example, if the secondary io jitters and doesn't elected as new primary, the old primary will be elected, and then it will loop through the priority check logic again.

The better solution is

  • Because transfer primary is actually as move etcd leader.
  • So replace ResetLeader with transfer primary, which is more time efficient and can avoid time lag.

@ti-chi-bot ti-chi-bot bot closed this as completed in #8597 Sep 9, 2024
@ti-chi-bot ti-chi-bot bot closed this as completed in 8733b55 Sep 9, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
type/ci The issue is related to CI.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants