Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

pd panic after fault recover from time offset 5mins #8049

Closed
Lily2025 opened this issue Apr 10, 2024 · 2 comments · Fixed by #8050
Closed

pd panic after fault recover from time offset 5mins #8049

Lily2025 opened this issue Apr 10, 2024 · 2 comments · Fixed by #8050
Labels
affects-8.1 This bug affects the 8.1.x(LTS) versions. severity/critical type/bug The issue is confirmed as a bug.

Comments

@Lily2025
Copy link

Lily2025 commented Apr 10, 2024

Bug Report

What did you do?

1、run tpcc (warehouse 2w)with 32 threads
2、inject pd leader time offset 5mins
3、after 10mins,recover fault

What did you expect to see?

no panic

What did you see instead?

after 5mins when fault recover,pd panic
2024-04-09 11:30:20 log="/pd-server --data-dir=/var/lib/pd/data --name=tc-pd-1 --peer-urls=http://0.0.0.0:2380 --advertise-peer-urls=http://tc-pd-1.tc-pd-peer.ha-test-lightning-tps-7503715-1-618.svc:2380 --client-urls=http://0.0.0.0:2379 --advertise-client-urls=http://tc-pd-1.tc-pd-peer.ha-test-lightning-tps-7503715-1-618.svc:2379 --config=/etc/pd/pd.toml \n" 2024-04-09 11:30:16 log="starting pd-server ...\n" 2024-04-09 11:30:16 log="nslookup domain tc-pd-1.tc-pd-peer.ha-test-lightning-tps-7503715-1-618.svc.svc success\n" 2024-04-09 11:30:16 log="\n" 2024-04-09 11:30:16 log="Address: 10.233.88.210\n" 2024-04-09 11:30:16 log="Name:\ttc-pd-1.tc-pd-peer.ha-test-lightning-tps-7503715-1-618.svc.cluster.local\n" 2024-04-09 11:30:16 log="\n" 2024-04-09 11:30:16 log="Address:\t10.96.0.10#53\n" 2024-04-09 11:30:16 log="Server:\t\t10.96.0.10\n" 2024-04-09 11:30:14 log="\t/go/pkg/mod/google.golang.org/[email protected]/server.go:1038 +0x135\n" 2024-04-09 11:30:14 log="created by google.golang.org/grpc.(*Server).serveStreams.func2 in goroutine 1888642\n" 2024-04-09 11:30:14 log="\t/go/pkg/mod/google.golang.org/[email protected]/server.go:1027 +0x8b\n" 2024-04-09 11:30:14 log="google.golang.org/grpc.(*Server).serveStreams.func2.1()\n" 2024-04-09 11:30:14 log="\t/go/pkg/mod/google.golang.org/[email protected]/server.go:1801 +0xfbb\n" 2024-04-09 11:30:14 log="google.golang.org/grpc.(*Server).handleStream(0xc0015f0c00, {0x3c56cc0, 0xc005c5fa00}, 0xc0070ca5a0)\n" 2024-04-09 11:30:14 log="\t/go/pkg/mod/google.golang.org/[email protected]/server.go:1687 +0x1267\n" 2024-04-09 11:30:14 log="google.golang.org/grpc.(*Server).processStreamingRPC(0xc0015f0c00, {0x3c49788, 0xc00667b770}, {0x3c56cc0, 0xc005c5fa00}, 0xc0070ca5a0, 0xc00124fe90, 0x5086f20, 0x0)\n" 2024-04-09 11:30:14 log="\t/go/pkg/mod/github.com/grpc-ecosystem/[email protected]/chain.go:86 +0x135\n" 2024-04-09 11:30:14 log="go.etcd.io/etcd/etcdserver/api/v3rpc.Server.ChainStreamServer.func9({0x2ff7a80, 0xc001786ae0}, {0x3c4f060, 0xc004c51860}, 0xc00d202d98, 0xc00f561c40?)\n" 2024-04-09 11:30:14 log="\t/go/pkg/mod/go.etcd.io/[email protected]/etcdserver/api/v3rpc/interceptor.go:238 +0x479\n" 2024-04-09 11:30:14 log="go.etcd.io/etcd/etcdserver/api/v3rpc.newStreamInterceptor.func1({0x2ff7a80, 0xc001786ae0}, {0x3c4f060, 0xc004c51860}, 0xc00d202d98, 0xc0070a9a00)\n" 2024-04-09 11:30:14 log="\t/go/pkg/mod/github.com/grpc-ecosystem/[email protected]/chain.go:83 +0x45\n" 2024-04-09 11:30:14 log="go.etcd.io/etcd/etcdserver/api/v3rpc.Server.ChainStreamServer.func9.1({0x2ff7a80?, 0xc001786ae0?}, {0x3c4f060?, 0xc004c51860?})\n" 2024-04-09 11:30:14 log="\t/go/pkg/mod/github.com/grpc-ecosystem/[email protected]/server_metrics.go:121 +0xd2\n" 2024-04-09 11:30:14 log="github.com/grpc-ecosystem/go-grpc-prometheus.init.(*ServerMetrics).StreamServerInterceptor.func4({0x2ff7a80, 0xc001786ae0}, {0x3c4f060?, 0xc004c51860}, 0xc00d202d98, 0x39716b8)\n" 2024-04-09 11:30:14 log="\t/go/pkg/mod/github.com/pingcap/[email protected]/pkg/pdpb/pdpb.pb.go:9777 +0x94\n" 2024-04-09 11:30:14 log="github.com/pingcap/kvproto/pkg/pdpb._PD_RegionHeartbeat_Handler({0x2ff7a80?, 0xc001786ae0}, {0x3c4e7f0?, 0xc00d202db0})\n" 2024-04-09 11:30:14 log="\t/home/jenkins/agent/workspace/build-common/go/src/github.com/pingcap/pd/server/grpc_service.go:1296 +0xf72\n" 2024-04-09 11:30:14 log="github.com/tikv/pd/server.(*GrpcServer).RegionHeartbeat(0xc001786ae0, {0x3c556d0?, 0xc00f561c90})\n" 2024-04-09 11:30:14 log="\t/home/jenkins/agent/workspace/build-common/go/src/github.com/pingcap/pd/server/cluster/cluster_worker.go:51 +0x1af\n" 2024-04-09 11:30:14 log="github.com/tikv/pd/server/cluster.(*RaftCluster).HandleRegionHeartbeat(0xc00056b680, 0xc00388cdd0?)\n" 2024-04-09 11:30:14 log="\t/home/jenkins/agent/workspace/build-common/go/src/github.com/pingcap/pd/pkg/schedule/operator/operator_controller.go:159 +0x875\n" 2024-04-09 11:30:14 log="github.com/tikv/pd/pkg/schedule/operator.(*Controller).Dispatch(0xc00fa80000, 0xc00eb05300, {0x300c8c2, 0x9}, 0xc00c38ad70)\n" 2024-04-09 11:30:14 log="\t/home/jenkins/agent/workspace/build-common/go/src/github.com/pingcap/pd/pkg/schedule/operator/operator_controller.go:382 +0x3d\n" 2024-04-09 11:30:14 log="github.com/tikv/pd/pkg/schedule/operator.(*Controller).PromoteWaitingOperator(0xc00fa80000)\n" 2024-04-09 11:30:14 log="\t/home/jenkins/agent/workspace/build-common/go/src/github.com/pingcap/pd/pkg/schedule/operator/waiting_operator.go:101 +0x365\n" 2024-04-09 11:30:14 log="github.com/tikv/pd/pkg/schedule/operator.(*randBuckets).GetOperator(0xc00c7df8c0)\n" 2024-04-09 11:30:14 log="goroutine 1888645 [running]:\n" 2024-04-09 11:30:14 log="\n" 2024-04-09 11:30:14 log="panic: runtime error: index out of range [1] with length 1\n"

What version of PD are you using (pd-server -V)?

./pd-server -V
Release Version: v8.1.0-alpha
Edition: Community
Git Commit Hash: 726b81f
Git Branch: heads/refs/tags/v8.1.0-alpha
UTC Build Time: 2024-04-08 11:37:36
2024-04-09T01:44:40.309+0800
./tidb-server -V
Release Version: v8.1.0-alpha

@Lily2025 Lily2025 added the type/bug The issue is confirmed as a bug. label Apr 10, 2024
@Lily2025
Copy link
Author

/type bug
/severity critical

@rleungx
Copy link
Member

rleungx commented Apr 10, 2024

Previously, we added two merge operators at the same time. After #8032, we might add two merge operators one by one. If we promote the merge operator before adding the second one. It might panic.

ti-chi-bot bot added a commit that referenced this issue Apr 11, 2024
ref #7897, close #8049

pkg/schedule: put merge operators together to maintain atomicity

Signed-off-by: nolouch <[email protected]>

Co-authored-by: ti-chi-bot[bot] <108142056+ti-chi-bot[bot]@users.noreply.github.com>
@rleungx rleungx added the affects-8.1 This bug affects the 8.1.x(LTS) versions. label Apr 15, 2024
ti-chi-bot bot pushed a commit that referenced this issue Apr 15, 2024
ref #7897, close #8049

pkg/schedule: put merge operators together to maintain atomicity

Signed-off-by: nolouch <[email protected]>

Co-authored-by: nolouch <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
affects-8.1 This bug affects the 8.1.x(LTS) versions. severity/critical type/bug The issue is confirmed as a bug.
Projects
Development

Successfully merging a pull request may close this issue.

2 participants