Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PD panics when list resource-group with some resource group defined. #7206

Closed
AndreMouche opened this issue Oct 16, 2023 · 12 comments · Fixed by #7623
Closed

PD panics when list resource-group with some resource group defined. #7206

AndreMouche opened this issue Oct 16, 2023 · 12 comments · Fixed by #7623
Assignees
Labels
affects-7.1 This bug affects the 7.1.x(LTS) versions. affects-7.5 This bug affects the 7.5.x(LTS) versions. report/customer Customers have encountered this bug. severity/major type/bug The issue is confirmed as a bug.

Comments

@AndreMouche
Copy link
Member

Bug Report

What did you do?

create some resource group
and try to list them.

What did you expect to see?

No panic and get all resource groups

What did you see instead?

PD panic

panic: json: unsupported value: NaN

goroutine 331601 [running]:
github.com/tikv/pd/pkg/mcs/resource_manager/server.(*ResourceGroup).Copy(0x40179af5d8?)
    /mnt/data1/jenkins/workspace/build-common@2/go/src/github.com/pingcap/pd/pkg/mcs/resource_manager/server/resource_group.go:68 +0x13c
github.com/tikv/pd/pkg/mcs/resource_manager/server.(*Manager).GetResourceGroupList(0x4000511ec0)
    /mnt/data1/jenkins/workspace/build-common@2/go/src/github.com/pingcap/pd/pkg/mcs/resource_manager/server/manager.go:245 +0x124
github.com/tikv/pd/pkg/mcs/resource_manager/server.(*Service).ListResourceGroups(0x4000209638?, {0x402a3abc80?, 0x3?}, 0x3?)
    /mnt/data1/jenkins/workspace/build-common@2/go/src/github.com/pingcap/pd/pkg/mcs/resource_manager/server/grpc_service.go:114 +0x74
github.com/pingcap/kvproto/pkg/resource_manager._ResourceManager_ListResourceGroups_Handler.func1({0x3891a90, 0x402a3abb30}, {0x2a33aa0?, 0x4035da1b80})
    /root/go/pkg/mod/github.com/pingcap/[email protected]/pkg/resource_manager/resource_manager.pb.go:1868 +0x74
github.com/grpc-ecosystem/go-grpc-middleware.ChainUnaryServer.func1.1({0x3891a90?, 0x402a3abb30?}, {0x2a33aa0?, 0x4035da1b80?})
    /root/go/pkg/mod/github.com/grpc-ecosystem/[email protected]/chain.go:31 +0x9c
github.com/grpc-ecosystem/go-grpc-prometheus.(*ServerMetrics).UnaryServerInterceptor.func1({0x3891a90, 0x402a3abb30}, {0x2a33aa0, 0x4035da1b80}, 0x20?, 0x401cdd8e10)
    /root/go/pkg/mod/github.com/grpc-ecosystem/[email protected]/server_metrics.go:107 +0x74
github.com/grpc-ecosystem/go-grpc-middleware.ChainUnaryServer.func1.1({0x3891a90?, 0x402a3abb30?}, {0x2a33aa0?, 0x4035da1b80?})
    /root/go/pkg/mod/github.com/grpc-ecosystem/[email protected]/chain.go:34 +0x74
go.etcd.io/etcd/etcdserver/api/v3rpc.newUnaryInterceptor.func1({0x3891a90, 0x402a3abb30}, {0x2a33aa0?, 0x4035da1b80}, 0x3366471900000000?, 0x401cdd8e10)
    /root/go/pkg/mod/go.etcd.io/[email protected]/etcdserver/api/v3rpc/interceptor.go:70 +0x2c4
github.com/grpc-ecosystem/go-grpc-middleware.ChainUnaryServer.func1.1({0x3891a90?, 0x402a3abb30?}, {0x2a33aa0?, 0x4035da1b80?})
    /root/go/pkg/mod/github.com/grpc-ecosystem/[email protected]/chain.go:34 +0x74
go.etcd.io/etcd/etcdserver/api/v3rpc.newLogUnaryInterceptor.func1({0x3891a90, 0x402a3abb30}, {0x2a33aa0, 0x4035da1b80}, 0x4035da1ba0, 0x401cdd8e10)
    /root/go/pkg/mod/go.etcd.io/[email protected]/etcdserver/api/v3rpc/interceptor.go:77 +0x80
github.com/grpc-ecosystem/go-grpc-middleware.ChainUnaryServer.func1({0x3891a90, 0x402a3abb30}, {0x2a33aa0, 0x4035da1b80}, 0x4035da1ba0, 0x40281687c8)
    /root/go/pkg/mod/github.com/grpc-ecosystem/[email protected]/chain.go:39 +0x17c
github.com/pingcap/kvproto/pkg/resource_manager._ResourceManager_ListResourceGroups_Handler({0x29e8040?, 0x4000209638}, {0x3891a90, 0x402a3abb30}, 0x4035caaae0, 0x400157c1b0)
    /root/go/pkg/mod/github.com/pingcap/[email protected]/pkg/resource_manager/resource_manager.pb.go:1870 +0x12c
google.golang.org/grpc.(*Server).processUnaryRPC(0x4001cf2480, {0x38a10e0, 0x402dc29800}, 0x400cd0c300, 0x400243f590, 0x48e45c0, 0x0)
    /root/go/pkg/mod/google.golang.org/[email protected]/server.go:1024 +0xb18
google.golang.org/grpc.(*Server).handleStream(0x4001cf2480, {0x38a10e0, 0x402dc29800}, 0x400cd0c300, 0x0)
    /root/go/pkg/mod/google.golang.org/[email protected]/server.go:1313 +0x854
google.golang.org/grpc.(*Server).serveStreams.func1.1()
    /root/go/pkg/mod/google.golang.org/[email protected]/server.go:722 +0x84
created by google.golang.org/grpc.(*Server).serveStreams.func1
    /root/go/pkg/mod/google.golang.org/[email protected]/server.go:720 +0xdc
Stream closed EOF for prod-tidb/prod-redash-pd-0 (pd)

Deleting all resource groups stops the panics.

What version of PD are you using (pd-server -V)?

v7.1.0

@hongshaoyang
Copy link

I am facing this issue in one of our TiDB 7.1 cluster.

Looking at the stacktrace, could it be related to Prometheus scraping of metrics?

@hongshaoyang
Copy link

hongshaoyang commented Dec 26, 2023

This is the snippet causing the issue, it has to do with json.Marshal not serializing the ResourceGroup struct properly.

func (rg *ResourceGroup) Copy() *ResourceGroup {
// TODO: use a better way to copy
rg.RLock()
defer rg.RUnlock()
res, err := json.Marshal(rg)
if err != nil {
panic(err)
}
var newRG ResourceGroup
err = json.Unmarshal(res, &newRG)
if err != nil {
panic(err)
}
return &newRG
}

@CabinfeverB
Copy link
Member

*ResourceGroup) Copy() *ResourceGroup {

Yes, but we don't know how some fields changes to NaN.
Can u help reproduce it?
cc @nolouch

@CabinfeverB
Copy link
Member

cc @glorv

@hongshaoyang
Copy link

Yes, but we don't know how some fields changes to NaN. Can u help reproduce it? cc @nolouch

Yes, sure, here is the list of resource groups that we used.
Screenshot 2023-12-25 at 8 46 40 PM

@CabinfeverB
Copy link
Member

@hongshaoyang After PD panic, does PD panic again when listing the resource groups again?

@hongshaoyang
Copy link

hongshaoyang commented Dec 26, 2023

@hongshaoyang After PD panic, does PD panic again when listing the resource groups again?

@CabinfeverB Yes, the PD panics again repeatedly. The TiDB cluster is deployed on Kubernetes. The PD pods keeps going into CrashLoopBackOff. The logs show the same stacktrace. This implies that there is some hidden process that is listing the resource groups repeatedly.

It is not a human running the resource groups listing as the PD pods crashed outside of office hours, when there were no changes to resource groups or their configurations.

@nolouch
Copy link
Contributor

nolouch commented Dec 27, 2023

@hongshaoyang
How often does it panic? could you help us export some data with the command:

curl -sl  http://{pd-leader-ip}:{pd-port}/resource-manager/api/v1/config/groups | jq ".[].r_u_settings" > data.json

@hongshaoyang
Copy link

Here is the r_u_settings data:

{"r_u":{"settings":{"fill_rate":14000,"burst_limit":14000},"state":{"initialized":false}}}
{"r_u":{"settings":{"fill_rate":2147483647,"burst_limit":-1},"state":{"tokens":29860685960413220,"last_update":"2023-12-27T08:19:16.269363735Z","initialized":true}}}
{"r_u":{"settings":{"fill_rate":14000,"burst_limit":14000},"state":{"tokens":14000,"last_update":"2023-12-27T08:19:17.269332808Z","initialized":true}}}
{"r_u":{"settings":{"fill_rate":14000,"burst_limit":14000},"state":{"tokens":14000,"last_update":"2023-12-27T08:19:05.143659794Z","initialized":true}}}
{"r_u":{"settings":{"fill_rate":14000,"burst_limit":14000},"state":{"tokens":-28216.64129750421,"last_update":"2023-12-27T08:19:16.4862813Z","initialized":true}}}
{"r_u":{"settings":{"fill_rate":14000,"burst_limit":14000},"state":{"tokens":14000,"last_update":"2023-12-27T08:19:17.420912119Z","initialized":true}}}
{"r_u":{"settings":{"fill_rate":14000,"burst_limit":14000},"state":{"tokens":1163.6089377586882,"last_update":"2023-12-27T08:19:15.252112524Z","initialized":true}}}
{"r_u":{"settings":{"fill_rate":14000,"burst_limit":14000},"state":{"tokens":14000,"last_update":"2023-12-27T08:19:10.78038052Z","initialized":true}}}
{"r_u":{"settings":{"fill_rate":14000,"burst_limit":14000},"state":{"tokens":11678.85839950952,"last_update":"2023-12-27T08:19:16.269380275Z","initialized":true}}}
{"r_u":{"settings":{"fill_rate":14000,"burst_limit":14000},"state":{"tokens":14000,"last_update":"2023-12-27T08:19:09.270797923Z","initialized":true}}}

@hongshaoyang
Copy link

hongshaoyang commented Dec 27, 2023

#7206 (comment)

It panics every 5-8 days, not sure why it is such an infrequent occurence. The only solution is to drop all resource groups.

ti-chi-bot bot pushed a commit that referenced this issue Dec 27, 2023
close #7206

resource_mananger: deep clone resource group

Signed-off-by: nolouch <[email protected]>

Co-authored-by: tongjian <[email protected]>
ti-chi-bot pushed a commit to ti-chi-bot/pd that referenced this issue Dec 27, 2023
ti-chi-bot pushed a commit to ti-chi-bot/pd that referenced this issue Dec 27, 2023
@nolouch nolouch reopened this Dec 27, 2023
ti-chi-bot bot pushed a commit that referenced this issue Jan 2, 2024
close #7206

resource_mananger: deep clone resource group

Signed-off-by: ti-chi-bot <[email protected]>
Signed-off-by: nolouch <[email protected]>

Co-authored-by: ShuNing <[email protected]>
Co-authored-by: nolouch <[email protected]>
ti-chi-bot bot pushed a commit that referenced this issue Jan 3, 2024
close #7206

resource_mananger: deep clone resource group

Signed-off-by: ti-chi-bot <[email protected]>
Signed-off-by: nolouch <[email protected]>

Co-authored-by: ShuNing <[email protected]>
Co-authored-by: nolouch <[email protected]>
ti-chi-bot bot added a commit that referenced this issue Jan 3, 2024
…7626)

ref #7206

Signed-off-by: Cabinfever_B <[email protected]>

Co-authored-by: ti-chi-bot[bot] <108142056+ti-chi-bot[bot]@users.noreply.github.com>
ti-chi-bot bot added a commit that referenced this issue Jan 4, 2024
…7626) (#7658)

ref #7206

Signed-off-by: Cabinfever_B <[email protected]>

Co-authored-by: Cabinfever_B <[email protected]>
Co-authored-by: ti-chi-bot[bot] <108142056+ti-chi-bot[bot]@users.noreply.github.com>
ti-chi-bot bot pushed a commit that referenced this issue Jan 4, 2024
pingandb pushed a commit to pingandb/pd that referenced this issue Jan 18, 2024
…ikv#7626)

ref tikv#7206

Signed-off-by: Cabinfever_B <[email protected]>

Co-authored-by: ti-chi-bot[bot] <108142056+ti-chi-bot[bot]@users.noreply.github.com>
Signed-off-by: pingandb <[email protected]>
@nolouch nolouch closed this as completed Mar 1, 2024
@nolouch
Copy link
Contributor

nolouch commented Mar 1, 2024

fixed. Cannot reproduce the NaN problem, but we replace a new way to copy the data, so this issue should be fixed.

@seiya-annie
Copy link

/found customer

@ti-chi-bot ti-chi-bot bot added the report/customer Customers have encountered this bug. label Jun 4, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
affects-7.1 This bug affects the 7.1.x(LTS) versions. affects-7.5 This bug affects the 7.5.x(LTS) versions. report/customer Customers have encountered this bug. severity/major type/bug The issue is confirmed as a bug.
Projects
Development

Successfully merging a pull request may close this issue.

7 participants