Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Deploying Cass-operator and CassandraDataCenter on Power architecture #561

Open
Sunidhi-Gaonkar1 opened this issue Aug 21, 2023 · 5 comments

Comments

@Sunidhi-Gaonkar1
Copy link

Sunidhi-Gaonkar1 commented Aug 21, 2023

Hi Team , We are working on deploying cass-operator(v1.14.0) and CassandraDataCenter on Power architecture. We are able to deploy the operator successfully, while deploying the CassandraDataCenter the pod with 0th index terminates repeatedly while the pods with 1st and 2nd index are running fine.
We have installed the operator using Helm chart.

NAME                                  READY   STATUS        RESTARTS   AGE
pod/cass-operator-6fb8dffdb6-hl5nm    1/1     Running       0          4h38m
pod/cassandra-default-sts-0           1/2     Terminating   0          19s
pod/cassandra-default-sts-1           2/2     Running       0          4h37m
pod/cassandra-default-sts-2           2/2     Running       0          4h37m

NAME                                                TYPE        CLUSTER-IP      EXTERNAL-IP   PORT(S)
        AGE
service/cass-operator-webhook-service               ClusterIP   172.30.17.183   <none>        443/TCP
        4h38m
service/cassandra-additional-seed-service   ClusterIP   None            <none>        <none>
        4h37m
service/cassandra-all-pods-service          ClusterIP   None            <none>        9042/TCP,8080/TCP,9103/TCP,9000/TCP            4h37m
service/cassandra-service                   ClusterIP   None            <none>        9042/TCP,9142/TCP,8080/TCP,9103/TCP,9000/TCP   4h37m
service/seed-service                        ClusterIP   None            <none>        <none>
        4h37m

NAME                            READY   UP-TO-DATE   AVAILABLE   AGE
deployment.apps/cass-operator   1/1     1            1           4h38m

NAME                                       DESIRED   CURRENT   READY   AGE
replicaset.apps/cass-operator-6fb8dffdb6   1         1         1       4h38m

NAME                                             READY   AGE
statefulset.apps/cassandra-default-sts   2/3     4h37m

Any pointers regarding this will be helpful, Thank you.

┆Issue is synchronized with this Jira Story by Unito
┆Issue Number: CASS-18

@burmanm
Copy link
Contributor

burmanm commented Aug 22, 2023

There should be some sort of indication from Kubernetes what caused the termination of a pod. We have no special handling for killing a pod (especially with index 0). The only reason we would kill a pod is if the Cassandra container itself becomes stuck (as in, loses readiness).

Other than that, it would require rolling restart / decommission to start deleting pods. But in those cases, the order would be different.

I recommend checking the logs of containers to identify if there's a reason why Cassandra is failing, or if Kubernetes has reasons to delete the pod otherwise (like rescheduling).

@Sunidhi-Gaonkar1
Copy link
Author

Thank you for the pointer! The Cassandra logs have no error specifying why the container is failing, attaching the logs below for your reference.
cassandra-container-logs.txt

@burmanm
Copy link
Contributor

burmanm commented Aug 22, 2023

The cassandra container's log could tell if the /drain endpoint was called (it's a shutdown hook for the pod) or if the shutdown came from other source. If it's the shutdown hook, then something should indicate why it was shutdown.

Did cass-operator logs have any indications? If it did kill the pod, it should log why.

@Sunidhi-Gaonkar1
Copy link
Author

I checked the cass-operator logs and found this error for the 0th index pod:

2023-08-23T11:18:34.200Z INFO client::callNodeMgmtEndpoint {"controller": "cassandradatacenter_controller", "controllerGroup": "cassandra.datastax.com", "controllerKind": "CassandraDatacenter", "CassandraDatacenter": {"name":"cassandra","namespace":"k8ssandra"}, "namespace": "k8ssandra", "name": "cassandra", "reconcileID": "cb182f6e-a759-4ad2-811b-1e9ec3812cba"}
2023-08-23T11:18:34.200Z DEBUG events Starting Cassandra for pod cassandra-default-sts-0 {"type": "Normal", "object": {"kind":"CassandraDatacenter","namespace":"k8ssandra","name":"cassandra","uid":"21cca51b-d70f-47a3-8086-0ef3416cf6a4","apiVersion":"cassandra.datastax.com/v1beta1","resourceVersion":"2114359"}, "reason": "StartingCassandra"}
2023-08-23T11:18:34.215Z INFO Failed to start pod cassandra-default-sts-0, deleting it {"controller": "cassandradatacenter_controller", "controllerGroup": "cassandra.datastax.com", "controllerKind": "CassandraDatacenter", "CassandraDatacenter": {"name":"cassandra","namespace":"k8ssandra"}, "namespace": "k8ssandra", "name": "cassandra", "reconcileID": "cb182f6e-a759-4ad2-811b-1e9ec3812cba", "reason": "StartingCassandra", "eventType": "Warning"}
2023-08-23T11:18:34.215Z ERROR controllers.CassandraDatacenter calculateReconciliationActions returned an error {"cassandradatacenter": "k8ssandra/cassandra", "requestNamespace": "k8ssandra", "requestName": "cassandra", "loopID": "94017e6a-a0ad-4314-a6e3-add6a7b32302", "error": "Post \"http://10.254.22.169:8080/api/v0/lifecycle/start\": dial tcp 10.254.22.169:8080: connect: connection refused"}
github.com/k8ssandra/cass-operator/controllers/cassandra.(*CassandraDatacenterReconciler).Reconcile
/workspace/controllers/cassandra/cassandradatacenter_controller.go:145
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Reconcile
/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:121
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler
/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:320
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem
/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:273
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2
/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:234
2023-08-23T11:18:34.215Z INFO Post "http://10.254.22.169:8080/api/v0/lifecycle/start": dial tcp 10.254.22.169:8080: connect: connection refused {"controller": "cassandradatacenter_controller", "controllerGroup": "cassandra.datastax.com", "controllerKind": "CassandraDatacenter", "CassandraDatacenter": {"name":"cassandra","namespace":"k8ssandra"}, "namespace": "k8ssandra", "name": "cassandra", "reconcileID": "cb182f6e-a759-4ad2-811b-1e9ec3812cba", "reason": "ReconcileFailed", "eventType": "Warning"}
2023-08-23T11:18:34.215Z INFO controllers.CassandraDatacenter Reconcile loop completed {"cassandradatacenter": "k8ssandra/cassandra", "requestNamespace": "k8ssandra", "requestName": "cassandra", "loopID": "94017e6a-a0ad-4314-a6e3-add6a7b32302", "duration": 0.018262743}
2023-08-23T11:18:34.215Z ERROR Reconciler error {"controller": "cassandradatacenter_controller", "controllerGroup": "cassandra.datastax.com", "controllerKind": "CassandraDatacenter", "CassandraDatacenter": {"name":"cassandra","namespace":"k8ssandra"}, "namespace": "k8ssandra", "name": "cassandra", "reconcileID": "cb182f6e-a759-4ad2-811b-1e9ec3812cba", "error": "Post \"http://10.254.22.169:8080/api/v0/lifecycle/start\": dial tcp 10.254.22.169:8080: connect: connection refused"}

@burmanm
Copy link
Contributor

burmanm commented Aug 23, 2023

There is the reason:

2023-08-23T11:18:34.215Z INFO Post "http://10.254.22.169:8080/api/v0/lifecycle/start": dial tcp 10.254.22.169:8080: connect: connection refused

The management-api could not be contacted for some reason (perhaps the cassandra-container logs would tell something, that includes management-api logs, server-system-logger container is the Cassandra itself).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
No open projects
Status: No status
Development

No branches or pull requests

2 participants