Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make gRPC service report backups as FAILED if lose their callback future #786

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

rzvoncek
Copy link
Contributor

@rzvoncek rzvoncek commented Jun 19, 2024

Fixes k8ssandra/k8ssandra-operator#1312.

This PR alters how the BackupMan reports a status of the backup.
Previously, it'd just look at the storage, and report the state of the backup as IN_PROGRESS if the backup has started (~wrote something into the storage). It'd not care if the backup is actually ongoing.

With this change, it will also check if there is an active future waiting for the backup to finish. The future's presence indicates a healthy state, which in turn means the backup has a high chance of happening.

However, waiting for the future to complete and run the associated callback is not a requirement for the backup to complete. When I tried to cover this in the integration steps, I was able to restart the gRPC server, but I was unable to kill the actual process that does the backup. So the new server (with new BackupMan) came back and saw the backup as complete.

The benefit of this change is more visible in the world of k8ssandra-operator, where a pod might restart mid-backup. Killing a pod. during a backup, would lead to the following situation:

kubectl get MedusaBackupJob,MedusaBackup -n k8ssandra-operator

NAME                                           STARTED   FINISHED
medusabackupjob.medusa.k8ssandra.io/backup-1   12m       11m
medusabackupjob.medusa.k8ssandra.io/backup-2   7m38s     7m23s
medusabackupjob.medusa.k8ssandra.io/backup-3   51s

NAME                                        STARTED   FINISHED   NODES   FILES   SIZE        COMPLETED   STATUS
medusabackup.medusa.k8ssandra.io/backup-1   12m       11m        1       184     143.53 KB   1           SUCCESS
medusabackup.medusa.k8ssandra.io/backup-2   7m38s     7m23s      1       296     227.16 KB   1           SUCCESS

The backup-3 would never finish, it'd continue to be reported as IN_PROGRESS. This is because the MedusaBackupJob would always feature:

status:
  inProgress:
    - firstcluster-dc1-default-sts-0
  startTime: '2024-06-19T14:05:18Z'

Deploying a Medusa container built off this branch actually heals the situation:

Every 1.0s: kubectl get MedusaBackupJob,MedusaBackup -n k8ssandra-operator                                                                                                                       

NAME                                           STARTED   FINISHED
medusabackupjob.medusa.k8ssandra.io/backup-1   15m       15m
medusabackupjob.medusa.k8ssandra.io/backup-2   10m       10m
medusabackupjob.medusa.k8ssandra.io/backup-3   4m9s      17s

NAME                                        STARTED   FINISHED   NODES   FILES   SIZE        COMPLETED   STATUS
medusabackup.medusa.k8ssandra.io/backup-1   15m       15m        1       184     143.53 KB   1           SUCCESS
medusabackup.medusa.k8ssandra.io/backup-2   10m       10m        1       296     227.16 KB   1           SUCCESS
medusabackup.medusa.k8ssandra.io/backup-3   4m9s      17s        1               0.00 B                  FAILED

Because:

status:
  failed:
    - firstcluster-dc1-default-sts-0
  finishTime: '2024-06-19T14:09:10Z'
  startTime: '2024-06-19T14:05:18Z'

@rzvoncek rzvoncek marked this pull request as ready for review June 19, 2024 14:10
@rzvoncek rzvoncek force-pushed the radovan/failed-backups branch 6 times, most recently from 3f2588b to b2da4fb Compare June 20, 2024 08:24
Copy link

Copy link
Contributor

@adejanovski adejanovski left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Issue:

I've tested the image which was built in CI, and deployed it on a cluster that's been running for a while in GCP.
The backups are now failing altogether under normal conditions.
One of the nodes reports the following error:

[2024-07-17 09:16:37,788] DEBUG: [Storage] Reading blob dse68-rc1/index/backup_index/dse69-multi-cluster-prime-schedule-1719842100/tokenmap_dse69-multi-cluster-prime-alpha-sts-1.json...
[2024-07-17 09:16:37,864] DEBUG: Disconnecting from Google Storage
[2024-07-17 09:16:37,867] ERROR: Unexpected [AttributeError] raised by servicer method [/Medusa/GetBackups]
Traceback (most recent call last):
  File "/home/cassandra/medusa/service/grpc/server.py", line 239, in GetBackups
    summary = get_backup_summary(backup)
  File "/home/cassandra/medusa/service/grpc/server.py", line 354, in get_backup_summary
    if BackupMan.get_backup_future(backup.name) is None:
  File "/home/cassandra/medusa/backup_manager.py", line 128, in get_backup_future
    backup_state = BackupMan.__instance.__backups[backup_name]
KeyError: 'dse69-multi-cluster-prime-schedule-1719842100'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "src/python/grpcio/grpc/_cython/_cygrpc/aio/server.pyx.pxi", line 682, in grpc._cython.cygrpc._handle_exceptions
  File "src/python/grpcio/grpc/_cython/_cygrpc/aio/server.pyx.pxi", line 802, in _handle_rpc
  File "src/python/grpcio/grpc/_cython/_cygrpc/aio/server.pyx.pxi", line 547, in _handle_unary_unary_rpc
  File "src/python/grpcio/grpc/_cython/_cygrpc/aio/server.pyx.pxi", line 411, in _finish_handler_with_unary_response
  File "/usr/lib/python3.10/concurrent/futures/thread.py", line 58, in run
    result = self.fn(*self.args, **self.kwargs)
  File "/home/cassandra/medusa/service/grpc/server.py", line 246, in GetBackups
    response.status = medusa_pb2.StatusType.UNKNOWN
AttributeError: Protocol message GetBackupsResponse has no "status" field.

The other node (there are only two nodes in this DC) seems to process the backup but never finishes, and the last log line is apparently a file upload:

[2024-07-17 09:19:32,350] DEBUG: [Storage] Uploading object from stream -> gcs://ui-playground-backups/dse68-rc1/dse69-multi-cluster-prime-alpha-sts-1/backup-2024-07-17t09-14-54-331z/meta/schema.cql

Despite having both nodes in failed status, I'm not seeing a finish date on the backup job:

status:
  failed:
    - dse69-multi-cluster-prime-alpha-sts-0
    - dse69-multi-cluster-prime-alpha-sts-1
  startTime: '2024-07-17T09:15:09Z'
spec:
  backupType: differential
  cassandraDatacenter: dse69-multi-cluster-prime
NAME                                            STARTED   FINISHED
backup-2024-07-17t09-14-54-331z                 7m48s     
dse69-multi-cluster-prime-schedule-1719841200   15d       15d
dse69-multi-cluster-prime-schedule-1719841500   15d       15d
dse69-multi-cluster-prime-schedule-1719841800   15d       15d

Thus further attempts to run a scheduled backup will get blocked.

@rzvoncek rzvoncek force-pushed the radovan/failed-backups branch from 778fbb1 to 58ed59f Compare October 17, 2024 11:51
Copy link

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

K8ssandra-operator memory leak after failed MedusaBackupJob
2 participants