-
Notifications
You must be signed in to change notification settings - Fork 145
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Make gRPC service report backups as FAILED if lose their callback future #786
base: master
Are you sure you want to change the base?
Conversation
3f2588b
to
b2da4fb
Compare
Quality Gate passedIssues Measures |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Issue:
I've tested the image which was built in CI, and deployed it on a cluster that's been running for a while in GCP.
The backups are now failing altogether under normal conditions.
One of the nodes reports the following error:
[2024-07-17 09:16:37,788] DEBUG: [Storage] Reading blob dse68-rc1/index/backup_index/dse69-multi-cluster-prime-schedule-1719842100/tokenmap_dse69-multi-cluster-prime-alpha-sts-1.json...
[2024-07-17 09:16:37,864] DEBUG: Disconnecting from Google Storage
[2024-07-17 09:16:37,867] ERROR: Unexpected [AttributeError] raised by servicer method [/Medusa/GetBackups]
Traceback (most recent call last):
File "/home/cassandra/medusa/service/grpc/server.py", line 239, in GetBackups
summary = get_backup_summary(backup)
File "/home/cassandra/medusa/service/grpc/server.py", line 354, in get_backup_summary
if BackupMan.get_backup_future(backup.name) is None:
File "/home/cassandra/medusa/backup_manager.py", line 128, in get_backup_future
backup_state = BackupMan.__instance.__backups[backup_name]
KeyError: 'dse69-multi-cluster-prime-schedule-1719842100'
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "src/python/grpcio/grpc/_cython/_cygrpc/aio/server.pyx.pxi", line 682, in grpc._cython.cygrpc._handle_exceptions
File "src/python/grpcio/grpc/_cython/_cygrpc/aio/server.pyx.pxi", line 802, in _handle_rpc
File "src/python/grpcio/grpc/_cython/_cygrpc/aio/server.pyx.pxi", line 547, in _handle_unary_unary_rpc
File "src/python/grpcio/grpc/_cython/_cygrpc/aio/server.pyx.pxi", line 411, in _finish_handler_with_unary_response
File "/usr/lib/python3.10/concurrent/futures/thread.py", line 58, in run
result = self.fn(*self.args, **self.kwargs)
File "/home/cassandra/medusa/service/grpc/server.py", line 246, in GetBackups
response.status = medusa_pb2.StatusType.UNKNOWN
AttributeError: Protocol message GetBackupsResponse has no "status" field.
The other node (there are only two nodes in this DC) seems to process the backup but never finishes, and the last log line is apparently a file upload:
[2024-07-17 09:19:32,350] DEBUG: [Storage] Uploading object from stream -> gcs://ui-playground-backups/dse68-rc1/dse69-multi-cluster-prime-alpha-sts-1/backup-2024-07-17t09-14-54-331z/meta/schema.cql
Despite having both nodes in failed status, I'm not seeing a finish date on the backup job:
status:
failed:
- dse69-multi-cluster-prime-alpha-sts-0
- dse69-multi-cluster-prime-alpha-sts-1
startTime: '2024-07-17T09:15:09Z'
spec:
backupType: differential
cassandraDatacenter: dse69-multi-cluster-prime
NAME STARTED FINISHED
backup-2024-07-17t09-14-54-331z 7m48s
dse69-multi-cluster-prime-schedule-1719841200 15d 15d
dse69-multi-cluster-prime-schedule-1719841500 15d 15d
dse69-multi-cluster-prime-schedule-1719841800 15d 15d
Thus further attempts to run a scheduled backup will get blocked.
778fbb1
to
58ed59f
Compare
Quality Gate passedIssues Measures |
Fixes k8ssandra/k8ssandra-operator#1312.
This PR alters how the BackupMan reports a status of the backup.
Previously, it'd just look at the storage, and report the state of the backup as IN_PROGRESS if the backup has started (~wrote something into the storage). It'd not care if the backup is actually ongoing.
With this change, it will also check if there is an active future waiting for the backup to finish. The future's presence indicates a healthy state, which in turn means the backup has a high chance of happening.
However, waiting for the future to complete and run the associated callback is not a requirement for the backup to complete. When I tried to cover this in the integration steps, I was able to restart the gRPC server, but I was unable to kill the actual process that does the backup. So the new server (with new BackupMan) came back and saw the backup as complete.
The benefit of this change is more visible in the world of k8ssandra-operator, where a pod might restart mid-backup. Killing a pod. during a backup, would lead to the following situation:
The
backup-3
would never finish, it'd continue to be reported asIN_PROGRESS
. This is because theMedusaBackupJob
would always feature:Deploying a Medusa container built off this branch actually heals the situation:
Because: