Make gRPC service report backups as FAILED if lose their callback future #786

rzvoncek · 2024-06-19T14:00:18Z

Fixes k8ssandra/k8ssandra-operator#1312.

This PR alters how the BackupMan reports a status of the backup.
Previously, it'd just look at the storage, and report the state of the backup as IN_PROGRESS if the backup has started (~wrote something into the storage). It'd not care if the backup is actually ongoing.

With this change, it will also check if there is an active future waiting for the backup to finish. The future's presence indicates a healthy state, which in turn means the backup has a high chance of happening.

However, waiting for the future to complete and run the associated callback is not a requirement for the backup to complete. When I tried to cover this in the integration steps, I was able to restart the gRPC server, but I was unable to kill the actual process that does the backup. So the new server (with new BackupMan) came back and saw the backup as complete.

The benefit of this change is more visible in the world of k8ssandra-operator, where a pod might restart mid-backup. Killing a pod. during a backup, would lead to the following situation:

kubectl get MedusaBackupJob,MedusaBackup -n k8ssandra-operator

NAME                                           STARTED   FINISHED
medusabackupjob.medusa.k8ssandra.io/backup-1   12m       11m
medusabackupjob.medusa.k8ssandra.io/backup-2   7m38s     7m23s
medusabackupjob.medusa.k8ssandra.io/backup-3   51s

NAME                                        STARTED   FINISHED   NODES   FILES   SIZE        COMPLETED   STATUS
medusabackup.medusa.k8ssandra.io/backup-1   12m       11m        1       184     143.53 KB   1           SUCCESS
medusabackup.medusa.k8ssandra.io/backup-2   7m38s     7m23s      1       296     227.16 KB   1           SUCCESS

The backup-3 would never finish, it'd continue to be reported as IN_PROGRESS. This is because the MedusaBackupJob would always feature:

status:
  inProgress:
    - firstcluster-dc1-default-sts-0
  startTime: '2024-06-19T14:05:18Z'

Deploying a Medusa container built off this branch actually heals the situation:

Every 1.0s: kubectl get MedusaBackupJob,MedusaBackup -n k8ssandra-operator                                                                                                                       

NAME                                           STARTED   FINISHED
medusabackupjob.medusa.k8ssandra.io/backup-1   15m       15m
medusabackupjob.medusa.k8ssandra.io/backup-2   10m       10m
medusabackupjob.medusa.k8ssandra.io/backup-3   4m9s      17s

NAME                                        STARTED   FINISHED   NODES   FILES   SIZE        COMPLETED   STATUS
medusabackup.medusa.k8ssandra.io/backup-1   15m       15m        1       184     143.53 KB   1           SUCCESS
medusabackup.medusa.k8ssandra.io/backup-2   10m       10m        1       296     227.16 KB   1           SUCCESS
medusabackup.medusa.k8ssandra.io/backup-3   4m9s      17s        1               0.00 B                  FAILED

Because:

status:
  failed:
    - firstcluster-dc1-default-sts-0
  finishTime: '2024-06-19T14:09:10Z'
  startTime: '2024-06-19T14:05:18Z'

sonarqubecloud · 2024-06-20T08:36:15Z

Quality Gate passed

Issues
1 New issue
0 Accepted issues

Measures
0 Security Hotspots
0.0% Coverage on New Code
0.0% Duplication on New Code

See analysis details on SonarCloud

adejanovski

Issue:

I've tested the image which was built in CI, and deployed it on a cluster that's been running for a while in GCP.
The backups are now failing altogether under normal conditions.
One of the nodes reports the following error:

[2024-07-17 09:16:37,788] DEBUG: [Storage] Reading blob dse68-rc1/index/backup_index/dse69-multi-cluster-prime-schedule-1719842100/tokenmap_dse69-multi-cluster-prime-alpha-sts-1.json...
[2024-07-17 09:16:37,864] DEBUG: Disconnecting from Google Storage
[2024-07-17 09:16:37,867] ERROR: Unexpected [AttributeError] raised by servicer method [/Medusa/GetBackups]
Traceback (most recent call last):
  File "/home/cassandra/medusa/service/grpc/server.py", line 239, in GetBackups
    summary = get_backup_summary(backup)
  File "/home/cassandra/medusa/service/grpc/server.py", line 354, in get_backup_summary
    if BackupMan.get_backup_future(backup.name) is None:
  File "/home/cassandra/medusa/backup_manager.py", line 128, in get_backup_future
    backup_state = BackupMan.__instance.__backups[backup_name]
KeyError: 'dse69-multi-cluster-prime-schedule-1719842100'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "src/python/grpcio/grpc/_cython/_cygrpc/aio/server.pyx.pxi", line 682, in grpc._cython.cygrpc._handle_exceptions
  File "src/python/grpcio/grpc/_cython/_cygrpc/aio/server.pyx.pxi", line 802, in _handle_rpc
  File "src/python/grpcio/grpc/_cython/_cygrpc/aio/server.pyx.pxi", line 547, in _handle_unary_unary_rpc
  File "src/python/grpcio/grpc/_cython/_cygrpc/aio/server.pyx.pxi", line 411, in _finish_handler_with_unary_response
  File "/usr/lib/python3.10/concurrent/futures/thread.py", line 58, in run
    result = self.fn(*self.args, **self.kwargs)
  File "/home/cassandra/medusa/service/grpc/server.py", line 246, in GetBackups
    response.status = medusa_pb2.StatusType.UNKNOWN
AttributeError: Protocol message GetBackupsResponse has no "status" field.

The other node (there are only two nodes in this DC) seems to process the backup but never finishes, and the last log line is apparently a file upload:

[2024-07-17 09:19:32,350] DEBUG: [Storage] Uploading object from stream -> gcs://ui-playground-backups/dse68-rc1/dse69-multi-cluster-prime-alpha-sts-1/backup-2024-07-17t09-14-54-331z/meta/schema.cql

Despite having both nodes in failed status, I'm not seeing a finish date on the backup job:

status:
  failed:
    - dse69-multi-cluster-prime-alpha-sts-0
    - dse69-multi-cluster-prime-alpha-sts-1
  startTime: '2024-07-17T09:15:09Z'
spec:
  backupType: differential
  cassandraDatacenter: dse69-multi-cluster-prime

NAME                                            STARTED   FINISHED
backup-2024-07-17t09-14-54-331z                 7m48s     
dse69-multi-cluster-prime-schedule-1719841200   15d       15d
dse69-multi-cluster-prime-schedule-1719841500   15d       15d
dse69-multi-cluster-prime-schedule-1719841800   15d       15d

Thus further attempts to run a scheduled backup will get blocked.

sonarqubecloud · 2024-10-17T11:51:39Z

Quality Gate passed

Issues
1 New issue
0 Accepted issues

Measures
0 Security Hotspots
0.0% Coverage on New Code
0.0% Duplication on New Code

See analysis details on SonarCloud

rzvoncek marked this pull request as ready for review June 19, 2024 14:10

rzvoncek force-pushed the radovan/failed-backups branch 6 times, most recently from 3f2588b to b2da4fb Compare June 20, 2024 08:24

adejanovski requested changes Jul 17, 2024

View reviewed changes

sync-by-unito bot mentioned this pull request Oct 14, 2024

K8ssandra-operator memory leak after failed MedusaBackupJob k8ssandra/k8ssandra-operator#1312

Open

Make gRPC service report backups as FAILED if lose their callback future

58ed59f

rzvoncek force-pushed the radovan/failed-backups branch from 778fbb1 to 58ed59f Compare October 17, 2024 11:51

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Make gRPC service report backups as FAILED if lose their callback future #786

Make gRPC service report backups as FAILED if lose their callback future #786

rzvoncek commented Jun 19, 2024 •

edited

Loading

sonarqubecloud bot commented Jun 20, 2024

adejanovski left a comment

sonarqubecloud bot commented Oct 17, 2024

Make gRPC service report backups as FAILED if lose their callback future #786

Are you sure you want to change the base?

Make gRPC service report backups as FAILED if lose their callback future #786

Conversation

rzvoncek commented Jun 19, 2024 • edited Loading

sonarqubecloud bot commented Jun 20, 2024

Quality Gate passed

adejanovski left a comment

Choose a reason for hiding this comment

sonarqubecloud bot commented Oct 17, 2024

Quality Gate passed

rzvoncek commented Jun 19, 2024 •

edited

Loading