Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Backup is marked as waitingForPluginOperationsPartiallyfailed when VolumeSnapshotContent has an error #7356

Closed
shubham-pampattiwar opened this issue Jan 25, 2024 · 7 comments · May be fixed by vmware-tanzu/velero-plugin-for-csi#226 or #8023
Assignees
Labels
Area/CSI Related to Container Storage Interface support
Milestone

Comments

@shubham-pampattiwar
Copy link
Collaborator

What steps did you take and what happened:

We have been seeing this issue recently, if any of the VolumeSnapshotContent CR has an error related to removing VolumeSnapshotBeingCreated annotation in that case it moves backup to WaitingForPluginOperationsPartiallyFailed phase. Due to this most of the CSI/NativeDataMover backups are failing recently.

Here you can see that backup is marked waitingForPluginOperationsPartiallyfailed in only 3mins after the start
Intermittent(It happens when VolumeSnapshotContent CR has error related to removing annotation)

$ oc get backup test1
NAME    AGE
test1   3m17s
$ oc get backup test1 -o jsonpath={.status.phase}
WaitingForPluginOperationsPartiallyFailed
./velero describe backup test1 -n openshift-adp --details
Name:         test1
Namespace:    openshift-adp
Labels:       velero.io/storage-location=ts-dpa-1
Annotations:  velero.io/resource-timeout=10m0s
              velero.io/source-cluster-k8s-gitversion=v1.27.6+98158f9
              velero.io/source-cluster-k8s-major-version=1
              velero.io/source-cluster-k8s-minor-version=27
Phase:  PartiallyFailed (run `velero backup logs test1` for more information)

Warnings:
  Velero:   
  Cluster:    <none>
  Namespaces: <none>
Errors:
  Velero:   
  Cluster:    <none>
  Namespaces: <none>
Namespaces:
  Included:  ocp-cassandra
  Excluded:  <none>
Resources:
  Included:        *
  Excluded:        <none>
  Cluster-scoped:  auto
Label selector:  <none>
Storage Location:  ts-dpa-1
Velero-Native Snapshot PVs:  auto
Snapshot Move Data:          auto
Data Mover:                  velero
TTL:  720h0m0s
CSISnapshotTimeout:    10m0s
ItemOperationTimeout:  4h0m0s
Hooks:  <none>
Backup Format Version:  1.1.0
Started:    2023-10-17 18:07:59 +0530 IST
Completed:  2023-10-17 18:22:11 +0530 IST
Expiration:  2023-11-16 18:07:59 +0530 IST
Total items to be backed up:  101
Items backed up:              101
Backup Item Operations:
  Operation for volumesnapshots.snapshot.storage.k8s.io ocp-cassandra/velero-cassandra-data-cassandra-0-glb8b:
    Backup Item Action Plugin:  velero.io/csi-volumesnapshot-backupper
    Operation ID:               ocp-cassandra/velero-cassandra-data-cassandra-0-glb8b/2023-10-17T12:38:13Z
    Items to Update:
              volumesnapshots.snapshot.storage.k8s.io ocp-cassandra/velero-cassandra-data-cassandra-0-glb8b
    Phase:    Completed
    Created:  2023-10-17 18:08:13 +0530 IST
    Started:  2023-10-17 18:08:13 +0530 IST
  Operation for volumesnapshotcontents.snapshot.storage.k8s.io /snapcontent-9bb72537-76e0-46be-aa1a-7501b800526c:
    Backup Item Action Plugin:  velero.io/csi-volumesnapshotcontent-backupper
    Operation ID:               snapcontent-9bb72537-76e0-46be-aa1a-7501b800526c/2023-10-17T12:38:13Z
    Items to Update:
              volumesnapshotcontents.snapshot.storage.k8s.io /snapcontent-9bb72537-76e0-46be-aa1a-7501b800526c
    Phase:    Completed
    Created:  2023-10-17 18:08:13 +0530 IST
    Started:  2023-10-17 18:08:13 +0530 IST
  Operation for volumesnapshots.snapshot.storage.k8s.io ocp-cassandra/velero-cassandra-data-cassandra-1-wrscv:
    Backup Item Action Plugin:  velero.io/csi-volumesnapshot-backupper
    Operation ID:               ocp-cassandra/velero-cassandra-data-cassandra-1-wrscv/2023-10-17T12:38:23Z
    Items to Update:
              volumesnapshots.snapshot.storage.k8s.io ocp-cassandra/velero-cassandra-data-cassandra-1-wrscv
    Phase:    Completed
    Created:  2023-10-17 18:08:23 +0530 IST
    Started:  2023-10-17 18:08:23 +0530 IST
  Operation for volumesnapshotcontents.snapshot.storage.k8s.io /snapcontent-cb8d1866-6032-4b08-bce9-c02fc22d3c4a:
    Backup Item Action Plugin:  velero.io/csi-volumesnapshotcontent-backupper
    Operation ID:               snapcontent-cb8d1866-6032-4b08-bce9-c02fc22d3c4a/2023-10-17T12:38:23Z
    Items to Update:
                      volumesnapshotcontents.snapshot.storage.k8s.io /snapcontent-cb8d1866-6032-4b08-bce9-c02fc22d3c4a
    Phase:            Failed
    Operation Error:  Failed to check and update snapshot content: failed to remove VolumeSnapshotBeingCreated annotation on the content snapcontent-cb8d1866-6032-4b08-bce9-c02fc22d3c4a: "snapshot controller failed to update snapcontent-cb8d1866-6032-4b08-bce9-c02fc22d3c4a on API server: Operation cannot be fulfilled on volumesnapshotcontents.snapshot.storage.k8s.io \"snapcontent-cb8d1866-6032-4b08-bce9-c02fc22d3c4a\": the object has been modified; please apply your changes to the latest version and try again"
    Created:          2023-10-17 18:08:23 +0530 IST
    Started:          2023-10-17 18:08:23 +0530 IST
  Operation for volumesnapshots.snapshot.storage.k8s.io ocp-cassandra/velero-cassandra-data-cassandra-2-96rmj:
    Backup Item Action Plugin:  velero.io/csi-volumesnapshot-backupper
    Operation ID:               ocp-cassandra/velero-cassandra-data-cassandra-2-96rmj/2023-10-17T12:38:33Z
    Items to Update:
              volumesnapshots.snapshot.storage.k8s.io ocp-cassandra/velero-cassandra-data-cassandra-2-96rmj
    Phase:    Completed
    Created:  2023-10-17 18:08:33 +0530 IST
    Started:  2023-10-17 18:08:33 +0530 IST
  Operation for volumesnapshotcontents.snapshot.storage.k8s.io /snapcontent-1183dce4-f703-4793-9f1d-b0e3028217e7:
    Backup Item Action Plugin:  velero.io/csi-volumesnapshotcontent-backupper
    Operation ID:               snapcontent-1183dce4-f703-4793-9f1d-b0e3028217e7/2023-10-17T12:38:33Z
    Items to Update:
              volumesnapshotcontents.snapshot.storage.k8s.io /snapcontent-1183dce4-f703-4793-9f1d-b0e3028217e7
    Phase:    Completed
    Created:  2023-10-17 18:08:33 +0530 IST
    Started:  2023-10-17 18:08:33 +0530 IST
Resource List:
  apiextensions.k8s.io/v1/CustomResourceDefinition:
    - volumesnapshots.snapshot.storage.k8s.io
  apps/v1/ControllerRevision:
    - ocp-cassandra/cassandra-76bd54848b
    - ...

What did you expect to happen:
Backup to not fail on temporary VSC errors. CSI plugin should wait at least for the specified csiSnapshotTimeout.

The following information will help us better understand what's going on:

If you are using velero v1.7.0+:
Please use velero debug --backup <backupname> --restore <restorename> to generate the support bundle, and attach to this issue, more options please refer to velero debug --help

If you are using earlier versions:
Please provide the output of the following commands (Pasting long output into a GitHub gist or other pastebin is fine.)

  • kubectl logs deployment/velero -n velero
  • velero backup describe <backupname> or kubectl get backup/<backupname> -n velero -o yaml
  • velero backup logs <backupname>
  • velero restore describe <restorename> or kubectl get restore/<restorename> -n velero -o yaml
  • velero restore logs <restorename>

Anything else you would like to add:

Environment:

  • Velero version (use velero version):
  • Velero features (use velero client config get features):
  • Kubernetes version (use kubectl version):
  • Kubernetes installer & version:
  • Cloud provider or hardware configuration:
  • OS (e.g. from /etc/os-release):

Vote on this issue!

This is an invitation to the Velero community to vote on issues, you can see the project's top voted issues listed here.
Use the "reaction smiley face" up to the right of this comment to vote.

  • 👍 for "I would like to see this bug fixed as soon as possible"
  • 👎 for "There are more important bugs to focus on right now"
@shubham-pampattiwar
Copy link
Collaborator Author

shubham-pampattiwar commented Jan 25, 2024

Proposed solution in draft PR: vmware-tanzu/velero-plugin-for-csi#226

This is inline with what we do for VolumeSnapshot backup plugin

@qiuming-best qiuming-best added the Area/CSI Related to Container Storage Interface support label Jan 29, 2024
@reasonerjt reasonerjt added this to the v1.14 milestone Feb 6, 2024
@reasonerjt reasonerjt added Needs triage We need discussion to understand problem and decide the priority target/1.14.1 and removed Needs triage We need discussion to understand problem and decide the priority labels Apr 17, 2024
@reasonerjt reasonerjt removed this from the v1.14 milestone Apr 24, 2024
@Rohmilchkaese
Copy link

I'm having the same problem:

To Backup (manually - usually via schedule):
velero --kubeconfig .kube/prod backup create testbackup --snapshot-move-data=true --snapshot-volumes=true --csi-snapshot-timeout=3h --exclude-namespaces=velero

Name:         testbackup
Namespace:    velero
Labels:       velero.io/storage-location=exoscale
Annotations:  velero.io/resource-timeout=10m0s
              velero.io/source-cluster-k8s-gitversion=v1.29.4+k0s
              velero.io/source-cluster-k8s-major-version=1
              velero.io/source-cluster-k8s-minor-version=29

Phase:  WaitingForPluginOperationsPartiallyFailed


Warnings:
  Velero:     <none>
  Cluster:   resource: /persistentvolumeclaims name: /pvc-36d4a0b7-11af-40dd-b4b6-96c511e31c5b message: /Additional item was not found in Kubernetes API, can't back it up
  Namespaces: <none>

Errors:
  Velero:    message: /Fail to wait VolumeSnapshot snapshot handle created: failed to get volumesnapshot harbor/velero-harbor-jobservice-b45q5: volumesnapshots.snapshot.storage.k8s.io "velero-harbor-jobservice-b45q5" not found
             name: /harbor-jobservice-6cc66db858-kxxxp message: /Error backing up item error: /error executing custom action (groupResource=persistentvolumeclaims, namespace=harbor, name=harbor-jobservice): rpc error: code = Unknown desc = failed to get volumesnapshot harbor/velero-harbor-jobservice-b45q5: volumesnapshots.snapshot.storage.k8s.io "velero-harbor-jobservice-b45q5" not found
             message: /Fail to wait VolumeSnapshot snapshot handle created: failed to get volumesnapshot limesurvey/velero-limesurvey-kdtbx: volumesnapshots.snapshot.storage.k8s.io "velero-limesurvey-kdtbx" not found
             name: /limesurvey-7bfd487975-44xdx message: /Error backing up item error: /error executing custom action (groupResource=persistentvolumeclaims, namespace=limesurvey, name=limesurvey): rpc error: code = Unknown desc = failed to get volumesnapshot limesurvey/velero-limesurvey-kdtbx: volumesnapshots.snapshot.storage.k8s.io "velero-limesurvey-kdtbx" not found
             message: /Fail to wait VolumeSnapshot snapshot handle created: failed to get volumesnapshot matomo/velero-matomo-matomo-zmlkm: volumesnapshots.snapshot.storage.k8s.io "velero-matomo-matomo-zmlkm" not found
             name: /matomo-7c77bbbf95-jl2cz message: /Error backing up item error: /error executing custom action (groupResource=persistentvolumeclaims, namespace=matomo, name=matomo-matomo): rpc error: code = Unknown desc = failed to get volumesnapshot matomo/velero-matomo-matomo-zmlkm: volumesnapshots.snapshot.storage.k8s.io "velero-matomo-matomo-zmlkm" not found
             message: /Fail to wait VolumeSnapshot snapshot handle created: failed to get volumesnapshot n8n/velero-n8n-gw8rq: volumesnapshots.snapshot.storage.k8s.io "velero-n8n-gw8rq" not found
             name: /n8n-b4fcf45cb-h7px7 message: /Error backing up item error: /error executing custom action (groupResource=persistentvolumeclaims, namespace=n8n, name=n8n): rpc error: code = Unknown desc = failed to get volumesnapshot n8n/velero-n8n-gw8rq: volumesnapshots.snapshot.storage.k8s.io "velero-n8n-gw8rq" not found
             message: /Fail to wait VolumeSnapshot snapshot handle created: failed to get volumesnapshot sonarqube/velero-sonarqube-sonarqube-chsdb: volumesnapshots.snapshot.storage.k8s.io "velero-sonarqube-sonarqube-chsdb" not found
             name: /sonarqube-sonarqube-0 message: /Error backing up item error: /error executing custom action (groupResource=persistentvolumeclaims, namespace=sonarqube, name=sonarqube-sonarqube): rpc error: code = Unknown desc = failed to get volumesnapshot sonarqube/velero-sonarqube-sonarqube-chsdb: volumesnapshots.snapshot.storage.k8s.io "velero-sonarqube-sonarqube-chsdb" not found
             name: /export-minio-1-cloned-pvc message: /Error backing up item error: /error executing custom action (groupResource=persistentvolumeclaims, namespace=minio, name=export-minio-1-cloned-pvc): rpc error: code = Unknown desc = failed to get PV pvc-36d4a0b7-11af-40dd-b4b6-96c511e31c5b for PVC minio/export-minio-1-cloned-pvc: persistentvolumes "pvc-36d4a0b7-11af-40dd-b4b6-96c511e31c5b" not found
  Cluster:    <none>
  Namespaces: <none>



HooksAttempted:  0
HooksFailed:     0

@wenchao5211
Copy link

i had the same problem

log

./velero backup describe backup-vm-backup-vm-vm1715161607-1715222101432-1715222103091
Name:         backup-vm-backup-vm-vm1715161607-1715222101432-1715222103091
Namespace:    velero
Labels:     
              velero.io/storage-location=default
Annotations:  velero.io/source-cluster-k8s-gitversion=v1.25.12+k3s-3fcb1443
              velero.io/source-cluster-k8s-major-version=1
              velero.io/source-cluster-k8s-minor-version=25

Phase:  PartiallyFailed (run `velero backup logs backup-vm-backup-vm-vm1715161607-1715222101432-1715222103091` for more information)


Warnings:
  Velero:     <none>
  Cluster:   resource: /persistentvolumes name: /pvc-1099ef37-2921-4923-b8cd-61069ced82fc
  Namespaces: <none>

Errors:
  Velero:   
  Cluster:    <none>
  Namespaces: <none>

Namespaces:
  Included:  default
  Excluded:  <none>

Resources:
  Included:        *
  Excluded:        <none>
  Cluster-scoped:  auto

Label selector:  com.virt.vmsets=vm1715161607

Or label selector:  <none>

Storage Location:  default

Velero-Native Snapshot PVs:  auto
Snapshot Move Data:          auto
Data Mover:                  velero

TTL:  800h0m0s

CSISnapshotTimeout:    30m0s
ItemOperationTimeout:  1h0m0s

Hooks:  <none>

Backup Format Version:  1.1.0

Started:    2024-05-09 10:35:15 +0800 CST
Completed:  2024-05-09 10:35:37 +0800 CST

Expiration:  2024-06-11 18:35:15 +0800 CST

Total items to be backed up:  8
Items backed up:              8

Backup Item Operations:  0 of 2 completed successfully, 2 failed (specify --details for more information)
Backup Volumes:
  <error getting backup volume info: DownloadRequest.velero.io "backup-vm-backup-vm-vm1715161607-1715222101432-1715222103091-e580dfb0-22b5-44b5-9317-424134ea2625" is invalid: spec.target.kind: Unsupported value: "BackupVolumeInfos": supported values: "BackupLog", "BackupContents", "BackupVolumeSnapshots", "BackupItemOperations", "BackupResourceList", "BackupResults", "RestoreLog", "RestoreResults", "RestoreResourceList", "RestoreItemOperations", "CSIBackupVolumeSnapshots", "CSIBackupVolumeSnapshotContents">
root@u22:~/velero-v1.13.2-linux-amd64# ./velero backup describe backup-vm-backup-vm-vm1715161607-1715222101432-1715222103091 --details
Name:         backup-vm-backup-vm-vm1715161607-1715222101432-1715222103091
Namespace:    velero
Labels:       com.nss=default
              com.virt.vmsets=vm1715161607
              velero.io/storage-location=default
Annotations:  velero.io/source-cluster-k8s-gitversion=v1.25.12+k3s-3fcb1443
              velero.io/source-cluster-k8s-major-version=1
              velero.io/source-cluster-k8s-minor-version=25

Phase:  PartiallyFailed (run `velero backup logs backup-vm-backup-vm-vm1715161607-1715222101432-1715222103091` for more information)


Warnings:
  Velero:     <none>
  Cluster:   resource: /persistentvolumes name: /pvc-1099ef37-2921-4923-b8cd-61069ced82fc
  Namespaces: <none>

Errors:
  Velero:   
  Cluster:    <none>
  Namespaces: <none>

Namespaces:
  Included:  default
  Excluded:  <none>

Resources:
  Included:        *
  Excluded:        <none>
  Cluster-scoped:  auto

Label selector:  com.virt.vmsets=vm1715161607

Or label selector:  <none>

Storage Location:  default

Velero-Native Snapshot PVs:  auto
Snapshot Move Data:          auto
Data Mover:                  velero

TTL:  800h0m0s

CSISnapshotTimeout:    30m0s
ItemOperationTimeout:  1h0m0s

Hooks:  <none>

Backup Format Version:  1.1.0

Started:    2024-05-09 10:35:15 +0800 CST
Completed:  2024-05-09 10:35:37 +0800 CST

Expiration:  2024-06-11 18:35:15 +0800 CST

Total items to be backed up:  8
Items backed up:              8

Backup Item Operations:
  Operation for volumesnapshots.snapshot.storage.k8s.io default/velero-pvc-vm1715161607-0-volume-1-q6ffp:
    Backup Item Action Plugin:  velero.io/csi-volumesnapshot-backupper
    Operation ID:               default/velero-pvc-vm1715161607-0-volume-1-q6ffp/2024-05-09T02:35:30Z
    Items to Update:
                      volumesnapshots.snapshot.storage.k8s.io default/velero-pvc-vm1715161607-0-volume-1-q6ffp
    Phase:            Failed
    Operation Error:  Failed to check and update snapshot content: failed to remove VolumeSnapshotBeingCreated annotation on the content snapcontent-3ed91ed2-6459-4812-a5b9-fba84d265adb: "snapshot controller failed to update snapcontent-3ed91ed2-6459-4812-a5b9-fba84d265adb on API server: Operation cannot be fulfilled on volumesnapshotcontents.snapshot.storage.k8s.io \"snapcontent-3ed91ed2-6459-4812-a5b9-fba84d265adb\": the object has been modified; please apply your changes to the latest version and try again"
    Created:          2024-05-09 10:35:30 +0800 CST
    Started:          2024-05-09 10:35:30 +0800 CST
  Operation for volumesnapshotcontents.snapshot.storage.k8s.io /snapcontent-3ed91ed2-6459-4812-a5b9-fba84d265adb:
    Backup Item Action Plugin:  velero.io/csi-volumesnapshotcontent-backupper
    Operation ID:               snapcontent-3ed91ed2-6459-4812-a5b9-fba84d265adb/2024-05-09T02:35:30Z
    Items to Update:
                      volumesnapshotcontents.snapshot.storage.k8s.io /snapcontent-3ed91ed2-6459-4812-a5b9-fba84d265adb
    Phase:            Failed
    Operation Error:  Failed to check and update snapshot content: failed to remove VolumeSnapshotBeingCreated annotation on the content snapcontent-3ed91ed2-6459-4812-a5b9-fba84d265adb: "snapshot controller failed to update snapcontent-3ed91ed2-6459-4812-a5b9-fba84d265adb on API server: Operation cannot be fulfilled on volumesnapshotcontents.snapshot.storage.k8s.io \"snapcontent-3ed91ed2-6459-4812-a5b9-fba84d265adb\": the object has been modified; please apply your changes to the latest version and try again"
    Created:          2024-05-09 10:35:30 +0800 CST
    Started:          2024-05-09 10:35:30 +0800 CST
Resource List:
  apiextensions.k8s.io/v1/CustomResourceDefinition:
    - vmsets.virt.liveit100.com
  snapshot.storage.k8s.io/v1/VolumeSnapshot:
    - default/velero-pvc-vm1715161607-0-volume-1-q6ffp
  snapshot.storage.k8s.io/v1/VolumeSnapshotClass:
    - longhorn
  snapshot.storage.k8s.io/v1/VolumeSnapshotContent:
    - snapcontent-3ed91ed2-6459-4812-a5b9-fba84d265adb
  v1/PersistentVolume:
    - pvc-1099ef37-2921-4923-b8cd-61069ced82fc
  v1/PersistentVolumeClaim:
    - default/pvc-vm1715161607-0-volume-1
  v1/Pod:
    - default/vm1715161607-0.0
  virt.liveit100.com/v1alpha2/VMSet:
    - default/vm1715161607

Backup Volumes:
  <error getting backup volume info: DownloadRequest.velero.io "backup-vm-backup-vm-vm1715161607-1715222101432-1715222103091-717e4ff6-99a2-43bf-a776-fe92001467cc" is invalid: spec.target.kind: Unsupported value: "BackupVolumeInfos": supported values: "BackupLog", "BackupContents", "BackupVolumeSnapshots", "BackupItemOperations", "BackupResourceList", "BackupResults", "RestoreLog", "RestoreResults", "RestoreResourceList", "RestoreItemOperations", "CSIBackupVolumeSnapshots", "CSIBackupVolumeSnapshotContents">

velero version : 1.11
velero-plugin-for-csi version : main
k8s version 1.25.12

@Lyndon-Li
Copy link
Contributor

@shubham-pampattiwar This issue is currently target for 1.14.1. Are you planning to fix it there? If not, let's remove it from 1.14.1.

@shubham-pampattiwar
Copy link
Collaborator Author

@Lyndon-Li will have to re-think the earlier proposed solution for this as the code flow has changed because of updates to the progress method of VSC async action operation as well as merging of csi plugin to velero core. yes I would like to try to solve this for 1.14.1 but will have to re-think, maybe will have to add the re-try on temporary error mechanism in WaitUntilVSCHandleIsReady function in the current code flow. Any thoughts on this ?
cc: @blackpiglet

@shubham-pampattiwar
Copy link
Collaborator Author

Draft PR against latest Velero for proposed solution: #8023

@reasonerjt reasonerjt added this to the v1.15 milestone Aug 14, 2024
@shubham-pampattiwar
Copy link
Collaborator Author

Closing this issue as we are currently not seeing this behavior in our testing.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment