Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Azure: proper bulk deletion #7931

Open
3 of 5 tasks
arpad-m opened this issue May 31, 2024 · 2 comments
Open
3 of 5 tasks

Azure: proper bulk deletion #7931

arpad-m opened this issue May 31, 2024 · 2 comments
Assignees
Labels
c/storage Component: storage

Comments

@arpad-m
Copy link
Member

arpad-m commented May 31, 2024

The AzureBlobStorage::delete_objects function loops over the list of objects to be deleted, without any retry behaviour. The chance of at least one failing increases exponentially with the length of the list. As we might specify long lists to the function, this is a risk.

Ideally, there is support for proper bulk deletion in the SDK (issue), but until then, we should at least do retries inside the call.

cc #5567

@arpad-m arpad-m added the c/storage Component: storage label May 31, 2024
@arpad-m
Copy link
Member Author

arpad-m commented Jun 5, 2024

PR #7964 has the workaround with the retries to be used until the SDK gains proper bulk deletion support.

arpad-m added a commit that referenced this issue Jun 6, 2024
This adds retries to the bulk deletion, because if there is a certain
chance n that a request fails, the chance that at least one of the
requests in a chain of requests fails increases exponentially.

We've had similar issues with the S3 DR tests, which in the end yielded
in adding retries at the remote_storage level. Retries at the top level
are not sufficient when one remote_storage "operation" is multiple
network requests in a trench coat, especially when there is no notion of
saving the progress: even if prior deletions had been successful, we'd
still need to get a 404 in order to continue the loop and get to the
point where we failed in the last iteration. Maybe we'll fail again but
before we've even reached it.

Retries at the bottom level avoid this issue because they have the
notion of progress and also when one network operation fails, only that
operation is retried.

First part of #7931.
@arpad-m arpad-m self-assigned this Dec 6, 2024
github-merge-queue bot pushed a commit that referenced this issue Dec 9, 2024
For a while already, we've been unable to update the Azure SDK crates
due to Azure adopting use of a non-tokio async runtime, see #7545.

The effort to upstream the fix got stalled, and I think it's better to
switch to a patched version of the SDK that is up to date.

Now we have a fork of the SDK under the neondatabase github org, to
which I have applied Conrad's rebased patches to:
https://github.com/neondatabase/azure-sdk-for-rust/tree/neon .

The existence of a fork will also help with shipping bulk delete support
before it's upstreamed (#7931).

Also, in related news, the Azure SDK has gotten a rift in development,
where the main branch pertains to a future, to-be-officially-blessed
release of the SDK, and the older versions, which we are currently
using, are on the `legacy` branch. Upstream doesn't really want patches
for the `legacy` branch any more, they want to focus on the `main`
efforts. However, even then, the `legacy` branch is still newer than
what we are having right now, so let's switch to `legacy` for now.

Depending on how long it takes, we can switch to the official version of
the SDK once it's released or switch to the upstream `main` branch if
there is changes we want before that.

As a nice side effect of this PR, we now use reqwest 0.12 everywhere,
dropping the dependency on version 0.11.

Fixes #7545
github-merge-queue bot pushed a commit that referenced this issue Dec 10, 2024
Azure has a different per-request limit of 256 items for bulk deletion
compared to the number of 1000 on AWS. Therefore, we need to support
multiple values. Due to `GenericRemoteStorage`, we can't add an
associated constant, but it has to be a function.

The PR replaces the `MAX_KEYS_PER_DELETE` constant with a function of
the same name, implemented on both the `RemoteStorage` trait as well as
on `GenericRemoteStorage`.

The value serves as hint of how many objects to pass to the
`delete_objects` function.

Reading:

* https://learn.microsoft.com/en-us/rest/api/storageservices/blob-batch
* https://docs.aws.amazon.com/AmazonS3/latest/API/API_DeleteObjects.html

Part of #7931
@arpad-m
Copy link
Member Author

arpad-m commented Dec 18, 2024

neon-internal project deleted branch with 115 GiB of WAL segments, caused stuck project alert: https://neondb.slack.com/archives/C03H1K0PGKH/p1734533843928339

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
c/storage Component: storage
Projects
None yet
Development

No branches or pull requests

1 participant