Skip to content

Commit

Permalink
Perform retries on azure bulk deletion (#7964)
Browse files Browse the repository at this point in the history
This adds retries to the bulk deletion, because if there is a certain
chance n that a request fails, the chance that at least one of the
requests in a chain of requests fails increases exponentially.

We've had similar issues with the S3 DR tests, which in the end yielded
in adding retries at the remote_storage level. Retries at the top level
are not sufficient when one remote_storage "operation" is multiple
network requests in a trench coat, especially when there is no notion of
saving the progress: even if prior deletions had been successful, we'd
still need to get a 404 in order to continue the loop and get to the
point where we failed in the last iteration. Maybe we'll fail again but
before we've even reached it.

Retries at the bottom level avoid this issue because they have the
notion of progress and also when one network operation fails, only that
operation is retried.

First part of #7931.
  • Loading branch information
arpad-m authored Jun 6, 2024
1 parent a8be077 commit 75bca9b
Showing 1 changed file with 51 additions and 17 deletions.
68 changes: 51 additions & 17 deletions libs/remote_storage/src/azure_blob.rs
Original file line number Diff line number Diff line change
Expand Up @@ -3,6 +3,7 @@
use std::borrow::Cow;
use std::collections::HashMap;
use std::env;
use std::fmt::Display;
use std::io;
use std::num::NonZeroU32;
use std::pin::Pin;
Expand All @@ -29,6 +30,7 @@ use http_types::{StatusCode, Url};
use scopeguard::ScopeGuard;
use tokio_util::sync::CancellationToken;
use tracing::debug;
use utils::backoff;

use crate::metrics::{start_measuring_requests, AttemptOutcome, RequestKind};
use crate::{
Expand Down Expand Up @@ -451,26 +453,58 @@ impl RemoteStorage for AzureBlobStorage {
// TODO batch requests are not supported by the SDK
// https://github.com/Azure/azure-sdk-for-rust/issues/1068
for path in paths {
let blob_client = self.client.blob_client(self.relative_path_to_name(path));

let request = blob_client.delete().into_future();

let res = tokio::time::timeout(self.timeout, request).await;

match res {
Ok(Ok(_response)) => continue,
Ok(Err(e)) => {
if let Some(http_err) = e.as_http_error() {
if http_err.status() == StatusCode::NotFound {
continue;
}
}
return Err(e.into());
#[derive(Debug)]
enum AzureOrTimeout {
AzureError(azure_core::Error),
Timeout,
Cancel,
}
impl Display for AzureOrTimeout {
fn fmt(&self, f: &mut std::fmt::Formatter<'_>) -> std::fmt::Result {
write!(f, "{self:?}")
}
Err(_elapsed) => return Err(TimeoutOrCancel::Timeout.into()),
}
let warn_threshold = 3;
let max_retries = 5;
backoff::retry(
|| async {
let blob_client = self.client.blob_client(self.relative_path_to_name(path));

let request = blob_client.delete().into_future();

let res = tokio::time::timeout(self.timeout, request).await;

match res {
Ok(Ok(_v)) => Ok(()),
Ok(Err(azure_err)) => {
if let Some(http_err) = azure_err.as_http_error() {
if http_err.status() == StatusCode::NotFound {
return Ok(());
}
}
Err(AzureOrTimeout::AzureError(azure_err))
}
Err(_elapsed) => Err(AzureOrTimeout::Timeout),
}
},
|err| match err {
AzureOrTimeout::AzureError(_) | AzureOrTimeout::Timeout => false,
AzureOrTimeout::Cancel => true,
},
warn_threshold,
max_retries,
"deleting remote object",
cancel,
)
.await
.ok_or_else(|| AzureOrTimeout::Cancel)
.and_then(|x| x)
.map_err(|e| match e {
AzureOrTimeout::AzureError(err) => anyhow::Error::from(err),
AzureOrTimeout::Timeout => TimeoutOrCancel::Timeout.into(),
AzureOrTimeout::Cancel => TimeoutOrCancel::Cancel.into(),
})?;
}

Ok(())
};

Expand Down

1 comment on commit 75bca9b

@github-actions
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

3280 tests run: 3128 passed, 0 failed, 152 skipped (full report)


Flaky tests (2)

Postgres 15

  • test_statvfs_pressure_usage: release
  • test_vm_bit_clear_on_heap_lock: debug

Code coverage* (full report)

  • functions: 31.6% (6606 of 20937 functions)
  • lines: 48.5% (51069 of 105314 lines)

* collected from Rust tests only


The comment gets automatically updated with the latest test results
75bca9b at 2024-06-06T16:24:23.079Z :recycle:

Please sign in to comment.