Skip to content

Commit

Permalink
storage controller: sleep between compute notify retries (#8869)
Browse files Browse the repository at this point in the history
## Problem

Live migration retries when it fails to notify the compute of the new
location. It should sleep between attempts.

Closes: #8820

## Summary of changes

- Do an `exponential_backoff` in the retry loop for compute
notifications
  • Loading branch information
jcsp authored Aug 30, 2024
1 parent 72aa6b0 commit 20f82f9
Showing 1 changed file with 13 additions and 0 deletions.
13 changes: 13 additions & 0 deletions storage_controller/src/reconciler.rs
Original file line number Diff line number Diff line change
Expand Up @@ -12,6 +12,7 @@ use std::collections::HashMap;
use std::sync::Arc;
use std::time::{Duration, Instant};
use tokio_util::sync::CancellationToken;
use utils::backoff::exponential_backoff;
use utils::failpoint_support;
use utils::generation::Generation;
use utils::id::{NodeId, TimelineId};
Expand Down Expand Up @@ -568,6 +569,7 @@ impl Reconciler {

// During a live migration it is unhelpful to proceed if we couldn't notify compute: if we detach
// the origin without notifying compute, we will render the tenant unavailable.
let mut notify_attempts = 0;
while let Err(e) = self.compute_notify().await {
match e {
NotifyError::Fatal(_) => return Err(ReconcileError::Notify(e)),
Expand All @@ -578,6 +580,17 @@ impl Reconciler {
);
}
}

exponential_backoff(
notify_attempts,
// Generous waits: control plane operations which might be blocking us usually complete on the order
// of hundreds to thousands of milliseconds, so no point busy polling.
1.0,
10.0,
&self.cancel,
)
.await;
notify_attempts += 1;
}

// Downgrade the origin to secondary. If the tenant's policy is PlacementPolicy::Attached(0), then
Expand Down

1 comment on commit 20f82f9

@github-actions
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

3889 tests run: 3772 passed, 1 failed, 116 skipped (full report)


Failures on Postgres 16

  • test_pgbench_intensive_init_workload[neon_off-github-actions-selfhosted-1000]: release-x86-64
# Run all failed tests locally:
scripts/pytest -vv -n $(nproc) -k "test_pgbench_intensive_init_workload[neon_off-release-pg16-github-actions-selfhosted-1000]"
Flaky tests (2)

Postgres 16

Code coverage* (full report)

  • functions: 32.5% (7416 of 22809 functions)
  • lines: 50.6% (60043 of 118599 lines)

* collected from Rust tests only


The comment gets automatically updated with the latest test results
20f82f9 at 2024-08-30T12:55:50.856Z :recycle:

Please sign in to comment.