Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix defusing race between Tenant::shutdown and offload_timeline #10150

Open
wants to merge 2 commits into
base: main
Choose a base branch
from

Conversation

arpad-m
Copy link
Member

@arpad-m arpad-m commented Dec 13, 2024

There is a race condition between Tenant::shutdown's defuse_for_drop loop and offload_timeline, where timeline offloading can insert into a tenant that is in the process of shutting down, in fact so far progressed that the defuse_for_drop has already been called.

This prevents warn log lines of the form:

offloaded timeline <hash> was dropped without having cleaned it up at the ancestor

The solution piggybacks on the offloaded_timelines lock: both the defuse loop and the offloaded timeline insertion need to acquire the lock, and we know that the defuse loop only runs after the tenant has set its TenantState to Stopping.

So if we hold the offloaded_timelines lock, and know that the TenantState is not Stopping, then we know that the defuse loop has not ran yet, and holding the lock ensures that it doesn't start running while we are inserting the offloaded timeline.

Fixes #10070

@arpad-m arpad-m requested a review from a team as a code owner December 13, 2024 21:48
@arpad-m arpad-m requested a review from problame December 13, 2024 21:48
@arpad-m arpad-m changed the title Fix defusing race between Tenant::shutdown and offload_timeline Fix offloaded timeline defusing race between Tenant::shutdown and offload_timeline Dec 13, 2024
@arpad-m arpad-m changed the title Fix offloaded timeline defusing race between Tenant::shutdown and offload_timeline Fix defusing race between Tenant::shutdown and offload_timeline Dec 13, 2024
@arpad-m arpad-m force-pushed the arpad/fix_offload_defuse_race branch from 1da89f4 to e2dfd26 Compare December 13, 2024 22:35
Copy link

github-actions bot commented Dec 13, 2024

7095 tests run: 6797 passed, 0 failed, 298 skipped (full report)


Flaky tests (7)

Postgres 17

Postgres 16

  • test_pgdata_import_smoke[8-1024-RelBlockSize.MULTIPLE_RELATION_SEGMENTS]: release-arm64
  • test_pgdata_import_smoke[None-1024-RelBlockSize.MULTIPLE_RELATION_SEGMENTS]: release-arm64

Postgres 15

Postgres 14

  • test_pgdata_import_smoke[None-1024-RelBlockSize.MULTIPLE_RELATION_SEGMENTS]: release-arm64

Code coverage* (full report)

  • functions: 31.3% (8396 of 26831 functions)
  • lines: 48.0% (66653 of 138900 lines)

* collected from Rust tests only


The comment gets automatically updated with the latest test results
ddf3128 at 2024-12-18T18:10:07.685Z :recycle:

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

test_timeline_archival_chaos hits log error
1 participant