Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add timeline offload mechanism #8907

Merged
merged 17 commits into from
Oct 8, 2024
Merged

Add timeline offload mechanism #8907

merged 17 commits into from
Oct 8, 2024

Conversation

arpad-m
Copy link
Member

@arpad-m arpad-m commented Sep 3, 2024

Implements an initial mechanism for offloading of archived timelines.

Offloading is implemented as specified in the RFC.

For now, there is no persistence, so a restart of the pageserver will retrigger downloads until the timeline is offloaded again.

We trigger offloading in the compaction loop because we need the signal for whether compaction is done and everything has been uploaded or not.

Part of #8088

Copy link

github-actions bot commented Sep 4, 2024

5085 tests run: 4878 passed, 0 failed, 207 skipped (full report)


Flaky tests (2)

Postgres 17

Code coverage* (full report)

  • functions: 31.3% (7506 of 23952 functions)
  • lines: 49.5% (60276 of 121841 lines)

* collected from Rust tests only


The comment gets automatically updated with the latest test results
1665c67 at 2024-10-07T23:50:16.674Z :recycle:

@arpad-m arpad-m force-pushed the arpad/timeline_offload branch from d76a83b to 25461cc Compare September 20, 2024 14:07
arpad-m added a commit that referenced this pull request Sep 21, 2024
Moves the per-timeline code to load timeline metadata into a new
dedicated function called `load_timeline_metadata`. The old
`load_timeline_metadata` becomes `load_timelines_metadata`.

Split out of #8907

Part of #8088
@arpad-m arpad-m force-pushed the arpad/timeline_offload branch 2 times, most recently from c363e87 to da16915 Compare September 24, 2024 11:35
@arpad-m arpad-m force-pushed the arpad/timeline_offload branch from 24ea738 to 21139c8 Compare September 30, 2024 20:51
pageserver/src/tenant.rs Outdated Show resolved Hide resolved
pageserver/src/tenant.rs Outdated Show resolved Hide resolved
pageserver/src/tenant.rs Outdated Show resolved Hide resolved
@arpad-m arpad-m marked this pull request as ready for review October 3, 2024 00:50
@arpad-m arpad-m requested a review from a team as a code owner October 3, 2024 00:50
@arpad-m
Copy link
Member Author

arpad-m commented Oct 3, 2024

marking as ready for review as I can add a test in a later PR (it needs additional functionality to manually offload).

@arpad-m arpad-m requested a review from jcsp October 3, 2024 00:51
Copy link
Contributor

@VladLazar VladLazar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The logic looks good to me. I wonder if we can simplify a bit though.

pageserver/src/tenant.rs Show resolved Hide resolved
pageserver/src/tenant.rs Outdated Show resolved Hide resolved
pageserver/src/tenant.rs Outdated Show resolved Hide resolved
pageserver/src/tenant.rs Outdated Show resolved Hide resolved
Copy link
Contributor

@VladLazar VladLazar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good, but check_ancestor_of_to_be_unarchived_is_not_archived vs check_to_be_unarchived_timeline_has_no_archived_parent confuses me.

pageserver/src/tenant.rs Outdated Show resolved Hide resolved
pageserver/src/tenant.rs Outdated Show resolved Hide resolved
pageserver/src/tenant.rs Outdated Show resolved Hide resolved
pageserver/src/tenant.rs Show resolved Hide resolved
pageserver/src/tenant.rs Outdated Show resolved Hide resolved
pageserver/src/tenant.rs Show resolved Hide resolved
@arpad-m arpad-m requested a review from VladLazar October 7, 2024 15:57
Copy link
Contributor

@VladLazar VladLazar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. This isn't used in prod right now so let's merge and iron it out as we go.

@arpad-m arpad-m merged commit e8ae376 into main Oct 8, 2024
79 checks passed
@arpad-m arpad-m deleted the arpad/timeline_offload branch October 8, 2024 23:33
arpad-m added a commit that referenced this pull request Oct 22, 2024
Persist timeline offloaded state to S3.

Right now, as of #8907, at each restart of the pageserver, all offloaded
state is lost, so we load the full timeline again. As it starts with an
empty local directory, we might potentially download some files again,
leading to downloads that are ultimately wasteful.

This patch adds support for persisting the offloaded state, allowing us
to never load offloaded timelines in the first place. The persistence
feature is facilitated via a new file in S3 that is tenant-global, which
contains a list of all offloaded timelines. It is updated each time we
offload or unoffload a timeline, and otherwise never touched.

This choice means that tenants where no offloading is happening will not
immediately get a manifest, keeping the change very minimal at the
start.

We leave generation support for future work. It is important to support
generations, as in the worst case, the manifest might be overwritten by
an older generation after a timeline has been unoffloaded (and
unarchived), so the next pageserver process instantiation might wrongly
believe that some timeline is still offloaded even though it should be
active.

Part of #9386, #8088
arpad-m added a commit that referenced this pull request Oct 25, 2024
Before, we didn't copy over the `index-part.json` of offloaded timelines
to the new shard's location, resulting in the new shard not knowing the
timeline even exists.

In #9444, we copy over the manifest, but we also need to do this for
`index-part.json`.

As the operations to do are mostly the same between offloaded and
non-offloaded timelines, we can iterate over all of them in the same
loop, after the introduction of a `TimelineOrOffloadedArcRef` type to
generalize over the two cases. This is analogous to the deletion code
added in #8907.

The added test also ensures that the sharded archival config endpoint
works, something that has not yet been ensured by tests.

Part of #8088
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants