Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Correct mistakes in offloaded timeline retain_lsn management #9760

Merged
merged 15 commits into from
Nov 15, 2024

Conversation

arpad-m
Copy link
Member

@arpad-m arpad-m commented Nov 14, 2024

PR #9308 has modified tenant activation code to take offloaded child timelines into account for populating the list of retain_lsn values. However, there is more places than just tenant activation where one needs to update the retain_lsns.

This PR fixes some bugs of the current code that could lead to corruption in the worst case:

  1. Deleting of an offloaded timeline would not get its retain_lsn purged from its parent. With the patch we now do it, but as the parent can be offloaded as well, the situatoin is a bit trickier than for non-offloaded timelines which can just keep a pointer to their parent. Here we can't keep a pointer because the parent might get offloaded, then unoffloaded again, creating a dangling pointer situation. Keeping a pointer to the tenant is not good either, because we might drop the offloaded timeline in a context where a offloaded_timelines lock is already held: so we don't want to acquire a lock in the drop code of OffloadedTimeline.
  2. Unoffloading a timeline would not get its retain_lsn values populated, leading to it maybe garbage collecting values that its children might need. We now call initialize_gc_info on the parent.
  3. Offloading of a timeline would not get its retain_lsn values registered as offloaded at the parent. So if we drop the Timeline object, and its registration is removed, the parent would not have any of the child's retain_lsns around. Also, before, the Timeline object would delete anything related to its timeline ID, now it only deletes retain_lsns that have MaybeOffloaded::No set.

Incorporates Chi's reproducer from #9753. cc https://github.com/neondatabase/cloud/issues/20199

The test_timeline_retain_lsn test is extended:

  1. it gains a new dimension, duplicating each mode, to either have the "main" branch be the direct parent of the timeline we archive, or the "test_archived_parent" branch intermediary, creating a three timeline structure. This doesn't test anything fixed by this PR in particular, just explores the vast space of possible configurations a little bit more.
  2. it gains two new modes, offload-parent, which tests the second point, and offload-no-restart which tests the third point.

It's easy to verify the test actually is "sharp" by removing one of the respective self.initialize_gc_info(), gc_info.insert_child() or ancestor_children.push().

Part of #8088

@arpad-m arpad-m requested a review from skyzh November 14, 2024 15:56
@arpad-m arpad-m requested a review from a team as a code owner November 14, 2024 15:56
Copy link
Member

@skyzh skyzh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM; overall I feel we might want to revisit the retain_lsn mechanism, manipulating it in Drop doesn't seem like a good idea... As we get more and more timeline states, in the future, probably it would be a good idea to re-compute it every time before gc/compaction?

pageserver/src/tenant.rs Outdated Show resolved Hide resolved
pageserver/src/tenant/timeline.rs Show resolved Hide resolved
@arpad-m
Copy link
Member Author

arpad-m commented Nov 14, 2024

probably it would be a good idea to re-compute it every time before gc/compaction?

I actually thought that this was already done, due to #9308. Personally I'd be fine with that as well.

@skyzh
Copy link
Member

skyzh commented Nov 14, 2024

probably it would be a good idea to re-compute it every time before gc/compaction?

I actually thought that this was already done, due to #9308. Personally I'd be fine with that as well.

But this is done in initialize_gc_info, which is only called in Tenant::activate; we also need to do that in refresh_gc_info for the future if it's a good idea. The current way of computing is incremental (i.e., only updated when new timelines get created / timelines get dropped), this is efficient but would cause larger engineering overhead to maintain...

@arpad-m
Copy link
Member Author

arpad-m commented Nov 14, 2024

But this is done in initialize_gc_info, which is only called in Tenant::activate;

yeah, that's what I wanted to say: i mistakenly assumed this when I wrote #9308. Now I know better :)

@arpad-m arpad-m force-pushed the arpad/retain_lsn_management branch from 14e6cfb to 99a959a Compare November 14, 2024 16:52
Copy link

github-actions bot commented Nov 14, 2024

5490 tests run: 5247 passed, 0 failed, 243 skipped (full report)


Flaky tests (1)

Postgres 17

Code coverage* (full report)

  • functions: 31.9% (7931 of 24849 functions)
  • lines: 49.7% (62841 of 126497 lines)

* collected from Rust tests only


The comment gets automatically updated with the latest test results
cef74a5 at 2024-11-15T12:59:31.645Z :recycle:

@arpad-m arpad-m merged commit 7880c24 into main Nov 15, 2024
80 checks passed
@arpad-m arpad-m deleted the arpad/retain_lsn_management branch November 15, 2024 13:22
@arpad-m
Copy link
Member Author

arpad-m commented Nov 15, 2024

Documenting the test failures if I comment out the offloading-specific retain_lsn code:

commented out self.initialize_gc_info:

FAILED test_runner/regress/test_timeline_archive.py::test_timeline_retain_lsn[debug-pg17-offload-parent-True] - psycopg2.errors.IoError: [NEON_SMGR] [shard 0] could not read block 1 in rel 1663/5/16385.0 from page server at lsn 0/015AEA60
FAILED test_runner/regress/test_timeline_archive.py::test_timeline_retain_lsn[debug-pg17-offload-parent-False] - RuntimeError: Run ['/home/arpad/src/neon.main/target/debug/neon_local', 'endpoint', 'start', '--safekeepers', '1', 'ep-3'] failed:
ERROR test_runner/regress/test_timeline_archive.py::test_timeline_retain_lsn[debug-pg17-offload-parent-True] - AssertionError: First log error on pageserver_1: (8984, '2024-11-15T13:08:29.472799Z ERROR page_service_conn_main{peer_addr=127.0.0.1:47642}:process_query{tenant_id=55706acf3cbb2e738d341d35f1d49699 timeline_id=8d82c84d21bf67cd8735c4e93c0651db}:han...
ERROR test_runner/regress/test_timeline_archive.py::test_timeline_retain_lsn[debug-pg17-offload-parent-False] - AssertionError: First log error on pageserver_1: (10088, "2024-11-15T13:08:28.585638Z ERROR page_service_conn_main{peer_addr=127.0.0.1:59498}: query handler for 'basebackup 29c9fe6e8eedde931f811b22d663bfe8 7810e79f0135f125419355b95ffb99db --gzip' ...

commented out gc_info.insert_child:

FAILED test_runner/regress/test_timeline_archive.py::test_timeline_retain_lsn[debug-pg17-offload-no-restart-True] - psycopg2.errors.IoError: [NEON_SMGR] [shard 0] could not read block 1 in rel 1663/5/16385.0 from page server at lsn 0/015AEA60
FAILED test_runner/regress/test_timeline_archive.py::test_timeline_retain_lsn[debug-pg17-offload-no-restart-False] - RuntimeError: Run ['/home/arpad/src/neon.main/target/debug/neon_local', 'endpoint', 'start', '--safekeepers', '1', 'ep-3'] failed:
ERROR test_runner/regress/test_timeline_archive.py::test_timeline_retain_lsn[debug-pg17-offload-no-restart-True] - AssertionError: First log error on pageserver_1: (8075, '2024-11-15T13:12:01.817458Z ERROR page_service_conn_main{peer_addr=127.0.0.1:46472}:process_query{tenant_id=1d4832e6363110d56c35508bdf3e872b timeline_id=953b2c11aee998a26c0fae6c112645aa}:han...
ERROR test_runner/regress/test_timeline_archive.py::test_timeline_retain_lsn[debug-pg17-offload-no-restart-False] - AssertionError: First log error on pageserver_1: (8146, "2024-11-15T13:12:03.551280Z ERROR page_service_conn_main{peer_addr=127.0.0.1:36762}: query handler for 'basebackup 0d9e975731b4a6e650a4a0d81871ef64 655e71391cf90d0ee14300c3ca2f78bb --gzip' f...

commented out ancestor_children.push:

FAILED test_runner/regress/test_timeline_archive.py::test_timeline_retain_lsn[debug-pg17-offload-parent-True] - psycopg2.errors.IoError: [NEON_SMGR] [shard 0] could not read block 1 in rel 1663/5/16385.0 from page server at lsn 0/015AEA60
FAILED test_runner/regress/test_timeline_archive.py::test_timeline_retain_lsn[debug-pg17-offload-True] - psycopg2.errors.IoError: [NEON_SMGR] [shard 0] could not read block 1 in rel 1663/5/16385.0 from page server at lsn 0/015AEA60
FAILED test_runner/regress/test_timeline_archive.py::test_timeline_retain_lsn[debug-pg17-offload-False] - RuntimeError: Run ['/home/arpad/src/neon.main/target/debug/neon_local', 'endpoint', 'start', '--safekeepers', '1', 'ep-3'] failed:
FAILED test_runner/regress/test_timeline_archive.py::test_timeline_retain_lsn[debug-pg17-offload-parent-False] - RuntimeError: Run ['/home/arpad/src/neon.main/target/debug/neon_local', 'endpoint', 'start', '--safekeepers', '1', 'ep-3'] failed:
ERROR test_runner/regress/test_timeline_archive.py::test_timeline_retain_lsn[debug-pg17-offload-True] - AssertionError: First log error on pageserver_1: (8412, '2024-11-15T13:19:58.864915Z ERROR page_service_conn_main{peer_addr=127.0.0.1:34142}:process_query{tenant_id=e1222ebe282467f72d560b2d1c3728c1 timeline_id=44086dc2ff71364001364b788cce9d10}:han...
ERROR test_runner/regress/test_timeline_archive.py::test_timeline_retain_lsn[debug-pg17-offload-parent-True] - AssertionError: First log error on pageserver_1: (8989, '2024-11-15T13:19:59.441191Z ERROR page_service_conn_main{peer_addr=127.0.0.1:36232}:process_query{tenant_id=1b4eb5ccc71cfdba923d6a72f7d93d53 timeline_id=88c85775b855f3452b4072232b2a0ce3}:han...
ERROR test_runner/regress/test_timeline_archive.py::test_timeline_retain_lsn[debug-pg17-offload-False] - AssertionError: First log error on pageserver_1: (8423, "2024-11-15T13:19:57.832008Z ERROR page_service_conn_main{peer_addr=127.0.0.1:53884}: query handler for 'basebackup d3dd85530cc06cf768f784b79b3295b5 e706a8c9aef648f0d6fec837abdd8f02 --gzip' f...
ERROR test_runner/regress/test_timeline_archive.py::test_timeline_retain_lsn[debug-pg17-offload-parent-False] - AssertionError: First log error on pageserver_1: (10104, "2024-11-15T13:19:59.341517Z ERROR page_service_conn_main{peer_addr=127.0.0.1:33498}: query handler for 'basebackup 767019d1809ca7794cdc2ed543df2bcf 7f3203f9a17daf030dca1878bb88c564 --gzip' ...
diff --git a/pageserver/src/tenant.rs b/pageserver/src/tenant.rs
index 909f99ea9..27de8c2ba 100644
--- a/pageserver/src/tenant.rs
+++ b/pageserver/src/tenant.rs
@@ -542,7 +542,7 @@ fn from_timeline(timeline: &Timeline) -> Result<Self, UploadQueueNotReadyError>
                 let ancestor_lsn = timeline.get_ancestor_lsn();
                 let ancestor_timeline_id = ancestor_timeline.timeline_id;
                 let mut gc_info = ancestor_timeline.gc_info.write().unwrap();
-                gc_info.insert_child(timeline.timeline_id, ancestor_lsn, MaybeOffloaded::Yes);
+                //gc_info.insert_child(timeline.timeline_id, ancestor_lsn, MaybeOffloaded::Yes);
                 (Some(ancestor_lsn), Some(ancestor_timeline_id))
             } else {
                 (None, None)
@@ -1988,7 +1988,7 @@ async fn unoffload_timeline(
                 None => warn!("timeline already removed from offloaded timelines"),
             }
 
-            self.initialize_gc_info(&timelines, &offloaded_timelines, Some(timeline_id));
+            //self.initialize_gc_info(&timelines, &offloaded_timelines, Some(timeline_id));
 
             Arc::clone(timeline)
         };
@@ -3865,7 +3865,7 @@ fn initialize_gc_info(
                     return;
                 };
                 let ancestor_children = all_branchpoints.entry(*ancestor_timeline_id).or_default();
-                ancestor_children.push((retain_lsn, *timeline_id, MaybeOffloaded::Yes));
+                //ancestor_children.push((retain_lsn, *timeline_id, MaybeOffloaded::Yes));
             });
 
         // The number of bytes we always keep, irrespective of PITR: this is a constant across timelines

arpad-m added a commit that referenced this pull request Nov 18, 2024
…'s parent (#9791)

There is a potential data corruption issue, not one I've encountered,
but it's still not hard to hit with some correct looking code given our
current architecture. It has to do with the timeline's memory object storage
via reference counted `Arc`s, and the removal of `retain_lsn` entries at
the drop of the last `Arc` reference.

The corruption steps are as follows:

1. timeline gets offloaded. timeline object A doesn't get dropped
though, because some long-running task accesses it
2. the same timeline gets unoffloaded again. timeline object B gets
created for it, timeline object A still referenced. both point to the
same timeline.
3. the task keeping the reference to timeline object A exits. destructor
for object A runs, removing `retain_lsn` in the timeline's parent.
4. the timeline's parent runs gc without the `retain_lsn` of the still
exant timleine's child, leading to data corruption.

In general we are susceptible each time when we recreate a `Timeline`
object in the same process, which happens both during a timeline
offload/unoffload cycle, as well as during an ancestor detach operation.

The solution this PR implements is to make the destructor for a timeline
as well as an offloaded timeline remove at most one `retain_lsn`.

PR #9760 has added a log line to print the refcounts at timeline
offload, but this only detects one of the places where we do such a
recycle operation. Plus it doesn't prevent the actual issue.

I doubt that this occurs in practice. It is more a defense in depth measure.
Usually I'd assume that the timeline gets dropped immediately in step 1,
as there is no background tasks referencing it after its shutdown.
But one never knows, and reducing the stakes of step 1 actually occurring
is a really good idea, from potential data corruption to waste of CPU time.

Part of #8088
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants