Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

test: test_download_remote_layers_api again #5322

Merged
merged 7 commits into from
Sep 16, 2023
Merged
Changes from 6 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
35 changes: 19 additions & 16 deletions test_runner/regress/test_ondemand_download.py
Original file line number Diff line number Diff line change
Expand Up @@ -323,8 +323,8 @@ def test_download_remote_layers_api(
"compaction_period": "0s",
# small checkpoint distance to create more delta layer files
"checkpoint_distance": f"{1 * 1024 ** 2}", # 1 MB
"compaction_threshold": "1",
"image_creation_threshold": "1",
"compaction_threshold": "999999",
"image_creation_threshold": "999999",
"compaction_target_size": f"{1 * 1024 ** 2}", # 1 MB
}
)
Expand Down Expand Up @@ -357,8 +357,23 @@ def get_resident_physical_size():
tenant_id, timeline_id, "pageserver_resident_physical_size"
)

# Shut down safekeepers before starting the pageserver.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

May be it is better to introduce PS parameter for delaying logical size computation?
Itmay be useful not only for tests...

Copy link
Member Author

@koivunej koivunej Sep 16, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think so. Problem here is not the fact that safekeeper communication ends up asking logical size, that's what the failpoint is for, set as environment variable when starting up pageserver.

Problem was that new WAL was being received in apparently when there's other load on the test runners leading to flakyness of the test, and my previous "fix" made this more visible. Or alternatively compaction was going on, I didn't really dig too deep because I was doing work.

I should look at the comments because as noted in slack, at best they are distracting.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cleaned the test up in e249c8a not to allude to safekeepers launching logical size calculation.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Checked the example I linked here #3831 (comment):

EDIT: Log analysis later on: logs clearly show inmemory flush after metrics were read during shutdown

# If we don't, the tenant's walreceiver handler will trigger the
# the logical size computation task, and that downloads layes,
# which makes our assertions on size fail.
koivunej marked this conversation as resolved.
Show resolved Hide resolved
for sk in env.safekeepers:
sk.stop()
# sk.stop(immediate=True)

# it is sad we cannot do a flush inmem layer without compaction, but
# working around with very high layer0 count and image layer creation
# threshold
client.timeline_checkpoint(tenant_id, timeline_id)

wait_for_upload_queue_empty(client, tenant_id, timeline_id)

filled_current_physical = get_api_current_physical_size()
log.info(filled_current_physical)
log.info(f"filled_current_physical: {filled_current_physical}")
filled_size = get_resident_physical_size()
log.info(f"filled_size: {filled_size}")
assert filled_current_physical == filled_size, "we don't yet do layer eviction"
Expand All @@ -371,13 +386,6 @@ def get_resident_physical_size():
log.info(f"unlinking layer {layer.name}")
layer.unlink()

# Shut down safekeepers before starting the pageserver.
# If we don't, the tenant's walreceiver handler will trigger the
# the logical size computation task, and that downloads layes,
# which makes our assertions on size fail.
for sk in env.safekeepers:
sk.stop(immediate=True)

##### Second start, restore the data and ensure it's the same
env.pageserver.start(extra_env_vars={"FAILPOINTS": "remote-storage-download-pre-rename=return"})
env.pageserver.allowed_errors.extend(
Expand All @@ -395,13 +403,8 @@ def get_resident_physical_size():
# witnessed for example difference of 29827072 (filled_current_physical) to 29868032 (here) is no good reason to fail a test.
this_time = get_api_current_physical_size()
assert (
filled_current_physical <= this_time
filled_current_physical == this_time
), "current_physical_size is sum of loaded layer sizes, independent of whether local or remote"
if filled_current_physical != this_time:
log.info(
f"fixing up filled_current_physical from {filled_current_physical} to {this_time} ({this_time - filled_current_physical})"
)
filled_current_physical = this_time

post_unlink_size = get_resident_physical_size()
log.info(f"post_unlink_size: {post_unlink_size}")
Expand Down