Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(test_runner): allowed_errors in storage scrubber #10062

Merged
merged 5 commits into from
Dec 10, 2024

Conversation

skyzh
Copy link
Member

@skyzh skyzh commented Dec 9, 2024

Problem

resolve #9988 (comment)

Summary of changes

  • New verbose mode for storage scrubber scan metadata (pageserver) that contains the error messages.
  • Filter allowed_error list from the JSON output to determine the healthy flag status.

Copy link

github-actions bot commented Dec 9, 2024

7062 tests run: 6747 passed, 0 failed, 315 skipped (full report)


Flaky tests (4)

Postgres 17

Postgres 15

Postgres 14

Code coverage* (full report)

  • functions: 31.4% (8333 of 26537 functions)
  • lines: 47.7% (65614 of 137569 lines)

* collected from Rust tests only


The comment gets automatically updated with the latest test results
3ee2bd3 at 2024-12-10T16:26:30.108Z :recycle:

Signed-off-by: Alex Chi Z <[email protected]>
@skyzh skyzh requested review from problame and yliang412 December 9, 2024 21:07
@skyzh skyzh marked this pull request as ready for review December 9, 2024 21:07
@skyzh skyzh requested a review from a team as a code owner December 9, 2024 21:07
@skyzh skyzh force-pushed the skyzh/scrubber-allowed-errors branch from 5689d07 to 86b92fa Compare December 9, 2024 21:08
@skyzh skyzh force-pushed the skyzh/scrubber-allowed-errors branch from 86b92fa to 457935f Compare December 9, 2024 21:09
Signed-off-by: Alex Chi Z <[email protected]>
@skyzh skyzh force-pushed the skyzh/scrubber-allowed-errors branch from 564d844 to d1e4d76 Compare December 9, 2024 21:13
@skyzh
Copy link
Member Author

skyzh commented Dec 9, 2024

Next step: figure out which test cases would produce orphan layers and allow such errors for these test cases. We can also choose to add it to default allowed errors.

Copy link
Contributor

@yliang412 yliang412 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. I think we discussed the disable_scrub_on_exit method in the meeting. I think currently it is used for two cases:

  1. We deleted the only tenant, and the scrubber fails if it detects nothing.
  2. Immediate shutdown might left orphan layers.

As a next step, we potentially could add those to structured allowed errors and remove disable_scrub_on_exit if that's desired. I don't have strong opinion on this.

storage_scrubber/src/scan_pageserver_metadata.rs Outdated Show resolved Hide resolved
Signed-off-by: Alex Chi Z <[email protected]>
@skyzh skyzh force-pushed the skyzh/scrubber-allowed-errors branch from 6a96956 to 3ee2bd3 Compare December 10, 2024 15:14
@skyzh skyzh enabled auto-merge December 10, 2024 15:23
@skyzh skyzh added this pull request to the merge queue Dec 10, 2024
Merged via the queue into main with commit aa0554f Dec 10, 2024
82 checks passed
@skyzh skyzh deleted the skyzh/scrubber-allowed-errors branch December 10, 2024 17:02
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Remote storage metadata corrupted in test_timeline_archival_chaos
2 participants