Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

pageserver: assert that keys belong to shard #9943

Merged
merged 6 commits into from
Dec 6, 2024

Conversation

erikgrinaker
Copy link
Contributor

@erikgrinaker erikgrinaker commented Nov 29, 2024

We've seen cases where stray keys end up on the wrong shard. This shouldn't happen. Add debug assertions to prevent this. In release builds, we should be lenient in order to handle changing key ownership policies.

Touches #9914.

Copy link

github-actions bot commented Nov 29, 2024

7066 tests run: 6757 passed, 0 failed, 309 skipped (full report)


Flaky tests (7)

Postgres 17

  • test_physical_replication_config_mismatch_max_locks_per_transaction: release-arm64

Postgres 16

Postgres 15

Postgres 14

Code coverage* (full report)

  • functions: 31.4% (8323 of 26501 functions)
  • lines: 47.7% (65438 of 137048 lines)

* collected from Rust tests only


The comment gets automatically updated with the latest test results
c340914 at 2024-12-06T10:34:36.698Z :recycle:

@erikgrinaker
Copy link
Contributor Author

This keeps timing out on the v17 debug without-lfc regression tests. I wouldn't expect the added assertions here to have a significant performance impact, but I'll look into it.

@erikgrinaker
Copy link
Contributor Author

Idk, this barely shows up on CPU profiles -- validate_lsn_order is at least two orders of magnitude more expensive.

On a successful run over on #10000, this job completed in 40 minutes. Here, it times out after an hour.

Maybe the test fails in some other way, e.g. a node crash that manifests as a timeout. Having a look at the test artifacts.

@erikgrinaker
Copy link
Contributor Author

I'm not seeing any other issues except spurious timeouts, and I don't see how this change would be expensive enough to increase runtime from 40 to 60 minutes. Rerunning.

@erikgrinaker
Copy link
Contributor Author

I don't know why this wasn't included in any of the CI artifacts, but I found the issue with test_isolation locally:

ERROR wal_connection_manager{tenant_id=c1ae72753459caf54556dafe583b6c54 shard_id=0204 timeline_id=31886a8ef88a0618b1d4fb9a28900f92}:connection{node_id=1}:panic{thread=walreceiver worker location=pageserver/src/tenant/timeline.rs:5897:21}: key 000000067F0000400000000A2E0100000002 does not belong on shard 0204

The test just hangs at this point, since the compute is waiting for the pageserver.

Let's see what this is.

@erikgrinaker
Copy link
Contributor Author

It's an FSM key:

field1: 00         kind => relation key
field2: 0000067F   spcnode
field3: 00004000   dbnode
field4: 00000A2E   relnode
field5: 01         forknum => FSM_FORKNUM
field6: 00000002   

Forking off #10027.

@erikgrinaker erikgrinaker added this pull request to the merge queue Dec 6, 2024
Merged via the queue into main with commit 7838659 Dec 6, 2024
82 checks passed
@erikgrinaker erikgrinaker deleted the erik/assert-batch-keys branch December 6, 2024 10:25
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants