Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

pageserver: misrouted FSM key #10027

Closed
erikgrinaker opened this issue Dec 5, 2024 · 1 comment · Fixed by #10032
Closed

pageserver: misrouted FSM key #10027

erikgrinaker opened this issue Dec 5, 2024 · 1 comment · Fixed by #10032
Assignees
Labels
c/storage/pageserver Component: storage: pageserver t/bug Issue Type: Bug

Comments

@erikgrinaker
Copy link
Contributor

Seen in #9943 (comment). An assertion shows that an FSM key was misrouted to shard 2 during test_isolation:

ERROR wal_connection_manager{tenant_id=c1ae72753459caf54556dafe583b6c54 shard_id=0204 timeline_id=31886a8ef88a0618b1d4fb9a28900f92}:connection{node_id=1}:panic{thread=walreceiver worker location=pageserver/src/tenant/timeline.rs:5897:21}: key 000000067F0000400000000A2E0100000002 does not belong on shard 0204
Key: 000000067F0000400000000A2E0100000002

field1: 00         kind => relation key
field2: 0000067F   spcnode
field3: 00004000   dbnode
field4: 00000A2E   relnode
field5: 01         forknum => FSM_FORKNUM
field6: 00000002   
@erikgrinaker
Copy link
Contributor Author

erikgrinaker commented Dec 5, 2024

It's SMGR_TRUNCATE_FSM here:

if blkno % pg_constants::SLOTS_PER_FSM_PAGE != 0 {
// Tail of last remaining FSM page has to be zeroed.
// We are not precise here and instead of digging in FSM bitmap format just clear the whole page.
modification.put_rel_page_image_zero(rel, fsm_physical_page_no)?;
fsm_physical_page_no += 1;
}
// TODO: re-examine the None case here wrt. sharding; should we error?
let nblocks = get_relsize(modification, rel, ctx).await?.unwrap_or(0);
if nblocks > fsm_physical_page_no {
// check if something to do: FSM is larger than truncate position
self.put_rel_truncation(modification, rel, fsm_physical_page_no, ctx)
.await?;
}
}
if flags & pg_constants::SMGR_TRUNCATE_VM != 0 {

I think we need the same shard filtering logic as for the corresponding visibility map code here:

if (trunc_byte != 0 || trunc_offs != 0)
&& self.shard.is_key_local(&rel_block_to_key(rel, vm_page_no))
{
modification.put_rel_wal_record(
rel,
vm_page_no,
NeonWalRecord::TruncateVisibilityMap {
trunc_byte,
trunc_offs,
},
)?;
vm_page_no += 1;
}

github-merge-queue bot pushed a commit that referenced this issue Dec 6, 2024
## Problem

FSM pages are managed like regular relation pages, and owned by a single
shard. However, when truncating the FSM relation the last FSM page was
zeroed out on all shards. This is unnecessary and potentially confusing.

The superfluous keys will be removed during compactions, as they do not
belong on these shards.

Resolves #10027.

## Summary of changes

Only zero out the truncated FSM page on the owning shard.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
c/storage/pageserver Component: storage: pageserver t/bug Issue Type: Bug
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant