Skip to content

Commit

Permalink
Emit nbtree vacuum cycle id in nbtree xlog through FPIs
Browse files Browse the repository at this point in the history
NBTree needs a vacuum cycle ID on pages of whom the split resulted in a new
right page that is located before the original page, or who were split from
such split pages in the current vacuum cycle. By WAL-logging the cycle_id
and restoring it in recovery, we assure vacuum doesn't fail to clean up the
earlier pages.

During recovery, we extract the cycle ID from the original page if this page
had an FPI, either directly (when the page was restored) or indirectly (from
the record data).

This fixes neondatabase/neon#9929
  • Loading branch information
MMeent committed Dec 6, 2024
1 parent dff6615 commit 6dfba66
Show file tree
Hide file tree
Showing 2 changed files with 94 additions and 1 deletion.
26 changes: 25 additions & 1 deletion src/backend/access/nbtree/nbtinsert.c
Original file line number Diff line number Diff line change
Expand Up @@ -1494,6 +1494,7 @@ _bt_split(Relation rel, Relation heaprel, BTScanInsert itup_key, Buffer buf,
bool newitemonleft,
isleaf,
isrightmost;
uint16 origcycleid;

/*
* origpage is the original page to be split. leftpage is a temporary
Expand All @@ -1514,6 +1515,8 @@ _bt_split(Relation rel, Relation heaprel, BTScanInsert itup_key, Buffer buf,
isrightmost = P_RIGHTMOST(oopaque);
maxoff = PageGetMaxOffsetNumber(origpage);
origpagenumber = BufferGetBlockNumber(buf);
/* NEON: store the page's former cycle ID for FPI check later */
origcycleid = oopaque->btpo_cycleid;

/*
* Choose a point to split origpage at.
Expand Down Expand Up @@ -1969,6 +1972,7 @@ _bt_split(Relation rel, Relation heaprel, BTScanInsert itup_key, Buffer buf,
xl_btree_split xlrec;
uint8 xlinfo;
XLogRecPtr recptr;
uint8 bufflags = REGBUF_STANDARD;

xlrec.level = ropaque->btpo_level;
/* See comments below on newitem, orignewitem, and posting lists */
Expand All @@ -1981,7 +1985,27 @@ _bt_split(Relation rel, Relation heaprel, BTScanInsert itup_key, Buffer buf,
XLogBeginInsert();
XLogRegisterData((char *) &xlrec, SizeOfBtreeSplit);

XLogRegisterBuffer(0, buf, REGBUF_STANDARD);
/*
* NEON: If we split to earlier pages during a btree vacuum cycle,
* then we have to include the cycle ID in the WAL record. The
* easiest method to do that is to force an image, which happens to
* be relatively cheap, as the data already contained in the record is
* enough to populate the new right page.
*
* We MUST log an FPI when the page split during a vacuum cycle, and:
* - The right page's blckno < the left page's blckno, or
* - The right page might be 'C' in a page spit chain B > C > A after
* B split B > A => B > C > A; or B > C > D > A, etc. (as indicated
* by the presense of a cycle ID).
*/
if (oopaque->btpo_cycleid != 0 &&
(origpagenumber > rightpagenumber || oopaque->btpo_cycleid == origcycleid))
{
/* cycle ID is required */
bufflags |= REGBUF_FORCE_IMAGE;
}

XLogRegisterBuffer(0, buf, bufflags);
XLogRegisterBuffer(1, rbuf, REGBUF_WILL_INIT);
/* Log original right sibling, since we've changed its prev-pointer */
if (!isrightmost)
Expand Down
69 changes: 69 additions & 0 deletions src/backend/access/nbtree/nbtxlog.c
Original file line number Diff line number Diff line change
Expand Up @@ -434,6 +434,75 @@ btree_xlog_split(bool newitemonleft, XLogReaderState *record)
MarkBufferDirty(buf);
}

/*
* NEON: If the original page was supposed to be recovered from FPI,
* then we need to correct the cycle ID (see _bt_split for reasons)
*
* Note that we can't just use the buffer in WALRedo on Pageserver,
* as that may be InvalidBuffer when the original (left) page of the
* split wasn't requested.
*/
if (XLogRecGetBlock(record, 0)->has_image)
{
/*
* btree split FPIs may contain important cycle IDs on the original
* page's FPI; make sure we correctly transfer this over
*/

/*
* Because we don't want to decompress the page if it's not needed, or
* reconstruct a whole 8kB page when we're only interested in the 2
* bytes of the bkpimg, we recognise there are 3 different ways we can
* get the data, in order of efficiency (from most efficient to least
* efficient):
* - There is an original (left) page in the buffer
* - There is original buffer, the logged FPI was not compressed
* - There is original buffer, the logged FPI was compressed
*/
if (BufferIsValid(buf))
{
/*
* Neat, we can just use the buffer to copy the cycle ID
*/
BTPageOpaque oopaque = BTPageGetOpaque(BufferGetPage(buf));
ropaque->btpo_cycleid = oopaque->btpo_cycleid;
}
else if (!BKPIMAGE_COMPRESSED(XLogRecGetBlock(record, 0)->bimg_info))
{
/*
* Good, we don't have to decompress the data, so we can use
* calculated offsets into bkpb->bkp_imag\e
*/

/*
* offset of the start of cycleid relative to the end of the page,
* which is also relative to the end of the FPI
*/
const int cycleid_off = MAXALIGN(sizeof(BTPageOpaqueData))
- offsetof(BTPageOpaqueData, btpo_cycleid);
char *cycleid_ptr; /* may not be aligned */
DecodedBkpBlock *bkpb = XLogRecGetBlock(record, 0);

cycleid_ptr = &bkpb->bkp_image[bkpb->bimg_len - cycleid_off];

memcpy(&ropaque->btpo_cycleid, cycleid_ptr, sizeof(BTCycleId));
}
else
{
/*
* Bummer, we have to decompress the data.
*/
PGAlignedBlock tmp;
BTPageOpaque oopaque;

/* Expensive decompression of data */
RestoreBlockImage(record, 0, tmp.data);

oopaque = BTPageGetOpaque(tmp.data);
ropaque->btpo_cycleid = oopaque->btpo_cycleid;
}
}

/* Fix left-link of the page to the right of the new right sibling */
if (spagenumber != P_NONE)
{
Expand Down

0 comments on commit 6dfba66

Please sign in to comment.