Skip to content

Commit

Permalink
Emit nbtree vacuum cycle id in nbtree xlog through FPIs
Browse files Browse the repository at this point in the history
NBTree needs a vacuum cycle ID on pages of whom the split resulted in a new
right page that is located before the original page, or who were split from
such split pages in the current vacuum cycle. By WAL-logging the cycle_id
and restoring it in recovery, we assure vacuum doesn't fail to clean up the
earlier pages.

During recovery, we extract the cycle ID from the original page if this page
had an FPI, either directly (when the page was restored) or indirectly (from
the record data).

This fixes neondatabase/neon#9929
  • Loading branch information
MMeent committed Dec 6, 2024
1 parent 373f9de commit d3a2380
Show file tree
Hide file tree
Showing 2 changed files with 90 additions and 1 deletion.
26 changes: 25 additions & 1 deletion src/backend/access/nbtree/nbtinsert.c
Original file line number Diff line number Diff line change
Expand Up @@ -1491,6 +1491,7 @@ _bt_split(Relation rel, BTScanInsert itup_key, Buffer buf, Buffer cbuf,
bool newitemonleft,
isleaf,
isrightmost;
uint16 origcycleid;

/*
* origpage is the original page to be split. leftpage is a temporary
Expand All @@ -1511,6 +1512,8 @@ _bt_split(Relation rel, BTScanInsert itup_key, Buffer buf, Buffer cbuf,
isrightmost = P_RIGHTMOST(oopaque);
maxoff = PageGetMaxOffsetNumber(origpage);
origpagenumber = BufferGetBlockNumber(buf);
/* NEON: store the page's former cycle ID for FPI check later */
origcycleid = oopaque->btpo_cycleid;

/*
* Choose a point to split origpage at.
Expand Down Expand Up @@ -1966,6 +1969,7 @@ _bt_split(Relation rel, BTScanInsert itup_key, Buffer buf, Buffer cbuf,
xl_btree_split xlrec;
uint8 xlinfo;
XLogRecPtr recptr;
uint8 bufflags = REGBUF_STANDARD;

xlrec.level = ropaque->btpo_level;
/* See comments below on newitem, orignewitem, and posting lists */
Expand All @@ -1978,7 +1982,27 @@ _bt_split(Relation rel, BTScanInsert itup_key, Buffer buf, Buffer cbuf,
XLogBeginInsert();
XLogRegisterData((char *) &xlrec, SizeOfBtreeSplit);

XLogRegisterBuffer(0, buf, REGBUF_STANDARD);
/*
* NEON: If we split to earlier pages during a btree vacuum cycle,
* then we have to include the cycle ID in the WAL record. The
* easiest method to do that is to force an image, which happens to
* be relatively cheap, as the data already contained in the record is
* enough to populate the new right page.
*
* We MUST log an FPI when the page split during a vacuum cycle, and:
* - The right page's blckno < the left page's blckno, or
* - The right page might be 'C' in a page spit chain B > C > A after
* B split B > A => B > C > A; or B > C > D > A, etc. (as indicated
* by the presense of a cycle ID).
*/
if (oopaque->btpo_cycleid != 0 &&
(origpagenumber > rightpagenumber || oopaque->btpo_cycleid == origcycleid))
{
/* cycle ID is required */
bufflags |= REGBUF_FORCE_IMAGE;
}

XLogRegisterBuffer(0, buf, bufflags);
XLogRegisterBuffer(1, rbuf, REGBUF_WILL_INIT);
/* Log original right sibling, since we've changed its prev-pointer */
if (!isrightmost)
Expand Down
65 changes: 65 additions & 0 deletions src/backend/access/nbtree/nbtxlog.c
Original file line number Diff line number Diff line change
Expand Up @@ -434,6 +434,71 @@ btree_xlog_split(bool newitemonleft, XLogReaderState *record)
MarkBufferDirty(buf);
}

/*
* NEON: If the original page was supposed to be recovered from FPI,
* then we need to correct the cycle ID (see _bt_split for reasons)
*
* Note that we can't just use the buffer in WALRedo on Pageserver,
* as that may be InvalidBuffer when the original (left) page of the
* split wasn't requested.
*/
if (record->blocks[0].has_image)
{
/*
* Because we don't want to decompress the page if it's not needed, or
* reconstruct a whole 8kB page when we're only interested in the 2
* bytes of the bkpimg, we recognise there are 3 different ways we can
* get the data, in order of efficiency (from most efficient to least
* efficient):
* - There is an original (left) page in the buffer
* - There is original buffer, the logged FPI was not compressed
* - There is original buffer, the logged FPI was compressed
*/
if (BufferIsValid(buf))
{
/*
* Neat, we can just use the buffer to copy the cycle ID
*/
Page opage = BufferGetPage(buf);
BTPageOpaque oopaque = (BTPageOpaque) PageGetSpecialPointer(opage);
ropaque->btpo_cycleid = oopaque->btpo_cycleid;
}
else if (!(record->blocks[0].bimg_info & BKPIMAGE_IS_COMPRESSED))
{
/*
* Good, we don't have to decompress the data, so we can use
* calculated offsets into bkpb->bkp_image
*/

/*
* offset of the start of cycleid relative to the end of the page,
* which is also relative to the end of the FPI
*/
const int cycleid_off = MAXALIGN(sizeof(BTPageOpaqueData))
- offsetof(BTPageOpaqueData, btpo_cycleid);
char *cycleid_ptr; /* may not be aligned */
DecodedBkpBlock *bkpb = &record->blocks[0];

cycleid_ptr = &bkpb->bkp_image[bkpb->bimg_len - cycleid_off];

memcpy(&ropaque->btpo_cycleid, cycleid_ptr, sizeof(BTCycleId));
}
else
{
/*
* Bummer, we have to decompress the data.
*/
PGAlignedBlock tmp;
BTPageOpaque oopaque;

/* Expensive decompression of data */
RestoreBlockImage(record, 0, tmp.data);

oopaque = (BTPageOpaque) PageGetSpecialPointer(tmp.data);
ropaque->btpo_cycleid = oopaque->btpo_cycleid;
}
}

/* Fix left-link of the page to the right of the new right sibling */
if (spagenumber != P_NONE)
{
Expand Down

0 comments on commit d3a2380

Please sign in to comment.