Skip to content

Commit

Permalink
Emit nbtree vacuum cycle id in nbtree xlog through FPIs
Browse files Browse the repository at this point in the history
NBTree needs a vacuum cycle ID on pages of whom the split resulted in a new
right page that is located before the original page, or who were split from
such split pages in the current vacuum cycle. By WAL-logging the cycle_id
and restoring it in recovery, we assure vacuum doesn't fail to clean up the
earlier pages.

During recovery, we extract the cycle ID from the original page if this page
had an FPI, either directly (when the page was restored) or indirectly (from
the record data).

This fixes neondatabase/neon#9929
  • Loading branch information
MMeent committed Dec 3, 2024
1 parent 972e325 commit b21751d
Show file tree
Hide file tree
Showing 2 changed files with 97 additions and 1 deletion.
29 changes: 28 additions & 1 deletion src/backend/access/nbtree/nbtinsert.c
Original file line number Diff line number Diff line change
Expand Up @@ -1489,6 +1489,7 @@ _bt_split(Relation rel, BTScanInsert itup_key, Buffer buf, Buffer cbuf,
bool newitemonleft,
isleaf,
isrightmost;
uint16 old_cycleid;

/*
* origpage is the original page to be split. leftpage is a temporary
Expand Down Expand Up @@ -1554,6 +1555,8 @@ _bt_split(Relation rel, BTScanInsert itup_key, Buffer buf, Buffer cbuf,
/* handle btpo_next after rightpage buffer acquired */
lopaque->btpo_level = oopaque->btpo_level;
/* handle btpo_cycleid after rightpage buffer acquired */
/* NEON: store the page's former cycle ID for FPI check later */
old_cycleid = oopaque->btpo_cycleid;

/*
* Copy the original page's LSN into leftpage, which will become the
Expand Down Expand Up @@ -1976,7 +1979,31 @@ _bt_split(Relation rel, BTScanInsert itup_key, Buffer buf, Buffer cbuf,
XLogBeginInsert();
XLogRegisterData((char *) &xlrec, SizeOfBtreeSplit);

XLogRegisterBuffer(0, buf, REGBUF_STANDARD);
/*
* NEON: If we split to earlier pages during a btree vacuum cycle,
* then we have to include the cycle ID in the WAL record. The
* easiest method to do that is to force an image, which happens to
* be relatively cheap, as the data already contained in the record is
* enough to populate the new right page.
*
* We MUST log an FPI when:
* - The right page's blckno < the left page's blckno
* - The right page might be 'C' in a page spit chain B > C > A after
* B split B > A => B > C > A; or B > C > D > A, etc. (as indicated
* by the presense of a cycle ID).
*/
if (lopaque->btpo_cycleid == 0 || (rightpagenumber > origpagenumber &&
lopaque->btpo_cycleid != old_cycleid))
{
/* no cycle ID is required */
XLogRegisterBuffer(0, buf, REGBUF_STANDARD);
}
else
{
/* cycle ID is required */
XLogRegisterBuffer(0, buf, REGBUF_FORCE_IMAGE | REGBUF_STANDARD);
}

XLogRegisterBuffer(1, rbuf, REGBUF_WILL_INIT);
/* Log original right sibling, since we've changed its prev-pointer */
if (!isrightmost)
Expand Down
69 changes: 69 additions & 0 deletions src/backend/access/nbtree/nbtxlog.c
Original file line number Diff line number Diff line change
Expand Up @@ -434,6 +434,75 @@ btree_xlog_split(bool newitemonleft, XLogReaderState *record)
MarkBufferDirty(buf);
}

/*
* NEON: If the original page was supposed to be recovered from FPI,
* then we need to correct the cycle ID (see _bt_split for reasons)
*
* Note that we can't just use the buffer in WALRedo on Pageserver,
* as that may be InvalidBuffer when the original (left) page of the
* split wasn't requested.
*/
if (XLogRecGetBlock(record, 0)->has_image)
{
/*
* btree split FPIs may contain important cycle IDs on the original
* page's FPI; make sure we correctly transfer this over
*/

/*
* Because we don't want to decompress the page if it's not needed, or
* reconstruct a whole 8kB page when we're only interested in the 2
* bytes of the bkpimg, we recognise there are 3 different ways we can
* get the data, in order of efficiency (from most efficient to least
* efficient):
* - There is an original (left) page in the buffer
* - There is original buffer, the logged FPI was not compressed
* - There is original buffer, the logged FPI was compressed
*/
if (BufferIsValid(buf))
{
/*
* Neat, we can just use the buffer to copy the cycle ID
*/
BTPageOpaque oopaque = BTPageGetOpaque(BufferGetPage(buf));
ropaque->btpo_cycleid = oopaque->btpo_cycleid;
}
else if (!BKPIMAGE_COMPRESSED(XLogRecGetBlock(record, 0)->bimg_info))
{
/*
* Good, we don't have to decompress the data, so we can use
* calculated offsets into bkpb->bkp_image
*/

/*
* offset of the start of cycleid relative to the end of the page,
* which is also relative to the end of the FPI
*/
const int cycleid_off = MAXALIGN(sizeof(BTPageOpaqueData))
- offsetof(BTPageOpaqueData, btpo_cycleid);
char *cycleid_ptr; /* may not be aligned */
DecodedBkpBlock *bkpb = XLogRecGetBlock(record, 0);

cycleid_ptr = &bkpb->bkp_image[bkpb->bimg_len - cycleid_off];

memcpy(&ropaque->btpo_cycleid, cycleid_ptr, sizeof(BTCycleId));
}
else
{
/*
* Bummer, we have to decompress the data.
*/
PGAlignedBlock tmp;
BTPageOpaque oopaque;

/* Expensive decompression of data */
RestoreBlockImage(record, 0, tmp.data);

oopaque = BTPageGetOpaque(tmp.data);
ropaque->btpo_cycleid = oopaque->btpo_cycleid;
}
}

/* Fix left-link of the page to the right of the new right sibling */
if (spagenumber != P_NONE)
{
Expand Down

0 comments on commit b21751d

Please sign in to comment.