nbtree README: Add note about latestRemovedXid.

petergeoghegan · petergeoghegan · commit 48064a8d330d · 2021-09-24T13:53:48.000-07:00
Point out that index tuple deletion generally needs a latestRemovedXid value for the deletion operation's WAL record. This is bound to be the most expensive part of the whole deletion operation now that it takes place up front, during original execution. This was arguably an oversight in commit 558a916, which moved the work required to generate these values from index deletion REDO routines to original execution of index deletion operations.
diff --git a/src/backend/access/nbtree/README b/src/backend/access/nbtree/README
@@ -490,24 +490,33 @@ lock on the leaf page).
 Once an index tuple has been marked LP_DEAD it can actually be deleted
 from the index immediately; since index scans only stop "between" pages,
 no scan can lose its place from such a deletion.  We separate the steps
-because we allow LP_DEAD to be set with only a share lock (it's exactly
-like a hint bit for a heap tuple), but physically removing tuples requires
-exclusive lock.  Also, delaying the deletion often allows us to pick up
-extra index tuples that weren't initially safe for index scans to mark
-LP_DEAD.  We do this with index tuples whose TIDs point to the same table
-blocks as an LP_DEAD-marked tuple.  They're practically free to check in
-passing, and have a pretty good chance of being safe to delete due to
-various locality effects.
-
-We only try to delete LP_DEAD tuples (and nearby tuples) when we are
-otherwise faced with having to split a page to do an insertion (and hence
-have exclusive lock on it already).  Deduplication and bottom-up index
-deletion can also prevent a page split, but simple deletion is always our
-preferred approach.  (Note that posting list tuples can only have their
-LP_DEAD bit set when every table TID within the posting list is known
-dead.  This isn't much of a problem in practice because LP_DEAD bits are
-just a starting point for simple deletion -- we still manage to perform
-granular deletes of posting list TIDs quite often.)
+because we allow LP_DEAD to be set with only a share lock (it's like a
+hint bit for a heap tuple), but physically deleting tuples requires an
+exclusive lock.  We also need to generate a latestRemovedXid value for
+each deletion operation's WAL record, which requires additional
+coordinating with the tableam when the deletion actually takes place.
+(This latestRemovedXid value may be used to generate a recovery conflict
+during subsequent REDO of the record by a standby.)
+
+Delaying and batching index tuple deletion like this enables a further
+optimization: opportunistic checking of "extra" nearby index tuples
+(tuples that are not LP_DEAD-set) when they happen to be very cheap to
+check in passing (because we already know that the tableam will be
+visiting their table block to generate a latestRemovedXid value).  Any
+index tuples that turn out to be safe to delete will also be deleted.
+Simple deletion will behave as if the extra tuples that actually turn
+out to be delete-safe had their LP_DEAD bits set right from the start.
+
+Deduplication can also prevent a page split, but index tuple deletion is
+our preferred approach.  Note that posting list tuples can only have
+their LP_DEAD bit set when every table TID within the posting list is
+known dead.  This isn't much of a problem in practice because LP_DEAD
+bits are just a starting point for deletion.  What really matters is
+that _some_ deletion operation that targets related nearby-in-table TIDs
+takes place at some point before the page finally splits.  That's all
+that's required for the deletion process to perform granular removal of
+groups of dead TIDs from posting list tuples (without the situation ever
+being allowed to get out of hand).
 
 It's sufficient to have an exclusive lock on the index page, not a
 super-exclusive lock, to do deletion of LP_DEAD items.  It might seem