@@ -490,24 +490,33 @@ lock on the leaf page).
490490Once an index tuple has been marked LP_DEAD it can actually be deleted
491491from the index immediately; since index scans only stop "between" pages,
492492no scan can lose its place from such a deletion. We separate the steps
493- because we allow LP_DEAD to be set with only a share lock (it's exactly
494- like a hint bit for a heap tuple), but physically removing tuples requires
495- exclusive lock. Also, delaying the deletion often allows us to pick up
496- extra index tuples that weren't initially safe for index scans to mark
497- LP_DEAD. We do this with index tuples whose TIDs point to the same table
498- blocks as an LP_DEAD-marked tuple. They're practically free to check in
499- passing, and have a pretty good chance of being safe to delete due to
500- various locality effects.
501-
502- We only try to delete LP_DEAD tuples (and nearby tuples) when we are
503- otherwise faced with having to split a page to do an insertion (and hence
504- have exclusive lock on it already). Deduplication and bottom-up index
505- deletion can also prevent a page split, but simple deletion is always our
506- preferred approach. (Note that posting list tuples can only have their
507- LP_DEAD bit set when every table TID within the posting list is known
508- dead. This isn't much of a problem in practice because LP_DEAD bits are
509- just a starting point for simple deletion -- we still manage to perform
510- granular deletes of posting list TIDs quite often.)
493+ because we allow LP_DEAD to be set with only a share lock (it's like a
494+ hint bit for a heap tuple), but physically deleting tuples requires an
495+ exclusive lock. We also need to generate a latestRemovedXid value for
496+ each deletion operation's WAL record, which requires additional
497+ coordinating with the tableam when the deletion actually takes place.
498+ (This latestRemovedXid value may be used to generate a recovery conflict
499+ during subsequent REDO of the record by a standby.)
500+
501+ Delaying and batching index tuple deletion like this enables a further
502+ optimization: opportunistic checking of "extra" nearby index tuples
503+ (tuples that are not LP_DEAD-set) when they happen to be very cheap to
504+ check in passing (because we already know that the tableam will be
505+ visiting their table block to generate a latestRemovedXid value). Any
506+ index tuples that turn out to be safe to delete will also be deleted.
507+ Simple deletion will behave as if the extra tuples that actually turn
508+ out to be delete-safe had their LP_DEAD bits set right from the start.
509+
510+ Deduplication can also prevent a page split, but index tuple deletion is
511+ our preferred approach. Note that posting list tuples can only have
512+ their LP_DEAD bit set when every table TID within the posting list is
513+ known dead. This isn't much of a problem in practice because LP_DEAD
514+ bits are just a starting point for deletion. What really matters is
515+ that _some_ deletion operation that targets related nearby-in-table TIDs
516+ takes place at some point before the page finally splits. That's all
517+ that's required for the deletion process to perform granular removal of
518+ groups of dead TIDs from posting list tuples (without the situation ever
519+ being allowed to get out of hand).
511520
512521It's sufficient to have an exclusive lock on the index page, not a
513522super-exclusive lock, to do deletion of LP_DEAD items. It might seem
0 commit comments