@@ -88,15 +88,31 @@ IndexOnlyNext(IndexOnlyScanState *node)
8888 * Note on Memory Ordering Effects: visibilitymap_test does not lock
8989 * the visibility map buffer, and therefore the result we read here
9090 * could be slightly stale. However, it can't be stale enough to
91- * matter. It suffices to show that (1) there is a read barrier
92- * between the time we read the index TID and the time we test the
93- * visibility map; and (2) there is a write barrier between the time
94- * some other concurrent process clears the visibility map bit and the
95- * time it inserts the index TID. Since acquiring or releasing a
96- * LWLock interposes a full barrier, this is easy to show: (1) is
97- * satisfied by the release of the index buffer content lock after
98- * reading the TID; and (2) is satisfied by the acquisition of the
99- * buffer content lock in order to insert the TID.
91+ * matter.
92+ *
93+ * We need to detect clearing a VM bit due to an insert right away,
94+ * because the tuple is present in the index page but not visible. The
95+ * reading of the TID by this scan (using a shared lock on the index
96+ * buffer) is serialized with the insert of the TID into the index
97+ * (using an exclusive lock on the index buffer). Because the VM bit
98+ * is cleared before updating the index, and locking/unlocking of the
99+ * index page acts as a full memory barrier, we are sure to see the
100+ * cleared bit if we see a recently-inserted TID.
101+ *
102+ * Deletes do not update the index page (only VACUUM will clear out
103+ * the TID), so the clearing of the VM bit by a delete is not
104+ * serialized with this test below, and we may see a value that is
105+ * significantly stale. However, we don't care about the delete right
106+ * away, because the tuple is still visible until the deleting
107+ * transaction commits or the statement ends (if it's our
108+ * transaction). In either case, the lock on the VM buffer will have
109+ * been released (acting as a write barrier) after clearing the
110+ * bit. And for us to have a snapshot that includes the deleting
111+ * transaction (making the tuple invisible), we must have acquired
112+ * ProcArrayLock after that time, acting as a read barrier.
113+ *
114+ * It's worth going through this complexity to avoid needing to lock
115+ * the VM buffer, which could cause significant contention.
100116 */
101117 if (!visibilitymap_test (scandesc -> heapRelation ,
102118 ItemPointerGetBlockNumber (tid ),
0 commit comments