1
0
mirror of https://github.com/postgres/postgres.git synced 2025-07-28 23:42:10 +03:00

snapshot scalability: Don't compute global horizons while building snapshots.

To make GetSnapshotData() more scalable, it cannot not look at at each proc's
xmin: While snapshot contents do not need to change whenever a read-only
transaction commits or a snapshot is released, a proc's xmin is modified in
those cases. The frequency of xmin modifications leads to, particularly on
higher core count systems, many cache misses inside GetSnapshotData(), despite
the data underlying a snapshot not changing. That is the most
significant source of GetSnapshotData() scaling poorly on larger systems.

Without accessing xmins, GetSnapshotData() cannot calculate accurate horizons /
thresholds as it has so far. But we don't really have to: The horizons don't
actually change that much between GetSnapshotData() calls. Nor are the horizons
actually used every time a snapshot is built.

The trick this commit introduces is to delay computation of accurate horizons
until there use and using horizon boundaries to determine whether accurate
horizons need to be computed.

The use of RecentGlobal[Data]Xmin to decide whether a row version could be
removed has been replaces with new GlobalVisTest* functions.  These use two
thresholds to determine whether a row can be pruned:
1) definitely_needed, indicating that rows deleted by XIDs >= definitely_needed
   are definitely still visible.
2) maybe_needed, indicating that rows deleted by XIDs < maybe_needed can
   definitely be removed
GetSnapshotData() updates definitely_needed to be the xmin of the computed
snapshot.

When testing whether a row can be removed (with GlobalVisTestIsRemovableXid())
and the tested XID falls in between the two (i.e. XID >= maybe_needed && XID <
definitely_needed) the boundaries can be recomputed to be more accurate. As it
is not cheap to compute accurate boundaries, we limit the number of times that
happens in short succession.  As the boundaries used by
GlobalVisTestIsRemovableXid() are never reset (with maybe_needed updated by
GetSnapshotData()), it is likely that further test can benefit from an earlier
computation of accurate horizons.

To avoid regressing performance when old_snapshot_threshold is set (as that
requires an accurate horizon to be computed), heap_page_prune_opt() doesn't
unconditionally call TransactionIdLimitedForOldSnapshots() anymore. Both the
computation of the limited horizon, and the triggering of errors (with
SetOldSnapshotThresholdTimestamp()) is now only done when necessary to remove
tuples.

This commit just removes the accesses to PGXACT->xmin from
GetSnapshotData(), but other members of PGXACT residing in the same
cache line are accessed. Therefore this in itself does not result in a
significant improvement. Subsequent commits will take advantage of the
fact that GetSnapshotData() now does not need to access xmins anymore.

Note: This contains a workaround in heap_page_prune_opt() to keep the
snapshot_too_old tests working. While that workaround is ugly, the tests
currently are not meaningful, and it seems best to address them separately.

Author: Andres Freund <andres@anarazel.de>
Reviewed-By: Robert Haas <robertmhaas@gmail.com>
Reviewed-By: Thomas Munro <thomas.munro@gmail.com>
Reviewed-By: David Rowley <dgrowleyml@gmail.com>
Discussion: https://postgr.es/m/20200301083601.ews6hz5dduc3w2se@alap3.anarazel.de
This commit is contained in:
Andres Freund
2020-08-12 16:03:49 -07:00
parent 1f42d35a1d
commit dc7420c2c9
38 changed files with 1466 additions and 570 deletions

View File

@ -281,7 +281,7 @@ present or the overflow flag is set.) If a backend released XidGenLock
before storing its XID into MyPgXact, then it would be possible for another
backend to allocate and commit a later XID, causing latestCompletedXid to
pass the first backend's XID, before that value became visible in the
ProcArray. That would break GetOldestXmin, as discussed below.
ProcArray. That would break ComputeXidHorizons, as discussed below.
We allow GetNewTransactionId to store the XID into MyPgXact->xid (or the
subxid array) without taking ProcArrayLock. This was once necessary to
@ -293,42 +293,50 @@ once, rather than assume they can read it multiple times and get the same
answer each time. (Use volatile-qualified pointers when doing this, to
ensure that the C compiler does exactly what you tell it to.)
Another important activity that uses the shared ProcArray is GetOldestXmin,
which must determine a lower bound for the oldest xmin of any active MVCC
snapshot, system-wide. Each individual backend advertises the smallest
xmin of its own snapshots in MyPgXact->xmin, or zero if it currently has no
live snapshots (eg, if it's between transactions or hasn't yet set a
snapshot for a new transaction). GetOldestXmin takes the MIN() of the
valid xmin fields. It does this with only shared lock on ProcArrayLock,
which means there is a potential race condition against other backends
doing GetSnapshotData concurrently: we must be certain that a concurrent
backend that is about to set its xmin does not compute an xmin less than
what GetOldestXmin returns. We ensure that by including all the active
XIDs into the MIN() calculation, along with the valid xmins. The rule that
transactions can't exit without taking exclusive ProcArrayLock ensures that
concurrent holders of shared ProcArrayLock will compute the same minimum of
currently-active XIDs: no xact, in particular not the oldest, can exit
while we hold shared ProcArrayLock. So GetOldestXmin's view of the minimum
active XID will be the same as that of any concurrent GetSnapshotData, and
so it can't produce an overestimate. If there is no active transaction at
all, GetOldestXmin returns latestCompletedXid + 1, which is a lower bound
for the xmin that might be computed by concurrent or later GetSnapshotData
calls. (We know that no XID less than this could be about to appear in
the ProcArray, because of the XidGenLock interlock discussed above.)
Another important activity that uses the shared ProcArray is
ComputeXidHorizons, which must determine a lower bound for the oldest xmin
of any active MVCC snapshot, system-wide. Each individual backend
advertises the smallest xmin of its own snapshots in MyPgXact->xmin, or zero
if it currently has no live snapshots (eg, if it's between transactions or
hasn't yet set a snapshot for a new transaction). ComputeXidHorizons takes
the MIN() of the valid xmin fields. It does this with only shared lock on
ProcArrayLock, which means there is a potential race condition against other
backends doing GetSnapshotData concurrently: we must be certain that a
concurrent backend that is about to set its xmin does not compute an xmin
less than what ComputeXidHorizons determines. We ensure that by including
all the active XIDs into the MIN() calculation, along with the valid xmins.
The rule that transactions can't exit without taking exclusive ProcArrayLock
ensures that concurrent holders of shared ProcArrayLock will compute the
same minimum of currently-active XIDs: no xact, in particular not the
oldest, can exit while we hold shared ProcArrayLock. So
ComputeXidHorizons's view of the minimum active XID will be the same as that
of any concurrent GetSnapshotData, and so it can't produce an overestimate.
If there is no active transaction at all, ComputeXidHorizons uses
latestCompletedXid + 1, which is a lower bound for the xmin that might
be computed by concurrent or later GetSnapshotData calls. (We know that no
XID less than this could be about to appear in the ProcArray, because of the
XidGenLock interlock discussed above.)
GetSnapshotData also performs an oldest-xmin calculation (which had better
match GetOldestXmin's) and stores that into RecentGlobalXmin, which is used
for some tuple age cutoff checks where a fresh call of GetOldestXmin seems
too expensive. Note that while it is certain that two concurrent
executions of GetSnapshotData will compute the same xmin for their own
snapshots, as argued above, it is not certain that they will arrive at the
same estimate of RecentGlobalXmin. This is because we allow XID-less
transactions to clear their MyPgXact->xmin asynchronously (without taking
ProcArrayLock), so one execution might see what had been the oldest xmin,
and another not. This is OK since RecentGlobalXmin need only be a valid
lower bound. As noted above, we are already assuming that fetch/store
of the xid fields is atomic, so assuming it for xmin as well is no extra
risk.
As GetSnapshotData is performance critical, it does not perform an accurate
oldest-xmin calculation (it used to, until v13). The contents of a snapshot
only depend on the xids of other backends, not their xmin. As backend's xmin
changes much more often than its xid, having GetSnapshotData look at xmins
can lead to a lot of unnecessary cacheline ping-pong. Instead
GetSnapshotData updates approximate thresholds (one that guarantees that all
deleted rows older than it can be removed, another determining that deleted
rows newer than it can not be removed). GlobalVisTest* uses those threshold
to make invisibility decision, falling back to ComputeXidHorizons if
necessary.
Note that while it is certain that two concurrent executions of
GetSnapshotData will compute the same xmin for their own snapshots, there is
no such guarantee for the horizons computed by ComputeXidHorizons. This is
because we allow XID-less transactions to clear their MyPgXact->xmin
asynchronously (without taking ProcArrayLock), so one execution might see
what had been the oldest xmin, and another not. This is OK since the
thresholds need only be a valid lower bound. As noted above, we are already
assuming that fetch/store of the xid fields is atomic, so assuming it for
xmin as well is no extra risk.
pg_xact and pg_subtrans

View File

@ -9096,7 +9096,7 @@ CreateCheckPoint(int flags)
* StartupSUBTRANS hasn't been called yet.
*/
if (!RecoveryInProgress())
TruncateSUBTRANS(GetOldestXmin(NULL, PROCARRAY_FLAGS_DEFAULT));
TruncateSUBTRANS(GetOldestTransactionIdConsideredRunning());
/* Real work is done, but log and update stats before releasing lock. */
LogCheckpointEnd(false);
@ -9456,7 +9456,7 @@ CreateRestartPoint(int flags)
* this because StartupSUBTRANS hasn't been called yet.
*/
if (EnableHotStandby)
TruncateSUBTRANS(GetOldestXmin(NULL, PROCARRAY_FLAGS_DEFAULT));
TruncateSUBTRANS(GetOldestTransactionIdConsideredRunning());
/* Real work is done, but log and update before releasing lock. */
LogCheckpointEnd(true);