Commit 5764f611e used dclist_delete_from() to remove the proc from the
wait queue. However, since it doesn't clear dist_node's next/prev to
NULL, it could call RemoveFromWaitQueue() twice: when the process
detects a deadlock and then when cleaning up locks on aborting the
transaction. The waiting lock information is cleared in the first
call, so it led to a crash in the second call.
Backpatch to v16, where the change was introduced.
Bug: #18031
Reported-by: Justin Pryzby, Alexander Lakhin
Reviewed-by: Andres Freund
Discussion: https://postgr.es/m/ZKy4AdrLEfbqrxGJ%40telsasoft.com
Discussion: https://postgr.es/m/18031-ebe2d08cb405f6cc@postgresql.org
Backpatch-through: 16
This commit reverts the work done by commits 3ba59ccc89 and 72e78d831a.
Those commits were incorrect in asserting that we never acquire any other
heavy-weight lock after acquring page lock other than relation extension
lock. We can acquire a lock on catalogs while doing catalog look up after
acquring page lock.
This won't impact any existing feature but we need to think some other way
to achieve this before parallelizing other write operations or even
improving the parallelism in vacuum (like allowing multiple workers
for an index).
Reported-by: Jaime Casanova
Author: Amit Kapila
Backpatch-through: 13
Discussion: https://postgr.es/m/CAJKUy5jffnRKNvRHKQ0LynRb0RJC-o4P8Ku3x9vGAVLwDBWumQ@mail.gmail.com
Run pgindent, pgperltidy, and reformat-dat-files.
This set of diffs is a bit larger than typical. We've updated to
pg_bsd_indent 2.1.2, which properly indents variable declarations that
have multi-line initialization expressions (the continuation lines are
now indented one tab stop). We've also updated to perltidy version
20230309 and changed some of its settings, which reduces its desire to
add whitespace to lines to make assignments etc. line up. Going
forward, that should make for fewer random-seeming changes to existing
code.
Discussion: https://postgr.es/m/20230428092545.qfb3y5wcu4cm75ur@alvherre.pgsql
The setting of the process title could be seen on profiles of very
fast-to-execute queries. In many locations where we call
set_ps_display() we pass along a string constant, the length of which is
known during compilation. Here we effectively rename set_ps_display() to
set_ps_display_with_len() and then add a static inline function named
set_ps_display() which calls strlen() on the given string. This allows
the compiler to optimize away the strlen() call when dealing with
call sites passing a string constant. We can then also use memcpy()
instead of strlcpy() to copy the string into the destination buffer.
That's significantly faster than strlcpy's byte-at-a-time way of
copying.
Here we also take measures to improve some code which was adjusting the
process title to add a " waiting" suffix to it. Call sites which require
this can now just call set_ps_display_suffix() to add or adjust the suffix
and call set_ps_display_remove_suffix() to remove it again.
Reviewed-by: Andres Freund
Discussion: https://postgr.es/m/CAApHDvocBvvk-0gWNA2Gohe+sv9fMcv+fK_G+siBKJrgDG4O7g@mail.gmail.com
In a similar effort to f01592f91, here we mostly rename shadowed local
variables to remove the warnings produced when compiling with
-Wshadow=compatible-local.
This fixes 63 warnings and leaves just 5.
Author: Justin Pryzby, David Rowley
Reviewed-by: Justin Pryzby
Discussion https://postgr.es/m/20220817145434.GC26426%40telsasoft.com
This reverts commits 0147fc7, 4567596, aa64f23, and 5ecd018.
There is no longer agreement that introducing this function
was the right way to address the problem. The consensus now
seems to favor trying to make a correct value for MaxBackends
available to mdules executing their _PG_init() functions.
Nathan Bossart
Discussion: http://postgr.es/m/20220323045229.i23skfscdbvrsuxa@jrouhaud
Some of the code paths changed by aa64f23 can reduce the number of times
GetMaxBackends() is called. The performance gain is marginal, but most
of the code changed by this commit already did that. Hence, let's be
clean and apply the same rule everywhere, for consistency.
Some of the code paths, like in deadlock.c, involve only assertions.
These are left unchanged.
Reviewed-by: Nathan Bossart, Robert Haas
Discussion: https://postgr.es/m/YgMpGZhPOjNfS7er@paquier.xyz
Previously, it was really easy to write code that accessed MaxBackends
before we'd actually initialized it, especially when coding up an
extension. To make this less error-prune, introduce a new function
GetMaxBackends() which should be used to obtain the correct value.
This will ERROR if called too early. Demote the global variable to
a file-level static, so that nobody can peak at it directly.
Nathan Bossart. Idea by Andres Freund. Review by Greg Sabino Mullane,
by Michael Paquier (who had doubts about the approach), and by me.
Discussion: http://postgr.es/m/20210802224204.bckcikl45uezv5e4@alap3.anarazel.de
The purpose of commit 8a54e12a38 was to
fix this, and it sufficed when the PREPARE TRANSACTION completed before
the CIC looked for lock conflicts. Otherwise, things still broke. As
before, in a cluster having used CIC while having enabled prepared
transactions, queries that use the resulting index can silently fail to
find rows. It may be necessary to reindex to recover from past
occurrences; REINDEX CONCURRENTLY suffices. Fix this for future index
builds by making CIC wait for arbitrarily-recent prepared transactions
and for ordinary transactions that may yet PREPARE TRANSACTION. As part
of that, have PREPARE TRANSACTION transfer locks to its dummy PGPROC
before it calls ProcArrayClearTransaction(). Back-patch to 9.6 (all
supported versions).
Andrey Borodin, reviewed (in earlier versions) by Andres Freund.
Discussion: https://postgr.es/m/01824242-AA92-4FE9-9BA7-AEBAFFEA3D0C@yandex-team.ru
They are used in code that runs both during normal operation and during
WAL replay, and needs to behave differently during replay. Move them to
xlogutils.c, because that's where we have other helper functions used by
redo routines.
Reviewed-by: Andres Freund
Discussion: https://www.postgresql.org/message-id/b3b71061-4919-e882-4857-27e370ab134a%40iki.fi
We have an implementation restriction that PREPARE TRANSACTION can't
handle cases where both session-lifespan and transaction-lifespan locks
are held on the same lockable object. (That's because we'd otherwise
need to acquire a new PROCLOCK entry during post-prepare cleanup, which
is an operation that might fail. The situation can only arise with odd
usages of advisory locks, so removing the restriction is probably not
worth the amount of effort it would take.) AtPrepare_Locks attempted
to enforce this, but its logic was many bricks shy of a load, because
it only detected cases where the session and transaction locks had the
same lockmode. Locks of different modes on the same object would lead
to the rather unhelpful message "PANIC: we seem to have dropped a bit
somewhere".
To fix, build a transient hashtable with one entry per locktag,
not one per locktag + mode, and use that to detect conflicts.
Per bug #17122 from Alexander Pyhalov. This bug is ancient,
so back-patch to all supported branches.
Discussion: https://postgr.es/m/17122-04f3c32098a62233@postgresql.org
This commit adds new column "waitstart" into pg_locks view. This column
reports the time when the server process started waiting for the lock
if the lock is not held. This information is useful, for example, when
examining the amount of time to wait on a lock by subtracting
"waitstart" in pg_locks from the current time, and identify the lock
that the processes are waiting for very long.
This feature uses the current time obtained for the deadlock timeout
timer as "waitstart" (i.e., the time when this process started waiting
for the lock). Since getting the current time newly can cause overhead,
we reuse the already-obtained time to avoid that overhead.
Note that "waitstart" is updated without holding the lock table's
partition lock, to avoid the overhead by additional lock acquisition.
This can cause "waitstart" in pg_locks to become NULL for a very short
period of time after the wait started even though "granted" is false.
This is OK in practice because we can assume that users are likely to
look at "waitstart" when waiting for the lock for a long time.
The first attempt of this patch (commit 3b733fcd04) caused the buildfarm
member "rorqual" (built with --disable-atomics --disable-spinlocks) to report
the failure of the regression test. It was reverted by commit 890d2182a2.
The cause of this failure was that the atomic variable for "waitstart"
in the dummy process entry created at the end of prepare transaction was
not initialized. This second attempt fixes that issue.
Bump catalog version.
Author: Atsushi Torikoshi
Reviewed-by: Ian Lawrence Barwick, Robert Haas, Justin Pryzby, Fujii Masao
Discussion: https://postgr.es/m/a96013dc51cdc56b2a2b84fa8a16a993@oss.nttdata.com
This commit adds new column "waitstart" into pg_locks view. This column
reports the time when the server process started waiting for the lock
if the lock is not held. This information is useful, for example, when
examining the amount of time to wait on a lock by subtracting
"waitstart" in pg_locks from the current time, and identify the lock
that the processes are waiting for very long.
This feature uses the current time obtained for the deadlock timeout
timer as "waitstart" (i.e., the time when this process started waiting
for the lock). Since getting the current time newly can cause overhead,
we reuse the already-obtained time to avoid that overhead.
Note that "waitstart" is updated without holding the lock table's
partition lock, to avoid the overhead by additional lock acquisition.
This can cause "waitstart" in pg_locks to become NULL for a very short
period of time after the wait started even though "granted" is false.
This is OK in practice because we can assume that users are likely to
look at "waitstart" when waiting for the lock for a long time.
Bump catalog version.
Author: Atsushi Torikoshi
Reviewed-by: Ian Lawrence Barwick, Robert Haas, Justin Pryzby, Fujii Masao
Discussion: https://postgr.es/m/a96013dc51cdc56b2a2b84fa8a16a993@oss.nttdata.com
In a cluster having used CREATE INDEX CONCURRENTLY while having enabled
prepared transactions, queries that use the resulting index can silently
fail to find rows. Fix this for future CREATE INDEX CONCURRENTLY by
making it wait for prepared transactions like it waits for ordinary
transactions. This expands the VirtualTransactionId structure domain to
admit prepared transactions. It may be necessary to reindex to recover
from past occurrences. Back-patch to 9.5 (all supported versions).
Andrey Borodin, reviewed (in earlier versions) by Tom Lane and Michael
Paquier.
Discussion: https://postgr.es/m/2E712143-97F7-4890-B470-4A35142ABC82@yandex-team.ru
Invent a new flag bit HASH_STRINGS to specify C-string hashing, which
was formerly the default; and add assertions insisting that exactly
one of the bits HASH_STRINGS, HASH_BLOBS, and HASH_FUNCTION be set.
This is in hopes of preventing recurrences of the type of oversight
fixed in commit a1b8aa1e4 (i.e., mistakenly omitting HASH_BLOBS).
Also, when HASH_STRINGS is specified, insist that the keysize be
more than 8 bytes. This is a heuristic, but it should catch
accidental use of HASH_STRINGS for integer or pointer keys.
(Nearly all existing use-cases set the keysize to NAMEDATALEN or
more, so there's little reason to think this restriction should
be problematic.)
Tweak hash_create() to insist that the HASH_ELEM flag be set, and
remove the defaults it had for keysize and entrysize. Since those
defaults were undocumented and basically useless, no callers
omitted HASH_ELEM anyway.
Also, remove memset's zeroing the HASHCTL parameter struct from
those callers that had one. This has never been really necessary,
and while it wasn't a bad coding convention it was confusing that
some callers did it and some did not. We might as well save a few
cycles by standardizing on "not".
Also improve the documentation for hash_create().
In passing, improve reinit.c's usage of a hash table by storing
the key as a binary Oid rather than a string; and, since that's
a temporary hash table, allocate it in CurrentMemoryContext for
neatness.
Discussion: https://postgr.es/m/590625.1607878171@sss.pgh.pa.us
The new array contains the xids for all connected backends / in-use
PGPROC entries in a dense manner (in contrast to the PGPROC/PGXACT
arrays which can have unused entries interspersed).
This improves performance because GetSnapshotData() always needs to
scan the xids of all live procarray entries and now there's no need to
go through the procArray->pgprocnos indirection anymore.
As the set of running top-level xids changes rarely, compared to the
number of snapshots taken, this substantially increases the likelihood
of most data required for a snapshot being in l2 cache. In
read-mostly workloads scanning the xids[] array will sufficient to
build a snapshot, as most backends will not have an xid assigned.
To keep the xid array dense ProcArrayRemove() needs to move entries
behind the to-be-removed proc's one further up in the array. Obviously
moving array entries cannot happen while a backend sets it
xid. I.e. locking needs to prevent that array entries are moved while
a backend modifies its xid.
To avoid locking ProcArrayLock in GetNewTransactionId() - a fairly hot
spot already - ProcArrayAdd() / ProcArrayRemove() now needs to hold
XidGenLock in addition to ProcArrayLock. Adding / Removing a procarray
entry is not a very frequent operation, even taking 2PC into account.
Due to the above, the dense array entries can only be read or modified
while holding ProcArrayLock and/or XidGenLock. This prevents a
concurrent ProcArrayRemove() from shifting the dense array while it is
accessed concurrently.
While the new dense array is very good when needing to look at all
xids it is less suitable when accessing a single backend's xid. In
particular it would be problematic to have to acquire a lock to access
a backend's own xid. Therefore a backend's xid is not just stored in
the dense array, but also in PGPROC. This also allows a backend to
only access the shared xid value when the backend had acquired an
xid.
The infrastructure added in this commit will be used for the remaining
PGXACT fields in subsequent commits. They are kept separate to make
review easier.
Author: Andres Freund <andres@anarazel.de>
Reviewed-By: Robert Haas <robertmhaas@gmail.com>
Reviewed-By: Thomas Munro <thomas.munro@gmail.com>
Reviewed-By: David Rowley <dgrowleyml@gmail.com>
Discussion: https://postgr.es/m/20200301083601.ews6hz5dduc3w2se@alap3.anarazel.de
Choose names that fit into the conventions for wait event names
(particularly, that multi-word names are in the style MultiWordName)
and hopefully convey more information to non-hacker users than the
previous names did.
Also rename SerializablePredicateLockListLock to
SerializablePredicateListLock; the old name was long enough to cause
table formatting problems, plus the double occurrence of "Lock" seems
confusing/error-prone.
Also change a couple of particularly opaque LWLock field names.
Discussion: https://postgr.es/m/28683.1589405363@sss.pgh.pa.us
Includes some manual cleanup of places that pgindent messed up,
most of which weren't per project style anyway.
Notably, it seems some people didn't absorb the style rules of
commit c9d297751, because there were a bunch of new occurrences
of function calls with a newline just after the left paren, all
with faulty expectations about how the rest of the call would get
indented.
Until now, only selected bulk operations (e.g. COPY) did this. If a
given relfilenode received both a WAL-skipping COPY and a WAL-logged
operation (e.g. INSERT), recovery could lose tuples from the COPY. See
src/backend/access/transam/README section "Skipping WAL for New
RelFileNode" for the new coding rules. Maintainers of table access
methods should examine that section.
To maintain data durability, just before commit, we choose between an
fsync of the relfilenode and copying its contents to WAL. A new GUC,
wal_skip_threshold, guides that choice. If this change slows a workload
that creates small, permanent relfilenodes under wal_level=minimal, try
adjusting wal_skip_threshold. Users setting a timeout on COMMIT may
need to adjust that timeout, and log_min_duration_statement analysis
will reflect time consumption moving to COMMIT from commands like COPY.
Internally, this requires a reliable determination of whether
RollbackAndReleaseCurrentSubTransaction() would unlink a relation's
current relfilenode. Introduce rd_firstRelfilenodeSubid. Amend the
specification of rd_createSubid such that the field is zero when a new
rel has an old rd_node. Make relcache.c retain entries for certain
dropped relations until end of transaction.
Bump XLOG_PAGE_MAGIC, since this introduces XLOG_GIST_ASSIGN_LSN.
Future servers accept older WAL, so this bump is discretionary.
Kyotaro Horiguchi, reviewed (in earlier, similar versions) by Robert
Haas. Heikki Linnakangas and Michael Paquier implemented earlier
designs that materially clarified the problem. Reviewed, in earlier
designs, by Andrew Dunstan, Andres Freund, Alvaro Herrera, Tom Lane,
Fujii Masao, and Simon Riggs. Reported by Martijn van Oosterhout.
Discussion: https://postgr.es/m/20150702220524.GA9392@svana.org
When certain parameters are changed on a physical replication primary,
this is communicated to standbys using the XLOG_PARAMETER_CHANGE WAL
record. The standby then checks whether its own settings are at least
as big as the ones on the primary. If not, the standby shuts down
with a fatal error.
The correspondence of settings between primary and standby is required
because those settings influence certain shared memory sizings that
are required for processing WAL records that the primary might send.
For example, if the primary sends a prepared transaction, the standby
must have had max_prepared_transaction set appropriately or it won't
be able to process those WAL records.
However, fatally shutting down the standby immediately upon receipt of
the parameter change record might be a bit of an overreaction. The
resources related to those settings are not required immediately at
that point, and might never be required if the activity on the primary
does not exhaust all those resources. If we just let the standby roll
on with recovery, it will eventually produce an appropriate error when
those resources are used.
So this patch relaxes this a bit. Upon receipt of
XLOG_PARAMETER_CHANGE, we still check the settings but only issue a
warning and set a global flag if there is a problem. Then when we
actually hit the resource issue and the flag was set, we issue another
warning message with relevant information. At that point we pause
recovery, so a hot standby remains usable. We also repeat the last
warning message once a minute so it is harder to miss or ignore.
Reviewed-by: Sergei Kornilov <sk@zsrv.org>
Reviewed-by: Masahiko Sawada <masahiko.sawada@2ndquadrant.com>
Reviewed-by: Kyotaro Horiguchi <horikyota.ntt@gmail.com>
Discussion: https://www.postgresql.org/message-id/flat/4ad69a4c-cc9b-0dfe-0352-8b1b0cd36c7b@2ndquadrant.com
Until now, only selected bulk operations (e.g. COPY) did this. If a
given relfilenode received both a WAL-skipping COPY and a WAL-logged
operation (e.g. INSERT), recovery could lose tuples from the COPY. See
src/backend/access/transam/README section "Skipping WAL for New
RelFileNode" for the new coding rules. Maintainers of table access
methods should examine that section.
To maintain data durability, just before commit, we choose between an
fsync of the relfilenode and copying its contents to WAL. A new GUC,
wal_skip_threshold, guides that choice. If this change slows a workload
that creates small, permanent relfilenodes under wal_level=minimal, try
adjusting wal_skip_threshold. Users setting a timeout on COMMIT may
need to adjust that timeout, and log_min_duration_statement analysis
will reflect time consumption moving to COMMIT from commands like COPY.
Internally, this requires a reliable determination of whether
RollbackAndReleaseCurrentSubTransaction() would unlink a relation's
current relfilenode. Introduce rd_firstRelfilenodeSubid. Amend the
specification of rd_createSubid such that the field is zero when a new
rel has an old rd_node. Make relcache.c retain entries for certain
dropped relations until end of transaction.
Back-patch to 9.5 (all supported versions). This introduces a new WAL
record type, XLOG_GIST_ASSIGN_LSN, without bumping XLOG_PAGE_MAGIC. As
always, update standby systems before master systems. This changes
sizeof(RelationData) and sizeof(IndexStmt), breaking binary
compatibility for affected extensions. (The most recent commit to
affect the same class of extensions was
089e4d405d0f3b94c74a2c6a54357a84a681754b.)
Kyotaro Horiguchi, reviewed (in earlier, similar versions) by Robert
Haas. Heikki Linnakangas and Michael Paquier implemented earlier
designs that materially clarified the problem. Reviewed, in earlier
designs, by Andrew Dunstan, Andres Freund, Alvaro Herrera, Tom Lane,
Fujii Masao, and Simon Riggs. Reported by Martijn van Oosterhout.
Discussion: https://postgr.es/m/20150702220524.GA9392@svana.org
This is required as it is no safer for two related processes to perform
clean up in gin indexes at a time than for unrelated processes to do the
same. After acquiring page locks, we can acquire relation extension lock
but reverse never happens which means these will also not participate in
deadlock. So, avoid checking wait edges from this lock.
Currently, the parallel mode is strictly read-only, but after this patch
we have the infrastructure to allow parallel inserts and parallel copy.
Author: Dilip Kumar, Amit Kapila
Reviewed-by: Amit Kapila, Kuntal Ghosh and Sawada Masahiko
Discussion: https://postgr.es/m/CAD21AoCmT3cFQUN4aVvzy5chw7DuzXrJCbrjTU05B+Ss=Gn1LA@mail.gmail.com
This is required as it is no safer for two related processes to extend the
same relation at a time than for unrelated processes to do the same. We
don't acquire a heavyweight lock on any other object after relation
extension lock which means such a lock can never participate in the
deadlock cycle. So, avoid checking wait edges from this lock.
This provides an infrastructure to allow parallel operations like insert,
copy, etc. which were earlier not possible as parallel group members won't
conflict for relation extension lock.
Author: Dilip Kumar, Amit Kapila
Reviewed-by: Amit Kapila, Kuntal Ghosh and Sawada Masahiko
Discussion: https://postgr.es/m/CAD21AoCmT3cFQUN4aVvzy5chw7DuzXrJCbrjTU05B+Ss=Gn1LA@mail.gmail.com
Assert that we don't acquire any other heavyweight lock while holding the
page lock except for relation extension. However, these locks are never
taken in reverse order which implies that page locks will never
participate in the deadlock cycle.
Similar to relation extension, page locks are also held for a short
duration, so imposing such a restriction won't hurt.
Author: Dilip Kumar, with few changes by Amit Kapila
Reviewed-by: Amit Kapila, Kuntal Ghosh and Sawada Masahiko
Discussion: https://postgr.es/m/CAD21AoCmT3cFQUN4aVvzy5chw7DuzXrJCbrjTU05B+Ss=Gn1LA@mail.gmail.com
relation extension lock.
The only exception to the rule is that we can try to acquire the same
relation extension lock more than once. This is allowed as we are not
creating any new lock for this case. This restriction implies that the
relation extension lock won't ever participate in the deadlock cycle
because we can never wait for any other heavyweight lock after acquiring
this lock.
Such a restriction is okay for relation extension locks as unlike other
heavyweight locks these are not held till the transaction end. These are
taken for a short duration to extend a particular relation and then
released.
Author: Dilip Kumar, with few changes by Amit Kapila
Reviewed-by: Amit Kapila, Kuntal Ghosh and Sawada Masahiko
Discussion: https://postgr.es/m/CAD21AoCmT3cFQUN4aVvzy5chw7DuzXrJCbrjTU05B+Ss=Gn1LA@mail.gmail.com
The init_ps_display() arguments were mostly lies by now, so to match
typical usage, just use one argument and let the caller assemble it
from multiple sources if necessary. The only user of the additional
arguments is BackendInitialize(), which was already doing string
assembly on the caller side anyway.
Remove the second argument of set_ps_display() ("force") and just
handle that in init_ps_display() internally.
BackendInitialize() also used to set the initial status as
"authentication", but that was very far from where authentication
actually happened. So now it's set to "initializing" and then
"authentication" just before the actual call to
ClientAuthentication().
Reviewed-by: Julien Rouhaud <rjuju123@gmail.com>
Reviewed-by: Kuntal Ghosh <kuntalghosh.2007@gmail.com>
Reviewed-by: Alvaro Herrera <alvherre@2ndquadrant.com>
Discussion: https://www.postgresql.org/message-id/flat/c65e5196-4f04-4ead-9353-6088c19615a3@2ndquadrant.com
This is numbered take 7, and addresses a set of issues around:
- Fixes for typos and incorrect reference names.
- Removal of unneeded comments.
- Removal of unreferenced functions and structures.
- Fixes regarding variable name consistency.
Author: Alexander Lakhin
Discussion: https://postgr.es/m/10bfd4ac-3e7c-40ab-2b2e-355ed15495e8@gmail.com
This uses the progress reporting infrastructure added by c16dc1aca5,
adding support for CREATE INDEX and CREATE INDEX CONCURRENTLY.
There are two pieces to this: one is index-AM-agnostic, and the other is
AM-specific. The latter is fairly elaborate for btrees, including
reportage for parallel index builds and the separate phases that btree
index creation uses; other index AMs, which are much simpler in their
building procedures, have simplistic reporting only, but that seems
sufficient, at least for non-concurrent builds.
The index-AM-agnostic part is fairly complete, providing insight into
the CONCURRENTLY wait phases as well as block-based progress during the
index validation table scan. (The index validation index scan requires
patching each AM, which has not been included here.)
Reviewers: Rahila Syed, Pavan Deolasee, Tatsuro Yamada
Discussion: https://postgr.es/m/20181220220022.mg63bhk26zdpvmcj@alvherre.pgsql
When preparing a transaction in two-phase commit, a dummy PGPROC entry
holding the GID used for the transaction is registered, which gets
released once COMMIT PREPARED is run. Prior releasing its shared memory
state, all the locks taken in the prepared transaction are released
using a dedicated set of callbacks (pgstat and multixact having similar
callbacks), which may cause the locks to be released before the GID is
set free.
Hence, there is a small window where lock conflicts could happen, for
example:
- Transaction A releases its locks, still holding its GID in shared
memory.
- Transaction B held a lock which conflicted with locks of transaction
A.
- Transaction B continues its processing, reusing the same GID as
transaction A.
- Transaction B fails because of a conflicting GID, already in use by
transaction A.
This commit changes the shared memory state release so as post-commit
callbacks and predicate lock cleanup happen consistently with the shared
memory state cleanup for the dummy PGPROC entry. The race window is
small and 2PC had this issue from the start, so no backpatch is done.
On top if that fixes discussed involved ABI breakages, which are not
welcome in stable branches.
Reported-by: Oleksii Kliukin, Ildar Musin
Diagnosed-by: Oleksii Kliukin, Ildar Musin
Author: Michael Paquier
Reviewed-by: Masahiko Sawada, Oleksii Kliukin
Discussion: https://postgr.es/m/BF9B38A4-2BFF-46E8-BA87-A2D00A8047A6@hintbits.com
Opening a relation with no lock at all is unsafe; there's no guarantee
that we'll see a consistent state of the relevant catalog entries.
While use of MVCC scans to read the catalogs partially addresses that
complaint, it's still possible to switch to a new catalog snapshot
partway through loading the relcache entry. Moreover, whether or not
you trust the reasoning behind sometimes using less than
AccessExclusiveLock for ALTER TABLE, that reasoning is certainly not
valid if concurrent users of the table don't hold a lock corresponding
to the operation they want to perform.
Hence, add some assertion-build-only checks that require any caller
of relation_open(x, NoLock) to hold at least AccessShareLock. This
isn't a full solution, since we can't verify that the lock level is
semantically appropriate for the action --- but it's definitely of
some use, because it's already caught two bugs.
We can also assert that callers of addRangeTableEntryForRelation()
hold at least the lock level specified for the new RTE.
Amit Langote and Tom Lane
Discussion: https://postgr.es/m/16565.1538327894@sss.pgh.pa.us
Commit 37c54863c removed the code in StandbyAcquireAccessExclusiveLock
that checked the return value of LockAcquireExtended. That created a
bug, because it's still passing reportMemoryError = false to
LockAcquireExtended, meaning that LOCKACQUIRE_NOT_AVAIL will be returned
if we're out of shared memory for the lock table.
In such a situation, the startup process would believe it had acquired an
exclusive lock even though it hadn't, with potentially dire consequences.
To fix, just drop the use of reportMemoryError = false, which allows us
to simplify the call into a plain LockAcquire(). It's unclear that the
locktable-full situation arises often enough that it's worth having a
better recovery method than crash-and-restart. (I strongly suspect that
the only reason the code path existed at all was that it was relatively
simple to do in the pre-37c54863c implementation. But now it's not.)
LockAcquireExtended's reportMemoryError parameter is now dead code and
could be removed. I refrained from doing so, however, because there
was some interest in resurrecting the behavior if we do get reports of
locktable-full failures in the field. Also, it seems unwise to remove
the parameter concurrently with shipping commit f868a8143, which added a
parameter; if there are any third-party callers of LockAcquireExtended,
we want them to get a wrong-number-of-parameters compile error rather
than a possibly-silent misinterpretation of its last parameter.
Back-patch to 9.6 where the bug was introduced.
Discussion: https://postgr.es/m/6202.1536359835@sss.pgh.pa.us
LockRelationOid and sibling routines supposed that, if our session already
holds the lock they were asked to acquire, they could skip calling
AcceptInvalidationMessages on the grounds that we must have already read
any remote sinval messages issued against the relation being locked.
This is normally true, but there's a critical special case where it's not:
processing inside AcceptInvalidationMessages might attempt to access system
relations, resulting in a recursive call to acquire a relation lock.
Hence, if the outer call had acquired that same system catalog lock, we'd
fall through, despite the possibility that there's an as-yet-unread sinval
message for that system catalog. This could, for example, result in
failure to access a system catalog or index that had just been processed
by VACUUM FULL. This is the explanation for buildfarm failures we've been
seeing intermittently for the past three months. The bug is far older
than that, but commits a54e1f158 et al added a new recursion case within
AcceptInvalidationMessages that is apparently easier to hit than any
previous case.
To fix this, we must not skip calling AcceptInvalidationMessages until
we have *finished* a call to it since acquiring a relation lock, not
merely acquired the lock. (There's already adequate logic inside
AcceptInvalidationMessages to deal with being called recursively.)
Fortunately, we can implement that at trivial cost, by adding a flag
to LOCALLOCK hashtable entries that tracks whether we know we have
completed such a call.
There is an API hazard added by this patch for external callers of
LockAcquire: if anything is testing for LOCKACQUIRE_ALREADY_HELD,
it might be fooled by the new return code LOCKACQUIRE_ALREADY_CLEAR
into thinking the lock wasn't already held. This should be a fail-soft
condition, though, unless something very bizarre is being done in
response to the test.
Also, I added an additional output argument to LockAcquireExtended,
assuming that that probably isn't called by any outside code given
the very limited usefulness of its additional functionality.
Back-patch to all supported branches.
Discussion: https://postgr.es/m/12259.1532117714@sss.pgh.pa.us