1
0
mirror of https://github.com/postgres/postgres.git synced 2025-11-03 09:13:20 +03:00
Commit Graph

2646 Commits

Author SHA1 Message Date
Alexander Korotkov
2c91e13013 Move WaitLSNShmemInit() to CreateOrAttachShmemStructs()
Thanks to Andres Freund, Thomas Munrom and David Rowley for investigating
this issue.

Discussion: https://postgr.es/m/CAPpHfdvap5mMLikt8CUjA0osAvCJHT0qnYeR3f84EJ_Kvse0mg%40mail.gmail.com
2024-04-03 02:55:03 +03:00
Alexander Korotkov
06c418e163 Implement pg_wal_replay_wait() stored procedure
pg_wal_replay_wait() is to be used on standby and specifies waiting for
the specific WAL location to be replayed before starting the transaction.
This option is useful when the user makes some data changes on primary and
needs a guarantee to see these changes on standby.

The queue of waiters is stored in the shared memory array sorted by LSN.
During replay of WAL waiters whose LSNs are already replayed are deleted from
the shared memory array and woken up by setting of their latches.

pg_wal_replay_wait() needs to wait without any snapshot held.  Otherwise,
the snapshot could prevent the replay of WAL records implying a kind of
self-deadlock.  This is why it is only possible to implement
pg_wal_replay_wait() as a procedure working in a non-atomic context,
not a function.

Catversion is bumped.

Discussion: https://postgr.es/m/eb12f9b03851bb2583adab5df9579b4b%40postgrespro.ru
Author: Kartyshov Ivan, Alexander Korotkov
Reviewed-by: Michael Paquier, Peter Eisentraut, Dilip Kumar, Amit Kapila
Reviewed-by: Alexander Lakhin, Bharath Rupireddy, Euler Taveira
2024-04-02 22:48:03 +03:00
Tom Lane
6faca9ae28 Avoid deadlock during orphan temp table removal.
If temp tables have dependencies (such as sequences) then it's
possible for autovacuum's cleanup of orphan temp tables to deadlock
against an incoming backend that's trying to clean out the temp
namespace for its own use.  That can happen because RemoveTempRelations'
performDeletion call can visit objects within the namespace in
an order different from the order in which a per-table deletion
will visit them.

To fix, observe that performDeletion will begin by taking an exclusive
lock on the temp namespace (even though it won't actually delete it).
So, if we can get a shared lock on the namespace, we can be sure we're
not running concurrently with RemoveTempRelations, while also not
conflicting with ordinary use of the namespace.  This requires
introducing a conditional version of LockDatabaseObject, but that's no
big deal.  (It's surprising we've got along without that this long.)

Report and patch by Mikhail Zhilin.  Back-patch to all supported
branches.

Discussion: https://postgr.es/m/c43ce028-2bc2-4865-9b89-3f706246eed5@postgrespro.ru
2024-04-02 14:59:32 -04:00
Thomas Munro
b5a9b18cd0 Provide API for streaming relation data.
Introduce an abstraction allowing relation data to be accessed as a
stream of buffers, with an implementation that is more efficient than
the equivalent sequence of ReadBuffer() calls.

Client code supplies a callback that can say which block number it wants
next, and then consumes individual buffers one at a time from the
stream.  This division puts read_stream.c in control of how far ahead it
can see and allows it to read clusters of neighboring blocks with
StartReadBuffers().  It also issues POSIX_FADV_WILLNEED advice ahead of
time when random access is detected.

Other variants of I/O stream will be proposed in future work (for
example to support recovery, whose LsnReadQueue device in
xlogprefetcher.c is a distant cousin of this code and should eventually
be replaced by this), but this basic API is sufficient for many common
executor usage patterns involving predictable access to a single fork of
a single relation.

Several patches using this API are proposed separately.

This stream concept is loosely based on ideas from Andres Freund on how
we should pave the way for later work on asynchronous I/O.

Author: Thomas Munro <thomas.munro@gmail.com>
Author: Heikki Linnakangas <hlinnaka@iki.fi> (contributions)
Author: Melanie Plageman <melanieplageman@gmail.com> (contributions)
Suggested-by: Andres Freund <andres@anarazel.de>
Reviewed-by: Heikki Linnakangas <hlinnaka@iki.fi>
Reviewed-by: Melanie Plageman <melanieplageman@gmail.com>
Reviewed-by: Nazir Bilal Yavuz <byavuz81@gmail.com>
Reviewed-by: Andres Freund <andres@anarazel.de>
Tested-by: Tomas Vondra <tomas.vondra@enterprisedb.com>
Discussion: https://postgr.es/m/CA+hUKGJkOiOCa+mag4BF+zHo7qo=o9CFheB8=g6uT5TUm2gkvA@mail.gmail.com
2024-04-03 00:49:46 +13:00
Thomas Munro
210622c60e Provide vectored variant of ReadBuffer().
Break ReadBuffer() up into two steps.  StartReadBuffers() and
WaitReadBuffers() give us two main advantages:

1.  Multiple consecutive blocks can be read with one system call.
2.  Advice (hints of future reads) can optionally be issued to the
kernel ahead of time.

The traditional ReadBuffer() function is now implemented in terms of
those functions, to avoid duplication.

A new GUC io_combine_limit is defined, and the functions for limiting
per-backend pin counts are made into public APIs.  Those are provided
for use by callers of StartReadBuffers(), when deciding how many buffers
to read at once.  The following commit will add a higher level mechanism
for doing that automatically with a practical interface.

With some more infrastructure in later work, StartReadBuffers() could
be extended to start real asynchronous I/O instead of just issuing
advice and leaving WaitReadBuffers() to do the work synchronously.

Author: Thomas Munro <thomas.munro@gmail.com>
Author: Andres Freund <andres@anarazel.de> (some optimization tweaks)
Reviewed-by: Melanie Plageman <melanieplageman@gmail.com>
Reviewed-by: Heikki Linnakangas <hlinnaka@iki.fi>
Reviewed-by: Nazir Bilal Yavuz <byavuz81@gmail.com>
Reviewed-by: Dilip Kumar <dilipbalaut@gmail.com>
Reviewed-by: Andres Freund <andres@anarazel.de>
Tested-by: Tomas Vondra <tomas.vondra@enterprisedb.com>
Discussion: https://postgr.es/m/CA+hUKGJkOiOCa+mag4BF+zHo7qo=o9CFheB8=g6uT5TUm2gkvA@mail.gmail.com
2024-04-03 00:23:20 +13:00
Masahiko Sawada
667e65aac3 Use TidStore for dead tuple TIDs storage during lazy vacuum.
Previously, we used a simple array for storing dead tuple IDs during
lazy vacuum, which had a number of problems:

* The array used a single allocation and so was limited to 1GB.
* The allocation was pessimistically sized according to table size.
* Lookup with binary search was slow because of poor CPU cache and
  branch prediction behavior.

This commit replaces that array with the TID store from commit
30e144287a.

Since the backing radix tree makes small allocations as needed, the
1GB limit is now gone. Further, the total memory used is now often
smaller by an order of magnitude or more, depending on the
distribution of blocks and offsets. These two features should make
multiple rounds of heap scanning and index cleanup an extremely rare
event. TID lookup during index cleanup is also several times faster,
even more so when index order is correlated with heap tuple order.

Since there is no longer a predictable relationship between the number
of dead tuples vacuumed and the space taken up by their TIDs, the
number of tuples no longer provides any meaningful insights for users,
nor is the maximum number predictable. For that reason this commit
also changes to byte-based progress reporting, with the relevant
columns of pg_stat_progress_vacuum renamed accordingly to
max_dead_tuple_bytes and dead_tuple_bytes.

For parallel vacuum, both the TID store and supplemental information
specific to vacuum are shared among the parallel vacuum workers. As
with the previous array, we don't take any locks on TidStore during
parallel vacuum since writes are still only done by the leader
process.

Bump catalog version.

Reviewed-by: John Naylor, (in an earlier version) Dilip Kumar
Discussion: https://postgr.es/m/CAD21AoAfOZvmfR0j8VmZorZjL7RhTiQdVttNuC4W-Shdc2a-AA%40mail.gmail.com
2024-04-02 10:15:37 +09:00
Alvaro Herrera
da952b415f Rework lwlocknames.txt to become lwlocklist.h
This way, we can fold the list of lock names to occur in
BuiltinTrancheNames instead of having its own separate array.  This
saves two lines of code in GetLWTrancheName and some space in
BuiltinTrancheNames, as foreseen in commit 74a7306310, as well as
removing the need for a separate lwlocknames.c file.

We still have to build lwlocknames.h using Perl code, which initially I
wanted to avoid, but it gives us the chance to cross-check
wait_event_names.txt.

Discussion: https://postgr.es/m/202401231025.gbv4nnte5fmm@alvherre.pgsql
2024-03-20 11:55:20 +01:00
Robert Haas
2346df6fc3 Allow a no-wait lock acquisition to succeed in more cases.
We don't determine the position at which a process waiting for a lock
should insert itself into the wait queue until we reach ProcSleep(),
and we may at that point discover that we must insert ourselves ahead
of everyone who wants a conflicting lock, in which case we obtain the
lock immediately. Up until now, a no-wait lock acquisition would fail
in such cases, erroneously claiming that the lock couldn't be obtained
immediately.  Fix that by trying ProcSleep even in the no-wait case.

No back-patch for now, because I'm treating this as an improvement to
the existing no-wait feature. It could instead be argued that it's a
bug fix, on the theory that there should never be any case whatsoever
where no-wait fails to obtain a lock that would have been obtained
immediately without no-wait, but I'm reluctant to interpret the
semantics of no-wait that strictly.

Robert Haas and Jingxian Li

Discussion: http://postgr.es/m/CA+TgmobCH-kMXGVpb0BB-iNMdtcNkTvcZ4JBxDJows3kYM+GDg@mail.gmail.com
2024-03-14 08:56:06 -04:00
Alexander Korotkov
e85662df44 Fix false reports in pg_visibility
Currently, pg_visibility computes its xid horizon using the
GetOldestNonRemovableTransactionId().  The problem is that this horizon can
sometimes go backward.  That can lead to reporting false errors.

In order to fix that, this commit implements a new function
GetStrictOldestNonRemovableTransactionId().  This function computes the xid
horizon, which would be guaranteed to be newer or equal to any xid horizon
computed before.

We have to do the following to achieve this.

1. Ignore processes xmin's, because they consider connection to other databases
   that were ignored before.
2. Ignore KnownAssignedXids, because they are not database-aware. At the same
   time, the primary could compute its horizons database-aware.
3. Ignore walsender xmin, because it could go backward if some replication
   connections don't use replication slots.

As a result, we're using only currently running xids to compute the horizon.
Surely these would significantly sacrifice accuracy.  But we have to do so to
avoid reporting false errors.

Inspired by earlier patch by Daniel Shelepanov and the following discussion
with Robert Haas and Tom Lane.

Discussion: https://postgr.es/m/1649062270.289865713%40f403.i.mail.ru
Reviewed-by: Alexander Lakhin, Dmitry Koval
2024-03-14 13:12:05 +02:00
Peter Eisentraut
97d85be365 Make the order of the header file includes consistent
Similar to commit 7e735035f2.

Author: Richard Guo <guofenglinux@gmail.com>
Reviewed-by: Bharath Rupireddy <bharath.rupireddyforpostgres@gmail.com>
Discussion: https://www.postgresql.org/message-id/flat/CAMbWs4-WhpCFMbXCjtJ%2BFzmjfPrp7Hw1pk4p%2BZpU95Kh3ofZ1A%40mail.gmail.com
2024-03-13 15:07:00 +01:00
Heikki Linnakangas
af0e7deb4a Don't destroy SMgrRelations at relcache invalidation
With commit 21d9c3ee4e, SMgrRelations remain valid until end of
transaction (or longer if they're "pinned"). Relcache invalidation can
happen in the middle of a transaction, so we must not destroy them at
relcache invalidation anymore.

This was revealed by failures in the 'constraints' test in buildfarm
animals using -DCLOBBER_CACHE_ALWAYS. That started failing with commit
8af2565248, which was the first commit that started to rely on an
SMgrRelation living until end of transaction.

Diagnosed-by: Tomas Vondra, Thomas Munro
Reviewed-by: Thomas Munro
Discussion: https://www.postgresql.org/message-id/CA%2BhUKGK%2B5DOmLaBp3Z7C4S-Yv6yoROvr1UncjH2S1ZbPT8D%2BZg%40mail.gmail.com
2024-03-11 09:08:02 +02:00
Michael Paquier
f160bf06f7 Document units of "timeout" in ConditionVariableTimedSleep()
The timeout is passed down to WaitLatch() as milliseconds.

Author: Shveta Malik
Discussion: https://postgr.es/m/CAJpy0uC=xiBQD1WapgYYvOiytap6ULJaakLd867zZXqu9tYc8w@mail.gmail.com
2024-03-09 15:44:41 +09:00
Peter Eisentraut
43a8875f49 Put back required #include
Fix for dbbca2cf29: "storage/shmem.h" is required with
-Dspinlocks=false.
2024-03-04 13:17:37 +01:00
Daniel Gustafsson
cc09e6549f Remove the adminpack contrib extension
The adminpack extension was only used to support pgAdmin III,  which
in turn was declared EOL many years ago. Removing the extension also
allows us to remove functions from core as well which were only used
to support old version of adminpack.

Reviewed-by: Tom Lane <tgl@sss.pgh.pa.us>
Reviewed-by: Nathan Bossart <nathandbossart@gmail.com>
Reviewed-by: Bharath Rupireddy <bharath.rupireddyforpostgres@gmail.com>
Discussion: https://postgr.es/m/CALj2ACUmL5TraYBUBqDZBi1C+Re8_=SekqGYqYprj_W8wygQ8w@mail.gmail.com
2024-03-04 12:39:22 +01:00
Peter Eisentraut
dbbca2cf29 Remove unused #include's from backend .c files
as determined by include-what-you-use (IWYU)

While IWYU also suggests to *add* a bunch of #include's (which is its
main purpose), this patch does not do that.  In some cases, a more
specific #include replaces another less specific one.

Some manual adjustments of the automatic result:

- IWYU currently doesn't know about includes that provide global
  variable declarations (like -Wmissing-variable-declarations), so
  those includes are being kept manually.

- All includes for port(ability) headers are being kept for now, to
  play it safe.

- No changes of catalog/pg_foo.h to catalog/pg_foo_d.h, to keep the
  patch from exploding in size.

Note that this patch touches just *.c files, so nothing declared in
header files changes in hidden ways.

As a small example, in src/backend/access/transam/rmgr.c, some IWYU
pragma annotations are added to handle a special case there.

Discussion: https://www.postgresql.org/message-id/flat/af837490-6b2f-46df-ba05-37ea6a6653fc%40eisentraut.org
2024-03-04 12:02:20 +01:00
Heikki Linnakangas
393b5599e5 Use MyBackendType in more places to check what process this is
Remove IsBackgroundWorker, IsAutoVacuumLauncherProcess(),
IsAutoVacuumWorkerProcess(), and IsLogicalSlotSyncWorker() in favor of
new Am*Process() macros that use MyBackendType. For consistency with
the existing Am*Process() macros.

Reviewed-by: Andres Freund
Discussion: https://www.postgresql.org/message-id/f3ecd4cb-85ee-4e54-8278-5fabfb3a4ed0@iki.fi
2024-03-04 10:25:12 +02:00
Heikki Linnakangas
024c521117 Replace BackendIds with 0-based ProcNumbers
Now that BackendId was just another index into the proc array, it was
redundant with the 0-based proc numbers used in other places. Replace
all usage of backend IDs with proc numbers.

The only place where the term "backend id" remains is in a few pgstat
functions that expose backend IDs at the SQL level. Those IDs are now
in fact 0-based ProcNumbers too, but the documentation still calls
them "backend ids". That term still seems appropriate to describe what
the numbers are, so I let it be.

One user-visible effect is that pg_temp_0 is now a valid temp schema
name, for backend with ProcNumber 0.

Reviewed-by: Andres Freund
Discussion: https://www.postgresql.org/message-id/8171f1aa-496f-46a6-afc3-c46fe7a9b407@iki.fi
2024-03-03 19:38:22 +02:00
Heikki Linnakangas
ab355e3a88 Redefine backend ID to be an index into the proc array
Previously, backend ID was an index into the ProcState array, in the
shared cache invalidation manager (sinvaladt.c). The entry in the
ProcState array was reserved at backend startup by scanning the array
for a free entry, and that was also when the backend got its backend
ID. Things become slightly simpler if we redefine backend ID to be the
index into the PGPROC array, and directly use it also as an index to
the ProcState array. This uses a little more memory, as we reserve a
few extra slots in the ProcState array for aux processes that don't
need them, but the simplicity is worth it.

Aux processes now also have a backend ID. This simplifies the
reservation of BackendStatusArray and ProcSignal slots.

You can now convert a backend ID into an index into the PGPROC array
simply by subtracting 1. We still use 0-based "pgprocnos" in various
places, for indexes into the PGPROC array, but the only difference now
is that backend IDs start at 1 while pgprocnos start at 0. (The next
commmit will get rid of the term "backend ID" altogether and make
everything 0-based.)

There is still a 'backendId' field in PGPROC, now part of 'vxid' which
encapsulates the backend ID and local transaction ID together. It's
needed for prepared xacts. For regular backends, the backendId is
always equal to pgprocno + 1, but for prepared xact PGPROC entries,
it's the ID of the original backend that processed the transaction.

Reviewed-by: Andres Freund, Reid Thompson
Discussion: https://www.postgresql.org/message-id/8171f1aa-496f-46a6-afc3-c46fe7a9b407@iki.fi
2024-03-03 19:37:28 +02:00
Thomas Munro
653b55b570 Return ssize_t in fd.c I/O functions.
In the past, FileRead() and FileWrite() used types based on the Unix
read() and write() functions from before C and POSIX standardization,
though not exactly (we had int for amount instead of unsigned).  In
commit 2d4f1ba6 we changed to the appropriate standard C types, just
like the modern POSIX functions they wrap, but again not exactly: the
return type stayed as int.  In theory, a ssize_t value could be returned
by the underlying call that is too large for an int.

That wasn't really a live bug, because we don't expect PostgreSQL code
to perform reads or writes of gigabytes, and OSes probably apply
internal caps smaller than that anyway.  This change is done on the
principle that the return might as well follow the standard interfaces
consistently.

Reported-by: Tom Lane <tgl@sss.pgh.pa.us>
Reviewed-by: Peter Eisentraut <peter@eisentraut.org>
Discussion: https://postgr.es/m/1672202.1703441340%40sss.pgh.pa.us
2024-03-02 12:09:28 +13:00
Alvaro Herrera
53c2a97a92 Improve performance of subsystems on top of SLRU
More precisely, what we do here is make the SLRU cache sizes
configurable with new GUCs, so that sites with high concurrency and big
ranges of transactions in flight (resp. multixacts/subtransactions) can
benefit from bigger caches.  In order for this to work with good
performance, two additional changes are made:

1. the cache is divided in "banks" (to borrow terminology from CPU
   caches), and algorithms such as eviction buffer search only affect
   one specific bank.  This forestalls the problem that linear searching
   for a specific buffer across the whole cache takes too long: we only
   have to search the specific bank, whose size is small.  This work is
   authored by Andrey Borodin.

2. Change the locking regime for the SLRU banks, so that each bank uses
   a separate LWLock.  This allows for increased scalability.  This work
   is authored by Dilip Kumar.  (A part of this was previously committed as
   d172b717c6f4.)

Special care is taken so that the algorithms that can potentially
traverse more than one bank release one bank's lock before acquiring the
next.  This should happen rarely, but particularly clog.c's group commit
feature needed code adjustment to cope with this.  I (Álvaro) also added
lots of comments to make sure the design is sound.

The new GUCs match the names introduced by bcdfa5f2e2 in the
pg_stat_slru view.

The default values for these parameters are similar to the previous
sizes of each SLRU.  commit_ts, clog and subtrans accept value 0, which
means to adjust by dividing shared_buffers by 512 (so 2MB for every 1GB
of shared_buffers), with a cap of 8MB.  (A new slru.c function
SimpleLruAutotuneBuffers() was added to support this.)  The cap was
previously 1MB for clog, so for sites with more than 512MB of shared
memory the total memory used increases, which is likely a good tradeoff.
However, other SLRUs (notably multixact ones) retain smaller sizes and
don't support a configured value of 0.  These values based on
shared_buffers may need to be revisited, but that's an easy change.

There was some resistance to adding these new GUCs: it would be better
to adjust to memory pressure automatically somehow, for example by
stealing memory from shared_buffers (where the caches can grow and
shrink naturally).  However, doing that seems to be a much larger
project and one which has made virtually no progress in several years,
and because this is such a pain point for so many users, here we take
the pragmatic approach.

Author: Andrey Borodin <x4mmm@yandex-team.ru>
Author: Dilip Kumar <dilipbalaut@gmail.com>
Reviewed-by: Amul Sul, Gilles Darold, Anastasia Lubennikova,
	Ivan Lazarev, Robert Haas, Thomas Munro, Tomas Vondra,
	Yura Sokolov, Васильев Дмитрий (Dmitry Vasiliev).
Discussion: https://postgr.es/m/2BEC2B3F-9B61-4C1D-9FB5-5FAB0F05EF86@yandex-team.ru
Discussion: https://postgr.es/m/CAFiTN-vzDvNz=ExGXz6gdyjtzGixKSqs0mKHMmaQ8sOSEFZ33A@mail.gmail.com
2024-02-28 17:05:31 +01:00
Alvaro Herrera
bcdfa5f2e2 Rename SLRU elements in view pg_stat_slru
The new names are intended to match those in an upcoming patch that adds
a few GUCs to configure the SLRU buffer sizes.

Backwards compatibility concern: this changes the accepted names for
function pg_stat_slru_rest().  Since this function recognizes "any other
string" as a request to reset the entry for "other", this means that
calling it with the old names would silently reset "other" instead of
doing nothing or throwing an error.

Reviewed-by: Andrey M. Borodin <x4mmm@yandex-team.ru>
Discussion: https://postgr.es/m/202402261616.dlriae7b6emv@alvherre.pgsql
2024-02-28 09:39:52 +01:00
Nathan Bossart
42a1de3013 Add helper functions for dshash tables with string keys.
Presently, string keys are not well-supported for dshash tables.
The dshash code always copies key_size bytes into new entries'
keys, and dshash.h only provides compare and hash functions that
forward to memcmp() and tag_hash(), both of which do not stop at
the first NUL.  This means that callers must pad string keys so
that the data beyond the first NUL does not adversely affect the
results of copying, comparing, and hashing the keys.

To better support string keys in dshash tables, this commit does
a couple things:

* A new copy_function field is added to the dshash_parameters
  struct.  This function pointer specifies how the key should be
  copied into new table entries.  For example, we only want to copy
  up to the first NUL byte for string keys.  A dshash_memcpy()
  helper function is provided and used for all existing in-tree
  dshash tables without string keys.

* A set of helper functions for string keys are provided.  These
  helper functions forward to strcmp(), strcpy(), and
  string_hash(), all of which ignore data beyond the first NUL.

This commit also adjusts the DSM registry's dshash table to use the
new helper functions for string keys.

Reviewed-by: Andy Fan
Discussion: https://postgr.es/m/20240119215941.GA1322079%40nathanxps13
2024-02-26 15:47:13 -06:00
Heikki Linnakangas
d360e3cc60 Fix compiler warning on typedef redeclaration
bulk_write.c:78:3: error: redefinition of typedef 'BulkWriteState' is a C11 feature [-Werror,-Wtypedef-redefinition]
    } BulkWriteState;
      ^
    ../../../../src/include/storage/bulk_write.h:20:31: note: previous definition is here
    typedef struct BulkWriteState BulkWriteState;
                                  ^
    1 error generated.

Per buildfarm animals 'sifaka' and 'longfin'.

Discussion: https://www.postgresql.org/message-id/9e1f63c3-ef16-404c-b3cb-859a96eaba39@iki.fi
2024-02-23 17:39:27 +02:00
Heikki Linnakangas
8af2565248 Introduce a new smgr bulk loading facility.
The new facility makes it easier to optimize bulk loading, as the
logic for buffering, WAL-logging, and syncing the relation only needs
to be implemented once. It's also less error-prone: We have had a
number of bugs in how a relation is fsync'd - or not - at the end of a
bulk loading operation. By centralizing that logic to one place, we
only need to write it correctly once.

The new facility is faster for small relations: Instead of of calling
smgrimmedsync(), we register the fsync to happen at next checkpoint,
which avoids the fsync latency. That can make a big difference if you
are e.g. restoring a schema-only dump with lots of relations.

It is also slightly more efficient with large relations, as the WAL
logging is performed multiple pages at a time. That avoids some WAL
header overhead. The sorted GiST index build did that already, this
moves the buffering to the new facility.

The changes to pageinspect GiST test needs an explanation: Before this
patch, the sorted GiST index build set the LSN on every page to the
special GistBuildLSN value, not the LSN of the WAL record, even though
they were WAL-logged. There was no particular need for it, it just
happened naturally when we wrote out the pages before WAL-logging
them. Now we WAL-log the pages first, like in B-tree build, so the
pages are stamped with the record's real LSN. When the build is not
WAL-logged, we still use GistBuildLSN. To make the test output
predictable, use an unlogged index.

Reviewed-by: Andres Freund
Discussion: https://www.postgresql.org/message-id/30e8f366-58b3-b239-c521-422122dd5150%40iki.fi
2024-02-23 16:10:51 +02:00
Amit Kapila
93db6cbda0 Add a new slot sync worker to synchronize logical slots.
By enabling slot synchronization, all the failover logical replication
slots on the primary (assuming configurations are appropriate) are
automatically created on the physical standbys and are synced
periodically. The slot sync worker on the standby server pings the primary
server at regular intervals to get the necessary failover logical slots
information and create/update the slots locally. The slots that no longer
require synchronization are automatically dropped by the worker.

The nap time of the worker is tuned according to the activity on the
primary. The slot sync worker waits for some time before the next
synchronization, with the duration varying based on whether any slots were
updated during the last cycle.

A new parameter sync_replication_slots enables or disables this new
process.

On promotion, the slot sync worker is shut down by the startup process to
drop any temporary slots acquired by the slot sync worker and to prevent
the worker from trying to fetch the failover slots.

A functionality to allow logical walsenders to wait for the physical will
be done in a subsequent commit.

Author: Shveta Malik, Hou Zhijie based on design inputs by Masahiko Sawada and Amit Kapila
Reviewed-by: Masahiko Sawada, Bertrand Drouvot, Peter Smith, Dilip Kumar, Ajin Cherian, Nisha Moond, Kuroda Hayato, Amit Kapila
Discussion: https://postgr.es/m/514f6f2f-6833-4539-39f1-96cd1e011f23@enterprisedb.com
2024-02-22 15:25:15 +05:30
Heikki Linnakangas
28f3915b73 Remove superfluous 'pgprocno' field from PGPROC
It was always just the index of the PGPROC entry from the beginning of
the proc array. Introduce a macro to compute it from the pointer
instead.

Reviewed-by: Andres Freund
Discussion: https://www.postgresql.org/message-id/8171f1aa-496f-46a6-afc3-c46fe7a9b407@iki.fi
2024-02-22 01:21:34 +02:00
Nathan Bossart
5497daf3aa Replace calls to pg_qsort() with the qsort() macro.
Calls to this function might give the impression that pg_qsort()
is somehow different than qsort(), when in fact there is a qsort()
macro in port.h that expands all in-tree uses to pg_qsort().

Reviewed-by: Mats Kindahl
Discussion: https://postgr.es/m/CA%2B14426g2Wa9QuUpmakwPxXFWG_1FaY0AsApkvcTBy-YfS6uaw%40mail.gmail.com
2024-02-16 11:37:50 -06:00
Alexander Korotkov
51efe38cb9 Introduce transaction_timeout
This commit adds timeout that is expected to be used as a prevention
of long-running queries. Any session within the transaction will be
terminated after spanning longer than this timeout.

However, this timeout is not applied to prepared transactions.
Only transactions with user connections are affected.

Discussion: https://postgr.es/m/CAAhFRxiQsRs2Eq5kCo9nXE3HTugsAAJdSQSmxncivebAxdmBjQ%40mail.gmail.com
Author: Andrey Borodin <amborodin@acm.org>
Author: Japin Li <japinli@hotmail.com>
Author: Junwang Zhao <zhjwpku@gmail.com>
Reviewed-by: Nikolay Samokhvalov <samokhvalov@gmail.com>
Reviewed-by: Andres Freund <andres@anarazel.de>
Reviewed-by: Fujii Masao <masao.fujii@oss.nttdata.com>
Reviewed-by: bt23nguyent <bt23nguyent@oss.nttdata.com>
Reviewed-by: Yuhang Qiu <iamqyh@gmail.com>
2024-02-15 23:56:12 +02:00
Nathan Bossart
28e4632509 Centralize logic for restoring errno in signal handlers.
Presently, we rely on each individual signal handler to save the
initial value of errno and then restore it before returning if
needed.  This is easily forgotten and, if missed, often goes
undetected for a long time.

In commit 3b00fdba9f, we introduced a wrapper signal handler
function that checks whether MyProcPid matches getpid().  This
commit moves the aforementioned errno restoration code from the
individual signal handlers to the new wrapper handler so that we no
longer need to worry about missing it.

Reviewed-by: Andres Freund, Noah Misch
Discussion: https://postgr.es/m/20231121212008.GA3742740%40nathanxps13
2024-02-14 16:34:18 -06:00
Amit Kapila
ddd5f4f54a Add a slot synchronization function.
This commit introduces a new SQL function pg_sync_replication_slots()
which is used to synchronize the logical replication slots from the
primary server to the physical standby so that logical replication can be
resumed after a failover or planned switchover.

A new 'synced' flag is introduced in pg_replication_slots view, indicating
whether the slot has been synchronized from the primary server. On a
standby, synced slots cannot be dropped or consumed, and any attempt to
perform logical decoding on them will result in an error.

The logical replication slots on the primary can be synchronized to the
hot standby by using the 'failover' parameter of
pg-create-logical-replication-slot(), or by using the 'failover' option of
CREATE SUBSCRIPTION during slot creation, and then calling
pg_sync_replication_slots() on standby. For the synchronization to work,
it is mandatory to have a physical replication slot between the primary
and the standby aka 'primary_slot_name' should be configured on the
standby, and 'hot_standby_feedback' must be enabled on the standby. It is
also necessary to specify a valid 'dbname' in the 'primary_conninfo'.

If a logical slot is invalidated on the primary, then that slot on the
standby is also invalidated.

If a logical slot on the primary is valid but is invalidated on the
standby, then that slot is dropped but will be recreated on the standby in
the next pg_sync_replication_slots() call provided the slot still exists
on the primary server. It is okay to recreate such slots as long as these
are not consumable on standby (which is the case currently). This
situation may occur due to the following reasons:
- The 'max_slot_wal_keep_size' on the standby is insufficient to retain
WAL records from the restart_lsn of the slot.
- 'primary_slot_name' is temporarily reset to null and the physical slot
is removed.

The slot synchronization status on the standby can be monitored using the
'synced' column of pg_replication_slots view.

A functionality to automatically synchronize slots by a background worker
and allow logical walsenders to wait for the physical will be done in
subsequent commits.

Author: Hou Zhijie, Shveta Malik, Ajin Cherian based on an earlier version by Peter Eisentraut
Reviewed-by: Masahiko Sawada, Bertrand Drouvot, Peter Smith, Dilip Kumar, Nisha Moond, Kuroda Hayato, Amit Kapila
Discussion: https://postgr.es/m/514f6f2f-6833-4539-39f1-96cd1e011f23@enterprisedb.com
2024-02-14 09:45:36 +05:30
Heikki Linnakangas
fbf9a7ac4d Fix 'mmap' DSM implementation with allocations larger than 4 GB
Fixes bug #18341. Backpatch to all supported versions.

Discussion: https://www.postgresql.org/message-id/18341-ce16599e7fd6228c@postgresql.org
2024-02-13 21:23:41 +02:00
Heikki Linnakangas
18dd9d2ed9 Fix typo in comments
Backpatch-through: v16
2024-02-03 00:18:21 +02:00
Heikki Linnakangas
d212957254 Fix bug in bulk extending temp relation after failure
A ResourceOwnerEnlarge() call was missing. That led to an error:

ERROR:  ResourceOwnerRemember called but array was full

and an assertion failure, if you tried to extend a temp relation again
after a failure. Alexander's test case used running out of disk space
to trigger the original failure.

This bug was introduced in the large ResourceOwner rewrite commit
b8bff07daa. Before that, the UnpinLocalBuffer() call guaranteed that
the subsequent PinLocalBuffer() will succeed, but after the rewrite,
releasing an old resource doesn't guarantee that there is space for a
new one.

Add a comment explaining why the UnpinBuffer + PinBuffer calls in
BufferAlloc(), with no ResourceOwnerEnlarge() in between, are safe.

Reported-by: Alexander Lakhin
Discussion: https://www.postgresql.org/message-id/dc574fea-c83e-a600-08cd-10881762e4fa@gmail.com
2024-02-02 21:12:30 +02:00
Alvaro Herrera
402388946f Fix copy&paste typo in comment 2024-01-31 23:11:53 +01:00
Heikki Linnakangas
21d9c3ee4e Give SMgrRelation pointers a well-defined lifetime.
After calling smgropen(), it was not clear how long you could continue
to use the result, because various code paths including cache
invalidation could call smgrclose(), which freed the memory.

Guarantee that the object won't be destroyed until the end of the
current transaction, or in recovery, the commit/abort record that
destroys the underlying storage.

smgrclose() is now just an alias for smgrrelease(). It closes files
and forgets all state except the rlocator, but keeps the SMgrRelation
object valid.

A new smgrdestroy() function is used by rare places that know there
should be no other references to the SMgrRelation.

The short version:

 * smgrclose() is now just an alias for smgrrelease(). It releases
   resources, but doesn't destroy until EOX
 * smgrdestroy() now frees memory, and should rarely be used.

Existing code should be unaffected, but it is now possible for code that
has an SMgrRelation object to use it repeatedly during a transaction as
long as the storage hasn't been physically dropped.  Such code would
normally hold a lock on the relation.

This also replaces the "ownership" mechanism of SMgrRelations with a
pin counter.  An SMgrRelation can now be "pinned", which prevents it
from being destroyed at end of transaction.  There can be multiple pins
on the same SMgrRelation.  In practice, the pin mechanism is only used
by the relcache, so there cannot be more than one pin on the same
SMgrRelation.  Except with swap_relation_files XXX

Author: Thomas Munro, Heikki Linnakangas
Reviewed-by: Robert Haas <robertmhaas@gmail.com>
Discussion: https://www.postgresql.org/message-id/CA%2BhUKGJ8NTvqLHz6dqbQnt2c8XCki4r2QvXjBQcXpVwxTY_pvA@mail.gmail.com
2024-01-31 12:31:02 +02:00
Alvaro Herrera
7b745d85b8 Split use of SerialSLRULock, creating SerialControlLock
predicate.c has been using SerialSLRULock (the control lock for its SLRU
structure) to coordinate access to SerialControlData, another of its
numerous shared memory structures; this is unnecessary and confuses
further SLRU scalability work.  Create a separate LWLock to cover
SerialControlData.

Extracted from a larger patch from the same author, and some additional
changes by Álvaro.

Author: Dilip Kumar <dilip.kumar@enterprisedb.com>
Discussion: https://postgr.es/m/CAFiTN-vzDvNz=ExGXz6gdyjtzGixKSqs0mKHMmaQ8sOSEFZ33A@mail.gmail.com
2024-01-30 18:11:17 +01:00
Alvaro Herrera
8c9da1441d Make spelling of cancelled/cancellation consistent
This fixes places where words derived from cancel were not using their
common en-US ugly^H^H^H^Hspelling.

Author: Jelte Fennema-Nio <postgres@jeltef.nl>
Reported-by: Thomas Munro <thomas.munro@gmail.com>
Discussion: https://postgr.es/m/CA+hUKG+Lrq+ty6yWXF5572qNQ8KwxGwG5n4fsEcCUap685nWvQ@mail.gmail.com
2024-01-26 12:38:15 +01:00
Alvaro Herrera
55627ba2d3 Remove dummy_spinlock
It's been unused since 1b468a131b (2015).
2024-01-25 11:43:47 +01:00
Alvaro Herrera
46778187f5 Fix s_lock_test compile
This is a mostly unused tool, but I discovered while nosing around the
Makefile that it hasn't been kept in line with other changes.  Fix it.

Backpatching doesn't appear to be necessary.

Discussion: https://postgr.es/m/202401241114.ied53jcich72@alvherre.pgsql
2024-01-25 11:17:33 +01:00
Alvaro Herrera
74a7306310 Improve notation of BuiltinTrancheNames
Use C99 designated initializer syntax for array elements, instead of
writing the position in a comment.  This is less verbose and much more
readable.  Akin to cc15059634.

One disadvantage is that the BuiltinTrancheNames array now has a hole of
51 NULLs -- previously, the array elements were shifted 51 elements
 downward to avoid this.  This can be fixed by merging the
IndividualLWLockNames array into BuiltinTrancheNames, which would occupy
those 51 pointers, but because it requires some arguably ugly Meson
hackery, it's left for later.

Discussion: https://postgr.es/m/202401231025.gbv4nnte5fmm@alvherre.pgsql
2024-01-24 15:01:30 +01:00
Nathan Bossart
4372adfa24 Fix possible NULL pointer dereference in GetNamedDSMSegment().
GetNamedDSMSegment() doesn't check whether dsm_attach() returns
NULL, which creates the possibility of a NULL pointer dereference
soon after.  To fix, emit an ERROR if dsm_attach() returns NULL.
This shouldn't happen, but it would be nice to avoid a segfault if
it does.  In passing, tidy up the surrounding code.

Reported-by: Tom Lane
Reviewed-by: Michael Paquier, Bharath Rupireddy
Discussion: https://postgr.es/m/3348869.1705854106%40sss.pgh.pa.us
2024-01-22 20:44:38 -06:00
Michael Paquier
d86d20f0ba Add backend support for injection points
Injection points are a new facility that makes possible for developers
to run custom code in pre-defined code paths.  Its goal is to provide
ways to design and run advanced tests, for cases like:
- Race conditions, where processes need to do actions in a controlled
ordered manner.
- Forcing a state, like an ERROR, FATAL or even PANIC for OOM, to force
recovery, etc.
- Arbitrary sleeps.

This implements some basics, and there are plans to extend it more in
the future depending on what's required.  Hence, this commit adds a set
of routines in the backend that allows developers to attach, detach and
run injection points:
- A code path calling an injection point can be declared with the macro
INJECTION_POINT(name).
- InjectionPointAttach() and InjectionPointDetach() to respectively
attach and detach a callback to/from an injection point.  An injection
point name is registered in a shmem hash table with a library name and a
function name, which will be used to load the callback attached to an
injection point when its code path is run.

Injection point names are just strings, so as an injection point can be
declared and run by out-of-core extensions and modules, with callbacks
defined in external libraries.

This facility is hidden behind a dedicated switch for ./configure and
meson, disabled by default.

Note that backends use a local cache to store callbacks already loaded,
cleaning up their cache if a callback has found to be removed on a
best-effort basis.  This could be refined further but any tests but what
we have here was fine with the tests I've written while implementing
these backend APIs.

Author: Michael Paquier, with doc suggestions from Ashutosh Bapat.
Reviewed-by: Ashutosh Bapat, Nathan Bossart, Álvaro Herrera, Dilip
Kumar, Amul Sul, Nazir Bilal Yavuz
Discussion: https://postgr.es/m/ZTiV8tn_MIb_H2rE@paquier.xyz
2024-01-22 10:15:50 +09:00
Nathan Bossart
8b2bcf3f28 Introduce the dynamic shared memory registry.
Presently, the most straightforward way for a shared library to use
shared memory is to request it at server startup via a
shmem_request_hook, which requires specifying the library in
shared_preload_libraries.  Alternatively, the library can create a
dynamic shared memory (DSM) segment, but absent a shared location
to store the segment's handle, other backends cannot use it.  This
commit introduces a registry for DSM segments so that these other
backends can look up existing segments with a library-specified
string.  This allows libraries to easily use shared memory without
needing to request it at server startup.

The registry is accessed via the new GetNamedDSMSegment() function.
This function handles allocating the segment and initializing it
via a provided callback.  If another backend already created and
initialized the segment, it simply attaches the segment.
GetNamedDSMSegment() locks the registry appropriately to ensure
that only one backend initializes the segment and that all other
backends just attach it.

The registry itself is comprised of a dshash table that stores the
DSM segment handles keyed by a library-specified string.

Reviewed-by: Michael Paquier, Andrei Lepikhov, Nikita Malakhov, Robert Haas, Bharath Rupireddy, Zhang Mingli, Amul Sul
Discussion: https://postgr.es/m/20231205034647.GA2705267%40nathanxps13
2024-01-19 14:24:36 -06:00
Alexander Korotkov
448a3331d9 Fix name collision in c64086b79d
Reported-by: Erik Rijkers, Tom Lane
Discussion: https://postgr.es/m/E1rQqeS-002A0s-Qm%40gemulon.postgresql.org
2024-01-19 18:17:13 +02:00
Alexander Korotkov
c64086b79d Reorder actions in ProcArrayApplyRecoveryInfo()
Since 5a1dfde833, 2PC filenames use FullTransactionId.  Thus, it needs to
convert TransactionId to FullTransactionId in StandbyTransactionIdIsPrepared()
using TransamVariables->nextXid.  However, ProcArrayApplyRecoveryInfo()
first releases locks with usage StandbyTransactionIdIsPrepared(), then advances
TransamVariables->nextXid.  This sequence of actions could cause errors.
This commit makes ProcArrayApplyRecoveryInfo() advance
TransamVariables->nextXid before releasing locks.

Reported-by: Thomas Munro, Michael Paquier
Discussion: https://postgr.es/m/CA%2BhUKGLj_ve1_pNAnxwYU9rDcv7GOhsYXJt7jMKSA%3D5-6ss-Cw%40mail.gmail.com
Discussion: https://postgr.es/m/Zadp9f4E1MYvMJqe%40paquier.xyz
2024-01-19 17:19:17 +02:00
Peter Eisentraut
ed1e0a6512 Error message capitalisation
per style guidelines

Author: Peter Smith <peter.b.smith@fujitsu.com>
Discussion: https://www.postgresql.org/message-id/flat/CAHut%2BPtzstExQ4%3DvFH%2BWzZ4g4xEx2JA%3DqxussxOdxVEwJce6bw%40mail.gmail.com
2024-01-18 09:35:12 +01:00
Michael Paquier
e72a37528d Refactor code checking for file existence
jit.c and dfgr.c had a copy of the same code to check if a file exists
or not, with a twist: jit.c did not check for EACCES when failing the
stat() call for the path whose existence is tested.  This refactored
routine will be used by an upcoming patch.

Reviewed-by: Ashutosh Bapat
Discussion: https://postgr.es/m/ZTiV8tn_MIb_H2rE@paquier.xyz
2024-01-12 12:04:51 +09:00
Nathan Bossart
5b1b9bce84 Cross-check lists of predefined LWLocks.
Both lwlocknames.txt and wait_event_names.txt contain a list of all
the predefined LWLocks, i.e., those with predefined positions
within MainLWLockArray.  It is easy to miss one or the other,
especially since the list in wait_event_names.txt omits the "Lock"
suffix from all the LWLock wait events.  This commit adds a cross-
check of these lists to the script that generates lwlocknames.h.
If the lists do not match exactly, building will fail.

Suggested-by: Robert Haas
Reviewed-by: Robert Haas, Michael Paquier, Bertrand Drouvot
Discussion: https://postgr.es/m/20240102173120.GA1061678%40nathanxps13
2024-01-09 11:05:19 -06:00
Michael Paquier
9cd0d77dfc Fix corruption of local buffer state during extend of temp relation
A typo has been introduced by 31966b151e when updating the state of a
local buffer when a temporary relation is extended, for the case of a
block included in the relation range extended, when it is already found
in the hash table holding the local buffers.  In this case, BM_VALID
should be cleared, but the buffer state was changed so as BM_VALID
remained while clearing the other flags.

As reported on the thread, it was possible to corrupt the state of the
local buffers on ENOSPC, but the states would be corrupted on any kind
of ERROR during the relation extend (like partial writes or some other
errno).

Reported-by: Alexander Lakhin
Author: Tender Wang
Reviewed-by: Richard Guo, Alexander Lakhin, Michael Paquier
Discussion: https://postgr.es/m/18259-6e256429825dd435@postgresql.org
Backpatch-through: 16
2024-01-05 20:08:34 +09:00
Bruce Momjian
29275b1d17 Update copyright for 2024
Reported-by: Michael Paquier

Discussion: https://postgr.es/m/ZZKTDPxBBMt3C0J9@paquier.xyz

Backpatch-through: 12
2024-01-03 20:49:05 -05:00