Replace (expr op C1) OR (expr op C2) ... with expr op ANY(ARRAY[C1, C2, ...])
on the preliminary stage of optimization when we are still working with the
expression tree.
Here Cn is a n-th constant expression, 'expr' is non-constant expression, 'op'
is an operator which returns boolean result and has a commuter (for the case
of reverse order of constant and non-constant parts of the expression,
like 'Cn op expr').
Sometimes it can lead to not optimal plan. This is why there is a
or_to_any_transform_limit GUC. It specifies a threshold value of length of
arguments in an OR expression that triggers the OR-to-ANY transformation.
Generally, more groupable OR arguments mean that transformation will be more
likely to win than to lose.
Discussion: https://postgr.es/m/567ED6CA.2040504%40sigaev.ru
Author: Alena Rybakina <lena.ribackina@yandex.ru>
Author: Andrey Lepikhov <a.lepikhov@postgrespro.ru>
Reviewed-by: Peter Geoghegan <pg@bowt.ie>
Reviewed-by: Ranier Vilela <ranier.vf@gmail.com>
Reviewed-by: Alexander Korotkov <aekorotkov@gmail.com>
Reviewed-by: Robert Haas <robertmhaas@gmail.com>
Reviewed-by: Jian He <jian.universality@gmail.com>
The BAS_VACUUM ring size has been 256kB since commit d526575f introduced
the mechanism 17 years ago. Commit 1cbbee03 recently made it
configurable but retained the traditional default. The correct default
size has been debated for years, but 256kB is certainly very small.
VACUUM soon needs to write back data it dirtied only 32 blocks ago,
which usually requires flushing the WAL. New experiments in prefetching
pages for VACUUM exacerbated the problem by crashing into dirty data
even sooner. Let's make the default 2MB. That's 1.6% of the default
toy buffer pool size, and 0.2% of 1GB, which would be a considered a
small shared_buffers setting for a real system these days. Users are
still free to set the GUC to a different value.
Reviewed-by: Andres Freund <andres@anarazel.de>
Discussion: https://postgr.es/m/20240403221257.md4gfki3z75cdyf6%40awork3.anarazel.de
Discussion: https://postgr.es/m/CA%2BhUKGLY4Q4ZY4f1rvnFtv6%2BPkjNf8MejdPkcju3Qii9DYqqcQ%40mail.gmail.com
Break ReadBuffer() up into two steps. StartReadBuffers() and
WaitReadBuffers() give us two main advantages:
1. Multiple consecutive blocks can be read with one system call.
2. Advice (hints of future reads) can optionally be issued to the
kernel ahead of time.
The traditional ReadBuffer() function is now implemented in terms of
those functions, to avoid duplication.
A new GUC io_combine_limit is defined, and the functions for limiting
per-backend pin counts are made into public APIs. Those are provided
for use by callers of StartReadBuffers(), when deciding how many buffers
to read at once. The following commit will add a higher level mechanism
for doing that automatically with a practical interface.
With some more infrastructure in later work, StartReadBuffers() could
be extended to start real asynchronous I/O instead of just issuing
advice and leaving WaitReadBuffers() to do the work synchronously.
Author: Thomas Munro <thomas.munro@gmail.com>
Author: Andres Freund <andres@anarazel.de> (some optimization tweaks)
Reviewed-by: Melanie Plageman <melanieplageman@gmail.com>
Reviewed-by: Heikki Linnakangas <hlinnaka@iki.fi>
Reviewed-by: Nazir Bilal Yavuz <byavuz81@gmail.com>
Reviewed-by: Dilip Kumar <dilipbalaut@gmail.com>
Reviewed-by: Andres Freund <andres@anarazel.de>
Tested-by: Tomas Vondra <tomas.vondra@enterprisedb.com>
Discussion: https://postgr.es/m/CA+hUKGJkOiOCa+mag4BF+zHo7qo=o9CFheB8=g6uT5TUm2gkvA@mail.gmail.com
This is marked PGC_SIGHUP, so it can only be set in a configuration
file, not anywhere else; and it is also marked GUC_DISALLOW_IN_AUTO_FILE,
so it can't be set using ALTER SYSTEM. When set to false, the
ALTER SYSTEM command is disallowed.
There was considerable concern that this would be misinterpreted as
a security feature, which it is not, because a determined superuser
has various ways of bypassing it. Hence, a lot of work has gone into
wordsmithing the documentation, in the hopes of avoiding any such
confusion.
Jelte Fennemia-Nio and Gabriele Bartolini, with wording suggestions
for the documentation from many others.
Discussion: http://postgr.es/m/CA%2BVUV5rEKt2%2BCdC_KUaPoihMu%2Bi5ChT4WVNTr4CD5-xXZUfuQw%40mail.gmail.com
This patch provides a way to ensure that physical standbys that are
potential failover candidates have received and flushed changes before
the primary server making them visible to subscribers. Doing so guarantees
that the promoted standby server is not lagging behind the subscribers
when a failover is necessary.
The logical walsender now guarantees that all local changes are sent and
flushed to the standby servers corresponding to the replication slots
specified in 'standby_slot_names' before sending those changes to the
subscriber.
Additionally, the SQL functions pg_logical_slot_get_changes,
pg_logical_slot_peek_changes and pg_replication_slot_advance are modified
to ensure that they process changes for failover slots only after physical
slots specified in 'standby_slot_names' have confirmed WAL receipt for those.
Author: Hou Zhijie and Shveta Malik
Reviewed-by: Masahiko Sawada, Peter Smith, Bertrand Drouvot, Ajin Cherian, Nisha Moond, Amit Kapila
Discussion: https://postgr.es/m/514f6f2f-6833-4539-39f1-96cd1e011f23@enterprisedb.com
More precisely, what we do here is make the SLRU cache sizes
configurable with new GUCs, so that sites with high concurrency and big
ranges of transactions in flight (resp. multixacts/subtransactions) can
benefit from bigger caches. In order for this to work with good
performance, two additional changes are made:
1. the cache is divided in "banks" (to borrow terminology from CPU
caches), and algorithms such as eviction buffer search only affect
one specific bank. This forestalls the problem that linear searching
for a specific buffer across the whole cache takes too long: we only
have to search the specific bank, whose size is small. This work is
authored by Andrey Borodin.
2. Change the locking regime for the SLRU banks, so that each bank uses
a separate LWLock. This allows for increased scalability. This work
is authored by Dilip Kumar. (A part of this was previously committed as
d172b717c6f4.)
Special care is taken so that the algorithms that can potentially
traverse more than one bank release one bank's lock before acquiring the
next. This should happen rarely, but particularly clog.c's group commit
feature needed code adjustment to cope with this. I (Álvaro) also added
lots of comments to make sure the design is sound.
The new GUCs match the names introduced by bcdfa5f2e2 in the
pg_stat_slru view.
The default values for these parameters are similar to the previous
sizes of each SLRU. commit_ts, clog and subtrans accept value 0, which
means to adjust by dividing shared_buffers by 512 (so 2MB for every 1GB
of shared_buffers), with a cap of 8MB. (A new slru.c function
SimpleLruAutotuneBuffers() was added to support this.) The cap was
previously 1MB for clog, so for sites with more than 512MB of shared
memory the total memory used increases, which is likely a good tradeoff.
However, other SLRUs (notably multixact ones) retain smaller sizes and
don't support a configured value of 0. These values based on
shared_buffers may need to be revisited, but that's an easy change.
There was some resistance to adding these new GUCs: it would be better
to adjust to memory pressure automatically somehow, for example by
stealing memory from shared_buffers (where the caches can grow and
shrink naturally). However, doing that seems to be a much larger
project and one which has made virtually no progress in several years,
and because this is such a pain point for so many users, here we take
the pragmatic approach.
Author: Andrey Borodin <x4mmm@yandex-team.ru>
Author: Dilip Kumar <dilipbalaut@gmail.com>
Reviewed-by: Amul Sul, Gilles Darold, Anastasia Lubennikova,
Ivan Lazarev, Robert Haas, Thomas Munro, Tomas Vondra,
Yura Sokolov, Васильев Дмитрий (Dmitry Vasiliev).
Discussion: https://postgr.es/m/2BEC2B3F-9B61-4C1D-9FB5-5FAB0F05EF86@yandex-team.ru
Discussion: https://postgr.es/m/CAFiTN-vzDvNz=ExGXz6gdyjtzGixKSqs0mKHMmaQ8sOSEFZ33A@mail.gmail.com
By enabling slot synchronization, all the failover logical replication
slots on the primary (assuming configurations are appropriate) are
automatically created on the physical standbys and are synced
periodically. The slot sync worker on the standby server pings the primary
server at regular intervals to get the necessary failover logical slots
information and create/update the slots locally. The slots that no longer
require synchronization are automatically dropped by the worker.
The nap time of the worker is tuned according to the activity on the
primary. The slot sync worker waits for some time before the next
synchronization, with the duration varying based on whether any slots were
updated during the last cycle.
A new parameter sync_replication_slots enables or disables this new
process.
On promotion, the slot sync worker is shut down by the startup process to
drop any temporary slots acquired by the slot sync worker and to prevent
the worker from trying to fetch the failover slots.
A functionality to allow logical walsenders to wait for the physical will
be done in a subsequent commit.
Author: Shveta Malik, Hou Zhijie based on design inputs by Masahiko Sawada and Amit Kapila
Reviewed-by: Masahiko Sawada, Bertrand Drouvot, Peter Smith, Dilip Kumar, Ajin Cherian, Nisha Moond, Kuroda Hayato, Amit Kapila
Discussion: https://postgr.es/m/514f6f2f-6833-4539-39f1-96cd1e011f23@enterprisedb.com
When evaluating a query with a multi-column GROUP BY clause, we can minimize
sort operations or avoid them if we synchronize the order of GROUP BY clauses
with the ORDER BY sort clause or sort order, which comes from the underlying
query tree. Grouping does not imply any ordering, so we can compare
the keys in arbitrary order, and a Hash Agg leverages this. But for Group Agg,
we simply compared keys in the order specified in the query. This commit
explores alternative ordering of the keys, trying to find a cheaper one.
The ordering of group keys may interact with other parts of the query, some of
which may not be known while planning the grouping. For example, there may be
an explicit ORDER BY clause or some other ordering-dependent operation higher up
in the query, and using the same ordering may allow using either incremental
sort or even eliminating the sort entirely.
The patch always keeps the ordering specified in the query, assuming the user
might have additional insights.
This introduces a new GUC enable_group_by_reordering so that the optimization
may be disabled if needed.
Discussion: https://postgr.es/m/7c79e6a5-8597-74e8-0671-1c39d124c9d6%40sigaev.ru
Author: Andrei Lepikhov, Teodor Sigaev
Reviewed-by: Tomas Vondra, Claudio Freire, Gavin Flower, Dmitry Dolgov
Reviewed-by: Robert Haas, Pavel Borisov, David Rowley, Zhihong Yu
Reviewed-by: Tom Lane, Alexander Korotkov, Richard Guo, Alena Rybakina
In commit 3e51b278d I (tgl) caused initdb to leave lc_messages and
other lc_xxx GUCs commented-out in the installed postgresql.conf file
if they were going to be set to 'C'. This was a hack for cosmetic
purposes, and it was buggy because lc_messages' wired-in default is
not 'C' but '' (empty string). That led to --no-locale not having
the expected effect, since the postmaster would then obtain
lc_messages from its startup environment.
Let's just revert to the prior behavior of always de-commenting the
lc_xxx entries; the argument for changing that longstanding behavior
was weak in the first place.
Also, fix postgresql.conf.sample's erroneous claim that the default
value of lc_messages is 'C'. I suspect that was what misled me into
making this mistake in the first place.
Report and patch by Kyotaro Horiguchi. Back-patch to v16 where
the problem was introduced.
Discussion: https://postgr.es/m/20231122.162700.1995154567625541112.horikyota.ntt@gmail.com
When active, this process writes WAL summary files to
$PGDATA/pg_wal/summaries. Each summary file contains information for a
certain range of LSNs on a certain TLI. For each relation, it stores a
"limit block" which is 0 if a relation is created or destroyed within
a certain range of WAL records, or otherwise the shortest length to
which the relation was truncated during that range of WAL records, or
otherwise InvalidBlockNumber. In addition, it stores a list of blocks
which have been modified during that range of WAL records, but
excluding blocks which were removed by truncation after they were
modified and never subsequently modified again.
In other words, it tells us which blocks need to copied in case of an
incremental backup covering that range of WAL records. But this
doesn't yet add the capability to actually perform an incremental
backup; the next patch will do that.
A new parameter summarize_wal enables or disables this new background
process. The background process also automatically deletes summary
files that are older than wal_summarize_keep_time, if that parameter
has a non-zero value and the summarizer is configured to run.
Patch by me, with some design help from Dilip Kumar and Andres Freund.
Reviewed by Matthias van de Meent, Dilip Kumar, Jakub Wartak, Peter
Eisentraut, and Álvaro Herrera.
Discussion: http://postgr.es/m/CA+TgmoYOYZfMCyOXFyC-P+-mdrZqm5pP2N7S-r0z3_402h9rsA@mail.gmail.com
This avoids the wraparound in async.c and removes the corresponding code
complexity. The maximum amount of allocated SLRU pages for NOTIFY / LISTEN
queue is now determined by the max_notify_queue_pages GUC. The default
value is 1048576. It allows to consume up to 8 GB of disk space which is
exactly the limit we had previously.
Author: Maxim Orlov, Aleksander Alekseev, Alexander Korotkov, Teodor Sigaev
Author: Nikita Glukhov, Pavel Borisov, Yura Sokolov
Reviewed-by: Jacob Champion, Heikki Linnakangas, Alexander Korotkov
Reviewed-by: Japin Li, Pavel Borisov, Tom Lane, Peter Eisentraut, Andres Freund
Reviewed-by: Andrey Borodin, Dilip Kumar, Aleksander Alekseev
Discussion: https://postgr.es/m/CACG%3DezZe1NQSCnfHOr78AtAZxJZeCvxrts0ygrxYwe%3DpyyjVWA%40mail.gmail.com
Discussion: https://postgr.es/m/CAJ7c6TPDOYBYrnCAeyndkBktO0WG2xSdYduTF0nxq%2BvfkmTF5Q%40mail.gmail.com
vacuum_defer_cleanup_age was introduced before hot_standby_feedback and
replication slots existed. It is hard to use reasonably - commonly it will
either be set too low (not preventing recovery conflicts, while still causing
some bloat), or too high (causing a lot of bloat). The alternatives do not
have that issue.
That on its own might not be sufficient reason to remove
vacuum_defer_cleanup_age, but it also complicates computation of xid
horizons. See e.g. the bug fixed in be504a3e97. It also is untested.
This commit removes TransactionIdRetreatSafely(), as there are no users
anymore. There might be potential future users, hence noting that here.
Reviewed-by: Daniel Gustafsson <daniel@yesql.se>
Reviewed-by: Justin Pryzby <pryzby@telsasoft.com>
Reviewed-by: Alvaro Herrera <alvherre@alvh.no-ip.org>
Discussion: https://postgr.es/m/20230317230930.nhsgk3qfk7f4axls@awork3.anarazel.de
This reverts commit 3d03b24c3 (Revert Add support for Kerberos
credential delegation) which was committed on the grounds of concern
about portability, but on further review and discussion, it's clear that
we are better off explicitly requiring MIT Kerberos as that appears to
be the only GSSAPI library currently that's under proper maintenance
and ongoing development. The API used for storing credentials was added
to MIT Kerberos over a decade ago while for the other libraries which
appear to be mainly based on Heimdal, which exists explicitly to be a
re-implementation of MIT Kerberos, the API never made it to a released
version (even though it was added to the Heimdal git repo over 5 years
ago..).
This post-feature-freeze change was approved by the RMT.
Discussion: https://postgr.es/m/ZDDO6jaESKaBgej0%40tamriel.snowman.net
This reverts commit 3d4fa227bc.
Per discussion and buildfarm, this depends on APIs that seem to not
be available on at least one platform (NetBSD). Should be certainly
possible to rework to be optional on that platform if necessary but bit
late for that at this point.
Discussion: https://postgr.es/m/3286097.1680922218@sss.pgh.pa.us
Support GSSAPI/Kerberos credentials being delegated to the server by a
client. With this, a user authenticating to PostgreSQL using Kerberos
(GSSAPI) credentials can choose to delegate their credentials to the
PostgreSQL server (which can choose to accept them, or not), allowing
the server to then use those delegated credentials to connect to
another service, such as with postgres_fdw or dblink or theoretically
any other service which is able to be authenticated using Kerberos.
Both postgres_fdw and dblink are changed to allow non-superuser
password-less connections but only when GSSAPI credentials have been
delegated to the server by the client and GSSAPI is used to
authenticate to the remote system.
Authors: Stephen Frost, Peifeng Qiu
Reviewed-By: David Christensen
Discussion: https://postgr.es/m/CO1PR05MB8023CC2CB575E0FAAD7DF4F8A8E29@CO1PR05MB8023.namprd05.prod.outlook.com
Add new options to the VACUUM and ANALYZE commands called
BUFFER_USAGE_LIMIT to allow users more control over how large to make the
buffer access strategy that is used to limit the usage of buffers in
shared buffers. Larger rings can allow VACUUM to run more quickly but
have the drawback of VACUUM possibly evicting more buffers from shared
buffers that might be useful for other queries running on the database.
Here we also add a new GUC named vacuum_buffer_usage_limit which controls
how large to make the access strategy when it's not specified in the
VACUUM/ANALYZE command. This defaults to 256KB, which is the same size as
the access strategy was prior to this change. This setting also
controls how large to make the buffer access strategy for autovacuum.
Per idea by Andres Freund.
Author: Melanie Plageman
Reviewed-by: David Rowley
Reviewed-by: Andres Freund
Reviewed-by: Justin Pryzby
Reviewed-by: Bharath Rupireddy
Discussion: https://postgr.es/m/20230111182720.ejifsclfwymw2reb@awork3.anarazel.de
For ICU collations, ensure that the locale's language exists in ICU,
and that the locale can be opened.
Basic validation helps avoid minor mistakes and misspellings, which
often fall back to the root locale instead of the intended
locale. It's even more important to avoid such mistakes in ICU
versions 54 and earlier, where the same (misspelled) locale string
could fall back to different locales depending on the environment.
Discussion: https://postgr.es/m/11b1eeb7e7667fdd4178497aeb796c48d26e69b9.camel@j-davis.com
Discussion: https://postgr.es/m/df2efad0cae7c65180df8e5ebb709e5eb4f2a82b.camel@j-davis.com
Reviewed-by: Peter Eisentraut
Replace the hardcoded value with a GUC such that the iteration
count can be raised in order to increase protection against
brute-force attacks. The hardcoded value for SCRAM iteration
count was defined to be 4096, which is taken from RFC 7677, so
set the default for the GUC to 4096 to match. In RFC 7677 the
recommendation is at least 15000 iterations but 4096 is listed
as a SHOULD requirement given that it's estimated to yield a
0.5s processing time on a mobile handset of the time of RFC
writing (late 2015).
Raising the iteration count of SCRAM will make stored passwords
more resilient to brute-force attacks at a higher computational
cost during connection establishment. Lowering the count will
reduce computational overhead during connections at the tradeoff
of reducing strength against brute-force attacks.
There are however platforms where even a modest iteration count
yields a too high computational overhead, with weaker password
encryption schemes chosen as a result. In these situations,
SCRAM with a very low iteration count still gives benefits over
weaker schemes like md5, so we allow the iteration count to be
set to one at the low end.
The new GUC is intentionally generically named such that it can
be made to support future SCRAM standards should they emerge.
At that point the value can be made into key:value pairs with
an undefined key as a default which will be backwards compatible
with this.
Reviewed-by: Michael Paquier <michael@paquier.xyz>
Reviewed-by: Jonathan S. Katz <jkatz@postgresql.org>
Discussion: https://postgr.es/m/F72E7BC7-189F-4B17-BF47-9735EB72C364@yesql.se
Eager freezing strategy avoids large build-ups of all-visible pages. It
makes VACUUM trigger page-level freezing whenever doing so will enable
the page to become all-frozen in the visibility map. This is useful for
tables that experience continual growth, particularly strict append-only
tables such as pgbench's history table. Eager freezing significantly
improves performance stability by spreading out the cost of freezing
over time, rather than doing most freezing during aggressive VACUUMs.
It complements the insert autovacuum mechanism added by commit b07642db.
VACUUM determines its freezing strategy based on the value of the new
vacuum_freeze_strategy_threshold GUC (or reloption) with logged tables.
Tables that exceed the size threshold use the eager freezing strategy.
Unlogged tables and temp tables always use eager freezing strategy,
since the added cost is negligible there. Non-permanent relations won't
incur any extra overhead in WAL written (for the obvious reason), nor in
pages dirtied (since any extra freezing will only take place on pages
whose PD_ALL_VISIBLE bit needed to be set either way).
VACUUM uses lazy freezing strategy for logged tables that fall under the
GUC size threshold. Page-level freezing triggers based on the criteria
established in commit 1de58df4, which added basic page-level freezing.
Eager freezing is strictly more aggressive than lazy freezing. Settings
like vacuum_freeze_min_age still get applied in just the same way in
every VACUUM, independent of the strategy in use. The only mechanical
difference between eager and lazy freezing strategies is that only the
former applies its own additional criteria to trigger freezing pages.
Note that even lazy freezing strategy will trigger freezing whenever a
page happens to have required that an FPI be written during pruning,
provided that the page will thereby become all-frozen in the visibility
map afterwards (due to the FPI optimization from commit 1de58df4).
The vacuum_freeze_strategy_threshold default setting is 4GB. This is a
relatively low setting that prioritizes performance stability. It will
be reviewed at the end of the Postgres 16 beta period.
Author: Peter Geoghegan <pg@bowt.ie>
Reviewed-By: Jeff Davis <pgsql@j-davis.com>
Reviewed-By: Andres Freund <andres@anarazel.de>
Reviewed-By: Matthias van de Meent <boekewurm+postgres@gmail.com>
Discussion: https://postgr.es/m/CAH2-WzkFok_6EAHuK39GaW4FjEFQsY=3J0AAd6FXk93u-Xq3Fg@mail.gmail.com
This provides a way to reserve connection slots for non-superusers.
The slots reserved via the new GUC are available only to users who
have the new predefined role pg_use_reserved_connections.
superuser_reserved_connections remains as a final reserve in case
reserved_connections has been exhausted.
Patch by Nathan Bossart. Reviewed by Tushar Ahuja and by me.
Discussion: http://postgr.es/m/20230119194601.GA4105788@nathanxps13
Can be set to the empty string, or to either or both of "set" or
"inherit". If set to a non-empty value, a non-superuser who creates
a role (necessarily by relying up the CREATEROLE privilege) will
grant that role back to themselves with the specified options.
This isn't a security feature, because the grant that this feature
triggers can also be performed explicitly. Instead, it's a user experience
feature. A superuser would necessarily inherit the privileges of any
created role and be able to access all such roles via SET ROLE;
with this patch, you can configure createrole_self_grant = 'set, inherit'
to provide a similar experience for a user who has CREATEROLE but not
SUPERUSER.
Discussion: https://postgr.es/m/CA+TgmobN59ct+Emmz6ig1Nua2Q-_o=r6DSD98KfU53kctq_kQw@mail.gmail.com
Currently, for large transactions, the publisher sends the data in
multiple streams (changes divided into chunks depending upon
logical_decoding_work_mem), and then on the subscriber-side, the apply
worker writes the changes into temporary files and once it receives the
commit, it reads from those files and applies the entire transaction. To
improve the performance of such transactions, we can instead allow them to
be applied via parallel workers.
In this approach, we assign a new parallel apply worker (if available) as
soon as the xact's first stream is received and the leader apply worker
will send changes to this new worker via shared memory. The parallel apply
worker will directly apply the change instead of writing it to temporary
files. However, if the leader apply worker times out while attempting to
send a message to the parallel apply worker, it will switch to
"partial serialize" mode - in this mode, the leader serializes all
remaining changes to a file and notifies the parallel apply workers to
read and apply them at the end of the transaction. We use a non-blocking
way to send the messages from the leader apply worker to the parallel
apply to avoid deadlocks. We keep this parallel apply assigned till the
transaction commit is received and also wait for the worker to finish at
commit. This preserves commit ordering and avoid writing to and reading
from files in most cases. We still need to spill if there is no worker
available.
This patch also extends the SUBSCRIPTION 'streaming' parameter so that the
user can control whether to apply the streaming transaction in a parallel
apply worker or spill the change to disk. The user can set the streaming
parameter to 'on/off', or 'parallel'. The parameter value 'parallel' means
the streaming will be applied via a parallel apply worker, if available.
The parameter value 'on' means the streaming transaction will be spilled
to disk. The default value is 'off' (same as current behaviour).
In addition, the patch extends the logical replication STREAM_ABORT
message so that abort_lsn and abort_time can also be sent which can be
used to update the replication origin in parallel apply worker when the
streaming transaction is aborted. Because this message extension is needed
to support parallel streaming, parallel streaming is not supported for
publications on servers < PG16.
Author: Hou Zhijie, Wang wei, Amit Kapila with design inputs from Sawada Masahiko
Reviewed-by: Sawada Masahiko, Peter Smith, Dilip Kumar, Shi yu, Kuroda Hayato, Shveta Mallik
Discussion: https://postgr.es/m/CAA4eK1+wyN6zpaHUkCLorEWNx75MG0xhMwcFhvjqm2KURZEAGw@mail.gmail.com
1349d279 added query planner support to allow more efficient execution of
aggregate functions which have an ORDER BY or a DISTINCT clause. Prior to
that commit, the planner would only request that the lower planner produce
a plan with the order required for the GROUP BY clause and it would be
left up to nodeAgg.c to perform the final sort of records within each
group so that the aggregate transition functions were called in the
correct order. Now that the planner requests the lower planner produce a
plan with the GROUP BY and the ORDER BY / DISTINCT aggregates in mind,
there is the possibility that the planner chooses a plan which could be
less efficient than what would have been produced before 1349d279.
While developing 1349d279, I had in mind that Incremental Sort would help
us in cases where an index exists only on the GROUP BY column(s).
Incremental Sort would just replace the implicit tuplesorts which are
being performed in nodeAgg.c. However, because the planner has the
flexibility to instead choose a plan which just performs a full sort on
both the GROUP BY and ORDER BY / DISTINCT aggregate columns, there is
potential for the planner to make a bad choice. The costing for
Incremental Sort is not perfect as it assumes an even distribution of rows
to sort within each sort group.
Here we add an escape hatch in the form of the enable_presorted_aggregate
GUC. This will allow users to get the pre-PG16 behavior in cases where
they have no other means to convince the query planner to produce a plan
which only sorts on the GROUP BY column(s).
Discussion: https://postgr.es/m/CAApHDvr1Sm+g9hbv4REOVuvQKeDWXcKUAhmbK5K+dfun0s9CvA@mail.gmail.com
This reverts commit db0d67db24 and
several follow-on fixes. The idea of making a cost-based choice
of the order of the sorting columns is not fundamentally unsound,
but it requires cost information and data statistics that we don't
really have. For example, relying on procost to distinguish the
relative costs of different sort comparators is pretty pointless
so long as most such comparator functions are labeled with cost 1.0.
Moreover, estimating the number of comparisons done by Quicksort
requires more than just an estimate of the number of distinct values
in the input: you also need some idea of the sizes of the larger
groups, if you want an estimate that's good to better than a factor of
three or so. That's data that's often unknown or not very reliable.
Worse, to arrive at estimates of the number of calls made to the
lower-order-column comparison functions, the code needs to make
estimates of the numbers of distinct values of multiple columns,
which are necessarily even less trustworthy than per-column stats.
Even if all the inputs are perfectly reliable, the cost algorithm
as-implemented cannot offer useful information about how to order
sorting columns beyond the point at which the average group size
is estimated to drop to 1.
Close inspection of the code added by db0d67db2 shows that there
are also multiple small bugs. These could have been fixed, but
there's not much point if we don't trust the estimates to be
accurate in-principle.
Finally, the changes in cost_sort's behavior made for very large
changes (often a factor of 2 or so) in the cost estimates for all
sorting operations, not only those for multi-column GROUP BY.
That naturally changes plan choices in many situations, and there's
precious little evidence to show that the changes are for the better.
Given the above doubts about whether the new estimates are really
trustworthy, it's hard to summon much confidence that these changes
are better on the average.
Since we're hard up against the release deadline for v15, let's
revert these changes for now. We can always try again later.
Note: in v15, I left T_PathKeyInfo in place in nodes.h even though
it's unreferenced. Removing it would be an ABI break, and it seems
a bit late in the release cycle for that.
Discussion: https://postgr.es/m/TYAPR01MB586665EB5FB2C3807E893941F5579@TYAPR01MB5866.jpnprd01.prod.outlook.com
Referring to the WAL as just "log" invites confusion with the
postmaster log, so avoid doing that in docs and error messages.
Also shorten "WAL segment file" to just "WAL file" in various
places.
Bharath Rupireddy, reviewed by Nathan Bossart and Kyotaro Horiguchi
Discussion: https://postgr.es/m/CALj2ACUeXa8tDPaiTLexBDMZ7hgvaN+RTb957-cn5qwv9zf-MQ@mail.gmail.com
This commit addresses a few things around GUCs:
- The TCP-related parameters (the four tcp_keepalives_* and
client_connection_check_interval are listed in postgresql.conf.sample in
a subsection called "TCP settings" of "CONNECTIONS AND AUTHENTICATION",
but they did not have their own group name in guc.c.
- enable_group_by_reordering, stats_fetch_consistency and
recovery_prefetch had an inconsistent description, missing a dot at the
end.
- In postgresql.conf.sample, "Process title" should not have a section
of its own, but it should be a subsection of "REPORTING AND LOGGING".
This impacts the contents of pg_settings, which could be seen as a
compatibility break, so no backpatch is done. This is similar to the
cleanup done in a55a984.
Author: Shinya Kato
Discussion: https://postgr.es/m/5e0c9c608624eafbba910c344282cb14@oss.nttdata.com
POSIX shm_open() can sleep for a long time and fail spuriously because
of contention on an internal lock file on Solaris (and presumably
illumos). Commit 389869af fixed the main problem with this, namely that
we could crash, but it's now clear that "posix" is not a good default.
Therefore, choose "sysv" at initdb time on Solaris and illumos. Other
choices are still available by editing the postgresql.conf file.
Back-patch only to 15, because contention is much less likely further
back, and it doesn't seem like a good idea to change this in released
branches. This should clear up the failures on build farm animal
margay.
Discussion: https://postgr.es/m/CA%2BhUKGKqKrCV5xKWfh9rnm%3Do%3DDwZLTLtnsj_XpUi9g5%3DV%2B9oyg%40mail.gmail.com
Introduce a new GUC recovery_prefetch. When enabled, look ahead in the
WAL and try to initiate asynchronous reading of referenced data blocks
that are not yet cached in our buffer pool. For now, this is done with
posix_fadvise(), which has several caveats. Since not all OSes have
that system call, "try" is provided so that it can be enabled where
available. Better mechanisms for asynchronous I/O are possible in later
work.
Set to "try" for now for test coverage. Default setting to be finalized
before release.
The GUC wal_decode_buffer_size limits the distance we can look ahead in
bytes of decoded data.
The existing GUC maintenance_io_concurrency is used to limit the number
of concurrent I/Os allowed, based on pessimistic heuristics used to
infer that I/Os have begun and completed. We'll also not look more than
maintenance_io_concurrency * 4 block references ahead.
Reviewed-by: Julien Rouhaud <rjuju123@gmail.com>
Reviewed-by: Tomas Vondra <tomas.vondra@2ndquadrant.com>
Reviewed-by: Alvaro Herrera <alvherre@2ndquadrant.com> (earlier version)
Reviewed-by: Andres Freund <andres@anarazel.de> (earlier version)
Reviewed-by: Justin Pryzby <pryzby@telsasoft.com> (earlier version)
Tested-by: Tomas Vondra <tomas.vondra@2ndquadrant.com> (earlier version)
Tested-by: Jakub Wartak <Jakub.Wartak@tomtom.com> (earlier version)
Tested-by: Dmitry Dolgov <9erthalion6@gmail.com> (earlier version)
Tested-by: Sait Talha Nisanci <Sait.Nisanci@microsoft.com> (earlier version)
Discussion: https://postgr.es/m/CA%2BhUKGJ4VJN8ttxScUFM8dOKX0BrBiboo5uz1cq%3DAovOddfHpA%40mail.gmail.com
Previously the statistics collector received statistics updates via UDP and
shared statistics data by writing them out to temporary files regularly. These
files can reach tens of megabytes and are written out up to twice a
second. This has repeatedly prevented us from adding additional useful
statistics.
Now statistics are stored in shared memory. Statistics for variable-numbered
objects are stored in a dshash hashtable (backed by dynamic shared
memory). Fixed-numbered stats are stored in plain shared memory.
The header for pgstat.c contains an overview of the architecture.
The stats collector is not needed anymore, remove it.
By utilizing the transactional statistics drop infrastructure introduced in a
prior commit statistics entries cannot "leak" anymore. Previously leaked
statistics were dropped by pgstat_vacuum_stat(), called from [auto-]vacuum. On
systems with many small relations pgstat_vacuum_stat() could be quite
expensive.
Now that replicas drop statistics entries for dropped objects, it is not
necessary anymore to reset stats when starting from a cleanly shut down
replica.
Subsequent commits will perform some further code cleanup, adapt docs and add
tests.
Bumps PGSTAT_FILE_FORMAT_ID.
Author: Kyotaro Horiguchi <horikyota.ntt@gmail.com>
Author: Andres Freund <andres@anarazel.de>
Author: Melanie Plageman <melanieplageman@gmail.com>
Reviewed-By: Andres Freund <andres@anarazel.de>
Reviewed-By: Thomas Munro <thomas.munro@gmail.com>
Reviewed-By: Justin Pryzby <pryzby@telsasoft.com>
Reviewed-By: "David G. Johnston" <david.g.johnston@gmail.com>
Reviewed-By: Tomas Vondra <tomas.vondra@2ndquadrant.com> (in a much earlier version)
Reviewed-By: Arthur Zakirov <a.zakirov@postgrespro.ru> (in a much earlier version)
Reviewed-By: Antonin Houska <ah@cybertec.at> (in a much earlier version)
Discussion: https://postgr.es/m/20220303021600.hs34ghqcw6zcokdh@alap3.anarazel.de
Discussion: https://postgr.es/m/20220308205351.2xcn6k4x5yivcxyd@alap3.anarazel.de
Discussion: https://postgr.es/m/20210319235115.y3wz7hpnnrshdyv6@alap3.anarazel.de