Our configure script intended to ensure this, but it supposed that
expr(1) would report an error for integer overflow. Maybe that
was true when the code was written (commit 3c6248a82 of 2008-05-02),
but all the modern expr's I tried will deliver bigger-than-int32
results without complaint. Moreover, if you use --with-segsize-blocks
then there's no check at all.
Ideally we'd add a test in configure itself to check that the value
fits in int, but to do that we'd need to suppose that test(1) handles
bigger-than-int32 numbers correctly. Probably modern ones do, but
that's an assumption I could do without; and I'm not too trusting
about meson either. Instead, let's install a static assertion, so
that even people who ignore all the compiler warnings you get from
such values will be forced to confront the fact that it won't work.
This has been hazardous for awhile, but given that we hadn't heard
a complaint about it till now, I don't feel a need to back-patch.
Reported-by: Casey Shobe <casey.allen.shobe@icloud.com>
Author: Tom Lane <tgl@sss.pgh.pa.us>
Discussion: https://postgr.es/m/C5DC82D6-C76D-4E8F-BC2E-DF03EFC4FA24@icloud.com
Coverity complains that the return value from gettuple_eval_partition
(stored in variable "datum") in a do..while loop in
WinGetFuncArgInPartition is overwritten when exiting the while
loop. This commit tries to fix the issue by changing the
gettuple_eval_partition call to:
(void) gettuple_eval_partition()
explicitly stating that we discard the return value. We are just
interested in whether we are inside or outside of partition, NULL or
NOT NULL here.
Also enhance some comments for easier code reading.
Reported-by: Michael Paquier <michael@paquier.xyz>
Discussion: https://postgr.es/m/aPCOabSE4VfJLaky%40paquier.xyz
This removes a few static functions and replaces them with 2 functions
which aim to be more reusable. The upper planner's pathkey requirements
can be simplified down to operations which require pathkeys in the same
order as the pathkeys for the given operation, and operations which can
make use of a Path's pathkeys in any order.
Here we also add some short-circuiting to truncate_useless_pathkeys(). At
any point we discover that all pathkeys are useful to a single operation,
we can stop checking the remaining operations as we're not going to be
able to find any further useful pathkeys - they're all possibly useful
already. Adjusting this seems to warrant trying to put the checks
roughly in order of least-expensive-first so that the short-circuits
have the most chance of skipping the more expensive checks.
In passing clean up has_useful_pathkeys() as it seems to have grown a
redundant check for group_pathkeys. This isn't needed as
standard_qp_callback will set query_pathkeys if there's any requirement
to have group_pathkeys. All this code does is waste run-time effort and
take up needless space.
Author: David Rowley <dgrowleyml@gmail.com>
Reviewed-by: Richard Guo <guofenglinux@gmail.com>
Reviewed-by: Chao Li <li.evan.chao@gmail.com>
Discussion: https://postgr.es/m/CAApHDvpbsEoTksvW5901MMoZo-hHf78E5up3uDOfkJnxDe_WAw@mail.gmail.com
This fixes an unlikely issue when fetching GROUPING SET results from
their internally stored hash tables. It was possible in rare cases that
the hash iterator would be set up incorrectly which could result in a
crash.
This was introduced in 4d143509c, so backpatch to v18.
Many thanks to Yuri Zamyatin for reporting and helping to debug this
issue.
Bug: #19078
Reported-by: Yuri Zamyatin <yuri@yrz.am>
Author: David Rowley <dgrowleyml@gmail.com>
Reviewed-by: Jeff Davis <pgsql@j-davis.com>
Discussion: https://postgr.es/m/19078-dfd62f840a2c0766@postgresql.org
Backpatch-through: 18
Commit a1b4f289be improved the hashjoin sizing to also consider the
memory used by BufFiles for batches. The code however had multiple
issues, making it ineffective or not working as expected in some cases.
* The amount of memory needed by buffers was calculated using uint32,
so it would overflow for nbatch >= 262144. If this happened the loop
would exit prematurely and the memory usage would not be reduced.
The nbatch overflow is fixed by reworking the condition to not use a
multiplication at all, so there's no risk of overflow. An explicit
cast was added to a similar calculation in ExecHashIncreaseBatchSize.
* The loop adjusting the nbatch value used hash_table_bytes to calculate
the old/new size, but then updated only space_allowed. The consequence
is the total memory usage was not reduced, but all the memory saved by
reducing the number of batches was used for the internal hash table.
This was fixed by using only space_allowed. This is also more correct,
because hash_table_bytes does not account for skew buckets.
* The code was also doubling multiple parameters (e.g. the number of
buckets for hash table), but was missing overflow protections.
The loop now checks for overflow, and terminates if needed. It'd be
possible to cap the value and continue the loop, but it's not worth
the complexity. And the overflow implies the in-memory hash table is
already very large anyway.
While at it, rework the comment explaining how the memory balancing
works, to make it more concise and easier to understand.
The initial nbatch overflow issue was reported by Vaibhav Jain. The
other issues were noticed by me and Melanie Plageman. Fix by me, with a
lot of review and feedback by Melanie.
Backpatch to 18, where the hashjoin memory balancing was introduced.
Reported-by: Vaibhav Jain <jainva@google.com>
Reviewed-by: Melanie Plageman <melanieplageman@gmail.com>
Backpatch-through: 18
Discussion: https://postgr.es/m/CABa-Az174YvfFq7rLS+VNKaQyg7inA2exvPWmPWqnEn6Ditr_Q@mail.gmail.com
The URL for "Sorting and Searching Algorithms: A Cookbook"
by Thomas Niemann has started returning 404, and since we
refer to the page for license terms this replaces the now
defunct link with one to the copy on archive.org.
Author: Chao Li <lic@highgo.com>
Discussion: https://postgr.es/m/6DED3DEF-875E-4D1D-8F8F-7353D5AF7B79@gmail.com
EvalPlanQualStart() failed to propagate es_partition_directory into
the child EState used for EPQ rechecks. When execution time partition
pruning ran during the EPQ scan, executor code dereferenced a NULL
partition directory and crashed.
Previously, propagating es_partition_directory into the EPQ EState was
unnecessary because CreatePartitionPruneState(), which sets it on
demand, also initialized the exec-pruning context. After commit
d47cbf474, CreatePartitionPruneState() now initializes only the init-
time pruning context, leaving exec-pruning context initialization to
ExecInitNode(). Since EvalPlanQualStart() runs only ExecInitNode() and
not CreatePartitionPruneState(), it can encounter a NULL
es_partition_directory. Other executor fields initialized during
CreatePartitionPruneState() are already copied into the child EState
thanks to commit 8741e48e5d, but es_partition_directory was missed.
Fix by borrowing the parent estate's es_partition_directory in
EvalPlanQualStart(), and by clearing that field in EvalPlanQualEnd()
so the parent remains responsible for freeing the directory.
Add an isolation test permutation that triggers EPQ with execution-
time partition pruning, the case that reproduces this crash.
Bug: #19078
Reported-by: Yuri Zamyatin <yuri@yrz.am>
Diagnosed-by: David Rowley <dgrowleyml@gmail.com>
Author: David Rowley <dgrowleyml@gmail.com>
Co-authored-by: Amit Langote <amitlangote09@gmail.com>
Discussion: https://postgr.es/m/19078-dfd62f840a2c0766@postgresql.org
Backpatch-through: 18
This commit adjusts RangeVarCallbackForReindexIndex() to handle an
extremely unlikely race condition involving concurrent OID reuse.
In short, if REINDEX INDEX is executed at the same time that the
index is re-created with the same name and OID but a different
parent table OID, we might lock the wrong parent table. To fix,
simply detect when this happens and emit an ERROR. Unfortunately,
we can't gracefully handle this situation because we will have
already locked the index, and we must lock the parent table before
the index to avoid deadlocks.
While at it, I've replaced all but one early return in this
callback function with ERRORs that should be unreachable. While I
haven't verified the presence of a live bug, the checks in question
appear to be unnecessary, and the early returns seem prone to
breaking the parent table locking code in subtle ways. If nothing
else, this simplifies the code a bit.
This is a bug fix and could be back-patched, but given the presumed
rarity of the race condition and the lack of reports, I'm not going
to bother.
Reviewed-by: Tom Lane <tgl@sss.pgh.pa.us>
Reviewed-by: Jeff Davis <pgsql@j-davis.com>
Discussion: https://postgr.es/m/Z8zwVmGzXyDdkAXj%40nathan
Per-character pg_locale_t APIs. Useful for tsearch parsing and
potentially other places.
Significant overlap with the regc_wc_isalpha() and related functions
in regc_pg_locale.c, but this change leaves those intact for
now.
Discussion: https://postgr.es/m/0151ad01239e2cc7b3139644358cf8f7b9622ff7.camel@j-davis.com
Presently, these functions look up the relation's OID, lock it, and
then check privileges. Not only does this approach provide no
guarantee that the locked relation matches the arguments of the
lookup, but it also allows users to briefly lock relations for
which they do not have privileges, which might enable
denial-of-service attacks. This commit adjusts these functions to
use RangeVarGetRelidExtended(), which is purpose-built to avoid
both of these issues. The new RangeVarGetRelidCallback function is
somewhat complicated because it must handle both tables and
indexes, and for indexes, we must check privileges on the parent
table and lock it first. Also, it needs to handle a couple of
extremely unlikely race conditions involving concurrent OID reuse.
A downside of this change is that the coding doesn't allow for
locking indexes in AccessShare mode anymore; everything is locked
in ShareUpdateExclusive mode. Per discussion, the original choice
of lock levels was intended for a now defunct implementation that
used in-place updates, so we believe this change is okay.
Reviewed-by: Jeff Davis <pgsql@j-davis.com>
Discussion: https://postgr.es/m/Z8zwVmGzXyDdkAXj%40nathan
Backpatch-through: 18
The log output functionality of log_autovacuum_min_duration applies to
both VACUUM and ANALYZE, so it is not possible to separate the VACUUM
and ANALYZE log output thresholds. Logs are likely to be output only for
VACUUM and not for ANALYZE.
Therefore, we decided to separate the threshold for log output of VACUUM
by autovacuum (log_autovacuum_min_duration) and the threshold for log
output of ANALYZE by autovacuum (log_autoanalyze_min_duration).
Author: Shinya Kato <shinya11.kato@gmail.com>
Reviewed-by: Kasahara Tatsuhito <kasaharatt@oss.nttdata.com>
Discussion: https://www.postgresql.org/message-id/flat/CAOzEurQtfV4MxJiWT-XDnimEeZAY+rgzVSLe8YsyEKhZcajzSA@mail.gmail.com
After scanning the line pointers on a heap page during the first phase
of vacuum, we use the information collected to decide whether to use
the assembled freeze plans.
Move this decision logic into a helper function to improve readability.
While here, rename a PruneState member and disambiguate some local
variables in heap_page_prune_and_freeze().
Author: Melanie Plageman <melanieplageman@gmail.com>
Reviewed-by: Andres Freund <andres@anarazel.de>
Discussion: https://postgr.es/m/2wk7jo4m4qwh5sn33pfgerdjfujebbccsmmlownybddbh6nawl%40mdyyqpqzxjek
Previously WinCheckAndInitializeNullTreatment() used elog() to emit an
error message. ereport() should be used instead because it's a
user-facing error. Also use existing get_func_name() to get a
function's name, rather than own implementation.
Moreover add an assertion to validate winobj parameter, just like
other window function API.
Reported-by: Tom Lane <tgl@sss.pgh.pa.us>
Author: Tatsuo Ishii <ishii@postgresql.org>
Reviewed-by: Chao Li <lic@highgo.com>
Discussion: https://postgr.es/m/2952409.1760023154%40sss.pgh.pa.us
Instead of emitting a separate XLOG_HEAP2_VISIBLE WAL record for each
page that becomes all-visible in vacuum's third phase, specify the VM
changes in the already emitted XLOG_HEAP2_PRUNE_VACUUM_CLEANUP record.
Visibility checks are now performed before marking dead items unused.
This is safe because the heap page is held under exclusive lock for the
entire operation.
This reduces the number of WAL records generated by VACUUM phase III by
up to 50%.
Author: Melanie Plageman <melanieplageman@gmail.com>
Reviewed-by: Andres Freund <andres@anarazel.de>
Reviewed-by: Robert Haas <robertmhaas@gmail.com>
Reviewed-by: Kirill Reshke <reshkekirill@gmail.com>
Discussion: https://postgr.es/m/flat/CAAKRu_ZMw6Npd_qm2KM%2BFwQ3cMOMx1Dh3VMhp8-V7SOLxdK9-g%40mail.gmail.com
log_error() would probably fail completely if used, and would
certainly print garbage for anything that needed to be interpolated
into the message, because it was failing to use the correct printing
subroutine for a va_list argument.
This bug likely went undetected because the error cases this code
is used for are rarely exercised - they only occur when Windows
security API calls fail catastrophically (out of memory, security
subsystem corruption, etc).
The FRONTEND variant can be fixed just by calling vfprintf()
instead of fprintf(). However, there was no va_list variant
of write_stderr(), so create one by refactoring that function.
Following the usual naming convention for such things, call
it vwrite_stderr().
Author: Bryan Green <dbryan.green@gmail.com>
Reviewed-by: Tom Lane <tgl@sss.pgh.pa.us>
Discussion: https://postgr.es/m/CAF+pBj8goe4fRmZ0V3Cs6eyWzYLvK+HvFLYEYWG=TzaM+tWPnw@mail.gmail.com
Backpatch-through: 13
Remove a variable that is no longer in use following commit 9a2e2a28.
It's not immediately clear why there were no compiler warnings about
this oversight.
Author: Peter Geoghegan <pg@bowt.ie>
Backpatch-through: 18
Commit 71f4c8c6f7 (which implemented DETACH CONCURRENTLY) added code
to create a separate table constraint when a table is detached
concurrently, identical to the partition constraint, on the theory that
such a constraint was needed in case the optimizer had constructed any
query plans that depended on the constraint being there. However, that
theory was apparently bogus because any such plans would be invalidated.
For hash partitioning, those constraints are problematic, because their
expressions reference the OID of the parent partitioned table, to which
the detached table is no longer related; this causes all sorts of
problems (such as inability of restoring a pg_dump of that table, and
the table no longer working properly if the partitioned table is later
dropped).
We'd like to get rid of all those constraints. In fact, for branch
master, do that -- no longer create any substitute constraints.
However, out of fear that some users might somehow depend on these
constraints for other partitioning strategies, for stable branches
(back to 14, which added DETACH CONCURRENTLY), only do it for hash
partitioning.
(If you repeatedly DETACH CONCURRENTLY and then ATTACH a partition, then
with this constraint addition you don't need to scan the table in the
ATTACH step, which presumably is good. But if users really valued this
feature, they would have requested that it worked for non-concurrent
DETACH also.)
Author: Haiyang Li <mohen.lhy@alibaba-inc.com>
Reported-by: Fei Changhong <feichanghong@qq.com>
Reported-by: Haiyang Li <mohen.lhy@alibaba-inc.com>
Backpatch-through: 14
Discussion: https://postgr.es/m/18371-7fef49f63de13f02@postgresql.org
Discussion: https://postgr.es/m/19070-781326347ade7c57@postgresql.org
An assertion in _bt_killitems expected the scan's currPos state to
contain a valid LSN, saved from when currPos's page was initially read.
The assertion failed to account for the fact that even logged relations
can have leaf pages with an invalid LSN when built with wal_level set to
"minimal". Remove the faulty assertion.
Oversight in commit e6eed40e (though note that the assertion was
backpatched to stable branches before 18 by commit 7c319f54).
Author: Peter Geoghegan <pg@bowt.ie>
Reported-By: Matthijs van der Vleuten <postgresql@zr40.nl>
Bug: #19082
Discussion: https://postgr.es/m/19082-628e62160dbbc1c1@postgresql.org
Backpatch-through: 13
An error happening while a slot data is saved on disk in
SaveSlotToPath() could cause a state.tmp file (temporary file holding
the slot state data, renamed to its permanent name at the end of the
function) to remain around after it has been created. This temporary
file is created with O_EXCL, meaning that if an existing state.tmp is
found, its creation would fail. This would prevent the slot data to be
saved, requiring a manual intervention to remove state.tmp before being
able to save again a slot. Possible scenarios where this temporary file
could remain on disk is for example a ENOSPC case (no disk space) while
writing, syncing or renaming it. The bug reports point to a write
failure as the principal cause of the problems.
Using O_TRUNC has been argued back in 2019 as a potential solution to
discard any temporary file that could exist. This solution was rejected
as O_EXCL can also act as a safety measure when saving the slot state,
crash recovery offering cleanup guarantees post-crash. This commit uses
the alternative approach that has been suggested by Andres Freund back
in 2019. When the temporary state file cannot be written, synced,
closed or renamed (note: not when created!), an unlink() is used to
remove the temporary state file while holding the in-progress I/O
LWLock, so as any follow-up attempts to save a slot's data would not
choke on an existing file that remained around because of a previous
failure.
This problem has been reported a few times across the years, going back
to 2019, but for some reason I have never come back to do something
about it and it has been forgotten. A recent report has reminded me
that this was still a problem.
Reported-by: Kevin K Biju <kevinkbiju@gmail.com>
Reported-by: Sergei Kornilov <sk@zsrv.org>
Reported-by: Grigory Smolkin <g.smolkin@postgrespro.ru>
Discussion: https://postgr.es/m/CAM45KeHa32soKL_G8Vk38CWvTBeOOXcsxAPAs7Jt7yPRf2mbVA@mail.gmail.com
Discussion: https://postgr.es/m/3559061693910326@qy4q4a6esb2lebnz.sas.yp-c.yandex.net
Discussion: https://postgr.es/m/08bbfab1-a61d-3750-fc18-4ab2c1aa7f09@postgrespro.ru
Backpatch-through: 13
The processing of the PARALLEL option for VACUUM was not quite
following what the DefElem code had intended. defGetInt32() already has
code to handle missing parameters and returns a perfectly good error
message for when that happens.
Here we get rid of the ExecVacuum() error:
ERROR: parallel option requires a value between 0 and N
and leave defGetInt32() handle it, which will give:
ERROR: parallel requires an integer value
defGetInt32() was already handling the non-integer parameter case, so it
may as well handle the missing parameter case too.
Additionally, parameterize the option name to make translator work easier,
and also use errhint_internal() rather than errhint() for the
BUFFER_USAGE_LIMIT option since there isn't any work for a translator to
do for "%s".
Author: David Rowley <dgrowleyml@gmail.com>
Reviewed-by: Álvaro Herrera <alvherre@kurilemu.de>
Reviewed-by: Masahiko Sawada <sawada.mshk@gmail.com>
Discussion: https://postgr.es/m/CAApHDvovH14tNWB+WvP6TSbfi7-=TysQ9h5tQ5AgavwyWRWKHA@mail.gmail.com
An error context callback function might leak some memory into
ErrorContext, since those functions are run with ErrorContext as
current context. In the case where the elevel is ERROR, this is
no problem since the code level that catches the error should do
FlushErrorState to clean up, and that will reset ErrorContext.
However, if the elevel is less than ERROR then no such cleanup occurs.
In principle, repeated leaks while emitting log messages or client
notices could accumulate arbitrarily much leaked data, if no ERROR
occurs in the session.
To fix, let errfinish() perform an ErrorContext reset if it is
at the outermost error nesting level. (If it isn't, we'll delay
cleanup until the outermost nesting level is exited.)
The only actual leakage of this sort that I've been able to observe
within our regression tests was recently introduced by commit
f727b63e8. While it seems plausible that there are other such
leaks not reached in the regression tests, the lack of field
reports suggests that they're not a big problem. Accordingly,
I won't take the risk of back-patching this now. We can always
back-patch later if we get field reports of leaks.
Reported-by: Andres Freund <andres@anarazel.de>
Author: Tom Lane <tgl@sss.pgh.pa.us>
Discussion: https://postgr.es/m/jngsjonyfscoont4tnwi2qoikatpd5hifsg373vmmjvugwiu6g@m6opxh7uisgd
While pgoutput caches relation synchronization information in
RelationSyncCache that resides in CacheMemoryContext, each entry's
information (such as row filter expressions and column lists) is
stored in the entry's private memory context (entry_cxt in
RelationSyncEntry), which is a descendant memory context of the
decoding context. If a logical decoding invoked via SQL functions like
pg_logical_slot_get_binary_changes fails with an error, subsequent
logical decoding executions could access already-freed memory of the
entry's cache, resulting in a crash.
With this change, it's ensured that RelationSyncCache is cleaned up
even in error cases by using a memory context reset callback function.
Backpatch to 15, where entry_cxt was introduced for column filtering
and row filtering.
While the backbranches v13 and v14 have a similar issue where
RelationSyncCache persists even after an error when pgoutput is used
via SQL API, we decided not to backport this fix. This decision was
made because v13 is approaching its final minor release, and we won't
have an chance to fix any new issues that might arise. Additionally,
since using pgoutput via SQL API is not a common use case, the risk
outwights the benefit. If we receive bug reports, we can consider
backporting the fixes then.
Author: vignesh C <vignesh21@gmail.com>
Co-authored-by: Masahiko Sawada <sawada.mshk@gmail.com>
Reviewed-by: Zhijie Hou <houzj.fnst@fujitsu.com>
Reviewed-by: Euler Taveira <euler@eulerto.com>
Discussion: https://postgr.es/m/CALDaNm0x-aCehgt8Bevs2cm=uhmwS28MvbYq1=s2Ekf0aDPkOA@mail.gmail.com
Backpatch-through: 15
Some of the buildfarm is still unhappy with WinGetFuncArgInPartition
even after 2273fa32b. While it seems to be just very old compilers,
we can suppress the warnings and arguably make the code more readable
by not initializing these variables till closer to where they are
used. While at it, make a couple of cosmetic comment improvements.
This patch adds support for the ALL SEQUENCES clause in publications,
enabling synchronization/replication of all sequences that is useful for
upgrades.
Publications can now include all sequences via FOR ALL SEQUENCES.
psql enhancements:
\d shows publications for a given sequence.
\dRp indicates if a publication includes all sequences.
ALL SEQUENCES can be combined with ALL TABLES, but not with other options
like TABLE or TABLES IN SCHEMA. We can extend support for more granular
clauses in future.
The view pg_publication_sequences provides information about the mapping
between publications and sequences.
This patch enables publishing of sequences; subscriber-side support will
be added in upcoming patches.
Author: vignesh C <vignesh21@gmail.com>
Author: Tomas Vondra <tomas@vondra.me>
Reviewed-by: shveta malik <shveta.malik@gmail.com>
Reviewed-by: Dilip Kumar <dilipbalaut@gmail.com>
Reviewed-by: Peter Smith <smithpb2250@gmail.com>
Reviewed-by: Hayato Kuroda <kuroda.hayato@fujitsu.com>
Reviewed-by: Masahiko Sawada <sawada.mshk@gmail.com>
Reviewed-by: Nisha Moond <nisha.moond412@gmail.com>
Reviewed-by: Shlok Kyal <shlok.kyal.oss@gmail.com>
Reviewed-by: Amit Kapila <amit.kapila16@gmail.com>
Discussion: https://postgr.es/m/CAA4eK1LC+KJiAkSrpE_NwvNdidw9F2os7GERUeSxSKv71gXysQ@mail.gmail.com
SQL/JSON functions such as JSON_VALUE could fail with "unrecognized
node type" errors when a DEFAULT clause contained an explicit COLLATE
expression. That happened because assign_collations_walker() could
invoke exprSetCollation() on a JsonBehavior expression whose DEFAULT
still contained a CollateExpr, which exprSetCollation() does not
handle.
For example:
SELECT JSON_VALUE('{"a":1}', '$.c' RETURNING text
DEFAULT 'A' COLLATE "C" ON EMPTY);
Fix by validating in transformJsonBehavior() that the DEFAULT
expression's collation matches the enclosing JSON expression’s
collation. In exprSetCollation(), replace the recursive call on the
JsonBehavior expression with an assertion that its collation already
matches the target, since the parser now enforces that condition.
Reported-by: Jian He <jian.universality@gmail.com>
Author: Jian He <jian.universality@gmail.com>
Reviewed-by: Amit Langote <amitlangote09@gmail.com>
Discussion: https://postgr.es/m/CACJufxHVwYYSyiVQ6o+PsRX6zQ7rAFinh_fv1kCfTsT1xG4Zeg@mail.gmail.com
Backpatch-through: 17
truncate_useless_pathkeys() seems to have neglected to account for
PathKeys that might be useful for WindowClause evaluation. Modify it so
that it properly accounts for that.
Making this work required adjusting two things:
1. Change from checking query_pathkeys to check sort_pathkeys instead.
2. Add explicit check for window_pathkeys
For #1, query_pathkeys gets set in standard_qp_callback() according to the
sort order requirements for the first operation to be applied after the
join planner is finished, so this changes depending on which upper
planner operations a particular query needs. If the query has window
functions and no GROUP BY, then query_pathkeys gets set to
window_pathkeys. Before this change, this meant PathKeys useful for the
ORDER BY were not accounted for in queries with window functions.
Because of #1, #2 is now required so that we explicitly check to ensure
we don't truncate away PathKeys useful for window functions.
Author: David Rowley <dgrowleyml@gmail.com>
Discussion: https://postgr.es/m/CAApHDvrj3HTKmXoLMbUjTO=_MNMxM=cnuCSyBKidAVibmYPnrg@mail.gmail.com
Previously StrategyGetBuffer() acquired the buffer header spinlock for every
buffer, whether it was reusable or not. If reusable, it'd be returned, with
the lock held, to GetVictimBuffer(), which then would pin the buffer with
PinBuffer_Locked(). That's somewhat violating the spirit of the guidelines for
holding spinlocks (i.e. that they are only held for a few lines of consecutive
code) and necessitates using PinBuffer_Locked(), which scales worse than
PinBuffer() due to holding the spinlock. This alone makes it worth changing
the code.
However, the main reason to change this is that a future commit will make
PinBuffer_Locked() slower (due to making UnlockBufHdr() slower), to gain
scalability for the much more common case of pinning a pre-existing buffer. By
pinning the buffer with a single atomic operation, iff the buffer is reusable,
we avoid any potential regression for miss-heavy workloads. There strictly are
fewer atomic operations for each potential buffer after this change.
The price for this improvement is that freelist.c needs two CAS loops and
needs to be able to set up the resource accounting for pinned buffers. The
latter is achieved by exposing a new function for that purpose from bufmgr.c,
that seems better than exposing the entire private refcount infrastructure.
The improvement seems worth the complexity.
Reviewed-by: Robert Haas <robertmhaas@gmail.com>
Reviewed-by: Matthias van de Meent <boekewurm+postgres@gmail.com>
Discussion: https://postgr.es/m/fvfmkr5kk4nyex56ejgxj3uzi63isfxovp2biecb4bspbjrze7@az2pljabhnff
We're planning to merge buffer content locks into BufferDesc.state. To reduce
the size of that patch, centralize calls to BufferDescriptorGetContentLock().
The biggest part of the change is in assertions, by introducing
BufferIsLockedByMe[InMode]() (and removing BufferIsExclusiveLocked()). This
seems like an improvement even without aforementioned plans.
Additionally replace some direct calls to LWLockAcquire() with calls to
LockBuffer().
Reviewed-by: Matthias van de Meent <boekewurm+postgres@gmail.com>
Discussion: https://postgr.es/m/fvfmkr5kk4nyex56ejgxj3uzi63isfxovp2biecb4bspbjrze7@az2pljabhnff
BM_PERMANENT is defined as 1U<<31, which is a negative number when interpreted
as a signed integer. Unfortunately the mask variable in BufferSync() was
signed. This has been wrong for a long time, but failed to fail, due to
integer conversion rules.
However, in an upcoming patch the width of the state variable will be
increased, with the wrong signedness leading to never flushing permanent
buffers - luckily caught in a test.
It seems better to fix this separately, instead of doing so as part of a
large, otherwise mechanical, patch.
Reviewed-by: Matthias van de Meent <boekewurm+postgres@gmail.com>
Discussion: https://postgr.es/m/fvfmkr5kk4nyex56ejgxj3uzi63isfxovp2biecb4bspbjrze7@az2pljabhnff