This function is able to clear the data associated to an extended
statistics object, making things so as the object looks as
newly-created.
The caller of this function needs the following arguments for the
extended stats to clear:
- The name of the relation.
- The schema name of the relation.
- The name of the extended stats object.
- The schema name of the extended stats object.
- If the stats are inherited or not.
The first two parameters are especially important to ensure a consistent
lookup and ACL checks for the relation on which is based the extended
stats object that will be cleared, relying first on a RangeVar lookup
where permissions are checked without locking a relation, critical to
prevent denial-of-service attacks when using this kind of function (see
also 688dc6299a for a similar concern). The third to fifth arguments
give a way to target the extended stats records to clear.
This has been extracted from a larger patch by the same author, for a
piece which is again useful on its own. I have rewritten large portions
of it. The tests have been extended while discussing this piece,
resulting on what this commit includes. The intention behind this
feature is to add support for the import of extended statistics across
dumps and upgrades, this change building one piece that we will be able
to rely on for the rest of the changes.
Bump catalog version.
Author: Corey Huinker <corey.huinker@gmail.com>
Co-authored-by: Michael Paquier <michael@paquier.xyz>
Reviewed-by: Chao Li <li.evan.chao@gmail.com>
Discussion: https://postgr.es/m/CADkLM=dpz3KFnqP-dgJ-zvRvtjsa8UZv8wDAQdqho=qN3kX0Zg@mail.gmail.com
This reverts commit f8d7f29b3e, plus parts of
subsequent commits fixing a typo in a parameter name.
Support for disowned lwlocks was added for the benefit of AIO, to be able to
have content locks "owned" by the AIO subsystem. But as of commit fcb9c977aa,
content locks do not use lwlocks anymore.
It does not seem particularly likely that we need this facility outside of the
AIO use-case, therefore remove the now unused functions.
I did choose to keep the comment added in the aforementioned commit about
lock->owner intentionally being left pointing to the last owner.
Reviewed-by: Chao Li <li.evan.chao@gmail.com>
Discussion: https://postgr.es/m/cj5mcjdpucvw4a54hehslr3ctukavrbnxltvuzzhqnimvpju5e@cy3g3mnsefwz
As of commit fcb9c977aa, ForEachLWLockHeldByMe(), introduced in f4ece891fc,
is not used anymore, as content locks are now implemented in bufmgr.c. It
doesn't seem that likely that a new user of the functionality will appear all
that soon, making removal of the function seem like the most sensible path. It
can easily be added back if necessary.
Discussion: https://postgr.es/m/lneuyxqxamqoayd2ntau3lqjblzdckw6tjgeu4574ezwh4tzlg%40noioxkquezdw
Until now buffer content locks were implemented using lwlocks. That has the
obvious advantage of not needing a separate efficient implementation of
locks. However, the time for a dedicated buffer content lock implementation
has come:
1) Hint bits are currently set while holding only a share lock. This leads to
having to copy pages while they are being written out if checksums are
enabled, which is not cheap. We would like to add AIO writes, however once
many buffers can be written out at the same time, it gets a lot more
expensive to copy them, particularly because that copy needs to reside in
shared buffers (for worker mode to have access to the buffer).
In addition, modifying buffers while they are being written out can cause
issues with unbuffered/direct-IO, as some filesystems (like btrfs) do not
like that, due to filesystem internal checksums getting corrupted.
The solution to this is to require a new share-exclusive lock-level to set
hint bits and to write out buffers, making those operations mutually
exclusive. We could introduce such a lock-level into the generic lwlock
implementation, however it does not look like there would be other users,
and it does add some overhead into important code paths.
2) For AIO writes we need to be able to race-freely check whether a buffer is
undergoing IO and whether an exclusive lock on the page can be acquired. That
is rather hard to do efficiently when the buffer state and the lock state
are separate atomic variables. This is a major hindrance to allowing writes
to be done asynchronously.
3) Buffer locks are by far the most frequently taken locks. Optimizing them
specifically for their use case is worth the effort. E.g. by merging
content locks into buffer locks we will be able to release a buffer lock
and pin in one atomic operation.
4) There are more complicated optimizations, like long-lived "super pinned &
locked" pages, that cannot realistically be implemented with the generic
lwlock implementation.
Therefore implement content locks inside bufmgr.c. The lockstate is stored as
part of BufferDesc.state. The implementation of buffer content locks is fairly
similar to lwlocks, with a few important differences:
1) An additional lock-level share-exclusive has been added. This lock-level
conflicts with exclusive locks and itself, but not share locks.
2) Error recovery for content locks is implemented as part of the already
existing private-refcount tracking mechanism in combination with resowners,
instead of a bespoke mechanism as the case for lwlocks. This means we do
not need to add dedicated error-recovery code paths to release all content
locks (like done with LWLockReleaseAll() for lwlocks).
3) The lock state is embedded in BufferDesc.state instead of having its own
struct.
4) The wakeup logic is a tad more complicated due to needing to support the
additional lock-level
This commit unfortunately introduces some code that is very similar to the
code in lwlock.c, however the code is not equivalent enough to easily merge
it. The future wins that this commit makes possible seem worth the cost.
As of this commit nothing uses the new share-exclusive lock mode. It will be
used in a future commit. It seemed too complicated to introduce the lock-level
in a separate commit.
It's worth calling out one wart in this commit: Despite content locks not
being lwlocks anymore, they continue to use PGPROC->lw* - that seemed better
than duplicating the relevant infrastructure.
Another thing worth pointing out is that, after this change, content locks are
not reported as LWLock wait events anymore, but as new wait events in the
"Buffer" wait event class (see also 6c5c393b74). The old BufferContent lwlock
tranche has been removed.
Reviewed-by: Melanie Plageman <melanieplageman@gmail.com>
Reviewed-by: Heikki Linnakangas <heikki.linnakangas@iki.fi>
Reviewed-by: Greg Burd <greg@burd.me>
Reviewed-by: Chao Li <li.evan.chao@gmail.com>
Discussion: https://postgr.es/m/fvfmkr5kk4nyex56ejgxj3uzi63isfxovp2biecb4bspbjrze7@az2pljabhnff
This patch reworks LISTEN/NOTIFY to avoid waking backends that have
no need to process the notification messages we just sent.
The primary change is to create a shared hash table that tracks
which processes are listening to which channels (where a "channel" is
defined by a database OID and channel name). This allows a notifying
process to accurately determine which listeners are interested,
replacing the previous weak approximation that listeners in other
databases couldn't be interested.
Secondly, if a listener is known not to be interested and is
currently stopped at the old queue head, we avoid waking it at all
and just directly advance its queue pointer past the notifications
we inserted.
These changes permit very significant improvements (integer multiples)
in NOTIFY throughput, as well as a noticeable reduction in latency,
when there are many listeners but only a few are interested in any
specific message. There is no improvement for the simplest case where
every listener reads every message, but any loss seems below the noise
level.
Author: Joel Jacobson <joel@compiler.org>
Reviewed-by: Tom Lane <tgl@sss.pgh.pa.us>
Discussion: https://postgr.es/m/6899c044-4a82-49be-8117-e6f669765f7e@app.fastmail.com
When an autovacuum worker exits, the launcher needs to be notified
with SIGUSR2, so that it can rebalance and possibly launch a new
worker. The launcher must be notified only after the worker has
finished ProcKill(), so that the worker slot is available for a new
worker. Before this commit, the autovacuum worker was responsible for
that, which required a slightly complicated dance to pass the
launcher's PID from FreeWorkerInfo() to ProcKill() in a global
variable.
Simplify that by moving the responsibility of the signaling to the
postmaster. The postmaster was already doing it when it failed to fork
a worker process, so it seems logical to make it responsible for
notifying the launcher on worker exit too. That's also how the
notification on background worker exit is done.
Reviewed-by: Nathan Bossart <nathandbossart@gmail.com>
Reviewed-by: Masahiko Sawada <sawada.mshk@gmail.com>
Reviewed-by: li carol <carol.li2025@outlook.com>
Discussion: https://www.postgresql.org/message-id/a5e27d25-c7e7-45d5-9bac-a17c8f462def@iki.fi
With 64-bit multixact offsets, PerformMembersTruncation() doesn't need
the starting offset anymore. The 'oldestOffset' value that
TruncateMultiXact() calculates is no longer used for anything. Remove
it, and the code to calculate it.
'oldestOffset' was included in the WAL record as 'startTruncMemb',
which sounds nice if you e.g. look at the WAL with pg_waldump, but it
was also confusing because we didn't actually use the value for
determining what to truncate. Replaying the WAL would remove all
segments older than 'endTruncMemb', regardless of
'startTruncMemb'. The 'startTruncOff' stored in the WAL record was
similarly unnecessary even before 64-bit multixid offsets, it was
stored just for the sake of symmetry with 'startTruncMemb'. Remove
both from the WAL record, and rename the remaining 'endTruncOff' to
'oldestMulti' and 'endTruncMemb' to 'oldestOffset', for consistency
with the variable names used for them in other places.
Reviewed-by: Andrey Borodin <x4mmm@yandex-team.ru>
Discussion: https://www.postgresql.org/message-id/000301b2-5b81-4938-bdac-90f6eb660843@iki.fi
These routines are useful to perform some basic validation checks on
each object structure, working currently on attribute numbers for
non-expression and expression attnums. These checks could be extended
in the future.
Note that this code is not used yet in the tree, and that these
functions will become handy for an upcoming patch for the import of
extended statistics data. However, they are worth their own independent
change as they are actually useful by themselves, with at least the
extension code argument in mind (or perhaps I am just feeling more
pedantic today).
Extracted from a larger patch by the same author, with many adjustments
and fixes by me.
Author: Corey Huinker <corey.huinker@gmail.com>
Reviewed-by: Chao Li <li.evan.chao@gmail.com>
Reviewed-by: Michael Paquier <michael@paquier.xyz>
Discussion: https://postgr.es/m/CADkLM=dpz3KFnqP-dgJ-zvRvtjsa8UZv8wDAQdqho=qN3kX0Zg@mail.gmail.com
This reverts commit f0f2c0c1ae.
The original problem that led to the use of pg_restrict was that MSVC
couldn't handle plain restrict, and defining it to something else
would conflict with its __declspec(restrict) that is used in system
header files. In C11 mode, this is no longer a problem, as MSVC
handles plain restrict. This led to the commit to replace pg_restrict
with restrict. But this did not take C++ into account. Standard C++
does not have restrict, so we defined it as something else (for
example, MSVC supports __restrict). But this then again conflicts
with __declspec(restrict) in system header files. So we have to
revert this attempt. The comments are updated to clarify that the
reason for this is now C++ only.
Reported-by: Jelte Fennema-Nio <postgres@jeltef.nl>
Reviewed-by: Andres Freund <andres@anarazel.de>
Discussion: https://www.postgresql.org/message-id/CAGECzQRoD7chJP1-dneSrhxUJv%2BBRcigoGOO4UwGzaShLot2Yw%40mail.gmail.com
RangeTblEntry.groupexprs was marked with the node attribute
query_jumble_ignore, causing a list of GROUP BY expressions to be
ignored during the query jumbling. For example, these two queries could
be grouped together within the same query ID:
SELECT count(*) FROM t GROUP BY a;
SELECT count(*) FROM t GROUP BY b;
However, as such queries use different GROUP BY clauses, they should be
split across multiple entries.
This fixes an oversight in 247dea89f7, that has introduced an RTE for
GROUP BY clauses. Query IDs are documented as being stable across minor
releases, but as this is a regression new to v18 and that we are still
early in its support cycle, a backpatch is exceptionally done as this
has broken a behavior that exists since query jumbling is supported in
core, since its introduction in pg_stat_statements.
The tests of pg_stat_statements are expanded to cover this area, with
patterns involving GROUP BY and GROUPING clauses.
Author: Jian He <jian.universality@gmail.com>
Discussion: https://postgr.es/m/CACJufxEy2W+tCqC7XuJ94r3ivWsM=onKJp94kRFx3hoARjBeFQ@mail.gmail.com
Backpatch-through: 18
There are two reasons for doing so:
1) It is generally faster to perform checks in a batched fashion and making
sequential scans faster is nice.
2) We would like to stop setting hint bits while pages are being written
out. The necessary locking becomes visible for page mode scans, if done for
every tuple. With batching, the overhead can be amortized to only happen
once per page.
There are substantial further optimization opportunities along these
lines:
- Right now HeapTupleSatisfiesMVCCBatch() simply uses the single-tuple
HeapTupleSatisfiesMVCC(), relying on the compiler to inline it. We could
instead write an explicitly optimized version that avoids repeated xid
tests.
- Introduce batched version of the serializability test
- Introduce batched version of HeapTupleSatisfiesVacuum
Reviewed-by: Melanie Plageman <melanieplageman@gmail.com>
Discussion: https://postgr.es/m/6rgb2nvhyvnszz4ul3wfzlf5rheb2kkwrglthnna7qhe24onwr@vw27225tkyar
Doing this meant that those two headers, which are supposed to be
internal to their corresponding index AMs, were being included pretty
much universally, because tuplesort.h is included by execnodes.h which
is very widely used. Stop that, and fix fallout.
We also change indexing.h to no longer include execnodes.h (tuptable.h
is sufficient), and relscan.h to no longer include buf.h (pointless
since c2fe139c20).
Author: Mario González <gonzalemario@gmail.com>
Discussion: https://postgr.es/m/CAFsReFUcBFup=Ohv_xd7SNQ=e73TXi8YNEkTsFEE2BW7jS1noQ@mail.gmail.com
Previously the instrumentation logic always converted to seconds, only for
many of the callers to do unnecessary division to get to milliseconds. As an
upcoming refactoring will split the Instrumentation struct, utilize instrtime
always to keep things simpler. It's also a bit faster to not have to first
convert to a double in functions like InstrEndLoop(), InstrAggNode().
Author: Lukas Fittl <lukas@fittl.com>
Reviewed-by: Andres Freund <andres@anarazel.de>
Discussion: https://postgr.es/m/CAP53PkzZ3UotnRrrnXWAv=F4avRq9MQ8zU+bxoN9tpovEu6fGQ@mail.gmail.com
This is important for Postgres extensions that are written in C++,
such as pg_duckdb, which uses PGXS as the build system currently. In
the autotools build, C++ is not coupled to LLVM. If the autotools
build is configured without --with-llvm, the C++ compiler and the
various flags get persisted into the Makefile.global.
Author: Tristan Partin <tristan@partin.io>
Author: Jelte Fennema-Nio <postgres@jeltef.nl>
Discussion: https://www.postgresql.org/message-id/flat/D98JHQF7H2A8.VSE3I4CJBTAB%40partin.io
This patch completes the transition to making inet_ops be default for
inet/cidr columns, rather than btree_gist's opclasses. Once we do
that, though, pg_upgrade has a big problem. A dump from an older
version will see btree_gist's opclasses as being default, so it will
not mention the opclass explicitly in CREATE INDEX commands, which
would cause the restore to create the indexes using inet_ops. Since
that's not compatible with what's actually in the files, havoc would
ensue.
This isn't readily fixable, because the CREATE INDEX command strings
are built by the older server's pg_get_indexdef() function; pg_dump
hasn't nearly enough knowledge to modify those strings successfully.
Even if we cared to put in the work to make that happen in pg_dump,
it would be counterproductive because the end goal here is to get
people off of these opclasses. Allowing such indexes to persist
through pg_upgrade wouldn't advance that goal.
Therefore, this patch just adds code to pg_upgrade to detect indexes
that would be problematic and refuse to upgrade.
There's another issue too: even without any indexes to worry about,
pg_dump in binary-upgrade mode will reproduce the "CREATE OPERATOR
CLASS ... DEFAULT" commands for btree_gist's opclasses, and those
will fail because now we have a built-in opclass that provides a
conflicting default. We could ask users to drop the btree_gist
extension altogether before upgrading, but that would carry very
severe penalties. It would affect perfectly-valid indexes for other
data types, and it would drop operators that might be relied on in
views or other database objects. Instead, put a hack in DefineOpClass
to ignore the DEFAULT clauses for these opclasses when in
binary-upgrade mode. This will result in installing a version of
btree_gist that isn't quite the version it claims to be, but that can
be fixed by issuing ALTER EXTENSION UPDATE afterwards.
Since we don't apply that hack when not in binary-upgrade mode,
it is now impossible to install any version of btree_gist
less than 1.9 via CREATE EXTENSION.
Author: Tom Lane <tgl@sss.pgh.pa.us>
Reviewed-by: Matthias van de Meent <boekewurm+postgres@gmail.com>
Reviewed-by: Heikki Linnakangas <hlinnaka@iki.fi>
Discussion: https://postgr.es/m/2483812.1754072263@sss.pgh.pa.us
The only user-visible change is the fix in the "malformed
pg_dependencies" error detail. That one is new in commit e1405aa5e3,
so no backpatching required.
This new identifier type provides support for 64-bit unsigned values,
to be used in catalogs, like OIDs. An advantage of a new data type is
that it becomes easier to grep for it in the code when assigning this
type to a catalog attribute, linking it to dedicated APIs and internal
structures.
The following operators are added in this commit, with dedicated tests:
- Casts with integer types and OID.
- btree and hash operators
- min/max functions.
- C type with related macros and defines, named around "Oid8".
This has been mentioned as useful on its own on the thread to add
support for 64-bit TOAST values, so as it becomes possible to attach
this data type to the TOAST code and catalog definitions. However, as
this concept can apply to many more areas, it is implemented as its own
independent change. This is based on a discussion with Andres Freund
and Tom Lane.
Bump catalog version.
Author: Michael Paquier <michael@paquier.xyz>
Reviewed-by: Greg Burd <greg@burd.me>
Reviewed-by: Nikhil Kumar Veldanda <veldanda.nikhilkumar17@gmail.com>
Discussion: https://postgr.es/m/1891064.1754681536@sss.pgh.pa.us
Background workers gain a new flag, called BGWORKER_INTERRUPTIBLE, that
offers the possibility to terminate the workers when these are connected
to a database that is involved in one of the following commands:
ALTER DATABASE RENAME TO
ALTER DATABASE SET TABLESPACE
CREATE DATABASE
DROP DATABASE
This is useful to give background workers the same behavior as backends
and autovacuum workers, which are stopped when these commands are
executed. The default behavior, that exists since 9.3, is still to
never terminate bgworkers connected to the database involved in any of
these commands. The new flag has to be set to terminate the workers.
A couple of tests are added to worker_spi to track the commands that
impact the termination of the workers. There is a test case for a
non-interruptible worker, additionally, that relies on an injection
point to make the wait time in CountOtherDBBackends() reduced from 5s to
0.3s for faster test runs. The tests rely on the contents of the server
logs to check if a worker has been started or terminated:
- LOG generated by worker_spi_main() at startup, once connection to
database is done.
- FATAL in bgworker_die() when terminated.
A couple of tests run in the CI have showed that this method is stable
enough. The safe_psql() calls that scan pg_stat_activity could be
replaced with some poll_query_until() for more stability, if the current
method proves to be an issue in the buildfarm.
Author: Aya Iwata <iwata.aya@fujitsu.com>
Reviewed-by: Hayato Kuroda <kuroda.hayato@fujitsu.com>
Reviewed-by: Chao Li <li.evan.chao@gmail.com>
Reviewed-by: Ryo Matsumura <matsumura.ryo@fujitsu.com>
Reviewed-by: Michael Paquier <michael@paquier.xyz>
Reviewed-by: Peter Smith <smithpb2250@gmail.com>
Discussion: https://postgr.es/m/OS7PR01MB11964335F36BE41021B62EAE8EAE4A@OS7PR01MB11964.jpnprd01.prod.outlook.com
It's been almost a year since we last did this, and upstream has
been busy. They've added stemmers for Polish and Esperanto,
and also deprecated their old Dutch stemmer in favor of the
Kraaij-Pohlmann algorithm. (The "dutch" stemmer is now the
latter, and "dutch_porter" is the old algorithm.)
Upstream also decided to rename their internal header "header.h"
to something less generic: "snowball_runtime.h". Seems like a good
thing, but it complicates this patch a bit because we were relying on
interposing our own version of "header.h" to control system header
inclusion order. (We're partially failing at that now, because now the
generated stemmer files include <stddef.h> before snowball_runtime.h.
I think that'll be okay, but if the buildfarm complains then we'll
have to do more-extensive editing of the generated files.)
I realized that we weren't documenting the available stemmers in
any user-visible place, except indirectly through sample \dFd output.
That's incomplete because we only provide built-in dictionaries for
the recommended stemmers for each language, not alternative stemmers
such as dutch_porter. So I added a list to the documentation.
I did not do anything with the stopword lists. If those are still
available from snowballstem.org, they are mighty well hidden.
Discussion: https://postgr.es/m/1185975.1767569534@sss.pgh.pa.us
Add support for waiting on WAL write and flush LSNs in addition to the
existing replay LSN wait type. This provides the foundation for
extending the WAIT FOR command with MODE parameter.
Key changes are following.
- Add WAIT_LSN_TYPE_STANDBY_WRITE and WAIT_LSN_TYPE_STANDBY_FLUSH to
WaitLSNType.
- Add GetCurrentLSNForWaitType() to retrieve the current LSN for each wait
type.
- Add new wait events WAIT_EVENT_WAIT_FOR_WAL_WRITE and
WAIT_EVENT_WAIT_FOR_WAL_FLUSH for pg_stat_activity visibility.
- Update WaitForLSN() to use GetCurrentLSNForWaitType() internally.
Discussion: https://postgr.es/m/CABPTF7UiArgW-sXj9CNwRzUhYOQrevLzkYcgBydmX5oDes1sjg%40mail.gmail.com
Author: Xuneng Zhou <xunengzhou@gmail.com>
Reviewed-by: Alexander Korotkov <aekorotkov@gmail.com>
Reviewed-by: Chao Li <li.evan.chao@gmail.com>
Reviewed-by: Alvaro Herrera <alvherre@kurilemu.de>
Add a new "location" column to the pg_available_extensions and
pg_available_extension_versions views, exposing the directory where
the extension is located.
The default system location is shown as '$system', the same value
that can be used to configure the extension_control_path GUC.
User-defined locations are only visible for super users, otherwise
'<insufficient privilege>' is returned as a column value, the same
behaviour that we already use in pg_stat_activity.
I failed to resist the temptation to do a little extra editorializing of
the TAP test script.
Catalog version bumped.
Author: Matheus Alcantara <mths.dev@pm.me>
Reviewed-By: Chao Li <li.evan.chao@gmail.com>
Reviewed-By: Rohit Prasad <rohit.prasad@arm.com>
Reviewed-By: Michael Banck <mbanck@gmx.net>
Reviewed-By: Manni Wood <manni.wood@enterprisedb.com>
Reviewed-By: Euler Taveira <euler@eulerto.com>
Reviewed-By: Quan Zongliang <quanzongliang@yeah.net>
Up to now, index amhandlers were expected to produce a new, palloc'd
struct on each call. That requires palloc/pfree overhead, and creates
a risk of memory leaks if the caller fails to pfree, and the time
taken to fill such a large structure isn't nil. Moreover, we were
storing these things in the relcache, eating several hundred bytes for
each cached index. There is not anything in these structs that needs
to vary at runtime, so let's change the definition so that an
amhandler can return a pointer to a "static const" struct of which
there's only one copy per index AM. Mark all the core code's
IndexAmRoutine pointers const so that we catch anyplace that might
still try to change or pfree one.
(This is similar to the way we were already handling TableAmRoutine
structs. This commit does fix one comment that was infelicitously
copied-and-pasted into tableamapi.c.)
This commit needs to be called out in the v19 release notes as an API
change for extension index AMs. An un-updated AM will still work
(as of now, anyway) but it risks memory leaks and will be slower than
necessary.
Author: Matthias van de Meent <boekewurm+postgres@gmail.com>
Reviewed-by: Tom Lane <tgl@sss.pgh.pa.us>
Discussion: https://postgr.es/m/CAEoWx2=vApYk2LRu8R0DdahsPNEhWUxGBZ=rbZo1EXE=uA+opQ@mail.gmail.com
We currently require LLVM 14, so these probes for LLVM 9 functions
always succeeded. Even when the features aren't enabled in an LLVM
build, dummy functions are defined (a problem for a later commit).
The whole PGAC_CHECK_LLVM_FUNCTIONS macro and Meson equivalent are
removed, because we switched to testing LLVM_VERSION_MAJOR at compile
time in subsequent work and these were the last holdouts. That suits
the nature of LLVM API evolution better, and also allows for strictly
mechanical pruning in future commits like 820b5af7 and 972c2cd2. They
advanced the minimum LLVM version but failed to spot these.
Reviewed-by: Matheus Alcantara <matheusssilv97@gmail.com>
Reviewed-by: Andres Freund <andres@anarazel.de>
Discussion: https://postgr.es/m/CA%2BhUKGJgB6gvrdDohgwLfCwzVQm%3DVMtb9m0vzQn%3DCwWn-kwG9w%40mail.gmail.com
This new function exposes at SQL level some information related to
multixacts, not available until now. This data is useful for monitoring
purposes, especially for workloads that make a heavy use of multixacts:
- num_mxids, number of MultiXact IDs in use.
- num_members, number of member entries in use.
- members_size, bytes used by num_members in pg_multixact/members/.
- oldest_multixact: oldest MultiXact still needed.
This patch has been originally proposed when MultiXactOffset was still
32 bits, to monitor wraparound. This part is not relevant anymore since
bd8d9c9bdf that has widen MultiXactOffset to 64 bits. The monitoring
of disk space usage for the members is still relevant.
Some tests are added to check this function, in the shape of one
isolation test with concurrent transactions that take a ROW SHARE lock,
and some SQL tests for pg_read_all_stats. Some documentation is added
to explain some patterns that can come from the information provided by
the function.
Bump catalog version.
Author: Naga Appani <nagnrik@gmail.com>
Reviewed-by: Ashutosh Bapat <ashutosh.bapat.oss@gmail.com>
Reviewed-by: Michael Paquier <michael@paquier.xyz>
Reviewed-by: Atsushi Torikoshi <torikoshia@oss.nttdata.com>
Discussion: https://postgr.es/m/CA+QeY+AAsYK6WvBW4qYzHz4bahHycDAY_q5ECmHkEV_eB9ckzg@mail.gmail.com
This function calculates in bytes the storage taken between two
multixact offsets. This will be used in an upcoming patch, introduced
separately here as this piece can be useful on its own.
Author: Naga Appani <nagnrik@gmail.com>
Co-authored-by: Michael Paquier <michael@paquier.xyz>
Discussion: https://postgr.es/m/aUyTvZMq2CLgNEB4@paquier.xyz
This routine returned a number of members as a MultiXactOffset,
calculated based on the difference between the next-to-be-assigned
offset and the oldest offset. However, this number is not actually an
offset but a number.
This type confusion comes from the original implementation of
MultiXactMemberFreezeThreshold(), in 53bb309d2d. The number of
members is now defined as a uint64, large enough for MultiXactOffset.
This change will be used in a follow-up patch.
Reviewed-by: Naga Appani <nagnrik@gmail.com>
Discussion: https://postgr.es/m/aUyTvZMq2CLgNEB4@paquier.xyz
When pulling up a subquery, we may need to wrap its targetlist items
in PlaceHolderVars to enforce separate identity or as a result of
outer joins. However, this causes any upper-level WHERE clauses
referencing these outputs to contain PlaceHolderVars, which prevents
indxpath.c from recognizing that they could be matched to index
columns or index expressions, potentially affecting the planner's
ability to use indexes.
To fix, explicitly strip PlaceHolderVars from index operands. A
PlaceHolderVar appearing in a relation-scan-level expression is
effectively a no-op. Nevertheless, to play it safe, we strip only
PlaceHolderVars that are not marked nullable.
The stripping is performed recursively to handle cases where
PlaceHolderVars are nested or interleaved with other node types. To
minimize performance impact, we first use a lightweight walker to
check for the presence of strippable PlaceHolderVars. The expensive
mutator is invoked only if a candidate is found, avoiding unnecessary
memory allocation and tree copying in the common case where no
PlaceHolderVars are present.
Back-patch to v18. Although this issue exists before that, changes in
this version made it common enough to notice. Given the lack of field
reports for older versions, I am not back-patching further.
Reported-by: Haowu Ge <gehaowu@bitmoe.com>
Author: Richard Guo <guofenglinux@gmail.com>
Discussion: https://postgr.es/m/62af586c-c270-44f3-9c5e-02c81d537e3d.gehaowu@bitmoe.com
Backpatch-through: 18
Presumably, the C type MsgType was meant to hold the protocol message
type in the pre-version-3 era, but this was never fully developed even
then, and the name is pretty confusing nowadays. It has only one
vestigial use for cancel requests that we can get rid of. Since a
cancel request is indicated by a special protocol version number, we
can use the ProtocolVersion type, which MsgType was based on.
Reviewed-by: Chao Li <li.evan.chao@gmail.com>
Discussion: https://www.postgresql.org/message-id/flat/505e76cb-0ca2-4e22-ba0f-772b5dc3f230%40eisentraut.org
Many of the operations done for attribute stats in attribute_stats.c
share the same logic as extended stats, as done by a patch under
discussion to add support for extended stats import and export. All the
pieces necessary for extended statistics are moved to stats_utils.c,
which is the file where common facilities are shared for stats files.
The following renames are done:
* get_attr_stat_type() -> statatt_get_type()
* init_empty_stats_tuple() -> statatt_init_empty_tuple()
* set_stats_slot() -> statatt_set_slot()
* get_elem_stat_type() -> statatt_get_elem_type()
While on it, this commit adds more documentation for all these
functions, describing more their internals and the dependencies that
have been implied for attribute statistics. The same concepts apply to
extended statistics, at some degree.
Author: Corey Huinker <corey.huinker@gmail.com>
Reviewed-by: Chao Li <li.evan.chao@gmail.com>
Reviewed-by: Yu Wang <wangyu_runtime@163.com>
Reviewed-by: Michael Paquier <michael@paquier.xyz>
Discussion: https://postgr.es/m/CADkLM=dpz3KFnqP-dgJ-zvRvtjsa8UZv8wDAQdqho=qN3kX0Zg@mail.gmail.com
If there are any SRFs in a PathTarget, we must separate it into
SRF-computing and SRF-free targets. This is because the executor can
only handle SRFs that appear at the top level of the targetlist of a
ProjectSet plan node.
If we find a subexpression that matches an expression already computed
in the previous plan level, we should treat it like a Var and should
not split it again. setrefs.c will later replace the expression with
a Var referencing the subplan output.
However, when processing the grouping target for grouping sets, the
planner can fail to recognize that an expression is already computed
in the scan/join phase. The root cause is a mismatch in the
nullingrels bits. Expressions in the grouping target carry the
grouping nulling bit in their nullingrels to indicate that they can be
nulled by the grouping step. However, the corresponding expressions
in the scan/join target do not have these bits.
As a result, the exact match check in list_member() fails, leading the
planner to incorrectly believe that the expression needs to be
re-evaluated from its arguments, which are often not available in the
subplan. This can lead to planner errors such as "variable not found
in subplan target list".
To fix, ignore the grouping nulling bit when checking whether an
expression from the grouping target is available in the pre-grouping
input target. This aligns with the matching logic in setrefs.c.
Backpatch to v18, where this issue was introduced.
Bug: #19353
Reported-by: Marian MULLER REBEYROL <marian.muller@serli.com>
Author: Richard Guo <guofenglinux@gmail.com>
Reviewed-by: Tender Wang <tndrwang@gmail.com>
Discussion: https://postgr.es/m/19353-aaa179bba986a19b@postgresql.org
Backpatch-through: 18
Previously logical decoding required wal_level to be set to 'logical'
at server start. This meant that users had to incur the overhead of
logical-level WAL logging even when no logical replication slots were
in use.
This commit adds functionality to automatically control logical
decoding availability based on logical replication slot presence. The
newly introduced module logicalctl.c allows logical decoding to be
dynamically activated when needed when wal_level is set to
'replica'.
When the first logical replication slot is created, the system
automatically increases the effective WAL level to maintain
logical-level WAL records. Conversely, after the last logical slot is
dropped or invalidated, it decreases back to 'replica' WAL level.
While activation occurs synchronously right after creating the first
logical slot, deactivation happens asynchronously through the
checkpointer process. This design avoids a race condition at the end
of recovery; a concurrent deactivation could happen while the startup
process enables logical decoding at the end of recovery, but WAL
writes are still not permitted until recovery fully completes. The
checkpointer will handle it after recovery is done. Asynchronous
deactivation also avoids excessive toggling of the logical decoding
status in workloads that repeatedly create and drop a single logical
slot. On the other hand, this lazy approach can delay changes to
effective_wal_level and the disabling logical decoding, especially
when the checkpointer is busy with other tasks. We chose this lazy
approach in all deactivation paths to keep the implementation simple,
even though laziness is strictly required only for end-of-recovery
cases. Future work might address this limitation either by using a
dedicated worker instead of the checkpointer, or by implementing
synchronous waiting during slot drops if workloads are significantly
affected by the lazy deactivation of logical decoding.
The effective WAL level, determined internally by XLogLogicalInfo, is
allowed to change within a transaction until an XID is assigned. Once
an XID is assigned, the value becomes fixed for the remainder of the
transaction. This behavior ensures that the logging mode remains
consistent within a writing transaction, similar to the behavior of
GUC parameters.
A new read-only GUC parameter effective_wal_level is introduced to
monitor the actual WAL level in effect. This parameter reflects the
current operational WAL level, which may differ from the configured
wal_level setting.
Bump PG_CONTROL_VERSION as it adds a new field to CheckPoint struct.
Reviewed-by: Shveta Malik <shveta.malik@gmail.com>
Reviewed-by: Amit Kapila <amit.kapila16@gmail.com>
Reviewed-by: Hayato Kuroda <kuroda.hayato@fujitsu.com>
Reviewed-by: Bertrand Drouvot <bertranddrouvot.pg@gmail.com>
Reviewed-by: Peter Smith <smithpb2250@gmail.com>
Reviewed-by: Shlok Kyal <shlok.kyal.oss@gmail.com>
Reviewed-by: Ashutosh Bapat <ashutosh.bapat.oss@gmail.com>
Discussion: https://postgr.es/m/CAD21AoCVLeLYq09pQPaWs+Jwdni5FuJ8v2jgq-u9_uFbcp6UbA@mail.gmail.com
off_t was previously used for offsets, which is 4 bytes on Windows,
hence limiting the backend code to a hard limit for files longer than
2GB. This leads to some simplification in these files, removing some
casts based on long, also 4 bytes on Windows.
This commit removes one comment introduced in db3c4c3a2d, not relevant
anymore as pgoff_t is a safe 8-byte alternative on Windows.
This change is surprisingly not invasive, as the callers of
BufFileTell(), BufFileSeek() and BufFileTruncateFileSet() (worker.c,
tuplestore.c, etc.) track offsets in local structures that just to
switch from off_t to pgoff_t for the most part.
The file is still relying on a maximum file size of
MAX_PHYSICAL_FILESIZE (1GB). This change allows the code to make this
maximum potentially larger in the future, or larger on a per-demand
basis.
Reviewed-by: Chao Li <li.evan.chao@gmail.com>
Discussion: https://postgr.es/m/aUStrqoOCDRFAq1M@paquier.xyz