This addresses some post-commit review comments for commits 6185c973,
de3600452, and 9425c596a0, with the following changes:
* Fix JSON_TABLE() syntax documentation to use the term
"path_expression" for JSON path expressions instead of
"json_path_specification" to be consistent with the other SQL/JSON
functions.
* Fix a typo in the example code in JSON_TABLE() documentation.
* Rewrite some newly added comments in jsonpath.h.
* In JsonPathQuery(), add missing cast to int before printing an enum
value.
Reported-by: Jian He <jian.universality@gmail.com>
Discussion: https://postgr.es/m/CACJufxG_e0QLCgaELrr2ZNz7AxPeGCNKAORe3fHtFCQLsH4J4Q@mail.gmail.com
This improves some error messages emitted by SQL/JSON query functions
by mentioning column name when available, such as when they are
invoked as part of evaluating JSON_TABLE() columns. To do so, a new
field column_name is added to both JsonFuncExpr and JsonExpr that is
only populated when creating those nodes for transformed JSON_TABLE()
columns.
While at it, relevant error messages are reworded for clarity.
Reported-by: Jian He <jian.universality@gmail.com>
Suggested-by: Jian He <jian.universality@gmail.com>
Discussion: https://postgr.es/m/CACJufxG_e0QLCgaELrr2ZNz7AxPeGCNKAORe3fHtFCQLsH4J4Q@mail.gmail.com
GlobalVisTestNonRemovableHorizon() was only used for the implementation of
snapshot_too_old, which was removed in f691f5b80a. As using
GlobalVisTestNonRemovableHorizon() is not particularly efficient, no new uses
for it should be added. Therefore remove.
Discussion: https://postgr.es/m/20240415185720.q4dg4dlcyvvrabz4@awork3.anarazel.de
When matching constraints in AttachPartitionEnsureIndexes() we weren't
testing the constraint type, which could make a UNIQUE key lacking a
not-null constraint incorrectly satisfy a primary key requirement. Fix
this by testing that the constraint types match. (Other possible
mismatches are verified by comparing index properties.)
Discussion: https://postgr.es/m/202402051447.wimb4xmtiiyb@alvherre.pgsql
Let table AM define custom reloptions for its tables. This allows specifying
AM-specific parameters by the WITH clause when creating a table.
The reloptions, which could be used outside of table AM, are now extracted
into the CommonRdOptions data structure. These options could be by decision
of table AM directly specified by a user or calculated in some way.
The new test module test_tam_options evaluates the ability to set up custom
reloptions and calculate fields of CommonRdOptions on their base.
The code may use some parts from prior work by Hao Wu.
Discussion: https://postgr.es/m/CAPpHfdurb9ycV8udYqM%3Do0sPS66PJ4RCBM1g-bBpvzUfogY0EA%40mail.gmail.com
Discussion: https://postgr.es/m/AMUA1wBBBxfc3tKRLLdU64rb.1.1683276279979.Hmail.wuhao%40hashdata.cn
Reviewed-by: Reviewed-by: Pavel Borisov, Matthias van de Meent, Jess Davis
29f6a959c added a bump allocator type for efficient compact allocations.
Here we make use of this for non-bounded tuplesorts to store tuples.
This is very space efficient when storing narrow tuples due to bump.c
not having chunk headers. This means we can fit more tuples in work_mem
before spilling to disk, or perform an in-memory sort touching fewer
cacheline.
Author: David Rowley
Reviewed-by: Nathan Bossart
Reviewed-by: Matthias van de Meent
Reviewed-by: Tomas Vondra
Reviewed-by: John Naylor
Discussion: https://postgr.es/m/CAApHDvqGSpCU95TmM=Bp=6xjL_nLys4zdZOpfNyWBk97Xrdj2w@mail.gmail.com
This introduces a bump MemoryContext type. The bump context is best
suited for short-lived memory contexts which require only allocations
of memory and never a pfree or repalloc, which are unsupported.
Memory palloc'd into a bump context has no chunk header. This makes
bump a useful context type when lots of small allocations need to be
done without any need to pfree those allocations. Allocation sizes are
rounded up to the next MAXALIGN boundary, so with this and no chunk
header, allocations are very compact indeed.
Allocations are also very fast as bump does not check any freelists to
try and make use of previously free'd chunks. It just checks if there
is enough room on the current block, and if so it bumps the freeptr
beyond this chunk and returns the value that the freeptr was previously
pointing to. Simple and fast. A new block is malloc'd when there's not
enough space in the current block.
Code using the bump allocator must take care never to call any functions
which could try to call realloc() (or variants), pfree(),
GetMemoryChunkContext() or GetMemoryChunkSpace() on a bump allocated
chunk. Due to lack of chunk headers, these operations are unsupported.
To increase the chances of catching such issues, when compiled with
MEMORY_CONTEXT_CHECKING, bump allocated chunks are given a header and
any attempt to perform an unsupported operation will result in an ERROR.
Without MEMORY_CONTEXT_CHECKING, code attempting an unsupported
operation could result in a segfault.
A follow-on commit will implement the first user of bump.
Author: David Rowley
Reviewed-by: Nathan Bossart
Reviewed-by: Matthias van de Meent
Reviewed-by: Tomas Vondra
Reviewed-by: John Naylor
Discussion: https://postgr.es/m/CAApHDvqGSpCU95TmM=Bp=6xjL_nLys4zdZOpfNyWBk97Xrdj2w@mail.gmail.com
Reserve 4 bits for MemoryContextMethodID rather than 3. 3 bits did
technically allow a maximum of 8 memory context types, however, we've
opted to reserve some bit patterns which left us with only 4 slots, all
of which were used.
Here we add another bit which frees up 8 slots for future memory context
types.
In passing, adjust the enum names in MemoryContextMethodID to make it
more clear which ones can be used and which ones are reserved.
Author: Matthias van de Meent, David Rowley
Discussion: https://postgr.es/m/CAApHDvqGSpCU95TmM=Bp=6xjL_nLys4zdZOpfNyWBk97Xrdj2w@mail.gmail.com
This new DDL command splits a single partition into several parititions.
Just like ALTER TABLE ... MERGE PARTITIONS ... command, new patitions are
created using createPartitionTable() function with parent partition as the
template.
This commit comprises quite naive implementation which works in single process
and holds the ACCESS EXCLUSIVE LOCK on the parent table during all the
operations including the tuple routing. This is why this new DDL command
can't be recommended for large partitioned tables under a high load. However,
this implementation come in handy in certain cases even as is.
Also, it could be used as a foundation for future implementations with lesser
locking and possibly parallel.
Discussion: https://postgr.es/m/c73a1746-0cd0-6bdd-6b23-3ae0b7c0c582%40postgrespro.ru
Author: Dmitry Koval
Reviewed-by: Matthias van de Meent, Laurenz Albe, Zhihong Yu, Justin Pryzby
Reviewed-by: Alvaro Herrera, Robert Haas, Stephane Tachoires
Commit 9e8da0f7 taught nbtree to handle ScalarArrayOpExpr quals
natively. This works by pushing down the full context (the array keys)
to the nbtree index AM, enabling it to execute multiple primitive index
scans that the planner treats as one continuous index scan/index path.
This earlier enhancement enabled nbtree ScalarArrayOp index-only scans.
It also allowed scans with ScalarArrayOp quals to return ordered results
(with some notable restrictions, described further down).
Take this general approach a lot further: teach nbtree SAOP index scans
to decide how to execute ScalarArrayOp scans (when and where to start
the next primitive index scan) based on physical index characteristics.
This can be far more efficient. All SAOP scans will now reliably avoid
duplicative leaf page accesses (just like any other nbtree index scan).
SAOP scans whose array keys are naturally clustered together now require
far fewer index descents, since we'll reliably avoid starting a new
primitive scan just to get to a later offset from the same leaf page.
The scan's arrays now advance using binary searches for the array
element that best matches the next tuple's attribute value. Required
scan key arrays (i.e. arrays from scan keys that can terminate the scan)
ratchet forward in lockstep with the index scan. Non-required arrays
(i.e. arrays from scan keys that can only exclude non-matching tuples)
"advance" without the process ever rolling over to a higher-order array.
Naturally, only required SAOP scan keys trigger skipping over leaf pages
(non-required arrays cannot safely end or start primitive index scans).
Consequently, even index scans of a composite index with a high-order
inequality scan key (which we'll mark required) and a low-order SAOP
scan key (which we won't mark required) now avoid repeating leaf page
accesses -- that benefit isn't limited to simpler equality-only cases.
In general, all nbtree index scans now output tuples as if they were one
continuous index scan -- even scans that mix a high-order inequality
with lower-order SAOP equalities reliably output tuples in index order.
This allows us to remove a couple of special cases that were applied
when building index paths with SAOP clauses during planning.
Bugfix commit 807a40c5 taught the planner to avoid generating unsafe
path keys: path keys on a multicolumn index path, with a SAOP clause on
any attribute beyond the first/most significant attribute. These cases
are now all safe, so we go back to generating path keys without regard
for the presence of SAOP clauses (just like with any other clause type).
Affected queries can now exploit scan output order in all the usual ways
(e.g., certain "ORDER BY ... LIMIT n" queries can now terminate early).
Also undo changes from follow-up bugfix commit a4523c5a, which taught
the planner to produce alternative index paths, with path keys, but
without low-order SAOP index quals (filter quals were used instead).
We'll no longer generate these alternative paths, since they can no
longer offer any meaningful advantages over standard index qual paths.
Affected queries thereby avoid all of the disadvantages that come from
using filter quals within index scan nodes. They can avoid extra heap
page accesses from using filter quals to exclude non-matching tuples
(index quals will never have that problem). They can also skip over
irrelevant sections of the index in more cases (though only when nbtree
determines that starting another primitive scan actually makes sense).
There is a theoretical risk that removing restrictions on SAOP index
paths from the planner will break compatibility with amcanorder-based
index AMs maintained as extensions. Such an index AM could have the
same limitations around ordered SAOP scans as nbtree had up until now.
Adding a pro forma incompatibility item about the issue to the Postgres
17 release notes seems like a good idea.
Author: Peter Geoghegan <pg@bowt.ie>
Author: Matthias van de Meent <boekewurm+postgres@gmail.com>
Reviewed-By: Heikki Linnakangas <hlinnaka@iki.fi>
Reviewed-By: Matthias van de Meent <boekewurm+postgres@gmail.com>
Reviewed-By: Tomas Vondra <tomas.vondra@enterprisedb.com>
Discussion: https://postgr.es/m/CAH2-Wz=ksvN_sjcnD1+Bt-WtifRA5ok48aDYnq3pkKhxgMQpcw@mail.gmail.com
JSON_TABLE() allows JSON data to be converted into a relational view
and thus used, for example, in a FROM clause, like other tabular
data. Data to show in the view is selected from a source JSON object
using a JSON path expression to get a sequence of JSON objects that's
called a "row pattern", which becomes the source to compute the
SQL/JSON values that populate the view's output columns. Column
values themselves are computed using JSON path expressions applied to
each of the JSON objects comprising the "row pattern", for which the
SQL/JSON query functions added in 6185c9737c are used.
To implement JSON_TABLE() as a table function, this augments the
TableFunc and TableFuncScanState nodes that are currently used to
support XMLTABLE() with some JSON_TABLE()-specific fields.
Note that the JSON_TABLE() spec includes NESTED COLUMNS and PLAN
clauses, which are required to provide more flexibility to extract
data out of nested JSON objects, but they are not implemented here
to keep this commit of manageable size.
Author: Nikita Glukhov <n.gluhov@postgrespro.ru>
Author: Teodor Sigaev <teodor@sigaev.ru>
Author: Oleg Bartunov <obartunov@gmail.com>
Author: Alexander Korotkov <aekorotkov@gmail.com>
Author: Andrew Dunstan <andrew@dunslane.net>
Author: Amit Langote <amitlangote09@gmail.com>
Author: Jian He <jian.universality@gmail.com>
Reviewers have included (in no particular order):
Andres Freund, Alexander Korotkov, Pavel Stehule, Andrew Alsup,
Erik Rijkers, Zihong Yu, Himanshu Upadhyaya, Daniel Gustafsson,
Justin Pryzby, Álvaro Herrera, Jian He
Discussion: https://postgr.es/m/cd0bb935-0158-78a7-08b5-904886deac4b@postgrespro.ru
Discussion: https://postgr.es/m/20220616233130.rparivafipt6doj3@alap3.anarazel.de
Discussion: https://postgr.es/m/abd9b83b-aa66-f230-3d6d-734817f0995d%40postgresql.org
Discussion: https://postgr.es/m/CA+HiwqE4XTdfb1nW=Ojoy_tQSRhYt-q_kb6i5d4xcKyrLC1Nbg@mail.gmail.com
This is marked PGC_SIGHUP, so it can only be set in a configuration
file, not anywhere else; and it is also marked GUC_DISALLOW_IN_AUTO_FILE,
so it can't be set using ALTER SYSTEM. When set to false, the
ALTER SYSTEM command is disallowed.
There was considerable concern that this would be misinterpreted as
a security feature, which it is not, because a determined superuser
has various ways of bypassing it. Hence, a lot of work has gone into
wordsmithing the documentation, in the hopes of avoiding any such
confusion.
Jelte Fennemia-Nio and Gabriele Bartolini, with wording suggestions
for the documentation from many others.
Discussion: http://postgr.es/m/CA%2BVUV5rEKt2%2BCdC_KUaPoihMu%2Bi5ChT4WVNTr4CD5-xXZUfuQw%40mail.gmail.com
The user-facing name is "Other Platforms and Clients", but the
internal name seems too focused on clients specifically, especially
given the plan to add a new setting to this session that is about
platform or deployment model compatibility rather than client
compatibility.
Jelte Fennema-Nio
Discussion: http://postgr.es/m/CAGECzQTfMbDiM6W3av+3weSnHxJvPmuTEcjxVvSt91sQBdOxuQ@mail.gmail.com
This adds 3 new variants of the random() function:
random(min integer, max integer) returns integer
random(min bigint, max bigint) returns bigint
random(min numeric, max numeric) returns numeric
Each returns a random number x in the range min <= x <= max.
For the numeric function, the number of digits after the decimal point
is equal to the number of digits that "min" or "max" has after the
decimal point, whichever has more.
The main entry points for these functions are in a new C source file.
The existing random(), random_normal(), and setseed() functions are
moved there too, so that they can all share the same PRNG state, which
is kept private to that file.
Dean Rasheed, reviewed by Jian He, David Zhang, Aleksander Alekseev,
and Tomas Vondra.
Discussion: https://postgr.es/m/CAEZATCV89Vxuq93xQdmc0t-0Y2zeeNQTdsjbmV7dyFBPykbV4Q@mail.gmail.com
Previously, the DSA segment size always started with 1MB and grew up
to DSA_MAX_SEGMENT_SIZE. It was inconvenient in certain scenarios,
such as when the caller desired a soft constraint on the total DSA
segment size, limiting it to less than 1MB.
This commit introduces the capability to specify the initial and
maximum DSA segment sizes when creating a DSA area, providing more
flexibility and control over memory usage.
Reviewed-by: John Naylor, Tomas Vondra
Discussion: https://postgr.es/m/CAD21AoAYGGC1ePjVX0H%2Bpp9rH%3D9vuPK19nNOiu12NprdV5TVJA%40mail.gmail.com
It's now possible to specify a table access method via
CREATE TABLE ... USING for a partitioned table, as well change it with
ALTER TABLE ... SET ACCESS METHOD. Specifying an AM for a partitioned
table lets the value be used for all future partitions created under it,
closely mirroring the behavior of the TABLESPACE option for partitioned
tables. Existing partitions are not modified.
For a partitioned table with no AM specified, any new partitions are
created with the default_table_access_method.
Also add ALTER TABLE ... SET ACCESS METHOD DEFAULT, which reverts to the
original state of using the default for new partitions.
The relcache of partitioned tables is not changed: rd_tableam is not
set, even if a partitioned table has a relam set.
Author: Justin Pryzby <pryzby@telsasoft.com>
Author: Soumyadeep Chakraborty <soumyadeep2007@gmail.com>
Author: Michaël Paquier <michael@paquier.xyz>
Reviewed-by: The authors themselves
Discussion: https://postgr.es/m/CAE-ML+9zM4wJCGCBGv01k96qQ3gFv4WFcFy=zqPHKeaEFwwv6A@mail.gmail.com
Discussion: https://postgr.es/m/20210308010707.GA29832%40telsasoft.com
Up to now, all of the "catcache list" objects within a catalog cache
were just chained together on a single dlist, requiring O(N) time to
search. Remarkably, we've not had serious performance problems with
that so far; but we got a complaint of a bad performance regression
from v15 in a case with a large number of roles in the system, which
traced down to O(N^2) total time when we probed N catcache lists.
Replace that data structure with a hashtable having an enlargeable
number of dlists, in an exactly parallel way to the data structure
we've used for years for the plain CatCTup cache members. The extra
cost of maintaining a hash table seems negligible, since we were
already computing a hash value for list searches.
Normally this'd be HEAD-only material, but in view of the performance
regression it seems advisable to back-patch into v16. In the v16
version of the patch, leave the dead cc_lists field where it is and
add the new fields at the end of struct catcache, to avoid possible
ABI breakage in case any external code is looking at these structs.
(We assume no external code is actually allocating new catcache
structs.)
Per report from alex work.
Discussion: https://postgr.es/m/CAGvXd3OSMbJQwOSc-Tq-Ro1CAz=vggErdSG7pv2s6vmmTOLJSg@mail.gmail.com
This adds the X509 attributes notBefore and notAfter to sslinfo
as well as pg_stat_ssl to allow verifying and identifying the
validity period of the current client certificate. OpenSSL has
APIs for extracting notAfter and notBefore, but they are only
supported in recent versions so we have to calculate the dates
by hand in order to make this work for the older versions of
OpenSSL that we still support.
Original patch by Cary Huang with additional hacking by Jacob
and myself.
Author: Cary Huang <cary.huang@highgo.ca>
Co-author: Jacob Champion <jacob.champion@enterprisedb.com>
Co-author: Daniel Gustafsson <daniel@yesql.se>
Discussion: https://postgr.es/m/182b8565486.10af1a86f158715.2387262617218380588@highgo.ca
The new table AM method free_rd_amcache is responsible for freeing all the
memory related to rd_amcache and setting free_rd_amcache to NULL. If the new
method is not specified, we still assume rd_amcache to be a single chunk of
memory, which could be just pfree'd.
Discussion: https://postgr.es/m/CAPpHfdurb9ycV8udYqM%3Do0sPS66PJ4RCBM1g-bBpvzUfogY0EA%40mail.gmail.com
Reviewed-by: Matthias van de Meent, Mark Dilger, Pavel Borisov
Reviewed-by: Nikita Malakhov, Japin Li
Based on comments from Peter Eisentraut.
* Document CREATE DATABASE ... BUILTIN_LOCALE.
* Determine required encoding based on locale name for CREATE
COLLATION. Use -1 for "C" (requires catversion bump).
* initdb output fixups.
* Make ctype_is_c a constant true for now.
* Fixups to ICU 010_create_database.pl test.
Discussion: https://postgr.es/m/4135cf11-206d-40ed-96c0-9363c1232379@eisentraut.org
New provider for collations, like "libc" or "icu", but without any
external dependency.
Initially, the only locale supported by the builtin provider is "C",
which is identical to the libc provider's "C" locale. The libc
provider's "C" locale has always been treated as a special case that
uses an internal implementation, without using libc at all -- so the
new builtin provider uses the same implementation.
The builtin provider's locale is independent of the server environment
variables LC_COLLATE and LC_CTYPE. Using the builtin provider, the
database collation locale can be "C" while LC_COLLATE and LC_CTYPE are
set to "en_US", which is impossible with the libc provider.
By offering a new builtin provider, it clarifies that the semantics of
a collation using this provider will never depend on libc, and makes
it easier to document the behavior.
Discussion: https://postgr.es/m/ab925f69-5f9d-f85e-b87c-bd2a44798659@joeconway.com
Discussion: https://postgr.es/m/dd9261f4-7a98-4565-93ec-336c1c110d90@manitou-mail.org
Discussion: https://postgr.es/m/ff4c2f2f9c8fc7ca27c1c24ae37ecaeaeaff6b53.camel%40j-davis.com
Reviewed-by: Daniel Vérité, Peter Eisentraut, Jeremy Schneider
Roles with MAINTAIN on a relation may run VACUUM, ANALYZE, REINDEX,
REFRESH MATERIALIZE VIEW, CLUSTER, and LOCK TABLE on the relation.
Roles with privileges of pg_maintain may run those same commands on
all relations.
This was previously committed for v16, but it was reverted in
commit 151c22deee due to concerns about search_path tricks that
could be used to escalate privileges to the table owner. Commits
2af07e2f74, 59825d1639, and c7ea3f4229 resolved these concerns by
restricting search_path when running maintenance commands.
Bumps catversion.
Reviewed-by: Jeff Davis
Discussion: https://postgr.es/m/20240305161235.GA3478007%40nathanxps13
... and in particular don't return them as replica identity.
The motivation for this change is letting the primary keys be seen by
code that derives NOT NULL constraints from them, when creating
inheritance children; before this change, if you had a deferrable PK,
pg_dump would not recreate the attnotnull marking properly, because the
column would not be considered as having anything to back said marking
after dropping the throwaway NOT NULL constraint.
The reason we don't want these PKs as replica identities is that
replication can corrupt data, if the uniqueness constraint is
transiently broken.
Reported-by: Amul Sul <sulamul@gmail.com>
Reviewed-by: Dean Rasheed <dean.a.rasheed@gmail.com>
Discussion: https://postgr.es/m/CAAJ_b94QonkgsbDXofakHDnORQNgafd1y3Oa5QXfpQNJyXyQ7A@mail.gmail.com
You might run out of stack space with recursion, which is not nice in
functions that might be used e.g. at cleanup after transaction
abort. MemoryContext contains pointer to parent and siblings, so we
can traverse a tree of contexts iteratively, without using
stack. Refactor the functions to do that.
MemoryContextStats() still recurses, but it now has a limit to how
deep it recurses. Once the limit is reached, it prints just a summary
of the rest of the hierarchy, similar to how it summarizes contexts
with lots of children. That seems good anyway, because a context dump
with hundreds of nested contexts isn't very readable.
Report by Egor Chindyaskin and Alexander Lakhin.
Discussion: https://postgr.es/m/1672760457.940462079%40f306.i.mail.ru
Author: Heikki Linnakangas
Reviewed-by: Robert Haas, Andres Freund, Alexander Korotkov, Tom Lane
This patch provides a way to ensure that physical standbys that are
potential failover candidates have received and flushed changes before
the primary server making them visible to subscribers. Doing so guarantees
that the promoted standby server is not lagging behind the subscribers
when a failover is necessary.
The logical walsender now guarantees that all local changes are sent and
flushed to the standby servers corresponding to the replication slots
specified in 'standby_slot_names' before sending those changes to the
subscriber.
Additionally, the SQL functions pg_logical_slot_get_changes,
pg_logical_slot_peek_changes and pg_replication_slot_advance are modified
to ensure that they process changes for failover slots only after physical
slots specified in 'standby_slot_names' have confirmed WAL receipt for those.
Author: Hou Zhijie and Shveta Malik
Reviewed-by: Masahiko Sawada, Peter Smith, Bertrand Drouvot, Ajin Cherian, Nisha Moond, Amit Kapila
Discussion: https://postgr.es/m/514f6f2f-6833-4539-39f1-96cd1e011f23@enterprisedb.com
This implements a radix tree data structure based on the design in
"The Adaptive Radix Tree: ARTful Indexing for Main-Memory Databases"
by Viktor Leis, Alfons Kemper, and ThomasNeumann, 2013. The main
technique that makes it adaptive is using several different node types,
each with a different capacity of elements, and a different algorithm
for accessing them. The nodes start small and grow/shrink as needed.
The main advantage over hash tables is efficient sorted iteration and
better memory locality when successive keys are lexicographically
close together. The implementation currently assumes 64-bit integer
keys, and traversing the tree is in general slower than a linear
probing hash table, so this is not a general-purpose associative array.
The paper describes two other techniques not implemented here,
namely "path compression" and "lazy expansion". These can further
reduce memory usage and speed up traversal, but the former would add
significant complexity and the latter requires storing the full key
with the value. We do trivially compress the path when leading bytes
of the key are zeros, however.
For value storage, we use "combined pointer/value slots", as
recommended in the paper. Values of size equal or smaller than the the
platform's pointer type are stored in the array of child pointers in
the last level node, while larger values are each stored in a separate
allocation. This is for now fixed at compile time, but it would be
fairly trivial to allow determining at runtime how variable-length
values are stored.
One innovation in our implementation compared to the ART paper is
decoupling the notion of node "size class" from "kind". The size
classes within a given node kind have the same underlying type, but
a variable capacity for children, so we can introduce additional node
sizes with little additional code.
To enable different use cases to specialize for different value types
and for shared/local memory, we use macro-templatized code generation
in the same manner as simplehash.h and sort_template.h.
Future commits will use this infrastructure for storing TIDs.
Patch by Masahiko Sawada and John Naylor, but a substantial amount of
credit is due to Andres Freund, whose proof-of-concept was a valuable
source of coding idioms and awareness of performance pitfalls, and
who reviewed earlier versions.
Discussion: https://postgr.es/m/CAD21AoAfOZvmfR0j8VmZorZjL7RhTiQdVttNuC4W-Shdc2a-AA%40mail.gmail.com
This reverts commit eae7be600b, following a discussion with Tom Lane,
due to concerns that this impacts the decisions made by the planner for
the number of workers spawned based on the inlining and const-folding of
index expressions and predicate for cases that would have worked until
this commit.
Discussion: https://postgr.es/m/162802.1709746091@sss.pgh.pa.us
Backpatch-through: 12
As coded, the planner logic that calculates the number of parallel
workers to use for a parallel index build uses expressions and
predicates from the relcache, which are flattened for the planner by
eval_const_expressions().
As reported in the bug, an immutable parallel-unsafe function flattened
in the relcache would become a Const, which would be considered as
parallel-safe, even if the predicate or the expressions including the
function are not safe in parallel workers. Depending on the expressions
or predicate used, this could cause the parallel build to fail.
Tests are included that check parallel index builds with parallel-unsafe
predicate and expressions. Two routines are added to lsyscache.h to be
able to retrieve expressions and predicate of an index from its pg_index
data.
Reported-by: Alexander Lakhin
Author: Tender Wang
Reviewed-by: Jian He, Michael Paquier
Discussion: https://postgr.es/m/CAHewXN=UaAaNn9ruHDH3Os8kxLVmtWqbssnf=dZN_s9=evHUFA@mail.gmail.com
Backpatch-through: 12
Use GUC_ACTION_SAVE rather than GUC_ACTION_SET, necessary for working
with parallel query.
Now that the call requires more arguments, wrap the call in a new
function to avoid code duplication and offer a place for a comment.
Discussion: https://postgr.es/m/E1rhJpO-0027Wf-9L@gemulon.postgresql.org
Now that BackendId was just another index into the proc array, it was
redundant with the 0-based proc numbers used in other places. Replace
all usage of backend IDs with proc numbers.
The only place where the term "backend id" remains is in a few pgstat
functions that expose backend IDs at the SQL level. Those IDs are now
in fact 0-based ProcNumbers too, but the documentation still calls
them "backend ids". That term still seems appropriate to describe what
the numbers are, so I let it be.
One user-visible effect is that pg_temp_0 is now a valid temp schema
name, for backend with ProcNumber 0.
Reviewed-by: Andres Freund
Discussion: https://www.postgresql.org/message-id/8171f1aa-496f-46a6-afc3-c46fe7a9b407@iki.fi
More precisely, what we do here is make the SLRU cache sizes
configurable with new GUCs, so that sites with high concurrency and big
ranges of transactions in flight (resp. multixacts/subtransactions) can
benefit from bigger caches. In order for this to work with good
performance, two additional changes are made:
1. the cache is divided in "banks" (to borrow terminology from CPU
caches), and algorithms such as eviction buffer search only affect
one specific bank. This forestalls the problem that linear searching
for a specific buffer across the whole cache takes too long: we only
have to search the specific bank, whose size is small. This work is
authored by Andrey Borodin.
2. Change the locking regime for the SLRU banks, so that each bank uses
a separate LWLock. This allows for increased scalability. This work
is authored by Dilip Kumar. (A part of this was previously committed as
d172b717c6f4.)
Special care is taken so that the algorithms that can potentially
traverse more than one bank release one bank's lock before acquiring the
next. This should happen rarely, but particularly clog.c's group commit
feature needed code adjustment to cope with this. I (Álvaro) also added
lots of comments to make sure the design is sound.
The new GUCs match the names introduced by bcdfa5f2e2 in the
pg_stat_slru view.
The default values for these parameters are similar to the previous
sizes of each SLRU. commit_ts, clog and subtrans accept value 0, which
means to adjust by dividing shared_buffers by 512 (so 2MB for every 1GB
of shared_buffers), with a cap of 8MB. (A new slru.c function
SimpleLruAutotuneBuffers() was added to support this.) The cap was
previously 1MB for clog, so for sites with more than 512MB of shared
memory the total memory used increases, which is likely a good tradeoff.
However, other SLRUs (notably multixact ones) retain smaller sizes and
don't support a configured value of 0. These values based on
shared_buffers may need to be revisited, but that's an easy change.
There was some resistance to adding these new GUCs: it would be better
to adjust to memory pressure automatically somehow, for example by
stealing memory from shared_buffers (where the caches can grow and
shrink naturally). However, doing that seems to be a much larger
project and one which has made virtually no progress in several years,
and because this is such a pain point for so many users, here we take
the pragmatic approach.
Author: Andrey Borodin <x4mmm@yandex-team.ru>
Author: Dilip Kumar <dilipbalaut@gmail.com>
Reviewed-by: Amul Sul, Gilles Darold, Anastasia Lubennikova,
Ivan Lazarev, Robert Haas, Thomas Munro, Tomas Vondra,
Yura Sokolov, Васильев Дмитрий (Dmitry Vasiliev).
Discussion: https://postgr.es/m/2BEC2B3F-9B61-4C1D-9FB5-5FAB0F05EF86@yandex-team.ru
Discussion: https://postgr.es/m/CAFiTN-vzDvNz=ExGXz6gdyjtzGixKSqs0mKHMmaQ8sOSEFZ33A@mail.gmail.com
The new names are intended to match those in an upcoming patch that adds
a few GUCs to configure the SLRU buffer sizes.
Backwards compatibility concern: this changes the accepted names for
function pg_stat_slru_rest(). Since this function recognizes "any other
string" as a request to reset the entry for "other", this means that
calling it with the old names would silently reset "other" instead of
doing nothing or throwing an error.
Reviewed-by: Andrey M. Borodin <x4mmm@yandex-team.ru>
Discussion: https://postgr.es/m/202402261616.dlriae7b6emv@alvherre.pgsql
Many modern compilers are able to optimize function calls to functions
where the parameters of the called function match a leading subset of
the calling function's parameters. If there are no instructions in the
calling function after the function is called, then the compiler is free
to avoid any stack frame setup and implement the function call as a
"jmp" rather than a "call". This is called sibling call optimization.
Here we adjust the memory allocation functions in mcxt.c to allow this
optimization. This requires moving some responsibility into the memory
context implementations themselves. It's now the responsibility of the
MemoryContext to check for malloc failures. This is good as it both
allows the sibling call optimization, but also because most small and
medium allocations won't call malloc and just allocate memory to an
existing block. That can't fail, so checking for NULLs in that case
isn't required.
Also, traditionally it's been the responsibility of palloc and the other
allocation functions in mcxt.c to check for invalid allocation size
requests. Here we also move the responsibility of checking that into the
MemoryContext. This isn't to allow the sibling call optimization, but
more because most of our allocators handle large allocations separately
and we can just add the size check when doing large allocations. We no
longer check this for non-large allocations at all.
To make checking the allocation request sizes and ERROR handling easier,
add some helper functions to mcxt.c for the allocators to use.
Author: Andres Freund
Reviewed-by: David Rowley
Discussion: https://postgr.es/m/20210719195950.gavgs6ujzmjfaiig@alap3.anarazel.de
Thanks to commit 3b00fdba9f, this check in the SIGTERM handler for
the startup process is now obsolete and can be removed. Instead of
leaving around the dead function write_stderr_signal_safe(), I've
opted to just remove it for now.
This partially reverts commit 97550c0711.
Reviewed-by: Andres Freund, Noah Misch
Discussion: https://postgr.es/m/20231121212008.GA3742740%40nathanxps13
After calling smgropen(), it was not clear how long you could continue
to use the result, because various code paths including cache
invalidation could call smgrclose(), which freed the memory.
Guarantee that the object won't be destroyed until the end of the
current transaction, or in recovery, the commit/abort record that
destroys the underlying storage.
smgrclose() is now just an alias for smgrrelease(). It closes files
and forgets all state except the rlocator, but keeps the SMgrRelation
object valid.
A new smgrdestroy() function is used by rare places that know there
should be no other references to the SMgrRelation.
The short version:
* smgrclose() is now just an alias for smgrrelease(). It releases
resources, but doesn't destroy until EOX
* smgrdestroy() now frees memory, and should rarely be used.
Existing code should be unaffected, but it is now possible for code that
has an SMgrRelation object to use it repeatedly during a transaction as
long as the storage hasn't been physically dropped. Such code would
normally hold a lock on the relation.
This also replaces the "ownership" mechanism of SMgrRelations with a
pin counter. An SMgrRelation can now be "pinned", which prevents it
from being destroyed at end of transaction. There can be multiple pins
on the same SMgrRelation. In practice, the pin mechanism is only used
by the relcache, so there cannot be more than one pin on the same
SMgrRelation. Except with swap_relation_files XXX
Author: Thomas Munro, Heikki Linnakangas
Reviewed-by: Robert Haas <robertmhaas@gmail.com>
Discussion: https://www.postgresql.org/message-id/CA%2BhUKGJ8NTvqLHz6dqbQnt2c8XCki4r2QvXjBQcXpVwxTY_pvA@mail.gmail.com
This function requires simd.h, which is a rather large dependency
for a widely-used header file like pg_wchar.h. Furthermore, there
is a report of a third-party tool that is struggling to use
pg_wchar.h due to its dependence on simd.h (presumably because
simd.h uses several intrinsics). Moving the function to the much
less popular ascii.h resolves these issues for now.
This commit is back-patched for the benefit of the aforementioned
third-party tool. The simd.h dependency was only added in v16,
but we've opted to back-patch to v15 so that is_valid_ascii() lives
in the same file for all versions where it exists. This could
break existing third-party code that uses the function, but we
couldn't find any examples of such code. It should be possible to
fix any code that this commit breaks by including ascii.h in the
file that uses is_valid_ascii().
Author: Jubilee Young
Reviewed-by: Tom Lane, John Naylor, Andres Freund, Eric Ridge
Discussion: https://postgr.es/m/CAPNHn3oKJJxMsYq%2BqLYzVJOFrUcOr4OF1EC-KtFT-qh8nOOOtQ%40mail.gmail.com
Backpatch-through: 15
This adds a new "Memory:" line under the "Planning:" group (which
currently only has "Buffers:") when the MEMORY option is specified.
In order to make the reporting reasonably accurate, we create a separate
memory context for planner activities, to be used only when this option
is given. The total amount of memory allocated by that context is
reported as "allocated"; we subtract memory in the context's freelists
from that and report that result as "used". We use
MemoryContextStatsInternal() to obtain the quantities.
The code structure to show buffer usage during planning was not in
amazing shape, so I (Álvaro) modified the patch a bit to clean that up
in passing.
Author: Ashutosh Bapat
Reviewed-by: David Rowley, Andrey Lepikhov, Jian He, Andy Fan
Discussion: https://www.postgresql.org/message-id/CAExHW5sZA=5LJ_ZPpRO-w09ck8z9p7eaYAqq3Ks9GDfhrxeWBw@mail.gmail.com
Formerly, these were only supported in to_char(), but there seems
little reason for that restriction. We should at least have enough
support to permit round-tripping the output of to_char().
In that spirit, TZ accepts either zone abbreviations or numeric
(HH or HH:MM) offsets, which are the cases that to_char() can output.
In an ideal world we'd make it take full zone names too, but
that seems like it'd introduce an unreasonable amount of ambiguity,
since the rules for POSIX-spec zone names are so lax.
OF is a subset of this, accepting only HH or HH:MM.
One small benefit of this improvement is that we can simplify
jsonpath's executeDateTimeMethod function, which no longer needs
to consider the HH and HH:MM cases separately. Moreover, letting
it accept zone abbreviations means it will accept "Z" to mean UTC,
which is emitted by JSON.stringify() for example.
Patch by me, reviewed by Aleksander Alekseev and Daniel Gustafsson
Discussion: https://postgr.es/m/1681086.1686673242@sss.pgh.pa.us
This commit implements ithe jsonpath .bigint(), .boolean(),
.date(), .decimal([precision [, scale]]), .integer(), .number(),
.string(), .time(), .time_tz(), .timestamp(), and .timestamp_tz()
methods.
.bigint() converts the given JSON string or a numeric value to
the bigint type representation.
.boolean() converts the given JSON string, numeric, or boolean
value to the boolean type representation. In the numeric case, only
integers are allowed. We use the parse_bool() backend function
to convert a string to a bool.
.decimal([precision [, scale]]) converts the given JSON string
or a numeric value to the numeric type representation. If precision
and scale are provided for .decimal(), then it is converted to the
equivalent numeric typmod and applied to the numeric number.
.integer() and .number() convert the given JSON string or a
numeric value to the int4 and numeric type representation.
.string() uses the datatype's output function to convert numeric
and various date/time types to the string representation.
The JSON string representing a valid date/time is converted to the
specific date or time type representation using jsonpath .date(),
.time(), .time_tz(), .timestamp(), .timestamp_tz() methods. The
changes use the infrastructure of the .datetime() method and perform
the datatype conversion as appropriate. Unlike the .datetime()
method, none of these methods accept a format template and use ISO
DateTime format instead. However, except for .date(), the
date/time related methods take an optional precision to adjust the
fractional seconds.
Jeevan Chalke, reviewed by Peter Eisentraut and Andrew Dunstan.
This adds a Node *escontext parameter to it and a bunch of functions
downstream to it, replacing any ereport()s in that path by either
errsave() or ereturn() as appropriate. This also adds code to those
functions where necessary to return early upon encountering a soft
error.
The changes here are mainly intended to suppress errors in the
functions of jsonfuncs.c. Functions in any external modules, such as
arrayfuncs.c, that those functions may in turn call are not changed
here based on the assumption that the various checks in jsonfuncs.c
functions should ensure that only values that are structurally valid
get passed to the functions in those external modules. An exception
is made for domain_check() to allow handling domain constraint
violation errors softly.
For testing, this adds a function jsonb_populate_record_valid(),
which returns true if jsonb_populate_record() would finish without
causing an error for the provided JSON object, false otherwise. Note
that jsonb_populate_record() internally calls populate_record(),
which in turn uses populate_record_field().
Extracted from a much larger patch to add SQL/JSON query functions.
Author: Nikita Glukhov <n.gluhov@postgrespro.ru>
Author: Teodor Sigaev <teodor@sigaev.ru>
Author: Oleg Bartunov <obartunov@gmail.com>
Author: Alexander Korotkov <aekorotkov@gmail.com>
Author: Andrew Dunstan <andrew@dunslane.net>
Author: Amit Langote <amitlangote09@gmail.com>
Reviewers have included (in no particular order) Andres Freund,
Alexander Korotkov, Pavel Stehule, Andrew Alsup, Erik Rijkers,
Zihong Yu, Himanshu Upadhyaya, Daniel Gustafsson, Justin Pryzby,
Álvaro Herrera, Jian He, Peter Eisentraut
Discussion: https://postgr.es/m/cd0bb935-0158-78a7-08b5-904886deac4b@postgrespro.ru
Discussion: https://postgr.es/m/20220616233130.rparivafipt6doj3@alap3.anarazel.de
Discussion: https://postgr.es/m/abd9b83b-aa66-f230-3d6d-734817f0995d%40postgresql.org
Discussion: https://postgr.es/m/CA+HiwqHROpf9e644D8BRqYvaAPmgBZVup-xKMDPk-nd4EpgzHw@mail.gmail.com
Discussion: https://postgr.es/m/CA+HiwqE4XTdfb1nW=Ojoy_tQSRhYt-q_kb6i5d4xcKyrLC1Nbg@mail.gmail.com
Injection points are a new facility that makes possible for developers
to run custom code in pre-defined code paths. Its goal is to provide
ways to design and run advanced tests, for cases like:
- Race conditions, where processes need to do actions in a controlled
ordered manner.
- Forcing a state, like an ERROR, FATAL or even PANIC for OOM, to force
recovery, etc.
- Arbitrary sleeps.
This implements some basics, and there are plans to extend it more in
the future depending on what's required. Hence, this commit adds a set
of routines in the backend that allows developers to attach, detach and
run injection points:
- A code path calling an injection point can be declared with the macro
INJECTION_POINT(name).
- InjectionPointAttach() and InjectionPointDetach() to respectively
attach and detach a callback to/from an injection point. An injection
point name is registered in a shmem hash table with a library name and a
function name, which will be used to load the callback attached to an
injection point when its code path is run.
Injection point names are just strings, so as an injection point can be
declared and run by out-of-core extensions and modules, with callbacks
defined in external libraries.
This facility is hidden behind a dedicated switch for ./configure and
meson, disabled by default.
Note that backends use a local cache to store callbacks already loaded,
cleaning up their cache if a callback has found to be removed on a
best-effort basis. This could be refined further but any tests but what
we have here was fine with the tests I've written while implementing
these backend APIs.
Author: Michael Paquier, with doc suggestions from Ashutosh Bapat.
Reviewed-by: Ashutosh Bapat, Nathan Bossart, Álvaro Herrera, Dilip
Kumar, Amul Sul, Nazir Bilal Yavuz
Discussion: https://postgr.es/m/ZTiV8tn_MIb_H2rE@paquier.xyz