There are some problems with the new way to handle these constraints
that were detected at the last minute, and require fixes that appear too
invasive to be doing this late in the cycle. Revert this (again) for
now, we'll try again with these problems fixed.
The following commits are reverted:
b0e96f3119 Catalog not-null constraints
9b581c5341 Disallow changing NO INHERIT status of a not-null constraint
d0ec2ddbe0 Fix not-null constraint test
ac22a9545c Move privilege check to the right place
b0f7dd915b Check stack depth in new recursive functions
3af7217942 Update information_schema definition for not-null constraints
c3709100be Fix propagating attnotnull in multiple inheritance
d9f686a72e Fix restore of not-null constraints with inheritance
d72d32f52d Don't try to assign smart names to constraints
0cd711271d Better handle indirect constraint drops
13daa33fa5 Disallow NO INHERIT not-null constraints on partitioned tables
d45597f72f Disallow direct change of NO INHERIT of not-null constraints
21ac38f498 Fix inconsistencies in error messages
Discussion: https://postgr.es/m/202405110940.joxlqcx4dogd@alvherre.pgsql
This commit extends the backend-side infrastructure of injection points
so as it becomes possible to register some input data when attaching a
point. This private data can be registered with the function name and
the library name of the callback when attaching a point, then it is
given as input argument to the callback. This gives the possibility for
modules to pass down custom data at runtime when attaching a point
without managing that internally, in a manner consistent with the
callback entry retrieved from the hash shmem table storing the injection
point data.
InjectionPointAttach() gains two arguments, to be able to define the
private data contents and its size.
A follow-up commit will rely on this infrastructure to close a race
condition with the injection point detach in the module
injection_points.
While on it, this changes InjectionPointDetach() to return a boolean,
returning false if a point cannot be detached. This has been mentioned
by Noah as useful when it comes to implement more complex tests with
concurrent point detach, solid with the automatic detach done for local
points in the test module.
Documentation is adjusted in consequence.
Per discussion with Noah Misch.
Reviewed-by: Noah Misch
Discussion: https://postgr.es/m/20240509031553.47@rfd.leadboat.com
In commit 25cd2d640 I (tgl) opined that "The additions of the months
and microseconds fields could also overflow, of course. However,
I believe we need no additional checks there; the existing range
checks should catch such cases". This is demonstrably wrong however
for the microseconds field, and given that discovery it seems prudent
to be paranoid about the months addition as well.
Report and patch by Joseph Koshakow. As before, back-patch to all
supported branches. (However, the test case doesn't work before
v15 because we didn't allow wider-than-int32 numbers in interval
literals. A variant test could probably be built that fits within
that restriction, but it didn't seem worth the trouble.)
Discussion: https://postgr.es/m/CAAvxfHf77sRHKoEzUw9_cMYSpbpNS2C+J_+8Dq4+0oi8iKopeA@mail.gmail.com
This addresses some post-commit review comments for commits 6185c973,
de3600452, and 9425c596a0, with the following changes:
* Fix JSON_TABLE() syntax documentation to use the term
"path_expression" for JSON path expressions instead of
"json_path_specification" to be consistent with the other SQL/JSON
functions.
* Fix a typo in the example code in JSON_TABLE() documentation.
* Rewrite some newly added comments in jsonpath.h.
* In JsonPathQuery(), add missing cast to int before printing an enum
value.
Reported-by: Jian He <jian.universality@gmail.com>
Discussion: https://postgr.es/m/CACJufxG_e0QLCgaELrr2ZNz7AxPeGCNKAORe3fHtFCQLsH4J4Q@mail.gmail.com
This improves some error messages emitted by SQL/JSON query functions
by mentioning column name when available, such as when they are
invoked as part of evaluating JSON_TABLE() columns. To do so, a new
field column_name is added to both JsonFuncExpr and JsonExpr that is
only populated when creating those nodes for transformed JSON_TABLE()
columns.
While at it, relevant error messages are reworded for clarity.
Reported-by: Jian He <jian.universality@gmail.com>
Suggested-by: Jian He <jian.universality@gmail.com>
Discussion: https://postgr.es/m/CACJufxG_e0QLCgaELrr2ZNz7AxPeGCNKAORe3fHtFCQLsH4J4Q@mail.gmail.com
BumpContext relies on using the head block from its 'blocks' field to
use as the current block to allocate new chunks to. When we receive an
allocation request larger than allocChunkLimit, we place these chunks on
a new dedicated block and, until now, we pushed the block onto the
*head* of the 'blocks' list.
This behavior caused the previous bump block to no longer be available
for new normal-sized (non-large) allocations and would result in blocks
only being partially filled if a large allocation request arrived before
the block became full.
Here adjust the code to push these dedicated blocks onto the *tail* of
the blocks list so that the head block remains intact and available to
be used by normal allocation request sizes until it becomes full.
In passing, make the elog(ERROR) calls for the unsupported callbacks
consistent. Likewise for the header comments for those functions.
Discussion: https://postgr.es/m/CAApHDvp9___r-ayJj0nZ6GD3MeCGwGZ0_6ZptWpwj+zqHtmwCw@mail.gmail.com
Discussion: https://postgr.es/m/CAApHDvqerXpzUnuDQfUEi3DZA+9=Ud9WSt3ruxN5b6PcOosx2g@mail.gmail.com
Commit 2ed8f9a01 intended to institute a policy that if a
RangeTblFunction has a coldeflist, then the function return type is
certainly RECORD, and we should use the coldeflist as the source of
truth about what the columns of the record type are. When the
original function has been folded to a constant, inspection of the
constant might give a different answer. This situation will lead to
a tuple-type-mismatch error at execution, but up until that point we
need to consistently believe the coldeflist, or we'll have problems
from different bits of code reaching different conclusions.
expandRTE didn't get that memo though, and would try to produce a
tupdesc based on the constant in this situation, leading to an
assertion failure. (Desultory testing suggests that non-assert
builds often manage to give the expected error, although I also
saw a "cache lookup failed for type 0" error, and it seems at
least possible that a crash could happen.)
Some other callers of get_expr_result_type and get_expr_result_tupdesc
were also being incautious about this. While none of them seem to
have actual bugs, they're working harder than necessary in this case,
besides which it seems safest to have an explicit policy of not using
those functions on an RTE with a coldeflist. Adjust the code
accordingly, and add commentary to funcapi.c about this policy.
Also fix an obsolete comment that claimed "get_expr_result_type()
doesn't know how to extract type info from a RECORD constant".
That hasn't been true since commit d57534740.
Per bug #18422 from Alexander Lakhin.
As with the previous commit, back-patch to all supported branches.
Discussion: https://postgr.es/m/18422-89ca86c8eac5246d@postgresql.org
When matching constraints in AttachPartitionEnsureIndexes() we weren't
testing the constraint type, which could make a UNIQUE key lacking a
not-null constraint incorrectly satisfy a primary key requirement. Fix
this by testing that the constraint types match. (Other possible
mismatches are verified by comparing index properties.)
Discussion: https://postgr.es/m/202402051447.wimb4xmtiiyb@alvherre.pgsql
This addresses a few problems with commit e5da0fe3c2 ("Catalog domain
not-null constraints").
In CREATE DOMAIN, a NOT NULL constraint looks like
CREATE DOMAIN d1 AS int [ CONSTRAINT conname ] NOT NULL
(Before e5da0fe3c2, the constraint name was accepted but ignored.)
But in ALTER DOMAIN, a NOT NULL constraint looks like
ALTER DOMAIN d1 ADD [ CONSTRAINT conname ] NOT NULL VALUE
where VALUE is where for a table constraint the column name would be.
(This works as of e5da0fe3c2. Before e5da0fe3c2, this syntax
resulted in an internal error.)
But for domains, this latter syntax is confusing and needlessly
inconsistent between CREATE and ALTER. So this changes it to just
ALTER DOMAIN d1 ADD [ CONSTRAINT conname ] NOT NULL
(None of these syntaxes are per SQL standard; we are just living with
the bits of inconsistency that have built up over time.)
In passing, this also changes the psql \dD output to not show not-null
constraints in the column "Check", since it's already shown in the
column "Nullable". This has also been off since e5da0fe3c2.
Reviewed-by: jian he <jian.universality@gmail.com>
Discussion: https://www.postgresql.org/message-id/flat/9ec24d7b-633d-463a-84c6-7acff769c9e8%40eisentraut.org
This adjusts various appendStringInfo* function calls to use a more
appropriate and efficient function with the same behavior. For example,
use appendStringInfoChar() when appending a single character rather than
appendStringInfo() and appendStringInfoString() when no formatting is
required rather than using appendStringInfo().
All adjustments made here are in code that's new to v17, so it makes
sense to fix these now rather than wait a few years and make
backpatching harder.
Discussion: https://postgr.es/m/CAApHDvojY2UvMiO+9_55ArTj10P1LBNJyyoGB+C65BLDNT0GsQ@mail.gmail.com
Reviewed-by: Nathan Bossart, Tom Lane
The bump allocator was recently added in 29f6a959c. Our other
allocators have a similar macro to this, but seemingly the version of
the macro for those allocators is only used in places where the chunk
header is decoded. Since the bump allocator has no chunk header, none
of those functions exist for bump therefore macro is unused. Remove it.
Reported-by: Peter Eisentraut
Discussion: https://postgr.es/m/5f724fb2-96e1-4f36-b65b-47b337ad432e@eisentraut.org
Let table AM define custom reloptions for its tables. This allows specifying
AM-specific parameters by the WITH clause when creating a table.
The reloptions, which could be used outside of table AM, are now extracted
into the CommonRdOptions data structure. These options could be by decision
of table AM directly specified by a user or calculated in some way.
The new test module test_tam_options evaluates the ability to set up custom
reloptions and calculate fields of CommonRdOptions on their base.
The code may use some parts from prior work by Hao Wu.
Discussion: https://postgr.es/m/CAPpHfdurb9ycV8udYqM%3Do0sPS66PJ4RCBM1g-bBpvzUfogY0EA%40mail.gmail.com
Discussion: https://postgr.es/m/AMUA1wBBBxfc3tKRLLdU64rb.1.1683276279979.Hmail.wuhao%40hashdata.cn
Reviewed-by: Reviewed-by: Pavel Borisov, Matthias van de Meent, Jess Davis
Currently, get_json_expr_options() does not emit the default values
for QUOTES (KEEP QUOTES) and WRAPPER (WITHOUT WRAPPER). That causes
the deparsed JSON_TABLE() columns, such as those contained in a a
view's query, to behave differently when executed than the original
definition. That's because the rules encoded in
transformJsonTableColumns() will choose either JSON_VALUE() or
JSON_QUERY() as implementation to execute a given column's path
expression depending on the QUOTES and WRAPPER specificationd and
they have slightly different semantics.
Reported-by: Jian He <jian.universality@gmail.com>
Discussion: https://postgr.es/m/CACJufxEqhqsfrg_p7EMyo5zak3d767iFDL8vz_4%3DZBHpOtrghw%40mail.gmail.com
The new "log_connection_negotiation" server option causes the server
to print messages to the log when it receives a SSLRequest or
GSSENCRequest packet from the client. Together with "log_connections",
it gives a trace of how a connection and encryption is
negotiatated. Use the option in the libpq_encryption test, to verify
in more detail how libpq negotiates encryption with different
gssencmode and sslmode options.
This revealed a couple of cases where libpq retries encryption or
authentication, when it should already know that it cannot succeed. I
marked them with XXX comments in the test tables. They only happen
when the connection was going to fail anyway, and only with rare
combinations of options, so they're not serious.
Discussion: https://www.postgresql.org/message-id/CAEze2Wja8VUoZygCepwUeiCrWa4jP316k0mvJrOW4PFmWP0Tcw@mail.gmail.com
Replace (expr op C1) OR (expr op C2) ... with expr op ANY(ARRAY[C1, C2, ...])
on the preliminary stage of optimization when we are still working with the
expression tree.
Here Cn is a n-th constant expression, 'expr' is non-constant expression, 'op'
is an operator which returns boolean result and has a commuter (for the case
of reverse order of constant and non-constant parts of the expression,
like 'Cn op expr').
Sometimes it can lead to not optimal plan. This is why there is a
or_to_any_transform_limit GUC. It specifies a threshold value of length of
arguments in an OR expression that triggers the OR-to-ANY transformation.
Generally, more groupable OR arguments mean that transformation will be more
likely to win than to lose.
Discussion: https://postgr.es/m/567ED6CA.2040504%40sigaev.ru
Author: Alena Rybakina <lena.ribackina@yandex.ru>
Author: Andrey Lepikhov <a.lepikhov@postgrespro.ru>
Reviewed-by: Peter Geoghegan <pg@bowt.ie>
Reviewed-by: Ranier Vilela <ranier.vf@gmail.com>
Reviewed-by: Alexander Korotkov <aekorotkov@gmail.com>
Reviewed-by: Robert Haas <robertmhaas@gmail.com>
Reviewed-by: Jian He <jian.universality@gmail.com>
In one multixact.c edge case, we need a mechanism to wait for one
multixact offset to be written before being allowed to read the next
one. We used to handle this case by sleeping for one millisecond and
retrying, but such sleeps have been reported as problematic in
production cases. We can avoid the problem by using a condition
variable: readers sleep on it and then every creator of multixacts
broadcasts into the CV when creation is sufficiently far along.
Author: Kyotaro Horiguchi <horikyotajntt@gmail.com>
Reviewed-by: Andrey Borodin <amborodin@acm.org>
Discussion: https://postgr.es/m/47A598F4-B4E7-4029-8FEC-A06A6C3CB4B5@yandex-team.ru
Discussion: https://postgr.es/m/20200515.090333.24867479329066911.horikyota.ntt
29f6a959c added a bump allocator type for efficient compact allocations.
Here we make use of this for non-bounded tuplesorts to store tuples.
This is very space efficient when storing narrow tuples due to bump.c
not having chunk headers. This means we can fit more tuples in work_mem
before spilling to disk, or perform an in-memory sort touching fewer
cacheline.
Author: David Rowley
Reviewed-by: Nathan Bossart
Reviewed-by: Matthias van de Meent
Reviewed-by: Tomas Vondra
Reviewed-by: John Naylor
Discussion: https://postgr.es/m/CAApHDvqGSpCU95TmM=Bp=6xjL_nLys4zdZOpfNyWBk97Xrdj2w@mail.gmail.com
This introduces a bump MemoryContext type. The bump context is best
suited for short-lived memory contexts which require only allocations
of memory and never a pfree or repalloc, which are unsupported.
Memory palloc'd into a bump context has no chunk header. This makes
bump a useful context type when lots of small allocations need to be
done without any need to pfree those allocations. Allocation sizes are
rounded up to the next MAXALIGN boundary, so with this and no chunk
header, allocations are very compact indeed.
Allocations are also very fast as bump does not check any freelists to
try and make use of previously free'd chunks. It just checks if there
is enough room on the current block, and if so it bumps the freeptr
beyond this chunk and returns the value that the freeptr was previously
pointing to. Simple and fast. A new block is malloc'd when there's not
enough space in the current block.
Code using the bump allocator must take care never to call any functions
which could try to call realloc() (or variants), pfree(),
GetMemoryChunkContext() or GetMemoryChunkSpace() on a bump allocated
chunk. Due to lack of chunk headers, these operations are unsupported.
To increase the chances of catching such issues, when compiled with
MEMORY_CONTEXT_CHECKING, bump allocated chunks are given a header and
any attempt to perform an unsupported operation will result in an ERROR.
Without MEMORY_CONTEXT_CHECKING, code attempting an unsupported
operation could result in a segfault.
A follow-on commit will implement the first user of bump.
Author: David Rowley
Reviewed-by: Nathan Bossart
Reviewed-by: Matthias van de Meent
Reviewed-by: Tomas Vondra
Reviewed-by: John Naylor
Discussion: https://postgr.es/m/CAApHDvqGSpCU95TmM=Bp=6xjL_nLys4zdZOpfNyWBk97Xrdj2w@mail.gmail.com
Reserve 4 bits for MemoryContextMethodID rather than 3. 3 bits did
technically allow a maximum of 8 memory context types, however, we've
opted to reserve some bit patterns which left us with only 4 slots, all
of which were used.
Here we add another bit which frees up 8 slots for future memory context
types.
In passing, adjust the enum names in MemoryContextMethodID to make it
more clear which ones can be used and which ones are reserved.
Author: Matthias van de Meent, David Rowley
Discussion: https://postgr.es/m/CAApHDvqGSpCU95TmM=Bp=6xjL_nLys4zdZOpfNyWBk97Xrdj2w@mail.gmail.com
This new DDL command splits a single partition into several parititions.
Just like ALTER TABLE ... MERGE PARTITIONS ... command, new patitions are
created using createPartitionTable() function with parent partition as the
template.
This commit comprises quite naive implementation which works in single process
and holds the ACCESS EXCLUSIVE LOCK on the parent table during all the
operations including the tuple routing. This is why this new DDL command
can't be recommended for large partitioned tables under a high load. However,
this implementation come in handy in certain cases even as is.
Also, it could be used as a foundation for future implementations with lesser
locking and possibly parallel.
Discussion: https://postgr.es/m/c73a1746-0cd0-6bdd-6b23-3ae0b7c0c582%40postgrespro.ru
Author: Dmitry Koval
Reviewed-by: Matthias van de Meent, Laurenz Albe, Zhihong Yu, Justin Pryzby
Reviewed-by: Alvaro Herrera, Robert Haas, Stephane Tachoires
Commit 9e8da0f7 taught nbtree to handle ScalarArrayOpExpr quals
natively. This works by pushing down the full context (the array keys)
to the nbtree index AM, enabling it to execute multiple primitive index
scans that the planner treats as one continuous index scan/index path.
This earlier enhancement enabled nbtree ScalarArrayOp index-only scans.
It also allowed scans with ScalarArrayOp quals to return ordered results
(with some notable restrictions, described further down).
Take this general approach a lot further: teach nbtree SAOP index scans
to decide how to execute ScalarArrayOp scans (when and where to start
the next primitive index scan) based on physical index characteristics.
This can be far more efficient. All SAOP scans will now reliably avoid
duplicative leaf page accesses (just like any other nbtree index scan).
SAOP scans whose array keys are naturally clustered together now require
far fewer index descents, since we'll reliably avoid starting a new
primitive scan just to get to a later offset from the same leaf page.
The scan's arrays now advance using binary searches for the array
element that best matches the next tuple's attribute value. Required
scan key arrays (i.e. arrays from scan keys that can terminate the scan)
ratchet forward in lockstep with the index scan. Non-required arrays
(i.e. arrays from scan keys that can only exclude non-matching tuples)
"advance" without the process ever rolling over to a higher-order array.
Naturally, only required SAOP scan keys trigger skipping over leaf pages
(non-required arrays cannot safely end or start primitive index scans).
Consequently, even index scans of a composite index with a high-order
inequality scan key (which we'll mark required) and a low-order SAOP
scan key (which we won't mark required) now avoid repeating leaf page
accesses -- that benefit isn't limited to simpler equality-only cases.
In general, all nbtree index scans now output tuples as if they were one
continuous index scan -- even scans that mix a high-order inequality
with lower-order SAOP equalities reliably output tuples in index order.
This allows us to remove a couple of special cases that were applied
when building index paths with SAOP clauses during planning.
Bugfix commit 807a40c5 taught the planner to avoid generating unsafe
path keys: path keys on a multicolumn index path, with a SAOP clause on
any attribute beyond the first/most significant attribute. These cases
are now all safe, so we go back to generating path keys without regard
for the presence of SAOP clauses (just like with any other clause type).
Affected queries can now exploit scan output order in all the usual ways
(e.g., certain "ORDER BY ... LIMIT n" queries can now terminate early).
Also undo changes from follow-up bugfix commit a4523c5a, which taught
the planner to produce alternative index paths, with path keys, but
without low-order SAOP index quals (filter quals were used instead).
We'll no longer generate these alternative paths, since they can no
longer offer any meaningful advantages over standard index qual paths.
Affected queries thereby avoid all of the disadvantages that come from
using filter quals within index scan nodes. They can avoid extra heap
page accesses from using filter quals to exclude non-matching tuples
(index quals will never have that problem). They can also skip over
irrelevant sections of the index in more cases (though only when nbtree
determines that starting another primitive scan actually makes sense).
There is a theoretical risk that removing restrictions on SAOP index
paths from the planner will break compatibility with amcanorder-based
index AMs maintained as extensions. Such an index AM could have the
same limitations around ordered SAOP scans as nbtree had up until now.
Adding a pro forma incompatibility item about the issue to the Postgres
17 release notes seems like a good idea.
Author: Peter Geoghegan <pg@bowt.ie>
Author: Matthias van de Meent <boekewurm+postgres@gmail.com>
Reviewed-By: Heikki Linnakangas <hlinnaka@iki.fi>
Reviewed-By: Matthias van de Meent <boekewurm+postgres@gmail.com>
Reviewed-By: Tomas Vondra <tomas.vondra@enterprisedb.com>
Discussion: https://postgr.es/m/CAH2-Wz=ksvN_sjcnD1+Bt-WtifRA5ok48aDYnq3pkKhxgMQpcw@mail.gmail.com
The BAS_VACUUM ring size has been 256kB since commit d526575f introduced
the mechanism 17 years ago. Commit 1cbbee03 recently made it
configurable but retained the traditional default. The correct default
size has been debated for years, but 256kB is certainly very small.
VACUUM soon needs to write back data it dirtied only 32 blocks ago,
which usually requires flushing the WAL. New experiments in prefetching
pages for VACUUM exacerbated the problem by crashing into dirty data
even sooner. Let's make the default 2MB. That's 1.6% of the default
toy buffer pool size, and 0.2% of 1GB, which would be a considered a
small shared_buffers setting for a real system these days. Users are
still free to set the GUC to a different value.
Reviewed-by: Andres Freund <andres@anarazel.de>
Discussion: https://postgr.es/m/20240403221257.md4gfki3z75cdyf6%40awork3.anarazel.de
Discussion: https://postgr.es/m/CA%2BhUKGLY4Q4ZY4f1rvnFtv6%2BPkjNf8MejdPkcju3Qii9DYqqcQ%40mail.gmail.com
The current design behind the automatic generation of the C code and
documentation related to wait events introduced in fa88928470 does not
offer a way to attach new wait events without breaking ABI
compatibility, as all the events are forcibly reordered for each section
in the input file wait_event_names.txt. Adding new wait events to
stable branches is something that has happened in the past, 0b6517a3b7
being a recent example of that with VERSION_FILE_SYNC, so we need a way
to generate any C code for wait events while maintaining compatibility
on stable branches already released.
This commit solves this issue by adding a new region called
"ABI_compatibility" (keyword could be updated to something else if
someone had a better idea) to each section of wait_event_names.txt, so
as one can add new wait events to stable branches in
wait_event_names.txt while keeping the code ABI-compatible.
"ABI_compatibility" has no impact on the documentation generated: all
the wait events of one section are still alphabetically ordered. LWLock
and Lock sections generate their C code elsewhere, so they do not need
an "ABI_compatibility" region.
For example, let's imagine a wait_event_names.txt like that:
Section: ClassName - Foo
FOO_1 "Waiting in Foo 1"
FOO_2 "Waiting in Foo 2"
ABI_compatibility:
NEW_FOO_1 "Waiting in New Foo 1"
NEW_BAR_1 "Waiting in New Bar 1"
This results in the following enum, where the events in the ABI region
are listed last with the same ordering as in wait_event_names.txt:
typedef enum
{
WAIT_EVENT_FOO_1,
WAIT_EVENT_FOO_2,
WAIT_EVENT_NEW_FOO_1,
WAIT_EVENT_NEW_BAR_1
} WaitEventFoo;
New wait events added in stable branches should be added at the end of
each ABI_compatibility region, and ABI_compatibility should remain empty
on HEAD and unreleased stable branches.
This design has been suggested by Noah Misch and me.
Reported-by: Noah Misch
Author: Bertrand Drouvot
Reviewed-by: Michael Paquier
Discussion: https://postgr.es/m/20240317183114.16@rfd.leadboat.com
JSON_TABLE() allows JSON data to be converted into a relational view
and thus used, for example, in a FROM clause, like other tabular
data. Data to show in the view is selected from a source JSON object
using a JSON path expression to get a sequence of JSON objects that's
called a "row pattern", which becomes the source to compute the
SQL/JSON values that populate the view's output columns. Column
values themselves are computed using JSON path expressions applied to
each of the JSON objects comprising the "row pattern", for which the
SQL/JSON query functions added in 6185c9737c are used.
To implement JSON_TABLE() as a table function, this augments the
TableFunc and TableFuncScanState nodes that are currently used to
support XMLTABLE() with some JSON_TABLE()-specific fields.
Note that the JSON_TABLE() spec includes NESTED COLUMNS and PLAN
clauses, which are required to provide more flexibility to extract
data out of nested JSON objects, but they are not implemented here
to keep this commit of manageable size.
Author: Nikita Glukhov <n.gluhov@postgrespro.ru>
Author: Teodor Sigaev <teodor@sigaev.ru>
Author: Oleg Bartunov <obartunov@gmail.com>
Author: Alexander Korotkov <aekorotkov@gmail.com>
Author: Andrew Dunstan <andrew@dunslane.net>
Author: Amit Langote <amitlangote09@gmail.com>
Author: Jian He <jian.universality@gmail.com>
Reviewers have included (in no particular order):
Andres Freund, Alexander Korotkov, Pavel Stehule, Andrew Alsup,
Erik Rijkers, Zihong Yu, Himanshu Upadhyaya, Daniel Gustafsson,
Justin Pryzby, Álvaro Herrera, Jian He
Discussion: https://postgr.es/m/cd0bb935-0158-78a7-08b5-904886deac4b@postgrespro.ru
Discussion: https://postgr.es/m/20220616233130.rparivafipt6doj3@alap3.anarazel.de
Discussion: https://postgr.es/m/abd9b83b-aa66-f230-3d6d-734817f0995d%40postgresql.org
Discussion: https://postgr.es/m/CA+HiwqE4XTdfb1nW=Ojoy_tQSRhYt-q_kb6i5d4xcKyrLC1Nbg@mail.gmail.com
This would cause the timestamp values used by emit_log_hook and all the
other log destinations to differ, because the timestamps are reset
before sending the logs to the server and after calling the hook.
This change matters for emit_log_hook when generating log information
with 'n' or 'm' in log_line_prefix through log_status_format(), or when
doing direct calls to get_formatted_log_time() like in the JSON or CSV
log formats.
While on it, this commit fixes a couple of comments related to the
formatted timestamps where the JSON was not mentioned. Oversight in
dc686681e0, that I have noticed while reviewing this patch.
Author: Kambam Vinay, Michael Paquier
Discussion: https://postgr.es/m/CANiRfmsK36A0i8mnQtzaxhSm3CUCimPwJPp4WQNq53OdSNkgWg@mail.gmail.com
This adds errcodes to a set of PANIC and FATAL errors in xlog.c
and relcache.c, which previously had no errcode at all set, in
order to make fleetwide analysis of errorlogs easier. There are
many more ereport/elogs left which could benefit from having an
errcode but this at least makes a dent in the issue.
Author: Nazir Bilal Yavuz <byavuz81@gmail.com>
Discussion: https://postgr.es/m/CAN55FZ1k8LgLEqncPGmz_fWnrobV6bjABOTH4tOWta6xNcPQig@mail.gmail.com
pg_wal_replay_wait() is to be used on standby and specifies waiting for
the specific WAL location to be replayed before starting the transaction.
This option is useful when the user makes some data changes on primary and
needs a guarantee to see these changes on standby.
The queue of waiters is stored in the shared memory array sorted by LSN.
During replay of WAL waiters whose LSNs are already replayed are deleted from
the shared memory array and woken up by setting of their latches.
pg_wal_replay_wait() needs to wait without any snapshot held. Otherwise,
the snapshot could prevent the replay of WAL records implying a kind of
self-deadlock. This is why it is only possible to implement
pg_wal_replay_wait() as a procedure working in a non-atomic context,
not a function.
Catversion is bumped.
Discussion: https://postgr.es/m/eb12f9b03851bb2583adab5df9579b4b%40postgrespro.ru
Author: Kartyshov Ivan, Alexander Korotkov
Reviewed-by: Michael Paquier, Peter Eisentraut, Dilip Kumar, Amit Kapila
Reviewed-by: Alexander Lakhin, Bharath Rupireddy, Euler Taveira
Break ReadBuffer() up into two steps. StartReadBuffers() and
WaitReadBuffers() give us two main advantages:
1. Multiple consecutive blocks can be read with one system call.
2. Advice (hints of future reads) can optionally be issued to the
kernel ahead of time.
The traditional ReadBuffer() function is now implemented in terms of
those functions, to avoid duplication.
A new GUC io_combine_limit is defined, and the functions for limiting
per-backend pin counts are made into public APIs. Those are provided
for use by callers of StartReadBuffers(), when deciding how many buffers
to read at once. The following commit will add a higher level mechanism
for doing that automatically with a practical interface.
With some more infrastructure in later work, StartReadBuffers() could
be extended to start real asynchronous I/O instead of just issuing
advice and leaving WaitReadBuffers() to do the work synchronously.
Author: Thomas Munro <thomas.munro@gmail.com>
Author: Andres Freund <andres@anarazel.de> (some optimization tweaks)
Reviewed-by: Melanie Plageman <melanieplageman@gmail.com>
Reviewed-by: Heikki Linnakangas <hlinnaka@iki.fi>
Reviewed-by: Nazir Bilal Yavuz <byavuz81@gmail.com>
Reviewed-by: Dilip Kumar <dilipbalaut@gmail.com>
Reviewed-by: Andres Freund <andres@anarazel.de>
Tested-by: Tomas Vondra <tomas.vondra@enterprisedb.com>
Discussion: https://postgr.es/m/CA+hUKGJkOiOCa+mag4BF+zHo7qo=o9CFheB8=g6uT5TUm2gkvA@mail.gmail.com
Previously, we used a simple array for storing dead tuple IDs during
lazy vacuum, which had a number of problems:
* The array used a single allocation and so was limited to 1GB.
* The allocation was pessimistically sized according to table size.
* Lookup with binary search was slow because of poor CPU cache and
branch prediction behavior.
This commit replaces that array with the TID store from commit
30e144287a.
Since the backing radix tree makes small allocations as needed, the
1GB limit is now gone. Further, the total memory used is now often
smaller by an order of magnitude or more, depending on the
distribution of blocks and offsets. These two features should make
multiple rounds of heap scanning and index cleanup an extremely rare
event. TID lookup during index cleanup is also several times faster,
even more so when index order is correlated with heap tuple order.
Since there is no longer a predictable relationship between the number
of dead tuples vacuumed and the space taken up by their TIDs, the
number of tuples no longer provides any meaningful insights for users,
nor is the maximum number predictable. For that reason this commit
also changes to byte-based progress reporting, with the relevant
columns of pg_stat_progress_vacuum renamed accordingly to
max_dead_tuple_bytes and dead_tuple_bytes.
For parallel vacuum, both the TID store and supplemental information
specific to vacuum are shared among the parallel vacuum workers. As
with the previous array, we don't take any locks on TidStore during
parallel vacuum since writes are still only done by the leader
process.
Bump catalog version.
Reviewed-by: John Naylor, (in an earlier version) Dilip Kumar
Discussion: https://postgr.es/m/CAD21AoAfOZvmfR0j8VmZorZjL7RhTiQdVttNuC4W-Shdc2a-AA%40mail.gmail.com