Predicate locks are used on per page basis only if fastupdate = off, in
opposite case predicate lock on pending list will effectively lock whole index,
to reduce locking overhead, just lock a relation. Entry and posting trees are
essentially B-tree, so locks are acquired on leaf pages only.
Author: Shubham Barai with some editorization by me and Dmitry Ivanov
Review by: Alexander Korotkov, Dmitry Ivanov, Fedor Sigaev
Discussion: https://www.postgresql.org/message-id/flat/CALxAEPt5sWW+EwTaKUGFL5_XFcZ0MuGBcyJ70oqbWqr42YKR8Q@mail.gmail.com
Store GID of 2PC in commit/abort WAL records when wal_level = logical.
This allows logical decoding to send the SAME gid to subscribers
across restarts of logical replication.
Track relica origin replay progress for 2PC.
(Edited from patch 0003 in the logical decoding 2PC series.)
Authors: Nikhil Sontakke, Stas Kelvich
Reviewed-by: Simon Riggs, Andres Freund
Currently adding a column to a table with a non-NULL default results in
a rewrite of the table. For large tables this can be both expensive and
disruptive. This patch removes the need for the rewrite as long as the
default value is not volatile. The default expression is evaluated at
the time of the ALTER TABLE and the result stored in a new column
(attmissingval) in pg_attribute, and a new column (atthasmissing) is set
to true. Any existing row when fetched will be supplied with the
attmissingval. New rows will have the supplied value or the default and
so will never need the attmissingval.
Any time the table is rewritten all the atthasmissing and attmissingval
settings for the attributes are cleared, as they are no longer needed.
The most visible code change from this is in heap_attisnull, which
acquires a third TupleDesc argument, allowing it to detect a missing
value if there is one. In many cases where it is known that there will
not be any (e.g. catalog relations) NULL can be passed for this
argument.
Andrew Dunstan, heavily modified from an original patch from Serge
Rielau.
Reviewed by Tom Lane, Andres Freund, Tomas Vondra and David Rowley.
Discussion: https://postgr.es/m/31e2e921-7002-4c27-59f5-51f08404c858@2ndQuadrant.com
If the value of an index expression is unchanged after UPDATE,
allow HOT updates where previously we disallowed them, giving
a significant performance boost in those cases.
Particularly useful for indexes such as JSON->>field where the
JSON value changes but the indexed value does not.
Submitted as "surjective indexes" patch, now enabled by use
of new "recheck_on_update" parameter.
Author: Konstantin Knizhnik
Reviewer: Simon Riggs, with much wordsmithing and some cleanup
Performing JIT compilation for deforming gains performance benefits
over unJITed deforming from compile-time knowledge of the tuple
descriptor. Fixed column widths, NOT NULLness, etc can be taken
advantage of.
Right now the JITed deforming is only used when deforming tuples as
part of expression evaluation (and obviously only if the descriptor is
known). It's likely to be beneficial in other cases, too.
By default tuple deforming is JITed whenever an expression is JIT
compiled. There's a separate boolean GUC controlling it, but that's
expected to be primarily useful for development and benchmarking.
Docs will follow in a later commit containing docs for the whole JIT
feature.
Author: Andres Freund
Discussion: https://postgr.es/m/20170901064131.tazjxwus3k2w3ybh@alap3.anarazel.de
Using the standard bool type provided by C allows some recent compilers
and debuggers to give better diagnostics. Also, some extension code and
third-party headers are increasingly pulling in stdbool.h, so it's
probably saner if everyone uses the same definition.
But PostgreSQL code is not prepared to handle bool of a size other than
1, so we keep our own old definition if we encounter a stdbool.h with a
bool of a different size. (Among current build farm members, this only
applies to old macOS versions on PowerPC.)
To check that the used bool is of the right size, add a static
assertions about size of GinTernaryValue vs bool. This is currently the
only place that assumes that bool and char are of the same size.
Discussion: https://www.postgresql.org/message-id/flat/3a0fe7e1-5ed1-414b-9230-53bbc0ed1f49@2ndquadrant.com
For any interesting JIT target, fields inside structs need to be
accessed. b96d550e contains infrastructure for syncing the definition
of types between postgres C code and runtime code generation with
LLVM. But that doesn't sync the number or names of fields inside
structs, just the types (including padding etc).
One option would be to hardcode the offset numbers in the JIT code,
but that'd be hard to keep in sync. Instead add macros indicating the
field offset to the fields that need to be accessed. Not pretty, but
manageable.
Author: Andres Freund
Discussion: https://postgr.es/m/20170901064131.tazjxwus3k2w3ybh@alap3.anarazel.de
The original coding of the SP-GiST scan traversalValue feature (commit
ccd6eb49a) arranged for traversal values to be stored in the query's main
executor context. That's fine if there's only one index scan per query,
but if there are many, we have a memory leak as successive scans create
new traversal values. Fix it by creating a separate memory context for
traversal values, which we can reset during spgrescan(). Back-patch
to 9.6 where this code was introduced.
In principle, adding the traversalCxt field to SpGistScanOpaqueData
creates an ABI break in the back branches. But I (tgl) have little
sympathy for extensions including spgist_private.h, so I'm not very
worried about that. Alternatively we could stick the new field at the
end of the struct in back branches, but that has its own downsides.
Anton Dignös, reviewed by Alexander Kuzmenkov
Discussion: https://postgr.es/m/CALNdv1jb6y2Te-m8xHLxLX12RsBmZJ1f4hESX7J0HjgyOhA9eA@mail.gmail.com
Instead of embedding the savepoint name in a list and then requiring
complex code to unpack it, just add another struct field to store it
directly.
Reviewed-by: Alvaro Herrera <alvherre@alvh.no-ip.org>
We call this thing a "transaction block" everywhere except in a few
functions, where it is mysteriously called a "transaction chain". In
the SQL standard, a transaction chain is something different. So rename
these functions to match the common terminology.
Reviewed-by: Alvaro Herrera <alvherre@alvh.no-ip.org>
Use IndexTupleSize everywhere, instead. Also, remove IndexTupleSize's
internal typecast, as that's not really needed and might mask coding
errors. Change some pointer variable datatypes in the call sites
to compensate for that and make it clearer what we're assuming.
Ildar Musin, Robert Haas, Stephen Frost
Discussion: https://postgr.es/m/0274288e-9e88-13b6-c61c-7b36928bf221@postgrespro.ru
This patch adds the ability to use "RANGE offset PRECEDING/FOLLOWING"
frame boundaries in window functions. We'd punted on that back in the
original patch to add window functions, because it was not clear how to
do it in a reasonably data-type-extensible fashion. That problem is
resolved here by adding the ability for btree operator classes to provide
an "in_range" support function that defines how to add or subtract the
RANGE offset value. Factoring it this way also allows the operator class
to avoid overflow problems near the ends of the datatype's range, if it
wishes to expend effort on that. (In the committed patch, the integer
opclasses handle that issue, but it did not seem worth the trouble to
avoid overflow failures for datetime types.)
The patch includes in_range support for the integer_ops opfamily
(int2/int4/int8) as well as the standard datetime types. Support for
other numeric types has been requested, but that seems like suitable
material for a follow-on patch.
In addition, the patch adds GROUPS mode which counts the offset in
ORDER-BY peer groups rather than rows, and it adds the frame_exclusion
options specified by SQL:2011. As far as I can see, we are now fully
up to spec on window framing options.
Existing behaviors remain unchanged, except that I changed the errcode
for a couple of existing error reports to meet the SQL spec's expectation
that negative "offset" values should be reported as SQLSTATE 22013.
Internally and in relevant parts of the documentation, we now consistently
use the terminology "offset PRECEDING/FOLLOWING" rather than "value
PRECEDING/FOLLOWING", since the term "value" is confusingly vague.
Oliver Ford, reviewed and whacked around some by me
Discussion: https://postgr.es/m/CAGMVOdu9sivPAxbNN0X+q19Sfv9edEPv=HibOJhB14TJv_RCQg@mail.gmail.com
To make this work, tuplesort.c and logtape.c must also support
parallelism, so this patch adds that infrastructure and then applies
it to the particular case of parallel btree index builds. Testing
to date shows that this can often be 2-3x faster than a serial
index build.
The model for deciding how many workers to use is fairly primitive
at present, but it's better than not having the feature. We can
refine it as we get more experience.
Peter Geoghegan with some help from Rushabh Lathia. While Heikki
Linnakangas is not an author of this patch, he wrote other patches
without which this feature would not have been possible, and
therefore the release notes should possibly credit him as an author
of this feature. Reviewed by Claudio Freire, Heikki Linnakangas,
Thomas Munro, Tels, Amit Kapila, me.
Discussion: http://postgr.es/m/CAM3SWZQKM=Pzc=CAHzRixKjp2eO5Q0Jg1SoFQqeXFQ647JiwqQ@mail.gmail.com
Discussion: http://postgr.es/m/CAH2-Wz=AxWqDoVvGU7dq856S4r6sJAj6DBn7VMtigkB33N5eyg@mail.gmail.com
Once this function has been called, we know that all workers have
started and attached to their error queues -- so if any of them
subsequently exit uncleanly, we'll be sure to throw an ERROR promptly.
Otherwise, users of the ParallelContext machinery must be careful not
to wait forever for a worker that has failed to start. Parallel query
manages to work without needing this for reasons explained in new
comments added by this patch, but it's a useful primitive for other
parallel operations, such as the pending patch to make creating a
btree index run in parallel.
Amit Kapila, revised by me. Additional review by Peter Geoghegan.
Discussion: http://postgr.es/m/CAA4eK1+e2MzyouF5bg=OtyhDSX+=Ao=3htN=T-r_6s3gCtKFiw@mail.gmail.com
We have a lot of code in which option names, which from the user's
viewpoint are logically keywords, are passed through the grammar as plain
identifiers, and then matched to string literals during command execution.
This approach avoids making words into lexer keywords unnecessarily. Some
places matched these strings using plain strcmp, some using pg_strcasecmp.
But the latter should be unnecessary since identifiers would have been
downcased on their way through the parser. Aside from any efficiency
concerns (probably not a big factor), the lack of consistency in this area
creates a hazard of subtle bugs due to different places coming to different
conclusions about whether two option names are the same or different.
Hence, standardize on using strcmp() to match any option names that are
expected to have been fed through the parser.
This does create a user-visible behavioral change, which is that while
formerly all of these would work:
alter table foo set (fillfactor = 50);
alter table foo set (FillFactor = 50);
alter table foo set ("fillfactor" = 50);
alter table foo set ("FillFactor" = 50);
now the last case will fail because that double-quoted identifier is
different from the others. However, none of our documentation says that
you can use a quoted identifier in such contexts at all, and we should
discourage doing so since it would break if we ever decide to parse such
constructs as true lexer keywords rather than poor man's substitutes.
So this shouldn't create a significant compatibility issue for users.
Daniel Gustafsson, reviewed by Michael Paquier, small changes by me
Discussion: https://postgr.es/m/29405B24-564E-476B-98C0-677A29805B84@yesql.se
Commit 28724fd90d fixed things so that
if a background worker fails to start due to fork() failure or because
it is terminated before startup succeeds, BGWH_STOPPED will be
reported. However, that only helps if the code that uses the
background worker machinery notices the change in status, and the code
in parallel.c did not.
To fix that, do two things. First, make sure that when a worker
exits, it triggers the leader to read from error queues. That way, if
a worker which has attached to an error queue exits uncleanly, the
leader is sure to throw some error, either the contents of the
ErrorResponse sent by the worker, or "lost connection to parallel
worker" if it exited without sending one. To cover the case where
the worker never starts up in the first place or exits before
attaching to the error queue, the ParallelContext now keeps track
of which workers have sent at least one message via the error
queue. A worker which sends no messages by the time the parallel
operation finishes will be checked to see whether it exited before
attaching to the error queue; if so, a new error message, "parallel
worker failed to initialize", will be reported. If not, we'll
continue to wait until it either starts up and exits cleanly, starts
up and exits uncleanly, or fails to start, and then take the
appropriate action.
Patch by me, reviewed by Amit Kapila.
Discussion: http://postgr.es/m/CA+TgmoYnBgXgdTu6wk5YPdWhmgabYc9nY_pFLq=tB=FSLYkD8Q@mail.gmail.com
The original idea was that we could use an isNull-style bool array
directly as a GinNullCategory array. However, the existing code already
acknowledges that that doesn't really work, because of the possibility
that bool as currently defined can have arbitrary bit patterns for true
values. So it has to loop through the nullFlags array to set each bool
value to an acceptable value. But if we are looping through the whole
array anyway, we might as well build a proper GinNullCategory array
instead and abandon the type casting. That makes the code much safer in
case bool is ever changed to something else.
Reviewed-by: Michael Paquier <michael.paquier@gmail.com>
The previous commit has shown that the sanity checks around freezing
aren't strong enough. Strengthening them seems especially important
because the existance of the bug has caused corruption that we don't
want to make even worse during future vacuum cycles.
The errors are emitted with ereport rather than elog, despite being
"should never happen" messages, so a proper error code is emitted. To
avoid superflous translations, mark messages as internal.
Author: Andres Freund and Alvaro Herrera
Reviewed-By: Alvaro Herrera, Michael Paquier
Discussion: https://postgr.es/m/20171102112019.33wb7g5wp4zpjelu@alap3.anarazel.de
Backpatch: 9.3-
do_pg_start_backup() expects its callers to pass in an open DIR pointer
for the pg_tblspc directory, but there's no apparent advantage in that.
It complicates the callers without adding any flexibility, and there's no
robustness advantage, since we surely have to be prepared for errors during
the scan of pg_tblspc anyway. In fact, by holding an extra kernel resource
during operations like the preliminary checkpoint, we might be making
things a fraction more failure-prone not less. Hence, remove that argument
and open the directory just for the duration of the actual scan.
Discussion: https://postgr.es/m/28752.1512413887@sss.pgh.pa.us
Originally, we palloc'd this buffer just barely big enough to hold the
largest xlog record seen so far. It turns out that that can result in
valgrind complaints, because some compilers will emit code that assumes
it can safely fetch padding bytes at the end of a struct, and those
padding bytes were unallocated so far as aset.c was concerned. We can
fix that by MAXALIGN'ing the palloc request size, ensuring that it is big
enough to include any possible padding that might've been omitted from
the on-disk record.
An additional objection to the original coding is that it could result in
many repeated palloc cycles, in the worst case where we see a series of
gradually larger xlog records. We can ameliorate that cheaply by
imposing a minimum buffer size that's large enough for most xlog records.
BLCKSZ/2 was chosen after a bit of discussion.
In passing, remove an obsolete comment in struct xl_heap_new_cid that the
combocid field is free due to alignment considerations. Perhaps that was
true at some point, but it's not now.
Back-patch to 9.5 where this code came in.
Discussion: https://postgr.es/m/E1eHa4J-0006hI-Q8@gemulon.postgresql.org
Previously, executor nodes running in parallel worker processes didn't
have access to the dsm_segment object used for parallel execution. In
order to support resource management based on DSM segment lifetime,
they need that. So create a ParallelWorkerContext object to hold it
and pass it to all InitializeWorker functions.
Author: Thomas Munro
Reviewed-By: Andres Freund
Discussion: https://postgr.es/m/CAEepm=2W=cOkiZxcg6qiFQP-dHUe09aqTrEMM7yJDrHMhDv_RA@mail.gmail.com
The pending list must (for correctness) always be cleaned up by vacuum, and
should (for the avoidance of surprising behavior) always be cleaned up
by an explicit call to gin_clean_pending_list, but cleanup is optional
when inserting. The old logic got this backward: cleanup was forced
if (stats == NULL), but that's going to be *false* when vacuuming and
*true* for inserts.
Masahiko Sawada, reviewed by me.
Discussion: http://postgr.es/m/CAD21AoBLUSyiYKnTYtSAbC+F=XDjiaBrOUEGK+zUXdQ8owfPKw@mail.gmail.com
The lower case spellings are C and C++ standard and are used in most
parts of the PostgreSQL sources. The upper case spellings are only used
in some files/modules. So standardize on the standard spellings.
The APIs for ICU, Perl, and Windows define their own TRUE and FALSE, so
those are left as is when using those APIs.
In code comments, we use the lower-case spelling for the C concepts and
keep the upper-case spelling for the SQL concepts.
Reviewed-by: Michael Paquier <michael.paquier@gmail.com>
It turns out we misdiagnosed what the real problem was. Revert the
previous changes, because they may have worse consequences going
forward. A better fix is forthcoming.
The simplistic test case is kept, though disabled.
Discussion: https://postgr.es/m/20171102112019.33wb7g5wp4zpjelu@alap3.anarazel.de
This is the last major omission in our domains feature: you can now
make a domain over anything that's not a pseudotype.
The major complication from an implementation standpoint is that places
that might be creating tuples of a domain type now need to be prepared
to apply domain_check(). It seems better that unprepared code fail
with an error like "<type> is not composite" than that it silently fail
to apply domain constraints. Therefore, relevant infrastructure like
get_func_result_type() and lookup_rowtype_tupdesc() has been adjusted
to treat domain-over-composite as a distinct case that unprepared code
won't recognize, rather than just transparently treating it the same
as plain composite. This isn't a 100% solution to the possibility of
overlooked domain checks, but it catches most places.
In passing, improve typcache.c's support for domains (it can now cache
the identity of a domain's base type), and rewrite the argument handling
logic in jsonfuncs.c's populate_record[set]_worker to reduce duplicative
per-call lookups.
I believe this is code-complete so far as the core and contrib code go.
The PLs need varying amounts of work, which will be tackled in followup
patches.
Discussion: https://postgr.es/m/4206.1499798337@sss.pgh.pa.us
There's three categories of changes leading to better performance:
- Splitting the per-attribute part of SendRowDescriptionMessage into a
v2 and a v3 version allows avoiding branches for every attribute.
- Preallocating the size of the buffer to be big enough for all
attributes and then using pq_write* avoids unnecessary buffer
size checks & resizing.
- Reusing a persistently allocated StringInfo for all
SendRowDescriptionMessage() invocations avoids repeated allocations
& reallocations.
Author: Andres Freund
Discussion: https://postgr.es/m/20170914063418.sckdzgjfrsbekae4@alap3.anarazel.de
When some tuple versions in an update chain are frozen due to them being
older than freeze_min_age, the xmax/xmin trail can become broken. This
breaks HOT (and probably other things). A subsequent VACUUM can break
things in more serious ways, such as leaving orphan heap-only tuples
whose root HOT redirect items were removed. This can be seen because
index creation (or REINDEX) complain like
ERROR: XX000: failed to find parent tuple for heap-only tuple at (0,7) in table "t"
Because of relfrozenxid contraints, we cannot avoid the freezing of the
early tuples, so we must cope with the results: whenever we see an Xmin
of FrozenTransactionId, consider it a match for whatever the previous
Xmax value was.
This problem seems to have appeared in 9.3 with multixact changes,
though strictly speaking it seems unrelated.
Since 9.4 we have commit 37484ad2a "Change the way we mark tuples as
frozen", so the fix is simple: just compare the raw Xmin (still stored
in the tuple header, since freezing merely set an infomask bit) to the
Xmax. But in 9.3 we rewrite the Xmin value to FrozenTransactionId, so
the original value is lost and we have nothing to compare the Xmax with.
To cope with that case we need to compare the Xmin with FrozenXid,
assume it's a match, and hope for the best. Sadly, since you can
pg_upgrade a 9.3 instance containing half-frozen pages to newer
releases, we need to keep the old check in newer versions too, which
seems a bit brittle; I hope we can somehow get rid of that.
I didn't optimize the new function for performance. The new coding is
probably a bit slower than before, since there is a function call rather
than a straight comparison, but I'd rather have it work correctly than
be fast but wrong.
This is a followup after 20b6552242 fixed a few related problems.
Apparently, in 9.6 and up there are more ways to get into trouble, but
in 9.3 - 9.5 I cannot reproduce a problem anymore with this patch, so
there must be a separate bug.
Reported-by: Peter Geoghegan
Diagnosed-by: Peter Geoghegan, Michael Paquier, Daniel Wood,
Yi Wen Wong, Álvaro
Discussion: https://postgr.es/m/CAH2-Wznm4rCrhFAiwKPWTpEw2bXDtgROZK7jWWGucXeH3D1fmA@mail.gmail.com
This was intended as infrastructure for weakening VACUUM's locking
requirements, similar to what was done for btree indexes in commit
2ed5b87f96. However, for hash indexes,
it seems that the improvements which are possible are actually
extremely marginal. Furthermore, performing the LSN cross-check will
end up skipping cleanup far more often than is necessary; we only care
about page modifications due to a VACUUM, but the LSN check will fail
if ANY modification has occurred. So, rather than pressing forward
with that "optimization", just rip the LSN field out.
Patch by me, reviewed by Ashutosh Sharma and Amit Kapila
Discussion: http://postgr.es/m/CAA4eK1JxqqcuC5Un7YLQVhOYSZBS+t=3xqZuEkt5RyquyuxpwQ@mail.gmail.com
Commit 09cb5c0e7d added a similar
optimization to btree back in 2006, but nobody bothered to implement
the same thing for hash indexes, probably because they weren't
WAL-logged and had lots of other performance problems as well. As
with the corresponding btree case, this eliminates the problem of
potentially needing to refind our position within the page, and cuts
down on pin/unpin traffic as well.
Ashutosh Sharma, reviewed by Alexander Korotkov, Jesper Pedersen,
Amit Kapila, and me. Some final edits to comments and README by
me.
Discussion: http://postgr.es/m/CAE9k0Pm3KTx93K8_5j6VMzG4h5F+SyknxUwXrN-zqSZ9X8ZS3w@mail.gmail.com
For performance reasons a larger segment size than the default 16MB
can be useful. A larger segment size has two main benefits: Firstly,
in setups using archiving, it makes it easier to write scripts that
can keep up with higher amounts of WAL, secondly, the WAL has to be
written and synced to disk less frequently.
But at the same time large segment size are disadvantageous for
smaller databases. So far the segment size had to be configured at
compile time, often making it unrealistic to choose one fitting to a
particularly load. Therefore change it to a initdb time setting.
This includes a breaking changes to the xlogreader.h API, which now
requires the current segment size to be configured. For that and
similar reasons a number of binaries had to be taught how to recognize
the current segment size.
Author: Beena Emerson, editorialized by Andres Freund
Reviewed-By: Andres Freund, David Steele, Kuntal Ghosh, Michael
Paquier, Peter Eisentraut, Robert Hass, Tushar Ahuja
Discussion: https://postgr.es/m/CAOG9ApEAcQ--1ieKbhFzXSQPw_YLmepaa4hNdnY5+ZULpt81Mw@mail.gmail.com
The bug was caused by not re-reading the control file during crash
recovery restarts, which lead to an attempt to pfree() shared memory
contents. The fix is to re-read the control file, which seems good
anyway.
It's unclear as of this moment, whether we want to keep the
refactoring introduced in the commit referenced above, or come up with
an alternative approach. But fixing the bug in the mean time seems
like a good idea regardless.
A followup commit will introduce regression test coverage for crash
restarts.
Reported-By: Tom Lane
Discussion: https://postgr.es/m/14134.1505572349@sss.pgh.pa.us
Make the btree page-flags test macros (P_ISLEAF and friends) return clean
boolean values, rather than values that might not fit in a bool. Use them
in a few places that were randomly referencing the flag bits directly.
In passing, change access/nbtree/'s only direct use of BUFFER_LOCK_SHARE to
BT_READ. (Some think we should go the other way, but as long as we have
BT_READ/BT_WRITE, let's use them consistently.)
Masahiko Sawada, reviewed by Doug Doole
Discussion: https://postgr.es/m/CAD21AoBmWPeN=WBB5Jvyz_Nt3rmW1ebUyAnk3ZbJP3RMXALJog@mail.gmail.com
Tuples can have type RECORDOID and a typmod number that identifies a blessed
TupleDesc in a backend-private cache. To support the sharing of such tuples
through shared memory and temporary files, provide a typmod registry in
shared memory.
To achieve that, introduce per-session DSM segments, created on demand when a
backend first runs a parallel query. The per-session DSM segment has a
table-of-contents just like the per-query DSM segment, and initially the
contents are a shared record typmod registry and a DSA area to provide the
space it needs to grow.
State relating to the current session is accessed via a Session object
reached through global variable CurrentSession that may require significant
redesign further down the road as we figure out what else needs to be shared
or remodelled.
Author: Thomas Munro
Reviewed-By: Andres Freund
Discussion: https://postgr.es/m/CAEepm=0ZtQ-SpsgCyzzYpsXS6e=kZWqk3g5Ygn3MDV7A8dabUA@mail.gmail.com
Previously we read the control file in multiple places. But soon the
segment size will be configurable and stored in the control file, and
that needs to be available earlier than it currently is needed.
Instead of adding yet another place where it's read, refactor things
so there's a single processing of the control file during startup (in
EXEC_BACKEND that's every individual backend's startup).
Author: Andres Freund
Discussion: http://postgr.es/m/20170913092828.aozd3gvvmw67gmyc@alap3.anarazel.de
Issuing a savepoint-related command in a Query message that contains
multiple SQL statements led to a FATAL exit with a complaint about
"unexpected state STARTED". This is a shortcoming of commit 4f896dac1,
which attempted to prevent such misbehaviors in multi-statement strings;
its quick hack of marking the individual statements as "not top-level"
does the wrong thing in this case, and isn't a very accurate description
of the situation anyway.
To fix, let's introduce into xact.c an explicit model of what happens for
multi-statement Query strings. This is an "implicit transaction block
in progress" state, which for many purposes works like the normal
TBLOCK_INPROGRESS state --- in particular, IsTransactionBlock returns true,
causing the desired result that PreventTransactionChain will throw error.
But in case of error abort it works like TBLOCK_STARTED, allowing the
transaction to be cancelled without need for an explicit ROLLBACK command.
Commit 4f896dac1 is reverted in toto, so that we go back to treating the
individual statements as "top level". We could have left it as-is, but
this allows sharpening the error message for PreventTransactionChain
calls inside functions.
Except for getting a normal error instead of a FATAL exit for savepoint
commands, this patch should result in no user-visible behavioral change
(other than that one error message rewording). There are some things
we might want to do in the line of changing the appearance or wording of
error and warning messages around this behavior, which would be much
simpler to do now that it's an explicitly modeled state. But I haven't
done them here.
Although this fixes a long-standing bug, no backpatch. The consequences
of the bug don't seem severe enough to justify the risk that this commit
itself creates some new issue.
Patch by me, but it owes something to previous investigation by
Takayuki Tsunakawa, who also reported the bug in the first place.
Also thanks to Michael Paquier for reviewing.
Discussion: https://postgr.es/m/0A3221C70F24FB45833433255569204D1F6BE40D@G01JPEXMBYT05
This will be useful for hash partitioning, which needs a way to seed
the hash functions to avoid problems such as a hash index on a hash
partitioned table clumping all values into a small portion of the
bucket space; it's also useful for anything that wants a 64-bit hash
value rather than a 32-bit hash value.
Just in case somebody wants a 64-bit hash value that is compatible
with the existing 32-bit hash values, make the low 32-bits of the
64-bit hash value match the 32-bit hash value when the seed is 0.
Robert Haas and Amul Sul
Discussion: http://postgr.es/m/CA+Tgmoafx2yoJuhCQQOL5CocEi-w_uG4S2xT0EtgiJnPGcHW3g@mail.gmail.com
Previously, the parallel executor logic did reinitialization of shared
state within the ExecReScan code for parallel-aware scan nodes. This is
problematic, because it means that the ExecReScan call has to occur
synchronously (ie, during the parent Gather node's ReScan call). That is
swimming very much against the tide so far as the ExecReScan machinery is
concerned; the fact that it works at all today depends on a lot of fragile
assumptions, such as that no plan node between Gather and a parallel-aware
scan node is parameterized. Another objection is that because ExecReScan
might be called in workers as well as the leader, hacky extra tests are
needed in some places to prevent unwanted shared-state resets.
Hence, let's separate this code into two functions, a ReInitializeDSM
call and the ReScan call proper. ReInitializeDSM is called only in
the leader and is guaranteed to run before we start new workers.
ReScan is returned to its traditional function of resetting only local
state, which means that ExecReScan's usual habits of delaying or
eliminating child rescan calls are safe again.
As with the preceding commit 7df2c1f8d, it doesn't seem to be necessary
to make these changes in 9.6, which is a good thing because the FDW and
CustomScan APIs are impacted.
Discussion: https://postgr.es/m/CAA4eK1JkByysFJNh9M349u_nNjqETuEnY_y1VUc_kJiU0bxtaQ@mail.gmail.com
Previously, tuple descriptors were stored in chains keyed by a fixed size
array of OIDs. That meant there were effectively two levels of collision
chain -- one inside and one outside the hash table. Instead, let dynahash.c
look after conflicts for us by supplying a proper hash and equal function
pair.
This is a nice cleanup on its own, but also simplifies followup
changes allowing blessed TupleDescs to be shared between backends
participating in parallel query.
Author: Thomas Munro
Reviewed-By: Andres Freund
Discussion: https://postgr.es/m/CAEepm%3D34GVhOL%2BarUx56yx7OPk7%3DqpGsv3CpO54feqjAwQKm5g%40mail.gmail.com
TupleDesc's attributes were already stored in contiguous memory after the
struct. Go one step further and get rid of the array of pointers to
attributes so that they can be stored in shared memory mapped at different
addresses in each backend. This won't work for TupleDescs with contraints
and defaults, since those point to other objects, but for many purposes
only attributes are needed.
Author: Thomas Munro
Reviewed-By: Andres Freund
Discussion: https://postgr.es/m/CAEepm=0ZtQ-SpsgCyzzYpsXS6e=kZWqk3g5Ygn3MDV7A8dabUA@mail.gmail.com
This is a mechanical change in preparation for a later commit that
will change the layout of TupleDesc. Introducing a macro to abstract
the details of where attributes are stored will allow us to change
that in separate step and revise it in future.
Author: Thomas Munro, editorialized by Andres Freund
Reviewed-By: Andres Freund
Discussion: https://postgr.es/m/CAEepm=0ZtQ-SpsgCyzzYpsXS6e=kZWqk3g5Ygn3MDV7A8dabUA@mail.gmail.com