The code checking whether an aggregate transition value needs to be
reparented into the current context has always only compared the
transition return value with the previous transition value by datum,
i.e. without regard for NULLness. This normally works, because when
the transition function returns NULL (via fcinfo->isnull), it'll
return a value that won't be the same as its input value.
But there's no hard requirement that that's the case. And it turns
out, it's possible to hit this case (see discussion or reproducers),
leading to a non-null transition value not being reparented, followed
by a crash caused by that.
Instead of adding another comparison of NULLness, instead have
ExecAggTransReparent() ensure that pergroup->transValue ends up as 0
when the new transition value is NULL. That avoids having to add an
additional branch to the much more common cases of the transition
function returning the old transition value (which is a pointer in
this case), and when the new value is different, but not NULL.
In branches since 69c3936a14, also deduplicate the reparenting code
between the expression evaluation based transitions, and the path for
ordered aggregates.
Reported-By: Teodor Sigaev, Nikita Glukhov
Author: Andres Freund
Discussion: https://postgr.es/m/bd34e930-cfec-ea9b-3827-a8bc50891393@sigaev.ru
Backpatch: 9.4-, this issue has existed since at least 7.4
Commit 9b63c13f0 turns out to have been fundamentally misguided:
the parent node's subPlan list is by no means the only way in which
a child SubPlan node can be hooked into the outer execution state.
As shown in bug #16213 from Matt Jibson, we can also get short-lived
tuple table slots added to the outer es_tupleTable list. At this point
I have little faith that there aren't other possible connections as
well; the long time it took to notice this problem shows that this
isn't a heavily-exercised situation.
Therefore, revert that fix, returning to the coding that passed a
NULL parent plan pointer down to the transiently-built subexpressions.
That gives us a pretty good guarantee that they won't hook into the
outer executor state in any way. But then we need some other solution
to make SubPlans work. Adopt the solution speculated about in the
previous commit's log message: do expression initialization at plan
startup for just those VALUES rows containing SubPlans, abandoning the
goal of reclaiming memory intra-query for those rows. In practice it
seems unlikely that queries containing a vast number of VALUES rows
would be using SubPlans in them, so this should not give up much.
(BTW, this test case also refutes my claim in connection with the prior
commit that the issue only arises with use of LATERAL. That was just
wrong: some variants of SubLink always produce SubPlans.)
As with previous patch, back-patch to all supported branches.
Discussion: https://postgr.es/m/16213-871ac3bc208ecf23@postgresql.org
A view with conditional INSTEAD rules and no unconditional INSTEAD
rules or INSTEAD OF triggers is not auto-updatable. Previously we
relied on a check in the executor to catch this, but that's
problematic since the planner may fail to properly handle such a query
and thus return a particularly unhelpful error to the user, before
reaching the executor check.
Instead, trap this in the rewriter and report the correct error there.
Doing so also allows us to include more useful error detail than the
executor check can provide. This doesn't change the existing behaviour
of updatable views; it merely ensures that useful error messages are
reported when a view isn't updatable.
Per report from Pengzhou Tang, though not adopting that suggested fix.
Back-patch to all supported branches.
Discussion: https://postgr.es/m/CAG4reAQn+4xB6xHJqWdtE0ve_WqJkdyCV4P=trYr4Kn8_3_PEA@mail.gmail.com
Our algorithm for choosing batch numbers turned out not to work
effectively for multi-billion key inner relations. We would use
more hash bits than we have, and effectively concentrate all tuples
into a smaller number of batches than we intended. While ideally
we should switch to wider hashes, for now, change the algorithm to
one that effectively gives up bits from the bucket number when we
don't have enough bits. That means we'll finish up with longer
bucket chains than would be ideal, but that's better than having
batches that don't fit in work_mem and can't be divided.
Batch-patch to all supported releases.
Author: Thomas Munro
Reviewed-by: Tom Lane, thanks also to Tomas Vondra, Alvaro Herrera, Andres Freund for testing and discussion
Reported-by: James Coleman
Discussion: https://postgr.es/m/16104-dc11ed911f1ab9df%40postgresql.org
Revert part of commit 19df1702f5.
Early shutdown was added by that commit so that we could collect
statistics from workers, but unfortunately, it interacted badly with
rescans. The problem is that we ended up destroying the parallel context
which is required for rescans. This leads to rescans of a Limit node over
a Gather node to produce unpredictable results as it tries to access
destroyed parallel context. By reverting the early shutdown code, we
might lose statistics in some cases of Limit over Gather [Merge], but that
will require further study to fix.
Reported-by: Jerry Sievers
Diagnosed-by: Thomas Munro
Author: Amit Kapila, testcase by Vignesh C
Backpatch-through: 9.6
Discussion: https://postgr.es/m/87ims2amh6.fsf@jsievers.enova.com
Call ExecShutdownNode() after ExecutePlan()'s loop, rather than at each
break. We had forgotten to do that in one case. The omission caused
intermittent "temporary file leak" warnings from multi-batch parallel
hash joins with a LIMIT clause.
Back-patch to 11. Though the problem exists in theory in earlier
parallel query releases, nothing really depended on it.
Author: Kyotaro Horiguchi
Reviewed-by: Thomas Munro, Amit Kapila
Discussion: https://postgr.es/m/20191111.212418.2222262873417235945.horikyota.ntt%40gmail.com
Avoid creating transiently-inconsistent slot states where possible,
by not setting TTS_FLAG_SHOULDFREE until after the slot actually has
a free'able tuple pointer, and by making sure that we reset tts_nvalid
and related derived state before we replace the tuple contents. This
would only matter if something were to examine the slot after we'd
suffered some kind of error (e.g. out of memory) while manipulating
the slot. We typically don't do that, so these changes might just be
cosmetic --- but even if so, it seems like good future-proofing.
Also remove some redundant Asserts, and add a couple for consistency.
Back-patch to v12 where all this code was rewritten.
Discussion: https://postgr.es/m/16095-c3ff2e5283b8dba5@postgresql.org
In the course of 5567d12ce0, 356687bd8 and 317ffdfeaa, I changed
BuildTupleHashTable[Ext]'s call to ExecBuildGroupingEqual to not pass
in the parent node, but NULL. Which in turn prevents the tuple
equality comparator from being JIT compiled. While that fixes
bug #15486, it is not actually necessary after all of the above commits,
as we don't re-build the comparator when using the new
BuildTupleHashTableExt() interface (as the content of the hashtable
are reset, but the TupleHashTable itself is not).
Therefore re-allow jit compilation for callers that use
BuildTupleHashTableExt with a separate context for "metadata" and
content.
As in the previous commit, there's ongoing work to make this easier to
test to prevent such regressions in the future, but that
infrastructure is not going to be backpatchable.
The performance impact of not JIT compiling hashtable equality
comparators can be substantial e.g. for aggregation queries that
aggregate a lot of input rows to few output rows (when there are a lot
of output groups, there will be fewer comparisons).
Author: Andres Freund
Discussion: https://postgr.es/m/20190927072053.njf6prdl3vb7y7qb@alap3.anarazel.de
Backpatch: 11, just as 5567d12ce0
For many queries the fact that the tuple descriptor from the lower
node was not taken into account when determining whether the type of a
slot is fixed, lead to tuple deforming for such upper nodes not to be
JIT accelerated.
I broke this in 675af5c01e.
There is ongoing work to enable writing regression tests for related
behavior (including a patch that would have detected this
regression), by optionally showing such details in EXPLAIN. But as it
seems unlikely that that will be suitable for stable branches, just
merge the fix for now.
While it's fairly close to the 12 release window, the fact that 11
continues to perform JITed tuple deforming in these cases, that
there's still cases where we do so in 12, and the fact that the
performance regression can be sizable, weigh in favor of fixing it
now.
Author: Andres Freund
Discussion: https://postgr.es/m/20190927072053.njf6prdl3vb7y7qb@alap3.anarazel.de
Backpatch: 12-, where 675af5c01e was merged.
The code used the destination slot's natts where it intended to
use the source slot's natts. Adding an Assert shows that there
is no case in "make check-world" where these counts are different,
so maybe this is a harmless bug, but it's still a bug.
Takayuki Tsunakawa
Discussion: https://postgr.es/m/0A3221C70F24FB45833433255569204D1FD34C0E@G01JPEXMBYT05
In ad0bda5d24 I changed the EvalPlanQual machinery to store
substitution tuples in slot, instead of using plain HeapTuples. The
main motivation for that was that using HeapTuples will be inefficient
for future tableams. But it turns out that that conversion was buggy
for non-locking rowmarks - the wrong tuple descriptor was used to
create the slot.
As a secondary issue 5db6df0c0 changed ExecLockRows() to begin EPQ
earlier, to allow to fetch the locked rows directly into the EPQ
slots, instead of having to copy tuples around. Unfortunately, as Tom
complained, that forces some expensive initialization to happen
earlier.
As a third issue, the test coverage for EPQ was clearly insufficient.
Fixing the first issue is unfortunately not trivial: Non-locked row
marks were fetched at the start of EPQ, and we don't have the type
information for the rowmarks available at that point. While we could
change that, it's not easy. It might be worthwhile to change that at
some point, but to fix this bug, it seems better to delay fetching
non-locking rowmarks when they're actually needed, rather than
eagerly. They're referenced at most once, and in cases where EPQ
fails, might never be referenced. Fetching them when needed also
increases locality a bit.
To be able to fetch rowmarks during execution, rather than
initialization, we need to be able to access the active EPQState, as
that contains necessary data. To do so move EPQ related data from
EState to EPQState, and, only for EStates creates as part of EPQ,
reference the associated EPQState from EState.
To fix the second issue, change EPQ initialization to allow use of
EvalPlanQualSlot() to be used before EvalPlanQualBegin() (but
obviously still requiring EvalPlanQualInit() to have been done).
As these changes made struct EState harder to understand, e.g. by
adding multiple EStates, significantly reorder the members, and add a
lot more comments.
Also add a few more EPQ tests, including one that fails for the first
issue above. More is needed.
Reported-By: yi huang
Author: Andres Freund
Reviewed-By: Tom Lane
Discussion:
https://postgr.es/m/CAHU7rYZo_C4ULsAx_LAj8az9zqgrD8WDd4hTegDTMM1LMqrBsg@mail.gmail.comhttps://postgr.es/m/24530.1562686693@sss.pgh.pa.us
Backpatch: 12-, where the EPQ changes were introduced
Commit bf6c614a2 rearranged the lookup of the comparison operators
needed in a hashed subplan, and in so doing, broke the cross-type
case: it caused the original LHS-vs-RHS operator to be used to compare
hash table entries too (which of course are all of the RHS type).
This leads to C functions being passed a Datum that is not of the
type they expect, with the usual hazards of crashes and unauthorized
server memory disclosure.
For the set of hashable cross-type operators present in v11 core
Postgres, this bug is nearly harmless on 64-bit machines, which
may explain why it escaped earlier detection. But it is a live
security hazard on 32-bit machines; and of course there may be
extensions that add more hashable cross-type operators, which
would increase the risk.
Reported by Andreas Seltenreich. Back-patch to v11 where the
problem came in.
Security: CVE-2019-10209
In 5f32b29c18 I changed the creation of HashState.hashkeys to
actually use HashState as the parent (instead of HashJoinState, which
was incorrect, as they were executed below HashState), to fix the
problem of hashkeys expressions otherwise relying on slot types
appropriate for HashJoinState, rather than HashState as would be
correct. That reliance was only introduced in 12, which is why it
previously worked to use HashJoinState as the parent (although I'd be
unsurprised if there were problematic cases).
Unfortunately that's not a sufficient solution, because before this
commit, the to-be-hashed expressions referenced inner/outer as
appropriate for the HashJoin, not Hash. That didn't have obvious bad
consequences, because the slots containing the tuples were put into
ecxt_innertuple when hashing a tuple for HashState (even though Hash
doesn't have an inner plan).
There are less common cases where this can cause visible problems
however (rather than just confusion when inspecting such executor
trees). E.g. "ERROR: bogus varno: 65000", when explaining queries
containing a HashJoin where the subsidiary Hash node's hash keys
reference a subplan. While normally hashkeys aren't displayed by
EXPLAIN, if one of those expressions references a subplan, that
subplan may be printed as part of the Hash node - which then failed
because an inner plan was referenced, and Hash doesn't have that.
It seems quite possible that there's other broken cases, too.
Fix the problem by properly splitting the expression for the HashJoin
and Hash nodes at plan time, and have them reference the proper
subsidiary node. While other workarounds are possible, fixing this
correctly seems easy enough. It was a pretty ugly hack to have
ExecInitHashJoin put the expression into the already initialized
HashState, in the first place.
I decided to not just split inner/outer hashkeys inside
make_hashjoin(), but also to separate out hashoperators and
hashcollations at plan time. Otherwise we would have ended up having
two very similar loops, one at plan time and the other during executor
startup. The work seems to more appropriately belong to plan time,
anyway.
Reported-By: Nikita Glukhov, Alexander Korotkov
Author: Andres Freund
Reviewed-By: Tom Lane, in an earlier version
Discussion: https://postgr.es/m/CAPpHfdvGVegF_TKKRiBrSmatJL2dR9uwFCuR+teQ_8tEXU8mxg@mail.gmail.com
Backpatch: 12-
These were introduced by pgindent due to fixe to broken
indentation (c.f. 8255c7a5ee). Previously the mis-indentation of
function prototypes was creatively used to reduce indentation in a few
places.
As that formatting only exists in master and REL_12_STABLE, it seems
better to fix it in both, rather than having some odd indentation in
v12 that somebody might copy for future patches or such.
Author: Andres Freund
Discussion: https://postgr.es/m/20190728013754.jwcbe5nfyt3533vx@alap3.anarazel.de
Backpatch: 12-
Since 15d8f8312 we assert that - and since 7ef04e4d2c, 4da597edf1
rely on - the slot type for an expression's
ecxt_{outer,inner,scan}tuple not changing, unless explicitly flagged
as such. That allows to either skip deforming (for a virtual tuple
slot) or optimize the code for JIT accelerated deforming
appropriately (for other known slot types).
This assumption was sometimes violated for grouping sets, when
nodeAgg.c internally uses tuplesorts, and the child node doesn't
return a TTSOpsMinimalTuple type slot. Detect that case, and flag that
the outer slot might not be "fixed".
It's probably worthwhile to optimize this further in the future, and
more granularly determine whether the slot is fixed. As we already
instantiate per-phase transition and equal expressions, we could
cheaply set the slot type appropriately for each phase. But that's a
separate change from this bugfix.
This commit does include a very minor optimization by avoiding to
create a slot for handling tuplesorts, if no such sorts are
performed. Previously we created that slot unnecessarily in the common
case of computing all grouping sets via hashing. The code looked too
confusing without that, as the conditions for needing a sort slot and
flagging that the slot type isn't fixed, are the same.
Reported-By: Ashutosh Sharma
Author: Andres Freund
Discussion: https://postgr.es/m/CAE9k0PmNaMD2oHTEAhRyxnxpaDaYkuBYkLa1dpOpn=RS0iS2AQ@mail.gmail.com
Backpatch: 12-, where the bug was introduced in 15d8f8312
After 277cb78983 ON CONFLICT ... SET ... RETURNING failed with
ERROR: virtual tuple table slot does not have system attributes
when taking the update path, as the slot used to insert into the
table (and then process RETURNING) was defined to be a virtual slot in
that commit. Virtual slots don't support system columns except for
tableoid and ctid, as the other system columns are AM dependent.
Fix that by using a slot of the table's type. Add tests for system
column accesses in ON CONFLICT ... RETURNING.
Reported-By: Roby, bisected to the relevant commit by Jeff Janes
Author: Andres Freund
Discussion: https://postgr.es/m/73436355-6432-49B1-92ED-1FE4F7E7E100@finefun.com.au
Backpatch: 12-, where the bug was introduced in 277cb78983
There were a number of issues in the recent commits which include typos,
code and comments mismatch, leftover function declarations. Fix them.
Reported-by: Alexander Lakhin
Author: Alexander Lakhin, Amit Kapila and Amit Langote
Reviewed-by: Amit Kapila
Discussion: https://postgr.es/m/ef0c0232-0c1d-3a35-63d4-0ebd06e31387@gmail.com
We used the same slot to store a tuple from the index, and to store a
tuple from the table. That's not OK. It worked with the heap, because
heapam_getnextslot() stores a HeapTuple to the slot, and doesn't care how
large the tts_values/nulls arrays are. But when I played with a toy table
AM implementation that used a virtual tuple, it caused memory overruns.
In the passing, tidy up comments on the ioss_PscanLen fields.
86b85044e rewrote how COPY FROM works to allow multiple tuple buffers to
exist to once thus allowing multi-inserts to be used in more cases with
partitioned tables. That commit neglected to update the estate's
es_result_relation_info when flushing the insert buffer to the partition
making it possible for the index tuples to be added into an index on the
wrong partition.
Fix this and also add an Assert in ExecInsertIndexTuples to help ensure
that we never make this mistake again.
Reported-by: Haruka Takatsuka
Author: Ashutosh Sharma
Discussion: https://postgr.es/m/15832-b1bf336a4ee246b5@postgresql.org
Some of the wrapper functions didn't match the callback names. Many of
them due to staying "consistent" with historic naming of the wrapped
functionality. We decided that for most cases it's more important to
be for tableam to be consistent going forward, than with the past.
The one exception is beginscan/endscan/... because it'd have looked
odd to have systable_beginscan/endscan/... with a different naming
scheme, and changing the systable_* APIs would have caused way too
much churn (including breaking a lot of external users).
Author: Ashwin Agrawal, with some small additions by Andres Freund
Reviewed-By: Andres Freund
Discussion: https://postgr.es/m/CALfoeiugyrXZfX7n0ORCa4L-m834dzmaE8eFdbNR6PMpetU4Ww@mail.gmail.com
When there were duplicate columns in the hash key list, the array
sizes could be miscomputed, resulting in access off the end of the
array. Adjust the computation to ensure the array is always large
enough.
(I considered whether the duplicates could be removed in planning, but
I can't rule out the possibility that duplicate columns might have
different hash functions assigned. Simpler to just make sure it works
at execution time regardless.)
Bug apparently introduced in fc4b3dea2 as part of narrowing down the
tuples stored in the hashtable. Reported by Colm McHugh of Salesforce,
though I didn't use their patch. Backpatch back to version 10 where
the bug was introduced.
Discussion: https://postgr.es/m/CAFeeJoKKu0u+A_A9R9316djW-YW3-+Gtgvy3ju655qRHR3jtdA@mail.gmail.com
This is still using the 2.0 version of pg_bsd_indent.
I thought it would be good to commit this separately,
so as to document the differences between 2.0 and 2.1 behavior.
Discussion: https://postgr.es/m/16296.1558103386@sss.pgh.pa.us
This code was still using the old style of forming a heap tuple rather
than using tuple slots. This would be less efficient if a non-heap
access method was used. And using tuple slots is actually quite a bit
faster when using heap as well.
Also add some test cases for generated columns with null values and
with varlena values. This lack of coverage was discovered while
working on this patch.
Discussion: https://www.postgresql.org/message-id/flat/20190331025744.ugbsyks7czfcoksd%40alap3.anarazel.de
For partial aggregation combine steps,
AggStatePerTrans->numTransInputs was set to the transition function's
number of inputs, rather than the combine function's number of
inputs (always 1).
That lead to partial aggregates with strict combine functions to
wrongly check for NOT NULL input as required by strictness. When the
aggregate wasn't exactly passed one argument, the strictness check was
either omitted (in the 0 args case) or too many arguments were
checked. In the latter case we'd read beyond the end of
FunctionCallInfoData->args (only in master).
AggStatePerTrans->numTransInputs actually has been wrong since since
9.6, where partial aggregates were added. But it turns out to not be
an active problem in 9.6 and 10, because numTransInputs wasn't used at
all for combine functions: Before c253b722f6 there simply was no NULL
check for the input to strict trans functions, and after that the
check was simply hardcoded for the right offset in fcinfo, as it's
done by code specific to combine functions.
In bf6c614a2f (11) the strictness check was generalized, with common
code doing the strictness checks for both plain and combine transition
functions, based on numTransInputs. For combine functions this lead to
not emitting an expression step to check for strict input in the 0
arguments case, and in the > 1 arguments case, we'd check too many
arguments.Due to the fact that the relevant fcinfo->isnull[2..] was
always zero-initialized (more or less by accident, by being part of
the AggStatePerTrans struct, which is palloc0'ed), there was no
observable damage in the latter case before a9c35cf85c, we just
checked too many array elements.
Due to the changes in a9c35cf85c, > 1 argument bug became visible,
because these days fcinfo is a) dynamically allocated without being
zeroed b) exactly the length required for the number of specified
arguments (hardcoded to 2 in this case).
This commit only contains a fairly minimal fix, setting numTransInputs
to a hardcoded 1 when building a pertrans for a combine function. It
seems likely that we'll want to clean this up further (e.g. the
arguments build_pertrans_for_aggref() aren't particularly meaningful
for combine functions). But the wrap date for 12 beta1 is coming up
fast, so it seems good to have a minimal fix in place.
Backpatch to 11. While AggStatePerTrans->numTransInputs was set
wrongly before that, the value was not used for combine functions.
Reported-By: Rajkumar Raghuwanshi
Diagnosed-By: Kyotaro Horiguchi, Jeevan Chalke, Andres Freund, David Rowley
Author: David Rowley, Kyotaro Horiguchi, Andres Freund
Discussion: https://postgr.es/m/CAKcux6=uZEyWyLw0N7HtR9OBc-sWEFeByEZC7t-KDf15FKxVew@mail.gmail.com
Instead add a tableam callback to do so. To avoid adding per
validation overhead, pass a scan to tuple_tid_valid. In heap's case
we'd otherwise incurred a RelationGetNumberOfBlocks() call for each
tid - which'd have added noticable overhead to nodeTidscan.c.
Author: Andres Freund
Reviewed-By: Ashwin Agrawal
Discussion: https://postgr.es/m/20190515185447.gno2jtqxyktylyvs@alap3.anarazel.de
Previously, gen_partprune_steps() always built executor pruning steps
using all suitable clauses, including those containing PARAM_EXEC
Params. This meant that the pruning steps were only completely safe
for executor run-time (scan start) pruning. To prune at executor
startup, we had to ignore the steps involving exec Params. But this
doesn't really work in general, since there may be logic changes
needed as well --- for example, pruning according to the last operator's
btree strategy is the wrong thing if we're not applying that operator.
The rules embodied in gen_partprune_steps() and its minions are
sufficiently complicated that tracking their incremental effects in
other logic seems quite impractical.
Short of a complete redesign, the only safe fix seems to be to run
gen_partprune_steps() twice, once to create executor startup pruning
steps and then again for run-time pruning steps. We can save a few
cycles however by noting during the first scan whether we rejected
any clauses because they involved exec Params --- if not, we don't
need to do the second scan.
In support of this, refactor the internal APIs in partprune.c to make
more use of passing information in the GeneratePruningStepsContext
struct, rather than as separate arguments.
This is, I hope, the last piece of our response to a bug report from
Alan Jackson. Back-patch to v11 where this code came in.
Discussion: https://postgr.es/m/FAD28A83-AC73-489E-A058-2681FA31D648@tvsquared.com
For some reason both callsite and the implementation for heapam had
the meaning inverted (i.e. succeeded == true was passed in case of
conflict). That's confusing.
I (Andres) briefly pondered whether it'd be better to rename
table_complete_speculative's argument to 'bool specConflict' or such,
but decided not to. The 'complete' in the function name for me makes
`succeeded` sound a bit better.
Reported-By: Ashwin Agrawal, Melanie Plageman, Heikki Linnakangas
Discussion:
https://postgr.es/m/CALfoeitk7-TACwYv3hCw45FNPjkA86RfXg4iQ5kAOPhR+F1Y4w@mail.gmail.comhttps://postgr.es/m/97673451-339f-b21e-a781-998d06b1067c@iki.fi
For amcanreorderby scans the nodeIndexscan.c's reorder queue holds
heap tuples, but the underlying table likely does not. Before this fix
we'd return different types of slots, depending on whether the tuple
came from the reorder queue, or from the index + table.
While that could be fixed by signalling that the node doesn't return a
fixed type of slot, it seems better to instead remove the separate
slot for the reorder queue, and use ExecForceStoreHeapTuple() to store
tuples from the queue. It's not particularly common to need
reordering, after all.
This reverts most of the iss_ReorderQueueSlot related changes to
nodeIndexscan.c made in 1a0586de36, except that now
ExecForceStoreHeapTuple() is used instead of ExecStoreHeapTuple().
Noticed when testing zheap against the in-core version of tableam.
Author: Andres Freund
As reported by Tom, when ExecStoreMinimalTuple() had to perform a
conversion to store the minimal tuple in the slot, it forgot to
respect the shouldFree flag, and leaked the tuple into the current
memory context if true. Fix that by freeing the tuple in that case.
Looking at the relevant code made me (Andres) realize that not having
the shouldFree parameter to ExecForceStoreHeapTuple() was a bad
idea. Some callers had to locally implement the necessary logic, and
in one case it was missing, creating a potential per-group leak in
non-hashed aggregation.
The choice to not free the tuple in ExecComputeStoredGenerated() is
not pretty, but not introduced by this commit - I'll start a separate
discussion about it.
Reported-By: Tom Lane
Discussion: https://postgr.es/m/366.1555382816@sss.pgh.pa.us
HoldPinnedPortals() did things in the wrong order: it must not mark
a portal autoHeld until it's been successfully held. Otherwise,
a failure while persisting the portal results in a server crash
because we think the portal is in a good state when it's not.
Also add a check that portal->status is READY before attempting to
hold a pinned portal. We have such a check before the only other
use of HoldPortal(), so it seems unwise not to check it here.
Lastly, rethink the responsibility for where to call HoldPinnedPortals.
The comment for it imagined that it was optional for any individual PL
to call it or not, but that cannot be the case: if some outer level of
procedure has a pinned portal, failing to persist it when an inner
procedure commits is going to be trouble. Let's have SPI do it instead
of the individual PLs. That's not a complete solution, since in theory
a PL might not be using SPI to perform commit/rollback, but such a PL
is going to have to be aware of lots of related requirements anyway.
(This change doesn't cause an API break for any external PLs that might
be calling HoldPinnedPortals per the old regime, because calling it
twice during a commit or rollback sequence won't hurt.)
Per bug #15703 from Julian Schauder. Back-patch to v11 where this code
came in.
Discussion: https://postgr.es/m/15703-c12c5bc0ea34ba26@postgresql.org
Commit 3f2393edef changed ExecCleanupTupleRouting() so that it skipped
cleaning up subplan resultrels before calling EndForeignInsert(), but
that would cause an issue: when those resultrels were foreign tables,
the FDWs would fail to shut down. Repair by skipping it after calling
EndForeignInsert() as before.
Author: Etsuro Fujita
Reviewed-by: David Rowley and Amit Langote
Discussion: https://postgr.es/m/5CAF3B8F.2090905@lab.ntt.co.jp
This commit fixes three, unfortunately related, issues:
1) Since 5db6df0c01, the introduction of DML via tableam, it was
possible to trigger "ERROR: unexpected table_lock_tuple status: 1"
when updating a row that was previously updated in the same
transaction - but only when the previously updated row was before
updated in a concurrent transaction (and READ COMMITTED was
used). The reason for that was that that case simply wasn't
expected. Fixing that lead to:
2) Even before the above commit, there were error checks (introduced
in 6868ed7491) preventing a row being updated by different
commands within the same statement (say in a function called by an
UPDATE) - but that check wasn't performed when the row was first
updated in a concurrent transaction - instead the second update was
silently skipped in that case. After this change we throw the same
error as we'd without the concurrent transaction.
3) The error messages (introduced in 6868ed7491) preventing such
updates emitted the same error message for both DELETE and
UPDATE ("tuple to be updated was already modified by an operation
triggered by the current command"). While that could be changed
separately, it made it hard to write tests that verify the correct
correct behavior of the code.
This commit changes heap's implementation of table_lock_tuple() to
return TM_SelfModified instead of TM_Invisible (previously loosely
modeled after EvalPlanQualFetch), and teaches nodeModifyTable.c to
handle that in response to table_lock_tuple() and not just in response
to table_(delete|update).
Additionally it fixes the wrong error message (see 3 above). The
comment for table_lock_tuple() is also adjusted to state that
TM_Deleted won't return information in TM_FailureData - it'll not
always be available.
This also adds tests to ensure that DELETE/UPDATE correctly error out
when affecting a row that concurrently was modified by another
transaction.
Author: Andres Freund
Reported-By: Tom Lane, when investigating a bug bug fix to another bug
by Amit Langote
Discussion: https://postgr.es/m/19321.1554567786@sss.pgh.pa.us
If we need ordered output from a scan of a partitioned table, but
the ordering matches the partition ordering, then we don't need to
use a MergeAppend to combine the pre-ordered per-partition scan
results: a plain Append will produce the same results. This
both saves useless comparison work inside the MergeAppend proper,
and allows us to start returning tuples after istarting up just
the first child node not all of them.
However, all is not peaches and cream, because if some of the
child nodes have high startup costs then there will be big
discontinuities in the tuples-returned-versus-elapsed-time curve.
The planner's cost model cannot handle that (yet, anyway).
If we model the Append's startup cost as being just the first
child's startup cost, we may drastically underestimate the cost
of fetching slightly more tuples than are available from the first
child. Since we've had bad experiences with over-optimistic choices
of "fast start" plans for ORDER BY LIMIT queries, that seems scary.
As a klugy workaround, set the startup cost estimate for an ordered
Append to be the sum of its children's startup costs (as MergeAppend
would). This doesn't really describe reality, but it's less likely
to cause a bad plan choice than an underestimated startup cost would.
In practice, the cases where we really care about this optimization
will have child plans that are IndexScans with zero startup cost,
so that the overly conservative estimate is still just zero.
David Rowley, reviewed by Julien Rouhaud and Antonin Houska
Discussion: https://postgr.es/m/CAKJS1f-hAqhPLRk_RaSFTgYxd=Tz5hA7kQ2h4-DhJufQk8TGuw@mail.gmail.com
This adds table_multi_insert(), and converts COPY FROM, the only user
of heap_multi_insert, to it.
A simple conversion of COPY FROM use slots would have yielded a
slowdown when inserting into a partitioned table for some
workloads. Different partitions might need different slots (both slot
types and their descriptors), and dropping / creating slots when
there's constant partition changes is measurable.
Thus instead revamp the COPY FROM buffering for partitioned tables to
allow to buffer inserts into multiple tables, flushing only when
limits are reached across all partition buffers. By only dropping
slots when there've been inserts into too many different partitions,
the aforementioned overhead is gone. By allowing larger batches, even
when there are frequent partition changes, we actuall speed such cases
up significantly.
By using slots COPY of very narrow rows into unlogged / temporary
might slow down very slightly (due to the indirect function calls).
Author: David Rowley, Andres Freund, Haribabu Kommi
Discussion:
https://postgr.es/m/20180703070645.wchpu5muyto5n647@alap3.anarazel.dehttps://postgr.es/m/20190327054923.t3epfuewxfqdt22e@alap3.anarazel.de
The assertions added by commit b04aeb0a0 exposed that there are some
code paths wherein the executor will try to open an index without
holding any lock on it. We do have some lock on the index's table,
so it seems likely that there's no fatal problem with this (for
instance, the index couldn't get dropped from under us). Still,
it's bad practice and we should fix it.
To do so, remove the optimizations in ExecInitIndexScan and friends
that tried to avoid taking a lock on an index belonging to a target
relation, and just take the lock always. In non-bug cases, this
will result in no additional shared-memory access, since we'll find
in the local lock table that we already have a lock of the desired
type; hence, no significant performance degradation should occur.
Also, adjust the planner and executor so that the type of lock taken
on an index is always identical to the type of lock taken for its table,
by relying on the recently added RangeTblEntry.rellockmode field.
This avoids some corner cases where that might not have been true
before (possibly resulting in extra locking overhead), and prevents
future maintenance issues from having multiple bits of logic that
all needed to be in sync. In addition, this change removes all core
calls to ExecRelationIsTargetRelation, which avoids a possible O(N^2)
startup penalty for queries with large numbers of target relations.
(We'd probably remove that function altogether, were it not that we
advertise it as something that FDWs might want to use.)
Also adjust some places in selfuncs.c to not take any lock on indexes
they are transiently opening, since we can assume that plancat.c
did that already.
In passing, change gin_clean_pending_list() to take RowExclusiveLock
not AccessShareLock on its target index. Although it's not clear that
that's actually a bug, it seemed very strange for a function that's
explicitly going to modify the index to use only AccessShareLock.
David Rowley, reviewed by Julien Rouhaud and Amit Langote,
a bit of further tweaking by me
Discussion: https://postgr.es/m/19465.1541636036@sss.pgh.pa.us
This moves bitmap heap scan support to below an optional tableam
callback. It's optional as the whole concept of bitmap heapscans is
fairly block specific.
This basically moves the work previously done in bitgetpage() into the
new scan_bitmap_next_block callback, and the direct poking into the
buffer done in BitmapHeapNext() into the new scan_bitmap_next_tuple()
callback.
The abstraction is currently somewhat leaky because
nodeBitmapHeapscan.c's prefetching and visibilitymap based logic
remains - it's likely that we'll later have to move more into the
AM. But it's not trivial to do so without introducing a significant
amount of code duplication between the AMs, so that's a project for
later.
Note that now nodeBitmapHeapscan.c and the associated node types are a
bit misnamed. But it's not clear whether renaming wouldn't be a cure
worse than the disease. Either way, that'd be best done in a separate
commit.
Author: Andres Freund
Reviewed-By: Robert Haas (in an older version)
Discussion: https://postgr.es/m/20180703070645.wchpu5muyto5n647@alap3.anarazel.de