Our policy since commit ab77a5a45 has been that a plan node having
any initplans is automatically not parallel-safe. (This could be
relaxed, but not today.) clean_up_removed_plan_level neglected
this, and could attach initplans to a parallel-safe child plan
node without clearing the plan's parallel-safe flag. That could
lead to "subplan was not initialized" errors at runtime, in case
an initplan referenced another one and only the referencing one
got transmitted to parallel workers.
The fix in clean_up_removed_plan_level is trivial enough.
materialize_finished_plan also moves initplans from one node
to another, but it's okay because it already copies the source
node's parallel_safe flag. The other place that does this kind
of thing is standard_planner's hack to inject a top-level Gather
when debug_parallel_query is active. But that's actually dead
code given that we're correctly enforcing the "initplans aren't
parallel safe" rule, so just replace it with an Assert that
there are no initplans.
Also improve some related comments.
Normally we'd add a regression test case for this sort of bug.
The mistake itself is already reached by existing tests, but there
is accidentally no visible problem. The only known test case that
creates an actual failure seems too indirect and fragile to justify
keeping it as a regression test (not least because it fails to fail
in v11, though the bug is clearly present there too).
Per report from Justin Pryzby. Back-patch to all supported branches.
Discussion: https://postgr.es/m/ZDVt6MaNWkRDO1LQ@telsasoft.com
This function can be called from mark_async_capable_plan(), a helper
function for create_append_plan(), before set_subqueryscan_references(),
to determine the triviality of a SubqueryScan that is a child of an
Append plan node, which is done before doing finalize_plan() on the
SubqueryScan (if necessary) and set_plan_references() on the subplan,
unlike when called from set_subqueryscan_references(). The reason why
this is safe wouldn't be that obvious, so add comments explaining this.
Follow-up for commit c2bb02bc2.
Reviewed by Zhihong Yu.
Discussion: https://postgr.es/m/CAPmGK17%2BGiJBthC6va7%2B9n6t75e-M1N0U18YB2G1B%2BE5OdrNTA%40mail.gmail.com
We were setting MERGE source subplan's targetlist by expanding the
individual attributes of the source relation completely, early in the
parse analysis phase. This failed to work when the condition of an
action included a whole-row reference, causing setrefs.c to error out
with
ERROR: variable not found in subplan target lists
because at that point there is nothing to resolve the whole-row
reference with. We can fix this by having preprocess_targetlist expand
the source targetlist for Vars required from the source rel by all
actions. Moreover, by using this expansion mechanism we can do away
with the targetlist expansion in transformMergeStmt, which is good
because then we no longer pull in columns that aren't needed for
anything.
Add a test case for the problem.
While at it, remove some redundant code in preprocess_targetlist():
MERGE was doing separately what is already being done for UPDATE/DELETE,
so we can just rely on the latter and remove the former. (The handling
of inherited rels was different for MERGE, but that was a no-longer-
necessary hack.)
Fix outdated, related comments for fix_join_expr also.
Author: Richard Guo <guofenglinux@gmail.com>
Author: Álvaro Herrera <alvherre@alvh.no-ip.org>
Reported-by: Joe Wildish <joe@lateraljoin.com>
Discussion: https://postgr.es/m/fab3b90a-914d-46a9-beb0-df011ee39ee5@www.fastmail.com
Window functions such as row_number() always return a value higher than
the previously returned value for tuples in any given window partition.
Traditionally queries such as;
SELECT * FROM (
SELECT *, row_number() over (order by c) rn
FROM t
) t WHERE rn <= 10;
were executed fairly inefficiently. Neither the query planner nor the
executor knew that once rn made it to 11 that nothing further would match
the outer query's WHERE clause. It would blindly continue until all
tuples were exhausted from the subquery.
Here we implement means to make the above execute more efficiently.
This is done by way of adding a pg_proc.prosupport function to various of
the built-in window functions and adding supporting code to allow the
support function to inform the planner if the window function is
monotonically increasing, monotonically decreasing, both or neither. The
planner is then able to make use of that information and possibly allow
the executor to short-circuit execution by way of adding a "run condition"
to the WindowAgg to allow it to determine if some of its execution work
can be skipped.
This "run condition" is not like a normal filter. These run conditions
are only built using quals comparing values to monotonic window functions.
For monotonic increasing functions, quals making use of the btree
operators for <, <= and = can be used (assuming the window function column
is on the left). You can see here that once such a condition becomes false
that a monotonic increasing function could never make it subsequently true
again. For monotonically decreasing functions the >, >= and = btree
operators for the given type can be used for run conditions.
The best-case situation for this is when there is a single WindowAgg node
without a PARTITION BY clause. Here when the run condition becomes false
the WindowAgg node can simply return NULL. No more tuples will ever match
the run condition. It's a little more complex when there is a PARTITION
BY clause. In this case, we cannot return NULL as we must still process
other partitions. To speed this case up we pull tuples from the outer
plan to check if they're from the same partition and simply discard them
if they are. When we find a tuple belonging to another partition we start
processing as normal again until the run condition becomes false or we run
out of tuples to process.
When there are multiple WindowAgg nodes to evaluate then this complicates
the situation. For intermediate WindowAggs we must ensure we always
return all tuples to the calling node. Any filtering done could lead to
incorrect results in WindowAgg nodes above. For all intermediate nodes,
we can still save some work when the run condition becomes false. We've
no need to evaluate the WindowFuncs anymore. Other WindowAgg nodes cannot
reference the value of these and these tuples will not appear in the final
result anyway. The savings here are small in comparison to what can be
saved in the top-level WingowAgg, but still worthwhile.
Intermediate WindowAgg nodes never filter out tuples, but here we change
WindowAgg so that the top-level WindowAgg filters out tuples that don't
match the intermediate WindowAgg node's run condition. Such filters
appear in the "Filter" clause in EXPLAIN for the top-level WindowAgg node.
Here we add prosupport functions to allow the above to work for;
row_number(), rank(), dense_rank(), count(*) and count(expr). It appears
technically possible to do the same for min() and max(), however, it seems
unlikely to be useful enough, so that's not done here.
Bump catversion
Author: David Rowley
Reviewed-by: Andy Fan, Zhihong Yu
Discussion: https://postgr.es/m/CAApHDvqvp3At8++yF8ij06sdcoo1S_b2YoaT9D4Nf+MObzsrLQ@mail.gmail.com
In commit 27e1f1456, create_append_plan() only allowed the subplan
created from a given subpath to be executed asynchronously when it was
an async-capable ForeignPath. To extend coverage, this patch handles
cases when the given subpath includes some other Path types as well that
can be omitted in the plan processing, such as a ProjectionPath directly
atop an async-capable ForeignPath, allowing asynchronous execution in
partitioned-scan/partitioned-join queries with non-Var tlist expressions
and more UNION queries.
Andrey Lepikhov and Etsuro Fujita, reviewed by Alexander Pyhalov and
Zhihong Yu.
Discussion: https://postgr.es/m/659c37a8-3e71-0ff2-394c-f04428c76f08%40postgrespro.ru
MERGE performs actions that modify rows in the target table using a
source table or query. MERGE provides a single SQL statement that can
conditionally INSERT/UPDATE/DELETE rows -- a task that would otherwise
require multiple PL statements. For example,
MERGE INTO target AS t
USING source AS s
ON t.tid = s.sid
WHEN MATCHED AND t.balance > s.delta THEN
UPDATE SET balance = t.balance - s.delta
WHEN MATCHED THEN
DELETE
WHEN NOT MATCHED AND s.delta > 0 THEN
INSERT VALUES (s.sid, s.delta)
WHEN NOT MATCHED THEN
DO NOTHING;
MERGE works with regular tables, partitioned tables and inheritance
hierarchies, including column and row security enforcement, as well as
support for row and statement triggers and transition tables therein.
MERGE is optimized for OLTP and is parameterizable, though also useful
for large scale ETL/ELT. MERGE is not intended to be used in preference
to existing single SQL commands for INSERT, UPDATE or DELETE since there
is some overhead. MERGE can be used from PL/pgSQL.
MERGE does not support targetting updatable views or foreign tables, and
RETURNING clauses are not allowed either. These limitations are likely
fixable with sufficient effort. Rewrite rules are also not supported,
but it's not clear that we'd want to support them.
Author: Pavan Deolasee <pavan.deolasee@gmail.com>
Author: Álvaro Herrera <alvherre@alvh.no-ip.org>
Author: Amit Langote <amitlangote09@gmail.com>
Author: Simon Riggs <simon.riggs@enterprisedb.com>
Reviewed-by: Peter Eisentraut <peter.eisentraut@enterprisedb.com>
Reviewed-by: Andres Freund <andres@anarazel.de> (earlier versions)
Reviewed-by: Peter Geoghegan <pg@bowt.ie> (earlier versions)
Reviewed-by: Robert Haas <robertmhaas@gmail.com> (earlier versions)
Reviewed-by: Japin Li <japinli@hotmail.com>
Reviewed-by: Justin Pryzby <pryzby@telsasoft.com>
Reviewed-by: Tomas Vondra <tomas.vondra@enterprisedb.com>
Reviewed-by: Zhihong Yu <zyu@yugabyte.com>
Discussion: https://postgr.es/m/CANP8+jKitBSrB7oTgT9CY2i1ObfOt36z0XMraQc+Xrz8QB0nXA@mail.gmail.com
Discussion: https://postgr.es/m/CAH2-WzkJdBuxj9PO=2QaO9-3h3xGbQPZ34kJH=HukRekwM-GZg@mail.gmail.com
Discussion: https://postgr.es/m/20201231134736.GA25392@alvherre.pgsql
8edd0e794 added some code to remove Append and MergeAppend nodes when they
contained a single child node. As it turned out, this was unsafe to do
when the Append/MergeAppend was parallel_aware and the child node was not.
Removing the Append/MergeAppend, in this case, could lead to the child plan
being called multiple times by parallel workers when it was unsafe to do
so.
Here we fix this by just not removing the Append/MergeAppend when the
parallel_aware flag of the parent and child node don't match.
Reported-by: Yura Sokolov
Bug: #17335
Discussion: https://postgr.es/m/b59605fecb20ba9ea94e70ab60098c237c870628.camel%40postgrespro.ru
Backpatch-through: 12, where 8edd0e794 was first introduced
Commit 4ace45677 failed to fix the problem fully, because the
same issue of attempting to fetch a non-returnable index column
can occur when rechecking the indexqual after using a lossy index
operator. Moreover, it broke EXPLAIN for such indexquals (which
indicates a gap in our test cases :-().
Revert the code changes of 4ace45677 in favor of adding a new field
to struct IndexOnlyScan, containing a version of the indexqual that
can be executed against the index-returned tuple without using any
non-returnable columns. (The restrictions imposed by check_index_only
guarantee this is possible, although we may have to recompute indexed
expressions.) Support construction of that during setrefs.c
processing by marking IndexOnlyScan.indextlist entries as resjunk
if they can't be returned, rather than removing them entirely.
(We could alternatively require setrefs.c to look up the IndexOptInfo
again, but abusing resjunk this way seems like a reasonably safe way
to avoid needing to do that.)
This solution isn't great from an API-stability standpoint: if there
are any extensions out there that build IndexOnlyScan structs directly,
they'll be broken in the next minor releases. However, only a very
invasive extension would be likely to do such a thing. There's no
change in the Path representation, so typical planner extensions
shouldn't have a problem.
As before, back-patch to all supported branches.
Discussion: https://postgr.es/m/3179992.1641150853@sss.pgh.pa.us
Discussion: https://postgr.es/m/17350-b5bdcf476e5badbb@postgresql.org
Up to now the size of a query's rangetable has been limited by the
constants INNER_VAR et al, which mustn't be equal to any real
rangetable index. 65000 doubtless seemed like enough for anybody,
and it still is orders of magnitude larger than the number of joins
we can realistically handle. However, we need a rangetable entry
for each child partition that is (or might be) processed by a query.
Queries with a few thousand partitions are getting more realistic,
so that the day when that limit becomes a problem is in sight,
even if it's not here yet. Hence, let's raise the limit.
Rather than just increase the values of INNER_VAR et al, this patch
adopts the approach of making them small negative values, so that
rangetables could theoretically become as long as INT_MAX.
The bulk of the patch is concerned with changing Var.varno and some
related variables from "Index" (unsigned int) to plain "int". This
is basically cosmetic, with little actual effect other than to help
debuggers print their values nicely. As such, I've only bothered
with changing places that could actually see INNER_VAR et al, which
the parser and most of the planner don't. We do have to be careful
in places that are performing less/greater comparisons on varnos,
but there are very few such places, other than the IS_SPECIAL_VARNO
macro itself.
A notable side effect of this patch is that while it used to be
possible to add INNER_VAR et al to a Bitmapset, that will now
draw an error. I don't see any likelihood that it wouldn't be a
bug to include these fake varnos in a bitmapset of real varnos,
so I think this is all to the good.
Although this touches outfuncs/readfuncs, I don't think a catversion
bump is required, since stored rules would never contain Vars
with these fake varnos.
Andrey Lepikhov and Tom Lane, after a suggestion by Peter Eisentraut
Discussion: https://postgr.es/m/43c7f2f5-1e27-27aa-8c65-c91859d15190@postgrespro.ru
It's possible for us to copy an AlternativeSubPlan expression node
into multiple places, for example the scan quals of several
partition children. Then it's possible that we choose a different
one of the alternatives as optimal in each place. Commit 41efb8340
failed to consider this scenario, so its attempt to remove "unused"
subplans could remove subplans that were still used elsewhere.
Fix by delaying the removal logic until we've examined all the
AlternativeSubPlans in a given query level. (This does assume that
AlternativeSubPlans couldn't get copied to other query levels, but
for the foreseeable future that's fine; cf qual_is_pushdown_safe.)
Per report from Rajkumar Raghuwanshi. Back-patch to v14
where the faulty logic came in.
Discussion: https://postgr.es/m/CAKcux6==O3NNZC3bZ2prRYv3cjm3_Zw1GfzmOjEVqYN4jub2+Q@mail.gmail.com
As of v14, pg_depend contains almost 7000 "pin" entries recording
the OIDs of built-in objects. This is a fair amount of bloat for
every database, and it adds time to pg_depend lookups as well as
initdb. We can get rid of all of those entries in favor of an OID
range check, i.e. "OIDs below FirstUnpinnedObjectId are pinned".
(template1 and the public schema are exceptions. Those exceptions
are now wired into IsPinnedObject() instead of initdb's code for
filling pg_depend, but it's the same amount of cruft either way.)
The contents of pg_shdepend are modified likewise.
Discussion: https://postgr.es/m/3737988.1618451008@sss.pgh.pa.us
"Result Cache" was never a great name for this node, but nobody managed
to come up with another name that anyone liked enough. That was until
David Johnston mentioned "Node Memoization", which Tom Lane revised to
just "Memoize". People seem to like "Memoize", so let's do the rename.
Reviewed-by: Justin Pryzby
Discussion: https://postgr.es/m/20210708165145.GG1176@momjian.us
Backpatch-through: 14, where Result Cache was introduced
Similar to 50e17ad28, which allowed hash tables to be used for IN clauses
with a set of constants, here we add the same feature for NOT IN clauses.
NOT IN evaluates the same as: WHERE a <> v1 AND a <> v2 AND a <> v3.
Obviously, if we're using a hash table we must be exactly equivalent to
that and return the same result taking into account that either side of
the condition could contain a NULL. This requires a little bit of
special handling to make work with the hash table version.
When processing NOT IN, the ScalarArrayOpExpr's operator will be the <>
operator. To be able to build and lookup a hash table we must use the
<>'s negator operator. The planner checks if that exists and is hashable
and sets the relevant fields in ScalarArrayOpExpr to instruct the executor
to use hashing.
Author: David Rowley, James Coleman
Reviewed-by: James Coleman, Zhihong Yu
Discussion: https://postgr.es/m/CAApHDvoF1mum_FRk6D621edcB6KSHBi2+GAgWmioj5AhOu2vwQ@mail.gmail.com
Result Cache, added in 9eacee2e6 neglected to properly adjust the plan
references in setrefs.c. This could lead to the following error during
EXPLAIN:
ERROR: cannot decompile join alias var in plan tree
Fix that.
Bug: 17030
Reported-by: Hans Buschmann
Discussion: https://postgr.es/m/17030-5844aecae42fe223@postgresql.org
ScalarArrayOpExprs with "useOr=true" and a set of Consts on the righthand
side have traditionally been evaluated by using a linear search over the
array. When these arrays contain large numbers of elements then this
linear search could become a significant part of execution time.
Here we add a new method of evaluating ScalarArrayOpExpr expressions to
allow them to be evaluated by first building a hash table containing each
element, then on subsequent evaluations, we just probe that hash table to
determine if there is a match.
The planner is in charge of determining when this optimization is possible
and it enables it by setting hashfuncid in the ScalarArrayOpExpr. The
executor will only perform the hash table evaluation when the hashfuncid
is set.
This means that not all cases are optimized. For example CHECK constraints
containing an IN clause won't go through the planner, so won't get the
hashfuncid set. We could maybe do something about that at some later
date. The reason we're not doing it now is from fear that we may slow
down cases where the expression is evaluated only once. Those cases can
be common, for example, a single row INSERT to a table with a CHECK
constraint containing an IN clause.
In the planner, we enable this when there are suitable hash functions for
the ScalarArrayOpExpr's operator and only when there is at least
MIN_ARRAY_SIZE_FOR_HASHED_SAOP elements in the array. The threshold is
currently set to 9.
Author: James Coleman, David Rowley
Reviewed-by: David Rowley, Tomas Vondra, Heikki Linnakangas
Discussion: https://postgr.es/m/CAAaqYe8x62+=wn0zvNKCj55tPpg-JBHzhZFFc6ANovdqFw7-dA@mail.gmail.com
Here we add a new executor node type named "Result Cache". The planner
can include this node type in the plan to have the executor cache the
results from the inner side of parameterized nested loop joins. This
allows caching of tuples for sets of parameters so that in the event that
the node sees the same parameter values again, it can just return the
cached tuples instead of rescanning the inner side of the join all over
again. Internally, result cache uses a hash table in order to quickly
find tuples that have been previously cached.
For certain data sets, this can significantly improve the performance of
joins. The best cases for using this new node type are for join problems
where a large portion of the tuples from the inner side of the join have
no join partner on the outer side of the join. In such cases, hash join
would have to hash values that are never looked up, thus bloating the hash
table and possibly causing it to multi-batch. Merge joins would have to
skip over all of the unmatched rows. If we use a nested loop join with a
result cache, then we only cache tuples that have at least one join
partner on the outer side of the join. The benefits of using a
parameterized nested loop with a result cache increase when there are
fewer distinct values being looked up and the number of lookups of each
value is large. Also, hash probes to lookup the cache can be much faster
than the hash probe in a hash join as it's common that the result cache's
hash table is much smaller than the hash join's due to result cache only
caching useful tuples rather than all tuples from the inner side of the
join. This variation in hash probe performance is more significant when
the hash join's hash table no longer fits into the CPU's L3 cache, but the
result cache's hash table does. The apparent "random" access of hash
buckets with each hash probe can cause a poor L3 cache hit ratio for large
hash tables. Smaller hash tables generally perform better.
The hash table used for the cache limits itself to not exceeding work_mem
* hash_mem_multiplier in size. We maintain a dlist of keys for this cache
and when we're adding new tuples and realize we've exceeded the memory
budget, we evict cache entries starting with the least recently used ones
until we have enough memory to add the new tuples to the cache.
For parameterized nested loop joins, we now consider using one of these
result cache nodes in between the nested loop node and its inner node. We
determine when this might be useful based on cost, which is primarily
driven off of what the expected cache hit ratio will be. Estimating the
cache hit ratio relies on having good distinct estimates on the nested
loop's parameters.
For now, the planner will only consider using a result cache for
parameterized nested loop joins. This works for both normal joins and
also for LATERAL type joins to subqueries. It is possible to use this new
node for other uses in the future. For example, to cache results from
correlated subqueries. However, that's not done here due to some
difficulties obtaining a distinct estimation on the outer plan to
calculate the estimated cache hit ratio. Currently we plan the inner plan
before planning the outer plan so there is no good way to know if a result
cache would be useful or not since we can't estimate the number of times
the subplan will be called until the outer plan is generated.
The functionality being added here is newly introducing a dependency on
the return value of estimate_num_groups() during the join search.
Previously, during the join search, we only ever needed to perform
selectivity estimations. With this commit, we need to use
estimate_num_groups() in order to estimate what the hit ratio on the
result cache will be. In simple terms, if we expect 10 distinct values
and we expect 1000 outer rows, then we'll estimate the hit ratio to be
99%. Since cache hits are very cheap compared to scanning the underlying
nodes on the inner side of the nested loop join, then this will
significantly reduce the planner's cost for the join. However, it's
fairly easy to see here that things will go bad when estimate_num_groups()
incorrectly returns a value that's significantly lower than the actual
number of distinct values. If this happens then that may cause us to make
use of a nested loop join with a result cache instead of some other join
type, such as a merge or hash join. Our distinct estimations have been
known to be a source of trouble in the past, so the extra reliance on them
here could cause the planner to choose slower plans than it did previous
to having this feature. Distinct estimations are also fairly hard to
estimate accurately when several tables have been joined already or when a
WHERE clause filters out a set of values that are correlated to the
expressions we're estimating the number of distinct value for.
For now, the costing we perform during query planning for result caches
does put quite a bit of faith in the distinct estimations being accurate.
When these are accurate then we should generally see faster execution
times for plans containing a result cache. However, in the real world, we
may find that we need to either change the costings to put less trust in
the distinct estimations being accurate or perhaps even disable this
feature by default. There's always an element of risk when we teach the
query planner to do new tricks that it decides to use that new trick at
the wrong time and causes a regression. Users may opt to get the old
behavior by turning the feature off using the enable_resultcache GUC.
Currently, this is enabled by default. It remains to be seen if we'll
maintain that setting for the release.
Additionally, the name "Result Cache" is the best name I could think of
for this new node at the time I started writing the patch. Nobody seems
to strongly dislike the name. A few people did suggest other names but no
other name seemed to dominate in the brief discussion that there was about
names. Let's allow the beta period to see if the current name pleases
enough people. If there's some consensus on a better name, then we can
change it before the release. Please see the 2nd discussion link below
for the discussion on the "Result Cache" name.
Author: David Rowley
Reviewed-by: Andy Fan, Justin Pryzby, Zhihong Yu, Hou Zhijie
Tested-By: Konstantin Knizhnik
Discussion: https://postgr.es/m/CAApHDvrPcQyQdWERGYWx8J%2B2DLUNgXu%2BfOSbQ1UscxrunyXyrQ%40mail.gmail.com
Discussion: https://postgr.es/m/CAApHDvq=yQXr5kqhRviT2RhNKwToaWr9JAN5t+5_PzhuRJ3wvg@mail.gmail.com
This removes "Add Result Cache executor node". It seems that something
weird is going on with the tracking of cache hits and misses as
highlighted by many buildfarm animals. It's not yet clear what the
problem is as other parts of the plan indicate that the cache did work
correctly, it's just the hits and misses that were being reported as 0.
This is especially a bad time to have the buildfarm so broken, so
reverting before too many more animals go red.
Discussion: https://postgr.es/m/CAApHDvq_hydhfovm4=izgWs+C5HqEeRScjMbOgbpC-jRAeK3Yw@mail.gmail.com
Here we add a new executor node type named "Result Cache". The planner
can include this node type in the plan to have the executor cache the
results from the inner side of parameterized nested loop joins. This
allows caching of tuples for sets of parameters so that in the event that
the node sees the same parameter values again, it can just return the
cached tuples instead of rescanning the inner side of the join all over
again. Internally, result cache uses a hash table in order to quickly
find tuples that have been previously cached.
For certain data sets, this can significantly improve the performance of
joins. The best cases for using this new node type are for join problems
where a large portion of the tuples from the inner side of the join have
no join partner on the outer side of the join. In such cases, hash join
would have to hash values that are never looked up, thus bloating the hash
table and possibly causing it to multi-batch. Merge joins would have to
skip over all of the unmatched rows. If we use a nested loop join with a
result cache, then we only cache tuples that have at least one join
partner on the outer side of the join. The benefits of using a
parameterized nested loop with a result cache increase when there are
fewer distinct values being looked up and the number of lookups of each
value is large. Also, hash probes to lookup the cache can be much faster
than the hash probe in a hash join as it's common that the result cache's
hash table is much smaller than the hash join's due to result cache only
caching useful tuples rather than all tuples from the inner side of the
join. This variation in hash probe performance is more significant when
the hash join's hash table no longer fits into the CPU's L3 cache, but the
result cache's hash table does. The apparent "random" access of hash
buckets with each hash probe can cause a poor L3 cache hit ratio for large
hash tables. Smaller hash tables generally perform better.
The hash table used for the cache limits itself to not exceeding work_mem
* hash_mem_multiplier in size. We maintain a dlist of keys for this cache
and when we're adding new tuples and realize we've exceeded the memory
budget, we evict cache entries starting with the least recently used ones
until we have enough memory to add the new tuples to the cache.
For parameterized nested loop joins, we now consider using one of these
result cache nodes in between the nested loop node and its inner node. We
determine when this might be useful based on cost, which is primarily
driven off of what the expected cache hit ratio will be. Estimating the
cache hit ratio relies on having good distinct estimates on the nested
loop's parameters.
For now, the planner will only consider using a result cache for
parameterized nested loop joins. This works for both normal joins and
also for LATERAL type joins to subqueries. It is possible to use this new
node for other uses in the future. For example, to cache results from
correlated subqueries. However, that's not done here due to some
difficulties obtaining a distinct estimation on the outer plan to
calculate the estimated cache hit ratio. Currently we plan the inner plan
before planning the outer plan so there is no good way to know if a result
cache would be useful or not since we can't estimate the number of times
the subplan will be called until the outer plan is generated.
The functionality being added here is newly introducing a dependency on
the return value of estimate_num_groups() during the join search.
Previously, during the join search, we only ever needed to perform
selectivity estimations. With this commit, we need to use
estimate_num_groups() in order to estimate what the hit ratio on the
result cache will be. In simple terms, if we expect 10 distinct values
and we expect 1000 outer rows, then we'll estimate the hit ratio to be
99%. Since cache hits are very cheap compared to scanning the underlying
nodes on the inner side of the nested loop join, then this will
significantly reduce the planner's cost for the join. However, it's
fairly easy to see here that things will go bad when estimate_num_groups()
incorrectly returns a value that's significantly lower than the actual
number of distinct values. If this happens then that may cause us to make
use of a nested loop join with a result cache instead of some other join
type, such as a merge or hash join. Our distinct estimations have been
known to be a source of trouble in the past, so the extra reliance on them
here could cause the planner to choose slower plans than it did previous
to having this feature. Distinct estimations are also fairly hard to
estimate accurately when several tables have been joined already or when a
WHERE clause filters out a set of values that are correlated to the
expressions we're estimating the number of distinct value for.
For now, the costing we perform during query planning for result caches
does put quite a bit of faith in the distinct estimations being accurate.
When these are accurate then we should generally see faster execution
times for plans containing a result cache. However, in the real world, we
may find that we need to either change the costings to put less trust in
the distinct estimations being accurate or perhaps even disable this
feature by default. There's always an element of risk when we teach the
query planner to do new tricks that it decides to use that new trick at
the wrong time and causes a regression. Users may opt to get the old
behavior by turning the feature off using the enable_resultcache GUC.
Currently, this is enabled by default. It remains to be seen if we'll
maintain that setting for the release.
Additionally, the name "Result Cache" is the best name I could think of
for this new node at the time I started writing the patch. Nobody seems
to strongly dislike the name. A few people did suggest other names but no
other name seemed to dominate in the brief discussion that there was about
names. Let's allow the beta period to see if the current name pleases
enough people. If there's some consensus on a better name, then we can
change it before the release. Please see the 2nd discussion link below
for the discussion on the "Result Cache" name.
Author: David Rowley
Reviewed-by: Andy Fan, Justin Pryzby, Zhihong Yu
Tested-By: Konstantin Knizhnik
Discussion: https://postgr.es/m/CAApHDvrPcQyQdWERGYWx8J%2B2DLUNgXu%2BfOSbQ1UscxrunyXyrQ%40mail.gmail.com
Discussion: https://postgr.es/m/CAApHDvq=yQXr5kqhRviT2RhNKwToaWr9JAN5t+5_PzhuRJ3wvg@mail.gmail.com
This patch makes two closely related sets of changes:
1. For UPDATE, the subplan of the ModifyTable node now only delivers
the new values of the changed columns (i.e., the expressions computed
in the query's SET clause) plus row identity information such as CTID.
ModifyTable must re-fetch the original tuple to merge in the old
values of any unchanged columns. The core advantage of this is that
the changed columns are uniform across all tables of an inherited or
partitioned target relation, whereas the other columns might not be.
A secondary advantage, when the UPDATE involves joins, is that less
data needs to pass through the plan tree. The disadvantage of course
is an extra fetch of each tuple to be updated. However, that seems to
be very nearly free in context; even worst-case tests don't show it to
add more than a couple percent to the total query cost. At some point
it might be interesting to combine the re-fetch with the tuple access
that ModifyTable must do anyway to mark the old tuple dead; but that
would require a good deal of refactoring and it seems it wouldn't buy
all that much, so this patch doesn't attempt it.
2. For inherited UPDATE/DELETE, instead of generating a separate
subplan for each target relation, we now generate a single subplan
that is just exactly like a SELECT's plan, then stick ModifyTable
on top of that. To let ModifyTable know which target relation a
given incoming row refers to, a tableoid junk column is added to
the row identity information. This gets rid of the horrid hack
that was inheritance_planner(), eliminating O(N^2) planning cost
and memory consumption in cases where there were many unprunable
target relations.
Point 2 of course requires point 1, so that there is a uniform
definition of the non-junk columns to be returned by the subplan.
We can't insist on uniform definition of the row identity junk
columns however, if we want to keep the ability to have both
plain and foreign tables in a partitioning hierarchy. Since
it wouldn't scale very far to have every child table have its
own row identity column, this patch includes provisions to merge
similar row identity columns into one column of the subplan result.
In particular, we can merge the whole-row Vars typically used as
row identity by FDWs into one column by pretending they are type
RECORD. (It's still okay for the actual composite Datums to be
labeled with the table's rowtype OID, though.)
There is more that can be done to file down residual inefficiencies
in this patch, but it seems to be committable now.
FDW authors should note several API changes:
* The argument list for AddForeignUpdateTargets() has changed, and so
has the method it must use for adding junk columns to the query. Call
add_row_identity_var() instead of manipulating the parse tree directly.
You might want to reconsider exactly what you're adding, too.
* PlanDirectModify() must now work a little harder to find the
ForeignScan plan node; if the foreign table is part of a partitioning
hierarchy then the ForeignScan might not be the direct child of
ModifyTable. See postgres_fdw for sample code.
* To check whether a relation is a target relation, it's no
longer sufficient to compare its relid to root->parse->resultRelation.
Instead, check it against all_result_relids or leaf_result_relids,
as appropriate.
Amit Langote and Tom Lane
Discussion: https://postgr.es/m/CA+HiwqHpHdqdDn48yCEhynnniahH78rwcrv1rEX65-fsZGBOLQ@mail.gmail.com
This adds a new executor node named TID Range Scan. The query planner
will generate paths for TID Range scans when quals are discovered on base
relations which search for ranges on the table's ctid column. These
ranges may be open at either end. For example, WHERE ctid >= '(10,0)';
will return all tuples on page 10 and over.
To support this, two new optional callback functions have been added to
table AM. scan_set_tidrange is used to set the scan range to just the
given range of TIDs. scan_getnextslot_tidrange fetches the next tuple
in the given range.
For AMs were scanning ranges of TIDs would not make sense, these functions
can be set to NULL in the TableAmRoutine. The query planner won't
generate TID Range Scan Paths in that case.
Author: Edmund Horner, David Rowley
Reviewed-by: David Rowley, Tomas Vondra, Tom Lane, Andres Freund, Zhihong Yu
Discussion: https://postgr.es/m/CAMyN-kB-nFTkF=VA_JPwFNo08S0d-Yk0F741S2B7LDmYAi8eyA@mail.gmail.com
It's unsafe to do this at parse time because addition of generated
columns to a table would not invalidate stored rules containing
UPDATEs on the table ... but there might now be dependent generated
columns that were not there when the rule was made. This also fixes
an oversight that rewriteTargetView failed to update extraUpdatedCols
when transforming an UPDATE on an updatable view. (Since the new
calculation is downstream of that, rewriteTargetView doesn't actually
need to do anything; but before, there was a demonstrable bug there.)
In v13 and HEAD, this leads to easily-visible bugs because (since
commit c6679e4fc) we won't recalculate generated columns that aren't
listed in extraUpdatedCols. In v12 this bitmap is mostly just used
for trigger-firing decisions, so you'd only notice a problem if a
trigger cared whether a generated column had been updated.
I'd complained about this back in May, but then forgot about it
until bug #16671 from Michael Paul Killian revived the issue.
Back-patch to v12 where this field was introduced. If existing
stored rules contain any extraUpdatedCols values, they'll be
ignored because the rewriter will overwrite them, so the bug will
be fixed even for existing rules. (But note that if someone were
to update to 13.1 or 12.5, store some rules with UPDATEs on tables
having generated columns, and then downgrade to a prior minor version,
they might observe issues similar to what this patch fixes. That
seems unlikely enough to not be worth going to a lot of effort to fix.)
Discussion: https://postgr.es/m/10206.1588964727@sss.pgh.pa.us
Discussion: https://postgr.es/m/16671-2fa55851859fb166@postgresql.org
FDWs that can perform an UPDATE/DELETE remotely using the "direct modify"
set of APIs need to access the ResultRelInfo of the target table. That's
currently available in EState.es_result_relation_info, but the next
commit will remove that field.
This commit adds a new resultRelation field in ForeignScan, to store the
target relation's RT index, and the corresponding ResultRelInfo in
ForeignScanState. The FDW's PlanDirectModify callback is expected to set
'resultRelation' along with 'operation'. The core code doesn't need them
for anything, they are for the convenience of FDW's Begin- and
IterateDirectModify callbacks.
Authors: Amit Langote, Etsuro Fujita
Discussion: https://www.postgresql.org/message-id/CA%2BHiwqGEmiib8FLiHMhKB%2BCH5dRgHSLc5N5wnvc4kym%2BZYpQEQ%40mail.gmail.com
Instead of allocating all the ResultRelInfos upfront in one big array,
allocate them in ExecInitModifyTable(). es_result_relations is now an
array of ResultRelInfo pointers, rather than an array of structs, and it
is indexed by the RT index.
This simplifies things: we get rid of the separate concept of a "result
rel index", and don't need to set it in setrefs.c anymore. This also
allows follow-up optimizations (not included in this commit yet) to skip
initializing ResultRelInfos for target relations that were not needed at
runtime, and removal of the es_result_relation_info pointer.
The EState arrays of regular result rels and root result rels are merged
into one array. Similarly, the resultRelations and rootResultRelations
lists in PlannedStmt are merged into one. It's not actually clear to me
why they were kept separate in the first place, but now that the
es_result_relations array is indexed by RT index, it certainly seems
pointless.
The PlannedStmt->resultRelations list is now only needed for
ExecRelationIsTargetRelation(). One visible effect of this change is that
ExecRelationIsTargetRelation() will now return 'true' also for the
partition root, if a partitioned table is updated. That seems like a good
thing, although the function isn't used in core code, and I don't see any
reason for an FDW to call it on a partition root.
Author: Amit Langote
Discussion: https://www.postgresql.org/message-id/CA%2BHiwqGEmiib8FLiHMhKB%2BCH5dRgHSLc5N5wnvc4kym%2BZYpQEQ%40mail.gmail.com
When commit bd3daddaf introduced AlternativeSubPlans, I had some
ambitions towards allowing the choice of subplan to change during
execution. That has not happened, or even been thought about, in the
ensuing twelve years; so it seems like a failed experiment. So let's
rip that out and resolve the choice of subplan at the end of planning
(in setrefs.c) rather than during executor startup. This has a number
of positive benefits:
* Removal of a few hundred lines of executor code, since
AlternativeSubPlans need no longer be supported there.
* Removal of executor-startup overhead (particularly, initialization
of subplans that won't be used).
* Removal of incidental costs of having a larger plan tree, such as
tree-scanning and copying costs in the plancache; not to mention
setrefs.c's own costs of processing the discarded subplans.
* EXPLAIN no longer has to print a weird (and undocumented)
representation of an AlternativeSubPlan choice; it sees only the
subplan actually used. This should mean less confusion for users.
* Since setrefs.c knows which subexpression of a plan node it's
working on at any instant, it's possible to adjust the estimated
number of executions of the subplan based on that. For example,
we should usually estimate more executions of a qual expression
than a targetlist expression. The implementation used here is
pretty simplistic, because we don't want to expend a lot of cycles
on the issue; but it's better than ignoring the point entirely,
as the executor had to.
That last point might possibly result in shifting the choice
between hashed and non-hashed EXISTS subplans in a few cases,
but in general this patch isn't meant to change planner choices.
Since we're doing the resolution so late, it's really impossible
to change any plan choices outside the AlternativeSubPlan itself.
Patch by me; thanks to David Rowley for review.
Discussion: https://postgr.es/m/1992952.1592785225@sss.pgh.pa.us
Thomas Munro fixed a longstanding annoyance in pg_bsd_indent, that
it would misformat lines containing IsA() macros on the assumption
that the IsA() call should be treated like a cast. This improves
some other cases involving field/variable names that match typedefs,
too. The only places that get worse are a couple of uses of the
OpenSSL macro STACK_OF(); we'll gladly take that trade-off.
Discussion: https://postgr.es/m/20200114221814.GA19630@alvherre.pgsql
Incremental Sort is an optimized variant of multikey sort for cases when
the input is already sorted by a prefix of the requested sort keys. For
example when the relation is already sorted by (key1, key2) and we need
to sort it by (key1, key2, key3) we can simply split the input rows into
groups having equal values in (key1, key2), and only sort/compare the
remaining column key3.
This has a number of benefits:
- Reduced memory consumption, because only a single group (determined by
values in the sorted prefix) needs to be kept in memory. This may also
eliminate the need to spill to disk.
- Lower startup cost, because Incremental Sort produce results after each
prefix group, which is beneficial for plans where startup cost matters
(like for example queries with LIMIT clause).
We consider both Sort and Incremental Sort, and decide based on costing.
The implemented algorithm operates in two different modes:
- Fetching a minimum number of tuples without check of equality on the
prefix keys, and sorting on all columns when safe.
- Fetching all tuples for a single prefix group and then sorting by
comparing only the remaining (non-prefix) keys.
We always start in the first mode, and employ a heuristic to switch into
the second mode if we believe it's beneficial - the goal is to minimize
the number of unnecessary comparions while keeping memory consumption
below work_mem.
This is a very old patch series. The idea was originally proposed by
Alexander Korotkov back in 2013, and then revived in 2017. In 2018 the
patch was taken over by James Coleman, who wrote and rewrote most of the
current code.
There were many reviewers/contributors since 2013 - I've done my best to
pick the most active ones, and listed them in this commit message.
Author: James Coleman, Alexander Korotkov
Reviewed-by: Tomas Vondra, Andreas Karlsson, Marti Raudsepp, Peter Geoghegan, Robert Haas, Thomas Munro, Antonin Houska, Andres Freund, Alexander Kuzmenkov
Discussion: https://postgr.es/m/CAPpHfdscOX5an71nHd8WSUH6GNOCf=V7wgDaTXdDd9=goN-gfA@mail.gmail.com
Discussion: https://postgr.es/m/CAPpHfds1waRZ=NOmueYq0sx1ZSCnt+5QJvizT8ndT2=etZEeAQ@mail.gmail.com
The core idea of this patch is to make the parser generate join alias
Vars (that is, ones with varno pointing to a JOIN RTE) only when the
alias Var is actually different from any raw join input, that is a type
coercion and/or COALESCE is necessary to generate the join output value.
Otherwise just generate varno/varattno pointing to the relevant join
input column.
In effect, this means that the planner's flatten_join_alias_vars()
transformation is already done in the parser, for all cases except
(a) columns that are merged by JOIN USING and are transformed in the
process, and (b) whole-row join Vars. In principle that would allow
us to skip doing flatten_join_alias_vars() in many more queries than
we do now, but we don't have quite enough infrastructure to know that
we can do so --- in particular there's no cheap way to know whether
there are any whole-row join Vars. I'm not sure if it's worth the
trouble to add a Query-level flag for that, and in any case it seems
like fit material for a separate patch. But even without skipping the
work entirely, this should make flatten_join_alias_vars() faster,
particularly where there are nested joins that it previously had to
flatten recursively.
An essential part of this change is to replace Var nodes'
varnoold/varoattno fields with varnosyn/varattnosyn, which have
considerably more tightly-defined meanings than the old fields: when
they differ from varno/varattno, they identify the Var's position in
an aliased JOIN RTE, and the join alias is what ruleutils.c should
print for the Var. This is necessary because the varno change
destroyed ruleutils.c's ability to find the JOIN RTE from the Var's
varno.
Another way in which this change broke ruleutils.c is that it's no
longer feasible to determine, from a JOIN RTE's joinaliasvars list,
which join columns correspond to which columns of the join's immediate
input relations. (If those are sub-joins, the joinaliasvars entries
may point to columns of their base relations, not the sub-joins.)
But that was a horrid mess requiring a lot of fragile assumptions
already, so let's just bite the bullet and add some more JOIN RTE
fields to make it more straightforward to figure that out. I added
two integer-List fields containing the relevant column numbers from
the left and right input rels, plus a count of how many merged columns
there are.
This patch depends on the ParseNamespaceColumn infrastructure that
I added in commit 5815696bc. The biggest bit of code change is
restructuring transformFromClauseItem's handling of JOINs so that
the ParseNamespaceColumn data is propagated upward correctly.
Other than that and the ruleutils fixes, everything pretty much
just works, though some processing is now inessential. I grabbed
two pieces of low-hanging fruit in that line:
1. In find_expr_references, we don't need to recurse into join alias
Vars anymore. There aren't any except for references to merged USING
columns, which are more properly handled when we scan the join's RTE.
This change actually fixes an edge-case issue: we will now record a
dependency on any type-coercion function present in a USING column's
joinaliasvar, even if that join column has no references in the query
text. The odds of the missing dependency causing a problem seem quite
small: you'd have to posit somebody dropping an implicit cast between
two data types, without removing the types themselves, and then having
a stored rule containing a whole-row Var for a join whose USING merge
depends on that cast. So I don't feel a great need to change this in
the back branches. But in theory this way is more correct.
2. markRTEForSelectPriv and markTargetListOrigin don't need to recurse
into join alias Vars either, because the cases they care about don't
apply to alias Vars for USING columns that are semantically distinct
from the underlying columns. This removes the only case in which
markVarForSelectPriv could be called with NULL for the RTE, so adjust
the comments to describe that hack as being strictly internal to
markRTEForSelectPriv.
catversion bump required due to changes in stored rules.
Discussion: https://postgr.es/m/7115.1577986646@sss.pgh.pa.us
This patch causes EXPLAIN to always assign a separate table alias to the
parent RTE of an append relation (inheritance set); before, such RTEs
were ignored if not actually scanned by the plan. Since the child RTEs
now always have that same alias to start with (cf. commit 55a1954da),
the net effect is that the parent RTE usually gets the alias used or
implied by the query text, and the children all get that alias with "_N"
appended. (The exception to "usually" is if there are duplicate aliases
in different subtrees of the original query; then some of those original
RTEs will also have "_N" appended.)
This results in more uniform output for partitioned-table plans than
we had before: the partitioned table itself gets the original alias,
and all child tables have aliases with "_N", rather than the previous
behavior where one of the children would get an alias without "_N".
The reason for giving the parent RTE an alias, even if it isn't scanned
by the plan, is that we now use the parent's alias to qualify Vars that
refer to an appendrel output column and appear above the Append or
MergeAppend that computes the appendrel. But below the append, Vars
refer to some one of the child relations, and are displayed that way.
This seems clearer than the old behavior where a Var that could carry
values from any child relation was displayed as if it referred to only
one of them.
While at it, change ruleutils.c so that the code paths used by EXPLAIN
deal in Plan trees not PlanState trees. This effectively reverts a
decision made in commit 1cc29fe7c, which seemed like a good idea at
the time to make ruleutils.c consistent with explain.c. However,
it's problematic because we'd really like to allow executor startup
pruning to remove all the children of an append node when possible,
leaving no child PlanState to resolve Vars against. (That's not done
here, but will be in the next patch.) This requires different handling
of subplans and initplans than before, but is otherwise a pretty
straightforward change.
Discussion: https://postgr.es/m/001001d4f44b$2a2cca50$7e865ef0$@lab.ntt.co.jp
In the wake of commit 1cff1b95a, the result of list_concat no longer
shares the ListCells of the second input. Therefore, we can replace
"list_concat(x, list_copy(y))" with just "list_concat(x, y)".
To improve call sites that were list_copy'ing the first argument,
or both arguments, invent "list_concat_copy()" which produces a new
list sharing no ListCells with either input. (This is a bit faster
than "list_concat(list_copy(x), y)" because it makes the result list
the right size to start with.)
In call sites that were not list_copy'ing the second argument, the new
semantics mean that we are usually leaking the second List's storage,
since typically there is no remaining pointer to it. We considered
inventing another list_copy variant that would list_free the second
input, but concluded that for most call sites it isn't worth worrying
about, given the relative compactness of the new List representation.
(Note that in cases where such leakage would happen, the old code
already leaked the second List's header; so we're only discussing
the size of the leak not whether there is one. I did adjust two or
three places that had been troubling to free that header so that
they manually free the whole second List.)
Patch by me; thanks to David Rowley for review.
Discussion: https://postgr.es/m/11587.1550975080@sss.pgh.pa.us
In 5f32b29c18 I changed the creation of HashState.hashkeys to
actually use HashState as the parent (instead of HashJoinState, which
was incorrect, as they were executed below HashState), to fix the
problem of hashkeys expressions otherwise relying on slot types
appropriate for HashJoinState, rather than HashState as would be
correct. That reliance was only introduced in 12, which is why it
previously worked to use HashJoinState as the parent (although I'd be
unsurprised if there were problematic cases).
Unfortunately that's not a sufficient solution, because before this
commit, the to-be-hashed expressions referenced inner/outer as
appropriate for the HashJoin, not Hash. That didn't have obvious bad
consequences, because the slots containing the tuples were put into
ecxt_innertuple when hashing a tuple for HashState (even though Hash
doesn't have an inner plan).
There are less common cases where this can cause visible problems
however (rather than just confusion when inspecting such executor
trees). E.g. "ERROR: bogus varno: 65000", when explaining queries
containing a HashJoin where the subsidiary Hash node's hash keys
reference a subplan. While normally hashkeys aren't displayed by
EXPLAIN, if one of those expressions references a subplan, that
subplan may be printed as part of the Hash node - which then failed
because an inner plan was referenced, and Hash doesn't have that.
It seems quite possible that there's other broken cases, too.
Fix the problem by properly splitting the expression for the HashJoin
and Hash nodes at plan time, and have them reference the proper
subsidiary node. While other workarounds are possible, fixing this
correctly seems easy enough. It was a pretty ugly hack to have
ExecInitHashJoin put the expression into the already initialized
HashState, in the first place.
I decided to not just split inner/outer hashkeys inside
make_hashjoin(), but also to separate out hashoperators and
hashcollations at plan time. Otherwise we would have ended up having
two very similar loops, one at plan time and the other during executor
startup. The work seems to more appropriately belong to plan time,
anyway.
Reported-By: Nikita Glukhov, Alexander Korotkov
Author: Andres Freund
Reviewed-By: Tom Lane, in an earlier version
Discussion: https://postgr.es/m/CAPpHfdvGVegF_TKKRiBrSmatJL2dR9uwFCuR+teQ_8tEXU8mxg@mail.gmail.com
Backpatch: 12-
If there's only one child relation, the Append or MergeAppend isn't
doing anything useful, and can be elided. It does have a purpose
during planning though, which is to serve as a buffer between parent
and child Var numbering. Therefore we keep it all the way through
to setrefs.c, and get rid of it only after fixing references in the
plan level(s) above it. This works largely the same as setrefs.c's
ancient hack to get rid of no-op SubqueryScan nodes, and can even
share some code with that.
Note the change to make setrefs.c use apply_tlist_labeling rather than
ad-hoc code. This has the effect of propagating the child's resjunk
and ressortgroupref labels, which formerly weren't propagated when
removing a SubqueryScan. Doing that is demonstrably necessary for
the [Merge]Append cases, and seems harmless for SubqueryScan, if only
because trivial_subqueryscan is afraid to collapse cases where the
resjunk marking differs. (I suspect that restriction could now be
removed, though it's unclear that it'd make any new matches possible,
since the outer query can't have references to a child resjunk column.)
David Rowley, reviewed by Alvaro Herrera and Tomas Vondra
Discussion: https://postgr.es/m/CAKJS1f_7u8ATyJ1JGTMHFoKDvZdeF-iEBhs+sM_SXowOr9cArg@mail.gmail.com
Create a new header optimizer/optimizer.h, which exposes just the
planner functions that can be used "at arm's length", without need
to access Paths or the other planner-internal data structures defined
in nodes/relation.h. This is intended to provide the whole planner
API seen by most of the rest of the system; although FDWs still need
to use additional stuff, and more thought is also needed about just
what selfuncs.c should rely on.
The main point of doing this now is to limit the amount of new
#include baggage that will be needed by "planner support functions",
which I expect to introduce later, and which will be in relevant
datatype modules rather than anywhere near the planner.
This commit just moves relevant declarations into optimizer.h from
other header files (a couple of which go away because everything
got moved), and adjusts #include lists to match. There's further
cleanup that could be done if we want to decide that some stuff
being exposed by optimizer.h doesn't belong in the planner at all,
but I'll leave that for another day.
Discussion: https://postgr.es/m/11460.1548706639@sss.pgh.pa.us
This change allows callers of query_tree_walker() to choose whether
to visit an RTE before or after visiting the contents of the RTE
(i.e., prefix or postfix tree order). All existing users of
QTW_EXAMINE_RTES want the QTW_EXAMINE_RTES_BEFORE behavior, but
an upcoming patch will want QTW_EXAMINE_RTES_AFTER, and it seems
like a potentially useful change on its own.
Andreas Karlsson (extracted from CTE inlining patch)
Discussion: https://postgr.es/m/8810.1542402910@sss.pgh.pa.us
Avoid using "typeid" as a parameter name in header files, since that
is a C++ keyword. These cases were introduced recently, in 04fe805a1
and 586b98fdf.
Since I'm an incurable neatnik, also rename these parameters in the
underlying function definitions. That's not really necessary per
project rules, but I don't like function declarations that don't
quite agree with the underlying definitions.
Per src/tools/pginclude/cpluspluscheck.
If a domain has no constraints, then CoerceToDomain doesn't really do
anything and can be simplified to a RelabelType. This not only
eliminates cycles at execution, but allows the planner to optimize better
(for instance, match the coerced expression to an index on the underlying
column). However, we do have to support invalidating the plan later if
a constraint gets added to the domain. That's comparable to the case of
a change to a SQL function that had been inlined into a plan, so all the
necessary logic already exists for plans depending on functions. We
need only duplicate or share that logic for domains.
ALTER DOMAIN ADD/DROP CONSTRAINT need to be taught to send out sinval
messages for the domain's pg_type entry, since those operations don't
update that row. (ALTER DOMAIN SET/DROP NOT NULL do update that row,
so no code change is needed for them.)
Testing this revealed what's really a pre-existing bug in plpgsql:
it caches the SQL-expression-tree expansion of type coercions and
had no provision for invalidating entries in that cache. Up to now
that was only a problem if such an expression had inlined a SQL
function that got changed, which is unlikely though not impossible.
But failing to track changes of domain constraints breaks an existing
regression test case and would likely cause practical problems too.
We could fix that locally in plpgsql, but what seems like a better
idea is to build some generic infrastructure in plancache.c to store
standalone expressions and track invalidation events for them.
(It's tempting to wonder whether plpgsql's "simple expression" stuff
could use this code with lower overhead than its current use of the
heavyweight plancache APIs. But I've left that idea for later.)
Other stuff fixed in passing:
* Allow estimate_expression_value() to drop CoerceToDomain
unconditionally, effectively assuming that the coercion will succeed.
This will improve planner selectivity estimates for cases involving
estimatable expressions that are coerced to domains. We could have
done this independently of everything else here, but there wasn't
previously any need for eval_const_expressions_mutator to know about
CoerceToDomain at all.
* Use a dlist for plancache.c's list of cached plans, rather than a
manually threaded singly-linked list. That eliminates a potential
performance problem in DropCachedPlan.
* Fix a couple of inconsistencies in typecmds.c about whether
operations on domains drop RowExclusiveLock on pg_type. Our common
practice is that DDL operations do drop catalog locks, so standardize
on that choice.
Discussion: https://postgr.es/m/19958.1544122124@sss.pgh.pa.us
Our support for multiple-set-clauses in UPDATE assumes that the Params
referencing a MULTIEXPR_SUBLINK SubPlan will appear before that SubPlan
in the targetlist of the plan node that calculates the updated row.
(Yeah, it's a hack...) In some PG branches it's possible that a Result
node gets inserted between the primary calculation of the update tlist
and the ModifyTable node. setrefs.c did the wrong thing in this case
and left the upper-level Params as Params, causing a crash at runtime.
What it should do is replace them with "outer" Vars referencing the child
plan node's output. That's a result of careless ordering of operations
in fix_upper_expr_mutator, so we can fix it just by reordering the code.
Fix fix_join_expr_mutator similarly for consistency, even though join
nodes could never appear in such a context. (In general, it seems
likely to be a bit cheaper to use Vars than Params in such situations
anyway, so this patch might offer a tiny performance improvement.)
The hazard extends back to 9.5 where the MULTIEXPR_SUBLINK stuff
was introduced, so back-patch that far. However, this may be a live
bug only in 9.6.x and 10.x, as the other branches don't seem to want
to calculate the final tlist below the Result node. (That plan shape
change between branches might be a mini-bug in itself, but I'm not
really interested in digging into the reasons for that right now.
Still, add a regression test memorializing what we expect there,
so we'll notice if it changes again.)
Per bug report from Eduards Bezverhijs.
Discussion: https://postgr.es/m/b6cd572a-3e44-8785-75e9-c512a5a17a73@tieto.com
In the wake of commit f2343653f, we no longer need some fields that
were used before to control executor lock acquisitions:
* PlannedStmt.nonleafResultRelations can go away entirely.
* partitioned_rels can go away from Append, MergeAppend, and ModifyTable.
However, ModifyTable still needs to know the RT index of the partition
root table if any, which was formerly kept in the first entry of that
list. Add a new field "rootRelation" to remember that. rootRelation is
partly redundant with nominalRelation, in that if it's set it will have
the same value as nominalRelation. However, the latter field has a
different purpose so it seems best to keep them distinct.
Amit Langote, reviewed by David Rowley and Jesper Pedersen,
and whacked around a bit more by me
Discussion: https://postgr.es/m/468c85d9-540e-66a2-1dde-fec2b741e688@lab.ntt.co.jp
Create an array estate->es_relations[] paralleling the es_range_table,
and store references to Relations (relcache entries) there, so that any
given RT entry is opened and closed just once per executor run. Scan
nodes typically still call ExecOpenScanRelation, but ExecCloseScanRelation
is no more; relation closing is now done centrally in ExecEndPlan.
This is slightly more complex than one would expect because of the
interactions with relcache references held in ResultRelInfo nodes.
The general convention is now that ResultRelInfo->ri_RelationDesc does
not represent a separate relcache reference and so does not need to be
explicitly closed; but there is an exception for ResultRelInfos in the
es_trig_target_relations list, which are manufactured by
ExecGetTriggerResultRel and have to be cleaned up by
ExecCleanUpTriggerState. (That much was true all along, but these
ResultRelInfos are now more different from others than they used to be.)
To allow the partition pruning logic to make use of es_relations[] rather
than having its own relcache references, adjust PartitionedRelPruneInfo
to store an RT index rather than a relation OID.
Amit Langote, reviewed by David Rowley and Jesper Pedersen,
some mods by me
Discussion: https://postgr.es/m/468c85d9-540e-66a2-1dde-fec2b741e688@lab.ntt.co.jp
Commit f49842d, which added support for partitionwise joins, built the
child's tlist by applying adjust_appendrel_attrs() to the parent's. So in
the case where the parent's included a whole-row Var for the parent, the
child's contained a ConvertRowtypeExpr. To cope with that, that commit
added code to the planner, such as setrefs.c, but some code paths still
assumed that the tlist for a scan (or join) rel would only include Vars
and PlaceHolderVars, which was true before that commit, causing errors:
* When creating an explicit sort node for an input path for a mergejoin
path for a child join, prepare_sort_from_pathkeys() threw the 'could not
find pathkey item to sort' error.
* When deparsing a relation participating in a pushed down child join as a
subquery in contrib/postgres_fdw, get_relation_column_alias_ids() threw
the 'unexpected expression in subquery output' error.
* When performing set_plan_references() on a local join plan generated by
contrib/postgres_fdw for EvalPlanQual support for a pushed down child
join, fix_join_expr() threw the 'variable not found in subplan target
lists' error.
To fix these, two approaches have been proposed: one by Ashutosh Bapat and
one by me. While the former keeps building the child's tlist with a
ConvertRowtypeExpr, the latter builds it with a whole-row Var for the
child not to violate the planner assumption, and tries to fix it up later,
But both approaches need more work, so refuse to generate partitionwise
join paths when whole-row Vars are involved, instead. We don't need to
handle ConvertRowtypeExprs in the child's tlists for now, so this commit
also removes the changes to the planner.
Previously, partitionwise join computed attr_needed data for each child
separately, and built the child join's tlist using that data, which also
required an extra step for adding PlaceHolderVars to that tlist, but it
would be more efficient to build it from the parent join's tlist through
the adjust_appendrel_attrs() transformation. So this commit builds that
list that way, and simplifies build_joinrel_tlist() and placeholder.c as
well as part of set_append_rel_size() to basically what they were before
partitionwise join went in.
Back-patch to PG11 where partitionwise join was introduced.
Report by Rajkumar Raghuwanshi. Analysis by Ashutosh Bapat, who also
provided some of regression tests. Patch by me, reviewed by Robert Haas.
Discussion: https://postgr.es/m/CAKcux6ktu-8tefLWtQuuZBYFaZA83vUzuRd7c1YHC-yEWyYFpg@mail.gmail.com
This reverts commits d204ef6377,
83454e3c2b and a few more commits thereafter
(complete list at the end) related to MERGE feature.
While the feature was fully functional, with sufficient test coverage and
necessary documentation, it was felt that some parts of the executor and
parse-analyzer can use a different design and it wasn't possible to do that in
the available time. So it was decided to revert the patch for PG11 and retry
again in the future.
Thanks again to all reviewers and bug reporters.
List of commits reverted, in reverse chronological order:
f1464c5380 Improve parse representation for MERGE
ddb4158579 MERGE syntax diagram correction
530e69e59b Allow cpluspluscheck to pass by renaming variable
01b88b4df5 MERGE minor errata
3af7b2b0d4 MERGE fix variable warning in non-assert builds
a5d86181ec MERGE INSERT allows only one VALUES clause
4b2d44031f MERGE post-commit review
4923550c20 Tab completion for MERGE
aa3faa3c7a WITH support in MERGE
83454e3c2b New files for MERGE
d204ef6377 MERGE SQL Command following SQL:2016
Author: Pavan Deolasee
Reviewed-by: Michael Paquier
Review comments from Andres Freund
* Consolidate code into AfterTriggerGetTransitionTable()
* Rename nodeMerge.c to execMerge.c
* Rename nodeMerge.h to execMerge.h
* Move MERGE handling in ExecInitModifyTable()
into a execMerge.c ExecInitMerge()
* Move mt_merge_subcommands flags into execMerge.h
* Rename opt_and_condition to opt_merge_when_and_condition
* Wordsmith various comments
Author: Pavan Deolasee
Reviewer: Simon Riggs
MERGE performs actions that modify rows in the target table
using a source table or query. MERGE provides a single SQL
statement that can conditionally INSERT/UPDATE/DELETE rows
a task that would other require multiple PL statements.
e.g.
MERGE INTO target AS t
USING source AS s
ON t.tid = s.sid
WHEN MATCHED AND t.balance > s.delta THEN
UPDATE SET balance = t.balance - s.delta
WHEN MATCHED THEN
DELETE
WHEN NOT MATCHED AND s.delta > 0 THEN
INSERT VALUES (s.sid, s.delta)
WHEN NOT MATCHED THEN
DO NOTHING;
MERGE works with regular and partitioned tables, including
column and row security enforcement, as well as support for
row, statement and transition triggers.
MERGE is optimized for OLTP and is parameterizable, though
also useful for large scale ETL/ELT. MERGE is not intended
to be used in preference to existing single SQL commands
for INSERT, UPDATE or DELETE since there is some overhead.
MERGE can be used statically from PL/pgSQL.
MERGE does not yet support inheritance, write rules,
RETURNING clauses, updatable views or foreign tables.
MERGE follows SQL Standard per the most recent SQL:2016.
Includes full tests and documentation, including full
isolation tests to demonstrate the concurrent behavior.
This version written from scratch in 2017 by Simon Riggs,
using docs and tests originally written in 2009. Later work
from Pavan Deolasee has been both complex and deep, leaving
the lead author credit now in his hands.
Extensive discussion of concurrency from Peter Geoghegan,
with thanks for the time and effort contributed.
Various issues reported via sqlsmith by Andreas Seltenreich
Authors: Pavan Deolasee, Simon Riggs
Reviewer: Peter Geoghegan, Amit Langote, Tomas Vondra, Simon Riggs
Discussion:
https://postgr.es/m/CANP8+jKitBSrB7oTgT9CY2i1ObfOt36z0XMraQc+Xrz8QB0nXA@mail.gmail.comhttps://postgr.es/m/CAH2-WzkJdBuxj9PO=2QaO9-3h3xGbQPZ34kJH=HukRekwM-GZg@mail.gmail.com