On further reflection, commit e5d83995e didn't go far enough: pretty much
everywhere in the planner that examines a clause's is_pushed_down flag
ought to be changed to use the more complicated behavior where we also
check the clause's required_relids. Otherwise we could make incorrect
decisions about whether, say, a clause is safe to use as a hash clause.
Some (many?) of these places are safe as-is, either because they are
never reached while considering a parameterized path, or because there
are additional checks that would reject a pushed-down clause anyway.
However, it seems smarter to just code them all the same way rather
than rely on easily-broken reasoning of that sort.
In support of that, invent a new macro RINFO_IS_PUSHED_DOWN that should
be used in place of direct tests on the is_pushed_down flag.
Like the previous patch, back-patch to all supported branches.
Discussion: https://postgr.es/m/f8128b11-c5bf-3539-48cd-234178b2314d@proxel.se
- Explicitly forbids opclass, collation and indoptions (like DESC/ASC etc) for
including columns. Throw an error if user points that.
- Truncated storage arrays for such attributes to store only key atrributes,
added assertion checks.
- Do not check opfamily and collation for including columns in
CompareIndexInfo()
Discussion: https://www.postgresql.org/message-id/5ee72852-3c4e-ee35-e2ed-c1d053d45c08@sigaev.ru
Existing partition pruning is only able to work at plan time, for query
quals that appear in the parsed query. This is good but limiting, as
there can be parameters that appear later that can be usefully used to
further prune partitions.
This commit adds support for pruning subnodes of Append which cannot
possibly contain any matching tuples, during execution, by evaluating
Params to determine the minimum set of subnodes that can possibly match.
We support more than just simple Params in WHERE clauses. Support
additionally includes:
1. Parameterized Nested Loop Joins: The parameter from the outer side of the
join can be used to determine the minimum set of inner side partitions to
scan.
2. Initplans: Once an initplan has been executed we can then determine which
partitions match the value from the initplan.
Partition pruning is performed in two ways. When Params external to the plan
are found to match the partition key we attempt to prune away unneeded Append
subplans during the initialization of the executor. This allows us to bypass
the initialization of non-matching subplans meaning they won't appear in the
EXPLAIN or EXPLAIN ANALYZE output.
For parameters whose value is only known during the actual execution
then the pruning of these subplans must wait. Subplans which are
eliminated during this stage of pruning are still visible in the EXPLAIN
output. In order to determine if pruning has actually taken place, the
EXPLAIN ANALYZE must be viewed. If a certain Append subplan was never
executed due to the elimination of the partition then the execution
timing area will state "(never executed)". Whereas, if, for example in
the case of parameterized nested loops, the number of loops stated in
the EXPLAIN ANALYZE output for certain subplans may appear lower than
others due to the subplan having been scanned fewer times. This is due
to the list of matching subnodes having to be evaluated whenever a
parameter which was found to match the partition key changes.
This commit required some additional infrastructure that permits the
building of a data structure which is able to perform the translation of
the matching partition IDs, as returned by get_matching_partitions, into
the list index of a subpaths list, as exist in node types such as
Append, MergeAppend and ModifyTable. This allows us to translate a list
of clauses into a Bitmapset of all the subpath indexes which must be
included to satisfy the clause list.
Author: David Rowley, based on an earlier effort by Beena Emerson
Reviewers: Amit Langote, Robert Haas, Amul Sul, Rajkumar Raghuwanshi,
Jesper Pedersen
Discussion: https://postgr.es/m/CAOG9ApE16ac-_VVZVvv0gePSgkg_BwYEV1NBqZFqDR2bBE0X0A@mail.gmail.com
This patch introduces INCLUDE clause to index definition. This clause
specifies a list of columns which will be included as a non-key part in
the index. The INCLUDE columns exist solely to allow more queries to
benefit from index-only scans. Also, such columns don't need to have
appropriate operator classes. Expressions are not supported as INCLUDE
columns since they cannot be used in index-only scans.
Index access methods supporting INCLUDE are indicated by amcaninclude flag
in IndexAmRoutine. For now, only B-tree indexes support INCLUDE clause.
In B-tree indexes INCLUDE columns are truncated from pivot index tuples
(tuples located in non-leaf pages and high keys). Therefore, B-tree indexes
now might have variable number of attributes. This patch also provides
generic facility to support that: pivot tuples contain number of their
attributes in t_tid.ip_posid. Free 13th bit of t_info is used for indicating
that. This facility will simplify further support of index suffix truncation.
The changes of above are backward-compatible, pg_upgrade doesn't need special
handling of B-tree indexes for that.
Bump catalog version
Author: Anastasia Lubennikova with contribition by Alexander Korotkov and me
Reviewed by: Peter Geoghegan, Tomas Vondra, Antonin Houska, Jeff Janes,
David Rowley, Alexander Korotkov
Discussion: https://www.postgresql.org/message-id/flat/56168952.4010101@postgrespro.ru
Add a new module backend/partitioning/partprune.c, implementing a more
sophisticated algorithm for partition pruning. The new module uses each
partition's "boundinfo" for pruning instead of constraint exclusion,
based on an idea proposed by Robert Haas of a "pruning program": a list
of steps generated from the query quals which are run iteratively to
obtain a list of partitions that must be scanned in order to satisfy
those quals.
At present, this targets planner-time partition pruning, but there exist
further patches to apply partition pruning at execution time as well.
This commit also moves some definitions from include/catalog/partition.h
to a new file include/partitioning/partbounds.h, in an attempt to
rationalize partitioning related code.
Authors: Amit Langote, David Rowley, Dilip Kumar
Reviewers: Robert Haas, Kyotaro Horiguchi, Ashutosh Bapat, Jesper Pedersen.
Discussion: https://postgr.es/m/098b9c71-1915-1a2a-8d52-1a7a50ce79e8@lab.ntt.co.jp
Don't call generate_gather_paths for the topmost scan/join relation
when it is initially populated with paths. Instead, do the work in
grouping_planner. By itself, this gains nothing; in fact it loses
slightly because we end up calling set_cheapest() for the topmost
scan/join rel twice rather than once. However, it paves the way for
a future commit which will postpone generate_gather_paths for the
topmost scan/join relation even further, allowing more accurate
costing of parallel paths.
Amit Kapila and Robert Haas. Earlier versions of this patch (which
different substantially) were reviewed by Dilip Kumar, Amit
Khandekar, Marina Polyakova, and Ashutosh Bapat.
If the partition keys of input relation are part of the GROUP BY
clause, all the rows belonging to a given group come from a single
partition. This allows aggregation/grouping over a partitioned
relation to be broken down * into aggregation/grouping on each
partition. This should be no worse, and often better, than the normal
approach.
If the GROUP BY clause does not contain all the partition keys, we can
still perform partial aggregation for each partition and then finalize
aggregation after appending the partial results. This is less certain
to be a win, but it's still useful.
Jeevan Chalke, Ashutosh Bapat, Robert Haas. The larger patch series
of which this patch is a part was also reviewed and tested by Antonin
Houska, Rajkumar Raghuwanshi, David Rowley, Dilip Kumar, Konstantin
Knizhnik, Pascal Legrand, and Rafia Sabih.
Discussion: http://postgr.es/m/CAM2+6=V64_xhstVHie0Rz=KPEQnLJMZt_e314P0jaT_oJ9MR8A@mail.gmail.com
A simple UNION ALL gets flattened into an appendrel of subquery
RTEs, but up until now it's been impossible for the appendrel to use
the partial paths for the subqueries, so we can implement the
appendrel as a Parallel Append but only one with non-partial paths
as children.
There are three separate obstacles to removing that limitation.
First, when planning a subquery, propagate any partial paths to the
final_rel so that they are potentially visible to outer query levels
(but not if they have initPlans attached, because that wouldn't be
safe). Second, after planning a subquery, propagate any partial paths
for the final_rel to the subquery RTE in the outer query level in the
same way we do for non-partial paths. Third, teach finalize_plan() to
account for the possibility that the fake parameter we use for rescan
signalling when the plan contains a Gather (Merge) node may be
propagated from an outer query level.
Patch by me, reviewed and tested by Amit Khandekar, Rajkumar
Raghuwanshi, and Ashutosh Bapat. Test cases based on examples by
Rajkumar Raghuwanshi.
Discussion: http://postgr.es/m/CA+Tgmoa6L9A1nNCk3aTDVZLZ4KkHDn1+tm7mFyFvP+uQPS7bAg@mail.gmail.com
Commit 1804284042e659e7d16904e7bbb0ad546394b6a3 established that single-batch
parallel-aware hash joins could create one large shared hash table using the
combined work_mem budget of all participants. The costing accidentally
assumed that parallel-oblivious hash joins could also do that. The
documentation for initial_cost_hashjoin() also failed to mention the new
argument. Repair.
Author: Thomas Munro
Reported-By: Antonin Houska
Reviewed-By: Antonin Houska
Discussion: https://postgr.es/m/12441.1513935950%40localhost
Since 9.5, it's possible that some but not all columns of an index
support returning the indexed value for index-only scans. If the
same indexed column appears in index columns that behave both ways,
check_index_only() supposed that it'd be OK to do an index-only scan
testing that column; but that fails if we have to recheck the indexed
condition on one of the columns that doesn't support this.
In principle we could make this work by remapping the recheck expressions
to pull the value from a column that does support returning the indexed
value. But such cases are so weird and rare that, at least for now,
it doesn't seem worth the trouble. Instead, just teach check_index_only
that a value is returnable only if all the index columns containing it
are returnable, rather than any of them.
Per report from David Pereiro Lagares. Back-patch to 9.5 where the
possibility of this situation appeared.
Kyotaro Horiguchi
Discussion: https://postgr.es/m/1516210494.1798.16.camel@nlpgo.com
Up until now, we've abused grouped_rel->partial_pathlist as a place to
store partial paths that have been partially aggregate, but that's
really not correct, because a partial path for a relation is supposed
to be one which produces the correct results with the addition of only
a Gather or Gather Merge node, and these paths also require a Finalize
Aggregate step. Instead, add a new partially_group_rel which can hold
either partial paths (which need to be gathered and then have
aggregation finalized) or non-partial paths (which only need to have
aggregation finalized). This allows us to reuse generate_gather_paths
for partially_grouped_rel instead of writing new code, so that this
patch actually basically no net new code while making things cleaner,
simplifying things for pending patches for partition-wise aggregate.
Robert Haas and Jeevan Chalke. The larger patch series of which this
patch is a part was also reviewed and tested by Antonin Houska,
Rajkumar Raghuwanshi, David Rowley, Dilip Kumar, Konstantin Knizhnik,
Pascal Legrand, Rafia Sabih, and me.
Discussion: http://postgr.es/m/CA+TgmobrzFYS3+U8a_BCy3-hOvh5UyJbC18rEcYehxhpw5=ETA@mail.gmail.com
Discussion: http://postgr.es/m/CA+TgmoZyQEjdBNuoG9-wC5GQ5GrO4544Myo13dVptvx+uLg9uQ@mail.gmail.com
Given overlapping or partially redundant join clauses, for example
t1 JOIN t2 ON t1.a = t2.x AND t1.b = t2.x
the planner's EquivalenceClass machinery will ordinarily refactor the
clauses as "t1.a = t1.b AND t1.a = t2.x", so that join processing doesn't
see multiple references to the same EquivalenceClass in a list of join
equality clauses. However, if the join is outer, it's incorrect to derive
a restriction clause on the outer side from the join conditions, so the
clause refactoring does not happen and we end up with overlapping join
conditions. The code that attempted to deal with such cases had several
subtle bugs, which could result in "left and right pathkeys do not match in
mergejoin" or "outer pathkeys do not match mergeclauses" planner errors,
if the selected join plan type was a mergejoin. (It does not appear that
any actually incorrect plan could have been emitted.)
The core of the problem really was failure to recognize that the outer and
inner relations' pathkeys have different relationships to the mergeclause
list. A join's mergeclause list is constructed by reference to the outer
pathkeys, so it will always be ordered the same as the outer pathkeys, but
this cannot be presumed true for the inner pathkeys. If the inner sides of
the mergeclauses contain multiple references to the same EquivalenceClass
({t2.x} in the above example) then a simplistic rendering of the required
inner sort order is like "ORDER BY t2.x, t2.x", but the pathkey machinery
recognizes that the second sort column is redundant and throws it away.
The mergejoin planning code failed to account for that behavior properly.
One error was to try to generate cut-down versions of the mergeclause list
from cut-down versions of the inner pathkeys in the same way as the initial
construction of the mergeclause list from the outer pathkeys was done; this
could lead to choosing a mergeclause list that fails to match the outer
pathkeys. The other problem was that the pathkey cross-checking code in
create_mergejoin_plan treated the inner and outer pathkey lists
identically, whereas actually the expectations for them must be different.
That led to false "pathkeys do not match" failures in some cases, and in
principle could have led to failure to detect bogus plans in other cases,
though there is no indication that such bogus plans could be generated.
Reported by Alexander Kuzmenkov, who also reviewed this patch. This has
been broken for years (back to around 8.3 according to my testing), so
back-patch to all supported branches.
Discussion: https://postgr.es/m/5dad9160-4632-0e47-e120-8e2082000c01@postgrespro.ru
Previously, Append didn't charge anything at all, and MergeAppend
charged only cpu_operator_cost, about half the value used here. This
change might make MergeAppend plans slightly more likely to be chosen
than before, since this commit increases the assumed cost for Append
-- with default values -- by 0.005 per tuple but MergeAppend by only
0.0025 per tuple. Since the comparisons required by MergeAppend are
costed separately, it's not clear why MergeAppend needs to be
otherwise more expensive than Append, so hopefully this is OK.
Prior to partition-wise join, it didn't really matter whether or not
an Append node had any cost of its own, because every plan had to use
the same number of Append or MergeAppend nodes and in the same places.
Only the relative cost of Append vs. MergeAppend made a difference.
Now, however, it is possible to avoid some of the Append nodes using a
partition-wise join, so it's worth making an effort. Pending patches
for partition-wise aggregate care too, because an Append of Aggregate
nodes will incur the Append overhead fewer times than an Aggregate
over an Append. Although in most cases this change will favor the use
of partition-wise techniques, it does the opposite when the join
cardinality is greater than the sum of the input cardinalities. Since
this situation arises in an existing regression test, I [rhaas]
adjusted it to keep the overall plan shape approximately the same.
Jeevan Chalke, per a suggestion from David Rowley. Reviewed by
Ashutosh Bapat. Some changes by me. The larger patch series of which
this patch is a part was also reviewed and tested by Antonin Houska,
Rajkumar Raghuwanshi, David Rowley, Dilip Kumar, Konstantin Knizhnik,
Pascal Legrand, Rafia Sabih, and me.
Discussion: http://postgr.es/m/CAKJS1f9UXdk6ZYyqbJnjFO9a9hyHKGW7B=ZRh-rxy9qxfPA5Gw@mail.gmail.com
The previous code assumed that we'd always succeed in creating
child-joins for a joinrel for which partition-wise join was considered,
but that's not guaranteed, at least in the case where dummy rels
are involved.
Ashutosh Bapat, with some wordsmithing by me.
Discussion: http://postgr.es/m/CAFjFpRf8=uyMYYfeTBjWDMs1tR5t--FgOe2vKZPULxxdYQ4RNw@mail.gmail.com
To make this work, tuplesort.c and logtape.c must also support
parallelism, so this patch adds that infrastructure and then applies
it to the particular case of parallel btree index builds. Testing
to date shows that this can often be 2-3x faster than a serial
index build.
The model for deciding how many workers to use is fairly primitive
at present, but it's better than not having the feature. We can
refine it as we get more experience.
Peter Geoghegan with some help from Rushabh Lathia. While Heikki
Linnakangas is not an author of this patch, he wrote other patches
without which this feature would not have been possible, and
therefore the release notes should possibly credit him as an author
of this feature. Reviewed by Claudio Freire, Heikki Linnakangas,
Thomas Munro, Tels, Amit Kapila, me.
Discussion: http://postgr.es/m/CAM3SWZQKM=Pzc=CAHzRixKjp2eO5Q0Jg1SoFQqeXFQ647JiwqQ@mail.gmail.com
Discussion: http://postgr.es/m/CAH2-Wz=AxWqDoVvGU7dq856S4r6sJAj6DBn7VMtigkB33N5eyg@mail.gmail.com
When an UPDATE causes a row to no longer match the partition
constraint, try to move it to a different partition where it does
match the partition constraint. In essence, the UPDATE is split into
a DELETE from the old partition and an INSERT into the new one. This
can lead to surprising behavior in concurrency scenarios because
EvalPlanQual rechecks won't work as they normally did; the known
problems are documented. (There is a pending patch to improve the
situation further, but it needs more review.)
Amit Khandekar, reviewed and tested by Amit Langote, David Rowley,
Rajkumar Raghuwanshi, Dilip Kumar, Amul Sul, Thomas Munro, Álvaro
Herrera, Amit Kapila, and me. A few final revisions by me.
Discussion: http://postgr.es/m/CAJ3gD9do9o2ccQ7j7+tSgiE1REY65XRiMb=yJO3u3QhyP8EEPQ@mail.gmail.com
- Remove unnecessary #include mistakenly added in execnodes.h.
- Fix mistake in comment in choose_next_subplan_for_leader.
- Adjust row estimates in cost_append for a possibly-different
parallel divisor.
- Clamp row estimates in cost_append after operations that may
not produce integers.
Amit Kapila, with cosmetic adjustments by me.
Discussion: http://postgr.es/m/CAA4eK1+qcbeai3coPpRW=GFCzFeLUsuY4T-AKHqMjxpEGZBPQg@mail.gmail.com
Introduce parallel-aware hash joins that appear in EXPLAIN plans as Parallel
Hash Join with Parallel Hash. While hash joins could already appear in
parallel queries, they were previously always parallel-oblivious and had a
partial subplan only on the outer side, meaning that the work of the inner
subplan was duplicated in every worker.
After this commit, the planner will consider using a partial subplan on the
inner side too, using the Parallel Hash node to divide the work over the
available CPU cores and combine its results in shared memory. If the join
needs to be split into multiple batches in order to respect work_mem, then
workers process different batches as much as possible and then work together
on the remaining batches.
The advantages of a parallel-aware hash join over a parallel-oblivious hash
join used in a parallel query are that it:
* avoids wasting memory on duplicated hash tables
* avoids wasting disk space on duplicated batch files
* divides the work of building the hash table over the CPUs
One disadvantage is that there is some communication between the participating
CPUs which might outweigh the benefits of parallelism in the case of small
hash tables. This is avoided by the planner's existing reluctance to supply
partial plans for small scans, but it may be necessary to estimate
synchronization costs in future if that situation changes. Another is that
outer batch 0 must be written to disk if multiple batches are required.
A potential future advantage of parallel-aware hash joins is that right and
full outer joins could be supported, since there is a single set of matched
bits for each hashtable, but that is not yet implemented.
A new GUC enable_parallel_hash is defined to control the feature, defaulting
to on.
Author: Thomas Munro
Reviewed-By: Andres Freund, Robert Haas
Tested-By: Rafia Sabih, Prabhat Sahu
Discussion:
https://postgr.es/m/CAEepm=2W=cOkiZxcg6qiFQP-dHUe09aqTrEMM7yJDrHMhDv_RA@mail.gmail.comhttps://postgr.es/m/CAEepm=37HKyJ4U6XOLi=JgfSHM3o6B-GaeO-6hkOmneTDkH+Uw@mail.gmail.com
When we create an Append node, we can spread out the workers over the
subplans instead of piling on to each subplan one at a time, which
should typically be a bit more efficient, both because the startup
cost of any plan executed entirely by one worker is paid only once and
also because of reduced contention. We can also construct Append
plans using a mix of partial and non-partial subplans, which may allow
for parallelism in places that otherwise couldn't support it.
Unfortunately, this patch doesn't handle the important case of
parallelizing UNION ALL by running each branch in a separate worker;
the executor infrastructure is added here, but more planner work is
needed.
Amit Khandekar, Robert Haas, Amul Sul, reviewed and tested by
Ashutosh Bapat, Amit Langote, Rafia Sabih, Amit Kapila, and
Rajkumar Raghuwanshi.
Discussion: http://postgr.es/m/CAJ3gD9dy0K_E8r727heqXoBmWZ83HwLFwdcaSSmBQ1+S+vRuUQ@mail.gmail.com
Hash partitioning is useful when you want to partition a growing data
set evenly. This can be useful to keep table sizes reasonable, which
makes maintenance operations such as VACUUM faster, or to enable
partition-wise join.
At present, we still depend on constraint exclusion for partitioning
pruning, and the shape of the partition constraints for hash
partitioning is such that that doesn't work. Work is underway to fix
that, which should both improve performance and make partitioning
pruning work with hash partitioning.
Amul Sul, reviewed and tested by Dilip Kumar, Ashutosh Bapat, Yugo
Nagata, Rajkumar Raghuwanshi, Jesper Pedersen, and by me. A few
final tweaks also by me.
Discussion: http://postgr.es/m/CAAJ_b96fhpJAP=ALbETmeLk1Uni_GFZD938zgenhF49qgDTjaQ@mail.gmail.com
The lower case spellings are C and C++ standard and are used in most
parts of the PostgreSQL sources. The upper case spellings are only used
in some files/modules. So standardize on the standard spellings.
The APIs for ICU, Perl, and Windows define their own TRUE and FALSE, so
those are left as is when using those APIs.
In code comments, we use the lower-case spelling for the C concepts and
keep the upper-case spelling for the SQL concepts.
Reviewed-by: Michael Paquier <michael.paquier@gmail.com>
For some reason, we have never accounted for either the evaluation cost
or the selectivity of filter conditions attached to Agg and Group nodes
(which, in practice, are always conditions from a HAVING clause).
Applying our regular selectivity logic to post-grouping conditions is a
bit bogus, but it's surely better than taking the selectivity as 1.0.
Perhaps someday the extended-statistics mechanism can be taught to provide
statistics that would help us in getting non-default estimates here.
Per a gripe from Benjamin Coutu. This is surely a bug fix, but I'm
hesitant to back-patch because of the prospect of destabilizing existing
plan choices. Given that it took us this long to notice the bug, it's
probably not hurting too many people in the field.
Discussion: https://postgr.es/m/20968.1509486337@sss.pgh.pa.us
If the operator is a strict btree equality operator, and X isn't volatile,
then the clause must yield true for any non-null value of X, or null if X
is null. At top level of a WHERE clause, we can ignore the distinction
between false and null results, so it's valid to simplify the clause to
"X IS NOT NULL". This is a useful improvement mainly because we'll get
a far better selectivity estimate in most cases.
Because such cases seldom arise in well-written queries, it is unappetizing
to expend a lot of planner cycles looking for them ... but it turns out
that there's a place we can shoehorn this in practically for free, because
equivclass.c already has to detect and reject candidate equivalences of the
form X = X. That doesn't catch every place that it would be valid to
simplify to X IS NOT NULL, but it catches the typical case. Working harder
doesn't seem justified.
Patch by me, reviewed by Petr Jelinek
Discussion: https://postgr.es/m/CAMjNa7cC4X9YR-vAJS-jSYCajhRDvJQnN7m2sLH1wLh-_Z2bsw@mail.gmail.com
Instead of joining two partitioned tables in their entirety we can, if
it is an equi-join on the partition keys, join the matching partitions
individually. This involves teaching the planner about "other join"
rels, which are related to regular join rels in the same way that
other member rels are related to baserels. This can use significantly
more CPU time and memory than regular join planning, because there may
now be a set of "other" rels not only for every base relation but also
for every join relation. In most practical cases, this probably
shouldn't be a problem, because (1) it's probably unusual to join many
tables each with many partitions using the partition keys for all
joins and (2) if you do that scenario then you probably have a big
enough machine to handle the increased memory cost of planning and (3)
the resulting plan is highly likely to be better, so what you spend in
planning you'll make up on the execution side. All the same, for now,
turn this feature off by default.
Currently, we can only perform joins between two tables whose
partitioning schemes are absolutely identical. It would be nice to
cope with other scenarios, such as extra partitions on one side or the
other with no match on the other side, but that will have to wait for
a future patch.
Ashutosh Bapat, reviewed and tested by Rajkumar Raghuwanshi, Amit
Langote, Rafia Sabih, Thomas Munro, Dilip Kumar, Antonin Houska, Amit
Khandekar, and by me. A few final adjustments by me.
Discussion: http://postgr.es/m/CAFjFpRfQ8GrQvzp3jA2wnLqrHmaXna-urjm_UY9BqXj=EaDTSA@mail.gmail.com
Discussion: http://postgr.es/m/CAFjFpRcitjfrULr5jfuKWRPsGUX0LQ0k8-yG0Qw2+1LBGNpMdw@mail.gmail.com
Allowing arrays with a domain type as their element type was left un-done
in the original domain patch, but not for any very good reason. This
omission leads to such surprising results as array_agg() not working on
a domain column, because the parser can't identify a suitable output type
for the polymorphic aggregate.
In order to fix this, first clean up the APIs of coerce_to_domain() and
some internal functions in parse_coerce.c so that we consistently pass
around a CoercionContext along with CoercionForm. Previously, we sometimes
passed an "isExplicit" boolean flag instead, which is strictly less
information; and coerce_to_domain() didn't even get that, but instead had
to reverse-engineer isExplicit from CoercionForm. That's contrary to the
documentation in primnodes.h that says that CoercionForm only affects
display and not semantics. I don't think this change fixes any live bugs,
but it makes things more consistent. The main reason for doing it though
is that now build_coercion_expression() receives ccontext, which it needs
in order to be able to recursively invoke coerce_to_target_type().
Next, reimplement ArrayCoerceExpr so that the node does not directly know
any details of what has to be done to the individual array elements while
performing the array coercion. Instead, the per-element processing is
represented by a sub-expression whose input is a source array element and
whose output is a target array element. This simplifies life in
parse_coerce.c, because it can build that sub-expression by a recursive
invocation of coerce_to_target_type(). The executor now handles the
per-element processing as a compiled expression instead of hard-wired code.
The main advantage of this is that we can use a single ArrayCoerceExpr to
handle as many as three successive steps per element: base type conversion,
typmod coercion, and domain constraint checking. The old code used two
stacked ArrayCoerceExprs to handle type + typmod coercion, which was pretty
inefficient, and adding yet another array deconstruction to do domain
constraint checking seemed very unappetizing.
In the case where we just need a single, very simple coercion function,
doing this straightforwardly leads to a noticeable increase in the
per-array-element runtime cost. Hence, add an additional shortcut evalfunc
in execExprInterp.c that skips unnecessary overhead for that specific form
of expression. The runtime speed of simple cases is within 1% or so of
where it was before, while cases that previously required two levels of
array processing are significantly faster.
Finally, create an implicit array type for every domain type, as we do for
base types, enums, etc. Everything except the array-coercion case seems
to just work without further effort.
Tom Lane, reviewed by Andrew Dunstan
Discussion: https://postgr.es/m/9852.1499791473@sss.pgh.pa.us
These variables are only ever written to in assertion-enabled builds,
and the latest Microsoft compilers complain about such variables in
non-assertion-enabled builds.
Apparently they don't worry so much about variables that are written to
but not read from, so most of our PG_USED_FOR_ASSERTS_ONLY variables
don't cause the problem.
Discussion: https://postgr.es/m/7800.1505950322@sss.pgh.pa.us
Flattening the partitioning hierarchy at this stage makes various
desirable optimizations difficult. The original use case for this
patch was partition-wise join, which wants to match up the partitions
in one partitioning hierarchy with those in another such hierarchy.
However, it now seems that it will also be useful in making partition
pruning work using the PartitionDesc rather than constraint exclusion,
because with a flattened expansion, we have no easy way to figure out
which PartitionDescs apply to which leaf tables in a multi-level
partition hierarchy.
As it turns out, we end up creating both rte->inh and !rte->inh RTEs
for each intermediate partitioned table, just as we previously did for
the root table. This seems unnecessary since the partitioned tables
have no storage and are not scanned. We might want to go back and
rejigger things so that no partitioned tables (including the parent)
need !rte->inh RTEs, but that seems to require some adjustments not
related to the core purpose of this patch.
Ashutosh Bapat, reviewed by me and by Amit Langote. Some final
adjustments by me.
Discussion: http://postgr.es/m/CAFjFpRd=1venqLL7oGU=C1dEkuvk2DJgvF+7uKbnPHaum1mvHQ@mail.gmail.com
Historically, the selectivity functions have simply not distinguished
< from <=, or > from >=, arguing that the fraction of the population that
satisfies the "=" aspect can be considered to be vanishingly small, if the
comparison value isn't any of the most-common-values for the variable.
(If it is, the code path that executes the operator against each MCV will
take care of things properly.) But that isn't really true unless we're
dealing with a continuum of variable values, and in practice we seldom are.
If "x = const" would estimate a nonzero number of rows for a given const
value, then it follows that we ought to estimate different numbers of rows
for "x < const" and "x <= const", even if the const is not one of the MCVs.
Handling this more honestly makes a significant difference in edge cases,
such as the estimate for a tight range (x BETWEEN y AND z where y and z
are close together).
Hence, split scalarltsel into scalarltsel/scalarlesel, and similarly
split scalargtsel into scalargtsel/scalargesel. Adjust <= and >=
operator definitions to reference the new selectivity functions.
Improve the core ineq_histogram_selectivity() function to make a
correction for equality. (Along the way, I learned quite a bit about
exactly why that function gives good answers, which I tried to memorialize
in improved comments.)
The corresponding join selectivity functions were, and remain, just stubs.
But I chose to split them similarly, to avoid confusion and to prevent the
need for doing this exercise again if someone ever makes them less stubby.
In passing, change ineq_histogram_selectivity's clamp for extreme
probability estimates so that it varies depending on the histogram
size, instead of being hardwired at 0.0001. With the default histogram
size of 100 entries, you still get the old clamp value, but bigger
histograms should allow us to put more faith in edge values.
Tom Lane, reviewed by Aleksander Alekseev and Kuntal Ghosh
Discussion: https://postgr.es/m/12232.1499140410@sss.pgh.pa.us
The executor is capable of splitting buckets during a hash join if
too much memory is being used by a small number of buckets. However,
this only helps if a bucket's population is actually divisible; if
all the hash keys are alike, the tuples still end up in the same
new bucket. This can result in an OOM failure if there are enough
inner keys with identical hash values. The planner's cost estimates
will bias it against choosing a hash join in such situations, but not
by so much that it will never do so. To mitigate the OOM hazard,
explicitly estimate the hash bucket space needed by just the inner
side's most common value, and if that would exceed work_mem then
add disable_cost to the hash cost estimate.
This approach doesn't account for the possibility that two or more
common values would share the same hash value. On the other hand,
work_mem is normally a fairly conservative bound, so that eating
two or more times that much space is probably not going to kill us.
If we have no stats about the inner side, ignore this consideration.
There was some discussion of making a conservative assumption, but that
would effectively result in disabling hash join whenever we lack stats,
which seems like an overreaction given how seldom the problem manifests
in the field.
Per a complaint from David Hinkle. Although this could be viewed
as a bug fix, the lack of similar complaints weighs against back-
patching; indeed we waited for v11 because it seemed already rather
late in the v10 cycle to be making plan choice changes like this one.
Discussion: https://postgr.es/m/32013.1487271761@sss.pgh.pa.us
Instead of duplicating the logic to search for a matching
ParamPathInfo in multiple places, factor it out into a separate
function.
Pass only the relevant bits of the PartitionKey to
partition_bounds_equal instead of the whole thing, because
partition-wise join will want to call this without having a
PartitionKey available.
Adjust allow_star_schema_join and calc_nestloop_required_outer
to take relevant Relids rather than the entire Path, because
partition-wise join will want to call it with the top-parent
relids to determine whether a child join is allowable.
Ashutosh Bapat. Review and testing of the larger patch set of which
this is a part by Amit Langote, Rajkumar Raghuwanshi, Rafia Sabih,
Thomas Munro, Dilip Kumar, and me.
Discussion: http://postgr.es/m/CA+TgmobQK80vtXjAsPZWWXd7c8u13G86gmuLupN+uUJjA+i4nA@mail.gmail.com
Currently, child relations are always base relations, so when we
translate parent relids to child relids, we only need to translate
a singler relid. However, the proposed partition-wise join feature
will create child joins, which will mean we need to translate a set
of parent relids to the corresponding child relids. This is
preliminary refactoring to make that possible.
Ashutosh Bapat. Review and testing of the larger patch set of which
this is a part by Amit Langote, Rajkumar Raghuwanshi, Rafia Sabih,
Thomas Munro, Dilip Kumar, and me. Some adjustments, mostly
cosmetic, by me.
Discussion: http://postgr.es/m/CA+TgmobQK80vtXjAsPZWWXd7c8u13G86gmuLupN+uUJjA+i4nA@mail.gmail.com
Add error checks in some places that were calling get_opfamily_member
or get_opfamily_proc and just assuming that the call could never fail.
Also, standardize the wording for such errors in some other places.
None of these errors are expected in normal use, hence they're just
elog not ereport. But they may be handy for diagnosing omissions in
custom opclasses.
Rushabh Lathia found the oversight in RelationBuildPartitionKey();
I found the others by grepping for all callers of these functions.
Discussion: https://postgr.es/m/CAGPqQf2R9Nk8htpv0FFi+FP776EwMyGuORpc9zYkZKC8sFQE3g@mail.gmail.com
Add missing infrastructure for this node type, notably in ruleutils.c where
its lack could demonstrably cause EXPLAIN to fail. Add outfuncs/readfuncs
support. (outfuncs support is useful today for debugging purposes. The
readfuncs support may never be needed, since at present it would only
matter for parallel query and NextValueExpr should never appear in a
parallelizable query; but it seems like a bad idea to have a primnode type
that isn't fully supported here.) Teach planner infrastructure that
NextValueExpr is a volatile, parallel-unsafe, non-leaky expression node
with cost cpu_operator_cost. Given its limited scope of usage, there
*might* be no live bug today from the lack of that knowledge, but it's
certainly going to bite us on the rear someday. Teach pg_stat_statements
about the new node type, too.
While at it, also teach cost_qual_eval() that MinMaxExpr, SQLValueFunction,
XmlExpr, and CoerceToDomain should be charged as cpu_operator_cost.
Failing to do this for SQLValueFunction was an oversight in my commit
0bb51aa96. The others are longer-standing oversights, but no time like the
present to fix them. (In principle, CoerceToDomain could have cost much
higher than this, but it doesn't presently seem worth trying to examine the
domain's constraints here.)
Modify execExprInterp.c to execute NextValueExpr as an out-of-line
function; it seems quite unlikely to me that it's worth insisting that
it be inlined in all expression eval methods. Besides, providing the
out-of-line function doesn't stop anyone from inlining if they want to.
Adjust some places where NextValueExpr support had been inserted with the
aid of a dartboard rather than keeping it in the same order as elsewhere.
Discussion: https://postgr.es/m/23862.1499981661@sss.pgh.pa.us