The keys are integers, not strings. The code accidentally worked on
little-endian machines, at least up to 256 distinct record types within
a session, but failed utterly on big-endian. This was unexpectedly
exposed by a test case added by commit 4452000f3, which apparently is the
only parallelizable query in the regression suite that uses more than one
anonymous record type. Fortunately, buildfarm member mandrill is
big-endian and is running with force_parallel_mode on, so it failed.
ExecMakeTableFunctionResult(), which is used in SELECT FROM function(...)
cases, formerly treated a simple NULL output from a function that both
returnsSet and returnsTuple as a violation of the SRF protocol. What seems
better is to treat a NULL output as equivalent to ROW(NULL,NULL,...).
Without this, cases such as SELECT FROM unnest(...) on an array of
composite are vulnerable to unexpected and not-very-helpful failures.
Old code comments here suggested an alternative of just ignoring
simple-NULL outputs, but that doesn't seem very principled.
This change had been hung up for a long time due to uncertainty about
how much we wanted to buy into the equivalence of simple NULL and
ROW(NULL,NULL,...). I think that's been mostly resolved by the discussion
around bug #14235, so let's go ahead and do it.
Per bug #7808 from Joe Van Dyk. Although this is a pretty old report,
fixing it smells a bit more like a new feature than a bug fix, and the
lack of other similar complaints suggests that we shouldn't take much risk
of destabilization by back-patching. (Maybe that could be revisited once
this patch has withstood some field usage.)
Andrew Gierth and Tom Lane
Report: <E1TurJE-0006Es-TK@wrigleys.postgresql.org>
The SQL standard appears to specify that IS [NOT] NULL's tests of field
nullness are non-recursive, ie, we shouldn't consider that a composite
field with value ROW(NULL,NULL) is null for this purpose.
ExecEvalNullTest got this right, but eval_const_expressions did not,
leading to weird inconsistencies depending on whether the expression
was such that the planner could apply constant folding.
Also, adjust the docs to mention that IS [NOT] DISTINCT FROM NULL can be
used as a substitute test if a simple null check is wanted for a rowtype
argument. That motivated reordering things so that IS [NOT] DISTINCT FROM
is described before IS [NOT] NULL. In HEAD, I went a bit further and added
a table showing all the comparison-related predicates.
Per bug #14235. Back-patch to all supported branches, since it's certainly
undesirable that constant-folding should change the semantics.
Report and patch by Andrew Gierth; assorted wordsmithing and revised
regression test cases by me.
Report: <20160708024746.1410.57282@wrigleys.postgresql.org>
We must not push down a foreign join when the foreign tables involved
should be accessed under different user mappings. Previously we tried
to enforce that rule literally during planning, but that meant that the
resulting plans were dependent on the current contents of the
pg_user_mapping catalog, and we had to blow away all cached plans
containing any remote join when anything at all changed in pg_user_mapping.
This could have been improved somewhat, but the fact that a syscache inval
callback has very limited info about what changed made it hard to do better
within that design. Instead, let's change the planner to not consider user
mappings per se, but to allow a foreign join if both RTEs have the same
checkAsUser value. If they do, then they necessarily will use the same
user mapping at runtime, and we don't need to know specifically which one
that is. Post-plan-time changes in pg_user_mapping no longer require any
plan invalidation.
This rule does give up some optimization ability, to wit where two foreign
table references come from views with different owners or one's from a view
and one's directly in the query, but nonetheless the same user mapping
would have applied. We'll sacrifice the first case, but to not regress
more than we have to in the second case, allow a foreign join involving
both zero and nonzero checkAsUser values if the nonzero one is the same as
the prevailing effective userID. In that case, mark the plan as only
runnable by that userID.
The plancache code already had a notion of plans being userID-specific,
in order to support RLS. It was a little confused though, in particular
lacking clarity of thought as to whether it was the rewritten query or just
the finished plan that's dependent on the userID. Rearrange that code so
that it's clearer what depends on which, and so that the same logic applies
to both RLS-injected role dependency and foreign-join-injected role
dependency.
Note that this patch doesn't remove the other issue mentioned in the
original complaint, which is that while we'll reliably stop using a foreign
join if it's disallowed in a new context, we might fail to start using a
foreign join if it's now allowed, but we previously created a generic
cached plan that didn't use one. It was agreed that the chance of winning
that way was not high enough to justify the much larger number of plan
invalidations that would have to occur if we tried to cause it to happen.
In passing, clean up randomly-varying spelling of EXPLAIN commands in
postgres_fdw.sql, and fix a COSTS ON example that had been allowed to
leak into the committed tests.
This reverts most of commits fbe5a3fb7 and 5d4171d1c, which were the
previous attempt at ensuring we wouldn't push down foreign joins that
span permissions contexts.
Etsuro Fujita and Tom Lane
Discussion: <d49c1e5b-f059-20f4-c132-e9752ee0113e@lab.ntt.co.jp>
ExecInsertIndexTuples treated an exclusion constraint as subject to
noDupErr processing even when it was not listed in arbiterIndexes, and
would therefore not error out for a conflict in such a constraint, instead
returning it as an arbiter-index failure. That led to an infinite loop in
ExecInsert, since ExecCheckIndexConstraints ignored the index as-intended
and therefore didn't throw the expected error. To fix, make the exclusion
constraint code path use the same condition as the index_insert call does
to decide whether no-error-for-duplicates behavior is appropriate. While
at it, refactor a little bit to avoid unnecessary list_member_oid calls.
(That surely wouldn't save anything worth noticing, but I find the code
a bit clearer this way.)
Per bug report from Heikki Rauhala. Back-patch to 9.5 where ON CONFLICT
was introduced.
Report: <4C976D6B-76B4-434C-8052-D009F7B7AEDA@reaktor.fi>
The original coding had three separate booleans representing partial
aggregation behavior, which was confusing, unreadable, and error-prone,
not least because the booleans weren't always listed in the same order.
It was also inadequate for the allegedly-desirable future extension to
support intermediate partial aggregation, because we'd need separate
markers for serialization and deserialization in such a case.
Merge these bools into an enum "AggSplit" to provide symbolic names for
the supported operating modes (and document what those are). By assigning
the values of the enum constants carefully, we can treat AggSplit values
as options bitmasks so that tests of what to do aren't noticeably more
expensive than before.
While at it, get rid of Aggref.aggoutputtype. That's not needed since
commit 59a3795c2 got rid of setrefs.c's special-purpose Aggref comparison
code, and it likewise seemed more confusing than helpful.
Assorted comment cleanup as well (there's still more that I want to do
in that line).
catversion bump for change in Aggref node contents. Should be the last
one for partial-aggregation changes.
Discussion: <29309.1466699160@sss.pgh.pa.us>
The original specification for this called for the deserialization function
to have signature "deserialize(serialtype) returns transtype", which is a
security violation if transtype is INTERNAL (which it always would be in
practice) and serialtype is not (which ditto). The patch blithely overrode
the opr_sanity check for that, which was sloppy-enough work in itself,
but the indisputable reason this cannot be allowed to stand is that CREATE
FUNCTION will reject such a signature and thus it'd be impossible for
extensions to create parallelizable aggregates.
The minimum fix to make the signature type-safe is to add a second, dummy
argument of type INTERNAL. But to lock it down a bit more and make misuse
of INTERNAL-accepting functions less likely, let's get rid of the ability
to specify a "serialtype" for an aggregate and just say that the only
useful serialtype is BYTEA --- which, in practice, is the only interesting
value anyway, due to the usefulness of the send/recv infrastructure for
this purpose. That means we only have to allow "serialize(internal)
returns bytea" and "deserialize(bytea, internal) returns internal" as
the signatures for these support functions.
In passing fix bogus signature of int4_avg_combine, which I found thanks
to adding an opr_sanity check on combinefunc signatures.
catversion bump due to removing pg_aggregate.aggserialtype and adjusting
signatures of assorted built-in functions.
David Rowley and Tom Lane
Discussion: <27247.1466185504@sss.pgh.pa.us>
When doing partial aggregation, the args list of the upper (combining)
Aggref node is replaced by a Var representing the output of the partial
aggregation steps, which has either the aggregate's transition data type
or a serialized representation of that. However, nodeAgg.c blindly
continued to use the args list as an indication of the user-level argument
types. This broke resolution of polymorphic transition datatypes at
executor startup (though it accidentally failed to fail for the ANYARRAY
case, which is likely the only one anyone had tested). Moreover, the
constructed FuncExpr passed to the finalfunc contained completely wrong
information, which would have led to bogus answers or crashes for any case
where the finalfunc examined that information (which is only likely to be
with polymorphic aggregates using a non-polymorphic transition type).
As an independent bug, apply_partialaggref_adjustment neglected to resolve
a polymorphic transition datatype before assigning it as the output type
of the lower-level Aggref node. This again accidentally failed to fail
for ANYARRAY but would be unlikely to work in other cases.
To fix the first problem, record the user-level argument types in a
separate OID-list field of Aggref, and look to that rather than the args
list when asking what the argument types were. (It turns out to be
convenient to include any "direct" arguments in this list too, although
those are not currently subject to being overwritten.)
Rather than adding yet another resolve_aggregate_transtype() call to fix
the second problem, add an aggtranstype field to Aggref, and store the
resolved transition type OID there when the planner first computes it.
(By doing this in the planner and not the parser, we can allow the
aggregate's transition type to change from time to time, although no DDL
support yet exists for that.) This saves nothing of consequence for
simple non-polymorphic aggregates, but for polymorphic transition types
we save a catalog lookup during executor startup as well as several
planner lookups that are new in 9.6 due to parallel query planning.
In passing, fix an error that was introduced into count_agg_clauses_walker
some time ago: it was applying exprTypmod() to something that wasn't an
expression node at all, but a TargetEntry. exprTypmod silently returned
-1 so that there was not an obvious failure, but this broke the intended
sensitivity of aggregate space consumption estimates to the typmod of
varchar and similar data types. This part needs to be back-patched.
Catversion bump due to change of stored Aggref nodes.
Discussion: <8229.1466109074@sss.pgh.pa.us>
If a Gather node has read as many tuples as it needs (for example, due
to Limit) it may detach the queue connecting it to the worker before
reading all of the worker's tuples. Rather than let the worker
continue to generate and send all of the results, have it stop after
sending the next tuple.
More could be done here to stop the worker even quicker, but this is
about as well as we can hope to do for 9.6.
This is in response to a problem report from Andreas Seltenreich.
Commit 44339b892a04e94bbb472235882dc6f7023bdc65 should be actually be
sufficient to fix that example even without this change, but it seems
better to do this, too, since we might otherwise waste quite a large
amount of effort in one or more workers.
Discussion: CAA4eK1KOKGqmz9bGu+Z42qhRwMbm4R5rfnqsLCNqFs9j14jzEA@mail.gmail.com
Amit Kapila
Fix still another bug in commit 35fcb1b3d: it failed to fully initialize
the SortSupport states it introduced to allow the executor to re-check
ORDER BY expressions containing distance operators. That led to a null
pointer dereference if the sortsupport code tried to use ssup_cxt. The
problem only manifests in narrow cases, explaining the lack of previous
field reports. It requires a GiST-indexable distance operator that lacks
SortSupport and is on a pass-by-ref data type, which among core+contrib
seems to be only btree_gist's interval opclass; and it requires the scan
to be done as an IndexScan not an IndexOnlyScan, which explains how
btree_gist's regression test didn't catch it. Per bug #14134 from
Jihyun Yu.
Peter Geoghegan
Report: <20160511154904.2603.43889@wrigleys.postgresql.org>
Further thought about bug #14174 motivated me to try the case of a
R/W datum being returned from a VALUES list, and sure enough it was
broken. Fix that.
Also add a regression test case exercising the same scenario for
FunctionScan. That's not broken right now, because the function's
result will get shoved into a tuplestore between generation and use;
but it could easily become broken whenever we get around to optimizing
FunctionScan better.
There don't seem to be any other places where we put the result of
expression evaluation into a virtual tuple slot that could then be
the source for Vars of further expression evaluation, so I think
this is the end of this bug.
If a plan node output expression returns an "expanded" datum, and that
output column is referenced in more than one place in upper-level plan
nodes, we need to ensure that what is returned is a read-only reference
not a read/write reference. Otherwise one of the referencing sites could
scribble on or even delete the expanded datum before we have evaluated the
others. Commit 1dc5ebc9077ab742, which introduced this feature, supposed
that it'd be sufficient to make SubqueryScan nodes force their output
columns to read-only state. The folly of that was revealed by bug #14174
from Andrew Gierth, and really should have been immediately obvious
considering that the planner will happily optimize SubqueryScan nodes
out of the plan without any regard for this issue.
The safest fix seems to be to make ExecProject() force its results into
read-only state; that will cover every case where a plan node returns
expression results. Actually we can delegate this to ExecTargetList()
since we can recursively assume that plain Vars will not reference
read-write datums. That should keep the extra overhead down to something
minimal. We no longer need ExecMakeSlotContentsReadOnly(), which was
introduced only in support of the idea that just a few plan node types
would need to do this.
In the future it would be nice to have the planner account for this problem
and inject force-to-read-only expression evaluation nodes into only the
places where there's a risk of multiple evaluation. That's not a suitable
solution for 9.5 or even 9.6 at this point, though.
Report: <20160603124628.9932.41279@wrigleys.postgresql.org>
do_text_output_multiline() would fail (typically with a null pointer
dereference crash) if its input string did not end with a newline. Such
cases do not arise in our current sources; but it certainly could happen
in future, or in extension code's usage of the function, so we should fix
it. To fix, replace "eol += len" with "eol = text + len".
While at it, make two cosmetic improvements: mark the input string const,
and rename the argument from "text" to "txt" to dodge pgindent strangeness
(since "text" is a typedef name).
Even though this problem is only latent at present, it seems like a good
idea to back-patch the fix, since it's a very simple/safe patch and it's
not out of the realm of possibility that we might in future back-patch
something that expects sane behavior from do_text_output_multiline().
Per report from Hao Lee.
Report: <CAGoxFiFPAGyPAJLcFxTB5cGhTW2yOVBDYeqDugYwV4dEd1L_Ag@mail.gmail.com>
That way, if the result overflows size_t, you'll get an error instead
of undefined behavior, which seems like a plus. This also has the
effect of casting the number of workers from int to Size, which is
better because it's harder to overflow int than size_t.
Dilip Kumar reported this issue and provided a patch upon which this
patch is based, but his version did use mul_size.
These adjustments adjust code and comments in minor ways to prevent
pgindent from mangling them. Among other things, I tried to avoid
situations where pgindent would emit "a +b" instead of "a + b", and I
tried to avoid having it break up inline comments across multiple
lines.
In nodeFuncs.c, pgindent wants to introduce spurious indentation into
the definitions of planstate_tree_walker and planstate_walk_subplans.
Fix that by spreading the definition out across several lines, similar
to what is already done for other walker functions in that file.
In execParallel.c, in the definition of SharedExecutorInstrumentation,
pgindent wants to insert more whitespace between the type name and the
member name. That causes it to mangle comments later on the line. Fix
by moving the comments out of line. Now that we have a bit more room,
add some more details that may be useful to the next person reading
this code.
The reverted changes were intended to force a choice of whether any
newly-added BufferGetPage() calls needed to be accompanied by a
test of the snapshot age, to support the "snapshot too old"
feature. Such an accompanying test is needed in about 7% of the
cases, where the page is being used as part of a scan rather than
positioning for other purposes (such as DML or vacuuming). The
additional effort required for back-patching, and the doubt whether
the intended benefit would really be there, have indicated it is
best just to rely on developers to do the right thing based on
comments and existing usage, as we do with many other conventions.
This change should have little or no effect on generated executable
code.
Motivated by the back-patching pain of Tom Lane and Robert Haas
The previous display was sort of confusing, because it didn't
distinguish between the number of workers that we planned to launch
and the number that actually got launched. This has already confused
several people, so display both numbers and label them clearly.
Julien Rouhaud, reviewed by me.
When IF NOT EXISTS was added to CREATE TABLE AS, this logic didn't get
the memo, possibly resulting in an Assert failure. It looks like there
would have been no ill effects in a non-Assert build, though. Back-patch
to 9.5 where the IF NOT EXISTS option was added.
Stas Kelvich
This patch is a no-op patch which is intended to reduce the chances
of failures of omission once the functional part of the "snapshot
too old" patch goes in. It adds parameters for snapshot, relation,
and an enum to specify whether the snapshot age check needs to be
done for the page at this point. This initial patch passes NULL
for the first two new parameters and BGP_NO_SNAPSHOT_TEST for the
third. The follow-on patch will change the places where the test
needs to be made.
Now indexes (but only B-tree for now) can contain "extra" column(s) which
doesn't participate in index structure, they are just stored in leaf
tuples. It allows to use index only scan by using single index instead
of two or more indexes.
Author: Anastasia Lubennikova with minor editorializing by me
Reviewers: David Rowley, Peter Geoghegan, Jeff Janes
resulttypeLen and resulttypeByVal must be set correctly when serializing
aggregates, not just when finalizing them. This was in David's final
patch but I downloaded the wrong version by mistake and failed to spot
the error.
David Rowley
This is necessary infrastructure for supporting parallel aggregation
for aggregates whose transition type is "internal". Such values
can't be passed between cooperating processes, because they are
just pointers.
David Rowley, reviewed by Tomas Vondra and by me.
In commit afb9249d06f47d7a, we (probably I) made ExecLockRows assign
null test tuples to all relations of the query while setting up to do an
EvalPlanQual recheck for a newly-updated locked row. This was sheerest
brain fade: we should only set test tuples for relations that are lockable
by the LockRows node, and in particular empty test tuples are only sensible
for inheritance child relations that weren't the source of the current
tuple from their inheritance tree. Setting a null test tuple for an
unrelated table causes it to return NULLs when it should not, as exhibited
in bug #14034 from Bronislav Houdek. To add insult to injury, doing it the
wrong way required two loops where one would suffice; so the corrected code
is even a bit shorter and faster.
Add a regression test case based on his example, and back-patch to 9.5
where the bug was introduced.
Parallel workers can now partially aggregate the data and pass the
transition values back to the leader, which can combine the partial
results to produce the final answer.
David Rowley, based on earlier work by Haribabu Kommi. Reviewed by
Álvaro Herrera, Tomas Vondra, Amit Kapila, James Sewell, and me.
postgres_fdw can now sent an UPDATE or DELETE statement directly to
the foreign server in simple cases, rather than sending a SELECT FOR
UPDATE statement and then updating or deleting rows one-by-one.
Etsuro Fujita, reviewed by Rushabh Lathia, Shigeru Hanada, Kyotaro
Horiguchi, Albe Laurenz, Thom Brown, and me.
INSERT ... ON CONFLICT's precheck may have to wait on the outcome of
another insertion, which may or may not itself be a speculative
insertion. This wait is not necessarily associated with an exclusion
constraint, but was always reported that way in log messages if the wait
happened to involve a tuple that had no speculative token.
Initially discovered through use of ON CONFLICT DO NOTHING, where
spurious references to exclusion constraints in log messages were more
likely.
Patch by Peter Geoghegan.
Reviewed by Julien Rouhaud.
Back-patch to 9.5 where INSERT ... ON CONFLICT was added.
Commit 23a27b039d94ba35 widened the rows-stored counters to uint64, but
that's academic unless we allow the tuple pointer array to exceed 1GB.
(It might be a good idea to provide some other limit on how much storage
a SPITupleTable can eat. On the other hand, there are plenty of other
ways to drive a backend into swap hell.)
Dagfinn Ilmari Mannsåker
This patch widens SPI_processed, EState's es_processed field, PortalData's
portalPos field, FuncCallContext's call_cntr and max_calls fields,
ExecutorRun's count argument, PortalRunFetch's result, and the max number
of rows in a SPITupleTable to uint64, and deals with (I hope) all the
ensuing fallout. Some of these values were declared uint32 before, and
others "long".
I also removed PortalData's posOverflow field, since that logic seems
pretty useless given that portalPos is now always 64 bits.
The user-visible results are that command tags for SELECT etc will
correctly report tuple counts larger than 4G, as will plpgsql's GET
GET DIAGNOSTICS ... ROW_COUNT command. Queries processing more tuples
than that are still not exactly the norm, but they're becoming more
common.
Most values associated with FETCH/MOVE distances, such as PortalRun's count
argument and the count argument of most SPI functions that have one, remain
declared as "long". It's not clear whether it would be worth promoting
those to int64; but it would definitely be a large dollop of additional
API churn on top of this, and it would only help 32-bit platforms which
seem relatively less likely to see any benefit.
Andreas Scherbaum, reviewed by Christian Ullrich, additional hacking by me
I've been saying we needed to do this for more than five years, and here it
finally is. This patch removes the ever-growing tangle of spaghetti logic
that grouping_planner() used to use to try to identify the best plan for
post-scan/join query steps. Now, there is (nearly) independent
consideration of each execution step, and entirely separate construction of
Paths to represent each of the possible ways to do that step. We choose
the best Path or set of Paths using the same add_path() logic that's been
used inside query_planner() for years.
In addition, this patch removes the old restriction that subquery_planner()
could return only a single Plan. It now returns a RelOptInfo containing a
set of Paths, just as query_planner() does, and the parent query level can
use each of those Paths as the basis of a SubqueryScanPath at its level.
This allows finding some optimizations that we missed before, wherein a
subquery was capable of returning presorted data and thereby avoiding a
sort in the parent level, making the overall cost cheaper even though
delivering sorted output was not the cheapest plan for the subquery in
isolation. (A couple of regression test outputs change in consequence of
that. However, there is very little change in visible planner behavior
overall, because the point of this patch is not to get immediate planning
benefits but to create the infrastructure for future improvements.)
There is a great deal left to do here. This patch unblocks a lot of
planner work that was basically impractical in the old code structure,
such as allowing FDWs to implement remote aggregation, or rewriting
plan_set_operations() to allow consideration of multiple implementation
orders for set operations. (The latter will likely require a full
rewrite of plan_set_operations(); what I've done here is only to fix it
to return Paths not Plans.) I have also left unfinished some localized
refactoring in createplan.c and planner.c, because it was not necessary
to get this patch to a working state.
Thanks to Robert Haas, David Rowley, and Amit Kapila for review.
Originally, we didn't have nworkers_launched, so code that used parallel
contexts had to be preprared for the possibility that not all of the
workers requested actually got launched. But now we can count on knowing
the number of workers that were successfully launched, which can shave
off a few cycles and simplify some code slightly.
Amit Kapila, reviewed by Haribabu Kommi, per a suggestion from Peter
Geoghegan.
The new bit indicates whether every tuple on the page is already frozen.
It is cleared only when the all-visible bit is cleared, and it can be
set only when we vacuum a page and find that every tuple on that page is
both visible to every transaction and in no need of any future
vacuuming.
A future commit will use this new bit to optimize away full-table scans
that would otherwise be triggered by XID wraparound considerations. A
page which is merely all-visible must still be scanned in that case, but
a page which is all-frozen need not be. This commit does not attempt
that optimization, although that optimization is the goal here. It
seems better to get the basic infrastructure in place first.
Per discussion, it's very desirable for pg_upgrade to automatically
migrate existing VM forks from the old format to the new format. That,
too, will be handled in a follow-on patch.
Masahiko Sawada, reviewed by Kyotaro Horiguchi, Fujii Masao, Amit
Kapila, Simon Riggs, Andres Freund, and others, and substantially
revised by me.
When processing ordered aggregates following a sort that could make use
of the abbreviated key optimization, only call the equality operator to
compare successive pairs of tuples when their abbreviated keys were not
equal.
Peter Geoghegan, reviewd by Andreas Karlsson and by me.
Commit 45f6240a8fa9d355 added an assumption in ExecHashIncreaseNumBatches
and ExecHashIncreaseNumBuckets that they could find all tuples in the main
hash table by iterating over the "dense storage" introduced by that patch.
However, ExecHashRemoveNextSkewBucket continued its old practice of simply
re-linking deleted skew tuples into the main table's hashchains. Hence,
such tuples got lost during any subsequent increase in nbatch or nbuckets,
and would never get joined, as reported in bug #13908 from Seth P.
I (tgl) think that the aforesaid commit has got multiple design issues
and should be reworked rather completely; but there is no time for that
right now, so band-aid the problem by making ExecHashRemoveNextSkewBucket
physically copy deleted skew tuples into the "dense storage" arena.
The added test case is able to exhibit the problem by means of fooling the
planner with a WHERE condition that it will underestimate the selectivity
of, causing the initial nbatch estimate to be too small.
Tomas Vondra and Tom Lane. Thanks to David Johnston for initial
investigation into the bug report.
Commit 30d7ae3c76d2de144232ae6ab328ca86b70e72c3 introduced an HJDEBUG
stanza that probably didn't compile at the time, and definitely doesn't
compile now, because it refers to a nonexistent variable. It doesn't seem
terribly useful anyway, so just get rid of it.
While I'm fooling with it, use %z modifier instead of the obsolete hack of
casting size_t to unsigned long, and include the HashJoinTable's address in
each printout so that it's possible to distinguish the activities of
multiple hashjoins occurring in one query.
Noted while trying to use HJDEBUG to investigate bug #13908. Back-patch
to 9.5, because code that doesn't compile is certainly not very helpful.
This patch doesn't put the new infrastructure to use anywhere, and
indeed it's not clear how it could ever be used for something like
postgres_fdw which has to send an SQL query and wait for a reply,
but there might be FDWs or custom scan providers that are CPU-bound,
so let's give them a way to join club parallel.
KaiGai Kohei, reviewed by me.
Previously, the foreign join pushdown infrastructure left the question
of security entirely up to individual FDWs, but it would be easy for
a foreign data wrapper to inadvertently open up subtle security holes
that way. So, make it the core code's job to determine which user
mapping OID is relevant, and don't attempt join pushdown unless it's
the same for all relevant relations.
Per a suggestion from Tom Lane. Shigeru Hanada and Ashutosh Bapat,
reviewed by Etsuro Fujita and KaiGai Kohei, with some further
changes by me.
The core innovation of this patch is the introduction of the concept
of a partial path; that is, a path which if executed in parallel will
generate a subset of the output rows in each process. Gathering a
partial path produces an ordinary (complete) path. This allows us to
generate paths for parallel joins by joining a partial path for one
side (which at the baserel level is currently always a Partial Seq
Scan) to an ordinary path on the other side. This is subject to
various restrictions at present, especially that this strategy seems
unlikely to be sensible for merge joins, so only nested loops and
hash joins paths are generated.
This also allows an Append node to be pushed below a Gather node in
the case of a partitioned table.
Testing revealed that early versions of this patch made poor decisions
in some cases, which turned out to be caused by the fact that the
original cost model for Parallel Seq Scan wasn't very good. So this
patch tries to make some modest improvements in that area.
There is much more to be done in the area of generating good parallel
plans in all cases, but this seems like a useful step forward.
Patch by me, reviewed by Dilip Kumar and Amit Kapila.
Aggregate nodes now have two new modes: a "partial" mode where they
output the unfinalized transition state, and a "finalize" mode where
they accept unfinalized transition states rather than individual
values as input.
These new modes are not used anywhere yet, but they will be necessary
for parallel aggregation. The infrastructure also figures to be
useful for cases where we want to aggregate local data and remote
data via the FDW interface, and want to bring back partial aggregates
from the remote side that can then be combined with locally generated
partial aggregates to produce the final value. It may also be useful
even when neither FDWs nor parallelism are in play, as explained in
the comments in nodeAgg.c.
David Rowley and Simon Riggs, reviewed by KaiGai Kohei, Heikki
Linnakangas, Haribabu Kommi, and me.
This patch reduces pg_am to just two columns, a name and a handler
function. All the data formerly obtained from pg_am is now provided
in a C struct returned by the handler function. This is similar to
the designs we've adopted for FDWs and tablesample methods. There
are multiple advantages. For one, the index AM's support functions
are now simple C functions, making them faster to call and much less
error-prone, since the C compiler can now check function signatures.
For another, this will make it far more practical to define index access
methods in installable extensions.
A disadvantage is that SQL-level code can no longer see attributes
of index AMs; in particular, some of the crosschecks in the opr_sanity
regression test are no longer possible from SQL. We've addressed that
by adding a facility for the index AM to perform such checks instead.
(Much more could be done in that line, but for now we're content if the
amvalidate functions more or less replace what opr_sanity used to do.)
We might also want to expose some sort of reporting functionality, but
this patch doesn't do that.
Alexander Korotkov, reviewed by Petr Jelínek, and rather heavily
editorialized on by me.