When decoding the old version of an UPDATE or DELETE change, and if that
tuple was bigger than MaxHeapTupleSize, we either Assert'ed out, or
failed in more subtle ways in non-assert builds. Normally individual
tuples aren't bigger than MaxHeapTupleSize, with big datums toasted.
But that's not the case for the old version of a tuple for logical
decoding; the replica identity is logged as one piece. With the default
replica identity btree limits that to small tuples, but that's not the
case for FULL.
Change the tuple buffer infrastructure to separate allocate over-large
tuples, instead of always going through the slab cache.
This unfortunately requires changing the ReorderBufferTupleBuf
definition, we need to store the allocated size someplace. To avoid
requiring output plugins to recompile, don't store HeapTupleHeaderData
directly after HeapTupleData, but point to it via t_data; that leaves
rooms for the allocated size. As there's no reason for an output plugin
to look at ReorderBufferTupleBuf->t_data.header, remove the field. It
was just a minor convenience having it directly accessible.
Reported-By: Adam Dratwiński
Discussion: CAKg6ypLd7773AOX4DiOGRwQk1TVOQKhNwjYiVjJnpq8Wo+i62Q@mail.gmail.com
Somehow I managed to flip the order of restoring old & new tuples when
de-spooling a change in a large transaction from disk. This happens to
only take effect when a change is spooled to disk which has old/new
versions of the tuple. That only is the case for UPDATEs where he
primary key changed or where replica identity is changed to FULL.
The tests didn't catch this because either spooled updates, or updates
that changed primary keys, were tested; not both at the same time.
Found while adding tests for the following commit.
Backpatch: 9.4, where logical decoding was added
Logical decoding's reorderbuffer keeps transactions in an LSN ordered
list for efficiency. To make that's efficiently possible upper-level
xids are forced to be logged before nested subtransaction xids. That
only works though if these records are all looked at: Unfortunately we
didn't do so for e.g. row level locks, which are otherwise uninteresting
for logical decoding.
This could lead to errors like:
"ERROR: subxact logged without previous toplevel record".
It's not sufficient to just look at row locking records, the xid could
appear first due to a lot of other types of records (which will trigger
the transaction to be marked logged with MarkCurrentTransactionIdLoggedIfAny).
So invent infrastructure to tell reorderbuffer about xids seen, when
they'd otherwise not pass through reorderbuffer.c.
Reported-By: Jarred Ward
Bug: #13844
Discussion: 20160105033249.1087.66040@wrigleys.postgresql.org
Backpatch: 9.4, where logical decoding was added
Previously, we included NULLS FIRST when appropriate but relied on the
default behavior to be NULLS LAST. This is, however, not true for a
sort in descending order and seems like a fragile assumption anyway.
Report by Rajkumar Raghuwanshi. Patch by Ashutosh Bapat. Review
comments from Michael Paquier and Tom Lane.
Otherwise running installcheck-force on a server with
synchronous_commit=off will result in the tests failing. All the other
tests already do so...
Backpatch: 9.4, where logical decoding was added
When adding replication origins in 5aa235042, I somehow managed to set
the timestamp of decoded transactions to InvalidXLogRecptr when decoding
one made without a replication origin. Fix that, and the wrong type of
the new commit_time variable.
This didn't trigger a regression test failure because we explicitly
don't show commit timestamps in the regression tests, as they obviously
are variable. Add a test that checks that a decoded commit's timestamp
is within minutes of NOW() from before the commit.
Reported-By: Weiping Qu
Diagnosed-By: Artur Zakirov
Discussion: 56D4197E.9050706@informatik.uni-kl.de,
56D42918.1010108@postgrespro.ru
Backpatch: 9.5, where 5aa235042 originates.
The new bit indicates whether every tuple on the page is already frozen.
It is cleared only when the all-visible bit is cleared, and it can be
set only when we vacuum a page and find that every tuple on that page is
both visible to every transaction and in no need of any future
vacuuming.
A future commit will use this new bit to optimize away full-table scans
that would otherwise be triggered by XID wraparound considerations. A
page which is merely all-visible must still be scanned in that case, but
a page which is all-frozen need not be. This commit does not attempt
that optimization, although that optimization is the goal here. It
seems better to get the basic infrastructure in place first.
Per discussion, it's very desirable for pg_upgrade to automatically
migrate existing VM forks from the old format to the new format. That,
too, will be handled in a follow-on patch.
Masahiko Sawada, reviewed by Kyotaro Horiguchi, Fujii Masao, Amit
Kapila, Simon Riggs, Andres Freund, and others, and substantially
revised by me.
This is basically a bug fix; the old code assumes that a ForeignScan
is always parallel-safe, but for postgres_fdw, for example, this is
definitely false. It should be true for file_fdw, though, since a
worker can read a file from the filesystem just as well as any other
backend process.
Original patch by Thomas Munro. Documentation, and changes to the
comments, by me.
list_concat(list_concat(a, b), c) destructively changes both a and b;
to avoid such perils, copy lists of remote_conds before incorporating
them into larger lists via list_concat().
Ashutosh Bapat, per a report from Etsuro Fujita
Up to now, there's been an assumption that all Paths for a given relation
compute the same output column set (targetlist). However, there are good
reasons to remove that assumption. For example, an indexscan on an
expression index might be able to return the value of an expensive function
"for free". While we have the ability to generate such a plan today in
simple cases, we don't have a way to model that it's cheaper than a plan
that computes the function from scratch, nor a way to create such a plan
in join cases (where the function computation would normally happen at
the topmost join node). Also, we need this so that we can have Paths
representing post-scan/join steps, where the targetlist may well change
from one step to the next. Therefore, invent a "struct PathTarget"
representing the columns we expect a plan step to emit. It's convenient
to include the output tuple width and tlist evaluation cost in this struct,
and there will likely be additional fields in future.
While Path nodes that actually do have custom outputs will need their own
PathTargets, it will still be true that most Paths for a given relation
will compute the same tlist. To reduce the overhead added by this patch,
keep a "default PathTarget" in RelOptInfo, and allow Paths that compute
that column set to just point to their parent RelOptInfo's reltarget.
(In the patch as committed, actually every Path is like that, since we
do not yet have any cases of custom PathTargets.)
I took this opportunity to provide some more-honest costing of
PlaceHolderVar evaluation. Up to now, the assumption that "scan/join
reltargetlists have cost zero" was applied not only to Vars, where it's
reasonable, but also PlaceHolderVars where it isn't. Now, we add the eval
cost of a PlaceHolderVar's expression to the first plan level where it can
be computed, by including it in the PathTarget cost field and adding that
to the cost estimates for Paths. This isn't perfect yet but it's much
better than before, and there is a way forward to improve it more. This
costing change affects the join order chosen for a couple of the regression
tests, changing expected row ordering.
Dead or half-dead index leaf pages were incorrectly reported as live, as a
consequence of a code rearrangement I made (during a moment of severe brain
fade, evidently) in commit d287818eb514d431.
The index metapage was not counted in index_size, causing that result to
not agree with the actual index size on-disk.
Index root pages were not counted in internal_pages, which is inconsistent
compared to the case of a root that's also a leaf (one-page index), where
the root would be counted in leaf_pages. Aside from that inconsistency,
this could lead to additional transient discrepancies between the reported
page counts and index_size, since it's possible for pgstatindex's scan to
see zero or multiple pages marked as BTP_ROOT, if the root moves due to
a split during the scan. With these fixes, index_size will always be
exactly one page more than the sum of the displayed page counts.
Also, the index_size result was incorrectly documented as being measured in
pages; it's always been measured in bytes. (While fixing that, I couldn't
resist doing some small additional wordsmithing on the pgstattuple docs.)
Including the metapage causes the reported index_size to not be zero for
an empty index. To preserve the desired property that the pgstattuple
regression test results are platform-independent (ie, BLCKSZ configuration
independent), scale the index_size result in the regression tests.
The documentation issue was reported by Otsuka Kenji, and the inconsistent
root page counting by Peter Geoghegan; the other problems noted by me.
Back-patch to all supported branches, because this has been broken for
a long time.
If we've got a relatively straightforward join between two tables,
this pushes that join down to the remote server instead of fetching
the rows for each table and performing the join locally. Some cases
are not handled yet, such as SEMI and ANTI joins. Also, we don't
yet attempt to create presorted join paths or parameterized join
paths even though these options do get tried for a base relation
scan. Nevertheless, this seems likely to be a very significant win
in many practical cases.
Shigeru Hanada and Ashutosh Bapat, reviewed by Robert Haas, with
additional review at various points by Tom Lane, Etsuro Fujita,
KaiGai Kohei, and Jeevan Chalke.
deparseReturningList ended up adding up RETURNING NULL to the code, but
code elsewhere saw an empty list of attributes and concluded that it
should not expect tuples from the remote side.
Etsuro Fujita and Robert Haas, reviewed by Thom Brown
The previous RequestAddinLWLocks() method had several disadvantages.
First, the locks would be in the main tranche; we've recently decided
that it's useful for LWLocks used for separate purposes to have
separate tranche IDs. Second, there wasn't any correlation between
what code called RequestAddinLWLocks() and what code called
LWLockAssign(); when multiple modules are in use, it could become
quite difficult to troubleshoot problems where LWLockAssign() ran out
of locks. To fix, create a concept of named LWLock tranches which
can be used either by extension or by core code.
Amit Kapila and Robert Haas
Commit e09996ff8dee3f70 removed some ad-hoc code in hstore_to_json_loose
that determined whether an hstore value string looked like a number,
in favor of calling the JSON parser's is-it-a-number code. However,
it neglected the fact that the exact same code appeared in
hstore_to_jsonb_loose.
This is not a bug, exactly, because the requirements on the two functions
are not the same: hstore_to_json_loose must accept only syntactically legal
JSON numbers as numbers, or it will produce invalid JSON output, as per bug
#12070 which spawned the prior commit. But hstore_to_jsonb_loose could
accept anything that numeric_in will eat, other than Inf and NaN.
Nonetheless it seems surprising and arbitrary that the two functions don't
use the same rules for what is a number versus what is a string; especially
since they did use the same rules before the aforesaid commit. For one
thing, that means that doing hstore_to_json_loose and then casting to jsonb
can produce results different from doing just hstore_to_jsonb_loose.
Hence, change hstore_to_jsonb_loose's logic to match hstore_to_json_loose,
ie, hstore values are treated as numbers when they match the JSON syntax
for numbers.
No back-patch, since this is more in the nature of a definitional change
than a bug fix.
Remove duplicate assignment. This part by Ashutosh Bapat.
Remove now-obsolete comment. This part by me, although the pending
join pushdown patch does something similar, and for the same reason:
there's no reason to keep two lists of the things in the fdw_private
structure that have to be kept in sync with each other.
The default fetch size of 100 rows might not be right in every
environment, so allow users to configure it.
Corey Huinker, reviewed by Kyotaro Horiguchi, Andres Freund, and me.
Commit e09996ff8dee3f70 was one brick shy of a load: it didn't insist
that the detected JSON number be the whole of the supplied string.
This allowed inputs such as "2016-01-01" to be misdetected as valid JSON
numbers. Per bug #13906 from Dmitry Ryabov.
In passing, be more wary of zero-length input (I'm not sure this can
happen given current callers, but better safe than sorry), and do some
minor cosmetic cleanup.
The code that generates a complete SQL query for a given foreign relation
was repeated in two places, and they didn't quite agree: the EXPLAIN case
left out the locking clause. Centralize the code so we get the same
behavior everywhere, and adjust calling conventions and which functions
are static vs. extern accordingly . Centralize the code so we get the same
behavior everywhere, and adjust calling conventions and which functions
are static vs. extern accordingly.
Ashutosh Bapat, reviewed and slightly adjusted by me.
listForeignTables' invocation of processSQLNamePattern did not match up
with the other ones that handle potentially-schema-qualified names; it
failed to make use of pg_table_is_visible() and also passed the name
arguments in the wrong order. Bug seems to have been aboriginal in commit
0d692a0dc9f0e532. It accidentally sort of worked as long as you didn't
inquire too closely into the behavior, although the silliness was later
exposed by inconsistencies in the test queries added by 59efda3e50ca4de6
(which I probably should have questioned at the time, but didn't).
Per bug #13899 from Reece Hart. Patch by Reece Hart and Tom Lane.
Back-patch to all affected branches.
The upcoming patch to allow join pushdown in postgres_fdw needs to use
this code multiple times, which requires moving it to deparse.c. That
seems like a good idea anyway, so do that now both on general principle
and to simplify the future patch.
Inspired by a patch by Shigeru Hanada and Ashutosh Bapat, but I did
it a little differently than what that patch did.
Previously, postgres_fdw's connection cache was keyed by user OID and
server OID, but this can lead to multiple connections when it's not
really necessary. In particular, if all relevant users are mapped to
the public user mapping, then their connection options are certainly
the same, so one connection can be used for all of them.
While we're cleaning things up here, drop the "server" argument to
GetConnection(), which isn't really needed. This saves a few cycles
because callers no longer have to look this up; the function itself
does, but only when establishing a new connection, not when reusing
an existing one.
Ashutosh Bapat, with a few small changes by me.
Commit e529cd4ffa605c6f introduced an Assert requiring NAMEDATALEN to be
less than MAX_LEVENSHTEIN_STRLEN, which has been 255 for a long time.
Since up to that instant we had always allowed NAMEDATALEN to be
substantially more than that, this was ill-advised.
It's debatable whether we need MAX_LEVENSHTEIN_STRLEN at all (versus
putting a CHECK_FOR_INTERRUPTS into the loop), or whether it has to be
so tight; but this patch takes the narrower approach of just not applying
the MAX_LEVENSHTEIN_STRLEN limit to calls from the parser.
Trusting the parser for this seems reasonable, first because the strings
are limited to NAMEDATALEN which is unlikely to be hugely more than 256,
and second because the maximum distance is tightly constrained by
MAX_FUZZY_DISTANCE (though we'd forgotten to make use of that limit in one
place). That means the cost is not really O(mn) but more like O(max(m,n)).
Relaxing the limit for user-supplied calls is left for future research;
given the lack of complaints to date, it doesn't seem very high priority.
In passing, fix confusion between lengths-in-bytes and lengths-in-chars
in comments and error messages.
Per gripe from Kevin Day; solution suggested by Robert Haas. Back-patch
to 9.5 where the unwanted restriction was introduced.
GIN had some minor issues too, mostly using "internal" where something
else would be more appropriate. I went with the same approach as in
9ff60273e35cad6e, namely preferring the opclass' indexed datatype for
arguments that receive an operator RHS value, even if that's not
necessarily what they really are.
Again, this is with an eye to having a uniform rule for ginvalidate()
to check support function signatures.
The conventions specified by the GiST SGML documentation were widely
ignored. For example, the strategy-number argument for "consistent" and
"distance" functions is specified to be a smallint, but most of the
built-in support functions declared it as an integer, and for that matter
the core code passed it using Int32GetDatum not Int16GetDatum. None of
that makes any real difference at runtime, but it's quite confusing for
newcomers to the code, and it makes it very hard to write an amvalidate()
function that checks support function signatures. So let's try to instill
some consistency here.
Another similar issue is that the "query" argument is not of a single
well-defined type, but could have different types depending on the strategy
(corresponding to search operators with different righthand-side argument
types). Some of the functions threw up their hands and declared the query
argument as being of "internal" type, which surely isn't right ("any" would
have been more appropriate); but the majority position seemed to be to
declare it as being of the indexed data type, corresponding to a search
operator with both input types the same. So I've specified a convention
that that's what to do always.
Also, the result of the "union" support function actually must be of the
index's storage type, but the documentation suggested declaring it to
return "internal", and some of the functions followed that. Standardize
on telling the truth, instead.
Similarly, standardize on declaring the "same" function's inputs as
being of the storage type, not "internal".
Also, somebody had forgotten to add the "recheck" argument to both
the documentation of the "distance" support function and all of their
SQL declarations, even though the C code was happily using that argument.
Clean that up too.
Fix up some other omissions in the docs too, such as documenting that
union's second input argument is vestigial.
So far as the errors in core function declarations go, we can just fix
pg_proc.h and bump catversion. Adjusting the erroneous declarations in
contrib modules is more debatable: in principle any change in those
scripts should involve an extension version bump, which is a pain.
However, since these changes are purely cosmetic and make no functional
difference, I think we can get away without doing that.
This patch reduces pg_am to just two columns, a name and a handler
function. All the data formerly obtained from pg_am is now provided
in a C struct returned by the handler function. This is similar to
the designs we've adopted for FDWs and tablesample methods. There
are multiple advantages. For one, the index AM's support functions
are now simple C functions, making them faster to call and much less
error-prone, since the C compiler can now check function signatures.
For another, this will make it far more practical to define index access
methods in installable extensions.
A disadvantage is that SQL-level code can no longer see attributes
of index AMs; in particular, some of the crosschecks in the opr_sanity
regression test are no longer possible from SQL. We've addressed that
by adding a facility for the index AM to perform such checks instead.
(Much more could be done in that line, but for now we're content if the
amvalidate functions more or less replace what opr_sanity used to do.)
We might also want to expose some sort of reporting functionality, but
this patch doesn't do that.
Alexander Korotkov, reviewed by Petr Jelínek, and rather heavily
editorialized on by me.
Commit 866566a690bb9916 is insufficient to prevent dump/reload failures
when using transform modules in a database with both plpython2 and
plpython3 installed. The reason is that the transform extension scripts
use DO blocks as a mechanism to pull in the libpython library before
creating the transform function. It's necessary to preload the library
because the dynamic loader won't do it for us on every platform, leading
to "unresolved symbol" failures when the transform library is loaded.
But it's *not* necessary to execute Python code, and doing so will
provoke a multiple-Pythons-are-loaded error even after the preceding
commit.
To fix, use LOAD instead of a DO block. That requires superuser privilege,
but creation of a C function does anyway. It also embeds knowledge of
the underlying library name for each PL language; but that's wired into
the initdb-time contents of pg_pltemplate too, so that doesn't seem like
a large problem either. Note that CREATE TRANSFORM as such doesn't call
the language module at all.
Per a report from Paul Jones. Back-patch to 9.5 where transform modules
were introduced.
The order of inclusion of .o files makes a difference in linker output;
not a functional difference, but still a bitwise difference, which annoys
some packagers who would like reproducible builds.
Report and patch by Christoph Berg
On closer inspection, the reason copyright.pl was missing files is
that it is looking for 'Copyright (c)' and they had 'Copyright (C)'.
Fix that, and update a couple more that grepping for that revealed.
postgresImportForeignSchema pays attention to IMPORT's EXCEPT and LIMIT TO
options, but only as an efficiency hack, not for correctness' sake. The
FDW documentation does explain that, but someone using postgres_fdw.c
as a coding guide might not remember it, so let's add a comment here.
Per question from Regina Obe.
Commit 33bd250f6c4cc309f4eeb657da80f1e7743b3e5c could have done with
some more review:
Adjust coding so that compilers unfamiliar with elog/ereport don't complain
about uninitialized values.
Fix misuse of PG_GETARG_INT16 to retrieve arguments declared as "integer"
at the SQL level. (This was evidently copied from cube_ll_coord and
cube_ur_coord, but those were wrong too.)
Fix non-style-guide-conforming error messages.
Fix underparenthesized if statements, which pgindent would have made a
hash of, and remove some unnecessary parens elsewhere.
Run pgindent over new code.
Revise documentation: repeated accretion of more operators without any
rethinking of the text already there had left things in a bit of a mess.
Merge all the cube operators into one table and adjust surrounding text
appropriately.
David Rowley and Tom Lane
As of commit d5563d7df, psql -c no longer implies -X, but not all of
our regression testing scripts had gotten that memo.
To ensure consistency of results across different developers, make
sure that *all* invocations of psql in all scripts in our tree
use -X, even where this is not what previously happened.
Michael Paquier and Tom Lane
Both Blowfish and DES implementations of crypt() can take arbitrarily
long time, depending on the number of rounds specified by the caller;
make sure they can be interrupted.
Author: Andreas Karlsson
Reviewer: Jeff Janes
Backpatch to 9.1.
When use_remote_estimate is enabled, consider adding ORDER BY to the
query we sending to the remote server so that we can use that ordered
data for a merge join. Commit f18c944b6137329ac4a6b2dce5745c5dc21a8578
arranges to push down the query pathkeys, which seems like the case
mostly likely to be a win, but testing shows this can sometimes win,
too.
For a regular table, we know which indexes are present and therefore
test whether the ordering provided by each such index is useful. Here,
we take the opposite approach: guess what orderings would be useful if
they could be generated cheaply, and then ask the remote side what those
will cost.
Ashutosh Bapat, with very substantial cosmetic revisions by me. Also
reviewed by Rushabh Lathia.
Introduce distance operators over cubes:
<#> taxicab distance
<-> euclidean distance
<=> chebyshev distance
Also add kNN support of those distances in GiST opclass.
Author: Stas Kelvich
Commit e7cb7ee14555cc9c5773e2c102efd6371f6f2005 provided basic
infrastructure for allowing a foreign data wrapper or custom scan
provider to replace a join of one or more tables with a scan.
However, this infrastructure failed to take into account the need
for possible EvalPlanQual rechecks, and ExecScanFetch would fail
an assertion (or just overwrite memory) if such a check was attempted
for a plan containing a pushed-down join. To fix, adjust the EPQ
machinery to skip some processing steps when scanrelid == 0, making
those the responsibility of scan's recheck method, which also has
the responsibility in this case of correctly populating the relevant
slot.
To allow foreign scans to gain control in the right place to make
use of this new facility, add a new, optional RecheckForeignScan
method. Also, allow a foreign scan to have a child plan, which can
be used to correctly populate the slot (or perhaps for something
else, but this is the only use currently envisioned).
KaiGai Kohei, reviewed by Robert Haas, Etsuro Fujita, and Kyotaro
Horiguchi.