Such indexes can only be duplicated leftovers of a previously failed
REINDEX CONCURRENTLY command, and a valid equivalent is guaranteed to
exist. As toast indexes can only be dropped if invalid, reindexing
these would lead to useless duplicated indexes that can't be dropped
anymore, except if the parent relation is dropped.
Thanks to Justin Pryzby for reminding that this problem was reported
long ago during the review of the original patch of REINDEX
CONCURRENTLY, but the issue was never addressed.
Reported-by: Sergei Kornilov, Justin Pryzby
Author: Julien Rouhaud
Reviewed-by: Michael Paquier
Discussion: https://postgr.es/m/36712441546604286%40sas1-890ba5c2334a.qloud-c.yandex.net
Discussion: https://postgr.es/m/20200216190835.GA21832@telsasoft.com
Backpatch-through: 12
Specifically, this patch allows ALTER TYPE to:
* Change the default TOAST strategy for a toastable base type;
* Promote a non-toastable type to toastable;
* Add/remove binary I/O functions for a type;
* Add/remove typmod I/O functions for a type;
* Add/remove a custom ANALYZE statistics functions for a type.
The first of these can be done by the type's owner; all the others
require superuser privilege since misuse could cause problems.
The main motivation for this patch is to allow extensions to
upgrade the feature sets of their data types, so the set of
alterable properties is biased towards that use-case. However
it's also true that changing some other properties would be
a lot harder, as they get baked into physical storage and/or
stored expressions that depend on the type.
Along the way, refactor GenerateTypeDependencies() to make it easier
to call, refactor DefineType's volatility checks so they can be shared
by AlterType, and teach typcache.c that it might have to reload data
from the type's pg_type row, a scenario it never handled before.
Also rearrange alter_type.sgml a bit for clarity (put the
composite-type operations together).
Tomas Vondra and Tom Lane
Discussion: https://postgr.es/m/20200228004440.b23ein4qvmxnlpht@development
Our usual practice for "poor man's enum" catalog columns is to define
macros for the possible values and use those, not literal constants,
in C code. But for some reason lost in the mists of time, this was
never done for typalign/attalign or typstorage/attstorage. It's never
too late to make it better though, so let's do that.
The reason I got interested in this right now is the need to duplicate
some uses of the TYPSTORAGE constants in an upcoming ALTER TYPE patch.
But in general, this sort of change aids greppability and readability,
so it's a good idea even without any specific motivation.
I may have missed a few places that could be converted, and it's even
more likely that pending patches will re-introduce some hard-coded
references. But that's not fatal --- there's no expectation that
we'd actually change any of these values. We can clean up stragglers
over time.
Discussion: https://postgr.es/m/16457.1583189537@sss.pgh.pa.us
The backend was using strings to represent command tags and doing string
comparisons in multiple places, but that's slow and unhelpful. Create a
new command list with a supporting structure to use instead; this is
stored in a tag-list-file that can be tailored to specific purposes with
a caller-definable C macro, similar to what we do for WAL resource
managers. The first first such uses are a new CommandTag enum and a
CommandTagBehavior struct.
Replace numerous occurrences of char *completionTag with a
QueryCompletion struct so that the code no longer stores information
about completed queries in a cstring. Only at the last moment, in
EndCommand(), does this get converted to a string.
EventTriggerCacheItem no longer holds an array of palloc’d tag strings
in sorted order, but rather just a Bitmapset over the CommandTags.
Author: Mark Dilger, with unsolicited help from Álvaro Herrera
Reviewed-by: John Naylor, Tom Lane
Discussion: https://postgr.es/m/981A9DB4-3F0C-4DA5-88AD-CB9CFF4D6CAD@enterprisedb.com
This also involves renaming src/include/utils/hashutils.h, which
becomes src/include/common/hashfn.h. Perhaps an argument can be
made for keeping the hashutils.h name, but it seemed more
consistent to make it match the name of the file, and also more
descriptive of what is actually going on here.
Patch by me, reviewed by Suraj Kharage and Mark Dilger. Off-list
advice on how not to break the Windows build from Davinder Singh
and Amit Kapila.
Discussion: http://postgr.es/m/CA+TgmoaRiG4TXND8QuM6JXFRkM_1wL2ZNhzaUKsuec9-4yrkgw@mail.gmail.com
In certain transient states, it's possible that a table has attributes
with attgenerated set but no default expressions in pg_attrdef yet.
In that case, the old code path would not set
relation->rd_att->constr->has_generated_stored, unless
relation->rd_att->constr was also populated for some other reason.
There was probably no practical impact, but it's better to keep this
consistent.
Reported-by: Andres Freund <andres@anarazel.de>
Discussion: https://www.postgresql.org/message-id/flat/20200115181105.ad6ab6dlgyww3lb6%40alap3.anarazel.de
Commit fc7695891 changed CheckAttributeType to recurse into ranges,
but made it pass down the wrong collation (always InvalidOid, since
ranges as such have no collation). This would result in guaranteed
failure when considering a range type whose subtype is collatable.
Embarrassingly, we lack any regression tests that would expose such
a problem (but fortunately, somebody noticed before we shipped this
bug in any release).
Fix it to pass down the range's subtype collation property instead,
and add some regression test cases to exercise collatable-subtype
ranges a bit more. Back-patch to all supported branches, as the
previous patch was.
Report and patch by Julien Rouhaud, test cases tweaked by me
Discussion: https://postgr.es/m/CAOBaU_aBWqNweiGUFX0guzBKkcfJ8mnnyyGC_KBQmO12Mj5f_A@mail.gmail.com
This follows multiple complains from Peter Geoghegan, Andres Freund and
Alvaro Herrera that this issue ought to be dug more before actually
happening, if it happens.
Discussion: https://postgr.es/m/20191226144606.GA5659@alvherre.pgsql
In the wake of commit 5b9312378, there's no particular reason
for this restriction (previously, it was problematic because of
the implied rowtype reference). A simple constraint on a whole-row
Var probably isn't that useful, but conceivably somebody would want
to pass one to a function that extracts a partitioning key. Besides
which, we're expending much more code to enforce the restriction than
we save by having it, since the latter quantity is now zero.
So drop the restriction.
Amit Langote
Discussion: https://postgr.es/m/CA+HiwqFUzjfj9HEsJtYWcr1SgQ_=iCAvQ=O2Sx6aQxoDu4OiHw@mail.gmail.com
Formerly the rd_partkey and rd_partdesc data structures were always
populated immediately when a relcache entry was built or rebuilt.
This patch changes things so that they are populated only when they
are first requested. (Hence, callers *must* now always use
RelationGetPartitionKey or RelationGetPartitionDesc; just fetching
the pointer directly is no longer acceptable.)
This seems to have some performance benefits, but the main reason to do
it is that it eliminates a recursive-reload failure that occurs if the
partkey or partdesc expressions contain any references to the relation's
rowtype (as discovered by Amit Langote). In retrospect, since loading
these data structures might result in execution of nearly-arbitrary code
via eval_const_expressions, it was a dumb idea to require that to happen
during relcache entry rebuild.
Also, fix things so that old copies of a relcache partition descriptor
will be dropped when the cache entry's refcount goes to zero. In the
previous coding it was possible for such copies to survive for the
lifetime of the session, as I'd complained of in a previous discussion.
(This management technique still isn't perfect, but it's better than
before.) Improve the commentary explaining how that works and why
it's safe to hand out direct pointers to these relcache substructures.
In passing, improve RelationBuildPartitionDesc by using the same
memory-context-parent-swap approach used by RelationBuildPartitionKey,
thereby making it less dependent on strong assumptions about what
partition_bounds_copy does. Avoid doing get_rel_relkind in the
critical section, too.
Patch by Amit Langote and Tom Lane; Robert Haas deserves some credit
for prior work in the area, too. Although this is a pre-existing
problem, no back-patch: the patch seems too invasive to be safe to
back-patch, and the bug it fixes is a corner case that seems
relatively unlikely to cause problems in the field.
Discussion: https://postgr.es/m/CA+HiwqFUzjfj9HEsJtYWcr1SgQ_=iCAvQ=O2Sx6aQxoDu4OiHw@mail.gmail.com
Discussion: https://postgr.es/m/CA+TgmoY3bRmGB6-DUnoVy5fJoreiBJ43rwMrQRCdPXuKt4Ykaw@mail.gmail.com
The following renaming is done so as source files related to index
access methods are more consistent with table access methods (the
original names used for index AMs ware too generic, and could be
confused as including features related to table AMs):
- amapi.h -> indexam.h.
- amapi.c -> indexamapi.c. Here we have an equivalent with
backend/access/table/tableamapi.c.
- amvalidate.c -> indexamvalidate.c.
- amvalidate.h -> indexamvalidate.h.
- genam.c -> indexgenam.c.
- genam.h -> indexgenam.h.
This has been discussed during the development of v12 when table AM was
worked on, but the renaming never happened.
Author: Michael Paquier
Reviewed-by: Fabien Coelho, Julien Rouhaud
Discussion: https://postgr.es/m/20191223053434.GF34339@paquier.xyz
We implement ON COMMIT DELETE ROWS by truncating tables marked that
way, which requires also truncating/rebuilding their indexes. But
RelationTruncateIndexes asks the relcache for up-to-date copies of any
index expressions, which may cause execution of eval_const_expressions
on them, which can result in actual execution of subexpressions.
This is a bad thing to have happening during ON COMMIT. Manuel Rigger
reported that use of a SQL function resulted in crashes due to
expectations that ActiveSnapshot would be set, which it isn't.
The most obvious fix perhaps would be to push a snapshot during
PreCommit_on_commit_actions, but I think that would just open the door
to more problems: CommitTransaction explicitly expects that no
user-defined code can be running at this point.
Fortunately, since we know that no tuples exist to be indexed, there
seems no need to use the real index expressions or predicates during
RelationTruncateIndexes. We can set up dummy index expressions
instead (we do need something that will expose the right data type,
as there are places that build index tupdescs based on this), and
just ignore predicates and exclusion constraints.
In a green field it'd likely be better to reimplement ON COMMIT DELETE
ROWS using the same "init fork" infrastructure used for unlogged
relations. That seems impractical without catalog changes though,
and even without that it'd be too big a change to back-patch.
So for now do it like this.
Per private report from Manuel Rigger. This has been broken forever,
so back-patch to all supported branches.
When maintaining or merging patches, one of the most common sources
for conflicts are the list of objects in makefiles. Especially when
the split across lines has been changed on both sides, which is
somewhat common due to attempting to stay below 80 columns, those
conflicts are unnecessarily laborious to resolve.
By splitting, and alphabetically sorting, OBJS style lines into one
object per line, conflicts should be less frequent, and easier to
resolve when they still occur.
Author: Andres Freund
Discussion: https://postgr.es/m/20191029200901.vww4idgcxv74cwes@alap3.anarazel.de
f2e40380 introduces support of non-key attributes in GiST indexes. Then if
get_index_column_opclass() is asked by gistproperty() to get an opclass of
non-key column, it returns garbage past oidvector value. This commit fixes
that by making get_index_column_opclass() return InvalidOid in this case.
Discussion: https://postgr.es/m/20190902231948.GA5343%40alvherre.pgsql
Author: Nikita Glukhov, Alexander Korotkov
Backpatch-through: 12
detoast.c/h contain functions required to detoast a datum, partially
or completely, plus a few other utility functions for examining the
size of toasted datums.
toast_internals.c/h contain functions that are used internally to the
TOAST subsystem but which (mostly) do not need to be accessed from
outside.
heaptoast.c/h contains code that is intrinsically specific to the
heap AM, either because it operates on HeapTuples or is based on the
layout of a heap page.
detoast.c and toast_internals.c are placed in
src/backend/access/common rather than src/backend/access/heap. At
present, both files still have dependencies on the heap, but that will
be improved in a future commit.
Patch by me, reviewed and tested by Prabhat Sabu, Thomas Munro,
Andres Freund, and Álvaro Herrera.
Discussion: http://postgr.es/m/CA+TgmoZv-=2iWM4jcw5ZhJeL18HF96+W1yJeYrnGMYdkFFnEpQ@mail.gmail.com
RelationAllowsEarlyPruning() performed a catalog scan, but is used
in two contexts where that was a bad idea:
1. In heap_page_prune_opt(), which runs very frequently in some large
scans. This caused major performance problems in a field report
that was easy to reproduce.
2. In TestForOldSnapshot(), which runs while we hold a buffer content
lock. It's not clear if this was guaranteed to be free of buffer
deadlock risk.
The check was introduced in commit 2cc41acd8 and defended against a
real problem: 9.6's hash indexes have no page LSN and so we can't
allow early pruning (ie the snapshot-too-old feature). We can remove
the check from all later releases though: hash indexes are now logged,
and there is no way to create UNLOGGED indexes on regular logged
tables.
If a future release allows such a combination, it might need to put
a similar check in place, but it'll need some more thought.
Back-patch to 10.
Author: Thomas Munro
Reviewed-by: Tom Lane, who spotted the second problem
Discussion: https://postgr.es/m/CA%2BhUKGKT8oTkp5jw_U4p0S-7UG9zsvtw_M47Y285bER6a2gD%2Bg%40mail.gmail.com
Discussion: https://postgr.es/m/CAA4eK1%2BWy%2BN4eE5zPm765h68LrkWc3Biu_8rzzi%2BOYX4j%2BiHRw%40mail.gmail.com
Commit 4b93f5799 rearranged things in plpgsql to make it cope better with
composite types changing underneath it intra-session. However, I failed to
consider the case of a composite type being dropped and recreated entirely.
In my defense, the previous coding didn't consider that possibility at all
either --- but it would accidentally work so long as you didn't change the
type's field list, because the built-at-compile-time list of component
variables would then still match the type's new definition. The new
coding, however, occasionally tries to re-look-up the type by OID, and
then fails to find the dropped type.
To fix this, we need to save the TypeName struct, and then redo the type
OID lookup from that. Of course that's expensive, so we don't want to do
it every time we need the type OID. This can be fixed in the same way that
4b93f5799 dealt with changes to composite types' definitions: keep an eye
on the type's typcache entry to see if its tupledesc has been invalidated.
(Perhaps, at some point, this mechanism should be generalized so it can
work for non-composite types too; but for now, plpgsql only tries to
cope with intra-session redefinitions of composites.)
I'm slightly hesitant to back-patch this into v11, because it changes
the contents of struct PLpgSQL_type as well as the signature of
plpgsql_build_datatype(), so in principle it could break code that is
poking into the innards of plpgsql. However, the only popular extension
of that ilk is pldebugger, and it doesn't seem to be affected. Since
this is a regression for people who were relying on the old behavior,
it seems worth taking the small risk of causing compatibility issues.
Per bug #15913 from Daniel Fiori. Back-patch to v11 where 4b93f5799
came in.
Discussion: https://postgr.es/m/15913-a7e112e16dedcffc@postgresql.org
The rd_amcache allows an index AM to cache arbitrary information in a
relcache entry. This commit moves the cleanup of rd_amcache so that it
can also be used by table AMs. Nothing takes advantage of that yet, but
I'm sure it'll come handy for anyone writing new table AMs.
Backpatch to v12, where table AM interface was introduced.
Reviewed-by: Julien Rouhaud
heap.c and relcache.c contained nearly identical copies of logic
to insert OIDs into an OID list while preserving the list's OID
ordering (and rejecting duplicates, in one case but not the other).
The comments argue that this is faster than qsort for small numbers
of OIDs, which is at best unproven, and seems even less likely to be
true now that lappend_cell_oid has to move data around. In any case
it's ugly and hard-to-follow code, and if we do have a lot of OIDs
to consider, it's O(N^2).
Hence, replace with simply lappend'ing OIDs to a List, then list_sort
the completed List, then remove adjacent duplicates if necessary.
This is demonstrably O(N log N) and it's much simpler for the
callers. It's possible that this would be somewhat inefficient
if there were a very large number of duplicates, but that seems
unlikely in the existing usage.
This adds list_deduplicate_oid and list_oid_cmp infrastructure
to list.c. I didn't bother with equivalent functionality for
integer or pointer Lists, but such could always be added later
if we find a use for it.
Discussion: https://postgr.es/m/26193.1563228600@sss.pgh.pa.us
Originally, Postgres Lists were a more or less exact reimplementation of
Lisp lists, which consist of chains of separately-allocated cons cells,
each having a value and a next-cell link. We'd hacked that once before
(commit d0b4399d8) to add a separate List header, but the data was still
in cons cells. That makes some operations -- notably list_nth() -- O(N),
and it's bulky because of the next-cell pointers and per-cell palloc
overhead, and it's very cache-unfriendly if the cons cells end up
scattered around rather than being adjacent.
In this rewrite, we still have List headers, but the data is in a
resizable array of values, with no next-cell links. Now we need at
most two palloc's per List, and often only one, since we can allocate
some values in the same palloc call as the List header. (Of course,
extending an existing List may require repalloc's to enlarge the array.
But this involves just O(log N) allocations not O(N).)
Of course this is not without downsides. The key difficulty is that
addition or deletion of a list entry may now cause other entries to
move, which it did not before.
For example, that breaks foreach() and sister macros, which historically
used a pointer to the current cons-cell as loop state. We can repair
those macros transparently by making their actual loop state be an
integer list index; the exposed "ListCell *" pointer is no longer state
carried across loop iterations, but is just a derived value. (In
practice, modern compilers can optimize things back to having just one
loop state value, at least for simple cases with inline loop bodies.)
In principle, this is a semantics change for cases where the loop body
inserts or deletes list entries ahead of the current loop index; but
I found no such cases in the Postgres code.
The change is not at all transparent for code that doesn't use foreach()
but chases lists "by hand" using lnext(). The largest share of such
code in the backend is in loops that were maintaining "prev" and "next"
variables in addition to the current-cell pointer, in order to delete
list cells efficiently using list_delete_cell(). However, we no longer
need a previous-cell pointer to delete a list cell efficiently. Keeping
a next-cell pointer doesn't work, as explained above, but we can improve
matters by changing such code to use a regular foreach() loop and then
using the new macro foreach_delete_current() to delete the current cell.
(This macro knows how to update the associated foreach loop's state so
that no cells will be missed in the traversal.)
There remains a nontrivial risk of code assuming that a ListCell *
pointer will remain good over an operation that could now move the list
contents. To help catch such errors, list.c can be compiled with a new
define symbol DEBUG_LIST_MEMORY_USAGE that forcibly moves list contents
whenever that could possibly happen. This makes list operations
significantly more expensive so it's not normally turned on (though it
is on by default if USE_VALGRIND is on).
There are two notable API differences from the previous code:
* lnext() now requires the List's header pointer in addition to the
current cell's address.
* list_delete_cell() no longer requires a previous-cell argument.
These changes are somewhat unfortunate, but on the other hand code using
either function needs inspection to see if it is assuming anything
it shouldn't, so it's not all bad.
Programmers should be aware of these significant performance changes:
* list_nth() and related functions are now O(1); so there's no
major access-speed difference between a list and an array.
* Inserting or deleting a list element now takes time proportional to
the distance to the end of the list, due to moving the array elements.
(However, it typically *doesn't* require palloc or pfree, so except in
long lists it's probably still faster than before.) Notably, lcons()
used to be about the same cost as lappend(), but that's no longer true
if the list is long. Code that uses lcons() and list_delete_first()
to maintain a stack might usefully be rewritten to push and pop at the
end of the list rather than the beginning.
* There are now list_insert_nth...() and list_delete_nth...() functions
that add or remove a list cell identified by index. These have the
data-movement penalty explained above, but there's no search penalty.
* list_concat() and variants now copy the second list's data into
storage belonging to the first list, so there is no longer any
sharing of cells between the input lists. The second argument is
now declared "const List *" to reflect that it isn't changed.
This patch just does the minimum needed to get the new implementation
in place and fix bugs exposed by the regression tests. As suggested
by the foregoing, there's a fair amount of followup work remaining to
do.
Also, the ENABLE_LIST_COMPAT macros are finally removed in this
commit. Code using those should have been gone a dozen years ago.
Patch by me; thanks to David Rowley, Jesper Pedersen, and others
for review.
Discussion: https://postgr.es/m/11587.1550975080@sss.pgh.pa.us
Since extended statistic got introduced in PostgreSQL 10, there was a
single catalog pg_statistic_ext storing both the definitions and built
statistic. That's however problematic when a user is supposed to have
access only to the definitions, but not to user data.
Consider for example pg_dump on a database with RLS enabled - if the
pg_statistic_ext catalog respects RLS (which it should, if it contains
user data), pg_dump would not see any records and the result would not
define any extended statistics. That would be a surprising behavior.
Until now this was not a pressing issue, because the existing types of
extended statistic (functional dependencies and ndistinct coefficients)
do not include any user data directly. This changed with introduction
of MCV lists, which do include most common combinations of values.
The easiest way to fix this is to split the pg_statistic_ext catalog
into two - one for definitions, one for the built statistic values.
The new catalog is called pg_statistic_ext_data, and we're maintaining
a 1:1 relationship with the old catalog - either there are matching
records in both catalogs, or neither of them.
Bumped CATVERSION due to changing system catalog definitions.
Author: Dean Rasheed, with improvements by me
Reviewed-by: Dean Rasheed, John Naylor
Discussion: https://postgr.es/m/CAEZATCUhT9rt7Ui%3DVdx4N%3D%3DVV5XOK5dsXfnGgVOz_JhAicB%3DZA%40mail.gmail.com
Vignesh found this bug in the check function for
default_table_access_method's check hook, but that was just copied
from older GUCs. Investigation by Michael and me then found the bug in
further places.
When not connected to a database (e.g. in a walsender connection), we
cannot perform (most) GUC checks that need database access. Even when
only shared tables are needed, unless they're
nailed (c.f. RelationCacheInitializePhase2()), they cannot be accessed
without pg_class etc. being present.
Fix by extending the existing IsTransactionState() checks to also
check for MyDatabaseOid.
Reported-By: Vignesh C, Michael Paquier, Andres Freund
Author: Vignesh C, Andres Freund
Discussion: https://postgr.es/m/CALDaNm1KXK9gbZfY-p_peRFm_XrBh1OwQO1Kk6Gig0c0fVZ2uw%40mail.gmail.com
Backpatch: 9.4-
This is still using the 2.0 version of pg_bsd_indent.
I thought it would be good to commit this separately,
so as to document the differences between 2.0 and 2.1 behavior.
Discussion: https://postgr.es/m/16296.1558103386@sss.pgh.pa.us
The right way for IsCatalogRelation/Class to behave is to return true
for OIDs less than FirstBootstrapObjectId (not FirstNormalObjectId),
without any of the ad-hoc fooling around with schema membership.
The previous code was wrong because (1) it claimed that
information_schema tables were not catalog relations but their toast
tables were, which is silly; and (2) if you dropped and recreated
information_schema, which is a supported operation, the behavior
changed. That's even sillier. With this definition, "catalog
relations" are exactly the ones traceable to the postgres.bki data,
which seems like what we want.
With this simplification, we don't actually need access to the pg_class
tuple to identify a catalog relation; we only need its OID. Hence,
replace IsCatalogClass with "IsCatalogRelationOid(oid)". But keep
IsCatalogRelation as a convenience function.
This allows fixing some arguably-wrong semantics in contrib/sepgsql and
ReindexRelationConcurrently, which were using an IsSystemNamespace test
where what they really should be using is IsCatalogRelationOid. The
previous coding failed to protect toast tables of system catalogs, and
also was not on board with the general principle that user-created tables
do not become catalogs just by virtue of being renamed into pg_catalog.
We can also get rid of a messy hack in ReindexMultipleTables.
While we're at it, also rename IsSystemNamespace to IsCatalogNamespace,
because the previous name invited confusion with the more expansive
semantics used by IsSystemRelation/Class.
Also improve the comments in catalog.c.
There are a few remaining places in replication-related code that are
special-casing OIDs below FirstNormalObjectId. I'm inclined to think
those are wrong too, and if there should be any special case it should
just extend to FirstBootstrapObjectId. But first we need to debate
whether a FOR ALL TABLES publication should include information_schema.
Discussion: https://postgr.es/m/21697.1557092753@sss.pgh.pa.us
Discussion: https://postgr.es/m/15150.1557257111@sss.pgh.pa.us
In the wake of commit f912d7dec, RelationSetIndexList isn't used any
more. It was always a horrid wart, so getting rid of it is very nice.
We can also convert rd_indexvalid back to a plain boolean.
Discussion: https://postgr.es/m/28926.1556664156@sss.pgh.pa.us
Commits 3dbb317d3 et al failed under CLOBBER_CACHE_ALWAYS testing.
Investigation showed that to reindex pg_class_oid_index, we must
suppress accesses to the index (via SetReindexProcessing) before we call
RelationSetNewRelfilenode, or at least before we do CommandCounterIncrement
therein; otherwise, relcache reloads happening within the CCI may try to
fetch pg_class rows using the index's new relfilenode value, which is as
yet an empty file.
Of course, the point of 3dbb317d3 was that that ordering didn't work
either, because then RelationSetNewRelfilenode's own update of the index's
pg_class row cannot access the index, should it need to.
There are various ways we might have got around that, but Andres Freund
came up with a brilliant solution: for a mapped index, we can really just
skip the pg_class update altogether. The only fields it was actually
changing were relpages etc, but it was just setting them to zeroes which
is useless make-work. (Correct new values will be installed at the end
of index build.) All pg_class indexes are mapped and probably always will
be, so this eliminates the problem by removing work rather than adding it,
always a pleasant outcome. Having taught RelationSetNewRelfilenode to do
it that way, we can revert the code reordering in reindex_index. (But
I left the moved setup code where it was; there seems no reason why it
has to run without use of the old index. If you're trying to fix a
busted pg_class index, you'll have had to disable system index use
altogether to get this far.)
Moreover, this means we don't need RelationSetIndexList at all, because
reindex_relation's hacking to make "REINDEX TABLE pg_class" work is
likewise now unnecessary. We'll leave that code in place in the back
branches, but a follow-on patch will remove it in HEAD.
In passing, do some minor cleanup for commit 5c1560606 (in HEAD only),
notably removing a duplicate newrnode assignment.
Patch by me, using a core idea due to Andres Freund. Back-patch to all
supported branches, as 3dbb317d3 was.
Discussion: https://postgr.es/m/28926.1556664156@sss.pgh.pa.us
Most of these stem from d25f519107 "tableam: relation creation, VACUUM
FULL/CLUSTER, SET TABLESPACE.".
1) To pass data to the relation_set_new_filenode()
RelationSetNewRelfilenode() was made to update RelationData.rd_rel
directly. That's not OK however, as it makes the relcache entries
temporarily inconsistent. Which among other scenarios is a problem
if a REINDEX targets an index on pg_class - the
CatalogTupleUpdate() in RelationSetNewRelfilenode(). Presumably
that was introduced because other places in the code do so - while
those aren't "good practice" they don't appear to be actively
buggy (e.g. because system tables may not be targeted).
I (Andres) should have caught this while reviewing and signficantly
evolving the code in that commit, mea culpa.
Fix that by instead passing in the new RelFileNode as separate
argument to relation_set_new_filenode() and rely on the relcache to
update the catalog entry. Also revert that the
RelationMapUpdateMap() call was changed to immediate, and undo some
other more unnecessary changes.
2) Document that the relation_set_new_filenode cannot rely on the
whole relcache entry to be valid. It might be worthwhile to
refactor the code to never have to rely on that, but given the way
heap_create() is currently coded, that'd be a large change.
3) ATExecSetTableSpace() shouldn't do FlushRelationBuffers() itself. A
table AM might not use shared buffers at all. Move to
index_copy_data() and heapam_relation_copy_data().
4) heapam_relation_set_new_filenode() previously sometimes accessed
rel->rd_rel->relpersistence rather than the `persistence`
argument. Code movement mistake.
5) Previously heapam_relation_set_new_filenode() re-opened the smgr
relation to create the init for, if necesary. Instead have
RelationCreateStorage() return the SMgrRelation and use it to
create the init fork.
6) Add a note about the danger of modifying the relcache directly to
ATExecSetTableSpace() - it's currently not a bug because there's a
check ERRORing for catalog tables.
Regression tests and assertion improvements that together trigger the
bug described in 1) will be added in a later commit, as there is a
related bug on all branches.
Reported-By: Michael Paquier
Diagnosed-By: Tom Lane and Andres Freund
Author: Andres Freund
Reviewed-By: Tom Lane
Discussion: https://postgr.es/m/20190418011430.GA19133@paquier.xyz
If a FOR ALL TABLES publication exists, temporary and unlogged tables
are ignored for publishing changes. But CheckCmdReplicaIdentity()
would still check in that case that such a table has a replica
identity set before accepting updates. To fix, have
GetRelationPublicationActions() return that such a table publishes no
actions.
Discussion: https://www.postgresql.org/message-id/f3f151f7-c4dd-1646-b998-f60bd6217dd3@2ndquadrant.com
The original coding of generate_partition_qual() just copied the list
of predicate expressions into the global CacheMemoryContext, making it
effectively impossible to clean up when the owning relcache entry is
destroyed --- the relevant code in RelationDestroyRelation() only managed
to free the topmost List header :-(. This resulted in a session-lifespan
memory leak whenever a table partition's relcache entry is rebuilt.
Fortunately, that's not normally a large data structure, and rebuilds
shouldn't occur all that often in production situations; but this is
still a bug worth fixing back to v10 where the code was introduced.
To fix, put the cached expression tree into its own small memory context,
as we do with other complicated substructures of relcache entries.
Also, deal more honestly with the case that a partition has an empty
partcheck list; while that probably isn't a case that's very interesting
for production use, it's legal.
In passing, clarify comments about how partitioning-related relcache
data structures are managed, and add some Asserts that we're not leaking
old copies when we overwrite these data fields.
Amit Langote and Tom Lane
Discussion: https://postgr.es/m/7961.1552498252@sss.pgh.pa.us
Rewrite get_attgenerated() to avoid compiler warning if the compiler
does not recognize that elog(ERROR) does not return.
Reported-by: David Rowley <david.rowley@2ndquadrant.com>
This is an SQL-standard feature that allows creating columns that are
computed from expressions rather than assigned, similar to a view or
materialized view but on a column basis.
This implements one kind of generated column: stored (computed on
write). Another kind, virtual (computed on read), is planned for the
future, and some room is left for it.
Reviewed-by: Michael Paquier <michael@paquier.xyz>
Reviewed-by: Pavel Stehule <pavel.stehule@gmail.com>
Discussion: https://www.postgresql.org/message-id/flat/b151f851-4019-bdb1-699e-ebab07d2f40a@2ndquadrant.com
This moves the responsibility for:
- creating the storage necessary for a relation, including creating a
new relfilenode for a relation with existing storage
- non-transactional truncation of a relation
- VACUUM FULL / CLUSTER's rewrite of a table
below tableam.
This is fairly straight forward, with a bit of complexity smattered in
to move the computation of xid / multixid horizons below the AM, as
they don't make sense for every table AM.
Author: Andres Freund
Discussion: https://postgr.es/m/20180703070645.wchpu5muyto5n647@alap3.anarazel.de
This adds a flag "deterministic" to collations. If that is false,
such a collation disables various optimizations that assume that
strings are equal only if they are byte-wise equal. That then allows
use cases such as case-insensitive or accent-insensitive comparisons
or handling of strings with different Unicode normal forms.
This functionality is only supported with the ICU provider. At least
glibc doesn't appear to have any locales that work in a
nondeterministic way, so it's not worth supporting this for the libc
provider.
The term "deterministic comparison" in this context is from Unicode
Technical Standard #10
(https://unicode.org/reports/tr10/#Deterministic_Comparison).
This patch makes changes in three areas:
- CREATE COLLATION DDL changes and system catalog changes to support
this new flag.
- Many executor nodes and auxiliary code are extended to track
collations. Previously, this code would just throw away collation
information, because the eventually-called user-defined functions
didn't use it since they only cared about equality, which didn't
need collation information.
- String data type functions that do equality comparisons and hashing
are changed to take the (non-)deterministic flag into account. For
comparison, this just means skipping various shortcuts and tie
breakers that use byte-wise comparison. For hashing, we first need
to convert the input string to a canonical "sort key" using the ICU
analogue of strxfrm().
Reviewed-by: Daniel Verite <daniel@manitou-mail.org>
Reviewed-by: Peter Geoghegan <pg@bowt.ie>
Discussion: https://www.postgresql.org/message-id/flat/1ccc668f-4cbc-0bef-af67-450b47cdfee7@2ndquadrant.com
Some buildfarm members using CLOBBER_CACHE_ALWAYS have been having OOM
problems of late. Commit 2455ab488 addressed this problem by recovering
space transiently used within RelationBuildPartitionDesc, but it turns
out that leaves quite a lot on the table, because other subroutines of
RelationBuildDesc also leak memory like mad. Let's move the temp-context
management into RelationBuildDesc so that leakage from the other
subroutines is also recovered.
I examined this issue by arranging for postgres.c to dump the size of
MessageContext just before resetting it in each command cycle, and
then running the update.sql regression test (which is one of the two
that are seeing buildfarm OOMs) with and without CLOBBER_CACHE_ALWAYS.
Before 2455ab488, the peak space usage with CCA was as much as 250MB.
That patch got it down to ~80MB, but with this patch it's about 0.5MB,
and indeed the space usage now seems nearly indistinguishable from a
non-CCA build.
RelationBuildDesc's traditional behavior of not worrying about leaking
transient data is of many years' standing, so I'm pretty hesitant to
change that without more evidence that it'd be useful in a normal build.
(So far as I can see, non-CCA memory consumption is about the same with
or without this change, whuch if anything suggests that it isn't useful.)
Hence, configure the patch so that we recover space only when
CLOBBER_CACHE_ALWAYS or CLOBBER_CACHE_RECURSIVELY is defined. However,
that choice can be overridden at compile time, in case somebody would
like to do some performance testing and try to develop evidence for
changing that decision.
It's possible that we ought to back-patch this change, but in the
absence of back-branch OOM problems in the buildfarm, I'm not in
a hurry to do that.
Discussion: https://postgr.es/m/CA+TgmoY3bRmGB6-DUnoVy5fJoreiBJ43rwMrQRCdPXuKt4Ykaw@mail.gmail.com
... as well as its implementation from backend/access/hash/hashfunc.c to
backend/utils/hash/hashfn.c.
access/hash is the place for the hash index AM, not really appropriate
for generic facilities, which is what hash_any is; having things the old
way meant that anything using hash_any had to include the AM's include
file, pointlessly polluting its namespace with unrelated, unnecessary
cruft.
Also move the HTEqual strategy number to access/stratnum.h from
access/hash.h.
To avoid breaking third-party extension code, add an #include
"utils/hashutils.h" to access/hash.h. (An easily removed line by
committers who enjoy their asbestos suits to protect them from angry
extension authors.)
Discussion: https://postgr.es/m/201901251935.ser5e4h6djt2@alvherre.pgsql
This fixes two sets of issues related to the use of transient files in
the backend:
1) OpenTransientFile() has been used in some code paths with read-write
flags while read-only is sufficient, so switch those calls to be
read-only where necessary. These have been reported by Joe Conway.
2) When opening transient files, it is up to the caller to close the
file descriptors opened. In error code paths, CloseTransientFile() gets
called to clean up things before issuing an error. However in normal
exit paths, a lot of callers of CloseTransientFile() never actually
reported errors, which could leave a file descriptor open without
knowing about it. This is an issue I complained about a couple of
times, but never had the courage to write and submit a patch, so here we
go.
Note that one frontend code path is impacted by this commit so as an
error is issued when fetching control file data, making backend and
frontend to be treated consistently.
Reported-by: Joe Conway, Michael Paquier
Author: Michael Paquier
Reviewed-by: Álvaro Herrera, Georgios Kokolatos, Joe Conway
Discussion: https://postgr.es/m/20190301023338.GD1348@paquier.xyz
Discussion: https://postgr.es/m/c49b69ec-e2f7-ff33-4f17-0eaa4f2cef27@joeconway.com
We still require AccessExclusiveLock on the partition itself, because
otherwise an insert that violates the newly-imposed partition
constraint could be in progress at the same time that we're changing
that constraint; only the lock level on the parent relation is
weakened.
To make this safe, we have to cope with (at least) three separate
problems. First, relevant DDL might commit while we're in the process
of building a PartitionDesc. If so, find_inheritance_children() might
see a new partition while the RELOID system cache still has the old
partition bound cached, and even before invalidation messages have
been queued. To fix that, if we see that the pg_class tuple seems to
be missing or to have a null relpartbound, refetch the value directly
from the table. We can't get the wrong value, because DETACH PARTITION
still requires AccessExclusiveLock throughout; if we ever want to
change that, this will need more thought. In testing, I found it quite
difficult to hit even the null-relpartbound case; the race condition
is extremely tight, but the theoretical risk is there.
Second, successive calls to RelationGetPartitionDesc might not return
the same answer. The query planner will get confused if lookup up the
PartitionDesc for a particular relation does not return a consistent
answer for the entire duration of query planning. Likewise, query
execution will get confused if the same relation seems to have a
different PartitionDesc at different times. Invent a new
PartitionDirectory concept and use it to ensure consistency. This
ensures that a single invocation of either the planner or the executor
sees the same view of the PartitionDesc from beginning to end, but it
does not guarantee that the planner and the executor see the same
view. Since this allows pointers to old PartitionDesc entries to
survive even after a relcache rebuild, also postpone removing the old
PartitionDesc entry until we're certain no one is using it.
For the most part, it seems to be OK for the planner and executor to
have different views of the PartitionDesc, because the executor will
just ignore any concurrently added partitions which were unknown at
plan time; those partitions won't be part of the inheritance
expansion, but invalidation messages will trigger replanning at some
point. Normally, this happens by the time the very next command is
executed, but if the next command acquires no locks and executes a
prepared query, it can manage not to notice until a new transaction is
started. We might want to tighten that up, but it's material for a
separate patch. There would still be a small window where a query
that started just after an ATTACH PARTITION command committed might
fail to notice its results -- but only if the command starts before
the commit has been acknowledged to the user. All in all, the warts
here around serializability seem small enough to be worth accepting
for the considerable advantage of being able to add partitions without
a full table lock.
Although in general the consequences of new partitions showing up
between planning and execution are limited to the query not noticing
the new partitions, run-time partition pruning will get confused in
that case, so that's the third problem that this patch fixes.
Run-time partition pruning assumes that indexes into the PartitionDesc
are stable between planning and execution. So, add code so that if
new partitions are added between plan time and execution time, the
indexes stored in the subplan_map[] and subpart_map[] arrays within
the plan's PartitionedRelPruneInfo get adjusted accordingly. There
does not seem to be a simple way to generalize this scheme to cope
with partitions that are removed, mostly because they could then get
added back again with different bounds, but it works OK for added
partitions.
This code does not try to ensure that every backend participating in
a parallel query sees the same view of the PartitionDesc. That
currently doesn't matter, because we never pass PartitionDesc
indexes between backends. Each backend will ignore the concurrently
added partitions which it notices, and it doesn't matter if different
backends are ignoring different sets of concurrently added partitions.
If in the future that matters, for example because we allow writes in
parallel query and want all participants to do tuple routing to the same
set of partitions, the PartitionDirectory concept could be improved to
share PartitionDescs across backends. There is a draft patch to
serialize and restore PartitionDescs on the thread where this patch
was discussed, which may be a useful place to start.
Patch by me. Thanks to Alvaro Herrera, David Rowley, Simon Riggs,
Amit Langote, and Michael Paquier for discussion, and to Alvaro
Herrera for some review.
Discussion: http://postgr.es/m/CA+Tgmobt2upbSocvvDej3yzokd7AkiT+PvgFH+a9-5VV1oJNSQ@mail.gmail.com
Discussion: http://postgr.es/m/CA+TgmoZE0r9-cyA-aY6f8WFEROaDLLL7Vf81kZ8MtFCkxpeQSw@mail.gmail.com
Discussion: http://postgr.es/m/CA+TgmoY13KQZF-=HNTrt9UYWYx3_oYOQpu9ioNT49jGgiDpUEA@mail.gmail.com
This introduces the concept of table access methods, i.e. CREATE
ACCESS METHOD ... TYPE TABLE and
CREATE TABLE ... USING (storage-engine).
No table access functionality is delegated to table AMs as of this
commit, that'll be done in following commits.
Subsequent commits will incrementally abstract table access
functionality to be routed through table access methods. That change
is too large to be reviewed & committed at once, so it'll be done
incrementally.
Docs will be updated at the end, as adding them incrementally would
likely make them less coherent, and definitely is a lot more work,
without a lot of benefit.
Table access methods are specified similar to index access methods,
i.e. pg_am.amhandler returns, as INTERNAL, a pointer to a struct with
callbacks. In contrast to index AMs that struct needs to live as long
as a backend, typically that's achieved by just returning a pointer to
a constant struct.
Psql's \d+ now displays a table's access method. That can be disabled
with HIDE_TABLEAM=true, which is mainly useful so regression tests can
be run against different AMs. It's quite possible that this behaviour
still needs to be fine tuned.
For now it's not allowed to set a table AM for a partitioned table, as
we've not resolved how partitions would inherit that. Disallowing
allows us to introduce, if we decide that's the way forward, such a
behaviour without a compatibility break.
Catversion bumped, to add the heap table AM and references to it.
Author: Haribabu Kommi, Andres Freund, Alvaro Herrera, Dimitri Golgov and others
Discussion:
https://postgr.es/m/20180703070645.wchpu5muyto5n647@alap3.anarazel.dehttps://postgr.es/m/20160812231527.GA690404@alvherre.pgsqlhttps://postgr.es/m/20190107235616.6lur25ph22u5u5av@alap3.anarazel.dehttps://postgr.es/m/20190304234700.w5tmhducs5wxgzls@alap3.anarazel.de
This is similar in spirit to the existing partbounds.c file in the
same directory, except that there's a lot less code in the new file
created by this commit. Pending work in this area proposes to add a
bunch more code related to PartitionDescs, though, and this will give
us a good place to put it.
Discussion: http://postgr.es/m/CA+TgmoZUwPf_uanjF==gTGBMJrn8uCq52XYvAEorNkLrUdoawg@mail.gmail.com
Add support function requests for estimating the selectivity, cost,
and number of result rows (if a SRF) of the target function.
The lack of a way to estimate selectivity of a boolean-returning
function in WHERE has been a recognized deficiency of the planner
since Berkeley days. This commit finally fixes it.
In addition, non-constant estimates of cost and number of output
rows are now possible. We still fall back to looking at procost
and prorows if the support function doesn't service the request,
of course.
To make concrete use of the possibility of estimating output rowcount
for SRFs, this commit adds support functions for array_unnest(anyarray)
and the integer variants of generate_series; the lack of plausible
rowcount estimates for those, even when it's obvious to a human,
has been a repeated subject of complaints. Obviously, much more
could now be done in this line, but I'm mostly just trying to get
the infrastructure in place.
Discussion: https://postgr.es/m/15193.1548028093@sss.pgh.pa.us