This moves bitmap heap scan support to below an optional tableam
callback. It's optional as the whole concept of bitmap heapscans is
fairly block specific.
This basically moves the work previously done in bitgetpage() into the
new scan_bitmap_next_block callback, and the direct poking into the
buffer done in BitmapHeapNext() into the new scan_bitmap_next_tuple()
callback.
The abstraction is currently somewhat leaky because
nodeBitmapHeapscan.c's prefetching and visibilitymap based logic
remains - it's likely that we'll later have to move more into the
AM. But it's not trivial to do so without introducing a significant
amount of code duplication between the AMs, so that's a project for
later.
Note that now nodeBitmapHeapscan.c and the associated node types are a
bit misnamed. But it's not clear whether renaming wouldn't be a cure
worse than the disease. Either way, that'd be best done in a separate
commit.
Author: Andres Freund
Reviewed-By: Robert Haas (in an older version)
Discussion: https://postgr.es/m/20180703070645.wchpu5muyto5n647@alap3.anarazel.de
This moves sample scan support to below tableam. It's not optional as
there is, in contrast to e.g. bitmap heap scans, no alternative way to
perform tablesample queries. If an AM can't deal with the block based
API, it will have to throw an ERROR.
The tableam callbacks for this are block based, but given the current
TsmRoutine interface, that seems to be required.
The new interface doesn't require TsmRoutines to perform visibility
checks anymore - that requires the TsmRoutine to know details about
the AM, which we want to avoid. To continue to allow taking the
returned number of tuples account SampleScanState now has a donetuples
field (which previously e.g. existed in SystemRowsSamplerData), which
is only incremented after the visibility check succeeds.
Author: Andres Freund
Discussion: https://postgr.es/m/20180703070645.wchpu5muyto5n647@alap3.anarazel.de
An anti-wraparound vacuum has to be by definition aggressive as it needs
to work on all the pages of a relation. However it can happen that due
to some concurrent activity an anti-wraparound vacuum is marked as
non-aggressive, which makes it redundant with a previous run, and
it is actually useless as an anti-wraparound vacuum should process all
the pages of a relation. This commit makes such vacuums to be skipped.
An anti-wraparound vacuum not aggressive can be found easily by mixing
low values of autovacuum_freeze_max_age (to control anti-wraparound) and
autovacuum_freeze_table_age (to control the aggressiveness).
28a8fa9 has added some extra logging printing all the possible
combinations of anti-wraparound and aggressive vacuums, which now gets
simplified as an anti-wraparound vacuum also non-aggressive gets
skipped.
Per discussion mainly between Andrew Dunstan, Robert Haas, Álvaro
Herrera, Kyotaro Horiguchi, Masahiko Sawada, and myself.
Author: Kyotaro Horiguchi, Michael Paquier
Reviewed-by: Andrew Dunstan, Álvaro Herrera
Discussion: https://postgr.es/m/20180914153554.562muwr3uwujno75@alvherre.pgsql
This just moves the table/matview[/toast] determination of relation
size to a callback, and uses a copy of the existing logic to implement
that callback for heap.
It probably would make sense to also move the index specific logic
into a callback, so the metapage handling (and probably more) can be
index specific. But that's a separate task.
Author: Andres Freund
Discussion: https://postgr.es/m/20180703070645.wchpu5muyto5n647@alap3.anarazel.de
This is a relatively straightforward move of the current
implementation to sit below tableam. As the current analyze sampling
implementation is pretty inherently block based, the tableam analyze
interface is as well. It might make sense to generalize that at some
point, but that seems like a larger project that shouldn't be
undertaken at the same time as the introduction of tableam.
Author: Andres Freund
Discussion: https://postgr.es/m/20180703070645.wchpu5muyto5n647@alap3.anarazel.de
This moves the responsibility for:
- creating the storage necessary for a relation, including creating a
new relfilenode for a relation with existing storage
- non-transactional truncation of a relation
- VACUUM FULL / CLUSTER's rewrite of a table
below tableam.
This is fairly straight forward, with a bit of complexity smattered in
to move the computation of xid / multixid horizons below the AM, as
they don't make sense for every table AM.
Author: Andres Freund
Discussion: https://postgr.es/m/20180703070645.wchpu5muyto5n647@alap3.anarazel.de
To support building indexes over tables of different AMs, the scans to
do so need to be routed through the table AM. While moving a fair
amount of code, nearly all the changes are just moving code to below a
callback.
Currently the range based interface wouldn't make much sense for non
block based table AMs. But that seems aceptable for now.
Author: Andres Freund
Discussion: https://postgr.es/m/20180703070645.wchpu5muyto5n647@alap3.anarazel.de
Previously the xid horizon was only computed during WAL replay. That
had two major problems:
1) It relied on knowing what the table pointed to looks like. That was
easy enough before the introducing of tableam (we knew it had to be
heap, although some trickery around logging the heap relfilenodes
was required). But to properly handle table AMs we need
per-database catalog access to look up the AM handler, which
recovery doesn't allow.
2) Not knowing the xid horizon also makes it hard to support logical
decoding on standbys. When on a catalog table, we need to be able
to conflict with slots that have an xid horizon that's too old. But
computing the horizon by visiting the heap only works once
consistency is reached, but we always need to be able to detect
conflicts.
There's also a secondary problem, in that the current method performs
redundant work on every standby. But that's counterbalanced by
potentially computing the value when not necessary (either because
there's no standby, or because there's no connected backends).
Solve 1) and 2) by moving computation of the xid horizon to the
primary and by involving tableam in the computation of the horizon.
To address the potentially increased overhead, increase the efficiency
of the xid horizon computation for heap by sorting the tids, and
eliminating redundant buffer accesses. When prefetching is available,
additionally perform prefetching of buffers. As this is more of a
maintenance task, rather than something routinely done in every read
only query, we add an arbitrary 10 to the effective concurrency -
thereby using IO concurrency, when not globally enabled. That's
possibly not the perfect formula, but seems good enough for now.
Bumps WAL format, as latestRemovedXid is now part of the records, and
the heap's relfilenode isn't anymore.
Author: Andres Freund, Amit Khandekar, Robert Haas
Reviewed-By: Robert Haas
Discussion:
https://postgr.es/m/20181212204154.nsxf3gzqv3gesl32@alap3.anarazel.dehttps://postgr.es/m/20181214014235.dal5ogljs3bmlq44@alap3.anarazel.dehttps://postgr.es/m/20180703070645.wchpu5muyto5n647@alap3.anarazel.de
After 71bdc99d0d7, "tableam: Add helper for indexes to check if a
corresponding table tuples exist." there's no in-core user left. As
there's unlikely to be an external user, and such an external user
could easily be adjusted to use table_index_fetch_tuple_check(),
remove heap_hot_search().
Per complaint from Peter Geoghegan
Author: Andres Freund
Discussion: https://postgr.es/m/CAH2-Wzn0Oq4ftJrTqRAsWy2WGjv0QrJcwoZ+yqWsF_Z5vjUBFw@mail.gmail.com
This primarily is to allow WHERE CURRENT OF to continue to work as it
currently does. It's not clear to me that these semantics make sense
for every AM, but it works for the in-core heap, and the out of core
zheap. We can refine it further at a later point if necessary.
Author: Andres Freund
Discussion: https://postgr.es/m/20180703070645.wchpu5muyto5n647@alap3.anarazel.de
This is essentially the tableam version of heapam_fetch(),
i.e. fetching a tuple identified by a tid, performing visibility
checks.
Note that this different from table_index_fetch_tuple(), which is for
index lookups. It therefore has to handle a tid pointing to an earlier
version of a tuple if the AM uses an optimization like heap's HOT. Add
comments to that end.
This commit removes the stats_relation argument from heap_fetch, as
it's been unused for a long time.
Author: Andres Freund
Reviewed-By: Haribabu Kommi
Discussion: https://postgr.es/m/20180703070645.wchpu5muyto5n647@alap3.anarazel.de
This adds new, required, table AM callbacks for insert/delete/update
and lock_tuple. To be able to reasonably use those, the EvalPlanQual
mechanism had to be adapted, moving more logic into the AM.
Previously both delete/update/lock call-sites and the EPQ mechanism had
to have awareness of the specific tuple format to be able to fetch the
latest version of a tuple. Obviously that needs to be abstracted
away. To do so, move the logic that find the latest row version into
the AM. lock_tuple has a new flag argument,
TUPLE_LOCK_FLAG_FIND_LAST_VERSION, that forces it to lock the last
version, rather than the current one. It'd have been possible to do
so via a separate callback as well, but finding the last version
usually also necessitates locking the newest version, making it
sensible to combine the two. This replaces the previous use of
EvalPlanQualFetch(). Additionally HeapTupleUpdated, which previously
signaled either a concurrent update or delete, is now split into two,
to avoid callers needing AM specific knowledge to differentiate.
The move of finding the latest row version into tuple_lock means that
encountering a row concurrently moved into another partition will now
raise an error about "tuple to be locked" rather than "tuple to be
updated/deleted" - which is accurate, as that always happens when
locking rows. While possible slightly less helpful for users, it seems
like an acceptable trade-off.
As part of this commit HTSU_Result has been renamed to TM_Result, and
its members been expanded to differentiated between updating and
deleting. HeapUpdateFailureData has been renamed to TM_FailureData.
The interface to speculative insertion is changed so nodeModifyTable.c
does not have to set the speculative token itself anymore. Instead
there's a version of tuple_insert, tuple_insert_speculative, that
performs the speculative insertion (without requiring a flag to signal
that fact), and the speculative insertion is either made permanent
with table_complete_speculative(succeeded = true) or aborted with
succeeded = false).
Note that multi_insert is not yet routed through tableam, nor is
COPY. Changing multi_insert requires changes to copy.c that are large
enough to better be done separately.
Similarly, although simpler, CREATE TABLE AS and CREATE MATERIALIZED
VIEW are also only going to be adjusted in a later commit.
Author: Andres Freund and Haribabu Kommi
Discussion:
https://postgr.es/m/20180703070645.wchpu5muyto5n647@alap3.anarazel.dehttps://postgr.es/m/20190313003903.nwvrxi7rw3ywhdel@alap3.anarazel.dehttps://postgr.es/m/20160812231527.GA690404@alvherre.pgsql
Many places need both, so this allows a few functions to take one
fewer parameter. More importantly, as soon as we add a VACUUM
option that takes a non-Boolean parameter, we need to replace
'int options' with a struct, and it seems better to think
of adding more fields to VacuumParams rather than passing around
both VacuumParams and a separate struct as well.
Patch by me, reviewed by Masahiko Sawada
Discussion: http://postgr.es/m/CA+Tgmob6g6-s50fyv8E8he7APfwCYYJ4z0wbZC2yZeSz=26CYQ@mail.gmail.com
Too allow table accesses to be not directly dependent on heap, several
new abstractions are needed. Specifically:
1) Heap scans need to be generalized into table scans. Do this by
introducing TableScanDesc, which will be the "base class" for
individual AMs. This contains the AM independent fields from
HeapScanDesc.
The previous heap_{beginscan,rescan,endscan} et al. have been
replaced with a table_ version.
There's no direct replacement for heap_getnext(), as that returned
a HeapTuple, which is undesirable for a other AMs. Instead there's
table_scan_getnextslot(). But note that heap_getnext() lives on,
it's still used widely to access catalog tables.
This is achieved by new scan_begin, scan_end, scan_rescan,
scan_getnextslot callbacks.
2) The portion of parallel scans that's shared between backends need
to be able to do so without the user doing per-AM work. To achieve
that new parallelscan_{estimate, initialize, reinitialize}
callbacks are introduced, which operate on a new
ParallelTableScanDesc, which again can be subclassed by AMs.
As it is likely that several AMs are going to be block oriented,
block oriented callbacks that can be shared between such AMs are
provided and used by heap. table_block_parallelscan_{estimate,
intiialize, reinitialize} as callbacks, and
table_block_parallelscan_{nextpage, init} for use in AMs. These
operate on a ParallelBlockTableScanDesc.
3) Index scans need to be able to access tables to return a tuple, and
there needs to be state across individual accesses to the heap to
store state like buffers. That's now handled by introducing a
sort-of-scan IndexFetchTable, which again is intended to be
subclassed by individual AMs (for heap IndexFetchHeap).
The relevant callbacks for an AM are index_fetch_{end, begin,
reset} to create the necessary state, and index_fetch_tuple to
retrieve an indexed tuple. Note that index_fetch_tuple
implementations need to be smarter than just blindly fetching the
tuples for AMs that have optimizations similar to heap's HOT - the
currently alive tuple in the update chain needs to be fetched if
appropriate.
Similar to table_scan_getnextslot(), it's undesirable to continue
to return HeapTuples. Thus index_fetch_heap (might want to rename
that later) now accepts a slot as an argument. Core code doesn't
have a lot of call sites performing index scans without going
through the systable_* API (in contrast to loads of heap_getnext
calls and working directly with HeapTuples).
Index scans now store the result of a search in
IndexScanDesc->xs_heaptid, rather than xs_ctup->t_self. As the
target is not generally a HeapTuple anymore that seems cleaner.
To be able to sensible adapt code to use the above, two further
callbacks have been introduced:
a) slot_callbacks returns a TupleTableSlotOps* suitable for creating
slots capable of holding a tuple of the AMs
type. table_slot_callbacks() and table_slot_create() are based
upon that, but have additional logic to deal with views, foreign
tables, etc.
While this change could have been done separately, nearly all the
call sites that needed to be adapted for the rest of this commit
also would have been needed to be adapted for
table_slot_callbacks(), making separation not worthwhile.
b) tuple_satisfies_snapshot checks whether the tuple in a slot is
currently visible according to a snapshot. That's required as a few
places now don't have a buffer + HeapTuple around, but a
slot (which in heap's case internally has that information).
Additionally a few infrastructure changes were needed:
I) SysScanDesc, as used by systable_{beginscan, getnext} et al. now
internally uses a slot to keep track of tuples. While
systable_getnext() still returns HeapTuples, and will so for the
foreseeable future, the index API (see 1) above) now only deals with
slots.
The remainder, and largest part, of this commit is then adjusting all
scans in postgres to use the new APIs.
Author: Andres Freund, Haribabu Kommi, Alvaro Herrera
Discussion:
https://postgr.es/m/20180703070645.wchpu5muyto5n647@alap3.anarazel.dehttps://postgr.es/m/20160812231527.GA690404@alvherre.pgsql
This fixes two sets of issues related to the use of transient files in
the backend:
1) OpenTransientFile() has been used in some code paths with read-write
flags while read-only is sufficient, so switch those calls to be
read-only where necessary. These have been reported by Joe Conway.
2) When opening transient files, it is up to the caller to close the
file descriptors opened. In error code paths, CloseTransientFile() gets
called to clean up things before issuing an error. However in normal
exit paths, a lot of callers of CloseTransientFile() never actually
reported errors, which could leave a file descriptor open without
knowing about it. This is an issue I complained about a couple of
times, but never had the courage to write and submit a patch, so here we
go.
Note that one frontend code path is impacted by this commit so as an
error is issued when fetching control file data, making backend and
frontend to be treated consistently.
Reported-by: Joe Conway, Michael Paquier
Author: Michael Paquier
Reviewed-by: Álvaro Herrera, Georgios Kokolatos, Joe Conway
Discussion: https://postgr.es/m/20190301023338.GD1348@paquier.xyz
Discussion: https://postgr.es/m/c49b69ec-e2f7-ff33-4f17-0eaa4f2cef27@joeconway.com
This introduces the concept of table access methods, i.e. CREATE
ACCESS METHOD ... TYPE TABLE and
CREATE TABLE ... USING (storage-engine).
No table access functionality is delegated to table AMs as of this
commit, that'll be done in following commits.
Subsequent commits will incrementally abstract table access
functionality to be routed through table access methods. That change
is too large to be reviewed & committed at once, so it'll be done
incrementally.
Docs will be updated at the end, as adding them incrementally would
likely make them less coherent, and definitely is a lot more work,
without a lot of benefit.
Table access methods are specified similar to index access methods,
i.e. pg_am.amhandler returns, as INTERNAL, a pointer to a struct with
callbacks. In contrast to index AMs that struct needs to live as long
as a backend, typically that's achieved by just returning a pointer to
a constant struct.
Psql's \d+ now displays a table's access method. That can be disabled
with HIDE_TABLEAM=true, which is mainly useful so regression tests can
be run against different AMs. It's quite possible that this behaviour
still needs to be fine tuned.
For now it's not allowed to set a table AM for a partitioned table, as
we've not resolved how partitions would inherit that. Disallowing
allows us to introduce, if we decide that's the way forward, such a
behaviour without a compatibility break.
Catversion bumped, to add the heap table AM and references to it.
Author: Haribabu Kommi, Andres Freund, Alvaro Herrera, Dimitri Golgov and others
Discussion:
https://postgr.es/m/20180703070645.wchpu5muyto5n647@alap3.anarazel.dehttps://postgr.es/m/20160812231527.GA690404@alvherre.pgsqlhttps://postgr.es/m/20190107235616.6lur25ph22u5u5av@alap3.anarazel.dehttps://postgr.es/m/20190304234700.w5tmhducs5wxgzls@alap3.anarazel.de
After commit b0eaa4c51b, we use a local map of pages to find the required
space for small relations. We do clear this map when we have found a block
with enough free space, when we extend the relation, or on transaction
abort so that it can be used next time. However, we miss to clear it when
we didn't find any pages to try from the map which leads to an assertion
failure when we later tried to use it after relation extension.
In the passing, I have improved some comments in this area.
Reported-by: Tom Lane based on buildfarm results
Author: Amit Kapila
Reviewed-by: John Naylor
Tested-by: Kuntal Ghosh
Discussion: https://postgr.es/m/32368.1551114120@sss.pgh.pa.us
Test for the compiler builtins __builtin_clz, __builtin_ctz, and
__builtin_popcount, and make use of these in preference to
handwritten C code if they're available. Create src/port
infrastructure for "leftmost one", "rightmost one", and "popcount"
so as to centralize these decisions.
On x86_64, __builtin_popcount generally won't make use of the POPCNT
opcode because that's not universally supported yet. Provide code
that checks CPUID and then calls POPCNT via asm() if available.
This requires indirecting through a function pointer, which is
an annoying amount of overhead for a one-instruction operation,
but it's probably not worth working harder than this for our
current use-cases.
I'm not sure we've found all the existing places that could profit
from this new infrastructure; but we at least touched all the
ones that used copied-and-pasted versions of the bitmapset.c code,
and got rid of multiple copies of the associated constant arrays.
While at it, replace c-compiler.m4's one-per-builtin-function
macros with a single one that can handle all the cases we need
to worry about so far. Also, because I'm paranoid, make those
checks into AC_LINK checks rather than just AC_COMPILE; the
former coding failed to verify that libgcc has support for the
builtin, in cases where it's not inline code.
David Rowley, Thomas Munro, Alvaro Herrera, Tom Lane
Discussion: https://postgr.es/m/CAKJS1f9WTAGG1tPeJnD18hiQW5gAk59fQ6WK-vfdAKEHyRg2RA@mail.gmail.com
This reverts commits fc6c72747ae6, 109de05cbb03, d0b4663c23b7 and
711bab1e4d19.
Somebody will have to try harder before submitting this patch again.
I've spent entirely too much time on it already, and the #ifdef maze yet
to be written in order for it to build at all got on my nerves. The
amount of work needed to get a platform-specific performance improvement
that's barely above the noise level is not worth it.
These opcodes have been around in the AMD world since 2007, and 2008 in
the case of intel. They're supported in GCC and Clang via some __builtin
macros. The opcodes may be unavailable during runtime, in which case we
fall back on a C-based implementation of the code. In order to get the
POPCNT instruction we must pass the -mpopcnt option to the compiler. We
do this only for the pg_bitutils.c file.
David Rowley (with fragments taken from a patch by Thomas Munro)
Discussion: https://postgr.es/m/CAKJS1f9WTAGG1tPeJnD18hiQW5gAk59fQ6WK-vfdAKEHyRg2RA@mail.gmail.com
Previously, all heaps had FSMs. For very small tables, this means that the
FSM took up more space than the heap did. This is wasteful, so now we
refrain from creating the FSM for heaps with 4 pages or fewer. If the last
known target block has insufficient space, we still try to insert into some
other page before giving up and extending the relation, since doing
otherwise leads to table bloat. Testing showed that trying every page
penalized performance slightly, so we compromise and try every other page.
This way, we visit at most two pages. Any pages with wasted free space
become visible at next relation extension, so we still control table bloat.
As a bonus, directly attempting one or two pages can even be faster than
consulting the FSM would have been.
Once the FSM is created for a heap we don't remove it even if somebody
deletes all the rows from the corresponding relation. We don't think it is
a useful optimization as it is quite likely that relation will again grow
to the same size.
Author: John Naylor, Amit Kapila
Reviewed-by: Amit Kapila
Tested-by: Mithun C Y
Discussion: https://www.postgresql.org/message-id/CAJVSVGWvB13PzpbLEecFuGFc5V2fsO736BsdTakPiPAcdMM5tQ@mail.gmail.com
Previously we initialized pages when bulk extending in
RelationAddExtraBlocks(). That has a major disadvantage: It ties
RelationAddExtraBlocks() to heap, as other types of storage are likely
to need different amounts of special space, have different amount of
free space (previously determined by PageGetHeapFreeSpace()).
That we're relying on initializing pages, but not WAL logging the
initialization, also means the risk for getting
"WARNING: relation \"%s\" page %u is uninitialized --- fixing"
style warnings in vacuums after crashes/immediate shutdowns, is
considerably higher. The warning sounds much more serious than what
they are.
Fix those two issues together by not initializing pages in
RelationAddExtraPages() (but continue to do so in
RelationGetBufferForTuple(), which is linked much more closely to
heap), and accepting uninitialized pages as normal in
vacuumlazy.c. When vacuumlazy encounters an empty page it now adds it
to the FSM, but does nothing else. We chose to not issue a debug
message, much less a warning in that case - it seems rarely useful,
and quite likely to scare people unnecessarily.
For now empty pages aren't added to the VM, because standbys would not
re-discover such pages after a promotion. In contrast to other sources
for empty pages, there's no corresponding WAL records triggering FSM
updates during replay.
Previously when extending the relation, there was a moment between
extending the relation, and acquiring an exclusive lock on the new
page, in which another backend could lock the page. To avoid new
content being put on that new page, vacuumlazy needed to acquire the
extension lock for a brief moment when encountering a new page. A
second corner case, only working somewhat by accident, was that
RelationGetBufferForTuple() sometimes checks the last page in a
relation for free space, without consulting the FSM; that only worked
because PageGetHeapFreeSpace() interprets the zero page header in a
new page as no free space. The lack of handling this properly
required reverting the previous attempt in 684200543b.
This issue can be solved by using RBM_ZERO_AND_LOCK when extending the
relation, thereby avoiding this window. There's some added complexity
when RelationGetBufferForTuple() is called with another buffer (for
updates), to avoid deadlocks, but that's rarely hit at runtime.
Author: Andres Freund
Reviewed-By: Tom Lane
Discussion: https://postgr.es/m/20181219083945.6khtgm36mivonhva@alap3.anarazel.de
To avoid deadlock, backend acquires a lock on heap pages in block
number order. In certain cases, lock on heap pages is dropped and
reacquired. In this case, the locks are dropped for reading in
corresponding VM page/s. The issue is we re-acquire locks in bufferId
order whereas the intention was to acquire in blockid order.
This commit ensures that we will always acquire locks on heap pages in
blockid order.
Reported-by: Nishant Fnu
Author: Nishant Fnu
Reviewed-by: Amit Kapila and Robert Haas
Backpatch-through: 9.4
Discussion: https://postgr.es/m/5883C831-2ED1-47C8-BFAC-2D5BAE5A8CAE@amazon.com
When logging the replica identity of a deleted tuple, XLOG_HEAP_DELETE
records include references of the old tuple. Its data is stored in an
intermediate variable used to register this information for the WAL
record, but this variable gets away from the stack when the record gets
actually inserted.
Spotted by clang's AddressSanitizer.
Author: Stas Kelvish
Discussion: https://postgr.es/m/085C8825-AD86-4E93-AF80-E26CDF03D1EA@postgrespro.ru
Backpatch-through: 9.4
This reverts commit fc02e6724f3ce069b33284bce092052ab55bd751 and
e6799d5a53011985d916fdb48fe014a4ae70422e.
Parts of the buildfarm error out with
ERROR: page %u of relation "%s" should be empty but is not
errors, and so far I/we do not know why. fc02e672 didn't fix the
issue. As I cannot reproduce the issue locally, it seems best to get
the buildfarm green again, and reproduce the issue without time
pressure.
In e6799d5a5301 I removed vacuumlazy.c trickery around re-checking
whether a page is actually empty after acquiring an extension lock on
the relation, because the page is not PageInit()ed anymore, and
entries in the FSM ought not to lead to user-visible errors.
As reported by various buildfarm animals that is not correct, given
the way to code currently stands: If vacuum processes a page that's
just been newly added by either RelationGetBufferForTuple() or
RelationAddExtraBlocks(), it could add that page to the FSM and it
could be reused by other backends, before those two functions check
whether the newly added page is actually new. That's a relatively
narrow race, but several buildfarm machines appear to be able to hit
it.
While it seems wrong that the FSM, given it's lack of durability and
approximative nature, can trigger errors like this, that seems better
fixed in a separate commit. Especially given that a good portion of
the buildfarm is red, and this is just re-introducing logic that
existed a few hours ago.
Author: Andres Freund
Discussion: https://postgr.es/m/20190128222259.zhi7ovzgtkft6em6@alap3.anarazel.de
Previously we initialized pages when bulk extending in
RelationAddExtraBlocks(). That has a major disadvantage: It ties
RelationAddExtraBlocks() to heap, as other types of storage are likely
to need different amounts of special space, have different amount of
free space (previously determined by PageGetHeapFreeSpace()).
That we're relying on initializing pages, but not WAL logging the
initialization, also means the risk for getting
"WARNING: relation \"%s\" page %u is uninitialized --- fixing"
style warnings in vacuums after crashes/immediate shutdowns, is
considerably higher. The warning sounds much more serious than what
they are.
Fix those two issues together by not initializing pages in
RelationAddExtraPages() (but continue to do so in
RelationGetBufferForTuple(), which is linked much more closely to
heap), and accepting uninitialized pages as normal in
vacuumlazy.c. When vacuumlazy encounters an empty page it now adds it
to the FSM, but does nothing else. We chose to not issue a debug
message, much less a warning in that case - it seems rarely useful,
and quite likely to scare people unnecessarily.
For now empty pages aren't added to the VM, because standbys would not
re-discover such pages after a promotion. In contrast to other sources
for empty pages, there's no corresponding WAL records triggering FSM
updates during replay.
Author: Andres Freund
Reviewed-By: Tom Lane
Discussion: https://postgr.es/m/20181219083945.6khtgm36mivonhva@alap3.anarazel.de
Previously, all heaps had FSMs. For very small tables, this means that the
FSM took up more space than the heap did. This is wasteful, so now we
refrain from creating the FSM for heaps with 4 pages or fewer. If the last
known target block has insufficient space, we still try to insert into some
other page before giving up and extending the relation, since doing
otherwise leads to table bloat. Testing showed that trying every page
penalized performance slightly, so we compromise and try every other page.
This way, we visit at most two pages. Any pages with wasted free space
become visible at next relation extension, so we still control table bloat.
As a bonus, directly attempting one or two pages can even be faster than
consulting the FSM would have been.
Once the FSM is created for a heap we don't remove it even if somebody
deletes all the rows from the corresponding relation. We don't think it is
a useful optimization as it is quite likely that relation will again grow
to the same size.
Author: John Naylor with design inputs and some code contribution by Amit Kapila
Reviewed-by: Amit Kapila
Tested-by: Mithun C Y
Discussion: https://www.postgresql.org/message-id/CAJVSVGWvB13PzpbLEecFuGFc5V2fsO736BsdTakPiPAcdMM5tQ@mail.gmail.com
Given these routines are heap specific, and that there will be more
generic visibility support in via table AM, it makes sense to move the
prototypes to heapam.h (routines like HeapTupleSatisfiesVacuum will
not be exposed in a generic fashion, because they are too storage
specific).
Similarly, the code in tqual.c is specific to heap, so moving it into
access/heap/ makes sense.
Author: Andres Freund
Discussion: https://postgr.es/m/20180703070645.wchpu5muyto5n647@alap3.anarazel.de
access/heapam contains functions that are very storage specific (say
heap_insert() and a lot of lower level functions), and fairly generic
infrastructure like relation_open(), heap_open() etc. In the upcoming
pluggable storage work we're introducing a layer between table
accesses in general and heapam, to allow for different storage
methods. For a bit cleaner separation it thus seems advantageous to
move generic functions like the aforementioned to their own headers.
access/relation.h will contain relation_open() etc, and access/table.h
will contain table_open() (formerly known as heap_open()). I've decided
for table.h not to include relation.h, but we might change that at a
later stage.
relation.h already exists in another directory, but the other
plausible name (rel.h) also conflicts. It'd be nice if there were a
non-conflicting name, but nobody came up with a suggestion. It's
possible that the appropriate way to address the naming conflict would
be to rename nodes/relation.h, which isn't particularly well named.
To avoid breaking a lot of extensions that just use heap_open() etc,
table.h has macros mapping the old names to the new ones, and heapam.h
includes relation, table.h. That also allows to keep the
bulk renaming of existing callers in a separate commit.
Author: Andres Freund
Discussion: https://postgr.es/m/20190111000539.xbv7s6w7ilcvm7dp@alap3.anarazel.de
This is architecturally mildly problematic, which becomes more
pronounced with the upcoming introduction of pluggable storage.
To fix, teach heap_parallelscan_estimate() to deal with SnapshotAny
snapshots, and then use it from _bt_parallel_estimate_shared().
Author: Andres Freund
Discussion: https://postgr.es/m/20180703070645.wchpu5muyto5n647@alap3.anarazel.de
This reverts commit c203d6cf8 and some follow-on fixes, completing the
task begun in commit 5d28c9bd7. If that feature is ever resurrected,
the code will look quite a bit different from this, so it seems best
to start from a clean slate.
The v11 branch is not touched; in that branch, the recheck_on_update
storage option remains present, but nonfunctional and undocumented.
Discussion: https://postgr.es/m/20190114223409.3tcvejfhlvbucrv5@alap3.anarazel.de
This is the genam.h equivalent of 4c850ecec649c (which removed
heapam.h from a lot of other headers). There's still a few header
includes of genam.h, but not from central headers anymore.
As a few headers are not indirectly included anymore, execnodes.h and
relscan.h need a few additional includes. Some of the depended on
types were replacable by using the underlying structs, but e.g. for
Snapshot in execnodes.h that'd have gotten more invasive than
reasonable in this commit.
Like the aforementioned commit 4c850ecec649c, this requires adding new
genam.h includes to a number of backend files, which likely is also
required in a few external projects.
Author: Andres Freund
Discussion: https://postgr.es/m/20190114000701.y4ttcb74jpskkcfb@alap3.anarazel.de
In toast_fetch_datum_slice(), we Assert() that what is passed in isn't
compressed, but we then later had a check to see what the length of if
what was passed in is compressed. That later check is rather confusing
since toast_fetch_datum_slice() is only ever called with non-compressed
datums and the Assert() earlier makes it clear that one shouldn't be
passing in compressed datums.
Add a comment to make it clear that toast_fetch_datum_slice() is just
for non-compressed datums, and remove the dead code.
During table rewrites (VACUUM FULL and CLUSTER), the main heap is logged
using XLOG / FPI records, and thus (correctly) ignored in decoding.
But the associated TOAST table is WAL-logged as plain INSERT records,
and so was logically decoded and passed to reorder buffer.
That has severe consequences with TOAST tables of non-trivial size.
Firstly, reorder buffer has to keep all those changes, possibly spilling
them to a file, incurring I/O costs and disk space.
Secondly, ReoderBufferCommit() was stashing all those TOAST chunks into
a hash table, which got discarded only after processing the row from the
main heap. But as the main heap is not decoded for rewrites, this never
happened, so all the TOAST data accumulated in memory, resulting either
in excessive memory consumption or OOM.
The fix is simple, as commit e9edc1ba already introduced infrastructure
(namely HEAP_INSERT_NO_LOGICAL flag) to skip logical decoding of TOAST
tables, but it only applied it to system tables. So simply use it for
all TOAST data in raw_heap_insert().
That would however solve only the memory consumption issue - the TOAST
changes would still be decoded and added to the reorder buffer, and
spilled to disk (although without TOAST tuple data, so much smaller).
But we can solve that by tweaking DecodeInsert() to just ignore such
INSERT records altogether, using XLH_INSERT_CONTAINS_NEW_TUPLE flag,
instead of skipping them later in ReorderBufferCommit().
Review: Masahiko Sawada
Discussion: https://www.postgresql.org/message-id/flat/1a17c643-e9af-3dba-486b-fbe31bc1823a%402ndquadrant.com
Backpatch: 9.4-, where logical decoding was introduced
Previously tables declared WITH OIDS, including a significant fraction
of the catalog tables, stored the oid column not as a normal column,
but as part of the tuple header.
This special column was not shown by default, which was somewhat odd,
as it's often (consider e.g. pg_class.oid) one of the more important
parts of a row. Neither pg_dump nor COPY included the contents of the
oid column by default.
The fact that the oid column was not an ordinary column necessitated a
significant amount of special case code to support oid columns. That
already was painful for the existing, but upcoming work aiming to make
table storage pluggable, would have required expanding and duplicating
that "specialness" significantly.
WITH OIDS has been deprecated since 2005 (commit ff02d0a05280e0).
Remove it.
Removing includes:
- CREATE TABLE and ALTER TABLE syntax for declaring the table to be
WITH OIDS has been removed (WITH (oids[ = true]) will error out)
- pg_dump does not support dumping tables declared WITH OIDS and will
issue a warning when dumping one (and ignore the oid column).
- restoring an pg_dump archive with pg_restore will warn when
restoring a table with oid contents (and ignore the oid column)
- COPY will refuse to load binary dump that includes oids.
- pg_upgrade will error out when encountering tables declared WITH
OIDS, they have to be altered to remove the oid column first.
- Functionality to access the oid of the last inserted row (like
plpgsql's RESULT_OID, spi's SPI_lastoid, ...) has been removed.
The syntax for declaring a table WITHOUT OIDS (or WITH (oids = false)
for CREATE TABLE) is still supported. While that requires a bit of
support code, it seems unnecessary to break applications / dumps that
do not use oids, and are explicit about not using them.
The biggest user of WITH OID columns was postgres' catalog. This
commit changes all 'magic' oid columns to be columns that are normally
declared and stored. To reduce unnecessary query breakage all the
newly added columns are still named 'oid', even if a table's column
naming scheme would indicate 'reloid' or such. This obviously
requires adapting a lot code, mostly replacing oid access via
HeapTupleGetOid() with access to the underlying Form_pg_*->oid column.
The bootstrap process now assigns oids for all oid columns in
genbki.pl that do not have an explicit value (starting at the largest
oid previously used), only oids assigned later by oids will be above
FirstBootstrapObjectId. As the oid column now is a normal column the
special bootstrap syntax for oids has been removed.
Oids are not automatically assigned during insertion anymore, all
backend code explicitly assigns oids with GetNewOidWithIndex(). For
the rare case that insertions into the catalog via SQL are called for
the new pg_nextoid() function can be used (which only works on catalog
tables).
The fact that oid columns on system tables are now normal columns
means that they will be included in the set of columns expanded
by * (i.e. SELECT * FROM pg_class will now include the table's oid,
previously it did not). It'd not technically be hard to hide oid
column by default, but that'd mean confusing behavior would either
have to be carried forward forever, or it'd cause breakage down the
line.
While it's not unlikely that further adjustments are needed, the
scope/invasiveness of the patch makes it worthwhile to get merge this
now. It's painful to maintain externally, too complicated to commit
after the code code freeze, and a dependency of a number of other
patches.
Catversion bump, for obvious reasons.
Author: Andres Freund, with contributions by John Naylor
Discussion: https://postgr.es/m/20180930034810.ywp2c7awz7opzcfr@alap3.anarazel.de
On some operating systems, it doesn't make sense to retry fsync(),
because dirty data cached by the kernel may have been dropped on
write-back failure. In that case the only remaining copy of the
data is in the WAL. A subsequent fsync() could appear to succeed,
but not have flushed the data. That means that a future checkpoint
could apparently complete successfully but have lost data.
Therefore, violently prevent any future checkpoint attempts by
panicking on the first fsync() failure. Note that we already
did the same for WAL data; this change extends that behavior to
non-temporary data files.
Provide a GUC data_sync_retry to control this new behavior, for
users of operating systems that don't eject dirty data, and possibly
forensic/testing uses. If it is set to on and the write-back error
was transient, a later checkpoint might genuinely succeed (on a
system that does not throw away buffers on failure); if the error is
permanent, later checkpoints will continue to fail. The GUC defaults
to off, meaning that we panic.
Back-patch to all supported releases.
There is still a narrow window for error-loss on some operating
systems: if the file is closed and later reopened and a write-back
error occurs in the intervening time, but the inode has the bad
luck to be evicted due to memory pressure before we reopen, we could
miss the error. A later patch will address that with a scheme
for keeping files with dirty data open at all times, but we judge
that to be too complicated to back-patch.
Author: Craig Ringer, with some adjustments by Thomas Munro
Reported-by: Craig Ringer
Reviewed-by: Robert Haas, Thomas Munro, Andres Freund
Discussion: https://postgr.es/m/20180427222842.in2e4mibx45zdth5%40alap3.anarazel.de
Upcoming work intends to allow pluggable ways to introduce new ways of
storing table data. Accessing those table access methods from the
executor requires TupleTableSlots to be carry tuples in the native
format of such storage methods; otherwise there'll be a significant
conversion overhead.
Different access methods will require different data to store tuples
efficiently (just like virtual, minimal, heap already require fields
in TupleTableSlot). To allow that without requiring additional pointer
indirections, we want to have different structs (embedding
TupleTableSlot) for different types of slots. Thus different types of
slots are needed, which requires adapting creators of slots.
The slot that most efficiently can represent a type of tuple in an
executor node will often depend on the type of slot a child node
uses. Therefore we need to track the type of slot is returned by
nodes, so parent slots can create slots based on that.
Relatedly, JIT compilation of tuple deforming needs to know which type
of slot a certain expression refers to, so it can create an
appropriate deforming function for the type of tuple in the slot.
But not all nodes will only return one type of slot, e.g. an append
node will potentially return different types of slots for each of its
subplans.
Therefore add function that allows to query the type of a node's
result slot, and whether it'll always be the same type (whether it's
fixed). This can be queried using ExecGetResultSlotOps().
The scan, result, inner, outer type of slots are automatically
inferred from ExecInitScanTupleSlot(), ExecInitResultSlot(),
left/right subtrees respectively. If that's not correct for a node,
that can be overwritten using new fields in PlanState.
This commit does not introduce the actually abstracted implementation
of different kind of TupleTableSlots, that will be left for a followup
commit. The different types of slots introduced will, for now, still
use the same backing implementation.
While this already partially invalidates the big comment in
tuptable.h, it seems to make more sense to update it later, when the
different TupleTableSlot implementations actually exist.
Author: Ashutosh Bapat and Andres Freund, with changes by Amit Khandekar
Discussion: https://postgr.es/m/20181105210039.hh4vvi4vwoq5ba2q@alap3.anarazel.de