1
0
mirror of https://github.com/postgres/postgres.git synced 2025-10-25 13:17:41 +03:00
Commit Graph

202 Commits

Author SHA1 Message Date
Michael Paquier
62a17a9283 Integrate FullTransactionIds deeper into two-phase code
This refactoring is a follow-up of the work done in 5a1dfde833, that
has switched 2PC file names to use FullTransactionIds when written on
disk.  This will help with the integration of a follow-up solution
related to the handling of two-phase files during recovery, to address
older defects while reading these from disk after a crash.

This change is useful in itself as it reduces the need to build the
file names from epoch numbers and TransactionIds, because we can use
directly FullTransactionIds from which the 2PC file names are guessed.
So this avoids a lot of back-and-forth between the FullTransactionIds
retrieved from the file names and how these are passed around in the
internal 2PC logic.

Note that the core of the change is the use of a FullTransactionId
instead of a TransactionId in GlobalTransactionData, that tracks 2PC
file information in shared memory.  The change in TwoPhaseCallback makes
this commit unfit for stable branches.

Noah has contributed a good chunk of this patch.  I have spent some time
on it as well while working on the issues with two-phase state files and
recovery.

Author: Noah Misch <noah@leadboat.com>
Co-Authored-by: Michael Paquier <michael@paquier.xyz>
Discussion: https://postgr.es/m/Z5sd5O9JO7NYNK-C@paquier.xyz
Discussion: https://postgr.es/m/20250116205254.65.nmisch@google.com
2025-07-07 12:50:40 +09:00
Fujii Masao
73bdcfab35 Rename log_lock_failure GUC to log_lock_failures for consistency.
This commit renames the GUC log_lock_failure to log_lock_failures
to align with the existing similar setting log_lock_waits, which uses
the plural form. This improves naming consistency across related GUCs.

Suggested-by: Peter Eisentraut <peter@eisentraut.org>
Author: Fujii Masao <masao.fujii@gmail.com
Reviewed-by: Peter Eisentraut <peter@eisentraut.org>
Discussion: https://postgr.es/m/7a8198b6-d5b8-4910-b41e-8d3efcbb015d@eisentraut.org
2025-06-03 10:02:55 +09:00
Heikki Linnakangas
51a0382e8d Fix crash if LockErrorCleanup() is called twice
The refactoring in commit 3c0fd64fec removed the clearing of
awaitedLock from LockErrorCleanup(). It's still needed, otherwise
LockErrorCleanup() during abort processing will try to update the
LOCALLOCK struct even after the lock has already been released. Put it
back.

Reported-by: Richard Guo <guofenglinux@gmail.com>
Reported-by: Robins Tharakan <tharakan@gmail.com>
Reported-by: Alexander Lakhin <exclusion@gmail.com>
Discussion: https://www.postgresql.org/message-id/flat/CAMbWs4_dNX1SzBmvFdoY-LxJh_4W_BjtVd5i008ihfU-wFF=eg@mail.gmail.com
Discussion: https://www.postgresql.org/message-id/18832-38e5575b1bbd7277@postgresql.org
Discussion: https://www.postgresql.org/message-id/e11a30e5-c0d8-491d-8546-3a1b50c10ad4@gmail.com
2025-03-28 20:19:17 +02:00
Fujii Masao
6d376c3b0d Add GUC option to log lock acquisition failures.
This commit introduces a new GUC, log_lock_failure, which controls whether
a detailed log message is produced when a lock acquisition fails. Currently,
it only supports logging lock failures caused by SELECT ... NOWAIT.

The log message includes information about all processes holding or
waiting for the lock that couldn't be acquired, helping users analyze and
diagnose the causes of lock failures.

Currently, this option does not log failures from SELECT ... SKIP LOCKED,
as that could generate excessive log messages if many locks are skipped,
causing unnecessary noise.

This mechanism can be extended in the future to support for logging
lock failures from other commands, such as LOCK TABLE ... NOWAIT.

Author: Yuki Seino <seinoyu@oss.nttdata.com>
Co-authored-by: Fujii Masao <masao.fujii@gmail.com>
Reviewed-by: Jelte Fennema-Nio <postgres@jeltef.nl>
Discussion: https://postgr.es/m/411280a186cc26ef7034e0f2dfe54131@oss.nttdata.com
2025-03-14 23:14:12 +09:00
Peter Eisentraut
3691edfab9 pg_noreturn to replace pg_attribute_noreturn()
We want to support a "noreturn" decoration on more compilers besides
just GCC-compatible ones, but for that we need to move the decoration
in front of the function declaration instead of either behind it or
wherever, which is the current style afforded by GCC-style attributes.
Also rename the macro to "pg_noreturn" to be similar to the C11
standard "noreturn".

pg_noreturn is now supported on all compilers that support C11 (using
_Noreturn), as well as GCC-compatible ones (using __attribute__, as
before), as well as MSVC (using __declspec).  (When PostgreSQL
requires C11, the latter two variants can be dropped.)

Now, all supported compilers effectively support pg_noreturn, so the
extra code for !HAVE_PG_ATTRIBUTE_NORETURN can be dropped.

This also fixes a possible problem if third-party code includes
stdnoreturn.h, because then the current definition of

    #define pg_attribute_noreturn() __attribute__((noreturn))

would cause an error.

Note that the C standard does not support a noreturn attribute on
function pointer types.  So we have to drop these here.  There are
only two instances at this time, so it's not a big loss.  In one case,
we can make up for it by adding the pg_noreturn to a wrapper function
and adding a pg_unreachable(), in the other case, the latter was
already done before.

Reviewed-by: Dagfinn Ilmari Mannsåker <ilmari@ilmari.org>
Reviewed-by: Andres Freund <andres@anarazel.de>
Discussion: https://www.postgresql.org/message-id/flat/pxr5b3z7jmkpenssra5zroxi7qzzp6eswuggokw64axmdixpnk@zbwxuq7gbbcw
2025-03-13 12:37:26 +01:00
Bruce Momjian
50e6eb731d Update copyright for 2025
Backpatch-through: 13
2025-01-01 11:21:55 -05:00
Heikki Linnakangas
3c0fd64fec Split ProcSleep function into JoinWaitQueue and ProcSleep
Split ProcSleep into two functions: JoinWaitQueue and ProcSleep.
JoinWaitQueue is called while holding the partition lock, and inserts
the current process to the wait queue, while ProcSleep() does the
actual sleeping. ProcSleep() is now called without holding the
partition lock, and it no longer re-acquires the partition lock before
returning. That makes the wakeup a little cheaper. Once upon a time,
re-acquiring the partition lock was needed to prevent a signal handler
from longjmping out at a bad time, but these days our signal handlers
just set flags, and longjmping can only happen at points where we
explicitly run CHECK_FOR_INTERRUPTS().

If JoinWaitQueue detects an "early deadlock" before even joining the
wait queue, it returns without changing the shared lock entry, leaving
the cleanup of the shared lock entry to the caller. This makes the
handling of an early deadlock the same as the dontWait=true case.

One small user-visible side-effect of this refactoring is that we now
only set the 'ps' title to say "waiting" when we actually enter the
sleep, not when the lock is skipped because dontWait=true, or when a
deadlock is detected early before entering the sleep.

This eliminates the 'lockAwaited' global variable in proc.c, which was
largely redundant with 'awaitedLock' in lock.c

Note: Updating the local lock table is now the caller's responsibility.
JoinWaitQueue and ProcSleep are now only responsible for modifying the
shared state. Seems a little nicer that way.

Based on Thomas Munro's earlier patch and observation that ProcSleep
doesn't really need to re-acquire the partition lock.

Reviewed-by: Maxim Orlov
Discussion: https://www.postgresql.org/message-id/7c2090cd-a72a-4e34-afaa-6dd2ef31440e@iki.fi
2024-11-04 17:59:24 +02:00
Heikki Linnakangas
fbce7dfc77 Refactor lock manager initialization to make it a bit less special
Split the shared and local initialization to separate functions, and
follow the common naming conventions. With this, we no longer create
the LockMethodLocalHash hash table in the postmaster process, which
was always pointless.

Reviewed-by: Andreas Karlsson
Discussion: https://www.postgresql.org/message-id/c09694ff-2453-47e5-b26c-32a16cd75ce6@iki.fi
2024-08-29 09:46:06 +03:00
Noah Misch
0cecc908e9 Lock before setting relhassubclass on RELKIND_PARTITIONED_INDEX.
Commit 5b562644fe added a comment that
SetRelationHasSubclass() callers must hold this lock.  When commit
17f206fbc8 extended use of this column to
partitioned indexes, it didn't take the lock.  As the latter commit
message mentioned, we currently never reset a partitioned index to
relhassubclass=f.  That largely avoids harm from the lock omission.  The
cause for fixing this now is to unblock introducing a rule about locks
required to heap_update() a pg_class row.  This might cause more
deadlocks.  It gives minor user-visible benefits:

- If an ALTER INDEX SET TABLESPACE runs concurrently with ALTER TABLE
  ATTACH PARTITION or CREATE PARTITION OF, one transaction blocks
  instead of failing with "tuple concurrently updated".  (Many cases of
  DDL concurrency still fail that way.)

- Match ALTER INDEX ATTACH PARTITION in choosing to lock the index.

While not user-visible today, we'll need this if we ever make something
set the flag to false for a partitioned index, like ANALYZE does today
for tables.  Back-patch to v12 (all supported versions), the plan for
the commit relying on the new rule.  In back branches, add
LockOrStrongerHeldByMe() instead of adding a LockHeldByMe() parameter.

Reviewed (in an earlier version) by Robert Haas.

Discussion: https://postgr.es/m/20240611024525.9f.nmisch@google.com
2024-06-27 19:21:05 -07:00
Heikki Linnakangas
024c521117 Replace BackendIds with 0-based ProcNumbers
Now that BackendId was just another index into the proc array, it was
redundant with the 0-based proc numbers used in other places. Replace
all usage of backend IDs with proc numbers.

The only place where the term "backend id" remains is in a few pgstat
functions that expose backend IDs at the SQL level. Those IDs are now
in fact 0-based ProcNumbers too, but the documentation still calls
them "backend ids". That term still seems appropriate to describe what
the numbers are, so I let it be.

One user-visible effect is that pg_temp_0 is now a valid temp schema
name, for backend with ProcNumber 0.

Reviewed-by: Andres Freund
Discussion: https://www.postgresql.org/message-id/8171f1aa-496f-46a6-afc3-c46fe7a9b407@iki.fi
2024-03-03 19:38:22 +02:00
Heikki Linnakangas
ab355e3a88 Redefine backend ID to be an index into the proc array
Previously, backend ID was an index into the ProcState array, in the
shared cache invalidation manager (sinvaladt.c). The entry in the
ProcState array was reserved at backend startup by scanning the array
for a free entry, and that was also when the backend got its backend
ID. Things become slightly simpler if we redefine backend ID to be the
index into the PGPROC array, and directly use it also as an index to
the ProcState array. This uses a little more memory, as we reserve a
few extra slots in the ProcState array for aux processes that don't
need them, but the simplicity is worth it.

Aux processes now also have a backend ID. This simplifies the
reservation of BackendStatusArray and ProcSignal slots.

You can now convert a backend ID into an index into the PGPROC array
simply by subtracting 1. We still use 0-based "pgprocnos" in various
places, for indexes into the PGPROC array, but the only difference now
is that backend IDs start at 1 while pgprocnos start at 0. (The next
commmit will get rid of the term "backend ID" altogether and make
everything 0-based.)

There is still a 'backendId' field in PGPROC, now part of 'vxid' which
encapsulates the backend ID and local transaction ID together. It's
needed for prepared xacts. For regular backends, the backendId is
always equal to pgprocno + 1, but for prepared xact PGPROC entries,
it's the ID of the original backend that processed the transaction.

Reviewed-by: Andres Freund, Reid Thompson
Discussion: https://www.postgresql.org/message-id/8171f1aa-496f-46a6-afc3-c46fe7a9b407@iki.fi
2024-03-03 19:37:28 +02:00
Heikki Linnakangas
28f3915b73 Remove superfluous 'pgprocno' field from PGPROC
It was always just the index of the PGPROC entry from the beginning of
the proc array. Introduce a macro to compute it from the pointer
instead.

Reviewed-by: Andres Freund
Discussion: https://www.postgresql.org/message-id/8171f1aa-496f-46a6-afc3-c46fe7a9b407@iki.fi
2024-02-22 01:21:34 +02:00
Bruce Momjian
29275b1d17 Update copyright for 2024
Reported-by: Michael Paquier

Discussion: https://postgr.es/m/ZZKTDPxBBMt3C0J9@paquier.xyz

Backpatch-through: 12
2024-01-03 20:49:05 -05:00
Peter Eisentraut
611806cd72 Add trailing commas to enum definitions
Since C99, there can be a trailing comma after the last value in an
enum definition.  A lot of new code has been introducing this style on
the fly.  Some new patches are now taking an inconsistent approach to
this.  Some add the last comma on the fly if they add a new last
value, some are trying to preserve the existing style in each place,
some are even dropping the last comma if there was one.  We could
nudge this all in a consistent direction if we just add the trailing
commas everywhere once.

I omitted a few places where there was a fixed "last" value that will
always stay last.  I also skipped the header files of libpq and ecpg,
in case people want to use those with older compilers.  There were
also a small number of cases where the enum type wasn't used anywhere
(but the enum values were), which ended up confusing pgindent a bit,
so I left those alone.

Discussion: https://www.postgresql.org/message-id/flat/386f8c45-c8ac-4681-8add-e3b0852c1620%40eisentraut.org
2023-10-26 09:20:54 +02:00
Tom Lane
0245f8db36 Pre-beta mechanical code beautification.
Run pgindent, pgperltidy, and reformat-dat-files.

This set of diffs is a bit larger than typical.  We've updated to
pg_bsd_indent 2.1.2, which properly indents variable declarations that
have multi-line initialization expressions (the continuation lines are
now indented one tab stop).  We've also updated to perltidy version
20230309 and changed some of its settings, which reduces its desire to
add whitespace to lines to make assignments etc. line up.  Going
forward, that should make for fewer random-seeming changes to existing
code.

Discussion: https://postgr.es/m/20230428092545.qfb3y5wcu4cm75ur@alvherre.pgsql
2023-05-19 17:24:48 -04:00
Andres Freund
5764f611e1 Use dlist/dclist instead of PROC_QUEUE / SHM_QUEUE for heavyweight locks
Part of a series to remove SHM_QUEUE. ilist.h style lists are more widely used
and have an easier to use interface.

As PROC_QUEUE is now unused, remove it.

Reviewed-by: Thomas Munro <thomas.munro@gmail.com> (in an older version)
Discussion: https://postgr.es/m/20221120055930.t6kl3tyivzhlrzu2@awork3.anarazel.de
Discussion: https://postgr.es/m/20200211042229.msv23badgqljrdg2@alap3.anarazel.de
2023-01-18 11:41:14 -08:00
Amit Kapila
216a784829 Perform apply of large transactions by parallel workers.
Currently, for large transactions, the publisher sends the data in
multiple streams (changes divided into chunks depending upon
logical_decoding_work_mem), and then on the subscriber-side, the apply
worker writes the changes into temporary files and once it receives the
commit, it reads from those files and applies the entire transaction. To
improve the performance of such transactions, we can instead allow them to
be applied via parallel workers.

In this approach, we assign a new parallel apply worker (if available) as
soon as the xact's first stream is received and the leader apply worker
will send changes to this new worker via shared memory. The parallel apply
worker will directly apply the change instead of writing it to temporary
files. However, if the leader apply worker times out while attempting to
send a message to the parallel apply worker, it will switch to
"partial serialize" mode -  in this mode, the leader serializes all
remaining changes to a file and notifies the parallel apply workers to
read and apply them at the end of the transaction. We use a non-blocking
way to send the messages from the leader apply worker to the parallel
apply to avoid deadlocks. We keep this parallel apply assigned till the
transaction commit is received and also wait for the worker to finish at
commit. This preserves commit ordering and avoid writing to and reading
from files in most cases. We still need to spill if there is no worker
available.

This patch also extends the SUBSCRIPTION 'streaming' parameter so that the
user can control whether to apply the streaming transaction in a parallel
apply worker or spill the change to disk. The user can set the streaming
parameter to 'on/off', or 'parallel'. The parameter value 'parallel' means
the streaming will be applied via a parallel apply worker, if available.
The parameter value 'on' means the streaming transaction will be spilled
to disk. The default value is 'off' (same as current behaviour).

In addition, the patch extends the logical replication STREAM_ABORT
message so that abort_lsn and abort_time can also be sent which can be
used to update the replication origin in parallel apply worker when the
streaming transaction is aborted. Because this message extension is needed
to support parallel streaming, parallel streaming is not supported for
publications on servers < PG16.

Author: Hou Zhijie, Wang wei, Amit Kapila with design inputs from Sawada Masahiko
Reviewed-by: Sawada Masahiko, Peter Smith, Dilip Kumar, Shi yu, Kuroda Hayato, Shveta Mallik
Discussion: https://postgr.es/m/CAA4eK1+wyN6zpaHUkCLorEWNx75MG0xhMwcFhvjqm2KURZEAGw@mail.gmail.com
2023-01-09 07:52:45 +05:30
Bruce Momjian
c8e1ba736b Update copyright for 2023
Backpatch-through: 11
2023-01-02 15:00:37 -05:00
Robert Haas
8ec569479f Apply PGDLLIMPORT markings broadly.
Up until now, we've had a policy of only marking certain variables
in the PostgreSQL header files with PGDLLIMPORT, but now we've
decided to mark them all. This means that extensions running on
Windows should no longer operate at a disadvantage as compared to
extensions running on Linux: if the variable is present in a header
file, it should be accessible.

Discussion: http://postgr.es/m/CA+TgmoYanc1_FSfimhgiWSqVyP5KKmh5NP2BWNwDhO8Pg2vGYQ@mail.gmail.com
2022-04-08 08:16:38 -04:00
Bruce Momjian
27b77ecf9f Update copyright for 2022
Backpatch-through: 10
2022-01-07 19:04:57 -05:00
Noah Misch
3cd9c3b921 Fix CREATE INDEX CONCURRENTLY for the newest prepared transactions.
The purpose of commit 8a54e12a38 was to
fix this, and it sufficed when the PREPARE TRANSACTION completed before
the CIC looked for lock conflicts.  Otherwise, things still broke.  As
before, in a cluster having used CIC while having enabled prepared
transactions, queries that use the resulting index can silently fail to
find rows.  It may be necessary to reindex to recover from past
occurrences; REINDEX CONCURRENTLY suffices.  Fix this for future index
builds by making CIC wait for arbitrarily-recent prepared transactions
and for ordinary transactions that may yet PREPARE TRANSACTION.  As part
of that, have PREPARE TRANSACTION transfer locks to its dummy PGPROC
before it calls ProcArrayClearTransaction().  Back-patch to 9.6 (all
supported versions).

Andrey Borodin, reviewed (in earlier versions) by Andres Freund.

Discussion: https://postgr.es/m/01824242-AA92-4FE9-9BA7-AEBAFFEA3D0C@yandex-team.ru
2021-10-23 18:36:38 -07:00
Fujii Masao
46d6e5f567 Display the time when the process started waiting for the lock, in pg_locks, take 2
This commit adds new column "waitstart" into pg_locks view. This column
reports the time when the server process started waiting for the lock
if the lock is not held. This information is useful, for example, when
examining the amount of time to wait on a lock by subtracting
"waitstart" in pg_locks from the current time, and identify the lock
that the processes are waiting for very long.

This feature uses the current time obtained for the deadlock timeout
timer as "waitstart" (i.e., the time when this process started waiting
for the lock). Since getting the current time newly can cause overhead,
we reuse the already-obtained time to avoid that overhead.

Note that "waitstart" is updated without holding the lock table's
partition lock, to avoid the overhead by additional lock acquisition.
This can cause "waitstart" in pg_locks to become NULL for a very short
period of time after the wait started even though "granted" is false.
This is OK in practice because we can assume that users are likely to
look at "waitstart" when waiting for the lock for a long time.

The first attempt of this patch (commit 3b733fcd04) caused the buildfarm
member "rorqual" (built with --disable-atomics --disable-spinlocks) to report
the failure of the regression test. It was reverted by commit 890d2182a2.
The cause of this failure was that the atomic variable for "waitstart"
in the dummy process entry created at the end of prepare transaction was
not initialized. This second attempt fixes that issue.

Bump catalog version.

Author: Atsushi Torikoshi
Reviewed-by: Ian Lawrence Barwick, Robert Haas, Justin Pryzby, Fujii Masao
Discussion: https://postgr.es/m/a96013dc51cdc56b2a2b84fa8a16a993@oss.nttdata.com
2021-02-15 15:13:37 +09:00
Fujii Masao
890d2182a2 Revert "Display the time when the process started waiting for the lock, in pg_locks."
This reverts commit 3b733fcd04.

Per buildfarm members prion and rorqual.
2021-02-09 18:30:40 +09:00
Fujii Masao
3b733fcd04 Display the time when the process started waiting for the lock, in pg_locks.
This commit adds new column "waitstart" into pg_locks view. This column
reports the time when the server process started waiting for the lock
if the lock is not held. This information is useful, for example, when
examining the amount of time to wait on a lock by subtracting
"waitstart" in pg_locks from the current time, and identify the lock
that the processes are waiting for very long.

This feature uses the current time obtained for the deadlock timeout
timer as "waitstart" (i.e., the time when this process started waiting
for the lock). Since getting the current time newly can cause overhead,
we reuse the already-obtained time to avoid that overhead.

Note that "waitstart" is updated without holding the lock table's
partition lock, to avoid the overhead by additional lock acquisition.
This can cause "waitstart" in pg_locks to become NULL for a very short
period of time after the wait started even though "granted" is false.
This is OK in practice because we can assume that users are likely to
look at "waitstart" when waiting for the lock for a long time.

Bump catalog version.

Author: Atsushi Torikoshi
Reviewed-by: Ian Lawrence Barwick, Robert Haas, Justin Pryzby, Fujii Masao
Discussion: https://postgr.es/m/a96013dc51cdc56b2a2b84fa8a16a993@oss.nttdata.com
2021-02-09 18:10:19 +09:00
Noah Misch
8a54e12a38 Fix CREATE INDEX CONCURRENTLY for simultaneous prepared transactions.
In a cluster having used CREATE INDEX CONCURRENTLY while having enabled
prepared transactions, queries that use the resulting index can silently
fail to find rows.  Fix this for future CREATE INDEX CONCURRENTLY by
making it wait for prepared transactions like it waits for ordinary
transactions.  This expands the VirtualTransactionId structure domain to
admit prepared transactions.  It may be necessary to reindex to recover
from past occurrences.  Back-patch to 9.5 (all supported versions).

Andrey Borodin, reviewed (in earlier versions) by Tom Lane and Michael
Paquier.

Discussion: https://postgr.es/m/2E712143-97F7-4890-B470-4A35142ABC82@yandex-team.ru
2021-01-30 00:00:27 -08:00
Bruce Momjian
ca3b37487b Update copyright for 2021
Backpatch-through: 9.5
2021-01-02 13:06:25 -05:00
Noah Misch
566372b3d6 Prevent concurrent SimpleLruTruncate() for any given SLRU.
The SimpleLruTruncate() header comment states the new coding rule.  To
achieve this, add locktype "frozenid" and two LWLocks.  This closes a
rare opportunity for data loss, which manifested as "apparent
wraparound" or "could not access status of transaction" errors.  Data
loss is more likely in pg_multixact, due to released branches' thin
margin between multiStopLimit and multiWrapLimit.  If a user's physical
replication primary logged ":  apparent wraparound" messages, the user
should rebuild standbys of that primary regardless of symptoms.  At less
risk is a cluster having emitted "not accepting commands" errors or
"must be vacuumed" warnings at some point.  One can test a cluster for
this data loss by running VACUUM FREEZE in every database.  Back-patch
to 9.5 (all supported versions).

Discussion: https://postgr.es/m/20190218073103.GA1434723@rfd.leadboat.com
2020-08-15 10:15:53 -07:00
Noah Misch
c6b92041d3 Skip WAL for new relfilenodes, under wal_level=minimal.
Until now, only selected bulk operations (e.g. COPY) did this.  If a
given relfilenode received both a WAL-skipping COPY and a WAL-logged
operation (e.g. INSERT), recovery could lose tuples from the COPY.  See
src/backend/access/transam/README section "Skipping WAL for New
RelFileNode" for the new coding rules.  Maintainers of table access
methods should examine that section.

To maintain data durability, just before commit, we choose between an
fsync of the relfilenode and copying its contents to WAL.  A new GUC,
wal_skip_threshold, guides that choice.  If this change slows a workload
that creates small, permanent relfilenodes under wal_level=minimal, try
adjusting wal_skip_threshold.  Users setting a timeout on COMMIT may
need to adjust that timeout, and log_min_duration_statement analysis
will reflect time consumption moving to COMMIT from commands like COPY.

Internally, this requires a reliable determination of whether
RollbackAndReleaseCurrentSubTransaction() would unlink a relation's
current relfilenode.  Introduce rd_firstRelfilenodeSubid.  Amend the
specification of rd_createSubid such that the field is zero when a new
rel has an old rd_node.  Make relcache.c retain entries for certain
dropped relations until end of transaction.

Bump XLOG_PAGE_MAGIC, since this introduces XLOG_GIST_ASSIGN_LSN.
Future servers accept older WAL, so this bump is discretionary.

Kyotaro Horiguchi, reviewed (in earlier, similar versions) by Robert
Haas.  Heikki Linnakangas and Michael Paquier implemented earlier
designs that materially clarified the problem.  Reviewed, in earlier
designs, by Andrew Dunstan, Andres Freund, Alvaro Herrera, Tom Lane,
Fujii Masao, and Simon Riggs.  Reported by Martijn van Oosterhout.

Discussion: https://postgr.es/m/20150702220524.GA9392@svana.org
2020-04-04 12:25:34 -07:00
Noah Misch
de9396326e Revert "Skip WAL for new relfilenodes, under wal_level=minimal."
This reverts commit cb2fd7eac2.  Per
numerous buildfarm members, it was incompatible with parallel query, and
a test case assumed LP64.  Back-patch to 9.5 (all supported versions).

Discussion: https://postgr.es/m/20200321224920.GB1763544@rfd.leadboat.com
2020-03-22 09:24:09 -07:00
Noah Misch
cb2fd7eac2 Skip WAL for new relfilenodes, under wal_level=minimal.
Until now, only selected bulk operations (e.g. COPY) did this.  If a
given relfilenode received both a WAL-skipping COPY and a WAL-logged
operation (e.g. INSERT), recovery could lose tuples from the COPY.  See
src/backend/access/transam/README section "Skipping WAL for New
RelFileNode" for the new coding rules.  Maintainers of table access
methods should examine that section.

To maintain data durability, just before commit, we choose between an
fsync of the relfilenode and copying its contents to WAL.  A new GUC,
wal_skip_threshold, guides that choice.  If this change slows a workload
that creates small, permanent relfilenodes under wal_level=minimal, try
adjusting wal_skip_threshold.  Users setting a timeout on COMMIT may
need to adjust that timeout, and log_min_duration_statement analysis
will reflect time consumption moving to COMMIT from commands like COPY.

Internally, this requires a reliable determination of whether
RollbackAndReleaseCurrentSubTransaction() would unlink a relation's
current relfilenode.  Introduce rd_firstRelfilenodeSubid.  Amend the
specification of rd_createSubid such that the field is zero when a new
rel has an old rd_node.  Make relcache.c retain entries for certain
dropped relations until end of transaction.

Back-patch to 9.5 (all supported versions).  This introduces a new WAL
record type, XLOG_GIST_ASSIGN_LSN, without bumping XLOG_PAGE_MAGIC.  As
always, update standby systems before master systems.  This changes
sizeof(RelationData) and sizeof(IndexStmt), breaking binary
compatibility for affected extensions.  (The most recent commit to
affect the same class of extensions was
089e4d405d0f3b94c74a2c6a54357a84a681754b.)

Kyotaro Horiguchi, reviewed (in earlier, similar versions) by Robert
Haas.  Heikki Linnakangas and Michael Paquier implemented earlier
designs that materially clarified the problem.  Reviewed, in earlier
designs, by Andrew Dunstan, Andres Freund, Alvaro Herrera, Tom Lane,
Fujii Masao, and Simon Riggs.  Reported by Martijn van Oosterhout.

Discussion: https://postgr.es/m/20150702220524.GA9392@svana.org
2020-03-21 09:38:26 -07:00
Amit Kapila
85f6b49c2c Allow relation extension lock to conflict among parallel group members.
This is required as it is no safer for two related processes to extend the
same relation at a time than for unrelated processes to do the same.  We
don't acquire a heavyweight lock on any other object after relation
extension lock which means such a lock can never participate in the
deadlock cycle.  So, avoid checking wait edges from this lock.

This provides an infrastructure to allow parallel operations like insert,
copy, etc. which were earlier not possible as parallel group members won't
conflict for relation extension lock.

Author: Dilip Kumar, Amit Kapila
Reviewed-by: Amit Kapila, Kuntal Ghosh and Sawada Masahiko
Discussion: https://postgr.es/m/CAD21AoCmT3cFQUN4aVvzy5chw7DuzXrJCbrjTU05B+Ss=Gn1LA@mail.gmail.com
2020-03-20 08:20:56 +05:30
Amit Kapila
15ef6ff4b9 Assert that we don't acquire a heavyweight lock on another object after
relation extension lock.

The only exception to the rule is that we can try to acquire the same
relation extension lock more than once.  This is allowed as we are not
creating any new lock for this case.  This restriction implies that the
relation extension lock won't ever participate in the deadlock cycle
because we can never wait for any other heavyweight lock after acquiring
this lock.

Such a restriction is okay for relation extension locks as unlike other
heavyweight locks these are not held till the transaction end.  These are
taken for a short duration to extend a particular relation and then
released.

Author: Dilip Kumar, with few changes by Amit Kapila
Reviewed-by: Amit Kapila, Kuntal Ghosh and Sawada Masahiko
Discussion: https://postgr.es/m/CAD21AoCmT3cFQUN4aVvzy5chw7DuzXrJCbrjTU05B+Ss=Gn1LA@mail.gmail.com
2020-03-18 07:20:17 +05:30
Peter Eisentraut
c096a804d9 Remove STATUS_FOUND
Replace the solitary use with a bool.

Reviewed-by: Michael Paquier <michael@paquier.xyz>
Discussion: https://www.postgresql.org/message-id/flat/a6f91ead-0ce4-2a34-062b-7ab9813ea308%402ndquadrant.com
2020-01-11 07:48:57 +01:00
Bruce Momjian
7559d8ebfa Update copyrights for 2020
Backpatch-through: update all files in master, backpatch legal files through 9.4
2020-01-01 12:21:45 -05:00
Amit Kapila
e0487223ec Make the order of the header file includes consistent.
Similar to commits 14aec03502, 7e735035f2 and dddf4cdc33, this commit
makes the order of header file inclusion consistent in more places.

Author: Vignesh C
Reviewed-by: Amit Kapila
Discussion: https://postgr.es/m/CALDaNm2Sznv8RR6Ex-iJO6xAdsxgWhCoETkaYX=+9DW3q0QCfA@mail.gmail.com
2019-11-25 08:08:57 +05:30
Michael Paquier
eb43f3d193 Fix inconsistencies and typos in the tree
This is numbered take 8, and addresses again a set of issues with code
comments, variable names and unreferenced variables.

Author: Alexander Lakhin
Discussion: https://postgr.es/m/b137b5eb-9c95-9c2f-586e-38aba7d59788@gmail.com
2019-07-29 12:28:30 +09:00
Tom Lane
8255c7a5ee Phase 2 pgindent run for v12.
Switch to 2.1 version of pg_bsd_indent.  This formats
multiline function declarations "correctly", that is with
additional lines of parameter declarations indented to match
where the first line's left parenthesis is.

Discussion: https://postgr.es/m/CAEepm=0P3FeTXRcU5B2W3jv3PgRVZ-kGUXLGfd42FFhUROO3ug@mail.gmail.com
2019-05-22 13:04:48 -04:00
Michael Paquier
752f06443f Fix and improve description of locktag types in lock.h
The description of the lock type for speculative insertions was
incorrect, being copy-pasted from another one.

As discussed, also move the description for all the fields of lock tag
types from the structure listing lock tag types to the set of macros
setting each LOCKTAG.

Author: John Naylor
Discussion: https://postgr.es/m/CACPNZCtA0-ybaC4fFfaDq_8p_TUOLvGxZH9Dm-=TMHZJarBa7Q@mail.gmail.com
2019-05-10 09:35:27 +09:00
Alvaro Herrera
ab0dfc961b Report progress of CREATE INDEX operations
This uses the progress reporting infrastructure added by c16dc1aca5,
adding support for CREATE INDEX and CREATE INDEX CONCURRENTLY.

There are two pieces to this: one is index-AM-agnostic, and the other is
AM-specific.  The latter is fairly elaborate for btrees, including
reportage for parallel index builds and the separate phases that btree
index creation uses; other index AMs, which are much simpler in their
building procedures, have simplistic reporting only, but that seems
sufficient, at least for non-concurrent builds.

The index-AM-agnostic part is fairly complete, providing insight into
the CONCURRENTLY wait phases as well as block-based progress during the
index validation table scan.  (The index validation index scan requires
patching each AM, which has not been included here.)

Reviewers: Rahila Syed, Pavan Deolasee, Tatsuro Yamada
Discussion: https://postgr.es/m/20181220220022.mg63bhk26zdpvmcj@alvherre.pgsql
2019-04-02 15:18:08 -03:00
Peter Eisentraut
28988a84cf Reorder LOCALLOCK structure members to compact the size
Save 8 bytes (on x86-64) by filling up padding holes.

Author: Takayuki Tsunakawa <tsunakawa.takay@jp.fujitsu.com>
Discussion: https://www.postgresql.org/message-id/20190219001639.ft7kxir2iz644alf@alap3.anarazel.de
2019-03-19 14:07:08 +01:00
Bruce Momjian
97c39498e5 Update copyright for 2019
Backpatch-through: certain files through 9.4
2019-01-02 12:44:25 -05:00
Tom Lane
b04aeb0a05 Add assertions that we hold some relevant lock during relation open.
Opening a relation with no lock at all is unsafe; there's no guarantee
that we'll see a consistent state of the relevant catalog entries.
While use of MVCC scans to read the catalogs partially addresses that
complaint, it's still possible to switch to a new catalog snapshot
partway through loading the relcache entry.  Moreover, whether or not
you trust the reasoning behind sometimes using less than
AccessExclusiveLock for ALTER TABLE, that reasoning is certainly not
valid if concurrent users of the table don't hold a lock corresponding
to the operation they want to perform.

Hence, add some assertion-build-only checks that require any caller
of relation_open(x, NoLock) to hold at least AccessShareLock.  This
isn't a full solution, since we can't verify that the lock level is
semantically appropriate for the action --- but it's definitely of
some use, because it's already caught two bugs.

We can also assert that callers of addRangeTableEntryForRelation()
hold at least the lock level specified for the new RTE.

Amit Langote and Tom Lane

Discussion: https://postgr.es/m/16565.1538327894@sss.pgh.pa.us
2018-10-01 12:43:21 -04:00
Tom Lane
f868a8143a Fix longstanding recursion hazard in sinval message processing.
LockRelationOid and sibling routines supposed that, if our session already
holds the lock they were asked to acquire, they could skip calling
AcceptInvalidationMessages on the grounds that we must have already read
any remote sinval messages issued against the relation being locked.
This is normally true, but there's a critical special case where it's not:
processing inside AcceptInvalidationMessages might attempt to access system
relations, resulting in a recursive call to acquire a relation lock.

Hence, if the outer call had acquired that same system catalog lock, we'd
fall through, despite the possibility that there's an as-yet-unread sinval
message for that system catalog.  This could, for example, result in
failure to access a system catalog or index that had just been processed
by VACUUM FULL.  This is the explanation for buildfarm failures we've been
seeing intermittently for the past three months.  The bug is far older
than that, but commits a54e1f158 et al added a new recursion case within
AcceptInvalidationMessages that is apparently easier to hit than any
previous case.

To fix this, we must not skip calling AcceptInvalidationMessages until
we have *finished* a call to it since acquiring a relation lock, not
merely acquired the lock.  (There's already adequate logic inside
AcceptInvalidationMessages to deal with being called recursively.)
Fortunately, we can implement that at trivial cost, by adding a flag
to LOCALLOCK hashtable entries that tracks whether we know we have
completed such a call.

There is an API hazard added by this patch for external callers of
LockAcquire: if anything is testing for LOCKACQUIRE_ALREADY_HELD,
it might be fooled by the new return code LOCKACQUIRE_ALREADY_CLEAR
into thinking the lock wasn't already held.  This should be a fail-soft
condition, though, unless something very bizarre is being done in
response to the test.

Also, I added an additional output argument to LockAcquireExtended,
assuming that that probably isn't called by any outside code given
the very limited usefulness of its additional functionality.

Back-patch to all supported branches.

Discussion: https://postgr.es/m/12259.1532117714@sss.pgh.pa.us
2018-09-07 18:04:54 -04:00
Bruce Momjian
9d4649ca49 Update copyright for 2018
Backpatch-through: certain files through 9.3
2018-01-02 23:30:12 -05:00
Tom Lane
c7b8998ebb Phase 2 of pgindent updates.
Change pg_bsd_indent to follow upstream rules for placement of comments
to the right of code, and remove pgindent hack that caused comments
following #endif to not obey the general rule.

Commit e3860ffa4d wasn't actually using
the published version of pg_bsd_indent, but a hacked-up version that
tried to minimize the amount of movement of comments to the right of
code.  The situation of interest is where such a comment has to be
moved to the right of its default placement at column 33 because there's
code there.  BSD indent has always moved right in units of tab stops
in such cases --- but in the previous incarnation, indent was working
in 8-space tab stops, while now it knows we use 4-space tabs.  So the
net result is that in about half the cases, such comments are placed
one tab stop left of before.  This is better all around: it leaves
more room on the line for comment text, and it means that in such
cases the comment uniformly starts at the next 4-space tab stop after
the code, rather than sometimes one and sometimes two tabs after.

Also, ensure that comments following #endif are indented the same
as comments following other preprocessor commands such as #else.
That inconsistency turns out to have been self-inflicted damage
from a poorly-thought-through post-indent "fixup" in pgindent.

This patch is much less interesting than the first round of indent
changes, but also bulkier, so I thought it best to separate the effects.

Discussion: https://postgr.es/m/E1dAmxK-0006EE-1r@gemulon.postgresql.org
Discussion: https://postgr.es/m/30527.1495162840@sss.pgh.pa.us
2017-06-21 15:19:25 -04:00
Tom Lane
e3860ffa4d Initial pgindent run with pg_bsd_indent version 2.0.
The new indent version includes numerous fixes thanks to Piotr Stefaniak.
The main changes visible in this commit are:

* Nicer formatting of function-pointer declarations.
* No longer unexpectedly removes spaces in expressions using casts,
  sizeof, or offsetof.
* No longer wants to add a space in "struct structname *varname", as
  well as some similar cases for const- or volatile-qualified pointers.
* Declarations using PG_USED_FOR_ASSERTS_ONLY are formatted more nicely.
* Fixes bug where comments following declarations were sometimes placed
  with no space separating them from the code.
* Fixes some odd decisions for comments following case labels.
* Fixes some cases where comments following code were indented to less
  than the expected column 33.

On the less good side, it now tends to put more whitespace around typedef
names that are not listed in typedefs.list.  This might encourage us to
put more effort into typedef name collection; it's not really a bug in
indent itself.

There are more changes coming after this round, having to do with comment
indentation and alignment of lines appearing within parentheses.  I wanted
to limit the size of the diffs to something that could be reviewed without
one's eyes completely glazing over, so it seemed better to split up the
changes as much as practical.

Discussion: https://postgr.es/m/E1dAmxK-0006EE-1r@gemulon.postgresql.org
Discussion: https://postgr.es/m/30527.1495162840@sss.pgh.pa.us
2017-06-21 14:39:04 -04:00
Bruce Momjian
1d25779284 Update copyright via script for 2017 2017-01-03 13:48:53 -05:00
Robert Haas
719c84c1be Extend relations multiple blocks at a time to improve scalability.
Contention on the relation extension lock can become quite fierce when
multiple processes are inserting data into the same relation at the same
time at a high rate.  Experimentation shows the extending the relation
multiple blocks at a time improves scalability.

Dilip Kumar, reviewed by Petr Jelinek, Amit Kapila, and me.
2016-04-08 02:04:46 -04:00
Robert Haas
53be0b1add Provide much better wait information in pg_stat_activity.
When a process is waiting for a heavyweight lock, we will now indicate
the type of heavyweight lock for which it is waiting.  Also, you can
now see when a process is waiting for a lightweight lock - in which
case we will indicate the individual lock name or the tranche, as
appropriate - or for a buffer pin.

Amit Kapila, Ildus Kurbangaliev, reviewed by me.  Lots of helpful
discussion and suggestions by many others, including Alexander
Korotkov, Vladimir Borodin, and many others.
2016-03-10 12:44:09 -05:00
Tom Lane
52f5d578d6 Create a function to reliably identify which sessions block which others.
This patch introduces "pg_blocking_pids(int) returns int[]", which returns
the PIDs of any sessions that are blocking the session with the given PID.
Historically people have obtained such information using a self-join on
the pg_locks view, but it's unreasonably tedious to do it that way with any
modicum of correctness, and the addition of parallel queries has pretty
much broken that approach altogether.  (Given some more columns in the view
than there are today, you could imagine handling parallel-query cases with
a 4-way join; but ugh.)

The new function has the following behaviors that are painful or impossible
to get right via pg_locks:

1. Correctly understands which lock modes block which other ones.

2. In soft-block situations (two processes both waiting for conflicting lock
modes), only the one that's in front in the wait queue is reported to
block the other.

3. In parallel-query cases, reports all sessions blocking any member of
the given PID's lock group, and reports a session by naming its leader
process's PID, which will be the pg_backend_pid() value visible to
clients.

The motivation for doing this right now is mostly to fix the isolation
tests.  Commit 38f8bdcac4 lobotomized
isolationtester's is-it-waiting query by removing its ability to recognize
nonconflicting lock modes, as a crude workaround for the inability to
handle soft-block situations properly.  But even without the lock mode
tests, the old query was excessively slow, particularly in
CLOBBER_CACHE_ALWAYS builds; some of our buildfarm animals fail the new
deadlock-hard test because the deadlock timeout elapses before they can
probe the waiting status of all eight sessions.  Replacing the pg_locks
self-join with use of pg_blocking_pids() is not only much more correct, but
a lot faster: I measure it at about 9X faster in a typical dev build with
Asserts, and 3X faster in CLOBBER_CACHE_ALWAYS builds.  That should provide
enough headroom for the slower CLOBBER_CACHE_ALWAYS animals to pass the
test, without having to lengthen deadlock_timeout yet more and thus slow
down the test for everyone else.
2016-02-22 14:31:43 -05:00