1
0
mirror of https://github.com/postgres/postgres.git synced 2025-11-25 12:03:53 +03:00

Fix bug where we truncated CLOG that was still needed by LISTEN/NOTIFY

The async notification queue contains the XID of the sender, and when
processing notifications we call TransactionIdDidCommit() on the
XID. But we had no safeguards to prevent the CLOG segments containing
those XIDs from being truncated away. As a result, if a backend didn't
for some reason process its notifications for a long time, or when a
new backend issued LISTEN, you could get an error like:

test=# listen c21;
ERROR:  58P01: could not access status of transaction 14279685
DETAIL:  Could not open file "pg_xact/000D": No such file or directory.
LOCATION:  SlruReportIOError, slru.c:1087

To fix, make VACUUM "freeze" the XIDs in the async notification queue
before truncating the CLOG. Old XIDs are replaced with
FrozenTransactionId or InvalidTransactionId.

Note: This commit is not a full fix. A race condition remains, where a
backend is executing asyncQueueReadAllNotifications() and has just
made a local copy of an async SLRU page which contains old XIDs, while
vacuum concurrently truncates the CLOG covering those XIDs. When the
backend then calls TransactionIdDidCommit() on those XIDs from the
local copy, you still get the error. The next commit will fix that
remaining race condition.

This was first reported by Sergey Zhuravlev in 2021, with many other
people hitting the same issue later. Thanks to:
- Alexandra Wang, Daniil Davydov, Andrei Varashen and Jacques Combrink
  for investigating and providing reproducable test cases,
- Matheus Alcantara and Arseniy Mukhin for review and earlier proposed
  patches to fix this,
- Álvaro Herrera and Masahiko Sawada for reviews,
- Yura Sokolov aka funny-falcon for the idea of marking transactions
  as committed in the notification queue, and
- Joel Jacobson for the final patch version. I hope I didn't forget
  anyone.

Backpatch to all supported versions. I believe the bug goes back all
the way to commit d1e027221d, which introduced the SLRU-based async
notification queue.

Discussion: https://www.postgresql.org/message-id/16961-25f29f95b3604a8a@postgresql.org
Discussion: https://www.postgresql.org/message-id/18804-bccbbde5e77a68c2@postgresql.org
Discussion: https://www.postgresql.org/message-id/CAK98qZ3wZLE-RZJN_Y%2BTFjiTRPPFPBwNBpBi5K5CU8hUHkzDpw@mail.gmail.com
Backpatch-through: 14
This commit is contained in:
Heikki Linnakangas
2025-11-12 20:59:36 +02:00
parent 1b4699090e
commit 8eeb4a0f7c
5 changed files with 196 additions and 0 deletions

View File

@@ -30,6 +30,7 @@ tests += {
't/001_emergency_vacuum.pl',
't/002_limits.pl',
't/003_wraparounds.pl',
't/004_notify_freeze.pl',
],
},
}

View File

@@ -0,0 +1,71 @@
# Copyright (c) 2024-2025, PostgreSQL Global Development Group
#
# Test freezing XIDs in the async notification queue. This isn't
# really wraparound-related, but the test depends on the
# consume_xids() helper function.
use strict;
use warnings FATAL => 'all';
use PostgreSQL::Test::Cluster;
use Test::More;
my $node = PostgreSQL::Test::Cluster->new('node');
$node->init;
$node->start;
if (!$ENV{PG_TEST_EXTRA} || $ENV{PG_TEST_EXTRA} !~ /\bxid_wraparound\b/)
{
plan skip_all => "test xid_wraparound not enabled in PG_TEST_EXTRA";
}
# Setup
$node->safe_psql('postgres', 'CREATE EXTENSION xid_wraparound');
$node->safe_psql('postgres',
'ALTER DATABASE template0 WITH ALLOW_CONNECTIONS true');
# Start Session 1 and leave it idle in transaction
my $psql_session1 = $node->background_psql('postgres');
$psql_session1->query_safe('listen s;', "Session 1 listens to 's'");
$psql_session1->query_safe('begin;', "Session 1 starts a transaction");
# Send some notifys from other sessions
for my $i (1 .. 10)
{
$node->safe_psql('postgres', "NOTIFY s, '$i'");
}
# Consume enough XIDs to trigger truncation, and one more with
# 'txid_current' to bump up the freeze horizon.
$node->safe_psql('postgres', 'select consume_xids(10000000);');
$node->safe_psql('postgres', 'select txid_current()');
# Remember current datfrozenxid before vacuum freeze so that we can
# check that it is advanced. (Taking the min() this way assumes that
# XID wraparound doesn't happen.)
my $datafronzenxid = $node->safe_psql('postgres',
"select min(datfrozenxid::text::bigint) from pg_database");
# Execute vacuum freeze on all databases
$node->command_ok([ 'vacuumdb', '--all', '--freeze', '--port', $node->port ],
"vacuumdb --all --freeze");
# Check that vacuumdb advanced datfrozenxid
my $datafronzenxid_freeze = $node->safe_psql('postgres',
"select min(datfrozenxid::text::bigint) from pg_database");
ok($datafronzenxid_freeze > $datafronzenxid, 'datfrozenxid advanced');
# On Session 1, commit and ensure that the all the notifications are
# received. This depends on correctly freezing the XIDs in the pending
# notification entries.
my $res = $psql_session1->query_safe('commit;', "commit listen s;");
my $notifications_count = 0;
foreach my $i (split('\n', $res))
{
$notifications_count++;
like($i,
qr/Asynchronous notification "s" with payload "$notifications_count" received/
);
}
is($notifications_count, 10, 'received all committed notifications');
done_testing();