1
0
mirror of https://github.com/MariaDB/server.git synced 2025-08-07 00:04:31 +03:00
Commit Graph

4561 Commits

Author SHA1 Message Date
Brandon Nesterenko
ea9869504d MDEV-33921: Replication breaks when filtering two-phase XA transactions
There are two problems.

First, replication fails when XA transactions are used where the
slave has replicate_do_db set and the client has touched a different
database when running DML such as inserts. This is because XA
commands are not treated as keywords, and are thereby not exempt
from the replication filter. The effect of this is that during an XA
transaction, if its logged “use db” from the master is filtered out
by the replication filter, then XA END will be ignored, yet its
corresponding XA PREPARE will be executed in an invalid state,
thereby breaking replication.

Second, if the slave replicates an XA transaction which results in
an empty transaction, the XA START through XA PREPARE first phase of
the transaction won’t be binlogged, yet the XA COMMIT will be
binlogged. This will break replication in chain configurations.

The first problem is fixed by treating XA commands in
Query_log_event as keywords, thus allowing them to bypass the
replication filter. Note that Query_log_event::is_trans_keyword() is
changed to accept a new parameter to define its mode, to either
check for XA commands or regular transaction commands, but not both.
In addition, mysqlbinlog is adapted to use this mode so its
--database filter does not remove XA commands from its output.

The second problem fixed by overwriting the XA state in the XID
cache to be XA_ROLLBACK_ONLY, so at commit time, the server knows to
rollback the transaction and skip its binlogging. If the xid cache
is cleared before an XA transaction receives its completion command
(e.g. on server shutdown), then before reporting ER_XAER_NOTA when
the completion command is executed, the filter is first checked if
the database is ignored, and if so, the error is ignored.

Reviewed By:
============
Kristian Nielsen <knielsen@knielsen-hq.org>
Andrei Elkin <andrei.elkin@mariadb.com>
2024-07-10 14:37:39 -06:00
Monty
dd99780967 MDEV-34504 PURGE BINARY LOGS not working anymore
PURGE BINARY LOGS did not always purge binary logs. This commit fixes
some of the issues and adds notifications if a binary log cannot be
purged.

User visible changes:
- 'PURGE BINARY LOG TO log_name' and 'PURGE BINARY LOGS BEFORE date'
  worked differently. 'TO' ignored 'slave_connections_needed_for_purge'
  while 'BEFORE' did not. Now both versions ignores the
  'slave_connections_needed_for_purge variable'.
- 'PURGE BINARY LOG..' commands now returns 'note' if a binary log cannot
   be deleted like
   Note 1375 Binary log 'master-bin.000004' is not purged because it is
             the current active binlog
- Automatic binary log purges, based on date or size, will write a
  note to the error log if a binary log matching the size or date
  cannot yet be deleted.
- If 'slave_connections_needed_for_purge' is set from a config or
  command line, it is set to 0 if Galera is enabled and 1 otherwise
  (old default). This ensures that automatic binary log purge works
  with Galera as before the addition of
  'slave_connections_needed_for_purge'.
  If the variable is changed to 0, a warning will be printed to the error
  log.

Code changes:
- Added THD argument to several purge_logs related functions that needed
  THD.
- Added 'interactive' options to purge_logs functions. This allowed
  me to remove testing of sql_command == SQLCOM_PURGE.
- Changed purge_logs_before_date() to first check if log is applicable
  before calling can_purge_logs(). This ensures we do not get a
  notification for logs that does not match the remove criteria.
- MYSQL_BIN_LOG::can_purge_log() will write notifications to the user
  or error log if a log cannot yet be removed.
- log_in_use() will return reason why a binary log cannot be removed.

Changes to keep code consistent:
- Moved checking of binlog_format for Galera to be after Galera is
  initialized (The old check never worked). If Galera is enabled
  we now change the binlog_format to ROW, with a warning, instead of
  aborting the server. If this change happens a warning will be printed to
  the error log.
- Print a warning if Galera or FLASHBACK changes the binlog_format
  to ROW. Before it was done silently.

Reviewed by: Sergei Golubchik <serg@mariadb.com>,
             Kristian Nielsen <knielsen@knielsen-hq.org>
2024-07-10 18:50:08 +03:00
Alexander Barkov
5fb07d942b Merge remote-tracking branch 'origin/11.2' into 11.4 2024-07-09 21:45:37 +04:00
Alexander Barkov
8aad19ddfc Merge remote-tracking branch 'origin/11.1' into 11.2 2024-07-09 14:04:11 +04:00
Julius Goryavsky
4026f04425 Merge branch 10.5 into 10.6 2024-07-09 11:56:47 +02:00
Alexander Barkov
44af9bfc67 Merge remote-tracking branch 'origin/10.11' into 11.1 2024-07-09 10:45:47 +04:00
Alexander Barkov
4ffeca643d Renaming a few mysql-test/suite/rpl/include/*.test files into *.inc
Builders:
- amd64-fedora-38-last-N-failed
- amd64-debian-10-last-N-failed
erroneously execute MTR for these files:

- rpl/include/rpl_extra_col_master.test
- rpl/include.rpl_binlog_max_cache_size.test
- rpl/include.rpl_charset.test

when these files are changed by a commit.
Changing their extension from *.test to *.inc to avoid this.
2024-07-09 08:46:40 +04:00
Oleksandr Byelkin
2447dda2c0 Merge branch '10.11' into 11.1 2024-07-08 22:40:16 +02:00
Alexander Barkov
4d71a117a3 Merge remote-tracking branch 'origin/10.6' into 10.11 2024-07-08 21:52:08 +04:00
Brandon Nesterenko
744580d5a7 MDEV-32892: IO Thread Reports False Error When Stopped During Connecting to Primary
The IO thread can report error code 2013 into the error log when it
is stopped during the initial connection process to the primary, as
well as when trying to read an event. However, because the IO thread
is being stopped, its connection to the primary is force-killed by
the signaling thread (see THD::awake_no_mutex()), and thereby these
connection errors should be ignored.

Reviewed By:
============
Kristian Nielsen <knielsen@knielsen-hq.org>
2024-07-08 10:39:17 -06:00
Alexander Barkov
e56040fee8 Merge remote-tracking branch 'origin/10.5' into 10.6 2024-07-08 18:59:04 +04:00
Brandon Nesterenko
eb4458e993 MDEV-33465: an option to enable semisync recovery
The current semi-sync binlog fail-over recovery process uses
rpl_semi_sync_slave_enabled==TRUE as its condition to truncate a
primary server’s binlog, as it is anticipating the server to re-join
a replication topology as a replica. However, for servers configured
with both rpl_semi_sync_master_enabled=1 and
rpl_semi_sync_slave_enabled=1, if a primary is just re-started (i.e.
retaining its role as master), it can truncate its binlog to drop
transactions which its replica(s) has already received and executed.
If this happens, when the replica reconnects, its gtid_slave_pos can
be ahead of the recovered primary’s gtid_binlog_pos, resulting in an
error state where the replica’s state is ahead of the primary’s.

This patch changes the condition for semi-sync recovery to truncate
the binlog to instead use the configuration variable
--init-rpl-role, when set to SLAVE. This allows for both
rpl_semi_sync_master_enabled and rpl_semi_sync_slave_enabled to be
set for a primary that is restarted, and no transactions will be
lost, so long as --init-rpl-role is not set to SLAVE.

Reviewed By:
============
Sergei Golubchik <serg@mariadb.com>
2024-07-05 19:53:57 -06:00
Brandon Nesterenko
cbc1898e82 MDEV-25607: Auto-generated DELETE from HEAP table can break replication
The special logic used by the memory storage engine
to keep slaves in sync with the master on a restart can
break replication. In particular, after a restart, the
master writes DELETE statements in the binlog for
each MEMORY-based table so the slave can empty its
data. If the DELETE is not executable, e.g. due to
invalid triggers, the slave will error and fail, whereas
the master will never see the problem.

Instead of DELETE statements, use TRUNCATE to
keep slaves in-sync with the master, thereby bypassing
triggers.

Reviewed By:
===========
Kristian Nielsen <knielsen@knielsen-hq.org>
Andrei Elkin <andrei.elkin@mariadb.com>
2024-07-05 12:00:09 -06:00
Marko Mäkelä
27a3366663 Merge 10.6 into 10.11 2024-06-27 10:26:09 +03:00
Marko Mäkelä
0076eb3d4e Merge 10.5 into 10.6 2024-06-24 13:09:47 +03:00
Monty
3541bd63f0 MDEV-33582 Add more warnings to be able to better diagnose network issues
Changed the logged messages from errors to warnings
Also changed 'remain' to 'read_length' in the warning to make it more readable.
2024-06-20 09:53:01 +03:00
Brandon Nesterenko
8c8b3ab784 MDEV-34274: Test rpl.rpl_change_master_demote frequently fails on buildbot with "IO thread should not be running..."
The test rpl.rpl_change_master_demote used a `sleep 1` command
to give time for a START SLAVE UNTIL to start the slave threads
and wait for them to automatically die by UNTIL.  On machines
with heavy load (especially MSAN bb builders), one second was
not enough, and the test would fail due to the IO thread
still being up.

This patch fixes the test by replacing the sleep with specific
conditions to wait for. The test cannot wait for the IO or SQL
threads to start, as it would be possible that they would be
started and stopped by the time the MTR executor would check
the slave status. So instead, we test for proof that they
existed via the Connections status variable being incremented
by at least 2 (Connections just shows the global thread id).
At this point, we still can't use the wait_for_slave_to_stop
helper, as the SQL/IO_Running fields of SHOW SLAVE STATUS
may not be updated yet. So instead, we use
information_schema.processlist, which would show the presence
of the Slave_SQL/IO threads. So to "wait for the slave to stop",
we wait for the Slave_SQL/IO threads to be gone from the
processlist.
2024-06-19 10:17:08 -06:00
Andrei
387bdb2a2e MDEV-29934 rpl.rpl_start_alter_chain_basic, rpl.rpl_start_alter_restart_slave sometimes fail in BB with result content mismatch
rpl.rpl_start_alter_chain_basic was used to fail sporadically due
to a missed GTID master-slave synchronization which was necessary
because of the following SELECT from GTID-state table.

Fixed with arranging two synchronization pieces for two
chain slaves requiring that.

Note rpl.rpl_start_alter_restart_slave must have been fixed by
MDEV-30460 and 87e13722a9 (manual) merge commit.
2024-06-19 14:09:00 +03:00
Brandon Nesterenko
6cab2f75fe MDEV-23857: replication master password length
After MDEV-4013, the maximum length of replication passwords was extended to
96 ASCII characters. After a restart, however, slaves only read the first 41
characters of MASTER_PASSWORD from the master.info file. This lead to slaves
unable to reconnect to the master after a restart.

After a slave restart, if a master.info file is detected, use the full
allowable length of the password rather than 41 characters.

Reviewed By:
============
Sergei Golubchik <serg@mariadb.com>
2024-06-18 07:21:18 -06:00
Andrei
c37b2a9f04 MDEV-30460 rpl.rpl_start_alter_restart_slave sometimes fails in BB with result length mismatch
The test was used to fail because of lacking a synchronization point
designating an expected for processing "CA_1" event has been indeed
taken into this phase.
The event apparently may not have even arrived at slave.

Fixed with deploying the missed synchronization.
2024-06-17 19:06:34 +03:00
Alexander Barkov
c4bf4ce948 Merge remote-tracking branch 'origin/11.2' into 11.4 2024-06-17 15:46:39 +04:00
Marko Mäkelä
a21e49cbcc Merge 11.1 into 11.2 2024-06-17 12:02:03 +03:00
Marko Mäkelä
d34289a3e2 Merge 10.11 into 11.1 2024-06-17 09:21:50 +03:00
Marko Mäkelä
b81d717387 Merge 10.6 into 10.11 2024-06-11 12:50:10 +03:00
Brandon Nesterenko
fcd21d3e40 MDEV-34355: rpl.rpl_semi_sync_no_missed_ack_after_add_slave ‘server_3 should have sent…’
The problem is that the test could query the status variable
Rpl_semi_sync_slave_send_ack before the slave actually updated it.
This would result in an immediate --die assertion killing the rest
of the test. The bottom of this commit message has a small patch
that can be applied to reproduce the test failure.

This patch fixes the test failure by waiting for the variable to be
updated before querying its value.

diff --git a/sql/semisync_slave.cc b/sql/semisync_slave.cc
index 9ddd4c5c8d7..60538079fce 100644
--- a/sql/semisync_slave.cc
+++ b/sql/semisync_slave.cc
@@ -303,7 +303,10 @@ int Repl_semi_sync_slave::slave_reply(Master_info *mi)
     reply_res= DBUG_EVALUATE_IF("semislave_failed_net_flush", 1,
                                 net_flush(net));
     if (!reply_res)
+    {
+      sleep(1);
       rpl_semi_sync_slave_send_ack++;
+    }
   }
   DBUG_RETURN(reply_res);
 }
2024-06-10 12:27:20 -06:00
Oleksandr Byelkin
99b370e023 Merge branch '11.2' into 11.4 2024-05-21 19:38:51 +02:00
Sergei Golubchik
bf5da43e50 Merge branch '11.1' into 11.2 2024-05-13 10:00:26 +02:00
Sergei Golubchik
f0a5412037 Merge branch '11.0' into 11.1 2024-05-13 09:52:30 +02:00
Sergei Golubchik
f9807aadef Merge branch '10.11' into 11.0 2024-05-12 12:18:28 +02:00
Sergei Golubchik
a6b2f820e0 Merge branch '10.6' into 10.11 2024-05-10 20:02:18 +02:00
Sergei Golubchik
7b53672c63 Merge branch '10.5' into 10.6 2024-05-08 20:06:00 +02:00
Kristian Nielsen
383ee364dc Merge 10.6 to 10.11 2024-05-07 08:45:31 +02:00
Sergei Golubchik
7a789e2027 sporadic failures of rpl.rpl_parallel_sbm
the test waits for the event to get stuck on MASTER_DELAY,
but on a slow/overloaded slave the event might pass MASTER_DELAY
before the test starts waiting.

Wait for the event to get stuck on the LOCK TABLES (after MASTER_DELAY),
the event cannot avoid that,
2024-05-05 21:37:07 +02:00
Sergei Golubchik
cea083af9f cleanup: use THD_STAGE_INFO, not thd_proc_info
and put master-slave.inc *last* in the series of includes
2024-05-05 21:37:07 +02:00
Kristian Nielsen
4b4db4a8e5 MDEV-34042: Deadlock kill of XA PREPARE can break replication / rpl.rpl_parallel_multi_domain_xa sporadic failure
Refinement of the original patch.

Move the code to reset the kill up into the parent class
Xid_apply_log_event, to also fix the similar issue for XA COMMIT.

Increase the number of slave retries in the test case
rpl.rpl_parallel_multi_domain_xa to fix some sporadic failures. The test
generates massive amounts of conflicting transactions in multiple
independent domains, which can cause multiple rollback+retry for a
transaction as it conflicts with transactions in other domains one-by-one.

Signed-off-by: Kristian Nielsen <knielsen@knielsen-hq.org>
2024-05-05 19:01:56 +02:00
Sergei Golubchik
3ee6f69d49 sporadic failures of binlog_encryption.rpl_parallel_gco_wait_kill
CURRENT_TEST: binlog_encryption.rpl_parallel_gco_wait_kill
mysqltest: In included file "./suite/rpl/t/rpl_parallel_gco_wait_kill.test":
included from /home/buildbot/amd64-ubuntu-2004-debug/build/mysql-test/suite/binlog_encryption/rpl_parallel_gco_wait_kill.test at line 2:
At line 334: Can't initialize replace from 'replace_result $thd_id THD_ID'

An sql thread can reach the "Slave has read all relay log" state
and then start reading relay log again. Let's use a more generic
pattern to retrieve the sql thread ID even if it's not
in the "read all relay log" state.
2024-05-02 22:14:19 +02:00
Kristian Nielsen
e365877bae MDEV-33798: ROW base optimistic deadlock with concurrent writes on same table
One case is conflicting transactions T1 and T2 with different domain id, in
optimistic parallel replication in non-GTID mode. Then T2 will
wait_for_prior_commit on T1; and if T1 got a row lock wait on T2 it would
hang, as different domains caused the deadlock kill to be skipped in
thd_rpl_deadlock_check().

More generally, if we have transactions T1 and T2 in one domain/master
connection, and independent transactions U in another, then we can
still deadlock like this:

  T1 row low wait on U
  U row lock wait on T2
  T2 wait_for_prior_commit on T1

This commit enforces the deadlock kill in these cases. If the waited-for
transaction is speculatively applied, then it will be deadlock killed in
case of a conflict, even if the two transactions are in different domains
or master connections.

Reviewed-by: Andrei Elkin <andrei.elkin@mariadb.com>
Signed-off-by: Kristian Nielsen <knielsen@knielsen-hq.org>
2024-05-02 21:07:45 +02:00
Sergei Golubchik
0aae11ac28 Merge branch '10.6' into 10.11 2024-04-30 16:56:49 +02:00
Andrei
3fa2caf553 MDEV-31404 post-push for rpl.max_binlog_total_size
The test's header did not follow a correct `have_` and `master-slave`
sourcing pattern.

That's corrected.
2024-04-30 14:00:19 +03:00
Andrei
ae03374f29 MDEV-34030 rpl.rpl_using_gtid_default can fail in (BB) mtr
The test's header is not written to follow strictly a correct order
of checks by mtr at test start which may lead to an error. E.g

./mtr --mysqld=--binlog-format=row rpl.rpl_using_gtid_default

to
At line 175: query 'SET GLOBAL gtid_slave_pos= ""' failed: ER_SLAVE_MUST_STOP (1198): This operation cannot be performed as you have a running slave ''; run STOP SLAVE '' first

Fixed to require the binlog format first in the test header.
2024-04-30 12:40:50 +03:00
Andrei
6a63204c36 MDEV-34029 rpl.rpl_heartbeat can fail when (BB) mtr reorders tests
rpl.rpl_heartbeat turns out to miss a standard include/master-slave
header which made it potentially in BB and actually with manual mtr
failing as it may have used a previous slave GTID state.

Fixed with installing the standard rpl suite header/footer in the
test file.
2024-04-30 12:40:50 +03:00
Sergei Golubchik
c1f3eff53f Merge branch '10.5' into 10.6 2024-04-29 10:08:58 +02:00
Sergei Golubchik
7ff649315e sporadic failures of rpl.rpl_parallel_multi_domain_xa
it's a slow test, the slave needs to catch up, reading >1500
transactions. A default MASTER_GTID_WAIT() timeout in
sync_with_master_gtid.inc is 120 seconds, which might be not
enough for a slow/overloaded slave.

Let's wait forever or until ./mtr --testcase-timeout,
whatever comes first.
2024-04-26 14:24:32 +02:00
Sergei Golubchik
9e92582024 sporadic failures of rpl.rpl_parallel_sbm
the test waits for the event to get stuck on MASTER_DELAY,
but on a slow/overloaded slave the event might pass MASTER_DELAY
before the test starts waiting.

Wait for the event to get stuck on the LOCK TABLES (after MASTER_DELAY),
the event cannot avoid that,
2024-04-25 12:47:23 +02:00
Kristian Nielsen
553a4d6271 MDEV-33602: Sporadic test failure in rpl.rpl_gtid_stop_start
The test could fail with a duplicate key error because switching to non-GTID
mode could start at the wrong old-style position. The position could be
wrong when the previous GTID connect was stopped before receiving the fake
GTID list event which gives the old-style position corresponding to the GTID
connected position.

Work-around by injecting an extra event and syncing the slave before
switching to non-GTID mode.

Signed-off-by: Kristian Nielsen <knielsen@knielsen-hq.org>
2024-04-25 11:00:45 +02:00
Sergei Golubchik
9cf718859f cleanup: use THD_STAGE_INFO, not thd_proc_info
and put master-slave.inc *last* in the series of includes
2024-04-24 22:08:52 +02:00
Brandon Nesterenko
8c7992165b MDEV-33672: 10.11 Fix for Two Phase Alter Flags
Extends 89c907bd4f to account for
binlog_two_phase_alter flags in a Gtid log event. I.e., if the
FL_COMMIT_ALTER_E1 or FL_ROLLBACK_ALTER_E2 flags are set in the
event flags, yet the length of the event is too short to hold
the value, then set the event as invalid
2024-04-24 13:19:36 +02:00
Sergei Golubchik
7fe764b109 MDEV-32973 SHOW TABLES LIKE shows temporary tables with non-matching names
* compare both db and table name
* use the correct charset
2024-04-23 12:31:41 +02:00
Sergei Golubchik
f243c73788 sporadic failures of rpl.rpl_rewrite_db_sys_vars
first stop the slave, then run commands on the master that are
supposed to fail on the slave, then start the slave.

if you swap first two steps, the slave might get and execute those
commands before it's stopped, which will fail the test.

also, improve debugability
2024-04-22 21:02:11 +02:00
Sergei Golubchik
018d537ec1 Merge branch '10.6' into 10.11 2024-04-22 15:23:10 +02:00