1
0
mirror of https://github.com/MariaDB/server.git synced 2025-08-07 00:04:31 +03:00
Commit Graph

4561 Commits

Author SHA1 Message Date
Marko Mäkelä
3d23adb766 Merge 10.6 into 10.11 2024-11-29 13:43:17 +02:00
Brandon Nesterenko
a06d81ff3f MDEV-35477: rpl_semi_sync_no_missed_ack_after_add_slave fails after MDEV-35109
MTR test rpl_semi_sync_no_missed_ack_after_add_slave fails on
buildbot after the preparatory commit for MDEV-35109 (5290fa043b)
which changed a sleep to a debug_sync point. The problem is that the
debug_sync point would time-out on a slave while waiting to enter
the logic to send an ACK reply. More specifically, where the test
config is a primary with two replicas, and the test waits on one of
the replicas to start sending an ACK, if the other replica was able
to receive the event and respond with an ACK before the binlog dump
thread of the timing-out server would prepare to send event, it
wouldn't set the SEMI_SYNC_NEED_ACK flag, and the replica wouldn't
even try to respond with an ACK.

Fix is to use debug_sync for both replicas such that both replicas
are held before sending their ack, so one can’t temporarily disable
semi-sync for the other before it receives the transaction.
2024-11-21 11:30:25 -07:00
Dave Gosselin
3f114a0930 MDEV-35046 SIGSEGV in list_delete in optimized builds when using pseudo_slave_mode
slave_applier_reset_xa_trans() should clear the THD::pseudo_thread_id when called
to reset XA transaction state completely.  Clearing when pseudo_thread_id models
the binlog applier that handles BASE64-encoded events which possibly contain the
pseudo_thread_id, allowing us to restore the pre-event's state of the
connection's respective session var.
2024-11-15 09:48:28 -05:00
Brandon Nesterenko
716ed2ce22 MDEV-35350: Consolidate MTR wait_for_pattern_in_file.inc and SEARCH_WAIT in search_pattern_in_file.inc
Replace wait_for_pattern_in_file.inc and all of its uses
to use search_pattern_in_file.inc with SEARCH_WAIT.

Reviewed By:
============
Kristian Nielsen <knielsen@knielsen-hq.org>
Sergei Golubchik <serg@mariadb.org>
2024-11-07 13:25:58 -07:00
Vladislav Vaintroub
faf9e755ba MDEV-35109 fix test case
rpl_semi_sync_after_sync_coord_consistency fails on release compilation
2024-11-05 22:38:55 +01:00
Brandon Nesterenko
b07258a0d5 MDEV-35109: Semi-sync Replication stalling Primary using wait point=AFTER_SYNC
For a primary configured with wait_point=AFTER_SYNC, if two threads
T1 (binlogging through MYSQL_BIN_LOG::write()) and T2 were
binlogging at the same time, T1 could accidentally wait for its
semi-sync ACK using the binlog coordinates of T2. Prior to
MDEV-33551, this only resulted in delayed transactions, because all
transactions shared the same condition variable for ACK signaling.
However, with the MDEV-33551 changes, each thread has its own
condition variable to signal. So T1 could wait indefinitely when
either:
  1) T1's ACK is received but not T2's when T1 goes into
wait_after_sync(), because the ACK receiver thread has already
notified about the T1 ACK, but T1 was _actually_ waiting on T2's
ACK, and therefore tries to wait (in vain).

  2) T1 goes to wait_after_sync() before any ACKs have arrived. When
T1's ACK comes in, T1 is woken up; however, sees it needs to wait
more (because it was actually waiting on T2's ACK), and goes to wait
again (this time, in vain).

Note that the actual cause of T1 waiting on T2's binlog coordinates
is when MYSQL_BIN_LOG::write() would call
Repl_semisync_master::wait_after_sync(), the binlog offset parameter
was read as the end of MYSQL_BIN_LOG::log_file, which is shared
among transactions. So if T2 had updated the binary log _after_ T1
had released LOCK_log, but not yet invoked wait_after_sync(), it
would use the end of the binary log file as the binlog offset, which
was that of T2 (or any future transaction).

The fix in this patch ensures consistency between the binary log
coordinates a transaction uses between report_binlog_update() and
wait_after_sync().

Reviewed By
============
Kristian Nielsen <knielsen@knielsen-hq.org>
Andrei Elkin <andrei.elkin@mariadb.com>
2024-11-04 10:45:58 -07:00
Brandon Nesterenko
5290fa043b MDEV-35109 PREP: simulate_delay_semisync_slave_reply use debug_sync
This is a preparatory commit for MDEV-35109 to make its
testing code cleaner (and harden other tests too).

The DEBUG_DBUG point simulate_delay_semisync_slave_reply
up to this patch used my_sleep() to delay an ACK response,
but sleeps are prone to test failures on machines that
run tests when already having a heavy load (e.g. on
buildbot).

This patch changes this DEBUG_DBUG sleep to use DEBUG_SYNC
to coordinate exactly when a slave should send its reply,
which is safer and faster.

As DEBUG_SYNC can't be used while a server is shutting
down, to synchronize threads with SHUTDOWN WAIT FOR SLAVES
logic, we use and extend wait_for_pattern_in_file.inc to
wait for an informational error message in the logic to
indicate that the shutdown process has reached the
intended state (i.e. indicating that the shutdown has
been delayed to await semi-sync ACKs). Specifically, the
extensions are as follows:

 1. wait_for_pattern_in_file.inc is extended with parameter
    wait_for_pattern_count as a number that indicates the
    number of times a pattern should occur in the file before
    return control back to the calling script.

 2. search_for_pattern_in_file.inc is extended with parameter
    SEARCH_ABORT_IS_SUCCESS to inverse the error/success
    logic, so the SEARCH_ABORT condition can be used to
    indicate success, rather than error.
2024-11-04 10:45:58 -07:00
Brandon Nesterenko
e9a502df08 Testing fix for rpl_semi_sync_cond_var_per_thd failure 2024-10-30 08:32:19 -06:00
Oleksandr Byelkin
c770bce898 Merge branch '11.2' into 11.4 2024-10-30 15:11:17 +01:00
Oleksandr Byelkin
69d033d165 Merge branch '10.11' into 11.2 2024-10-29 16:42:46 +01:00
Oleksandr Byelkin
3d0fb15028 Merge branch '10.6' into 10.11 2024-10-29 15:24:38 +01:00
Monty
066f920484 MDEV-35110 Deadlock on Replica during BACKUP STAGE BLOCK_COMMIT on XA transactions
This is an extension of MDEV-30423 "Deadlock on Replica during BACKUP
STAGE BLOCK_COMMIT on XA transactions"

The original commit in MDEV-30423 was not complete as some usage in XA of
MDL_BACKUP_COMMIT locks did not set thd->backup_commit_lock.
This is required to be set when using parallel replication.

Fixed by ensuring that all usage of BACKUP_COMMIT lock i XA is uniform and
all sets thd->backup_commit_lock. I also changed all locks to be
MDL_EXPLICIT to keep also that part uniform.

A regression test is added.
2024-10-28 13:29:21 +02:00
Brandon Nesterenko
1ed30e08af MDEV-34122: Assertion `entry' failed in Active_tranx::assert_thd_is_waiter
If semi-sync is switched off then on while a transaction is
in-between binlogging and waiting for an ACK, the semi-sync state of
the transaction is removed, leading to a debug assertion that
indicates the transaction tried to wait, but cannot receive an ACK
signal. More specifically, when semi-sync is switched off, the
Active_tranx list is cleared (where a transaction adds an entry to
this list during binlogging), and each entry in this list saves the
thread which will wait for an ACK, and the thread has the COND
variable to signal to wake itself. So if the entry is lost, the
Ack_receiver thread won’t be able to find the thread to wake up when
an ACK comes in

The fix is to ensure that the entry exists before awaiting the ACK,
and if there is no entry, skip the wait. In debug builds, an
informative message is written explaining that the transaction is
skipping its wait. Additional debug-build only logic is added to
ensure that the cause of the missing entry is due to semi-sync being
turned off and on

Reviewed By:
============
Kristian Nielsen <knielsen@knielsen-hq.org>
2024-10-21 15:35:54 -06:00
Sergei Golubchik
3a1cf2c85b MDEV-34679 ER_BAD_FIELD uses non-localizable substrings 2024-10-17 21:37:37 +02:00
Sergei Golubchik
5ebda30ccc Revert "MDEV-35019 Provide a way to enable "rollback XA on disconnect" behavior we had before 10.5.2"
This reverts commit 8ae462a220.
2024-10-16 13:23:47 +02:00
Kristian Nielsen
8ae462a220 MDEV-35019 Provide a way to enable "rollback XA on disconnect" behavior we had before 10.5.2
Implement variable legacy_xa_rollback_at_disconnect to support
backwards compatibility for applications that rely on the pre-10.5
behavior for connection disconnect, which is to rollback the
transaction (in violation of the XA specification).

Signed-off-by: Kristian Nielsen <knielsen@knielsen-hq.org>
2024-10-16 10:18:36 +02:00
Marko Mäkelä
b53b81e937 Merge 11.2 into 11.4 2024-10-03 14:32:14 +03:00
Marko Mäkelä
12a91b57e2 Merge 10.11 into 11.2 2024-10-03 13:24:43 +03:00
Marko Mäkelä
63913ce5af Merge 10.6 into 10.11 2024-10-03 10:55:08 +03:00
Marko Mäkelä
7e0afb1c73 Merge 10.5 into 10.6 2024-10-03 09:31:39 +03:00
Lena Startseva
0a5e4a0191 MDEV-31005: Make working cursor-protocol
Updated tests: cases with bugs or which cannot be run
with the cursor-protocol were excluded with
"--disable_cursor_protocol"/"--enable_cursor_protocol"

Fix for v.10.5
2024-09-18 18:39:26 +07:00
Brandon Nesterenko
68938d2b42 MDEV-33500 (part 2): rpl.rpl_parallel_sbm can still fail
The failing test case validates Seconds_Behind_Master for a delayed
slave, while STOP SLAVE is executed during a delay. The test fixes
initially added to the test (commit b04c857596) added a table lock
to ensure a transaction could not finish before validating the
Seconds_Behind_Master field after SLAVE START, but did not address a
possibility that the transaction could finish before running the
STOP SLAVE command, which invalidates the validations for the rest
of the test case. Specifically, this would result in 1) a timeout in
“Waiting for table metadata lock” on the replica, which expects the
transaction to retry after slave restart and hit a lock conflict on
the locked tables (added in b04c857596), and 2) that
Seconds_Behind_Master should have increased, but did not.

The failure can be reproduced by synchronizing the slave to the master
before the MDEV-32265 echo statement (i.e. before the SLAVE STOP).

This patch fixes the test by adding a mechanism to use DEBUG_SYNC to
synchronize a MASTER_DELAY, rather than continually increase the
duration of the delay each time the test fails on buildbot. This is
to ensure that on slow machines, a delay does not pass before the
test gets a chance to validate results. Additionally, it decreases
overall test time because the test can continue immediately after
validation, thereby bypassing the remainder of a full delay for each
transaction.
2024-09-17 06:29:20 -06:00
Marko Mäkelä
44733aa8cf Merge 11.2 into 11.4 2024-08-29 19:10:38 +03:00
Marko Mäkelä
e91a799458 Merge 10.11 into 11.2 2024-08-29 16:02:57 +03:00
Marko Mäkelä
cfcf27c6fe Merge 10.6 into 10.11 2024-08-29 07:47:29 +03:00
Marko Mäkelä
48becffd07 Merge 10.5 into 10.6 2024-08-27 08:52:10 +03:00
Kristian Nielsen
8642453ce6 Fix sporadic failure of test case rpl.rpl_start_stop_slave
The test was expecting the I/O thread to be in a specific state, but thread
scheduling may cause it to not yet have reached that state. So just have a
loop that waits for the expected state to occur.

Signed-off-by: Kristian Nielsen <knielsen@knielsen-hq.org>
2024-08-26 14:39:24 +02:00
Kristian Nielsen
214e6c5b3d Fix sporadic failure of test case rpl.rpl_old_master
Remove the test for MDEV-14528. This is supposed to test that parallel
replication from pre-10.0 master will update Seconds_Behind_Master. But
after MDEV-12179 the SQL thread is blocked from even beginning to fetch
events from the relay log due to FLUSH TABLES WITH READ LOCK, so the test
case is no longer testing what is was intended to. And pre-10.0 versions are
long since out of support, so does not seem worthwhile to try to rewrite the
test to work another way.

The root cause of the test failure is MDEV-34778. Briefly, depending on
exact timing during slave stop, the rli->sql_thread_caught_up flag may end
up with different value. If it ends up as "true", this causes
Seconds_Behind_Master to be 0 during next slave start; and this caused test
case timeout as the test was waiting for Seconds_Behind_Master to become
non-zero.

Signed-off-by: Kristian Nielsen <knielsen@knielsen-hq.org>
2024-08-26 14:39:24 +02:00
Kristian Nielsen
7dc4ea5649 Fix sporadic test failure in rpl.rpl_create_drop_event
Depending on timing, an extra event run could start just when the event
scheduler is shut down and delay running until after the table has been
dropped; this would cause the test to fail with a "table does not exist"
error in the log.

Signed-off-by: Kristian Nielsen <knielsen@knielsen-hq.org>
2024-08-26 14:39:24 +02:00
Kristian Nielsen
33854d7324 Restore skiping rpl.rpl_mdev6020 under Valgrind
(Revert a change done by mistake when XtraDB was removed.)

Signed-off-by: Kristian Nielsen <knielsen@knielsen-hq.org>
2024-08-26 14:39:24 +02:00
Brandon Nesterenko
bd54475efa MDEV-34779: Sporadic test failure in rpl.rpl_semi_sync_cond_var_per_thd
In a merge, an mtr.call_suppression was erroneously removed
for "Got an error writing communication packets". So when
the error would expectedly occur, the test would fail.

This patch adds this suppression back.

Note that the test will still fail due MDEV-34799, which
is to be fixed in 10.6.
2024-08-22 13:44:15 -06:00
Kristian Nielsen
78fcb9474c Fix sporadic failure in test rpl.rpl_rotate_logs
Clarify confusing comments in the previous commit, and note that the failure
started after push of MDEV-34504.

Signed-off-by: Kristian Nielsen <knielsen@knielsen-hq.org>
2024-08-19 21:18:56 +02:00
Kristian Nielsen
5dc2fe4815 Fix sporadic failure in test rpl.rpl_rotate_logs
The test started failing after push of MDEV-31404.

Signed-off-by: Kristian Nielsen <knielsen@knielsen-hq.org>
2024-08-16 22:27:01 +02:00
Oleksandr Byelkin
1640c9b06e Merge branch '11.2' into 11.4 2024-08-04 17:27:48 +02:00
Oleksandr Byelkin
dced6cbdb6 Merge branch '11.1' into 11.2 2024-08-03 09:50:16 +02:00
Oleksandr Byelkin
80abd847da Merge branch '10.11' into 11.1 2024-08-03 09:32:42 +02:00
Oleksandr Byelkin
8f020508c8 Merge branch '10.5' into 10.6 2024-08-03 09:04:24 +02:00
Brandon Nesterenko
001608de7e MDEV-15393: Fix rpl_mysqldump_gtid_slave_pos
The slave would try to sync_with_master_gtid.inc,
but the master never actually saved its gtid position
so the test would move on too quickly.
2024-07-31 14:17:46 -06:00
Andrei
c944cd6fec MDEV-15393 post-push: complete rpl_mysqldump_gtid_slave_pos fixes.
Added a missed
  --source include/save_master_gtid.inc
by the previous commit.
2024-07-22 20:52:26 +03:00
Oleksandr Byelkin
0fe39d368a Merge branch '10.6' into 10.11 2024-07-22 15:14:50 +02:00
Oleksandr Byelkin
a938503cfb Merge branch '10.5' into 10.6 2024-07-20 08:12:42 +02:00
Andrei
b8f92ade57 MDEV-15393 gtid_slave_pos duplicate key errors after mysqldump restore
When mysqldump is run to dump the `mysql` system database, it generates
INSERT statements into the table `mysql.gtid_slave_pos`.
After running the backup script
those inserts did not produce the expected gtid state on slave. In
particular the maximum of mysql.gtid_slave_pos.sub_id did not make
into
   rpl_global_gtid_slave_state.last_sub_id

an in-memory object that is supposed to match the current state of the
table. And that was regardless of whether --gtid option was specified
or not. Later when the backup recipient server starts as slave
in *non-gtid* mode this desychronization may lead to a duplicate key
error.

This effect is corrected for --gtid mode mysqldump/mariadb-dump only
as the following.  The fixes ensure the insert block of the dump
script is followed with a "summing-up" SET @global.gtid_slave_pos
assignment.

For the implemenation part, note a deferred print-out of
SET-gtid_slave_pos and associated comments is prefered over relocating
of the entire blocks if (opt_master,slave_data &&
do_show_master,slave_status) ...  because of compatiblity
concern. Namely an error inside do_show_*() is handled in the new code
the same way, as early as, as before.

A regression test can be run in how-to-reproduce mode as well.
One affected mtr test observed.
rpl_mysqldump_slave.result "mismatch" shows now the new deferring print
of SET-gtid_slave_pos policy in action.
2024-07-19 21:44:12 +03:00
Oleksandr Byelkin
9af2caca33 Merge branch '10.5' into 10.6 2024-07-18 16:25:33 +02:00
Brandon Nesterenko
a061ae1079 MDEV-33921: Fix rpl_xa_empty_transaction.test
The test was missing a save_master_gtid.inc on the master,
leading to the slave thinking it was in sync after executing
sync_with_master_gtid.inc, despite not having executed the
latest transaction. This skipped transaction, XA COMMIT,
was supposed to error-to-be-ignored because its XID could not
be found, but be thrown out because the replication filters
would filter out the target database. However, if the slave
was able to stop before executing the transaction, then
the replication filer is reset (to empty), and when the
slave is later restarted, that transactions error would
no longer be ignored.

Additionally, as the test cases added in MDEV-33921 rely
on GTID synchronization, the test cases now force
master_use_gtid=slave_pos for consistency
2024-07-17 16:38:26 -06:00
Sergei Golubchik
d60f5c11ea MDEV-34318 mariadb-dump SQL syntax error with MAX_STATEMENT_TIME against Percona MySQL server
protect MariaDB conditional comments from a bug
in Percona MySQL comment parser
2024-07-17 21:25:40 +02:00
Yuchen Pei
f071b7620b Merge branch '10.5' into 10.6 2024-07-16 15:54:22 +08:00
Daniel Black
e8bcc4e455 MDEV-34568 rpl.rpl_mdev12179 - correct for Windows
Simplify in an attempt to avoid:

mysqltest: At line 275: File already exist: on the write_file
lines.

Using write_line as that's what a lot of other tests
do for writing small bits to a expect file.

Review thanks Valdislav Vaintroub
2024-07-12 12:55:28 +02:00
Brandon Nesterenko
632dd304c7 MDEV-34554: rpl_change_master_demote sporadically fails on buildbot
MDEV-34274 did not fix the test failure. The test has a START SLAVE
UNTIL condition, where we can't use sync_with_master_gtid.inc,
wait_for_slave_to_start.inc, or wait_for_slave_to_stop.inc because
our MTR connection thread races with the start/stop of the SQL/IO
threads. So instead, for slave start, we prove the threads started
by waiting for the connection count to increase by 2; and for slave
stop, we wait for the processlist count to return to its pre start
slave number.
2024-07-11 14:45:12 -06:00
Brandon Nesterenko
fa80449725 MDEV-34274: Test rpl.rpl_change_master_demote frequently fails on buildbot with "IO thread should not be running..."
Note this is a backport of 8c8b3ab784
from 11.1.

The test rpl.rpl_change_master_demote used a `sleep 1` command
to give time for a START SLAVE UNTIL to start the slave threads
and wait for them to automatically die by UNTIL.  On machines
with heavy load (especially MSAN bb builders), one second was
not enough, and the test would fail due to the IO thread
still being up.

This patch fixes the test by replacing the sleep with specific
conditions to wait for. The test cannot wait for the IO or SQL
threads to start, as it would be possible that they would be
started and stopped by the time the MTR executor would check
the slave status. So instead, we test for proof that they
existed via the Connections status variable being incremented
by at least 2 (Connections just shows the global thread id).
At this point, we still can't use the wait_for_slave_to_stop
helper, as the SQL/IO_Running fields of SHOW SLAVE STATUS
may not be updated yet. So instead, we use
information_schema.processlist, which would show the presence
of the Slave_SQL/IO threads. So to "wait for the slave to stop",
we wait for the Slave_SQL/IO threads to be gone from the
processlist.
2024-07-11 09:06:23 -06:00
Monty
e0cff1e72b Fixed failure in rpl.rpl_change_master_demote : "IO thread should not be running..."
The issue was that the test did not take into account that the IO thread
could have been in COMMAND=Connecting state, which happens before the
COMMANMD=Slave_IO state.

The test is a bit fragile as it depends on the COMMAND state to be
syncronised with the Slave_IO_State, which is not the case.

I added a new proc state and some more information to the error
output to be able to diagnose future failures more easily.
2024-07-11 11:15:47 +03:00