MDEV-6551: Some replication errors are ignored if slave_parallel_threads > 0

The problem occured when using parallel replication, and an error occured that caused the SQL thread to stop when the IO thread had already reached a following binlog file from the master (or otherwise performed a relay log rotation). In this case, the Rotate Event at the end of the relay log file could still be executed, even though an earlier event in that relay log file had gotten an error. This would cause the position to be incorrectly updated, so that upon restart of the SQL thread, the event that had failed would be silently skipped and ignored, causing replication corruption. Fixed by checking before executing Rotate Event, whether an earlier event has failed. If so, the Rotate Event is not executed, just dequeued, same as for other normal events following a failing event.
2026-01-06 05:22:24 +03:00 · 2014-08-15 11:31:13 +02:00
parent 65ac881c80
commit cfa1ce81bb
4 changed files with 131 additions and 8 deletions
--- a/mysql-test/suite/rpl/t/rpl_parallel.test
+++ b/mysql-test/suite/rpl/t/rpl_parallel.test
@@ -1408,6 +1408,64 @@ SELECT * FROM t6 ORDER BY a;
 SELECT * FROM t6 ORDER BY a;


+--echo *** MDEV-6551: Some replication errors are ignored if slave_parallel_threads > 0 ***
+
+--connection server_1
+INSERT INTO t2 VALUES (31);
+--source include/save_master_gtid.inc
+
+--connection server_2
+--source include/sync_with_master_gtid.inc
+--source include/stop_slave.inc
+SET GLOBAL slave_parallel_threads= 0;
+--source include/start_slave.inc
+
+# Force a duplicate key error on the slave.
+SET sql_log_bin= 0;
+INSERT INTO t2 VALUES (32);
+SET sql_log_bin= 1;
+
+--connection server_1
+INSERT INTO t2 VALUES (32);
+# Rotate the binlog; the bug is triggered when the master binlog file changes
+# after the event group that causes the duplicate key error.
+FLUSH LOGS;
+INSERT INTO t2 VALUES (33);
+INSERT INTO t2 VALUES (34);
+SELECT * FROM t2 WHERE a >= 30 ORDER BY a;
+--source include/save_master_gtid.inc
+
+--connection server_2
+--let $slave_sql_errno= 1062
+--source include/wait_for_slave_sql_error.inc
+
+--connection server_2
+--source include/stop_slave_io.inc
+SET GLOBAL slave_parallel_threads=10;
+START SLAVE;
+
+--let $slave_sql_errno= 1062
+--source include/wait_for_slave_sql_error.inc
+
+# Note: IO thread is still running at this point.
+# The bug seems to have been that restarting the SQL thread after an error with
+# the IO thread still running, somehow picks up a later relay log position and
+# thus ends up skipping the failing event, rather than re-executing.
+
+START SLAVE SQL_THREAD;
+--let $slave_sql_errno= 1062
+--source include/wait_for_slave_sql_error.inc
+
+SELECT * FROM t2 WHERE a >= 30 ORDER BY a;
+
+# Skip the duplicate error, so we can proceed.
+SET sql_slave_skip_counter= 1;
+--source include/start_slave.inc
+--source include/sync_with_master_gtid.inc
+
+SELECT * FROM t2 WHERE a >= 30 ORDER BY a;
+
+
 --connection server_2
 --source include/stop_slave.inc
 SET GLOBAL slave_parallel_threads=@old_parallel_threads;