mariadb

mirror of https://github.com/MariaDB/server.git synced 2025-09-11 05:52:26 +03:00

Author	SHA1	Message	Date
Brandon Nesterenko	744580d5a7	MDEV-32892: IO Thread Reports False Error When Stopped During Connecting to Primary The IO thread can report error code 2013 into the error log when it is stopped during the initial connection process to the primary, as well as when trying to read an event. However, because the IO thread is being stopped, its connection to the primary is force-killed by the signaling thread (see THD::awake_no_mutex()), and thereby these connection errors should be ignored. Reviewed By: ============ Kristian Nielsen <knielsen@knielsen-hq.org>	2024-07-08 10:39:17 -06:00
Alexander Barkov	e56040fee8	Merge remote-tracking branch 'origin/10.5' into 10.6	2024-07-08 18:59:04 +04:00
Brandon Nesterenko	eb4458e993	MDEV-33465: an option to enable semisync recovery The current semi-sync binlog fail-over recovery process uses rpl_semi_sync_slave_enabled==TRUE as its condition to truncate a primary server’s binlog, as it is anticipating the server to re-join a replication topology as a replica. However, for servers configured with both rpl_semi_sync_master_enabled=1 and rpl_semi_sync_slave_enabled=1, if a primary is just re-started (i.e. retaining its role as master), it can truncate its binlog to drop transactions which its replica(s) has already received and executed. If this happens, when the replica reconnects, its gtid_slave_pos can be ahead of the recovered primary’s gtid_binlog_pos, resulting in an error state where the replica’s state is ahead of the primary’s. This patch changes the condition for semi-sync recovery to truncate the binlog to instead use the configuration variable --init-rpl-role, when set to SLAVE. This allows for both rpl_semi_sync_master_enabled and rpl_semi_sync_slave_enabled to be set for a primary that is restarted, and no transactions will be lost, so long as --init-rpl-role is not set to SLAVE. Reviewed By: ============ Sergei Golubchik <serg@mariadb.com>	2024-07-05 19:53:57 -06:00
Brandon Nesterenko	cbc1898e82	MDEV-25607: Auto-generated DELETE from HEAP table can break replication The special logic used by the memory storage engine to keep slaves in sync with the master on a restart can break replication. In particular, after a restart, the master writes DELETE statements in the binlog for each MEMORY-based table so the slave can empty its data. If the DELETE is not executable, e.g. due to invalid triggers, the slave will error and fail, whereas the master will never see the problem. Instead of DELETE statements, use TRUNCATE to keep slaves in-sync with the master, thereby bypassing triggers. Reviewed By: =========== Kristian Nielsen <knielsen@knielsen-hq.org> Andrei Elkin <andrei.elkin@mariadb.com>	2024-07-05 12:00:09 -06:00
Marko Mäkelä	27a3366663	Merge 10.6 into 10.11	2024-06-27 10:26:09 +03:00
Marko Mäkelä	0076eb3d4e	Merge 10.5 into 10.6	2024-06-24 13:09:47 +03:00
Monty	3541bd63f0	MDEV-33582 Add more warnings to be able to better diagnose network issues Changed the logged messages from errors to warnings Also changed 'remain' to 'read_length' in the warning to make it more readable.	2024-06-20 09:53:01 +03:00
Brandon Nesterenko	8c8b3ab784	MDEV-34274: Test rpl.rpl_change_master_demote frequently fails on buildbot with "IO thread should not be running..." The test rpl.rpl_change_master_demote used a `sleep 1` command to give time for a START SLAVE UNTIL to start the slave threads and wait for them to automatically die by UNTIL. On machines with heavy load (especially MSAN bb builders), one second was not enough, and the test would fail due to the IO thread still being up. This patch fixes the test by replacing the sleep with specific conditions to wait for. The test cannot wait for the IO or SQL threads to start, as it would be possible that they would be started and stopped by the time the MTR executor would check the slave status. So instead, we test for proof that they existed via the Connections status variable being incremented by at least 2 (Connections just shows the global thread id). At this point, we still can't use the wait_for_slave_to_stop helper, as the SQL/IO_Running fields of SHOW SLAVE STATUS may not be updated yet. So instead, we use information_schema.processlist, which would show the presence of the Slave_SQL/IO threads. So to "wait for the slave to stop", we wait for the Slave_SQL/IO threads to be gone from the processlist.	2024-06-19 10:17:08 -06:00
Andrei	387bdb2a2e	MDEV-29934 rpl.rpl_start_alter_chain_basic, rpl.rpl_start_alter_restart_slave sometimes fail in BB with result content mismatch rpl.rpl_start_alter_chain_basic was used to fail sporadically due to a missed GTID master-slave synchronization which was necessary because of the following SELECT from GTID-state table. Fixed with arranging two synchronization pieces for two chain slaves requiring that. Note rpl.rpl_start_alter_restart_slave must have been fixed by MDEV-30460 and `87e13722a9` (manual) merge commit.	2024-06-19 14:09:00 +03:00
Brandon Nesterenko	6cab2f75fe	MDEV-23857: replication master password length After MDEV-4013, the maximum length of replication passwords was extended to 96 ASCII characters. After a restart, however, slaves only read the first 41 characters of MASTER_PASSWORD from the master.info file. This lead to slaves unable to reconnect to the master after a restart. After a slave restart, if a master.info file is detected, use the full allowable length of the password rather than 41 characters. Reviewed By: ============ Sergei Golubchik <serg@mariadb.com>	2024-06-18 07:21:18 -06:00
Andrei	c37b2a9f04	MDEV-30460 rpl.rpl_start_alter_restart_slave sometimes fails in BB with result length mismatch The test was used to fail because of lacking a synchronization point designating an expected for processing "CA_1" event has been indeed taken into this phase. The event apparently may not have even arrived at slave. Fixed with deploying the missed synchronization.	2024-06-17 19:06:34 +03:00
Alexander Barkov	c4bf4ce948	Merge remote-tracking branch 'origin/11.2' into 11.4	2024-06-17 15:46:39 +04:00
Marko Mäkelä	a21e49cbcc	Merge 11.1 into 11.2	2024-06-17 12:02:03 +03:00
Marko Mäkelä	d34289a3e2	Merge 10.11 into 11.1	2024-06-17 09:21:50 +03:00
Marko Mäkelä	b81d717387	Merge 10.6 into 10.11	2024-06-11 12:50:10 +03:00
Brandon Nesterenko	fcd21d3e40	MDEV-34355: rpl.rpl_semi_sync_no_missed_ack_after_add_slave ‘server_3 should have sent…’ The problem is that the test could query the status variable Rpl_semi_sync_slave_send_ack before the slave actually updated it. This would result in an immediate --die assertion killing the rest of the test. The bottom of this commit message has a small patch that can be applied to reproduce the test failure. This patch fixes the test failure by waiting for the variable to be updated before querying its value. diff --git a/sql/semisync_slave.cc b/sql/semisync_slave.cc index 9ddd4c5c8d7..60538079fce 100644 --- a/sql/semisync_slave.cc +++ b/sql/semisync_slave.cc @@ -303,7 +303,10 @@ int Repl_semi_sync_slave::slave_reply(Master_info *mi) reply_res= DBUG_EVALUATE_IF("semislave_failed_net_flush", 1, net_flush(net)); if (!reply_res) + { + sleep(1); rpl_semi_sync_slave_send_ack++; + } } DBUG_RETURN(reply_res); }	2024-06-10 12:27:20 -06:00
Oleksandr Byelkin	99b370e023	Merge branch '11.2' into 11.4	2024-05-21 19:38:51 +02:00
Sergei Golubchik	bf5da43e50	Merge branch '11.1' into 11.2	2024-05-13 10:00:26 +02:00
Sergei Golubchik	f0a5412037	Merge branch '11.0' into 11.1	2024-05-13 09:52:30 +02:00
Sergei Golubchik	f9807aadef	Merge branch '10.11' into 11.0	2024-05-12 12:18:28 +02:00
Sergei Golubchik	7b53672c63	Merge branch '10.5' into 10.6	2024-05-08 20:06:00 +02:00
Kristian Nielsen	383ee364dc	Merge 10.6 to 10.11	2024-05-07 08:45:31 +02:00
Sergei Golubchik	7a789e2027	sporadic failures of rpl.rpl_parallel_sbm the test waits for the event to get stuck on MASTER_DELAY, but on a slow/overloaded slave the event might pass MASTER_DELAY before the test starts waiting. Wait for the event to get stuck on the LOCK TABLES (after MASTER_DELAY), the event cannot avoid that,	2024-05-05 21:37:07 +02:00
Sergei Golubchik	cea083af9f	cleanup: use THD_STAGE_INFO, not thd_proc_info and put master-slave.inc last in the series of includes	2024-05-05 21:37:07 +02:00
Kristian Nielsen	4b4db4a8e5	MDEV-34042: Deadlock kill of XA PREPARE can break replication / rpl.rpl_parallel_multi_domain_xa sporadic failure Refinement of the original patch. Move the code to reset the kill up into the parent class Xid_apply_log_event, to also fix the similar issue for XA COMMIT. Increase the number of slave retries in the test case rpl.rpl_parallel_multi_domain_xa to fix some sporadic failures. The test generates massive amounts of conflicting transactions in multiple independent domains, which can cause multiple rollback+retry for a transaction as it conflicts with transactions in other domains one-by-one. Signed-off-by: Kristian Nielsen <knielsen@knielsen-hq.org>	2024-05-05 19:01:56 +02:00
Sergei Golubchik	3ee6f69d49	sporadic failures of binlog_encryption.rpl_parallel_gco_wait_kill CURRENT_TEST: binlog_encryption.rpl_parallel_gco_wait_kill mysqltest: In included file "./suite/rpl/t/rpl_parallel_gco_wait_kill.test": included from /home/buildbot/amd64-ubuntu-2004-debug/build/mysql-test/suite/binlog_encryption/rpl_parallel_gco_wait_kill.test at line 2: At line 334: Can't initialize replace from 'replace_result $thd_id THD_ID' An sql thread can reach the "Slave has read all relay log" state and then start reading relay log again. Let's use a more generic pattern to retrieve the sql thread ID even if it's not in the "read all relay log" state.	2024-05-02 22:14:19 +02:00
Kristian Nielsen	e365877bae	MDEV-33798: ROW base optimistic deadlock with concurrent writes on same table One case is conflicting transactions T1 and T2 with different domain id, in optimistic parallel replication in non-GTID mode. Then T2 will wait_for_prior_commit on T1; and if T1 got a row lock wait on T2 it would hang, as different domains caused the deadlock kill to be skipped in thd_rpl_deadlock_check(). More generally, if we have transactions T1 and T2 in one domain/master connection, and independent transactions U in another, then we can still deadlock like this: T1 row low wait on U U row lock wait on T2 T2 wait_for_prior_commit on T1 This commit enforces the deadlock kill in these cases. If the waited-for transaction is speculatively applied, then it will be deadlock killed in case of a conflict, even if the two transactions are in different domains or master connections. Reviewed-by: Andrei Elkin <andrei.elkin@mariadb.com> Signed-off-by: Kristian Nielsen <knielsen@knielsen-hq.org>	2024-05-02 21:07:45 +02:00
Sergei Golubchik	0aae11ac28	Merge branch '10.6' into 10.11	2024-04-30 16:56:49 +02:00
Andrei	3fa2caf553	MDEV-31404 post-push for rpl.max_binlog_total_size The test's header did not follow a correct `have_` and `master-slave` sourcing pattern. That's corrected.	2024-04-30 14:00:19 +03:00
Andrei	ae03374f29	MDEV-34030 rpl.rpl_using_gtid_default can fail in (BB) mtr The test's header is not written to follow strictly a correct order of checks by mtr at test start which may lead to an error. E.g ./mtr --mysqld=--binlog-format=row rpl.rpl_using_gtid_default to At line 175: query 'SET GLOBAL gtid_slave_pos= ""' failed: ER_SLAVE_MUST_STOP (1198): This operation cannot be performed as you have a running slave ''; run STOP SLAVE '' first Fixed to require the binlog format first in the test header.	2024-04-30 12:40:50 +03:00
Andrei	6a63204c36	MDEV-34029 rpl.rpl_heartbeat can fail when (BB) mtr reorders tests rpl.rpl_heartbeat turns out to miss a standard include/master-slave header which made it potentially in BB and actually with manual mtr failing as it may have used a previous slave GTID state. Fixed with installing the standard rpl suite header/footer in the test file.	2024-04-30 12:40:50 +03:00
Sergei Golubchik	c1f3eff53f	Merge branch '10.5' into 10.6	2024-04-29 10:08:58 +02:00
Sergei Golubchik	7ff649315e	sporadic failures of rpl.rpl_parallel_multi_domain_xa it's a slow test, the slave needs to catch up, reading >1500 transactions. A default MASTER_GTID_WAIT() timeout in sync_with_master_gtid.inc is 120 seconds, which might be not enough for a slow/overloaded slave. Let's wait forever or until ./mtr --testcase-timeout, whatever comes first.	2024-04-26 14:24:32 +02:00
Sergei Golubchik	9e92582024	sporadic failures of rpl.rpl_parallel_sbm the test waits for the event to get stuck on MASTER_DELAY, but on a slow/overloaded slave the event might pass MASTER_DELAY before the test starts waiting. Wait for the event to get stuck on the LOCK TABLES (after MASTER_DELAY), the event cannot avoid that,	2024-04-25 12:47:23 +02:00
Kristian Nielsen	553a4d6271	MDEV-33602: Sporadic test failure in rpl.rpl_gtid_stop_start The test could fail with a duplicate key error because switching to non-GTID mode could start at the wrong old-style position. The position could be wrong when the previous GTID connect was stopped before receiving the fake GTID list event which gives the old-style position corresponding to the GTID connected position. Work-around by injecting an extra event and syncing the slave before switching to non-GTID mode. Signed-off-by: Kristian Nielsen <knielsen@knielsen-hq.org>	2024-04-25 11:00:45 +02:00
Sergei Golubchik	9cf718859f	cleanup: use THD_STAGE_INFO, not thd_proc_info and put master-slave.inc last in the series of includes	2024-04-24 22:08:52 +02:00
Brandon Nesterenko	8c7992165b	MDEV-33672: 10.11 Fix for Two Phase Alter Flags Extends `89c907bd4f` to account for binlog_two_phase_alter flags in a Gtid log event. I.e., if the FL_COMMIT_ALTER_E1 or FL_ROLLBACK_ALTER_E2 flags are set in the event flags, yet the length of the event is too short to hold the value, then set the event as invalid	2024-04-24 13:19:36 +02:00
Sergei Golubchik	f243c73788	sporadic failures of rpl.rpl_rewrite_db_sys_vars first stop the slave, then run commands on the master that are supposed to fail on the slave, then start the slave. if you swap first two steps, the slave might get and execute those commands before it's stopped, which will fail the test. also, improve debugability	2024-04-22 21:02:11 +02:00
Sergei Golubchik	018d537ec1	Merge branch '10.6' into 10.11	2024-04-22 15:23:10 +02:00
Sergei Golubchik	e83d92ee5e	sporadic failures of rpl.rpl_semi_sync_fail_over in the $case=2 - it's wrong to kill after the first binlog EOF, because that might happen between INSERT(4) and INSERT(5). So, wait for the slave to acknowledge INSERT(5) before killing the master, that is, both connection threads must pass repl_semisync_master.wait_after_sync()	2024-04-21 22:54:52 +02:00
Sergei Golubchik	6242783f24	rpl.rpl_semi_sync_fail_over improve debugability	2024-04-21 14:03:26 +02:00
Sergei Golubchik	a4b6409ff6	sporadic failures of binlog_encryption.rpl_parallel_slave_bgc_kill do CHANGE MASTER before sync_with_master to have the slave in a predictable fully synced state before the next test	2024-04-21 01:17:31 +02:00
Sergei Golubchik	d8368ae289	Merge '10.5' into 10.6	2024-04-20 14:47:26 +02:00
Kristian Nielsen	0c249ad718	MDEV-30232: rpl.rpl_gtid_crash fails sporadically in BB The root cause of the failure is a bug in the Linux network stack: https://lore.kernel.org/netdev/87sf0ldk41.fsf@urd.knielsen-hq.org/T/#u If the slave does a connect(2) at the exact same time that kill -9 of the master process closes the listening socket, the FIN or RST packet is lost in the kernel, and the slave ends up timing out waiting for the initial communication from the server. This timeout defaults to --slave-net-timeout=120, which causes include/master_gtid_wait.inc to time out first and fail the test. Work-around this problem by reducing the --slave-net-timeout for this test case. If this problem turns up in other tests, we can consider reducing the default value for all tests. Signed-off-by: Kristian Nielsen <knielsen@knielsen-hq.org>	2024-04-20 13:41:08 +02:00
Marko Mäkelä	bb2e125d07	Merge 10.5 into 10.6 This excludes commit `040069f4ba` because it is specific to innodb_sync_debug, which had been removed in commit `ff5d306e29`.	2024-04-18 07:14:56 +03:00
Brandon Nesterenko	0ad52e4d6a	MDEV-27512: Assertion !thd->transaction_rollback_request failed in rows_event_stmt_cleanup If replicating an event in ROW format, and InnoDB detects a deadlock while searching for a row, the row event will error and rollback in InnoDB and indicate that the binlog cache also needs to be cleared, i.e. by marking thd->transaction_rollback_request. In the normal case, this will trigger an error in Rows_log_event::do_apply_event() and cause a rollback. During the Rows_log_event::do_apply_event() cleanup of a successful event application, there is a DBUG_ASSERT in log_event_server.cc::rows_event_stmt_cleanup(), which sets the expectation that thd->transaction_rollback_request cannot be set because the general rollback (i.e. not the InnoDB rollback) should have happened already. However, if the replica is configured to skip deadlock errors, the rows event logic will clear the error and continue on, as if no error happened. This results in thd->transaction_rollback_request being set while in rows_event_stmt_cleanup(), thereby triggering the assertion. This patch fixes this in the following ways: 1) The assertion is invalid, and thereby removed. 2) The rollback case is forced in rows_event_stmt_cleanup() if transaction_rollback_request is set. Note the differing behavior between transactions which are skipped due to deadlock errors and other errors. When a transaction is skipped due to an ignored deadlock error, the entire transaction is rolled back and skipped (though note MDEV-33930 which allows statements in the same transaction after the deadlock-inducing one to commit). When a transaction is skipped due to ignoring a different error, only the erroring statements are rolled-back and skipped - the rest of the transaction will execute as normal. The effect of this can be seen in the test results. The added test case to rpl_skip_error.test shows that only statements which are ignored due to non-deadlock errors are ignored in larger transactions. A diff between rpl_temporary_error2_skip_all.result and rpl_temporary_error2.result shows that all statements in the errored transaction are rolled back (diff pasted below): : diff rpl_temporary_error2.result rpl_temporary_error2_skip_all.result 49c49 < 2 1 --- > 2 NULL 51c51 < 4 1 --- > 4 NULL 53c53 < * There will be two rows in t2 due to the retry. --- > * There will be one row in t2 because the ignored deadlock does not retry. 57d56 < 1 59c58 < 1 --- > 0 Reviewed By: ============ Andrei Elkin <andrei.elkin@mariadb.com>	2024-04-17 11:14:21 -06:00
Vladislav Vaintroub	061adae9a2	MDEV-16944 Fix file sharing issues on Windows in mysqltest On Windows systems, occurrences of ERROR_SHARING_VIOLATION due to conflicting share modes between processes accessing the same file can result in CreateFile failures. mysys' my_open() already incorporates a workaround by implementing wait/retry logic on Windows. But this does not help if files are opened using shell redirection like mysqltest traditionally did it, i.e via --echo exec "some text" > output_file In such cases, it is cmd.exe, that opens the output_file, and it won't do any sharing-violation retries. This commit addresses the issue by introducing a new built-in command, 'write_line', in mysqltest. This new command serves as a brief alternative to 'write_file', with a single line output, that also resolves variables like "exec" would. Internally, this command will use my_open(), and therefore retry-on-error logic. Hopefully this will eliminate the very sporadic "can't open file because it is used by another process" error on CI.	2024-04-17 16:52:37 +02:00
Marko Mäkelä	829cb1a49c	Merge 10.5 into 10.6	2024-04-17 14:14:58 +03:00
Daniel Black	ea810b04cb	MDEV-30676 rpl.parallel_backup* tests sometimes fail Raise innodb_lock_wait_timeout from 1 to 5	2024-04-15 15:45:03 +10:00
Brandon Nesterenko	a6aecbb036	MDEV-10684: rpl.rpl_domain_id_filter_restart fails in buildbot The test failure in rpl.rpl_domain_id_filter_restart is caused by MDEV-33887. That is, the test uses master_pos_wait() (called indirectly by sync_slave_with_master) to try and wait for the replica to catch up to the master. However, the waited on transaction is ignored by the configured CHANGE MASTER TO IGNORE_DOMAIN_IDS=() As MDEV-33887 reports, due to the IO thread updating the binlog coordinates and the SQL thread updating the GTID state, if the replica is stopped in-between these updates, the replica state will be inconsistent. That is, the test expects that the GTID state will be updated, so upon restart, the replica will be up-to-date. However, if the replica is stopped before the SQL thread updates its GTID state, then upon restart, the replica will fetch the previously ignored event, which is no longer ignored upon restart, and execute it. This leads to the sporadic extra row in t2. This patch changes master_pos_wait() to use master_gtid_wait() to ensure the replica state is consistent with the master state.	2024-04-11 09:49:20 -06:00

1 2 3 4 5 ...

3659 Commits