mariadb

mirror of https://github.com/MariaDB/server.git synced 2025-11-27 05:41:41 +03:00

Author	SHA1	Message	Date
sjaakola	5c230b21bf	MDEV-23328 Server hang due to Galera lock conflict resolution Mutex order violation when wsrep bf thread kills a conflicting trx, the stack is wsrep_thd_LOCK() wsrep_kill_victim() lock_rec_other_has_conflicting() lock_clust_rec_read_check_and_lock() row_search_mvcc() ha_innobase::index_read() ha_innobase::rnd_pos() handler::ha_rnd_pos() handler::rnd_pos_by_record() handler::ha_rnd_pos_by_record() Rows_log_event::find_row() Update_rows_log_event::do_exec_row() Rows_log_event::do_apply_event() Log_event::apply_event() wsrep_apply_events() and mutexes are taken in the order lock_sys->mutex -> victim_trx->mutex -> victim_thread->LOCK_thd_data When a normal KILL statement is executed, the stack is innobase_kill_query() kill_handlerton() plugin_foreach_with_mask() ha_kill_query() THD::awake() kill_one_thread() and mutexes are victim_thread->LOCK_thd_data -> lock_sys->mutex -> victim_trx->mutex This patch is the plan D variant for fixing potetial mutex locking order exercised by BF aborting and KILL command execution. In this approach, KILL command is replicated as TOI operation. This guarantees total isolation for the KILL command execution in the first node: there is no concurrent replication applying and no concurrent DDL executing. Therefore there is no risk of BF aborting to happen in parallel with KILL command execution either. Potential mutex deadlocks between the different mutex access paths with KILL command execution and BF aborting cannot therefore happen. TOI replication is used, in this approach, purely as means to provide isolated KILL command execution in the first node. KILL command should not (and must not) be applied in secondary nodes. In this patch, we make this sure by skipping KILL execution in secondary nodes, in applying phase, where we bail out if applier thread is trying to execute KILL command. This is effective, but skipping the applying of KILL command could happen much earlier as well. This also fixed unprotected calls to wsrep_thd_abort that will use wsrep_abort_transaction. This is fixed by holding THD::LOCK_thd_data while we abort transaction. Reviewed-by: Jan Lindström <jan.lindstrom@mariadb.com>	2021-10-29 09:52:52 +03:00
Jan Lindström	aa7ca987db	MDEV-25114: Crash: WSREP: invalid state ROLLED_BACK (FATAL) Revert "MDEV-23328 Server hang due to Galera lock conflict resolution" This reverts commit `eac8341df4`.	2021-10-29 09:52:40 +03:00
Julius Goryavsky	7948a1dc53	MDEV-26914: Unreleased mutex in the exec_relay_log_event() function In the replication-related code, in the exec_relay_log_event() (slave.cc) function, where the "data_lock" mutex is captured, this mutex is then not released on one of the early return branches within a specific insert for WSREP, namely under the branch: "if (wsrep_before_statement(thd))". As a result, the mutex remains captured, resulting in errors or hangs. This commit fixes this issue, which is now showing up as intermittent failures in mtr tests for galera and galera_sr suites.	2021-10-28 03:17:12 +02:00
Sergei Golubchik	eac8341df4	MDEV-23328 Server hang due to Galera lock conflict resolution adaptation of `29bbcac0ee` for 10.4	2021-02-12 18:17:06 +01:00
Sergei Golubchik	9703cffa8c	don't take mutexes conditionally	2021-02-12 18:14:20 +01:00
Sergei Golubchik	00a313ecf3	Merge branch 'bb-10.3-release' into bb-10.4-release Note, the fix for "MDEV-23328 Server hang due to Galera lock conflict resolution" was null-merged. 10.4 version of the fix is coming up separately	2021-02-12 17:44:22 +01:00
Sergei Golubchik	60ea09eae6	Merge branch '10.2' into 10.3	2021-02-01 13:49:33 +01:00
Sergei Golubchik	6a1cb449fe	cleanup: remove slave background thread, use handle_manager thread instead	2021-01-24 11:35:55 +01:00
Daniel Black	29d9897fe2	MDEV-10272: add master host/port info to slave thread exit messages Sample log error message generated: 2021-01-21 2:33:24 139912137520896 [Note] Slave SQL thread exiting, replication stopped in log 'master-bin.000001' at position 369 33:24 139912137520896 [Note] master was 127.0.0.1:16400 2021-01-21 2:33:24 139912137828096 [Note] Slave I/O thread exiting, read up to log 'master-bin.000001', position 369 2021-01-21 2:33:24 139912137828096 [Note] master was 127.0.0.1:16400 Based on work by Hartmut Holzgraefe. Reviewer: knielsen@knielsen-hq.org, Andrei, Sachin	2021-01-22 10:06:33 +11:00
Sujatha	b2029c0300	Merge branch '10.3' into 10.4	2020-11-12 15:39:02 +05:30
Sujatha	bafb011a82	Merge branch '10.2' into 10.3	2020-11-12 14:10:05 +05:30
Sujatha	984a06db2c	MDEV-4633: multi_source.simple test fails sporadically Analysis: ======== Writes to 'rli->log_space_total' needs to be synchronized, otherwise both SQL_THREAD and IO_THREAD can try to modify the variable simultaneously resulting in incorrect rli->log_space_total. In the current test scenario SQL_THREAD is trying to decrement 'rli->log_space_total' in 'purge_first_log' and IO_THREAD is trying to increment the 'rli->log_space_total' in 'queue_event' simultaneously. Hence test occasionally fails with result mismatch. Fix: === Convert 'rli->log_space_total' variable to atomic type.	2020-11-12 13:04:39 +05:30
Marko Mäkelä	9216114ce7	Merge 10.3 into 10.4	2020-07-31 18:09:08 +03:00
Marko Mäkelä	66ec3a770f	Merge 10.2 into 10.3	2020-07-31 13:51:28 +03:00
Sujatha	b3dd95e035	MDEV-14203: rpl.rpl_extra_col_master_myisam, rpl.rpl_slave_load_tmpdir_not_exist failed in buildbot with a warning Problem: ======= rpl.rpl_slave_load_tmpdir_not_exist 'stmt' w3 [ fail ] Found warnings/errors in server log file! Test ended at 2017-09-27 20:34:55 [Warning] Master is configured to log replication events with checksum, but will not send such events to slaves that cannot process them ^ Found warnings in /mnt/buildbot/build/mariadb-10.2.10/mysql-test/var/3/log/mysqld.1.err ok Analysis: ======== When slave tries to connect to master 'get_master_version_and_clock' function is invoked to perform elaborated slave-master handshake. During this process slave server queries master server, to know if it is checksum aware and at the same time master is notified about its CRC-awareness. The master's side instant value of @@global.binlog_checksum is stored in the dump thread's uservar area as well as cached locally to become known in consensus by master and slave. Post hand-shake slave requests master for binlog dump. It sends 'COM_BINLOG_DUMP'. This command is sent to master by 'cli_advanced_command' call. If there is some temporary network failure during this request_dump call, 'end_server' is invoked to close the current connection between master and slave. Upon connection close the dump thread on the master gets terminated and it clears the 'uservar' data it got through master-slave handshake. The 'COM_BINLOG_DUMP' command is sent once again without master-slave handshake. Since the checksum data is not available with new dump thread a warning gets reported. Fix: === Upon network write error donot attempt reconnect, proceed to master-slave handshake. This ensures that master is aware of slave's capability to use checksums.	2020-07-23 12:54:40 +05:30
Sachin	592a10d079	MDEV-22370 safe_mutex: Trying to lock uninitialized mutex at /data/src/10.4-bug/sql/rpl_parallel.cc, line 470 upon shutdown during FTWRL Problem:- When we issue FTWRL with shutdown in parallel, there is race between FTWRL and shutdown. Shutdown might destroy the mutex (pool->LOCK_rpl_thread_pool) before FTWRL can lock it. So we can get crash on FTWRL thread Solution:- mysql_mutex_destroy(pool->LOCK_rpl_thread_pool) should wait for FTWRL thread to complete its work , and then destroy. So slave_prepare_for_shutdown will just deactivate the pool, and mutex is destroyed later in end_slave()	2020-06-17 02:22:46 +05:30
Marko Mäkelä	6da14d7b4a	Merge 10.3 into 10.4	2020-05-30 11:04:27 +03:00
Marko Mäkelä	dad7a8ee7d	Merge 10.2 into 10.3	2020-05-27 17:10:39 +03:00
Andrei Elkin	0c1f97b3ab	MDEV-15152 Optimistic parallel slave doesnt cope well with START SLAVE UNTIL The immediate bug was caused by a failure to recognize a correct position to stop the slave applier run in optimistic parallel mode. There were the following set of issues that the analysis unveil. 1 incorrect estimate for the event binlog position passed to is_until_satisfied 2 wait for workers to complete by the driver thread did not account non-group events that could be left unprocessed and thus to mix up the last executed binlog group's file and position: the file remained old and the position related to the new rotated file 3 incorrect 'slave reached file:pos' by the parallel slave report in the error log 4 relay log UNTIL missed out the parallel slave branch in is_until_satisfied. The patch addresses all of them to simplify logics of log change notification in either the master and relay-log until case. P.1 is addressed with passing the event into is_until_satisfied() for proper analisis by the function. P.2 is fixed by changes in handle_queued_pos_update(). P.4 required removing relay-log change notification by workers. Instead the driver thread updates the notion of the current relay-log fully itself with aid of introduced bool Relay_log_info::until_relay_log_names_defer. An extra print out of the requested until file:pos is arranged with --log-warning=3.	2020-05-26 18:49:43 +03:00
Marko Mäkelä	2c3c851d2c	Merge 10.3 into 10.4	2020-05-05 20:33:10 +03:00
Oleksandr Byelkin	7fb73ed143	Merge branch '10.2' into 10.3	2020-05-04 16:47:11 +02:00
Oleksandr Byelkin	ca091e6372	Merge branch '10.1' into 10.2	2020-05-02 08:44:17 +02:00
Oleksandr Byelkin	23c6fb3e62	Merge branch '5.5' into 10.1	2020-04-30 17:36:41 +02:00
Sergei Golubchik	6bb28e0bc5	Bug#29915479 RUNNING COM_REGISTER_SLAVE WITHOUT COM_BINLOG_DUMP CAN RESULTS IN SERVER EXIT in fact, in MariaDB it cannot, but it can show spurious slaves in SHOW SLAVE HOSTS. slave was registered in COM_REGISTER_SLAVE and un-registered after COM_BINLOG_DUMP. If there was no COM_BINLOG_DUMP, it would never unregister.	2020-04-30 10:13:18 +02:00
Sergei Golubchik	8c534bdeb8	cleanup: remove dbug keywords that are never used	2020-04-29 18:17:08 +02:00
Marko Mäkelä	2e12d471ea	Merge 10.2 into 10.3	2020-04-27 14:24:41 +03:00
Marko Mäkelä	c06845d6f0	Merge 10.1 into 10.2	2020-04-27 13:28:13 +03:00
Marko Mäkelä	6be05ceb05	MDEV-22203: WSREP_ON is unnecessarily expensive to evaluate This is a backport of the applicable part of commit `93475aff8d` and commit `2c39f69d34` from 10.4. Before 10.4 and Galera 4, WSREP_ON is a macro that points to a global Boolean variable, so it is not that expensive to evaluate, but we will add an unlikely() hint around it. WSREP_ON_NEW: Remove. This macro was introduced in commit `c863159c32` when reverting WSREP_ON to its previous definition. We replace some use of WSREP_ON with WSREP(thd), like it was done in `93475aff8d`. Note: the macro WSREP() in 10.1 is equivalent to WSREP_NNULL() in 10.4. Item_func_rand::seed_random(): Avoid invoking current_thd when WSREP is not enabled.	2020-04-27 09:40:51 +03:00
Jan Lindström	93475aff8d	MDEV-22203: WSREP_ON is unnecessarily expensive to evaluate Replaced WSREP_ON macro by single global variable WSREP_ON that is then updated at server statup and on wsrep_on and wsrep_provider update functions.	2020-04-24 13:12:46 +03:00
Sergey Vojtovich	5876ed9e5b	Relay_log_info::executed_entries to Atomic_counter	2020-04-15 18:36:07 +04:00
Marko Mäkelä	e2f1f88fa6	Merge 10.3 into 10.4	2020-03-30 14:50:23 +03:00
Marko Mäkelä	1a9b6c4c7f	Merge 10.2 into 10.3	2020-03-30 11:12:56 +03:00
seppo	5918b17004	MDEV-21473 conflicts with async slave BF aborting (#1475 ) If async slave thread (slave SQL handler), becomes a BF victim, it may occasionally happen that rollbacker thread is used to carry out the rollback instead of the async slave thread. This can happen, if async slave thread has flagged "idle" state when BF thread tries to figure out how to kill the victim. The issue was possible to test by using a galera cluster as slave for external master, and issuing high load of conflicting writes through async replication and directly against galera cluster nodes. However, a deterministic mtr test for the "conflict window" has not yet been worked on. The fix, in this patch makes sure that async slave thread state is never set to IDLE. This prevents the rollbacker thread to intervene. The wsrep_query_state change was refactored to happen by dedicated function to make controlling the idle state change in one place.	2020-03-24 11:01:42 +02:00
Sergey Vojtovich	a39d92ca57	gtid_pos_table: my_atomic to std::atomic	2020-03-21 17:36:38 +04:00
Oleksandr Byelkin	b8c0e49670	Merge commit '10.3' into 10.4	2020-03-11 13:27:10 +01:00
Oleksandr Byelkin	440452628d	Merge branch '10.2' into 10.3	2020-03-06 23:28:26 +01:00
seppo	4618c974e4	MDEV-21723 Async slave thread BF abort and replaying fixes (#1448 ) If async replication slave thread conflicts with cluster replication, then the async slave transaction should be BF aborted, and depending on the state of async slave transaction execution, potentially also replayed. There were problems in such BF abort implementation and the replaying was not started. This pull request contains fixes which make sure that if async slave thread is marked to abort and replay, it will complete carry out the rollback and release all locks and resources before starting the replaying. After replaying, async slave transactions is treated as successful, so the slave thread will continue as usual, handling next replication event. There is also new mtr test: galera.galera_slave_replay, which stresses both a certification failure for async slave thread and a successful BF abort followed by replaying.	2020-02-23 10:29:42 +02:00
Oleksandr Byelkin	6918157e98	Merge branch '10.3' into 10.4	2020-01-21 23:15:02 +01:00
Andrei Elkin	5cd21ac202	MDEV-20821 parallel slave server shutdown hang Parallel slave server shutdown found to be hanging in close_connections() triggered by shutdown due to a slave worker thread would not be notified to exit in case the worker was sitting idle. Fixed with destroying the worker pool earlier that is in slave_prepare_for_shutdown() when all their driver threads have already left. A test file is added to simulate the bug condition as well as check multi-sourced and not-idle worker cases.	2020-01-21 16:11:52 +02:00
Oleksandr Byelkin	ade89fc898	Merge branch '10.2' into 10.3	2020-01-21 09:11:14 +01:00
Oleksandr Byelkin	3a1716a7e7	Merge branch '10.1' into 10.2	2020-01-20 16:15:05 +01:00
Oleksandr Byelkin	f31bf6f094	Merge branch '5.5' into 10.1	2020-01-19 12:22:12 +01:00
Markus Mäkelä	5683c113b8	Use get_ident_len in heartbeat event error messages The string doesn't appear to be null-terminated when binlog checksums are enabled. This causes a corrupt binlog name in the error message when a slave is ahead of the master.	2020-01-13 14:50:02 +02:00
Oleksandr Byelkin	a15234bf4b	Merge branch '10.3' into 10.4	2019-12-09 15:09:41 +01:00
Oleksandr Byelkin	008ee867a4	Merge branch '10.2' into 10.3	2019-12-04 17:46:28 +01:00
Jan Lindström	c9b9eb3315	MDEV-18497 : CTAS async replication from mariadb master crashes galera nodes (#1410 ) In MariaDB 10.2 master could have been configured so that there is extra annotate events. When we peak next event type for CTAS we need to skip annotate events.	2019-12-04 11:46:37 +02:00
Oleksandr Byelkin	f8b5e147da	Merge branch '10.1' into 10.2	2019-12-03 14:45:06 +01:00
seppo	38839854b7	MDEV-19572 async slave node fails to apply MyISAM only writes (#1418 ) The problem happens when MariaDB master replicates writes for only non InnoDB tables (e.g. writes to MyISAM table(s)). Async slave node, in Galera cluster, can apply these writes successfully, but it will, in the end, write gtid position in mysql.gtid_slave_pos table. mysql.gtid_slave_pos table is InnoDB engine, and this write makes innodb handlerton part of the replicated "transaction". Note that wsrep patch identifies that write to gtid_slave_pos should not be replicated and skips appending wsrep keys for these writes. However, as InnoDB was present in the transaction, and there are replication events (for MyISAM table) in transaction cache, but there are no appended keys, wsrep raises an error, and this makes the söave thread to stop. The fix is simply to not treat it as an error if async slave tries to replicate a write set with binlog events, but no keys. We just skip wsrep replication and return successfully. This commit contains also a mtr test which forces mysql.gtid_slave_pos table isto be of InnoDB engine, and executes MyISAM only write through asyn replication. There is additional fix for declaring IO and background slave threads as non wsrep. These threads should not write anything for wsrep replication, and this is just a safeguard to make sure nothing leaks into cluster from these slave threads.	2019-11-26 08:49:50 +02:00
seppo	5c68343db7	MDEV-18497 CTAS async replication from mariadb master crashes galera nodes (#1410 ) This PR contains a mtr test for reproducing a failure with replicating create table as select statement (CTAS) through asynchronous mariadb replication to mariadb galera cluster. The problem happens when CTAS replication contains both create table statement followed by row events for populating the table. In such situation, the galera node operating as mariadb replication slave, will first replicate only the create table part into the cluster, and then perform another replication containing both the create table and row events. This will lead all other nodes to fail for duplicate table create attempt, and crash due to this failure. PR contains also a fix, which identifies the situation when CTAS has been replicated, and makes further scan in async replication stream to see if there are following row events. The slave node will replicate either single TOI in case the CTAS table is empty, or if CTAS table contains rows, then single bundled write set with create table and row events is replicated to galera cluster. This fix should keep master server's GTID's for CTAS replication in sync with GTID's in galera cluster.	2019-11-18 15:18:00 +02:00
Marko Mäkelä	c11e5cdd12	Merge 10.3 into 10.4	2019-10-10 11:19:25 +03:00

1 2 3 4 5 ...

2572 Commits