Mutex order violation when wsrep bf thread kills a conflicting trx,
the stack is
wsrep_thd_LOCK()
wsrep_kill_victim()
lock_rec_other_has_conflicting()
lock_clust_rec_read_check_and_lock()
row_search_mvcc()
ha_innobase::index_read()
ha_innobase::rnd_pos()
handler::ha_rnd_pos()
handler::rnd_pos_by_record()
handler::ha_rnd_pos_by_record()
Rows_log_event::find_row()
Update_rows_log_event::do_exec_row()
Rows_log_event::do_apply_event()
Log_event::apply_event()
wsrep_apply_events()
and mutexes are taken in the order
lock_sys->mutex -> victim_trx->mutex -> victim_thread->LOCK_thd_data
When a normal KILL statement is executed, the stack is
innobase_kill_query()
kill_handlerton()
plugin_foreach_with_mask()
ha_kill_query()
THD::awake()
kill_one_thread()
and mutexes are
victim_thread->LOCK_thd_data -> lock_sys->mutex -> victim_trx->mutex
This patch is the plan D variant for fixing potetial mutex locking
order exercised by BF aborting and KILL command execution.
In this approach, KILL command is replicated as TOI operation.
This guarantees total isolation for the KILL command execution
in the first node: there is no concurrent replication applying
and no concurrent DDL executing. Therefore there is no risk of
BF aborting to happen in parallel with KILL command execution
either. Potential mutex deadlocks between the different mutex
access paths with KILL command execution and BF aborting cannot
therefore happen.
TOI replication is used, in this approach, purely as means
to provide isolated KILL command execution in the first node.
KILL command should not (and must not) be applied in secondary
nodes. In this patch, we make this sure by skipping KILL
execution in secondary nodes, in applying phase, where we
bail out if applier thread is trying to execute KILL command.
This is effective, but skipping the applying of KILL command
could happen much earlier as well.
This also fixed unprotected calls to wsrep_thd_abort
that will use wsrep_abort_transaction. This is fixed
by holding THD::LOCK_thd_data while we abort transaction.
Reviewed-by: Jan Lindström <jan.lindstrom@mariadb.com>
In the replication-related code, in the exec_relay_log_event() (slave.cc)
function, where the "data_lock" mutex is captured, this mutex is then not
released on one of the early return branches within a specific insert for
WSREP, namely under the branch: "if (wsrep_before_statement(thd))". As a
result, the mutex remains captured, resulting in errors or hangs.
This commit fixes this issue, which is now showing up as intermittent
failures in mtr tests for galera and galera_sr suites.
Sample log error message generated:
2021-01-21 2:33:24 139912137520896 [Note] Slave SQL thread exiting, replication stopped in log 'master-bin.000001' at position 369
33:24 139912137520896 [Note] master was 127.0.0.1:16400
2021-01-21 2:33:24 139912137828096 [Note] Slave I/O thread exiting, read up to log 'master-bin.000001', position 369
2021-01-21 2:33:24 139912137828096 [Note] master was 127.0.0.1:16400
Based on work by Hartmut Holzgraefe.
Reviewer: knielsen@knielsen-hq.org, Andrei, Sachin
Analysis:
========
Writes to 'rli->log_space_total' needs to be synchronized, otherwise both
SQL_THREAD and IO_THREAD can try to modify the variable simultaneously
resulting in incorrect rli->log_space_total. In the current test scenario
SQL_THREAD is trying to decrement 'rli->log_space_total' in 'purge_first_log'
and IO_THREAD is trying to increment the 'rli->log_space_total' in
'queue_event' simultaneously. Hence test occasionally fails with result
mismatch.
Fix:
===
Convert 'rli->log_space_total' variable to atomic type.
Problem:
=======
rpl.rpl_slave_load_tmpdir_not_exist 'stmt' w3 [ fail ] Found warnings/errors
in server log file!
Test ended at 2017-09-27 20:34:55
[Warning] Master is configured to log replication events with checksum, but
will not send such events to slaves that cannot process them
^ Found warnings in /mnt/buildbot/build/mariadb-10.2.10/mysql-test/var/3/log/mysqld.1.err
ok
Analysis:
========
When slave tries to connect to master 'get_master_version_and_clock' function
is invoked to perform elaborated slave-master handshake. During this process
slave server queries master server, to know if it is checksum aware and at the
same time master is notified about its CRC-awareness. The master's side
instant value of @@global.binlog_checksum is stored in the dump thread's
uservar area as well as cached locally to become known in consensus by master
and slave.
Post hand-shake slave requests master for binlog dump. It sends
'COM_BINLOG_DUMP'. This command is sent to master by 'cli_advanced_command'
call. If there is some temporary network failure during this request_dump
call, 'end_server' is invoked to close the current connection between master
and slave. Upon connection close the dump thread on the master gets terminated
and it clears the 'uservar' data it got through master-slave handshake.
The 'COM_BINLOG_DUMP' command is sent once again without master-slave
handshake. Since the checksum data is not available with new dump thread a
warning gets reported.
Fix:
===
Upon network write error donot attempt reconnect, proceed to master-slave
handshake. This ensures that master is aware of slave's capability to use
checksums.
Problem:- When we issue FTWRL with shutdown in parallel, there is race between
FTWRL and shutdown. Shutdown might destroy the mutex (pool->LOCK_rpl_thread_pool)
before FTWRL can lock it. So we can get crash on FTWRL thread
Solution:- mysql_mutex_destroy(pool->LOCK_rpl_thread_pool) should wait for
FTWRL thread to complete its work , and then destroy.
So slave_prepare_for_shutdown will just deactivate the pool, and mutex is destroyed
later in end_slave()
The immediate bug was caused by a failure to recognize a correct
position to stop the slave applier run in optimistic parallel mode.
There were the following set of issues that the analysis unveil.
1 incorrect estimate for the event binlog position passed to
is_until_satisfied
2 wait for workers to complete by the driver thread did not account non-group events
that could be left unprocessed and thus to mix up the last executed
binlog group's file and position:
the file remained old and the position related to the new rotated file
3 incorrect 'slave reached file:pos' by the parallel slave report in the error log
4 relay log UNTIL missed out the parallel slave branch in
is_until_satisfied.
The patch addresses all of them to simplify logics of log change
notification in either the master and relay-log until case.
P.1 is addressed with passing the event into is_until_satisfied()
for proper analisis by the function.
P.2 is fixed by changes in handle_queued_pos_update().
P.4 required removing relay-log change notification by workers.
Instead the driver thread updates the notion of the current relay-log
fully itself with aid of introduced
bool Relay_log_info::until_relay_log_names_defer.
An extra print out of the requested until file:pos is arranged
with --log-warning=3.
in fact, in MariaDB it cannot, but it can show spurious slaves
in SHOW SLAVE HOSTS.
slave was registered in COM_REGISTER_SLAVE and un-registered after
COM_BINLOG_DUMP. If there was no COM_BINLOG_DUMP, it would never
unregister.
This is a backport of the applicable part of
commit 93475aff8d and
commit 2c39f69d34
from 10.4.
Before 10.4 and Galera 4, WSREP_ON is a macro that points to
a global Boolean variable, so it is not that expensive to
evaluate, but we will add an unlikely() hint around it.
WSREP_ON_NEW: Remove. This macro was introduced in
commit c863159c32
when reverting WSREP_ON to its previous definition.
We replace some use of WSREP_ON with WSREP(thd), like it was done
in 93475aff8d. Note: the macro
WSREP() in 10.1 is equivalent to WSREP_NNULL() in 10.4.
Item_func_rand::seed_random(): Avoid invoking current_thd
when WSREP is not enabled.
If async slave thread (slave SQL handler), becomes a BF victim, it may occasionally happen that rollbacker thread is used to carry out the rollback instead of the async slave thread.
This can happen, if async slave thread has flagged "idle" state when BF thread tries to figure out how to kill the victim.
The issue was possible to test by using a galera cluster as slave for external master, and issuing high load of conflicting writes through async replication and directly against galera cluster nodes.
However, a deterministic mtr test for the "conflict window" has not yet been worked on.
The fix, in this patch makes sure that async slave thread state is never set to IDLE. This prevents the rollbacker thread to intervene.
The wsrep_query_state change was refactored to happen by dedicated function to make controlling the idle state change in one place.
If async replication slave thread conflicts with cluster replication,
then the async slave transaction should be BF aborted, and depending on the
state of async slave transaction execution, potentially also replayed.
There were problems in such BF abort implementation and the replaying was not
started.
This pull request contains fixes which make sure that if async slave thread is
marked to abort and replay, it will complete carry out the rollback and
release all locks and resources before starting the replaying. After replaying,
async slave transactions is treated as successful, so the slave thread will
continue as usual, handling next replication event.
There is also new mtr test: galera.galera_slave_replay, which stresses both a
certification failure for async slave thread and a successful BF abort
followed by replaying.
Parallel slave server shutdown found to be hanging in
close_connections() triggered by shutdown due to a slave worker thread
would not be notified to exit in case the worker was sitting idle.
Fixed with destroying the worker pool earlier that is in
slave_prepare_for_shutdown() when all their driver threads have already left.
A test file is added to simulate the bug condition as well as check
multi-sourced and not-idle worker cases.
The string doesn't appear to be null-terminated when binlog checksums are
enabled. This causes a corrupt binlog name in the error message when a
slave is ahead of the master.
In MariaDB 10.2 master could have been configured so that there
is extra annotate events. When we peak next event type for CTAS we
need to skip annotate events.
The problem happens when MariaDB master replicates writes for only non InnoDB
tables (e.g. writes to MyISAM table(s)). Async slave node, in Galera cluster,
can apply these writes successfully, but it will, in the end, write gtid position in
mysql.gtid_slave_pos table. mysql.gtid_slave_pos table is InnoDB engine, and
this write makes innodb handlerton part of the replicated "transaction".
Note that wsrep patch identifies that write to gtid_slave_pos should not be replicated
and skips appending wsrep keys for these writes. However, as InnoDB was present
in the transaction, and there are replication events (for MyISAM table) in transaction
cache, but there are no appended keys, wsrep raises an error, and this makes the söave
thread to stop.
The fix is simply to not treat it as an error if async slave tries to replicate a write
set with binlog events, but no keys. We just skip wsrep replication and return successfully.
This commit contains also a mtr test which forces mysql.gtid_slave_pos table isto be
of InnoDB engine, and executes MyISAM only write through asyn replication.
There is additional fix for declaring IO and background slave threads as non wsrep.
These threads should not write anything for wsrep replication, and this is just a safeguard
to make sure nothing leaks into cluster from these slave threads.
This PR contains a mtr test for reproducing a failure with replicating create table as select statement (CTAS) through asynchronous mariadb replication to mariadb galera cluster.
The problem happens when CTAS replication contains both create table statement followed by row events for populating the table. In such situation, the galera node operating as mariadb replication slave, will first replicate only the create table part into the cluster, and then perform another replication containing both the create table and row events. This will lead all other nodes to fail for duplicate table create attempt, and crash due to this failure.
PR contains also a fix, which identifies the situation when CTAS has been replicated, and makes further scan in async replication stream to see if there are following row events. The slave node will replicate either single TOI in case the CTAS table is empty, or if CTAS table contains rows, then single bundled write set with create table and row events is replicated to galera cluster.
This fix should keep master server's GTID's for CTAS replication in sync with GTID's in galera cluster.