1
0
mirror of https://github.com/MariaDB/server.git synced 2025-07-08 17:02:21 +03:00

MDEV-10653: SHOW SLAVE STATUS Can Deadlock an Errored Slave

AKA rpl.rpl_parallel, binlog_encryption.rpl_parallel fails in
buildbot with timeout in include

A replication parallel worker thread can deadlock with another
connection running SHOW SLAVE STATUS. That is, if the replication
worker thread is in do_gco_wait() and is killed, it will already
hold the LOCK_parallel_entry, and during error reporting, try to
grab the err_lock. SHOW SLAVE STATUS, however, grabs these locks in
reverse order. It will initially grab the err_lock, and then try to
grab LOCK_parallel_entry. This leads to a deadlock when both threads
have grabbed their first lock without the second.

This patch implements the MDEV-31894 proposed fix to optimize the
workers_idle() check to compare the last in-use relay log’s
queued_count==dequeued_count for idleness. This removes the need for
workers_idle() to grab LOCK_parallel_entry, as these values are
atomically updated.

Huge thanks to Kristian Nielsen for diagnosing the problem!

Reviewed By:
============
Kristian Nielsen <knielsen@knielsen-hq.org>
Andrei Elkin <andrei.elkin@mariadb.com>
This commit is contained in:
Brandon Nesterenko
2023-11-29 06:53:31 -07:00
parent 5ca63b2b8b
commit 8dad51481b
6 changed files with 203 additions and 19 deletions

View File

@ -2537,23 +2537,10 @@ rpl_parallel::stop_during_until()
bool
rpl_parallel::workers_idle()
rpl_parallel::workers_idle(Relay_log_info *rli)
{
struct rpl_parallel_entry *e;
uint32 i, max_i;
max_i= domain_hash.records;
for (i= 0; i < max_i; ++i)
{
bool active;
e= (struct rpl_parallel_entry *)my_hash_element(&domain_hash, i);
mysql_mutex_lock(&e->LOCK_parallel_entry);
active= e->current_sub_id > e->last_committed_sub_id;
mysql_mutex_unlock(&e->LOCK_parallel_entry);
if (active)
break;
}
return (i == max_i);
return rli->last_inuse_relaylog->queued_count ==
rli->last_inuse_relaylog->dequeued_count;
}