MDEV-31448: Killing a replica thread awaiting its GCO can hang/crash a parallel replica

The problem was an incorrect unmark_start_commit() in signal_error_to_sql_driver_thread(). If an event group gets an error, this unmark could run after the following GCO started, and the subsequent re-marking could access de-allocated GCO. The offending unmark_start_commit() looks obviously incorrect, and the fix is to just remove it. It was introduced in the MDEV-8302 patch, the commit message of which suggests it was added there solely to satisfy an assertion in ha_rollback_trans(). So update this assertion instead to not trigger for event groups that experienced an error (rgi->worker_error). When an error occurs in an event group, all following event groups are skipped anyway, so the unmark should never be needed in this case. Reviewed-by: Andrei Elkin <andrei.elkin@mariadb.com> Signed-off-by: Kristian Nielsen <knielsen@knielsen-hq.org>
2025-07-29 05:21:33 +03:00 · 2023-06-11 17:44:58 +02:00
parent 60bec1d54d
commit a8ea6627a4
3 changed files with 15 additions and 7 deletions
--- a/sql/rpl_parallel.cc
+++ b/sql/rpl_parallel.cc
@ -286,16 +286,11 @@ static void
 signal_error_to_sql_driver_thread(THD *thd, rpl_group_info *rgi, int err)
 {
  rgi->worker_error= err;
-  /*
-    In case we get an error during commit, inform following transactions that
-    we aborted our commit.
-  */
  DBUG_EXECUTE_IF("hold_worker2_favor_worker3", {
      if (rgi->current_gtid.seq_no == 2002) {
        debug_sync_set_action(thd, STRING_WITH_LEN("now WAIT_FOR cont_worker2"));
      }});

-  rgi->unmark_start_commit();
  rgi->cleanup_context(thd, true);
  rgi->rli->abort_slave= true;
  rgi->rli->stop_for_until= false;