MDEV-34696: do_gco_wait() completes too early on InnoDB dict stats updates

Before doing mark_start_commit(), check that there is no pending deadlock kill. If there is a pending kill, we won't commit (we will abort, roll back, and retry). Then we should not mark the commit as started, since that could potentially make the following GCO start too early, before we completed the commit after the retry. This condition could trigger in some corner cases, where InnoDB would take temporarily table/row locks that are released again immediately, not held until the transaction commits. This happens with dict_stats updates and possibly auto-increment locks. Such locks can be passed to thd_rpl_deadlock_check() and cause a deadlock kill to be scheduled in the background. But since the blocking locks are held only temporarily, they can be released before the background kill happens. This way, the kill can be delayed until after mark_start_commit() has been called. Thus we need to check the synchronous indication rgi->killed_for_retry, not just the asynchroneous thd->killed. Signed-off-by: Kristian Nielsen <knielsen@knielsen-hq.org>
2025-07-30 16:24:05 +03:00 · 2024-08-03 03:51:47 +02:00
parent 1f040ae048
commit b4c2e23954
2 changed files with 33 additions and 4 deletions
--- a/sql/rpl_parallel.cc
+++ b/sql/rpl_parallel.cc
@ -1450,11 +1450,23 @@ handle_rpl_parallel_thread(void *arg)
          after mark_start_commit(), we have to unmark, which has at least a
          theoretical possibility of leaving a window where it looks like all
          transactions in a GCO have started committing, while in fact one
-          will need to rollback and retry. This is not supposed to be possible
-          (since there is a deadlock, at least one transaction should be
-          blocked from reaching commit), but this seems a fragile ensurance,
-          and there were historically a number of subtle bugs in this area.
+          will need to rollback and retry.
+
+          Normally this will not happen, since the kill is there to resolve a
+          deadlock that is preventing at least one transaction from proceeding.
+          One case it can happen is with InnoDB dict stats update, which can
+          temporarily cause transactions to block each other, but locks are
+          released immediately, they don't linger until commit. There could be
+          other similar cases, there were historically a number of subtle bugs
+          in this area.
+
+          But once we start the commit, we can expect that no new lock
+          conflicts will be introduced. So by handling any lingering deadlock
+          kill at this point just before mark_start_commit(), we should be
+          robust even towards spurious deadlock kills.
        */
+        if (rgi->killed_for_retry != rpl_group_info::RETRY_KILL_NONE)
+          wait_for_pending_deadlock_kill(thd, rgi);
        if (!thd->killed)
        {
          DEBUG_SYNC(thd, "rpl_parallel_before_mark_start_commit");