MDEV-30260: Slave crashed:reload_acl_and_cache during shutdown

The signal handler thread can use various different runtime resources when processing a SIGHUP (e.g. master-info information) due to calling into reload_acl_and_cache(). Currently, the shutdown process waits for the termination of the signal thread after performing cleanup. However, this could cause resources actively used by the signal handler to be freed while reload_acl_and_cache() is processing. The specific resource that caused MDEV-30260 is a race condition for the hostname_cache, such that mysqld would delete it in clean_up()::hostname_cache_free(), before the signal handler would use it in reload_acl_and_cache()::hostname_cache_refresh(). Another similar resource is the active_mi/master_info_index. There was a race between its deletion by the main thread in end_slave(), and their usage by the Signal Handler as a part of Master_info_index::flush_all_relay_logs.read(active_mi) in reload_acl_and_cache(). This patch fixes these race conditions by relocating where server shutdown waits for the signal handler to die until after server-level threads have been killed (i.e., as a last step of close_connections()). With respect to the hostname_cache, active_mi and master_info_cache, this ensures that they cannot be destroyed while the signal handler is still active, and potentially using them. Additionally: 1) This requires that Events memory is still in place for SIGHUP handling's mysql_print_status(). So event deinitialization is moved into clean_up(), but the event scheduler still needs to be stopped in close_connections() at the same spot. 2) The function kill_server_thread is no longer used, so it is deleted 3) The timeout to wait for the death of the signal thread was not consistent with the comment. The comment mentioned up to 10 seconds, whereas it was actually 0.01s. The code has been fixed to wait up to 10 seconds. 4) A warning has been added if the signal handler thread fails to exit in time. 5) Added pthread_join() to end of wait_for_signal_thread_to_end() if it hadn't ended in 10s with a warning. Note this also removes the pthread_detached attribute from the signal_thread to allow for the pthread_join(). Reviewed By: =========== Vladislav Vaintroub <wlad@mariadb.com> Andrei Elkin <andrei.elkin@mariadb.com>
2025-07-30 16:24:05 +03:00 · 2024-04-08 13:04:59 -06:00
parent 4980fcb990
commit 952ab9a596
4 changed files with 259 additions and 35 deletions
--- a/sql/sql_reload.cc
+++ b/sql/sql_reload.cc
@ -67,6 +67,15 @@ bool reload_acl_and_cache(THD *thd, unsigned long long options,
  bool result=0;
  select_errors=0;				/* Write if more errors */
  int tmp_write_to_binlog= *write_to_binlog= 1;
+#ifndef DBUG_OFF
+  /*
+    When invoked for handling a SIGHUP by rpl_shutdown_sighup.test, we need to
+    force the signal handler to wait after REFRESH_TABLES, as that will check
+    for a killed server, and we need to call hostname_cache_refresh after
+    server cleanup has happened to trigger MDEV-30260.
+  */
+  int do_dbug_sleep= 0;
+#endif

  DBUG_ASSERT(!thd || !thd->in_sub_stmt);

@ -99,6 +108,15 @@ bool reload_acl_and_cache(THD *thd, unsigned long long options,
        */
        my_error(ER_UNKNOWN_ERROR, MYF(0));
      }
+
+#ifndef DBUG_OFF
+      DBUG_EXECUTE_IF("hold_sighup_log_refresh", {
+        DBUG_ASSERT(!debug_sync_set_action(
+            thd, STRING_WITH_LEN("now SIGNAL in_reload_acl_and_cache "
+                                 "WAIT_FOR refresh_logs")));
+        do_dbug_sleep= 1;
+      });
+#endif
    }
    opt_noacl= 0;

@ -351,6 +369,11 @@ bool reload_acl_and_cache(THD *thd, unsigned long long options,
    }
    my_dbopt_cleanup();
  }
+
+#ifndef DBUG_OFF
+  if (do_dbug_sleep)
+    my_sleep(3000000); // 3s
+#endif
  if (options & REFRESH_HOSTS)
    hostname_cache_refresh();
  if (thd && (options & REFRESH_STATUS))