The problem was that parallel replication of temporary tables using
statement-based binlogging could overlap the COMMIT in one thread with a DML
or DROP TEMPORARY TABLE in another thread using the same temporary table.
Temporary tables are not safe for concurrent access, so this caused
reference to freed memory and possibly other nastiness.
The fix is to disable the optimisation with overlapping commits of one
transaction with the start of a later transaction, when temporary tables are
in use. Then the following event groups will be blocked from starting until
the one using temporary tables is completed.
This also fixes occasional test failures of rpl.rpl_parallel_temptable seen
in Buildbot.
Signed-off-by: Kristian Nielsen <knielsen@knielsen-hq.org>
The problem is that when a worker thread is (user) killed in
wait_for_prior_commit, the event group may complete out-of-order since the
wait for prior commit was aborted by the kill.
This fix ensures that event groups will always complete in-order, even
in the error case. This is done in finish_event_group() by doing an
extra wait_for_prior_commit(), if necessary, that ignores kills.
This fix supersedes the fix for MDEV-30780, so the earlier fix for
that is reverted in this patch.
Also fix that an error from wait_for_prior_commit() inside
finish_event_group() would not signal the error to
wakeup_subsequent_commits().
Based on earlier work by Brandon Nesterenko and Andrei Elkin, with
some changes to simplify the semantics of wait_for_prior_commit() and
make the code more robust to future changes.
Reviewed-by: Andrei Elkin <andrei.elkin@mariadb.com>
Signed-off-by: Kristian Nielsen <knielsen@knielsen-hq.org>
The problem is that a parallel replica would not immediately stop
running/queued transactions when issued STOP SLAVE. That is, it
allowed the current group of transactions to run, and sometimes the
transactions which belong to the next group could be started and run
through commit after STOP SLAVE was issued too, if the last group
had started committing. This would lead to long periods to wait for
all waiting transactions to finish.
This patch updates a parallel replica to try and abort immediately
and roll-back any ongoing transactions. The exception to this is any
transactions which are non-transactional (e.g. those modifying
sequences or non-transactional tables), and any prior transactions,
will be run to completion.
The specifics are as follows:
1. A new stage was added to SHOW PROCESSLIST output for the SQL
Thread when it is waiting for a replica thread to either rollback or
finish its transaction before stopping. This stage presents as
“Waiting for worker thread to stop”
2. Worker threads which error or are killed no longer perform GCO
cleanup if there is a concurrently running prior transaction. This
is because a worker thread scheduled to run in a future GCO could be
killed and incorrectly perform cleanup of the active GCO.
3. Refined cases when the FL_TRANSACTIONAL flag is added to GTID
binlog events to disallow adding it to transactions which modify
both transactional and non-transactional engines when the binlogging
configuration allow the modifications to exist in the same event,
i.e. when using binlog_direct_non_trans_update == 0 and
binlog_format == statement.
4. A few existing MTR tests relied on the completion of certain
transactions after issuing STOP SLAVE, and were re-recorded
(potentially with added synchronizations) under the new rollback
behavior.
Reviewed By
===========
Andrei Elkin <andrei.elkin@mariadb.com>
This is a backport from 10.5.
The problem seems to be a deadlock between KILL command execution
and BF abort issued by an applier, where:
* KILL has locked victim's LOCK_thd_kill and LOCK_thd_data.
* Applier has innodb side global lock mutex and victim trx mutex.
* KILL is calling innobase_kill_query, and is blocked by innodb
global lock mutex.
* Applier is in wsrep_innobase_kill_one_trx and is blocked by
victim's LOCK_thd_kill.
The fix in this commit removes the TOI replication of KILL command
and makes KILL execution less intrusive operation. Aborting the
victim happens now by using awake_no_mutex() and ha_abort_transaction().
If the KILL happens when the transaction is committing, the
KILL operation is postponed to happen after the statement
has completed in order to avoid KILL to interrupt commit
processing.
Notable changes in this commit:
* wsrep client connections's error state may remain sticky after
client connection is closed. This error message will then pop
up for the next client session issuing first SQL statement.
This problem raised with test galera.galera_bf_kill.
The fix is to reset wsrep client error state, before a THD is
reused for next connetion.
* Release THD locks in wsrep_abort_transaction when locking
innodb mutexes. This guarantees same locking order as with applier
BF aborting.
* BF abort from MDL was changed to do BF abort on server/wsrep-lib
side first, and only then do the BF abort on InnoDB side. This
removes the need to call back from InnoDB for BF aborts which originate
from MDL and simplifies the locking.
* Removed wsrep_thd_set_wsrep_aborter() from service_wsrep.h.
The manipulation of the wsrep_aborter can be done solely on
server side. Moreover, it is now debug only variable and
could be excluded from optimized builds.
* Remove LOCK_thd_kill from wsrep_thd_LOCK/UNLOCK to allow more
fine grained locking for SR BF abort which may require locking
of victim LOCK_thd_kill. Added explicit call for
wsrep_thd_kill_LOCK/UNLOCK where appropriate.
* Wsrep-lib was updated to version which allows external
locking for BF abort calls.
Changes to MTR tests:
* Disable galera_bf_abort_group_commit. This test is going to
be removed (MDEV-30855).
* Record galera_gcache_recover_manytrx as result file was incomplete.
Trivial change.
* Make galera_create_table_as_select more deterministic:
Wait until CTAS execution has reached MDL wait for multi-master
conflict case. Expected error from multi-master conflict is
ER_QUERY_INTERRUPTED. This is because CTAS does not yet have open
wsrep transaction when it is waiting for MDL, query gets interrupted
instead of BF aborted. This should be addressed in separate task.
* A new test galera_kill_group_commit to verify correct behavior
when KILL is executed while the transaction is committing.
Co-authored-by: Seppo Jaakola <seppo.jaakola@iki.fi>
Co-authored-by: Jan Lindström <jan.lindstrom@galeracluster.com>
Signed-off-by: Julius Goryavsky <julius.goryavsky@mariadb.com>
This patch is the result of running
run-clang-tidy -fix -header-filter=.* -checks='-*,modernize-use-equals-default' .
Code style changes have been done on top. The result of this change
leads to the following improvements:
1. Binary size reduction.
* For a -DBUILD_CONFIG=mysql_release build, the binary size is reduced by
~400kb.
* A raw -DCMAKE_BUILD_TYPE=Release reduces the binary size by ~1.4kb.
2. Compiler can better understand the intent of the code, thus it leads
to more optimization possibilities. Additionally it enabled detecting
unused variables that had an empty default constructor but not marked
so explicitly.
Particular change required following this patch in sql/opt_range.cc
result_keys, an unused template class Bitmap now correctly issues
unused variable warnings.
Setting Bitmap template class constructor to default allows the compiler
to identify that there are no side-effects when instantiating the class.
Previously the compiler could not issue the warning as it assumed Bitmap
class (being a template) would not be performing a NO-OP for its default
constructor. This prevented the "unused variable warning".
(Initial patch by Varun Gupta. Amended and added comments).
When the query has both
1. Aggregate functions that require sorting data by group, and
2. Window functions
we need to use two temporary tables. The first temp.table will hold the
join output. Then it is passed to filesort(). Reading it in sorted
order allows to compute the aggregate functions.
Then, we need to write their values into the second temp. table. Then,
Window Function computation step can pass that to filesort() and read
them in the order it needs.
Failure to create the second temp. table would cause an assertion
failure: window function could would not find where to get the values
of the aggregate functions.
regression from MDEV-29540 / 8c389393695.
INSERT SELECT errors needed to be unconditionally ignored.
As this touches the CREATE .. SELECT functionality, show
the equalivent test there.
Virtual column values are updated in handler in reading commands,
like ha_index_next, etc. This was missing for ha_ft_read.
handler::ha_ft_read: add table->update_virtual_fields() call
The population of default values in INSERT SELECT was being
performed twice. With sequences, this resulted in every
second sequence value being used.
With SELECT INSERT we remove the second invokation of
table->update_default_fields(). This was already performed
in store_values() invoking fill_record_n_invoke_before_triggers()
which invoked update_default_fields() previously.
We do need to return an error on duplicate values, so the
::store_values is extended to take the ignore option.
As of now innodb does not store trx_id for each record in secondary index.
The idea behind is following: let us store only per-page max_trx_id, and
delete-mark the records when they are deleted/updated.
If the read starts, it rememders the lowest id of currently active
transaction. Innodb refers to it as trx->read_view->m_up_limit_id.
See also ReadView::open.
When the page is fetched, its max_trx_id is compared to m_up_limit_id.
If the value is lower, and the secondary index record is not delete-marked,
then this page is just safe to read as is. Else, a clustered index could be
needed ato access. See page_get_max_trx_id call in row_search_mvcc, and the
corresponding switch (row_search_idx_cond_check(...)) below.
Virtual columns are required to be updated in case if the record was
delete-marked. The motivation behind it is documented in
Row_sel_get_clust_rec_for_mysql::operator() near
row_sel_sec_rec_is_for_clust_rec call.
This was basically a description why virtual column computation can
normally happen during SELECT, and, generally, a vcol index access.
Sometimes stats tables are updated by innodb. This starts a new
transaction, and it can happen that it didn't finish to the moment of
SELECT execution, forcing virtual columns recomputation. If the result was
a something that normally outputs a warning, like division by zero, then
it could be outputted in a racy manner.
The solution is to suppress the warnings when a column is computed
for the described purpose.
ignore_wrnings argument is added innobase_get_computed_value.
Currently, it is only true for a call from
row_sel_sec_rec_is_for_clust_rec.
The problem is that if table definition cache (TDC) is full of real tables
which are in tables cache, view definition can not stay there so will be
removed by its own underlying tables.
In situation above old mechanism of detection matching definition in PS
and current version always require reprepare and so prevent executing
the PS.
One work around is to increase TDC, other - improve version check for
views/triggers (which is done here). Now in suspicious cases we check:
- timestamp (microseconds) of the view to be sure that version really
have changed;
- time (microseconds) of creation of a trigger related to time
(microseconds) of statement preparation.
Problem:
========
Replication can break while applying a query log event if its
respective command errors on the primary, but is ignored by the
replication filter within Grant_tables on the replica. The bug
reported by MDEV-28530 shows this with REVOKE ALL PRIVILEGES using a
non-existent user. The primary will binlog the REVOKE command with
an error code, and the replica will think the command executed with
success because the replication filter will ignore the command while
accessing the Grant_tables classes. When the replica performs an
error check, it sees the difference between the error codes, and
replication breaks.
Solution:
========
If the replication filter check done by Grant_tables logic ignores
the tables, reset thd->slave_expected_error to 0 so that
Query_log_event::do_apply_event() can be made aware that the
underlying query was ignored when it compares errors.
Note that this bug also effects DROP USER if not all users exist
in the provided list, and the patch fixes and tests this case.
Reviewed By:
============
andrei.elkin@mariadb.com
Making changes to wsrep_mysqld.h causes large parts of server code to
be recompiled. The reason is that wsrep_mysqld.h is included by
sql_class.h, even tough very little of wsrep_mysqld.h is needed in
sql_class.h. This commit introduces a new header file, wsrep_on.h,
which is meant to be included from sql_class.h, and contains only
macros and variable declarations used to determine whether wsrep is
enabled.
Also, header wsrep.h should only contain definitions that are also
used outside of sql/. Therefore, move WSREP_TO_ISOLATION* and
WSREP_SYNC_WAIT macros to wsrep_mysqld.h.
Reviewed-by: Jan Lindström <jan.lindstrom@mariadb.com>
TIMESTAMP columns were compared as strings in ALL/ANY comparison,
which did not work well near DST time change.
Changing ALL/ANY comparison to use "Native" representation to compare
TIMESTAMP columns, like simple comparison does.
When creating a recursive CTE, the column types are taken from the
non recursive part of the CTE (this is according to the SQL standard).
This patch adds code to abort the CTE if the calculated values in the
recursive part does not fit in the fields in the created temporary table.
The new code only affects recursive CTE, so it should not cause any notable
problems for old applications.
Other things:
- Fixed that we get correct row numbers for warnings generated with
WITH RECURSIVE
Reviewer: Alexander Barkov <bar@mariadb.com>
MDEV-21810 MBR: Unexpected "Unsafe statement" warning for unsafe IODKU
MDEV-17614 fixes to replication unsafety for INSERT ON DUP KEY UPDATE
on two or more unique key table left a flaw. The fixes checked the
safety condition per each inserted record with the idea to catch a user-created
value to an autoincrement column and when that succeeds the autoincrement column
would become the source of unsafety too.
It was not expected that after a duplicate error the next record's
write_set may become different and the unsafe decision for that
specific record will be computed to screw the Query's binlogging
state and when @@binlog_format is MIXED nothing gets bin-logged.
This case has been already fixed in 10.5.2 by 91ab42a823 that
relocated/optimized THD::decide_logging_format_low() out of the record insert
loop. The safety decision is computed once and at the right time.
Pertinent parts of the commit are cherry-picked.
Also a spurious warning about unsafety is removed when MIXED
@@binlog_format; original MDEV-17614 test result corrected.
The original test of MDEV-17614 is extended and made more readable.
Problem:
========
If a primary is shutdown during an active semi-sync connection
during the period when the primary is awaiting an ACK, the primary
hard kills the active communication thread and does not ensure the
transaction was received by a replica. This can lead to an
inconsistent replication state.
Solution:
========
During shutdown, the primary should wait for an ACK or timeout
before hard killing a thread which is awaiting a communication. We
extend the `SHUTDOWN WAIT FOR SLAVES` logic to identify and ignore
any threads waiting for a semi-sync ACK in phase 1. Then, before
stopping the ack receiver thread, the shutdown is delayed until all
waiting semi-sync connections receive an ACK or time out. The
connections are then killed in phase 2.
Notes:
1) There remains an unresolved corner case that affects this
patch. MDEV-28141: Slave crashes with Packets out of order when
connecting to a shutting down master. Specifically, If a slave is
connecting to a master which is actively shutting down, the slave
can crash with a "Packets out of order" assertion error. To get
around this issue in the MTR tests, the primary will wait a small
amount of time before phase 1 killing threads to let the replicas
safely stop (if applicable).
2) This patch also fixes MDEV-28114: Semi-sync Master ACK Receiver
Thread Can Error on COM_QUIT
Reviewed By
============
Andrei Elkin <andrei.elkin@mariadb.com>
the bug was that in_vector array in Item_func_in was allocated in the
statement arena, not in the table->expr_arena.
revert part of the 5acd391e8b2d. Instead, change the arena correctly
in fix_all_session_vcol_exprs().
Remove TABLE_ARENA, that was introduced in 5acd391e8b2d to force
item tree changes to be rolled back (because they were allocated in the
wrong arena and didn't persist. now they do)
Problem: In regular replication, when master binlogged using statement format
slave might not have written an event to its binary log when the Query
event aimed at a temporary table.
Specifically this was observed with LOAD DATA INFILE.
This effect was possible because unlike master slave holds temporary
tables in its pool and the master side check of existence of a
temporary table at the format bin-logging decision did not apply.
Solution: replace THD::has_thd_temporary_tables() with
THD::has_temporary_tables which allows to identify temporary table
presence on either side.
--
Reviewed by Andrei Elkin.
Using parallel slave applying can cause deadlock between between DDL and
other events. GTID with lower seqno can be blocked in galera when node
entered TOI mode, but DDL GTID which has higher node can be blocked
before previous GTIDs are applied locally.
Fix is to check prior commits before entering TOI.
Reviewed-by: Jan Lindström <jan.lindstrom@mariadb.com>
Handling BF abort for prepared statement execution so that EXECUTE processing will continue
until parameter setup is complete, before BF abort bails out the statement execution.
THD class has new boolean member: wsrep_delayed_BF_abort, which is set if BF abort is observed
in do_command() right after reading client's packet, and if the client has sent PS execute command.
In such case, the deadlock error is not returned immediately back to client, but the PS execution
will be started. However, the PS execution loop, will now check if wsrep_delayed_BF_abort is set, and
stop the PS execution after the type information has been assigned for the PS.
With this, the PS protocol type information, which is present in the first PS EXECUTE command, is not lost
even if the first PS EXECUTE command was marked to abort.
Reviewed-by: Jan Lindström <jan.lindstrom@mariadb.com>
extra2_read_len resolved by keeping the implementation
in sql/table.cc by exposed it for use by ha_partition.cc
Remove identical implementation in unireg.h
(ref: bfed2c7d57a7ca34936d6ef0688af7357592dc40)
Problem:
Parse-time conversion from binary to tricky character sets like utf32
produced ill-formed strings. So, later a chash happened in debug builds,
or a wrong SHOW CREATE TABLE was returned in release builds.
Fix:
1. Backporting a few methods from 10.3:
- THD::check_string_for_wellformedness()
- THD::convert_string() overloads
- THD::make_text_string_connection()
2. Adding a new method THD::reinterpret_string_from_binary(),
which makes sure to either returns a well-formed string
(optionally prepending with zero bytes), or returns an error.
close_connections() in mysqld.cc sends a signal to all threads.
But InnoDB is too busy purging, doesn't react immediately.
close_connections() waits 20 seconds, which isn't enough in this
particular case, and then unlinks all threads from
the list and forcibly closes their vio connection.
InnoDB background threads have no vio connection to close, but
they're unlinked all the same. So when later they finally notice
the shutdown request and try to unlink themselves, they fail to
assert that they're still linked.
Fix: don't assert_linked, as another thread can unlink this THD anytime