This commit contains a merge from 10.5-MDEV-29293-squash
into 10.6.
Although the bug MDEV-29293 was not reproducible with 10.6,
the fix contains several improvements for wsrep KILL query and
BF abort handling, and addresses the following issues:
* MDEV-30307 KILL command issued inside a transaction is
problematic for galera replication:
This commit will remove KILL TOI replication, so Galera side
transaction context is not lost during KILL.
* MDEV-21075 KILL QUERY maintains nodes data consistency but
breaks GTID sequence: This is fixed as well as KILL does not
use TOI, and thus does not change GTID state.
* MDEV-30372 Assertion in wsrep-lib state: This was caused by
BF abort or KILL when local transaction was in the middle
of group commit. This commit disables THD::killed handling
during commit, so the problem is avoided.
* MDEV-30963 Assertion failure !lock.was_chosen_as_deadlock_victim
in trx0trx.h:1065: The assertion happened when the victim was
BF aborted via MDL while it was committing. This commit changes
MDL BF aborts so that transactions which are committing cannot
be BF aborted via MDL. The RQG grammar attached in the issue
could not reproduce the crash anymore.
Original commit message from 10.5 fix:
MDEV-29293 MariaDB stuck on starting commit state
The problem seems to be a deadlock between KILL command execution
and BF abort issued by an applier, where:
* KILL has locked victim's LOCK_thd_kill and LOCK_thd_data.
* Applier has innodb side global lock mutex and victim trx mutex.
* KILL is calling innobase_kill_query, and is blocked by innodb
global lock mutex.
* Applier is in wsrep_innobase_kill_one_trx and is blocked by
victim's LOCK_thd_kill.
The fix in this commit removes the TOI replication of KILL command
and makes KILL execution less intrusive operation. Aborting the
victim happens now by using awake_no_mutex() and ha_abort_transaction().
If the KILL happens when the transaction is committing, the
KILL operation is postponed to happen after the statement
has completed in order to avoid KILL to interrupt commit
processing.
Notable changes in this commit:
* wsrep client connections's error state may remain sticky after
client connection is closed. This error message will then pop
up for the next client session issuing first SQL statement.
This problem raised with test galera.galera_bf_kill.
The fix is to reset wsrep client error state, before a THD is
reused for next connetion.
* Release THD locks in wsrep_abort_transaction when locking
innodb mutexes. This guarantees same locking order as with applier
BF aborting.
* BF abort from MDL was changed to do BF abort on server/wsrep-lib
side first, and only then do the BF abort on InnoDB side. This
removes the need to call back from InnoDB for BF aborts which originate
from MDL and simplifies the locking.
* Removed wsrep_thd_set_wsrep_aborter() from service_wsrep.h.
The manipulation of the wsrep_aborter can be done solely on
server side. Moreover, it is now debug only variable and
could be excluded from optimized builds.
* Remove LOCK_thd_kill from wsrep_thd_LOCK/UNLOCK to allow more
fine grained locking for SR BF abort which may require locking
of victim LOCK_thd_kill. Added explicit call for
wsrep_thd_kill_LOCK/UNLOCK where appropriate.
* Wsrep-lib was updated to version which allows external
locking for BF abort calls.
Changes to MTR tests:
* Disable galera_bf_abort_group_commit. This test is going to
be removed (MDEV-30855).
* Make galera_var_retry_autocommit result more readable by echoing
cases and expectations into result. Only one expected result for
reap to verify that server returns expected status for query.
* Record galera_gcache_recover_manytrx as result file was incomplete.
Trivial change.
* Make galera_create_table_as_select more deterministic:
Wait until CTAS execution has reached MDL wait for multi-master
conflict case. Expected error from multi-master conflict is
ER_QUERY_INTERRUPTED. This is because CTAS does not yet have open
wsrep transaction when it is waiting for MDL, query gets interrupted
instead of BF aborted. This should be addressed in separate task.
* A new test galera_bf_abort_registering to check that registering trx gets
BF aborted through MDL.
* A new test galera_kill_group_commit to verify correct behavior
when KILL is executed while the transaction is committing.
Co-authored-by: Seppo Jaakola <seppo.jaakola@iki.fi>
Co-authored-by: Jan Lindström <jan.lindstrom@galeracluster.com>
Signed-off-by: Julius Goryavsky <julius.goryavsky@mariadb.com>
This patch is the result of running
run-clang-tidy -fix -header-filter=.* -checks='-*,modernize-use-equals-default' .
Code style changes have been done on top. The result of this change
leads to the following improvements:
1. Binary size reduction.
* For a -DBUILD_CONFIG=mysql_release build, the binary size is reduced by
~400kb.
* A raw -DCMAKE_BUILD_TYPE=Release reduces the binary size by ~1.4kb.
2. Compiler can better understand the intent of the code, thus it leads
to more optimization possibilities. Additionally it enabled detecting
unused variables that had an empty default constructor but not marked
so explicitly.
Particular change required following this patch in sql/opt_range.cc
result_keys, an unused template class Bitmap now correctly issues
unused variable warnings.
Setting Bitmap template class constructor to default allows the compiler
to identify that there are no side-effects when instantiating the class.
Previously the compiler could not issue the warning as it assumed Bitmap
class (being a template) would not be performing a NO-OP for its default
constructor. This prevented the "unused variable warning".
The problem was that federated engine does not support comparable rowids
which was not taken into account by semijoin code.
Fixed by checking that we don't use semijoin with tables that does not
support comparable rowids.
Other things:
- Fixed some typos in the code comments
Works like vers_force but forces trx_id-based system-versioned tables
if the storage supports it (currently InnoDB-only). Otherwise creates
timestamp-based system-versioned table.
If the backup finished in the middle of a Aria bulk load insert,
which could happen with LOAD DATA INFILE, CREATE ... SELECT etc)
there was a chance that Aria recovery would fail on the backup.
Fixed by ensuring that bulk load operations for Aria are not allowed
under BACKUP LOCK.
I also changed so that the table TRN is updated just before truncate
which ensures that old redo's for the table are ignored.
I also enabled Aria redo for DDL's to be able to repeat REPAIR commands.
Without this change recovery would not work on repaired tables.
Notes:
- We take the backup lock protection at the end of bulk insert (as we
don't want to keep the lock over a very long running insert).
If mariadb-backup keeps the backup lock too long, this may fail with
a lock timeout. In this case the batch insert will fail and the table
will be truncated (set to it's original state).
When a range rowid filter was used with an index ref access the cost of
accessing the index entries for the records rejected by the filter was not
taken into account. For a ref access by an index with big average number
of records per key this led to poor execution plans if selectivity of the
used filter was high.
The patch resolves this problem. It also introduces a minor optimization
that skips look-ups into a filter that turns out to be empty.
With this patch the output of ANALYZE stmt reports the number of look-ups
into used rowid filters.
The patch also back-ports from 10.5 the code that properly sets the field
TABLE::file::table for opened temporary tables.
The test cases that were supposed to use rowid filters have been adjusted
in order to use similar execution plans after this fix.
Approved by Oleksandr Byelkin <sanja@mariadb.com>
that is not in binlog.
Post-crash recovery of --rpl-semi-sync-slave-enabled server
failed to recognize a transaction in-doubt that needed rolled back.
A prepared-but-not-in-binlog transaction gets committed instead
to possibly create inconsistency with a master (e.g the way it was observed
in the bug report).
The semisync recovery is corrected now with initializing binlog coordinates
of any transaction in-doubt to the maximum offset which is
unreachable.
In effect when a prepared transaction that is not found in binlog
it will be decided to rollback because it's guaranteed to reside
in a truncated tail area of binlog.
Mtr tests are reinforced to cover the described scenario.
We will remove the parameter innodb_disallow_writes because it is badly
designed and implemented. The parameter was never allowed at startup.
It was only internally used by Galera snapshot transfer.
If a user executed
SET GLOBAL innodb_disallow_writes=ON;
the server could hang even on subsequent read operations.
During Galera snapshot transfer, we will block writes
to implement an rsync friendly snapshot, as follows:
sst_flush_tables() will acquire a global lock by executing
FLUSH TABLES WITH READ LOCK, which will block any writes
at the high level.
sst_disable_innodb_writes(), invoked via ha_disable_internal_writes(true),
will suspend or disable InnoDB background tasks or threads that could
initiate writes. As part of this, log_make_checkpoint() will be invoked
to ensure that anything in the InnoDB buf_pool.flush_list will be written
to the data files. This has the nice side effect that the Galera joiner
will avoid crash recovery.
The changes to sql/wsrep.cc and to the tests are based on a prototype
that was developed by Jan Lindström.
Reviewed by: Jan Lindström
Commit 6c39eaeb1 made the crash recovery dependent on server_id.
The crash recovery could fail when restoring a new instance from
original crashed data directory USING A NEW SERVER ID.
The issue doesn't exist in previous major versions before 10.6.
Root cause is when generating the input XID to be searched in the hash,
server id is populated with the current server id.
So if the server id changed when recovering, the XID couldn't be found
in the hash due to server id doesn't match.
This fix is to use original server id when creating the input XID
object in function `xarecover_do_commit_or_rollback`.
All new code of the whole pull request, including one or several files
that are either new files or modified ones, are contributed under the
BSD-new license. I am contributing on behalf of my employer Amazon Web
Services, Inc.
The first step for deprecating innodb_autoinc_lock_mode(see MDEV-27844) is:
- to switch statement binlog format to ROW if binlog format is MIXED and
the statement changes autoincremented fields
- issue warnings if innodb_autoinc_lock_mode == 2 and binlog format is
STATEMENT
If the optimizer decides to rewrites a NOT IN predicand of the form
outer_expr IN (SELECT inner_col FROM ... WHERE subquery_where)
into the EXISTS subquery
EXISTS (SELECT 1 FROM ... WHERE subquery_where AND
(outer_expr=inner_col OR inner_col IS NULL))
then the pushed equality predicate outer_expr=inner_col can be used for
ref[or_null] access if inner_col is a reference to an indexed column.
In this case if there is a selective range condition over this column then
a Rowid filter may be employed coupled the with ref[or_null] access. The
filter is 'pushed' into the engine and in InnoDB currently it cannot be
used with index look-ups by primary key. The ref[or_null] access can be
used only when outer_expr is not NULL. Otherwise the original predicand
is evaluated to TRUE only if the result set returned by the query
SELECT 1 FROM ... WHERE subquery_where
is empty. When performing this evaluation the executor switches to the
table scan by primary key. Before this patch the pushed filter still
remained marked as active and the engine tried to apply the filter. This
was incorrect and in InnoDB this attempt to use the filter led to an
assertion failure.
This patch fixes the problem by disabling usage of the filter when
outer_expr is evaluated to NULL.
Federated and Federatex cannot be used with ROR scans
Federated::position() and Federatex::position() is storing in 'ref' a
pointer into a local result set buffer. This means that one cannot
compare 'ref' from different handler instances to see if they point to the
same physical record.
This bug caused federated.federatedx to return wrong results when the
optimizer tried to use index_merge to resolve some queries.
Fixed by introducing table flag HA_NON_COMPARABLE_ROWID and using this
with the above handlers.
Todo:
- Fix multi_delete(), multi_update and read_records() to use primary key
instead of 'ref' if case HA_NON_COMPARABLE_ROWID is set. The current
code only works if we have only one range (like table scan) for the
tables that will be updated in the second pass.
- Enable DBUG_ASSERT() in ha_federated::cmp_ref() and
ha_federatedx::cmp_ref().
- In ha_innobase::prepare_inplace_alter_table(), InnoDB should
check whether the table is empty. If the table is empty then
server should avoid downgrading the MDL after prepare phase.
It is more like instant alter, does change only in dicationary
and metadata.
- Changed few debug test case to make non-empty DDL table