The immediate bug was caused by a failure to recognize a correct
position to stop the slave applier run in optimistic parallel mode.
There were the following set of issues that the analysis unveil.
1 incorrect estimate for the event binlog position passed to
is_until_satisfied
2 wait for workers to complete by the driver thread did not account non-group events
that could be left unprocessed and thus to mix up the last executed
binlog group's file and position:
the file remained old and the position related to the new rotated file
3 incorrect 'slave reached file:pos' by the parallel slave report in the error log
4 relay log UNTIL missed out the parallel slave branch in
is_until_satisfied.
The patch addresses all of them to simplify logics of log change
notification in either the master and relay-log until case.
P.1 is addressed with passing the event into is_until_satisfied()
for proper analisis by the function.
P.2 is fixed by changes in handle_queued_pos_update().
P.4 required removing relay-log change notification by workers.
Instead the driver thread updates the notion of the current relay-log
fully itself with aid of introduced
bool Relay_log_info::until_relay_log_names_defer.
An extra print out of the requested until file:pos is arranged
with --log-warning=3.
This is a new test from upstream that did not expect the correct value
of the command slot of the Dump thread when the latter gets killed.
The test is made to expect "Killed" string as the command
in show-processlist as it is supposed to when a thread gets killed.
If the InnoDB buffer pool contains many pages for a table or index
that is being dropped or rebuilt, and if many of such pages are
pointed to by the adaptive hash index, dropping the adaptive hash index
may consume a lot of time.
The time-consuming operation of dropping the adaptive hash index entries
is being executed while the InnoDB data dictionary cache dict_sys is
exclusively locked.
It is not actually necessary to drop all adaptive hash index entries
at the time a table or index is being dropped or rebuilt. We can let
the LRU replacement policy of the buffer pool take care of this gradually.
For this to work, we must detach the dict_table_t and dict_index_t
objects from the main dict_sys cache, and once the last
adaptive hash index entry for the detached table is removed
(when the garbage page is evicted from the buffer pool) we can free
the dict_table_t and dict_index_t object.
Related to this, in MDEV-16283, we made ALTER TABLE...DISCARD TABLESPACE
skip both the buffer pool eviction and the drop of the adaptive hash index.
We shifted the burden to ALTER TABLE...IMPORT TABLESPACE or DROP TABLE.
We can remove the eviction from DROP TABLE. We must retain the eviction
in the ALTER TABLE...IMPORT TABLESPACE code path, so that in case the
discarded table is being re-imported with the same tablespace identifier,
the fresh data from the imported tablespace will replace any stale pages
in the buffer pool.
rpl.rpl_failed_drop_tbl_binlog: Remove the test. DROP TABLE can
no longer be interrupted inside InnoDB.
fseg_free_page(), fseg_free_step(), fseg_free_step_not_header(),
fseg_free_page_low(), fseg_free_extent(): Remove the parameter
that specifies whether the adaptive hash index should be dropped.
btr_search_lazy_free(): Lazily free an index when the last
reference to it is dropped from the adaptive hash index.
buf_pool_clear_hash_index(): Declare static, and move to the
same compilation unit with the bulk of the adaptive hash index
code.
dict_index_t::clone(), dict_index_t::clone_if_needed():
Clone an index that is being rebuilt while adaptive hash index
entries exist. The original index will be inserted into
dict_table_t::freed_indexes and dict_index_t::set_freed()
will be called.
dict_index_t::set_freed(), dict_index_t::freed(): Note that
or check whether the index has been freed. We will use the
impossible page number 1 to denote this condition.
dict_index_t::n_ahi_pages(): Replaces btr_search_info_get_ref_count().
dict_index_t::detach_columns(): Move the assignment n_fields=0
to ha_innobase_inplace_ctx::clear_added_indexes().
We must have access to the columns when freeing the
adaptive hash index. Note: dict_table_t::v_cols[] will remain
valid. If virtual columns are dropped or added, the table
definition will be reloaded in ha_innobase::commit_inplace_alter_table().
buf_page_mtr_lock(): Drop a stale adaptive hash index if needed.
We will also reduce the number of btr_get_search_latch() calls
and enclose some more code inside #ifdef BTR_CUR_HASH_ADAPT
in order to benefit cmake -DWITH_INNODB_AHI=OFF.
Also fixes:
MDEV-21487: Implement option for mysql_upgrade that allows root@localhost to be replaced
MDEV-21486: Implement option for mysql_install_db that allows root@localhost to be replaced
Add user mariadb.sys to be definer of user view
(and has right on underlying table global_priv for
required operation over global_priv
(SELECT,UPDATE,DELETE))
Also changed definer of gis functions in case of creation,
but they work with any definer so upgrade script do not try
to push this change.
in fact, in MariaDB it cannot, but it can show spurious slaves
in SHOW SLAVE HOSTS.
slave was registered in COM_REGISTER_SLAVE and un-registered after
COM_BINLOG_DUMP. If there was no COM_BINLOG_DUMP, it would never
unregister.
Problem:
=======
SET @@GLOBAL.replicate_wild_ignore_table='';
SET @@GLOBAL.replicate_wild_do_table='';
Reports following valgrind error.
Conditional jump or move depends on uninitialised value(s)
Rpl_filter::set_wild_ignore_table(char const*) (rpl_filter.cc:439)
Conditional jump or move depends on uninitialised value(s)
at 0xF60390: delete_dynamic (array.c:304)
by 0x74F3F2: Rpl_filter::set_wild_do_table(char const*) (rpl_filter.cc:421)
Analysis:
========
List of values provided for options "wild_do_table" and "wild_ignore_table" are
stored in DYNAMIC_ARRAYS. When an empty list is provided these dynamic arrays
are not initialized. Existing code treats empty element list as an error and
tries to clean the uninitialized list. This results in above valgrind issue.
Fix:
===
The clean up should be initiated only when there is an error while parsing the
'wild_do_table' or 'wild_ignore_table' list and the dynamic_array is in
initialized state. Otherwise for empty list it should simply return success.
MDEV-21605 Clean up and speed up interfaces for binary row logging
MDEV-21617 Bug fix for previous version of this code
The intention is to have as few 'if' as possible in ha_write() and
related functions. This is done by pre-calculating once per statement the
row_logging state for all tables.
Benefits are simpler and faster code both when binary logging is disabled
and when it's enabled.
Changes:
- Added handler->row_logging to make it easy to check it table should be
row logged. This also made it easier to disabling row logging for system,
internal and temporary tables.
- The tables row_logging capabilities are checked once per "statements
that updates tables" in THD::binlog_prepare_for_row_logging() which
is called when needed from THD::decide_logging_format().
- Removed most usage of tmp_disable_binlog(), reenable_binlog() and
temporary saving and setting of thd->variables.option_bits.
- Moved checks that can't change during a statement from
check_table_binlog_row_based() to check_table_binlog_row_based_internal()
- Removed flag row_already_logged (used by sequence engine)
- Moved binlog_log_row() to a handler::
- Moved write_locked_table_maps() to THD::binlog_write_table_maps() as
most other related binlog functions are in THD.
- Removed binlog_write_table_map() and binlog_log_row_internal() as
they are now obsolete as 'has_transactions()' is pre-calculated in
prepare_for_row_logging().
- Remove 'is_transactional' argument from binlog_write_table_map() as this
can now be read from handler.
- Changed order of 'if's in handler::external_lock() and wsrep_mysqld.h
to first evaluate fast and likely cases before more complex ones.
- Added error checking in ha_write_row() and related functions if
binlog_log_row() failed.
- Don't clear check_table_binlog_row_based_result in
clear_cached_table_binlog_row_based_flag() as it's not needed.
- THD::clear_binlog_table_maps() has been replaced with
THD::reset_binlog_for_next_statement()
- Added 'MYSQL_OPEN_IGNORE_LOGGING_FORMAT' flag to open_and_lock_tables()
to avoid calculating of binary log format for internal opens. This flag
is also used to avoid reading statistics tables for internal tables.
- Added OPTION_BINLOG_LOG_OFF as a simple way to turn of binlog temporary
for create (instead of using THD::sql_log_bin_off.
- Removed flag THD::sql_log_bin_off (not needed anymore)
- Speed up THD::decide_logging_format() by remembering if blackhole engine
is used and avoid a loop over all tables if it's not used
(the common case).
- THD::decide_logging_format() is not called anymore if no tables are used
for the statement. This will speed up pure stored procedure code with
about 5%+ according to some simple tests.
- We now get annotated events on slave if a CREATE ... SELECT statement
is transformed on the slave from statement to row logging.
- In the original code, the master could come into a state where row
logging is enforced for all future events if statement could be used.
This is now partly fixed.
Other changes:
- Ensure that all tables used by a statement has query_id set.
- Had to restore the row_logging flag for not used tables in
THD::binlog_write_table_maps (not normal scenario)
- Removed injector::transaction::use_table(server_id_type sid, table tbl)
as it's not used.
- Cleaned up set_slave_thread_options()
- Some more DBUG_ENTER/DBUG_RETURN, code comments and minor indentation
changes.
- Ensure we only call THD::decide_logging_format_low() once in
mysql_insert() (inefficiency).
- Don't annotate INSERT DELAYED
- Removed zeroing pos_in_table_list in THD::open_temporary_table() as it's
already 0
e.g.
- dont -> don't
- occurence -> occurrence
- succesfully -> successfully
- easyly -> easily
Also remove trailing space in selected files.
These changes span:
- server core
- Connect and Innobase storage engine code
- OQgraph, Sphinx and TokuDB storage engines
Related to MDEV-21769.
Lifted long standing limitation to the XA of rolling it back at the
transaction's
connection close even if the XA is prepared.
Prepared XA-transaction is made to sustain connection close or server
restart.
The patch consists of
- binary logging extension to write prepared XA part of
transaction signified with
its XID in a new XA_prepare_log_event. The concusion part -
with Commit or Rollback decision - is logged separately as
Query_log_event.
That is in the binlog the XA consists of two separate group of
events.
That makes the whole XA possibly interweaving in binlog with
other XA:s or regular transaction but with no harm to
replication and data consistency.
Gtid_log_event receives two more flags to identify which of the
two XA phases of the transaction it represents. With either flag
set also XID info is added to the event.
When binlog is ON on the server XID::formatID is
constrained to 4 bytes.
- engines are made aware of the server policy to keep up user
prepared XA:s so they (Innodb, rocksdb) don't roll them back
anymore at their disconnect methods.
- slave applier is refined to cope with two phase logged XA:s
including parallel modes of execution.
This patch does not address crash-safe logging of the new events which
is being addressed by MDEV-21469.
CORNER CASES: read-only, pure myisam, binlog-*, @@skip_log_bin, etc
Are addressed along the following policies.
1. The read-only at reconnect marks XID to fail for future
completion with ER_XA_RBROLLBACK.
2. binlog-* filtered XA when it changes engine data is regarded as
loggable even when nothing got cached for binlog. An empty
XA-prepare group is recorded. Consequent Commit-or-Rollback
succeeds in the Engine(s) as well as recorded into binlog.
3. The same applies to the non-transactional engine XA.
4. @@skip_log_bin=OFF does not record anything at XA-prepare
(obviously), but the completion event is recorded into binlog to
admit inconsistency with slave.
The following actions are taken by the patch.
At XA-prepare:
when empty binlog cache - don't do anything to binlog if RO,
otherwise write empty XA_prepare (assert(binlog-filter case)).
At Disconnect:
when Prepared && RO (=> no binlogging was done)
set Xid_cache_element::error := ER_XA_RBROLLBACK
*keep* XID in the cache, and rollback the transaction.
At XA-"complete":
Discover the error, if any don't binlog the "complete",
return the error to the user.
Kudos
-----
Alexey Botchkov took to drive this work initially.
Sergei Golubchik, Sergei Petrunja, Marko Mäkelä provided a number of
good recommendations.
Sergei Voitovich made a magnificent review and improvements to the code.
They all deserve a bunch of thanks for making this work done!
Remove usage of deprecated variable storage_engine. It was deprecated in 5.5 but
it never issued a deprecation warning. Make it issue a warning in 10.5.1.
Replaced with default_storage_engine.
Fix:
===
Add "REPLICA" as an alias for "SLAVE". All commands which use "SLAVE" keyword
can be used with new alias "REPLICA".
List of commands:
On Master:
=========
SHOW REPLICA HOSTS <--> SHOW SLAVE HOSTS
Privilege "SLAVE" <--> "REPLICA"
On Slave:
=========
START SLAVE <--> START REPLICA
START ALL SLAVES <--> START ALL REPLICAS
START SLAVE UNTIL <--> START REPLICA UNTIL
STOP SLAVE <--> STOP REPLICA
STOP ALL SLAVES <--> STOP ALL REPLICAS
RESET SLAVE <--> RESET REPLICA
RESET SLAVE ALL <--> RESET REPLICA ALL
SLAVE_POS <--> REPLICA_POS
Parallel slave server shutdown found to be hanging in
close_connections() triggered by shutdown due to a slave worker thread
would not be notified to exit in case the worker was sitting idle.
Fixed with destroying the worker pool earlier that is in
slave_prepare_for_shutdown() when all their driver threads have already left.
A test file is added to simulate the bug condition as well as check
multi-sourced and not-idle worker cases.
Analysis:
========
'max_binlog_cache_size' is configured and a huge transaction is executed. When
the transaction specific events size exceeds 'max_binlog_cache_size' the event
cannot be written to the binary log cache and cache write error is raised.
Upon cache write error the statement is rolled back and the transaction cache
should be truncated to a previous statement specific position. The truncate
operation should reset the cache to earlier valid positions and flush the new
changes. Even though the flush is successful the cache write error is still in
marked state. The truncate code interprets the cache write error as cache flush
failure and returns abruptly without modifying the write cache parameters.
Hence cache is in a invalid state. When a COMMIT statement is executed in this
session it tries to flush the contents of transaction cache to binary log.
Since cache has partial events the cache write operation will report
'writer.remains' assert.
Fix:
===
Binlog truncate function resets the cache to a specified size. As a first step
of truncation, clear the cache write error flag that was raised during earlier
execution. With this new errors that surface during cache truncation can be
clearly identified.
Description:
============
To change 'CONSERVATIVE' @@global.slave_parallel_mode default to 'OPTIMISTIC'
in 10.5.
@sql/sys_vars.cc
Changed default parallel_mode to 'OPTIMISTIC'
@sql/rpl_filter.cc
Changed default parallel_mode to 'OPTIMISTIC'
@sql/mysqld.cc
Removed the initialization of 'SLAVE_PARALLEL_CONSERVATIVE' to
'opt_slave_parallel_mode' variable.
@mysql-test/suite/rpl/t/rpl_parallel_mdev6589.test
@mysql-test/suite/rpl/t/rpl_mdev6386.test
Added 'mtr' suppression to ignore 'ER_PRIOR_COMMIT_FAILED'. In case of
'OPTIMISTIC' mode if a transaction gets killed during "wait_for_prior_commit"
it results in above error "1964". Hence suppression needs to be added for this
error.
@mysql-test/suite/rpl/t/rpl_parallel_conflicts.test
Test has a 'slave.opt' which explicitly sets slave_parallel_mode to
'conservative'. When the test ends this mode conflicts with new default mode.
Hence check test case reports an error. The 'slave.opt' is removed and options
are set and reset within test.
@mysql-test/suite/multi_source/info_logs.result
@mysql-test/suite/multi_source/reset_slave.result
@mysql-test/suite/multi_source/simple.result
Result content mismatch in "show slave status" output. This is expected as new
slave_parallel_mode='OPTIMISTIC'.
@mysql-test/include/check-testcase.test
Updated default 'slave_parallel_mode' to 'optimistic'.
Refactored rpl_parallel.test into following test cases.
Test case 1: @mysql-test/suite/rpl/t/rpl_parallel_domain.test
Test case 2: @mysql-test/suite/rpl/t/rpl_parallel_domain_slave_single_grp.test
Test case 3: @mysql-test/suite/rpl/t/rpl_parallel_single_grpcmt.test
Test case 4: @mysql-test/suite/rpl/t/rpl_parallel_stop_slave.test
Test case 5: @mysql-test/suite/rpl/t/rpl_parallel_slave_bgc_kill.test
Test case 6: @mysql-test/suite/rpl/t/rpl_parallel_gco_wait_kill.test
Test case 7: @mysql-test/suite/rpl/t/rpl_parallel_free_deferred_event.test
Test case 8: @mysql-test/suite/rpl/t/rpl_parallel_missed_error_handling.test
Test case 9: @mysql-test/suite/rpl/t/rpl_parallel_innodb_lock_conflict.test
Test case 10: @mysql-test/suite/rpl/t/rpl_parallel_gtid_slave_pos_update_fail.test
Test case 11: @mysql-test/suite/rpl/t/rpl_parallel_wrong_exec_master_pos.test
Test case 12: @mysql-test/suite/rpl/t/rpl_parallel_partial_binlog_trans.test
Test case 13: @mysql-test/suite/rpl/t/rpl_parallel_ignore_error_on_rotate.test
Test case 14: @mysql-test/suite/rpl/t/rpl_parallel_wrong_binlog_order.test
Test case 15: @mysql-test/suite/rpl/t/rpl_parallel_incorrect_relay_pos.test
Test case 16: @mysql-test/suite/rpl/t/rpl_parallel_retry_deadlock.test
Test case 17: @mysql-test/suite/rpl/t/rpl_parallel_deadlock_corrupt_binlog.test
Test case 18: @mysql-test/suite/rpl/t/rpl_parallel_mode.test
Test case 19: @mysql-test/suite/rpl/t/rpl_parallel_analyze_table_hang.test
Test case 20: @mysql-test/suite/rpl/t/rpl_parallel_record_gtid_wakeup.test
Test case 21: @mysql-test/suite/rpl/t/rpl_parallel_stop_on_con_kill.test
Test case 22: @mysql-test/suite/rpl/t/rpl_parallel_rollback_assert.test