LIMIT history switching requires the number of history partitions to
be marked for read: from first to last non-empty plus one empty. The
least we can do is to fail with error message if the needed partition
was not marked for read. As this is handler interface we require new
handler error code to display user-friendly error message.
Switching by INTERVAL works out-of-the-box with
ER_ROW_DOES_NOT_MATCH_GIVEN_PARTITION_SET error.
Mutex order violation when wsrep bf thread kills a conflicting trx,
the stack is
wsrep_thd_LOCK()
wsrep_kill_victim()
lock_rec_other_has_conflicting()
lock_clust_rec_read_check_and_lock()
row_search_mvcc()
ha_innobase::index_read()
ha_innobase::rnd_pos()
handler::ha_rnd_pos()
handler::rnd_pos_by_record()
handler::ha_rnd_pos_by_record()
Rows_log_event::find_row()
Update_rows_log_event::do_exec_row()
Rows_log_event::do_apply_event()
Log_event::apply_event()
wsrep_apply_events()
and mutexes are taken in the order
lock_sys->mutex -> victim_trx->mutex -> victim_thread->LOCK_thd_data
When a normal KILL statement is executed, the stack is
innobase_kill_query()
kill_handlerton()
plugin_foreach_with_mask()
ha_kill_query()
THD::awake()
kill_one_thread()
and mutexes are
victim_thread->LOCK_thd_data -> lock_sys->mutex -> victim_trx->mutex
This patch is the plan D variant for fixing potetial mutex locking
order exercised by BF aborting and KILL command execution.
In this approach, KILL command is replicated as TOI operation.
This guarantees total isolation for the KILL command execution
in the first node: there is no concurrent replication applying
and no concurrent DDL executing. Therefore there is no risk of
BF aborting to happen in parallel with KILL command execution
either. Potential mutex deadlocks between the different mutex
access paths with KILL command execution and BF aborting cannot
therefore happen.
TOI replication is used, in this approach, purely as means
to provide isolated KILL command execution in the first node.
KILL command should not (and must not) be applied in secondary
nodes. In this patch, we make this sure by skipping KILL
execution in secondary nodes, in applying phase, where we
bail out if applier thread is trying to execute KILL command.
This is effective, but skipping the applying of KILL command
could happen much earlier as well.
This also fixed unprotected calls to wsrep_thd_abort
that will use wsrep_abort_transaction. This is fixed
by holding THD::LOCK_thd_data while we abort transaction.
Reviewed-by: Jan Lindström <jan.lindstrom@mariadb.com>
in about a hundred of users of MY_BITMAP, only two were using its
built-in mutex, and only one of those two was actually needing it.
Remove the mutex from MY_BITMAP, remove all associated conditions
and checks in bitmap functions. Use an external LOCK_temp_pool
mutex and temp_pool_set_next/temp_pool_clear_bit acccessors.
Remove bitmap_init/bitmap_free, always use my_* versions.
1. work around MDEV-25912 to not apply assert
at wsrep running time;
2. handle wsrep mode of the server recovery
3. convert hton calls to static binlog_commit ones.
4. satisfy MSAN complain on uninitialized std::pair
Problem:
=======
When the semisync master is crashed and restarted as slave it could
recover transactions that former slaves may never have seen.
A known method existed to clear out all prepared transactions
with --tc-heuristic-recover=rollback does not care to adjust
binlog accordingly.
Fix:
===
The binlog-based recovery is made to concern of the slave semisync role of
post-crash restarted server.
No changes in behavior is done to the "normal" binloggging server
and the semisync master.
When the restarted server is configured with
--rpl-semi-sync-slave-enabled=1
the refined recovery attempts to roll back prepared transactions
and truncate binlog accordingly.
In case of a partially committed (that is committed at least
in one of the engine participants) such transaction gets committed.
It's guaranteed no (partially as well) committed transactions
exist beyond the truncate position.
In case there exists a non-transactional replication event
(being in a way a committed transaction) past the
computed truncate position the recovery ends with an error.
As after master crash and failover to slave, the demoted-to-slave
ex-master must be ready to face and accept its own (generated by)
events, without generally necessary --replicate-same-server-id.
So the acceptance conditions are relaxed for the semisync slave
to accept own events without that option.
While gtid_strict_mode ON ensures no duplicate transaction can be
(re-)executed the master_use_gtid=none slave has to be
configured with --replicate-same-server-id.
*NOTE* for reviewers.
This patch does not handle the user XA which is done
in next git commit.
- Major rewrite of ddl_log.cc and ddl_log.h
- ddl_log.cc described in the beginning how the recovery works.
- ddl_log.log has unique signature and is dynamic. It's easy to
add more information to the header and other ddl blocks while still
being able to execute old ddl entries.
- IO_SIZE for ddl blocks is now dynamic. Can be changed without affecting
recovery of old logs.
- Code is more modular and is now usable outside of partition handling.
- Renamed log file to dll_recovery.log and added option --log-ddl-recovery
to allow one to specify the path & filename.
- Added ddl_log_entry_phase[], number of phases for each DDL action,
which allowed me to greatly simply set_global_from_ddl_log_entry()
- Changed how strings are stored in log entries, which allows us to
store much more information in a log entry.
- ddl log is now always created at start and deleted on normal shutdown.
This simplices things notable.
- Added probes debug_crash_here() and debug_simulate_error() to simply
crash testing and allow crash after a given number of times a probe
is executed. See comments in debug_sync.cc and rename_table.test for
how this can be used.
- Reverting failed table and view renames is done trough the ddl log.
This ensures that the ddl log is tested also outside of recovery.
- Added helper function 'handler::needs_lower_case_filenames()'
- Extend binary log with Q_XID events. ddl log handling is using this
to check if a ddl log entry was logged to the binary log (if yes,
it will be deleted from the log during ddl_log_close_binlogged_events()
- If a DDL entry fails 3 time, disable it. This is to ensure that if
we have a crash in ddl recovery code the server will not get stuck
in a forever crash-restart-crash loop.
mysqltest.cc changes:
- --die will now replace $variables with their values
- $error will contain the error of the last failed statement
storage engine changes:
- maria_rename() was changed to be more robust against crashes during
rename.
This change is to get rid of randomly failing tests, especially those
that reads random position of the binary log. From looking at the logs
it's clear that some failures is because of a read char (with value >= 128)
is converted to a big long value. Using uchar everywhere makes this much
less likely to happen.
Another benefit is that a lot of cast of char to uchar could be removed.
Other things:
- Removed some extra space before '=' and '+=' in assignments
- Fixed indentations and lines > 80 characters
- Replace '16' with 'element_size' (from class definition) in
Gtid_list_log_event()
Problem:
=======
In slave_parallel_mode=optimistic configuration, when admin commands and
DML operation on the same table are scheduled simultaneously for execution,
it results in lock conflict and slave server either hangs due to
deadlock or goes down with an assert.
Analysis:
========
Admin commands OPTIMIZE, REPAIR and ANALYZE are written to binary log as
ordinary transactions. When 'slave_parallel_mode' is 'optimistic' DMLs are
allowed to run in parallel. But these locks are not detected by parallel
replication deadlock detection-and-handling mechanism. At times they result
in deadlock or assertion.
Fix:
===
Flag admin commands as DDL in Gtid_log_event at the time of writing to
binary log. Add a new bit EXECUTED_TABLE_ADMIN_CMD to
'm_unsafe_rollback_flags'. During 'mysql_admin_table' command execution it
accepts a list of tables to be processed and executes them in a loop. Upon
successful execution enable 'EXECUTED_TABLE_ADMIN_CMD' bit in
thd->transaction.stmt_unsafe_rollback_flags. Gtid_log_event constructor
will notice this flag and mark the current transaction with 'FL_DDL' flag.
Gtid_log_events marked as FL_DDL will not be scheduled parallel execution,
on the slave. They will execute in isolation to prevent deadlocks.
Note: Removed the call to 'trans_commit_implicit' from 'mysql_admin_table'
function as 'mysql_execute_command' will take care of invoking
'trans_commit_implicit'.
Problem:
========
180511 11:07:58 [ERROR] Slave I/O: Unexpected master's heartbeat data:
heartbeat is not compatible with local info;the event's data: log_file_name
mysql-bin.000009 log_pos 1054262041, Error_code: 1623
Analysis:
=========
In replication setup when master server doesn't have any events to send to
slave server it sends an 'Heartbeat_log_event'. This event carries the
current binary log filename and offset details. The offset values is stored
within 4 bytes of event header. When the size of binary log is higher than
UINT32_MAX the log_pos values will not fit in 4 bytes memory. It overflows
and hence slave stops with an error.
Fix:
===
Since we cannot extend the common_header of Log_event class, a greater than
4GB value of Log_event::log_pos is made to be transported with a HeartBeat
event's sub-header. Log_event::log_pos in such case is set to zero to
indicate that the 8 byte sub-header is allocated in the event.
In case of cross version replication following behaviour is expected
OLD - Server without fix
NEW - Server with fix
OLD<->NEW : works bidirectionally as long as the binlog offset is
(normally) within 4GB.
When log_pos > UINT32_MAX
OLD->NEW : The 'log_pos' is bound to overflow and NEW slave may report
an invalid event/incompatible heart beat event error.
NEW->OLD : Since patched server sets log_pos=0 on overflow, OLD slave will
report invalid event error.
The assertion failed in handler::ha_reset upon SELECT under
READ UNCOMMITTED from table with index on virtual column.
This was the debug-only failure, though the problem is mush wider:
* MY_BITMAP is a structure containing my_bitmap_map, the latter is a raw
bitmap.
* read_set, write_set and vcol_set of TABLE are the pointers to MY_BITMAP
* The rest of MY_BITMAPs are stored in TABLE and TABLE_SHARE
* The pointers to the stored MY_BITMAPs, like orig_read_set etc, and
sometimes all_set and tmp_set, are assigned to the pointers.
* Sometimes tmp_use_all_columns is used to substitute the raw bitmap
directly with all_set.bitmap
* Sometimes even bitmaps are directly modified, like in
TABLE::update_virtual_field(): bitmap_clear_all(&tmp_set) is called.
The last three bullets in the list, when used together (which is mostly
always) make the program flow cumbersome and impossible to follow,
notwithstanding the errors they cause, like this MDEV-17556, where tmp_set
pointer was assigned to read_set, write_set and vcol_set, then its bitmap
was substituted with all_set.bitmap by dbug_tmp_use_all_columns() call,
and then bitmap_clear_all(&tmp_set) was applied to all this.
To untangle this knot, the rule should be applied:
* Never substitute bitmaps! This patch is about this.
orig_*, all_set bitmaps are never substituted already.
This patch changes the following function prototypes:
* tmp_use_all_columns, dbug_tmp_use_all_columns
to accept MY_BITMAP** and to return MY_BITMAP * instead of my_bitmap_map*
* tmp_restore_column_map, dbug_tmp_restore_column_maps to accept
MY_BITMAP* instead of my_bitmap_map*
These functions now will substitute read_set/write_set/vcol_set directly,
and won't touch underlying bitmaps.
Fix for MDEV-23033 fixes a problem in replication applying of transactions, which contain cascading foreign key delete for a table, which has indexed virtual column.
This fix adds slave_fk_event_map flag for table, to mark when the prelocking is needed for applying of a transaction.
See commit 608b0ee52e for more details.
However, this fix is targeted for async replication only, Rows_log_event::do_apply_event() has condition to rule out galera replication from the fix domain, and use cases suffering from MDEV-23033 and related MDEV-21153 will fail in galera cluster.
The fix in this commit removes the condition to rule out the setting of slave_fk_event_map flag from galera replication, and makes the fix in MDEV-23033 effective for galera replication as well.
However, the above fix has caused regressions for some galera_sr suite tests, which run tests for streaming replication.
This regression can be observed e.g. by: /mtr galera_sr.galera_sr_multirow_rollback --mysqld=--slave_run_triggers_for_rbr=yes
These galera_sr suite tests were failing in last phase of replication applying, where actual transaction is already applied, and streaming replication related meta data needs to be updated in wsrep system tables.
Opening the wsrep system tables failed for corrupt data in THD::lex:query_tables_list. The fix in this commit uses back query table list for the duration of fragment update operation.
Finally, a mtr test for virtual column support has been added. galera.galera_virtual_column.test has as first test a scenario from MDEV-21153
new fix
Reviewed-by: Jan Lindström <jan.lindstrom@mariadb.com>
Fix for MDEV-23033 fixes a problem in replication applying of transactions, which contain cascading foreign key delete for a table, which has indexed virtual column.
This fix adds slave_fk_event_map flag for table, to mark when the prelocking is needed for applying of a transaction.
See commit 608b0ee52e for more details.
However, this fix is targeted for async replication only, Rows_log_event::do_apply_event() has condition to rule out galera replication from the fix domain, and use cases suffering from MDEV-23033 and related MDEV-21153 will fail in galera cluster.
The fix in this commit removes the condition to rule out the setting of slave_fk_event_map flag from galera replication, and makes the fix in MDEV-23033 effective for galera replication as well.
Finally, a mtr test for virtual column support has been added. galera.galera_virtual_column.test has as first test a scenario from MDEV-21153
Reviewed-by: Jan Lindström <jan.lindstrom@mariadb.com>
Problem:
=======
Upon deleting or updating a row in a parent table (with primary key), if
the child table has virtual column and an associated key with ON UPDATE
CASCADE/ON DELETE CASCADE, it will result in slave crash.
Analysis:
========
Tables which are related through foreign key require prelocking similar to
triggers. i.e If a table has triggers/foreign keys we should add all tables
and routines used by them to the prelocking set. This prelocking happens
during 'open_and_lock_tables' call. Each table being opened is checked for
foreign key references. If foreign key reference exists then the child
table is opened and it is linked to the table_list. Upon any modification
to parent table its corresponding child tables are retried from table_list
and they are updated accordingly. This prelocking work fine on master.
On slave prelocking works for following cases.
- Statement/mixed based replication
- In row based replication when trigger execution is enabled through
'slave_run_triggers_for_rbr=YES/LOGGING/ENFORCE'
Otherwise it results in an assert/crash, as the parent table will not find
the corresponding child table and it will be NULL. Dereferencing NULL
pointer leads to slave server exit.
Fix:
===
Introduce a new 'slave_fk_event_map' flag similar to 'trg_event_map'. This
flag will ensure that when foreign key is enabled in row based replication
all the parent and child tables are prelocked, so that parent is able to
locate the child table.
Note: This issue is specific to slave, hence only slave needs to be
upgraded.
The reason for the failure is that
thd->mdl_context.release_transactional_locks()
was called after commit & rollback even in cases where the current
transaction is still active.
For 10.2, 10.3 and 10.4 the fix is simple:
- Replace all calls to thd->mdl_context.release_transactional_locks() with
thd->release_transactional_locks(). The thd function will only call
the mdl_context function if there are no active transactional locks.
In 10.6 we will better fix where we will change the return value for
some trans_xxx() functions to indicate if transaction did close the
transaction or not. This will avoid the need of the indirect call.
Other things:
- trans_xa_commit() and trans_xa_rollback() will automatically
call release_transactional_locks() if the transaction is closed.
- We can't do that for the other functions as the caller of many of these
are doing additional work (like close_thread_tables) before calling
release_transactional_locks().
- Added missing abort_result_set() and missing DBUG_RETURN in
select_create::send_eof()
- Fixed wrong indentation in injector::transaction::commit()
The crash was caused by improper raising of an error or replication checksum
verification at time of the server initialization. As there is no THD object
associated with the main initializing thread yet the error text should be
assigned with calling a respective macro that is aware of that possibility.
Fixed accordingly.
[At merging to 10.4 the new test result file needs
+# restart: --master_verify_checksum=ON --debug_dbug=+d,corrupt_read_log_event_char
that mtr run will hint on.]
Analysis:
========
"mysqlbinlog -v" option will reconstruct row events and display them as
commented SQL statements. If this option is given twice, the output includes
comments to indicate column data types and some metadata.
`log_event_print_value` is the function reponsible for printing values and
their types. This function doesn't handle GEOMETRY type. Hence the above error
gets printed.
Fix:
===
Add support for GEOMETRY datatype.
(This commit is exclusively for 10.1 branch, do not merge it to upper ones)
In case of a pattern of non-STMT_END-marked Rows-log-event (A) followed by
a STMT_END marked one (B) mysqlbinlog mixes up the base64 encoded rows events
with their pseudo sql representation produced by the verbose option:
BINLOG '
base64 encoded data for A
### verbose section for A
base64 encoded data for B
### verbose section for B
'/*!*/;
In effect the produced BINLOG '...' query is not valid and is rejected with the error.
Examples of this way malformed BINLOG could have been found in binlog_row_annotate.result
that gets corrected with the patch.
The issue is fixed with introduction an auxiliary IO_CACHE to hold on the verbose
comments until the terminal STMT_END event is found. The new cache is emptied
out after two pre-existing ones are done at that time.
The correctly produced output now for the above case is as the following:
BINLOG '
base64 encoded data for A
base64 encoded data for B
'/*!*/;
### verbose section for A
### verbose section for B
Thanks to Alexey Midenkov for the problem recognition and attempt to tackle,
Venkatesh Duggirala who produced a patch for the upstream whose
idea is exploited here, as well as to MDEV-23077 reporter LukeXwang who
also contributed a piece of a patch aiming at this issue.
Extra: mysqlbinlog_row_minimal refined to not produce mutable numeric values into the result file.
(This commit is for 10.3 and upper branches)
In case of a pattern of non-STMT_END-marked Rows-log-event (A) followed by
a STMT_END marked one (B) mysqlbinlog mixes up the base64 encoded rows events
with their pseudo sql representation produced by the verbose option:
BINLOG '
base64 encoded data for A
### verbose section for A
base64 encoded data for B
### verbose section for B
'/*!*/;
In effect the produced BINLOG '...' query is not valid and is rejected with the error.
Examples of this way malformed BINLOG could have been found in binlog_row_annotate.result
that gets corrected with the patch.
The issue is fixed with introduction an auxiliary IO_CACHE to hold on the verbose
comments until the terminal STMT_END event is found. The new cache is emptied
out after two pre-existing ones are done at that time.
The correctly produced output now for the above case is as the following:
BINLOG '
base64 encoded data for A
base64 encoded data for B
'/*!*/;
### verbose section for A
### verbose section for B
Thanks to Alexey Midenkov for the problem recognition and attempt to tackle,
and to Venkatesh Duggirala who produced a patch for the upstream whose
idea is exploited here, as well as to MDEV-23077 reporter LukeXwang who
also contributed a piece of a patch aiming at this issue.
(This commit is exclusively for 10.2 branch. Do not merge it to 10.3)
In case of a pattern of non-STMT_END-marked Rows-log-event (A) followed by
a STMT_END marked one (B) mysqlbinlog mixes up the base64 encoded rows events
with their pseudo sql representation produced by the verbose option:
BINLOG '
base64 encoded data for A
### verbose section for A
base64 encoded data for B
### verbose section for B
'/*!*/;
In effect the produced BINLOG '...' query is not valid and is rejected with the error.
Examples of this way malformed BINLOG could have been found in binlog_row_annotate.result
that gets corrected with the patch.
The issue is fixed with introduction an auxiliary IO_CACHE to hold on the verbose
comments until the terminal STMT_END event is found. The new cache is emptied
out after two pre-existing ones are done at that time.
The correctly produced output now for the above case is as the following:
BINLOG '
base64 encoded data for A
base64 encoded data for B
'/*!*/;
### verbose section for A
### verbose section for B
Thanks to Alexey Midenkov for the problem recognition and attempt to tackle,
and to Venkatesh Duggirala who produced a patch for the upstream whose
idea is exploited here, as well as to MDEV-23077 reporter LukeXwang who
also contributed a piece of a patch aiming at this issue.