Lifted long standing limitation to the XA of rolling it back at the
transaction's
connection close even if the XA is prepared.
Prepared XA-transaction is made to sustain connection close or server
restart.
The patch consists of
- binary logging extension to write prepared XA part of
transaction signified with
its XID in a new XA_prepare_log_event. The concusion part -
with Commit or Rollback decision - is logged separately as
Query_log_event.
That is in the binlog the XA consists of two separate group of
events.
That makes the whole XA possibly interweaving in binlog with
other XA:s or regular transaction but with no harm to
replication and data consistency.
Gtid_log_event receives two more flags to identify which of the
two XA phases of the transaction it represents. With either flag
set also XID info is added to the event.
When binlog is ON on the server XID::formatID is
constrained to 4 bytes.
- engines are made aware of the server policy to keep up user
prepared XA:s so they (Innodb, rocksdb) don't roll them back
anymore at their disconnect methods.
- slave applier is refined to cope with two phase logged XA:s
including parallel modes of execution.
This patch does not address crash-safe logging of the new events which
is being addressed by MDEV-21469.
CORNER CASES: read-only, pure myisam, binlog-*, @@skip_log_bin, etc
Are addressed along the following policies.
1. The read-only at reconnect marks XID to fail for future
completion with ER_XA_RBROLLBACK.
2. binlog-* filtered XA when it changes engine data is regarded as
loggable even when nothing got cached for binlog. An empty
XA-prepare group is recorded. Consequent Commit-or-Rollback
succeeds in the Engine(s) as well as recorded into binlog.
3. The same applies to the non-transactional engine XA.
4. @@skip_log_bin=OFF does not record anything at XA-prepare
(obviously), but the completion event is recorded into binlog to
admit inconsistency with slave.
The following actions are taken by the patch.
At XA-prepare:
when empty binlog cache - don't do anything to binlog if RO,
otherwise write empty XA_prepare (assert(binlog-filter case)).
At Disconnect:
when Prepared && RO (=> no binlogging was done)
set Xid_cache_element::error := ER_XA_RBROLLBACK
*keep* XID in the cache, and rollback the transaction.
At XA-"complete":
Discover the error, if any don't binlog the "complete",
return the error to the user.
Kudos
-----
Alexey Botchkov took to drive this work initially.
Sergei Golubchik, Sergei Petrunja, Marko Mäkelä provided a number of
good recommendations.
Sergei Voitovich made a magnificent review and improvements to the code.
They all deserve a bunch of thanks for making this work done!
This patch changes how old rows in mysql.gtid_slave_pos* tables are deleted.
Instead of doing it as part of every replicated transaction in
record_gtid(), it is done periodically (every @@gtid_cleanup_batch_size
transaction) in the slave background thread.
This removes the deletion step from the replication process in SQL or worker
threads, which could speed up replication with many small transactions. It
also decreases contention on the global mutex LOCK_slave_state. And it
simplifies the logic, eg. when a replicated transaction fails after having
deleted old rows.
With this patch, the deletion of old GTID rows happens asynchroneously and
slightly non-deterministic. Thus the number of old rows in
mysql.gtid_slave_pos can temporarily exceed @@gtid_cleanup_batch_size. But
all old rows will be deleted eventually after sufficiently many new GTIDs
have been replicated.
This would happen especially in optimistic parallel replication, where there
is a good chance that a transaction will be rolled back (due to conflicts)
after it has executed record_gtid(). If the transaction did any deletions of
old rows as part of record_gtid(), those deletions will be undone as well.
And the code did not properly ensure that the deletions would be re-tried.
This patch makes record_gtid() remember the list of deletions done as part
of a transaction. Then in rpl_slave_state::update() when the changes have
been committed, we discard the list. However, in case of error and rollback,
in cleanup_context() we will instead put the list back into
rpl_global_gtid_slave_state so that the deletions will be re-tried later.
Probably fixes part of the cause of MDEV-12147 as well.
Signed-off-by: Kristian Nielsen <knielsen@knielsen-hq.org>
Replicated transaction extra gtid statement on slave failed to specify
an engine gtid_slave_pos name correctly. In case lower-case-table-names > 0
the InnoDB table name was generated to reproduce the lower-case-table-names=0 version
which is of mixed cases.
In rpl.rpl_mdev12179 test run this triggered a failure to DROP table which
was due to the innodb table handle was not closed:
InnoDB: Waited XYZ seconds for ref-count on table: `mysql`.`gtid_slave_pos_innodb`
on windows.
The closing issue was caused by having the table registered twice in the table cache,
for its lower- and mixed- case name versions. The DROP-table handler closed only
only one of the cache item to leave the 2nd one active.
(On Linux a failure occurs earlier at attempt to open an expected lower-cased table:
Last_Error: Error during XID COMMIT: failed to update GTID state in mysql.gtid_slave_pos: 1146: Table 'mysql.gtid_slave_pos_InnoDB' doesn't exist
but the table's name as the message shows is not in the right case).
Fixed with consulting lower-case-table-names when the engine gtid-slave-pos table
is created.
Note the lower-case-table-names=a-value created table will not recognized when next
the lower case option changes to a different value.
In 10.4 a follow-up patch is going to lowercase gtid-slave-pos autocreated table
at once at their origination, and a warning is issued in the 10.3 current patch.
This was done in, among other things:
- thd->db and thd->db_length
- TABLE_LIST tablename, db, alias and schema_name
- Audit plugin database name
- lex->db
- All db and table names in Alter_table_ctx
- st_select_lex db
Other things:
- Changed a lot of functions to take const LEX_CSTRING* as argument
for db, table_name and alias. See init_one_table() as an example.
- Changed some function arguments from LEX_CSTRING to const LEX_CSTRING
- Changed some lists from LEX_STRING to LEX_CSTRING
- threads_mysql.result changed because process list_db wasn't always
correctly updated
- New append_identifier() function that takes LEX_CSTRING* as arguments
- Added new element tmp_buff to Alter_table_ctx to separate temp name
handling from temporary space
- Ensure we store the length after my_casedn_str() of table/db names
- Removed not used version of rename_table_in_stat_tables()
- Changed Natural_join_column::table_name and db_name() to never return
NULL (used for print)
- thd->get_db() now returns db as a printable string (thd->db.str or "")
and specifically the ack receiving functionality.
Semisync is turned to be static instead of plugin so its functions
are invoked at the same points as RUN_HOOKS.
The RUN_HOOKS and the observer interface remain to be removed by later
patch.
Todo:
React on killed status by repl_semisync_master.wait_after_sync(). Currently
Repl_semi_sync_master::commit_trx does not check the killed status.
There were few bugfixes found that are present in mysql and its unclear
whether/how they are covered. Those include:
Bug#15985893: GTID SKIPPED EVENTS ON MASTER CAUSE SEMI SYNC TIME-OUTS
Bug#17932935 CALLING IS_SEMI_SYNC_SLAVE() IN EACH FUNCTION CALL
HAS BAD PERFORMANCE
Bug#20574628: SEMI-SYNC REPLICATION PERFORMANCE DEGRADES WITH A HIGH NUMBER OF THREADS
Part of MDEV-13073 AliSQL Optimize performance of semisync
The idea it to use a dedicated lock detecting if there is new data in
the master's binary log instead of the overused LOCK_log.
Changes:
- Use dedicated COND variables for the relay and binary log signaling.
This was needed as we where the old 'update_cond' variable was used
with different mutex's, which could cause deadlocks.
- Relay log uses now COND_relay_log_updated and LOCK_log
- Binary log uses now COND_bin_log_updated and LOCK_binlog_end_pos
- Renamed signal_cnt to relay_signal_cnt (as we now have two signals)
- Added some missing error handling in MYSQL_BIN_LOG::new_file_impl()
- Reformatted some comments with old style
- Renamed m_key_LOCK_binlog_end_pos to key_LOCK_binlog_end_pos
- Changed 'signal_update()' to update_binlog_end_pos() which works for
both relay and binary log
- Added sql/mariadb.h file that should be included first by files in sql
directory, if sql_plugin.h is not used (sql_plugin.h adds SHOW variables
that must be done before my_global.h is included)
- Removed a lot of include my_global.h from include files
- Removed include's of some files that my_global.h automatically includes
- Removed duplicated include's of my_sys.h
- Replaced include my_config.h with my_global.h
Intermediate commit.
Fix compilation failure with different my_atomic implementation.
The my_atomic_loadptr* takes void ** as first argument, so variables
updated with it needs to be void * (it is not legal C to cast
some_type ** to void **).
Intermediate commit.
Move the discovery of mysql.gtid_slave_pos* tables into the SQL thread.
This avoids doing things like opening tables and scanning the mysql
schema for tables inside of the START SLAVE statement, which might
interact badly with existing transaction or table locks.
(Even though START SLAVE is documented to implicitly commit any active
transactions, this appears not to be the case in current code).
Table discovery fits naturally in the SQL thread init code, next to
the loading of mysql.gtid_slave_pos state.
Intermediate commit.
Implement auto-creation of mysql.gtid_slave_pos* tables with needed engines,
if listed in --gtid-pos-auto-engines.
Uses an asynchronous approach to minimise locking overhead.
The list of available tables is extended with a flag. Extra entries are
added for --gtid-pos-auto-engines tables that do not exist yet, marked as
not existing but ready for auto-creation.
If record_gtid() needs a table marked for auto-creation, it sends a request
to the slave background thread to create the table, and continues to use an
existing table for the current and immediately coming transactions.
As soon as the slave background thread has made the new table available, it
will be used for all subsequent relevant transactions in record_gtid().
This asynchronous approach also avoids a lot of complex issues around trying
to do DDL in the middle of an on-going transaction.
Intermediate commit.
This commit implements that record_gtid() selects a gtid_slave_posXXX table
with a storage engine already in use by current transaction, if any.
The default table mysql.gtid_slave_pos is used if no match can be found on
storage engine, or for GTID position updates with no specific storage
engine.
Table discovery of mysql.gtid_slave_pos* happens on initial GTID state load
as well as on every START SLAVE. Some effort is made to make this possible
without additional locking. New tables are added using lock-free atomics.
Removing tables requires stopping all slaves first. A warning is given in
the error log when a table is removed but a non-stopped slave still has a
reference to it.
If multiple mysql.gtid_slave_posXXX tables with same storage engine exist,
one is chosen arbitrarily to be used, with a warning in the error log. GTID
data from all tables is still read, but only one among redundant tables with
same storage engine will be updated.
Intermediate commit.
For each GTID recorded in mysq.gtid_slave_pos, keep track of which
engine the update was made in.
This will be later used to know which rows can be deleted in the table
of a given engine.