MDEV-11415 Remove excessive undo logging during ALTER TABLE…ALGORITHM=COPY
Move a test from innodb.rename_table_debug to innodb.alter_copy.
ha_innobase::extra(HA_EXTRA_BEGIN_ALTER_COPY): Register id-versioned
tables so that mysql.transaction_registry will be updated, even for
empty tables that are subjected to ALTER TABLE…ALGORITHM=COPY.
If a crash occurs during ALTER TABLE…ALGORITHM=COPY, InnoDB would spend
a lot of time rolling back writes to the intermediate copy of the table.
To reduce the amount of busy work done, a work-around was introduced in
commit fd069e2bb3 in MySQL 4.1.8 and 5.0.2,
to commit the transaction after every 10,000 inserted rows.
A proper fix would have been to disable the undo logging altogether and
to simply drop the intermediate copy of the table on subsequent server
startup. This is what happens in MariaDB 10.3 with MDEV-14717,MDEV-14585.
In MariaDB 10.2, the intermediate copy of the table would be left behind
with a name starting with the string #sql.
This is a backport of a bug fix from MySQL 8.0.0 to MariaDB,
contributed by jixianliang <271365745@qq.com>.
Unlike recent MySQL, MariaDB supports ALTER IGNORE. For that operation
InnoDB must for now keep the undo logging enabled, so that the latest
row can be rolled back in case of an error.
In Galera cluster, the LOAD DATA statement will retain the existing
behaviour and commit the transaction after every 10,000 rows if
the parameter wsrep_load_data_splitting=ON is set. The logic to do
so (the wsrep_load_data_split() function and the call
handler::extra(HA_EXTRA_FAKE_START_STMT)) are joint work
by Ji Xianliang and Marko Mäkelä.
The original fix:
Author: Thirunarayanan Balathandayuthapani <thirunarayanan.balathandayuth@oracle.com>
Date: Wed Dec 2 16:09:15 2015 +0530
Bug#17479594 AVOID INTERMEDIATE COMMIT WHILE DOING ALTER TABLE ALGORITHM=COPY
Problem:
During ALTER TABLE, we commit and restart the transaction for every
10,000 rows, so that the rollback after recovery would not take so long.
Fix:
Suppress the undo logging during copy alter operation. If fts_index is
present then insert directly into fts auxiliary table rather
than doing at commit time.
ha_innobase::num_write_row: Remove the variable.
ha_innobase::write_row(): Remove the hack for committing every 10000 rows.
row_lock_table_for_mysql(): Remove the extra 2 parameters.
lock_get_src_table(), lock_is_table_exclusive(): Remove.
Reviewed-by: Marko Mäkelä <marko.makela@oracle.com>
Reviewed-by: Shaohua Wang <shaohua.wang@oracle.com>
Reviewed-by: Jon Olav Hauglid <jon.hauglid@oracle.com>
Only invoke set_versioned() on trx_id versioned tables.
dict_table_t::versioned_by_id(): New accessor, to determine if
a table is system versioned by transaction ID.
InnoDB RNG maintains global state, causing otherwise unnecessary bus
traffic. Even worse this is cross-mutex traffic. That is different
mutexes suffer from contention.
Fixed delay of 4 was verified to give best throughput by OLTP update
index and read-write benchmarks on Intel Broadwell (2/20/40) and
ARM (1/46/46).
Traditionally, DROP TABLE and TRUNCATE TABLE discarded any locks that
may have been held on the table. This feels like an ACID violation.
Probably most occurrences of it were prevented by meta-data locks (MDL)
which were introduced in MySQL 5.5.
dict_table_t::n_foreign_key_checks_running: Reduce the number of
non-debug checks.
lock_remove_all_on_table(), lock_remove_all_on_table_for_trx(): Remove.
ha_innobase::truncate(): Acquire an exclusive InnoDB table lock
before proceeding. DROP TABLE and DISCARD/IMPORT were already doing
this.
row_truncate_table_for_mysql(): Convert the already started transaction
into a dictionary operation, and do not invoke lock_remove_all_on_table().
row_mysql_table_id_reassign(): Do not call lock_remove_all_on_table().
This function is only used in ALTER TABLE...DISCARD/IMPORT TABLESPACE,
which is already holding an exclusive InnoDB table lock.
TODO: Make n_foreign_key_checks running a debug-only variable.
This would require two fixes:
(1) DROP TABLE: Exclusively lock the table beforehand, to prevent
the possibility of concurrently running foreign key checks (which
would acquire a table IS lock and then record S locks).
(2) RENAME TABLE: Find out if n_foreign_key_checks_running>0 actually
constitutes a potential problem.
innobase_commit_by_xid(), innobase_rollback_by_xid(): Decrement
the reference count before freeing the transaction object to the pool.
Failure to do so might corrupt the transaction bookkeeping
if trx_create_low() returns the same object to another thread
before we are done with it.
trx_sys_close(): Detach the recovered XA PREPARE transactions from
trx_sys->rw_trx_list before freeing them.
This will allow us to reduce critical section protected by
trx_sys.mutex:
- no need to maintain global m_free list
- eliminate if (trx->read_view == NULL) condition.
On x86_64 sizeof(Readview) is 144 mostly due to padding, sizeof(trx_t)
with ReadView is 1200.
Also don't close ReadView for read-write transactions, just mark it
closed similarly to read-only.
Clean-up: removed n_prepared_recovered_trx and n_prepared_trx, which
accidentally re-appeared after some rebase.
MDEV-14511 tried to avoid some consistency problems related to InnoDB
persistent statistics. The persistent statistics are being written by
an InnoDB internal SQL interpreter that requires the InnoDB data dictionary
cache to be locked.
Before MDEV-14511, the statistics were written during DDL in separate
transactions, which could unnecessarily reduce performance (each commit
would require a redo log flush) and break atomicity, because the statistics
would be updated separately from the dictionary transaction.
However, because it is unacceptable to hold the InnoDB data dictionary
cache locked while suspending the execution for waiting for a
transactional lock (in the mysql.innodb_index_stats or
mysql.innodb_table_stats tables) to be released, any lock conflict
was immediately be reported as "lock wait timeout".
To fix MDEV-14941, an attempt to reduce these lock conflicts by acquiring
transactional locks on the user tables in both the statistics and DDL
operations was made, but it would still not entirely prevent lock conflicts
on the mysql.innodb_index_stats and mysql.innodb_table_stats tables.
Fixing the remaining problems would require a change that is too intrusive
for a GA release series, such as MariaDB 10.2.
Thefefore, we revert the change MDEV-14511. To silence the
MDEV-13201 assertion, we use the pre-existing flag trx_t::internal.
trx->read_view|= 1 was done in a silly attempt to fix race condition
where trx->read_view was closed without trx_sys.mutex lock by read-only
trasnactions.
This just made the problem less likely to happen. In fact there was race
condition in const version of trx_get_read_view(): pointer may change to
garbage any moment after MVCC::is_view_active(trx->read_view) check and
before this function returns.
This patch doesn't fix this race condition, but rather makes it's
consequences less destructive.
There is only one transaction system object in InnoDB.
Allocate the storage for it at link time, not at runtime.
lock_rec_fetch_page(): Use the correct fetch mode BUF_GET.
Pages may never be deallocated from a tablespace while
record locks are pointing to them.
Use atomic operations when accessing trx_sys_t::max_trx_id. We can't yet
move trx_sys_t::get_new_trx_id() out of mutex because it must be updated
atomically along with trx_sys_t::rw_trx_ids.
trx_sys_t::rw_trx_set is implemented as std::set, which does a few quite
expensive operations under trx_sys_t::mutex protection: e.g. malloc/free
when adding/removing elements. Traversing b-tree is not that cheap either.
This has negative scalability impact, which is especially visible when running
oltp_update_index.lua benchmark on a ramdisk.
To reduce trx_sys_t::mutex contention std::set is replaced with LF_HASH. None
of LF_HASH operations require trx_sys_t::mutex (nor any other global mutex)
protection.
Another interesting issue observed with std::set is reproducible ~2% performance
decline after benchmark is ran for ~60 seconds. With LF_HASH results are stable.
All in all this patch optimises away one of three trx_sys->mutex locks per
oltp_update_index.lua query. The other two critical sections became smaller.
Relevant clean-ups:
Replaced rw_trx_set iteration at startup with local set. The latter is needed
because values inserted to rw_trx_list must be ordered by trx->id.
Removed redundant conditions from trx_reference(): it is (and even was) never
called with transactions that have trx->state == TRX_STATE_COMMITTED_IN_MEMORY.
do_ref_count doesn't (and probably even didn't) make any sense: now it is called
only when reference counter increment is actually requested.
Moved condition out of mutex in trx_erase_lists().
trx_rw_is_active(), trx_rw_is_active_low() and trx_get_rw_trx_by_id() were
greatly simplified and replaced by appropriate trx_rw_hash_t methods.
Compared to rw_trx_set, rw_trx_hash holds transactions only in PREPARED or
ACTIVE states. Transactions in COMMITTED state were required to be found
at InnoDB startup only. They are now looked up in the local set.
Removed unused trx_assert_recovered().
Removed unused innobase_get_trx() declaration.
Removed rather semantically incorrect trx_sys_rw_trx_add().
Moved information printout from trx_sys_init_at_db_start() to
trx_lists_init_at_db_start().
Other things, mainly to get
create_mysqld_error_find_printf_error tool to work:
- Added protection to not include mysqld_error.h twice
- Include "unireg.h" instead of "mysqld_error.h" in server
- Added protection if ER_XX messages are already defined
- Removed wrong calls to my_error(ER_OUTOFMEMORY) as
my_malloc() and my_alloc will do this automatically
- Added missing %s to ER_DUP_QUERY_NAME
- Removed old and wrong calls to my_strerror() when using
MY_ERROR_ON_RENAME (wrong merge)
- Fixed deadlock error message from Galera. Before the extra
information given to ER_LOCK_DEADLOCK was missing because
ER_LOCK_DEADLOCK doesn't provide any extra information.
I kept #ifdef mysqld_error_find_printf_error_used in sql_acl.h
to make it easy to do this kind of check again in the future
InnoDB in MariaDB 10.2 appears to only write MLOG_FILE_RENAME2
redo log records during table-rebuilding ALGORITHM=INPLACE operations.
We must write the records for any .ibd file renames, so that the
operations are crash-safe.
If InnoDB is killed during a RENAME TABLE operation, it can happen that
the transaction for updating the data dictionary will be rolled back.
But, nothing will roll back the renaming of the .ibd file
(the MLOG_FILE_RENAME2 only guarantees roll-forward), or for that matter,
the renaming of the dict_table_t::name in the dict_sys cache. We introduce
the undo log record TRX_UNDO_RENAME_TABLE to fix this.
fil_space_for_table_exists_in_mem(): Remove the parameters
adjust_space, table_id and some code that was trying to work around
these deficiencies.
fil_name_write_rename(): Write a MLOG_FILE_RENAME2 record.
dict_table_rename_in_cache(): Invoke fil_name_write_rename().
trx_undo_rec_copy(): Set the first 2 bytes to the length of the
copied undo log record.
trx_undo_page_report_rename(), trx_undo_report_rename():
Write a TRX_UNDO_RENAME_TABLE record with the old table name.
row_rename_table_for_mysql(): Invoke trx_undo_report_rename()
before modifying any data dictionary tables.
row_undo_ins_parse_undo_rec(): Roll back TRX_UNDO_RENAME_TABLE
by invoking dict_table_rename_in_cache(), which will take care
of both renaming the table and the file.
Merge branch '10.3' into trunk
Both field_visibility and VERS_HIDDEN_FLAG exist independently.
TODO:
VERS_HIDDEN_FLAG should be replaced with SYSTEM_INVISIBLE (or COMPLETELY_INVISIBLE?).