Implement undo tablespace truncation via normal redo logging.
Implement TRUNCATE TABLE as a combination of RENAME to #sql-ib name,
CREATE, and DROP.
Note: Orphan #sql-ib*.ibd may be left behind if MariaDB Server 10.2
is killed before the DROP operation is committed. If MariaDB Server 10.2
is killed during TRUNCATE, it is also possible that the old table
was renamed to #sql-ib*.ibd but the data dictionary will refer to the
table using the original name.
In MariaDB Server 10.3, RENAME inside InnoDB is transactional,
and #sql-* tables will be dropped on startup. So, this new TRUNCATE
will be fully crash-safe in 10.3.
ha_mroonga::wrapper_truncate(): Pass table options to the underlying
storage engine, now that ha_innobase::truncate() will need them.
rpl_slave_state::truncate_state_table(): Before truncating
mysql.gtid_slave_pos, evict any cached table handles from
the table definition cache, so that there will be no stale
references to the old table after truncating.
== TRUNCATE TABLE ==
WL#6501 in MySQL 5.7 introduced separate log files for implementing
atomic and crash-safe TRUNCATE TABLE, instead of using the InnoDB
undo and redo log. Some convoluted logic was added to the InnoDB
crash recovery, and some extra synchronization (including a redo log
checkpoint) was introduced to make this work. This synchronization
has caused performance problems and race conditions, and the extra
log files cannot be copied or applied by external backup programs.
In order to support crash-upgrade from MariaDB 10.2, we will keep
the logic for parsing and applying the extra log files, but we will
no longer generate those files in TRUNCATE TABLE.
A prerequisite for crash-safe TRUNCATE is a crash-safe RENAME TABLE
(with full redo and undo logging and proper rollback). This will
be implemented in MDEV-14717.
ha_innobase::truncate(): Invoke RENAME, create(), delete_table().
Because RENAME cannot be fully rolled back before MariaDB 10.3
due to missing undo logging, add some explicit rename-back in
case the operation fails.
ha_innobase::delete(): Introduce a variant that takes sqlcom as
a parameter. In TRUNCATE TABLE, we do not want to touch any
FOREIGN KEY constraints.
ha_innobase::create(): Add the parameters file_per_table, trx.
In TRUNCATE, the new table must be created in the same transaction
that renames the old table.
create_table_info_t::create_table_info_t(): Add the parameters
file_per_table, trx.
row_drop_table_for_mysql(): Replace a bool parameter with sqlcom.
row_drop_table_after_create_fail(): New function, wrapping
row_drop_table_for_mysql().
dict_truncate_index_tree_in_mem(), fil_truncate_tablespace(),
fil_prepare_for_truncate(), fil_reinit_space_header_for_table(),
row_truncate_table_for_mysql(), TruncateLogger,
row_truncate_prepare(), row_truncate_rollback(),
row_truncate_complete(), row_truncate_fts(),
row_truncate_update_system_tables(),
row_truncate_foreign_key_checks(), row_truncate_sanity_checks():
Remove.
row_upd_check_references_constraints(): Remove a check for
TRUNCATE, now that the table is no longer truncated in place.
The new test innodb.truncate_foreign uses DEBUG_SYNC to cover some
race-condition like scenarios. The test innodb-innodb.truncate does
not use any synchronization.
We add a redo log subformat to indicate backup-friendly format.
MariaDB 10.4 will remove support for the old TRUNCATE logging,
so crash-upgrade from old 10.2 or 10.3 to 10.4 will involve
limitations.
== Undo tablespace truncation ==
MySQL 5.7 implements undo tablespace truncation. It is only
possible when innodb_undo_tablespaces is set to at least 2.
The logging is implemented similar to the WL#6501 TRUNCATE,
that is, using separate log files and a redo log checkpoint.
We can simply implement undo tablespace truncation within
a single mini-transaction that reinitializes the undo log
tablespace file. Unfortunately, due to the redo log format
of some operations, currently, the total redo log written by
undo tablespace truncation will be more than the combined size
of the truncated undo tablespace. It should be acceptable
to have a little more than 1 megabyte of log in a single
mini-transaction. This will be fixed in MDEV-17138 in
MariaDB Server 10.4.
recv_sys_t: Add truncated_undo_spaces[] to remember for which undo
tablespaces a MLOG_FILE_CREATE2 record was seen.
namespace undo: Remove some unnecessary declarations.
fil_space_t::is_being_truncated: Document that this flag now
only applies to undo tablespaces. Remove some references.
fil_space_t::is_stopping(): Do not refer to is_being_truncated.
This check is for tablespaces of tables. Potentially used
tablespaces are never truncated any more.
buf_dblwr_process(): Suppress the out-of-bounds warning
for undo tablespaces.
fil_truncate_log(): Write a MLOG_FILE_CREATE2 with a nonzero
page number (new size of the tablespace in pages) to inform
crash recovery that the undo tablespace size has been reduced.
fil_op_write_log(): Relax assertions, so that MLOG_FILE_CREATE2
can be written for undo tablespaces (without .ibd file suffix)
for a nonzero page number.
os_file_truncate(): Add the parameter allow_shrink=false
so that undo tablespaces can actually be shrunk using this function.
fil_name_parse(): For undo tablespace truncation,
buffer MLOG_FILE_CREATE2 in truncated_undo_spaces[].
recv_read_in_area(): Avoid reading pages for which no redo log
records remain buffered, after recv_addr_trim() removed them.
trx_rseg_header_create(): Add a FIXME comment that we could write
much less redo log.
trx_undo_truncate_tablespace(): Reinitialize the undo tablespace
in a single mini-transaction, which will be flushed to the redo log
before the file size is trimmed.
recv_addr_trim(): Discard any redo logs for pages that were
logged after the new end of a file, before the truncation LSN.
If the rec_list becomes empty, reduce n_addrs. After removing
any affected records, actually truncate the file.
recv_apply_hashed_log_recs(): Invoke recv_addr_trim() right before
applying any log records. The undo tablespace files must be open
at this point.
buf_flush_or_remove_pages(), buf_flush_dirty_pages(),
buf_LRU_flush_or_remove_pages(): Add a parameter for specifying
the number of the first page to flush or remove (default 0).
trx_purge_initiate_truncate(): Remove the log checkpoints, the
extra logging, and some unnecessary crash points. Merge the code
from trx_undo_truncate_tablespace(). First, flush all to-be-discarded
pages (beyond the new end of the file), then trim the space->size
to make the page allocation deterministic. At the only remaining
crash injection point, flush the redo log, so that the recovery
can be tested.
Because innodb_file_per_table can be enabled at runtime after it
was disabled at startup, it is better to always register the same
innobase_hton->tablefile_extensions. Besides,
innodb_file_per_table=OFF does not prevent loading tables that may
have been created earlier with the .ibd file extension.
Disable "Invalid (old?) table or database name" warning when
converting table names in InnoDB's get_foreign_key_info().
Because a name can be a temporary table name during the ALTER TABLE,
and some other thread can do SHOW CREATE TABLE for the other table
in the FK relationships _anytime_.
extra/mariabackup/fil_cur.cc:361:42: warning: format specifies type 'unsigned long' but the argument has type 'ib_int64_t' (aka 'long long') [-Wformat]
extra/mariabackup/fil_cur.cc:376:9: warning: format specifies type 'unsigned long' but the argument has type 'ib_int64_t' (aka 'long long') [-Wformat]
sql/handler.cc:6196:45: warning: format specifies type 'unsigned long' but the argument has type 'wsrep_trx_id_t' (aka 'unsigned long long') [-Wformat]
sql/log.cc:1681:16: warning: format specifies type 'unsigned long' but the argument has type 'size_t' (aka 'unsigned int') [-Wformat]
sql/log.cc:1687:16: warning: format specifies type 'unsigned long' but the argument has type 'size_t' (aka 'unsigned int') [-Wformat]
sql/wsrep_sst.cc:1388:86: warning: format specifies type 'long' but the argument has type 'wsrep_seqno_t' (aka 'long long') [-Wformat]
sql/wsrep_sst.cc:232:86: warning: format specifies type 'long' but the argument has type 'wsrep_seqno_t' (aka 'long long') [-Wformat]
storage/connect/filamdbf.cpp:450:47: warning: format specifies type 'short' but the argument has type 'int' [-Wformat]
storage/connect/filamdbf.cpp:970:47: warning: format specifies type 'short' but the argument has type 'int' [-Wformat]
storage/connect/inihandl.cpp:197:16: warning: address of array 'key->name' will always evaluate to 'true' [-Wpointer-bool-conversion]
storage/innobase/btr/btr0scrub.cc:151:17: warning: format specifies type 'long' but the argument has type 'int' [-Wformat]
storage/innobase/buf/buf0buf.cc:5085:8: warning: nonnull parameter 'bpage' will evaluate to 'true' on first encounter [-Wpointer-bool-conversion]
storage/innobase/fil/fil0crypt.cc:2454:5: warning: format specifies type 'long' but the argument has type 'int' [-Wformat]
storage/innobase/handler/ha_innodb.cc:18685:7: warning: format specifies type 'unsigned long' but the argument has type 'wsrep_trx_id_t' (aka 'unsigned long long') [-Wformat]
storage/innobase/row/row0mysql.cc:3319:5: warning: format specifies type 'long' but the argument has type 'int' [-Wformat]
storage/innobase/row/row0mysql.cc:3327:5: warning: format specifies type 'long' but the argument has type 'int' [-Wformat]
storage/maria/ma_norec.c:35:10: warning: implicit conversion from 'int' to 'my_bool' (aka 'char') changes value from 131 to -125 [-Wconstant-conversion]
storage/maria/ma_norec.c:42:10: warning: implicit conversion from 'int' to 'my_bool' (aka 'char') changes value from 131 to -125 [-Wconstant-conversion]
storage/maria/ma_test2.c:1009:12: warning: format specifies type 'unsigned long' but the argument has type 'size_t' (aka 'unsigned int') [-Wformat]
storage/maria/ma_test2.c:1010:12: warning: format specifies type 'unsigned long' but the argument has type 'size_t' (aka 'unsigned int') [-Wformat]
storage/mroonga/ha_mroonga.cpp:9189:44: warning: use of logical '&&' with constant operand [-Wconstant-logical-operand]
storage/mroonga/vendor/groonga/lib/expr.c:4987:22: warning: comparison of constant -1 with expression of type 'grn_operator' is always false [-Wtautological-constant-out-of-range-compare]
storage/xtradb/btr/btr0scrub.cc:151:17: warning: format specifies type 'long' but the argument has type 'int' [-Wformat]
storage/xtradb/buf/buf0buf.cc:5047:8: warning: nonnull parameter 'bpage' will evaluate to 'true' on first encounter [-Wpointer-bool-conversion]
storage/xtradb/fil/fil0crypt.cc:2454:5: warning: format specifies type 'long' but the argument has type 'int' [-Wformat]
storage/xtradb/row/row0mysql.cc:3324:5: warning: format specifies type 'long' but the argument has type 'int' [-Wformat]
storage/xtradb/row/row0mysql.cc:3332:5: warning: format specifies type 'long' but the argument has type 'int' [-Wformat]
unittest/sql/mf_iocache-t.cc:120:35: warning: format specifies type 'unsigned long' but the argument has type 'int' [-Wformat]
unittest/sql/mf_iocache-t.cc:96:35: note: expanded from macro 'INFO_TAIL'
When a table is renamed to an internal #sql2 or #sql-ib name during
a table-rebuilding DDL operation such as OPTIMIZE TABLE or ALTER TABLE,
and shortly after that a purge operation in an index on virtual columns
is attempted, the operation could fail, but purge would fail to release
the table reference.
innodb_acquire_mdl(): Release the reference if the table name is not
valid for acquiring a meta-data lock (MDL).
innodb_find_table_for_vc(): Add a debug assertion if the table name
is not valid. This code path is for DML execution. The table
should have a valid name for executing DML, and furthermore a MDL
will prevent the table from being renamed.
row_vers_build_clust_v_col(): Add a debug assertion that both indexes
must belong to the same table.
If we have a 2+ node cluster which is replicating from an async master
and the binlog_format is set to STATEMENT and multi-row inserts are executed
on a table with an auto_increment column such that values are automatically
generated by MySQL, then the server node generates wrong auto_increment
values, which are different from what was generated on the async master.
The causes and fixes:
1. We need to improve processing of changing the auto-increment values
after changing the cluster size.
2. If wsrep auto_increment_control switched on during operation of
the node, then we should immediately update the auto_increment_increment
and auto_increment_offset global variables, without waiting of the next
invocation of the wsrep_view_handler_cb() callback. In the current version
these variables retain its initial values if wsrep_auto_increment_control
is switched on during operation of the node, which leads to inconsistent
results on the different nodes in some scenarios.
3. If wsrep auto_increment_control switched off during operation of the node,
then we must return the original values of the auto_increment_increment and
auto_increment_offset global variables, as the user has set. To make this
possible, we need to add a "shadow copies" of these variables (which stores
the latest values set by the user).
Similar to the tables SYS_FOREIGN and SYS_FOREIGN_COLS,
the tables mysql.innodb_table_stats and mysql.innodb_index_stats
are updated by the InnoDB internal SQL parser, which fails to
enforce the size limits of the data. Due to this, it is possible
for InnoDB to hang when there are persistent statistics defined on
partitioned tables where the total length of table name,
partition name and subpartition name exceeds the incorrectly
defined limit VARCHAR(64). That column should have been defined
as VARCHAR(199).
btr_node_ptr_max_size(): Interpret the VARCHAR(64) as VARCHAR(199),
to prevent a hang in the case that the upgrade script has not been
run.
dict_table_schema_check(): Ignore difference in the length of the
table_name column.
ha_innobase::max_supported_key_length(): For innodb_page_size=4k,
return a larger value so that the table mysql.innodb_index_stats
can be created. This could allow "impossible" tables to be created,
such that it is not possible to insert anything into a secondary
index when both the secondary key and the primary key are long,
but this is the easiest and most consistent way. The Oracle fix
would only ignore the maximum length violation for the two
statistics tables.
os_file_get_status_posix(), os_file_get_status_win32(): Handle
ENAMETOOLONG as well.
This patch is based on the following change in MySQL 5.7.23.
Not all changes were applied, and our variant allows persistent
statistics to work without hangs even if the table definitions
were not upgraded.
From fdbdce701ab8145ae234c9d401109dff4e4106cb Mon Sep 17 00:00:00 2001
From: Aditya A <aditya.a@oracle.com>
Date: Thu, 17 May 2018 16:11:43 +0530
Subject: [PATCH] Bug #26390736 THE FIELD TABLE_NAME (VARCHAR(64)) FROM
MYSQL.INNODB_TABLE_STATS CAN OVERFLOW.
In mysql.innodb_index_stats and mysql.innodb_table_stats
tables the table name column didn't take into consideration
partition names which can be more than varchar(64).
INNOBASE_SHARE: remove
check_index_consistency(): iterates through keys and looks for InnoDB and .frm
mismatches.
ha_innobase::innobase_get_index(): now uses dict_table_get_index_on_name()
dict_table_get_index_on_name(): uses strcmp() instead of innobase_casestrcmp()
as we just need to know whether strings are equal or not
Problem:
As part of bug #24938374 fix, dict_operation_lock was not taken by
fts_optimize_thread while syncing fts cache.
Due to this change, alter query is able to update SYS_TABLE rows
simultaneously. Now when fts_optimizer_thread goes open index table,
It doesn't open index table if the record corresponding to that table is
set to REC_INFO_DELETED_FLAG in SYS_TABLES and hits an assert.
Fix:
If fts sync is already in progress, Alter query would wait for sync to
complete before renaming table.
RB: #19604
Reviewed by : Jimmy.Yang@oracle.com
Introduce the configuration option innodb_log_optimize_ddl
for controlling whether native index creation or table-rebuild
in InnoDB should keep optimizing the redo log
(and writing MLOG_INDEX_LOAD records to ensure that
concurrent backup would fail).
By default, we have innodb_log_optimize_ddl=ON, that is,
the default behaviour that was introduced in MariaDB 10.2.2
(with the merge of InnoDB from MySQL 5.7) will be unchanged.
BtrBulk::m_trx: Replaces m_trx_id. We must be able to check for
KILL QUERY even if !m_flush_observer (innodb_log_optimize_ddl=OFF).
page_cur_insert_rec_write_log(): Declare globally, so that this
can be called from PageBulk::insert().
row_merge_insert_index_tuples(): Remove the unused parameter trx_id.
row_merge_build_indexes(): Enable or disable redo logging based on
the innodb_log_optimize_ddl parameter.
PageBulk::init(), PageBulk::insert(), PageBulk::finish(): Write
redo log records if needed. For ROW_FORMAT=COMPRESSED, redo log
will be written in PageBulk::compress() unless we called
m_mtr.set_log_mode(MTR_LOG_NO_REDO).
During a table-rebuilding operation, the function table_name_parse()
can encounter a table name that starts with #sql. Here is an example
of a failure:
CURRENT_TEST: gcol.innodb_virtual_basic
mysqltest: At line 1204: query 'alter table t drop column d ' failed:
2013: Lost connection to MySQL server during query
Let us just remove these bogus debug assertions.
If the final renaming phase during ALTER TABLE never fails, it
should not do any harm to skip the purge. If it does fail, then
we might end up 'leaking' some delete-marked records in the
indexes on virtual columns of the original table, and these
garbage records would keep consuming space until the indexes are
dropped or the table is successfully rebuilt.
The parameter innodb_lock_schedule_algorithm was introduced in
MariaDB Server 10.1.19, 10.2.13, 10.3.4 as part of MDEV-11039.
In MariaDB 10.1, the default value of the parameter is 'fcfs',
that is, the existing algorithm is used by default. But in
later versions of MariaDB Server, the parameter was 'vats',
enabling the new algorithm.
Because the new algorithm is triggering a debug assertion failure
that suggests corruption of the transactional lock data structures,
we will revert to the old algorithm by default until we have
resolved the problem.
Problem:
========
Truncate operation holds MDL on the table (t1) and tries to
acquire InnoDB dict_operation_lock. Purge holds dict_operation_lock
and tries to acquire MDL on the table (t1) to evaluate virtual
column expressions for indexed virtual columns.
It leads to deadlock of purge and truncate table (DDL).
Solution:
=========
If purge tries to acquire MDL on the table then it should do the following:
i) Purge should release all innodb latches (including dict_operation_lock)
before acquiring metadata lock on the table.
ii) After acquiring metadata lock on the table, it should check whether the
table was dropped or renamed. If the table is dropped then purge should
ignore the undo log record. If the table is renamed then it should
release the old MDL and acquire MDL on the new name.
iii) Once purge acquires MDL, it should use the SQL table handle for all
the remaining virtual index for the purge record.
purge_node_t: Introduce new virtual column information to know whether
the MDL was acquired successfully.
This is joint work with Marko Mäkelä.
Make dict_table_t::n_ref_count private, and protect it with
a combination of dict_sys->mutex and atomics. We want to be
able to invoke dict_table_t::release() without holding
dict_sys->mutex.
The bug was that innobase_get_computed_value() trashed record[0] and data
in Field_blob::value
Fixed by using a record on the heap for innobase_get_computed_value()
Reviewer: Marko Mäkelä
The following conditions will decide the query cache retrieval or
storing inside innodb:
(1) There should not be any locks on the table.
(2) Some other trx shouldn't invalidated the cache before the
transaction started.
(3) Read view shouldn't exist. If exists then the view
low_limit_id should be greater than or equal to the transaction that
invalidates the cache for the particular table.
For read-only transaction: should satisfy the above (1) and (3)
For read-write transaction: should satisfy the above (1), (2), (3).
- Changed the variable from query_cache_inv_id to query_cache_inv_trx_id.
- Moved the function row_search_check_if_query_cache_permitted from
row0sel.h and made it as static function in ha_innodb.cc
When type of the settable global variable innodb_change_buffering was
changed from string to ENUM in MDEV-12218, the "shadow" variable ibuf_use
stopped being updated as a result of SET GLOBAL innodb_change_buffering.
Only on InnoDB startup, the parameter innodb_change_buffering would
take effect.
ibuf_use: Remove, and use the global variable innodb_change_buffering.