The approach to handling corruption that was chosen by Oracle in
commit 177d8b0c12
is not really useful. Not only did it actually fail to prevent InnoDB
from crashing, but it is making things worse by blocking attempts to
rescue data from or rebuild a partially readable table.
We will try to prevent crashes in a different way: by propagating
errors up the call stack. We will never mark the clustered index
persistently corrupted, so that data recovery may be attempted by
reading from the table, or by rebuilding the table.
This should also fix MDEV-13680 (crash on btr_page_alloc() failure);
it was extensively tested with innodb_file_per_table=0 and a
non-autoextend system tablespace.
We should now avoid crashes in many cases, such as when a page
cannot be read or allocated, or an inconsistency is detected when
attempting to update multiple pages. We will not crash on double-free,
such as on the recovery of DDL in system tablespace in case something
was corrupted.
Crashes on corrupted data are still possible. The fault injection mechanism
that is introduced in the subsequent commit may help catch more of them.
buf_page_import_corrupt_failure: Remove the fault injection, and instead
corrupt some pages using Perl code in the tests.
btr_cur_pessimistic_insert(): Always reserve extents (except for the
change buffer), in order to prevent a subsequent allocation failure.
btr_pcur_open_at_rnd_pos(): Merged to the only caller ibuf_merge_pages().
btr_assert_not_corrupted(), btr_corruption_report(): Remove.
Similar checks are already part of btr_block_get().
FSEG_MAGIC_N_BYTES: Replaces FSEG_MAGIC_N_VALUE.
dict_hdr_get(), trx_rsegf_get_new(), trx_undo_page_get(),
trx_undo_page_get_s_latched(): Replaced with error-checking calls.
trx_rseg_t::get(mtr_t*): Replaces trx_rsegf_get().
trx_rseg_header_create(): Let the caller update the TRX_SYS page if needed.
trx_sys_create_sys_pages(): Merged with trx_sysf_create().
dict_check_tablespaces_and_store_max_id(): Do not access
DICT_HDR_MAX_SPACE_ID, because it was already recovered in dict_boot().
Merge dict_check_sys_tables() with this function.
dir_pathname(): Replaces os_file_make_new_pathname().
row_undo_ins_remove_sec(): Do not modify the undo page by adding
a terminating NUL byte to the record.
btr_decryption_failed(): Report decryption failures
dict_set_corrupted_by_space(), dict_set_encrypted_by_space(),
dict_set_corrupted_index_cache_only(): Remove.
dict_set_corrupted(): Remove the constant parameter dict_locked=false.
Never flag the clustered index corrupted in SYS_INDEXES, because
that would deny further access to the table. It might be possible to
repair the table by executing ALTER TABLE or OPTIMIZE TABLE, in case
no B-tree leaf page is corrupted.
dict_table_skip_corrupt_index(), dict_table_next_uncorrupted_index(),
row_purge_skip_uncommitted_virtual_index(): Remove, and refactor
the callers to read dict_index_t::type only once.
dict_table_is_corrupted(): Remove.
dict_index_t::is_btree(): Determine if the index is a valid B-tree.
BUF_GET_NO_LATCH, BUF_EVICT_IF_IN_POOL: Remove.
UNIV_BTR_DEBUG: Remove. Any inconsistency will no longer trigger
assertion failures, but error codes being returned.
buf_corrupt_page_release(): Replaced with a direct call to
buf_pool.corrupted_evict().
fil_invalid_page_access_msg(): Never crash on an invalid read;
let the caller of buf_page_get_gen() decide.
btr_pcur_t::restore_position(): Propagate failure status to the caller
by returning CORRUPTED.
opt_search_plan_for_table(): Simplify the code.
row_purge_del_mark(), row_purge_upd_exist_or_extern_func(),
row_undo_ins_remove_sec_rec(), row_undo_mod_upd_del_sec(),
row_undo_mod_del_mark_sec(): Avoid mem_heap_create()/mem_heap_free()
when no secondary indexes exist.
row_undo_mod_upd_exist_sec(): Simplify the code.
row_upd_clust_step(), dict_load_table_one(): Return DB_TABLE_CORRUPT
if the clustered index (and therefore the table) is corrupted, similar
to what we do in row_insert_for_mysql().
fut_get_ptr(): Replace with buf_page_get_gen() calls.
buf_page_get_gen(): Return nullptr and *err=DB_CORRUPTION
if the page is marked as freed. For other modes than
BUF_GET_POSSIBLY_FREED or BUF_PEEK_IF_IN_POOL this will
trigger a debug assertion failure. For BUF_GET_POSSIBLY_FREED,
we will return nullptr for freed pages, so that the callers
can be simplified. The purge of transaction history will be
a new user of BUF_GET_POSSIBLY_FREED, to avoid crashes on
corrupted data.
buf_page_get_low(): Never crash on a corrupted page, but simply
return nullptr.
fseg_page_is_allocated(): Replaces fseg_page_is_free().
fts_drop_common_tables(): Return an error if the transaction
was rolled back.
fil_space_t::set_corrupted(): Report a tablespace as corrupted if
it was not reported already.
fil_space_t::io(): Invoke fil_space_t::set_corrupted() to report
out-of-bounds page access or other errors.
Clean up mtr_t::page_lock()
buf_page_get_low(): Validate the page identifier (to check for
recently read corrupted pages) after acquiring the page latch.
buf_page_t::read_complete(): Flag uninitialized (all-zero) pages
with DB_FAIL. Return DB_PAGE_CORRUPTED on page number mismatch.
mtr_t::defer_drop_ahi(): Renamed from mtr_defer_drop_ahi().
recv_sys_t::free_corrupted_page(): Only set_corrupt_fs()
if any log records exist for the page. We do not mind if read-ahead
produces corrupted (or all-zero) pages that were not actually needed
during recovery.
recv_recover_page(): Return whether the operation succeeded.
recv_sys_t::recover_low(): Simplify the logic. Check for recovery error.
Thanks to Matthias Leich for testing this extensively and to the
authors of https://rr-project.org for making it easy to diagnose
and fix any failures that were found during the testing.
In commit 7a4fbb55b0 (MDEV-25105)
the innochecksum option --write (-w) was removed altogether.
It should have been made a Boolean option, so that old data files
may be converted to a format that is compatible with
innodb_checksum_algorithm=strict_crc32 by executing the following:
innochecksum -n -w ibdata* */*.ibd
It would be better to use an older-version innochecksum
for such a conversion, so that page checksums will be validated
before updating the checksum.
It never was possible for innochecksum to convert files to the
innodb_checksum_algorithm=full_crc32 format that is the default
for new InnoDB data files.
Some GNU/Linux distributions ship a zlib that is modified to use
the s390x DFLTCC instruction. That modification would essentially
redefine compressBound(sourceLen) as (sourceLen * 16 + 2308) / 8 + 6.
Let us relax the tests for InnoDB ROW_FORMAT=COMPRESSED to cope with
such a weaker compression guarantee.
create_table_info_t::row_size_is_acceptable(): Remove a bogus debug-only
assertion that would fail to hold for the test innodb_zip.bug36169.
The function page_zip_empty_size() may indeed return 0.
It's misleading to compare and write to user number of columns and fields.
Thus, it would be better to remove that check and let use see a subsequent
error message about missing or mispaced column.
row_import::match_schema(): remove misleading check
ALTER TABLE IMPORT doesn't properly handle instant alter metadata.
This patch makes IMPORT read, parse and apply instant alter metadata at the
very beginning of operation. So, cases when source table has some metadata
and destination table doesn't have it now works fine.
DISCARD already removes instant metadata so importing normal table into
instant table worked fine before this patch.
decrypt_decompress(): decrypts and decompresses page if needed
handle_instant_metadata(): this should be the first thing to read source
table. Basically, it applies instant metadata to a destination
dict_table_t object. This is the first thing to read FSP flags so
all possible checks of it were moved to this function.
PageConverter::update_index_page(): it doesn't now read instant metadata.
This logic were moved into handle_instant_metadata()
row_import::match_flags(): this is a first part row_import::match_schema().
As a separate function it's used by handle_instant_metadata().
fil_space_t::is_full_crc32_compressed(): added convenient function
ha_innobase::discard_or_import_tablespace(): do not reload table definition
to read instant metadata because handle_instant_metadata() does it better.
The reverted code was originally added in
4e7ee166a9
ANONYMOUS_VAR: this is a handy thing to use along with make_scope_exit()
full_crc32_import.test shows different results, because no
dict_table_close() and dict_table_open_on_id() happens.
Thus, SHOW CREATE TABLE shows a little bit older table definition.
This essentially reverts commit 4e89ec6692
and only disables InnoDB persistent statistics for tests where it is
desirable. By design, InnoDB persistent statistics will not be updated
except by ANALYZE TABLE or by STATS_AUTO_RECALC.
The internal transactions that update persistent InnoDB statistics
in background tasks (with innodb_stats_auto_recalc=ON) may cause
nondeterministic query plans or interfere with some tests that deal
with other InnoDB internals, such as the purge of transaction history.
InnoDB tablespace identifiers and page numbers are 32-bit numbers.
Let us use a 32-bit type for them in innochecksum.
The changes in commit 1918bdf32c
broke the build on 32-bit Windows.
Thanks to Vicențiu Ciorbaru for an initial version of this fixup.
Let us simply refuse an upgrade from earlier versions if the
upgrade procedure was not followed. This simplifies the purge,
commit, and rollback of transactions.
Before upgrading to MariaDB 10.3 or later, a clean shutdown
of the server (with innodb_fast_shutdown=1 or 0) is necessary,
to ensure that any incomplete transactions are rolled back.
The undo log format was changed in MDEV-12288. There is only
one persistent undo log for each transaction.
This is a complete rewrite of DROP TABLE, also as part of other DDL,
such as ALTER TABLE, CREATE TABLE...SELECT, TRUNCATE TABLE.
The background DROP TABLE queue hack is removed.
If a transaction needs to drop and create a table by the same name
(like TRUNCATE TABLE does), it must first rename the table to an
internal #sql-ib name. No committed version of the data dictionary
will include any #sql-ib tables, because whenever a transaction
renames a table to a #sql-ib name, it will also drop that table.
Either the rename will be rolled back, or the drop will be committed.
Data files will be unlinked after the transaction has been committed
and a FILE_RENAME record has been durably written. The file will
actually be deleted when the detached file handle returned by
fil_delete_tablespace() will be closed, after the latches have been
released. It is possible that a purge of the delete of the SYS_INDEXES
record for the clustered index will execute fil_delete_tablespace()
concurrently with the DDL transaction. In that case, the thread that
arrives later will wait for the other thread to finish.
HTON_TRUNCATE_REQUIRES_EXCLUSIVE_USE: A new handler flag.
ha_innobase::truncate() now requires that all other references to
the table be released in advance. This was implemented by Monty.
ha_innobase::delete_table(): If CREATE TABLE..SELECT is detected,
we will "hijack" the current transaction, drop the table in
the current transaction and commit the current transaction.
This essentially fixes MDEV-21602. There is a FIXME comment about
making the check less failure-prone.
ha_innobase::truncate(), ha_innobase::delete_table():
Implement a fast path for temporary tables. We will no longer allow
temporary tables to use the adaptive hash index.
dict_table_t::mdl_name: The original table name for the purpose of
acquiring MDL in purge, to prevent a race condition between a
DDL transaction that is dropping a table, and purge processing
undo log records of DML that had executed before the DDL operation.
For #sql-backup- tables during ALTER TABLE...ALGORITHM=COPY, the
dict_table_t::mdl_name will differ from dict_table_t::name.
dict_table_t::parse_name(): Use mdl_name instead of name.
dict_table_rename_in_cache(): Update mdl_name.
For the internal FTS_ tables of FULLTEXT INDEX, purge would
acquire MDL on the FTS_ table name, but not on the main table,
and therefore it would be able to run concurrently with a
DDL transaction that is dropping the table. Previously, the
DROP TABLE queue hack prevented a race between purge and DDL.
For now, we introduce purge_sys.stop_FTS() to prevent purge from
opening any table, while a DDL transaction that may drop FTS_
tables is in progress. The function fts_lock_table(), which will
be invoked before the dictionary is locked, will wait for
purge to release any table handles.
trx_t::drop_table_statistics(): Drop statistics for the table.
This replaces dict_stats_drop_index(). We will drop or rename
persistent statistics atomically as part of DDL transactions.
On lock conflict for dropping statistics, we will fail instantly
with DB_LOCK_WAIT_TIMEOUT, because we will be holding the
exclusive data dictionary latch.
trx_t::commit_cleanup(): Separated from trx_t::commit_in_memory().
Relax an assertion around fts_commit() and allow DB_LOCK_WAIT_TIMEOUT
in addition to DB_DUPLICATE_KEY. The call to fts_commit() is
entirely misplaced here and may obviously break the consistency
of transactions that affect FULLTEXT INDEX. It needs to be fixed
separately.
dict_table_t::n_foreign_key_checks_running: Remove (MDEV-21175).
The counter was a work-around for missing meta-data locking (MDL)
on the SQL layer, and not really needed in MariaDB.
ER_TABLE_IN_FK_CHECK: Replaced with ER_UNUSED_28.
HA_ERR_TABLE_IN_FK_CHECK: Remove.
row_ins_check_foreign_constraints(): Do not acquire
dict_sys.latch either. The SQL-layer MDL will protect us.
This was reviewed by Thirunarayanan Balathandayuthapani
and tested by Matthias Leich.
Many InnoDB data dictionary cache operations require that the
table name be copied so that it will be NUL terminated.
(For example, SYS_TABLES.NAME is not guaranteed to be NUL-terminated.)
dict_table_t::is_garbage_name(): Check if a name belongs to
the background drop table queue.
dict_check_if_system_table_exists(): Remove.
dict_sys_t::load_sys_tables(): Load the non-hard-coded system tables
SYS_FOREIGN, SYS_FOREIGN_COLS, SYS_VIRTUAL on startup.
dict_sys_t::create_or_check_sys_tables(): Replaces
dict_create_or_check_foreign_constraint_tables() and
dict_create_or_check_sys_virtual().
dict_sys_t::load_table(): Replaces dict_table_get_low()
and dict_load_table().
dict_sys_t::find_table(): Renamed from get_table().
dict_sys_t::sys_tables_exist(): Check whether all the non-hard-coded
tables SYS_FOREIGN, SYS_FOREIGN_COLS, SYS_VIRTUAL exist.
trx_t::has_stats_table_lock(): Moved to dict0stats.cc.
Some error messages will now report table names in the internal
databasename/tablename format, instead of `databasename`.`tablename`.
Ever since MDEV-24589, MDEV-18518 and other recent changes corrected the
rollback of CREATE and DROP operations, there is no need to crash the
server if we run out of space during a DROP operation. We can simply
let the transaction be rolled back.
This patch changes the main name of 3 byte character set from utf8 to
utf8mb3. New old_mode UTF8_IS_UTF8MB3 is added and set TRUE by default,
so that utf8 would mean utf8mb3. If not set, utf8 would mean utf8mb4.
A side effect of the MDEV-24589 bug fix is that if
FLUSH TABLE...FOR EXPORT is initiated before the history of an
earlier DROP INDEX operation has been purged, then the data file
will contain allocated pages that belonged to the dropped indexes.
These pages would never be freed after a subsequent IMPORT TABLESPACE.
We will work around this regression by making IMPORT TABLESPACE
tolerate pages that refer to an unknown index.
Historically, InnoDB supported a buggy page checksum algorithm that did not
compute a checksum over the full page. Later, well before MySQL 4.1
introduced .ibd files and the innodb_file_per_table option, the algorithm
was corrected and the first 4 bytes of each page were redefined to be
a checksum.
The original checksum was so slow that an option to disable page checksum
was introduced for benchmarketing purposes.
The Intel Nehalem microarchitecture introduced the SSE4.2 instruction set
extension, which includes instructions for faster computation of CRC-32C.
In MySQL 5.6 (and MariaDB 10.0), innodb_checksum_algorithm=crc32 was
implemented to make of that. As that option was changed to be the default
in MySQL 5.7, a bug was found on big-endian platforms and some work-around
code was added to weaken that checksum further. MariaDB disables that
work-around by default since MDEV-17958.
Later, SIMD-accelerated CRC-32C has been implemented in MariaDB for POWER
and ARM and also for IA-32/AMD64, making use of carry-less multiplication
where available.
Long story short, innodb_checksum_algorithm=crc32 is faster and more secure
than the pre-MySQL 5.6 checksum, called innodb_checksum_algorithm=innodb.
It should have removed any need to use innodb_checksum_algorithm=none.
The setting innodb_checksum_algorithm=crc32 is the default in
MySQL 5.7 and MariaDB Server 10.2, 10.3, 10.4. In MariaDB 10.5,
MDEV-19534 made innodb_checksum_algorithm=full_crc32 the default.
It is even faster and more secure.
The default settings in MariaDB do allow old data files to be read,
no matter if a worse checksum algorithm had been used.
(Unfortunately, before innodb_checksum_algorithm=full_crc32,
the data files did not identify which checksum algorithm is being used.)
The non-default settings innodb_checksum_algorithm=strict_crc32 or
innodb_checksum_algorithm=strict_full_crc32 would only allow CRC-32C
checksums. The incompatibility with old data files is why they are
not the default.
The newest server not to support innodb_checksum_algorithm=crc32
were MySQL 5.5 and MariaDB 5.5. Both have reached their end of life.
A valid reason for using innodb_checksum_algorithm=innodb could have
been the ability to downgrade. If it is really needed, data files
can be converted with an older version of the innochecksum utility.
Because there is no good reason to allow data files to be written
with insecure checksums, we will reject those option values:
innodb_checksum_algorithm=none
innodb_checksum_algorithm=innodb
innodb_checksum_algorithm=strict_none
innodb_checksum_algorithm=strict_innodb
Furthermore, the following innochecksum options will be removed,
because only strict crc32 will be supported:
innochecksum --strict-check=crc32
innochecksum -C crc32
innochecksum --write=crc32
innochecksum -w crc32
If a user wishes to convert a data file to use a different checksum
(so that it might be used with the no-longer-supported
MySQL 5.5 or MariaDB 5.5, which do not support IMPORT TABLESPACE
nor system tablespace format changes that were made in MariaDB 10.3),
then the innochecksum tool from MariaDB 10.2, 10.3, 10.4, 10.5 or
MySQL 5.7 can be used.
Reviewed by: Thirunarayanan Balathandayuthapani
* be strict in CREATE TABLE, just like in ALTER TABLE, because
CREATE TABLE, just like ALTER TABLE, can be rolled back for any engine
* but don't auto-convert warnings into errors for engine warnings
(handler::create) - this matches ALTER TABLE behavior
* and not when creating a default record, these errors are handled
specially (and replaced with ER_INVALID_DEFAULT)
* always issue a Note when a non-unique key is truncated, because it's
not a Warning that can be converted to an Error. Before this commit
it was a Note for blobs and a Warning for all other data types.
The InnoDB internal tables SYS_TABLESPACES and SYS_DATAFILES as well as the
INFORMATION_SCHEMA views INNODB_SYS_TABLESPACES and INNODB_SYS_DATAFILES
were introduced in MySQL 5.6 for no good reason in
mysql/mysql-server/commit/e9255a22ef16d612a8076bc0b34002bc5a784627
when the InnoDB support for the DATA DIRECTORY attribute was introduced.
The file system should be the authoritative source of information on files.
Storing information about file system paths in the file system (symlinks,
or even the .isl files that were unfortunately chosen as the solution) is
sufficient. If information is additionally stored in some hidden tables
inside the InnoDB system tablespace, everything unnecessarily becomes
more complicated, because more copies of data mean more opportunity
for the copies to be out of sync, and because modifying the data in
the system tablespace in the desired way might not be possible at all
without modifying the InnoDB source code. So, the copy in the system
tablespace basically is a redundant, non-authoritative source of
information.
We will stop creating or accessing the system tables SYS_TABLESPACES
and SYS_DATAFILES.
We will also remove the view
INFORMATION_SCHEMA.INNODB_SYS_DATAFILES along with SYS_DATAFILES.
The view
INFORMATION_SCHEMA.INNODB_SYS_TABLESPACES will be repurposed
to directly reflect fil_system.space_list. The column
PAGE_SIZE, which would always contain the value of
the GLOBAL read-only variable innodb_page_size, is
removed. The column ZIP_PAGE_SIZE, which would actually
contain the physical page size of a page, is renamed to
PAGE_SIZE. Finally, a new column FILENAME is added, as a
replacement of SYS_DATAFILES.PATH.
This will also
address MDEV-21801 (files that were created before upgrading
to MySQL 5.6 or MariaDB 10.0 or later were never registered
in SYS_TABLESPACES or SYS_DATAFILES) and
MDEV-21801 (information about the system tablespace is not stored
in SYS_TABLESPACES or SYS_DATAFILES).
Let us introduce the parameter innodb_read_only_compressed
that is ON by default, making any ROW_FORMAT=COMPRESSED tables
read-only.
I developed the ROW_FORMAT=COMPRESSED format based on
Heikki Tuuri's rough design between 2005 and 2008. It might
have been a good idea back then, but no proper benchmarks were
ever run to validate the design or the implementation.
The format has been more or less obsolete for years.
It limits innodb_page_size to 16384 bytes (the default),
and instant ALTER TABLE is not supported.
This is the first step towards deprecating and removing
write support for ROW_FORMAT=COMPRESSED tables.
In main.index_merge_myisam we remove the test that was added in
commit a2d24def8c because
it duplicates the test case that was added in
commit 5af12e4635.
The test innodb.innodb_wl6326 that had been disabled in 10.4 due to
MDEV-21535 is failing on 10.5 due to a different reason: the removal
of the MLOG_COMP_END_COPY_CREATED operations in MDEV-12353
commit 276f996af9 caused PAGE_LAST_INSERT
to be set to something nonzero by the function page_copy_rec_list_end().
This in turn would cause btr_page_get_split_rec_to_right() to behave
differently: we would not attempt to split the page at all, but simply
insert the new record into the new, empty, right leaf page.
Even though the change reduced the sizes of some tables, it is better
to aim for balanced trees.
page_copy_rec_list_end(), PageBulk::finishPage():
Preserve PAGE_LAST_INSERT, PAGE_N_DIRECTION, PAGE_DIRECTION.
PageBulk::finish(): Move some common code from PageBulk::finishPage().