dict_load_foreigns(): Use a correctly sized buffer for the maximum-length
SYS_FOREIGN.ID. In case of overflow, do not crash the server but instead
return DB_CORRUPTION.
Problem:
=======
ALTER TABLE in InnoDB fails to detect duplicate entries
for the unique index when the character set or collation of
an indexed column is changed in such a way that the character
encoding is compatible with the old table definition.
In this case, any secondary indexes on the changed columns
would be rebuilt (DROP INDEX, ADD INDEX).
Solution:
========
During ALTER TABLE, InnoDB keeps track of columns whose collation
changed, and will fill in the correct metadata when sorting the
index records, or applying changes from concurrent DML.
This metadata will be allocated in the dict_index_t::heap of
the being-created secondary indexes.
The fix was developed by Thirunarayanan Balathandayuthapani
and simplified by me.
ha_innobase::check_if_supported_inplace_alter(): Refuse to change the
collation of a column that would become or remain indexed as part of
the ALTER TABLE operation.
In MariaDB Server 10.6, we will allow this type of operation;
that fix depends on MDEV-15250.
Starting with 10.5, InnoDB crash recovery tests seem to time out
more easily under Valgrind, which emulates multiple threads by
interleaving them in a single operating system thread.
These tests will still be covered by
AddressSanitizer and MemorySanitizer.
- In case of discarded tablespace, InnoDB can't read the root page to
assign the n_core_null_bytes. Consecutive instant DDL fails because
of non-matching n_core_null_bytes.
prepare_inplace_alter_table_dict(): If the table will not be rebuilt,
preserve all of the original ROW_FORMAT, including the compressed
page size flags related to ROW_FORMAT=COMPRESSED.
On FreeBSD, tests run on persistent storage, and no asynchronous I/O
has been implemented. Warnings about 205-second waits on dict_sys.latch
may occur.
There was a race condition between log_checkpoint_low() and
deleting or renaming data files. The scenario is as follows:
1. The buffer pool does not contain dirty pages.
2. A FILE_DELETE or FILE_RENAME record is written.
3. The checkpoint LSN will be moved ahead of the write of the record.
4. The server is killed before the file is actually renamed or deleted.
We will prevent this race condition by ensuring that a log checkpoint
cannot occur between the durable write and the file system operation:
1. Durably write the FILE_DELETE or FILE_RENAME record.
2. Perform the file system operation.
3. Allow any log checkpoint to proceed.
mtr_t::commit_file(): Implement the DELETE or RENAME logic.
fil_delete_tablespace(): Delegate some of the logic to
mtr_t::commit_file().
fil_space_t::rename(): Delegate some logic to mtr_t::commit_file().
Remove the debug injection point fil_rename_tablespace_failure_2
because we do test RENAME failures without any debug injection.
fil_name_write_rename_low(), fil_name_write_rename(): Remove.
Tested by Matthias Leich
We do not really care about the exact result; we only care that the
statistics will be accessed. The result could change depending on
when some statistics were updated in the background or when some
committed delete-marked rows were purged from other tables on
which persistent statistics are enabled.
innodb_drop_database(): Use explicit TO_BINARY casts on
SYS_TABLES.NAME, which for historical reasons uses the wrong collation
latin1_swedish_ci instead of BINARY.
The approach to handling corruption that was chosen by Oracle in
commit 177d8b0c12
is not really useful. Not only did it actually fail to prevent InnoDB
from crashing, but it is making things worse by blocking attempts to
rescue data from or rebuild a partially readable table.
We will try to prevent crashes in a different way: by propagating
errors up the call stack. We will never mark the clustered index
persistently corrupted, so that data recovery may be attempted by
reading from the table, or by rebuilding the table.
This should also fix MDEV-13680 (crash on btr_page_alloc() failure);
it was extensively tested with innodb_file_per_table=0 and a
non-autoextend system tablespace.
We should now avoid crashes in many cases, such as when a page
cannot be read or allocated, or an inconsistency is detected when
attempting to update multiple pages. We will not crash on double-free,
such as on the recovery of DDL in system tablespace in case something
was corrupted.
Crashes on corrupted data are still possible. The fault injection mechanism
that is introduced in the subsequent commit may help catch more of them.
buf_page_import_corrupt_failure: Remove the fault injection, and instead
corrupt some pages using Perl code in the tests.
btr_cur_pessimistic_insert(): Always reserve extents (except for the
change buffer), in order to prevent a subsequent allocation failure.
btr_pcur_open_at_rnd_pos(): Merged to the only caller ibuf_merge_pages().
btr_assert_not_corrupted(), btr_corruption_report(): Remove.
Similar checks are already part of btr_block_get().
FSEG_MAGIC_N_BYTES: Replaces FSEG_MAGIC_N_VALUE.
dict_hdr_get(), trx_rsegf_get_new(), trx_undo_page_get(),
trx_undo_page_get_s_latched(): Replaced with error-checking calls.
trx_rseg_t::get(mtr_t*): Replaces trx_rsegf_get().
trx_rseg_header_create(): Let the caller update the TRX_SYS page if needed.
trx_sys_create_sys_pages(): Merged with trx_sysf_create().
dict_check_tablespaces_and_store_max_id(): Do not access
DICT_HDR_MAX_SPACE_ID, because it was already recovered in dict_boot().
Merge dict_check_sys_tables() with this function.
dir_pathname(): Replaces os_file_make_new_pathname().
row_undo_ins_remove_sec(): Do not modify the undo page by adding
a terminating NUL byte to the record.
btr_decryption_failed(): Report decryption failures
dict_set_corrupted_by_space(), dict_set_encrypted_by_space(),
dict_set_corrupted_index_cache_only(): Remove.
dict_set_corrupted(): Remove the constant parameter dict_locked=false.
Never flag the clustered index corrupted in SYS_INDEXES, because
that would deny further access to the table. It might be possible to
repair the table by executing ALTER TABLE or OPTIMIZE TABLE, in case
no B-tree leaf page is corrupted.
dict_table_skip_corrupt_index(), dict_table_next_uncorrupted_index(),
row_purge_skip_uncommitted_virtual_index(): Remove, and refactor
the callers to read dict_index_t::type only once.
dict_table_is_corrupted(): Remove.
dict_index_t::is_btree(): Determine if the index is a valid B-tree.
BUF_GET_NO_LATCH, BUF_EVICT_IF_IN_POOL: Remove.
UNIV_BTR_DEBUG: Remove. Any inconsistency will no longer trigger
assertion failures, but error codes being returned.
buf_corrupt_page_release(): Replaced with a direct call to
buf_pool.corrupted_evict().
fil_invalid_page_access_msg(): Never crash on an invalid read;
let the caller of buf_page_get_gen() decide.
btr_pcur_t::restore_position(): Propagate failure status to the caller
by returning CORRUPTED.
opt_search_plan_for_table(): Simplify the code.
row_purge_del_mark(), row_purge_upd_exist_or_extern_func(),
row_undo_ins_remove_sec_rec(), row_undo_mod_upd_del_sec(),
row_undo_mod_del_mark_sec(): Avoid mem_heap_create()/mem_heap_free()
when no secondary indexes exist.
row_undo_mod_upd_exist_sec(): Simplify the code.
row_upd_clust_step(), dict_load_table_one(): Return DB_TABLE_CORRUPT
if the clustered index (and therefore the table) is corrupted, similar
to what we do in row_insert_for_mysql().
fut_get_ptr(): Replace with buf_page_get_gen() calls.
buf_page_get_gen(): Return nullptr and *err=DB_CORRUPTION
if the page is marked as freed. For other modes than
BUF_GET_POSSIBLY_FREED or BUF_PEEK_IF_IN_POOL this will
trigger a debug assertion failure. For BUF_GET_POSSIBLY_FREED,
we will return nullptr for freed pages, so that the callers
can be simplified. The purge of transaction history will be
a new user of BUF_GET_POSSIBLY_FREED, to avoid crashes on
corrupted data.
buf_page_get_low(): Never crash on a corrupted page, but simply
return nullptr.
fseg_page_is_allocated(): Replaces fseg_page_is_free().
fts_drop_common_tables(): Return an error if the transaction
was rolled back.
fil_space_t::set_corrupted(): Report a tablespace as corrupted if
it was not reported already.
fil_space_t::io(): Invoke fil_space_t::set_corrupted() to report
out-of-bounds page access or other errors.
Clean up mtr_t::page_lock()
buf_page_get_low(): Validate the page identifier (to check for
recently read corrupted pages) after acquiring the page latch.
buf_page_t::read_complete(): Flag uninitialized (all-zero) pages
with DB_FAIL. Return DB_PAGE_CORRUPTED on page number mismatch.
mtr_t::defer_drop_ahi(): Renamed from mtr_defer_drop_ahi().
recv_sys_t::free_corrupted_page(): Only set_corrupt_fs()
if any log records exist for the page. We do not mind if read-ahead
produces corrupted (or all-zero) pages that were not actually needed
during recovery.
recv_recover_page(): Return whether the operation succeeded.
recv_sys_t::recover_low(): Simplify the logic. Check for recovery error.
Thanks to Matthias Leich for testing this extensively and to the
authors of https://rr-project.org for making it easy to diagnose
and fix any failures that were found during the testing.
The InnoDB srv_stats counters
n_rows_updated, n_rows_deleted, n_rows_inserted, and n_rows_read
are duplicating
Handler_update, Handler_delete, Handler_write, and Handler_read_ counters.
Updating those counters is not free, especially because some counters
are furthermore split to distinguish a rare case of modifying tables
in the system schema.
srv_printf_innodb_monitor(): Only display an ADAPTIVE HASH INDEX
section if the adaptive hash index is enabled.
ibuf_print(): Only display an INSERT BUFFER section if the
change buffer is not empty.
InnoDB buffer pool resize messages are more succinct from this change:
Before:
```
2022-05-07 17:10:33 0 [Note] InnoDB: Completed resizing buffer pool from 14745600 to 19660800 bytes.
2022-05-07 17:10:33 0 [Note] InnoDB: Completed resizing buffer pool.
2022-05-07 17:10:33 8 [Note] InnoDB: Completed resizing buffer pool. (New size: 19660800 bytes).
```
After:
```
2022-05-07 17:10:33 0 [Note] InnoDB: Completed resizing buffer pool from 14745600 to 19660800 bytes.
```
Additionally, the INNODB_BUFFER_POOL_RESIZE_STATUS has more complete
info: it contains both the old and new buffer pool size values.