The purpose of the change buffer was to reduce random disk access,
which could be useful on rotational storage, but maybe less so on
solid-state storage.
When we wished to
(1) insert a record into a non-unique secondary index,
(2) delete-mark a secondary index record,
(3) delete a secondary index record as part of purge (but not ROLLBACK),
and the B-tree leaf page where the record belongs to is not in the buffer
pool, we inserted a record into the change buffer B-tree, indexed by
the page identifier. When the page was eventually read into the buffer
pool, we looked up the change buffer B-tree for any modifications to the
page, applied these upon the completion of the read operation. This
was called the insert buffer merge.
We remove the change buffer, because it has been the source of
various hard-to-reproduce corruption bugs, including those fixed in
commit 5b9ee8d819 and
commit 165564d3c3 but not limited to them.
A downgrade will fail with a clear message starting with
commit db14eb16f9 (MDEV-30106).
buf_page_t::state: Merge IBUF_EXIST to UNFIXED and
WRITE_FIX_IBUF to WRITE_FIX.
buf_pool_t::watch[]: Remove.
trx_t: Move isolation_level, check_foreigns, check_unique_secondary,
bulk_insert into the same bit-field. The only purpose of
trx_t::check_unique_secondary is to enable bulk insert into an
empty table. It no longer enables insert buffering for UNIQUE INDEX.
btr_cur_t::thr: Remove. This field was originally needed for change
buffering. Later, its use was extended to cover SPATIAL INDEX.
Much of the time, rtr_info::thr holds this field. When it does not,
we will add parameters to SPATIAL INDEX specific functions.
ibuf_upgrade_needed(): Check if the change buffer needs to be updated.
ibuf_upgrade(): Merge and upgrade the change buffer after all redo log
has been applied. Free any pages consumed by the change buffer, and
zero out the change buffer root page to mark the upgrade completed,
and to prevent a downgrade to an earlier version.
dict_load_tablespaces(): Renamed from
dict_check_tablespaces_and_store_max_id(). This needs to be invoked
before ibuf_upgrade().
btr_cur_open_at_rnd_pos(): Specialize for use in persistent statistics.
The change buffer merge does not need this function anymore.
btr_page_alloc(): Renamed from btr_page_alloc_low(). We no longer
allocate any change buffer pages.
btr_cur_open_at_rnd_pos(): Specialize for use in persistent statistics.
The change buffer merge does not need this function anymore.
row_search_index_entry(), btr_lift_page_up(): Add a parameter thr
for the SPATIAL INDEX case.
rtr_page_split_and_insert(): Specialized from btr_page_split_and_insert().
rtr_root_raise_and_insert(): Specialized from btr_root_raise_and_insert().
Note: The support for upgrading from the MySQL 3.23 or MySQL 4.0
change buffer format that predates the MySQL 4.1 introduction of
the option innodb_file_per_table was removed in MySQL 5.6.5
as part of mysql/mysql-server@69b6241a79
and MariaDB 10.0.11 as part of 1d0f70c2f8.
In the tests innodb.log_upgrade and innodb.log_corruption, we create
valid (upgraded) change buffer pages.
Tested by: Matthias Leich
Before commit 6112853cda in MySQL 4.1.1
introduced the parameter innodb_file_per_table, all InnoDB data was
written to the InnoDB system tablespace (often named ibdata1).
A serious design problem is that once the system tablespace has grown to
some size, it cannot shrink even if the data inside it has been deleted.
There are also other design problems, such as the server hang MDEV-29930
that should only be possible when using innodb_file_per_table=0 and
innodb_undo_tablespaces=0 (storing both tables and undo logs in the
InnoDB system tablespace).
The parameter innodb_change_buffering was deprecated
in commit b5852ffbee.
Starting with commit baf276e6d4
(MDEV-19229) the number of innodb_undo_tablespaces can be increased,
so that the undo logs can be moved out of the system tablespace
of an existing installation.
If all these things (tables, undo logs, and the change buffer) are
removed from the InnoDB system tablespace, the only variable-size
data structure inside it is the InnoDB data dictionary.
DDL operations on .ibd files was optimized in
commit 86dc7b4d4c (MDEV-24626).
That should have removed any thinkable performance advantage of
using innodb_file_per_table=0.
Since there should be no benefit of setting innodb_file_per_table=0,
the parameter should be deprecated. Starting with MySQL 5.6 and
MariaDB Server 10.0, the default value is innodb_file_per_table=1.
- Information_schema.innodb_tablespaces_encryption should print
undo tablespace name as innodb_undo001, innodb_undo002 and soon.
- Encryption test should include undo tablespaces count when
the tests are waiting for the condition to check whether all
tables are encrypted or decrypted.
The log overwrite warnings are not being reliably emitted in all
debug-instrumented environments. It may be related to the
scheduling of some InnoDB internal activity, such as the purging
of committed transaction history.
The InnoDB write-ahead log ib_logfile0 is of fixed size,
specified by innodb_log_file_size. If the tail of the log
manages to overwrite the head (latest checkpoint) of the log,
crash recovery will be broken.
Let us clarify the messages about this, including adding
a message on the completion of a log checkpoint that notes
that the dangerous situation is over.
To reproduce the dangerous scenario, we will introduce the
debug injection label ib_log_checkpoint_avoid_hard, which will
avoid log checkpoints even harder than the previous
ib_log_checkpoint_avoid.
log_t::overwrite_warned: The first known dangerous log sequence number.
Set in log_close() and cleared in log_write_checkpoint_info(),
which will output a "Crash recovery was broken" message.
To prevent ASAN heap-use-after-poison in the MDEV-16549 part of
./mtr --repeat=6 main.derived
the initialization of Name_resolution_context was cleaned up.
In commit 28325b0863
a compile-time option was introduced to disable the macros
DBUG_ENTER and DBUG_RETURN or DBUG_VOID_RETURN.
The parameter name WITH_DBUG_TRACE would hint that it also
covers DBUG_PRINT statements. Let us do that: WITH_DBUG_TRACE=OFF
shall disable DBUG_PRINT() as well.
A few InnoDB recovery tests used to check that some output from
DBUG_PRINT("ib_log", ...) is present. We can live without those checks.
Reviewed by: Vladislav Vaintroub
recv_sys_t::free_corrupted_page(): Identify the corrupted page in
an error or warning message.
buf_page_free(): Just in case, register the page as modified.
This should already have been done in mtr_t::free() as part of
fseg_free_page_low().
mtr_t::memo_push(): Simplify a condition, so that when invoked
with MTR_MEMO_PAGE_X_MODIFY, we will do the right thing.
fseg_free_page_low(): Remove an accidentally added return statement
that prevented mtr_t::free() from being called. This fixes a regression
that was introduced in
commit 0b47c126e3 (MDEV-13542).
A message used to say "failed to read or decrypt"
but the "or decrypt" part was removed in
commit 0b47c126e3
without adjusting rarely needed error message suppressions in some
encryption tests.
Let us improve the error message so that it mentions the file name,
and adjust all error message suppressions in tests.
Thanks to Oleksandr Byelkin for noticing one test failure.
- InnoDB bulk insert fails to use encryption buffer for encrypting
the temporary log file. Declare the m_crypt_block, m_crypt_pfx in
row_merge_bulk_t to be used for encrypting the temporary file.
When attempting to recover a database with an incorrect encryption key,
the unencrypted page contents should be expected to differ from what
was written before recovery. Let us suppress some more messages.
This caused intermittent failures, depending on when the latest
log checkpoint was triggered.
The approach to handling corruption that was chosen by Oracle in
commit 177d8b0c12
is not really useful. Not only did it actually fail to prevent InnoDB
from crashing, but it is making things worse by blocking attempts to
rescue data from or rebuild a partially readable table.
We will try to prevent crashes in a different way: by propagating
errors up the call stack. We will never mark the clustered index
persistently corrupted, so that data recovery may be attempted by
reading from the table, or by rebuilding the table.
This should also fix MDEV-13680 (crash on btr_page_alloc() failure);
it was extensively tested with innodb_file_per_table=0 and a
non-autoextend system tablespace.
We should now avoid crashes in many cases, such as when a page
cannot be read or allocated, or an inconsistency is detected when
attempting to update multiple pages. We will not crash on double-free,
such as on the recovery of DDL in system tablespace in case something
was corrupted.
Crashes on corrupted data are still possible. The fault injection mechanism
that is introduced in the subsequent commit may help catch more of them.
buf_page_import_corrupt_failure: Remove the fault injection, and instead
corrupt some pages using Perl code in the tests.
btr_cur_pessimistic_insert(): Always reserve extents (except for the
change buffer), in order to prevent a subsequent allocation failure.
btr_pcur_open_at_rnd_pos(): Merged to the only caller ibuf_merge_pages().
btr_assert_not_corrupted(), btr_corruption_report(): Remove.
Similar checks are already part of btr_block_get().
FSEG_MAGIC_N_BYTES: Replaces FSEG_MAGIC_N_VALUE.
dict_hdr_get(), trx_rsegf_get_new(), trx_undo_page_get(),
trx_undo_page_get_s_latched(): Replaced with error-checking calls.
trx_rseg_t::get(mtr_t*): Replaces trx_rsegf_get().
trx_rseg_header_create(): Let the caller update the TRX_SYS page if needed.
trx_sys_create_sys_pages(): Merged with trx_sysf_create().
dict_check_tablespaces_and_store_max_id(): Do not access
DICT_HDR_MAX_SPACE_ID, because it was already recovered in dict_boot().
Merge dict_check_sys_tables() with this function.
dir_pathname(): Replaces os_file_make_new_pathname().
row_undo_ins_remove_sec(): Do not modify the undo page by adding
a terminating NUL byte to the record.
btr_decryption_failed(): Report decryption failures
dict_set_corrupted_by_space(), dict_set_encrypted_by_space(),
dict_set_corrupted_index_cache_only(): Remove.
dict_set_corrupted(): Remove the constant parameter dict_locked=false.
Never flag the clustered index corrupted in SYS_INDEXES, because
that would deny further access to the table. It might be possible to
repair the table by executing ALTER TABLE or OPTIMIZE TABLE, in case
no B-tree leaf page is corrupted.
dict_table_skip_corrupt_index(), dict_table_next_uncorrupted_index(),
row_purge_skip_uncommitted_virtual_index(): Remove, and refactor
the callers to read dict_index_t::type only once.
dict_table_is_corrupted(): Remove.
dict_index_t::is_btree(): Determine if the index is a valid B-tree.
BUF_GET_NO_LATCH, BUF_EVICT_IF_IN_POOL: Remove.
UNIV_BTR_DEBUG: Remove. Any inconsistency will no longer trigger
assertion failures, but error codes being returned.
buf_corrupt_page_release(): Replaced with a direct call to
buf_pool.corrupted_evict().
fil_invalid_page_access_msg(): Never crash on an invalid read;
let the caller of buf_page_get_gen() decide.
btr_pcur_t::restore_position(): Propagate failure status to the caller
by returning CORRUPTED.
opt_search_plan_for_table(): Simplify the code.
row_purge_del_mark(), row_purge_upd_exist_or_extern_func(),
row_undo_ins_remove_sec_rec(), row_undo_mod_upd_del_sec(),
row_undo_mod_del_mark_sec(): Avoid mem_heap_create()/mem_heap_free()
when no secondary indexes exist.
row_undo_mod_upd_exist_sec(): Simplify the code.
row_upd_clust_step(), dict_load_table_one(): Return DB_TABLE_CORRUPT
if the clustered index (and therefore the table) is corrupted, similar
to what we do in row_insert_for_mysql().
fut_get_ptr(): Replace with buf_page_get_gen() calls.
buf_page_get_gen(): Return nullptr and *err=DB_CORRUPTION
if the page is marked as freed. For other modes than
BUF_GET_POSSIBLY_FREED or BUF_PEEK_IF_IN_POOL this will
trigger a debug assertion failure. For BUF_GET_POSSIBLY_FREED,
we will return nullptr for freed pages, so that the callers
can be simplified. The purge of transaction history will be
a new user of BUF_GET_POSSIBLY_FREED, to avoid crashes on
corrupted data.
buf_page_get_low(): Never crash on a corrupted page, but simply
return nullptr.
fseg_page_is_allocated(): Replaces fseg_page_is_free().
fts_drop_common_tables(): Return an error if the transaction
was rolled back.
fil_space_t::set_corrupted(): Report a tablespace as corrupted if
it was not reported already.
fil_space_t::io(): Invoke fil_space_t::set_corrupted() to report
out-of-bounds page access or other errors.
Clean up mtr_t::page_lock()
buf_page_get_low(): Validate the page identifier (to check for
recently read corrupted pages) after acquiring the page latch.
buf_page_t::read_complete(): Flag uninitialized (all-zero) pages
with DB_FAIL. Return DB_PAGE_CORRUPTED on page number mismatch.
mtr_t::defer_drop_ahi(): Renamed from mtr_defer_drop_ahi().
recv_sys_t::free_corrupted_page(): Only set_corrupt_fs()
if any log records exist for the page. We do not mind if read-ahead
produces corrupted (or all-zero) pages that were not actually needed
during recovery.
recv_recover_page(): Return whether the operation succeeded.
recv_sys_t::recover_low(): Simplify the logic. Check for recovery error.
Thanks to Matthias Leich for testing this extensively and to the
authors of https://rr-project.org for making it easy to diagnose
and fix any failures that were found during the testing.