Thanks to MDEV-15058, there is only one InnoDB buffer pool.
Allocating buf_pool statically removes one level of pointer indirection
and makes code more readable, and removes the awkward initialization of
some buf_pool members.
While doing this, we will also declare some buf_pool_t data members
private and replace some functions with member functions. This is
mostly affecting buffer pool resizing.
This is not aiming to be a complete rewrite of buf_pool_t to
a proper class. Most of the buffer pool interface, such as
buf_page_get_gen(), will remain in the C programming style
for now.
buf_pool_t::withdrawing: Replaces buf_pool_withdrawing.
buf_pool_t::withdraw_clock_: Replaces buf_withdraw_clock.
buf_pool_t::create(): Repalces buf_pool_init().
buf_pool_t::close(): Replaces buf_pool_free().
buf_bool_t::will_be_withdrawn(): Replaces buf_block_will_be_withdrawn(),
buf_frame_will_be_withdrawn().
buf_pool_t::clear_hash_index(): Replaces buf_pool_clear_hash_index().
buf_pool_t::get_n_pages(): Replaces buf_pool_get_n_pages().
buf_pool_t::validate(): Replaces buf_validate().
buf_pool_t::print(): Replaces buf_print().
buf_pool_t::block_from_ahi(): Replaces buf_block_from_ahi().
buf_pool_t::is_block_field(): Replaces buf_pointer_is_block_field().
buf_pool_t::is_block_mutex(): Replaces buf_pool_is_block_mutex().
buf_pool_t::is_block_lock(): Replaces buf_pool_is_block_lock().
buf_pool_t::is_obsolete(): Replaces buf_pool_is_obsolete().
buf_pool_t::io_buf: Make default-constructible.
buf_pool_t::io_buf::create(): Delayed 'constructor'
buf_pool_t::io_buf::close(): Early 'destructor'
HazardPointer: Make default-constructible. Define all member functions
inline, also for derived classes.
- This issue is caused by MDEV-19176
(bba59abb03).
- Problem is that there is miscalculation of available memory during
recovery if innodb_buffer_pool_instances > 1.
- Ignore the buffer pool instance while calculating available_memory
- Removed recv_n_pool_free_frames variable and use buf_pool_get_n_pages()
instead.
Our benchmarking efforts indicate that the reasons for splitting the
buf_pool in commit c18084f71b
have mostly gone away, possibly as a result of
mysql/mysql-server@ce6109ebfd
or similar work.
Only in one write-heavy benchmark where the working set size is
ten times the buffer pool size, the buf_pool->mutex would be
less contended with 4 buffer pool instances than with 1 instance,
in buf_page_io_complete(). That contention could be alleviated
further by making more use of std::atomic and by splitting
buf_pool_t::mutex further (MDEV-15053).
We will deprecate and ignore the following parameters:
innodb_buffer_pool_instances
innodb_page_cleaners
There will be only one buffer pool and one page cleaner task.
In a number of INFORMATION_SCHEMA views, columns that indicated
the buffer pool instance will be removed:
information_schema.innodb_buffer_page.pool_id
information_schema.innodb_buffer_page_lru.pool_id
information_schema.innodb_buffer_pool_stats.pool_id
information_schema.innodb_cmpmem.buffer_pool_instance
information_schema.innodb_cmpmem_reset.buffer_pool_instance
InnoDB crash recovery used a special type of mem_heap_t that
allocates backing store from the buffer pool. That incurred
a significant overhead, leading to underutilization of memory,
and limiting the maximum contiguous allocated size of a log record.
recv_sys_t::blocks: A linked list of buf_block_t that are allocated
by buf_block_alloc() for redo log records. Replaces recv_sys_t::heap.
We repurpose buf_block_t::unzip_LRU for linking the elements.
recv_sys_t::max_log_blocks: Renamed from recv_n_pool_free_frames.
recv_sys_t::max_blocks(): Accessor for max_log_blocks.
recv_sys_t::alloc(): Allocate memory from the current recv_sys_t::blocks
element, or allocate another block. In debug builds, various free()
member functions must be invoked, because we repurpose
buf_page_t::buf_fix_count for tracking allocations.
recv_sys_t::free_corrupted_page(): Renamed from recv_recover_corrupt_page()
recv_sys_t::is_memory_exhausted(): Renamed from recv_sys_heap_check()
recv_sys_t::pages and its elements are allocated directly by the
system memory allocator.
recv_parse_log_recs(): Remove the parameter available_memory.
We rename some variables 'store_to_hash' to 'store', because
recv_sys.pages is not actually a hash table.
This is joint work with Thirunarayanan Balathandayuthapani.
buf_read_ibuf_merge_pages(): Discard any page numbers that are
outside the current bounds of the tablespace, by invoking the
function ibuf_delete_recs() that was introduced in MDEV-20934.
This could avoid an infinite change buffer merge loop on
innodb_fast_shutdown=0, because normally the change buffer merge
would only be attempted if a page was successfully loaded into
the buffer pool.
dict_drop_index_tree(): Add the parameter trx_t*.
To prevent the DROP TABLE crash, do not invoke btr_free_if_exists()
if the entire .ibd file will be dropped. Thus, we will avoid a crash
if the BTR_SEG_LEAF or BTR_SEG_TOP of the index is corrupted,
and we will also avoid unnecessarily accessing the to-be-dropped
tablespace via the buffer pool.
In MariaDB 10.2, we disable the DROP TABLE fix if innodb_safe_truncate=0,
because the backup-unsafe MySQL 5.7 WL#6501 form of TRUNCATE TABLE
requires that the individual pages be freed inside the tablespace.
After MDEV-11556, not even crash recovery should attempt to access
non-existing pages. But, buf_load() is not validating its input
and must thus be able to ignore missing pages, so that is why
buf_read_page_background() does that.
Almost all threads have gone
- the "ticking" threads, that sleep a while then do some work)
(srv_monitor_thread, srv_error_monitor_thread, srv_master_thread)
were replaced with timers. Some timers are periodic,
e.g the "master" timer.
- The btr_defragment_thread is also replaced by a timer , which
reschedules it self when current defragment "item" needs throttling
- the buf_resize_thread and buf_dump_threads are substitutes with tasks
Ditto with page cleaner workers.
- purge workers threads are not tasks as well, and purge cleaner
coordinator is a combination of a task and timer.
- All AIO is outsourced to tpool, Innodb just calls thread_pool::submit_io()
and provides the callback.
- The srv_slot_t was removed, and innodb_debug_sync used in purge
is currently not working, and needs reimplementation.
We will remove the InnoDB background operation of merging buffered
changes to secondary index leaf pages. Changes will only be merged as a
result of an operation that accesses a secondary index leaf page,
such as a SQL statement that performs a lookup via that index,
or is modifying the index. Also ROLLBACK and some background operations,
such as purging the history of committed transactions, or computing
index cardinality statistics, can cause change buffer merge.
Encryption key rotation will not perform change buffer merge.
The motivation of this change is to simplify the I/O logic and to
allow crash recovery to happen in the background (MDEV-14481).
We also hope that this will reduce the number of "mystery" crashes
due to corrupted data. Because change buffer merge will typically
take place as a result of executing SQL statements, there should be
a clearer connection between the crash and the SQL statements that
were executed when the server crashed.
In many cases, a slight performance improvement was observed.
This is joint work with Thirunarayanan Balathandayuthapani
and was tested by Axel Schwenke and Matthias Leich.
The InnoDB monitor counter innodb_ibuf_merge_usec will be removed.
On slow shutdown (innodb_fast_shutdown=0), we will continue to
merge all buffered changes (and purge all undo log history).
Two InnoDB configuration parameters will be changed as follows:
innodb_disable_background_merge: Removed.
This parameter existed only in debug builds.
All change buffer merges will use synchronous reads.
innodb_force_recovery will be changed as follows:
* innodb_force_recovery=4 will be the same as innodb_force_recovery=3
(the change buffer merge cannot be disabled; it can only happen as
a result of an operation that accesses a secondary index leaf page).
The option used to be capable of corrupting secondary index leaf pages.
Now that capability is removed, and innodb_force_recovery=4 becomes 'safe'.
* innodb_force_recovery=5 (which essentially hard-wires
SET GLOBAL TRANSACTION ISOLATION LEVEL READ UNCOMMITTED)
becomes safe to use. Bogus data can be returned to SQL, but
persistent InnoDB data files will not be corrupted further.
* innodb_force_recovery=6 (ignore the redo log files)
will be the only option that can potentially cause
persistent corruption of InnoDB data files.
Code changes:
buf_page_t::ibuf_exist: New flag, to indicate whether buffered
changes exist for a buffer pool page. Pages with pending changes
can be returned by buf_page_get_gen(). Previously, the changes
were always merged inside buf_page_get_gen() if needed.
ibuf_page_exists(const buf_page_t&): Check if a buffered changes
exist for an X-latched or read-fixed page.
buf_page_get_gen(): Add the parameter allow_ibuf_merge=false.
All callers that know that they may be accessing a secondary index
leaf page must pass this parameter as allow_ibuf_merge=true,
unless it does not matter for that caller whether all buffered
changes have been applied. Assert that whenever allow_ibuf_merge
holds, the page actually is a leaf page. Attempt change buffer
merge only to secondary B-tree index leaf pages.
btr_block_get(): Add parameter 'bool merge'.
All callers of btr_block_get() should know whether the page could be
a secondary index leaf page. If it is not, we should avoid consulting
the change buffer bitmap to even consider a merge. This is the main
interface to requesting index pages from the buffer pool.
ibuf_merge_or_delete_for_page(), recv_recover_page(): Replace
buf_page_get_known_nowait() with much simpler logic, because
it is now guaranteed that that the block is x-latched or read-fixed.
mlog_init_t::mark_ibuf_exist(): Renamed from mlog_init_t::ibuf_merge().
On crash recovery, we will no longer merge any buffered changes
for the pages that we read into the buffer pool during the last batch
of applying log records.
buf_page_get_gen_known_nowait(), BUF_MAKE_YOUNG, BUF_KEEP_OLD: Remove.
btr_search_guess_on_hash(): Merge buf_page_get_gen_known_nowait()
to its only remaining caller.
buf_page_make_young_if_needed(): Define as an inline function.
Add the parameter buf_pool.
buf_page_peek_if_young(), buf_page_peek_if_too_old(): Add the
parameter buf_pool.
fil_space_validate_for_mtr_commit(): Remove a bogus comment
about background merge of the change buffer.
btr_cur_open_at_rnd_pos_func(), btr_cur_search_to_nth_level_func(),
btr_cur_open_at_index_side_func(): Use narrower data types and scopes.
ibuf_read_merge_pages(): Replaces buf_read_ibuf_merge_pages().
Merge the change buffer by invoking buf_page_get_gen().
According to the code, it was Windows specific "simulated AIO"
workaround. The simulated s not supported on Windows anymore.
Thus, remove the dead code
Problem:
=========
One of the purge thread access the corrupted page and tries to remove from
LRU list. In the mean time, other purge threads are waiting for same page
in buf_wait_for_read(). Assertion(buf_fix_count == 0) fails for the
purge thread which tries to remove the page from LRU list.
Solution:
========
- Set the page id as FIL_NULL to indicate the page is corrupted before
removing the block from LRU list. Acquire hash lock for the particular
page id and wait for the other threads to release buf_fix_count
for the block.
- Added the error check for btr_cur_open() in row_search_on_row_ref().
Normally, InnoDB is not in the process of executing crash recovery.
Provide a hint to the compiler that the recovery-related code paths
are rarely executed.
MySQL 5.7 introduced the class page_size_t and increased the size of
buffer pool page descriptors by introducing this object to them.
Maybe the intention of this exercise was to prepare for a future
where the buffer pool could accommodate multiple page sizes.
But that future never arrived, not even in MySQL 8.0. It is much
easier to manage a pool of a single page size, and typically all
storage devices of an InnoDB instance benefit from using the same
page size.
Let us remove page_size_t from MariaDB Server. This will make it
easier to remove support for ROW_FORMAT=COMPRESSED (or make it a
compile-time option) in the future, just by removing various
occurrences of zip_size.
main.derived_cond_pushdown: Move all 10.3 tests to the end,
trim trailing white space, and add an "End of 10.3 tests" marker.
Add --sorted_result to tests where the ordering is not deterministic.
main.win_percentile: Add --sorted_result to tests where the
ordering is no longer deterministic.
Stop supporting the additional *trunc.log files that were
introduced via MySQL 5.7 to MariaDB Server 10.2 and 10.3.
DB_TABLESPACE_TRUNCATED: Remove.
purge_sys.truncate: A new structure to track undo tablespace
file truncation.
srv_start(): Remove the call to buf_pool_invalidate(). It is
no longer necessary, given that we no longer access things in
ways that violate the ARIES protocol. This call was originally
added for innodb_file_format, and it may later have been necessary
for the proper function of the MySQL 5.7 TRUNCATE recovery, which
we are now removing.
trx_purge_cleanse_purge_queue(): Take the undo tablespace as a
parameter.
trx_purge_truncate_history(): Rewrite everything mostly in a
single function, replacing references to undo::Truncate.
recv_apply_hashed_log_recs(): If any redo log is to be applied,
and if the log_sys.log.subformat indicates that separately
logged truncate may have been used, refuse to proceed except if
innodb_force_recovery is set. We will still refuse crash-upgrade
if TRUNCATE TABLE was logged. Undo tablespace truncation would
only be logged in undo*trunc.log files, which we are no longer
checking for.
This is a merge from 10.2, but the 10.2 version of this will not
be pushed into 10.2 yet, because the 10.2 version would include
backports of MDEV-14717 and MDEV-14585, which would introduce
a crash recovery regression: Tables could be lost on
table-rebuilding DDL operations, such as ALTER TABLE,
OPTIMIZE TABLE or this new backup-friendly TRUNCATE TABLE.
The test innodb.truncate_crash occasionally loses the table due to
the following bug:
MDEV-17158 log_write_up_to() sometimes fails
Implement undo tablespace truncation via normal redo logging.
Implement TRUNCATE TABLE as a combination of RENAME to #sql-ib name,
CREATE, and DROP.
Note: Orphan #sql-ib*.ibd may be left behind if MariaDB Server 10.2
is killed before the DROP operation is committed. If MariaDB Server 10.2
is killed during TRUNCATE, it is also possible that the old table
was renamed to #sql-ib*.ibd but the data dictionary will refer to the
table using the original name.
In MariaDB Server 10.3, RENAME inside InnoDB is transactional,
and #sql-* tables will be dropped on startup. So, this new TRUNCATE
will be fully crash-safe in 10.3.
ha_mroonga::wrapper_truncate(): Pass table options to the underlying
storage engine, now that ha_innobase::truncate() will need them.
rpl_slave_state::truncate_state_table(): Before truncating
mysql.gtid_slave_pos, evict any cached table handles from
the table definition cache, so that there will be no stale
references to the old table after truncating.
== TRUNCATE TABLE ==
WL#6501 in MySQL 5.7 introduced separate log files for implementing
atomic and crash-safe TRUNCATE TABLE, instead of using the InnoDB
undo and redo log. Some convoluted logic was added to the InnoDB
crash recovery, and some extra synchronization (including a redo log
checkpoint) was introduced to make this work. This synchronization
has caused performance problems and race conditions, and the extra
log files cannot be copied or applied by external backup programs.
In order to support crash-upgrade from MariaDB 10.2, we will keep
the logic for parsing and applying the extra log files, but we will
no longer generate those files in TRUNCATE TABLE.
A prerequisite for crash-safe TRUNCATE is a crash-safe RENAME TABLE
(with full redo and undo logging and proper rollback). This will
be implemented in MDEV-14717.
ha_innobase::truncate(): Invoke RENAME, create(), delete_table().
Because RENAME cannot be fully rolled back before MariaDB 10.3
due to missing undo logging, add some explicit rename-back in
case the operation fails.
ha_innobase::delete(): Introduce a variant that takes sqlcom as
a parameter. In TRUNCATE TABLE, we do not want to touch any
FOREIGN KEY constraints.
ha_innobase::create(): Add the parameters file_per_table, trx.
In TRUNCATE, the new table must be created in the same transaction
that renames the old table.
create_table_info_t::create_table_info_t(): Add the parameters
file_per_table, trx.
row_drop_table_for_mysql(): Replace a bool parameter with sqlcom.
row_drop_table_after_create_fail(): New function, wrapping
row_drop_table_for_mysql().
dict_truncate_index_tree_in_mem(), fil_truncate_tablespace(),
fil_prepare_for_truncate(), fil_reinit_space_header_for_table(),
row_truncate_table_for_mysql(), TruncateLogger,
row_truncate_prepare(), row_truncate_rollback(),
row_truncate_complete(), row_truncate_fts(),
row_truncate_update_system_tables(),
row_truncate_foreign_key_checks(), row_truncate_sanity_checks():
Remove.
row_upd_check_references_constraints(): Remove a check for
TRUNCATE, now that the table is no longer truncated in place.
The new test innodb.truncate_foreign uses DEBUG_SYNC to cover some
race-condition like scenarios. The test innodb-innodb.truncate does
not use any synchronization.
We add a redo log subformat to indicate backup-friendly format.
MariaDB 10.4 will remove support for the old TRUNCATE logging,
so crash-upgrade from old 10.2 or 10.3 to 10.4 will involve
limitations.
== Undo tablespace truncation ==
MySQL 5.7 implements undo tablespace truncation. It is only
possible when innodb_undo_tablespaces is set to at least 2.
The logging is implemented similar to the WL#6501 TRUNCATE,
that is, using separate log files and a redo log checkpoint.
We can simply implement undo tablespace truncation within
a single mini-transaction that reinitializes the undo log
tablespace file. Unfortunately, due to the redo log format
of some operations, currently, the total redo log written by
undo tablespace truncation will be more than the combined size
of the truncated undo tablespace. It should be acceptable
to have a little more than 1 megabyte of log in a single
mini-transaction. This will be fixed in MDEV-17138 in
MariaDB Server 10.4.
recv_sys_t: Add truncated_undo_spaces[] to remember for which undo
tablespaces a MLOG_FILE_CREATE2 record was seen.
namespace undo: Remove some unnecessary declarations.
fil_space_t::is_being_truncated: Document that this flag now
only applies to undo tablespaces. Remove some references.
fil_space_t::is_stopping(): Do not refer to is_being_truncated.
This check is for tablespaces of tables. Potentially used
tablespaces are never truncated any more.
buf_dblwr_process(): Suppress the out-of-bounds warning
for undo tablespaces.
fil_truncate_log(): Write a MLOG_FILE_CREATE2 with a nonzero
page number (new size of the tablespace in pages) to inform
crash recovery that the undo tablespace size has been reduced.
fil_op_write_log(): Relax assertions, so that MLOG_FILE_CREATE2
can be written for undo tablespaces (without .ibd file suffix)
for a nonzero page number.
os_file_truncate(): Add the parameter allow_shrink=false
so that undo tablespaces can actually be shrunk using this function.
fil_name_parse(): For undo tablespace truncation,
buffer MLOG_FILE_CREATE2 in truncated_undo_spaces[].
recv_read_in_area(): Avoid reading pages for which no redo log
records remain buffered, after recv_addr_trim() removed them.
trx_rseg_header_create(): Add a FIXME comment that we could write
much less redo log.
trx_undo_truncate_tablespace(): Reinitialize the undo tablespace
in a single mini-transaction, which will be flushed to the redo log
before the file size is trimmed.
recv_addr_trim(): Discard any redo logs for pages that were
logged after the new end of a file, before the truncation LSN.
If the rec_list becomes empty, reduce n_addrs. After removing
any affected records, actually truncate the file.
recv_apply_hashed_log_recs(): Invoke recv_addr_trim() right before
applying any log records. The undo tablespace files must be open
at this point.
buf_flush_or_remove_pages(), buf_flush_dirty_pages(),
buf_LRU_flush_or_remove_pages(): Add a parameter for specifying
the number of the first page to flush or remove (default 0).
trx_purge_initiate_truncate(): Remove the log checkpoints, the
extra logging, and some unnecessary crash points. Merge the code
from trx_undo_truncate_tablespace(). First, flush all to-be-discarded
pages (beyond the new end of the file), then trim the space->size
to make the page allocation deterministic. At the only remaining
crash injection point, flush the redo log, so that the recovery
can be tested.
fil_space_t::n_pending_ops, n_pending_ios: Use a combination of
fil_system.mutex and atomic memory access for protection.
fil_space_t::release(): Replaces fil_space_release().
Does not acquire fil_system.mutex.
fil_space_t::release_for_io(): Replaces fil_space_release_for_io().
Does not acquire fil_system.mutex.
Issue
=====
The original issue was that the size of a fil_per_table tablespace was calculated
incorrectly during truncate in the presence of an fts index. This incorrect calculation
was fixed as part of BUG#25053705 along with a testcase to reproduce the bug. The
assert that was added as part of it to reproduce the bug was wrong and resulted in
this bug.
Fix
===
Although the assert was removed earlier in a seperate commit as it was blocking the
ntest, this patch replaces the other parts of the code that were added to reproduce
the bug and replaces it with code that tries to reproduce the bug in a different way.
The new code basically tries to tweak conditions so as to simulate the random read
where a page that doesn't exist is tried to be read.
RB: 15890
Reviewed-by: Jimmy Yang <Jimmy.Yang@oracle.com>
Reviewed-by: Satya Bodapati <satya.bodapati@oracle.com>
Background thread is doing ibuf merge, in buf0rea.cc buf_read_ibuf_merge_pages().
It first tries to get page_size and if space is not found it deletes them, but
as we do not hold any mutexes, space can be marked as stopped between that
and buf_read_page_low() for same space. This naturally leads seen error
message on log.
buf_read_page_low(): Add parameter ignore_missing_space = false that
is passed to fil_io()
buf_read_ibuf_merge_pages(): call buf_read_page_low with
ignore_missing_space = true, this function will handle missing
space error code after buf_read_page_low returns.
fil_io(): if ignore_missing_space = true do not print error
message about trying to do I/0 for missing space, just return
correct error code that is handled later.
Page read could return DB_PAGE_CORRUPTED error that should
be reported and passed to upper layer. In case of unknown
error code we should print both number and string.
A similar change was contributed to Percona XtraBackup, but for some
reason, it is not present in Percona XtraDB. Since MDEV-9566
(MariaDB 10.1.23), that change is present in the MariaDB XtraDB.
The function buf_read_page_low() invokes fil_io(), which can return
DB_ERROR when the requested page is out of bounds (such as when
restoring a buffer pool dump). The callers should be handling that.
The function buf_read_page_low() invokes fil_io(), which can return
DB_ERROR when the requested page is out of bounds (such as when
restoring a buffer pool dump). The callers should be handling that.
This only merges MDEV-12253, adapting it to MDEV-12602 which is already
present in 10.2 but not yet in the 10.1 revision that is being merged.
TODO: Error handling in crash recovery needs to be improved.
If a page cannot be decrypted (or read), we should cleanly abort
the startup. If innodb_force_recovery is specified, we should
ignore the problematic page and apply redo log to other pages.
Currently, the test encryption.innodb-redo-badkey randomly fails
like this (the last messages are from cmake -DWITH_ASAN):
2017-05-05 10:19:40 140037071685504 [Note] InnoDB: Starting crash recovery from checkpoint LSN=1635994
2017-05-05 10:19:40 140037071685504 [ERROR] InnoDB: Missing MLOG_FILE_NAME or MLOG_FILE_DELETE before MLOG_CHECKPOINT for tablespace 1
2017-05-05 10:19:40 140037071685504 [ERROR] InnoDB: Plugin initialization aborted at srv0start.cc[2201] with error Data structure corruption
2017-05-05 10:19:41 140037071685504 [Note] InnoDB: Starting shutdown...
i=================================================================
==5226==ERROR: AddressSanitizer: attempting free on address which was not malloc()-ed: 0x612000018588 in thread T0
#0 0x736750 in operator delete(void*) (/mariadb/server/build/sql/mysqld+0x736750)
#1 0x1e4833f in LatchCounter::~LatchCounter() /mariadb/server/storage/innobase/include/sync0types.h:599:4
#2 0x1e480b8 in LatchMeta<LatchCounter>::~LatchMeta() /mariadb/server/storage/innobase/include/sync0types.h:786:17
#3 0x1e35509 in sync_latch_meta_destroy() /mariadb/server/storage/innobase/sync/sync0debug.cc:1622:3
#4 0x1e35314 in sync_check_close() /mariadb/server/storage/innobase/sync/sync0debug.cc:1839:2
#5 0x1dfdc18 in innodb_shutdown() /mariadb/server/storage/innobase/srv/srv0start.cc:2888:2
#6 0x197e5e6 in innobase_init(void*) /mariadb/server/storage/innobase/handler/ha_innodb.cc:4475:3