log_buf_pool_get_oldest_modification(): Acquire
log_sys_t::flush_order_mutex in order to prevent a race condition
that was introduced in
commit 1a6f708ec5 (MDEV-15058).
Before that change, log_buf_pool_get_oldest_modification()
was protected by both log_sys.mutex and log_sys.flush_order_mutex
like it was supposed to be ever since
commit a52c4820a3 (MySQL 5.5.10).
buf_pool_t::get_oldest_modification(): Replaces
buf_pool_get_oldest_modification(), to emphasize that
log_sys.flush_order_mutex must be acquired by the caller if needed.
log_close(): Invoke log_buf_pool_get_oldest_modification()
in order to ensure a clean shutdown.
The scenario of the race condition is as follows:
1. The buffer pool is clean (no writes are pending).
2. mtr_add_dirtied_pages_to_flush_list() releases log_sys.mutex.
3. log_buf_pool_get_oldest_modification() observes that the
buffer pool is clean and returns log_sys.lsn.
4. log_checkpoint() completes, writing a wrong checkpoint header
according to which everything up to log_sys.lsn was clean.
5. mtr_add_dirtied_pages_to_flush_list() completes the execution
of mtr_memo_note_modifications(), releases the page latches and
the flush_order_mutex.
6. On a subsequent log_checkpoint(), the assertion could fail
if the page modifications had not been flushed yet.
The failing assertion (which is valid) was added in MySQL 5.7
mysql/mysql-server@5c6c6ec693
and merged to MariaDB Server 10.2.2 in
commit fec844aca8.
Problem:
=======
While evicting the uncompressed page from buffer pool, InnoDB writes
the checksum for the compressed page in buf_LRU_free_page().
So while flushing the compressed page, checksum validation fails
when innodb_checksum_algorithm variable changed to strict_none.
Solution:
========
- Calculate the checksum only during flushing of page. Removed the
checksum write in buf_LRU_free_page().
If the InnoDB buffer pool contains many pages for a table or index
that is being dropped or rebuilt, and if many of such pages are
pointed to by the adaptive hash index, dropping the adaptive hash index
may consume a lot of time.
The time-consuming operation of dropping the adaptive hash index entries
is being executed while the InnoDB data dictionary cache dict_sys is
exclusively locked.
It is not actually necessary to drop all adaptive hash index entries
at the time a table or index is being dropped or rebuilt. We can let
the LRU replacement policy of the buffer pool take care of this gradually.
For this to work, we must detach the dict_table_t and dict_index_t
objects from the main dict_sys cache, and once the last
adaptive hash index entry for the detached table is removed
(when the garbage page is evicted from the buffer pool) we can free
the dict_table_t and dict_index_t object.
Related to this, in MDEV-16283, we made ALTER TABLE...DISCARD TABLESPACE
skip both the buffer pool eviction and the drop of the adaptive hash index.
We shifted the burden to ALTER TABLE...IMPORT TABLESPACE or DROP TABLE.
We can remove the eviction from DROP TABLE. We must retain the eviction
in the ALTER TABLE...IMPORT TABLESPACE code path, so that in case the
discarded table is being re-imported with the same tablespace identifier,
the fresh data from the imported tablespace will replace any stale pages
in the buffer pool.
rpl.rpl_failed_drop_tbl_binlog: Remove the test. DROP TABLE can
no longer be interrupted inside InnoDB.
fseg_free_page(), fseg_free_step(), fseg_free_step_not_header(),
fseg_free_page_low(), fseg_free_extent(): Remove the parameter
that specifies whether the adaptive hash index should be dropped.
btr_search_lazy_free(): Lazily free an index when the last
reference to it is dropped from the adaptive hash index.
buf_pool_clear_hash_index(): Declare static, and move to the
same compilation unit with the bulk of the adaptive hash index
code.
dict_index_t::clone(), dict_index_t::clone_if_needed():
Clone an index that is being rebuilt while adaptive hash index
entries exist. The original index will be inserted into
dict_table_t::freed_indexes and dict_index_t::set_freed()
will be called.
dict_index_t::set_freed(), dict_index_t::freed(): Note that
or check whether the index has been freed. We will use the
impossible page number 1 to denote this condition.
dict_index_t::n_ahi_pages(): Replaces btr_search_info_get_ref_count().
dict_index_t::detach_columns(): Move the assignment n_fields=0
to ha_innobase_inplace_ctx::clear_added_indexes().
We must have access to the columns when freeing the
adaptive hash index. Note: dict_table_t::v_cols[] will remain
valid. If virtual columns are dropped or added, the table
definition will be reloaded in ha_innobase::commit_inplace_alter_table().
buf_page_mtr_lock(): Drop a stale adaptive hash index if needed.
We will also reduce the number of btr_get_search_latch() calls
and enclose some more code inside #ifdef BTR_CUR_HASH_ADAPT
in order to benefit cmake -DWITH_INNODB_AHI=OFF.
On a checksum failure of a ROW_FORMAT=COMPRESSED page,
buf_LRU_free_one_page() would invoke buf_LRU_block_remove_hashed()
which will read the uncompressed page frame, although it would not
be initialized. With bad enough luck, fil_page_get_type(page)
could return an unrecognized value and cause the server to abort.
buf_page_io_complete(): On the corruption of a ROW_FORMAT=COMPRESSED
page, zerofill the uncompressed page frame.
ibuf_read_merge_pages(): Request a possibly freed page.
The change buffer is discarded lazily for freed pages either
by this function or when buf_page_create() reuses a page.
buf_page_get_low(): Relax a debug assertion.
Do not attempt change buffer merge on freed pages.
ibuf_merge_or_delete_for_page(): Assert that the page state is NORMAL.
INIT_ON_FLUSH is not possible, because in that case buf_page_create()
should have removed any buffered changes for the page.
buf_page_get_gen(): Apply buffered changes also in the case when
we can avoid reading the page based on buffered redo log records.
This addresses a hard-to-reproduce scenario that was broken in
commit 6697135c6d.
Maybe this patch will help catch problems like buffer overflow.
log_t::first_in_use: removed
log_t::buf: this is where mtr_t are supposed to append data
log_t::flush_buf: this is from server writes to a file
Those two buffers are std::swap()ped when some thread is gonna write
to a file
commit 121a5e8d07 revised the function
buf_pool_watch_unset() in such a way that the debug field
buf_page_t::in_page_hash is no longer protected by buf_pool.mutex
and thus not safe to access by the debug assertion in
buf_pool_watch_set().
For now, let us revert the change to buf_pool_watch_unset()
and have it acquire the buf_pool.mutex for a longer time.
redo log during recovery
- InnoDB unnecessarily reads the page even though it has fully initialized
buffered redo log records. Allow the page initialization redo log to
apply for the page in buf_page_get_gen() during recovery.
- Renamed buf_page_get_gen() to buf_page_get_low()
- Newly added buf_page_get_gen() will check for buffered redo log for
the particular page id during recovery
- Added new function buf_page_mtr_lock() which basically latches the page
for the given latch type.
- recv_recovery_create_page() is inline function which creates a page
if it has page initialization redo log records.
Remove unnecessary buf_pool_t:: qualifiers. In comments,
replace buf_pool::mutex with buf_pool.mutex.
Remove an outdated comment about a planned buffer pool resizing feature.
It is already implemented in MariaDB 10.2.2 (and MySQL 5.7.9).
In my micro-benchmarks memcmp(4196) 3 times faster than old
implementation. Also, it's generally better to use as less
reinterpret_casts<> as possible.
buf_is_zeroes(): renamed from buf_page_is_zeroes() and
argument changed to span<> for convenience.
st_::span<T>::const_iterator: fixed
page_zip-verify_checksum(): make argument byte* instead of void*
buf_pool_resize(): Simplify the fault injection
for innodb.buf_pool_resize_oom.
innodb.buf_pool_resize_oom: Use a small buffer pool.
innodb.innodb_buffer_pool_load_now: Make use of the sequence engine,
to avoid creating explicit InnoDB record locks. Clean up the
accesses to information_schema.innodb_buffer_page_lru.
Adapt from 10.2: git cherry-pick bfb5e1c3f0
buf_pool_t::chunk_t::create(), buf_pool_t::resize():
Restore or simplify the debug instrumentation.
buf_resize_callback(): Add DBUG_ENTER/DBUG_VOID_RETURN so that
the DBUG_EXECUTE_IF in buf_pool_t::resize() can be triggered.
Thanks to MDEV-15058, there is only one InnoDB buffer pool.
Allocating buf_pool statically removes one level of pointer indirection
and makes code more readable, and removes the awkward initialization of
some buf_pool members.
While doing this, we will also declare some buf_pool_t data members
private and replace some functions with member functions. This is
mostly affecting buffer pool resizing.
This is not aiming to be a complete rewrite of buf_pool_t to
a proper class. Most of the buffer pool interface, such as
buf_page_get_gen(), will remain in the C programming style
for now.
buf_pool_t::withdrawing: Replaces buf_pool_withdrawing.
buf_pool_t::withdraw_clock_: Replaces buf_withdraw_clock.
buf_pool_t::create(): Repalces buf_pool_init().
buf_pool_t::close(): Replaces buf_pool_free().
buf_bool_t::will_be_withdrawn(): Replaces buf_block_will_be_withdrawn(),
buf_frame_will_be_withdrawn().
buf_pool_t::clear_hash_index(): Replaces buf_pool_clear_hash_index().
buf_pool_t::get_n_pages(): Replaces buf_pool_get_n_pages().
buf_pool_t::validate(): Replaces buf_validate().
buf_pool_t::print(): Replaces buf_print().
buf_pool_t::block_from_ahi(): Replaces buf_block_from_ahi().
buf_pool_t::is_block_field(): Replaces buf_pointer_is_block_field().
buf_pool_t::is_block_mutex(): Replaces buf_pool_is_block_mutex().
buf_pool_t::is_block_lock(): Replaces buf_pool_is_block_lock().
buf_pool_t::is_obsolete(): Replaces buf_pool_is_obsolete().
buf_pool_t::io_buf: Make default-constructible.
buf_pool_t::io_buf::create(): Delayed 'constructor'
buf_pool_t::io_buf::close(): Early 'destructor'
HazardPointer: Make default-constructible. Define all member functions
inline, also for derived classes.
When a InnoDB data file page is freed, its contents becomes garbage,
and any storage allocated in the data file is wasted. During flushing,
InnoDB initializes the page with zeros if scrubbing is enabled. If the
tablespace is compressed then InnoDB should punch a hole else ignore the
flushing of the freed page.
buf_page_t:
- Replaced the variable file_page_was_freed, init_on_flush in buf_page_t
with status enum variable.
- Changed all debug assert of file_page_was_freed to DBUG_ASSERT
of buf_page_t::status
Removed buf_page_set_file_page_was_freed(),
buf_page_reset_file_page_was_freed().
buf_page_free(): Newly added function which takes X-lock on the page
before marking the status as FREED. So that InnoDB flush handler can
avoid concurrent flush of the freed page. Also while flushing the page,
InnoDB make sure that redo log which does freeing of the page also written
to the disk. Currently, this function only marks the page as FREED if
it is in buffer pool
buf_flush_freed_page(): Newly added function which initializes zeros
asynchorously if innodb_immediate_scrub_data_uncompressed is enabled.
Punch a hole to the file synchorously if page_compressed is enabled.
Reset the io_fix to NORMAL. Release the block from flush list and
associated mutex before writing zeros or punch a hole to the file.
buf_flush_page(): Removed the unnecessary usage of temporary
variable "flush"
fil_io(): Introduce new parameter called punch_hole. It allows fil_io()
to punch the hole to the file for the given offset.
buf_page_create(): Let the callers assign buf_page_t::status.
Every caller should eventually invoke mtr_t::init().
fsp_page_create(): Remove the unused mtr_t parameter.
In all other callers of buf_page_create() except fsp_page_create(),
before invoking mtr_t::init(), invoke
mtr_t::sx_latch_at_savepoint() or mtr_t::x_latch_at_savepoint().
mtr_t::init(): Initialize buf_page_t::status also for the temporary
tablespace (when redo logging is disabled), to avoid assertion failures.
Some fields were protected by log_sys.mutex, which adds quite some
overhead for readers. Some readers were submitting dirty reads.
log_t::lsn: Declare private and atomic. Add wrappers get_lsn()
and set_lsn() that will use relaxed memory access. Many accesses
to log_sys.lsn are still protected by log_sys.mutex; we avoid the
mutex for some readers.
log_t::flushed_to_disk_lsn: Declare private and atomic, and move
to the same cache line with log_t::lsn.
log_t::buf_free: Declare as size_t, and move to the same cache line
with log_t::lsn.
log_t::check_flush_or_checkpoint_: Declare private and atomic,
and move to the same cache line with log_t::lsn.
log_get_lsn(): Define as an alias of log_sys.get_lsn().
log_get_lsn_nowait(), log_peek_lsn(): Remove.
log_get_flush_lsn(): Define as an alias of log_sys.get_flush_lsn().
log_t::initiate_write(): Replaces log_buffer_sync_in_background().