As of CMake 3.24 CMAKE_COMPILER_IS_GNU(CC|CXX) are deprecated and should
be replaced with CMAKE_(C|CXX)_COMPILER_ID which were introduced with
CMake 2.6.
If the execution of the two reads in log_t::get_lsn_approx() is
interleaved with concurrent writes of those fields in
log_t::write_buf() or log_t::persist(), the returned approximation
will be an upper bound. If log_t::append_prepare_wait() is pending,
the approximation could be a lower bound.
We must adjust each caller of log_t::get_lsn_approx() for the
possibility that the return value is larger than
MAX(oldest_modification) in buf_pool.flush_list.
af_needed_for_redo(): Add a comment that explains why the glitch
is not a problem.
page_cleaner_flush_pages_recommendation(): Revise the logic for
the unlikely case cur_lsn < oldest_lsn. The original logic would have
invoked af_get_pct_for_lsn() with a very large age value, which
would likely cause an overflow of the local variable lsn_age_factor,
and make pct_for_lsn a "random number". Based on that value,
total_ratio would be normalized to something between 0.0 and 1.0.
Nothing extremely bad should have happened in this case;
the innodb_io_capacity_max should not be exceeded.
buf_pool_t::resize(): After successfully shrinking the buffer pool,
announce the success. The size had already been updated in shrunk().
After failing to shrink the buffer pool, re-enable the adaptive
hash index if it had been enabled.
Reviewed by: Debarun Banerjee
Clang ~16+ on MSAN became quite strict with uninitalized
data being passed and returned from functions. Non-debug builds
have a basic optimization that hides these from those builds
Two innodb cases violate the assumptions, however once inlined
with a basic optimization those that existed for uninitialized
values are removed.
(MDEV-36316) rec_set_bit_field_2 calling mach_read_from_2 hits a read of
bits it wasn't actually changing.
(MDEV-36327) The function dict_process_sys_columns_rec left
nth_v_col uninitialized unless it was a virtual column. This was
ok as the function i_s_sys_columns_fill_table also didn't read
this value unless it was a virtual column.
Problem:
=======
- In 10.11, During Copy algorithm, InnoDB does use bulk insert
for row by row insert operation. When temporary directory
ran out of memory, row_mysql_handle_errors() fails to handle
DB_TEMP_FILE_WRITE_FAIL.
- During inplace algorithm, concurrent DML fails to write
the log operation into the temporary file. InnoDB fail to
mark the error for the online log.
- ddl_log_write() releases the global ddl lock prematurely before
release the log memory entry
Fix:
===
row_mysql_handle_errors(): Rollback the transaction when
InnoDB encounters DB_TEMP_FILE_WRITE_FAIL
convert_error_code_to_mysql(): Report an aborted transaction
when InnoDB encounters DB_TEMP_FILE_WRITE_FAIL during
alter table algorithm=copy or innodb bulk insert operation
row_log_online_op(): Mark the error in online log when
InnoDB ran out of temporary space
fil_space_extend_must_retry(): Mark the os_has_said_disk_full
as true if os_file_set_size() fails
btr_cur_pessimistic_update(): Return error code when
btr_cur_pessimistic_insert() fails
ddl_log_write(): Release the global ddl lock after releasing
the log memory entry when error was encountered
btr_cur_optimistic_update(): Relax the assertion that
blob pointer can be null during rollback because InnoDB can
ran out of space while allocating the external page
ha_innobase::extra(): Rollback the transaction during DDL before
calling convert_error_code_to_mysql().
row_undo_mod_upd_exist_sec(): Remove the assertion which says
that InnoDB should fail to build index entry when rollbacking
an incomplete transaction after crash recovery. This scenario
can happen when InnoDB ran out of space.
row_upd_changes_ord_field_binary_func(): Relax the assertion to
make that externally stored field can be null when InnoDB ran out
of space.
Problem:
=======
- During inplace algorithm, concurrent DML fails to write
the log operation into the temporary file. InnoDB fail to
mark the error for the online log.
- ddl_log_write() releases the global ddl lock prematurely before
release the log memory entry
Fix:
===
row_log_online_op(): Mark the error in online log when
InnoDB ran out of temporary space
fil_space_extend_must_retry(): Mark the os_has_said_disk_full
as true if os_file_set_size() fails
btr_cur_pessimistic_update(): Return error code when
btr_cur_pessimistic_insert() fails
ddl_log_write(): Release the global ddl lock after releasing the
log memory entry when error was encountered
btr_cur_optimistic_update(): Relax the assertion that
blob pointer can be null during rollback because InnoDB can
ran out of space while allocating the external page
row_undo_mod_upd_exist_sec(): Remove the assertion which says
that InnoDB should fail to build index entry when rollbacking
an incomplete transaction after crash recovery. This scenario
can happen when InnoDB ran out of space.
row_upd_changes_ord_field_binary_func(): Relax the assertion to
make that externally stored field can be null when InnoDB ran out
of space.
ER_DUP_ENTRY on partitioned table
Now as c1492f3d07 (MDEV-36115) restores m_last_part table->file
points to partition p0 while the error happens in p1, so error index
does not match ib_table in innobase_get_mysql_key_number_for_index().
This case is handled by separate code block in
innobase_get_mysql_key_number_for_index() which was wrong on using
secondary index for dict_index_is_auto_gen_clust() and it was not
covered by the tests.
page_is_corrupted(): Do not allocate the buffers from stack,
but from the heap, in xb_fil_cur_open().
row_quiesce_write_cfg(): Issue one type of message when we
fail to create the .cfg file.
update_statistics_for_table(), read_statistics_for_table(),
delete_statistics_for_table(), rename_table_in_stat_tables():
Use a common stack buffer for Index_stat, Column_stat, Table_stat.
ha_connect::FileExists(): Invoke push_warning_printf() so that
we can avoid allocating a buffer for snprintf().
translog_init_with_table(): Do not duplicate TRANSLOG_PAGE_SIZE_BUFF.
Let us also globally enable the GCC 4.4 and clang 3.0 option
-Wframe-larger-than=16384 to reduce the possibility of introducing
such stack overflow in the future. For RocksDB and Mroonga we relax
these limits.
Reviewed by: Vladislav Lesin
In commit b6923420f3 (MDEV-29445)
some hash tables were accidentally created with the minimum size
(101 entries) instead of correctly deriving the size from the
initial innodb_buffer_pool_size. This led to very long hash bucket
chains, which are very slow to traverse.
ut_find_prime(): Assert that the size is nonzero in order to catch
this type of regression in the future.
innodb_init_params(): Do not bother reading buf_pool.curr_size()
when it is known to be 0,
srv_start(): Correctly initialize srv_lock_table_size to 5 times
buf_pool.curr_size(), that is, the buffer pool size in pages,
between invoking buf_pool.create() and lock_sys.create().
btr_search_enable(), dict_sys_t::create(), dict_sys_t::resize():
Correctly refer to buf_pool.curr_pool_size(), that is,
innodb_buffer_pool_size in bytes, when calculating the hash table size.
In MDEV-29445 the expressions buf_pool_get_curr_size() were
accidentally replaced with buf_pool.curr_size().
buf_buddy_shrink(): Properly cover the case when KEY_BLOCK_SIZE
corresponds to the innodb_page_size, that is, the ROW_FORMAT=COMPRESSED
page frame is directly allocated from the buffer pool, not via the
binary buddy allocator.
buf_LRU_check_size_of_non_data_objects(): Avoid a crash when the
buffer pool is being shrunk.
buf_pool_t::shrink(): Abort if over 95% of the shrunk buffer pool
would be occupied by the adaptive hash index or record locks.
In commit b6923420f3 (MDEV-29445)
we started to specify the MAP_POPULATE flag for allocating the
InnoDB buffer pool. This would cause a lot of time to be spent
on __mm_populate() inside the Linux kernel, such as 16 seconds
to pre-fault or commit innodb_buffer_pool_size=64G.
Let us revert to the previous way of allocating the buffer pool
at startup. Note: An attempt to increase the buffer pool size by
SET GLOBAL innodb_buffer_pool_size (up to innodb_buffer_pool_size_max)
will invoke my_virtual_mem_commit(), which will use MAP_POPULATE
to zero-fill and prefault the requested additional memory area, blocking
buf_pool.mutex.
Before MDEV-29445 we allocated the InnoDB buffer pool by invoking
mmap(2) once (via my_large_malloc()). After the change, we would
invoke mmap(2) twice, first via my_virtual_mem_reserve() and then
via my_virtual_mem_commit(). Outside Microsoft Windows, we are
reverting back to my_large_malloc() like allocation.
my_virtual_mem_reserve(): Define only for Microsoft Windows.
Other platforms should invoke my_large_virtual_alloc() and
update_malloc_size() instead of my_virtual_mem_reserve() and
my_virtual_mem_commit().
my_large_virtual_alloc(): Define only outside Microsoft Windows.
Do not specify MAP_NORESERVE nor MAP_POPULATE, to preserve compatibility
with my_large_malloc(). Were MAP_POPULATE specified, the mmap()
system call would be significantly slower, for example 18 seconds
to reserve 64 GiB upfront.
log_t::append_prepare_wait(): Do not attempt to read log_sys.write_lsn
because it is not protected by log_sys.latch but by write_lock, which
we cannot hold here. The assertion could fail if log_t::write_buf()
is executing concurrently, and it has not yet executed log_write_buf()
or updated log_sys.write_lsn.
Fixes up commit acd071f599 (MDEV-21923)
log_t::append_prepare_wait(): Do not attempt to read log_sys.write_lsn
because it is not protected by log_sys.latch but by write_lock, which
we cannot hold here. The assertion could fail if log_t::write_buf()
is executing concurrently, and it has not yet executed log_write_buf()
or updated log_sys.write_lsn.
Fixes up commit acd071f599 (MDEV-21923)
Valgrind is single threaded and only changes threads as part of
system calls or waits.
Some busy loops were identified and fixed where the server assumes
that some other thread will change the state, which will not happen
with valgrind.
Based on patch by Monty. Original patch introduced VALGRIND_YIELD,
which emits pthread_yield() only in valgrind builds. However it was
agreed that it is a good idea to emit yield() unconditionally, such
that other affected schedulers (like SCHED_FIFO) benefit from this
change. Also avoid pthread_yield() in favour of standard
std::this_thread::yield().
Reason:
=======
- MDEV-16239 does apply the DML logs after bulk insert for
ALTER TABLE..ALGORITHM=COPY, but InnoDB fails to reset the bulk_insert
in ha_innobase::extra(HA_EXTRA_END_ALTER_COPY). This leads to crash
while applying DML logs.
Solution:
=======
ha_innobase::extra(HA_EXTRA_END_ALTER_COPY): Reset TRX_DDL_BULK at the
end of bulk insert operation
A statement SET GLOBAL innodb_buffer_pool_size=...
could fail for no good reason when the buffer pool contains many
pages that can actually be evicted.
buf_flush_LRU_list_batch(): Keep evicting as long as the buffer pool
is being shrunk, for at most innodb_lru_scan_depth extra blocks.
Disregard the flush limit for pages that are marked as freed in files.
buf_flush_LRU_to_withdraw(): Update the to_withdraw target during
buf_flush_LRU_list_batch().
buf_pool_t::will_be_withdrawn(): Allow also ptr=nullptr (the condition
will not hold for it).
This fixes a regression that was introduced in
commit b6923420f3 (MDEV-29445)
and caught by the test innodb.temp_truncate_freed in MariaDB Server 11.4.
Tested by: Thirunarayanan Balathandayuthapani
Reviewed by: Thirunarayanan Balathandayuthapani
Set solution is to check if transaction, which modified a record, is
still active in lock_clust_rec_read_check_and_lock(). if yes, then just
request a lock. If no, then, depending on if the current transaction read
view can see the changes, return eighter DB_RECORD_CHANGED or request a
lock.
We can do the check in lock_clust_rec_read_check_and_lock() because
transaction tries to set a lock on the record which cursor points to after
transaction resuming and cursor position restoring. If the lock already
exists, then we don't request the lock again. But for the current commit
it's important that lock_clust_rec_read_check_and_lock() will be invoked
again for the same record, so we can do the check again after
transaction, which modified a record, was committed or rolled back.
MDEV-33802(4aa9291) is partially reverted. If some transaction holds
implicit lock on some record and transaction with snapshot isolation level
requests conflicting lock on the same record, it should be blocked instead
of returning DB_RECORD_CHANGED to have ability to continue execution when
implicit lock owner is rolled back.
The construction
--------------------------------------------------------------------------
let $wait_condition=
select count(*) = 1 from information_schema.processlist
where state = 'Updating' and info = 'UPDATE t SET b = 2 WHERE a';
--source include/wait_condition.inc
--------------------------------------------------------------------------
is not reliable enought to make sure transaction is blocked in test
case, the test failed sporadically with
--------------------------------------------------------------------------
./mtr --max-test-fail=1 --parallel=96 lock_isolation{,,,,,,,}{,,,}{,,} \
--repeat=500
--------------------------------------------------------------------------
command. That's why it was replaced with debug sync-points.
Reviewed by: Marko Mäkelä
during mariabackup --prepare
Reason:
======
During --prepare of partial backup, if InnoDB encounters the redo log
for the excluded tablespace then InnoDB stores the space id in dirty
tablespace list during recovery, anticipates that it may encounter
FILE_* redo log records in the future. Even though we encounter FILE_*
record for the partial excluded tablespace then we fail to replace the
name in dirty tablespace list. This lead to missing of
FILE_* redo log records error.
Solution:
========
fil_name_process(): Rename the file name from "" to name encountered
during FILE_* record
recv_init_missing_space(): Correct the condition to print the warning
message of missing tablespace during mariabackup restore process.
- InnoDB fails to check the table is being dropped or evicted
while acquiring the MDL for the table when table open operation
mode is DICT_TABLE_OP_OPEN_ONLY_IF_CACHED. This is caused by
the commit 337bf8ac4b (MDEV-36122)
Fix:
===
dict_acquire_mdl_shared(): If the table is evicted or dropped when
table operation mode is DICT_TABLE_OP_OPEN_IF_CACHED then return
nullptr
ha_innobase::info_low(): Assert that dict_table_t::stat_initialized()
only within the critical section. Changes of this field should be
protected by dict_table_t::lock_latch.
Problem:
========
- After commit cc8eefb0dc (MDEV-33087),
InnoDB does use bulk insert operation for ALTER TABLE.. ALGORITHM=COPY
and CREATE TABLE..SELECT as well. InnoDB fails to clear the bulk
buffer when it encounters error during CREATE..SELECT. Problem
is that while transaction cleanup, InnoDB fails to identify
the bulk insert for DDL operation.
Fix:
====
- Represent bulk_insert in trx by 2 bits. By doing that, InnoDB
can distinguish between TRX_DML_BULK, TRX_DDL_BULK. During DDL,
set bulk insert value for transaction to TRX_DDL_BULK.
- Introduce a parameter HA_EXTRA_ABORT_ALTER_COPY which rollbacks
only TRX_DDL_BULK transaction.
- bulk_insert_apply() happens for TRX_DDL_BULK transaction happens
only during HA_EXTRA_END_ALTER_COPY extra() call.
...when using io_uring on a potentially affected kernel"
Remove version check on the kernel as it now corresponds to
a working RHEL9 kernel and the problem was only there in
pre-release kernels that shouldn't have been used in production.
This reverts commit 1193a793c4.
Remove version check on the kernel as it now corresponds to
a working RHEL9 kernel and the problem was only there in
pre-release kernels that shouldn't have been used in production.
This reverts commit 3dc0d884ec.
Starting with mysql/mysql-server@02f8eaa998
and commit 2e814d4702 the index ID of
indexes on virtual columns was being encoded insufficiently in
InnoDB undo log records. Only the least significant 32 bits were
being written. This could lead to some corruption of the affected
indexes on ROLLBACK, as well as to missed chances to remove some
history from such indexes when purging the history of committed
transactions that included DELETE or an UPDATE in the indexes.
dict_hdr_create(): In debug instrumented builds, initialize the
DICT_HDR_INDEX_ID close to the 32-bit barrier, instead of initializing
it to DICT_HDR_FIRST_ID (10). This will allow the changed code to
be exercised while running ./mtr --suite=gcol,vcol.
trx_undo_log_v_idx(): Encode large index->id in a similar way as
mysql/mysql-server@e00328b4d0
but using a different implementation.
trx_undo_read_v_idx_low(): Decode large index->id in a similar way
as mach_u64_read_much_compressed().
Reviewed by: Debarun Banerjee
Added retry logic to certain file operations during installation as a
workaround for issues caused by buggy antivirus software on Windows.
Retry logic added for WritePrivateProfileString (mysql_install_db.cc)
and renaming file in Innodb.
Problem was that thread was holding lock_sys.wait_mutex when
streaming replication transaction rollback was handled and
in wsrep-lib requests THD::LOCK_thd_kill mutex causing
wrong mutex usage (thd->reset_globals()).
Fix is to remove streaming replication rollback handling
from Deadlock::report() i.e. wsrep_handle_SR_rollback call.
Purpose of Deadloc::report() is to find a cycle in the
waits-for graph if exists, report it, mark victim transaction
as deadlock victim and release locks it is waiting for.
Actual streaming replication rollback that can take longer
time can be handled later at trx_t::rollback where
lock_sys.wait_mutex is not held.
Signed-off-by: Julius Goryavsky <julius.goryavsky@mariadb.com>
log_t::clear_mmap(): Do not modify buf_size; we may have
file_size==0 here during bootstrap.
log_t::set_recovered(): If we are writing to a memory-mapped log,
update log_sys.buf_size to the record payload area of log_sys.buf.
This fixes up commit acd071f599
(MDEV-21923).
buf_buddy_alloc_from(): Pass the correct argument to
buf_pool.contains_zip(). This fixes a failure of the test
encryption.innochecksum when the code is built with
cmake -DWITH_UBSAN=ON -DCMAKE_BUILD_TYPE=Debug
- With the help of MDEV-14795, InnoDB implemented a way to shrink
the InnoDB system tablespace after undo tablespaces have been moved
to separate files (MDEV-29986). There is no way to defragment any
pages of InnoDB system tables. By doing that, shrinking of
system tablespace can be more effective. This patch deals with
defragment of system tables inside ibdata1.
Following steps are done to do the defragmentation of system
tablespace:
1) Make sure that there is no user tables exist in ibdata1
2) Iterate through all extent descriptor pages in system tablespace
and note their states.
3) Find the free earlier extent to replace the lastly used
extents in the system tablespace.
4) Iterate through all indexes of system tablespace and defragment
the tree level by level.
5) Iterate the level from left page to right page and find out
the page comes under the extent to be replaced. If it is then
do step (6) else step(4)
6) Prepare the allocation of new extent by latching necessary
pages. If any error happens then there is no modification of
page happened till step (5).
7) Allocate the new page from the new extent
8) Prepare the associated pages for the block to be modified
9) Prepare the step of freeing of page
10) If any error happens during preparing of associated pages,
freeing of page then restore the page which was modified
during new page allocation
11) Copy the old page content to new page
12) Change the associative pages like left, right and parent page
13) Complete the freeing of old page
Allocation of page from new extent, changing of relative pages,
freeing of page are done by 2 steps. one is prepare which
latches the to be modified pages and checks their validation.
Other is complete(), Do the operation
fseg_validate(): Validate the list exist in inode segment
Defragmentation is enabled only when :autoextend exist in
innodb_data_file_path variable.
The parameter innodb_log_spin_wait_delay will be deprecated and
ignored, because there is no spin loop anymore.
Thanks to commit 685d958e38
and commit a635c40648
multiple mtr_t::commit() can concurrently copy their slice of
mtr_t::m_log to the shared log_sys.buf. Each writer would allocate
their own log sequence number by invoking log_t::append_prepare()
while holding a shared log_sys.latch. This function was too heavy,
because it would invoke a minimum of 4 atomic read-modify-write
operations as well as system calls in the supposedly fast code path.
It turns out that with a simpler data structure, instead of having
several data fields that needed to be kept consistent with each other,
we only need one Atomic_relaxed<uint64_t> write_lsn_offset, on which
we can operate using fetch_add(), fetch_sub() as well as a single-bit
fetch_or(), which reasonably modern compilers (GCC 7, Clang 15 or later)
can translate into loop-free code on AMD64.
Before anything can be written to the log, log_sys.clear_mmap()
must be invoked.
log_t::base_lsn: The LSN of the last write_buf() or persist().
This is a rough approximation of log_sys.lsn, which will be removed.
log_t::write_lsn_offset: An Atomic_relaxed<uint64_t> that buffers
updates of write_to_buf and base_lsn.
log_t::buf_free, log_t::max_buf_free, log_t::lsn. Remove.
Replaced by base_lsn and write_lsn_offset.
log_t::buf_size: Always reflects the usable size in append_prepare().
log_t::lsn_lock: Remove. For the memory-mapped log in resize_write(),
there will be a resize_wrap_mutex.
log_t::get_lsn_approx(): Return a lower bound of get_lsn().
This should be exact unless append_prepare_wait() is pending.
log_get_lsn(): A wrapper for log_sys.get_lsn(), which must be invoked
while holding an exclusive log_sys.latch.
recv_recovery_from_checkpoint_start(): Do not invoke fil_names_clear();
it would seem to be unnecessary.
In many places, references to log_sys.get_lsn() are replaced with
log_sys.get_flushed_lsn(), which remains a simple std::atomic::load().
Reviewed by: Debarun Banerjee
recv_sys_t::report_progress(): Display the largest currently known LSN.
recv_scan_log(): Display an error with fewer function calls.
Reviewed by: Debarun Banerjee
buf_block_t::initialise(): Remove a redundant call to page.lock.init()
that was already executed in buf_pool_t::create() or
buf_pool_t::resize().
This fixes a regression that was introduced in
commit b6923420f3 (MDEV-29445).
- fix several Windows-specific "variable set but not used",
or "variable unused" warnings.
- correctly initialize std::atomic_flag (ATOMIC_FLAG_INIT)
- fix Ninja build for spider on Windows
- adjust check for sizeof(MYSQL) for Windows compilers
Problem:
=======
- While loading the foreign key constraints for the parent table,
if child table wasn't open then InnoDB uses the parent table heap
to store the child table name in fk_tables list. If the consecutive
foreign key relation for the parent table fails with error,
InnoDB evicts the parent table from memory. But InnoDB accesses the
evicted table memory again in dict_sys.load_table()
Solution:
========
dict_load_table_one(): In case of error, remove the child table
names which was added during dict_load_foreigns()
Problem:
========
- InnoDB does consecutive instant alter operation, first instant DDL
fails, it fails to reset the old instant information in table during
rollback. This lead to consecutive instant alter to have wrong
assumption about the exisitng instant column information.
Fix:
====
dict_table_t::instant_column(): Duplicate the instant information
field of the table. By doing this, InnoDB alter retains the old
instant information and reset it during rollback operation