The test in commit 1756b0f37d
is occasionally failing if there are unexpectedly many page cleaner
batches that are updating the log checkpoint by small amounts.
This occurs in particular when running the server under Valgrind.
Let us insert the same number of records with a larger number of
statements in a hope that the test would then be more likely to pass.
innodb_buffer_pool_size_auto_min: A minimum innodb_buffer_pool_size that
a Linux memory pressure event can lead to shrinking the buffer pool to.
On a memory pressure event, we will attempt to shrink
innodb_buffer_pool_size halfway between its current value and
innodb_buffer_pool_size_auto_min. If innodb_buffer_pool_size_auto_min is
specified as 0 or not specified on startup, its default value will be
adjusted to innodb_buffer_pool_size_max, that is, memory pressure events
will be disregarded by default.
buf_pool_t::garbage_collect(): For up to 15 seconds, attempt to shrink
the buffer pool in response to a memory pressure event.
Reviewed by: Debarun Banerjee
We deprecate and ignore the parameter innodb_buffer_pool_chunk_size
and let the buffer pool size to be changed in arbitrary 1-megabyte
increments.
innodb_buffer_pool_size_max: A new read-only startup parameter
that specifies the maximum innodb_buffer_pool_size. If 0 or
unspecified, it will default to the specified innodb_buffer_pool_size
rounded up to the allocation unit (2 MiB or 8 MiB). The maximum value
is 4GiB-2MiB on 32-bit systems and 16EiB-8MiB on 64-bit systems.
This maximum is very likely to be limited further by the operating system.
The status variable Innodb_buffer_pool_resize_status will reflect
the status of shrinking the buffer pool. When no shrinking is in
progress, the string will be empty.
Unlike before, the execution of SET GLOBAL innodb_buffer_pool_size
will block until the requested buffer pool size change has been
implemented, or the execution is interrupted by a KILL statement
a client disconnect, or server shutdown. If the
buf_flush_page_cleaner() thread notices that we are running out of
memory, the operation may fail with ER_WRONG_USAGE.
SET GLOBAL innodb_buffer_pool_size will be refused
if the server was started with --large-pages (even if
no HugeTLB pages were successfully allocated). This functionality
is somewhat exercised by the test main.large_pages, which now runs
also on Microsoft Windows. On Linux, explicit HugeTLB mappings are
apparently excluded from the reported Redident Set Size (RSS), and
apparently unshrinkable between mmap(2) and munmap(2).
The buffer pool will be mapped to a contiguous virtual memory area
that will be aligned and partitioned into extents of 8 MiB on
64-bit systems and 2 MiB on 32-bit systems.
Within an extent, the first few innodb_page_size blocks contain
buf_block_t objects that will cover the page frames in the rest
of the extent. The number of such frames is precomputed in the
array first_page_in_extent[] for each innodb_page_size.
In this way, there is a trivial mapping between
page frames and block descriptors and we do not need any
lookup tables like buf_pool.zip_hash or buf_pool_t::chunk_t::map.
We will always allocate the same number of block descriptors for
an extent, even if we do not need all the buf_block_t in the last
extent in case the innodb_buffer_pool_size is not an integer multiple
of the of extents size.
The minimum innodb_buffer_pool_size is 256*5/4 pages. At the default
innodb_page_size=16k this corresponds to 5 MiB. However, now that the
innodb_buffer_pool_size includes the memory allocated for the block
descriptors, the minimum would be innodb_buffer_pool_size=6m.
my_large_virtual_alloc(): A new function, similar to my_large_malloc().
my_virtual_mem_reserve(), my_virtual_mem_commit(),
my_virtual_mem_decommit(), my_virtual_mem_release():
New interface mostly by Vladislav Vaintroub, to separately
reserve and release virtual address space, as well as to
commit and decommit memory within it.
After my_virtual_mem_decommit(), the virtual memory range will be
read-only or unaccessible, depending on whether the build option
cmake -DHAVE_UNACCESSIBLE_AFTER_MEM_DECOMMIT=1
has been specified. This option is hard-coded on Microsoft Windows,
where VirtualMemory(MEM_DECOMMIT) will make the memory unaccessible.
On IBM AIX, Linux, Illumos and possibly Apple macOS, the virtual memory
will be zeroed out immediately. On other POSIX-like systems,
madvise(MADV_FREE) will be used if available, to give the operating
system kernel a permission to zero out the virtual memory range.
We prefer immediate freeing so that the reported
resident set size (RSS) of the process will reflect the current
innodb_buffer_pool_size. Shrinking the buffer pool is a rarely
executed resource intensive operation, and the immediate configuration
of the MMU mappings should not incur significant additional penalty.
opt_super_large_pages: Declare only on Solaris. Actually, this is
specific to the SPARC implementation of Solaris, but because we
lack access to a Solaris development environment, we will not revise
this for other MMU and ISA.
buf_pool_t::chunk_t::create(): Remove.
buf_pool_t::create(): Initialize all n_blocks of the buf_pool.free list.
buf_pool_t::allocate(): Renamed from buf_LRU_get_free_only().
buf_pool_t::LRU_warned: Changed to Atomic_relaxed<bool>,
only to be modified by the buf_flush_page_cleaner() thread.
buf_pool_t::shrink(): Attempt to shrink the buffer pool.
There are 3 possible outcomes: SHRINK_DONE (success),
SHRINK_IN_PROGRESS (the caller may keep trying),
and SHRINK_ABORT (we seem to be running out of buffer pool).
While traversing buf_pool.LRU, release the contended
buf_pool.mutex once in every 32 iterations in order to
reduce starvation. Use lru_scan_itr for efficient traversal,
similar to buf_LRU_free_from_common_LRU_list().
buf_pool_t::shrunk(): Update the reduced size of the buffer pool
in a way that is compatible with buf_pool_t::page_guess(),
and invoke my_virtual_mem_decommit().
buf_pool_t::resize(): Before invoking shrink(), run one batch of
buf_flush_page_cleaner() in order to prevent LRU_warn().
Abort if shrink() recommends it, or no blocks were withdrawn in
the past 15 seconds, or the execution of the statement
SET GLOBAL innodb_buffer_pool_size was interrupted.
buf_pool_t::first_to_withdraw: The first block descriptor that is
out of the bounds of the shrunk buffer pool.
buf_pool_t::withdrawn: The list of withdrawn blocks.
If buf_pool_t::resize() is aborted before shrink() completes,
we must be able to resurrect the withdrawn blocks in the free list.
buf_pool_t::contains_zip(): Added a parameter for the
number of least significant pointer bits to disregard,
so that we can find any pointers to within a block
that is supposed to be free.
buf_pool_t::is_shrinking(): Return the total number or blocks that
were withdrawn or are to be withdrawn.
buf_pool_t::to_withdraw(): Return the number of blocks that will need to
be withdrawn.
buf_pool_t::usable_size(): Number of usable pages, considering possible
in-progress attempt at shrinking the buffer pool.
buf_pool_t::page_guess(): Try to buffer-fix a guessed block pointer.
If HAVE_UNACCESSIBLE_AFTER_MEM_DECOMMIT is set, the pointer will
be validated before being dereferenced.
buf_pool_t::get_info(): Replaces buf_stats_get_pool_info().
innodb_init_param(): Refactored. We must first compute
srv_page_size_shift and then determine the valid bounds of
innodb_buffer_pool_size.
buf_buddy_shrink(): Replaces buf_buddy_realloc().
Part of the work is deferred to buf_buddy_condense_free(),
which is being executed when we are not holding any
buf_pool.page_hash latch.
buf_buddy_condense_free(): Do not relocate blocks.
buf_buddy_free_low(): Do not care about buffer pool shrinking.
This will be handled by buf_buddy_shrink() and
buf_buddy_condense_free().
buf_buddy_alloc_zip(): Assert !buf_pool.contains_zip()
when we are allocating from the binary buddy system.
Previously we were asserting this on multiple recursion levels.
buf_buddy_block_free(), buf_buddy_free_low():
Assert !buf_pool.contains_zip().
buf_buddy_alloc_from(): Remove the redundant parameter j.
buf_flush_LRU_list_batch(): Add the parameter to_withdraw
to keep track of buf_pool.n_blocks_to_withdraw.
buf_do_LRU_batch(): Skip buf_free_from_unzip_LRU_list_batch()
if we are shrinking the buffer pool. In that case, we want
to minimize the page relocations and just finish as quickly
as possible.
trx_purge_attach_undo_recs(): Limit purge_sys.n_pages_handled()
in every iteration, in case the buffer pool is being shrunk
in the middle of a purge batch.
Reviewed by: Debarun Banerjee
Reason:
=======
- InnoDB DML commit aborts the server when InnoDB does online
virtual index. During online DDL, concurrent DML commit
operation does read the undo log record and their related
current version of the clustered index record. Based on the
operation, InnoDB do build the old tuple and new tuple for
the table. If the concurrent online index can be affected
by the operation, InnoDB does build the entry
for the index and log the operation.
Problematic case is update operation, InnoDB does build the
update vector. But while building the old row, InnoDB fails
to fill the non-affected virtual column. This lead to
server abort while build the entry for index.
Fix:
===
- First, fill the virtual column entries for the new row.
Duplicate the old row based on new row and change only the
affected fields in old row based on the update vector.
ha_innobase::statistics_init(), ha_innobase::info_low():
Correctly handle a DB_READ_ONLY return value from dict_stats_save().
Fixes up commit 6e6a1b316c (MDEV-35000)
Let us integrate the test case with innodb.page_cleaner so that there
will be less interference from log writes due to checkpoints.
Also, make the test compatible with ./mtr --cursor-protocol.
This will make mysql-test-run.pl try to schedule these long-running
(> 60 seconds) tests early in --parallel runs, which helps avoid that
the testsuite gets stuck with a few long-running tests at the end
while most other test workers are idle.
This speed up mtr --parallel=96 with 25 seconds for me.
Reviewed-by: Brandon Nesterenko <brandon.nesterenko@mariadb.com>
Signed-off-by: Kristian Nielsen <knielsen@knielsen-hq.org>
In commit 6e6a1b316c (MDEV-35000)
a race condition was exposed.
ha_innobase::check_if_incompatible_data(): If the statistics have
already been initialized for the table, skip the invocation of
innobase_copy_frm_flags_from_create_info() in order to avoid
unexpectedly ruining things for other threads that are concurrently
accessing the table.
dict_stats_save(): Add debug instrumentation that is necessary for
reproducing the interlocking of the failure scenario.
stats_deinit(): Replaces dict_stats_deinit().
Deinitialize the statistics for persistent tables,
so that they will be reloaded or recalculated
on a subsequent ha_innobase::open().
ha_innobase::rename_table(): Invoke stats_deinit() so that the
subsequent ha_innobase::open() will reload the InnoDB persistent
statistics. That is, it will remain possible to have the InnoDB
persistent statistics reloaded by executing the following:
RENAME TABLE t TO tmp, tmp TO t;
dict_table_close(table): Replaced with table->release().
There will no longer be any logic that would attempt to ensure
that the InnoDB persistent statistics will be reloaded after
FLUSH TABLES has been executed. This also fixes the problem that
dict_table_t::stat_modified_counter would be frequently reset to 0,
whenever ha_innobase::open() is invoked after the table reference
count had dropped to 0.
dict_table_close(table, thd, mdl): Remove the parameter "dict_locked".
Do not try to invalidate the statistics.
ha_innobase::statistics_init(): Replaces dict_stats_init(table).
Reviewed by: Thirunarayanan Balathandayuthapani
row_parse_int(): Refactor the code and define the function static in
one compilation unit. For any negative values, we must return 0.
row_search_get_max_rec(), row_search_max_autoinc(): Moved to the same
compilation unit with row_parse_int().
We also remove a work-around of an internal compiler error when
targeting ARMv8 on GCC 4.8.5, a compiler that is no longer supported.
Reviewed by: Debarun Banerjee
The test fails trying to compare (innodb/lock)_row_lock_time_avg with
some limit. We can't predict (innodb/lock)_row_lock_time_avg value,
because it's counted as the whole waiting time divided by the amount of
waits. Both waiting time and amount of waits depend on the previous
tests execution. The corresponding counters in lock_sys can't be reset
with any query. Remove (innodb/lock)_row_lock_time_avg comparision from
the test.
information_schema.global_status.innodb_row_lock_time can't be reset,
compare its difference instead of absolute value.
Reviewed by: Marko Mäkelä
When the first attempt of XA ROLLBACK is expected to fail,
some recovered changes could be written back through the doublewrite
buffer. Should that happen, the next recovery attempt (after mangling
the data file t1.ibd further) could fail because no copy of the
affected pages would be available in the doublewrite buffer.
To prevent this from happening, ensure that the doublewrite buffer
will not be used and no log checkpoint occurs during the previous
failed recovery attempt.
Also, let a successful XA ROLLBACK serve the additional purpose of
freeing a BLOB page and therefore rewriting page 0, which we must then
be able to recover despite induced corruption. In the last restart step,
we will tolerate an unexpected checkpoint, because one is frequently
occurring on FreeBSD and AIX, despite our efforts to force a buffer pool
flush before each "no checkpoint" section.
Problem:
=======
- Import tablespace fails to check the index fields descending
property while matching the schema given in cfg file with the
table schema.
Fix:
===
row_quiesce_write_index_fields(): Write the descending
property of the field into field fixed length field.
Since the field fixed length uses only 10 bits,
InnoDB can use 0th bit of the field fixed length
to store the descending field property.
row_import_cfg_read_index_fields(): Read the field
descending information from field fixed length.
This commit implements
mysql/mysql-server@7037a0bdc8
functionality.
If some transaction 't' requests not-gap X-lock 'Xt' on record 'r', and
locks list of the record 'r' contains not-gap granted S-lock 'St' of
transaction 't', followed by not-gap waiting locks WB={Wb1,
Wb2, ..., Wbn} conflicting with 'Xt', and 'Xt' does not conflict with any
other lock, located in the list after 'St', then grant 'Xt'. Note that
insert-intention locks are also gap locks.
If some transaction 't' holds not-gap lock 'Lt' on record 'r', and some
other transactions have not-gap continuous waiting locks sequence
L(B)={L(b1), L(b2), ..., L(bn)} following L(t) in
the list of locks for the record 'r', and transaction 't' requests not-gap,
what means also not insert intention, as ii-locks are also gap locks,
X-lock conflicting with any lock in L(B), then grant the.
MySQL's commit contains the following explanation of why insert-intention
locks must not overtake a waiting ordinary or gap locks:
"It is important that this decission rule doesn't allow
INSERT_INTENTION locks to overtake WAITING locks on gaps (`S`, `S|GAP`,
`X`, `X|GAP`), as inserting a record into a gap would split such WAITING
lock, violating the invariant that each transaction can have at most
single WAITING lock at any time."
I would add to the explanation the following. Suppose we have trx 1 which
holds ordinary X-lock on some record. And trx 2 executes "DELETE FROM t"
or "SELECT * FOR UPDATE" in RR(see lock_delete_updated.test and
MDEV-27992), i.e. it creates waiting ordinary X-lock on the same record.
And then trx 1 wants to insert some record just before the locked record.
It requests insert-intention lock, and if the lock overtakes trx 2 lock,
there will be phantom records for trx 2 in RR. lock_delete_updated.test
shows how "DELETE" allows to insert some records in already scanned gap
and misses some records to delete.
The current implementation differs from MySQL implementation. There are
two key differences:
1. Lock queue ordering. In MySQL all waiting locks precede all granted
locks. A new waiting lock is added to the head of the queue, a new
granted lock is added to the end of the queue, if some waiting lock
is granted, it's moved to the end of the queue. In MariaDB any new
lock is added to the end of the queue and waiting lock does not change
its position in the queue where the lock is granted. The rule is that
blocking lock must be located before blocked lock in lock queue. We
maintain the rule with inserting bypassing lock just before bypassed
one.
2. MySQL implementation uses some object(locksys::Trx_locks_cache) which
can be passed to consecutive calls to rec_lock_has_to_wait() for the
same trx and heap_no to cache the result of checking if trx has a
granted lock which is blocking the waiting lock(see
locksys::Trx_locks_cache::has_granted_blocker()). The current
implementation does not use such object, because it looks for such
granted lock on the level of lock_rec_other_has_conflicting() and
lock_rec_has_to_wait_in_queue(). I.e. there is no need in additional
lock queue iteration in
locksys::Trx_locks_cache::has_granted_blocker(), as we already iterate
it in lock_rec_other_has_conflicting() and
lock_rec_has_to_wait_in_queue().
During the testing the following case was found. Suppose we have
delete-marked record and going to do inplace insert into
that delete-marked record. Usually we don't create explicit lock if
there are no conlicting with not gap X-lock locks(see
lock_clust_rec_modify_check_and_lock(), btr_cur_update_in_place()). The
implicit lock will be converted to explicit one by demand.
That can happen during INSERT, the not-gap S-lock can
be acquired on searching for duplicates(see
row_ins_duplicate_error_in_clust()), and, if delete-marked record is
found, inplace insert(see btr_cur_upd_rec_in_place()) modifies the
record, what is treated as implicit lock.
But there can be a case when some transaction trx1 holds not-gap S-lock,
another transaction trx2 creates waiting X-lock, and then trx2 tries to
do inplace insert. Before the fix the waiting X-lock of trx2 would be
conflicting lock, and trx1 would try to create explicit X-lock, what
would cause deadlock, and one of the transactions whould be rolled back.
But after the fix, trx2 waiting X-lock is not treated as conflicting
with trx1 X-lock anymore, as trx1 already holds S-lock. If we don't create
explicit lock, then some other transaction trx3 can create it during
implicit to explicit lock conversion and place it at the end of the
queue. So there can be the following locks order in the queue:
S1(granted) X2(waiting) X1(granted)
The above queue is not valid, because all granted trx1 locks must be
placed before waiting trx2 lock. Besides, lock_rec_release_try() can
remove S(granted, trx1) lock and grant X lock to trx 2, and there can be
two granted X-locks on the same record:
X2(granted) X1(granted)
Taking into account that lock_rec_release_try() can release cell and
lock_sys latches leaving some locks unreleased, the queue validation
function can fail in any unexpected place.
It can be fixed with two ways:
1) Place explicit X(granted, trx1) lock before X(waiting, trx2) lock
during implicit to explicit lock conversion. This option is implemented
in MySQL, as granted lock is always placed at the top of locks queue,
and waiting locks are placed at the bottom of the queue. MariaDB does
not do this, and implementing this variant would require conflicting
locks search before converting implicit to explicit lock, what, in
turns, would require cell and/or lock_sys latch acquiring.
2) Create and place X(granted, trx1) lock before X(waiting, trx2) during
inplace INSERT, i.e. when lock_rec_lock() is invoked from
lock_clust_rec_modify_check_and_lock() or
lock_sec_rec_modify_check_and_lock(), if X(waiting, trx2) is
bypassed. Such a way we don't need in additional conflicting locks
search, as they are searched anyway in lock_rec_low().
This fix implements the second variant(see the changes around
c_lock_info.insert_after in lock_rec_lock). I.e. if some record was
delete-marked and we do inplace insert in such a record, and some lock for
bypass was found, create explicit lock to avoid conflicting lock search on
each implicit to explicit lock conversion. We can remove it if MDEV-35624
is implemented.
lock_rec_other_has_conflicting(), lock_rec_has_to_wait_in_queue():
search locks to bypass along with conflicting locks searching in the
same loop. The result is returned in conflicting_lock_info object.
There can be several locks to bypass, only the first one is returned to
limit lock_rec_find_similar_on_page() with the first bypassed lock to
preserve "blocking before blocked" invariant. conflicting_lock_info also
contains a pointer to the lock, after which we can insert bypassing
lock. This lock precedes bypassed one.
Bypassing lock can be next-key lock, and the following cases are
possible:
1. S1(not-gap, granted) II2(granted) X3(waiting for S1),
When new X1(ordinary) lock is acquired, there will be the following
locks queue:
S1(not-gap, granted) II2(granted) X1(ordinary, granted) X3(waiting for
S1)
If we had inserted new X1 lock just after S1, and S1 had been released
on transaction commit or rollback, we would have the following
sequence in the locks queue:
X1(ordinary, granted) II2(granted) X3(waiting for X1)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
This is not a real issue as II lock once granted can be
ignored but it could possibly hit some assert(taking into account
that lock_release_try() can release lock_sys latch, and other threads
can acquire the latch and validate lock queue) as it breaks our design
constraint that any granted lock in the queue should not conflict
with locks ahead in the queue. But lock_rec_queue_validate() does not
check the above constraint. We place new bypassing lock just before
bypassed one, but there still can be the case when lock bitmap is used
instead of creating new lock object(see lock_rec_add_to_queue() and
lock_rec_find_similar_on_page()), and the lock, which owns the
bitmap, can precede II2(granted). We can either disable
lock_rec_find_similar_on_page() space optimization for bypassing locks
or treat "X1(ordinary, granted) II2(granted)" sequence as valid. As
we don't currently have the function which would fail on the above
sequence, let treat it as valid for the case, when lock_release()
execution is in process.
2. S1(ordinary, granted) II2(waiting for S1) X3(waiting for S1)
When new X1(ordinary) lock is acquired, there will be the following
locks queue:
S1(ordinary, granted) II2(waiting for S1) X1(ordinary, granted)
X3(waiting for S1).
After S1 releasing there will be:
II2(granted) X1(ordinary, granted) X3(waiting for X1)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
The above queue is valid because ordinary lock does not conflict with
II-lock(see lock_rec_has_to_wait()).
lock_rec_create_low(): insert new lock to the position which
lock_rec_other_has_conflicting(), lock_rec_has_to_wait_in_queue()
returned if the lock is bypassing.
lock_rec_find_similar_on_page(): add ability to limit similiar lock search
with the certain lock to preserve "blocking before blocked" invariant for
all bypassed locks.
lock_rec_add_to_queue(): don't treat bypassed locks as waiting ones to
let lock bitmap reusing for bypassing locks.
lock_rec_lock(): fix inplace insert case, explained above.
lock_rec_dequeue_from_page(), lock_rec_rebuild_waiting_queue(): move
bypassing lock to the correct place to preserve "blocking before blocked"
invariant.
Reviewed by: Debarun Banerjee, Marko Mäkelä.
innodb_convert_name(): Convert a schema or table name to
my_charset_filename compatible format.
dict_table_lookup(): Replaces dict_get_referenced_table().
Make the callers responsible for invoking innodb_convert_name().
innobase_casedn_str(): Remove. Let us invoke my_casedn_str() directly.
dict_table_rename_in_cache(): Do not duplicate a call to
dict_mem_foreign_table_name_lookup_set().
innobase_convert_to_filename_charset(): Defined static in the only
compilation unit that needs it.
dict_scan_id(): Remove the constant parameters
table_id=FALSE, accept_also_dot=TRUE. Invoke strconvert() directly.
innobase_convert_from_id(): Remove; only called from dict_scan_id().
innobase_convert_from_table_id(): Remove (dead code).
table_name_t::dblen(), table_name_t::basename(): In non-debug builds,
tolerate names that may miss a '/' separator.
Reviewed by: Debarun Banerjee
row_ins_cascade_calc_update_vec(): Skip any virtual columns in the
update vector of the parent table.
Based on mysql/mysql-server@0ac176453b
Reviewed by: Debarun Banerjee
- MDEV-34392(commit cc810e64d4) adds
the check for nullability of foreign key column when foreign key
relation is of UPDATE_CASCADE or UPDATE SET NULL. This check
makes DDL fail when it violates foreign key nullability.
This patch basically does the nullability check for foreign key
column only for strict sql mode
problem:
=======
- During load statement, InnoDB bulk operation relies on temporary
directory and it got crash when tmpdir is exhausted.
Solution:
========
During bulk insert, LOAD statement is building the clustered
index one record at a time instead of page. By doing this,
InnoDB does the following
1) Avoids creation of temporary file for clustered index.
2) Writes the undo log for first insert operation alone
trx_t::autoinc_locks: Use small_vector<lock_t*,4> in order to avoid any
dynamic memory allocation in the most common case (a statement is
holding AUTO_INCREMENT locks on at most 4 tables or partitions).
lock_cancel_waiting_and_release(): Instead of removing elements from
the middle, simply assign nullptr, like lock_table_remove_autoinc_lock().
The added test innodb.auto_increment_lock_mode covers the dynamic memory
allocation as well as nondeterministically (occasionally) covers
the out-of-order lock release in lock_table_remove_autoinc_lock().
Reviewed by: Debarun Banerjee
log_t::resize_start(): If the ib_logfile101 cannot be created,
be sure to reset log_sys.resize_lsn.
log_t::resize_abort(): In case SET GLOBAL innodb_log_file_size is
aborted, delete the ib_logfile101.
ha_innobase::delete_table(): Clear trx->dict_operation_lock_mode
after, not before invoking trx->rollback(), so that
row_undo_mod_parse_undo_rec() will be invoked with dict_locked=true
and dict_sys_t::freeze() will not be invoked for loading a table
definition. Inside dict_sys_t::freeze(), an assertion !have_any()
would fail when the current thread is already holding the latch.
This fixes up commit c5fd9aa562 (MDEV-25919).
Reviewed by: Debarun Banerjee
Problem:
=======
- insert..select statement on partition table fails to use
bulk insert for the transaction.
Solution:
========
- Enable the bulk insert operation for insert..select
statement for partition table.
This is done by mapping most of the existing MySQL unicode 0900 collations
to MariadB 1400 unicode collations. The assumption is that 1400 is a super
set of 0900 for all practical purposes.
I also added a new function 'compare_collations()' and changed most code
to use this instead of comparing character sets directly.
This enables one to seamlessly mix-and-match the corresponding 0900 and
1400 sets. Field comparision and alter table treats the character sets
as identical.
All MySQL 8.0 0900 collations are supported except:
- utf8mb4_ja_0900_as_cs
- utf8mb4_ja_0900_as_cs_ks
- utf8mb4_ru_0900_as_cs
- utf8mb4_zh_0900_as_cs
These do not have corresponding entries in the MariadB 01400 collations.
Other things:
- Added COMMENT colum to information_schema.collations. For utf8mb4_0900
colletions it contains the corresponding alias collation.
opt_calc_index_goodness(): Correct an inaccurate condition.
We can very well use a clustered index of a table that is subject
to online rebuild. But we must not choose an index that has not been
committed (it is a secondary index that was not fully created)
or that is corrupted or not a normal B-tree index.
opt_search_plan_for_table(): Remove some redundant code, now that
opt_calc_index_goodness() checks against corrupted indexes.
The test case allows this code to be exercised. The main observation
in the following:
./mtr --rr innodb.stats_persistent
rr replay var/log/mysqld.1.rr/latest-trace
should be that when opt_search_plan_for_table() is being invoked by
dict_stats_update_persistent() on the being-altered statistics table
in the 2nd call after ha_innobase::inplace_alter_table(),
and the fix in opt_calc_index_goodness() is absent,
it would choose the code path if (n_fields == 0), that is, a full
table scan, instead of searching for the record. The GDB commands to
execute in "rr replay" would be as follows:
break ha_innobase::inplace_alter_table
continue
break opt_search_plan_for_table
continue
continue
next
next
…
Reviewed by: Vladislav Lesin
Under unknown circumstances, the SQL layer may wrongly disregard an
invocation of thd_mark_transaction_to_rollback() when an InnoDB
transaction had been aborted (rolled back) due to one of the following errors:
* HA_ERR_LOCK_DEADLOCK
* HA_ERR_RECORD_CHANGED (if innodb_snapshot_isolation=ON)
* HA_ERR_LOCK_WAIT_TIMEOUT (if innodb_rollback_on_timeout=ON)
Such an error used to cause a crash of InnoDB during transaction commit.
These changes aim to catch and report the error earlier, so that not only
this crash can be avoided but also the original root cause be found and
fixed more easily later.
The idea of this fix is from Michael 'Monty' Widenius.
HA_ERR_ROLLBACK: A new error code that will be translated into
ER_ROLLBACK_ONLY, signalling that the current transaction
has been aborted and the only allowed action is ROLLBACK.
trx_t::state: Add TRX_STATE_ABORTED that is like
TRX_STATE_NOT_STARTED, but noting that the transaction had been
rolled back and aborted.
trx_t::is_started(): Replaces trx_is_started().
ha_innobase: Check the transaction state in various places.
Simplify the logic around SAVEPOINT.
ha_innobase::is_valid_trx(): Replaces ha_innobase::is_read_only().
The InnoDB logic around transaction savepoints, commit, and rollback
was unnecessarily complex and might have contributed to this
inconsistency. So, we are simplifying that logic as well.
trx_savept_t: Replace with const undo_no_t*. When we rollback to
a savepoint, all we need to know is the number of undo log records
that must survive.
trx_named_savept_t, DB_NO_SAVEPOINT: Remove. We can store undo_no_t
directly in the space allocated at innobase_hton->savepoint_offset.
fts_trx_create(): Do not copy previous savepoints.
fts_savepoint_rollback(): If a savepoint was not found, roll back
everything after the default savepoint of fts_trx_create().
The test innodb_fts.savepoint is extended to cover this code.
Reviewed by: Vladislav Lesin
Tested by: Matthias Leich