Problem:
=======
- InnoDB fails to write the buffered insert operation during
create..select operation. This happens when bulk_insert
in transaction is reset to false while unlocking a source table.
Fix:
===
- InnoDB should apply the previous buffered changes to
all tables if we encounter any statement other than
pure INSERT or INSERT..SELECT statement in
ha_innobase::external_lock() and start_stmt().
- Remove the function bulk_insert_apply_for_table()
start_stmt(), external_lock(): Assert that trx->duplicates
should be enabled during bulk insert operation
lock_rec_unlock_unmodified() is executed either under lock_sys.wr_lock()
or under a combination of lock_sys.rd_lock() + record locks hash table
cell latch. It also requests page latch to check if locked records were
changed by the current transaction or not.
Usually InnoDB requests page latch to find the certain record on the
page, and then requests lock_sys and/or record lock hash cell latch to
request record lock. lock_rec_unlock_unmodified() requests the latches
in the opposite order, what causes deadlocks. One of the possible
scenario for the deadlock is the following:
thread 1 - lock_rec_unlock_unmodified() is invoked under locks hash table
cell latch, the latch is acquired;
thread 2 - purge thread acquires page latch and tries to remove
delete-marked record, it invokes lock_update_delete(), which
requests locks hash table cell latch, held by thread 1;
thread 1 - requests page latch, held by thread 2.
To fix it we need to release lock_sys.latch and/or lock hash cell latch,
acquire page latch and re-acquire lock_sys related latches.
When lock_sys.latch and/or lock hash cell latch are released in
lock_release_on_prepare() and lock_release_on_prepare_try(), the page on
which the current lock is held, can be merged. In this case the bitmap
of the current lock must be cleared, and the new lock must be added to
the end of trx->lock.trx_locks list, or bitmap of already existing lock
must be changed.
The new field trx_lock_t::set_nth_bit_calls indicates if new locks
(bits in existing lock bitmaps or new lock objects) were created during
the period when lock_sys was released in trx->lock.trx_locks list
iteration loop in lock_release_on_prepare() or
lock_release_on_prepare_try(). And, if so, we traverse the list again.
The block can be freed during pages merging, what causes assertion
failure in buf_page_get_gen(), as btr_block_get() passes BUF_GET as page
get mode to it. That's why page_get_mode parameter was added to
btr_block_get() to pass BUF_GET_POSSIBLY_FREED from
lock_release_on_prepare() and lock_release_on_prepare_try() to
buf_page_get_gen().
As searching for id of trx, which modified secondary index record, is
quite expensive operation, restrict its usage for master. System variable
was added to remove the restriction for testing simplifying. The
variable exists only either for debug build or for build with
-DINNODB_ENABLE_XAP_UNLOCK_UNMODIFIED_FOR_PRIMARY option to increase the
probability of catching bugs for release build with RQG.
Note that the code, which does primary index lookup to find out what
transaction modified secondary index record, is necessary only when
there is no primary key and no unique secondary key on replica with row
based replication, because only in this case extra X locks on unmodified
records can be set during scan phase.
Reviewed by Marko Mäkelä.
There is no need to exclude exclusive non-gap locks from the procedure
of locks releasing on XA PREPARE execution in
lock_release_on_prepare_try() after commit
17e59ed3aa (MDEV-33454), because
lock_rec_unlock_unmodified() should check if the record was modified
with the XA, and release the lock if it was not.
lock_release_on_prepare_try(): don't skip X-locks, let
lock_rec_unlock_unmodified() to process them.
lock_sec_rec_some_has_impl(): add template parameter for not acquiring
trx_t::mutex for the case if a caller already holds the mutex, don't
crash if lock's bitmap is clean.
row_vers_impl_x_locked(), row_vers_impl_x_locked_low(): add new argument
to skip trx_t::mutex acquiring.
rw_trx_hash_t::validate_element(): don't acquire trx_t::mutex if the
current thread already holds it.
Thanks to Andrei Elkin for finding the bug.
Reviewed by Marko Mäkelä, Debarun Banerjee.
Problem:
=======
- Redundant table fails to insert into the table after
instant drop blob column. Instant drop column only marking
the column as hidden and consecutive insert statement tries
to insert NULL value for the dropped BLOB column and returns
the fixed length of the blob type as 65535. This lead to
row size too large error.
Fix:
====
For redundant table, if the non-fixed dropped column can be null
then set the length of the field type as 0.
- InnoDB fails to set the index information or index number
for the spatial index error HA_ERR_NULL_IN_SPATIAL.
row_build_spatial_index_key(): Initialize the tmp_mbr array completely.
check_if_supported_inplace_alter(): Fix the spelling mistake of alter
row_purge_reset_trx_id(): Reserve large enough offsets for accomodating
the maximum width PRIMARY KEY followed by DB_TRX_ID,DB_ROLL_PTR.
Reviewed by: Thirunarayanan Balathandayuthapani
Don't allow the referencing key column from NULL TO NOT NULL
when
1) Foreign key constraint type is ON UPDATE SET NULL
2) Foreign key constraint type is ON DELETE SET NULL
3) Foreign key constraint type is UPDATE CASCADE and referenced
column declared as NULL
Don't allow the referenced key column from NOT NULL to NULL
when foreign key constraint type is UPDATE CASCADE
and referencing key columns doesn't allow NULL values
get_foreign_key_info(): InnoDB sends the information about
nullability of the foreign key fields and referenced key fields.
fk_check_column_changes(): Enforce the above rules for COPY
algorithm
innobase_check_foreign_drop_col(): Checks whether the dropped
column exists in existing foreign key relation
innobase_check_foreign_low() : Enforce the above rules for
INPLACE algorithm
dict_foreign_t::check_fk_constraint_valid(): This is used
by CREATE TABLE statement to check nullability for foreign
key relation.
In the bug report MDEV-32817 it occurred that the function
row_mysql_get_table_status() is outputting a fil_space_t*
as if it were a numeric tablespace identifier.
ib_push_warning(): Remove. Let us invoke push_warning_printf() directly.
innodb_decryption_failed(): Report a decryption failure and set the
dict_table_t::file_unreadable flag. This code was being duplicated in
very many places. We return the constant value DB_DECRYPTION_FAILED
in order to avoid code duplication in the callers and to allow tail calls.
innodb_fk_error(): Report a FOREIGN KEY error.
dict_foreign_def_get(), dict_foreign_def_get_fields(): Remove.
This code was being used in dict_create_add_foreign_to_dictionary()
in an apparently uncovered code path. That ib_push_warning() call
would pass the integer i+1 instead of a pointer to NUL terminated
string ("%s"), and therefore the call should have resulted in a crash.
dict_print_info_on_foreign_key_in_create_format(),
innobase_quote_identifier(): Add const qualifiers.
row_mysql_get_table_error(): Replaces row_mysql_get_table_status().
Display no message on DB_CORRUPTION; it should be properly reported at
the SQL layer anyway.
btr_cur_t::search_leaf(): When the index root page is also a leaf page,
we may need to upgrade our existing shared root page latch into an
exclusive latch. Even if we end up waiting, the root page won't be able
to go away while we hold an index()->lock. The index page may be split;
that is all.
btr_latch_prev(): Acquire the page latch while holding a buffer-fix
and an index tree latch. Merge the change buffer if needed. Use
buf_pool_t::page_fix() for this special case instead of complicating
buf_page_get_low() and buf_page_get_gen().
row_merge_read_clustered_index(): Remove some code that does not seem
to be useful. No difference was observed with regard to removing this
code when a CREATE INDEX or OPTIMIZE TABLE statement was run concurrently
with sysbench oltp_update_index --tables=1 --table_size=1000 --threads=16.
buf_pool_t::unzip(): Decompress a ROW_FORMAT=COMPRESSED page.
buf_pool_t::page_fix(): Handle also ROW_FORMAT=COMPRESSED pages
as well as change buffer merge. Optionally return an error.
Add a flag for suppressing a page latch wait and a special return
value -1 to indicate that the call would block.
This is the preferred way of buffer-fixing blocks.
The functions buf_page_get_gen() and buf_page_get_low() are only being
invoked with rw_latch=RW_NO_LATCH in operations on SPATIAL INDEX.
buf_page_t: Define some static functions for interpreting state().
buf_page_get_zip(), buf_read_page(),
buf_read_ahead_random(), buf_read_ahead_linear():
Remove the redundant parameter zip_size. We must look up the
tablespace and can invoke fil_space_t::zip_size() on it.
buf_page_get_low(): Require mtr!=nullptr.
buf_page_get_gen(): Implement some lock downgrading during recovery.
ibuf_page_low(): Use buf_pool_t::page_fix() in a debug check.
We do wait for a page read here, because otherwise a debug assertion in
buf_page_get_low() in the test innodb.ibuf_delete could occasionally fail.
PageConverter::operator(): Invoke buf_pool_t::page_fix() in order
to possibly evict a block. This allows us to remove some
special case code from buf_page_get_low().
In a Sysbench oltp_update_index workload that involves 1 table,
a serious contention between the workload and the purge of history
was observed. This was the worst when the table contained only 1 record.
This turned out to be fixed by setting innodb_purge_batch_size=128,
which corresponds to the number of usable persistent rollback segments.
When we go above that, there would be contention between row_purge_poss_sec()
and the workload, typically on the clustered index page latch, sometimes
also on a secondary index page latch. It might be that with smaller
batches, trx_sys.history_size() will end up pausing all concurrent
transaction start/commit frequently enough so that purge will be able
to make some progress, so that there would be less contention on the
index page latches between purge and SQL execution.
In commit aa719b5010 (part of MDEV-32050)
the interpretation of the parameter innodb_purge_batch_size was slightly
changed. It would correspond to the maximum desired size of the
purge_sys.pages cache. Before that change, the parameter was referring to
a number of undo log pages, but the accounting might have been inaccurate.
To avoid a regression, we will reduce the default value to
innodb_purge_batch_size=127, which will also be compatible with
innodb_undo_tablespaces>1 (which will disable rollback segment 0).
Additionally, some logic in the purge and MVCC checks is simplified.
The purge tasks will make use of purge_sys.pages when accessing undo
log pages to find out if a secondary index record can be removed.
If an undo page needs to be looked up in buf_pool.page_hash, we will
merely buffer-fix it. This is correct, because the undo pages are
append-only in nature. Holding purge_sys.latch or purge_sys.end_latch
or the fact that the current thread is executing as a part of an
in-progress purge batch will prevent the contents of the undo page from
being freed and subsequently reused. The buffer-fix will prevent the
page from being evicted form the buffer pool. Thanks to this logic,
we can refer to the undo log record directly in the buffer pool page
and avoid copying the record.
buf_pool_t::page_fix(): Look up and buffer-fix a page. This is useful
for accessing undo log pages, which are append-only by nature.
There will be no need to deal with change buffer or ROW_FORMAT=COMPRESSED
in that case.
purge_sys_t::view_guard::view_guard(): Allow the type of guard to be
acquired: end_latch, latch, or no latch (in case we are a purge thread).
purge_sys_t::view_guard::get(): Read-only accessor to purge_sys.pages.
purge_sys_t::get_page(): Invoke buf_pool_t::page_fix().
row_vers_old_has_index_entry(): Replaced with row_purge_is_unsafe()
and row_undo_mod_sec_unsafe().
trx_undo_get_undo_rec(): Merged to trx_undo_prev_version_build().
row_purge_poss_sec(): Add the parameter mtr and remove redundant
or unused parameters sec_pcur, sec_mtr, is_tree. We will use the
caller's mtr object but release any acquired page latches before
returning.
btr_cur_get_page(), page_cur_get_page(): Do not invoke page_align().
row_purge_remove_sec_if_poss_leaf(): Return the value of PAGE_MAX_TRX_ID
to be checked against the page in row_purge_remove_sec_if_poss_tree().
If the secondary index page was not changed meanwhile, it will be
unnecessary to invoke row_purge_poss_sec() again.
trx_undo_prev_version_build(): Access any undo log pages using
the caller's mini-transaction object.
row_purge_vc_matches_cluster(): Moved to the only compilation unit that
needs it.
Reviewed by: Debarun Banerjee
When SUX_LOCK_GENERIC is defined, the srw_mutex, srw_lock, sux_lock are
implemented based on pthread_mutex_t and pthread_cond_t. This is the
only option for systems that lack a futex-like system call.
In the SUX_LOCK_GENERIC mode, if pthread_mutex_init() is allocating
some resources that need to be freed by pthread_mutex_destroy(),
a memory leak could occur when we are repeatedly invoking
pthread_mutex_init() without a pthread_mutex_destroy() in between.
pthread_mutex_wrapper::initialized: A debug field to track whether
pthread_mutex_init() has been invoked. This also helps find bugs
like the one that was fixed by
commit 1c8af2ae53 (MDEV-34422);
one simply needs to add -DSUX_LOCK_GENERIC to the CMAKE_CXX_FLAGS
to catch that particular bug on the initial server bootstrap.
buf_block_init(), buf_page_init_for_read(): Invoke block_lock::init()
because buf_page_t::init() will no longer do that.
buf_page_t::init(): Instead of invoking lock.init(), assert that it
has already been invoked (the lock is vacant).
add_fts_index(), build_fts_hidden_table(): Explicitly invoke
index_lock::init() in order to avoid a pthread_mutex_destroy()
invocation on an uninitialized object.
srw_lock_debug::destroy(): Invoke readers_lock.destroy().
trx_sys_t::create(): Invoke trx_rseg_t::init() on all rollback segments
in order to guarantee a deterministic state for shutdown, even if
InnoDB fails to start up.
trx_rseg_array_init(), trx_temp_rseg_create(), trx_rseg_create():
Invoke trx_rseg_t::destroy() before trx_rseg_t::init() in order to
balance pthread_mutex_init() and pthread_mutex_destroy() calls.
- During copy algorithm, InnoDB should use bulk insert operation
for row by row insert operation. By doing this, copy algorithm
can effectively build indexes. This optimization is disabled
for temporary table, versioning table and table which has
foreign key relation.
Introduced the variable innodb_alter_copy_bulk to allow
the bulk insert operation for copy alter operation
inside InnoDB. This is enabled by default
ha_innobase::extra(): HA_EXTRA_END_ALTER_COPY mode tries to apply
the buffered bulk insert operation, updates the non-persistent
table stats.
row_merge_bulk_t::write_to_index(): Update stat_n_rows after
applying the bulk insert operation
row_ins_clust_index_entry_low(): In case of copy algorithm,
switch to bulk insert operation.
copy_data_error_ignore(): Handles the error while copying
the data from source to target file.
The performance regression seen while loading BP is caused by the
deadlock fix given in MDEV-33543. The area of impact is wider but is
more visible when BP is being loaded initially via DMLs. Specifically
the response time could be impacted in DML doing pessimistic operation
on index(split/merge) and the leaf pages are not found in buffer pool.
It is more likely to occur with small BP size.
The origin of the issue dates back to MDEV-30400 that introduced
btr_cur_t::search_leaf() replacing btr_cur_search_to_nth_level() for
leaf page searches. In btr_latch_prev, we use RW_NO_LATCH to get the
previous page fixed in BP without latching. When the page is not in BP,
we try to acquire and wait for S latch violating the latching order.
This deadlock was analyzed in MDEV-33543 and fixed by using the already
present wait logic in buf_page_get_gen() instead of waiting for latch.
The wait logic is inferior to usual S latch wait and is simply a
repeated sleep 100 of micro-sec (The actual sleep time could be more
depending on platforms). The bug was seen with "change-buffering" code
path and the idea was that this path should be less exercised. The
judgement was not correct and the path is actually quite frequent and
does impact performance when pages are not in BP and being loaded by
DML expanding/shrinking large data.
Fix: While trying to get a page with RW_NO_LATCH and we are attempting
"out of order" latch, return from buf_page_get_gen immediately instead
of waiting and follow the ordered latching path.
The fixes in b8a6719889 have not disabled
semi-consistent read for innodb_snapshot_isolation=ON mode, they just allowed
read uncommitted version of a record, that's why the test for MDEV-26643 worked
well.
The semi-consistent read should be disabled on upper level in
row_search_mvcc() for READ COMMITTED isolation level.
Reviewed by Marko Mäkelä.
- Added a counter innodb_num_bulk_insert_operation in
INFORMATION_SCHEMA.GLOBAL_STATUS. This counter is incremented
whenever a InnoDB undergoes bulk insert operation.
- Change the innodb_instant_alter_column to atomic variable.
Problem:
========
- Partition update operation enables the bulk insert for the
transaction while moving the row between partitions. This leads
to debug assert failure while removing the row from one
of the partition.
Solution:
========
- Disallow the bulk insert operation for non-insert operation
of partition table.
I checked all stack overflow potential problems found with
gcc -Wstack-usage=16384
and
clang -Wframe-larger-than=16384 -no-inline
Fixes:
Added '#pragma clang diagnostic ignored "-Wframe-larger-than="'
to a lot of function to where stack usage large but resonable.
- Added stack check warnings to BUILD scrips when using clang and debug.
Function changed to use malloc instead allocating things on stack:
- read_bootstrap_query() now allocates line_buffer (20000 bytes) with
malloc() instead of using stack. This has a small performance impact
but this is not releant for bootstrap.
- mroonga grn_select() used 65856 bytes on stack. Changed it to use
malloc().
- Wsrep_schema::replay_transaction() and
Wsrep_schema::recover_sr_transactions().
- Connect zipOpen3()
Not fixed:
- mroonga/vendor/groonga/lib/expr.c grn_proc_call() uses
43712 byte on stack. However this is not easy to fix as the stack
used is caused by a lot of code generated by defines.
- Most changes in mroonga/groonga where only adding of pragmas to disable
stack warnings.
- rocksdb/options/options_helper.cc uses 20288 of stack space.
(no reason to fix except to get rid of the compiler warning)
- Causes using alloca() where the allocation size is resonable.
- An issue in libmariadb (reported to connectors).
In case of partition insert, InnoDB fails to end the bulk insert
for one of the partition. It leads to bulk insert operation for
the consecutive delete statement.
trx_t::bulk_insert_apply_for_table(): Irrespective of bulk insert
value, InnoDB should end the bulk insert for the table.