trx_t::commit_tables(): Ensure that mod_tables will be empty.
This was broken in commit b08448de64
where the query cache invalidation was moved from lock_release().
This failure is caused by commit 43ca6059ca
(MDEV-24720). InnoDB fails to remove the ahi entries
during rollback of bulk insert operation. InnoDB should
remove the AHI entries of root page before reinitialising it.
Reviewed-by: Marko Mäkelä
Now that an INSERT into an empty table is replicated more efficiently
during online ALTER, an old test case started to fail. Let us disable
the MDEV-515 logic for the critical INSERT statement.
This is caused by commit 3cef4f8f0f
(MDEV-515). dict_table_t::clear() frees all the blob during
rollback of bulk insert.But online log tries to read the
freed blob while applying the log. It can be fixed if we
truncate the online log during rollback of bulk insert operation.
InnoDB fails to remove the ahi entries during rollback
of bulk insert operation. InnoDB throws the error when
validates the ahi hash tables. InnoDB should remove
the ahi entries while freeing the segment only during
bulk index rollback operation.
Reviewed-by: Marko Mäkelä
The purpose of the test was to ensure that the SX (update) mode of
index tree and buffer page latches are being used.
The test has become unstable, possibly due to changes related to
buf_pool.mutex and buf_pool.page_hash, or to the use of MDL in the
purge of transaction history.
In 10.6, the test depends on instrumentation that was refactored
or removed in MDEV-24142.
The use of different latching modes can better be indirectly observed
through high-concurrency benchmarks. For MDEV-14637, a performance test
was conducted where the finer-grained latching and
BTR_CUR_FINE_HISTORY_LENGTH were removed. It caused a 20% performance
regression for UPDATE and somewhat smaller for INSERT.
Any new problem with latching granularity should be easily caught by
performance testing, or by stress tests with Random Query Generator.
In commit 3cef4f8f0f (MDEV-515)
we inadvertently broke CREATE TABLE...REPLACE SELECT statements
by wrongly disabling row-level undo logging.
select_create::prepare(): Only invoke extra(HA_EXTRA_BEGIN_ALTER_COPY)
if no special treatment of duplicates is needed.
After an ignored INSERT IGNORE statement into an empty table, we would
wrongly use the MDEV-515 table-level undo logging for a subsequent
REPLACE statement.
ha_innobase::reset_template(): Clear m_prebuilt->ins_node->bulk_insert
on every statement boundary.
ha_innobase::start_stmt(): Invoke end_bulk_insert().
ha_innobase::extra(): Avoid accessing m_prebuilt->trx. Do not call
thd_to_trx(). Invoke end_bulk_insert() and try to reset bulk_insert
when changing the REPLACE or IGNORE settings.
trx_mod_table_time_t::WAS_BULK: Use a distinct value from BULK.
trx_undo_report_row_operation(): Add debug assertions.
Note: Some calls to end_bulk_insert() may be redundant, but statement
boundaries are not always clear in the API (especially in the
presence of LOCK TABLES or stored procedures).
We implement an idea that was suggested by Michael 'Monty' Widenius
in October 2017: When InnoDB is inserting into an empty table or partition,
we can write a single undo log record TRX_UNDO_EMPTY, which will cause
ROLLBACK to clear the table.
For this to work, the insert into an empty table or partition must be
covered by an exclusive table lock that will be held until the transaction
has been committed or rolled back, or the INSERT operation has been
rolled back (and the table is empty again), in lock_table_x_unlock().
Clustered index records that are covered by the TRX_UNDO_EMPTY record
will carry DB_TRX_ID=0 and DB_ROLL_PTR=1<<55, and thus they cannot
be distinguished from what MDEV-12288 leaves behind after purging the
history of row-logged operations.
Concurrent non-locking reads must be adjusted: If the read view was
created before the INSERT into an empty table, then we must continue
to imagine that the table is empty, and not try to read any records.
If the read view was created after the INSERT was committed, then
all records must be visible normally. To implement this, we introduce
the field dict_table_t::bulk_trx_id.
This special handling only applies to the very first INSERT statement
of a transaction for the empty table or partition. If a subsequent
statement in the transaction is modifying the initially empty table again,
we must enable row-level undo logging, so that we will be able to
roll back to the start of the statement in case of an error (such as
duplicate key).
INSERT IGNORE will continue to use row-level logging and locking, because
implementing it would require the ability to roll back the latest row.
Since the undo log that we write only allows us to roll back the entire
statement, we cannot support INSERT IGNORE. We will introduce a
handler::extra() parameter HA_EXTRA_IGNORE_INSERT to indicate to storage
engines that INSERT IGNORE is being executed.
In many test cases, we add an extra record to the table, so that during
the 'interesting' part of the test, row-level locking and logging will
be used.
Replicas will continue to use row-level logging and locking until
MDEV-24622 has been addressed. Likewise, this optimization will be
disabled in Galera cluster until MDEV-24623 enables it.
dict_table_t::bulk_trx_id: The latest active or committed transaction
that initiated an insert into an empty table or partition.
Protected by exclusive table lock and a clustered index leaf page latch.
ins_node_t::bulk_insert: Whether bulk insert was initiated.
trx_t::mod_tables: Use C++11 style accessors (emplace instead of insert).
Unlike earlier, this collection will cover also temporary tables.
trx_mod_table_time_t: Add start_bulk_insert(), end_bulk_insert(),
is_bulk_insert(), was_bulk_insert().
trx_undo_report_row_operation(): Before accessing any undo log pages,
invoke trx->mod_tables.emplace() in order to determine whether undo
logging was disabled, or whether this is the first INSERT and we are
supposed to write a TRX_UNDO_EMPTY record.
row_ins_clust_index_entry_low(): If we are inserting into an empty
clustered index leaf page, set the ins_node_t::bulk_insert flag for
the subsequent trx_undo_report_row_operation() call.
lock_rec_insert_check_and_lock(), lock_prdt_insert_check_and_lock():
Remove the redundant parameter 'flags' that can be checked in the caller.
btr_cur_ins_lock_and_undo(): Simplify the logic. Correctly write
DB_TRX_ID,DB_ROLL_PTR after invoking trx_undo_report_row_operation().
trx_mark_sql_stat_end(), ha_innobase::extra(HA_EXTRA_IGNORE_INSERT),
ha_innobase::external_lock(): Invoke trx_t::end_bulk_insert() so that
the next statement will not be covered by table-level undo logging.
ReadView::changes_visible(trx_id_t) const: New accessor for the case
where the trx_id_t is not read from a potentially corrupted index page
but directly from the memory. In this case, we can skip a sanity check.
row_sel(), row_sel_try_search_shortcut(), row_search_mvcc():
row_sel_try_search_shortcut_for_mysql(),
row_merge_read_clustered_index(): Check dict_table_t::bulk_trx_id.
row_sel_clust_sees(): Replaces lock_clust_rec_cons_read_sees().
lock_sec_rec_cons_read_sees(): Replaced with lower-level code.
btr_root_page_init(): Refactored from btr_create().
dict_index_t::clear(), dict_table_t::clear(): Empty an index or table,
for the ROLLBACK of an INSERT operation.
ROW_T_EMPTY, ROW_OP_EMPTY: Note a concurrent ROLLBACK of an INSERT
into an empty table.
This is joint work with Thirunarayanan Balathandayuthapani,
who created a working prototype.
Thanks to Matthias Leich for extensive testing.
We may end up with an empty leaf page (containing only an ADD COLUMN
metadata record) that is not the root page.
innobase_add_instant_try(): Disable an optimization for a non-canonical
empty table that contains a metadata record somewhere else than in
the root page.
btr_pcur_store_position(): Tolerate a non-canonical empty table.
When commit 5eb539555b (MDEV-12227)
removed the pages of temporary tables from the buf_pool.flush_list,
an adjustment to the buffer pool resizing was forgotten.
buf_pool_t::realloc(): Do not invoke buf_flush_relocate_on_flush_list()
for pages that belong to the temporary tablespace. Also, deduplicate
some code at the end.
buf_page_t::set_corrupt_id(): Tolerate oldest_modification()==1
(the dummy value) for temporary tablespace pages. The revised
buf_pool_t::realloc() may invoke this on dirty temporary tablespace pages.
mysql_col_offset was not updated after the new column has been added by an
INSTANT ALTER TABLE -- table data dictionary had been remaining the same.
When the virtual column is added or removed, table was usually evicted and
then reopened, which triggered vcol info rebuild on the next open.
However this also should be done when the usual column is added or removed:
mariadb always stores virtual field at the end of maria record,
so the shift should always happen.
Fix:
expand the eviction condition to the case when usual fields are
added/removed
Note:
this should happen only in the case of !new_clustered:
* When new_clustered is true, a new data dictionary is created, and vcol
metadata is rebuilt in `alter_rebuild_apply_log()`
* We can't do it in `new_clustered` case, because the old table is not yet
subctituted correctly
btr_discard_only_page_on_level(): Attempt to read the MDEV-15562 metadata
record from the leaf page, not the root page. In the root, the leftmost
(in this case, the only) node pointer would look like a metadata record.
This corruption bug was introduced in
commit 0e5a4ac253 (MDEV-15562).
The scenario is rare: a column was dropped instantly or the order of
columns was changed instantly, and then the table became empty in such
a way that in the last step, the root page had one child page.
Normally, a non-leaf B-tree page would always contain at least 2 children.
in queries like
create view v1 as select 2 like 1 escape (3 in (select 0 union select 1));
select 2 union select * from v1;
Item_func_like::escape was left uninitialized, because
Item_in_optimizer is const_during_execution()
but not actually const_item() during execution.
It's not, because const subquery evaluation was disabled for derived.
Practically it only needs to be disabled for multi-update
that runs fix_fields() before all tables are locked.
SHOW ENGINE INNODB MUTEX functionality is completely removed,
as are the InnoDB latching order checks.
We will enforce innodb_fatal_semaphore_wait_threshold
only for dict_sys.mutex and lock_sys.mutex.
dict_sys_t::mutex_lock(): A single entry point for dict_sys.mutex.
lock_sys_t::mutex_lock(): A single entry point for lock_sys.mutex.
FIXME: srv_sys should be removed altogether; it is duplicating tpool
functionality.
fil_crypt_threads_init(): To prevent SAFE_MUTEX warnings, we must
not hold fil_system.mutex.
fil_close_all_files(): To prevent SAFE_MUTEX warnings for
fil_space_destroy_crypt_data(), we must not hold fil_system.mutex
while invoking fil_space_free_low() on a detached tablespace.
We will default to MUTEXTYPE=sys (using OSTrackMutex) for those
ib_mutex_t that have not been replaced yet.
The view INFORMATION_SCHEMA.INNODB_SYS_SEMAPHORE_WAITS is removed.
The parameter innodb_sync_array_size is removed.
FIXME: innodb_fatal_semaphore_wait_threshold will no longer be enforced.
We should enforce it for lock_sys.mutex and dict_sys.mutex somehow!
innodb_sync_debug=ON might still cover ib_mutex_t.
- There is no reason to collect EITS statistics
- The test is sporadically failing on some platforms. I believe the
issue is in InnoDB. Let's rule out EITS code as a possible source
of the issue.
The flushing of the InnoDB temporary tablespace is unnecessarily
tied to the write-ahead redo logging and redo log checkpoints,
which must be tied to the page writes of persistent tablespaces.
Let us simply omit any pages of temporary tables from buf_pool.flush_list.
In this way, log checkpoints will never incur any 'collateral damage' of
writing out unmodified changes for temporary tables.
After this change, pages of the temporary tablespace can only be written
out by buf_flush_lists(n_pages,0) as part of LRU eviction. Hopefully,
most of the time, that code will never be executed, and instead, the
temporary pages will be evicted by buf_release_freed_page() without
ever being written back to the temporary tablespace file.
This should improve the efficiency of the checkpoint flushing and
the buf_flush_page_cleaner thread.
Reviewed by: Vladislav Vaintroub
This hang was caused by MDEV-23855, and we failed to fix it in
MDEV-24109 (commit 4cbfdeca84).
When buf_flush_ahead() is invoked soon before server shutdown
and the non-default setting innodb_flush_sync=OFF is in effect
and the buffer pool contains dirty pages of temporary tables,
the page cleaner thread may remain in an infinite loop
without completing its work, thus causing the shutdown to hang.
buf_flush_page_cleaner(): If the buffer pool contains no
unmodified persistent pages, ensure that buf_flush_sync_lsn= 0
will be assigned, so that shutdown will proceed.
The test case is not deterministic. On my system, it reproduced
the hang with 95% probability when running multiple instances
of the test in parallel, and 4% when running single-threaded.
Thanks to Eugene Kosov for debugging and testing this.
Let us remove sux_lock::waits and the associated bookkeeping.
Starting with commit 1669c8890c
the PERFORMANCE_SCHEMA instrumentation interface is keeping
track of lock waits.
The view INFORMATION_SCHEMA.INNODB_MUTEXES only exported counts
of rw-lock waits.
Also, SHOW ENGINE INNODB MUTEX will no longer export any information
about rw-locks.