- InnoDB fails to skip newly created column while checking for
change column when table is in redundant row format. This issue
is caused the MDEV-18035 (ccb1acbd3c)
Let us explicitly wait for purge before invoking a slow shutdown,
so that instrumented builds (such as ASAN or UBSAN) will not
exceed the 60-second timeout during shutdown.
MDEV-27025 allows to insert records before the record on which DELETE is
locked, as a result the DELETE misses those records, what causes serious ACID
violation.
Revert MDEV-27025, MDEV-27550. The test which shows the scenario of ACID
violation is added.
During an increase in resize, the new curr_size got a value
less than old_size.
As n_chunks_new and n_chunks have a strong correlation to the
resizing operation in progress, we can use them and remove the
need for old_size.
For convienece the n_chunks_new < n_chunks is now the is_shrinking
function.
The volatile compiler optimization on n_chunks{,_new} is removed
as real mutex uses are needed.
Other n_chunks_new/n_chunks methods:
n_chunks_new and n_chunks almost always read and altered under
the pool mutex.
Exceptions are:
* i_s_innodb_buffer_page_fill,
* buf_pool_t::is_uncompressed (via is_blocked_field)
These need reexamining for the need of a mutex, however comments
indicates this already.
get_n_pages has uses in buffer pool load, recover log memory
exhaustion estimates and innodb status so take the minimum number
of chunks for safety.
The buf_pool_t::running_out function also uses curr_size/old_size.
We replace this hot function calculation with just n_chunks_new.
This is the new size of the chunks before the resizing occurs.
If we are resizing down, we've already got the case we had previously
(as the minimum). If we are resizing upwards, we are taking an
optimistic view that there will be buffer chunks available for locks.
As this memory allocation is occurring immediately next the resizing
function it seems likely.
Compiler hint UNIV_UNLIKELY removed to leave it to the branch
predictor to make an informed decision.
Added test case of a smaller size than the Marko/Roel original
in JIRA reducing the size to 256M. SEGV hits roughly 1/10 times
but its better than a 21G memory size.
Reviewer: Marko
btr_cur_optimistic_insert(): Disregard DEBUG_DBUG injection to
invoke btr_page_reorganize() if the page (and the table) is empty.
Otherwise, an assertion would fail in btr_page_reorganize_low()
because PAGE_MAX_TRX_ID is 0 in an empty secondary index leaf page.
We support online log resizing by replicating the current ib_logfile0
to a new file ib_logfile101, which will eventually replace the
ib_logfile0 on the first applicable log checkpoint.
Unless the log is located in a persistent memory file system (PMEM),
an attempt to SET GLOBAL innodb_log_file_size to less than
innodb_log_buffer_size will be refused. (With PMEM, a.k.a. mmap()
based log, that parameter has no meaning.)
Should the server be killed while the log was being resized,
both files ib_logfile0 and ib_logfile101 may exist on startup,
and since commit 3b06415cb8
the extra file ib_logfile101 will be removed.
We will initiate checkpoint flushing by invoking buf_flush_ahead(),
to let buf_flush_page_cleaner() write out pages until the
buf_flush_async_lsn target has been reached.
On a log checkpoint, if the new checkpoint LSN is not older than
log_sys.resize_lsn (the start LSN of the ib_logfile101),
we can switch files and complete the log resizing. Else, we will
attempt to switch files on the next checkpoint.
Log resizing can be aborted by killing the connection that is
executing the SET GLOBAL statement.
If the ib_logfile101 wraps around to the beginning, we must
advance the log_sys.resize_lsn. In the resized log file,
the sequence bit will always be written as 1 (no wrap-around).
The log will be duplicated in log_t::resize_write(), invoked by
mtr_t::finish_write().
When the log is being written via system calls (not PMEM), the initial
log_sys.resize_lsn is the current log_sys.first_lsn, plus an integer
multiple of log_sys.block_size, corresponding to the LSN at the start
of the block that was written by log_sys.write_lsn. The log_sys.resize_buf
will be of the same size as the log_sys.buf. During resizing, the
contents of log_sys.buf and log_sys.resize_buf will be identical,
except that the sequence bit of each mini-transaction will always be 1 in
log_sys.resize_buf. If resizing is in progress, log_t::write_buf()
will write log_sys.resize_buf to log_sys.resize_log (ib_logfile101).
If the file would wrap around, the buffer will be written to
log_sys.START_OFFSET and the log_sys.resize_lsn advanced accordingly.
When using mmap() on /dev/shm or a PMEM mount -o dax file system,
the initial log_sys.resize_lsn will be the log_sys.lsn at the time
the resizing is initiated. If the log file wraps around during resizing,
then the log_sys.resize_lsn will be advanced by
(log_sys.resize_target - log_sys.START_OFFSET).
log_t::resize_start(), log_t::resize_abort(), log_t::write_checkpoint():
Unless the log is mmap() based, acquire flush_lock and write_lock.
In any case, acquire exclusive log_sys.latch to prevent race conditions.
log_t::resize_rename(): Renamed from log_t::rename_resized(),
and moved some code to the previous sole caller srv_start().
Thanks to Vladislav Vaintroub for helpful review comments
and to Matthias Leich for testing this, in particular, testing
crash recovery, multiple concurrent SET GLOBAL innodb_log_file_size
and frequently killed connections.
create_table_info_t::innobase_table_flags(): Ignore page_compressed
and page_compression_level on TEMPORARY tables.
ha_innobase::truncate(): Add a debug assertion that create() must succeed
on temporary tables.
- Server incorrectly downgrading the MDL after prepare phase when
table is empty. mdl_exclusive_after_prepare is being set in
prepare phase only. But mdl_exclusive_after_prepare condition was
misplaced and checked before prepare phase by
commit d270525dfd and it is now
changed to check after prepare phase.
- main.innodb_mysql_sync test case was changed to avoid locking
optimization when table is empty.
Backported from 10.5 20e9e804c1 and
5948d7602e.
sel_restore_position_for_mysql() moves forward persistent cursor
position after btr_pcur_restore_position() call if cursor relative position
is BTR_PCUR_ON and the cursor points to the record with NOT the same field
values as in a stored record(and some other not important for this case
conditions).
It was done because btr_pcur_restore_position() sets
page_cur_mode_t mode to PAGE_CUR_LE for cursor->rel_pos == BTR_PCUR_ON
before opening cursor. So we are searching for the record less or equal
to stored one. And if the found record is not equal to stored one, then
it is less and we need to move cursor forward.
But there can be a situation when the stored record was purged, but the
new one with the same key but different value was inserted while
row_search_mvcc() was suspended. In this case, when the thread is
awaken, it will invoke sel_restore_position_for_mysql(), which, in turns,
invoke btr_pcur_restore_position(), which will return false because found
record don't match stored record, and
sel_restore_position_for_mysql() will move forward cursor position.
The above can lead to the case when awaken row_search_mvcc() do not see
records inserted by other transactions while it slept. The mtr test case
shows the example how it can be.
The fix is to return special value from persistent cursor restoring
function which would notify its caller that uniq fields of restored
record and stored record are the same, and in this case
sel_restore_position_for_mysql() don't move cursor forward.
Delete-marked records are correctly processed in row_search_mvcc().
Non-unique secondary indexes are "uniquified" by adding the PK, the
index->n_uniq should then be index->n_fields. So there is no need in
additional checks in the fix.
If transaction's readview can't see the changes made in secondary index
record, it requests clustered index record in row_search_mvcc() to check
its transaction id and get the correspondent record version. After this
row_search_mvcc() commits mtr to preserve clustered index latching
order, and starts mtr. Between those mtr commit and start secondary
index pages are unlatched, and purge has the ability to remove stored in
the cursor record, what causes rows duplication in result set for
non-locking reads, as cursor position is restored to the previously
visited record.
To solve this the changes are just switched off for non-locking reads,
it's quite simple solution, besides the changes don't make sense for
non-locking reads.
The more complex and effective from performance perspective solution is
to create mtr savepoint before clustered record requesting and rolling
back to that savepoint after that. See MDEV-27557.
One more solution is to have per-record transaction id for secondary
indexes. See MDEV-17598.
If any of those is implemented, just remove select_lock_type argument in
sel_restore_position_for_mysql().
convert_error_code_to_mysql(): Use the correct limit FK_MAX_CASCADE_DEL
in the error message. The DICT_FK_MAX_RECURSIVE_LOAD applies to
the number of foreign key constraints in table definitions,
not to the number of rows that are visited while processing
a foreign key constraint.
Some GNU/Linux distributions ship a zlib that is modified to use
the s390x DFLTCC instruction. That modification would essentially
redefine compressBound(sourceLen) as (sourceLen * 16 + 2308) / 8 + 6.
Let us relax the tests for InnoDB ROW_FORMAT=COMPRESSED to cope with
such a weaker compression guarantee.
create_table_info_t::row_size_is_acceptable(): Remove a bogus debug-only
assertion that would fail to hold for the test innodb_zip.bug36169.
The function page_zip_empty_size() may indeed return 0.
Post-push fix: remove unstable test.
The test was developed to find the reason of duplicated rows caused by
MDEV-20605 fix. The test is not necessary as the reason was found and
the bug was fixed.