The InnoDB redo log used to be formatted in blocks of 512 bytes.
The log blocks were encrypted and the checksum was calculated while
holding log_sys.mutex, creating a serious scalability bottleneck.
We remove the fixed-size redo log block structure altogether and
essentially turn every mini-transaction into a log block of its own.
This allows encryption and checksum calculations to be performed
on local mtr_t::m_log buffers, before acquiring log_sys.mutex.
The mutex only protects a memcpy() of the data to the shared
log_sys.buf, as well as the padding of the log, in case the
to-be-written part of the log would not end in a block boundary of
the underlying storage. For now, the "padding" consists of writing
a single NUL byte, to allow recovery and mariadb-backup to detect
the end of the circular log faster.
Like the previous implementation, we will overwrite the last log block
over and over again, until it has been completely filled. It would be
possible to write only up to the last completed block (if no more
recent write was requested), or to write dummy FILE_CHECKPOINT records
to fill the incomplete block, by invoking the currently disabled
function log_pad(). This would require adjustments to some logic around
log checkpoints, page flushing, and shutdown.
An upgrade after a crash of any previous version is not supported.
Logically empty log files from a previous version will be upgraded.
An attempt to start up InnoDB without a valid ib_logfile0 will be
refused. Previously, the redo log used to be created automatically
if it was missing. Only with with innodb_force_recovery=6, it is
possible to start InnoDB in read-only mode even if the log file
does not exist. This allows the contents of a possibly corrupted
database to be dumped.
Because a prepared backup from an earlier version of mariadb-backup
will create a 0-sized log file, we will allow an upgrade from such
log files, provided that the FIL_PAGE_FILE_FLUSH_LSN in the system
tablespace looks valid.
The 512-byte log checkpoint blocks at 0x200 and 0x600 will be replaced
with 64-byte log checkpoint blocks at 0x1000 and 0x2000.
The start of log records will move from 0x800 to 0x3000. This allows us
to use 4096-byte aligned blocks for all I/O in a future revision.
We extend the MDEV-12353 redo log record format as follows.
(1) Empty mini-transactions or extra NUL bytes will not be allowed.
(2) The end-of-minitransaction marker (a NUL byte) will be replaced
with a 1-bit sequence number, which will be toggled each time when the
circular log file wraps back to the beginning.
(3) After the sequence bit, a CRC-32C checksum of all data
(excluding the sequence bit) will written.
(4) If the log is encrypted, 8 bytes will be written before
the checksum and included in it. This is part of the
initialization vector (IV) of encrypted log data.
(5) File names, page numbers, and checkpoint information will not be
encrypted. Only the payload bytes of page-level log will be encrypted.
The tablespace ID and page number will form part of the IV.
(6) For padding, arbitrary-length FILE_CHECKPOINT records may be written,
with all-zero payload, and with the normal end marker and checksum.
The minimum size is 7 bytes, or 7+8 with innodb_encrypt_log=ON.
In mariadb-backup and in Galera snapshot transfer (SST) scripts, we will
no longer remove ib_logfile0 or create an empty ib_logfile0. Server startup
will require a valid log file. When resizing the log, we will create
a logically empty ib_logfile101 at the current LSN and use an atomic rename
to replace ib_logfile0 with it. See the test innodb.log_file_size.
Because there is no mandatory padding in the log file, we are able
to create a dummy log file as of an arbitrary log sequence number.
See the test mariabackup.huge_lsn.
The parameter innodb_log_write_ahead_size and the
INFORMATION_SCHEMA.INNODB_METRICS counter log_padded will be removed.
The minimum value of innodb_log_buffer_size will be increased to 2MiB
(because log_sys.buf will replace recv_sys.buf) and the increment
adjusted to 4096 bytes (the maximum log block size).
The following INFORMATION_SCHEMA.INNODB_METRICS counters will be removed:
os_log_fsyncs
os_log_pending_fsyncs
log_pending_log_flushes
log_pending_checkpoint_writes
The following status variables will be removed:
Innodb_os_log_fsyncs (this is included in Innodb_data_fsyncs)
Innodb_os_log_pending_fsyncs (this was limited to at most 1 by design)
log_sys.get_block_size(): Return the physical block size of the log file.
This is only implemented on Linux and Microsoft Windows for now, and for
the power-of-2 block sizes between 64 and 4096 bytes (the minimum and
maximum size of a checkpoint block). If the block size is anything else,
the traditional 512-byte size will be used via normal file system
buffering.
If the file system buffers can be bypassed, a message like the following
will be issued:
InnoDB: File system buffers for log disabled (block size=512 bytes)
InnoDB: File system buffers for log disabled (block size=4096 bytes)
This has been tested on Linux and Microsoft Windows with both sizes.
On Linux, only enable O_DIRECT on the log for innodb_flush_method=O_DSYNC.
Tests in 3 different environments where the log is stored in a device
with a physical block size of 512 bytes are yielding better throughput
without O_DIRECT. This could be due to the fact that in the event the
last log block is being overwritten (if multiple transactions would
become durable at the same time, and each of will write a small
number of bytes to the last log block), it should be faster to re-copy
data from log_sys.buf or log_sys.flush_buf to the kernel buffer,
to be finally written at fdatasync() time.
The parameter innodb_flush_method=O_DSYNC will imply O_DIRECT for
data files. This option will enable O_DIRECT on the log file on Linux.
It may be unsafe to use when the storage device does not support
FUA (Force Unit Access) mode.
When the server is compiled WITH_PMEM=ON, we will use memory-mapped
I/O for the log file if the log resides on a "mount -o dax" device.
We will identify PMEM in a start-up message:
InnoDB: log sequence number 0 (memory-mapped); transaction id 3
On Linux, we will also invoke mmap() on any ib_logfile0 that resides
in /dev/shm, effectively treating the log file as persistent memory.
This should speed up "./mtr --mem" and increase the test coverage of
PMEM on non-PMEM hardware. It also allows users to estimate how much
the performance would be improved by installing persistent memory.
On other tmpfs file systems such as /run, we will not use mmap().
mariadb-backup: Eliminated several variables. We will refer
directly to recv_sys and log_sys.
backup_wait_for_lsn(): Detect non-progress of
xtrabackup_copy_logfile(). In this new log format with
arbitrary-sized blocks, we can only detect log file overrun
indirectly, by observing that the scanned log sequence number
is not advancing.
xtrabackup_copy_logfile(): On PMEM, do not modify the sequence bit,
because we are not allowed to modify the server's log file, and our
memory mapping is read-only.
trx_flush_log_if_needed_low(): Do not use the callback on pmem.
Using neither flush_lock nor write_lock around PMEM writes seems
to yield the best performance. The pmem_persist() calls may
still be somewhat slower than the pwrite() and fdatasync() based
interface (PMEM mounted without -o dax).
recv_sys_t::buf: Remove. We will use log_sys.buf for parsing.
recv_sys_t::MTR_SIZE_MAX: Replaces RECV_SCAN_SIZE.
recv_sys_t::file_checkpoint: Renamed from mlog_checkpoint_lsn.
recv_sys_t, log_sys_t: Removed many data members.
recv_sys.lsn: Renamed from recv_sys.recovered_lsn.
recv_sys.offset: Renamed from recv_sys.recovered_offset.
log_sys.buf_size: Replaces srv_log_buffer_size.
recv_buf: A smart pointer that wraps log_sys.buf[recv_sys.offset]
when the buffer is being allocated from the memory heap.
recv_ring: A smart pointer that wraps a circular log_sys.buf[] that is
backed by ib_logfile0. The pointer will wrap from recv_sys.len
(log_sys.file_size) to log_sys.START_OFFSET. For the record that
wraps around, we may copy file name or record payload data to
the auxiliary buffer decrypt_buf in order to have a contiguous
block of memory. The maximum size of a record is less than
innodb_page_size bytes.
recv_sys_t::parse(): Take the smart pointer as a template parameter.
Do not temporarily add a trailing NUL byte to FILE_ records, because
we are not supposed to modify the memory-mapped log file. (It is
attached in read-write mode already during recovery.)
recv_sys_t::parse_mtr(): Wrapper for recv_sys_t::parse().
recv_sys_t::parse_pmem(): Like parse_mtr(), but if PREMATURE_EOF would be
returned on PMEM, use recv_ring to wrap around the buffer to the start.
mtr_t::finish_write(), log_close(): Do not enforce log_sys.max_buf_free
on PMEM, because it has no meaning on the mmap-based log.
log_sys.write_to_buf: Count writes to log_sys.buf. Replaces
srv_stats.log_write_requests and export_vars.innodb_log_write_requests.
Protected by log_sys.mutex. Updated consistently in log_close().
Previously, mtr_t::commit() conditionally updated the count,
which was inconsistent.
log_sys.write_to_log: Count swaps of log_sys.buf and log_sys.flush_buf,
for writing to log_sys.log (the ib_logfile0). Replaces
srv_stats.log_writes and export_vars.innodb_log_writes.
Protected by log_sys.mutex.
log_sys.waits: Count waits in append_prepare(). Replaces
srv_stats.log_waits and export_vars.innodb_log_waits.
recv_recover_page(): Do not unnecessarily acquire
log_sys.flush_order_mutex. We are inserting the blocks in arbitary
order anyway, to be adjusted in recv_sys.apply(true).
We will change the definition of flush_lock and write_lock to
avoid potential false sharing. Depending on sizeof(log_sys) and
CPU_LEVEL1_DCACHE_LINESIZE, the flush_lock and write_lock could
share a cache line with each other or with the last data members
of log_sys.
Thanks to Matthias Leich for providing https://rr-project.org traces
for various failures during the development, and to
Thirunarayanan Balathandayuthapani for his help in debugging
some of the recovery code. And thanks to the developers of the
rr debugger for a tool without which extensive changes to InnoDB
would be very challenging to get right.
Thanks to Vladislav Vaintroub for useful feedback and
to him, Axel Schwenke and Krunal Bauskar for testing the performance.
.. to be the same as startup.
In resolving MDEV-27461, BUF_LRU_MIN_LEN (256) is the minimum number of
pages for the innodb buffer pool size. Obviously we need more than just
flushing pages. Taking the 16k page size and its default minimum, an
extra 25% is needed on top of the flushing pages to make a workable buffer
pool.
The minimum innodb_buffer_pool_chunk_size (1M) restricts the minimum
otherwise we'd have a pool made up of different chunk sizes.
The resulting minimum innodb buffer pool sizes are:
Page Size, Previously minimum (startup), with change.
4k 5M 2M
8k 5M 3M
16k 5M 5M
32k 24M 10M
64k 24M 20M
With this patch, SET GLOBAL innodb_buffer_pool_size minimums are
enforced.
The evident minimum system variable size for innodb_buffer_pool_size
is 2M, however this is only setable if using 4k page size. As
the order of the page_size and buffer_pool_size aren't fixed, we can't
hide this change.
Subsequent changes:
* innodb_buffer_pool_resize_with_chunks.test - raised of pool resize due to new
minimums. Chunk size also needed increase as the test was for
pool_size < chunk_size to generate a warning.
* Removed srv_buf_pool_min_size and replaced use with MYSQL_SYSVAR_NAME(buffer_pool_size).min_val
* Removed srv_buf_pool_def_size and replaced constant defination in
MYSQL_SYSVAR_LONGLONG(buffer_pool_size)
* Reordered ha_innodb to allow for direct use of MYSQL_SYSVAR_NAME(buffer_pool_size).min_val
* Moved buf_pool_size_align into ha_innodb to access to MYSQL_SYSVAR_NAME(buffer_pool_size).min_val
* loose-innodb_disable_resize_buffer_pool_debug is needed in the
innodb.restart.opt test so that under debug mode, resizing of the
innodb buffer pool can occur.
The previous default innodb_buffer_pool_chunk_size of 128M
made sense when the innodb buffer pool size was a few GB.
When the pool size is 128GB this means the chunk size is 0.1%
of this. Fine tuning the buffer pool size on such a fine
increment doesn't make practical sense. Also on extremely
large buffer pool systems, initializing on the default 128M can
also take a considerable amount of time.
When large pages are enabled, the chunk size has to be a multiple
of an available large page size or memory allocation without
use can occur.
Previously the default 0 was documented as disabling resizing.
With srv_buf_pool_chunk_unit > 0 assertions in the code and the
minimium value set, I doubt this was ever the case.
As such the autosizing (based on default 0) takes place as follows:
* a 64th of the innodb_buffer_pool_size
* if large pages, this is rounded down the the nearest multiple
of the large page size.
* If less than 1MB, set to 1MB.
This does mean the new default innodb_buffer_pool_chunk size is
2MB, derived form the above formular with 128MB as the buffer pool
size.
The innodb_buffer_pool_chunk_size is changed to a size_t for
better compatiblity with the memory allocations which use size_t.
The previous upper limit is changed to the maxium of a size_t. The
maximium value used is the buffer pool size anyway.
Getting this default value of the chunk size to a more practical
size facilitates further development of more automated resizing
without significant overhead or memory fragmentation.
innodb_buffer_pool_resize test adjusted based on 1M default
chunk size thanks Wlad.
The following options were introduced in
commit 2e814d4702 (mariadb-10.2.2)
and have little use:
innodb_disable_resize_buffer_pool_debug had no effect even in
MariaDB 10.2.2 or MySQL 5.7.9. It was introduced in
mysql/mysql-server@5c4094cf49
to work around a problem that was fixed in
mysql/mysql-server@2957ae4f99
(but the parameter was not removed).
innodb_page_cleaner_disabled_debug and innodb_master_thread_disabled_debug
are only used by the test innodb.redo_log_during_checkpoint
that will be removed as part of this commit.
innodb_dict_stats_disabled_debug is only used by that test,
and it is redundant because one could simply use
innodb_stats_persistent=OFF or the STATS_PERSISTENT=0 attribute
of the table in the test to achieve the same effect.
MySQL 5.5 in commit 177d8b0c12
introduced a configuration parameter innodb_force_load_corrupted
whose purpose was to allow a corrupted table to be dropped.
Given that MDEV-11412 in MariaDB 10.5.4 aims to allow any metadata
for a missing or corrupted table to be dropped, and given that
MDEV-17567 and MDEV-25506 and related tasks made DDL operations
crash-safe, the parameter no longer serves any purpose.
Because this obscure parameter was read-only (not settable by a client),
it seems that we can simply declare it with MARIADB_REMOVED_OPTION
(commit 1bc9cce702) without breaking
any upgrades.
DICT_ERR_IGNORE_INDEX: Replaces DICT_ERR_IGNORE_INDEX_ROOT and
DICT_ERR_IGNORE_CORRUPT, which were always set equally.
dict_load_indexes(): Report "No indexes found for table" in
a uniform way, and only when the DICT_ERR_IGNORE_INDEX flag is
not set.
If the clustered index is marked corrupted, and the operation
is DICT_ERR_IGNORE_DROP (we are about to drop the table), we will
load the metadata; else, we will return DB_INDEX_CORRUPT.
If SYS_INDEXES.PAGE is FIL_NULL, report an error or warning
unless we are about to drop the table.
dict_load_table_one(): Simplify the logic.
In the InnoDB data files, we allocate 32 bits for tablespace identifiers
and page numbers as well as tablespace flags. But, in main memory
data structures we allocate 32 or 64 bits, depending on the register
width of the processor. Let us always use 32-bit fields to eliminate
a mismatch and reduce the memory footprint on 64-bit systems.
The practical maximum value of the parameter innodb_lock_wait_timeout
is 100,000,000. Any value larger than that specifies an infinite timeout.
Therefore, we should make 100,000,000 the maximum value of the parameter.
The MariaDB implementation of page_compressed tables for InnoDB used
sparse files. In the worst case, in the data file, every data page
will consist of some data followed by a hole. This may be extremely
inefficient in some file systems.
If the underlying storage device is thinly provisioned (can compress
data on the fly), it would be good to write regular files (with sequences
of NUL bytes at the end of each page_compressed block) and let the
storage device take care of compressing the data.
For reads, sparse file regions and regions containing NUL bytes will be
indistinguishable.
my_test_if_disable_punch_hole(): A new predicate for detecting thinly
provisioned storage. (Not implemented yet.)
innodb_atomic_writes: Correct the comment.
buf_flush_page(): Support all values of fil_node_t::punch_hole.
On a thinly provisioned storage device, we will always write
NUL-padded innodb_page_size bytes also for page_compressed tables.
buf_flush_freed_pages(): Remove a redundant condition.
fil_space_t::atomic_write_supported: Remove. (This was duplicating
fil_node_t::atomic_write.)
fil_space_t::punch_hole: Remove. (Duplicated fil_node_t::punch_hole.)
fil_node_t: Remove magic_n, and consolidate flags into bitfields.
For punch_hole we introduce a third value that indicates a
thinly provisioned storage device.
fil_node_t::find_metadata(): Detect all attributes of the file.
This is a complete rewrite of DROP TABLE, also as part of other DDL,
such as ALTER TABLE, CREATE TABLE...SELECT, TRUNCATE TABLE.
The background DROP TABLE queue hack is removed.
If a transaction needs to drop and create a table by the same name
(like TRUNCATE TABLE does), it must first rename the table to an
internal #sql-ib name. No committed version of the data dictionary
will include any #sql-ib tables, because whenever a transaction
renames a table to a #sql-ib name, it will also drop that table.
Either the rename will be rolled back, or the drop will be committed.
Data files will be unlinked after the transaction has been committed
and a FILE_RENAME record has been durably written. The file will
actually be deleted when the detached file handle returned by
fil_delete_tablespace() will be closed, after the latches have been
released. It is possible that a purge of the delete of the SYS_INDEXES
record for the clustered index will execute fil_delete_tablespace()
concurrently with the DDL transaction. In that case, the thread that
arrives later will wait for the other thread to finish.
HTON_TRUNCATE_REQUIRES_EXCLUSIVE_USE: A new handler flag.
ha_innobase::truncate() now requires that all other references to
the table be released in advance. This was implemented by Monty.
ha_innobase::delete_table(): If CREATE TABLE..SELECT is detected,
we will "hijack" the current transaction, drop the table in
the current transaction and commit the current transaction.
This essentially fixes MDEV-21602. There is a FIXME comment about
making the check less failure-prone.
ha_innobase::truncate(), ha_innobase::delete_table():
Implement a fast path for temporary tables. We will no longer allow
temporary tables to use the adaptive hash index.
dict_table_t::mdl_name: The original table name for the purpose of
acquiring MDL in purge, to prevent a race condition between a
DDL transaction that is dropping a table, and purge processing
undo log records of DML that had executed before the DDL operation.
For #sql-backup- tables during ALTER TABLE...ALGORITHM=COPY, the
dict_table_t::mdl_name will differ from dict_table_t::name.
dict_table_t::parse_name(): Use mdl_name instead of name.
dict_table_rename_in_cache(): Update mdl_name.
For the internal FTS_ tables of FULLTEXT INDEX, purge would
acquire MDL on the FTS_ table name, but not on the main table,
and therefore it would be able to run concurrently with a
DDL transaction that is dropping the table. Previously, the
DROP TABLE queue hack prevented a race between purge and DDL.
For now, we introduce purge_sys.stop_FTS() to prevent purge from
opening any table, while a DDL transaction that may drop FTS_
tables is in progress. The function fts_lock_table(), which will
be invoked before the dictionary is locked, will wait for
purge to release any table handles.
trx_t::drop_table_statistics(): Drop statistics for the table.
This replaces dict_stats_drop_index(). We will drop or rename
persistent statistics atomically as part of DDL transactions.
On lock conflict for dropping statistics, we will fail instantly
with DB_LOCK_WAIT_TIMEOUT, because we will be holding the
exclusive data dictionary latch.
trx_t::commit_cleanup(): Separated from trx_t::commit_in_memory().
Relax an assertion around fts_commit() and allow DB_LOCK_WAIT_TIMEOUT
in addition to DB_DUPLICATE_KEY. The call to fts_commit() is
entirely misplaced here and may obviously break the consistency
of transactions that affect FULLTEXT INDEX. It needs to be fixed
separately.
dict_table_t::n_foreign_key_checks_running: Remove (MDEV-21175).
The counter was a work-around for missing meta-data locking (MDL)
on the SQL layer, and not really needed in MariaDB.
ER_TABLE_IN_FK_CHECK: Replaced with ER_UNUSED_28.
HA_ERR_TABLE_IN_FK_CHECK: Remove.
row_ins_check_foreign_constraints(): Do not acquire
dict_sys.latch either. The SQL-layer MDL will protect us.
This was reviewed by Thirunarayanan Balathandayuthapani
and tested by Matthias Leich.
innodb_debug_sync was introduced in commit
b393e2cb0c and reverted in
commit fc58c17216 due to memory leak reported
by valgrind, see MDEV-21336.
The leak is now fixed by adding `rw_lock_free(&slot->debug_sync_lock)`
after background thread working loop is finished, and the patch is
reapplied, with respect to c++98 fixes by Marko.
The missing DEBUG_SYNC for MDEV-18546 in row0vers.cc is also reapplied.
As suggested by Vladislav Vaintroub, let us remove misleading
and malformatted startup messages.
Even if the global variable srv_use_atomic_writes were set, we would
still invoke my_test_if_atomic_write() to check if writes are atomic
with a particular page size.
When using the default innodb_page_size=16k, page writes should be
atomic on NTFS when using ROW_FORMAT=COMPRESSED and KEY_BLOCK_SIZE<=4.
Disabling srv_use_atomic_writes when innodb_file_per_table=OFF does
not make sense, because that is a dynamic parameter.
We also correct the documentation string of innodb_use_atomic_writes
and remove the duplicate variable innobase_use_atomic_writes.
The debug parameter innodb_simulate_comp_failures injected compression
failures for ROW_FORMAT=COMPRESSED tables, breaking the pre-existing
logic that I had implemented in the InnoDB Plugin for MySQL 5.1 to prevent
compressed page overflows. A much better check is already achieved by
defining UNIV_ZIP_COPY at the compilation time.
(Only UNIV_ZIP_DEBUG is part of cmake -DWITH_INNODB_EXTRA_DEBUG=ON.)
Historically, InnoDB supported a buggy page checksum algorithm that did not
compute a checksum over the full page. Later, well before MySQL 4.1
introduced .ibd files and the innodb_file_per_table option, the algorithm
was corrected and the first 4 bytes of each page were redefined to be
a checksum.
The original checksum was so slow that an option to disable page checksum
was introduced for benchmarketing purposes.
The Intel Nehalem microarchitecture introduced the SSE4.2 instruction set
extension, which includes instructions for faster computation of CRC-32C.
In MySQL 5.6 (and MariaDB 10.0), innodb_checksum_algorithm=crc32 was
implemented to make of that. As that option was changed to be the default
in MySQL 5.7, a bug was found on big-endian platforms and some work-around
code was added to weaken that checksum further. MariaDB disables that
work-around by default since MDEV-17958.
Later, SIMD-accelerated CRC-32C has been implemented in MariaDB for POWER
and ARM and also for IA-32/AMD64, making use of carry-less multiplication
where available.
Long story short, innodb_checksum_algorithm=crc32 is faster and more secure
than the pre-MySQL 5.6 checksum, called innodb_checksum_algorithm=innodb.
It should have removed any need to use innodb_checksum_algorithm=none.
The setting innodb_checksum_algorithm=crc32 is the default in
MySQL 5.7 and MariaDB Server 10.2, 10.3, 10.4. In MariaDB 10.5,
MDEV-19534 made innodb_checksum_algorithm=full_crc32 the default.
It is even faster and more secure.
The default settings in MariaDB do allow old data files to be read,
no matter if a worse checksum algorithm had been used.
(Unfortunately, before innodb_checksum_algorithm=full_crc32,
the data files did not identify which checksum algorithm is being used.)
The non-default settings innodb_checksum_algorithm=strict_crc32 or
innodb_checksum_algorithm=strict_full_crc32 would only allow CRC-32C
checksums. The incompatibility with old data files is why they are
not the default.
The newest server not to support innodb_checksum_algorithm=crc32
were MySQL 5.5 and MariaDB 5.5. Both have reached their end of life.
A valid reason for using innodb_checksum_algorithm=innodb could have
been the ability to downgrade. If it is really needed, data files
can be converted with an older version of the innochecksum utility.
Because there is no good reason to allow data files to be written
with insecure checksums, we will reject those option values:
innodb_checksum_algorithm=none
innodb_checksum_algorithm=innodb
innodb_checksum_algorithm=strict_none
innodb_checksum_algorithm=strict_innodb
Furthermore, the following innochecksum options will be removed,
because only strict crc32 will be supported:
innochecksum --strict-check=crc32
innochecksum -C crc32
innochecksum --write=crc32
innochecksum -w crc32
If a user wishes to convert a data file to use a different checksum
(so that it might be used with the no-longer-supported
MySQL 5.5 or MariaDB 5.5, which do not support IMPORT TABLESPACE
nor system tablespace format changes that were made in MariaDB 10.3),
then the innochecksum tool from MariaDB 10.2, 10.3, 10.4, 10.5 or
MySQL 5.7 can be used.
Reviewed by: Thirunarayanan Balathandayuthapani
We have innodb_use_native_aio=ON by default since the introduction of
that parameter in commit 2f9fb41b05
(MySQL 5.5 and MariaDB 5.5).
However, to really benefit from the setting, the files should be
opened in O_DIRECT mode, to bypass the file system cache.
In this way, the reads and writes can be submitted with DMA, using
the InnoDB buffer pool directly, and no processor cycles need to be
used for copying data. The use of O_DIRECT benefits not only the
current libaio implementation, but also liburing.
os_file_set_nocache(): Test innodb_flush_method in the function,
not in the callers.
A new configuration parameter innodb_deadlock_report is introduced:
* innodb_deadlock_report=off: Do not report any details of deadlocks.
* innodb_deadlock_report=basic: Report transactions and waiting locks.
* innodb_deadlock_report=full (default): Report also the blocking locks.
The improved deadlock checker will consider all involved transactions
in one loop, even if the deadlock loop includes several transactions.
The theoretical maximum number of transactions that can be involved in
a deadlock is `innodb_page_size` * 8, limited by the persistent data
structures.
Note: Similar to
mysql/mysql-server@3859219875
our deadlock checker will consider at most one blocking transaction
for each waiting transaction. The new field trx->lock.wait_trx be
nullptr if and only if trx->lock.wait_lock is nullptr. Note that
trx->lock.wait_lock->trx == trx (the waiting transaction), while
trx->lock.wait_trx points to one of the transactions whose lock is
conflicting with trx->lock.wait_lock.
Considering only one blocking transaction will greatly simplify
our deadlock checker, but it may also make the deadlock checker
blind to some deadlocks where the deadlock cycle is 'hidden' by
the fact that the registered trx->lock.wait_trx is not actually
waiting for any InnoDB lock, but something else. So, instead of
deadlocks, sometimes lock wait timeout may be reported.
To improve on this, whenever trx->lock.wait_trx is changed, we
will register further 'candidate' transactions in Deadlock::to_check(),
and check for 'revealed' deadlocks as soon as possible, in lock_release()
and innobase_kill_query().
The old DeadlockChecker was holding lock_sys.latch, even though using
lock_sys.wait_mutex should be less contended (and thus preferred)
in the likely case that no deadlock is present.
lock_wait(): Defer the deadlock check to this function, instead of
executing it in lock_rec_enqueue_waiting(), lock_table_enqueue_waiting().
DeadlockChecker: Complete rewrite:
(1) Explicitly keep track of transactions that are being waited for,
in trx->lock.wait_trx, protected by lock_sys.wait_mutex. Previously,
we were painstakingly traversing the lock heaps while blocking
concurrent registration or removal of any locks (even uncontended ones).
(2) Use Brent's cycle-detection algorithm for deadlock detection,
traversing each trx->lock.wait_trx edge at most 2 times.
(3) If a deadlock is detected, release lock_sys.wait_mutex,
acquire LockMutexGuard, re-acquire lock_sys.wait_mutex and re-invoke
find_cycle() to find out whether the deadlock is still present.
(4) Display information on all transactions that are involved in the
deadlock, and choose a victim to be rolled back.
lock_sys.deadlocks: Replaces lock_deadlock_found. Protected by wait_mutex.
Deadlock::find_cycle(): Quickly find a cycle of trx->lock.wait_trx...
using Brent's cycle detection algorithm.
Deadlock::report(): Report a deadlock cycle that was found by
Deadlock::find_cycle(), and choose a victim with the least weight.
Altogether, we may traverse each trx->lock.wait_trx edge up to 5
times (2*find_cycle()+1 time for reporting and choosing the victim).
Deadlock::check_and_resolve(): Find and resolve a deadlock.
lock_wait_rpl_report(): Report the waits-for information to
replication. This used to be executed as part of DeadlockChecker.
Replication must know the waits-for relations even if no deadlocks
are present in InnoDB.
Reviewed by: Vladislav Vaintroub
The parameter innodb_idle_flush_pct that was introduced in
MariaDB Server 10.1.2 by MDEV-6932 has no effect ever since
the InnoDB changes from MySQL 5.7.9 were applied in
commit 2e814d4702.
Let us declare the parameter as MARIADB_REMOVED_OPTION.
For earlier versions, commit ea9cd97f85
declared the parameter deprecated.
The parameter innodb_idle_flush_pct that was introduced in
MariaDB Server 10.1.2 by MDEV-6932 has no effect ever since
the InnoDB changes from MySQL 5.7.9 were applied in
commit 2e814d4702.
Let us declare the parameter as deprecated and having no effect.
In commit 3a9a3be1c6 (MDEV-23855)
some previous logic was replaced with the condition
dirty_pct < srv_max_dirty_pages_pct_lwm, which caused
the default value of the parameter innodb_max_dirty_pages_pct_lwm=0
to lose its special meaning: 'refer to innodb_max_dirty_pages_pct instead'.
This implicit special meaning was visible in the function
af_get_pct_for_dirty(), which was removed in
commit f0c295e2de (MDEV-24369).
page_cleaner_flush_pages_recommendation(): Restore the special
meaning that was removed in MDEV-24369.
buf_flush_page_cleaner(): If srv_max_dirty_pages_pct_lwm==0.0,
refer to srv_max_buf_pool_modified_pct. This fixes the observed
performance regression due to excessive page flushing.
buf_pool_t::page_cleaner_wakeup(): Revise the wakeup condition.
innodb_init(): Do initialize srv_max_io_capacity in Mariabackup.
It was previously constantly 0, which caused mariadb-backup --prepare
to hang in buf_flush_sync(), making no progress.
SHOW ENGINE INNODB MUTEX functionality is completely removed,
as are the InnoDB latching order checks.
We will enforce innodb_fatal_semaphore_wait_threshold
only for dict_sys.mutex and lock_sys.mutex.
dict_sys_t::mutex_lock(): A single entry point for dict_sys.mutex.
lock_sys_t::mutex_lock(): A single entry point for lock_sys.mutex.
FIXME: srv_sys should be removed altogether; it is duplicating tpool
functionality.
fil_crypt_threads_init(): To prevent SAFE_MUTEX warnings, we must
not hold fil_system.mutex.
fil_close_all_files(): To prevent SAFE_MUTEX warnings for
fil_space_destroy_crypt_data(), we must not hold fil_system.mutex
while invoking fil_space_free_low() on a detached tablespace.
We will default to MUTEXTYPE=sys (using OSTrackMutex) for those
ib_mutex_t that have not been replaced yet.
The view INFORMATION_SCHEMA.INNODB_SYS_SEMAPHORE_WAITS is removed.
The parameter innodb_sync_array_size is removed.
FIXME: innodb_fatal_semaphore_wait_threshold will no longer be enforced.
We should enforce it for lock_sys.mutex and dict_sys.mutex somehow!
innodb_sync_debug=ON might still cover ib_mutex_t.
In commit 5e62b6a5e0 (MDEV-16264)
the logic of os_aio_init() was changed so that it will never fail,
but instead automatically disable innodb_use_native_aio (which is
enabled by default) if the io_setup() system call would fail due
to resource limits being exceeded. This is questionable, especially
because falling back to simulated AIO may lead to significantly
reduced performance.
srv_n_file_io_threads, srv_n_read_io_threads, srv_n_write_io_threads:
Change the data type from ulong to uint.
os_aio_init(): Remove the parameters, and actually return an error code.
thread_pool::configure_aio(): Do not silently fall back to simulated AIO.
Reviewed by: Vladislav Vaintroub
After commit a5a2ef079c (part of MDEV-23855)
implemented asynchronous doublewrite, it is possible that the server will
hang when the following parametes are in effect:
innodb_doublewrite=1 (default)
innodb_write_io_threads=1
innodb_use_native_aio=0
Note: In commit 5e62b6a5e0 (MDEV-16264)
the logic of os_aio_init() was changed so that it will never fail,
but instead automatically disable innodb_use_native_aio (which is
enabled by default) if the io_setup() system call would fail due
to resource limits being exceeded.
Before commit a5a2ef079c, we used
a synchronous write for the doublewrite buffer batches, always at
most 64 pages at a time. So, upon completing a doublewrite batch,
a single thread would submit at most 64 page writes (for the
individual pages that were first written to the doublewrite buffer).
With that commit, we may submit up to 128 page writes at a time.
The maximum number of outstanding requests per thread is 256.
Because the maximum number of asynchronous write submissions per
thread was roughly doubled, it is now possible that
buf_dblwr_t::flush_buffered_writes_completed() will hang in
io_slots::acquire(), called via os_aio() and fil_space_t::io(),
when submitting writes of the individual blocks.
We will prevent this type of hang by increasing the minimum number
of innodb_write_io_threads from 1 to 2, so that this type of hang
would only become possible when 512 outstanding write requests
are exceeded.
Let us introduce the parameter innodb_read_only_compressed
that is ON by default, making any ROW_FORMAT=COMPRESSED tables
read-only.
I developed the ROW_FORMAT=COMPRESSED format based on
Heikki Tuuri's rough design between 2005 and 2008. It might
have been a good idea back then, but no proper benchmarks were
ever run to validate the design or the implementation.
The format has been more or less obsolete for years.
It limits innodb_page_size to 16384 bytes (the default),
and instant ALTER TABLE is not supported.
This is the first step towards deprecating and removing
write support for ROW_FORMAT=COMPRESSED tables.
Let us introduce a dummy variable innodb_max_purge_lag_wait
for waiting that the InnoDB history list length is below
the user-specified limit. Specifically,
SET GLOBAL innodb_max_purge_lag_wait=0;
should wait for all history to be purged. This could be useful
when upgrading from an older version to MariaDB 10.3 or later,
to avoid hitting MDEV-15912.
Note: the history cannot be purged if there exist transactions
that may see old versions.
Reviewed by: Vladislav Vaintroub
After MDEV-15053, MDEV-22871, MDEV-23399 shifted the scalability
bottleneck, log checkpoints became a new bottleneck.
If innodb_io_capacity is set low or innodb_max_dirty_pct_lwm is
set high and the workload fits in the buffer pool, the page cleaner
thread will perform very little flushing. When we reach the capacity
of the circular redo log file ib_logfile0 and must initiate a checkpoint,
some 'furious flushing' will be necessary. (If innodb_flush_sync=OFF,
then flushing would continue at the innodb_io_capacity rate, and
writers would be throttled.)
We have the best chance of advancing the checkpoint LSN immediately
after a page flush batch has been completed. Hence, it is best to
perform checkpoints after every batch in the page cleaner thread,
attempting to run once per second.
By initiating high-priority flushing in the page cleaner as early
as possible, we aim to make the throughput more stable.
The function buf_flush_wait_flushed() used to sleep for 10ms, hoping
that the page cleaner thread would do something during that time.
The observed end result was that a large number of threads that call
log_free_check() would end up sleeping while nothing useful is happening.
We will revise the design so that in the default innodb_flush_sync=ON
mode, buf_flush_wait_flushed() will wake up the page cleaner thread
to perform the necessary flushing, and it will wait for a signal from
the page cleaner thread.
If innodb_io_capacity is set to a low value (causing the page cleaner to
throttle its work), a write workload would initially perform well, until
the capacity of the circular ib_logfile0 is reached and log_free_check()
will trigger checkpoints. At that point, the extra waiting in
buf_flush_wait_flushed() will start reducing throughput.
The page cleaner thread will also initiate log checkpoints after each
buf_flush_lists() call, because that is the best point of time for
the checkpoint LSN to advance by the maximum amount.
Even in 'furious flushing' mode we invoke buf_flush_lists() with
innodb_io_capacity_max pages at a time, and at the start of each
batch (in the log_flush() callback function that runs in a separate
task) we will invoke os_aio_wait_until_no_pending_writes(). This
tweak allows the checkpoint to advance in smaller steps and
significantly reduces the maximum latency. On an Intel Optane 960
NVMe SSD on Linux, it reduced from 4.6 seconds to 74 milliseconds.
On Microsoft Windows with a slower SSD, it reduced from more than
180 seconds to 0.6 seconds.
We will make innodb_adaptive_flushing=OFF simply flush innodb_io_capacity
per second whenever the dirty proportion of buffer pool pages exceeds
innodb_max_dirty_pages_pct_lwm. For innodb_adaptive_flushing=ON we try
to make page_cleaner_flush_pages_recommendation() more consistent and
predictable: if we are below innodb_adaptive_flushing_lwm, let us flush
pages according to the return value of af_get_pct_for_dirty().
innodb_max_dirty_pages_pct_lwm: Revert the change of the default value
that was made in MDEV-23399. The value innodb_max_dirty_pages_pct_lwm=0
guarantees that a shutdown of an idle server will be fast. Users might
be surprised if normal shutdown suddenly became slower when upgrading
within a GA release series.
innodb_checkpoint_usec: Remove. The master task will no longer perform
periodic log checkpoints. It is the duty of the page cleaner thread.
log_sys.max_modified_age: Remove. The current span of the
buf_pool.flush_list expressed in LSN only matters for adaptive
flushing (outside the 'furious flushing' condition).
For the correctness of checkpoints, the only thing that matters is
the checkpoint age (log_sys.lsn - log_sys.last_checkpoint_lsn).
This run-time constant was also reported as log_max_modified_age_sync.
log_sys.max_checkpoint_age_async: Remove. This does not serve any
purpose, because the checkpoints will now be triggered by the page
cleaner thread. We will retain the log_sys.max_checkpoint_age limit
for engaging 'furious flushing'.
page_cleaner.slot: Remove. It turns out that
page_cleaner_slot.flush_list_time was duplicating
page_cleaner.slot.flush_time and page_cleaner.slot.flush_list_pass
was duplicating page_cleaner.flush_pass.
Likewise, there were some redundant monitor counters, because the
page cleaner thread no longer performs any buf_pool.LRU flushing, and
because there only is one buf_flush_page_cleaner thread.
buf_flush_sync_lsn: Protect writes by buf_pool.flush_list_mutex.
buf_pool_t::get_oldest_modification(): Add a parameter to specify the
return value when no persistent data pages are dirty. Require the
caller to hold buf_pool.flush_list_mutex.
log_buf_pool_get_oldest_modification(): Take the fall-back LSN
as a parameter. All callers will also invoke log_sys.get_lsn().
log_preflush_pool_modified_pages(): Replaced with buf_flush_wait_flushed().
buf_flush_wait_flushed(): Implement two limits. If not enough buffer pool
has been flushed, signal the page cleaner (unless innodb_flush_sync=OFF)
and wait for the page cleaner to complete. If the page cleaner
thread is not running (which can be the case durign shutdown),
initiate the flush and wait for it directly.
buf_flush_ahead(): If innodb_flush_sync=ON (the default),
submit a new buf_flush_sync_lsn target for the page cleaner
but do not wait for the flushing to finish.
log_get_capacity(), log_get_max_modified_age_async(): Remove, to make
it easier to see that af_get_pct_for_lsn() is not acquiring any mutexes.
page_cleaner_flush_pages_recommendation(): Protect all access to
buf_pool.flush_list with buf_pool.flush_list_mutex. Previously there
were some race conditions in the calculation.
buf_flush_sync_for_checkpoint(): New function to process
buf_flush_sync_lsn in the page cleaner thread. At the end of
each batch, we try to wake up any blocked buf_flush_wait_flushed().
If everything up to buf_flush_sync_lsn has been flushed, we will
reset buf_flush_sync_lsn=0. The page cleaner thread will keep
'furious flushing' until the limit is reached. Any threads that
are waiting in buf_flush_wait_flushed() will be able to resume
as soon as their own limit has been satisfied.
buf_flush_page_cleaner: Prioritize buf_flush_sync_lsn and do not
sleep as long as it is set. Do not update any page_cleaner statistics
for this special mode of operation. In the normal mode
(buf_flush_sync_lsn is not set for innodb_flush_sync=ON),
try to wake up once per second. No longer check whether
srv_inc_activity_count() has been called. After each batch,
try to perform a log checkpoint, because the best chances for
the checkpoint LSN to advance by the maximum amount are upon
completing a flushing batch.
log_t: Move buf_free, max_buf_free possibly to the same cache line
with log_sys.mutex.
log_margin_checkpoint_age(): Simplify the logic, and replace
a 0.1-second sleep with a call to buf_flush_wait_flushed() to
initiate flushing. Moved to the same compilation unit
with the only caller.
log_close(): Clean up the calculations. (Should be no functional
change.) Return whether flush-ahead is needed. Moved to the same
compilation unit with the only caller.
mtr_t::finish_write(): Return whether flush-ahead is needed.
mtr_t::commit(): Invoke buf_flush_ahead() when needed. Let us avoid
external calls in mtr_t::commit() and make the logic easier to follow
by having related code in a single compilation unit. Also, we will
invoke srv_stats.log_write_requests.inc() only once per
mini-transaction commit, while not holding mutexes.
log_checkpoint_margin(): Only care about log_sys.max_checkpoint_age.
Upon reaching log_sys.max_checkpoint_age where we must wait to prevent
the log from getting corrupted, let us wait for at most 1MiB of LSN
at a time, before rechecking the condition. This should allow writers
to proceed even if the redo log capacity has been reached and
'furious flushing' is in progress. We no longer care about
log_sys.max_modified_age_sync or log_sys.max_modified_age_async.
The log_sys.max_modified_age_sync could be a relic from the time when
there was a srv_master_thread that wrote dirty pages to data files.
Also, we no longer have any log_sys.max_checkpoint_age_async limit,
because log checkpoints will now be triggered by the page cleaner
thread upon completing buf_flush_lists().
log_set_capacity(): Simplify the calculations of the limit
(no functional change).
log_checkpoint_low(): Split from log_checkpoint(). Moved to the
same compilation unit with the caller.
log_make_checkpoint(): Only wait for everything to be flushed until
the current LSN.
create_log_file(): After checkpoint, invoke log_write_up_to()
to ensure that the FILE_CHECKPOINT record has been written.
This avoids ut_ad(!srv_log_file_created) in create_log_file_rename().
srv_start(): Do not call recv_recovery_from_checkpoint_start()
if the log has just been created. Set fil_system.space_id_reuse_warned
before dict_boot() has been executed, and clear it after recovery
has finished.
dict_boot(): Initialize fil_system.max_assigned_id.
srv_check_activity(): Remove. The activity count is counting transaction
commits and therefore mostly interesting for the purge of history.
BtrBulk::insert(): Do not explicitly wake up the page cleaner,
but do invoke srv_inc_activity_count(), because that counter is
still being used in buf_load_throttle_if_needed() for some
heuristics. (It might be cleaner to execute buf_load() in the
page cleaner thread!)
Reviewed by: Vladislav Vaintroub
MariaDB 10.2.2 inherited from MySQL 5.7 a perceived optimization
of ALTER TABLE, which skips the writing of redo log records.
In MDEV-16809 we introduced a parameter that allows the redo log to
be written, so that Mariabackup would not be impacted, but we kept
the MySQL 5.7 behaviour enabled by default (innodb_log_optimize_ddl=ON).
As noted in MDEV-19747 (Deprecate and ignore innodb_log_optimize_ddl,
implemented in MariaDB 10.5.1), omitting the redo log writes can
actually reduce performance, because we will have to wait for the data
pages to be written out. When the redo log file is configured to be
large enough, it actually can be much faster to write the redo log and
avoid the extra page flushing.
When the redo log is omitted (innodb_log_optimize_ddl=ON), also
Mariabackup may have to perform a lot of extra work, to re-copy the
entire data file if it is possible that any log was omitted during
the backup.
Starting with MariaDB 10.5.1, the parameter innodb_log_optimize_ddl
is deprecated and ignored. We hereby deprecate (but will not ignore)
the parameter in earlier versions as well.
InnoDB stores a 32-bit page number in page headers and in some
data structures, such as FIL_ADDR (consisting of a 32-bit page number
and a 16-bit byte offset within a page). For better compile-time
error detection and to reduce the memory footprint in some data
structures, let us use a uint32_t for the page number, instead
of ulint (size_t) which can be 64 bits.