1
0
mirror of https://github.com/MariaDB/server.git synced 2025-08-07 00:04:31 +03:00
Commit Graph

67 Commits

Author SHA1 Message Date
Marko Mäkelä
ba81009f63 MDEV-34863 RAM Usage Changed Significantly Between 10.11 Releases
innodb_buffer_pool_size_auto_min: A minimum innodb_buffer_pool_size that
a Linux memory pressure event can lead to shrinking the buffer pool to.
On a memory pressure event, we will attempt to shrink
innodb_buffer_pool_size halfway between its current value and
innodb_buffer_pool_size_auto_min. If innodb_buffer_pool_size_auto_min is
specified as 0 or not specified on startup, its default value will be
adjusted to innodb_buffer_pool_size_max, that is, memory pressure events
will be disregarded by default.

buf_pool_t::garbage_collect(): For up to 15 seconds, attempt to shrink
the buffer pool in response to a memory pressure event.

Reviewed by: Debarun Banerjee
2025-03-26 17:05:48 +02:00
Marko Mäkelä
b6923420f3 MDEV-29445: Reimplement SET GLOBAL innodb_buffer_pool_size
We deprecate and ignore the parameter innodb_buffer_pool_chunk_size
and let the buffer pool size to be changed in arbitrary 1-megabyte
increments.

innodb_buffer_pool_size_max: A new read-only startup parameter
that specifies the maximum innodb_buffer_pool_size.  If 0 or
unspecified, it will default to the specified innodb_buffer_pool_size
rounded up to the allocation unit (2 MiB or 8 MiB).  The maximum value
is 4GiB-2MiB on 32-bit systems and 16EiB-8MiB on 64-bit systems.
This maximum is very likely to be limited further by the operating system.

The status variable Innodb_buffer_pool_resize_status will reflect
the status of shrinking the buffer pool. When no shrinking is in
progress, the string will be empty.

Unlike before, the execution of SET GLOBAL innodb_buffer_pool_size
will block until the requested buffer pool size change has been
implemented, or the execution is interrupted by a KILL statement
a client disconnect, or server shutdown.  If the
buf_flush_page_cleaner() thread notices that we are running out of
memory, the operation may fail with ER_WRONG_USAGE.

SET GLOBAL innodb_buffer_pool_size will be refused
if the server was started with --large-pages (even if
no HugeTLB pages were successfully allocated). This functionality
is somewhat exercised by the test main.large_pages, which now runs
also on Microsoft Windows.  On Linux, explicit HugeTLB mappings are
apparently excluded from the reported Redident Set Size (RSS), and
apparently unshrinkable between mmap(2) and munmap(2).

The buffer pool will be mapped to a contiguous virtual memory area
that will be aligned and partitioned into extents of 8 MiB on
64-bit systems and 2 MiB on 32-bit systems.

Within an extent, the first few innodb_page_size blocks contain
buf_block_t objects that will cover the page frames in the rest
of the extent.  The number of such frames is precomputed in the
array first_page_in_extent[] for each innodb_page_size.
In this way, there is a trivial mapping between
page frames and block descriptors and we do not need any
lookup tables like buf_pool.zip_hash or buf_pool_t::chunk_t::map.

We will always allocate the same number of block descriptors for
an extent, even if we do not need all the buf_block_t in the last
extent in case the innodb_buffer_pool_size is not an integer multiple
of the of extents size.

The minimum innodb_buffer_pool_size is 256*5/4 pages.  At the default
innodb_page_size=16k this corresponds to 5 MiB.  However, now that the
innodb_buffer_pool_size includes the memory allocated for the block
descriptors, the minimum would be innodb_buffer_pool_size=6m.

my_large_virtual_alloc(): A new function, similar to my_large_malloc().

my_virtual_mem_reserve(), my_virtual_mem_commit(),
my_virtual_mem_decommit(), my_virtual_mem_release():
New interface mostly by Vladislav Vaintroub, to separately
reserve and release virtual address space, as well as to
commit and decommit memory within it.

After my_virtual_mem_decommit(), the virtual memory range will be
read-only or unaccessible, depending on whether the build option
cmake -DHAVE_UNACCESSIBLE_AFTER_MEM_DECOMMIT=1
has been specified.  This option is hard-coded on Microsoft Windows,
where VirtualMemory(MEM_DECOMMIT) will make the memory unaccessible.
On IBM AIX, Linux, Illumos and possibly Apple macOS, the virtual memory
will be zeroed out immediately.  On other POSIX-like systems,
madvise(MADV_FREE) will be used if available, to give the operating
system kernel a permission to zero out the virtual memory range.
We prefer immediate freeing so that the reported
resident set size (RSS) of the process will reflect the current
innodb_buffer_pool_size.  Shrinking the buffer pool is a rarely
executed resource intensive operation, and the immediate configuration
of the MMU mappings should not incur significant additional penalty.

opt_super_large_pages: Declare only on Solaris. Actually, this is
specific to the SPARC implementation of Solaris, but because we
lack access to a Solaris development environment, we will not revise
this for other MMU and ISA.

buf_pool_t::chunk_t::create(): Remove.

buf_pool_t::create(): Initialize all n_blocks of the buf_pool.free list.

buf_pool_t::allocate(): Renamed from buf_LRU_get_free_only().

buf_pool_t::LRU_warned: Changed to Atomic_relaxed<bool>,
only to be modified by the buf_flush_page_cleaner() thread.

buf_pool_t::shrink(): Attempt to shrink the buffer pool.
There are 3 possible outcomes: SHRINK_DONE (success),
SHRINK_IN_PROGRESS (the caller may keep trying),
and SHRINK_ABORT (we seem to be running out of buffer pool).
While traversing buf_pool.LRU, release the contended
buf_pool.mutex once in every 32 iterations in order to
reduce starvation. Use lru_scan_itr for efficient traversal,
similar to buf_LRU_free_from_common_LRU_list().

buf_pool_t::shrunk(): Update the reduced size of the buffer pool
in a way that is compatible with buf_pool_t::page_guess(),
and invoke my_virtual_mem_decommit().

buf_pool_t::resize(): Before invoking shrink(), run one batch of
buf_flush_page_cleaner() in order to prevent LRU_warn().
Abort if shrink() recommends it, or no blocks were withdrawn in
the past 15 seconds, or the execution of the statement
SET GLOBAL innodb_buffer_pool_size was interrupted.

buf_pool_t::first_to_withdraw: The first block descriptor that is
out of the bounds of the shrunk buffer pool.

buf_pool_t::withdrawn: The list of withdrawn blocks.
If buf_pool_t::resize() is aborted before shrink() completes,
we must be able to resurrect the withdrawn blocks in the free list.

buf_pool_t::contains_zip(): Added a parameter for the
number of least significant pointer bits to disregard,
so that we can find any pointers to within a block
that is supposed to be free.

buf_pool_t::is_shrinking(): Return the total number or blocks that
were withdrawn or are to be withdrawn.

buf_pool_t::to_withdraw(): Return the number of blocks that will need to
be withdrawn.

buf_pool_t::usable_size(): Number of usable pages, considering possible
in-progress attempt at shrinking the buffer pool.

buf_pool_t::page_guess(): Try to buffer-fix a guessed block pointer.
If HAVE_UNACCESSIBLE_AFTER_MEM_DECOMMIT is set, the pointer will
be validated before being dereferenced.

buf_pool_t::get_info(): Replaces buf_stats_get_pool_info().

innodb_init_param(): Refactored. We must first compute
srv_page_size_shift and then determine the valid bounds of
innodb_buffer_pool_size.

buf_buddy_shrink(): Replaces buf_buddy_realloc().
Part of the work is deferred to buf_buddy_condense_free(),
which is being executed when we are not holding any
buf_pool.page_hash latch.

buf_buddy_condense_free(): Do not relocate blocks.

buf_buddy_free_low(): Do not care about buffer pool shrinking.
This will be handled by buf_buddy_shrink() and
buf_buddy_condense_free().

buf_buddy_alloc_zip(): Assert !buf_pool.contains_zip()
when we are allocating from the binary buddy system.
Previously we were asserting this on multiple recursion levels.

buf_buddy_block_free(), buf_buddy_free_low():
Assert !buf_pool.contains_zip().

buf_buddy_alloc_from(): Remove the redundant parameter j.

buf_flush_LRU_list_batch(): Add the parameter to_withdraw
to keep track of buf_pool.n_blocks_to_withdraw.

buf_do_LRU_batch(): Skip buf_free_from_unzip_LRU_list_batch()
if we are shrinking the buffer pool. In that case, we want
to minimize the page relocations and just finish as quickly
as possible.

trx_purge_attach_undo_recs(): Limit purge_sys.n_pages_handled()
in every iteration, in case the buffer pool is being shrunk
in the middle of a purge batch.

Reviewed by: Debarun Banerjee
2025-03-26 17:05:44 +02:00
mariadb-DebarunBanerjee
a8e35a1cc6 MDEV-36149 UBSAN in X is outside the range of representable values of type 'unsigned long' | page_cleaner_flush_pages_recommendation
Currently it is allowed to set innodb_io_capacity to very large value
up to unsigned 8 byte maximum value 18446744073709551615. While
calculating the number of pages to flush, we could sometime go beyond
innodb_io_capacity. Specifically, MDEV-24369 has introduced a logic
for aggressive flushing when dirty page percentage in buffer pool
exceeds innodb_max_dirty_pages_pct. So, when innodb_io_capacity is
set to very large value and dirty page percentage exceeds the
threshold, there is a multiplication overflow in Innodb page cleaner.

Fix: We should prevent setting io_capacity to unrealistic values and
define a practical limit to it. The patch introduces limits for
innodb_io_capacity_max and innodb_io_capacity to the maximum of 4 byte
unsigned integer i.e. 4294967295 (2^32-1). For 16k page size this limit
translates to 64 TiB/sec write IO speed which looks sufficient.

Reviewed by: Marko Mäkelä
2025-03-17 11:44:09 +05:30
Marko Mäkelä
cfcf27c6fe Merge 10.6 into 10.11 2024-08-29 07:47:29 +03:00
Marko Mäkelä
bda40ccb85 MDEV-34803 innodb_lru_flush_size is no longer used
In commit fa8a46eb68 (MDEV-33613)
the parameter innodb_lru_flush_size ceased to have any effect.

Let us declare the parameter as deprecated and additionally as
MARIADB_REMOVED_OPTION, so that there will be a warning written
to the error log in case the option is specified in the command line.

Let us also do the same for the parameter
innodb_purge_rseg_truncate_frequency
that was deprecated&ignored earlier in MDEV-32050.

Reviewed by: Debarun Banerjee
2024-08-28 07:18:03 +03:00
Marko Mäkelä
0892e6d028 MDEV-33585 The maximum innodb_log_buffer_size is too large
On Microsoft Windows, ReadFile() as well as WriteFile() limit the size
of the request to DWORD, which is 32 bits (at most 4 GiB - 1) also on
64-bit systems.

On FreeBSD, sysctl debug.iosize_max_clamp could limit the size of a
write request to INT_MAX. The size of a read request is always limited
to INT_MAX. This would allow the request size to be 4095 bytes more than
the Linux limit (0x7ffff000 according to "man 2 read" and "man 2 write").

On OpenBSD, Solaris and possibly NetBSD, the read request size is limited
to SSIZE_T_MAX, which would be half the current maximum
innodb_log_buffer_size. This should be not much of an issue anyway,
because on contemporary 64-bit platforms, the virtual addresses are
limited to 48 bits.

IBM AIX documentation mentions OFF_MAX which would apply when
a 64-bit application is running on a 32-bit kernel.

Let us declare innodb_log_buffer_size as 32-bit unsigned and make the
maximum 0x7ffff000, to be compatible with the least common
denominator (Linux).

The maximum innodb_sort_buffer_size already was 64 MiB,
which is not a problem.

SyncFileIO::execute(): Assert that the size of a synchronous read or
write request is limited to the maximum.

Reviewed by: Vladislav Vaintroub
2024-04-09 09:32:47 +03:00
Oleksandr Byelkin
04d9a46c41 Merge branch '10.6' into 10.10 2023-11-08 16:23:30 +01:00
Marko Mäkelä
aa719b5010 MDEV-32050: Do not copy undo records in purge
Also, default to innodb_purge_batch_size=1000,
replacing the old default value of processing 300 undo log pages
in a batch. Axel Schwenke found this value to help reduce purge lag
without having a significant impact on workload throughput.

In purge, we can simply acquire a shared latch on the undo log page
(to avoid a race condition like the one that was fixed in
commit b102872ad5) and retain a buffer-fix
after releasing the latch. The buffer-fix will prevent the undo log
page from being evicted from the buffer pool. Concurrent modification
is prevented by design. Only the purge_coordinator_task
(or its accomplice purge_truncation_task) may free the undo log pages,
after any purge_worker_task have completed execution. Hence, we do not
have to worry about any overwriting or reuse of the undo log records.

trx_undo_rec_copy(): Remove. The only remaining caller would have been
trx_undo_get_undo_rec_low(), which is where the logic was merged.

purge_sys_t::m_initialized: Replaces heap.

purge_sys_t::pages: A cache of buffer-fixed pages that have been
looked up from buf_pool.page_hash.

purge_sys_t::get_page(): Return a buffer-fixed undo page, using the
pages cache.

trx_purge_t::batch_cleanup(): Renamed from clone_end_view().
Clear the pages cache and clone the end_view at the end of a batch.

purge_sys_t::n_pages_handled(): Return pages.size(). This determines
if innodb_purge_batch_size was exceeded.

purge_sys_t::rseg_get_next_history_log(): Replaces
trx_purge_rseg_get_next_history_log().

purge_sys_t::choose_next_log(): Replaces trx_purge_choose_next_log()
and trx_purge_read_undo_rec().

purge_sys_t::get_next_rec(): Replaces trx_purge_get_next_rec()
and trx_undo_get_next_rec().

purge_sys_t::fetch_next_rec(): Replaces trx_purge_fetch_next_rec()
and some use of trx_undo_get_first_rec().

trx_purge_attach_undo_recs(): Do not allow purge_sys.n_pages_handled()
exceed the innodb_purge_batch_size or ¾ of the buffer pool, whichever
is smaller.

Reviewed by: Vladislav Lesin
Tested by: Matthias Leich and Axel Schwenke
2023-10-25 10:19:17 +03:00
Marko Mäkelä
14685b10df MDEV-32050: Deprecate&ignore innodb_purge_rseg_truncate_frequency
The motivation of introducing the parameter
innodb_purge_rseg_truncate_frequency in
mysql/mysql-server@28bbd66ea5 and
mysql/mysql-server@8fc2120fed
seems to have been to avoid stalls due to freeing undo log pages
or truncating undo log tablespaces. In MariaDB Server,
innodb_undo_log_truncate=ON should be a much lighter operation
than in MySQL, because it will not involve any log checkpoint.

Another source of performance stalls should be
trx_purge_truncate_rseg_history(), which is shrinking the history list
by freeing the undo log pages whose undo records have been purged.
To alleviate that, we will introduce a purge_truncation_task that will
offload this from the purge_coordinator_task. In that way, the next
innodb_purge_batch_size pages may be parsed and purged while the pages
from the previous batch are being freed and the history list being shrunk.

The processing of innodb_undo_log_truncate=ON will still remain the
responsibility of the purge_coordinator_task.

purge_coordinator_state::count: Remove. We will ignore
innodb_purge_rseg_truncate_frequency, and act as if it had been
set to 1 (the maximum shrinking frequency).

purge_coordinator_state::do_purge(): Invoke an asynchronous task
purge_truncation_callback() to free the undo log pages.

purge_sys_t::iterator::free_history(): Free those undo log pages
that have been processed. This used to be a part of
trx_purge_truncate_history().

purge_sys_t::clone_end_view(): Take a new value of purge_sys.head
as a parameter, so that it will be updated while holding exclusive
purge_sys.latch. This is needed for race-free access to the field
in purge_truncation_callback().

Reviewed by: Vladislav Lesin
2023-10-25 09:11:58 +03:00
Marko Mäkelä
32d741b5b0 Merge 10.7 into 10.8 2022-02-25 16:24:13 +02:00
Marko Mäkelä
3d88f9f34c Merge 10.6 into 10.7 2022-02-25 16:09:16 +02:00
Marko Mäkelä
06eaca9b86 Merge 10.5 into 10.6 (MDEV-27913) 2022-02-25 12:15:16 +02:00
Marko Mäkelä
f42d6234bd Merge 10.4 into 10.5 (MDEV-27913) 2022-02-25 11:47:27 +02:00
Marko Mäkelä
7ab3db142b MDEV-27913 fixup: sys_vars.sysvars_innodb result 2022-02-25 10:30:04 +02:00
Marko Mäkelä
685d958e38 MDEV-14425 Improve the redo log for concurrency
The InnoDB redo log used to be formatted in blocks of 512 bytes.
The log blocks were encrypted and the checksum was calculated while
holding log_sys.mutex, creating a serious scalability bottleneck.

We remove the fixed-size redo log block structure altogether and
essentially turn every mini-transaction into a log block of its own.
This allows encryption and checksum calculations to be performed
on local mtr_t::m_log buffers, before acquiring log_sys.mutex.
The mutex only protects a memcpy() of the data to the shared
log_sys.buf, as well as the padding of the log, in case the
to-be-written part of the log would not end in a block boundary of
the underlying storage. For now, the "padding" consists of writing
a single NUL byte, to allow recovery and mariadb-backup to detect
the end of the circular log faster.

Like the previous implementation, we will overwrite the last log block
over and over again, until it has been completely filled. It would be
possible to write only up to the last completed block (if no more
recent write was requested), or to write dummy FILE_CHECKPOINT records
to fill the incomplete block, by invoking the currently disabled
function log_pad(). This would require adjustments to some logic around
log checkpoints, page flushing, and shutdown.

An upgrade after a crash of any previous version is not supported.
Logically empty log files from a previous version will be upgraded.

An attempt to start up InnoDB without a valid ib_logfile0 will be
refused. Previously, the redo log used to be created automatically
if it was missing. Only with with innodb_force_recovery=6, it is
possible to start InnoDB in read-only mode even if the log file
does not exist. This allows the contents of a possibly corrupted
database to be dumped.

Because a prepared backup from an earlier version of mariadb-backup
will create a 0-sized log file, we will allow an upgrade from such
log files, provided that the FIL_PAGE_FILE_FLUSH_LSN in the system
tablespace looks valid.

The 512-byte log checkpoint blocks at 0x200 and 0x600 will be replaced
with 64-byte log checkpoint blocks at 0x1000 and 0x2000.

The start of log records will move from 0x800 to 0x3000. This allows us
to use 4096-byte aligned blocks for all I/O in a future revision.

We extend the MDEV-12353 redo log record format as follows.

(1) Empty mini-transactions or extra NUL bytes will not be allowed.
(2) The end-of-minitransaction marker (a NUL byte) will be replaced
with a 1-bit sequence number, which will be toggled each time when the
circular log file wraps back to the beginning.
(3) After the sequence bit, a CRC-32C checksum of all data
(excluding the sequence bit) will written.
(4) If the log is encrypted, 8 bytes will be written before
the checksum and included in it. This is part of the
initialization vector (IV) of encrypted log data.
(5) File names, page numbers, and checkpoint information will not be
encrypted. Only the payload bytes of page-level log will be encrypted.
The tablespace ID and page number will form part of the IV.
(6) For padding, arbitrary-length FILE_CHECKPOINT records may be written,
with all-zero payload, and with the normal end marker and checksum.
The minimum size is 7 bytes, or 7+8 with innodb_encrypt_log=ON.

In mariadb-backup and in Galera snapshot transfer (SST) scripts, we will
no longer remove ib_logfile0 or create an empty ib_logfile0. Server startup
will require a valid log file. When resizing the log, we will create
a logically empty ib_logfile101 at the current LSN and use an atomic rename
to replace ib_logfile0 with it. See the test innodb.log_file_size.

Because there is no mandatory padding in the log file, we are able
to create a dummy log file as of an arbitrary log sequence number.
See the test mariabackup.huge_lsn.

The parameter innodb_log_write_ahead_size and the
INFORMATION_SCHEMA.INNODB_METRICS counter log_padded will be removed.

The minimum value of innodb_log_buffer_size will be increased to 2MiB
(because log_sys.buf will replace recv_sys.buf) and the increment
adjusted to 4096 bytes (the maximum log block size).

The following INFORMATION_SCHEMA.INNODB_METRICS counters will be removed:

os_log_fsyncs
os_log_pending_fsyncs
log_pending_log_flushes
log_pending_checkpoint_writes

The following status variables will be removed:

Innodb_os_log_fsyncs (this is included in Innodb_data_fsyncs)
Innodb_os_log_pending_fsyncs (this was limited to at most 1 by design)

log_sys.get_block_size(): Return the physical block size of the log file.
This is only implemented on Linux and Microsoft Windows for now, and for
the power-of-2 block sizes between 64 and 4096 bytes (the minimum and
maximum size of a checkpoint block). If the block size is anything else,
the traditional 512-byte size will be used via normal file system
buffering.

If the file system buffers can be bypassed, a message like the following
will be issued:

InnoDB: File system buffers for log disabled (block size=512 bytes)
InnoDB: File system buffers for log disabled (block size=4096 bytes)

This has been tested on Linux and Microsoft Windows with both sizes.

On Linux, only enable O_DIRECT on the log for innodb_flush_method=O_DSYNC.
Tests in 3 different environments where the log is stored in a device
with a physical block size of 512 bytes are yielding better throughput
without O_DIRECT. This could be due to the fact that in the event the
last log block is being overwritten (if multiple transactions would
become durable at the same time, and each of will write a small
number of bytes to the last log block), it should be faster to re-copy
data from log_sys.buf or log_sys.flush_buf to the kernel buffer,
to be finally written at fdatasync() time.

The parameter innodb_flush_method=O_DSYNC will imply O_DIRECT for
data files. This option will enable O_DIRECT on the log file on Linux.
It may be unsafe to use when the storage device does not support
FUA (Force Unit Access) mode.

When the server is compiled WITH_PMEM=ON, we will use memory-mapped
I/O for the log file if the log resides on a "mount -o dax" device.
We will identify PMEM in a start-up message:

InnoDB: log sequence number 0 (memory-mapped); transaction id 3

On Linux, we will also invoke mmap() on any ib_logfile0 that resides
in /dev/shm, effectively treating the log file as persistent memory.
This should speed up "./mtr --mem" and increase the test coverage of
PMEM on non-PMEM hardware. It also allows users to estimate how much
the performance would be improved by installing persistent memory.
On other tmpfs file systems such as /run, we will not use mmap().

mariadb-backup: Eliminated several variables. We will refer
directly to recv_sys and log_sys.

backup_wait_for_lsn(): Detect non-progress of
xtrabackup_copy_logfile(). In this new log format with
arbitrary-sized blocks, we can only detect log file overrun
indirectly, by observing that the scanned log sequence number
is not advancing.

xtrabackup_copy_logfile(): On PMEM, do not modify the sequence bit,
because we are not allowed to modify the server's log file, and our
memory mapping is read-only.

trx_flush_log_if_needed_low(): Do not use the callback on pmem.
Using neither flush_lock nor write_lock around PMEM writes seems
to yield the best performance. The pmem_persist() calls may
still be somewhat slower than the pwrite() and fdatasync() based
interface (PMEM mounted without -o dax).

recv_sys_t::buf: Remove. We will use log_sys.buf for parsing.

recv_sys_t::MTR_SIZE_MAX: Replaces RECV_SCAN_SIZE.

recv_sys_t::file_checkpoint: Renamed from mlog_checkpoint_lsn.

recv_sys_t, log_sys_t: Removed many data members.

recv_sys.lsn: Renamed from recv_sys.recovered_lsn.
recv_sys.offset: Renamed from recv_sys.recovered_offset.
log_sys.buf_size: Replaces srv_log_buffer_size.

recv_buf: A smart pointer that wraps log_sys.buf[recv_sys.offset]
when the buffer is being allocated from the memory heap.

recv_ring: A smart pointer that wraps a circular log_sys.buf[] that is
backed by ib_logfile0. The pointer will wrap from recv_sys.len
(log_sys.file_size) to log_sys.START_OFFSET. For the record that
wraps around, we may copy file name or record payload data to
the auxiliary buffer decrypt_buf in order to have a contiguous
block of memory. The maximum size of a record is less than
innodb_page_size bytes.

recv_sys_t::parse(): Take the smart pointer as a template parameter.
Do not temporarily add a trailing NUL byte to FILE_ records, because
we are not supposed to modify the memory-mapped log file. (It is
attached in read-write mode already during recovery.)

recv_sys_t::parse_mtr(): Wrapper for recv_sys_t::parse().

recv_sys_t::parse_pmem(): Like parse_mtr(), but if PREMATURE_EOF would be
returned on PMEM, use recv_ring to wrap around the buffer to the start.

mtr_t::finish_write(), log_close(): Do not enforce log_sys.max_buf_free
on PMEM, because it has no meaning on the mmap-based log.

log_sys.write_to_buf: Count writes to log_sys.buf. Replaces
srv_stats.log_write_requests and export_vars.innodb_log_write_requests.
Protected by log_sys.mutex. Updated consistently in log_close().
Previously, mtr_t::commit() conditionally updated the count,
which was inconsistent.

log_sys.write_to_log: Count swaps of log_sys.buf and log_sys.flush_buf,
for writing to log_sys.log (the ib_logfile0). Replaces
srv_stats.log_writes and export_vars.innodb_log_writes.
Protected by log_sys.mutex.

log_sys.waits: Count waits in append_prepare(). Replaces
srv_stats.log_waits and export_vars.innodb_log_waits.

recv_recover_page(): Do not unnecessarily acquire
log_sys.flush_order_mutex. We are inserting the blocks in arbitary
order anyway, to be adjusted in recv_sys.apply(true).

We will change the definition of flush_lock and write_lock to
avoid potential false sharing. Depending on sizeof(log_sys) and
CPU_LEVEL1_DCACHE_LINESIZE, the flush_lock and write_lock could
share a cache line with each other or with the last data members
of log_sys.

Thanks to Matthias Leich for providing https://rr-project.org traces
for various failures during the development, and to
Thirunarayanan Balathandayuthapani for his help in debugging
some of the recovery code. And thanks to the developers of the
rr debugger for a tool without which extensive changes to InnoDB
would be very challenging to get right.

Thanks to Vladislav Vaintroub for useful feedback and
to him, Axel Schwenke and Krunal Bauskar for testing the performance.
2022-01-21 16:03:47 +02:00
Daniel Black
d434250ee1 MDEV-25342: autosize innodb_buffer_pool_chunk_size
The previous default innodb_buffer_pool_chunk_size of 128M
made sense when the innodb buffer pool size was a few GB.

When the pool size is 128GB this means the chunk size is 0.1%
of this. Fine tuning the buffer pool size on such a fine
increment doesn't make practical sense. Also on extremely
large buffer pool systems, initializing on the default 128M can
also take a considerable amount of time.

When large pages are enabled, the chunk size has to be a multiple
of an available large page size or memory allocation without
use can occur.

Previously the default 0 was documented as disabling resizing.
With srv_buf_pool_chunk_unit > 0 assertions in the code and the
minimium value set, I doubt this was ever the case.

As such the autosizing (based on default 0) takes place as follows:
* a 64th of the innodb_buffer_pool_size
* if large pages, this is rounded down the the nearest multiple
  of the large page size.
* If less than 1MB, set to 1MB.

This does mean the new default innodb_buffer_pool_chunk size is
2MB, derived form the above formular with 128MB as the buffer pool
size.

The innodb_buffer_pool_chunk_size is changed to a size_t for
better compatiblity with the memory allocations which use size_t.
The previous upper limit is changed to the maxium of a size_t. The
maximium value used is the buffer pool size anyway.

Getting this default value of the chunk size to a more practical
size facilitates further development of more automated resizing
without significant overhead or memory fragmentation.

innodb_buffer_pool_resize test adjusted based on 1M default
chunk size thanks Wlad.
2022-01-18 14:20:57 +02:00
Marko Mäkelä
ca501ffb04 MDEV-26195: Use a 32-bit data type for some tablespace fields
In the InnoDB data files, we allocate 32 bits for tablespace identifiers
and page numbers as well as tablespace flags. But, in main memory
data structures we allocate 32 or 64 bits, depending on the register
width of the processor. Let us always use 32-bit fields to eliminate
a mismatch and reduce the memory footprint on 64-bit systems.
2021-07-22 11:22:47 +03:00
Marko Mäkelä
8c5c3a4594 MDEV-26067 innodb_lock_wait_timeout values above 100,000,000 are useless
The practical maximum value of the parameter innodb_lock_wait_timeout
is 100,000,000. Any value larger than that specifies an infinite timeout.

Therefore, we should make 100,000,000 the maximum value of the parameter.
2021-07-01 10:31:08 +03:00
Marko Mäkelä
c68007d958 MDEV-24738 Improve the InnoDB deadlock checker
A new configuration parameter innodb_deadlock_report is introduced:
* innodb_deadlock_report=off: Do not report any details of deadlocks.
* innodb_deadlock_report=basic: Report transactions and waiting locks.
* innodb_deadlock_report=full (default): Report also the blocking locks.

The improved deadlock checker will consider all involved transactions
in one loop, even if the deadlock loop includes several transactions.
The theoretical maximum number of transactions that can be involved in
a deadlock is `innodb_page_size` * 8, limited by the persistent data
structures.

Note: Similar to
mysql/mysql-server@3859219875
our deadlock checker will consider at most one blocking transaction
for each waiting transaction. The new field trx->lock.wait_trx be
nullptr if and only if trx->lock.wait_lock is nullptr. Note that
trx->lock.wait_lock->trx == trx (the waiting transaction), while
trx->lock.wait_trx points to one of the transactions whose lock is
conflicting with trx->lock.wait_lock.

Considering only one blocking transaction will greatly simplify
our deadlock checker, but it may also make the deadlock checker
blind to some deadlocks where the deadlock cycle is 'hidden' by
the fact that the registered trx->lock.wait_trx is not actually
waiting for any InnoDB lock, but something else. So, instead of
deadlocks, sometimes lock wait timeout may be reported.

To improve on this, whenever trx->lock.wait_trx is changed, we
will register further 'candidate' transactions in Deadlock::to_check(),
and check for 'revealed' deadlocks as soon as possible, in lock_release()
and innobase_kill_query().

The old DeadlockChecker was holding lock_sys.latch, even though using
lock_sys.wait_mutex should be less contended (and thus preferred)
in the likely case that no deadlock is present.

lock_wait(): Defer the deadlock check to this function, instead of
executing it in lock_rec_enqueue_waiting(), lock_table_enqueue_waiting().

DeadlockChecker: Complete rewrite:
(1) Explicitly keep track of transactions that are being waited for,
in trx->lock.wait_trx, protected by lock_sys.wait_mutex. Previously,
we were painstakingly traversing the lock heaps while blocking
concurrent registration or removal of any locks (even uncontended ones).
(2) Use Brent's cycle-detection algorithm for deadlock detection,
traversing each trx->lock.wait_trx edge at most 2 times.
(3) If a deadlock is detected, release lock_sys.wait_mutex,
acquire LockMutexGuard, re-acquire lock_sys.wait_mutex and re-invoke
find_cycle() to find out whether the deadlock is still present.
(4) Display information on all transactions that are involved in the
deadlock, and choose a victim to be rolled back.

lock_sys.deadlocks: Replaces lock_deadlock_found. Protected by wait_mutex.

Deadlock::find_cycle(): Quickly find a cycle of trx->lock.wait_trx...
using Brent's cycle detection algorithm.

Deadlock::report(): Report a deadlock cycle that was found by
Deadlock::find_cycle(), and choose a victim with the least weight.
Altogether, we may traverse each trx->lock.wait_trx edge up to 5
times (2*find_cycle()+1 time for reporting and choosing the victim).

Deadlock::check_and_resolve(): Find and resolve a deadlock.

lock_wait_rpl_report(): Report the waits-for information to
replication. This used to be executed as part of DeadlockChecker.
Replication must know the waits-for relations even if no deadlocks
are present in InnoDB.

Reviewed by: Vladislav Vaintroub
2021-02-17 12:44:08 +02:00
Sergei Golubchik
60ea09eae6 Merge branch '10.2' into 10.3 2021-02-01 13:49:33 +01:00
Marko Mäkelä
e4205fba7c MDEV-24536 innodb_idle_flush_pct has no effect
The parameter innodb_idle_flush_pct that was introduced in
MariaDB Server 10.1.2 by MDEV-6932 has no effect ever since
the InnoDB changes from MySQL 5.7.9 were applied in
commit 2e814d4702.

Let us declare the parameter as MARIADB_REMOVED_OPTION.
For earlier versions, commit ea9cd97f85
declared the parameter deprecated.
2021-01-13 19:11:31 +02:00
Marko Mäkelä
ea9cd97f85 MDEV-24536 innodb_idle_flush_pct has no effect
The parameter innodb_idle_flush_pct that was introduced in
MariaDB Server 10.1.2 by MDEV-6932 has no effect ever since
the InnoDB changes from MySQL 5.7.9 were applied in
commit 2e814d4702.

Let us declare the parameter as deprecated and having no effect.
2021-01-13 18:55:56 +02:00
Marko Mäkelä
f24b738318 MDEV-24313 (2 of 2): Silently ignored innodb_use_native_aio=1
In commit 5e62b6a5e0 (MDEV-16264)
the logic of os_aio_init() was changed so that it will never fail,
but instead automatically disable innodb_use_native_aio (which is
enabled by default) if the io_setup() system call would fail due
to resource limits being exceeded. This is questionable, especially
because falling back to simulated AIO may lead to significantly
reduced performance.

srv_n_file_io_threads, srv_n_read_io_threads, srv_n_write_io_threads:
Change the data type from ulong to uint.

os_aio_init(): Remove the parameters, and actually return an error code.

thread_pool::configure_aio(): Do not silently fall back to simulated AIO.

Reviewed by: Vladislav Vaintroub
2020-12-14 15:27:03 +02:00
Marko Mäkelä
17d3f8560b MDEV-24313 (1 of 2): Hang with innodb_write_io_threads=1
After commit a5a2ef079c (part of MDEV-23855)
implemented asynchronous doublewrite, it is possible that the server will
hang when the following parametes are in effect:

    innodb_doublewrite=1 (default)
    innodb_write_io_threads=1
    innodb_use_native_aio=0

Note: In commit 5e62b6a5e0 (MDEV-16264)
the logic of os_aio_init() was changed so that it will never fail,
but instead automatically disable innodb_use_native_aio (which is
enabled by default) if the io_setup() system call would fail due
to resource limits being exceeded.

Before commit a5a2ef079c, we used
a synchronous write for the doublewrite buffer batches, always at
most 64 pages at a time. So, upon completing a doublewrite batch,
a single thread would submit at most 64 page writes (for the
individual pages that were first written to the doublewrite buffer).
With that commit, we may submit up to 128 page writes at a time.

The maximum number of outstanding requests per thread is 256.
Because the maximum number of asynchronous write submissions per
thread was roughly doubled, it is now possible that
buf_dblwr_t::flush_buffered_writes_completed() will hang in
io_slots::acquire(), called via os_aio() and fil_space_t::io(),
when submitting writes of the individual blocks.

We will prevent this type of hang by increasing the minimum number
of innodb_write_io_threads from 1 to 2, so that this type of hang
would only become possible when 512 outstanding write requests
are exceeded.
2020-12-14 13:11:44 +02:00
Marko Mäkelä
dee6902922 After-merge fix: sys_vars.sysvars_innodb,32bit 2020-10-28 18:48:14 +02:00
Marko Mäkelä
9028cc6b86 Cleanup: Make InnoDB page numbers uint32_t
InnoDB stores a 32-bit page number in page headers and in some
data structures, such as FIL_ADDR (consisting of a 32-bit page number
and a 16-bit byte offset within a page). For better compile-time
error detection and to reduce the memory footprint in some data
structures, let us use a uint32_t for the page number, instead
of ulint (size_t) which can be 64 bits.
2020-10-15 17:06:17 +03:00
Marko Mäkelä
7cffb5f6e8 MDEV-23399: Performance regression with write workloads
The buffer pool refactoring in MDEV-15053 and MDEV-22871 shifted
the performance bottleneck to the page flushing.

The configuration parameters will be changed as follows:

innodb_lru_flush_size=32 (new: how many pages to flush on LRU eviction)
innodb_lru_scan_depth=1536 (old: 1024)
innodb_max_dirty_pages_pct=90 (old: 75)
innodb_max_dirty_pages_pct_lwm=75 (old: 0)

Note: The parameter innodb_lru_scan_depth will only affect LRU
eviction of buffer pool pages when a new page is being allocated. The
page cleaner thread will no longer evict any pages. It used to
guarantee that some pages will remain free in the buffer pool. Now, we
perform that eviction 'on demand' in buf_LRU_get_free_block().
The parameter innodb_lru_scan_depth(srv_LRU_scan_depth) is used as follows:
 * When the buffer pool is being shrunk in buf_pool_t::withdraw_blocks()
 * As a buf_pool.free limit in buf_LRU_list_batch() for terminating
   the flushing that is initiated e.g., by buf_LRU_get_free_block()
The parameter also used to serve as an initial limit for unzip_LRU
eviction (evicting uncompressed page frames while retaining
ROW_FORMAT=COMPRESSED pages), but now we will use a hard-coded limit
of 100 or unlimited for invoking buf_LRU_scan_and_free_block().

The status variables will be changed as follows:

innodb_buffer_pool_pages_flushed: This includes also the count of
innodb_buffer_pool_pages_LRU_flushed and should work reliably,
updated one by one in buf_flush_page() to give more real-time
statistics. The function buf_flush_stats(), which we are removing,
was not called in every code path. For both counters, we will use
regular variables that are incremented in a critical section of
buf_pool.mutex. Note that show_innodb_vars() directly links to the
variables, and reads of the counters will *not* be protected by
buf_pool.mutex, so you cannot get a consistent snapshot of both variables.

The following INFORMATION_SCHEMA.INNODB_METRICS counters will be
removed, because the page cleaner no longer deals with writing or
evicting least recently used pages, and because the single-page writes
have been removed:
* buffer_LRU_batch_flush_avg_time_slot
* buffer_LRU_batch_flush_avg_time_thread
* buffer_LRU_batch_flush_avg_time_est
* buffer_LRU_batch_flush_avg_pass
* buffer_LRU_single_flush_scanned
* buffer_LRU_single_flush_num_scan
* buffer_LRU_single_flush_scanned_per_call

When moving to a single buffer pool instance in MDEV-15058, we missed
some opportunity to simplify the buf_flush_page_cleaner thread. It was
unnecessarily using a mutex and some complex data structures, even
though we always have a single page cleaner thread.

Furthermore, the buf_flush_page_cleaner thread had separate 'recovery'
and 'shutdown' modes where it was waiting to be triggered by some
other thread, adding unnecessary latency and potential for hangs in
relatively rarely executed startup or shutdown code.

The page cleaner was also running two kinds of batches in an
interleaved fashion: "LRU flush" (writing out some least recently used
pages and evicting them on write completion) and the normal batches
that aim to increase the MIN(oldest_modification) in the buffer pool,
to help the log checkpoint advance.

The buf_pool.flush_list flushing was being blocked by
buf_block_t::lock for no good reason. Furthermore, if the FIL_PAGE_LSN
of a page is ahead of log_sys.get_flushed_lsn(), that is, what has
been persistently written to the redo log, we would trigger a log
flush and then resume the page flushing. This would unnecessarily
limit the performance of the page cleaner thread and trigger the
infamous messages "InnoDB: page_cleaner: 1000ms intended loop took 4450ms.
The settings might not be optimal" that were suppressed in
commit d1ab89037a unless log_warnings>2.

Our revised algorithm will make log_sys.get_flushed_lsn() advance at
the start of buf_flush_lists(), and then execute a 'best effort' to
write out all pages. The flush batches will skip pages that were modified
since the log was written, or are are currently exclusively locked.
The MDEV-13670 message "page_cleaner: 1000ms intended loop took" message
will be removed, because by design, the buf_flush_page_cleaner() should
not be blocked during a batch for extended periods of time.

We will remove the single-page flushing altogether. Related to this,
the debug parameter innodb_doublewrite_batch_size will be removed,
because all of the doublewrite buffer will be used for flushing
batches. If a page needs to be evicted from the buffer pool and all
100 least recently used pages in the buffer pool have unflushed
changes, buf_LRU_get_free_block() will execute buf_flush_lists() to
write out and evict innodb_lru_flush_size pages. At most one thread
will execute buf_flush_lists() in buf_LRU_get_free_block(); other
threads will wait for that LRU flushing batch to finish.

To improve concurrency, we will replace the InnoDB ib_mutex_t and
os_event_t native mutexes and condition variables in this area of code.
Most notably, this means that the buffer pool mutex (buf_pool.mutex)
is no longer instrumented via any InnoDB interfaces. It will continue
to be instrumented via PERFORMANCE_SCHEMA.

For now, both buf_pool.flush_list_mutex and buf_pool.mutex will be
declared with MY_MUTEX_INIT_FAST (PTHREAD_MUTEX_ADAPTIVE_NP). The critical
sections of buf_pool.flush_list_mutex should be shorter than those for
buf_pool.mutex, because in the worst case, they cover a linear scan of
buf_pool.flush_list, while the worst case of a critical section of
buf_pool.mutex covers a linear scan of the potentially much longer
buf_pool.LRU list.

mysql_mutex_is_owner(), safe_mutex_is_owner(): New predicate, usable
with SAFE_MUTEX. Some InnoDB debug assertions need this predicate
instead of mysql_mutex_assert_owner() or mysql_mutex_assert_not_owner().

buf_pool_t::n_flush_LRU, buf_pool_t::n_flush_list:
Replaces buf_pool_t::init_flush[] and buf_pool_t::n_flush[].
The number of active flush operations.

buf_pool_t::mutex, buf_pool_t::flush_list_mutex: Use mysql_mutex_t
instead of ib_mutex_t, to have native mutexes with PERFORMANCE_SCHEMA
and SAFE_MUTEX instrumentation.

buf_pool_t::done_flush_LRU: Condition variable for !n_flush_LRU.

buf_pool_t::done_flush_list: Condition variable for !n_flush_list.

buf_pool_t::do_flush_list: Condition variable to wake up the
buf_flush_page_cleaner when a log checkpoint needs to be written
or the server is being shut down. Replaces buf_flush_event.
We will keep using timed waits (the page cleaner thread will wake
_at least_ once per second), because the calculations for
innodb_adaptive_flushing depend on fixed time intervals.

buf_dblwr: Allocate statically, and move all code to member functions.
Use a native mutex and condition variable. Remove code to deal with
single-page flushing.

buf_dblwr_check_block(): Make the check debug-only. We were spending
a significant amount of execution time in page_simple_validate_new().

flush_counters_t::unzip_LRU_evicted: Remove.

IORequest: Make more members const. FIXME: m_fil_node should be removed.

buf_flush_sync_lsn: Protect by std::atomic, not page_cleaner.mutex
(which we are removing).

page_cleaner_slot_t, page_cleaner_t: Remove many redundant members.

pc_request_flush_slot(): Replaces pc_request() and pc_flush_slot().

recv_writer_thread: Remove. Recovery works just fine without it, if we
simply invoke buf_flush_sync() at the end of each batch in
recv_sys_t::apply().

recv_recovery_from_checkpoint_finish(): Remove. We can simply call
recv_sys.debug_free() directly.

srv_started_redo: Replaces srv_start_state.

SRV_SHUTDOWN_FLUSH_PHASE: Remove. logs_empty_and_mark_files_at_shutdown()
can communicate with the normal page cleaner loop via the new function
flush_buffer_pool().

buf_flush_remove(): Assert that the calling thread is holding
buf_pool.flush_list_mutex. This removes unnecessary mutex operations
from buf_flush_remove_pages() and buf_flush_dirty_pages(),
which replace buf_LRU_flush_or_remove_pages().

buf_flush_lists(): Renamed from buf_flush_batch(), with simplified
interface. Return the number of flushed pages. Clarified comments and
renamed min_n to max_n. Identify LRU batch by lsn=0. Merge all the functions
buf_flush_start(), buf_flush_batch(), buf_flush_end() directly to this
function, which was their only caller, and remove 2 unnecessary
buf_pool.mutex release/re-acquisition that we used to perform around
the buf_flush_batch() call. At the start, if not all log has been
durably written, wait for a background task to do it, or start a new
task to do it. This allows the log write to run concurrently with our
page flushing batch. Any pages that were skipped due to too recent
FIL_PAGE_LSN or due to them being latched by a writer should be flushed
during the next batch, unless there are further modifications to those
pages. It is possible that a page that we must flush due to small
oldest_modification also carries a recent FIL_PAGE_LSN or is being
constantly modified. In the worst case, all writers would then end up
waiting in log_free_check() to allow the flushing and the checkpoint
to complete.

buf_do_flush_list_batch(): Clarify comments, and rename min_n to max_n.
Cache the last looked up tablespace. If neighbor flushing is not applicable,
invoke buf_flush_page() directly, avoiding a page lookup in between.

buf_flush_space(): Auxiliary function to look up a tablespace for
page flushing.

buf_flush_page(): Defer the computation of space->full_crc32(). Never
call log_write_up_to(), but instead skip persistent pages whose latest
modification (FIL_PAGE_LSN) is newer than the redo log. Also skip
pages on which we cannot acquire a shared latch without waiting.

buf_flush_try_neighbors(): Do not bother checking buf_fix_count
because buf_flush_page() will no longer wait for the page latch.
Take the tablespace as a parameter, and only execute this function
when innodb_flush_neighbors>0. Avoid repeated calls of page_id_t::fold().

buf_flush_relocate_on_flush_list(): Declare as cold, and push down
a condition from the callers.

buf_flush_check_neighbor(): Take id.fold() as a parameter.

buf_flush_sync(): Ensure that the buf_pool.flush_list is empty,
because the flushing batch will skip pages whose modifications have
not yet been written to the log or were latched for modification.

buf_free_from_unzip_LRU_list_batch(): Remove redundant local variables.

buf_flush_LRU_list_batch(): Let the caller buf_do_LRU_batch() initialize
the counters, and report n->evicted.
Cache the last looked up tablespace. If neighbor flushing is not applicable,
invoke buf_flush_page() directly, avoiding a page lookup in between.

buf_do_LRU_batch(): Return the number of pages flushed.

buf_LRU_free_page(): Only release and re-acquire buf_pool.mutex if
adaptive hash index entries are pointing to the block.

buf_LRU_get_free_block(): Do not wake up the page cleaner, because it
will no longer perform any useful work for us, and we do not want it
to compete for I/O while buf_flush_lists(innodb_lru_flush_size, 0)
writes out and evicts at most innodb_lru_flush_size pages. (The
function buf_do_LRU_batch() may complete after writing fewer pages if
more than innodb_lru_scan_depth pages end up in buf_pool.free list.)
Eliminate some mutex release-acquire cycles, and wait for the LRU
flush batch to complete before rescanning.

buf_LRU_check_size_of_non_data_objects(): Simplify the code.

buf_page_write_complete(): Remove the parameter evict, and always
evict pages that were part of an LRU flush.

buf_page_create(): Take a pre-allocated page as a parameter.

buf_pool_t::free_block(): Free a pre-allocated block.

recv_sys_t::recover_low(), recv_sys_t::apply(): Preallocate the block
while not holding recv_sys.mutex. During page allocation, we may
initiate a page flush, which in turn may initiate a log flush, which
would require acquiring log_sys.mutex, which should always be acquired
before recv_sys.mutex in order to avoid deadlocks. Therefore, we must
not be holding recv_sys.mutex while allocating a buffer pool block.

BtrBulk::logFreeCheck(): Skip a redundant condition.

row_undo_step(): Do not invoke srv_inc_activity_count() for every row
that is being rolled back. It should suffice to invoke the function in
trx_flush_log_if_needed() during trx_t::commit_in_memory() when the
rollback completes.

sync_check_enable(): Remove. We will enable innodb_sync_debug from the
very beginning.

Reviewed by: Vladislav Vaintroub
2020-10-15 17:04:56 +03:00
Marko Mäkelä
bbd70fcc43 MDEV-23379 Deprecate&ignore InnoDB concurrency throttling parameters
The parameters innodb_thread_concurrency and innodb_commit_concurrency
were useful years ago when both computing resources and the implementation
of some shared data structures were limited. MySQL 5.0 or 5.1 had trouble
scaling beyond 8 concurrent connections. Most of the scalability bottlenecks
have been removed since then, and the transactions per second delivered
by MariaDB Server 10.5 should not dramatically drop upon exceeding the
'optimal' number of connections.

Hence, enabling any concurrency throttling for InnoDB actually makes
things worse. We have seen many customers mistakenly setting this to a
small value like 16 or 64 and then complaining the server was slow.

Ignoring the parameters allows us to remove some normally unused code
and data structures, which could slightly improve performance.

innodb_thread_concurrency, innodb_commit_concurrency,
innodb_replication_delay, innodb_concurrency_tickets,
innodb_thread_sleep_delay, innodb_adaptive_max_sleep_delay:
Deprecate and ignore; hard-wire to 0.

The column INFORMATION_SCHEMA.INNODB_TRX.trx_concurrency_tickets
will always report 0.
2020-08-04 06:59:29 +03:00
Marko Mäkelä
5155a300fa MDEV-22871: Reduce InnoDB buf_pool.page_hash contention
The rw_lock_s_lock() calls for the buf_pool.page_hash became a
clear bottleneck after MDEV-15053 reduced the contention on
buf_pool.mutex. We will replace that use of rw_lock_t with a
special implementation that is optimized for memory bus traffic.

The hash_table_locks instrumentation will be removed.

buf_pool_t::page_hash: Use a special implementation whose API is
compatible with hash_table_t, and store the custom rw-locks
directly in buf_pool.page_hash.array, intentionally sharing
cache lines with the hash table pointers.

rw_lock: A low-level rw-lock implementation based on std::atomic<uint32_t>
where read_trylock() becomes a simple fetch_add(1).

buf_pool_t::page_hash_latch: The special of rw_lock for the page_hash.

buf_pool_t::page_hash_latch::read_lock(): Assert that buf_pool.mutex
is not being held by the caller.

buf_pool_t::page_hash_latch::write_lock() may be called while not holding
buf_pool.mutex. buf_pool_t::watch_set() is such a caller.

buf_pool_t::page_hash_latch::read_lock_wait(),
page_hash_latch::write_lock_wait(): The spin loops.
These will obey the global parameters innodb_sync_spin_loops and
innodb_sync_spin_wait_delay.

buf_pool_t::freed_page_hash: A singly linked list of copies of
buf_pool.page_hash that ever existed. The fact that we never
free any buf_pool.page_hash.array guarantees that all
page_hash_latch that ever existed will remain valid until shutdown.

buf_pool_t::resize_hash(): Replaces buf_pool_resize_hash().
Prepend a shallow copy of the old page_hash to freed_page_hash.

buf_pool_t::page_hash_table::n_cells: Declare as Atomic_relaxed.

buf_pool_t::page_hash_table::lock(): Explain what prevents a
race condition with buf_pool_t::resize_hash().
2020-06-18 14:16:01 +03:00
Marko Mäkelä
574d8b2940 MDEV-21907: Fix most clang -Wconversion in InnoDB
Declare innodb_purge_threads as 4-byte integer (UINT)
instead of 4-or-8-byte (ULONG) and adjust the documentation string.
2020-03-11 08:29:48 +02:00
Eugene Kosov
852dcb9a56 try to fix sysvars_innodb,32bit test 2020-02-24 17:21:21 +03:00
Marko Mäkelä
1a6f708ec5 MDEV-15058: Deprecate and ignore innodb_buffer_pool_instances
Our benchmarking efforts indicate that the reasons for splitting the
buf_pool in commit c18084f71b
have mostly gone away, possibly as a result of
mysql/mysql-server@ce6109ebfd
or similar work.

Only in one write-heavy benchmark where the working set size is
ten times the buffer pool size, the buf_pool->mutex would be
less contended with 4 buffer pool instances than with 1 instance,
in buf_page_io_complete(). That contention could be alleviated
further by making more use of std::atomic and by splitting
buf_pool_t::mutex further (MDEV-15053).

We will deprecate and ignore the following parameters:

	innodb_buffer_pool_instances
	innodb_page_cleaners

There will be only one buffer pool and one page cleaner task.

In a number of INFORMATION_SCHEMA views, columns that indicated
the buffer pool instance will be removed:

	information_schema.innodb_buffer_page.pool_id
	information_schema.innodb_buffer_page_lru.pool_id
	information_schema.innodb_buffer_pool_stats.pool_id
	information_schema.innodb_cmpmem.buffer_pool_instance
	information_schema.innodb_cmpmem_reset.buffer_pool_instance
2020-02-12 14:45:21 +02:00
Marko Mäkelä
64952203af MDEV-18115: Fix up sys_vars.sysvars_innodb
This was forgotten in e9de6386ad
2020-01-20 16:46:39 +02:00
Eugene Kosov
e9de6386ad MDEV-18115 remove now unneeded constraint
log_group_max_size: is not needed because redo log do not use fil_io() now
2020-01-18 23:42:55 +08:00
Marko Mäkelä
b42294bc64 MDEV-19514 Defer change buffer merge until pages are requested
We will remove the InnoDB background operation of merging buffered
changes to secondary index leaf pages. Changes will only be merged as a
result of an operation that accesses a secondary index leaf page,
such as a SQL statement that performs a lookup via that index,
or is modifying the index. Also ROLLBACK and some background operations,
such as purging the history of committed transactions, or computing
index cardinality statistics, can cause change buffer merge.
Encryption key rotation will not perform change buffer merge.

The motivation of this change is to simplify the I/O logic and to
allow crash recovery to happen in the background (MDEV-14481).
We also hope that this will reduce the number of "mystery" crashes
due to corrupted data. Because change buffer merge will typically
take place as a result of executing SQL statements, there should be
a clearer connection between the crash and the SQL statements that
were executed when the server crashed.

In many cases, a slight performance improvement was observed.

This is joint work with Thirunarayanan Balathandayuthapani
and was tested by Axel Schwenke and Matthias Leich.

The InnoDB monitor counter innodb_ibuf_merge_usec will be removed.

On slow shutdown (innodb_fast_shutdown=0), we will continue to
merge all buffered changes (and purge all undo log history).

Two InnoDB configuration parameters will be changed as follows:

innodb_disable_background_merge: Removed.
This parameter existed only in debug builds.
All change buffer merges will use synchronous reads.

innodb_force_recovery will be changed as follows:
* innodb_force_recovery=4 will be the same as innodb_force_recovery=3
(the change buffer merge cannot be disabled; it can only happen as
a result of an operation that accesses a secondary index leaf page).
The option used to be capable of corrupting secondary index leaf pages.
Now that capability is removed, and innodb_force_recovery=4 becomes 'safe'.
* innodb_force_recovery=5 (which essentially hard-wires
SET GLOBAL TRANSACTION ISOLATION LEVEL READ UNCOMMITTED)
becomes safe to use. Bogus data can be returned to SQL, but
persistent InnoDB data files will not be corrupted further.
* innodb_force_recovery=6 (ignore the redo log files)
will be the only option that can potentially cause
persistent corruption of InnoDB data files.

Code changes:

buf_page_t::ibuf_exist: New flag, to indicate whether buffered
changes exist for a buffer pool page. Pages with pending changes
can be returned by buf_page_get_gen(). Previously, the changes
were always merged inside buf_page_get_gen() if needed.

ibuf_page_exists(const buf_page_t&): Check if a buffered changes
exist for an X-latched or read-fixed page.

buf_page_get_gen(): Add the parameter allow_ibuf_merge=false.
All callers that know that they may be accessing a secondary index
leaf page must pass this parameter as allow_ibuf_merge=true,
unless it does not matter for that caller whether all buffered
changes have been applied. Assert that whenever allow_ibuf_merge
holds, the page actually is a leaf page. Attempt change buffer
merge only to secondary B-tree index leaf pages.

btr_block_get(): Add parameter 'bool merge'.
All callers of btr_block_get() should know whether the page could be
a secondary index leaf page. If it is not, we should avoid consulting
the change buffer bitmap to even consider a merge. This is the main
interface to requesting index pages from the buffer pool.

ibuf_merge_or_delete_for_page(), recv_recover_page(): Replace
buf_page_get_known_nowait() with much simpler logic, because
it is now guaranteed that that the block is x-latched or read-fixed.

mlog_init_t::mark_ibuf_exist(): Renamed from mlog_init_t::ibuf_merge().
On crash recovery, we will no longer merge any buffered changes
for the pages that we read into the buffer pool during the last batch
of applying log records.

buf_page_get_gen_known_nowait(), BUF_MAKE_YOUNG, BUF_KEEP_OLD: Remove.

btr_search_guess_on_hash(): Merge buf_page_get_gen_known_nowait()
to its only remaining caller.

buf_page_make_young_if_needed(): Define as an inline function.
Add the parameter buf_pool.

buf_page_peek_if_young(), buf_page_peek_if_too_old(): Add the
parameter buf_pool.

fil_space_validate_for_mtr_commit(): Remove a bogus comment
about background merge of the change buffer.

btr_cur_open_at_rnd_pos_func(), btr_cur_search_to_nth_level_func(),
btr_cur_open_at_index_side_func(): Use narrower data types and scopes.

ibuf_read_merge_pages(): Replaces buf_read_ibuf_merge_pages().
Merge the change buffer by invoking buf_page_get_gen().
2019-10-11 17:28:15 +03:00
Marko Mäkelä
4081b7b27a Merge 10.4 into 10.5 2019-09-06 17:16:40 +03:00
Monty
a071e0e029 Merge branch '10.2' into 10.3 2019-09-03 13:17:32 +03:00
Monty
9cba6c5aa3 Updated mtr files to support different compiled in options
This allows one to run the test suite even if any of the following
options are changed:
- character-set-server
- collation-server
- join-cache-level
- log-basename
- max-allowed-packet
- optimizer-switch
- query-cache-size and query-cache-type
- skip-name-resolve
- table-definition-cache
- table-open-cache
- Some innodb options
etc

Changes:
- Don't print out the value of system variables as one can't depend on
  them to being constants.
- Don't set global variables to 'default' as the default may not
  be the same as the test was started with if there was an additional
  option file. Instead save original value and reset it at end of test.
- Test that depends on the latin1 character set should include
  default_charset.inc or set the character set to latin1
- Test that depends on the original optimizer switch, should include
  default_optimizer_switch.inc
- Test that depends on the value of a specific system variable should
  set it in the test (like optimizer_use_condition_selectivity)
- Split subselect3.test into subselect3.test and subselect3.inc to
  make it easier to set and reset system variables.
- Added .opt files for test that required specfic options that could
  be changed by external configuration files.
- Fixed result files in rockdsb & tokudb that had not been updated for
  a while.
2019-09-01 19:17:35 +03:00
Marko Mäkelä
893472d005 MDEV-19570 Deprecate and ignore innodb_undo_logs, remove innodb_rollback_segments
The option innodb_rollback_segments was deprecated already in
MariaDB Server 10.0. Its misleadingly named replacement innodb_undo_logs
is of very limited use. It makes sense to always create and use the
maximum number of rollback segments.

Let us remove the deprecated parameter innodb_rollback_segments and
deprecate&ignore the parameter innodb_undo_logs (to be removed in a
later major release).

This work involves some cleanup of InnoDB startup. Similar to other
write operations, DROP TABLE will no longer be allowed if
innodb_force_recovery is set to a value larger than 3.
2019-05-23 17:34:47 +03:00
Marko Mäkelä
df563e0c03 Merge 10.2 into 10.3
main.derived_cond_pushdown: Move all 10.3 tests to the end,
trim trailing white space, and add an "End of 10.3 tests" marker.
Add --sorted_result to tests where the ordering is not deterministic.

main.win_percentile: Add --sorted_result to tests where the
ordering is no longer deterministic.
2018-11-06 09:40:39 +02:00
Marko Mäkelä
32062cc61c Merge 10.1 into 10.2 2018-11-06 08:41:48 +02:00
Sergei Golubchik
bf28ba67b6 update rdiffs for 32bit 2018-10-31 22:06:15 +01:00
Marko Mäkelä
715e4f4320 MDEV-12218 Clean up InnoDB parameter validation
Bind more InnoDB parameters directly to MYSQL_SYSVAR and
remove "shadow variables".

innodb_change_buffering: Declare as ENUM, not STRING.

innodb_flush_method: Declare as ENUM, not STRING.

innodb_log_buffer_size: Bind directly to srv_log_buffer_size,
without rounding it to a multiple of innodb_page_size.

LOG_BUFFER_SIZE: Remove.

SysTablespace::normalize_size(): Renamed from normalize().

innodb_init_params(): A new function to initialize and validate
InnoDB startup parameters.

innodb_init(): Renamed from innobase_init(). Invoke innodb_init_params()
before actually trying to start up InnoDB.

srv_start(bool): Renamed from innobase_start_or_create_for_mysql().
Added the input parameter create_new_db.

SRV_ALL_O_DIRECT_FSYNC: Define only for _WIN32.

xb_normalize_init_values(): Merge to innodb_init_param().
2018-04-29 09:41:42 +03:00
Marko Mäkelä
ba19764209 Fix most -Wsign-conversion in InnoDB
Change innodb_buffer_pool_size, innodb_fill_factor to unsigned.
2018-04-28 20:45:45 +03:00
Jan Lindström
b23a109695 MDEV-11025: Make number of page cleaner threads variable dynamic
New test cases
        innodb-page-cleaners

Modified test cases
        innodb_page_cleaners_basic

New function buf_flush_set_page_cleaner_thread_cnt
    Increase or decrease the amount of page cleaner worker threads.
    In case of increase this function creates based on current
    abount and requested amount how many new threads should be
    created. In case of decrease this function sets up the
    requested amount of threads and uses is_requested event
    to signal workers. Then we wait until all new treads
    are started, old threads that should exit signal
    is_finished or shutdown has marked that page cleaner
    should finish.

buf_flush_page_cleaner_worker
        Store current thread id and thread_no and then signal
        event is_finished. If number of used page cleaner threads
        decrease we shut down those threads that have thread_no
        greater or equal than number of page configured page
        cleaners - 1 (note that there will be always page cleaner
        coordinator). Before exiting we signal is_finished.

New function innodb_page_cleaners_threads_update
        Update function for innodb-page-cleaners system variable.

innobase_start_or_create_for_mysql
        If more than one page cleaner threads is configured
        we use new function buf_flush_set_page_cleaner_thread_cnt
        to set up the requested threads (-1 coordinator).
2017-10-24 19:12:59 +03:00
Jan Lindström
016c35a7f2 MDEV-13690: Remove unnecessary innodb_use_mtflush, innodb_mtflush_threads parameters and related code
Users can use innodb-page-cleaners instead.
2017-09-01 18:33:46 +03:00
Marko Mäkelä
4e1fa7f63d Merge bb-10.2-ext into 10.3 2017-09-01 11:33:45 +03:00
Marko Mäkelä
1b41a54fc9 Fix test for MDEV-13674: Deprecate innodb_use_mtflush and innodb_mtflush_threads 2017-09-01 08:38:19 +03:00
Sergei Golubchik
8e8d42ddf0 Merge branch '10.0' into 10.1 2017-08-08 10:18:43 +02:00
Marko Mäkelä
813e6e628f Adjust sys_vars.sysvars_innodb for 32-bit builds
Remove the adjustments for some parameters that were deprecated in
MariaDB 10.2 and removed in 10.3.
2017-06-20 09:35:58 +03:00