mariadb

mirror of https://github.com/MariaDB/server.git synced 2025-08-07 00:04:31 +03:00

Author	SHA1	Message	Date
Marko Mäkelä	ba81009f63	MDEV-34863 RAM Usage Changed Significantly Between 10.11 Releases innodb_buffer_pool_size_auto_min: A minimum innodb_buffer_pool_size that a Linux memory pressure event can lead to shrinking the buffer pool to. On a memory pressure event, we will attempt to shrink innodb_buffer_pool_size halfway between its current value and innodb_buffer_pool_size_auto_min. If innodb_buffer_pool_size_auto_min is specified as 0 or not specified on startup, its default value will be adjusted to innodb_buffer_pool_size_max, that is, memory pressure events will be disregarded by default. buf_pool_t::garbage_collect(): For up to 15 seconds, attempt to shrink the buffer pool in response to a memory pressure event. Reviewed by: Debarun Banerjee	2025-03-26 17:05:48 +02:00
Marko Mäkelä	b6923420f3	MDEV-29445: Reimplement SET GLOBAL innodb_buffer_pool_size We deprecate and ignore the parameter innodb_buffer_pool_chunk_size and let the buffer pool size to be changed in arbitrary 1-megabyte increments. innodb_buffer_pool_size_max: A new read-only startup parameter that specifies the maximum innodb_buffer_pool_size. If 0 or unspecified, it will default to the specified innodb_buffer_pool_size rounded up to the allocation unit (2 MiB or 8 MiB). The maximum value is 4GiB-2MiB on 32-bit systems and 16EiB-8MiB on 64-bit systems. This maximum is very likely to be limited further by the operating system. The status variable Innodb_buffer_pool_resize_status will reflect the status of shrinking the buffer pool. When no shrinking is in progress, the string will be empty. Unlike before, the execution of SET GLOBAL innodb_buffer_pool_size will block until the requested buffer pool size change has been implemented, or the execution is interrupted by a KILL statement a client disconnect, or server shutdown. If the buf_flush_page_cleaner() thread notices that we are running out of memory, the operation may fail with ER_WRONG_USAGE. SET GLOBAL innodb_buffer_pool_size will be refused if the server was started with --large-pages (even if no HugeTLB pages were successfully allocated). This functionality is somewhat exercised by the test main.large_pages, which now runs also on Microsoft Windows. On Linux, explicit HugeTLB mappings are apparently excluded from the reported Redident Set Size (RSS), and apparently unshrinkable between mmap(2) and munmap(2). The buffer pool will be mapped to a contiguous virtual memory area that will be aligned and partitioned into extents of 8 MiB on 64-bit systems and 2 MiB on 32-bit systems. Within an extent, the first few innodb_page_size blocks contain buf_block_t objects that will cover the page frames in the rest of the extent. The number of such frames is precomputed in the array first_page_in_extent[] for each innodb_page_size. In this way, there is a trivial mapping between page frames and block descriptors and we do not need any lookup tables like buf_pool.zip_hash or buf_pool_t::chunk_t::map. We will always allocate the same number of block descriptors for an extent, even if we do not need all the buf_block_t in the last extent in case the innodb_buffer_pool_size is not an integer multiple of the of extents size. The minimum innodb_buffer_pool_size is 256*5/4 pages. At the default innodb_page_size=16k this corresponds to 5 MiB. However, now that the innodb_buffer_pool_size includes the memory allocated for the block descriptors, the minimum would be innodb_buffer_pool_size=6m. my_large_virtual_alloc(): A new function, similar to my_large_malloc(). my_virtual_mem_reserve(), my_virtual_mem_commit(), my_virtual_mem_decommit(), my_virtual_mem_release(): New interface mostly by Vladislav Vaintroub, to separately reserve and release virtual address space, as well as to commit and decommit memory within it. After my_virtual_mem_decommit(), the virtual memory range will be read-only or unaccessible, depending on whether the build option cmake -DHAVE_UNACCESSIBLE_AFTER_MEM_DECOMMIT=1 has been specified. This option is hard-coded on Microsoft Windows, where VirtualMemory(MEM_DECOMMIT) will make the memory unaccessible. On IBM AIX, Linux, Illumos and possibly Apple macOS, the virtual memory will be zeroed out immediately. On other POSIX-like systems, madvise(MADV_FREE) will be used if available, to give the operating system kernel a permission to zero out the virtual memory range. We prefer immediate freeing so that the reported resident set size (RSS) of the process will reflect the current innodb_buffer_pool_size. Shrinking the buffer pool is a rarely executed resource intensive operation, and the immediate configuration of the MMU mappings should not incur significant additional penalty. opt_super_large_pages: Declare only on Solaris. Actually, this is specific to the SPARC implementation of Solaris, but because we lack access to a Solaris development environment, we will not revise this for other MMU and ISA. buf_pool_t::chunk_t::create(): Remove. buf_pool_t::create(): Initialize all n_blocks of the buf_pool.free list. buf_pool_t::allocate(): Renamed from buf_LRU_get_free_only(). buf_pool_t::LRU_warned: Changed to Atomic_relaxed<bool>, only to be modified by the buf_flush_page_cleaner() thread. buf_pool_t::shrink(): Attempt to shrink the buffer pool. There are 3 possible outcomes: SHRINK_DONE (success), SHRINK_IN_PROGRESS (the caller may keep trying), and SHRINK_ABORT (we seem to be running out of buffer pool). While traversing buf_pool.LRU, release the contended buf_pool.mutex once in every 32 iterations in order to reduce starvation. Use lru_scan_itr for efficient traversal, similar to buf_LRU_free_from_common_LRU_list(). buf_pool_t::shrunk(): Update the reduced size of the buffer pool in a way that is compatible with buf_pool_t::page_guess(), and invoke my_virtual_mem_decommit(). buf_pool_t::resize(): Before invoking shrink(), run one batch of buf_flush_page_cleaner() in order to prevent LRU_warn(). Abort if shrink() recommends it, or no blocks were withdrawn in the past 15 seconds, or the execution of the statement SET GLOBAL innodb_buffer_pool_size was interrupted. buf_pool_t::first_to_withdraw: The first block descriptor that is out of the bounds of the shrunk buffer pool. buf_pool_t::withdrawn: The list of withdrawn blocks. If buf_pool_t::resize() is aborted before shrink() completes, we must be able to resurrect the withdrawn blocks in the free list. buf_pool_t::contains_zip(): Added a parameter for the number of least significant pointer bits to disregard, so that we can find any pointers to within a block that is supposed to be free. buf_pool_t::is_shrinking(): Return the total number or blocks that were withdrawn or are to be withdrawn. buf_pool_t::to_withdraw(): Return the number of blocks that will need to be withdrawn. buf_pool_t::usable_size(): Number of usable pages, considering possible in-progress attempt at shrinking the buffer pool. buf_pool_t::page_guess(): Try to buffer-fix a guessed block pointer. If HAVE_UNACCESSIBLE_AFTER_MEM_DECOMMIT is set, the pointer will be validated before being dereferenced. buf_pool_t::get_info(): Replaces buf_stats_get_pool_info(). innodb_init_param(): Refactored. We must first compute srv_page_size_shift and then determine the valid bounds of innodb_buffer_pool_size. buf_buddy_shrink(): Replaces buf_buddy_realloc(). Part of the work is deferred to buf_buddy_condense_free(), which is being executed when we are not holding any buf_pool.page_hash latch. buf_buddy_condense_free(): Do not relocate blocks. buf_buddy_free_low(): Do not care about buffer pool shrinking. This will be handled by buf_buddy_shrink() and buf_buddy_condense_free(). buf_buddy_alloc_zip(): Assert !buf_pool.contains_zip() when we are allocating from the binary buddy system. Previously we were asserting this on multiple recursion levels. buf_buddy_block_free(), buf_buddy_free_low(): Assert !buf_pool.contains_zip(). buf_buddy_alloc_from(): Remove the redundant parameter j. buf_flush_LRU_list_batch(): Add the parameter to_withdraw to keep track of buf_pool.n_blocks_to_withdraw. buf_do_LRU_batch(): Skip buf_free_from_unzip_LRU_list_batch() if we are shrinking the buffer pool. In that case, we want to minimize the page relocations and just finish as quickly as possible. trx_purge_attach_undo_recs(): Limit purge_sys.n_pages_handled() in every iteration, in case the buffer pool is being shrunk in the middle of a purge batch. Reviewed by: Debarun Banerjee	2025-03-26 17:05:44 +02:00
mariadb-DebarunBanerjee	a8e35a1cc6	MDEV-36149 UBSAN in X is outside the range of representable values of type 'unsigned long' \| page_cleaner_flush_pages_recommendation Currently it is allowed to set innodb_io_capacity to very large value up to unsigned 8 byte maximum value 18446744073709551615. While calculating the number of pages to flush, we could sometime go beyond innodb_io_capacity. Specifically, MDEV-24369 has introduced a logic for aggressive flushing when dirty page percentage in buffer pool exceeds innodb_max_dirty_pages_pct. So, when innodb_io_capacity is set to very large value and dirty page percentage exceeds the threshold, there is a multiplication overflow in Innodb page cleaner. Fix: We should prevent setting io_capacity to unrealistic values and define a practical limit to it. The patch introduces limits for innodb_io_capacity_max and innodb_io_capacity to the maximum of 4 byte unsigned integer i.e. 4294967295 (2^32-1). For 16k page size this limit translates to 64 TiB/sec write IO speed which looks sufficient. Reviewed by: Marko Mäkelä	2025-03-17 11:44:09 +05:30
Marko Mäkelä	cfcf27c6fe	Merge 10.6 into 10.11	2024-08-29 07:47:29 +03:00
Marko Mäkelä	bda40ccb85	MDEV-34803 innodb_lru_flush_size is no longer used In commit `fa8a46eb68` (MDEV-33613) the parameter innodb_lru_flush_size ceased to have any effect. Let us declare the parameter as deprecated and additionally as MARIADB_REMOVED_OPTION, so that there will be a warning written to the error log in case the option is specified in the command line. Let us also do the same for the parameter innodb_purge_rseg_truncate_frequency that was deprecated&ignored earlier in MDEV-32050. Reviewed by: Debarun Banerjee	2024-08-28 07:18:03 +03:00
Marko Mäkelä	0892e6d028	MDEV-33585 The maximum innodb_log_buffer_size is too large On Microsoft Windows, ReadFile() as well as WriteFile() limit the size of the request to DWORD, which is 32 bits (at most 4 GiB - 1) also on 64-bit systems. On FreeBSD, sysctl debug.iosize_max_clamp could limit the size of a write request to INT_MAX. The size of a read request is always limited to INT_MAX. This would allow the request size to be 4095 bytes more than the Linux limit (0x7ffff000 according to "man 2 read" and "man 2 write"). On OpenBSD, Solaris and possibly NetBSD, the read request size is limited to SSIZE_T_MAX, which would be half the current maximum innodb_log_buffer_size. This should be not much of an issue anyway, because on contemporary 64-bit platforms, the virtual addresses are limited to 48 bits. IBM AIX documentation mentions OFF_MAX which would apply when a 64-bit application is running on a 32-bit kernel. Let us declare innodb_log_buffer_size as 32-bit unsigned and make the maximum 0x7ffff000, to be compatible with the least common denominator (Linux). The maximum innodb_sort_buffer_size already was 64 MiB, which is not a problem. SyncFileIO::execute(): Assert that the size of a synchronous read or write request is limited to the maximum. Reviewed by: Vladislav Vaintroub	2024-04-09 09:32:47 +03:00
Oleksandr Byelkin	04d9a46c41	Merge branch '10.6' into 10.10	2023-11-08 16:23:30 +01:00
Marko Mäkelä	aa719b5010	MDEV-32050: Do not copy undo records in purge Also, default to innodb_purge_batch_size=1000, replacing the old default value of processing 300 undo log pages in a batch. Axel Schwenke found this value to help reduce purge lag without having a significant impact on workload throughput. In purge, we can simply acquire a shared latch on the undo log page (to avoid a race condition like the one that was fixed in commit `b102872ad5`) and retain a buffer-fix after releasing the latch. The buffer-fix will prevent the undo log page from being evicted from the buffer pool. Concurrent modification is prevented by design. Only the purge_coordinator_task (or its accomplice purge_truncation_task) may free the undo log pages, after any purge_worker_task have completed execution. Hence, we do not have to worry about any overwriting or reuse of the undo log records. trx_undo_rec_copy(): Remove. The only remaining caller would have been trx_undo_get_undo_rec_low(), which is where the logic was merged. purge_sys_t::m_initialized: Replaces heap. purge_sys_t::pages: A cache of buffer-fixed pages that have been looked up from buf_pool.page_hash. purge_sys_t::get_page(): Return a buffer-fixed undo page, using the pages cache. trx_purge_t::batch_cleanup(): Renamed from clone_end_view(). Clear the pages cache and clone the end_view at the end of a batch. purge_sys_t::n_pages_handled(): Return pages.size(). This determines if innodb_purge_batch_size was exceeded. purge_sys_t::rseg_get_next_history_log(): Replaces trx_purge_rseg_get_next_history_log(). purge_sys_t::choose_next_log(): Replaces trx_purge_choose_next_log() and trx_purge_read_undo_rec(). purge_sys_t::get_next_rec(): Replaces trx_purge_get_next_rec() and trx_undo_get_next_rec(). purge_sys_t::fetch_next_rec(): Replaces trx_purge_fetch_next_rec() and some use of trx_undo_get_first_rec(). trx_purge_attach_undo_recs(): Do not allow purge_sys.n_pages_handled() exceed the innodb_purge_batch_size or ¾ of the buffer pool, whichever is smaller. Reviewed by: Vladislav Lesin Tested by: Matthias Leich and Axel Schwenke	2023-10-25 10:19:17 +03:00
Marko Mäkelä	14685b10df	MDEV-32050: Deprecate&ignore innodb_purge_rseg_truncate_frequency The motivation of introducing the parameter innodb_purge_rseg_truncate_frequency in mysql/mysql-server@28bbd66ea5 and mysql/mysql-server@8fc2120fed seems to have been to avoid stalls due to freeing undo log pages or truncating undo log tablespaces. In MariaDB Server, innodb_undo_log_truncate=ON should be a much lighter operation than in MySQL, because it will not involve any log checkpoint. Another source of performance stalls should be trx_purge_truncate_rseg_history(), which is shrinking the history list by freeing the undo log pages whose undo records have been purged. To alleviate that, we will introduce a purge_truncation_task that will offload this from the purge_coordinator_task. In that way, the next innodb_purge_batch_size pages may be parsed and purged while the pages from the previous batch are being freed and the history list being shrunk. The processing of innodb_undo_log_truncate=ON will still remain the responsibility of the purge_coordinator_task. purge_coordinator_state::count: Remove. We will ignore innodb_purge_rseg_truncate_frequency, and act as if it had been set to 1 (the maximum shrinking frequency). purge_coordinator_state::do_purge(): Invoke an asynchronous task purge_truncation_callback() to free the undo log pages. purge_sys_t::iterator::free_history(): Free those undo log pages that have been processed. This used to be a part of trx_purge_truncate_history(). purge_sys_t::clone_end_view(): Take a new value of purge_sys.head as a parameter, so that it will be updated while holding exclusive purge_sys.latch. This is needed for race-free access to the field in purge_truncation_callback(). Reviewed by: Vladislav Lesin	2023-10-25 09:11:58 +03:00
Marko Mäkelä	32d741b5b0	Merge 10.7 into 10.8	2022-02-25 16:24:13 +02:00
Marko Mäkelä	3d88f9f34c	Merge 10.6 into 10.7	2022-02-25 16:09:16 +02:00
Marko Mäkelä	06eaca9b86	Merge 10.5 into 10.6 (MDEV-27913)	2022-02-25 12:15:16 +02:00
Marko Mäkelä	f42d6234bd	Merge 10.4 into 10.5 (MDEV-27913)	2022-02-25 11:47:27 +02:00
Marko Mäkelä	7ab3db142b	MDEV-27913 fixup: sys_vars.sysvars_innodb result	2022-02-25 10:30:04 +02:00
Marko Mäkelä	685d958e38	MDEV-14425 Improve the redo log for concurrency The InnoDB redo log used to be formatted in blocks of 512 bytes. The log blocks were encrypted and the checksum was calculated while holding log_sys.mutex, creating a serious scalability bottleneck. We remove the fixed-size redo log block structure altogether and essentially turn every mini-transaction into a log block of its own. This allows encryption and checksum calculations to be performed on local mtr_t::m_log buffers, before acquiring log_sys.mutex. The mutex only protects a memcpy() of the data to the shared log_sys.buf, as well as the padding of the log, in case the to-be-written part of the log would not end in a block boundary of the underlying storage. For now, the "padding" consists of writing a single NUL byte, to allow recovery and mariadb-backup to detect the end of the circular log faster. Like the previous implementation, we will overwrite the last log block over and over again, until it has been completely filled. It would be possible to write only up to the last completed block (if no more recent write was requested), or to write dummy FILE_CHECKPOINT records to fill the incomplete block, by invoking the currently disabled function log_pad(). This would require adjustments to some logic around log checkpoints, page flushing, and shutdown. An upgrade after a crash of any previous version is not supported. Logically empty log files from a previous version will be upgraded. An attempt to start up InnoDB without a valid ib_logfile0 will be refused. Previously, the redo log used to be created automatically if it was missing. Only with with innodb_force_recovery=6, it is possible to start InnoDB in read-only mode even if the log file does not exist. This allows the contents of a possibly corrupted database to be dumped. Because a prepared backup from an earlier version of mariadb-backup will create a 0-sized log file, we will allow an upgrade from such log files, provided that the FIL_PAGE_FILE_FLUSH_LSN in the system tablespace looks valid. The 512-byte log checkpoint blocks at 0x200 and 0x600 will be replaced with 64-byte log checkpoint blocks at 0x1000 and 0x2000. The start of log records will move from 0x800 to 0x3000. This allows us to use 4096-byte aligned blocks for all I/O in a future revision. We extend the MDEV-12353 redo log record format as follows. (1) Empty mini-transactions or extra NUL bytes will not be allowed. (2) The end-of-minitransaction marker (a NUL byte) will be replaced with a 1-bit sequence number, which will be toggled each time when the circular log file wraps back to the beginning. (3) After the sequence bit, a CRC-32C checksum of all data (excluding the sequence bit) will written. (4) If the log is encrypted, 8 bytes will be written before the checksum and included in it. This is part of the initialization vector (IV) of encrypted log data. (5) File names, page numbers, and checkpoint information will not be encrypted. Only the payload bytes of page-level log will be encrypted. The tablespace ID and page number will form part of the IV. (6) For padding, arbitrary-length FILE_CHECKPOINT records may be written, with all-zero payload, and with the normal end marker and checksum. The minimum size is 7 bytes, or 7+8 with innodb_encrypt_log=ON. In mariadb-backup and in Galera snapshot transfer (SST) scripts, we will no longer remove ib_logfile0 or create an empty ib_logfile0. Server startup will require a valid log file. When resizing the log, we will create a logically empty ib_logfile101 at the current LSN and use an atomic rename to replace ib_logfile0 with it. See the test innodb.log_file_size. Because there is no mandatory padding in the log file, we are able to create a dummy log file as of an arbitrary log sequence number. See the test mariabackup.huge_lsn. The parameter innodb_log_write_ahead_size and the INFORMATION_SCHEMA.INNODB_METRICS counter log_padded will be removed. The minimum value of innodb_log_buffer_size will be increased to 2MiB (because log_sys.buf will replace recv_sys.buf) and the increment adjusted to 4096 bytes (the maximum log block size). The following INFORMATION_SCHEMA.INNODB_METRICS counters will be removed: os_log_fsyncs os_log_pending_fsyncs log_pending_log_flushes log_pending_checkpoint_writes The following status variables will be removed: Innodb_os_log_fsyncs (this is included in Innodb_data_fsyncs) Innodb_os_log_pending_fsyncs (this was limited to at most 1 by design) log_sys.get_block_size(): Return the physical block size of the log file. This is only implemented on Linux and Microsoft Windows for now, and for the power-of-2 block sizes between 64 and 4096 bytes (the minimum and maximum size of a checkpoint block). If the block size is anything else, the traditional 512-byte size will be used via normal file system buffering. If the file system buffers can be bypassed, a message like the following will be issued: InnoDB: File system buffers for log disabled (block size=512 bytes) InnoDB: File system buffers for log disabled (block size=4096 bytes) This has been tested on Linux and Microsoft Windows with both sizes. On Linux, only enable O_DIRECT on the log for innodb_flush_method=O_DSYNC. Tests in 3 different environments where the log is stored in a device with a physical block size of 512 bytes are yielding better throughput without O_DIRECT. This could be due to the fact that in the event the last log block is being overwritten (if multiple transactions would become durable at the same time, and each of will write a small number of bytes to the last log block), it should be faster to re-copy data from log_sys.buf or log_sys.flush_buf to the kernel buffer, to be finally written at fdatasync() time. The parameter innodb_flush_method=O_DSYNC will imply O_DIRECT for data files. This option will enable O_DIRECT on the log file on Linux. It may be unsafe to use when the storage device does not support FUA (Force Unit Access) mode. When the server is compiled WITH_PMEM=ON, we will use memory-mapped I/O for the log file if the log resides on a "mount -o dax" device. We will identify PMEM in a start-up message: InnoDB: log sequence number 0 (memory-mapped); transaction id 3 On Linux, we will also invoke mmap() on any ib_logfile0 that resides in /dev/shm, effectively treating the log file as persistent memory. This should speed up "./mtr --mem" and increase the test coverage of PMEM on non-PMEM hardware. It also allows users to estimate how much the performance would be improved by installing persistent memory. On other tmpfs file systems such as /run, we will not use mmap(). mariadb-backup: Eliminated several variables. We will refer directly to recv_sys and log_sys. backup_wait_for_lsn(): Detect non-progress of xtrabackup_copy_logfile(). In this new log format with arbitrary-sized blocks, we can only detect log file overrun indirectly, by observing that the scanned log sequence number is not advancing. xtrabackup_copy_logfile(): On PMEM, do not modify the sequence bit, because we are not allowed to modify the server's log file, and our memory mapping is read-only. trx_flush_log_if_needed_low(): Do not use the callback on pmem. Using neither flush_lock nor write_lock around PMEM writes seems to yield the best performance. The pmem_persist() calls may still be somewhat slower than the pwrite() and fdatasync() based interface (PMEM mounted without -o dax). recv_sys_t::buf: Remove. We will use log_sys.buf for parsing. recv_sys_t::MTR_SIZE_MAX: Replaces RECV_SCAN_SIZE. recv_sys_t::file_checkpoint: Renamed from mlog_checkpoint_lsn. recv_sys_t, log_sys_t: Removed many data members. recv_sys.lsn: Renamed from recv_sys.recovered_lsn. recv_sys.offset: Renamed from recv_sys.recovered_offset. log_sys.buf_size: Replaces srv_log_buffer_size. recv_buf: A smart pointer that wraps log_sys.buf[recv_sys.offset] when the buffer is being allocated from the memory heap. recv_ring: A smart pointer that wraps a circular log_sys.buf[] that is backed by ib_logfile0. The pointer will wrap from recv_sys.len (log_sys.file_size) to log_sys.START_OFFSET. For the record that wraps around, we may copy file name or record payload data to the auxiliary buffer decrypt_buf in order to have a contiguous block of memory. The maximum size of a record is less than innodb_page_size bytes. recv_sys_t::parse(): Take the smart pointer as a template parameter. Do not temporarily add a trailing NUL byte to FILE_ records, because we are not supposed to modify the memory-mapped log file. (It is attached in read-write mode already during recovery.) recv_sys_t::parse_mtr(): Wrapper for recv_sys_t::parse(). recv_sys_t::parse_pmem(): Like parse_mtr(), but if PREMATURE_EOF would be returned on PMEM, use recv_ring to wrap around the buffer to the start. mtr_t::finish_write(), log_close(): Do not enforce log_sys.max_buf_free on PMEM, because it has no meaning on the mmap-based log. log_sys.write_to_buf: Count writes to log_sys.buf. Replaces srv_stats.log_write_requests and export_vars.innodb_log_write_requests. Protected by log_sys.mutex. Updated consistently in log_close(). Previously, mtr_t::commit() conditionally updated the count, which was inconsistent. log_sys.write_to_log: Count swaps of log_sys.buf and log_sys.flush_buf, for writing to log_sys.log (the ib_logfile0). Replaces srv_stats.log_writes and export_vars.innodb_log_writes. Protected by log_sys.mutex. log_sys.waits: Count waits in append_prepare(). Replaces srv_stats.log_waits and export_vars.innodb_log_waits. recv_recover_page(): Do not unnecessarily acquire log_sys.flush_order_mutex. We are inserting the blocks in arbitary order anyway, to be adjusted in recv_sys.apply(true). We will change the definition of flush_lock and write_lock to avoid potential false sharing. Depending on sizeof(log_sys) and CPU_LEVEL1_DCACHE_LINESIZE, the flush_lock and write_lock could share a cache line with each other or with the last data members of log_sys. Thanks to Matthias Leich for providing https://rr-project.org traces for various failures during the development, and to Thirunarayanan Balathandayuthapani for his help in debugging some of the recovery code. And thanks to the developers of the rr debugger for a tool without which extensive changes to InnoDB would be very challenging to get right. Thanks to Vladislav Vaintroub for useful feedback and to him, Axel Schwenke and Krunal Bauskar for testing the performance.	2022-01-21 16:03:47 +02:00
Daniel Black	d434250ee1	MDEV-25342: autosize innodb_buffer_pool_chunk_size The previous default innodb_buffer_pool_chunk_size of 128M made sense when the innodb buffer pool size was a few GB. When the pool size is 128GB this means the chunk size is 0.1% of this. Fine tuning the buffer pool size on such a fine increment doesn't make practical sense. Also on extremely large buffer pool systems, initializing on the default 128M can also take a considerable amount of time. When large pages are enabled, the chunk size has to be a multiple of an available large page size or memory allocation without use can occur. Previously the default 0 was documented as disabling resizing. With srv_buf_pool_chunk_unit > 0 assertions in the code and the minimium value set, I doubt this was ever the case. As such the autosizing (based on default 0) takes place as follows: * a 64th of the innodb_buffer_pool_size * if large pages, this is rounded down the the nearest multiple of the large page size. * If less than 1MB, set to 1MB. This does mean the new default innodb_buffer_pool_chunk size is 2MB, derived form the above formular with 128MB as the buffer pool size. The innodb_buffer_pool_chunk_size is changed to a size_t for better compatiblity with the memory allocations which use size_t. The previous upper limit is changed to the maxium of a size_t. The maximium value used is the buffer pool size anyway. Getting this default value of the chunk size to a more practical size facilitates further development of more automated resizing without significant overhead or memory fragmentation. innodb_buffer_pool_resize test adjusted based on 1M default chunk size thanks Wlad.	2022-01-18 14:20:57 +02:00
Marko Mäkelä	ca501ffb04	MDEV-26195: Use a 32-bit data type for some tablespace fields In the InnoDB data files, we allocate 32 bits for tablespace identifiers and page numbers as well as tablespace flags. But, in main memory data structures we allocate 32 or 64 bits, depending on the register width of the processor. Let us always use 32-bit fields to eliminate a mismatch and reduce the memory footprint on 64-bit systems.	2021-07-22 11:22:47 +03:00
Marko Mäkelä	8c5c3a4594	MDEV-26067 innodb_lock_wait_timeout values above 100,000,000 are useless The practical maximum value of the parameter innodb_lock_wait_timeout is 100,000,000. Any value larger than that specifies an infinite timeout. Therefore, we should make 100,000,000 the maximum value of the parameter.	2021-07-01 10:31:08 +03:00
Marko Mäkelä	c68007d958	MDEV-24738 Improve the InnoDB deadlock checker A new configuration parameter innodb_deadlock_report is introduced: * innodb_deadlock_report=off: Do not report any details of deadlocks. * innodb_deadlock_report=basic: Report transactions and waiting locks. * innodb_deadlock_report=full (default): Report also the blocking locks. The improved deadlock checker will consider all involved transactions in one loop, even if the deadlock loop includes several transactions. The theoretical maximum number of transactions that can be involved in a deadlock is `innodb_page_size` * 8, limited by the persistent data structures. Note: Similar to mysql/mysql-server@3859219875 our deadlock checker will consider at most one blocking transaction for each waiting transaction. The new field trx->lock.wait_trx be nullptr if and only if trx->lock.wait_lock is nullptr. Note that trx->lock.wait_lock->trx == trx (the waiting transaction), while trx->lock.wait_trx points to one of the transactions whose lock is conflicting with trx->lock.wait_lock. Considering only one blocking transaction will greatly simplify our deadlock checker, but it may also make the deadlock checker blind to some deadlocks where the deadlock cycle is 'hidden' by the fact that the registered trx->lock.wait_trx is not actually waiting for any InnoDB lock, but something else. So, instead of deadlocks, sometimes lock wait timeout may be reported. To improve on this, whenever trx->lock.wait_trx is changed, we will register further 'candidate' transactions in Deadlock::to_check(), and check for 'revealed' deadlocks as soon as possible, in lock_release() and innobase_kill_query(). The old DeadlockChecker was holding lock_sys.latch, even though using lock_sys.wait_mutex should be less contended (and thus preferred) in the likely case that no deadlock is present. lock_wait(): Defer the deadlock check to this function, instead of executing it in lock_rec_enqueue_waiting(), lock_table_enqueue_waiting(). DeadlockChecker: Complete rewrite: (1) Explicitly keep track of transactions that are being waited for, in trx->lock.wait_trx, protected by lock_sys.wait_mutex. Previously, we were painstakingly traversing the lock heaps while blocking concurrent registration or removal of any locks (even uncontended ones). (2) Use Brent's cycle-detection algorithm for deadlock detection, traversing each trx->lock.wait_trx edge at most 2 times. (3) If a deadlock is detected, release lock_sys.wait_mutex, acquire LockMutexGuard, re-acquire lock_sys.wait_mutex and re-invoke find_cycle() to find out whether the deadlock is still present. (4) Display information on all transactions that are involved in the deadlock, and choose a victim to be rolled back. lock_sys.deadlocks: Replaces lock_deadlock_found. Protected by wait_mutex. Deadlock::find_cycle(): Quickly find a cycle of trx->lock.wait_trx... using Brent's cycle detection algorithm. Deadlock::report(): Report a deadlock cycle that was found by Deadlock::find_cycle(), and choose a victim with the least weight. Altogether, we may traverse each trx->lock.wait_trx edge up to 5 times (2*find_cycle()+1 time for reporting and choosing the victim). Deadlock::check_and_resolve(): Find and resolve a deadlock. lock_wait_rpl_report(): Report the waits-for information to replication. This used to be executed as part of DeadlockChecker. Replication must know the waits-for relations even if no deadlocks are present in InnoDB. Reviewed by: Vladislav Vaintroub	2021-02-17 12:44:08 +02:00
Sergei Golubchik	60ea09eae6	Merge branch '10.2' into 10.3	2021-02-01 13:49:33 +01:00
Marko Mäkelä	e4205fba7c	MDEV-24536 innodb_idle_flush_pct has no effect The parameter innodb_idle_flush_pct that was introduced in MariaDB Server 10.1.2 by MDEV-6932 has no effect ever since the InnoDB changes from MySQL 5.7.9 were applied in commit `2e814d4702`. Let us declare the parameter as MARIADB_REMOVED_OPTION. For earlier versions, commit `ea9cd97f85` declared the parameter deprecated.	2021-01-13 19:11:31 +02:00
Marko Mäkelä	ea9cd97f85	MDEV-24536 innodb_idle_flush_pct has no effect The parameter innodb_idle_flush_pct that was introduced in MariaDB Server 10.1.2 by MDEV-6932 has no effect ever since the InnoDB changes from MySQL 5.7.9 were applied in commit `2e814d4702`. Let us declare the parameter as deprecated and having no effect.	2021-01-13 18:55:56 +02:00
Marko Mäkelä	f24b738318	MDEV-24313 (2 of 2): Silently ignored innodb_use_native_aio=1 In commit `5e62b6a5e0` (MDEV-16264) the logic of os_aio_init() was changed so that it will never fail, but instead automatically disable innodb_use_native_aio (which is enabled by default) if the io_setup() system call would fail due to resource limits being exceeded. This is questionable, especially because falling back to simulated AIO may lead to significantly reduced performance. srv_n_file_io_threads, srv_n_read_io_threads, srv_n_write_io_threads: Change the data type from ulong to uint. os_aio_init(): Remove the parameters, and actually return an error code. thread_pool::configure_aio(): Do not silently fall back to simulated AIO. Reviewed by: Vladislav Vaintroub	2020-12-14 15:27:03 +02:00
Marko Mäkelä	17d3f8560b	MDEV-24313 (1 of 2): Hang with innodb_write_io_threads=1 After commit `a5a2ef079c` (part of MDEV-23855) implemented asynchronous doublewrite, it is possible that the server will hang when the following parametes are in effect: innodb_doublewrite=1 (default) innodb_write_io_threads=1 innodb_use_native_aio=0 Note: In commit `5e62b6a5e0` (MDEV-16264) the logic of os_aio_init() was changed so that it will never fail, but instead automatically disable innodb_use_native_aio (which is enabled by default) if the io_setup() system call would fail due to resource limits being exceeded. Before commit `a5a2ef079c`, we used a synchronous write for the doublewrite buffer batches, always at most 64 pages at a time. So, upon completing a doublewrite batch, a single thread would submit at most 64 page writes (for the individual pages that were first written to the doublewrite buffer). With that commit, we may submit up to 128 page writes at a time. The maximum number of outstanding requests per thread is 256. Because the maximum number of asynchronous write submissions per thread was roughly doubled, it is now possible that buf_dblwr_t::flush_buffered_writes_completed() will hang in io_slots::acquire(), called via os_aio() and fil_space_t::io(), when submitting writes of the individual blocks. We will prevent this type of hang by increasing the minimum number of innodb_write_io_threads from 1 to 2, so that this type of hang would only become possible when 512 outstanding write requests are exceeded.	2020-12-14 13:11:44 +02:00
Marko Mäkelä	dee6902922	After-merge fix: sys_vars.sysvars_innodb,32bit	2020-10-28 18:48:14 +02:00
Marko Mäkelä	9028cc6b86	Cleanup: Make InnoDB page numbers uint32_t InnoDB stores a 32-bit page number in page headers and in some data structures, such as FIL_ADDR (consisting of a 32-bit page number and a 16-bit byte offset within a page). For better compile-time error detection and to reduce the memory footprint in some data structures, let us use a uint32_t for the page number, instead of ulint (size_t) which can be 64 bits.	2020-10-15 17:06:17 +03:00
Marko Mäkelä	7cffb5f6e8	MDEV-23399: Performance regression with write workloads The buffer pool refactoring in MDEV-15053 and MDEV-22871 shifted the performance bottleneck to the page flushing. The configuration parameters will be changed as follows: innodb_lru_flush_size=32 (new: how many pages to flush on LRU eviction) innodb_lru_scan_depth=1536 (old: 1024) innodb_max_dirty_pages_pct=90 (old: 75) innodb_max_dirty_pages_pct_lwm=75 (old: 0) Note: The parameter innodb_lru_scan_depth will only affect LRU eviction of buffer pool pages when a new page is being allocated. The page cleaner thread will no longer evict any pages. It used to guarantee that some pages will remain free in the buffer pool. Now, we perform that eviction 'on demand' in buf_LRU_get_free_block(). The parameter innodb_lru_scan_depth(srv_LRU_scan_depth) is used as follows: * When the buffer pool is being shrunk in buf_pool_t::withdraw_blocks() * As a buf_pool.free limit in buf_LRU_list_batch() for terminating the flushing that is initiated e.g., by buf_LRU_get_free_block() The parameter also used to serve as an initial limit for unzip_LRU eviction (evicting uncompressed page frames while retaining ROW_FORMAT=COMPRESSED pages), but now we will use a hard-coded limit of 100 or unlimited for invoking buf_LRU_scan_and_free_block(). The status variables will be changed as follows: innodb_buffer_pool_pages_flushed: This includes also the count of innodb_buffer_pool_pages_LRU_flushed and should work reliably, updated one by one in buf_flush_page() to give more real-time statistics. The function buf_flush_stats(), which we are removing, was not called in every code path. For both counters, we will use regular variables that are incremented in a critical section of buf_pool.mutex. Note that show_innodb_vars() directly links to the variables, and reads of the counters will not be protected by buf_pool.mutex, so you cannot get a consistent snapshot of both variables. The following INFORMATION_SCHEMA.INNODB_METRICS counters will be removed, because the page cleaner no longer deals with writing or evicting least recently used pages, and because the single-page writes have been removed: * buffer_LRU_batch_flush_avg_time_slot * buffer_LRU_batch_flush_avg_time_thread * buffer_LRU_batch_flush_avg_time_est * buffer_LRU_batch_flush_avg_pass * buffer_LRU_single_flush_scanned * buffer_LRU_single_flush_num_scan * buffer_LRU_single_flush_scanned_per_call When moving to a single buffer pool instance in MDEV-15058, we missed some opportunity to simplify the buf_flush_page_cleaner thread. It was unnecessarily using a mutex and some complex data structures, even though we always have a single page cleaner thread. Furthermore, the buf_flush_page_cleaner thread had separate 'recovery' and 'shutdown' modes where it was waiting to be triggered by some other thread, adding unnecessary latency and potential for hangs in relatively rarely executed startup or shutdown code. The page cleaner was also running two kinds of batches in an interleaved fashion: "LRU flush" (writing out some least recently used pages and evicting them on write completion) and the normal batches that aim to increase the MIN(oldest_modification) in the buffer pool, to help the log checkpoint advance. The buf_pool.flush_list flushing was being blocked by buf_block_t::lock for no good reason. Furthermore, if the FIL_PAGE_LSN of a page is ahead of log_sys.get_flushed_lsn(), that is, what has been persistently written to the redo log, we would trigger a log flush and then resume the page flushing. This would unnecessarily limit the performance of the page cleaner thread and trigger the infamous messages "InnoDB: page_cleaner: 1000ms intended loop took 4450ms. The settings might not be optimal" that were suppressed in commit `d1ab89037a` unless log_warnings>2. Our revised algorithm will make log_sys.get_flushed_lsn() advance at the start of buf_flush_lists(), and then execute a 'best effort' to write out all pages. The flush batches will skip pages that were modified since the log was written, or are are currently exclusively locked. The MDEV-13670 message "page_cleaner: 1000ms intended loop took" message will be removed, because by design, the buf_flush_page_cleaner() should not be blocked during a batch for extended periods of time. We will remove the single-page flushing altogether. Related to this, the debug parameter innodb_doublewrite_batch_size will be removed, because all of the doublewrite buffer will be used for flushing batches. If a page needs to be evicted from the buffer pool and all 100 least recently used pages in the buffer pool have unflushed changes, buf_LRU_get_free_block() will execute buf_flush_lists() to write out and evict innodb_lru_flush_size pages. At most one thread will execute buf_flush_lists() in buf_LRU_get_free_block(); other threads will wait for that LRU flushing batch to finish. To improve concurrency, we will replace the InnoDB ib_mutex_t and os_event_t native mutexes and condition variables in this area of code. Most notably, this means that the buffer pool mutex (buf_pool.mutex) is no longer instrumented via any InnoDB interfaces. It will continue to be instrumented via PERFORMANCE_SCHEMA. For now, both buf_pool.flush_list_mutex and buf_pool.mutex will be declared with MY_MUTEX_INIT_FAST (PTHREAD_MUTEX_ADAPTIVE_NP). The critical sections of buf_pool.flush_list_mutex should be shorter than those for buf_pool.mutex, because in the worst case, they cover a linear scan of buf_pool.flush_list, while the worst case of a critical section of buf_pool.mutex covers a linear scan of the potentially much longer buf_pool.LRU list. mysql_mutex_is_owner(), safe_mutex_is_owner(): New predicate, usable with SAFE_MUTEX. Some InnoDB debug assertions need this predicate instead of mysql_mutex_assert_owner() or mysql_mutex_assert_not_owner(). buf_pool_t::n_flush_LRU, buf_pool_t::n_flush_list: Replaces buf_pool_t::init_flush[] and buf_pool_t::n_flush[]. The number of active flush operations. buf_pool_t::mutex, buf_pool_t::flush_list_mutex: Use mysql_mutex_t instead of ib_mutex_t, to have native mutexes with PERFORMANCE_SCHEMA and SAFE_MUTEX instrumentation. buf_pool_t::done_flush_LRU: Condition variable for !n_flush_LRU. buf_pool_t::done_flush_list: Condition variable for !n_flush_list. buf_pool_t::do_flush_list: Condition variable to wake up the buf_flush_page_cleaner when a log checkpoint needs to be written or the server is being shut down. Replaces buf_flush_event. We will keep using timed waits (the page cleaner thread will wake _at least_ once per second), because the calculations for innodb_adaptive_flushing depend on fixed time intervals. buf_dblwr: Allocate statically, and move all code to member functions. Use a native mutex and condition variable. Remove code to deal with single-page flushing. buf_dblwr_check_block(): Make the check debug-only. We were spending a significant amount of execution time in page_simple_validate_new(). flush_counters_t::unzip_LRU_evicted: Remove. IORequest: Make more members const. FIXME: m_fil_node should be removed. buf_flush_sync_lsn: Protect by std::atomic, not page_cleaner.mutex (which we are removing). page_cleaner_slot_t, page_cleaner_t: Remove many redundant members. pc_request_flush_slot(): Replaces pc_request() and pc_flush_slot(). recv_writer_thread: Remove. Recovery works just fine without it, if we simply invoke buf_flush_sync() at the end of each batch in recv_sys_t::apply(). recv_recovery_from_checkpoint_finish(): Remove. We can simply call recv_sys.debug_free() directly. srv_started_redo: Replaces srv_start_state. SRV_SHUTDOWN_FLUSH_PHASE: Remove. logs_empty_and_mark_files_at_shutdown() can communicate with the normal page cleaner loop via the new function flush_buffer_pool(). buf_flush_remove(): Assert that the calling thread is holding buf_pool.flush_list_mutex. This removes unnecessary mutex operations from buf_flush_remove_pages() and buf_flush_dirty_pages(), which replace buf_LRU_flush_or_remove_pages(). buf_flush_lists(): Renamed from buf_flush_batch(), with simplified interface. Return the number of flushed pages. Clarified comments and renamed min_n to max_n. Identify LRU batch by lsn=0. Merge all the functions buf_flush_start(), buf_flush_batch(), buf_flush_end() directly to this function, which was their only caller, and remove 2 unnecessary buf_pool.mutex release/re-acquisition that we used to perform around the buf_flush_batch() call. At the start, if not all log has been durably written, wait for a background task to do it, or start a new task to do it. This allows the log write to run concurrently with our page flushing batch. Any pages that were skipped due to too recent FIL_PAGE_LSN or due to them being latched by a writer should be flushed during the next batch, unless there are further modifications to those pages. It is possible that a page that we must flush due to small oldest_modification also carries a recent FIL_PAGE_LSN or is being constantly modified. In the worst case, all writers would then end up waiting in log_free_check() to allow the flushing and the checkpoint to complete. buf_do_flush_list_batch(): Clarify comments, and rename min_n to max_n. Cache the last looked up tablespace. If neighbor flushing is not applicable, invoke buf_flush_page() directly, avoiding a page lookup in between. buf_flush_space(): Auxiliary function to look up a tablespace for page flushing. buf_flush_page(): Defer the computation of space->full_crc32(). Never call log_write_up_to(), but instead skip persistent pages whose latest modification (FIL_PAGE_LSN) is newer than the redo log. Also skip pages on which we cannot acquire a shared latch without waiting. buf_flush_try_neighbors(): Do not bother checking buf_fix_count because buf_flush_page() will no longer wait for the page latch. Take the tablespace as a parameter, and only execute this function when innodb_flush_neighbors>0. Avoid repeated calls of page_id_t::fold(). buf_flush_relocate_on_flush_list(): Declare as cold, and push down a condition from the callers. buf_flush_check_neighbor(): Take id.fold() as a parameter. buf_flush_sync(): Ensure that the buf_pool.flush_list is empty, because the flushing batch will skip pages whose modifications have not yet been written to the log or were latched for modification. buf_free_from_unzip_LRU_list_batch(): Remove redundant local variables. buf_flush_LRU_list_batch(): Let the caller buf_do_LRU_batch() initialize the counters, and report n->evicted. Cache the last looked up tablespace. If neighbor flushing is not applicable, invoke buf_flush_page() directly, avoiding a page lookup in between. buf_do_LRU_batch(): Return the number of pages flushed. buf_LRU_free_page(): Only release and re-acquire buf_pool.mutex if adaptive hash index entries are pointing to the block. buf_LRU_get_free_block(): Do not wake up the page cleaner, because it will no longer perform any useful work for us, and we do not want it to compete for I/O while buf_flush_lists(innodb_lru_flush_size, 0) writes out and evicts at most innodb_lru_flush_size pages. (The function buf_do_LRU_batch() may complete after writing fewer pages if more than innodb_lru_scan_depth pages end up in buf_pool.free list.) Eliminate some mutex release-acquire cycles, and wait for the LRU flush batch to complete before rescanning. buf_LRU_check_size_of_non_data_objects(): Simplify the code. buf_page_write_complete(): Remove the parameter evict, and always evict pages that were part of an LRU flush. buf_page_create(): Take a pre-allocated page as a parameter. buf_pool_t::free_block(): Free a pre-allocated block. recv_sys_t::recover_low(), recv_sys_t::apply(): Preallocate the block while not holding recv_sys.mutex. During page allocation, we may initiate a page flush, which in turn may initiate a log flush, which would require acquiring log_sys.mutex, which should always be acquired before recv_sys.mutex in order to avoid deadlocks. Therefore, we must not be holding recv_sys.mutex while allocating a buffer pool block. BtrBulk::logFreeCheck(): Skip a redundant condition. row_undo_step(): Do not invoke srv_inc_activity_count() for every row that is being rolled back. It should suffice to invoke the function in trx_flush_log_if_needed() during trx_t::commit_in_memory() when the rollback completes. sync_check_enable(): Remove. We will enable innodb_sync_debug from the very beginning. Reviewed by: Vladislav Vaintroub	2020-10-15 17:04:56 +03:00
Marko Mäkelä	bbd70fcc43	MDEV-23379 Deprecate&ignore InnoDB concurrency throttling parameters The parameters innodb_thread_concurrency and innodb_commit_concurrency were useful years ago when both computing resources and the implementation of some shared data structures were limited. MySQL 5.0 or 5.1 had trouble scaling beyond 8 concurrent connections. Most of the scalability bottlenecks have been removed since then, and the transactions per second delivered by MariaDB Server 10.5 should not dramatically drop upon exceeding the 'optimal' number of connections. Hence, enabling any concurrency throttling for InnoDB actually makes things worse. We have seen many customers mistakenly setting this to a small value like 16 or 64 and then complaining the server was slow. Ignoring the parameters allows us to remove some normally unused code and data structures, which could slightly improve performance. innodb_thread_concurrency, innodb_commit_concurrency, innodb_replication_delay, innodb_concurrency_tickets, innodb_thread_sleep_delay, innodb_adaptive_max_sleep_delay: Deprecate and ignore; hard-wire to 0. The column INFORMATION_SCHEMA.INNODB_TRX.trx_concurrency_tickets will always report 0.	2020-08-04 06:59:29 +03:00
Marko Mäkelä	5155a300fa	MDEV-22871: Reduce InnoDB buf_pool.page_hash contention The rw_lock_s_lock() calls for the buf_pool.page_hash became a clear bottleneck after MDEV-15053 reduced the contention on buf_pool.mutex. We will replace that use of rw_lock_t with a special implementation that is optimized for memory bus traffic. The hash_table_locks instrumentation will be removed. buf_pool_t::page_hash: Use a special implementation whose API is compatible with hash_table_t, and store the custom rw-locks directly in buf_pool.page_hash.array, intentionally sharing cache lines with the hash table pointers. rw_lock: A low-level rw-lock implementation based on std::atomic<uint32_t> where read_trylock() becomes a simple fetch_add(1). buf_pool_t::page_hash_latch: The special of rw_lock for the page_hash. buf_pool_t::page_hash_latch::read_lock(): Assert that buf_pool.mutex is not being held by the caller. buf_pool_t::page_hash_latch::write_lock() may be called while not holding buf_pool.mutex. buf_pool_t::watch_set() is such a caller. buf_pool_t::page_hash_latch::read_lock_wait(), page_hash_latch::write_lock_wait(): The spin loops. These will obey the global parameters innodb_sync_spin_loops and innodb_sync_spin_wait_delay. buf_pool_t::freed_page_hash: A singly linked list of copies of buf_pool.page_hash that ever existed. The fact that we never free any buf_pool.page_hash.array guarantees that all page_hash_latch that ever existed will remain valid until shutdown. buf_pool_t::resize_hash(): Replaces buf_pool_resize_hash(). Prepend a shallow copy of the old page_hash to freed_page_hash. buf_pool_t::page_hash_table::n_cells: Declare as Atomic_relaxed. buf_pool_t::page_hash_table::lock(): Explain what prevents a race condition with buf_pool_t::resize_hash().	2020-06-18 14:16:01 +03:00
Marko Mäkelä	574d8b2940	MDEV-21907: Fix most clang -Wconversion in InnoDB Declare innodb_purge_threads as 4-byte integer (UINT) instead of 4-or-8-byte (ULONG) and adjust the documentation string.	2020-03-11 08:29:48 +02:00
Eugene Kosov	852dcb9a56	try to fix sysvars_innodb,32bit test	2020-02-24 17:21:21 +03:00
Marko Mäkelä	1a6f708ec5	MDEV-15058: Deprecate and ignore innodb_buffer_pool_instances Our benchmarking efforts indicate that the reasons for splitting the buf_pool in commit `c18084f71b` have mostly gone away, possibly as a result of mysql/mysql-server@ce6109ebfd or similar work. Only in one write-heavy benchmark where the working set size is ten times the buffer pool size, the buf_pool->mutex would be less contended with 4 buffer pool instances than with 1 instance, in buf_page_io_complete(). That contention could be alleviated further by making more use of std::atomic and by splitting buf_pool_t::mutex further (MDEV-15053). We will deprecate and ignore the following parameters: innodb_buffer_pool_instances innodb_page_cleaners There will be only one buffer pool and one page cleaner task. In a number of INFORMATION_SCHEMA views, columns that indicated the buffer pool instance will be removed: information_schema.innodb_buffer_page.pool_id information_schema.innodb_buffer_page_lru.pool_id information_schema.innodb_buffer_pool_stats.pool_id information_schema.innodb_cmpmem.buffer_pool_instance information_schema.innodb_cmpmem_reset.buffer_pool_instance	2020-02-12 14:45:21 +02:00
Marko Mäkelä	64952203af	MDEV-18115: Fix up sys_vars.sysvars_innodb This was forgotten in `e9de6386ad`	2020-01-20 16:46:39 +02:00
Eugene Kosov	e9de6386ad	MDEV-18115 remove now unneeded constraint log_group_max_size: is not needed because redo log do not use fil_io() now	2020-01-18 23:42:55 +08:00
Marko Mäkelä	b42294bc64	MDEV-19514 Defer change buffer merge until pages are requested We will remove the InnoDB background operation of merging buffered changes to secondary index leaf pages. Changes will only be merged as a result of an operation that accesses a secondary index leaf page, such as a SQL statement that performs a lookup via that index, or is modifying the index. Also ROLLBACK and some background operations, such as purging the history of committed transactions, or computing index cardinality statistics, can cause change buffer merge. Encryption key rotation will not perform change buffer merge. The motivation of this change is to simplify the I/O logic and to allow crash recovery to happen in the background (MDEV-14481). We also hope that this will reduce the number of "mystery" crashes due to corrupted data. Because change buffer merge will typically take place as a result of executing SQL statements, there should be a clearer connection between the crash and the SQL statements that were executed when the server crashed. In many cases, a slight performance improvement was observed. This is joint work with Thirunarayanan Balathandayuthapani and was tested by Axel Schwenke and Matthias Leich. The InnoDB monitor counter innodb_ibuf_merge_usec will be removed. On slow shutdown (innodb_fast_shutdown=0), we will continue to merge all buffered changes (and purge all undo log history). Two InnoDB configuration parameters will be changed as follows: innodb_disable_background_merge: Removed. This parameter existed only in debug builds. All change buffer merges will use synchronous reads. innodb_force_recovery will be changed as follows: * innodb_force_recovery=4 will be the same as innodb_force_recovery=3 (the change buffer merge cannot be disabled; it can only happen as a result of an operation that accesses a secondary index leaf page). The option used to be capable of corrupting secondary index leaf pages. Now that capability is removed, and innodb_force_recovery=4 becomes 'safe'. * innodb_force_recovery=5 (which essentially hard-wires SET GLOBAL TRANSACTION ISOLATION LEVEL READ UNCOMMITTED) becomes safe to use. Bogus data can be returned to SQL, but persistent InnoDB data files will not be corrupted further. * innodb_force_recovery=6 (ignore the redo log files) will be the only option that can potentially cause persistent corruption of InnoDB data files. Code changes: buf_page_t::ibuf_exist: New flag, to indicate whether buffered changes exist for a buffer pool page. Pages with pending changes can be returned by buf_page_get_gen(). Previously, the changes were always merged inside buf_page_get_gen() if needed. ibuf_page_exists(const buf_page_t&): Check if a buffered changes exist for an X-latched or read-fixed page. buf_page_get_gen(): Add the parameter allow_ibuf_merge=false. All callers that know that they may be accessing a secondary index leaf page must pass this parameter as allow_ibuf_merge=true, unless it does not matter for that caller whether all buffered changes have been applied. Assert that whenever allow_ibuf_merge holds, the page actually is a leaf page. Attempt change buffer merge only to secondary B-tree index leaf pages. btr_block_get(): Add parameter 'bool merge'. All callers of btr_block_get() should know whether the page could be a secondary index leaf page. If it is not, we should avoid consulting the change buffer bitmap to even consider a merge. This is the main interface to requesting index pages from the buffer pool. ibuf_merge_or_delete_for_page(), recv_recover_page(): Replace buf_page_get_known_nowait() with much simpler logic, because it is now guaranteed that that the block is x-latched or read-fixed. mlog_init_t::mark_ibuf_exist(): Renamed from mlog_init_t::ibuf_merge(). On crash recovery, we will no longer merge any buffered changes for the pages that we read into the buffer pool during the last batch of applying log records. buf_page_get_gen_known_nowait(), BUF_MAKE_YOUNG, BUF_KEEP_OLD: Remove. btr_search_guess_on_hash(): Merge buf_page_get_gen_known_nowait() to its only remaining caller. buf_page_make_young_if_needed(): Define as an inline function. Add the parameter buf_pool. buf_page_peek_if_young(), buf_page_peek_if_too_old(): Add the parameter buf_pool. fil_space_validate_for_mtr_commit(): Remove a bogus comment about background merge of the change buffer. btr_cur_open_at_rnd_pos_func(), btr_cur_search_to_nth_level_func(), btr_cur_open_at_index_side_func(): Use narrower data types and scopes. ibuf_read_merge_pages(): Replaces buf_read_ibuf_merge_pages(). Merge the change buffer by invoking buf_page_get_gen().	2019-10-11 17:28:15 +03:00
Marko Mäkelä	4081b7b27a	Merge 10.4 into 10.5	2019-09-06 17:16:40 +03:00
Monty	a071e0e029	Merge branch '10.2' into 10.3	2019-09-03 13:17:32 +03:00
Monty	9cba6c5aa3	Updated mtr files to support different compiled in options This allows one to run the test suite even if any of the following options are changed: - character-set-server - collation-server - join-cache-level - log-basename - max-allowed-packet - optimizer-switch - query-cache-size and query-cache-type - skip-name-resolve - table-definition-cache - table-open-cache - Some innodb options etc Changes: - Don't print out the value of system variables as one can't depend on them to being constants. - Don't set global variables to 'default' as the default may not be the same as the test was started with if there was an additional option file. Instead save original value and reset it at end of test. - Test that depends on the latin1 character set should include default_charset.inc or set the character set to latin1 - Test that depends on the original optimizer switch, should include default_optimizer_switch.inc - Test that depends on the value of a specific system variable should set it in the test (like optimizer_use_condition_selectivity) - Split subselect3.test into subselect3.test and subselect3.inc to make it easier to set and reset system variables. - Added .opt files for test that required specfic options that could be changed by external configuration files. - Fixed result files in rockdsb & tokudb that had not been updated for a while.	2019-09-01 19:17:35 +03:00
Marko Mäkelä	893472d005	MDEV-19570 Deprecate and ignore innodb_undo_logs, remove innodb_rollback_segments The option innodb_rollback_segments was deprecated already in MariaDB Server 10.0. Its misleadingly named replacement innodb_undo_logs is of very limited use. It makes sense to always create and use the maximum number of rollback segments. Let us remove the deprecated parameter innodb_rollback_segments and deprecate&ignore the parameter innodb_undo_logs (to be removed in a later major release). This work involves some cleanup of InnoDB startup. Similar to other write operations, DROP TABLE will no longer be allowed if innodb_force_recovery is set to a value larger than 3.	2019-05-23 17:34:47 +03:00
Marko Mäkelä	df563e0c03	Merge 10.2 into 10.3 main.derived_cond_pushdown: Move all 10.3 tests to the end, trim trailing white space, and add an "End of 10.3 tests" marker. Add --sorted_result to tests where the ordering is not deterministic. main.win_percentile: Add --sorted_result to tests where the ordering is no longer deterministic.	2018-11-06 09:40:39 +02:00
Marko Mäkelä	32062cc61c	Merge 10.1 into 10.2	2018-11-06 08:41:48 +02:00
Sergei Golubchik	bf28ba67b6	update rdiffs for 32bit	2018-10-31 22:06:15 +01:00
Marko Mäkelä	715e4f4320	MDEV-12218 Clean up InnoDB parameter validation Bind more InnoDB parameters directly to MYSQL_SYSVAR and remove "shadow variables". innodb_change_buffering: Declare as ENUM, not STRING. innodb_flush_method: Declare as ENUM, not STRING. innodb_log_buffer_size: Bind directly to srv_log_buffer_size, without rounding it to a multiple of innodb_page_size. LOG_BUFFER_SIZE: Remove. SysTablespace::normalize_size(): Renamed from normalize(). innodb_init_params(): A new function to initialize and validate InnoDB startup parameters. innodb_init(): Renamed from innobase_init(). Invoke innodb_init_params() before actually trying to start up InnoDB. srv_start(bool): Renamed from innobase_start_or_create_for_mysql(). Added the input parameter create_new_db. SRV_ALL_O_DIRECT_FSYNC: Define only for _WIN32. xb_normalize_init_values(): Merge to innodb_init_param().	2018-04-29 09:41:42 +03:00
Marko Mäkelä	ba19764209	Fix most -Wsign-conversion in InnoDB Change innodb_buffer_pool_size, innodb_fill_factor to unsigned.	2018-04-28 20:45:45 +03:00
Jan Lindström	b23a109695	MDEV-11025: Make number of page cleaner threads variable dynamic New test cases innodb-page-cleaners Modified test cases innodb_page_cleaners_basic New function buf_flush_set_page_cleaner_thread_cnt Increase or decrease the amount of page cleaner worker threads. In case of increase this function creates based on current abount and requested amount how many new threads should be created. In case of decrease this function sets up the requested amount of threads and uses is_requested event to signal workers. Then we wait until all new treads are started, old threads that should exit signal is_finished or shutdown has marked that page cleaner should finish. buf_flush_page_cleaner_worker Store current thread id and thread_no and then signal event is_finished. If number of used page cleaner threads decrease we shut down those threads that have thread_no greater or equal than number of page configured page cleaners - 1 (note that there will be always page cleaner coordinator). Before exiting we signal is_finished. New function innodb_page_cleaners_threads_update Update function for innodb-page-cleaners system variable. innobase_start_or_create_for_mysql If more than one page cleaner threads is configured we use new function buf_flush_set_page_cleaner_thread_cnt to set up the requested threads (-1 coordinator).	2017-10-24 19:12:59 +03:00
Jan Lindström	016c35a7f2	MDEV-13690: Remove unnecessary innodb_use_mtflush, innodb_mtflush_threads parameters and related code Users can use innodb-page-cleaners instead.	2017-09-01 18:33:46 +03:00
Marko Mäkelä	4e1fa7f63d	Merge bb-10.2-ext into 10.3	2017-09-01 11:33:45 +03:00
Marko Mäkelä	1b41a54fc9	Fix test for MDEV-13674: Deprecate innodb_use_mtflush and innodb_mtflush_threads	2017-09-01 08:38:19 +03:00
Sergei Golubchik	8e8d42ddf0	Merge branch '10.0' into 10.1	2017-08-08 10:18:43 +02:00
Marko Mäkelä	813e6e628f	Adjust sys_vars.sysvars_innodb for 32-bit builds Remove the adjustments for some parameters that were deprecated in MariaDB 10.2 and removed in 10.3.	2017-06-20 09:35:58 +03:00

1 2

67 Commits