mariadb

mirror of https://github.com/MariaDB/server.git synced 2025-09-18 04:07:41 +03:00

Author	SHA1	Message	Date
Oleksandr Byelkin	a8d4642375	Merge branch '10.11' into 11.4	2025-04-26 10:53:02 +02:00
Marko Mäkelä	a524ec5951	MDEV-36587 InnoDB uses too much memory log_t::clear_mmap(): Do not modify buf_size; we may have file_size==0 here during bootstrap. log_t::set_recovered(): If we are writing to a memory-mapped log, update log_sys.buf_size to the record payload area of log_sys.buf. This fixes up commit `acd071f599` (MDEV-21923).	2025-04-14 10:33:22 +03:00
Marko Mäkelä	acd071f599	MDEV-21923: LSN allocation is a bottleneck The parameter innodb_log_spin_wait_delay will be deprecated and ignored, because there is no spin loop anymore. Thanks to commit `685d958e38` and commit `a635c40648` multiple mtr_t::commit() can concurrently copy their slice of mtr_t::m_log to the shared log_sys.buf. Each writer would allocate their own log sequence number by invoking log_t::append_prepare() while holding a shared log_sys.latch. This function was too heavy, because it would invoke a minimum of 4 atomic read-modify-write operations as well as system calls in the supposedly fast code path. It turns out that with a simpler data structure, instead of having several data fields that needed to be kept consistent with each other, we only need one Atomic_relaxed<uint64_t> write_lsn_offset, on which we can operate using fetch_add(), fetch_sub() as well as a single-bit fetch_or(), which reasonably modern compilers (GCC 7, Clang 15 or later) can translate into loop-free code on AMD64. Before anything can be written to the log, log_sys.clear_mmap() must be invoked. log_t::base_lsn: The LSN of the last write_buf() or persist(). This is a rough approximation of log_sys.lsn, which will be removed. log_t::write_lsn_offset: An Atomic_relaxed<uint64_t> that buffers updates of write_to_buf and base_lsn. log_t::buf_free, log_t::max_buf_free, log_t::lsn. Remove. Replaced by base_lsn and write_lsn_offset. log_t::buf_size: Always reflects the usable size in append_prepare(). log_t::lsn_lock: Remove. For the memory-mapped log in resize_write(), there will be a resize_wrap_mutex. log_t::get_lsn_approx(): Return a lower bound of get_lsn(). This should be exact unless append_prepare_wait() is pending. log_get_lsn(): A wrapper for log_sys.get_lsn(), which must be invoked while holding an exclusive log_sys.latch. recv_recovery_from_checkpoint_start(): Do not invoke fil_names_clear(); it would seem to be unnecessary. In many places, references to log_sys.get_lsn() are replaced with log_sys.get_flushed_lsn(), which remains a simple std::atomic::load(). Reviewed by: Debarun Banerjee	2025-04-10 13:02:17 +03:00
Marko Mäkelä	f5bd250f5b	Merge 10.11 into 11.4	2025-03-28 13:55:21 +02:00
Marko Mäkelä	b6923420f3	MDEV-29445: Reimplement SET GLOBAL innodb_buffer_pool_size We deprecate and ignore the parameter innodb_buffer_pool_chunk_size and let the buffer pool size to be changed in arbitrary 1-megabyte increments. innodb_buffer_pool_size_max: A new read-only startup parameter that specifies the maximum innodb_buffer_pool_size. If 0 or unspecified, it will default to the specified innodb_buffer_pool_size rounded up to the allocation unit (2 MiB or 8 MiB). The maximum value is 4GiB-2MiB on 32-bit systems and 16EiB-8MiB on 64-bit systems. This maximum is very likely to be limited further by the operating system. The status variable Innodb_buffer_pool_resize_status will reflect the status of shrinking the buffer pool. When no shrinking is in progress, the string will be empty. Unlike before, the execution of SET GLOBAL innodb_buffer_pool_size will block until the requested buffer pool size change has been implemented, or the execution is interrupted by a KILL statement a client disconnect, or server shutdown. If the buf_flush_page_cleaner() thread notices that we are running out of memory, the operation may fail with ER_WRONG_USAGE. SET GLOBAL innodb_buffer_pool_size will be refused if the server was started with --large-pages (even if no HugeTLB pages were successfully allocated). This functionality is somewhat exercised by the test main.large_pages, which now runs also on Microsoft Windows. On Linux, explicit HugeTLB mappings are apparently excluded from the reported Redident Set Size (RSS), and apparently unshrinkable between mmap(2) and munmap(2). The buffer pool will be mapped to a contiguous virtual memory area that will be aligned and partitioned into extents of 8 MiB on 64-bit systems and 2 MiB on 32-bit systems. Within an extent, the first few innodb_page_size blocks contain buf_block_t objects that will cover the page frames in the rest of the extent. The number of such frames is precomputed in the array first_page_in_extent[] for each innodb_page_size. In this way, there is a trivial mapping between page frames and block descriptors and we do not need any lookup tables like buf_pool.zip_hash or buf_pool_t::chunk_t::map. We will always allocate the same number of block descriptors for an extent, even if we do not need all the buf_block_t in the last extent in case the innodb_buffer_pool_size is not an integer multiple of the of extents size. The minimum innodb_buffer_pool_size is 256*5/4 pages. At the default innodb_page_size=16k this corresponds to 5 MiB. However, now that the innodb_buffer_pool_size includes the memory allocated for the block descriptors, the minimum would be innodb_buffer_pool_size=6m. my_large_virtual_alloc(): A new function, similar to my_large_malloc(). my_virtual_mem_reserve(), my_virtual_mem_commit(), my_virtual_mem_decommit(), my_virtual_mem_release(): New interface mostly by Vladislav Vaintroub, to separately reserve and release virtual address space, as well as to commit and decommit memory within it. After my_virtual_mem_decommit(), the virtual memory range will be read-only or unaccessible, depending on whether the build option cmake -DHAVE_UNACCESSIBLE_AFTER_MEM_DECOMMIT=1 has been specified. This option is hard-coded on Microsoft Windows, where VirtualMemory(MEM_DECOMMIT) will make the memory unaccessible. On IBM AIX, Linux, Illumos and possibly Apple macOS, the virtual memory will be zeroed out immediately. On other POSIX-like systems, madvise(MADV_FREE) will be used if available, to give the operating system kernel a permission to zero out the virtual memory range. We prefer immediate freeing so that the reported resident set size (RSS) of the process will reflect the current innodb_buffer_pool_size. Shrinking the buffer pool is a rarely executed resource intensive operation, and the immediate configuration of the MMU mappings should not incur significant additional penalty. opt_super_large_pages: Declare only on Solaris. Actually, this is specific to the SPARC implementation of Solaris, but because we lack access to a Solaris development environment, we will not revise this for other MMU and ISA. buf_pool_t::chunk_t::create(): Remove. buf_pool_t::create(): Initialize all n_blocks of the buf_pool.free list. buf_pool_t::allocate(): Renamed from buf_LRU_get_free_only(). buf_pool_t::LRU_warned: Changed to Atomic_relaxed<bool>, only to be modified by the buf_flush_page_cleaner() thread. buf_pool_t::shrink(): Attempt to shrink the buffer pool. There are 3 possible outcomes: SHRINK_DONE (success), SHRINK_IN_PROGRESS (the caller may keep trying), and SHRINK_ABORT (we seem to be running out of buffer pool). While traversing buf_pool.LRU, release the contended buf_pool.mutex once in every 32 iterations in order to reduce starvation. Use lru_scan_itr for efficient traversal, similar to buf_LRU_free_from_common_LRU_list(). buf_pool_t::shrunk(): Update the reduced size of the buffer pool in a way that is compatible with buf_pool_t::page_guess(), and invoke my_virtual_mem_decommit(). buf_pool_t::resize(): Before invoking shrink(), run one batch of buf_flush_page_cleaner() in order to prevent LRU_warn(). Abort if shrink() recommends it, or no blocks were withdrawn in the past 15 seconds, or the execution of the statement SET GLOBAL innodb_buffer_pool_size was interrupted. buf_pool_t::first_to_withdraw: The first block descriptor that is out of the bounds of the shrunk buffer pool. buf_pool_t::withdrawn: The list of withdrawn blocks. If buf_pool_t::resize() is aborted before shrink() completes, we must be able to resurrect the withdrawn blocks in the free list. buf_pool_t::contains_zip(): Added a parameter for the number of least significant pointer bits to disregard, so that we can find any pointers to within a block that is supposed to be free. buf_pool_t::is_shrinking(): Return the total number or blocks that were withdrawn or are to be withdrawn. buf_pool_t::to_withdraw(): Return the number of blocks that will need to be withdrawn. buf_pool_t::usable_size(): Number of usable pages, considering possible in-progress attempt at shrinking the buffer pool. buf_pool_t::page_guess(): Try to buffer-fix a guessed block pointer. If HAVE_UNACCESSIBLE_AFTER_MEM_DECOMMIT is set, the pointer will be validated before being dereferenced. buf_pool_t::get_info(): Replaces buf_stats_get_pool_info(). innodb_init_param(): Refactored. We must first compute srv_page_size_shift and then determine the valid bounds of innodb_buffer_pool_size. buf_buddy_shrink(): Replaces buf_buddy_realloc(). Part of the work is deferred to buf_buddy_condense_free(), which is being executed when we are not holding any buf_pool.page_hash latch. buf_buddy_condense_free(): Do not relocate blocks. buf_buddy_free_low(): Do not care about buffer pool shrinking. This will be handled by buf_buddy_shrink() and buf_buddy_condense_free(). buf_buddy_alloc_zip(): Assert !buf_pool.contains_zip() when we are allocating from the binary buddy system. Previously we were asserting this on multiple recursion levels. buf_buddy_block_free(), buf_buddy_free_low(): Assert !buf_pool.contains_zip(). buf_buddy_alloc_from(): Remove the redundant parameter j. buf_flush_LRU_list_batch(): Add the parameter to_withdraw to keep track of buf_pool.n_blocks_to_withdraw. buf_do_LRU_batch(): Skip buf_free_from_unzip_LRU_list_batch() if we are shrinking the buffer pool. In that case, we want to minimize the page relocations and just finish as quickly as possible. trx_purge_attach_undo_recs(): Limit purge_sys.n_pages_handled() in every iteration, in case the buffer pool is being shrunk in the middle of a purge batch. Reviewed by: Debarun Banerjee	2025-03-26 17:05:44 +02:00
Marko Mäkelä	49a6baec56	Merge 10.11 into 11.4	2025-03-03 11:07:56 +02:00
Marko Mäkelä	937ae4137e	MDEV-36155: MSAN use-of-uninitialized-value innodb.log_file_size_online Writing the redo log resized will write uninitialized data. There is a MEM_MAKE_DEFINED construct in the code to bless this however it was correct on the initial length, but not the changed length. The MEM_MAKE_DEFINED is moved earlier in the code where the length contains the correct value.	2025-02-27 08:19:07 +11:00
Marko Mäkelä	7e001b2a3c	MDEV-36082 Race condition between log_t::resize_start() and log_t::resize_abort() log_t::writer_update(): Add the parameter bool resizing, to indicate whether log resizing is in progress. We must enable log_writer_resizing only if resize_lsn>1, to ensure that log_t::resize_abort() will not choose the wrong log_sys.log_writer. log_t::resize_initiator: The thread that successfully invoked resize_start(). log_t::resize_start(): Simplify some logic, and assign resize_initiator if we successfully started log resizing. log_t::resize_abort(): Abort log resizing if we are the resize_initiator. innodb_log_file_size_update(): Clean up some logic. Reviewed by: Debarun Banerjee	2025-02-17 15:55:58 +02:00
Sergei Golubchik	f1a7693bc0	Merge branch '10.11' into 11.4	2025-01-14 23:45:41 +01:00
Marko Mäkelä	aa35f62f1c	MDEV-35810 Missing error handling in log resizing log_t::resize_start(): If the ib_logfile101 cannot be created, be sure to reset log_sys.resize_lsn. log_t::resize_abort(): In case SET GLOBAL innodb_log_file_size is aborted, delete the ib_logfile101.	2025-01-13 10:41:40 +02:00
Marko Mäkelä	42e6058629	MDEV-35785 innodb_log_file_mmap is not defined on 32-bit systems innodb_log_file_mmap: Use a constant documentation string that refers to persistent memory also when it is not available in the build. HAVE_INNODB_MMAP: Remove, and unconditionally enable this code. log_mmap(): On 32-bit systems, ensure that the size fits in 32 bits. log_t::resize_start(), log_t::resize_abort(): Only handle memory-mapping if HAVE_PMEM is defined. The generic memory-mapped interface is only for reading the log in recovery. Writable memory mappings are only for persistent memory, that is, Linux file systems with mount -o dax. Reviewed by: Debarun Banerjee, Otto Kekäläinen	2025-01-13 07:27:17 +02:00
Marko Mäkelä	4704435068	MDEV-35802 Race condition in log_t::persist() log_t::persist(): Remove the parameter holding_latch, and assert latch_holding_any(). We used to avoid acquiring a latch when log resizing was not in progress. That allowed a race condition to occur where log_t::write_checkpoint() has just completed log resizing. In that case, we could wrongly invoke pmem_persist() on the old log_sys.buf instead of the new one, which was shortly before known as log_sys.resize_buf. log_write_persist(): A non-inline wrapper function that will invoke log_sys.persist() while holding a shared log_sys.latch.	2025-01-10 08:15:09 +02:00
Marko Mäkelä	17f01186f5	Merge 10.11 into 11.4	2025-01-09 07:58:08 +02:00
Marko Mäkelä	c391fb1ff1	MDEV-35577 Broken recovery after SET GLOBAL innodb_log_file_size If InnoDB is killed in such a way that there had been no writes to a newly resized ib_logfile101 after it replaced ib_logfile0 in log_t::write_checkpoint(), it is possible that recovery will accidentally interpret some garbage at the end of the log as valid. log_t::write_buf(): To prevent the corruption, write an extra NUL byte at the end of log_sys.resize_buf, like we always did for the main log_sys.buf. To remove some conditional branches from a time critical code path, we instantiate a separate template for the rare case that the log is being resized. Define as __attribute__((always_inline)) so that this will be inlined also in the rare case the log is being resized. log_t::writer: Pointer to the current implementation of log_t::write_buf(). For quick access, this is located in the same cache line with log_sys.latch, which protects it. log_t::writer_update(): Update log_sys.writer. log_t::resize_write_buf(): Remove ATTRIBUTE_NOINLINE ATTRIBUTE_COLD. Now that log_t::write_buf() will be instantiated separately for the rare case of log resizing being in progress, there is no need to forbid this code from being inlined. Thanks to Thirunarayanan Balathandayuthapani for finding the root cause of this bug and suggesting the fix of writing an extra NUL byte. Reviewed by: Debarun Banerjee	2024-12-16 11:50:00 +02:00
Marko Mäkelä	1a557d087c	MDEV-35608 Fake PMEM on /dev/shm no longer works In commit `6acada713a` the logic for treating the file system of /dev/shm as if it were persistent memory was broken. Let us restore the original logic, so that we will have some more CI coverage of the memory-mapped redo log interface.	2024-12-09 12:53:38 +02:00
Oleksandr Byelkin	69d033d165	Merge branch '10.11' into 11.2	2024-10-29 16:42:46 +01:00
Oleksandr Byelkin	3d0fb15028	Merge branch '10.6' into 10.11	2024-10-29 15:24:38 +01:00
Vladislav Vaintroub	e8db5c8760	MDEV-35171 OS_FILE_NORMAL and OS_FILE_AIO are misleading Removed 'purpose' parameter from os_file_create() and related functions. Always use FILE_FLAG_OVERLAPPED when opening Windows files. No performance regression was measured, nor there is any measurable improvement.	2024-10-21 15:31:32 +02:00
Marko Mäkelä	12a91b57e2	Merge 10.11 into 11.2	2024-10-03 13:24:43 +03:00
Marko Mäkelä	dd5ce6b0c4	MDEV-34450 os_file_write_func() is an overkill for ib_logfile0 log_file_t::read(), log_file_t::write(): Invoke pread() or pwrite() directly, so that we can give more accurate diagnostics in case of a failure, and so that we will avoid the overhead of setting up 5(!) stack frames and related objects. tpool::pwrite(): Add a missing const qualifier.	2024-09-30 13:36:38 +03:00
Marko Mäkelä	6acada713a	MDEV-34062: Implement innodb_log_file_mmap on 64-bit systems When using the default innodb_log_buffer_size=2m, mariadb-backup --backup would spend a lot of time re-reading and re-parsing the log. For reads, it would be beneficial to memory-map the entire ib_logfile0 to the address space (typically 48 bits or 256 TiB) and read it from there, both during --backup and --prepare. We will introduce the Boolean read-only parameter innodb_log_file_mmap that will be OFF by default on most platforms, to avoid aggressive read-ahead of the entire ib_logfile0 in when only a tiny portion would be accessed. On Linux and FreeBSD the default is innodb_log_file_mmap=ON, because those platforms define a specific mmap(2) option for enabling such read-ahead and therefore it can be assumed that the default would be on-demand paging. This parameter will only have impact on the initial InnoDB startup and recovery. Any writes to the log will use regular I/O, except when the ib_logfile0 is stored in a specially configured file system that is backed by persistent memory (Linux "mount -o dax"). We also experimented with allowing writes of the ib_logfile0 via a memory mapping and decided against it. A fundamental problem would be unnecessary read-before-write in case of a major page fault, that is, when a new, not yet cached, virtual memory page in the circular ib_logfile0 is being written to. There appears to be no way to tell the operating system that we do not care about the previous contents of the page, or that the page fault handler should just zero it out. Many references to HAVE_PMEM have been replaced with references to HAVE_INNODB_MMAP. The predicate log_sys.is_pmem() has been replaced with log_sys.is_mmap() && !log_sys.is_opened(). Memory-mapped regular files differ from MAP_SYNC (PMEM) mappings in the way that an open file handle to ib_logfile0 will be retained. In both code paths, log_sys.is_mmap() will hold. Holding a file handle open will allow log_t::clear_mmap() to disable the interface with fewer operations. It should be noted that ever since commit `685d958e38` (MDEV-14425) most 64-bit Linux platforms on our CI platforms (s390x a.k.a. IBM System Z being a notable exception) read and write /dev/shm/*/ib_logfile0 via a memory mapping, pretending that it is persistent memory (mount -o dax). So, the memory mapping based log parsing that this change is enabling by default on Linux and FreeBSD has already been extensively tested on Linux. ::log_mmap(): If a log cannot be opened as PMEM and the desired access is read-only, try to open a read-only memory mapping. xtrabackup_copy_mmap_snippet(), xtrabackup_copy_mmap_logfile(): Copy the InnoDB log in mariadb-backup --backup from a memory mapped file.	2024-09-26 18:47:12 +03:00
Marko Mäkelä	9ea7f7129a	MDEV-34909 DDL hang during SET GLOBAL innodb_log_file_size on PMEM log_t::persist(): Add a parameter holding_latch to specify whether the caller is already holding exclusive log_sys.latch, like log_write_and_flush() always is.	2024-09-20 15:29:56 +03:00
Marko Mäkelä	e782e416ac	Merge 10.11 into 11.2	2024-09-18 07:38:49 +03:00
Marko Mäkelä	e3f653ca66	MDEV-34750 fixup: -Wconversion on 32-bit log_t::resize_write_buf(): If d<0 and d>-length, d will fit in ssize_t, which is a signed 32-bit or 64-bit integer. Cast from int64_t to ssize_t to make this clear and to silence a compiler warning.	2024-09-14 10:35:28 +03:00
Marko Mäkelä	e91a799458	Merge 10.11 into 11.2	2024-08-29 16:02:57 +03:00
Marko Mäkelä	984606d747	MDEV-34750 SET GLOBAL innodb_log_file_size is not crash safe The recent commit `4ca355d863` (MDEV-33894) caused a serious regression for online InnoDB ib_logfile0 resizing, breaking crash-safety unless the memory-mapped log file interface is being used. However, the log resizing was broken also before this. To prevent such regressions in the future, we extend the test innodb.log_file_size_online with a kill and restart of the server and with some writes running concurrently with the log size change. When run enough many times, this test revealed all the bugs that are being fixed by the code changes. log_t::resize_start(): Do not allow the resized log to start before the current log sequence number. In this way, there is no need to copy anything to the first block of resize_buf. The previous logic regarding that was incorrect in two ways. First, we would have to copy from the last written buffer (buf or flush_buf). Second, we failed to ensure that the mini-transaction end marker bytes would be 1 in the buffer. If the source ib_logfile0 had wrapped around an odd number of times, the end marker would be 0. This was occasionally observed when running the test innodb.log_file_size_online. log_t::resize_write_buf(): To adjust for the resize_start() change, do not write anything that would be before the resize_lsn. Take the buffer (resize_buf or resize_flush_buf) as a parameter. Starting with commit `4ca355d863` we no longer swap buffers when rewriting the last log block. log_t::append(): Define as a static function; only some debug assertions need to refer to the log_sys object. innodb_log_file_size_update(): Wake up the buf_flush_page_cleaner() if needed, and wait for it to complete a batch while waiting for the log resizing to be completed. If the current LSN is behind the resize target LSN, we will write redundant FILE_CHECKPOINT records to ensure that the log resizing completes. If the buf_pool.flush_list is empty or the buf_flush_page_cleaner() is stuck for some reason, our wait will time out in 5 seconds, so that we can periodically check if the execution of SET GLOBAL innodb_log_file_size was aborted. Previously, we could get into a busy loop here while the buf_flush_page_cleaner() would remain idle.	2024-08-29 14:53:08 +03:00
Oleksandr Byelkin	80abd847da	Merge branch '10.11' into 11.1	2024-08-03 09:32:42 +02:00
Marko Mäkelä	1c8af2ae53	MDEV-34422 Corrupted ib_logfile0 due to uninitialized log_sys.lsn_lock In commit `bf0b82d24b` (MDEV-33515) the function log_t::init_lsn_lock() was removed. This was fine on those platforms where InnoDB uses futex-based mutexes (Linux, FreeBSD, OpenBSD, NetBSD, DragonflyBSD). Dave Gosselin debugged this on Apple macOS and submitted a fix where pthread_mutex_wrapper::pthread_mutex_wrapper() would invoke init(). We do not really need that; we only need to invoke lsn_lock.init() like we used to do before commit `bf0b82d24b`. This should be a no-op for the futex based mutexes, which intentionally rely on zero initialization. The missing pthread_mutex_init() call would cause race conditions and corruption of log_sys.buf because multiple threads could apparently hold log_sys.lsn_lock concurrently in log_t::append_prepare(). The error would be caught by a debug assertion in log_t::write_buf(), or in non-debug builds by the fact that the server cannot be restarted due to an apparently missing FILE_CHECKPOINT record (because it had been written to wrong offset in log_sys.buf). The failure in log_t::append_prepare() was caught on Microsoft Windows after enabling SUX_LOCK_GENERIC and therefore forcing the use of pthread_mutex_wrapper for the log_sys.lsn_lock. It appears to be fine to omit the pthread_mutex_init() call on GNU/Linux. log_t::create(): Invoke lsn_lock.init(). log_t::close(): Invoke lsn_lock.destroy(). To better catch this kind of issues in the future by simply defining SUX_LOCK_GENERIC on any platform, a separate debug instrumentation patch will be applied to the 10.6 branch later. Reviewed by: Debarun Banerjee	2024-07-30 11:58:02 +03:00
Oleksandr Byelkin	2447dda2c0	Merge branch '10.11' into 11.1	2024-07-08 22:40:16 +02:00
Marko Mäkelä	4ca355d863	MDEV-33894: Resurrect innodb_log_write_ahead_size As part of commit `685d958e38` (MDEV-14425) the parameter innodb_log_write_ahead_size was removed, because it was thought that determining the physical block size would be a sufficient replacement. However, we can only determine the physical block size on Linux or Microsoft Windows. On some file systems, the physical block size is not relevant. For example, XFS uses a block size of 4096 bytes even if the underlying block size may be smaller. On Linux, we failed to determine the physical block size if innodb_log_file_buffered=OFF was not requested or possible. This will be fixed. log_sys.write_size: The value of the reintroduced parameter innodb_log_write_ahead_size. To keep it simple, this is read-only and a power of two between 512 and 4096 bytes, so that the previous alignment guarantees are fulfilled. This will replace the previous log_sys.get_block_size(). log_sys.block_size, log_t::get_block_size(): Remove. log_t::set_block_size(): Ensure that write_size will not be less than the physical block size. There is no point to invoke this function with 512 or less, because that is the minimum value of write_size. innodb_params_adjust(): Add some disabled code for adjusting the minimum value and default value of innodb_log_write_ahead_size to reflect the log_sys.write_size. log_t::set_recovered(): Mark the recovery completed. This is the place to adjust some things if we want to allow write_size>4096. log_t::resize_write_buf(): Refer to write_size. log_t::resize_start(): Refer to write_size instead of get_block_size(). log_write_buf(): Simplify some arithmetics and remove a goto. log_t::write_buf(): Refer to write_size. If we are writing less than that, do not switch buffers, but keep writing to the same buffer. Move some code to improve the locality of reference. recv_scan_log(): Refer to write_size instead of get_block_size(). os_file_create_func(): For type==OS_LOG_FILE on Linux, always invoke os_file_log_maybe_unbuffered(), so that log_sys.set_block_size() will be invoked even if we are not attempting to use O_DIRECT. recv_sys_t::find_checkpoint(): Read the entire log header in a single 12 KiB request into log_sys.buf. Tested with: ./mtr --loose-innodb-log-write-ahead-size=4096 ./mtr --loose-innodb-log-write-ahead-size=2048	2024-06-27 16:38:08 +03:00
Sergei Golubchik	f0a5412037	Merge branch '11.0' into 11.1	2024-05-13 09:52:30 +02:00
Sergei Golubchik	f9807aadef	Merge branch '10.11' into 11.0	2024-05-12 12:18:28 +02:00
Marko Mäkelä	8e663f5e90	MDEV-32791 MariaDB cannot be installed on Red Hat ubi9 The libpmem dependency that had been added in commit `3daef523af` (MDEV-17084) did not achieve any measurable performance improvement when comparing the same PMEM device with and without "mount -o dax" using the Linux ext4 file system. Because Red Hat has deprecated libpmem, let us remove the code altogether. Note: This is a 10.6 version of commit `3f9f5ca48e` which will retain PMEM support in MariaDB Server 10.11.	2024-04-19 11:04:51 +03:00
Marko Mäkelä	3f9f5ca48e	MDEV-33447: libpmem is not available in RHEL 8 Because the Red Hat Enterprise Linux 8 core repository does not include libpmem, let us implement the necessary subset ourselves. pmem_persist(): Implement for 64-bit x86, ARM, POWER, RISC-V, Loongarch in a way that should be compatible with the https://github.com/pmem/pmdk/ implementation of pmem_persist(). The CMake option WITH_INNODB_PMEM can be used for enabling or disabling this interface at compile time. By default, it is enabled on all applicable systems that are covered by our CI system. Note: libpmem had not been previously enabled for Loongarch in our Debian packaging. It was enabled for RISC-V, but we will not enable it by default on RISC-V or Loongarch because we lack CI coverage. The generated code for x86_64 was reviewed and tested on two Intel implementations: one that only supports clflush, and another that supports both clflushopt and clwb. The generated machine code was also reviewed on https://godbolt.org using various compiler versions. Godbolt helpfully includes an option to compile to binary code and display the encoding, which was useful on POWER. Reviewed by: Vladislav Vaintroub	2024-04-19 10:54:08 +03:00
Marko Mäkelä	42bda685db	MDEV-33585 follow-up optimization log_t: Define buf_size, max_buf_free as 32-bit and next_checkpoint_no as byte (we only need a bit) and rearrange some data members, so that on AMD64 we can fit log_sys.latch and log_sys.log in the same 64-byte cache line. mtr_t::commit_log(), mtr_t::commit_logger: A part of mtr_t::commit() split into a separate function, so that we will not unnecessarily invoke log_sys.get_write_target() when running on a memory-mapped log file, or log_sys.is_pmem(). Reviewed by: Vladislav Vaintroub Tested by: Matthias Leich	2024-04-09 09:36:45 +03:00
Marko Mäkelä	683fbced6b	Merge 11.0 into 11.1	2024-03-28 12:15:36 +02:00
Marko Mäkelä	fec2fd6add	Merge 10.11 into 11.0	2024-03-28 10:51:36 +02:00
Marko Mäkelä	788953463d	Merge 10.6 into 10.11 Some fixes related to commit `f838b2d799` and Rows_log_event::do_apply_event() and Update_rows_log_event::do_exec_row() for system-versioned tables were provided by Nikita Malyavin. This was required by test versioning.rpl,trx_id,row.	2024-03-28 09:16:57 +02:00
Marko Mäkelä	bf0b82d24b	MDEV-33515 log_sys.lsn_lock causes excessive context switching The log_sys.lsn_lock is a very contended resource with a small critical section in log_sys.append_prepare(). On many processor microarchitectures, replacing the system call based log_sys.lsn_lock with a pure spin lock would fare worse during high concurrency workloads, wasting a significant amount of CPU cycles in the spin loop. On other microarchitectures, we would see a significant amount of time being spent in native_queued_spin_lock_slowpath() in the Linux kernel, plus context switching between user and kernel address space. This was pointed out by Steve Shaw from Intel Corporation. Depending on the workload and the hardware implementation, it may be useful to use a pure spin lock in log_sys.append_prepare(). We will introduce a parameter. The statement SET GLOBAL INNODB_LOG_SPIN_WAIT_DELAY=50; would enable a spin lock that will execute that many MY_RELAX_CPU() operations (such as the x86 PAUSE instruction) between successive attempts of acquiring the spin lock. The use of a system call based log_sys.lsn_lock (which is the default setting) can be enabled by SET GLOBAL INNODB_LOG_SPIN_WAIT_DELAY=0; This patch will also introduce #ifdef LOG_LATCH_DEBUG (part of cmake -DWITH_INNODB_EXTRA_DEBUG=ON) for more accurate tracking of log_sys.latch ownership and reorganize the fields of log_sys to improve the locality of reference and to reduce the chances of false sharing. When a spin lock is being used, it will be maintained in the most significant bit of log_sys.buf_free. This is useful, because that is one of the fields that is covered by the lock. For IA-32 or AMD64, we implement the spin lock specially via log_t::lsn_lock_bts(), employing the i386 LOCK BTS instruction. A straightforward std::atomic::fetch_or() would translate into an inefficient loop around LOCK CMPXCHG. mtr_t::spin_wait_delay: The value of innodb_log_spin_wait_delay. mtr_t::finisher: Pointer to the currently used mtr_t::finish_write() implementation. This allows to avoid introducing conditional branches. We no longer invoke log_sys.is_pmem() at the mini-transaction level, but we would do that in log_write_up_to(). mtr_t::finisher_update(): Update finisher when spin_wait_delay is changed from or to 0 (the spin lock is changed to log_sys.lsn_lock or vice versa).	2024-03-22 12:29:01 +02:00
Marko Mäkelä	4ac8c4c820	MDEV-24167 fixup: Stricter assertion log_free_check(): Assert that the current thread is not holding lock_sys.latch in any mode. This fixes up commit `5f2dcd112b`	2024-03-12 09:20:36 +02:00
Marko Mäkelä	8155342a96	Merge 10.11 into 11.0	2024-02-20 15:31:18 +02:00
Marko Mäkelä	d73baa402a	Merge 10.11 into 11.0	2024-02-20 12:02:01 +02:00
Marko Mäkelä	3dd7b0a80c	Cleanup: Remove OS_FILE_ON_ERROR_NO_EXIT Ever since commit `412ee0330c` or commit `a440d6ed3a` InnoDB should generally not abort when failing to open or create files. In Datafile::open_or_create() we had failed to set the flag to avoid abort() on failure, but everywhere else we were setting it. We may still call abort() via os_file_handle_error(). Reviewed by: Vladislav Vaintroub	2024-02-20 11:22:52 +02:00
Marko Mäkelä	86c2c89743	Merge 10.6 into 10.11	2024-02-08 15:04:46 +02:00
Marko Mäkelä	91a2192bf2	Merge 10.5 into 10.6	2024-02-07 13:51:03 +02:00
Daniel Black	e06b159f02	MDEV-33397: Innodb include OS error information when failing to write to iblogfileX Without an OS error it could one of the many errors from write.	2024-02-07 17:27:35 +11:00
Marko Mäkelä	edc478847b	Merge 11.0 into 11.1	2023-11-24 15:58:35 +02:00
Marko Mäkelä	5b6134b040	Merge 10.11 into 11.0	2023-11-24 11:20:56 +02:00
Marko Mäkelä	7443ad1c8a	MDEV-32374 log_sys.lsn_lock is a performance hog The log_sys.lsn_lock that was introduced in commit `a635c40648` had better be located in the same cache line with log_sys.latch so that log_t::append_prepare() needs to modify only two first cache lines where log_sys is stored. log_t::lsn_lock: On Linux, change the type from pthread_mutex_t to something that may be as small as 32 bits, to pack more data members in the same cache line. On Microsoft Windows, CRITICAL_SECTION works better. log_t::check_flush_or_checkpoint_: Renamed to need_checkpoint. There is no need to pause all writer threads in log_free_check() when we only need to write log_sys.buf to ib_logfile0. That will be done in mtr_t::commit(). log_t::append_prepare_wait(): Make the member function non-static to simplify the call interface, and add a parameter for the LSN. log_t::append_prepare(): Invoke append_prepare_wait() at most once. Only set_check_for_checkpoint() if a log checkpoint needs to be written. If the log buffer needs to be written, we will take care of it ourselves later in our caller. This will reduce interference with log_free_check() in other threads. mtr_t::commit(): Call log_write_up_to() if needed. log_t::get_write_target(): Return a log_write_up_to() target to mtr_t::commit(). buf_flush_ahead(): If we are in furious flushing, call log_sys.set_check_for_checkpoint() so that all writers will wait in log_free_check() until the checkpoint is done. Otherwise, the test innodb.insert_into_empty could occasionally report an error "Crash recovery is broken". log_check_margins(): Replaced by log_free_check(). log_flush_margin(): Removed. This is part of mtr_t::commit() and other operations that write log. log_t::create(), log_t::attach(): Guarantee that buf_free < max_buf_free will always hold on PMEM, to satisfy an assumption of log_t::get_write_target(). log_write_up_to(): Assert lsn!=0. Such calls are not incorrect, but it is cheaper to test that single unlikely condition in mtr_t::commit() rather than test several conditions in log_write_up_to(). innodb_drop_database(), unlock_and_close_files(): Check the LSN before calling log_write_up_to(). ha_innobase::commit_inplace_alter_table(): Remove redundant calls to log_write_up_to() after calling unlock_and_close_files(). Reviewed by: Vladislav Vaintroub Stress tested by: Matthias Leich Performance tested by: Steve Shaw	2023-11-21 14:38:35 +02:00
Marko Mäkelä	90d968dab9	Merge 10.6 into 10.11	2023-11-20 10:08:19 +02:00

1 2 3 4 5 ...

443 Commits