mariadb

mirror of https://github.com/MariaDB/server.git synced 2025-08-29 00:08:14 +03:00

Author	SHA1	Message	Date
Marko Mäkelä	3a9a3be1c6	MDEV-23855: Improve InnoDB log checkpoint performance After MDEV-15053, MDEV-22871, MDEV-23399 shifted the scalability bottleneck, log checkpoints became a new bottleneck. If innodb_io_capacity is set low or innodb_max_dirty_pct_lwm is set high and the workload fits in the buffer pool, the page cleaner thread will perform very little flushing. When we reach the capacity of the circular redo log file ib_logfile0 and must initiate a checkpoint, some 'furious flushing' will be necessary. (If innodb_flush_sync=OFF, then flushing would continue at the innodb_io_capacity rate, and writers would be throttled.) We have the best chance of advancing the checkpoint LSN immediately after a page flush batch has been completed. Hence, it is best to perform checkpoints after every batch in the page cleaner thread, attempting to run once per second. By initiating high-priority flushing in the page cleaner as early as possible, we aim to make the throughput more stable. The function buf_flush_wait_flushed() used to sleep for 10ms, hoping that the page cleaner thread would do something during that time. The observed end result was that a large number of threads that call log_free_check() would end up sleeping while nothing useful is happening. We will revise the design so that in the default innodb_flush_sync=ON mode, buf_flush_wait_flushed() will wake up the page cleaner thread to perform the necessary flushing, and it will wait for a signal from the page cleaner thread. If innodb_io_capacity is set to a low value (causing the page cleaner to throttle its work), a write workload would initially perform well, until the capacity of the circular ib_logfile0 is reached and log_free_check() will trigger checkpoints. At that point, the extra waiting in buf_flush_wait_flushed() will start reducing throughput. The page cleaner thread will also initiate log checkpoints after each buf_flush_lists() call, because that is the best point of time for the checkpoint LSN to advance by the maximum amount. Even in 'furious flushing' mode we invoke buf_flush_lists() with innodb_io_capacity_max pages at a time, and at the start of each batch (in the log_flush() callback function that runs in a separate task) we will invoke os_aio_wait_until_no_pending_writes(). This tweak allows the checkpoint to advance in smaller steps and significantly reduces the maximum latency. On an Intel Optane 960 NVMe SSD on Linux, it reduced from 4.6 seconds to 74 milliseconds. On Microsoft Windows with a slower SSD, it reduced from more than 180 seconds to 0.6 seconds. We will make innodb_adaptive_flushing=OFF simply flush innodb_io_capacity per second whenever the dirty proportion of buffer pool pages exceeds innodb_max_dirty_pages_pct_lwm. For innodb_adaptive_flushing=ON we try to make page_cleaner_flush_pages_recommendation() more consistent and predictable: if we are below innodb_adaptive_flushing_lwm, let us flush pages according to the return value of af_get_pct_for_dirty(). innodb_max_dirty_pages_pct_lwm: Revert the change of the default value that was made in MDEV-23399. The value innodb_max_dirty_pages_pct_lwm=0 guarantees that a shutdown of an idle server will be fast. Users might be surprised if normal shutdown suddenly became slower when upgrading within a GA release series. innodb_checkpoint_usec: Remove. The master task will no longer perform periodic log checkpoints. It is the duty of the page cleaner thread. log_sys.max_modified_age: Remove. The current span of the buf_pool.flush_list expressed in LSN only matters for adaptive flushing (outside the 'furious flushing' condition). For the correctness of checkpoints, the only thing that matters is the checkpoint age (log_sys.lsn - log_sys.last_checkpoint_lsn). This run-time constant was also reported as log_max_modified_age_sync. log_sys.max_checkpoint_age_async: Remove. This does not serve any purpose, because the checkpoints will now be triggered by the page cleaner thread. We will retain the log_sys.max_checkpoint_age limit for engaging 'furious flushing'. page_cleaner.slot: Remove. It turns out that page_cleaner_slot.flush_list_time was duplicating page_cleaner.slot.flush_time and page_cleaner.slot.flush_list_pass was duplicating page_cleaner.flush_pass. Likewise, there were some redundant monitor counters, because the page cleaner thread no longer performs any buf_pool.LRU flushing, and because there only is one buf_flush_page_cleaner thread. buf_flush_sync_lsn: Protect writes by buf_pool.flush_list_mutex. buf_pool_t::get_oldest_modification(): Add a parameter to specify the return value when no persistent data pages are dirty. Require the caller to hold buf_pool.flush_list_mutex. log_buf_pool_get_oldest_modification(): Take the fall-back LSN as a parameter. All callers will also invoke log_sys.get_lsn(). log_preflush_pool_modified_pages(): Replaced with buf_flush_wait_flushed(). buf_flush_wait_flushed(): Implement two limits. If not enough buffer pool has been flushed, signal the page cleaner (unless innodb_flush_sync=OFF) and wait for the page cleaner to complete. If the page cleaner thread is not running (which can be the case durign shutdown), initiate the flush and wait for it directly. buf_flush_ahead(): If innodb_flush_sync=ON (the default), submit a new buf_flush_sync_lsn target for the page cleaner but do not wait for the flushing to finish. log_get_capacity(), log_get_max_modified_age_async(): Remove, to make it easier to see that af_get_pct_for_lsn() is not acquiring any mutexes. page_cleaner_flush_pages_recommendation(): Protect all access to buf_pool.flush_list with buf_pool.flush_list_mutex. Previously there were some race conditions in the calculation. buf_flush_sync_for_checkpoint(): New function to process buf_flush_sync_lsn in the page cleaner thread. At the end of each batch, we try to wake up any blocked buf_flush_wait_flushed(). If everything up to buf_flush_sync_lsn has been flushed, we will reset buf_flush_sync_lsn=0. The page cleaner thread will keep 'furious flushing' until the limit is reached. Any threads that are waiting in buf_flush_wait_flushed() will be able to resume as soon as their own limit has been satisfied. buf_flush_page_cleaner: Prioritize buf_flush_sync_lsn and do not sleep as long as it is set. Do not update any page_cleaner statistics for this special mode of operation. In the normal mode (buf_flush_sync_lsn is not set for innodb_flush_sync=ON), try to wake up once per second. No longer check whether srv_inc_activity_count() has been called. After each batch, try to perform a log checkpoint, because the best chances for the checkpoint LSN to advance by the maximum amount are upon completing a flushing batch. log_t: Move buf_free, max_buf_free possibly to the same cache line with log_sys.mutex. log_margin_checkpoint_age(): Simplify the logic, and replace a 0.1-second sleep with a call to buf_flush_wait_flushed() to initiate flushing. Moved to the same compilation unit with the only caller. log_close(): Clean up the calculations. (Should be no functional change.) Return whether flush-ahead is needed. Moved to the same compilation unit with the only caller. mtr_t::finish_write(): Return whether flush-ahead is needed. mtr_t::commit(): Invoke buf_flush_ahead() when needed. Let us avoid external calls in mtr_t::commit() and make the logic easier to follow by having related code in a single compilation unit. Also, we will invoke srv_stats.log_write_requests.inc() only once per mini-transaction commit, while not holding mutexes. log_checkpoint_margin(): Only care about log_sys.max_checkpoint_age. Upon reaching log_sys.max_checkpoint_age where we must wait to prevent the log from getting corrupted, let us wait for at most 1MiB of LSN at a time, before rechecking the condition. This should allow writers to proceed even if the redo log capacity has been reached and 'furious flushing' is in progress. We no longer care about log_sys.max_modified_age_sync or log_sys.max_modified_age_async. The log_sys.max_modified_age_sync could be a relic from the time when there was a srv_master_thread that wrote dirty pages to data files. Also, we no longer have any log_sys.max_checkpoint_age_async limit, because log checkpoints will now be triggered by the page cleaner thread upon completing buf_flush_lists(). log_set_capacity(): Simplify the calculations of the limit (no functional change). log_checkpoint_low(): Split from log_checkpoint(). Moved to the same compilation unit with the caller. log_make_checkpoint(): Only wait for everything to be flushed until the current LSN. create_log_file(): After checkpoint, invoke log_write_up_to() to ensure that the FILE_CHECKPOINT record has been written. This avoids ut_ad(!srv_log_file_created) in create_log_file_rename(). srv_start(): Do not call recv_recovery_from_checkpoint_start() if the log has just been created. Set fil_system.space_id_reuse_warned before dict_boot() has been executed, and clear it after recovery has finished. dict_boot(): Initialize fil_system.max_assigned_id. srv_check_activity(): Remove. The activity count is counting transaction commits and therefore mostly interesting for the purge of history. BtrBulk::insert(): Do not explicitly wake up the page cleaner, but do invoke srv_inc_activity_count(), because that counter is still being used in buf_load_throttle_if_needed() for some heuristics. (It might be cleaner to execute buf_load() in the page cleaner thread!) Reviewed by: Vladislav Vaintroub	2020-10-26 17:09:01 +02:00
Marko Mäkelä	7cffb5f6e8	MDEV-23399: Performance regression with write workloads The buffer pool refactoring in MDEV-15053 and MDEV-22871 shifted the performance bottleneck to the page flushing. The configuration parameters will be changed as follows: innodb_lru_flush_size=32 (new: how many pages to flush on LRU eviction) innodb_lru_scan_depth=1536 (old: 1024) innodb_max_dirty_pages_pct=90 (old: 75) innodb_max_dirty_pages_pct_lwm=75 (old: 0) Note: The parameter innodb_lru_scan_depth will only affect LRU eviction of buffer pool pages when a new page is being allocated. The page cleaner thread will no longer evict any pages. It used to guarantee that some pages will remain free in the buffer pool. Now, we perform that eviction 'on demand' in buf_LRU_get_free_block(). The parameter innodb_lru_scan_depth(srv_LRU_scan_depth) is used as follows: * When the buffer pool is being shrunk in buf_pool_t::withdraw_blocks() * As a buf_pool.free limit in buf_LRU_list_batch() for terminating the flushing that is initiated e.g., by buf_LRU_get_free_block() The parameter also used to serve as an initial limit for unzip_LRU eviction (evicting uncompressed page frames while retaining ROW_FORMAT=COMPRESSED pages), but now we will use a hard-coded limit of 100 or unlimited for invoking buf_LRU_scan_and_free_block(). The status variables will be changed as follows: innodb_buffer_pool_pages_flushed: This includes also the count of innodb_buffer_pool_pages_LRU_flushed and should work reliably, updated one by one in buf_flush_page() to give more real-time statistics. The function buf_flush_stats(), which we are removing, was not called in every code path. For both counters, we will use regular variables that are incremented in a critical section of buf_pool.mutex. Note that show_innodb_vars() directly links to the variables, and reads of the counters will not be protected by buf_pool.mutex, so you cannot get a consistent snapshot of both variables. The following INFORMATION_SCHEMA.INNODB_METRICS counters will be removed, because the page cleaner no longer deals with writing or evicting least recently used pages, and because the single-page writes have been removed: * buffer_LRU_batch_flush_avg_time_slot * buffer_LRU_batch_flush_avg_time_thread * buffer_LRU_batch_flush_avg_time_est * buffer_LRU_batch_flush_avg_pass * buffer_LRU_single_flush_scanned * buffer_LRU_single_flush_num_scan * buffer_LRU_single_flush_scanned_per_call When moving to a single buffer pool instance in MDEV-15058, we missed some opportunity to simplify the buf_flush_page_cleaner thread. It was unnecessarily using a mutex and some complex data structures, even though we always have a single page cleaner thread. Furthermore, the buf_flush_page_cleaner thread had separate 'recovery' and 'shutdown' modes where it was waiting to be triggered by some other thread, adding unnecessary latency and potential for hangs in relatively rarely executed startup or shutdown code. The page cleaner was also running two kinds of batches in an interleaved fashion: "LRU flush" (writing out some least recently used pages and evicting them on write completion) and the normal batches that aim to increase the MIN(oldest_modification) in the buffer pool, to help the log checkpoint advance. The buf_pool.flush_list flushing was being blocked by buf_block_t::lock for no good reason. Furthermore, if the FIL_PAGE_LSN of a page is ahead of log_sys.get_flushed_lsn(), that is, what has been persistently written to the redo log, we would trigger a log flush and then resume the page flushing. This would unnecessarily limit the performance of the page cleaner thread and trigger the infamous messages "InnoDB: page_cleaner: 1000ms intended loop took 4450ms. The settings might not be optimal" that were suppressed in commit `d1ab89037a` unless log_warnings>2. Our revised algorithm will make log_sys.get_flushed_lsn() advance at the start of buf_flush_lists(), and then execute a 'best effort' to write out all pages. The flush batches will skip pages that were modified since the log was written, or are are currently exclusively locked. The MDEV-13670 message "page_cleaner: 1000ms intended loop took" message will be removed, because by design, the buf_flush_page_cleaner() should not be blocked during a batch for extended periods of time. We will remove the single-page flushing altogether. Related to this, the debug parameter innodb_doublewrite_batch_size will be removed, because all of the doublewrite buffer will be used for flushing batches. If a page needs to be evicted from the buffer pool and all 100 least recently used pages in the buffer pool have unflushed changes, buf_LRU_get_free_block() will execute buf_flush_lists() to write out and evict innodb_lru_flush_size pages. At most one thread will execute buf_flush_lists() in buf_LRU_get_free_block(); other threads will wait for that LRU flushing batch to finish. To improve concurrency, we will replace the InnoDB ib_mutex_t and os_event_t native mutexes and condition variables in this area of code. Most notably, this means that the buffer pool mutex (buf_pool.mutex) is no longer instrumented via any InnoDB interfaces. It will continue to be instrumented via PERFORMANCE_SCHEMA. For now, both buf_pool.flush_list_mutex and buf_pool.mutex will be declared with MY_MUTEX_INIT_FAST (PTHREAD_MUTEX_ADAPTIVE_NP). The critical sections of buf_pool.flush_list_mutex should be shorter than those for buf_pool.mutex, because in the worst case, they cover a linear scan of buf_pool.flush_list, while the worst case of a critical section of buf_pool.mutex covers a linear scan of the potentially much longer buf_pool.LRU list. mysql_mutex_is_owner(), safe_mutex_is_owner(): New predicate, usable with SAFE_MUTEX. Some InnoDB debug assertions need this predicate instead of mysql_mutex_assert_owner() or mysql_mutex_assert_not_owner(). buf_pool_t::n_flush_LRU, buf_pool_t::n_flush_list: Replaces buf_pool_t::init_flush[] and buf_pool_t::n_flush[]. The number of active flush operations. buf_pool_t::mutex, buf_pool_t::flush_list_mutex: Use mysql_mutex_t instead of ib_mutex_t, to have native mutexes with PERFORMANCE_SCHEMA and SAFE_MUTEX instrumentation. buf_pool_t::done_flush_LRU: Condition variable for !n_flush_LRU. buf_pool_t::done_flush_list: Condition variable for !n_flush_list. buf_pool_t::do_flush_list: Condition variable to wake up the buf_flush_page_cleaner when a log checkpoint needs to be written or the server is being shut down. Replaces buf_flush_event. We will keep using timed waits (the page cleaner thread will wake _at least_ once per second), because the calculations for innodb_adaptive_flushing depend on fixed time intervals. buf_dblwr: Allocate statically, and move all code to member functions. Use a native mutex and condition variable. Remove code to deal with single-page flushing. buf_dblwr_check_block(): Make the check debug-only. We were spending a significant amount of execution time in page_simple_validate_new(). flush_counters_t::unzip_LRU_evicted: Remove. IORequest: Make more members const. FIXME: m_fil_node should be removed. buf_flush_sync_lsn: Protect by std::atomic, not page_cleaner.mutex (which we are removing). page_cleaner_slot_t, page_cleaner_t: Remove many redundant members. pc_request_flush_slot(): Replaces pc_request() and pc_flush_slot(). recv_writer_thread: Remove. Recovery works just fine without it, if we simply invoke buf_flush_sync() at the end of each batch in recv_sys_t::apply(). recv_recovery_from_checkpoint_finish(): Remove. We can simply call recv_sys.debug_free() directly. srv_started_redo: Replaces srv_start_state. SRV_SHUTDOWN_FLUSH_PHASE: Remove. logs_empty_and_mark_files_at_shutdown() can communicate with the normal page cleaner loop via the new function flush_buffer_pool(). buf_flush_remove(): Assert that the calling thread is holding buf_pool.flush_list_mutex. This removes unnecessary mutex operations from buf_flush_remove_pages() and buf_flush_dirty_pages(), which replace buf_LRU_flush_or_remove_pages(). buf_flush_lists(): Renamed from buf_flush_batch(), with simplified interface. Return the number of flushed pages. Clarified comments and renamed min_n to max_n. Identify LRU batch by lsn=0. Merge all the functions buf_flush_start(), buf_flush_batch(), buf_flush_end() directly to this function, which was their only caller, and remove 2 unnecessary buf_pool.mutex release/re-acquisition that we used to perform around the buf_flush_batch() call. At the start, if not all log has been durably written, wait for a background task to do it, or start a new task to do it. This allows the log write to run concurrently with our page flushing batch. Any pages that were skipped due to too recent FIL_PAGE_LSN or due to them being latched by a writer should be flushed during the next batch, unless there are further modifications to those pages. It is possible that a page that we must flush due to small oldest_modification also carries a recent FIL_PAGE_LSN or is being constantly modified. In the worst case, all writers would then end up waiting in log_free_check() to allow the flushing and the checkpoint to complete. buf_do_flush_list_batch(): Clarify comments, and rename min_n to max_n. Cache the last looked up tablespace. If neighbor flushing is not applicable, invoke buf_flush_page() directly, avoiding a page lookup in between. buf_flush_space(): Auxiliary function to look up a tablespace for page flushing. buf_flush_page(): Defer the computation of space->full_crc32(). Never call log_write_up_to(), but instead skip persistent pages whose latest modification (FIL_PAGE_LSN) is newer than the redo log. Also skip pages on which we cannot acquire a shared latch without waiting. buf_flush_try_neighbors(): Do not bother checking buf_fix_count because buf_flush_page() will no longer wait for the page latch. Take the tablespace as a parameter, and only execute this function when innodb_flush_neighbors>0. Avoid repeated calls of page_id_t::fold(). buf_flush_relocate_on_flush_list(): Declare as cold, and push down a condition from the callers. buf_flush_check_neighbor(): Take id.fold() as a parameter. buf_flush_sync(): Ensure that the buf_pool.flush_list is empty, because the flushing batch will skip pages whose modifications have not yet been written to the log or were latched for modification. buf_free_from_unzip_LRU_list_batch(): Remove redundant local variables. buf_flush_LRU_list_batch(): Let the caller buf_do_LRU_batch() initialize the counters, and report n->evicted. Cache the last looked up tablespace. If neighbor flushing is not applicable, invoke buf_flush_page() directly, avoiding a page lookup in between. buf_do_LRU_batch(): Return the number of pages flushed. buf_LRU_free_page(): Only release and re-acquire buf_pool.mutex if adaptive hash index entries are pointing to the block. buf_LRU_get_free_block(): Do not wake up the page cleaner, because it will no longer perform any useful work for us, and we do not want it to compete for I/O while buf_flush_lists(innodb_lru_flush_size, 0) writes out and evicts at most innodb_lru_flush_size pages. (The function buf_do_LRU_batch() may complete after writing fewer pages if more than innodb_lru_scan_depth pages end up in buf_pool.free list.) Eliminate some mutex release-acquire cycles, and wait for the LRU flush batch to complete before rescanning. buf_LRU_check_size_of_non_data_objects(): Simplify the code. buf_page_write_complete(): Remove the parameter evict, and always evict pages that were part of an LRU flush. buf_page_create(): Take a pre-allocated page as a parameter. buf_pool_t::free_block(): Free a pre-allocated block. recv_sys_t::recover_low(), recv_sys_t::apply(): Preallocate the block while not holding recv_sys.mutex. During page allocation, we may initiate a page flush, which in turn may initiate a log flush, which would require acquiring log_sys.mutex, which should always be acquired before recv_sys.mutex in order to avoid deadlocks. Therefore, we must not be holding recv_sys.mutex while allocating a buffer pool block. BtrBulk::logFreeCheck(): Skip a redundant condition. row_undo_step(): Do not invoke srv_inc_activity_count() for every row that is being rolled back. It should suffice to invoke the function in trx_flush_log_if_needed() during trx_t::commit_in_memory() when the rollback completes. sync_check_enable(): Remove. We will enable innodb_sync_debug from the very beginning. Reviewed by: Vladislav Vaintroub	2020-10-15 17:04:56 +03:00
Marko Mäkelä	7e798534f0	MDEV-22858 Remove unused innodb_mem_validate_usec, innodb_master_purge_usec MONITOR_SRV_MEM_VALIDATE_MICROSECOND, MEM_PERIODIC_CHECK, SRV_MASTER_MEM_VALIDATE_INTERVAL: Remove. These were unused ever since UNIV_MEM_DEBUG was removed. MONITOR_SRV_PURGE_MICROSECOND: Remove. This was always unused.	2020-06-10 18:07:13 +03:00
Thirunarayanan Balathandayuthapani	a5584b13d1	MDEV-15528 Punch holes when pages are freed The following parameters are deprecated: innodb-background-scrub-data-uncompressed innodb-background-scrub-data-compressed innodb-background-scrub-data-interval innodb-background-scrub-data-check-interval Removed scrubbing code completely(btr0scrub.h, btr0scrub.cc) Removed information_schema.innodb_tablespaces_scrubbing tables Removed the scrubbing logic from fil_crypt_thread()	2020-03-10 10:51:08 +05:30
Marko Mäkelä	9e488653ae	Cleanup: Make MONITOR_LSN_CHECKPOINT_AGE a value. Compute MONITOR_LSN_CHECKPOINT_AGE on demand in srv_mon_process_existing_counter(). This allows us to remove the overhead of MONITOR_SET calls for the counter.	2020-03-04 12:59:20 +02:00
Marko Mäkelä	20a7f75fbf	MDEV-15058: Revert the changes to INFORMATION_SCHEMA For compatibility with diagnostic software, let us return a dummy buffer pool identifier 0 and restore the columns that were initially deleted in commit `1a6f708ec5`: information_schema.innodb_buffer_page.pool_id information_schema.innodb_buffer_page_lru.pool_id information_schema.innodb_buffer_pool_stats.pool_id information_schema.innodb_cmpmem.buffer_pool_instance information_schema.innodb_cmpmem_reset.buffer_pool_instance Thanks to Vladislav Vaintroub for pointing this out.	2020-02-12 20:54:59 +02:00
Marko Mäkelä	1a6f708ec5	MDEV-15058: Deprecate and ignore innodb_buffer_pool_instances Our benchmarking efforts indicate that the reasons for splitting the buf_pool in commit `c18084f71b` have mostly gone away, possibly as a result of mysql/mysql-server@ce6109ebfd or similar work. Only in one write-heavy benchmark where the working set size is ten times the buffer pool size, the buf_pool->mutex would be less contended with 4 buffer pool instances than with 1 instance, in buf_page_io_complete(). That contention could be alleviated further by making more use of std::atomic and by splitting buf_pool_t::mutex further (MDEV-15053). We will deprecate and ignore the following parameters: innodb_buffer_pool_instances innodb_page_cleaners There will be only one buffer pool and one page cleaner task. In a number of INFORMATION_SCHEMA views, columns that indicated the buffer pool instance will be removed: information_schema.innodb_buffer_page.pool_id information_schema.innodb_buffer_page_lru.pool_id information_schema.innodb_buffer_pool_stats.pool_id information_schema.innodb_cmpmem.buffer_pool_instance information_schema.innodb_cmpmem_reset.buffer_pool_instance	2020-02-12 14:45:21 +02:00
Marko Mäkelä	28c89b7151	Merge 10.4 into 10.5	2019-12-16 07:47:17 +02:00
Oleksandr Byelkin	a15234bf4b	Merge branch '10.3' into 10.4	2019-12-09 15:09:41 +01:00
Faustin Lammler	2df2238cb8	Lintian complains on spelling error The lintian check complains on spelling error: https://salsa.debian.org/mariadb-team/mariadb-10.3/-/jobs/95739	2019-12-02 12:41:13 +02:00
Marko Mäkelä	b42294bc64	MDEV-19514 Defer change buffer merge until pages are requested We will remove the InnoDB background operation of merging buffered changes to secondary index leaf pages. Changes will only be merged as a result of an operation that accesses a secondary index leaf page, such as a SQL statement that performs a lookup via that index, or is modifying the index. Also ROLLBACK and some background operations, such as purging the history of committed transactions, or computing index cardinality statistics, can cause change buffer merge. Encryption key rotation will not perform change buffer merge. The motivation of this change is to simplify the I/O logic and to allow crash recovery to happen in the background (MDEV-14481). We also hope that this will reduce the number of "mystery" crashes due to corrupted data. Because change buffer merge will typically take place as a result of executing SQL statements, there should be a clearer connection between the crash and the SQL statements that were executed when the server crashed. In many cases, a slight performance improvement was observed. This is joint work with Thirunarayanan Balathandayuthapani and was tested by Axel Schwenke and Matthias Leich. The InnoDB monitor counter innodb_ibuf_merge_usec will be removed. On slow shutdown (innodb_fast_shutdown=0), we will continue to merge all buffered changes (and purge all undo log history). Two InnoDB configuration parameters will be changed as follows: innodb_disable_background_merge: Removed. This parameter existed only in debug builds. All change buffer merges will use synchronous reads. innodb_force_recovery will be changed as follows: * innodb_force_recovery=4 will be the same as innodb_force_recovery=3 (the change buffer merge cannot be disabled; it can only happen as a result of an operation that accesses a secondary index leaf page). The option used to be capable of corrupting secondary index leaf pages. Now that capability is removed, and innodb_force_recovery=4 becomes 'safe'. * innodb_force_recovery=5 (which essentially hard-wires SET GLOBAL TRANSACTION ISOLATION LEVEL READ UNCOMMITTED) becomes safe to use. Bogus data can be returned to SQL, but persistent InnoDB data files will not be corrupted further. * innodb_force_recovery=6 (ignore the redo log files) will be the only option that can potentially cause persistent corruption of InnoDB data files. Code changes: buf_page_t::ibuf_exist: New flag, to indicate whether buffered changes exist for a buffer pool page. Pages with pending changes can be returned by buf_page_get_gen(). Previously, the changes were always merged inside buf_page_get_gen() if needed. ibuf_page_exists(const buf_page_t&): Check if a buffered changes exist for an X-latched or read-fixed page. buf_page_get_gen(): Add the parameter allow_ibuf_merge=false. All callers that know that they may be accessing a secondary index leaf page must pass this parameter as allow_ibuf_merge=true, unless it does not matter for that caller whether all buffered changes have been applied. Assert that whenever allow_ibuf_merge holds, the page actually is a leaf page. Attempt change buffer merge only to secondary B-tree index leaf pages. btr_block_get(): Add parameter 'bool merge'. All callers of btr_block_get() should know whether the page could be a secondary index leaf page. If it is not, we should avoid consulting the change buffer bitmap to even consider a merge. This is the main interface to requesting index pages from the buffer pool. ibuf_merge_or_delete_for_page(), recv_recover_page(): Replace buf_page_get_known_nowait() with much simpler logic, because it is now guaranteed that that the block is x-latched or read-fixed. mlog_init_t::mark_ibuf_exist(): Renamed from mlog_init_t::ibuf_merge(). On crash recovery, we will no longer merge any buffered changes for the pages that we read into the buffer pool during the last batch of applying log records. buf_page_get_gen_known_nowait(), BUF_MAKE_YOUNG, BUF_KEEP_OLD: Remove. btr_search_guess_on_hash(): Merge buf_page_get_gen_known_nowait() to its only remaining caller. buf_page_make_young_if_needed(): Define as an inline function. Add the parameter buf_pool. buf_page_peek_if_young(), buf_page_peek_if_too_old(): Add the parameter buf_pool. fil_space_validate_for_mtr_commit(): Remove a bogus comment about background merge of the change buffer. btr_cur_open_at_rnd_pos_func(), btr_cur_search_to_nth_level_func(), btr_cur_open_at_index_side_func(): Use narrower data types and scopes. ibuf_read_merge_pages(): Replaces buf_read_ibuf_merge_pages(). Merge the change buffer by invoking buf_page_get_gen().	2019-10-11 17:28:15 +03:00
Marko Mäkelä	d09aec7a15	MDEV-19940 Clean up INFORMATION_SCHEMA.INNODB_ tables Shorten some VARCHAR attributes to a more reasonable length. INNODB_METRICS: Rename the column STATUS to ENABLED, and make it Boolean. Replace with INT(1) many Boolean attributes that were declared as VARCHAR containing 'NO','YES','disabled','enabled','Uninitialized','Initialized'. Replace some VARCHAR attributes with ENUM. Replace some BIGINT with INT when 32 bits are sufficient. Remove INNODB_SYS_TABLESPACES.SPACE_TYPE. The type of a tablespace can be derived from the tablespace ID. A fixed number is used for the system tablespace and the temporary tablespace. All other tablespaces are single-table or single-partition tablespaces. i_s_locks_row_t::lock_type, lock_get_type_str(): Remove. This is a redundant field. Table and record locks can be distinguished by whether i_s_locks_row_t::lock_index is NULL. fill_trx_row(): Do not unnecessarily copy the constant strings that trx->op_info is pointing to. i_s_locks_row_t::lock_mode: Replace string with integer. lock_get_mode_str(), lock_get_trx_id(), lock_get_trx(): Remove. field_store_ulint(): Remove.	2019-07-04 00:09:16 +03:00
Oleksandr Byelkin	c07325f932	Merge branch '10.3' into 10.4	2019-05-19 20:55:37 +02:00
Marko Mäkelä	874f8f30f2	Merge 10.2 into 10.3	2019-05-14 17:25:25 +03:00
Marko Mäkelä	50999738ea	Merge 10.1 into 10.2	2019-05-13 18:48:28 +03:00
Marko Mäkelä	2647fd101d	MDEV-19445 heap-use-after-free related to innodb_ft_aux_table Try to fix the race conditions between SET GLOBAL innodb_ft_aux_table = ...; and access to the INFORMATION_SCHEMA tables that depend on this variable. innodb_ft_aux_table: Replaces fts_internal_tbl_name,fts_internal_tbl_name2. Just store the user-specified parameter as is. innodb_ft_aux_table_id: The table_id corresponding to SET GLOBAL innodb_ft_aux_table, or 0 if the table does not exist or does not contain FULLTEXT INDEX. If the table is renamed later, the INFORMATION_SCHEMA tables will continue to refer to the table. If the table is dropped or rebuilt, the INFORMATION_SCHEMA tables will not find the table.	2019-05-13 17:16:42 +03:00
Oleksandr Byelkin	c51f85f882	Merge branch '10.2' into 10.3	2019-05-12 17:20:23 +02:00
Oleksandr Byelkin	633946fb63	Merge branch '10.1' into 10.2	2019-05-06 18:07:40 +02:00
Eugene Kosov	c83f837053	MDEV-18214 remove some duplicated MONITOR counters MONITOR_PENDING_LOG_WRITE MONITOR_PENDING_CHECKPOINT_WRITE MONITOR_LOG_IO: read values from log_t members instead of updating own monitor variables	2019-05-06 16:00:15 +03:00
Marko Mäkelä	7b42d892de	Merge 10.2 into 10.3	2019-04-02 09:19:34 +03:00
Marko Mäkelä	bce380f2a5	Merge 10.1 into 10.2	2019-04-02 09:14:15 +03:00
Marko Mäkelä	10dd290b4b	MDEV-17380 innodb_flush_neighbors=ON should be ignored on SSD For tablespaces that do not reside on spinning storage, it does not make sense to attempt to write nearby pages when writing out dirty pages from the InnoDB buffer pool. It is actually detrimental to performance and to the life span of flash ROM storage. With this change, MariaDB will detect whether an InnoDB file resides on solid-state storage. The detection has been implemented for Linux and Microsoft Windows. For other systems, we will err on the safe side and assume that files reside on SSD. As part of this change, we will reduce the number of fstat() calls when opening data files on POSIX systems and slightly clean up some file I/O code. FIXME: os_is_sparse_file_supported() on POSIX works in a destructive manner. Thus, we can only invoke it when creating files, not when opening them. For diagnostics, we introduce the column ON_SSD to the table INFORMATION_SCHEMA.INNODB_TABLESPACES_SCRUBBING. The table INNODB_SYS_TABLESPACES might seem more appropriate, but its purpose is to reflect the contents of the InnoDB system table SYS_TABLESPACES, which we would like to remove at some point. On Microsoft Windows, querying StorageDeviceSeekPenaltyProperty sometimes returns ERROR_GEN_FAILURE instead of ERROR_INVALID_FUNCTION or ERROR_NOT_SUPPORTED. We will silently ignore also this error, and assume that the file does not reside on SSD. On Linux, the detection will be based on the files /sys/block//queue/rotational and /sys/block//dev. Especially for USB storage, it is possible that /sys/block//queue/rotational will wrongly report 1 instead of 0. fil_node_t::on_ssd: Whether the InnoDB data file resides on solid-state storage. fil_system_t::ssd: Collection of Linux block devices that reside on non-rotational storage. fil_system_t::create(): Detect ssd on Linux based on the contents of /sys/block//queue/rotational and /sys/block/*/dev. fil_system_t::is_ssd(dev_t): Determine if a Linux block device is non-rotational. Partitions will be identified with the containing block device by assuming that the least significant 4 bits of the minor number identify a partition, and that the "partition number" of the entire device is 0.	2019-04-01 12:00:56 +03:00
Marko Mäkelä	b3860a8621	InnoDB review fixes Fix the formatting, and remove the MONITOR interface. Remove unnecessary wrapper functions for the callbacks, and replace void* with ha_innobase*.	2019-02-05 21:51:35 +02:00
Igor Babaev	33907360f5	MDEV-16188 Post-merge corrections and adjustments	2019-02-04 22:44:33 -08:00
Eugene Kosov	89337d510e	MDEV-16580 Remove unused monitor counters from InnoDB Remove one totally dead monitor.	2018-11-16 11:52:44 +03:00
Marko Mäkelä	13f7ac2269	MDEV-15705 Remove global status counter Innodb_pages0_read MDEV-9931 introduced a counter for keeping track of reads of the first page of InnoDB data files, because the original implementation of data-at-rest-encryption for InnoDB introduced new code paths for reading the pages. Ultimately, the extra reads of the first page were removed, and the encryption subsystem will be initialized whenever we first read the first page of each data file, in fil_node_open_file(). It should not be that interesting to observe how many times an InnoDB data file was opened for the first time.	2018-05-28 09:36:18 +03:00
Eugene Kosov	22f2b39c14	fix data races in rwlock	2017-12-19 13:03:04 +04:00
Marko Mäkelä	99e017d099	Remove INFORMATION_SCHEMA.INNODB_TRX.trx_adaptive_hash_latched In MariaDB 10.2, this column is practically always reported as 0, even before the data field trx_t::has_search_latch was removed.	2017-06-20 09:39:13 +03:00
Marko Mäkelä	0c92794db3	Remove deprecated InnoDB file format parameters The following options will be removed: innodb_file_format innodb_file_format_check innodb_file_format_max innodb_large_prefix They have been deprecated in MySQL 5.7.7 (and MariaDB 10.2.2) in WL#7703. The file_format column in two INFORMATION_SCHEMA tables will be removed: innodb_sys_tablespaces innodb_sys_tables Code to update the file format tag at the end of page 0:5 (TRX_SYS_PAGE in the InnoDB system tablespace) will be removed. When initializing a new database, the bytes will remain 0. All references to the Barracuda file format will be removed. Some references to the Antelope file format (meaning ROW_FORMAT=REDUNDANT or ROW_FORMAT=COMPACT) will remain. This basically ports WL#7704 from MySQL 8.0.0 to MariaDB 10.3.1: commit 4a69dc2a95995501ed92d59a1de74414a38540c6 Author: Marko Mäkelä <marko.makela@oracle.com> Date: Wed Mar 11 22:19:49 2015 +0200	2017-06-02 09:36:14 +03:00
Sergei Golubchik	da4d71d10d	Merge branch '10.1' into 10.2	2017-03-30 12:48:42 +02:00
Jan Lindström	50eb40a2a8	MDEV-11738: Mariadb uses 100% of several of my 8 cpus doing nothing MDEV-11581: Mariadb starts InnoDB encryption threads when key has not changed or data scrubbing turned off Background: Key rotation is based on background threads (innodb-encryption-threads) periodically going through all tablespaces on fil_system. For each tablespace current used key version is compared to max key age (innodb-encryption-rotate-key-age). This process naturally takes CPU. Similarly, in same time need for scrubbing is investigated. Currently, key rotation is fully supported on Amazon AWS key management plugin only but InnoDB does not have knowledge what key management plugin is used. This patch re-purposes innodb-encryption-rotate-key-age=0 to disable key rotation and background data scrubbing. All new tables are added to special list for key rotation and key rotation is based on sending a event to background encryption threads instead of using periodic checking (i.e. timeout). fil0fil.cc: Added functions fil_space_acquire_low() to acquire a tablespace when it could be dropped concurrently. This function is used from fil_space_acquire() or fil_space_acquire_silent() that will not print any messages if we try to acquire space that does not exist. fil_space_release() to release a acquired tablespace. fil_space_next() to iterate tablespaces in fil_system using fil_space_acquire() and fil_space_release(). Similarly, fil_space_keyrotation_next() to iterate new list fil_system->rotation_list where new tables. are added if key rotation is disabled. Removed unnecessary functions fil_get_first_space_safe() fil_get_next_space_safe() fil_node_open_file(): After page 0 is read read also crypt_info if it is not yet read. btr_scrub_lock_dict_func() buf_page_check_corrupt() buf_page_encrypt_before_write() buf_merge_or_delete_for_page() lock_print_info_all_transactions() row_fts_psort_info_init() row_truncate_table_for_mysql() row_drop_table_for_mysql() Use fil_space_acquire()/release() to access fil_space_t. buf_page_decrypt_after_read(): Use fil_space_get_crypt_data() because at this point we might not yet have read page 0. fil0crypt.cc/fil0fil.h: Lot of changes. Pass fil_space_t* directly to functions needing it and store fil_space_t* to rotation state. Use fil_space_acquire()/release() when iterating tablespaces and removed unnecessary is_closing from fil_crypt_t. Use fil_space_t::is_stopping() to detect when access to tablespace should be stopped. Removed unnecessary fil_space_get_crypt_data(). fil_space_create(): Inform key rotation that there could be something to do if key rotation is disabled and new table with encryption enabled is created. Remove unnecessary functions fil_get_first_space_safe() and fil_get_next_space_safe(). fil_space_acquire() and fil_space_release() are used instead. Moved fil_space_get_crypt_data() and fil_space_set_crypt_data() to fil0crypt.cc. fsp_header_init(): Acquire fil_space_t, write crypt_data and release space. check_table_options() Renamed FIL_SPACE_ENCRYPTION_ TO FIL_ENCRYPTION_* i_s.cc: Added ROTATING_OR_FLUSHING field to information_schema.innodb_tablespace_encryption to show current status of key rotation.	2017-03-14 16:23:10 +02:00
Marko Mäkelä	ff0530ef68	MDEV-12121: Revert test adjustments for -DWITH_INNODB_AHI=OFF Because the default build configuration of the server will remain at -DWITH_INNODB_AHI=ON, we want to test the instrumentation. We make and revert the test adjustments in separate commits on purpose, so that this commit can be easily reverted later if the default build configuration is changed to -DWITH_INNODB_AHI=OFF.	2017-03-03 17:08:06 +02:00
Marko Mäkelä	27b9989d31	MDEV-12121 Introduce build option WITH_INNODB_AHI to disable innodb_adaptive_hash_index The InnoDB adaptive hash index is sometimes degrading the performance of InnoDB, and it is sometimes disabled to get more consistent performance. We should have a compile-time option to disable the adaptive hash index. Let us introduce two options: OPTION(WITH_INNODB_AHI "Include innodb_adaptive_hash_index" ON) OPTION(WITH_INNODB_ROOT_GUESS "Cache index root block descriptors" ON) where WITH_INNODB_AHI always implies WITH_INNODB_ROOT_GUESS. As part of this change, the misleadingly named function trx_search_latch_release_if_reserved(trx) will be replaced with the macro trx_assert_no_search_latch(trx) that will be empty unless BTR_CUR_HASH_ADAPT is defined (cmake -DWITH_INNODB_AHI=ON). We will also remove the unused column INFORMATION_SCHEMA.INNODB_TRX.TRX_ADAPTIVE_HASH_TIMEOUT. In MariaDB Server 10.1, it used to reflect the value of trx_t::search_latch_timeout which could be adjusted during row_search_for_mysql(). In 10.2, there is no such field. Other than the removal of the unused column TRX_ADAPTIVE_HASH_TIMEOUT, this is an almost non-functional change to the server when using the default build options. Some tests are adjusted so that they will work with both -DWITH_INNODB_AHI=ON and -DWITH_INNODB_AHI=OFF. The test innodb.innodb_monitor has been renamed to innodb.monitor in order to track MySQL 5.7, and the duplicate tests sys_vars.innodb_monitor_* are removed.	2017-03-03 16:55:50 +02:00
Jan Lindström	6495806e59	MDEV-11254: innodb-use-trim has no effect in 10.2 Problem was that implementation merged from 10.1 was incompatible with InnoDB 5.7. buf0buf.cc: Add functions to return should we punch hole and how big. buf0flu.cc: Add written page to IORequest fil0fil.cc: Remove unneeded status call and add test is sparse files and punch hole supported by file system when tablespace is created. Add call to get file system block size. Used file node is added to IORequest. Added functions to check is punch hole supported and setting punch hole. ha_innodb.cc: Remove unneeded status variables (trim512-32768) and trim_op_saved. Deprecate innodb_use_trim and set it ON by default. Add function to set innodb-use-trim dynamically. dberr.h: Add error code DB_IO_NO_PUNCH_HOLE if punch hole operation fails. fil0fil.h: Add punch_hole variable to fil_space_t and block size to fil_node_t. os0api.h: Header to helper functions on buf0buf.cc and fil0fil.cc for os0file.h os0file.h: Remove unneeded m_block_size from IORequest and add bpage to IORequest to know actual size of the block and m_fil_node to know tablespace file system block size and does it support punch hole. os0file.cc: Add function punch_hole() to IORequest to do punch_hole operation, get the file system block size and determine does file system support sparse files (for punch hole). page0size.h: remove implicit copy disable and use this implicit copy to implement copy_from() function. buf0dblwr.cc, buf0flu.cc, buf0rea.cc, fil0fil.cc, fil0fil.h, os0file.h, os0file.cc, log0log.cc, log0recv.cc: Remove unneeded write_size parameter from fil_io calls. srv0mon.h, srv0srv.h, srv0mon.cc: Remove unneeded trim512-trim32678 status variables. Removed these from monitor tests.	2017-01-24 14:40:58 +02:00
Sergei Golubchik	4a5d25c338	Merge branch '10.1' into 10.2	2016-12-29 13:23:18 +01:00
Jan Lindström	1d55cfce10	Do not use os_file_read() directly for reading first page of the tablespace. Instead use fil_read() with syncronous setting. Fix test failures and mask tablespace number as it could change in concurrent mtr runs.	2016-09-22 21:47:27 +03:00
Jan Lindström	da168b3405	MDEV-10200: IS tables are not found on 10.2 InnoDB 5.7 (branch bb-10.2-jan) information_schema.innodb_changed_pages IS table available only on xtradb, add possible error for now.	2016-09-14 08:27:22 +03:00
Jan Lindström	ee768d8e0e	MDEV-9640: Add used key_id to INFORMATION_SCHEMA.INNODB_TABLESPACES_ENCRYPTION	2016-03-18 11:48:49 +02:00
Sergei Golubchik	a5679af1b1	Merge branch '10.0' into 10.1	2016-02-23 21:35:05 +01:00
Sergei Golubchik	87c306802b	update test results	2015-11-20 09:16:36 +01:00
Jan Lindström	c0b7bf2625	MDEV-8683: Bunch of tests fail in buildbot on new InnoDB variables	2015-08-26 18:07:34 +03:00
Sergei Golubchik	5091a4ba75	Merge tag 'mariadb-10.0.19' into 10.1	2015-06-01 15:51:25 +02:00
Jan Lindström	bad81f23f6	MDEV-8046: Server crashes in pfs_mutex_enter_func on select from I_S.INNODB_TABLESPACES_ENCRYPTION if InnoDB is disabled Problem was that information schema tables innodb_tablespaces_encryption and innodb_tablespaces_scrubbing where missing required check is InnoDB enabled or not.	2015-05-06 15:16:28 +03:00

43 Commits