mariadb

mirror of https://github.com/MariaDB/server.git synced 2025-09-11 05:52:26 +03:00

Author	SHA1	Message	Date
Marko Mäkelä	5bada1246d	Merge 10.5 into 10.6	2023-04-11 16:15:19 +03:00
Oleksandr Byelkin	ac5a534a4c	Merge remote-tracking branch '10.4' into 10.5	2023-03-31 21:32:41 +02:00
Yuchen Pei	7c91082e39	MDEV-27912 Fixing inconsistency w.r.t. expect files in tests. mtr uses group suffix, but some existing inc and test files use server_id for expect files. This patch aims to fix that. For spider: With this change we will not have to maintain a separate version of restart_mysqld.inc for spider, that duplicates code, just because spider tests use different names for expect files, and shutdown_mysqld requires magical names for them. With this change spider tests will also be able to use other features provided by restart_mysqld.inc without code duplication, like the parameter $restart_parameters (see e.g. the testcase mdev_29904.test in commit ef1161e5d4f). Tests run after this change: default, spider, rocksdb, galera, using the following command mtr --parallel=auto --force --max-test-fail=0 --skip-core-file mtr --suite spider,spider/,spider//* \ --skip-test="spider/oracle.\|./t\..*" --parallel=auto --big-test \ --force --max-test-fail=0 --skip-core-file mtr --suite galera --parallel=auto mtr --suite rocksdb --parallel=auto	2023-03-22 11:55:57 +11:00
Marko Mäkelä	4c355d4e81	Merge 10.11 into 11.0	2023-03-17 15:03:17 +02:00
Marko Mäkelä	c50f849d64	Merge 10.10 into 10.11	2023-03-17 07:00:03 +02:00
Marko Mäkelä	3dd33789c1	Merge 10.9 into 10.10	2023-03-17 06:59:46 +02:00
Marko Mäkelä	fffa4b28a1	Merge 10.8 into 10.9	2023-03-17 06:58:33 +02:00
Marko Mäkelä	acf46b7b36	Merge 10.6 into 10.8	2023-03-16 18:11:37 +02:00
Marko Mäkelä	a55b951e60	MDEV-26827 Make page flushing even faster For more convenient monitoring of something that could greatly affect the volume of page writes, we add the status variable Innodb_buffer_pool_pages_split that was previously only available via information_schema.innodb_metrics as "innodb_page_splits". This was suggested by Axel Schwenke. buf_flush_page_count: Replaced with buf_pool.stat.n_pages_written. We protect buf_pool.stat (except n_page_gets) with buf_pool.mutex and remove unnecessary export_vars indirection. buf_pool.flush_list_bytes: Moved from buf_pool.stat.flush_list_bytes. Protected by buf_pool.flush_list_mutex. buf_pool_t::page_cleaner_status: Replaces buf_pool_t::n_flush_LRU_, buf_pool_t::n_flush_list_, and buf_pool_t::page_cleaner_is_idle. Protected by buf_pool.flush_list_mutex. We will exclusively broadcast buf_pool.done_flush_list by the buf_flush_page_cleaner thread, and only wait for it when communicating with buf_flush_page_cleaner. There is no need to keep a count of pending writes by the buf_pool.flush_list processing. A single flag suffices for that. Waits for page write completion can be performed by simply waiting on block->page.lock, or by invoking buf_dblwr.wait_for_page_writes(). buf_LRU_block_free_non_file_page(): Broadcast buf_pool.done_free and set buf_pool.try_LRU_scan when freeing a page. This would be executed also as part of buf_page_write_complete(). buf_page_write_complete(): Do not broadcast buf_pool.done_flush_list, and do not acquire buf_pool.mutex unless buf_pool.LRU eviction is needed. Let buf_dblwr count all writes to persistent pages and broadcast a condition variable when no outstanding writes remain. buf_flush_page_cleaner(): Prioritize LRU flushing and eviction right after "furious flushing" (lsn_limit). Simplify the conditions and reduce the hold time of buf_pool.flush_list_mutex. Refuse to shut down or sleep if buf_pool.ran_out(), that is, LRU eviction is needed. buf_pool_t::page_cleaner_wakeup(): Add the optional parameter for_LRU. buf_LRU_get_free_block(): Protect buf_lru_free_blocks_error_printed with buf_pool.mutex. Invoke buf_pool.page_cleaner_wakeup(true) to to ensure that buf_flush_page_cleaner() will process the LRU flush request. buf_do_LRU_batch(), buf_flush_list(), buf_flush_list_space(): Update buf_pool.stat.n_pages_written when submitting writes (while holding buf_pool.mutex), not when completing them. buf_page_t::flush(), buf_flush_discard_page(): Require that the page U-latch be acquired upfront, and remove buf_page_t::ready_for_flush(). buf_pool_t::delete_from_flush_list(): Remove the parameter "bool clear". buf_flush_page(): Count pending page writes via buf_dblwr. buf_flush_try_neighbors(): Take the block of page_id as a parameter. If the tablespace is dropped before our page has been written out, release the page U-latch. buf_pool_invalidate(): Let the caller ensure that there are no outstanding writes. buf_flush_wait_batch_end(false), buf_flush_wait_batch_end_acquiring_mutex(false): Replaced with buf_dblwr.wait_for_page_writes(). buf_flush_wait_LRU_batch_end(): Replaces buf_flush_wait_batch_end(true). buf_flush_list(): Remove some broadcast of buf_pool.done_flush_list. buf_flush_buffer_pool(): Invoke also buf_dblwr.wait_for_page_writes(). buf_pool_t::io_pending(), buf_pool_t::n_flush_list(): Remove. Outstanding writes are reflected by buf_dblwr.pending_writes(). buf_dblwr_t::init(): New function, to initialize the mutex and the condition variables, but not the backing store. buf_dblwr_t::is_created(): Replaces buf_dblwr_t::is_initialised(). buf_dblwr_t::pending_writes(), buf_dblwr_t::writes_pending: Keeps track of writes of persistent data pages. buf_flush_LRU(): Allow calls while LRU flushing may be in progress in another thread. Tested by Matthias Leich (correctness) and Axel Schwenke (performance)	2023-03-16 17:19:58 +02:00
Sergei Petrunia	1529881595	Stabilize rocksdb.rocksdb test.	2023-02-03 14:31:21 +03:00
Monty	1f4a9f086a	Removed "<select expression> INTO <destination>" deprication. This was done after discussions with Igor, Sanja and Bar. The main reason for removing the deprication was to ensure that MariaDB is always backward compatible whenever possible. Other things: - Added statistics counters, mainly for the feedback plugin. - INTO OUTFILE - INTO variable - If INTO is using the old syntax (end of query)	2023-02-03 11:57:50 +03:00
Sergei Petrunia	6c4076fac4	MDEV-30032: EXPLAIN FORMAT=JSON output: part #2 : print 'loops'.	2023-02-03 11:22:17 +03:00
Sergei Petrunia	ffe0beca25	MDEV-30032: EXPLAIN FORMAT=JSON output: print costs Basic printout for join and table execution costs.	2023-02-03 11:01:24 +03:00
Monty	727491b72a	Added test cases for preceding test This includes all test changes from "Changing all cost calculation to be given in milliseconds" and forwards. Some of the things that caused changes in the result files: - As part of fixing tests, I added 'echo' to some comments to be able to easier find out where things where wrong. - MATERIALIZED has now a higher cost compared to X than before. Because of this some MATERIALIZED types have changed to DEPENDEND SUBQUERY. - Some test cases that required MATERIALIZED to repeat a bug was changed by adding more rows to force MATERIALIZED to happen. - 'Filtered' in SHOW EXPLAIN has in many case changed from 100.00 to something smaller. This is because now filtered also takes into account the smallest possible ref access and filters, even if they where not used. Another reason for 'Filtered' being smaller is that we now also take into account implicit filtering done for subqueries using FIRSTMATCH. (main.subselect_no_exists_to_in) This is caluculated in best_access_path() and stored in records_out. - Table orders has changed because more accurate costs. - 'index' and 'ALL' for small tables has changed to use 'range' or 'ref' because of optimizer_scan_setup_cost. - index can be changed to 'range' as 'range' optimizer assumes we don't have to read the blocks from disk that range optimizer has already read. This can be confusing in the case where there is no obvious where clause but instead there is a hidden 'key_column > NULL' added by the optimizer. (main.subselect_no_exists_to_in) - Scan on primary clustered key does not report 'Using Index' anymore (It's a table scan, not an index scan). - For derived tables, the number of rows is now 100 instead of 2, which can be seen in EXPLAIN. - More tests have "Using index for group by" as the cost of this optimization is now more correct (lower). - A primary key could be preferred for a normal key, even if it would access more rows, as it's faster to do 1 lokoup and 3 'index_next' on a clustered primary key than one lookup trough a secondary. (main.stat_tables_innodb) Notes: - There was a 4.7% more calls to best_extension_by_limited_search() in the main.greedy_optimizer test. However examining the test results it looked that the plans where slightly better (eq_ref where more chained together) so I assume this is ok. - I have verified a few test cases where there was notable/unexpected changes in the plan and in all cases the new optimizer plans where faster. (main.greedy_optimizer and some others)	2023-02-03 00:00:35 +03:00
Marko Mäkelä	f27e9c8947	MDEV-29694 Remove the InnoDB change buffer The purpose of the change buffer was to reduce random disk access, which could be useful on rotational storage, but maybe less so on solid-state storage. When we wished to (1) insert a record into a non-unique secondary index, (2) delete-mark a secondary index record, (3) delete a secondary index record as part of purge (but not ROLLBACK), and the B-tree leaf page where the record belongs to is not in the buffer pool, we inserted a record into the change buffer B-tree, indexed by the page identifier. When the page was eventually read into the buffer pool, we looked up the change buffer B-tree for any modifications to the page, applied these upon the completion of the read operation. This was called the insert buffer merge. We remove the change buffer, because it has been the source of various hard-to-reproduce corruption bugs, including those fixed in commit `5b9ee8d819` and commit `165564d3c3` but not limited to them. A downgrade will fail with a clear message starting with commit `db14eb16f9` (MDEV-30106). buf_page_t::state: Merge IBUF_EXIST to UNFIXED and WRITE_FIX_IBUF to WRITE_FIX. buf_pool_t::watch[]: Remove. trx_t: Move isolation_level, check_foreigns, check_unique_secondary, bulk_insert into the same bit-field. The only purpose of trx_t::check_unique_secondary is to enable bulk insert into an empty table. It no longer enables insert buffering for UNIQUE INDEX. btr_cur_t::thr: Remove. This field was originally needed for change buffering. Later, its use was extended to cover SPATIAL INDEX. Much of the time, rtr_info::thr holds this field. When it does not, we will add parameters to SPATIAL INDEX specific functions. ibuf_upgrade_needed(): Check if the change buffer needs to be updated. ibuf_upgrade(): Merge and upgrade the change buffer after all redo log has been applied. Free any pages consumed by the change buffer, and zero out the change buffer root page to mark the upgrade completed, and to prevent a downgrade to an earlier version. dict_load_tablespaces(): Renamed from dict_check_tablespaces_and_store_max_id(). This needs to be invoked before ibuf_upgrade(). btr_cur_open_at_rnd_pos(): Specialize for use in persistent statistics. The change buffer merge does not need this function anymore. btr_page_alloc(): Renamed from btr_page_alloc_low(). We no longer allocate any change buffer pages. btr_cur_open_at_rnd_pos(): Specialize for use in persistent statistics. The change buffer merge does not need this function anymore. row_search_index_entry(), btr_lift_page_up(): Add a parameter thr for the SPATIAL INDEX case. rtr_page_split_and_insert(): Specialized from btr_page_split_and_insert(). rtr_root_raise_and_insert(): Specialized from btr_root_raise_and_insert(). Note: The support for upgrading from the MySQL 3.23 or MySQL 4.0 change buffer format that predates the MySQL 4.1 introduction of the option innodb_file_per_table was removed in MySQL 5.6.5 as part of mysql/mysql-server@69b6241a79 and MariaDB 10.0.11 as part of `1d0f70c2f8`. In the tests innodb.log_upgrade and innodb.log_corruption, we create valid (upgraded) change buffer pages. Tested by: Matthias Leich	2023-01-11 17:59:36 +02:00
Sergei Golubchik	8759967d1c	MDEV-29625 Some clients/scripts refer to old slow log variables	2022-10-04 12:28:04 +02:00
Marko Mäkelä	5e996fbad9	Merge 10.9 into 10.10	2022-09-21 10:59:56 +03:00
Marko Mäkelä	a8e4540476	Merge 10.8 into 10.9	2022-09-21 10:07:09 +03:00
Marko Mäkelä	4345d93100	Merge 10.7 into 10.8	2022-09-21 09:52:09 +03:00
Marko Mäkelä	7c7ac6d4a4	Merge 10.6 into 10.7	2022-09-21 09:33:07 +03:00
Marko Mäkelä	44fd2c4b24	Merge 10.5 into 10.6	2022-09-20 16:53:20 +03:00
Alexander Barkov	fe844c16b6	Merge remote-tracking branch 'origin/10.4' into 10.5	2022-09-14 16:24:51 +04:00
Marko Mäkelä	18795f5512	Merge 10.3 into 10.4	2022-09-13 16:36:38 +03:00
Alexander Barkov	4c14243373	A cleanup for MDEV-29446 Change SHOW CREATE TABLE to display default collation Recording test results according to MDEV-29446 changes: storage/rocksdb/mysql-test/rocksdb/r/use_direct_io_for_flush_and_compaction.result	2022-09-13 12:44:23 +04:00
Alexander Barkov	f1544424de	MDEV-29446 Change SHOW CREATE TABLE to display default collation	2022-09-12 22:10:39 +04:00
Marko Mäkelä	fdc039db29	MDEV-28540 Deprecate and ignore the parameter innodb_prefix_index_cluster_optimization The parameter innodb_prefix_index_cluster_optimization used to enable an optimization that was added in `cb37c55768` and was disabled by default. We will unconditionally enable the extension and mark the parameter as deprecated. Related to this, the counters Innodb_secondary_index_triggered_cluster_reads and Innodb_secondary_index_triggered_cluster_reads_avoided allowed to determine the usefulness of this optimization. Now that the configuration parameter is disabled, the counters do not serve any useful purpose and can be removed. row_search_with_covering_prefix(): Fix a bug that caused an incorrect result to be returned.	2022-06-03 12:20:20 +03:00
Marko Mäkelä	0dab74ff3f	MDEV-28539 Some InnoDB counters are duplicating generic SHOW STATUS The InnoDB srv_stats counters n_rows_updated, n_rows_deleted, n_rows_inserted, and n_rows_read are duplicating Handler_update, Handler_delete, Handler_write, and Handler_read_ counters. Updating those counters is not free, especially because some counters are furthermore split to distinguish a rare case of modifying tables in the system schema.	2022-06-03 12:20:20 +03:00
Sergei Golubchik	bf2bdd1a1a	Merge branch '10.8' into 10.9	2022-05-19 14:07:55 +02:00
Sergei Golubchik	b7ffccf49b	Merge branch '10.7' into 10.8	2022-05-18 13:26:48 +02:00
Sergei Golubchik	99a433ed1c	Merge branch '10.6' into 10.7	2022-05-18 10:34:38 +02:00
Sergei Golubchik	b2187662bc	Merge branch '10.5' into 10.6	2022-05-18 10:30:47 +02:00
Sergei Golubchik	7970ac7fe8	Merge branch '10.4' into 10.5	2022-05-18 09:50:26 +02:00
Sergei Golubchik	23ddc3518f	Merge branch '10.3' into 10.4	2022-05-18 01:25:30 +02:00
Marko Mäkelä	4e1bf2bb23	MDEV-28537 Unused or useless InnoDB counters num_index_pages_written, num_non_index_pages_written The counters were added in commit `5e55d1ced5` and any code to update them was inadvertently removed in commit `2e814d4702` when applying InnoDB changes from MySQL 5.7. Let us remove these counters that never reported anything useful. If such statistics are really needed in a special case, they can be obtained by instrumenting the code by some means, such as eBPF or a source code patch.	2022-05-16 13:41:53 +03:00
Rucha Deodhar	5945e420f1	MDEV-24920: Merge "old" SQL variable to "old_mode" sql variable Analysis: There are 2 server variables- "old_mode" and "old". "old" is no longer needed as "old_mode" has replaced it (however still used in some places in the code). "old_mode" and "old" has same purpose- emulate behavior from previous MariaDB versions. So they can be merged to avoid confusion. Fix: Deprecate "old" variable and create another mode for @@old_mode to mimic behavior of previous "old" variable. Create specific modes for specifix task that --old sql variable was doing earlier and use the new modes instead.	2022-04-20 00:30:22 +05:30
Daniel Black	bea47a6f59	MDEV-27791: rocksdb_log_dir test postfix We can only remove a subdirectory in mtr on an installed instance Example failure previously: CURRENT_TEST: rocksdb.rocksdb_log_dir mysqltest: At line 15: Path '/usr/local/mariadb-10.9.0-linux-systemd-x86_64/mysql-test/var/tmp/1' is not a subdirectory of MYSQLTEST_VARDIR '/usr/local/mariadb-10.9.0-linux-systemd-x86_64/mysql-test/var/1'or MYSQL_TMP_DIR '/usr/local/mariadb-10.9.0-linux-systemd-x86_64/mysql-test/var/tmp/1'	2022-04-13 15:46:56 +10:00
Xinyi Hong	1ac87d6dd4	MDEV-27791: Create a new MyRocks parameter rocksdb_log_dir Parameter rocksdb_log_dir specifies the path of MyRocks error logs. By default, the error logs are stored in the same folder with MyRocks redo logs. Being able to put human readable logs in one place and machine logs in another place improves usability. All new code of the whole pull request, including one or several files that are either new files or modified ones, are contributed under the BSD-new license. I am contributing on behalf of my employer Amazon Web Services, Inc.	2022-04-13 08:36:33 +10:00
Alexander Barkov	d25b10fede	MDEV-27712 Reduce the size of Lex_length_and_dec_st from 16 to 8 User visible change: Removing the length specified by user from error messages: ER_TOO_BIG_SCALE and ER_TOO_BIG_PRECISION as discussed with Sergei.	2022-03-22 14:42:54 +04:00
Marko Mäkelä	934b2d605e	MDEV-27917 Some redo log diagnostics is always reported as 0 The InnoDB monitor counter log_sys.n_log_ios was almost removed in commit `685d958e38` (MDEV-14425). This counter was rather meaningless already since commit `30ea63b7d2` introduced a redo log group commit mechanism, and on the persistent memory interface there are no file system calls that could be counted. The only case when log_sys.n_log_ios was updated is when the log file was being read during crash recovery. Some related output in log_print() as well as the information_schema.innodb_metrics counter log_num_log_io are best removed.	2022-02-22 18:56:21 +02:00
Oleksandr Byelkin	4fb2cb1a30	Merge branch '10.7' into 10.8	2022-02-04 14:50:25 +01:00
Oleksandr Byelkin	9ed8deb656	Merge branch '10.6' into 10.7	2022-02-04 14:11:46 +01:00
Oleksandr Byelkin	f5c5f8e41e	Merge branch '10.5' into 10.6	2022-02-03 17:01:31 +01:00
Oleksandr Byelkin	cf63eecef4	Merge branch '10.4' into 10.5	2022-02-01 20:33:04 +01:00
Oleksandr Byelkin	c04a203a10	Rocksdb result fix after merge	2022-01-31 08:37:33 +01:00
Oleksandr Byelkin	a576a1cea5	Merge branch '10.3' into 10.4	2022-01-30 09:46:52 +01:00
Oleksandr Byelkin	41a163ac5c	Merge branch '10.2' into 10.3	2022-01-29 15:41:05 +01:00
Monty	a85d942be9	Fixed result file for rocksdb.i_s_deadlock This failed because of MDEV-18918 which removed DEFAULT's	2022-01-27 19:15:02 +02:00
Sergei Golubchik	918f524490	RocksDB doesn't support DESC indexes yet disallow descending indexes in rocksdb	2022-01-26 18:43:06 +01:00
Alexander Barkov	216834b068	A cleanup for MDEV-18918/MDEV-20254 Adjusting rocksdb tests results.	2022-01-25 17:48:44 +04:00
Marko Mäkelä	685d958e38	MDEV-14425 Improve the redo log for concurrency The InnoDB redo log used to be formatted in blocks of 512 bytes. The log blocks were encrypted and the checksum was calculated while holding log_sys.mutex, creating a serious scalability bottleneck. We remove the fixed-size redo log block structure altogether and essentially turn every mini-transaction into a log block of its own. This allows encryption and checksum calculations to be performed on local mtr_t::m_log buffers, before acquiring log_sys.mutex. The mutex only protects a memcpy() of the data to the shared log_sys.buf, as well as the padding of the log, in case the to-be-written part of the log would not end in a block boundary of the underlying storage. For now, the "padding" consists of writing a single NUL byte, to allow recovery and mariadb-backup to detect the end of the circular log faster. Like the previous implementation, we will overwrite the last log block over and over again, until it has been completely filled. It would be possible to write only up to the last completed block (if no more recent write was requested), or to write dummy FILE_CHECKPOINT records to fill the incomplete block, by invoking the currently disabled function log_pad(). This would require adjustments to some logic around log checkpoints, page flushing, and shutdown. An upgrade after a crash of any previous version is not supported. Logically empty log files from a previous version will be upgraded. An attempt to start up InnoDB without a valid ib_logfile0 will be refused. Previously, the redo log used to be created automatically if it was missing. Only with with innodb_force_recovery=6, it is possible to start InnoDB in read-only mode even if the log file does not exist. This allows the contents of a possibly corrupted database to be dumped. Because a prepared backup from an earlier version of mariadb-backup will create a 0-sized log file, we will allow an upgrade from such log files, provided that the FIL_PAGE_FILE_FLUSH_LSN in the system tablespace looks valid. The 512-byte log checkpoint blocks at 0x200 and 0x600 will be replaced with 64-byte log checkpoint blocks at 0x1000 and 0x2000. The start of log records will move from 0x800 to 0x3000. This allows us to use 4096-byte aligned blocks for all I/O in a future revision. We extend the MDEV-12353 redo log record format as follows. (1) Empty mini-transactions or extra NUL bytes will not be allowed. (2) The end-of-minitransaction marker (a NUL byte) will be replaced with a 1-bit sequence number, which will be toggled each time when the circular log file wraps back to the beginning. (3) After the sequence bit, a CRC-32C checksum of all data (excluding the sequence bit) will written. (4) If the log is encrypted, 8 bytes will be written before the checksum and included in it. This is part of the initialization vector (IV) of encrypted log data. (5) File names, page numbers, and checkpoint information will not be encrypted. Only the payload bytes of page-level log will be encrypted. The tablespace ID and page number will form part of the IV. (6) For padding, arbitrary-length FILE_CHECKPOINT records may be written, with all-zero payload, and with the normal end marker and checksum. The minimum size is 7 bytes, or 7+8 with innodb_encrypt_log=ON. In mariadb-backup and in Galera snapshot transfer (SST) scripts, we will no longer remove ib_logfile0 or create an empty ib_logfile0. Server startup will require a valid log file. When resizing the log, we will create a logically empty ib_logfile101 at the current LSN and use an atomic rename to replace ib_logfile0 with it. See the test innodb.log_file_size. Because there is no mandatory padding in the log file, we are able to create a dummy log file as of an arbitrary log sequence number. See the test mariabackup.huge_lsn. The parameter innodb_log_write_ahead_size and the INFORMATION_SCHEMA.INNODB_METRICS counter log_padded will be removed. The minimum value of innodb_log_buffer_size will be increased to 2MiB (because log_sys.buf will replace recv_sys.buf) and the increment adjusted to 4096 bytes (the maximum log block size). The following INFORMATION_SCHEMA.INNODB_METRICS counters will be removed: os_log_fsyncs os_log_pending_fsyncs log_pending_log_flushes log_pending_checkpoint_writes The following status variables will be removed: Innodb_os_log_fsyncs (this is included in Innodb_data_fsyncs) Innodb_os_log_pending_fsyncs (this was limited to at most 1 by design) log_sys.get_block_size(): Return the physical block size of the log file. This is only implemented on Linux and Microsoft Windows for now, and for the power-of-2 block sizes between 64 and 4096 bytes (the minimum and maximum size of a checkpoint block). If the block size is anything else, the traditional 512-byte size will be used via normal file system buffering. If the file system buffers can be bypassed, a message like the following will be issued: InnoDB: File system buffers for log disabled (block size=512 bytes) InnoDB: File system buffers for log disabled (block size=4096 bytes) This has been tested on Linux and Microsoft Windows with both sizes. On Linux, only enable O_DIRECT on the log for innodb_flush_method=O_DSYNC. Tests in 3 different environments where the log is stored in a device with a physical block size of 512 bytes are yielding better throughput without O_DIRECT. This could be due to the fact that in the event the last log block is being overwritten (if multiple transactions would become durable at the same time, and each of will write a small number of bytes to the last log block), it should be faster to re-copy data from log_sys.buf or log_sys.flush_buf to the kernel buffer, to be finally written at fdatasync() time. The parameter innodb_flush_method=O_DSYNC will imply O_DIRECT for data files. This option will enable O_DIRECT on the log file on Linux. It may be unsafe to use when the storage device does not support FUA (Force Unit Access) mode. When the server is compiled WITH_PMEM=ON, we will use memory-mapped I/O for the log file if the log resides on a "mount -o dax" device. We will identify PMEM in a start-up message: InnoDB: log sequence number 0 (memory-mapped); transaction id 3 On Linux, we will also invoke mmap() on any ib_logfile0 that resides in /dev/shm, effectively treating the log file as persistent memory. This should speed up "./mtr --mem" and increase the test coverage of PMEM on non-PMEM hardware. It also allows users to estimate how much the performance would be improved by installing persistent memory. On other tmpfs file systems such as /run, we will not use mmap(). mariadb-backup: Eliminated several variables. We will refer directly to recv_sys and log_sys. backup_wait_for_lsn(): Detect non-progress of xtrabackup_copy_logfile(). In this new log format with arbitrary-sized blocks, we can only detect log file overrun indirectly, by observing that the scanned log sequence number is not advancing. xtrabackup_copy_logfile(): On PMEM, do not modify the sequence bit, because we are not allowed to modify the server's log file, and our memory mapping is read-only. trx_flush_log_if_needed_low(): Do not use the callback on pmem. Using neither flush_lock nor write_lock around PMEM writes seems to yield the best performance. The pmem_persist() calls may still be somewhat slower than the pwrite() and fdatasync() based interface (PMEM mounted without -o dax). recv_sys_t::buf: Remove. We will use log_sys.buf for parsing. recv_sys_t::MTR_SIZE_MAX: Replaces RECV_SCAN_SIZE. recv_sys_t::file_checkpoint: Renamed from mlog_checkpoint_lsn. recv_sys_t, log_sys_t: Removed many data members. recv_sys.lsn: Renamed from recv_sys.recovered_lsn. recv_sys.offset: Renamed from recv_sys.recovered_offset. log_sys.buf_size: Replaces srv_log_buffer_size. recv_buf: A smart pointer that wraps log_sys.buf[recv_sys.offset] when the buffer is being allocated from the memory heap. recv_ring: A smart pointer that wraps a circular log_sys.buf[] that is backed by ib_logfile0. The pointer will wrap from recv_sys.len (log_sys.file_size) to log_sys.START_OFFSET. For the record that wraps around, we may copy file name or record payload data to the auxiliary buffer decrypt_buf in order to have a contiguous block of memory. The maximum size of a record is less than innodb_page_size bytes. recv_sys_t::parse(): Take the smart pointer as a template parameter. Do not temporarily add a trailing NUL byte to FILE_ records, because we are not supposed to modify the memory-mapped log file. (It is attached in read-write mode already during recovery.) recv_sys_t::parse_mtr(): Wrapper for recv_sys_t::parse(). recv_sys_t::parse_pmem(): Like parse_mtr(), but if PREMATURE_EOF would be returned on PMEM, use recv_ring to wrap around the buffer to the start. mtr_t::finish_write(), log_close(): Do not enforce log_sys.max_buf_free on PMEM, because it has no meaning on the mmap-based log. log_sys.write_to_buf: Count writes to log_sys.buf. Replaces srv_stats.log_write_requests and export_vars.innodb_log_write_requests. Protected by log_sys.mutex. Updated consistently in log_close(). Previously, mtr_t::commit() conditionally updated the count, which was inconsistent. log_sys.write_to_log: Count swaps of log_sys.buf and log_sys.flush_buf, for writing to log_sys.log (the ib_logfile0). Replaces srv_stats.log_writes and export_vars.innodb_log_writes. Protected by log_sys.mutex. log_sys.waits: Count waits in append_prepare(). Replaces srv_stats.log_waits and export_vars.innodb_log_waits. recv_recover_page(): Do not unnecessarily acquire log_sys.flush_order_mutex. We are inserting the blocks in arbitary order anyway, to be adjusted in recv_sys.apply(true). We will change the definition of flush_lock and write_lock to avoid potential false sharing. Depending on sizeof(log_sys) and CPU_LEVEL1_DCACHE_LINESIZE, the flush_lock and write_lock could share a cache line with each other or with the last data members of log_sys. Thanks to Matthias Leich for providing https://rr-project.org traces for various failures during the development, and to Thirunarayanan Balathandayuthapani for his help in debugging some of the recovery code. And thanks to the developers of the rr debugger for a tool without which extensive changes to InnoDB would be very challenging to get right. Thanks to Vladislav Vaintroub for useful feedback and to him, Axel Schwenke and Krunal Bauskar for testing the performance.	2022-01-21 16:03:47 +02:00

1 2 3 4 5 ...

730 Commits