1
0
mirror of https://github.com/MariaDB/server.git synced 2025-09-11 05:52:26 +03:00
Commit Graph

730 Commits

Author SHA1 Message Date
Marko Mäkelä
5bada1246d Merge 10.5 into 10.6 2023-04-11 16:15:19 +03:00
Oleksandr Byelkin
ac5a534a4c Merge remote-tracking branch '10.4' into 10.5 2023-03-31 21:32:41 +02:00
Yuchen Pei
7c91082e39 MDEV-27912 Fixing inconsistency w.r.t. expect files in tests.
mtr uses group suffix, but some existing inc and test files use
server_id for expect files. This patch aims to fix that.

For spider:

With this change we will not have to maintain a separate version of
restart_mysqld.inc for spider, that duplicates code, just because
spider tests use different names for expect files, and shutdown_mysqld
requires magical names for them.

With this change spider tests will also be able to use other features
provided by restart_mysqld.inc without code duplication, like the
parameter $restart_parameters (see e.g. the testcase mdev_29904.test
in commit ef1161e5d4f).

Tests run after this change: default, spider, rocksdb, galera, using
the following command

mtr --parallel=auto --force --max-test-fail=0 --skip-core-file
mtr --suite spider,spider/*,spider/*/* \
    --skip-test="spider/oracle.*|.*/t\..*" --parallel=auto --big-test \
    --force --max-test-fail=0 --skip-core-file
mtr --suite galera --parallel=auto
mtr --suite rocksdb --parallel=auto
2023-03-22 11:55:57 +11:00
Marko Mäkelä
4c355d4e81 Merge 10.11 into 11.0 2023-03-17 15:03:17 +02:00
Marko Mäkelä
c50f849d64 Merge 10.10 into 10.11 2023-03-17 07:00:03 +02:00
Marko Mäkelä
3dd33789c1 Merge 10.9 into 10.10 2023-03-17 06:59:46 +02:00
Marko Mäkelä
fffa4b28a1 Merge 10.8 into 10.9 2023-03-17 06:58:33 +02:00
Marko Mäkelä
acf46b7b36 Merge 10.6 into 10.8 2023-03-16 18:11:37 +02:00
Marko Mäkelä
a55b951e60 MDEV-26827 Make page flushing even faster
For more convenient monitoring of something that could greatly affect
the volume of page writes, we add the status variable
Innodb_buffer_pool_pages_split that was previously only available
via information_schema.innodb_metrics as "innodb_page_splits".
This was suggested by Axel Schwenke.

buf_flush_page_count: Replaced with buf_pool.stat.n_pages_written.
We protect buf_pool.stat (except n_page_gets) with buf_pool.mutex
and remove unnecessary export_vars indirection.

buf_pool.flush_list_bytes: Moved from buf_pool.stat.flush_list_bytes.
Protected by buf_pool.flush_list_mutex.

buf_pool_t::page_cleaner_status: Replaces buf_pool_t::n_flush_LRU_,
buf_pool_t::n_flush_list_, and buf_pool_t::page_cleaner_is_idle.
Protected by buf_pool.flush_list_mutex. We will exclusively broadcast
buf_pool.done_flush_list by the buf_flush_page_cleaner thread,
and only wait for it when communicating with buf_flush_page_cleaner.
There is no need to keep a count of pending writes by the
buf_pool.flush_list processing. A single flag suffices for that.

Waits for page write completion can be performed by
simply waiting on block->page.lock, or by invoking
buf_dblwr.wait_for_page_writes().

buf_LRU_block_free_non_file_page(): Broadcast buf_pool.done_free and
set buf_pool.try_LRU_scan when freeing a page. This would be
executed also as part of buf_page_write_complete().

buf_page_write_complete(): Do not broadcast buf_pool.done_flush_list,
and do not acquire buf_pool.mutex unless buf_pool.LRU eviction is needed.
Let buf_dblwr count all writes to persistent pages and broadcast a
condition variable when no outstanding writes remain.

buf_flush_page_cleaner(): Prioritize LRU flushing and eviction right after
"furious flushing" (lsn_limit). Simplify the conditions and reduce the
hold time of buf_pool.flush_list_mutex. Refuse to shut down
or sleep if buf_pool.ran_out(), that is, LRU eviction is needed.

buf_pool_t::page_cleaner_wakeup(): Add the optional parameter for_LRU.

buf_LRU_get_free_block(): Protect buf_lru_free_blocks_error_printed
with buf_pool.mutex. Invoke buf_pool.page_cleaner_wakeup(true) to
to ensure that buf_flush_page_cleaner() will process the LRU flush
request.

buf_do_LRU_batch(), buf_flush_list(), buf_flush_list_space():
Update buf_pool.stat.n_pages_written when submitting writes
(while holding buf_pool.mutex), not when completing them.

buf_page_t::flush(), buf_flush_discard_page(): Require that
the page U-latch be acquired upfront, and remove
buf_page_t::ready_for_flush().

buf_pool_t::delete_from_flush_list(): Remove the parameter "bool clear".

buf_flush_page(): Count pending page writes via buf_dblwr.

buf_flush_try_neighbors(): Take the block of page_id as a parameter.
If the tablespace is dropped before our page has been written out,
release the page U-latch.

buf_pool_invalidate(): Let the caller ensure that there are no
outstanding writes.

buf_flush_wait_batch_end(false),
buf_flush_wait_batch_end_acquiring_mutex(false):
Replaced with buf_dblwr.wait_for_page_writes().

buf_flush_wait_LRU_batch_end(): Replaces buf_flush_wait_batch_end(true).

buf_flush_list(): Remove some broadcast of buf_pool.done_flush_list.

buf_flush_buffer_pool(): Invoke also buf_dblwr.wait_for_page_writes().

buf_pool_t::io_pending(), buf_pool_t::n_flush_list(): Remove.
Outstanding writes are reflected by buf_dblwr.pending_writes().

buf_dblwr_t::init(): New function, to initialize the mutex and
the condition variables, but not the backing store.

buf_dblwr_t::is_created(): Replaces buf_dblwr_t::is_initialised().

buf_dblwr_t::pending_writes(), buf_dblwr_t::writes_pending:
Keeps track of writes of persistent data pages.

buf_flush_LRU(): Allow calls while LRU flushing may be in progress
in another thread.

Tested by Matthias Leich (correctness) and Axel Schwenke (performance)
2023-03-16 17:19:58 +02:00
Sergei Petrunia
1529881595 Stabilize rocksdb.rocksdb test. 2023-02-03 14:31:21 +03:00
Monty
1f4a9f086a Removed "<select expression> INTO <destination>" deprication.
This was done after discussions with Igor, Sanja and Bar.

The main reason for removing the deprication was to ensure that MariaDB
is always backward compatible whenever possible.

Other things:
- Added statistics counters, mainly for the feedback plugin.
  - INTO OUTFILE
  - INTO variable
  - If INTO is using the old syntax (end of query)
2023-02-03 11:57:50 +03:00
Sergei Petrunia
6c4076fac4 MDEV-30032: EXPLAIN FORMAT=JSON output: part #2: print 'loops'. 2023-02-03 11:22:17 +03:00
Sergei Petrunia
ffe0beca25 MDEV-30032: EXPLAIN FORMAT=JSON output: print costs
Basic printout for join and table execution costs.
2023-02-03 11:01:24 +03:00
Monty
727491b72a Added test cases for preceding test
This includes all test changes from
"Changing all cost calculation to be given in milliseconds"
and forwards.

Some of the things that caused changes in the result files:

- As part of fixing tests, I added 'echo' to some comments to be able to
  easier find out where things where wrong.
- MATERIALIZED has now a higher cost compared to X than before. Because
  of this some MATERIALIZED types have changed to DEPENDEND SUBQUERY.
  - Some test cases that required MATERIALIZED to repeat a bug was
    changed by adding more rows to force MATERIALIZED to happen.
- 'Filtered' in SHOW EXPLAIN has in many case changed from 100.00 to
  something smaller. This is because now filtered also takes into
  account the smallest possible ref access and filters, even if they
  where not used. Another reason for 'Filtered' being smaller is that
  we now also take into account implicit filtering done for subqueries
  using FIRSTMATCH.
  (main.subselect_no_exists_to_in)
  This is caluculated in best_access_path() and stored in records_out.
- Table orders has changed because more accurate costs.
- 'index' and 'ALL' for small tables has changed to use 'range' or
   'ref' because of optimizer_scan_setup_cost.
- index can be changed to 'range' as 'range' optimizer assumes we don't
  have to read the blocks from disk that range optimizer has already read.
  This can be confusing in the case where there is no obvious where clause
  but instead there is a hidden 'key_column > NULL' added by the optimizer.
  (main.subselect_no_exists_to_in)
- Scan on primary clustered key does not report 'Using Index' anymore
  (It's a table scan, not an index scan).
- For derived tables, the number of rows is now 100 instead of 2,
  which can be seen in EXPLAIN.
- More tests have "Using index for group by" as the cost of this
  optimization is now more correct (lower).
- A primary key could be preferred for a normal key, even if it would
  access more rows, as it's faster to do 1 lokoup and 3 'index_next' on a
  clustered primary key than one lookup trough a secondary.
  (main.stat_tables_innodb)

Notes:

- There was a 4.7% more calls to best_extension_by_limited_search() in
  the main.greedy_optimizer test.  However examining the test results
  it looked that the plans where slightly better (eq_ref where more
  chained together) so I assume this is ok.
- I have verified a few test cases where there was notable/unexpected
  changes in the plan and in all cases the new optimizer plans where
  faster.  (main.greedy_optimizer and some others)
2023-02-03 00:00:35 +03:00
Marko Mäkelä
f27e9c8947 MDEV-29694 Remove the InnoDB change buffer
The purpose of the change buffer was to reduce random disk access,
which could be useful on rotational storage, but maybe less so on
solid-state storage.
When we wished to
(1) insert a record into a non-unique secondary index,
(2) delete-mark a secondary index record,
(3) delete a secondary index record as part of purge (but not ROLLBACK),
and the B-tree leaf page where the record belongs to is not in the buffer
pool, we inserted a record into the change buffer B-tree, indexed by
the page identifier. When the page was eventually read into the buffer
pool, we looked up the change buffer B-tree for any modifications to the
page, applied these upon the completion of the read operation. This
was called the insert buffer merge.

We remove the change buffer, because it has been the source of
various hard-to-reproduce corruption bugs, including those fixed in
commit 5b9ee8d819 and
commit 165564d3c3 but not limited to them.

A downgrade will fail with a clear message starting with
commit db14eb16f9 (MDEV-30106).

buf_page_t::state: Merge IBUF_EXIST to UNFIXED and
WRITE_FIX_IBUF to WRITE_FIX.

buf_pool_t::watch[]: Remove.

trx_t: Move isolation_level, check_foreigns, check_unique_secondary,
bulk_insert into the same bit-field. The only purpose of
trx_t::check_unique_secondary is to enable bulk insert into an
empty table. It no longer enables insert buffering for UNIQUE INDEX.

btr_cur_t::thr: Remove. This field was originally needed for change
buffering. Later, its use was extended to cover SPATIAL INDEX.
Much of the time, rtr_info::thr holds this field. When it does not,
we will add parameters to SPATIAL INDEX specific functions.

ibuf_upgrade_needed(): Check if the change buffer needs to be updated.

ibuf_upgrade(): Merge and upgrade the change buffer after all redo log
has been applied. Free any pages consumed by the change buffer, and
zero out the change buffer root page to mark the upgrade completed,
and to prevent a downgrade to an earlier version.

dict_load_tablespaces(): Renamed from
dict_check_tablespaces_and_store_max_id(). This needs to be invoked
before ibuf_upgrade().

btr_cur_open_at_rnd_pos(): Specialize for use in persistent statistics.
The change buffer merge does not need this function anymore.

btr_page_alloc(): Renamed from btr_page_alloc_low(). We no longer
allocate any change buffer pages.

btr_cur_open_at_rnd_pos(): Specialize for use in persistent statistics.
The change buffer merge does not need this function anymore.

row_search_index_entry(), btr_lift_page_up(): Add a parameter thr
for the SPATIAL INDEX case.

rtr_page_split_and_insert(): Specialized from btr_page_split_and_insert().

rtr_root_raise_and_insert(): Specialized from btr_root_raise_and_insert().

Note: The support for upgrading from the MySQL 3.23 or MySQL 4.0
change buffer format that predates the MySQL 4.1 introduction of
the option innodb_file_per_table was removed in MySQL 5.6.5
as part of mysql/mysql-server@69b6241a79
and MariaDB 10.0.11 as part of 1d0f70c2f8.

In the tests innodb.log_upgrade and innodb.log_corruption, we create
valid (upgraded) change buffer pages.

Tested by: Matthias Leich
2023-01-11 17:59:36 +02:00
Sergei Golubchik
8759967d1c MDEV-29625 Some clients/scripts refer to old slow log variables 2022-10-04 12:28:04 +02:00
Marko Mäkelä
5e996fbad9 Merge 10.9 into 10.10 2022-09-21 10:59:56 +03:00
Marko Mäkelä
a8e4540476 Merge 10.8 into 10.9 2022-09-21 10:07:09 +03:00
Marko Mäkelä
4345d93100 Merge 10.7 into 10.8 2022-09-21 09:52:09 +03:00
Marko Mäkelä
7c7ac6d4a4 Merge 10.6 into 10.7 2022-09-21 09:33:07 +03:00
Marko Mäkelä
44fd2c4b24 Merge 10.5 into 10.6 2022-09-20 16:53:20 +03:00
Alexander Barkov
fe844c16b6 Merge remote-tracking branch 'origin/10.4' into 10.5 2022-09-14 16:24:51 +04:00
Marko Mäkelä
18795f5512 Merge 10.3 into 10.4 2022-09-13 16:36:38 +03:00
Alexander Barkov
4c14243373 A cleanup for MDEV-29446 Change SHOW CREATE TABLE to display default collation
Recording test results according to MDEV-29446 changes:

storage/rocksdb/mysql-test/rocksdb/r/use_direct_io_for_flush_and_compaction.result
2022-09-13 12:44:23 +04:00
Alexander Barkov
f1544424de MDEV-29446 Change SHOW CREATE TABLE to display default collation 2022-09-12 22:10:39 +04:00
Marko Mäkelä
fdc039db29 MDEV-28540 Deprecate and ignore the parameter innodb_prefix_index_cluster_optimization
The parameter innodb_prefix_index_cluster_optimization used to enable an
optimization that was added in cb37c55768
and was disabled by default.

We will unconditionally enable the extension and mark the parameter
as deprecated.

Related to this, the counters
Innodb_secondary_index_triggered_cluster_reads and
Innodb_secondary_index_triggered_cluster_reads_avoided
allowed to determine the usefulness of this optimization.

Now that the configuration parameter is disabled, the counters
do not serve any useful purpose and can be removed.

row_search_with_covering_prefix(): Fix a bug that caused an
incorrect result to be returned.
2022-06-03 12:20:20 +03:00
Marko Mäkelä
0dab74ff3f MDEV-28539 Some InnoDB counters are duplicating generic SHOW STATUS
The InnoDB srv_stats counters
n_rows_updated, n_rows_deleted, n_rows_inserted, and n_rows_read
are duplicating
Handler_update, Handler_delete, Handler_write, and Handler_read_ counters.

Updating those counters is not free, especially because some counters
are furthermore split to distinguish a rare case of modifying tables
in the system schema.
2022-06-03 12:20:20 +03:00
Sergei Golubchik
bf2bdd1a1a Merge branch '10.8' into 10.9 2022-05-19 14:07:55 +02:00
Sergei Golubchik
b7ffccf49b Merge branch '10.7' into 10.8 2022-05-18 13:26:48 +02:00
Sergei Golubchik
99a433ed1c Merge branch '10.6' into 10.7 2022-05-18 10:34:38 +02:00
Sergei Golubchik
b2187662bc Merge branch '10.5' into 10.6 2022-05-18 10:30:47 +02:00
Sergei Golubchik
7970ac7fe8 Merge branch '10.4' into 10.5 2022-05-18 09:50:26 +02:00
Sergei Golubchik
23ddc3518f Merge branch '10.3' into 10.4 2022-05-18 01:25:30 +02:00
Marko Mäkelä
4e1bf2bb23 MDEV-28537 Unused or useless InnoDB counters num_index_pages_written, num_non_index_pages_written
The counters were added in commit 5e55d1ced5
and any code to update them was
inadvertently removed in commit 2e814d4702
when applying InnoDB changes from MySQL 5.7.

Let us remove these counters that never reported anything useful. If such
statistics are really needed in a special case, they can be obtained by
instrumenting the code by some means, such as eBPF or a source code patch.
2022-05-16 13:41:53 +03:00
Rucha Deodhar
5945e420f1 MDEV-24920: Merge "old" SQL variable to "old_mode" sql variable
Analysis: There are 2 server variables- "old_mode" and "old". "old" is no
longer needed as "old_mode" has replaced it (however still used in some places
 in the code). "old_mode" and "old" has same purpose- emulate behavior from
previous MariaDB versions. So they can be merged to avoid confusion.
Fix: Deprecate "old" variable and create another mode for @@old_mode to mimic
behavior of previous "old" variable. Create specific modes for specifix task
that --old sql variable was doing earlier and use the new modes instead.
2022-04-20 00:30:22 +05:30
Daniel Black
bea47a6f59 MDEV-27791: rocksdb_log_dir test postfix
We can only remove a subdirectory in mtr on an installed instance

Example failure previously:

CURRENT_TEST: rocksdb.rocksdb_log_dir
mysqltest: At line 15: Path '/usr/local/mariadb-10.9.0-linux-systemd-x86_64/mysql-test/var/tmp/1' is not a subdirectory of MYSQLTEST_VARDIR '/usr/local/mariadb-10.9.0-linux-systemd-x86_64/mysql-test/var/1'or MYSQL_TMP_DIR '/usr/local/mariadb-10.9.0-linux-systemd-x86_64/mysql-test/var/tmp/1'
2022-04-13 15:46:56 +10:00
Xinyi Hong
1ac87d6dd4 MDEV-27791: Create a new MyRocks parameter rocksdb_log_dir
Parameter rocksdb_log_dir specifies the path of MyRocks error logs.

By default, the error logs are stored in the same folder with MyRocks
redo logs. Being able to put human readable logs in one place and
machine logs in another place improves usability.

All new code of the whole pull request, including one or several files
that are either new files or modified ones, are contributed under the
BSD-new license. I am contributing on behalf of my employer Amazon Web
Services, Inc.
2022-04-13 08:36:33 +10:00
Alexander Barkov
d25b10fede MDEV-27712 Reduce the size of Lex_length_and_dec_st from 16 to 8
User visible change:
Removing the length specified by user from error messages:
ER_TOO_BIG_SCALE and ER_TOO_BIG_PRECISION
as discussed with Sergei.
2022-03-22 14:42:54 +04:00
Marko Mäkelä
934b2d605e MDEV-27917 Some redo log diagnostics is always reported as 0
The InnoDB monitor counter log_sys.n_log_ios was almost removed
in commit 685d958e38 (MDEV-14425).
This counter was rather meaningless already since
commit 30ea63b7d2
introduced a redo log group commit mechanism, and on the persistent
memory interface there are no file system calls that could be counted.

The only case when log_sys.n_log_ios was updated is when the log file
was being read during crash recovery.

Some related output in log_print() as well as the
information_schema.innodb_metrics counter log_num_log_io are best removed.
2022-02-22 18:56:21 +02:00
Oleksandr Byelkin
4fb2cb1a30 Merge branch '10.7' into 10.8 2022-02-04 14:50:25 +01:00
Oleksandr Byelkin
9ed8deb656 Merge branch '10.6' into 10.7 2022-02-04 14:11:46 +01:00
Oleksandr Byelkin
f5c5f8e41e Merge branch '10.5' into 10.6 2022-02-03 17:01:31 +01:00
Oleksandr Byelkin
cf63eecef4 Merge branch '10.4' into 10.5 2022-02-01 20:33:04 +01:00
Oleksandr Byelkin
c04a203a10 Rocksdb result fix after merge 2022-01-31 08:37:33 +01:00
Oleksandr Byelkin
a576a1cea5 Merge branch '10.3' into 10.4 2022-01-30 09:46:52 +01:00
Oleksandr Byelkin
41a163ac5c Merge branch '10.2' into 10.3 2022-01-29 15:41:05 +01:00
Monty
a85d942be9 Fixed result file for rocksdb.i_s_deadlock
This failed because of MDEV-18918 which removed DEFAULT's
2022-01-27 19:15:02 +02:00
Sergei Golubchik
918f524490 RocksDB doesn't support DESC indexes yet
disallow descending indexes in rocksdb
2022-01-26 18:43:06 +01:00
Alexander Barkov
216834b068 A cleanup for MDEV-18918/MDEV-20254
Adjusting rocksdb tests results.
2022-01-25 17:48:44 +04:00
Marko Mäkelä
685d958e38 MDEV-14425 Improve the redo log for concurrency
The InnoDB redo log used to be formatted in blocks of 512 bytes.
The log blocks were encrypted and the checksum was calculated while
holding log_sys.mutex, creating a serious scalability bottleneck.

We remove the fixed-size redo log block structure altogether and
essentially turn every mini-transaction into a log block of its own.
This allows encryption and checksum calculations to be performed
on local mtr_t::m_log buffers, before acquiring log_sys.mutex.
The mutex only protects a memcpy() of the data to the shared
log_sys.buf, as well as the padding of the log, in case the
to-be-written part of the log would not end in a block boundary of
the underlying storage. For now, the "padding" consists of writing
a single NUL byte, to allow recovery and mariadb-backup to detect
the end of the circular log faster.

Like the previous implementation, we will overwrite the last log block
over and over again, until it has been completely filled. It would be
possible to write only up to the last completed block (if no more
recent write was requested), or to write dummy FILE_CHECKPOINT records
to fill the incomplete block, by invoking the currently disabled
function log_pad(). This would require adjustments to some logic around
log checkpoints, page flushing, and shutdown.

An upgrade after a crash of any previous version is not supported.
Logically empty log files from a previous version will be upgraded.

An attempt to start up InnoDB without a valid ib_logfile0 will be
refused. Previously, the redo log used to be created automatically
if it was missing. Only with with innodb_force_recovery=6, it is
possible to start InnoDB in read-only mode even if the log file
does not exist. This allows the contents of a possibly corrupted
database to be dumped.

Because a prepared backup from an earlier version of mariadb-backup
will create a 0-sized log file, we will allow an upgrade from such
log files, provided that the FIL_PAGE_FILE_FLUSH_LSN in the system
tablespace looks valid.

The 512-byte log checkpoint blocks at 0x200 and 0x600 will be replaced
with 64-byte log checkpoint blocks at 0x1000 and 0x2000.

The start of log records will move from 0x800 to 0x3000. This allows us
to use 4096-byte aligned blocks for all I/O in a future revision.

We extend the MDEV-12353 redo log record format as follows.

(1) Empty mini-transactions or extra NUL bytes will not be allowed.
(2) The end-of-minitransaction marker (a NUL byte) will be replaced
with a 1-bit sequence number, which will be toggled each time when the
circular log file wraps back to the beginning.
(3) After the sequence bit, a CRC-32C checksum of all data
(excluding the sequence bit) will written.
(4) If the log is encrypted, 8 bytes will be written before
the checksum and included in it. This is part of the
initialization vector (IV) of encrypted log data.
(5) File names, page numbers, and checkpoint information will not be
encrypted. Only the payload bytes of page-level log will be encrypted.
The tablespace ID and page number will form part of the IV.
(6) For padding, arbitrary-length FILE_CHECKPOINT records may be written,
with all-zero payload, and with the normal end marker and checksum.
The minimum size is 7 bytes, or 7+8 with innodb_encrypt_log=ON.

In mariadb-backup and in Galera snapshot transfer (SST) scripts, we will
no longer remove ib_logfile0 or create an empty ib_logfile0. Server startup
will require a valid log file. When resizing the log, we will create
a logically empty ib_logfile101 at the current LSN and use an atomic rename
to replace ib_logfile0 with it. See the test innodb.log_file_size.

Because there is no mandatory padding in the log file, we are able
to create a dummy log file as of an arbitrary log sequence number.
See the test mariabackup.huge_lsn.

The parameter innodb_log_write_ahead_size and the
INFORMATION_SCHEMA.INNODB_METRICS counter log_padded will be removed.

The minimum value of innodb_log_buffer_size will be increased to 2MiB
(because log_sys.buf will replace recv_sys.buf) and the increment
adjusted to 4096 bytes (the maximum log block size).

The following INFORMATION_SCHEMA.INNODB_METRICS counters will be removed:

os_log_fsyncs
os_log_pending_fsyncs
log_pending_log_flushes
log_pending_checkpoint_writes

The following status variables will be removed:

Innodb_os_log_fsyncs (this is included in Innodb_data_fsyncs)
Innodb_os_log_pending_fsyncs (this was limited to at most 1 by design)

log_sys.get_block_size(): Return the physical block size of the log file.
This is only implemented on Linux and Microsoft Windows for now, and for
the power-of-2 block sizes between 64 and 4096 bytes (the minimum and
maximum size of a checkpoint block). If the block size is anything else,
the traditional 512-byte size will be used via normal file system
buffering.

If the file system buffers can be bypassed, a message like the following
will be issued:

InnoDB: File system buffers for log disabled (block size=512 bytes)
InnoDB: File system buffers for log disabled (block size=4096 bytes)

This has been tested on Linux and Microsoft Windows with both sizes.

On Linux, only enable O_DIRECT on the log for innodb_flush_method=O_DSYNC.
Tests in 3 different environments where the log is stored in a device
with a physical block size of 512 bytes are yielding better throughput
without O_DIRECT. This could be due to the fact that in the event the
last log block is being overwritten (if multiple transactions would
become durable at the same time, and each of will write a small
number of bytes to the last log block), it should be faster to re-copy
data from log_sys.buf or log_sys.flush_buf to the kernel buffer,
to be finally written at fdatasync() time.

The parameter innodb_flush_method=O_DSYNC will imply O_DIRECT for
data files. This option will enable O_DIRECT on the log file on Linux.
It may be unsafe to use when the storage device does not support
FUA (Force Unit Access) mode.

When the server is compiled WITH_PMEM=ON, we will use memory-mapped
I/O for the log file if the log resides on a "mount -o dax" device.
We will identify PMEM in a start-up message:

InnoDB: log sequence number 0 (memory-mapped); transaction id 3

On Linux, we will also invoke mmap() on any ib_logfile0 that resides
in /dev/shm, effectively treating the log file as persistent memory.
This should speed up "./mtr --mem" and increase the test coverage of
PMEM on non-PMEM hardware. It also allows users to estimate how much
the performance would be improved by installing persistent memory.
On other tmpfs file systems such as /run, we will not use mmap().

mariadb-backup: Eliminated several variables. We will refer
directly to recv_sys and log_sys.

backup_wait_for_lsn(): Detect non-progress of
xtrabackup_copy_logfile(). In this new log format with
arbitrary-sized blocks, we can only detect log file overrun
indirectly, by observing that the scanned log sequence number
is not advancing.

xtrabackup_copy_logfile(): On PMEM, do not modify the sequence bit,
because we are not allowed to modify the server's log file, and our
memory mapping is read-only.

trx_flush_log_if_needed_low(): Do not use the callback on pmem.
Using neither flush_lock nor write_lock around PMEM writes seems
to yield the best performance. The pmem_persist() calls may
still be somewhat slower than the pwrite() and fdatasync() based
interface (PMEM mounted without -o dax).

recv_sys_t::buf: Remove. We will use log_sys.buf for parsing.

recv_sys_t::MTR_SIZE_MAX: Replaces RECV_SCAN_SIZE.

recv_sys_t::file_checkpoint: Renamed from mlog_checkpoint_lsn.

recv_sys_t, log_sys_t: Removed many data members.

recv_sys.lsn: Renamed from recv_sys.recovered_lsn.
recv_sys.offset: Renamed from recv_sys.recovered_offset.
log_sys.buf_size: Replaces srv_log_buffer_size.

recv_buf: A smart pointer that wraps log_sys.buf[recv_sys.offset]
when the buffer is being allocated from the memory heap.

recv_ring: A smart pointer that wraps a circular log_sys.buf[] that is
backed by ib_logfile0. The pointer will wrap from recv_sys.len
(log_sys.file_size) to log_sys.START_OFFSET. For the record that
wraps around, we may copy file name or record payload data to
the auxiliary buffer decrypt_buf in order to have a contiguous
block of memory. The maximum size of a record is less than
innodb_page_size bytes.

recv_sys_t::parse(): Take the smart pointer as a template parameter.
Do not temporarily add a trailing NUL byte to FILE_ records, because
we are not supposed to modify the memory-mapped log file. (It is
attached in read-write mode already during recovery.)

recv_sys_t::parse_mtr(): Wrapper for recv_sys_t::parse().

recv_sys_t::parse_pmem(): Like parse_mtr(), but if PREMATURE_EOF would be
returned on PMEM, use recv_ring to wrap around the buffer to the start.

mtr_t::finish_write(), log_close(): Do not enforce log_sys.max_buf_free
on PMEM, because it has no meaning on the mmap-based log.

log_sys.write_to_buf: Count writes to log_sys.buf. Replaces
srv_stats.log_write_requests and export_vars.innodb_log_write_requests.
Protected by log_sys.mutex. Updated consistently in log_close().
Previously, mtr_t::commit() conditionally updated the count,
which was inconsistent.

log_sys.write_to_log: Count swaps of log_sys.buf and log_sys.flush_buf,
for writing to log_sys.log (the ib_logfile0). Replaces
srv_stats.log_writes and export_vars.innodb_log_writes.
Protected by log_sys.mutex.

log_sys.waits: Count waits in append_prepare(). Replaces
srv_stats.log_waits and export_vars.innodb_log_waits.

recv_recover_page(): Do not unnecessarily acquire
log_sys.flush_order_mutex. We are inserting the blocks in arbitary
order anyway, to be adjusted in recv_sys.apply(true).

We will change the definition of flush_lock and write_lock to
avoid potential false sharing. Depending on sizeof(log_sys) and
CPU_LEVEL1_DCACHE_LINESIZE, the flush_lock and write_lock could
share a cache line with each other or with the last data members
of log_sys.

Thanks to Matthias Leich for providing https://rr-project.org traces
for various failures during the development, and to
Thirunarayanan Balathandayuthapani for his help in debugging
some of the recovery code. And thanks to the developers of the
rr debugger for a tool without which extensive changes to InnoDB
would be very challenging to get right.

Thanks to Vladislav Vaintroub for useful feedback and
to him, Axel Schwenke and Krunal Bauskar for testing the performance.
2022-01-21 16:03:47 +02:00