1
0
mirror of https://github.com/MariaDB/server.git synced 2025-09-15 05:41:27 +03:00
Commit Graph

882 Commits

Author SHA1 Message Date
Marko Mäkelä
572e34304e Merge 10.5 into 10.6 2022-03-14 10:59:46 +02:00
Marko Mäkelä
59359fb44a MDEV-24841 Build error with MSAN use-of-uninitialized-value in comp_err
The MemorySanitizer implementation in clang includes some built-in
instrumentation (interceptors) for GNU libc. In GNU libc 2.33, the
interface to the stat() family of functions was changed. Until the
MemorySanitizer interceptors are adjusted, any MSAN code builds
will act as if that the stat() family of functions failed to initialize
the struct stat.

A fix was applied in
https://reviews.llvm.org/rG4e1a6c07052b466a2a1cd0c3ff150e4e89a6d87a
but it fails to cover the 64-bit variants of the calls.

For now, let us work around the MemorySanitizer bug by defining
and using the macro MSAN_STAT_WORKAROUND().
2022-03-14 09:28:55 +02:00
Daniel Black
bd1ba7801f Merge branch 10.5 into 10.6 2022-03-12 16:16:03 +11:00
Daniel Black
d78173828e MDEV-27900: aio handle partial reads/writes
As btrfs showed, a partial read of data in AIO /O_DIRECT circumstances can
really confuse MariaDB.

Filipe Manana (SuSE)[1] showed how database programmers can assume
O_DIRECT is all or nothing.

While a fix was done in the kernel side, we can do better in our code by
requesting that the rest of the block be read/written synchronously if
we do only get a partial read/write.

Per the APIs, a partial read/write can occur before an error, so
reattempting the request will leave the caller with a concrete error to
handle.

[1] https://lore.kernel.org/linux-btrfs/CABVffENfbsC6HjGbskRZGR2NvxbnQi17gAuW65eOM+QRzsr8Bg@mail.gmail.com/T/#mb2738e675e48e0e0778a2e8d1537dec5ec0d3d3a

Also spell synchronously correctly in other files.
2022-03-12 09:47:53 +11:00
Marko Mäkelä
32d741b5b0 Merge 10.7 into 10.8 2022-02-25 16:24:13 +02:00
Marko Mäkelä
3d88f9f34c Merge 10.6 into 10.7 2022-02-25 16:09:16 +02:00
Marko Mäkelä
6daf8f8a0d Merge 10.5 into 10.6 2022-02-25 13:48:47 +02:00
Marko Mäkelä
b791b942e1 Merge 10.4 into 10.5 2022-02-25 13:27:41 +02:00
Marko Mäkelä
f5ff7d09c7 Merge 10.3 into 10.4 2022-02-25 13:00:48 +02:00
Marko Mäkelä
00b70bbb51 Merge 10.2 into 10.3 2022-02-25 10:43:38 +02:00
Marko Mäkelä
a3b96d4584 Merge 10.7 into 10.8 2022-02-22 13:20:18 +02:00
Marko Mäkelä
507084517f Merge 10.6 into 10.7 2022-02-22 12:47:48 +02:00
Marko Mäkelä
92f79a22e6 Merge 10.5 into 10.6 2022-02-22 12:12:49 +02:00
Vlad Lesin
a112a80b47 Merge 10.4 into 10.5 2022-02-22 10:35:16 +03:00
Vladislav Vaintroub
24ec144c63 MDEV-27901 Windows : expensive system calls used to calculate file system block size
The result is not used anywhere but in the output of Innodb information
schema, but this can take as much as 7%CPU (only) on a benchmark.

Fix to move fs blocksize calculate to where it is used.
2022-02-20 22:00:42 +01:00
Vladislav Vaintroub
fa557986ac MDEV-24175 Windows - fix detection of whether file is on SSD
Fix detection. SSD is when storage does *not* incur a seek penalty.
2022-02-17 22:55:08 +01:00
Marko Mäkelä
f1beeb58e6 MDEV-27848: Remove unused wait/io/file/innodb/innodb_log_file
The performance_schema counter wait/io/file/innodb/innodb_log_file
is always reported as 0.

The way how redo log writes are being waited for was refactored in
commit 30ea63b7d2 by the introduction
of flush_lock and write_lock. Even before that change, all the
wait/io/file/innodb/ counters were always 0 in my tests.

Moreover, if the PMEM interface that was introduced in
commit 3daef523af
is being used, writes to the InnoDB log file will completely avoid
any system calls and performance_schema instrumentation.
In commit 685d958e38 also the reads
of the redo log (during recovery) would bypass any system calls.
2022-02-15 15:03:15 +02:00
Marko Mäkelä
3b06415cb8 Clean up log resizing
log_t::rename_resized(): Replaces create_log_file_rename()

delete_log_files(): Moved to a separate function. Also invoked
on a normal startup when the log is not being resized, to
remove any garbage log files.

create_log_file(): Remove the std::string& parameter.

os_file_delete_if_exists_func() [_WIN32]: Do not retry DeleteFile()
on ERROR_ACCESS_DENIED.
2022-02-14 19:49:54 +02:00
Vincent Milum Jr
e375f51924 MDEV-27790: Fix mis-matched braces for non-Linux targets
Ran into this while compiling on FreeBSD 13.0-RELEASE

After this one change, it compiles and runs just fine on my FreeBSD Aarch64 server.
2022-02-10 17:18:37 +11:00
Vladislav Vaintroub
9d93b51eff MDEV-14425 fix inadverent changes on Windows
innodb_flush_log_on_trx_commit=2 means we want to write
redo using filesystem buffering, and not  flushing each time.

Also simplify code, remove gotos, and comment on a bug
in mariabackup that produces redo of odd sizes, that must
be worked around by forcing file to be buffered.
2022-01-26 13:28:10 +01:00
Marko Mäkelä
685d958e38 MDEV-14425 Improve the redo log for concurrency
The InnoDB redo log used to be formatted in blocks of 512 bytes.
The log blocks were encrypted and the checksum was calculated while
holding log_sys.mutex, creating a serious scalability bottleneck.

We remove the fixed-size redo log block structure altogether and
essentially turn every mini-transaction into a log block of its own.
This allows encryption and checksum calculations to be performed
on local mtr_t::m_log buffers, before acquiring log_sys.mutex.
The mutex only protects a memcpy() of the data to the shared
log_sys.buf, as well as the padding of the log, in case the
to-be-written part of the log would not end in a block boundary of
the underlying storage. For now, the "padding" consists of writing
a single NUL byte, to allow recovery and mariadb-backup to detect
the end of the circular log faster.

Like the previous implementation, we will overwrite the last log block
over and over again, until it has been completely filled. It would be
possible to write only up to the last completed block (if no more
recent write was requested), or to write dummy FILE_CHECKPOINT records
to fill the incomplete block, by invoking the currently disabled
function log_pad(). This would require adjustments to some logic around
log checkpoints, page flushing, and shutdown.

An upgrade after a crash of any previous version is not supported.
Logically empty log files from a previous version will be upgraded.

An attempt to start up InnoDB without a valid ib_logfile0 will be
refused. Previously, the redo log used to be created automatically
if it was missing. Only with with innodb_force_recovery=6, it is
possible to start InnoDB in read-only mode even if the log file
does not exist. This allows the contents of a possibly corrupted
database to be dumped.

Because a prepared backup from an earlier version of mariadb-backup
will create a 0-sized log file, we will allow an upgrade from such
log files, provided that the FIL_PAGE_FILE_FLUSH_LSN in the system
tablespace looks valid.

The 512-byte log checkpoint blocks at 0x200 and 0x600 will be replaced
with 64-byte log checkpoint blocks at 0x1000 and 0x2000.

The start of log records will move from 0x800 to 0x3000. This allows us
to use 4096-byte aligned blocks for all I/O in a future revision.

We extend the MDEV-12353 redo log record format as follows.

(1) Empty mini-transactions or extra NUL bytes will not be allowed.
(2) The end-of-minitransaction marker (a NUL byte) will be replaced
with a 1-bit sequence number, which will be toggled each time when the
circular log file wraps back to the beginning.
(3) After the sequence bit, a CRC-32C checksum of all data
(excluding the sequence bit) will written.
(4) If the log is encrypted, 8 bytes will be written before
the checksum and included in it. This is part of the
initialization vector (IV) of encrypted log data.
(5) File names, page numbers, and checkpoint information will not be
encrypted. Only the payload bytes of page-level log will be encrypted.
The tablespace ID and page number will form part of the IV.
(6) For padding, arbitrary-length FILE_CHECKPOINT records may be written,
with all-zero payload, and with the normal end marker and checksum.
The minimum size is 7 bytes, or 7+8 with innodb_encrypt_log=ON.

In mariadb-backup and in Galera snapshot transfer (SST) scripts, we will
no longer remove ib_logfile0 or create an empty ib_logfile0. Server startup
will require a valid log file. When resizing the log, we will create
a logically empty ib_logfile101 at the current LSN and use an atomic rename
to replace ib_logfile0 with it. See the test innodb.log_file_size.

Because there is no mandatory padding in the log file, we are able
to create a dummy log file as of an arbitrary log sequence number.
See the test mariabackup.huge_lsn.

The parameter innodb_log_write_ahead_size and the
INFORMATION_SCHEMA.INNODB_METRICS counter log_padded will be removed.

The minimum value of innodb_log_buffer_size will be increased to 2MiB
(because log_sys.buf will replace recv_sys.buf) and the increment
adjusted to 4096 bytes (the maximum log block size).

The following INFORMATION_SCHEMA.INNODB_METRICS counters will be removed:

os_log_fsyncs
os_log_pending_fsyncs
log_pending_log_flushes
log_pending_checkpoint_writes

The following status variables will be removed:

Innodb_os_log_fsyncs (this is included in Innodb_data_fsyncs)
Innodb_os_log_pending_fsyncs (this was limited to at most 1 by design)

log_sys.get_block_size(): Return the physical block size of the log file.
This is only implemented on Linux and Microsoft Windows for now, and for
the power-of-2 block sizes between 64 and 4096 bytes (the minimum and
maximum size of a checkpoint block). If the block size is anything else,
the traditional 512-byte size will be used via normal file system
buffering.

If the file system buffers can be bypassed, a message like the following
will be issued:

InnoDB: File system buffers for log disabled (block size=512 bytes)
InnoDB: File system buffers for log disabled (block size=4096 bytes)

This has been tested on Linux and Microsoft Windows with both sizes.

On Linux, only enable O_DIRECT on the log for innodb_flush_method=O_DSYNC.
Tests in 3 different environments where the log is stored in a device
with a physical block size of 512 bytes are yielding better throughput
without O_DIRECT. This could be due to the fact that in the event the
last log block is being overwritten (if multiple transactions would
become durable at the same time, and each of will write a small
number of bytes to the last log block), it should be faster to re-copy
data from log_sys.buf or log_sys.flush_buf to the kernel buffer,
to be finally written at fdatasync() time.

The parameter innodb_flush_method=O_DSYNC will imply O_DIRECT for
data files. This option will enable O_DIRECT on the log file on Linux.
It may be unsafe to use when the storage device does not support
FUA (Force Unit Access) mode.

When the server is compiled WITH_PMEM=ON, we will use memory-mapped
I/O for the log file if the log resides on a "mount -o dax" device.
We will identify PMEM in a start-up message:

InnoDB: log sequence number 0 (memory-mapped); transaction id 3

On Linux, we will also invoke mmap() on any ib_logfile0 that resides
in /dev/shm, effectively treating the log file as persistent memory.
This should speed up "./mtr --mem" and increase the test coverage of
PMEM on non-PMEM hardware. It also allows users to estimate how much
the performance would be improved by installing persistent memory.
On other tmpfs file systems such as /run, we will not use mmap().

mariadb-backup: Eliminated several variables. We will refer
directly to recv_sys and log_sys.

backup_wait_for_lsn(): Detect non-progress of
xtrabackup_copy_logfile(). In this new log format with
arbitrary-sized blocks, we can only detect log file overrun
indirectly, by observing that the scanned log sequence number
is not advancing.

xtrabackup_copy_logfile(): On PMEM, do not modify the sequence bit,
because we are not allowed to modify the server's log file, and our
memory mapping is read-only.

trx_flush_log_if_needed_low(): Do not use the callback on pmem.
Using neither flush_lock nor write_lock around PMEM writes seems
to yield the best performance. The pmem_persist() calls may
still be somewhat slower than the pwrite() and fdatasync() based
interface (PMEM mounted without -o dax).

recv_sys_t::buf: Remove. We will use log_sys.buf for parsing.

recv_sys_t::MTR_SIZE_MAX: Replaces RECV_SCAN_SIZE.

recv_sys_t::file_checkpoint: Renamed from mlog_checkpoint_lsn.

recv_sys_t, log_sys_t: Removed many data members.

recv_sys.lsn: Renamed from recv_sys.recovered_lsn.
recv_sys.offset: Renamed from recv_sys.recovered_offset.
log_sys.buf_size: Replaces srv_log_buffer_size.

recv_buf: A smart pointer that wraps log_sys.buf[recv_sys.offset]
when the buffer is being allocated from the memory heap.

recv_ring: A smart pointer that wraps a circular log_sys.buf[] that is
backed by ib_logfile0. The pointer will wrap from recv_sys.len
(log_sys.file_size) to log_sys.START_OFFSET. For the record that
wraps around, we may copy file name or record payload data to
the auxiliary buffer decrypt_buf in order to have a contiguous
block of memory. The maximum size of a record is less than
innodb_page_size bytes.

recv_sys_t::parse(): Take the smart pointer as a template parameter.
Do not temporarily add a trailing NUL byte to FILE_ records, because
we are not supposed to modify the memory-mapped log file. (It is
attached in read-write mode already during recovery.)

recv_sys_t::parse_mtr(): Wrapper for recv_sys_t::parse().

recv_sys_t::parse_pmem(): Like parse_mtr(), but if PREMATURE_EOF would be
returned on PMEM, use recv_ring to wrap around the buffer to the start.

mtr_t::finish_write(), log_close(): Do not enforce log_sys.max_buf_free
on PMEM, because it has no meaning on the mmap-based log.

log_sys.write_to_buf: Count writes to log_sys.buf. Replaces
srv_stats.log_write_requests and export_vars.innodb_log_write_requests.
Protected by log_sys.mutex. Updated consistently in log_close().
Previously, mtr_t::commit() conditionally updated the count,
which was inconsistent.

log_sys.write_to_log: Count swaps of log_sys.buf and log_sys.flush_buf,
for writing to log_sys.log (the ib_logfile0). Replaces
srv_stats.log_writes and export_vars.innodb_log_writes.
Protected by log_sys.mutex.

log_sys.waits: Count waits in append_prepare(). Replaces
srv_stats.log_waits and export_vars.innodb_log_waits.

recv_recover_page(): Do not unnecessarily acquire
log_sys.flush_order_mutex. We are inserting the blocks in arbitary
order anyway, to be adjusted in recv_sys.apply(true).

We will change the definition of flush_lock and write_lock to
avoid potential false sharing. Depending on sizeof(log_sys) and
CPU_LEVEL1_DCACHE_LINESIZE, the flush_lock and write_lock could
share a cache line with each other or with the last data members
of log_sys.

Thanks to Matthias Leich for providing https://rr-project.org traces
for various failures during the development, and to
Thirunarayanan Balathandayuthapani for his help in debugging
some of the recovery code. And thanks to the developers of the
rr debugger for a tool without which extensive changes to InnoDB
would be very challenging to get right.

Thanks to Vladislav Vaintroub for useful feedback and
to him, Axel Schwenke and Krunal Bauskar for testing the performance.
2022-01-21 16:03:47 +02:00
Marko Mäkelä
7dfaded962 Merge 10.6 into 10.7 2022-01-04 09:55:58 +02:00
Marko Mäkelä
3f5726768f Merge 10.5 into 10.6 2022-01-04 09:26:38 +02:00
Marko Mäkelä
4c3ad24413 MDEV-27416 InnoDB hang in buf_flush_wait_flushed(), on log checkpoint
InnoDB could sometimes hang when triggering a log checkpoint. This is
due to commit 7b1252c03d (MDEV-24278),
which introduced an untimed wait to buf_flush_page_cleaner().

The hang was noticed by occasional failures of IMPORT TABLESPACE tests,
such as innodb.innodb-wl5522, which would (unnecessarily) invoke
log_make_checkpoint() from row_import_cleanup().

The reason of the hang was that buf_flush_page_cleaner() would enter
untimed sleep despite buf_flush_sync_lsn being set. The exact failure
scenario is unclear, because buf_flush_sync_lsn should actually be
protected by buf_pool.flush_list_mutex. We prevent the hang by
invoking buf_pool.page_cleaner_set_idle(false) whenever we are
setting buf_flush_sync_lsn and signaling buf_pool.do_flush_list.

The bulk of these changes was originally developed as a preparation
for MDEV-26827, to invoke buf_flush_list() from fewer threads,
and tested on 10.6 by Matthias Leich.

This fix was tested by running 100 repetitions of 100 concurrent instances
of the test innodb.innodb-wl5522 on a RelWithDebInfo build, using ext4fs
and innodb_flush_method=O_DIRECT on a SATA SSD with 4096-byte block size.
During the test, the call to log_make_checkpoint() in row_import_cleanup()
was present.

buf_flush_list(): Make static.

buf_flush_wait(): Wait for buf_pool.get_oldest_modification()
to reach a target, by work done in the buf_flush_page_cleaner.
If buf_flush_sync_lsn is going to be set, we will invoke
buf_pool.page_cleaner_set_idle(false).

buf_flush_ahead(): If buf_flush_sync_lsn or buf_flush_async_lsn
is going to be set and the page cleaner woken up, we will invoke
buf_pool.page_cleaner_set_idle(false).

buf_flush_wait_flushed(): Invoke buf_flush_wait().

buf_flush_sync(): Invoke recv_sys.apply() at the start in case
crash recovery is active. Invoke buf_flush_wait().

buf_flush_sync_batch(): A lower-level variant of buf_flush_sync()
that is only called by recv_sys_t::apply().

buf_flush_sync_for_checkpoint(): Do not trigger log apply
or checkpoint during recovery.

buf_dblwr_t::create(): Only initiate a buffer pool flush, not
a checkpoint.

row_import_cleanup(): Do not unnecessarily invoke log_make_checkpoint().
Invoking buf_flush_list_space() before starting to generate redo log
for the imported tablespace should suffice.

srv_prepare_to_delete_redo_log_file():
Set recv_sys.recovery_on in order to prevent
buf_flush_sync_for_checkpoint() from initiating a checkpoint
while the log is inaccessible. Remove a wait loop that is already
part of buf_flush_sync().
Do not invoke fil_names_clear() if the log is being upgraded,
because the FILE_MODIFY record is specific to the latest format.

create_log_file(): Clear recv_sys.recovery_on only after calling
log_make_checkpoint(), to prevent buf_flush_page_cleaner from
invoking a checkpoint.

innodb_shutdown(): Simplify the logic in mariadb-backup --prepare.

os_aio_wait_until_no_pending_writes(): Update the function comment.
Apart from row_quiesce_table_start() during FLUSH TABLES...FOR EXPORT,
this is being called by buf_flush_list_space(), which is invoked
by ALTER TABLE...IMPORT TABLESPACE as well as some encryption operations.
2022-01-04 07:40:31 +02:00
Marko Mäkelä
4be366111b Merge 10.6 into 10.7 2021-09-11 18:01:31 +03:00
Marko Mäkelä
15139964d5 Merge 10.5 into 10.6 2021-09-11 17:55:27 +03:00
Marko Mäkelä
2d0847818d Merge 10.4 into 10.5 2021-09-11 11:49:12 +03:00
Marko Mäkelä
101d10b883 Merge 10.3 into 10.4 2021-09-11 11:21:39 +03:00
Marko Mäkelä
bcd25e1066 Merge 10.2 into 10.3 2021-09-11 11:14:18 +03:00
Marko Mäkelä
d09426f9e6 MDEV-26537 InnoDB corrupts files due to incorrect st_blksize calculation
The st_blksize returned by fstat(2) is not documented to be
a power of 2, like we assumed in
commit 58252fff15 (MDEV-26040).
While on Linux, the st_blksize appears to report the file system
block size (which hopefully is not smaller than the sector size
of the underlying block device), on FreeBSD we observed
st_blksize values that might have been something similar to st_size.

Also IBM AIX was affected by this. A simple test case would
lead to a crash when using the minimum innodb_buffer_pool_size=5m
on both FreeBSD and AIX:

seq -f 'create table t%g engine=innodb select * from seq_1_to_200000;' \
1 100|mysql test&
seq -f 'create table u%g engine=innodb select * from seq_1_to_200000;' \
1 100|mysql test&

We will fix this by not trusting st_blksize at all, and assuming that
the smallest allowed write size (for O_DIRECT) is 4096 bytes. We hope
that no storage systems with larger block size exist. Anything larger
than 4096 bytes should be unlikely, given that it is the minimum
virtual memory page size of many contemporary processors.

MariaDB Server on Microsoft Windows was not affected by this.

While the 512-byte sector size of the venerable Seagate ST-225 is still
in widespread use, the minimum innodb_page_size is 4096 bytes, and
innodb_log_file_size can be set in integer multiples of 65536 bytes.

The only occasion where InnoDB uses smaller data file block sizes than
4096 bytes is with ROW_FORMAT=COMPRESSED tables with KEY_BLOCK_SIZE=1
or KEY_BLOCK_SIZE=2 (or innodb_page_size=4096). For such tables,
we will from now on preallocate space in integer multiples of 4096 bytes
and let regular writes extend the file by 1024, 2048, or 3072 bytes.

The view INFORMATION_SCHEMA.INNODB_SYS_TABLESPACES.FS_BLOCK_SIZE
should report the raw st_blksize.

For page_compressed tables, the function fil_space_get_block_size()
will map to 512 any st_blksize value that is larger than 4096.

os_file_set_size(): Assume that the file system block size is 4096 bytes,
and only support extending files to integer multiples of 4096 bytes.

fil_space_extend_must_retry(): Round down the preallocation size to
an integer multiple of 4096 bytes.
2021-09-10 19:15:41 +03:00
Eugene Kosov
d089b51d66 TSAN: unprotected global variable
WARNING: ThreadSanitizer: data race (pid=1510842)
  Write of size 8 at 0x0000067b1e98 by main thread:
    #0 os_file_pwrite(IORequest const&, int, unsigned char const*, unsigned long, unsigned long, dberr_t*) /storage/innobase/os/os0file.cc:2928:2 (mariadbd+0x234c5ac)
    #1 os_file_write_func(IORequest const&, char const*, int, void const*, unsigned long, unsigned long) /storage/innobase/os/os0file.cc:2963:20 (mariadbd+0x234c019)
    #2 file_os_io::write(char const*, unsigned long, st_::span<unsigned char const>) /storage/innobase/log/log0log.cc:320:10 (mariadbd+0x22eaa50)
    #3 log_file_t::write(unsigned long, st_::span<unsigned char const>) /storage/innobase/log/log0log.cc:434:18 (mariadbd+0x22eb1d8)
    #4 log_t::file::write(unsigned long, st_::span<unsigned char>) /storage/innobase/log/log0log.cc:496:29 (mariadbd+0x22ebb55)
    #5 log_write_buf(unsigned char*, unsigned long, unsigned long, unsigned long, unsigned long) /storage/innobase/log/log0log.cc:614:14 (mariadbd+0x22f1b51)
    #6 log_write(bool) /storage/innobase/log/log0log.cc:755:2 (mariadbd+0x22ed2ec)
    #7 log_write_up_to(unsigned long, bool, bool, completion_callback const*) /storage/innobase/log/log0log.cc:817:5 (mariadbd+0x22eca44)
    #8 log_checkpoint_low(unsigned long, unsigned long) /storage/innobase/buf/buf0flu.cc:1734:5 (mariadbd+0x20d37c1)
    #9 log_checkpoint() /storage/innobase/buf/buf0flu.cc:1787:10 (mariadbd+0x20cd155)
    #10 buf_flush_wait_flushed(unsigned long) /storage/innobase/buf/buf0flu.cc:1867:5 (mariadbd+0x20ccf8f)
    #11 log_make_checkpoint() /storage/innobase/buf/buf0flu.cc:1793:3 (mariadbd+0x20cc4c9)
    #12 buf_dblwr_t::create() /storage/innobase/buf/buf0dblwr.cc:216:3 (mariadbd+0x209076a)
    #13 srv_start(bool) /storage/innobase/srv/srv0start.cc:1685:20 (mariadbd+0x256b4aa)
    #14 innodb_init(void*) /storage/innobase/handler/ha_innodb.cc:4188:8 (mariadbd+0x1ed40da)
    #15 ha_initialize_handlerton(st_plugin_int*) /sql/handler.cc:659:31 (mariadbd+0xf7c2b6)
    #16 plugin_initialize(st_mem_root*, st_plugin_int*, int*, char**, bool) /sql/sql_plugin.cc:1463:9 (mariadbd+0x160fedb)
    #17 plugin_init(int*, char**, int) /sql/sql_plugin.cc:1756:15 (mariadbd+0x160f53f)
    #18 init_server_components() /sql/mysqld.cc:5043:7 (mariadbd+0xd71462)
    #19 mysqld_main(int, char**) /sql/mysqld.cc:5655:7 (mariadbd+0xd6ae87)
    #20 main /sql/main.cc:34:10 (mariadbd+0xd661c8)

  Previous write of size 8 at 0x0000067b1e98 by thread T3:
    #0 os_file_pwrite(IORequest const&, int, unsigned char const*, unsigned long, unsigned long, dberr_t*) /storage/innobase/os/os0file.cc:2928:2 (mariadbd+0x234c5ac)
    #1 os_file_write_func(IORequest const&, char const*, int, void const*, unsigned long, unsigned long) /storage/innobase/os/os0file.cc:2963:20 (mariadbd+0x234c019)
    #2 file_os_io::write(char const*, unsigned long, st_::span<unsigned char const>) /storage/innobase/log/log0log.cc:320:10 (mariadbd+0x22eaa50)
    #3 log_file_t::write(unsigned long, st_::span<unsigned char const>) /storage/innobase/log/log0log.cc:434:18 (mariadbd+0x22eb1d8)
    #4 log_t::file::write(unsigned long, st_::span<unsigned char>) /storage/innobase/log/log0log.cc:496:29 (mariadbd+0x22ebb55)
    #5 log_write_checkpoint_info(unsigned long) /storage/innobase/log/log0log.cc:911:14 (mariadbd+0x22edd4e)
    #6 log_checkpoint_low(unsigned long, unsigned long) /storage/innobase/buf/buf0flu.cc:1755:3 (mariadbd+0x20d3a3d)
    #7 buf_flush_sync_for_checkpoint(unsigned long) /storage/innobase/buf/buf0flu.cc:1947:7 (mariadbd+0x20d4163)
    #8 buf_flush_page_cleaner() /storage/innobase/buf/buf0flu.cc:2186:9 (mariadbd+0x20cdab1)
    #9 void std::__invoke_impl<void, void (*)()>(std::__invoke_other, void (*&&)()) /usr/lib/gcc/x86_64-linux-gnu/11/../../../../include/c++/11/bits/invoke.h:61:14 (mariadbd+0x20c3aaa)
    #10 std::__invoke_result<void (*)()>::type std::__invoke<void (*)()>(void (*&&)()) /usr/lib/gcc/x86_64-linux-gnu/11/../../../../include/c++/11/bits/invoke.h:96:14 (mariadbd+0x20c39bd)
    #11 void std:🧵:_Invoker<std::tuple<void (*)()> >::_M_invoke<0ul>(std::_Index_tuple<0ul>) /usr/lib/gcc/x86_64-linux-gnu/11/../../../../include/c++/11/bits/std_thread.h:253:13 (mariadbd+0x20c3965)
    #12 std:🧵:_Invoker<std::tuple<void (*)()> >::operator()() /usr/lib/gcc/x86_64-linux-gnu/11/../../../../include/c++/11/bits/std_thread.h:260:11 (mariadbd+0x20c3905)
    #13 std:🧵:_State_impl<std:🧵:_Invoker<std::tuple<void (*)()> > >::_M_run() /usr/lib/gcc/x86_64-linux-gnu/11/../../../../include/c++/11/bits/std_thread.h:211:13 (mariadbd+0x20c37f9)
    #14 <null> <null> (libstdc++.so.6+0xd230f)

  Location is global 'os_n_file_writes' of size 8 at 0x0000067b1e98 (mariadbd+0x67b1e98)

  Make variable atomic.
2021-09-09 04:08:50 +06:00
Eugene Kosov
7f50edb215 TSAN: data race on a global counter
WARNING: ThreadSanitizer: data race (pid=1503350)
  Write of size 8 at 0x0000067b1f20 by thread T3:
    #0 os_file_sync_posix(int) /storage/innobase/os/os0file.cc:895:5 (mariadbd+0x23493f6)
    #1 os_file_flush_func(int) /storage/innobase/os/os0file.cc:983:8 (mariadbd+0x2349204)
    #2 file_os_io::flush() /storage/innobase/log/log0log.cc:326:10 (mariadbd+0x22eaaa9)
    #3 log_file_t::flush() /storage/innobase/log/log0log.cc:440:18 (mariadbd+0x22eb2d0)
    #4 log_t::file::flush() /storage/innobase/log/log0log.cc:507:29 (mariadbd+0x22ebe69)
    #5 log_write_flush_to_disk_low(unsigned long) /storage/innobase/log/log0log.cc:629:17 (mariadbd+0x22ed3f3)
    #6 log_write_up_to(unsigned long, bool, bool, completion_callback const*) /storage/innobase/log/log0log.cc:829:3 (mariadbd+0x22ecb04)
    #7 log_checkpoint_low(unsigned long, unsigned long) /storage/innobase/buf/buf0flu.cc:1734:5 (mariadbd+0x20d37f1)
    #8 buf_flush_sync_for_checkpoint(unsigned long) /storage/innobase/buf/buf0flu.cc:1947:7 (mariadbd+0x20d4193)
    #9 buf_flush_page_cleaner() /storage/innobase/buf/buf0flu.cc:2186:9 (mariadbd+0x20cdad7)
    #10 void std::__invoke_impl<void, void (*)()>(std::__invoke_other, void (*&&)()) /usr/lib/gcc/x86_64-linux-gnu/11/../../../../include/c++/11/bits/invoke.h:61:14 (mariadbd+0x20c3aaa)
    #11 std::__invoke_result<void (*)()>::type std::__invoke<void (*)()>(void (*&&)()) /usr/lib/gcc/x86_64-linux-gnu/11/../../../../include/c++/11/bits/invoke.h:96:14 (mariadbd+0x20c39bd)
    #12 void std:🧵:_Invoker<std::tuple<void (*)()> >::_M_invoke<0ul>(std::_Index_tuple<0ul>) /usr/lib/gcc/x86_64-linux-gnu/11/../../../../include/c++/11/bits/std_thread.h:253:13 (mariadbd+0x20c3965)
    #13 std:🧵:_Invoker<std::tuple<void (*)()> >::operator()() /usr/lib/gcc/x86_64-linux-gnu/11/../../../../include/c++/11/bits/std_thread.h:260:11 (mariadbd+0x20c3905)
    #14 std:🧵:_State_impl<std:🧵:_Invoker<std::tuple<void (*)()> > >::_M_run() /usr/lib/gcc/x86_64-linux-gnu/11/../../../../include/c++/11/bits/std_thread.h:211:13 (mariadbd+0x20c37f9)
    #15 <null> <null> (libstdc++.so.6+0xd230f)

  Previous write of size 8 at 0x0000067b1f20 by main thread:
    #0 os_file_sync_posix(int) /storage/innobase/os/os0file.cc:895:5 (mariadbd+0x23493f6)
    #1 os_file_flush_func(int) /storage/innobase/os/os0file.cc:983:8 (mariadbd+0x2349204)
    #2 fil_space_t::flush_low() /storage/innobase/fil/fil0fil.cc:504:5 (mariadbd+0x205cad5)
    #3 fil_flush_file_spaces() /storage/innobase/fil/fil0fil.cc:2947:13 (mariadbd+0x206523f)
    #4 log_checkpoint() /storage/innobase/buf/buf0flu.cc:1777:5 (mariadbd+0x20cd069)
    #5 buf_flush_wait_flushed(unsigned long) /storage/innobase/buf/buf0flu.cc:1867:5 (mariadbd+0x20ccf95)
    #6 log_make_checkpoint() /storage/innobase/buf/buf0flu.cc:1793:3 (mariadbd+0x20cc4c9)
    #7 buf_dblwr_t::create() /storage/innobase/buf/buf0dblwr.cc:216:3 (mariadbd+0x209076a)
    #8 srv_start(bool) /storage/innobase/srv/srv0start.cc:1685:20 (mariadbd+0x256b514)
    #9 innodb_init(void*) /storage/innobase/handler/ha_innodb.cc:4188:8 (mariadbd+0x1ed406a)
    #10 ha_initialize_handlerton(st_plugin_int*) /sql/handler.cc:659:31 (mariadbd+0xf7c246)
    #11 plugin_initialize(st_mem_root*, st_plugin_int*, int*, char**, bool) /sql/sql_plugin.cc:1463:9 (mariadbd+0x160fe6b)
    #12 plugin_init(int*, char**, int) /sql/sql_plugin.cc:1756:15 (mariadbd+0x160f4cf)
    #13 init_server_components() /sql/mysqld.cc:5043:7 (mariadbd+0xd713f2)
    #14 mysqld_main(int, char**) /sql/mysqld.cc:5655:7 (mariadbd+0xd6ae17)
    #15 main /sql/main.cc:34:10 (mariadbd+0xd66158)

This is a correct report by TSAN for an obvious case: unprotected global
counter. Fix it by making counter std::atomic.
2021-09-09 04:08:50 +06:00
Eugene Kosov
78084fa747 TSAN: data race on vptr (ctor/dtor vs virtual call)
Read of size 8 at 0x7fecf2e75fc8 by thread T2 (mutexes: write M1318):
    #0 tpool::thread_pool_generic::submit_task(tpool::task*) /tpool/tpool_generic.cc:823:9 (mariadbd+0x25fd2d2)
    #1 (anonymous namespace)::aio_uring::thread_routine((anonymous namespace)::aio_uring*) /tpool/aio_liburing.cc:173:20 (mariadbd+0x260b21b)
    #2 void std::__invoke_impl<void, void (*)((anonymous namespace)::aio_uring*), (anonymous namespace)::aio_uring*>(std::__invoke_other, void (*&&)((anonymous namespace)::aio_uring*), (anonymous namespace)::aio_uring*&&) /usr/lib/gcc/x86_64-linux-gnu/11/../../../../include/c++/11/bits/invoke.h:61:14 (mariadbd+0x260c62a)
    #3 std::__invoke_result<void (*)((anonymous namespace)::aio_uring*), (anonymous namespace)::aio_uring*>::type std::__invoke<void (*)((anonymous namespace)::aio_uring*), (anonymous namespace)::aio_uring*>(void (*&&)((anonymous namespace)::aio_uring*), (anonymous namespace)::aio_uring*&&) /usr/lib/gcc/x86_64-linux-gnu/11/../../../../include/c++/11/bits/invoke.h:96:14 (mariadbd+0x260c4ba)
    #4 void std:🧵:_Invoker<std::tuple<void (*)((anonymous namespace)::aio_uring*), (anonymous namespace)::aio_uring*> >::_M_invoke<0ul, 1ul>(std::_Index_tuple<0ul, 1ul>) /usr/lib/gcc/x86_64-linux-gnu/11/../../../../include/c++/11/bits/std_thread.h:253:13 (mariadbd+0x260c442)
    #5 std:🧵:_Invoker<std::tuple<void (*)((anonymous namespace)::aio_uring*), (anonymous namespace)::aio_uring*> >::operator()() /usr/lib/gcc/x86_64-linux-gnu/11/../../../../include/c++/11/bits/std_thread.h:260:11 (mariadbd+0x260c3c5)
    #6 std:🧵:_State_impl<std:🧵:_Invoker<std::tuple<void (*)((anonymous namespace)::aio_uring*), (anonymous namespace)::aio_uring*> > >::_M_run() /usr/lib/gcc/x86_64-linux-gnu/11/../../../../include/c++/11/bits/std_thread.h:211:13 (mariadbd+0x260c189)
    #7 <null> <null> (libstdc++.so.6+0xd230f)

  Previous write of size 8 at 0x7fecf2e75fc8 by main thread:
    #0 tpool::task::task(void (*)(void*), void*, tpool::task_group*) /tpool/task.cc:40:46 (mariadbd+0x260a138)
    #1 tpool::aiocb::aiocb() /tpool/tpool.h:147:13 (mariadbd+0x2355943)
    #2 void std::_Construct<tpool::aiocb>(tpool::aiocb*) /usr/lib/gcc/x86_64-linux-gnu/11/../../../../include/c++/11/bits/stl_construct.h:109:38 (mariadbd+0x2355845)
    #3 tpool::aiocb* std::__uninitialized_default_n_1<false>::__uninit_default_n<tpool::aiocb*, unsigned long>(tpool::aiocb*, unsigned long) /usr/lib/gcc/x86_64-linux-gnu/11/../../../../include/c++/11/bits/stl_uninitialized.h:579:3 (mariadbd+0x235576c)
    #4 tpool::aiocb* std::__uninitialized_default_n<tpool::aiocb*, unsigned long>(tpool::aiocb*, unsigned long) /usr/lib/gcc/x86_64-linux-gnu/11/../../../../include/c++/11/bits/stl_uninitialized.h:638:14 (mariadbd+0x23556e9)
    #5 tpool::aiocb* std::__uninitialized_default_n_a<tpool::aiocb*, unsigned long, tpool::aiocb>(tpool::aiocb*, unsigned long, std::allocator<tpool::aiocb>&) /usr/lib/gcc/x86_64-linux-gnu/11/../../../../include/c++/11/bits/stl_uninitialized.h:704:14 (mariadbd+0x2355641)
    #6 std::vector<tpool::aiocb, std::allocator<tpool::aiocb> >::_M_default_initialize(unsigned long) /usr/lib/gcc/x86_64-linux-gnu/11/../../../../include/c++/11/bits/stl_vector.h:1606:4 (mariadbd+0x2354f3d)
    #7 std::vector<tpool::aiocb, std::allocator<tpool::aiocb> >::vector(unsigned long, std::allocator<tpool::aiocb> const&) /usr/lib/gcc/x86_64-linux-gnu/11/../../../../include/c++/11/bits/stl_vector.h:512:9 (mariadbd+0x2354a19)
    #8 tpool::cache<tpool::aiocb>::cache(unsigned long, tpool::cache_notification_mode) /tpool/tpool_structs.h:73:20 (mariadbd+0x2354784)
    #9 io_slots::io_slots(int, int) /storage/innobase/os/os0file.cc:93:3 (mariadbd+0x235343b)
    #10 os_aio_init() /storage/innobase/os/os0file.cc:3780:22 (mariadbd+0x234ebce)
    #11 srv_start(bool) /storage/innobase/srv/srv0start.cc:1190:6 (mariadbd+0x256720c)
    #12 innodb_init(void*) /storage/innobase/handler/ha_innodb.cc:4188:8 (mariadbd+0x1ed3bda)
    #13 ha_initialize_handlerton(st_plugin_int*) /sql/handler.cc:659:31 (mariadbd+0xf7be06)
    #14 plugin_initialize(st_mem_root*, st_plugin_int*, int*, char**, bool) /sql/sql_plugin.cc:1463:9 (mariadbd+0x160fa1b)
    #15 plugin_init(int*, char**, int) /sql/sql_plugin.cc:1756:15 (mariadbd+0x160f07f)
    #16 init_server_components() /sql/mysqld.cc:5043:7 (mariadbd+0xd70fb2)
    #17 mysqld_main(int, char**) /sql/mysqld.cc:5655:7 (mariadbd+0xd6a9d7)
    #18 main /sql/main.cc:34:10 (mariadbd+0xd65d18)

I think the report is incorrect: it's not possible to have such a race
condition. I've checked it by reading the code and putting assertions.
Namely, no aio I/O is possible before the end of os_aio_init().
Most probably it's some bug in TSAN. But the patch fixes around 5 related
reports and this is a step toward TSAN usefullness. Currently it reports too
much noise.

std::unique_ptr is a step toward https://isocpp.github.io/CppCoreGuidelines/CppCoreGuidelines#r11-avoid-calling-new-and-delete-explicitly
There is no std::make_unique() in C++11, however.
2021-09-09 00:11:46 +06:00
Marko Mäkelä
e836673cc7 Merge 10.6 into 10.7 2021-09-07 10:37:49 +03:00
Marko Mäkelä
40ae9c5d10 Merge 10.5 into 10.6 2021-09-07 10:37:36 +03:00
Marko Mäkelä
eb2f2c1e5f MDEV-26547 fixup: Wait for read completion
buf_load(): Wait for the submitted reads to finish before updating
innodb_buffer_pool_load_status.
2021-09-07 08:55:08 +03:00
Marko Mäkelä
ddcb242b3c Merge 10.6 into 10.7 2021-08-04 11:52:39 +03:00
Oleksandr Byelkin
6efb5e9f5e Merge branch '10.5' into 10.6 2021-08-02 10:11:41 +02:00
Oleksandr Byelkin
ae6bdc6769 Merge branch '10.4' into 10.5 2021-07-31 23:19:51 +02:00
Oleksandr Byelkin
7841a7eb09 Merge branch '10.3' into 10.4 2021-07-31 22:59:58 +02:00
Marko Mäkelä
f50eb0d398 Merge 10.2 into 10.3 2021-07-27 10:47:17 +03:00
Marko Mäkelä
da094188f6 MDEV-24393 InnoDB disregards --skip-external-locking
On POSIX systems, InnoDB would unconditionally acquire advisory locks
on the files that it opens. On Linux, this would be observable by
a large number of entries in /proc/locks.

Other storage engines would only acquire advisory locks on files
based on the Boolean configuration parameter external_locking.

Let InnoDB do the same.

NOTE: The --skip-external-locking is activated by default. To have
InnoDB acquire advisory locks, --external-locking must be specified.

Reviewed by: Sergei Golubchik
2021-07-27 08:52:59 +03:00
Marko Mäkelä
ca501ffb04 MDEV-26195: Use a 32-bit data type for some tablespace fields
In the InnoDB data files, we allocate 32 bits for tablespace identifiers
and page numbers as well as tablespace flags. But, in main memory
data structures we allocate 32 or 64 bits, depending on the register
width of the processor. Let us always use 32-bit fields to eliminate
a mismatch and reduce the memory footprint on 64-bit systems.
2021-07-22 11:22:47 +03:00
Marko Mäkelä
0ad8a825a8 Merge 10.5 into 10.6 2021-07-02 17:00:05 +03:00
Marko Mäkelä
15dcb8bd3e Merge 10.4 into 10.5 2021-07-02 13:02:26 +03:00
Sergei Petrunia
eebe2090c8 Merge 10.3 -> 10.4 2021-06-30 18:41:46 +03:00
Sergei Petrunia
586870f9ef Merge 10.2->10.3 2021-06-30 15:06:54 +03:00
Marko Mäkelä
30edd5549d MDEV-26029: Sparse files are inefficient on thinly provisioned storage
The MariaDB implementation of page_compressed tables for InnoDB used
sparse files. In the worst case, in the data file, every data page
will consist of some data followed by a hole. This may be extremely
inefficient in some file systems.

If the underlying storage device is thinly provisioned (can compress
data on the fly), it would be good to write regular files (with sequences
of NUL bytes at the end of each page_compressed block) and let the
storage device take care of compressing the data.

For reads, sparse file regions and regions containing NUL bytes will be
indistinguishable.

my_test_if_disable_punch_hole(): A new predicate for detecting thinly
provisioned storage. (Not implemented yet.)

innodb_atomic_writes: Correct the comment.

buf_flush_page(): Support all values of fil_node_t::punch_hole.
On a thinly provisioned storage device, we will always write
NUL-padded innodb_page_size bytes also for page_compressed tables.

buf_flush_freed_pages(): Remove a redundant condition.

fil_space_t::atomic_write_supported: Remove. (This was duplicating
fil_node_t::atomic_write.)

fil_space_t::punch_hole: Remove. (Duplicated fil_node_t::punch_hole.)

fil_node_t: Remove magic_n, and consolidate flags into bitfields.
For punch_hole we introduce a third value that indicates a
thinly provisioned storage device.

fil_node_t::find_metadata(): Detect all attributes of the file.
2021-06-29 15:18:22 +03:00
Marko Mäkelä
58252fff15 MDEV-26040 os_file_set_size() may not work on O_DIRECT files
os_file_set_size(): Trim the current size down to the file system
block size, to obey the constraints for unbuffered I/O.
2021-06-29 14:28:23 +03:00
Marko Mäkelä
101da87228 Merge 10.5 into 10.6 2021-06-23 19:36:45 +03:00