1
0
mirror of https://github.com/MariaDB/server.git synced 2025-09-13 13:47:59 +03:00
Commit Graph

1253 Commits

Author SHA1 Message Date
Oleksandr Byelkin
a8d4642375 Merge branch '10.11' into 11.4 2025-04-26 10:53:02 +02:00
Marko Mäkelä
75ad1e9f00 Merge 10.6 into 10.11 2025-04-23 08:53:53 +03:00
Thirunarayanan Balathandayuthapani
4bedb222a8 MDEV-36304 InnoDB: Missing FILE_CREATE, FILE_DELETE or FILE_MODIFY error
during mariabackup --prepare

Reason:
======
During --prepare of partial backup, if InnoDB encounters the redo log
for the excluded tablespace then InnoDB stores the space id in dirty
tablespace list during recovery, anticipates that it may encounter
FILE_* redo log records in the future. Even though we encounter FILE_*
record for the partial excluded tablespace then we fail to replace the
name in dirty tablespace list. This lead to missing of
FILE_* redo log records error.

Solution:
========
fil_name_process(): Rename the file name from "" to name encountered
during FILE_* record

recv_init_missing_space(): Correct the condition to print the warning
message of missing tablespace during mariabackup restore process.
2025-04-22 17:53:08 +05:30
Marko Mäkelä
a524ec5951 MDEV-36587 InnoDB uses too much memory
log_t::clear_mmap(): Do not modify buf_size; we may have
file_size==0 here during bootstrap.

log_t::set_recovered(): If we are writing to a memory-mapped log,
update log_sys.buf_size to the record payload area of log_sys.buf.

This fixes up commit acd071f599
(MDEV-21923).
2025-04-14 10:33:22 +03:00
Marko Mäkelä
acd071f599 MDEV-21923: LSN allocation is a bottleneck
The parameter innodb_log_spin_wait_delay will be deprecated and
ignored, because there is no spin loop anymore.

Thanks to commit 685d958e38
and commit a635c40648
multiple mtr_t::commit() can concurrently copy their slice of
mtr_t::m_log to the shared log_sys.buf.  Each writer would allocate
their own log sequence number by invoking log_t::append_prepare()
while holding a shared log_sys.latch.  This function was too heavy,
because it would invoke a minimum of 4 atomic read-modify-write
operations as well as system calls in the supposedly fast code path.

It turns out that with a simpler data structure, instead of having
several data fields that needed to be kept consistent with each other,
we only need one Atomic_relaxed<uint64_t> write_lsn_offset, on which
we can operate using fetch_add(), fetch_sub() as well as a single-bit
fetch_or(), which reasonably modern compilers (GCC 7, Clang 15 or later)
can translate into loop-free code on AMD64.

Before anything can be written to the log, log_sys.clear_mmap()
must be invoked.

log_t::base_lsn: The LSN of the last write_buf() or persist().
This is a rough approximation of log_sys.lsn, which will be removed.

log_t::write_lsn_offset: An Atomic_relaxed<uint64_t> that buffers
updates of write_to_buf and base_lsn.

log_t::buf_free, log_t::max_buf_free, log_t::lsn. Remove.
Replaced by base_lsn and write_lsn_offset.

log_t::buf_size: Always reflects the usable size in append_prepare().

log_t::lsn_lock: Remove.  For the memory-mapped log in resize_write(),
there will be a resize_wrap_mutex.

log_t::get_lsn_approx(): Return a lower bound of get_lsn().
This should be exact unless append_prepare_wait() is pending.

log_get_lsn(): A wrapper for log_sys.get_lsn(), which must be invoked
while holding an exclusive log_sys.latch.

recv_recovery_from_checkpoint_start(): Do not invoke fil_names_clear();
it would seem to be unnecessary.

In many places, references to log_sys.get_lsn() are replaced with
log_sys.get_flushed_lsn(), which remains a simple std::atomic::load().

Reviewed by: Debarun Banerjee
2025-04-10 13:02:17 +03:00
Marko Mäkelä
0bcc03a2e6 Recovery cleanup
recv_sys_t::report_progress(): Display the largest currently known LSN.

recv_scan_log(): Display an error with fewer function calls.

Reviewed by: Debarun Banerjee
2025-04-10 12:58:10 +03:00
Marko Mäkelä
f5bd250f5b Merge 10.11 into 11.4 2025-03-28 13:55:21 +02:00
Marko Mäkelä
027d815546 MDEV-29445 fixup: Make Valgrind fair again
recv_sys_t::wait_for_pool(): Also wait for pending writes, so that
previously written blocks can be evicted and reused.

buf_flush_sync_for_checkpoint(): Wait for pending writes, in order
to guarantee progress even if the scheduler is unfair.
2025-03-27 14:52:07 +02:00
Marko Mäkelä
ab0f2a00b6 Merge 10.6 into 10.11 2025-03-27 08:01:47 +02:00
Marko Mäkelä
b6923420f3 MDEV-29445: Reimplement SET GLOBAL innodb_buffer_pool_size
We deprecate and ignore the parameter innodb_buffer_pool_chunk_size
and let the buffer pool size to be changed in arbitrary 1-megabyte
increments.

innodb_buffer_pool_size_max: A new read-only startup parameter
that specifies the maximum innodb_buffer_pool_size.  If 0 or
unspecified, it will default to the specified innodb_buffer_pool_size
rounded up to the allocation unit (2 MiB or 8 MiB).  The maximum value
is 4GiB-2MiB on 32-bit systems and 16EiB-8MiB on 64-bit systems.
This maximum is very likely to be limited further by the operating system.

The status variable Innodb_buffer_pool_resize_status will reflect
the status of shrinking the buffer pool. When no shrinking is in
progress, the string will be empty.

Unlike before, the execution of SET GLOBAL innodb_buffer_pool_size
will block until the requested buffer pool size change has been
implemented, or the execution is interrupted by a KILL statement
a client disconnect, or server shutdown.  If the
buf_flush_page_cleaner() thread notices that we are running out of
memory, the operation may fail with ER_WRONG_USAGE.

SET GLOBAL innodb_buffer_pool_size will be refused
if the server was started with --large-pages (even if
no HugeTLB pages were successfully allocated). This functionality
is somewhat exercised by the test main.large_pages, which now runs
also on Microsoft Windows.  On Linux, explicit HugeTLB mappings are
apparently excluded from the reported Redident Set Size (RSS), and
apparently unshrinkable between mmap(2) and munmap(2).

The buffer pool will be mapped to a contiguous virtual memory area
that will be aligned and partitioned into extents of 8 MiB on
64-bit systems and 2 MiB on 32-bit systems.

Within an extent, the first few innodb_page_size blocks contain
buf_block_t objects that will cover the page frames in the rest
of the extent.  The number of such frames is precomputed in the
array first_page_in_extent[] for each innodb_page_size.
In this way, there is a trivial mapping between
page frames and block descriptors and we do not need any
lookup tables like buf_pool.zip_hash or buf_pool_t::chunk_t::map.

We will always allocate the same number of block descriptors for
an extent, even if we do not need all the buf_block_t in the last
extent in case the innodb_buffer_pool_size is not an integer multiple
of the of extents size.

The minimum innodb_buffer_pool_size is 256*5/4 pages.  At the default
innodb_page_size=16k this corresponds to 5 MiB.  However, now that the
innodb_buffer_pool_size includes the memory allocated for the block
descriptors, the minimum would be innodb_buffer_pool_size=6m.

my_large_virtual_alloc(): A new function, similar to my_large_malloc().

my_virtual_mem_reserve(), my_virtual_mem_commit(),
my_virtual_mem_decommit(), my_virtual_mem_release():
New interface mostly by Vladislav Vaintroub, to separately
reserve and release virtual address space, as well as to
commit and decommit memory within it.

After my_virtual_mem_decommit(), the virtual memory range will be
read-only or unaccessible, depending on whether the build option
cmake -DHAVE_UNACCESSIBLE_AFTER_MEM_DECOMMIT=1
has been specified.  This option is hard-coded on Microsoft Windows,
where VirtualMemory(MEM_DECOMMIT) will make the memory unaccessible.
On IBM AIX, Linux, Illumos and possibly Apple macOS, the virtual memory
will be zeroed out immediately.  On other POSIX-like systems,
madvise(MADV_FREE) will be used if available, to give the operating
system kernel a permission to zero out the virtual memory range.
We prefer immediate freeing so that the reported
resident set size (RSS) of the process will reflect the current
innodb_buffer_pool_size.  Shrinking the buffer pool is a rarely
executed resource intensive operation, and the immediate configuration
of the MMU mappings should not incur significant additional penalty.

opt_super_large_pages: Declare only on Solaris. Actually, this is
specific to the SPARC implementation of Solaris, but because we
lack access to a Solaris development environment, we will not revise
this for other MMU and ISA.

buf_pool_t::chunk_t::create(): Remove.

buf_pool_t::create(): Initialize all n_blocks of the buf_pool.free list.

buf_pool_t::allocate(): Renamed from buf_LRU_get_free_only().

buf_pool_t::LRU_warned: Changed to Atomic_relaxed<bool>,
only to be modified by the buf_flush_page_cleaner() thread.

buf_pool_t::shrink(): Attempt to shrink the buffer pool.
There are 3 possible outcomes: SHRINK_DONE (success),
SHRINK_IN_PROGRESS (the caller may keep trying),
and SHRINK_ABORT (we seem to be running out of buffer pool).
While traversing buf_pool.LRU, release the contended
buf_pool.mutex once in every 32 iterations in order to
reduce starvation. Use lru_scan_itr for efficient traversal,
similar to buf_LRU_free_from_common_LRU_list().

buf_pool_t::shrunk(): Update the reduced size of the buffer pool
in a way that is compatible with buf_pool_t::page_guess(),
and invoke my_virtual_mem_decommit().

buf_pool_t::resize(): Before invoking shrink(), run one batch of
buf_flush_page_cleaner() in order to prevent LRU_warn().
Abort if shrink() recommends it, or no blocks were withdrawn in
the past 15 seconds, or the execution of the statement
SET GLOBAL innodb_buffer_pool_size was interrupted.

buf_pool_t::first_to_withdraw: The first block descriptor that is
out of the bounds of the shrunk buffer pool.

buf_pool_t::withdrawn: The list of withdrawn blocks.
If buf_pool_t::resize() is aborted before shrink() completes,
we must be able to resurrect the withdrawn blocks in the free list.

buf_pool_t::contains_zip(): Added a parameter for the
number of least significant pointer bits to disregard,
so that we can find any pointers to within a block
that is supposed to be free.

buf_pool_t::is_shrinking(): Return the total number or blocks that
were withdrawn or are to be withdrawn.

buf_pool_t::to_withdraw(): Return the number of blocks that will need to
be withdrawn.

buf_pool_t::usable_size(): Number of usable pages, considering possible
in-progress attempt at shrinking the buffer pool.

buf_pool_t::page_guess(): Try to buffer-fix a guessed block pointer.
If HAVE_UNACCESSIBLE_AFTER_MEM_DECOMMIT is set, the pointer will
be validated before being dereferenced.

buf_pool_t::get_info(): Replaces buf_stats_get_pool_info().

innodb_init_param(): Refactored. We must first compute
srv_page_size_shift and then determine the valid bounds of
innodb_buffer_pool_size.

buf_buddy_shrink(): Replaces buf_buddy_realloc().
Part of the work is deferred to buf_buddy_condense_free(),
which is being executed when we are not holding any
buf_pool.page_hash latch.

buf_buddy_condense_free(): Do not relocate blocks.

buf_buddy_free_low(): Do not care about buffer pool shrinking.
This will be handled by buf_buddy_shrink() and
buf_buddy_condense_free().

buf_buddy_alloc_zip(): Assert !buf_pool.contains_zip()
when we are allocating from the binary buddy system.
Previously we were asserting this on multiple recursion levels.

buf_buddy_block_free(), buf_buddy_free_low():
Assert !buf_pool.contains_zip().

buf_buddy_alloc_from(): Remove the redundant parameter j.

buf_flush_LRU_list_batch(): Add the parameter to_withdraw
to keep track of buf_pool.n_blocks_to_withdraw.

buf_do_LRU_batch(): Skip buf_free_from_unzip_LRU_list_batch()
if we are shrinking the buffer pool. In that case, we want
to minimize the page relocations and just finish as quickly
as possible.

trx_purge_attach_undo_recs(): Limit purge_sys.n_pages_handled()
in every iteration, in case the buffer pool is being shrunk
in the middle of a purge batch.

Reviewed by: Debarun Banerjee
2025-03-26 17:05:44 +02:00
Thirunarayanan Balathandayuthapani
a390aaaf23 MDEV-36180 Doublewrite recovery of innodb_checksum_algorithm=full_crc32 page_compressed pages does not work
- InnoDB fails to recover the full crc32 page_compressed page
from doublewrite buffer. The reason is that buf_dblwr_t::recover()
fails to identify the space id from the page because the page
has compressed from FIL_PAGE_FILE_FLUSH_LSN_OR_KEY_VERSION bytes.

Fix:
===
recv_dblwr_t::find_deferred_page(): Find the page which
has the same page number and try to decompress/decrypt the page
based on the tablespace metadata. After the decompression/decryption,
compare the space id and write the recovered page back to the file.

buf_page_t::read_complete(): Page read from disk is corrupted then
try to read the page from deferred pages in doublewrite buffer.
2025-03-26 12:03:44 +01:00
Marko Mäkelä
49a6baec56 Merge 10.11 into 11.4 2025-03-03 11:07:56 +02:00
Marko Mäkelä
937ae4137e MDEV-36155: MSAN use-of-uninitialized-value innodb.log_file_size_online
Writing the redo log resized will write uninitialized data. There is
a MEM_MAKE_DEFINED construct in the code to bless this however it was
correct on the initial length, but not the changed length.

The MEM_MAKE_DEFINED is moved earlier in the code where the length
contains the correct value.
2025-02-27 08:19:07 +11:00
Marko Mäkelä
809a0cebdc MDEV-36152 mariadb-backup --backup crash during innodb_undo_log_truncate=ON, innodb_encrypt_log=ON
recv_sys_t::parse(): Allocate decrypt_buf also for storing==BACKUP
but limit its size to correspond to 1 byte of record payload.
Ensure that last_offset=0 for storing==BACKUP.
When parsing INIT_PAGE or FREE_PAGE, avoid an unnecessary
l.copy_if_needed() for storing!=YES.
When parsing EXTENDED in storing==BACKUP, properly invoke
l.copy_if_needed() on a large enough decrypt_buf.
When parsing WRITE, MEMMOVE, MEMSET in storing==BACKUP,
skip further validation (and potential overflow of the tiny decrypt_buf),
like we used to do before commit 46aaf328ce
(MDEV-35830).

Reviewed by: Debarun Banerjee
2025-02-25 11:41:49 +02:00
Marko Mäkelä
7e001b2a3c MDEV-36082 Race condition between log_t::resize_start() and log_t::resize_abort()
log_t::writer_update(): Add the parameter bool resizing,
to indicate whether log resizing is in progress.
We must enable log_writer_resizing only if resize_lsn>1,
to ensure that log_t::resize_abort() will not choose the wrong
log_sys.log_writer.

log_t::resize_initiator: The thread that successfully invoked
resize_start().

log_t::resize_start(): Simplify some logic, and assign resize_initiator
if we successfully started log resizing.

log_t::resize_abort(): Abort log resizing if we are the
resize_initiator.

innodb_log_file_size_update(): Clean up some logic.

Reviewed by: Debarun Banerjee
2025-02-17 15:55:58 +02:00
Sergei Golubchik
f1a7693bc0 Merge branch '10.11' into 11.4 2025-01-14 23:45:41 +01:00
Marko Mäkelä
46aaf328ce MDEV-35830 Fix innodb_undo_log_truncate in backup
recv_sys_t::parse(): Correctly handle the storing==BACKUP case,
and simplify some logic around storing==YES as well.

The added test mariabackup.undo_truncate is based on an idea of
Thirunarayanan Balathandayuthapani. It nondeterministically (not on
every run) covers this logic, including the function backup_undo_trunc(),
for both innodb_encrypt_log=ON and innodb_encrypt_log=OFF.

Reviewed by: Debarun Banerjee
2025-01-13 16:57:11 +02:00
Marko Mäkelä
aa35f62f1c MDEV-35810 Missing error handling in log resizing
log_t::resize_start(): If the ib_logfile101 cannot be created,
be sure to reset log_sys.resize_lsn.

log_t::resize_abort(): In case SET GLOBAL innodb_log_file_size is
aborted, delete the ib_logfile101.
2025-01-13 10:41:40 +02:00
Marko Mäkelä
42e6058629 MDEV-35785 innodb_log_file_mmap is not defined on 32-bit systems
innodb_log_file_mmap: Use a constant documentation string that
refers to persistent memory also when it is not available in the build.

HAVE_INNODB_MMAP: Remove, and unconditionally enable this code.

log_mmap(): On 32-bit systems, ensure that the size fits in 32 bits.

log_t::resize_start(), log_t::resize_abort(): Only handle memory-mapping
if HAVE_PMEM is defined. The generic memory-mapped interface is only for
reading the log in recovery. Writable memory mappings are only for
persistent memory, that is, Linux file systems with mount -o dax.

Reviewed by: Debarun Banerjee, Otto Kekäläinen
2025-01-13 07:27:17 +02:00
Sergei Golubchik
221aa5e08f Merge branch '10.6' into 10.11 2025-01-10 13:14:42 +01:00
Marko Mäkelä
4704435068 MDEV-35802 Race condition in log_t::persist()
log_t::persist(): Remove the parameter holding_latch, and assert
latch_holding_any(). We used to avoid acquiring a latch when log
resizing was not in progress. That allowed a race condition to occur
where log_t::write_checkpoint() has just completed log resizing.
In that case, we could wrongly invoke pmem_persist() on the old
log_sys.buf instead of the new one, which was shortly before known
as log_sys.resize_buf.

log_write_persist(): A non-inline wrapper function that will
invoke log_sys.persist() while holding a shared log_sys.latch.
2025-01-10 08:15:09 +02:00
Marko Mäkelä
bca4cc0bd6 Merge 10.11 into 11.4 2025-01-09 14:02:19 +02:00
Marko Mäkelä
81633f47c3 MDEV-35796 OPT_PAGE_CHECKSUM is ignored if innodb_encrypt_log=ON
recv_sys_t::parse(): When parsing an OPTION record, invoke
l.copy_if_needed() before checking if the payload is OPT_PAGE_CHECKSUM
followed by a 32-bit page checksum.

This fixes up the merge 57d4a242da of
commit 4179f93d28 (MDEV-18976).

The impact of this can be observed by running a debug instrumented
build on the test encryption.recovery_memory. There should be over
5,000 invocations of log_phys_t::page_checksum(). Without this fix,
there should be less than 100 of them (when the OPT_PAGE_CHECKSUM
byte happens to encrypt to itself).

Reviewed by: Debarun Banerjee
Tested by: Matthias Leich
2025-01-09 13:21:38 +02:00
Marko Mäkelä
ed13d93a25 Fix mariadb-backup --backup with innodb_undo_log_truncate=ON
This fixes another regression that had been introduced in
commit b249a059da (MDEV-34850).

This should prevent failures of mariadb-backup --backup of
the following type:

mariabackup: Failed to read undo log tablespace space id …
and there is no undo tablespace truncation redo record.

This error has not been hit by our internal testing, and we
currently have no regression test to cover this.
2025-01-09 13:18:42 +02:00
Marko Mäkelä
ea19a6b38c MDEV-35699 Multi-batch recovery occasionally fails
recv_sys_t::parse<storing=NO>(): Do invoke
fil_space_set_recv_size_and_flags() and do parse enough of page 0
to facilitate that.

This fixes a regression that had been introduced in
commit b249a059da (MDEV-34850).
In a multi-batch crash recovery, we would fail to invoke
fil_space_set_recv_size_and_flags() while parsing the remaining log,
before starting the first recovery batch.

Reviewed by: Debarun Banerjee
Tested by: Matthias Leich
2025-01-09 13:18:30 +02:00
Marko Mäkelä
17f01186f5 Merge 10.11 into 11.4 2025-01-09 07:58:08 +02:00
Marko Mäkelä
990b010b09 MDEV-35438 Annotate InnoDB I/O functions with noexcept
Most InnoDB functions do not throw any exceptions, not even indirectly
std::bad_alloc, which could be thrown by a C++ memory allocation function.
Let us annotate many functions with noexcept in order to reduce the code
footprint related to exception handling.

Reviewed by: Thirunarayanan Balathandayuthapani
2025-01-09 07:43:24 +02:00
Marko Mäkelä
420d9eb27f Merge 10.6 into 10.11 2025-01-08 12:51:26 +02:00
Thirunarayanan Balathandayuthapani
f8cf493290 MDEV-34898 Doublewrite recovery of innodb_checksum_algorithm=full_crc32 encrypted pages does not work
- InnoDB fails to recover the full crc32 encrypted page from
doublewrite buffer. The reason is that buf_dblwr_t::recover()
fails to identify the space id from the page because the page has
been encrypted from FIL_PAGE_FILE_FLUSH_LSN_OR_KEY_VERSION bytes.

Fix:
===
buf_dblwr_t::recover(): preserve any pages whose space_id
does not match a known tablespace. These could be encrypted pages
of tablespaces that had been created with
innodb_checksum_algorithm=full_crc32.

buf_page_t::read_complete(): If the page looks corrupted and the
tablespace is encrypted and in full_crc32 format, try to
restore the page from doublewrite buffer.

recv_dblwr_t::recover_encrypted_page(): Find the page which
has the same page number and try to decrypt the page using
space->crypt_data. After decryption, compare the space id.
Write the recovered page back to the file.
2025-01-07 19:33:56 +05:30
Marko Mäkelä
a54d151fc1 Merge 10.6 into 10.11 2024-12-19 15:38:53 +02:00
Marko Mäkelä
c391fb1ff1 MDEV-35577 Broken recovery after SET GLOBAL innodb_log_file_size
If InnoDB is killed in such a way that there had been no writes
to a newly resized ib_logfile101 after it replaced ib_logfile0
in log_t::write_checkpoint(), it is possible that recovery will
accidentally interpret some garbage at the end of the log as valid.

log_t::write_buf(): To prevent the corruption, write an extra NUL byte
at the end of log_sys.resize_buf, like we always did for the main
log_sys.buf. To remove some conditional branches from a time critical
code path, we instantiate a separate template for the rare case that the
log is being resized. Define as __attribute__((always_inline)) so that
this will be inlined also in the rare case the log is being resized.

log_t::writer: Pointer to the current implementation of
log_t::write_buf(). For quick access, this is located in the
same cache line with log_sys.latch, which protects it.

log_t::writer_update(): Update log_sys.writer.

log_t::resize_write_buf(): Remove ATTRIBUTE_NOINLINE ATTRIBUTE_COLD.
Now that log_t::write_buf() will be instantiated separately for the
rare case of log resizing being in progress, there is no need to forbid
this code from being inlined.

Thanks to Thirunarayanan Balathandayuthapani for finding the
root cause of this bug and suggesting the fix of writing an extra
NUL byte.

Reviewed by: Debarun Banerjee
2024-12-16 11:50:00 +02:00
Marko Mäkelä
bfe7c8ff0a MDEV-35494 fil_space_t::fil_space_t() may be unsafe with GCC -flifetime-dse
fil_space_t::create(): Instead of invoking the default fil_space_t
constructor on a zero-filled buffer, allocate an uninitialized buffer
and invoke an explicitly defined constructor on it. Also, specify
initializer expressions for all constant data members, so that all of them
will be initialized in the constructor.

fil_space_t::being_imported: Replaces part of fil_space_t::purpose.

fil_space_t::is_being_imported(), fil_space_t::is_temporary():
Replaces fil_space_t::purpose.

fil_space_t:🆔 Changed the type from ulint to uint32_t to reduce
incompatibility with later branches that include
commit ca501ffb04 (MDEV-26195).

fil_space_t::try_to_close(): Do not attempt to close files that are
in an I/O bound phase of ALTER TABLE…IMPORT TABLESPACE.

log_file_op, first_page_init: recv_spaces_t:
Use uint32_t for the tablespace id.

Reviewed by: Debarun Banerjee
2024-12-11 14:44:42 +02:00
Marko Mäkelä
1a557d087c MDEV-35608 Fake PMEM on /dev/shm no longer works
In commit 6acada713a the
logic for treating the file system of /dev/shm
as if it were persistent memory was broken.

Let us restore the original logic, so that we will have
some more CI coverage of the memory-mapped redo log interface.
2024-12-09 12:53:38 +02:00
Kristian Nielsen
0f47db8525 Merge 10.11 -> 11.4
Signed-off-by: Kristian Nielsen <knielsen@knielsen-hq.org>
2024-12-05 11:01:42 +01:00
Kristian Nielsen
e7c6cdd842 Merge 10.6 -> 10.11
Signed-off-by: Kristian Nielsen <knielsen@knielsen-hq.org>
2024-12-05 10:11:58 +01:00
Marko Mäkelä
2719cc4925 Merge 10.11 into 11.4 2024-12-02 11:35:34 +02:00
Marko Mäkelä
998a625d00 Clean up recv_sys.pages bookkeeping
Instead of repurposing buf_page_t::access_time for state()==MEMORY
blocks that are part of recv_sys.pages, let us define an anonymous
union around buf_page_t::hash.  In this way, we will be able to
declare access_time private.

Reviewed by: Vladislav Lesin
2024-11-29 14:16:11 +02:00
Marko Mäkelä
3d23adb766 Merge 10.6 into 10.11 2024-11-29 13:43:17 +02:00
Marko Mäkelä
895cd553a3 MDEV-32175: Reduce page_align(), page_offset() calls
When srv_page_size and innodb_page_size were introduced,
the functions page_align() and page_offset() got more expensive.
Let us try to replace such calls with simpler pointer arithmetics
with respect to the buffer page frame.

page_rec_get_next_non_del_marked(): Add a page frame as a parameter,
and template<bool comp>.

page_rec_next_get(): A more efficient variant of page_rec_get_next(),
with template<bool comp> and const page_t* parameters.

lock_get_heap_no(): Replaces page_rec_get_heap_no() outside debug checks.

fseg_free_step(), fseg_free_step_not_header(): Take the header block
as a parameter.

Reviewed by: Vladislav Lesin
2024-11-21 11:01:30 +02:00
ParadoxV5
687377633d Extract some of #3360 fixes to 10.11.x
That PR uncovered countless issues on `my_snprintf` uses.
This commit backports a squashed subset of their fixes.
(Excludes previous parts #3485 and #3493)
2024-11-18 14:04:56 +11:00
Oleksandr Byelkin
69d033d165 Merge branch '10.11' into 11.2 2024-10-29 16:42:46 +01:00
Oleksandr Byelkin
3d0fb15028 Merge branch '10.6' into 10.11 2024-10-29 15:24:38 +01:00
Marko Mäkelä
b38edd09ff MDEV-34830 fixup: Relax an assertion
This follows up 1067046b7f
2024-10-22 11:35:33 +03:00
Marko Mäkelä
1067046b7f MDEV-34830 fixup: Relax an assertion
It is possible that recv_sys.scanned_lsn is ahead of recv_sys.recovered_lsn
by a few 512-byte log blocks in case the last mini-transaction in the log
had not been written out completely before the server was killed.
This is occasionally the case when running the test
innodb.innodb-32k-crash.
2024-10-22 09:09:11 +03:00
Marko Mäkelä
bea4adcb5a MDEV-35225 Bogus debug assertion failures in innodb.innodb-32k-crash
log_sort_flush_list(): Correct some debug assertions that had been added in
commit 0d175968d1 (MDEV-31354).
The writes of some blocks may be completed and the oldest_modification()
set to 1 at any time.

The bogus assertion failures led to occasional failures of the test
innodb.innodb-32k-crash.
2024-10-22 09:07:57 +03:00
Vladislav Vaintroub
e8db5c8760 MDEV-35171 OS_FILE_NORMAL and OS_FILE_AIO are misleading
Removed 'purpose' parameter from os_file_create() and related functions.
Always use FILE_FLAG_OVERLAPPED when opening Windows files.

No performance regression was measured, nor there is any measurable
improvement.
2024-10-21 15:31:32 +02:00
Marko Mäkelä
ebefef658e Merge 10.11 into 11.2 2024-10-18 11:32:22 +03:00
Marko Mäkelä
eca552a1a4 MDEV-34830: LSN in the future is not being treated as serious corruption
The invariant of write-ahead logging is that before any change to a
page is written to the data file, the corresponding log record must
must first have been durably written.

In crash recovery, there were some sloppy checks for this. Let us
implement accurate checks and flag an inconsistency as a hard error,
so that we can avoid further corruption of a corrupted database.
For data extraction from the corrupted database, innodb_force_recovery
can be used.

Before recovery is reading any data pages or invoking
buf_dblwr_t::recover() to recover torn pages from the
doublewrite buffer, InnoDB will have parsed the log until the
final LSN and updated log_sys.lsn to that. So, we can rely on
log_sys.lsn at all times. The doublewrite buffer recovery has been
refactored in such a way that the recv_sys.dblwr.pages may be consulted
while discovering files and their page sizes, but nothing will be
written back to data files before buf_dblwr_t::recover() is invoked.

recv_max_page_lsn, recv_lsn_checks_on: Remove.

recv_sys_t::validate_checkpoint(): Validate the write-ahead-logging
condition at the end of the recovery.

recv_dblwr_t::validate_page(): Keep track of the maximum LSN
(if we are checking a non-doublewrite copy of a page) but
do not complain LSN being in the future. The doublewrite buffer
is a special case, because it will be read early during recovery.
Besides, starting with commit 762bcb81b5
the dblwr=true copies of pages may legitimately be "too new".

recv_dblwr_t::find_page(): Find a valid page with the smallest
FIL_PAGE_LSN that is in the valid range for recovery.

recv_dblwr_t::restore_first_page(): Replaced by find_page().
Only buf_dblwr_t::recover() will write to data files.

buf_dblwr_t::recover(): Simplify the message output. Do attempt
doublewrite recovery on user page read error. Ignore doublewrite
pages whose FIL_PAGE_LSN is outside the usable bounds. Previously,
we could wrongly recover a too new page from the doublewrite buffer.
It is unlikely that this could have lead to an actual error.
Write back all recovered pages from the doublewrite buffer here,
including for the first page of any tablespace.

buf_page_is_corrupted(): Distinguish the return values
CORRUPTED_FUTURE_LSN and CORRUPTED_OTHER.

buf_page_check_corrupt(): Return the error code DB_CORRUPTION
in case the LSN is in the future.

Datafile::read_first_page_flags(): Split from read_first_page().
Take a copy of the first page as a parameter.

recv_sys_t::free_corrupted_page(): Take the file as a parameter
and return whether a message was displayed. This avoids some duplicated
and incomplete error messages.

buf_page_t::read_complete(): Remove some redundant output and always
display the name of the corrupted file. Never return DB_FAIL;
use it only in internal error handling.

IORequest::read_complete(): Assume that buf_page_t::read_complete()
will have reported any error.

fil_space_t::set_corrupted(): Return whether this is the first time
the tablespace had been flagged as corrupted.

Datafile::validate_first_page(), fil_node_open_file_low(),
fil_node_open_file(), fil_space_t::read_page0(),
fil_node_t::read_page0(): Add a parameter for a copy of the
first page, and a parameter to indicate whether the FIL_PAGE_LSN
check should be suppressed. Before buf_dblwr_t::recover() is
invoked, we cannot validate the FIL_PAGE_LSN, but we can trust the
FSP_SPACE_FLAGS and the tablespace ID that may be present in a
potentially too new copy of a page.

Reviewed by: Debarun Banerjee
2024-10-18 10:12:47 +03:00
Marko Mäkelä
bb47e575de MDEV-34830: LSN in the future is not being treated as serious corruption
The invariant of write-ahead logging is that before any change to a
page is written to the data file, the corresponding log record must
must first have been durably written.

On crash recovery, there were some sloppy checks for this. Let us
implement accurate checks and flag an inconsistency as a hard error,
so that we can avoid further corruption of a corrupted database.
For data extraction from the corrupted database, innodb_force_recovery
can be used.

Before recovery is reading any data pages or invoking
buf_dblwr_t::recover() to recover torn pages from the
doublewrite buffer, InnoDB will have parsed the log until the
final LSN and updated log_sys.lsn to that. So, we can rely on
log_sys.lsn at all times. The doublewrite buffer recovery has been
refactored in such a way that the recv_sys.dblwr.pages may be consulted
while discovering files and their page sizes, but nothing will be
written back to data files before buf_dblwr_t::recover() is invoked.

A section of the test mariabackup.innodb_redo_overwrite
that is parsing some mariadb-backup --backup output has
been removed, because that output "redo log block is overwritten"
would often be missing in a Microsoft Windows environment
as a result of these changes.

recv_max_page_lsn, recv_lsn_checks_on: Remove.

recv_sys_t::validate_checkpoint(): Validate the write-ahead-logging
condition at the end of the recovery.

recv_dblwr_t::validate_page(): Keep track of the maximum LSN
(if we are checking a non-doublewrite copy of a page) but
do not complain LSN being in the future. The doublewrite buffer
is a special case, because it will be read early during recovery.
Besides, starting with commit 762bcb81b5
the dblwr=true copies of pages may legitimately be "too new".

recv_dblwr_t::find_page(): Find a valid page with the smallest
FIL_PAGE_LSN that is in the valid range for recovery.

recv_dblwr_t::restore_first_page(): Replaced by find_page().
Only buf_dblwr_t::recover() will write to data files.

buf_dblwr_t::recover(): Simplify the message output. Do attempt
doublewrite recovery on user page read error. Ignore doublewrite
pages whose FIL_PAGE_LSN is outside the usable bounds. Previously,
we could wrongly recover a too new page from the doublewrite buffer.
It is unlikely that this could have lead to an actual error.
Write back all recovered pages from the doublewrite buffer here,
including for the first page of any tablespace.

buf_page_is_corrupted(): Distinguish the return values
CORRUPTED_FUTURE_LSN and CORRUPTED_OTHER.

buf_page_check_corrupt(): Return the error code DB_CORRUPTION
in case the LSN is in the future.

Datafile::read_first_page(): Handle FSP_SPACE_FLAGS=0xffffffff
in the same way on both 32-bit and 64-bit architectures.

Datafile::read_first_page_flags(): Split from read_first_page().
Take a copy of the first page as a parameter.

recv_sys_t::free_corrupted_page(): Take the file as a parameter
and return whether a message was displayed. This avoids some duplicated
and incomplete error messages.

buf_page_t::read_complete(): Remove some redundant output and always
display the name of the corrupted file. Never return DB_FAIL;
use it only in internal error handling.

IORequest::read_complete(): Assume that buf_page_t::read_complete()
will have reported any error.

fil_space_t::set_corrupted(): Return whether this is the first time
the tablespace had been flagged as corrupted.

Datafile::validate_first_page(), fil_node_open_file_low(),
fil_node_open_file(), fil_space_t::read_page0(),
fil_node_t::read_page0(): Add a parameter for a copy of the
first page, and a parameter to indicate whether the FIL_PAGE_LSN
check should be suppressed. Before buf_dblwr_t::recover() is
invoked, we cannot validate the FIL_PAGE_LSN, but we can trust the
FSP_SPACE_FLAGS and the tablespace ID that may be present in a
potentially too new copy of a page.

Reviewed by: Debarun Banerjee
2024-10-17 17:24:20 +03:00
Marko Mäkelä
8a6a4c947a Cleanup: Replace some is_predefined_tablespace()
In some places, there were redundant comparisons against TRX_SYS_SPACE
or SRV_TMP_SPACE_ID. The temporary tablespace is never the subject of
log-based recovery.

Also, consistently check for SRV_SPACE_ID_UPPER_BOUND.

Reviewed by: Debarun Barerjee
2024-10-04 13:41:12 +03:00