1
0
mirror of https://github.com/MariaDB/server.git synced 2025-08-08 11:22:35 +03:00
Commit Graph

2108 Commits

Author SHA1 Message Date
Oleksandr Byelkin
89c7e2b9c7 Merge branch '10.11' into 11.4 2025-06-17 09:50:22 +02:00
Marko Mäkelä
1c7209e828 Merge 10.6 into 10.11 2025-05-21 07:36:35 +03:00
Sergei Golubchik
83e0438f62 MDEV-36536 post-review changes
that were apparently partially lost in a rebase
2025-04-29 11:34:35 +02:00
Monty
1b934a387c MDEV-36536 Add option to not collect statistics for long char/varchars
This is needed to make it easy for users to automatically ignore long
char and varchars when using  ANALYZE TABLE PERSISTENT.
These fields can cause problems as they will consume
'CHARACTERS * MAX_CHARACTER_LENGTH * 2 * number_of_rows' space on disk
during analyze, which can easily be much bigger than the analyzed table.

This commit adds a new user variable, analyze_max_length, default value 4G.
Any field that is bigger than this in bytes, will be ignored by
ANALYZE TABLE PERSISTENT unless it is specified in FOR COLUMNS().

While doing this patch, I noticed that we do not skip GEOMETRY columns from
ANALYZE TABLE, like we do with BLOB. This should be fixed when merging
to the 'main' branch. At the same time we should add a resonable default
value for analyze_max_length, probably 1024, like we have for
max_sort_length.
2025-04-28 12:38:01 +03:00
Oleksandr Byelkin
a8d4642375 Merge branch '10.11' into 11.4 2025-04-26 10:53:02 +02:00
Marko Mäkelä
acd071f599 MDEV-21923: LSN allocation is a bottleneck
The parameter innodb_log_spin_wait_delay will be deprecated and
ignored, because there is no spin loop anymore.

Thanks to commit 685d958e38
and commit a635c40648
multiple mtr_t::commit() can concurrently copy their slice of
mtr_t::m_log to the shared log_sys.buf.  Each writer would allocate
their own log sequence number by invoking log_t::append_prepare()
while holding a shared log_sys.latch.  This function was too heavy,
because it would invoke a minimum of 4 atomic read-modify-write
operations as well as system calls in the supposedly fast code path.

It turns out that with a simpler data structure, instead of having
several data fields that needed to be kept consistent with each other,
we only need one Atomic_relaxed<uint64_t> write_lsn_offset, on which
we can operate using fetch_add(), fetch_sub() as well as a single-bit
fetch_or(), which reasonably modern compilers (GCC 7, Clang 15 or later)
can translate into loop-free code on AMD64.

Before anything can be written to the log, log_sys.clear_mmap()
must be invoked.

log_t::base_lsn: The LSN of the last write_buf() or persist().
This is a rough approximation of log_sys.lsn, which will be removed.

log_t::write_lsn_offset: An Atomic_relaxed<uint64_t> that buffers
updates of write_to_buf and base_lsn.

log_t::buf_free, log_t::max_buf_free, log_t::lsn. Remove.
Replaced by base_lsn and write_lsn_offset.

log_t::buf_size: Always reflects the usable size in append_prepare().

log_t::lsn_lock: Remove.  For the memory-mapped log in resize_write(),
there will be a resize_wrap_mutex.

log_t::get_lsn_approx(): Return a lower bound of get_lsn().
This should be exact unless append_prepare_wait() is pending.

log_get_lsn(): A wrapper for log_sys.get_lsn(), which must be invoked
while holding an exclusive log_sys.latch.

recv_recovery_from_checkpoint_start(): Do not invoke fil_names_clear();
it would seem to be unnecessary.

In many places, references to log_sys.get_lsn() are replaced with
log_sys.get_flushed_lsn(), which remains a simple std::atomic::load().

Reviewed by: Debarun Banerjee
2025-04-10 13:02:17 +03:00
Marko Mäkelä
f5bd250f5b Merge 10.11 into 11.4 2025-03-28 13:55:21 +02:00
Marko Mäkelä
ba81009f63 MDEV-34863 RAM Usage Changed Significantly Between 10.11 Releases
innodb_buffer_pool_size_auto_min: A minimum innodb_buffer_pool_size that
a Linux memory pressure event can lead to shrinking the buffer pool to.
On a memory pressure event, we will attempt to shrink
innodb_buffer_pool_size halfway between its current value and
innodb_buffer_pool_size_auto_min. If innodb_buffer_pool_size_auto_min is
specified as 0 or not specified on startup, its default value will be
adjusted to innodb_buffer_pool_size_max, that is, memory pressure events
will be disregarded by default.

buf_pool_t::garbage_collect(): For up to 15 seconds, attempt to shrink
the buffer pool in response to a memory pressure event.

Reviewed by: Debarun Banerjee
2025-03-26 17:05:48 +02:00
Marko Mäkelä
b6923420f3 MDEV-29445: Reimplement SET GLOBAL innodb_buffer_pool_size
We deprecate and ignore the parameter innodb_buffer_pool_chunk_size
and let the buffer pool size to be changed in arbitrary 1-megabyte
increments.

innodb_buffer_pool_size_max: A new read-only startup parameter
that specifies the maximum innodb_buffer_pool_size.  If 0 or
unspecified, it will default to the specified innodb_buffer_pool_size
rounded up to the allocation unit (2 MiB or 8 MiB).  The maximum value
is 4GiB-2MiB on 32-bit systems and 16EiB-8MiB on 64-bit systems.
This maximum is very likely to be limited further by the operating system.

The status variable Innodb_buffer_pool_resize_status will reflect
the status of shrinking the buffer pool. When no shrinking is in
progress, the string will be empty.

Unlike before, the execution of SET GLOBAL innodb_buffer_pool_size
will block until the requested buffer pool size change has been
implemented, or the execution is interrupted by a KILL statement
a client disconnect, or server shutdown.  If the
buf_flush_page_cleaner() thread notices that we are running out of
memory, the operation may fail with ER_WRONG_USAGE.

SET GLOBAL innodb_buffer_pool_size will be refused
if the server was started with --large-pages (even if
no HugeTLB pages were successfully allocated). This functionality
is somewhat exercised by the test main.large_pages, which now runs
also on Microsoft Windows.  On Linux, explicit HugeTLB mappings are
apparently excluded from the reported Redident Set Size (RSS), and
apparently unshrinkable between mmap(2) and munmap(2).

The buffer pool will be mapped to a contiguous virtual memory area
that will be aligned and partitioned into extents of 8 MiB on
64-bit systems and 2 MiB on 32-bit systems.

Within an extent, the first few innodb_page_size blocks contain
buf_block_t objects that will cover the page frames in the rest
of the extent.  The number of such frames is precomputed in the
array first_page_in_extent[] for each innodb_page_size.
In this way, there is a trivial mapping between
page frames and block descriptors and we do not need any
lookup tables like buf_pool.zip_hash or buf_pool_t::chunk_t::map.

We will always allocate the same number of block descriptors for
an extent, even if we do not need all the buf_block_t in the last
extent in case the innodb_buffer_pool_size is not an integer multiple
of the of extents size.

The minimum innodb_buffer_pool_size is 256*5/4 pages.  At the default
innodb_page_size=16k this corresponds to 5 MiB.  However, now that the
innodb_buffer_pool_size includes the memory allocated for the block
descriptors, the minimum would be innodb_buffer_pool_size=6m.

my_large_virtual_alloc(): A new function, similar to my_large_malloc().

my_virtual_mem_reserve(), my_virtual_mem_commit(),
my_virtual_mem_decommit(), my_virtual_mem_release():
New interface mostly by Vladislav Vaintroub, to separately
reserve and release virtual address space, as well as to
commit and decommit memory within it.

After my_virtual_mem_decommit(), the virtual memory range will be
read-only or unaccessible, depending on whether the build option
cmake -DHAVE_UNACCESSIBLE_AFTER_MEM_DECOMMIT=1
has been specified.  This option is hard-coded on Microsoft Windows,
where VirtualMemory(MEM_DECOMMIT) will make the memory unaccessible.
On IBM AIX, Linux, Illumos and possibly Apple macOS, the virtual memory
will be zeroed out immediately.  On other POSIX-like systems,
madvise(MADV_FREE) will be used if available, to give the operating
system kernel a permission to zero out the virtual memory range.
We prefer immediate freeing so that the reported
resident set size (RSS) of the process will reflect the current
innodb_buffer_pool_size.  Shrinking the buffer pool is a rarely
executed resource intensive operation, and the immediate configuration
of the MMU mappings should not incur significant additional penalty.

opt_super_large_pages: Declare only on Solaris. Actually, this is
specific to the SPARC implementation of Solaris, but because we
lack access to a Solaris development environment, we will not revise
this for other MMU and ISA.

buf_pool_t::chunk_t::create(): Remove.

buf_pool_t::create(): Initialize all n_blocks of the buf_pool.free list.

buf_pool_t::allocate(): Renamed from buf_LRU_get_free_only().

buf_pool_t::LRU_warned: Changed to Atomic_relaxed<bool>,
only to be modified by the buf_flush_page_cleaner() thread.

buf_pool_t::shrink(): Attempt to shrink the buffer pool.
There are 3 possible outcomes: SHRINK_DONE (success),
SHRINK_IN_PROGRESS (the caller may keep trying),
and SHRINK_ABORT (we seem to be running out of buffer pool).
While traversing buf_pool.LRU, release the contended
buf_pool.mutex once in every 32 iterations in order to
reduce starvation. Use lru_scan_itr for efficient traversal,
similar to buf_LRU_free_from_common_LRU_list().

buf_pool_t::shrunk(): Update the reduced size of the buffer pool
in a way that is compatible with buf_pool_t::page_guess(),
and invoke my_virtual_mem_decommit().

buf_pool_t::resize(): Before invoking shrink(), run one batch of
buf_flush_page_cleaner() in order to prevent LRU_warn().
Abort if shrink() recommends it, or no blocks were withdrawn in
the past 15 seconds, or the execution of the statement
SET GLOBAL innodb_buffer_pool_size was interrupted.

buf_pool_t::first_to_withdraw: The first block descriptor that is
out of the bounds of the shrunk buffer pool.

buf_pool_t::withdrawn: The list of withdrawn blocks.
If buf_pool_t::resize() is aborted before shrink() completes,
we must be able to resurrect the withdrawn blocks in the free list.

buf_pool_t::contains_zip(): Added a parameter for the
number of least significant pointer bits to disregard,
so that we can find any pointers to within a block
that is supposed to be free.

buf_pool_t::is_shrinking(): Return the total number or blocks that
were withdrawn or are to be withdrawn.

buf_pool_t::to_withdraw(): Return the number of blocks that will need to
be withdrawn.

buf_pool_t::usable_size(): Number of usable pages, considering possible
in-progress attempt at shrinking the buffer pool.

buf_pool_t::page_guess(): Try to buffer-fix a guessed block pointer.
If HAVE_UNACCESSIBLE_AFTER_MEM_DECOMMIT is set, the pointer will
be validated before being dereferenced.

buf_pool_t::get_info(): Replaces buf_stats_get_pool_info().

innodb_init_param(): Refactored. We must first compute
srv_page_size_shift and then determine the valid bounds of
innodb_buffer_pool_size.

buf_buddy_shrink(): Replaces buf_buddy_realloc().
Part of the work is deferred to buf_buddy_condense_free(),
which is being executed when we are not holding any
buf_pool.page_hash latch.

buf_buddy_condense_free(): Do not relocate blocks.

buf_buddy_free_low(): Do not care about buffer pool shrinking.
This will be handled by buf_buddy_shrink() and
buf_buddy_condense_free().

buf_buddy_alloc_zip(): Assert !buf_pool.contains_zip()
when we are allocating from the binary buddy system.
Previously we were asserting this on multiple recursion levels.

buf_buddy_block_free(), buf_buddy_free_low():
Assert !buf_pool.contains_zip().

buf_buddy_alloc_from(): Remove the redundant parameter j.

buf_flush_LRU_list_batch(): Add the parameter to_withdraw
to keep track of buf_pool.n_blocks_to_withdraw.

buf_do_LRU_batch(): Skip buf_free_from_unzip_LRU_list_batch()
if we are shrinking the buffer pool. In that case, we want
to minimize the page relocations and just finish as quickly
as possible.

trx_purge_attach_undo_recs(): Limit purge_sys.n_pages_handled()
in every iteration, in case the buffer pool is being shrunk
in the middle of a purge batch.

Reviewed by: Debarun Banerjee
2025-03-26 17:05:44 +02:00
mariadb-DebarunBanerjee
a8e35a1cc6 MDEV-36149 UBSAN in X is outside the range of representable values of type 'unsigned long' | page_cleaner_flush_pages_recommendation
Currently it is allowed to set innodb_io_capacity to very large value
up to unsigned 8 byte maximum value 18446744073709551615. While
calculating the number of pages to flush, we could sometime go beyond
innodb_io_capacity. Specifically, MDEV-24369 has introduced a logic
for aggressive flushing when dirty page percentage in buffer pool
exceeds innodb_max_dirty_pages_pct. So, when innodb_io_capacity is
set to very large value and dirty page percentage exceeds the
threshold, there is a multiplication overflow in Innodb page cleaner.

Fix: We should prevent setting io_capacity to unrealistic values and
define a practical limit to it. The patch introduces limits for
innodb_io_capacity_max and innodb_io_capacity to the maximum of 4 byte
unsigned integer i.e. 4294967295 (2^32-1). For 16k page size this limit
translates to 64 TiB/sec write IO speed which looks sufficient.

Reviewed by: Marko Mäkelä
2025-03-17 11:44:09 +05:30
Marko Mäkelä
38e420ba6e Merge 10.11 into 11.4 2025-03-11 13:47:46 +02:00
Marko Mäkelä
652f33e0a4 MDEV-30000: Force an InnoDB checkpoint in mariadb-backup
At the start of mariadb-backup --backup, trigger a flush of the
InnoDB buffer pool, so that as little log as possible will have
to be copied.

The previously debug-build-only interface
SET GLOBAL innodb_log_checkpoint_now=ON;
will be made available on all builds, and
mariadb-backup --backup will invoke it, unless the option
--skip-innodb-log-checkpoint-now is specified.

Reviewed by: Vladislav Vaintroub
2025-03-10 08:48:43 +02:00
Marko Mäkelä
49a6baec56 Merge 10.11 into 11.4 2025-03-03 11:07:56 +02:00
Marko Mäkelä
1ed09cfdcb MDEV-35000 preparation: Clean up dict_table_t::stat
innodb_stats_transient_sample_pages, innodb_stats_persistent_sample_pages:
Change the type to UNSIGNED, because the number of pages in a table
is limited to 32 bits by the InnoDB file format.

btr_get_size_and_reserved(), fseg_get_n_frag_pages(),
fseg_n_reserved_pages_low(), fseg_n_reserved_pages(): Return uint32_t.
The file format limits page numbers to 32 bits.

dict_table_t::stat: An Atomic_relaxed<uint32_t> that combines a
number of metadata fields.

innodb_copy_stat_flags(): Copy the statistics flags from
TABLE_SHARE or HA_CREATE_INFO.

dict_table_t::stats_initialized(), dict_table_t::stats_is_persistent():
Accessors to dict_table_t::stat.

Reviewed by: Thirunarayanan Balathandayuthapani
2025-02-28 08:55:16 +02:00
Sergei Petrunia
43c5d1303f MDEV-35958 Cost estimates for materialized derived tables are poor
Backport of commit 74f70c3944 to 10.11.
The new logic is disabled by default, to enable, use
optimizer_adjust_secondary_key_costs=fix_derived_table_read_cost.

== Original commit comment ==
Fixed costs in JOIN_TAB::estimate_scan_time() and HEAP

Estimate_scan_time() calculates the cost of scanning a derivied table.
The old code did not take into account that the temporary table heap table
may be converted to Aria.

  Things fixed:
  - Added checking if the temporary tables data will fit in the heap.
    If not, then calculate the cost based on the designated internal
    temporary table engine (Aria).
  - Removed MY_MAX(records, 1000) and instead trust the optimizer's
    estimate of records. This reduces the cost of temporary tables a bit
    for small tables, which caused a few changes in mtr results.
  - Fixed cost calculation for HEAP.
  - HEAP costs->row_next_find_cost was not set. This does not affect old
    costs calculation as this cost slot was not used anywhere.
    Now HEAP cost->row_next_find_cost is set, which allowed me to remove
    some duplicated computation in ha_heap::scan_time()
2025-02-10 21:14:01 +02:00
Julius Goryavsky
72f21560d5 Merge branch '10.6' into '10.11' 2025-02-02 23:17:20 +01:00
Julius Goryavsky
b3925982a0 MDEV-29755 post-merge for 10.6+
1) remove unnecessary restriction on changing mode;
2) adding lost wsrep_replicate_myisam_basic test.
2025-02-02 13:58:08 +01:00
Julius Goryavsky
53c693ec2f Merge branch '10.5' into '10.6' 2025-02-02 12:55:16 +01:00
Jan Lindström
7d69902d83 MDEV-29775 : Assertion `0' failed in void Protocol::end_statement() when adding data to the MyISAM table after setting wsrep_mode=replicate_myisam
If wsrep_replicate_myisam=ON we allow wsrep_forced_binlog_format
to be [DEFAULT|ROW].

Signed-off-by: Julius Goryavsky <julius.goryavsky@mariadb.com>
2025-02-02 04:16:05 +01:00
Sergei Golubchik
7d657fda64 Merge branch '10.11 into 11.4 2025-01-30 12:01:11 +01:00
Sergei Golubchik
e69f8cae1a Merge branch '10.6' into 10.11 2025-01-30 11:55:13 +01:00
Alexander Barkov
c1559f261f MDEV-35688 UBSAN: SUMMARY: UndefinedBehaviorSanitizer: nullptr-with-offset in my_casedn_utf8mb3
The functions MY_CHARSET_HANDLER::caseup() and MY_CHARSET_HANDLER::casedn()
in their virtual imlementations do "const char *end= src + srclen"
in the very beginning. Therefore src cannot be NULL to avoid
"UBSAN: SUMMARY: UndefinedBehaviorSanitizer: nullptr-with-offset".

Adding DBUG_ASSERT(src != NULL) into all virtual implementations,
to catch this problem in regular Debug builds (without UBSAN).

Fixing Master_info_index::get_master_info() to check connection_name->str.
If it is NULL then passing empty_clex_str into IdentBufferCasedn
instead of *connection_name.
2025-01-20 20:01:48 +04:00
Marko Mäkelä
98dbe3bfaf Merge 10.5 into 10.6 2025-01-20 09:57:37 +02:00
ParadoxV5
cbb24d9aa5 MDEV-35646: Limit pseudo_thread_id to UINT32_MAX
Although the `my_thread_id` type is 64 bits, binlog format specs
limits it to 32 bits in practice. (See also: MDEV-35706)

The writable SQL variable `pseudo_thread_id` didn’t realize this though
and had a range of `ULONGLONG_MAX` (at least `UINT64_MAX` in C/C++).
It consequentially accepted larger values silently, but only the lower
32 bits of whom gets binlogged; this could lead to inconsistency.

Reviewed-by: Brandon Nesterenko <brandon.nesterenko@mariadb.com>
2025-01-18 17:27:54 -07:00
Sergei Golubchik
f1a7693bc0 Merge branch '10.11' into 11.4 2025-01-14 23:45:41 +01:00
Marko Mäkelä
42e6058629 MDEV-35785 innodb_log_file_mmap is not defined on 32-bit systems
innodb_log_file_mmap: Use a constant documentation string that
refers to persistent memory also when it is not available in the build.

HAVE_INNODB_MMAP: Remove, and unconditionally enable this code.

log_mmap(): On 32-bit systems, ensure that the size fits in 32 bits.

log_t::resize_start(), log_t::resize_abort(): Only handle memory-mapping
if HAVE_PMEM is defined. The generic memory-mapped interface is only for
reading the log in recovery. Writable memory mappings are only for
persistent memory, that is, Linux file systems with mount -o dax.

Reviewed by: Debarun Banerjee, Otto Kekäläinen
2025-01-13 07:27:17 +02:00
Sergei Golubchik
221aa5e08f Merge branch '10.6' into 10.11 2025-01-10 13:14:42 +01:00
Sergei Golubchik
90bd638159 32-bit rdiff fixes 2025-01-09 10:00:36 +01:00
Marko Mäkelä
17f01186f5 Merge 10.11 into 11.4 2025-01-09 07:58:08 +02:00
Marko Mäkelä
420d9eb27f Merge 10.6 into 10.11 2025-01-08 12:51:26 +02:00
Monty
88d9348dfc Remove dates from all rdiff files 2025-01-05 16:40:11 +02:00
Monty
e600f9aebb MDEV-35750 Change MEM_ROOT allocation sizes to reduse calls to malloc() and avoid memory fragmentation
This commit updates default memory allocations size used with MEM_ROOT
objects to minimize the number of calls to malloc().

Changes:
- Updated MEM_ROOT block sizes in sql_const.h
- Updated MALLOC_OVERHEAD to also take into account the extra memory
  allocated by my_malloc()
- Updated init_alloc_root() to only take MALLOC_OVERHEAD into account as
  buffer size, not MALLOC_OVERHEAD + sizeof(USED_MEM).
- Reset mem_root->first_block_usage if and only if first block was used.
- Increase MEM_ROOT buffers sized used by my_load_defaults, plugin_init,
  Create_tmp_table, allocate_table_share, TABLE and TABLE_SHARE.
  This decreases number of malloc calls during queries.
- Use a small buffer for THD->main_mem_root in THD::THD. This avoids
  multiple malloc() call for new connections.

I tried the above changes on a complex select query with 12 tables.
The following shows the number of extra allocations that where used
to increase the size of the MEM_ROOT buffers.

Original code:
- Connection to MariaDB:   9 allocations
- First query run:       146 allocations
- Second query run:       24 allocations

Max memory allocated for thd when using with heap table:  61,262,408
Max memory allocated for thd when using Aria tmp table:      419,464

After changes:
Connection to MariaDB:     0 allocations
- First run:              25 allocations
- Second run:              7 allocations

Max memory allocated for thd when using with heap table:  61,347,424
Max memory allocated for thd when using Aria table:          529,168

The new code uses slightly more memory, but avoids memory fragmentation
and is slightly faster thanks to much fewer calls to malloc().

Reviewed-by: Sergei Golubchik <serg@mariadb.org>
2025-01-05 16:40:11 +02:00
Monty
52c29f3bdc MDEV-35469 Heap tables are calling mallocs to often
Heap tables are allocated blocks to store rows according to
my_default_record_cache (mapped to the server global variable
 read_buffer_size).
This causes performance issues when the record length is big
(> 1000 bytes) and the my_default_record_cache is small.

Changed to instead split the default heap allocation to 1/16 of the
allowed space and not use my_default_record_cache anymore when creating
the heap. The allocation is also aligned to be just under a power of 2.

For some test that I have been running, which was using record length=633,
the speed of the query doubled thanks to this change.

Other things:
- Fixed calculation of max_records passed to hp_create() to take
  into account padding between records.
- Updated calculation of memory needed by heap tables. Before we
  did not take into account internal structures needed to access rows.
- Changed block sized for memory_table from 1 to 16384 to get less
  fragmentation. This also avoids a problem where we need 1K
  to manage index and row storage which was not counted for before.
- Moved heap memory usage to a separate test for 32 bit.
- Allocate all data blocks in heap in powers of 2. Change reported
  memory usage for heap to reflect this.

Reviewed-by: Sergei Golubchik <serg@mariadb.org>
2025-01-05 16:40:11 +02:00
Oleksandr Byelkin
c770bce898 Merge branch '11.2' into 11.4 2024-10-30 15:11:17 +01:00
Oleksandr Byelkin
69d033d165 Merge branch '10.11' into 11.2 2024-10-29 16:42:46 +01:00
Oleksandr Byelkin
3d0fb15028 Merge branch '10.6' into 10.11 2024-10-29 15:24:38 +01:00
Vlad Lesin
8c7786e7d5 MDEV-34690 lock_rec_unlock_unmodified() causes deadlock
lock_rec_unlock_unmodified() is executed either under lock_sys.wr_lock()
or under a combination of lock_sys.rd_lock() + record locks hash table
cell latch. It also requests page latch to check if locked records were
changed by the current transaction or not.

Usually InnoDB requests page latch to find the certain record on the
page, and then requests lock_sys and/or record lock hash cell latch to
request record lock. lock_rec_unlock_unmodified() requests the latches
in the opposite order, what causes deadlocks. One of the possible
scenario for the deadlock is the following:

thread 1 - lock_rec_unlock_unmodified() is invoked under locks hash table
           cell latch, the latch is acquired;
thread 2 - purge thread acquires page latch and tries to remove
           delete-marked record, it invokes lock_update_delete(), which
           requests locks hash table cell latch, held by thread 1;
thread 1 - requests page latch, held by thread 2.

To fix it we need to release lock_sys.latch and/or lock hash cell latch,
acquire page latch and re-acquire lock_sys related latches.

When lock_sys.latch and/or lock hash cell latch are released in
lock_release_on_prepare() and lock_release_on_prepare_try(), the page on
which the current lock is held, can be merged. In this case the bitmap
of the current lock must be cleared, and the new lock must be added to
the end of trx->lock.trx_locks list, or bitmap of already existing lock
must be changed.

The new field trx_lock_t::set_nth_bit_calls indicates if new locks
(bits in existing lock bitmaps or new lock objects) were created during
the period when lock_sys was released in trx->lock.trx_locks list
iteration loop in lock_release_on_prepare() or
lock_release_on_prepare_try(). And, if so, we traverse the list again.

The block can be freed during pages merging, what causes assertion
failure in buf_page_get_gen(), as btr_block_get() passes BUF_GET as page
get mode to it. That's why page_get_mode parameter was added to
btr_block_get() to pass BUF_GET_POSSIBLY_FREED from
lock_release_on_prepare() and lock_release_on_prepare_try() to
buf_page_get_gen().

As searching for id of trx, which modified secondary index record, is
quite expensive operation, restrict its usage for master. System variable
was added to remove the restriction for testing simplifying. The
variable exists only either for debug build or for build with
-DINNODB_ENABLE_XAP_UNLOCK_UNMODIFIED_FOR_PRIMARY option to increase the
probability of catching bugs for release build with RQG.

Note that the code, which does primary index lookup to find out what
transaction modified secondary index record, is necessary only when
there is no primary key and no unique secondary key on replica with row
based replication, because only in this case extra X locks on unmodified
records can be set during scan phase.

Reviewed by Marko Mäkelä.
2024-10-23 12:36:17 +03:00
Sergei Golubchik
3a1cf2c85b MDEV-34679 ER_BAD_FIELD uses non-localizable substrings 2024-10-17 21:37:37 +02:00
Sergei Golubchik
5ebda30ccc Revert "MDEV-35019 Provide a way to enable "rollback XA on disconnect" behavior we had before 10.5.2"
This reverts commit 8ae462a220.
2024-10-16 13:23:47 +02:00
Kristian Nielsen
8ae462a220 MDEV-35019 Provide a way to enable "rollback XA on disconnect" behavior we had before 10.5.2
Implement variable legacy_xa_rollback_at_disconnect to support
backwards compatibility for applications that rely on the pre-10.5
behavior for connection disconnect, which is to rollback the
transaction (in violation of the XA specification).

Signed-off-by: Kristian Nielsen <knielsen@knielsen-hq.org>
2024-10-16 10:18:36 +02:00
Marko Mäkelä
b53b81e937 Merge 11.2 into 11.4 2024-10-03 14:32:14 +03:00
Marko Mäkelä
12a91b57e2 Merge 10.11 into 11.2 2024-10-03 13:24:43 +03:00
Marko Mäkelä
63913ce5af Merge 10.6 into 10.11 2024-10-03 10:55:08 +03:00
Marko Mäkelä
7e0afb1c73 Merge 10.5 into 10.6 2024-10-03 09:31:39 +03:00
Sergei Petrunia
1cda4726ca MDEV-34993, part2: backport optimizer_adjust_secondary_key_costs
...and make the fix for MDEV-34993 switchable. It is enabled by default
and controlled with @optimizer_adjust_secondary_key_costs=fix_card_multiplier
2024-10-02 10:52:09 +03:00
Sergei Petrunia
8166a5d33d MDEV-34993: Incorrect cardinality estimation causes poor query plan
When calculate_cond_selectivity_for_table() takes into account multi-
column selectivities from range access, it tries to take-into account
that selectivity for some columns may have been already taken into account.

For example, for range access on IDX1 using {kp1, kp2}, the selectivity
of restrictions on "kp2" might have already been taken into account
to some extent.
So, the code tries to "discount" that using rec_per_key[] estimates.

This seems to be wrong and unreliable: the "discounting" may produce a
rselectivity_multiplier number that hints that the overall selectivity
of range access on IDX1 was greater than 1.

Do a conservative fix: if we arrive at conclusion that selectivity of
range access on condition in IDX1 >1.0, clip it down to 1.
2024-10-02 10:52:09 +03:00
Marko Mäkelä
6acada713a MDEV-34062: Implement innodb_log_file_mmap on 64-bit systems
When using the default innodb_log_buffer_size=2m, mariadb-backup --backup
would spend a lot of time re-reading and re-parsing the log. For reads,
it would be beneficial to memory-map the entire ib_logfile0 to the
address space (typically 48 bits or 256 TiB) and read it from there,
both during --backup and --prepare.

We will introduce the Boolean read-only parameter innodb_log_file_mmap
that will be OFF by default on most platforms, to avoid aggressive
read-ahead of the entire ib_logfile0 in when only a tiny portion would be
accessed. On Linux and FreeBSD the default is innodb_log_file_mmap=ON,
because those platforms define a specific mmap(2) option for enabling
such read-ahead and therefore it can be assumed that the default would
be on-demand paging. This parameter will only have impact on the initial
InnoDB startup and recovery. Any writes to the log will use regular I/O,
except when the ib_logfile0 is stored in a specially configured file system
that is backed by persistent memory (Linux "mount -o dax").

We also experimented with allowing writes of the ib_logfile0 via a
memory mapping and decided against it. A fundamental problem would be
unnecessary read-before-write in case of a major page fault, that is,
when a new, not yet cached, virtual memory page in the circular
ib_logfile0 is being written to. There appears to be no way to tell
the operating system that we do not care about the previous contents of
the page, or that the page fault handler should just zero it out.

Many references to HAVE_PMEM have been replaced with references to
HAVE_INNODB_MMAP.

The predicate log_sys.is_pmem() has been replaced with
log_sys.is_mmap() && !log_sys.is_opened().

Memory-mapped regular files differ from MAP_SYNC (PMEM) mappings in the
way that an open file handle to ib_logfile0 will be retained. In both
code paths, log_sys.is_mmap() will hold. Holding a file handle open will
allow log_t::clear_mmap() to disable the interface with fewer operations.

It should be noted that ever since
commit 685d958e38 (MDEV-14425)
most 64-bit Linux platforms on our CI platforms
(s390x a.k.a. IBM System Z being a notable exception) read and write
/dev/shm/*/ib_logfile0 via a memory mapping, pretending that it is
persistent memory (mount -o dax). So, the memory mapping based log
parsing that this change is enabling by default on Linux and FreeBSD
has already been extensively tested on Linux.

::log_mmap(): If a log cannot be opened as PMEM and the desired access
is read-only, try to open a read-only memory mapping.

xtrabackup_copy_mmap_snippet(), xtrabackup_copy_mmap_logfile():
Copy the InnoDB log in mariadb-backup --backup from a memory
mapped file.
2024-09-26 18:47:12 +03:00
Sergei Petrunia
2c3b298337 Merge 11.2 into 11.4 2024-09-09 14:40:02 +03:00
Sergei Petrunia
abd98336d2 Merge 10.11 -> 11.2 2024-09-09 13:50:38 +03:00
Yuchen Pei
d002b1f503 Merge branch '10.6' into 10.11 2024-09-09 11:34:19 +10:00