In commit b6923420f3 (MDEV-29445)
we started to specify the MAP_POPULATE flag for allocating the
InnoDB buffer pool. This would cause a lot of time to be spent
on __mm_populate() inside the Linux kernel, such as 16 seconds
to pre-fault or commit innodb_buffer_pool_size=64G.
Let us revert to the previous way of allocating the buffer pool
at startup. Note: An attempt to increase the buffer pool size by
SET GLOBAL innodb_buffer_pool_size (up to innodb_buffer_pool_size_max)
will invoke my_virtual_mem_commit(), which will use MAP_POPULATE
to zero-fill and prefault the requested additional memory area, blocking
buf_pool.mutex.
Before MDEV-29445 we allocated the InnoDB buffer pool by invoking
mmap(2) once (via my_large_malloc()). After the change, we would
invoke mmap(2) twice, first via my_virtual_mem_reserve() and then
via my_virtual_mem_commit(). Outside Microsoft Windows, we are
reverting back to my_large_malloc() like allocation.
my_virtual_mem_reserve(): Define only for Microsoft Windows.
Other platforms should invoke my_large_virtual_alloc() and
update_malloc_size() instead of my_virtual_mem_reserve() and
my_virtual_mem_commit().
my_large_virtual_alloc(): Define only outside Microsoft Windows.
Do not specify MAP_NORESERVE nor MAP_POPULATE, to preserve compatibility
with my_large_malloc(). Were MAP_POPULATE specified, the mmap()
system call would be significantly slower, for example 18 seconds
to reserve 64 GiB upfront.
Fixing the code adding MySQL _0900_ collations as _uca1400_ aliases
not to perform deep initialization of the corresponding _uca1400_
collations.
Only basic initialization is now performed which allows to watch
these collations (both _0900_ and _uca1400_) in queries to
INFORMATION_SCHEMA tables COLLATIONS and
COLLATION_CHARACTER_SET_APPLICABILITY,
as well as in SHOW COLLATION statements.
Deep initialization is now performed only when a collation
(either the _0900_ alias or the corresponding _uca1400_ collation)
is used for the very first time after the server startup.
Refactoring was done to maintain the code easier:
- most of the _uca1400_ code was moved from ctype-uca.c
to a new file ctype-uca1400.c
- most of the _0900_ code was moved from type-uca.c
to a new file ctype-uca0900.c
Change details:
- The original function add_alias_for_collation() added by the patch for
"MDEV-20912 Add support for utf8mb4_0900_* collations in MariaDB Server"
was removed from mysys/charset.c, as it had two two problems:
a. it forced deep initialization of the _uca1400_ collations
when adding _0900_ aliases for them at the server startup
(the main reported problem)
b. the collation initialization code in add_alias_for_collation()
was related more to collations rather than to memory management,
so /strings should be a better place for it than /mysys.
The code from add_alias_for_collation() was split into separate functions.
Cyclic dependency was removed. `#include <my_sys.h>` was removed
from /strings/ctype-uca.c. Collations are now added using a callback
function MY_CHARSET_LOADED::add_collation, like it is done for
user collations defined in Index.xml. The code in /mysys sets
MY_CHARSET_LOADED::add_collation to add_compiled_collation().
- The function compare_collations() was removed.
A new virtual function was added into my_collation_handler_st instead:
my_bool (*eq_collation)(CHARSET_INFO *self, CHARSET_INFO *other);
because it is the collation handler who knows how to detect equal
collations by comparing only some of CHARSET_INFO members without
their deep initialization.
Three implementations were added:
- my_ci_eq_collation_uca() for UCA collations, it compares
_0900_ collations as equal to their corresponding _uca1400_ collations.
- my_ci_eq_collation_utf8mb4_bin(), it compares
utf8mb4_nopad_bin and utf8mb4_0900_bin as equal.
- my_ci_eq_collation_generic() - the default implementation,
which compares all collations as not equal.
A C++ wrapper CHARSET_INFO::eq_collations() was added.
The code in /sql was changes to use the wrapper instead of
the former calls for the removed function compare_collations().
- A part of add_alias_for_collation() was moved into a new function
my_ci_alloc(). It allocates a memory for a new charset_info_st
instance together with the collation name and the comment using a single
MY_CHARSET_LOADER::once_alloc call, which points to my_once_alloc()
in the server.
- A part of add_alias_for_collation() was moved into a new function
my_ci_make_comment_for_alias(). It makes an "Alias for xxx" string,
e.g. "Alias for utf8mb4_uca1400_swedish_ai_ci" in case of
utf8mb4_sv_0900_ai_ci.
- A part of the code in create_tailoring() was moved to
a new function my_uca1400_collation_get_initialized_shared_uca(),
to reuse the code between _uca1400_ and _0900_ collations.
- A new function my_collation_id_is_mysql_uca0900() was added
in addition to my_collation_id_is_mysql_uca1400().
- Functions to build collation names were added:
my_uca0900_collation_build_name()
my_uca1400_collation_build_name()
- A shared function function was added:
my_bool
my_uca1400_collation_alloc_and_init(MY_CHARSET_LOADER *loader,
LEX_CSTRING name,
LEX_CSTRING comment,
const uca_collation_def_param_t *param,
uint id)
It's reused to add _uca1400_ and _0900_ collations, with basic
initialization (without deep initialization).
- The function add_compiled_collation() changed its return type from
void to int, to make it compatible with MY_CHARSET_LOADER::add_collation.
- Functions mysql_uca0900_collation_definition_add(),
mysql_uca0900_utf8mb4_collation_definitions_add(),
mysql_utf8mb4_0900_bin_add() were added into ctype-uca0900.c.
They get MY_CHARSET_LOADER as a parameter.
- Functions my_uca1400_collation_definition_add(),
my_uca1400_collation_definitions_add() were moved from
charset-def.c to strings/ctype-uca1400.c.
The latter now accepts MY_CHARSET_LOADER as the first parameter
instead of initializing a MY_CHARSET_LOADER inside.
- init_compiled_charsets() now initializes a MY_CHARSET_LOADER
variable and passes it to all functions adding collations:
- mysql_utf8mb4_0900_collation_definitions_add()
- mysql_uca0900_utf8mb4_collation_definitions_add()
- mysql_utf8mb4_0900_bin_add()
- A new structure was added into ctype-uca.h:
typedef struct uca_collation_def_param
{
my_cs_encoding_t cs_id;
uint tailoring_id;
uint nopad_flags;
uint level_flags;
} uca_collation_def_param_t;
It simplifies reusing the code for _uca1400_ and _0900_ collations.
- The definition of MY_UCA1400_COLLATION_DEFINITION was
moved from ctype-uca.c to ctype-uca1400.h, to reuse
the code for _uca1400_ and _0900_ collations.
- The definitions of "MY_UCA_INFO my_uca_v1400" and
"MY_UCA_INFO my_uca1400_info_tailored[][]" were moved from
ctype-uca.c to ctype-uca1400.c.
- The definitions/declarations of:
- mysql_0900_collation_start,
- struct mysql_0900_to_mariadb_1400_mapping
- mysql_0900_to_mariadb_1400_mapping
- mysql_utf8mb4_0900_collation_definitions_add()
were moved from ctype-uca.c to ctype-uca0900.c
- Functions
my_uca1400_make_builtin_collation_id()
my_uca1400_collation_definition_init()
my_uca1400_collation_id_uca400_compat()
my_ci_get_collation_name_uca1400_context()
were moved from ctype-uca.c to ctype-uca1400.c and ctype-uca1400.h
- A part of my_uca1400_collation_definition_init()
was moved into my_uca0520_builtin_collation_by_id(),
to make functions smaller.
We deprecate and ignore the parameter innodb_buffer_pool_chunk_size
and let the buffer pool size to be changed in arbitrary 1-megabyte
increments.
innodb_buffer_pool_size_max: A new read-only startup parameter
that specifies the maximum innodb_buffer_pool_size. If 0 or
unspecified, it will default to the specified innodb_buffer_pool_size
rounded up to the allocation unit (2 MiB or 8 MiB). The maximum value
is 4GiB-2MiB on 32-bit systems and 16EiB-8MiB on 64-bit systems.
This maximum is very likely to be limited further by the operating system.
The status variable Innodb_buffer_pool_resize_status will reflect
the status of shrinking the buffer pool. When no shrinking is in
progress, the string will be empty.
Unlike before, the execution of SET GLOBAL innodb_buffer_pool_size
will block until the requested buffer pool size change has been
implemented, or the execution is interrupted by a KILL statement
a client disconnect, or server shutdown. If the
buf_flush_page_cleaner() thread notices that we are running out of
memory, the operation may fail with ER_WRONG_USAGE.
SET GLOBAL innodb_buffer_pool_size will be refused
if the server was started with --large-pages (even if
no HugeTLB pages were successfully allocated). This functionality
is somewhat exercised by the test main.large_pages, which now runs
also on Microsoft Windows. On Linux, explicit HugeTLB mappings are
apparently excluded from the reported Redident Set Size (RSS), and
apparently unshrinkable between mmap(2) and munmap(2).
The buffer pool will be mapped to a contiguous virtual memory area
that will be aligned and partitioned into extents of 8 MiB on
64-bit systems and 2 MiB on 32-bit systems.
Within an extent, the first few innodb_page_size blocks contain
buf_block_t objects that will cover the page frames in the rest
of the extent. The number of such frames is precomputed in the
array first_page_in_extent[] for each innodb_page_size.
In this way, there is a trivial mapping between
page frames and block descriptors and we do not need any
lookup tables like buf_pool.zip_hash or buf_pool_t::chunk_t::map.
We will always allocate the same number of block descriptors for
an extent, even if we do not need all the buf_block_t in the last
extent in case the innodb_buffer_pool_size is not an integer multiple
of the of extents size.
The minimum innodb_buffer_pool_size is 256*5/4 pages. At the default
innodb_page_size=16k this corresponds to 5 MiB. However, now that the
innodb_buffer_pool_size includes the memory allocated for the block
descriptors, the minimum would be innodb_buffer_pool_size=6m.
my_large_virtual_alloc(): A new function, similar to my_large_malloc().
my_virtual_mem_reserve(), my_virtual_mem_commit(),
my_virtual_mem_decommit(), my_virtual_mem_release():
New interface mostly by Vladislav Vaintroub, to separately
reserve and release virtual address space, as well as to
commit and decommit memory within it.
After my_virtual_mem_decommit(), the virtual memory range will be
read-only or unaccessible, depending on whether the build option
cmake -DHAVE_UNACCESSIBLE_AFTER_MEM_DECOMMIT=1
has been specified. This option is hard-coded on Microsoft Windows,
where VirtualMemory(MEM_DECOMMIT) will make the memory unaccessible.
On IBM AIX, Linux, Illumos and possibly Apple macOS, the virtual memory
will be zeroed out immediately. On other POSIX-like systems,
madvise(MADV_FREE) will be used if available, to give the operating
system kernel a permission to zero out the virtual memory range.
We prefer immediate freeing so that the reported
resident set size (RSS) of the process will reflect the current
innodb_buffer_pool_size. Shrinking the buffer pool is a rarely
executed resource intensive operation, and the immediate configuration
of the MMU mappings should not incur significant additional penalty.
opt_super_large_pages: Declare only on Solaris. Actually, this is
specific to the SPARC implementation of Solaris, but because we
lack access to a Solaris development environment, we will not revise
this for other MMU and ISA.
buf_pool_t::chunk_t::create(): Remove.
buf_pool_t::create(): Initialize all n_blocks of the buf_pool.free list.
buf_pool_t::allocate(): Renamed from buf_LRU_get_free_only().
buf_pool_t::LRU_warned: Changed to Atomic_relaxed<bool>,
only to be modified by the buf_flush_page_cleaner() thread.
buf_pool_t::shrink(): Attempt to shrink the buffer pool.
There are 3 possible outcomes: SHRINK_DONE (success),
SHRINK_IN_PROGRESS (the caller may keep trying),
and SHRINK_ABORT (we seem to be running out of buffer pool).
While traversing buf_pool.LRU, release the contended
buf_pool.mutex once in every 32 iterations in order to
reduce starvation. Use lru_scan_itr for efficient traversal,
similar to buf_LRU_free_from_common_LRU_list().
buf_pool_t::shrunk(): Update the reduced size of the buffer pool
in a way that is compatible with buf_pool_t::page_guess(),
and invoke my_virtual_mem_decommit().
buf_pool_t::resize(): Before invoking shrink(), run one batch of
buf_flush_page_cleaner() in order to prevent LRU_warn().
Abort if shrink() recommends it, or no blocks were withdrawn in
the past 15 seconds, or the execution of the statement
SET GLOBAL innodb_buffer_pool_size was interrupted.
buf_pool_t::first_to_withdraw: The first block descriptor that is
out of the bounds of the shrunk buffer pool.
buf_pool_t::withdrawn: The list of withdrawn blocks.
If buf_pool_t::resize() is aborted before shrink() completes,
we must be able to resurrect the withdrawn blocks in the free list.
buf_pool_t::contains_zip(): Added a parameter for the
number of least significant pointer bits to disregard,
so that we can find any pointers to within a block
that is supposed to be free.
buf_pool_t::is_shrinking(): Return the total number or blocks that
were withdrawn or are to be withdrawn.
buf_pool_t::to_withdraw(): Return the number of blocks that will need to
be withdrawn.
buf_pool_t::usable_size(): Number of usable pages, considering possible
in-progress attempt at shrinking the buffer pool.
buf_pool_t::page_guess(): Try to buffer-fix a guessed block pointer.
If HAVE_UNACCESSIBLE_AFTER_MEM_DECOMMIT is set, the pointer will
be validated before being dereferenced.
buf_pool_t::get_info(): Replaces buf_stats_get_pool_info().
innodb_init_param(): Refactored. We must first compute
srv_page_size_shift and then determine the valid bounds of
innodb_buffer_pool_size.
buf_buddy_shrink(): Replaces buf_buddy_realloc().
Part of the work is deferred to buf_buddy_condense_free(),
which is being executed when we are not holding any
buf_pool.page_hash latch.
buf_buddy_condense_free(): Do not relocate blocks.
buf_buddy_free_low(): Do not care about buffer pool shrinking.
This will be handled by buf_buddy_shrink() and
buf_buddy_condense_free().
buf_buddy_alloc_zip(): Assert !buf_pool.contains_zip()
when we are allocating from the binary buddy system.
Previously we were asserting this on multiple recursion levels.
buf_buddy_block_free(), buf_buddy_free_low():
Assert !buf_pool.contains_zip().
buf_buddy_alloc_from(): Remove the redundant parameter j.
buf_flush_LRU_list_batch(): Add the parameter to_withdraw
to keep track of buf_pool.n_blocks_to_withdraw.
buf_do_LRU_batch(): Skip buf_free_from_unzip_LRU_list_batch()
if we are shrinking the buffer pool. In that case, we want
to minimize the page relocations and just finish as quickly
as possible.
trx_purge_attach_undo_recs(): Limit purge_sys.n_pages_handled()
in every iteration, in case the buffer pool is being shrunk
in the middle of a purge batch.
Reviewed by: Debarun Banerjee
[Breaking]
Good news:
GCC now checks your `my_snprintf` (direct) calls (`-Wformat` was on);
Bad news:
The build process no longer lets your incorrect
formats/arguments sneak past (`-Werror` was also on).
As such, this commit also migrates all direct `my_snprintf` calls from
the old specifiers to MDEV-21978’s new `-Wformat`-compatible suffixes.
The next commits will cover non-direct calls to `my_snprintf`.
(I call them “`my_snprintf` descendants”.)
This commit does not update the ABI records because
there’re more ABI “changes” to come – half a dozen
`include/mysql/plugin_*.h.pp`s are now missing the new `__attribute__`s.
* rpl.rpl_system_versioning_partitions updated for MDEV-32188
* innodb.row_size_error_log_warnings_3 changed error for MDEV-33658
(checks are done in a different order)
The problem was that get_collation_number_internal() loops over all
collations for finding a collation based on name. For looking up
utf8mb4_0900_ aliases it used 22633 character strings comparisons at
startup.
Fixed by adding the MariaDB internal collation number in the "0900" alias
lookup array. This is fine as collation numbers never changes.
Discussed-with: serg@mariadb.com
This commit updates default memory allocations size used with MEM_ROOT
objects to minimize the number of calls to malloc().
Changes:
- Updated MEM_ROOT block sizes in sql_const.h
- Updated MALLOC_OVERHEAD to also take into account the extra memory
allocated by my_malloc()
- Updated init_alloc_root() to only take MALLOC_OVERHEAD into account as
buffer size, not MALLOC_OVERHEAD + sizeof(USED_MEM).
- Reset mem_root->first_block_usage if and only if first block was used.
- Increase MEM_ROOT buffers sized used by my_load_defaults, plugin_init,
Create_tmp_table, allocate_table_share, TABLE and TABLE_SHARE.
This decreases number of malloc calls during queries.
- Use a small buffer for THD->main_mem_root in THD::THD. This avoids
multiple malloc() call for new connections.
I tried the above changes on a complex select query with 12 tables.
The following shows the number of extra allocations that where used
to increase the size of the MEM_ROOT buffers.
Original code:
- Connection to MariaDB: 9 allocations
- First query run: 146 allocations
- Second query run: 24 allocations
Max memory allocated for thd when using with heap table: 61,262,408
Max memory allocated for thd when using Aria tmp table: 419,464
After changes:
Connection to MariaDB: 0 allocations
- First run: 25 allocations
- Second run: 7 allocations
Max memory allocated for thd when using with heap table: 61,347,424
Max memory allocated for thd when using Aria table: 529,168
The new code uses slightly more memory, but avoids memory fragmentation
and is slightly faster thanks to much fewer calls to malloc().
Reviewed-by: Sergei Golubchik <serg@mariadb.org>
This is done by mapping most of the existing MySQL unicode 0900 collations
to MariadB 1400 unicode collations. The assumption is that 1400 is a super
set of 0900 for all practical purposes.
I also added a new function 'compare_collations()' and changed most code
to use this instead of comparing character sets directly.
This enables one to seamlessly mix-and-match the corresponding 0900 and
1400 sets. Field comparision and alter table treats the character sets
as identical.
All MySQL 8.0 0900 collations are supported except:
- utf8mb4_ja_0900_as_cs
- utf8mb4_ja_0900_as_cs_ks
- utf8mb4_ru_0900_as_cs
- utf8mb4_zh_0900_as_cs
These do not have corresponding entries in the MariadB 01400 collations.
Other things:
- Added COMMENT colum to information_schema.collations. For utf8mb4_0900
colletions it contains the corresponding alias collation.
os_innodb_umask was of the incorrect type resulting in warnings
in clang-19. The correct type is mode_t.
As os_innodb_umask was set during innnodb_init from my_umask,
corrected the type there along with its companion my_umask_dir.
Because of this, the defaults mask values in innodb never
had an effect.
The resulting change allow found signed differences in
my_create{,_nosymlink}, open_nosymlinks:
mysys/my_create.c:47:20: error: operand of ?: changes signedness from ‘int’ to ‘mode_t’ {aka ‘unsigned int’} due to unsignedness of other operand [-Werror=sign-compare]
47 | CreateFlags ? CreateFlags : my_umask);
Ref: clang-19 warnings:
[55/123] Building CXX object storage/innobase/CMakeFiles/innobase.dir/os/os0file.cc.o
storage/innobase/os/os0file.cc:1075:46: warning: implicit conversion loses integer precision: 'ulint' (aka 'unsigned long') to 'mode_t' (aka 'unsigned int') [-Wshorten-64-to-32]
1075 | file = open(name, create_flag | O_CLOEXEC, os_innodb_umask);
| ~~~~ ^~~~~~~~~~~~~~~
storage/innobase/os/os0file.cc:1249:46: warning: implicit conversion loses integer precision: 'ulint' (aka 'unsigned long') to 'mode_t' (aka 'unsigned int') [-Wshorten-64-to-32]
1249 | file = open(name, create_flag | O_CLOEXEC, os_innodb_umask);
| ~~~~ ^~~~~~~~~~~~~~~
storage/innobase/os/os0file.cc:1381:45: warning: implicit conversion loses integer precision: 'ulint' (aka 'unsigned long') to 'mode_t' (aka 'unsigned int') [-Wshorten-64-to-32]
1381 | file = open(name, create_flag | O_CLOEXEC, os_innodb_umask);
| ~~~~ ^~~~~~~~~~~~~~~
Partial commit of the greater MDEV-34348 scope.
MDEV-34348: MariaDB is violating clang-16 -Wcast-function-type-strict
The functions queue_compare, qsort2_cmp, and qsort_cmp2
all had similar interfaces, and were used interchangable
and unsafely cast to one another.
This patch consolidates the functions all into the
qsort_cmp2 interface.
Reviewed By:
============
Marko Mäkelä <marko.makela@mariadb.com>
* preserve the graph in memory between statements
* keep it in a TABLE_SHARE, available for concurrent searches
* nodes are generally read-only, walking the graph doesn't change them
* distance to target is cached, calculated only once
* SIMD-optimized bloom filter detects visited nodes
* nodes are stored in an array, not List, to better utilize bloom filter
* auto-adjusting heuristic to estimate the number of visited nodes
(to configure the bloom filter)
* many threads can concurrently walk the graph. MEM_ROOT and Hash_set
are protected with a mutex, but walking doesn't need them
* up to 8 threads can concurrently load nodes into the cache,
nodes are partitioned into 8 mutexes (8 is chosen arbitrarily, might
need tuning)
* concurrent editing is not supported though
* this is fine for MyISAM, TL_WRITE protects the TABLE_SHARE and the
graph (note that TL_WRITE_CONCURRENT_INSERT is not allowed, because an
INSERT into the main table means multiple UPDATEs in the graph)
* InnoDB uses secondary transaction-level caches linked in a list in
in thd->ha_data via a fake handlerton
* on rollback the secondary cache is discarded, on commit nodes
from the secondary cache are invalidated in the shared cache
while it is exclusively locked
* on savepoint rollback both caches are flushed. this can be improved
in the future with a row visibility callback
* graph size is controlled by @@mhnsw_cache_size, the cache is flushed
when it reaches the threshold
savepoint support was added a year ago as get_last_memroot_block()
and free_all_new_blocks(), but it didn't work and was disabled.
* fix it to work
* instead of freeing the memory, only mark blocks free - this feature
is supposed to be used inside a loop (otherwise there is no need
to free anything, end of statement will do it anyway). And freeing
blocks inside a loop is a bad idea, as they'll be all malloc-ed
on the next iteration again. So, don't.
* fix a bug in mark_blocks_free() - it doesn't change the number of blocks
When using the default innodb_log_buffer_size=2m, mariadb-backup --backup
would spend a lot of time re-reading and re-parsing the log. For reads,
it would be beneficial to memory-map the entire ib_logfile0 to the
address space (typically 48 bits or 256 TiB) and read it from there,
both during --backup and --prepare.
We will introduce the Boolean read-only parameter innodb_log_file_mmap
that will be OFF by default on most platforms, to avoid aggressive
read-ahead of the entire ib_logfile0 in when only a tiny portion would be
accessed. On Linux and FreeBSD the default is innodb_log_file_mmap=ON,
because those platforms define a specific mmap(2) option for enabling
such read-ahead and therefore it can be assumed that the default would
be on-demand paging. This parameter will only have impact on the initial
InnoDB startup and recovery. Any writes to the log will use regular I/O,
except when the ib_logfile0 is stored in a specially configured file system
that is backed by persistent memory (Linux "mount -o dax").
We also experimented with allowing writes of the ib_logfile0 via a
memory mapping and decided against it. A fundamental problem would be
unnecessary read-before-write in case of a major page fault, that is,
when a new, not yet cached, virtual memory page in the circular
ib_logfile0 is being written to. There appears to be no way to tell
the operating system that we do not care about the previous contents of
the page, or that the page fault handler should just zero it out.
Many references to HAVE_PMEM have been replaced with references to
HAVE_INNODB_MMAP.
The predicate log_sys.is_pmem() has been replaced with
log_sys.is_mmap() && !log_sys.is_opened().
Memory-mapped regular files differ from MAP_SYNC (PMEM) mappings in the
way that an open file handle to ib_logfile0 will be retained. In both
code paths, log_sys.is_mmap() will hold. Holding a file handle open will
allow log_t::clear_mmap() to disable the interface with fewer operations.
It should be noted that ever since
commit 685d958e38 (MDEV-14425)
most 64-bit Linux platforms on our CI platforms
(s390x a.k.a. IBM System Z being a notable exception) read and write
/dev/shm/*/ib_logfile0 via a memory mapping, pretending that it is
persistent memory (mount -o dax). So, the memory mapping based log
parsing that this change is enabling by default on Linux and FreeBSD
has already been extensively tested on Linux.
::log_mmap(): If a log cannot be opened as PMEM and the desired access
is read-only, try to open a read-only memory mapping.
xtrabackup_copy_mmap_snippet(), xtrabackup_copy_mmap_logfile():
Copy the InnoDB log in mariadb-backup --backup from a memory
mapped file.
Apart from better performance when accessing thread local variables,
we'll get rid of things that depend on initialization/cleanup of
pthread_key_t variables.
Where appropriate, use compiler-dependent pre-C++11 thread-local
equivalents, where it makes sense, to avoid initialization check overhead
that non-static thread_local can suffer from.
Two new variables added:
- max_tmp_space_usage : Limits the the temporary space allowance per user
- max_total_tmp_space_usage: Limits the temporary space allowance for
all users.
New status variables: tmp_space_used & max_tmp_space_used
New field in information_schema.process_list: TMP_SPACE_USED
The temporary space is counted for:
- All SQL level temporary files. This includes files for filesort,
transaction temporary space, analyze, binlog_stmt_cache etc.
It does not include engine internal temporary files used for repair,
alter table, index pre sorting etc.
- All internal on disk temporary tables created as part of resolving a
SELECT, multi-source update etc.
Special cases:
- When doing a commit, the last flush of the binlog_stmt_cache
will not cause an error even if the temporary space limit is exceeded.
This is to avoid giving errors on commit. This means that a user
can temporary go over the limit with up to binlog_stmt_cache_size.
Noteworthy issue:
- One has to be careful when using small values for max_tmp_space_limit
together with binary logging and with non transactional tables.
If a the binary log entry for the query is bigger than
binlog_stmt_cache_size and one hits the limit of max_tmp_space_limit
when flushing the entry to disk, the query will abort and the
binary log will not contain the last changes to the table.
This will also stop the slave!
This is also true for all Aria tables as Aria cannot do rollback
(except in case of crashes)!
One way to avoid it is to use @@binlog_format=statement for
queries that updates a lot of rows.
Implementation:
- All writes to temporary files or internal temporary tables, that
increases the file size, are routed through temp_file_size_cb_func()
which updates and checks the temp space usage.
- Most of the temporary file monitoring is done inside IO_CACHE.
Temporary file monitoring is done inside the Aria engine.
- MY_TRACK and MY_TRACK_WITH_LIMIT are new flags for ini_io_cache().
MY_TRACK means that we track the file usage. TRACK_WITH_LIMIT means
that we track the file usage and we give an error if the limit is
breached. This is used to not give an error on commit when
binlog_stmp_cache is flushed.
- global_tmp_space_used contains the total tmp space used so far.
This is needed quickly check against max_total_tmp_space_usage.
- Temporary space errors are using EE_LOCAL_TMP_SPACE_FULL and
handler errors are using HA_ERR_LOCAL_TMP_SPACE_FULL.
This is needed until we move general errors to it's own error space
so that they cannot conflict with system error numbers.
- Return value of my_chsize() and mysql_file_chsize() has changed
so that -1 is returned in the case my_chsize() could not decrease
the file size (very unlikely and will not happen on modern systems).
All calls to _chsize() are updated to check for > 0 as the error
condition.
- At the destruction of THD we check that THD::tmp_file_space == 0
- At server end we check that global_tmp_space_used == 0
- As a precaution against errors in the tmp_space_used code, one can set
max_tmp_space_usage and max_total_tmp_space_usage to 0 to disable
the tmp space quota errors.
- truncate_io_cache() function added.
- Aria tables using static or dynamic row length are registered in 8K
increments to avoid some calls to update_tmp_file_size().
Other things:
- Ensure that all handler errors are registered. Before, some engine
errors could be printed as "Unknown error".
- Fixed bug in filesort() that causes a assert if there was an error
when writing to the temporay file.
- Fixed that compute_window_func() now takes into account write errors.
- In case of parallel replication, rpl_group_info::cleanup_context()
could call trans_rollback() with thd->error set, which would cause
an assert. Fixed by resetting the error before calling trans_rollback().
- Fixed bug in subselect3.inc which caused following test to use
heap tables with low value for max_heap_table_size
- Fixed bug in sql_expression_cache where it did not overflow
heap table to Aria table.
- Added Max_tmp_disk_space_used to slow query log.
- Fixed some bugs in log_slow_innodb.test
This task is to ensure we have a clear definition and rules of how to
repair or optimize a table.
The rules are:
- REPAIR should be used with tables that are crashed and are
unreadable (hardware issues with not readable blocks, blocks with
'unexpected data' etc)
- OPTIMIZE table should be used to optimize the storage layout for the
table (recover space for delete rows and optimize the index
structure.
- ALTER TABLE table_name FORCE should be used to rebuild the .frm file
(the table definition) and the table (with the original table row
format). If the table is from and older MariaDB/MySQL release with a
different storage format, it will convert the data to the new
format. ALTER TABLE ... FORCE is used as part of mariadb-upgrade
Here follows some more background:
The 3 ways to repair a table are:
1) ALTER TABLE table_name FORCE" (not other options).
As an alias we allow: "ALTER TABLE table_name ENGINE=original_engine"
2) "REPAIR TABLE" (without FORCE)
3) "OPTIMIZE TABLE"
All of the above commands will optimize row space usage (which means that
space will be needed to hold a temporary copy of the table) and
re-generate all indexes. They will also try to replicate the original
table definition as exact as possible.
For ALTER TABLE and "REPAIR TABLE without FORCE", the following will hold:
If the table is from an older MariaDB version and data conversion is
needed (for example for old type HASH columns, MySQL JSON type or new
TIMESTAMP format) "ALTER TABLE table_name FORCE, algorithm=COPY" will be
used.
The differences between the algorithms are
1) Will use the fastest algorithm the engine supports to do a full repair
of the table (except if data conversions are is needed).
2) Will use the storage engine internal REPAIR facility (MyISAM, Aria).
If the engine does not support REPAIR then
"ALTER TABLE FORCE, ALGORITHM=COPY" will be used.
If there was data incompatibilities (which means that FORCE was used)
then there will be a warning after REPAIR that ALTER TABLE FORCE is
still needed.
The reason for this is that REPAIR may be able to go around data
errors (wrong incompatible data, crashed or unreadable sectors) that
ALTER TABLE cannot do.
3) Will use the storage engine internal OPTIMIZE. If engine does not
support optimize, then "ALTER TABLE FORCE" is used.
The above will ensure that ALTER TABLE FORCE is able to
correct almost any errors in the row or index data. In case of
corrupted blocks then REPAIR possible followed by ALTER TABLE is needed.
This is important as mariadb-upgrade executes ALTER TABLE table_name
FORCE for any table that must be re-created.
Bugs fixed with InnoDB tables when using ALTER TABLE FORCE:
- No error for INNODB_DEFAULT_ROW_FORMAT=COMPACT even if row length
would be too wide. (Independent of innodb_strict_mode).
- Tables using symlinks will be symlinked after any of the above commands
(Independent of the setting of --symbolic-links)
If one specifies an algorithm together with ALTER TABLE FORCE, things
will work as before (except if data conversion is required as then
the COPY algorithm is enforced).
ALTER TABLE .. OPTIMIZE ALL PARTITIONS will work as before.
Other things:
- FORCE argument added to REPAIR to allow one to first run internal
repair to fix damaged blocks and then follow it with ALTER TABLE.
- REPAIR will not update frm_version if ha_check_for_upgrade() finds
that table is still incompatible with current version. In this case the
REPAIR will end with an error.
- REPAIR for storage engines that does not have native repair, like InnoDB,
is now using ALTER TABLE FORCE.
- REPAIR csv-table USE_FRM now works.
- It did not work before as CSV tables had extension list in wrong
order.
- Default error messages length for %M increased from 128 to 256 to not
cut information from REPAIR.
- Documented HA_ADMIN_XX variables related to repair.
- Added HA_ADMIN_NEEDS_DATA_CONVERSION to signal that we have to
do data conversions when converting the table (and thus ALTER TABLE
copy algorithm is needed).
- Fixed typo in error message (caused test changes).