This feature stores the ddls of the tables/views that are used in
a query, to the optimizer trace. It is currently controlled by a
system variable store_ddls_in_optimizer_trace, and is not enabled
by default. All the ddls will be stored in a single json array, with each
element having table/view name, and the associated create definition
of the table/view.
The approach taken is to read global query_tables from the thd->lex,
and read them in reverse. Create a record with table_name, ddl of
the table and add the table_name to the hash,
along with dumping the information to the trace.
dbName_plus_tableName is used as a key,
and the duplicate entries are not added to the hash.
The main suite tests are also run with the feature enabled, and they
all succeed.
let's always disconnect a user connection before dropping the said user.
MariaDB is traditionally very tolerant to active connections
of the dropped user, which isn't the case for most other databases.
Let's avoid unintentionally spreading incompatible behavior
and disconnect before drop.
Except in cases when the test specifically tests such a behavior.
Remove one of the major sources of race condiitons in mariadb-test.
Normally, mariadb_close() sends COM_QUIT to the server and immediately
disconnects. In mariadb-test it means the test can switch to another
connection and sends queries to the server before the server even
started parsing the COM_QUIT packet and these queries can see the
connection as fully active, as it didn't reach dispatch_command yet.
This is a major source of instability in tests and many - but not all,
still less than a half - tests employ workarounds. The correct one
is a pair count_sessions.inc/wait_until_count_sessions.inc.
Also very popular was wait_until_disconnected.inc, which was completely
useless, because it verifies that the connection is closed, and after
disconnect it always is, it didn't verify whether the server processed
COM_QUIT. Sadly the placebo was as widely used as the real thing.
Let's fix this by making mariadb-test `disconnect` command _to wait_ for
the server to confirm. This makes almost all workarounds redundant.
In some cases count_sessions.inc/wait_until_count_sessions.inc is still
needed, though, as only `disconnect` command is changed:
* after external tools, like `exec $MYSQL`
* after failed `connect` command
* replication, after `STOP SLAVE`
* Federated/CONNECT/SPIDER/etc after `DROP TABLE`
and also in some XA tests, because an XA transaction is dissociated from
the THD very late, after the server has closed the client connection.
Collateral cleanups: fix comments, remove some redundant statements:
* DROP IF EXISTS if nothing is known to exist
* DROP table/view before DROP DATABASE
* REVOKE privileges before DROP USER
etc
Statements that intend to modify data have to acquire protection
against ongoing backup. Prior to backup locks, protection against
FTWRL was acquired in form of 2 shared metadata locks of GLOBAL
(global read lock) and COMMIT namespaces. These two namespaces
were separate entities, they didn't share data structures and
locking primitives. And thus they were separate contention
points.
With backup locks, introduced by 7a9dfdd, these namespaces were
combined into a single BACKUP namespace. It became a single
contention point, which doubled load on BACKUP namespace data
structures and locking primitives compared to GLOBAL and COMMIT
namespaces. In other words system throughput has halved.
MDL fast lanes solve this problem by allowing multiple contention
points for single MDL_lock. Fast lane is scalable multi-instance
registry for leightweight locks. Internally it is just a list of
granted tickets, close counter and a mutex.
Number of fast lanes (or contention points) is defined by the
metadata_locks_instances system variable. Value of 1 disables fast
lanes and lock requests are served by conventional MDL_lock data
structures.
Since fast lanes allow arbitrary number of contention points, they
outperform pre-backup locks GLOBAL and COMMIT.
Fast lanes are enabled only for BACKUP namespace. Support for other
namespaces is to be implemented separately.
Lock types are divided in 2 categories: lightweight and heavyweight.
Lightweight lock types represent DML: MDL_BACKUP_DML,
MDL_BACKUP_TRANS_DML, MDL_BACKUP_SYS_DML, MDL_BACKUP_DDL,
MDL_BACKUP_ALTER_COPY, MDL_BACKUP_COMMIT. They are fully compatible
with each other. Normally served by corresponding fast lane, which is
determined by thread_id % metadata_locks_instances.
Heavyweight lock types represent ongoing backup: MDL_BACKUP_START,
MDL_BACKUP_FLUSH, MDL_BACKUP_WAIT_FLUSH, MDL_BACKUP_WAIT_DDL,
MDL_BACKUP_WAIT_COMMIT, MDL_BACKUP_FTWRL1, MDL_BACKUP_FTWRL2,
MDL_BACKUP_BLOCK_DDL. These locks are always served by conventional
MDL_lock data structures. Whenever such lock is requested, fast
lanes are closed and all tickets registered in fast lanes are
moved to conventional MDL_lock data structures. Until such locks
are released or aborted, lightweight lock requests are served by
conventional MDL_lock data structures.
Strictly speaking moving tickets from fast lanes to conventional
MDL_lock data structures is not required. But it allows to reduce
complexity and keep intact methods like: MDL_lock::visit_subgraph(),
MDL_lock::notify_conflicting_locks(), MDL_lock::reschedule_waiters(),
MDL_lock::can_grant_lock().
It is not even required to register tickets in fast lanes. They
can be implemented basing on an atomic variable that holds two
counters: granted lightweight locks and granted/waiting heavyweight
locks. Similarly to MySQL solution, which roughly speaking has
"single atomic fast lane". However it appears to be it won't bring
any better performance, while code complexity is going to be much
higher.
Added option 'aria-pagecache-segments', default 1.
For values > 1, this split the aria-pagecache-buffer into the given
number of segments, each independent from each other. Having multiple
pagecaches improve performance when multiple connections runs queries
concurrently using different tables.
Each pagecache will use aria-pageache-buffer/segments amount of
memory, however at least 128K.
Each opened table has its index and data file use the segments in a
a round-robin fashion.
Internal changes:
- All programs allocating the maria pagecache themselves should now
call multi_init_pagecache() instead of init_pagecache().
- pagecache statistics is now stored in 'pagecache_stats' instead of
maria_pagecache. One must call multi_update_pagecache_stats() to
update the statistics.
- Added into PAGECACHE_FILE a pointer to files pagecache. This was
done to ensure that index and data file are using the same
pagecache and simplified the checkpoint code.
I kept pagecache in TABLE_SHARE to minimize the changes.
- really_execute_checkpoint() was update to handle a dynamic number of
pagecaches.
- pagecache_collect_changed_blocks_with_lsn() was slight changed to
allow it to be called for each pagecache.
- undefined not used functions maria_assign_pagecache() and
maria_change_pagecache()
- ma_pagecaches.c is totally rewritten. It now contains all
multi_pagecache functions.
Errors found be QA that are fixed:
MDEV-36872 UBSAN errors in ma_checkpoint.c
MDEV-36874 Behavior upon too small aria_pagecache_buffer_size in case of
multiple segments is not very user-friendly
MDEV-36914 ma_checkpoint.c(285,9): conversion from '__int64' to 'uint'
treated as an error
MDEV-36912 sys_vars.sysvars_server_embedded and
sys_vars.sysvars_server_notembedded fail on x86
Appliers need to verify foreign key constraints during normal
operation, in multi-active topologies, and for this reason appliers
are configured to enable FK checking.
However, during node joining, in IST and latter catch up period,
the node is still idle (from local connections), and only source
for incoming transactions is the cluster sending certified write
sets for applying. IST happens with parallel applying, and there
is a possibility that foreign key check cause lock conflicts between
appliers accessing FK child and parent tables. Also, the excessive
FK checking will slow down IST process somewhat.
For this reasons, we could relax FK checks for appliers during IST
and catch up periods. The relaxed FK check mode should, however, be
configurable e.g. by wsrep_mode flag: SKIP_APPLIER_FK_CHECKS_IN_IST.
When this operation mode is set, and the node is processing IST or
catch up, appliers should skip FK checking.
Signed-off-by: Julius Goryavsky <julius.goryavsky@mariadb.com>
Promote the last few SQL-inaccessible replication options (command line
or `mariadb.cnf`) as these GLOBAL read-only system variables:
```
@@master_info_file
@@replicate_same_server_id
@@show_slave_auth_info
```
Side effect: The latter two options changed from no argument
to optional argument. Quote `include/my_getopt.h`:
> It should be noted that for historical reasons variables with the
> combination arg_type=NO_ARG, my_option::var_type=GET_BOOL still
> accepts arguments. This is someone counter intuitive and care should
> be taken if the code is refactored.
Reviewed-by: Brandon Nesterenko <brandon.nesterenko@mariadb.com>
The main purpose of this allow one to use the --read-only
option to ensure that no one can issue a query that can
block replication.
The --read-only option can now take 4 different values:
0 No read only (as before).
1 Blocks changes for users without the 'READ ONLY ADMIN'
privilege (as before).
2 Blocks in addition LOCK TABLES and SELECT IN SHARE MODE
for not 'READ ONLY ADMIN' users.
3 Blocks in addition 'READ_ONLY_ADMIN' users for all the
previous statements.
read_only is changed to an enum and one can use the following
names for the lock levels:
OFF, ON, NO_LOCK, NO_LOCK_NO_ADMIN
Too keep things compatible with older versions config files, one can
still use values FALSE and TRUE, which are mapped to OFF and ON.
The main visible changes are:
- 'show variables like "read_only"' now returns a string
instead of a number.
- Error messages related to read_only violations now contains
the current value off readonly.
Other things:
- is_read_only_ctx() renamed to check_read_only_with_error()
- Moved TL_READ_SKIP_LOCKED to it's logical place
Reviewed by: Sergei Golubchik <serg@mariadb.org>
MDEV-36563 Assertion `!mysql_bin_log.is_open()' failed in
THD::mark_tmp_table_as_free_for_reuse
The purpose of this commit is to ensure that creation and changes of
temporary tables are properly and predicable logged to the binary
log. It also fixes some bugs where ROW logging was used in MIXED mode,
when STATEMENT would be a better (and expected) choice.
In this comment STATEMENT stands for logging to binary log in
STATEMENT format, MIXED stands for MIXED binlog format and ROW for ROW
binlog format.
New rules for logging of temporary tables
- CREATE of temporary tables are now by default binlogged only if
STATEMENT binlog format is used. If it is binlogged, 1 is stored in
TABLE_SHARE->table_creation_was_logged. The user can change this
behavior by setting create_temporary_table_binlog_formats to
MIXED,STATEMENT in which case the create is logged in statement
format also in MIXED mode (as before).
- Changes to temporary tables are only binlogged if and only if
the CREATE was logged. The logging happens under STATEMENT or MIXED.
If binlog_format=ROW, temporary table changes are not binlogged. A
temporary table that are changed under ROW are marked as 'not up to
date in binlog' and no future row changes are logged. Any usage of
this temporary table will force row logging of other tables in any
future statements using the temporary table to be row logged.
- DROP TEMPORARY is binlogged only of the CREATE was binlogged.
Changes done:
- Row logging is forced for any statement using temporary tables that
are not up to date in the binary log.
(Before the row logging was forced if the user has a temporary table)
- If there is any changes to the temporary table that is not binlogged,
the table is marked as not up to date.
- TABLE_SHARE->table_creation_was_logged has a new definition for
temporary tables:
0 Table creating was not logged to binary log
1 Table creating was logged to binary log and table is up to date.
2 Table creating was logged to binary log but some changes where
not logged to binary log.
Table is not up to date in binary log is defined as value 0 or 2.
- If a multi-table-update or multi-table-delete fails then
all updated temporary tables are marked as not up to date.
- Enforce row logging if the query is using temporary tables
that are not up to date.
Before row logging was enforced if the user had any
temporary tables.
- When dropping temporary tables use IF EXISTS. This ensures
that slave will not stop if it had crashed and lost the
temporary tables.
- Remove comment and version from DROP /*!4000 TEMPORARY.. generated when
a connection closes that has open temporary tables. Added 'generated by
server' at the end of the DROP.
Bugs fixed:
- When using temporary tables with commands that forced row based,
like INSERT INTO temporary_table VALUES (UUID()), this was never
logged which causes the temporary table to be inconsistent on
master and slave.
- Used binlog format is now clearly defined. It is now only depending
on the current binlog_format and the tables used.
Before it was depending on the user had ANY temporary tables and
the state of 'current_stmt_binlog_format' set by previous queries.
This also caused temporary tables to be logged to binary log in
some cases.
- CREATE TABLE t1 LIKE not_logged_temporary_table caused replication
to stop.
- Rename of not binlogged temporary tables where binlogged to binary log
which caused replication to stop.
Changes in behavior:
- By default create_temporary_table_binlog_formats=STATEMENT, which
means that CREATE TEMPORARY is not logged to binary log under MIXED
binary logging. This can be changed by setting
create_temporary_table_binlog_formats to MIXED,STATEMENT.
- Using temporary tables that was not logged to the binary log will
cause any query using them for updating other tables to be logged in
ROW format. Before all queries was logged in ROW format if the user had
any temporary tables, even if they were not used by the query.
- Generated DROP TEMPORARY TABLE is now always using IF EXISTS and
has a "generated by server" comment in the binary log.
The consequences of the above is that manipulations of a lot of rows
through temporary tables will by default be be slower in mixed mode.
For example:
BEGIN;
CREATE TEMPORARY TABLE tmp AS SELECT a, b, c FROM
large_table1 JOIN large_table2 ON ...;
INSERT INTO other_table SELECT b, c FROM tmp WHERE a <100;
DROP TEMPORARY TABLE tmp;
COMMIT;
By default this will create a huge entry in the binary log, compared
to just a few hundred bytes in statement mode. However the change in
this commit will make usage of temporary tables more reliable and
predicable and is thus worth it. Using statement mode or
create_temporary_table_binlog_formats can be used to avoid this issue.
This is needed to make it easy for users to automatically ignore long
char and varchars when using ANALYZE TABLE PERSISTENT.
These fields can cause problems as they will consume
'CHARACTERS * MAX_CHARACTER_LENGTH * 2 * number_of_rows' space on disk
during analyze, which can easily be much bigger than the analyzed table.
This commit adds a new user variable, analyze_max_length, default value 4G.
Any field that is bigger than this in bytes, will be ignored by
ANALYZE TABLE PERSISTENT unless it is specified in FOR COLUMNS().
While doing this patch, I noticed that we do not skip GEOMETRY columns from
ANALYZE TABLE, like we do with BLOB. This should be fixed when merging
to the 'main' branch. At the same time we should add a resonable default
value for analyze_max_length, probably 1024, like we have for
max_sort_length.
Add ssl_passphrase server parameter, which works similarly
to --passout/--passin openssl command line parameters.
Pass phrase value can be formatted as follows.
- pass:password
Provide actual password after the pass: prefix.
- env:var
Obtain the password from the environment variable 'var'a
- file:pathname
Reads the password from the specified file pathname.
Only the first line, up to the newline character, is read from the stream.
If ssl_passphrase was set, SHOW VARIABLE will show "file:", "env:" or
"pass:" (but won't reveal sensitive data)
This patch adds support for SYS_REFCURSOR (a weakly typed cursor)
for both sql_mode=ORACLE and sql_mode=DEFAULT.
Works as a regular stored routine variable, parameter and return value:
- can be passed as an IN parameter to stored functions and procedures
- can be passed as an INOUT and OUT parameter to stored procedures
- can be returned from a stored function
Note, strongly typed REF CURSOR will be added separately.
Note, to maintain dependencies easier, some parts of sql_class.h
and item.h were moved to new header files:
- select_results.h:
class select_result_sink
class select_result
class select_result_interceptor
- sp_cursor.h:
class sp_cursor_statistics
class sp_cursor
- sp_rcontext_handler.h
class Sp_rcontext_handler and its descendants
The implementation consists of the following parts:
- A new class sp_cursor_array deriving from Dynamic_array
- A new class Statement_rcontext which contains data shared
between sub-statements of a compound statement.
It has a member m_statement_cursors of the sp_cursor_array data type,
as well as open cursor counter. THD inherits from Statement_rcontext.
- A new data type handler Type_handler_sys_refcursor in plugins/type_cursor/
It is designed to store uint16 references -
positions of the cursor in THD::m_statement_cursors.
- Type_handler_sys_refcursor suppresses some derived numeric features.
When a SYS_REFCURSOR variable is used as an integer an error is raised.
- A new abstract class sp_instr_fetch_cursor. It's needed to share
the common code between "OPEN cur" (for static cursors) and
"OPER cur FOR stmt" (for SYS_REFCURSORs).
- New sp_instr classes:
* sp_instr_copen_by_ref - OPEN sys_ref_curor FOR stmt;
* sp_instr_cfetch_by_ref - FETCH sys_ref_cursor INTO targets;
* sp_instr_cclose_by_ref - CLOSE sys_ref_cursor;
* sp_instr_destruct_variable - to destruct SYS_REFCURSOR variables when
the execution goes out of the BEGIN..END block
where SYS_REFCURSOR variables are declared.
- New methods in LEX:
* sp_open_cursor_for_stmt - handles "OPEN sys_ref_cursor FOR stmt".
* sp_add_instr_fetch_cursor - "FETCH cur INTO targets" for both
static cursors and SYS_REFCURSORs.
* sp_close - handles "CLOSE cur" both for static cursors and SYS_REFCURSORs.
- Changes in cursor functions to handle both static cursors and SYS_REFCURSORs:
* Item_func_cursor_isopen
* Item_func_cursor_found
* Item_func_cursor_notfound
* Item_func_cursor_rowcount
- A new system variable @@max_open_cursors - to limit the number
of cursors (static and SYS_REFCURSORs) opened at the same time.
Its allowed range is [0-65536], with 50 by default.
- A new virtual method Type_handler::can_return_bool() telling
if calling item->val_bool() is allowed for Items of this data type,
or if otherwise the "Illegal parameter for operation" error should be raised
at fix_fields() time.
- New methods in Sp_rcontext_handler:
* get_cursor()
* get_cursor_by_ref()
- A new class Sp_rcontext_handler_statement to handle top level statement
wide cursors which are shared by all substatements.
- A new virtual method expr_event_handler() in classes Item and Field.
It's needed to close (and make available for a new OPEN)
unused THD::m_statement_cursors elements which do not have any references
any more. It can happen in various moments in time, e.g.
* after evaluation parameters of an SQL routine
* after assigning a cursor expression into a SYS_REFCURSOR variable
* when leaving a BEGIN..END block with SYS_REFCURSOR variables
* after setting OUT/INOUT routine actual parameters from formal
parameters.
The parameter innodb_log_spin_wait_delay will be deprecated and
ignored, because there is no spin loop anymore.
Thanks to commit 685d958e38
and commit a635c40648
multiple mtr_t::commit() can concurrently copy their slice of
mtr_t::m_log to the shared log_sys.buf. Each writer would allocate
their own log sequence number by invoking log_t::append_prepare()
while holding a shared log_sys.latch. This function was too heavy,
because it would invoke a minimum of 4 atomic read-modify-write
operations as well as system calls in the supposedly fast code path.
It turns out that with a simpler data structure, instead of having
several data fields that needed to be kept consistent with each other,
we only need one Atomic_relaxed<uint64_t> write_lsn_offset, on which
we can operate using fetch_add(), fetch_sub() as well as a single-bit
fetch_or(), which reasonably modern compilers (GCC 7, Clang 15 or later)
can translate into loop-free code on AMD64.
Before anything can be written to the log, log_sys.clear_mmap()
must be invoked.
log_t::base_lsn: The LSN of the last write_buf() or persist().
This is a rough approximation of log_sys.lsn, which will be removed.
log_t::write_lsn_offset: An Atomic_relaxed<uint64_t> that buffers
updates of write_to_buf and base_lsn.
log_t::buf_free, log_t::max_buf_free, log_t::lsn. Remove.
Replaced by base_lsn and write_lsn_offset.
log_t::buf_size: Always reflects the usable size in append_prepare().
log_t::lsn_lock: Remove. For the memory-mapped log in resize_write(),
there will be a resize_wrap_mutex.
log_t::get_lsn_approx(): Return a lower bound of get_lsn().
This should be exact unless append_prepare_wait() is pending.
log_get_lsn(): A wrapper for log_sys.get_lsn(), which must be invoked
while holding an exclusive log_sys.latch.
recv_recovery_from_checkpoint_start(): Do not invoke fil_names_clear();
it would seem to be unnecessary.
In many places, references to log_sys.get_lsn() are replaced with
log_sys.get_flushed_lsn(), which remains a simple std::atomic::load().
Reviewed by: Debarun Banerjee
innodb_buffer_pool_size_auto_min: A minimum innodb_buffer_pool_size that
a Linux memory pressure event can lead to shrinking the buffer pool to.
On a memory pressure event, we will attempt to shrink
innodb_buffer_pool_size halfway between its current value and
innodb_buffer_pool_size_auto_min. If innodb_buffer_pool_size_auto_min is
specified as 0 or not specified on startup, its default value will be
adjusted to innodb_buffer_pool_size_max, that is, memory pressure events
will be disregarded by default.
buf_pool_t::garbage_collect(): For up to 15 seconds, attempt to shrink
the buffer pool in response to a memory pressure event.
Reviewed by: Debarun Banerjee
We deprecate and ignore the parameter innodb_buffer_pool_chunk_size
and let the buffer pool size to be changed in arbitrary 1-megabyte
increments.
innodb_buffer_pool_size_max: A new read-only startup parameter
that specifies the maximum innodb_buffer_pool_size. If 0 or
unspecified, it will default to the specified innodb_buffer_pool_size
rounded up to the allocation unit (2 MiB or 8 MiB). The maximum value
is 4GiB-2MiB on 32-bit systems and 16EiB-8MiB on 64-bit systems.
This maximum is very likely to be limited further by the operating system.
The status variable Innodb_buffer_pool_resize_status will reflect
the status of shrinking the buffer pool. When no shrinking is in
progress, the string will be empty.
Unlike before, the execution of SET GLOBAL innodb_buffer_pool_size
will block until the requested buffer pool size change has been
implemented, or the execution is interrupted by a KILL statement
a client disconnect, or server shutdown. If the
buf_flush_page_cleaner() thread notices that we are running out of
memory, the operation may fail with ER_WRONG_USAGE.
SET GLOBAL innodb_buffer_pool_size will be refused
if the server was started with --large-pages (even if
no HugeTLB pages were successfully allocated). This functionality
is somewhat exercised by the test main.large_pages, which now runs
also on Microsoft Windows. On Linux, explicit HugeTLB mappings are
apparently excluded from the reported Redident Set Size (RSS), and
apparently unshrinkable between mmap(2) and munmap(2).
The buffer pool will be mapped to a contiguous virtual memory area
that will be aligned and partitioned into extents of 8 MiB on
64-bit systems and 2 MiB on 32-bit systems.
Within an extent, the first few innodb_page_size blocks contain
buf_block_t objects that will cover the page frames in the rest
of the extent. The number of such frames is precomputed in the
array first_page_in_extent[] for each innodb_page_size.
In this way, there is a trivial mapping between
page frames and block descriptors and we do not need any
lookup tables like buf_pool.zip_hash or buf_pool_t::chunk_t::map.
We will always allocate the same number of block descriptors for
an extent, even if we do not need all the buf_block_t in the last
extent in case the innodb_buffer_pool_size is not an integer multiple
of the of extents size.
The minimum innodb_buffer_pool_size is 256*5/4 pages. At the default
innodb_page_size=16k this corresponds to 5 MiB. However, now that the
innodb_buffer_pool_size includes the memory allocated for the block
descriptors, the minimum would be innodb_buffer_pool_size=6m.
my_large_virtual_alloc(): A new function, similar to my_large_malloc().
my_virtual_mem_reserve(), my_virtual_mem_commit(),
my_virtual_mem_decommit(), my_virtual_mem_release():
New interface mostly by Vladislav Vaintroub, to separately
reserve and release virtual address space, as well as to
commit and decommit memory within it.
After my_virtual_mem_decommit(), the virtual memory range will be
read-only or unaccessible, depending on whether the build option
cmake -DHAVE_UNACCESSIBLE_AFTER_MEM_DECOMMIT=1
has been specified. This option is hard-coded on Microsoft Windows,
where VirtualMemory(MEM_DECOMMIT) will make the memory unaccessible.
On IBM AIX, Linux, Illumos and possibly Apple macOS, the virtual memory
will be zeroed out immediately. On other POSIX-like systems,
madvise(MADV_FREE) will be used if available, to give the operating
system kernel a permission to zero out the virtual memory range.
We prefer immediate freeing so that the reported
resident set size (RSS) of the process will reflect the current
innodb_buffer_pool_size. Shrinking the buffer pool is a rarely
executed resource intensive operation, and the immediate configuration
of the MMU mappings should not incur significant additional penalty.
opt_super_large_pages: Declare only on Solaris. Actually, this is
specific to the SPARC implementation of Solaris, but because we
lack access to a Solaris development environment, we will not revise
this for other MMU and ISA.
buf_pool_t::chunk_t::create(): Remove.
buf_pool_t::create(): Initialize all n_blocks of the buf_pool.free list.
buf_pool_t::allocate(): Renamed from buf_LRU_get_free_only().
buf_pool_t::LRU_warned: Changed to Atomic_relaxed<bool>,
only to be modified by the buf_flush_page_cleaner() thread.
buf_pool_t::shrink(): Attempt to shrink the buffer pool.
There are 3 possible outcomes: SHRINK_DONE (success),
SHRINK_IN_PROGRESS (the caller may keep trying),
and SHRINK_ABORT (we seem to be running out of buffer pool).
While traversing buf_pool.LRU, release the contended
buf_pool.mutex once in every 32 iterations in order to
reduce starvation. Use lru_scan_itr for efficient traversal,
similar to buf_LRU_free_from_common_LRU_list().
buf_pool_t::shrunk(): Update the reduced size of the buffer pool
in a way that is compatible with buf_pool_t::page_guess(),
and invoke my_virtual_mem_decommit().
buf_pool_t::resize(): Before invoking shrink(), run one batch of
buf_flush_page_cleaner() in order to prevent LRU_warn().
Abort if shrink() recommends it, or no blocks were withdrawn in
the past 15 seconds, or the execution of the statement
SET GLOBAL innodb_buffer_pool_size was interrupted.
buf_pool_t::first_to_withdraw: The first block descriptor that is
out of the bounds of the shrunk buffer pool.
buf_pool_t::withdrawn: The list of withdrawn blocks.
If buf_pool_t::resize() is aborted before shrink() completes,
we must be able to resurrect the withdrawn blocks in the free list.
buf_pool_t::contains_zip(): Added a parameter for the
number of least significant pointer bits to disregard,
so that we can find any pointers to within a block
that is supposed to be free.
buf_pool_t::is_shrinking(): Return the total number or blocks that
were withdrawn or are to be withdrawn.
buf_pool_t::to_withdraw(): Return the number of blocks that will need to
be withdrawn.
buf_pool_t::usable_size(): Number of usable pages, considering possible
in-progress attempt at shrinking the buffer pool.
buf_pool_t::page_guess(): Try to buffer-fix a guessed block pointer.
If HAVE_UNACCESSIBLE_AFTER_MEM_DECOMMIT is set, the pointer will
be validated before being dereferenced.
buf_pool_t::get_info(): Replaces buf_stats_get_pool_info().
innodb_init_param(): Refactored. We must first compute
srv_page_size_shift and then determine the valid bounds of
innodb_buffer_pool_size.
buf_buddy_shrink(): Replaces buf_buddy_realloc().
Part of the work is deferred to buf_buddy_condense_free(),
which is being executed when we are not holding any
buf_pool.page_hash latch.
buf_buddy_condense_free(): Do not relocate blocks.
buf_buddy_free_low(): Do not care about buffer pool shrinking.
This will be handled by buf_buddy_shrink() and
buf_buddy_condense_free().
buf_buddy_alloc_zip(): Assert !buf_pool.contains_zip()
when we are allocating from the binary buddy system.
Previously we were asserting this on multiple recursion levels.
buf_buddy_block_free(), buf_buddy_free_low():
Assert !buf_pool.contains_zip().
buf_buddy_alloc_from(): Remove the redundant parameter j.
buf_flush_LRU_list_batch(): Add the parameter to_withdraw
to keep track of buf_pool.n_blocks_to_withdraw.
buf_do_LRU_batch(): Skip buf_free_from_unzip_LRU_list_batch()
if we are shrinking the buffer pool. In that case, we want
to minimize the page relocations and just finish as quickly
as possible.
trx_purge_attach_undo_recs(): Limit purge_sys.n_pages_handled()
in every iteration, in case the buffer pool is being shrunk
in the middle of a purge batch.
Reviewed by: Debarun Banerjee
Currently it is allowed to set innodb_io_capacity to very large value
up to unsigned 8 byte maximum value 18446744073709551615. While
calculating the number of pages to flush, we could sometime go beyond
innodb_io_capacity. Specifically, MDEV-24369 has introduced a logic
for aggressive flushing when dirty page percentage in buffer pool
exceeds innodb_max_dirty_pages_pct. So, when innodb_io_capacity is
set to very large value and dirty page percentage exceeds the
threshold, there is a multiplication overflow in Innodb page cleaner.
Fix: We should prevent setting io_capacity to unrealistic values and
define a practical limit to it. The patch introduces limits for
innodb_io_capacity_max and innodb_io_capacity to the maximum of 4 byte
unsigned integer i.e. 4294967295 (2^32-1). For 16k page size this limit
translates to 64 TiB/sec write IO speed which looks sufficient.
Reviewed by: Marko Mäkelä
At the start of mariadb-backup --backup, trigger a flush of the
InnoDB buffer pool, so that as little log as possible will have
to be copied.
The previously debug-build-only interface
SET GLOBAL innodb_log_checkpoint_now=ON;
will be made available on all builds, and
mariadb-backup --backup will invoke it, unless the option
--skip-innodb-log-checkpoint-now is specified.
Reviewed by: Vladislav Vaintroub
innodb_stats_transient_sample_pages, innodb_stats_persistent_sample_pages:
Change the type to UNSIGNED, because the number of pages in a table
is limited to 32 bits by the InnoDB file format.
btr_get_size_and_reserved(), fseg_get_n_frag_pages(),
fseg_n_reserved_pages_low(), fseg_n_reserved_pages(): Return uint32_t.
The file format limits page numbers to 32 bits.
dict_table_t::stat: An Atomic_relaxed<uint32_t> that combines a
number of metadata fields.
innodb_copy_stat_flags(): Copy the statistics flags from
TABLE_SHARE or HA_CREATE_INFO.
dict_table_t::stats_initialized(), dict_table_t::stats_is_persistent():
Accessors to dict_table_t::stat.
Reviewed by: Thirunarayanan Balathandayuthapani
Backport of commit 74f70c3944 to 10.11.
The new logic is disabled by default, to enable, use
optimizer_adjust_secondary_key_costs=fix_derived_table_read_cost.
== Original commit comment ==
Fixed costs in JOIN_TAB::estimate_scan_time() and HEAP
Estimate_scan_time() calculates the cost of scanning a derivied table.
The old code did not take into account that the temporary table heap table
may be converted to Aria.
Things fixed:
- Added checking if the temporary tables data will fit in the heap.
If not, then calculate the cost based on the designated internal
temporary table engine (Aria).
- Removed MY_MAX(records, 1000) and instead trust the optimizer's
estimate of records. This reduces the cost of temporary tables a bit
for small tables, which caused a few changes in mtr results.
- Fixed cost calculation for HEAP.
- HEAP costs->row_next_find_cost was not set. This does not affect old
costs calculation as this cost slot was not used anywhere.
Now HEAP cost->row_next_find_cost is set, which allowed me to remove
some duplicated computation in ha_heap::scan_time()
* rpl.rpl_system_versioning_partitions updated for MDEV-32188
* innodb.row_size_error_log_warnings_3 changed error for MDEV-33658
(checks are done in a different order)