mariadb

mirror of https://github.com/MariaDB/server.git synced 2025-08-01 03:47:19 +03:00

Author	SHA1	Message	Date
bsrikanth-mariadb	67b320b413	MDEV-36483: store ddls in the optimizer trace This feature stores the ddls of the tables/views that are used in a query, to the optimizer trace. It is currently controlled by a system variable store_ddls_in_optimizer_trace, and is not enabled by default. All the ddls will be stored in a single json array, with each element having table/view name, and the associated create definition of the table/view. The approach taken is to read global query_tables from the thd->lex, and read them in reverse. Create a record with table_name, ddl of the table and add the table_name to the hash, along with dumping the information to the trace. dbName_plus_tableName is used as a key, and the duplicate entries are not added to the hash. The main suite tests are also run with the feature enabled, and they all succeed.	2025-06-28 07:35:07 -04:00
Sergei Golubchik	e3d9369774	cleanup: disconnect before DROP USER let's always disconnect a user connection before dropping the said user. MariaDB is traditionally very tolerant to active connections of the dropped user, which isn't the case for most other databases. Let's avoid unintentionally spreading incompatible behavior and disconnect before drop. Except in cases when the test specifically tests such a behavior.	2025-07-16 09:14:33 +07:00
Sergei Golubchik	bead24b7f3	mariadb-test: wait on disconnect Remove one of the major sources of race condiitons in mariadb-test. Normally, mariadb_close() sends COM_QUIT to the server and immediately disconnects. In mariadb-test it means the test can switch to another connection and sends queries to the server before the server even started parsing the COM_QUIT packet and these queries can see the connection as fully active, as it didn't reach dispatch_command yet. This is a major source of instability in tests and many - but not all, still less than a half - tests employ workarounds. The correct one is a pair count_sessions.inc/wait_until_count_sessions.inc. Also very popular was wait_until_disconnected.inc, which was completely useless, because it verifies that the connection is closed, and after disconnect it always is, it didn't verify whether the server processed COM_QUIT. Sadly the placebo was as widely used as the real thing. Let's fix this by making mariadb-test `disconnect` command _to wait_ for the server to confirm. This makes almost all workarounds redundant. In some cases count_sessions.inc/wait_until_count_sessions.inc is still needed, though, as only `disconnect` command is changed: * after external tools, like `exec $MYSQL` * after failed `connect` command * replication, after `STOP SLAVE` * Federated/CONNECT/SPIDER/etc after `DROP TABLE` and also in some XA tests, because an XA transaction is dissociated from the THD very late, after the server has closed the client connection. Collateral cleanups: fix comments, remove some redundant statements: * DROP IF EXISTS if nothing is known to exist * DROP table/view before DROP DATABASE * REVOKE privileges before DROP USER etc	2025-07-16 09:14:33 +07:00
Sergey Vojtovich	18985d8471	MDEV-19749 - MDL scalability regression after backup locks Statements that intend to modify data have to acquire protection against ongoing backup. Prior to backup locks, protection against FTWRL was acquired in form of 2 shared metadata locks of GLOBAL (global read lock) and COMMIT namespaces. These two namespaces were separate entities, they didn't share data structures and locking primitives. And thus they were separate contention points. With backup locks, introduced by `7a9dfdd`, these namespaces were combined into a single BACKUP namespace. It became a single contention point, which doubled load on BACKUP namespace data structures and locking primitives compared to GLOBAL and COMMIT namespaces. In other words system throughput has halved. MDL fast lanes solve this problem by allowing multiple contention points for single MDL_lock. Fast lane is scalable multi-instance registry for leightweight locks. Internally it is just a list of granted tickets, close counter and a mutex. Number of fast lanes (or contention points) is defined by the metadata_locks_instances system variable. Value of 1 disables fast lanes and lock requests are served by conventional MDL_lock data structures. Since fast lanes allow arbitrary number of contention points, they outperform pre-backup locks GLOBAL and COMMIT. Fast lanes are enabled only for BACKUP namespace. Support for other namespaces is to be implemented separately. Lock types are divided in 2 categories: lightweight and heavyweight. Lightweight lock types represent DML: MDL_BACKUP_DML, MDL_BACKUP_TRANS_DML, MDL_BACKUP_SYS_DML, MDL_BACKUP_DDL, MDL_BACKUP_ALTER_COPY, MDL_BACKUP_COMMIT. They are fully compatible with each other. Normally served by corresponding fast lane, which is determined by thread_id % metadata_locks_instances. Heavyweight lock types represent ongoing backup: MDL_BACKUP_START, MDL_BACKUP_FLUSH, MDL_BACKUP_WAIT_FLUSH, MDL_BACKUP_WAIT_DDL, MDL_BACKUP_WAIT_COMMIT, MDL_BACKUP_FTWRL1, MDL_BACKUP_FTWRL2, MDL_BACKUP_BLOCK_DDL. These locks are always served by conventional MDL_lock data structures. Whenever such lock is requested, fast lanes are closed and all tickets registered in fast lanes are moved to conventional MDL_lock data structures. Until such locks are released or aborted, lightweight lock requests are served by conventional MDL_lock data structures. Strictly speaking moving tickets from fast lanes to conventional MDL_lock data structures is not required. But it allows to reduce complexity and keep intact methods like: MDL_lock::visit_subgraph(), MDL_lock::notify_conflicting_locks(), MDL_lock::reschedule_waiters(), MDL_lock::can_grant_lock(). It is not even required to register tickets in fast lanes. They can be implemented basing on an atomic variable that holds two counters: granted lightweight locks and granted/waiting heavyweight locks. Similarly to MySQL solution, which roughly speaking has "single atomic fast lane". However it appears to be it won't bring any better performance, while code complexity is going to be much higher.	2025-07-15 23:19:06 +04:00
Monty	8745b6cace	Change some Aria ULONG variables to UINT This is to make their range more clear.	2025-06-25 17:59:45 +03:00
Monty	0b7b5cc1b3	MDEV-24 Segmented key cache for Aria Added option 'aria-pagecache-segments', default 1. For values > 1, this split the aria-pagecache-buffer into the given number of segments, each independent from each other. Having multiple pagecaches improve performance when multiple connections runs queries concurrently using different tables. Each pagecache will use aria-pageache-buffer/segments amount of memory, however at least 128K. Each opened table has its index and data file use the segments in a a round-robin fashion. Internal changes: - All programs allocating the maria pagecache themselves should now call multi_init_pagecache() instead of init_pagecache(). - pagecache statistics is now stored in 'pagecache_stats' instead of maria_pagecache. One must call multi_update_pagecache_stats() to update the statistics. - Added into PAGECACHE_FILE a pointer to files pagecache. This was done to ensure that index and data file are using the same pagecache and simplified the checkpoint code. I kept pagecache in TABLE_SHARE to minimize the changes. - really_execute_checkpoint() was update to handle a dynamic number of pagecaches. - pagecache_collect_changed_blocks_with_lsn() was slight changed to allow it to be called for each pagecache. - undefined not used functions maria_assign_pagecache() and maria_change_pagecache() - ma_pagecaches.c is totally rewritten. It now contains all multi_pagecache functions. Errors found be QA that are fixed: MDEV-36872 UBSAN errors in ma_checkpoint.c MDEV-36874 Behavior upon too small aria_pagecache_buffer_size in case of multiple segments is not very user-friendly MDEV-36914 ma_checkpoint.c(285,9): conversion from '__int64' to 'uint' treated as an error MDEV-36912 sys_vars.sysvars_server_embedded and sys_vars.sysvars_server_notembedded fail on x86	2025-06-25 17:59:45 +03:00
Oleksandr Byelkin	e653666368	Merge branch '12.0' into 12.1	2025-06-18 09:27:49 +02:00
Oleksandr Byelkin	dfcb5c91e0	Merge branch '11.8' into 12.0	2025-06-18 07:50:39 +02:00
Oleksandr Byelkin	a65f7dc71d	Merge branch '11.4' into 11.8	2025-06-18 07:43:24 +02:00
Oleksandr Byelkin	89c7e2b9c7	Merge branch '10.11' into 11.4	2025-06-17 09:50:22 +02:00
Sergei Golubchik	a6f5555008	Merge branch '12.0' into 12.1	2025-06-05 12:01:25 +02:00
Oleksandr Byelkin	f1102da37a	Merge branch '11.8' into 12.0	2025-05-22 09:22:55 +02:00
Marko Mäkelä	1c7209e828	Merge 10.6 into 10.11	2025-05-21 07:36:35 +03:00
Sergei Golubchik	72b666b837	12.1 branch	2025-05-18 19:29:51 +02:00
Denis Protivensky	1cb59a9bd4	MDEV-34822: Skip FK checks in Galera during applying in IST Appliers need to verify foreign key constraints during normal operation, in multi-active topologies, and for this reason appliers are configured to enable FK checking. However, during node joining, in IST and latter catch up period, the node is still idle (from local connections), and only source for incoming transactions is the cluster sending certified write sets for applying. IST happens with parallel applying, and there is a possibility that foreign key check cause lock conflicts between appliers accessing FK child and parent tables. Also, the excessive FK checking will slow down IST process somewhat. For this reasons, we could relax FK checks for appliers during IST and catch up periods. The relaxed FK check mode should, however, be configurable e.g. by wsrep_mode flag: SKIP_APPLIER_FK_CHECKS_IN_IST. When this operation mode is set, and the node is processing IST or catch up, appliers should skip FK checking. Signed-off-by: Julius Goryavsky <julius.goryavsky@mariadb.com>	2025-05-05 20:05:59 +02:00
Oleg Smirnov	2c8f6058c1	MDEV-34888 Implement SEMIJOIN() and SUBQUERY() hints	2025-05-05 12:02:47 +07:00
Sergei Golubchik	26ea37be5d	MDEV-36405 Session tracking does not report changes from COM_CHANGE_USER report all sysvar tracker changes, as for the new login. also report db and other session state changes.	2025-05-03 12:06:36 +02:00
Sergei Golubchik	3e9e1a25b7	MDEV-36566 SELECT create_temporary_table_binlog_formats should show exactly what it is SET to added a warning	2025-04-30 12:33:24 +02:00
ParadoxV5	c29e83f226	MDEV-30189 Add remaining replication options as system variables Promote the last few SQL-inaccessible replication options (command line or `mariadb.cnf`) as these GLOBAL read-only system variables: ``` @@master_info_file @@replicate_same_server_id @@show_slave_auth_info ``` Side effect: The latter two options changed from no argument to optional argument. Quote `include/my_getopt.h`: > It should be noted that for historical reasons variables with the > combination arg_type=NO_ARG, my_option::var_type=GET_BOOL still > accepts arguments. This is someone counter intuitive and care should > be taken if the code is refactored. Reviewed-by: Brandon Nesterenko <brandon.nesterenko@mariadb.com>	2025-04-29 15:27:55 -04:00
Sergei Golubchik	11f6b9d12a	remove features that were deprecated in 10.5 --big-tables --large-page-size --storage-engine performance_schema.setup_timers (WL#10986)	2025-04-29 16:53:02 +02:00
Sergei Golubchik	83e0438f62	MDEV-36536 post-review changes that were apparently partially lost in a rebase	2025-04-29 11:34:35 +02:00
Oleksandr Byelkin	98e02217c7	Fix version	2025-04-29 08:27:07 +02:00
Vasilii Lakhin	1b95e46524	Fix typos in mysql-test/	2025-04-29 13:53:16 +10:00
Monty	ce8a74f235	MDEV-36425 Extend read_only to also block share locks and super user The main purpose of this allow one to use the --read-only option to ensure that no one can issue a query that can block replication. The --read-only option can now take 4 different values: 0 No read only (as before). 1 Blocks changes for users without the 'READ ONLY ADMIN' privilege (as before). 2 Blocks in addition LOCK TABLES and SELECT IN SHARE MODE for not 'READ ONLY ADMIN' users. 3 Blocks in addition 'READ_ONLY_ADMIN' users for all the previous statements. read_only is changed to an enum and one can use the following names for the lock levels: OFF, ON, NO_LOCK, NO_LOCK_NO_ADMIN Too keep things compatible with older versions config files, one can still use values FALSE and TRUE, which are mapped to OFF and ON. The main visible changes are: - 'show variables like "read_only"' now returns a string instead of a number. - Error messages related to read_only violations now contains the current value off readonly. Other things: - is_read_only_ctx() renamed to check_read_only_with_error() - Moved TL_READ_SKIP_LOCKED to it's logical place Reviewed by: Sergei Golubchik <serg@mariadb.org>	2025-04-28 12:59:39 +03:00
Monty	f8ba5ced55	MDEV-36099 Ensure that creation and usage of temporary tables in replication is predictable MDEV-36563 Assertion `!mysql_bin_log.is_open()' failed in THD::mark_tmp_table_as_free_for_reuse The purpose of this commit is to ensure that creation and changes of temporary tables are properly and predicable logged to the binary log. It also fixes some bugs where ROW logging was used in MIXED mode, when STATEMENT would be a better (and expected) choice. In this comment STATEMENT stands for logging to binary log in STATEMENT format, MIXED stands for MIXED binlog format and ROW for ROW binlog format. New rules for logging of temporary tables - CREATE of temporary tables are now by default binlogged only if STATEMENT binlog format is used. If it is binlogged, 1 is stored in TABLE_SHARE->table_creation_was_logged. The user can change this behavior by setting create_temporary_table_binlog_formats to MIXED,STATEMENT in which case the create is logged in statement format also in MIXED mode (as before). - Changes to temporary tables are only binlogged if and only if the CREATE was logged. The logging happens under STATEMENT or MIXED. If binlog_format=ROW, temporary table changes are not binlogged. A temporary table that are changed under ROW are marked as 'not up to date in binlog' and no future row changes are logged. Any usage of this temporary table will force row logging of other tables in any future statements using the temporary table to be row logged. - DROP TEMPORARY is binlogged only of the CREATE was binlogged. Changes done: - Row logging is forced for any statement using temporary tables that are not up to date in the binary log. (Before the row logging was forced if the user has a temporary table) - If there is any changes to the temporary table that is not binlogged, the table is marked as not up to date. - TABLE_SHARE->table_creation_was_logged has a new definition for temporary tables: 0 Table creating was not logged to binary log 1 Table creating was logged to binary log and table is up to date. 2 Table creating was logged to binary log but some changes where not logged to binary log. Table is not up to date in binary log is defined as value 0 or 2. - If a multi-table-update or multi-table-delete fails then all updated temporary tables are marked as not up to date. - Enforce row logging if the query is using temporary tables that are not up to date. Before row logging was enforced if the user had any temporary tables. - When dropping temporary tables use IF EXISTS. This ensures that slave will not stop if it had crashed and lost the temporary tables. - Remove comment and version from DROP /*!4000 TEMPORARY.. generated when a connection closes that has open temporary tables. Added 'generated by server' at the end of the DROP. Bugs fixed: - When using temporary tables with commands that forced row based, like INSERT INTO temporary_table VALUES (UUID()), this was never logged which causes the temporary table to be inconsistent on master and slave. - Used binlog format is now clearly defined. It is now only depending on the current binlog_format and the tables used. Before it was depending on the user had ANY temporary tables and the state of 'current_stmt_binlog_format' set by previous queries. This also caused temporary tables to be logged to binary log in some cases. - CREATE TABLE t1 LIKE not_logged_temporary_table caused replication to stop. - Rename of not binlogged temporary tables where binlogged to binary log which caused replication to stop. Changes in behavior: - By default create_temporary_table_binlog_formats=STATEMENT, which means that CREATE TEMPORARY is not logged to binary log under MIXED binary logging. This can be changed by setting create_temporary_table_binlog_formats to MIXED,STATEMENT. - Using temporary tables that was not logged to the binary log will cause any query using them for updating other tables to be logged in ROW format. Before all queries was logged in ROW format if the user had any temporary tables, even if they were not used by the query. - Generated DROP TEMPORARY TABLE is now always using IF EXISTS and has a "generated by server" comment in the binary log. The consequences of the above is that manipulations of a lot of rows through temporary tables will by default be be slower in mixed mode. For example: BEGIN; CREATE TEMPORARY TABLE tmp AS SELECT a, b, c FROM large_table1 JOIN large_table2 ON ...; INSERT INTO other_table SELECT b, c FROM tmp WHERE a <100; DROP TEMPORARY TABLE tmp; COMMIT; By default this will create a huge entry in the binary log, compared to just a few hundred bytes in statement mode. However the change in this commit will make usage of temporary tables more reliable and predicable and is thus worth it. Using statement mode or create_temporary_table_binlog_formats can be used to avoid this issue.	2025-04-28 12:59:38 +03:00
Monty	1b934a387c	MDEV-36536 Add option to not collect statistics for long char/varchars This is needed to make it easy for users to automatically ignore long char and varchars when using ANALYZE TABLE PERSISTENT. These fields can cause problems as they will consume 'CHARACTERS * MAX_CHARACTER_LENGTH * 2 * number_of_rows' space on disk during analyze, which can easily be much bigger than the analyzed table. This commit adds a new user variable, analyze_max_length, default value 4G. Any field that is bigger than this in bytes, will be ignored by ANALYZE TABLE PERSISTENT unless it is specified in FOR COLUMNS(). While doing this patch, I noticed that we do not skip GEOMETRY columns from ANALYZE TABLE, like we do with BLOB. This should be fixed when merging to the 'main' branch. At the same time we should add a resonable default value for analyze_max_length, probably 1024, like we have for max_sort_length.	2025-04-28 12:38:01 +03:00
Sergei Golubchik	237e24497b	Merge remote-tracking branch 'github/bb-11.4-release' into bb-11.8-serg	2025-04-27 19:40:00 +02:00
Oleksandr Byelkin	a8d4642375	Merge branch '10.11' into 11.4	2025-04-26 10:53:02 +02:00
Vladislav Vaintroub	86ec20189a	MDEV-14091 Support password protected SSL key in server. Add ssl_passphrase server parameter, which works similarly to --passout/--passin openssl command line parameters. Pass phrase value can be formatted as follows. - pass:password Provide actual password after the pass: prefix. - env:var Obtain the password from the environment variable 'var'a - file:pathname Reads the password from the specified file pathname. Only the first line, up to the newline character, is read from the stream. If ssl_passphrase was set, SHOW VARIABLE will show "file:", "env:" or "pass:" (but won't reveal sensitive data)	2025-04-19 14:04:10 +03:00
Alexander Barkov	f11504af51	MDEV-20034 Add support for the pre-defined weak SYS_REFCURSOR This patch adds support for SYS_REFCURSOR (a weakly typed cursor) for both sql_mode=ORACLE and sql_mode=DEFAULT. Works as a regular stored routine variable, parameter and return value: - can be passed as an IN parameter to stored functions and procedures - can be passed as an INOUT and OUT parameter to stored procedures - can be returned from a stored function Note, strongly typed REF CURSOR will be added separately. Note, to maintain dependencies easier, some parts of sql_class.h and item.h were moved to new header files: - select_results.h: class select_result_sink class select_result class select_result_interceptor - sp_cursor.h: class sp_cursor_statistics class sp_cursor - sp_rcontext_handler.h class Sp_rcontext_handler and its descendants The implementation consists of the following parts: - A new class sp_cursor_array deriving from Dynamic_array - A new class Statement_rcontext which contains data shared between sub-statements of a compound statement. It has a member m_statement_cursors of the sp_cursor_array data type, as well as open cursor counter. THD inherits from Statement_rcontext. - A new data type handler Type_handler_sys_refcursor in plugins/type_cursor/ It is designed to store uint16 references - positions of the cursor in THD::m_statement_cursors. - Type_handler_sys_refcursor suppresses some derived numeric features. When a SYS_REFCURSOR variable is used as an integer an error is raised. - A new abstract class sp_instr_fetch_cursor. It's needed to share the common code between "OPEN cur" (for static cursors) and "OPER cur FOR stmt" (for SYS_REFCURSORs). - New sp_instr classes: * sp_instr_copen_by_ref - OPEN sys_ref_curor FOR stmt; * sp_instr_cfetch_by_ref - FETCH sys_ref_cursor INTO targets; * sp_instr_cclose_by_ref - CLOSE sys_ref_cursor; * sp_instr_destruct_variable - to destruct SYS_REFCURSOR variables when the execution goes out of the BEGIN..END block where SYS_REFCURSOR variables are declared. - New methods in LEX: * sp_open_cursor_for_stmt - handles "OPEN sys_ref_cursor FOR stmt". * sp_add_instr_fetch_cursor - "FETCH cur INTO targets" for both static cursors and SYS_REFCURSORs. * sp_close - handles "CLOSE cur" both for static cursors and SYS_REFCURSORs. - Changes in cursor functions to handle both static cursors and SYS_REFCURSORs: * Item_func_cursor_isopen * Item_func_cursor_found * Item_func_cursor_notfound * Item_func_cursor_rowcount - A new system variable @@max_open_cursors - to limit the number of cursors (static and SYS_REFCURSORs) opened at the same time. Its allowed range is [0-65536], with 50 by default. - A new virtual method Type_handler::can_return_bool() telling if calling item->val_bool() is allowed for Items of this data type, or if otherwise the "Illegal parameter for operation" error should be raised at fix_fields() time. - New methods in Sp_rcontext_handler: * get_cursor() * get_cursor_by_ref() - A new class Sp_rcontext_handler_statement to handle top level statement wide cursors which are shared by all substatements. - A new virtual method expr_event_handler() in classes Item and Field. It's needed to close (and make available for a new OPEN) unused THD::m_statement_cursors elements which do not have any references any more. It can happen in various moments in time, e.g. * after evaluation parameters of an SQL routine * after assigning a cursor expression into a SYS_REFCURSOR variable * when leaving a BEGIN..END block with SYS_REFCURSOR variables * after setting OUT/INOUT routine actual parameters from formal parameters.	2025-04-19 10:59:58 +04:00
Sergei Golubchik	f02ad2f641	bump the version	2025-04-18 17:11:09 +02:00
Sergei Golubchik	9b824e62d4	Merge branch '11.8' into main	2025-04-18 17:11:01 +02:00
Sergei Golubchik	805e7ca3ad	fix incorrect merge `15700f54c2` error messages in 11.8 should have same numbers as in 11.4	2025-04-18 09:41:24 +02:00
Marko Mäkelä	acd071f599	MDEV-21923: LSN allocation is a bottleneck The parameter innodb_log_spin_wait_delay will be deprecated and ignored, because there is no spin loop anymore. Thanks to commit `685d958e38` and commit `a635c40648` multiple mtr_t::commit() can concurrently copy their slice of mtr_t::m_log to the shared log_sys.buf. Each writer would allocate their own log sequence number by invoking log_t::append_prepare() while holding a shared log_sys.latch. This function was too heavy, because it would invoke a minimum of 4 atomic read-modify-write operations as well as system calls in the supposedly fast code path. It turns out that with a simpler data structure, instead of having several data fields that needed to be kept consistent with each other, we only need one Atomic_relaxed<uint64_t> write_lsn_offset, on which we can operate using fetch_add(), fetch_sub() as well as a single-bit fetch_or(), which reasonably modern compilers (GCC 7, Clang 15 or later) can translate into loop-free code on AMD64. Before anything can be written to the log, log_sys.clear_mmap() must be invoked. log_t::base_lsn: The LSN of the last write_buf() or persist(). This is a rough approximation of log_sys.lsn, which will be removed. log_t::write_lsn_offset: An Atomic_relaxed<uint64_t> that buffers updates of write_to_buf and base_lsn. log_t::buf_free, log_t::max_buf_free, log_t::lsn. Remove. Replaced by base_lsn and write_lsn_offset. log_t::buf_size: Always reflects the usable size in append_prepare(). log_t::lsn_lock: Remove. For the memory-mapped log in resize_write(), there will be a resize_wrap_mutex. log_t::get_lsn_approx(): Return a lower bound of get_lsn(). This should be exact unless append_prepare_wait() is pending. log_get_lsn(): A wrapper for log_sys.get_lsn(), which must be invoked while holding an exclusive log_sys.latch. recv_recovery_from_checkpoint_start(): Do not invoke fil_names_clear(); it would seem to be unnecessary. In many places, references to log_sys.get_lsn() are replaced with log_sys.get_flushed_lsn(), which remains a simple std::atomic::load(). Reviewed by: Debarun Banerjee	2025-04-10 13:02:17 +03:00
Marko Mäkelä	bb1d88b6dc	Merge 11.4 into 11.8	2025-04-02 14:07:01 +03:00
Marko Mäkelä	f5bd250f5b	Merge 10.11 into 11.4	2025-03-28 13:55:21 +02:00
Marko Mäkelä	ba81009f63	MDEV-34863 RAM Usage Changed Significantly Between 10.11 Releases innodb_buffer_pool_size_auto_min: A minimum innodb_buffer_pool_size that a Linux memory pressure event can lead to shrinking the buffer pool to. On a memory pressure event, we will attempt to shrink innodb_buffer_pool_size halfway between its current value and innodb_buffer_pool_size_auto_min. If innodb_buffer_pool_size_auto_min is specified as 0 or not specified on startup, its default value will be adjusted to innodb_buffer_pool_size_max, that is, memory pressure events will be disregarded by default. buf_pool_t::garbage_collect(): For up to 15 seconds, attempt to shrink the buffer pool in response to a memory pressure event. Reviewed by: Debarun Banerjee	2025-03-26 17:05:48 +02:00
Marko Mäkelä	b6923420f3	MDEV-29445: Reimplement SET GLOBAL innodb_buffer_pool_size We deprecate and ignore the parameter innodb_buffer_pool_chunk_size and let the buffer pool size to be changed in arbitrary 1-megabyte increments. innodb_buffer_pool_size_max: A new read-only startup parameter that specifies the maximum innodb_buffer_pool_size. If 0 or unspecified, it will default to the specified innodb_buffer_pool_size rounded up to the allocation unit (2 MiB or 8 MiB). The maximum value is 4GiB-2MiB on 32-bit systems and 16EiB-8MiB on 64-bit systems. This maximum is very likely to be limited further by the operating system. The status variable Innodb_buffer_pool_resize_status will reflect the status of shrinking the buffer pool. When no shrinking is in progress, the string will be empty. Unlike before, the execution of SET GLOBAL innodb_buffer_pool_size will block until the requested buffer pool size change has been implemented, or the execution is interrupted by a KILL statement a client disconnect, or server shutdown. If the buf_flush_page_cleaner() thread notices that we are running out of memory, the operation may fail with ER_WRONG_USAGE. SET GLOBAL innodb_buffer_pool_size will be refused if the server was started with --large-pages (even if no HugeTLB pages were successfully allocated). This functionality is somewhat exercised by the test main.large_pages, which now runs also on Microsoft Windows. On Linux, explicit HugeTLB mappings are apparently excluded from the reported Redident Set Size (RSS), and apparently unshrinkable between mmap(2) and munmap(2). The buffer pool will be mapped to a contiguous virtual memory area that will be aligned and partitioned into extents of 8 MiB on 64-bit systems and 2 MiB on 32-bit systems. Within an extent, the first few innodb_page_size blocks contain buf_block_t objects that will cover the page frames in the rest of the extent. The number of such frames is precomputed in the array first_page_in_extent[] for each innodb_page_size. In this way, there is a trivial mapping between page frames and block descriptors and we do not need any lookup tables like buf_pool.zip_hash or buf_pool_t::chunk_t::map. We will always allocate the same number of block descriptors for an extent, even if we do not need all the buf_block_t in the last extent in case the innodb_buffer_pool_size is not an integer multiple of the of extents size. The minimum innodb_buffer_pool_size is 256*5/4 pages. At the default innodb_page_size=16k this corresponds to 5 MiB. However, now that the innodb_buffer_pool_size includes the memory allocated for the block descriptors, the minimum would be innodb_buffer_pool_size=6m. my_large_virtual_alloc(): A new function, similar to my_large_malloc(). my_virtual_mem_reserve(), my_virtual_mem_commit(), my_virtual_mem_decommit(), my_virtual_mem_release(): New interface mostly by Vladislav Vaintroub, to separately reserve and release virtual address space, as well as to commit and decommit memory within it. After my_virtual_mem_decommit(), the virtual memory range will be read-only or unaccessible, depending on whether the build option cmake -DHAVE_UNACCESSIBLE_AFTER_MEM_DECOMMIT=1 has been specified. This option is hard-coded on Microsoft Windows, where VirtualMemory(MEM_DECOMMIT) will make the memory unaccessible. On IBM AIX, Linux, Illumos and possibly Apple macOS, the virtual memory will be zeroed out immediately. On other POSIX-like systems, madvise(MADV_FREE) will be used if available, to give the operating system kernel a permission to zero out the virtual memory range. We prefer immediate freeing so that the reported resident set size (RSS) of the process will reflect the current innodb_buffer_pool_size. Shrinking the buffer pool is a rarely executed resource intensive operation, and the immediate configuration of the MMU mappings should not incur significant additional penalty. opt_super_large_pages: Declare only on Solaris. Actually, this is specific to the SPARC implementation of Solaris, but because we lack access to a Solaris development environment, we will not revise this for other MMU and ISA. buf_pool_t::chunk_t::create(): Remove. buf_pool_t::create(): Initialize all n_blocks of the buf_pool.free list. buf_pool_t::allocate(): Renamed from buf_LRU_get_free_only(). buf_pool_t::LRU_warned: Changed to Atomic_relaxed<bool>, only to be modified by the buf_flush_page_cleaner() thread. buf_pool_t::shrink(): Attempt to shrink the buffer pool. There are 3 possible outcomes: SHRINK_DONE (success), SHRINK_IN_PROGRESS (the caller may keep trying), and SHRINK_ABORT (we seem to be running out of buffer pool). While traversing buf_pool.LRU, release the contended buf_pool.mutex once in every 32 iterations in order to reduce starvation. Use lru_scan_itr for efficient traversal, similar to buf_LRU_free_from_common_LRU_list(). buf_pool_t::shrunk(): Update the reduced size of the buffer pool in a way that is compatible with buf_pool_t::page_guess(), and invoke my_virtual_mem_decommit(). buf_pool_t::resize(): Before invoking shrink(), run one batch of buf_flush_page_cleaner() in order to prevent LRU_warn(). Abort if shrink() recommends it, or no blocks were withdrawn in the past 15 seconds, or the execution of the statement SET GLOBAL innodb_buffer_pool_size was interrupted. buf_pool_t::first_to_withdraw: The first block descriptor that is out of the bounds of the shrunk buffer pool. buf_pool_t::withdrawn: The list of withdrawn blocks. If buf_pool_t::resize() is aborted before shrink() completes, we must be able to resurrect the withdrawn blocks in the free list. buf_pool_t::contains_zip(): Added a parameter for the number of least significant pointer bits to disregard, so that we can find any pointers to within a block that is supposed to be free. buf_pool_t::is_shrinking(): Return the total number or blocks that were withdrawn or are to be withdrawn. buf_pool_t::to_withdraw(): Return the number of blocks that will need to be withdrawn. buf_pool_t::usable_size(): Number of usable pages, considering possible in-progress attempt at shrinking the buffer pool. buf_pool_t::page_guess(): Try to buffer-fix a guessed block pointer. If HAVE_UNACCESSIBLE_AFTER_MEM_DECOMMIT is set, the pointer will be validated before being dereferenced. buf_pool_t::get_info(): Replaces buf_stats_get_pool_info(). innodb_init_param(): Refactored. We must first compute srv_page_size_shift and then determine the valid bounds of innodb_buffer_pool_size. buf_buddy_shrink(): Replaces buf_buddy_realloc(). Part of the work is deferred to buf_buddy_condense_free(), which is being executed when we are not holding any buf_pool.page_hash latch. buf_buddy_condense_free(): Do not relocate blocks. buf_buddy_free_low(): Do not care about buffer pool shrinking. This will be handled by buf_buddy_shrink() and buf_buddy_condense_free(). buf_buddy_alloc_zip(): Assert !buf_pool.contains_zip() when we are allocating from the binary buddy system. Previously we were asserting this on multiple recursion levels. buf_buddy_block_free(), buf_buddy_free_low(): Assert !buf_pool.contains_zip(). buf_buddy_alloc_from(): Remove the redundant parameter j. buf_flush_LRU_list_batch(): Add the parameter to_withdraw to keep track of buf_pool.n_blocks_to_withdraw. buf_do_LRU_batch(): Skip buf_free_from_unzip_LRU_list_batch() if we are shrinking the buffer pool. In that case, we want to minimize the page relocations and just finish as quickly as possible. trx_purge_attach_undo_recs(): Limit purge_sys.n_pages_handled() in every iteration, in case the buffer pool is being shrunk in the middle of a purge batch. Reviewed by: Debarun Banerjee	2025-03-26 17:05:44 +02:00
mariadb-DebarunBanerjee	a8e35a1cc6	MDEV-36149 UBSAN in X is outside the range of representable values of type 'unsigned long' \| page_cleaner_flush_pages_recommendation Currently it is allowed to set innodb_io_capacity to very large value up to unsigned 8 byte maximum value 18446744073709551615. While calculating the number of pages to flush, we could sometime go beyond innodb_io_capacity. Specifically, MDEV-24369 has introduced a logic for aggressive flushing when dirty page percentage in buffer pool exceeds innodb_max_dirty_pages_pct. So, when innodb_io_capacity is set to very large value and dirty page percentage exceeds the threshold, there is a multiplication overflow in Innodb page cleaner. Fix: We should prevent setting io_capacity to unrealistic values and define a practical limit to it. The patch introduces limits for innodb_io_capacity_max and innodb_io_capacity to the maximum of 4 byte unsigned integer i.e. 4294967295 (2^32-1). For 16k page size this limit translates to 64 TiB/sec write IO speed which looks sufficient. Reviewed by: Marko Mäkelä	2025-03-17 11:44:09 +05:30
Marko Mäkelä	38e420ba6e	Merge 10.11 into 11.4	2025-03-11 13:47:46 +02:00
Marko Mäkelä	652f33e0a4	MDEV-30000: Force an InnoDB checkpoint in mariadb-backup At the start of mariadb-backup --backup, trigger a flush of the InnoDB buffer pool, so that as little log as possible will have to be copied. The previously debug-build-only interface SET GLOBAL innodb_log_checkpoint_now=ON; will be made available on all builds, and mariadb-backup --backup will invoke it, unless the option --skip-innodb-log-checkpoint-now is specified. Reviewed by: Vladislav Vaintroub	2025-03-10 08:48:43 +02:00
Marko Mäkelä	153778437d	Merge 11.8 into main	2025-03-05 21:20:02 +02:00
Marko Mäkelä	bb9f010432	Merge 11.4 into 11.8	2025-03-05 20:39:47 +02:00
Marko Mäkelä	49a6baec56	Merge 10.11 into 11.4	2025-03-03 11:07:56 +02:00
Marko Mäkelä	1ed09cfdcb	MDEV-35000 preparation: Clean up dict_table_t::stat innodb_stats_transient_sample_pages, innodb_stats_persistent_sample_pages: Change the type to UNSIGNED, because the number of pages in a table is limited to 32 bits by the InnoDB file format. btr_get_size_and_reserved(), fseg_get_n_frag_pages(), fseg_n_reserved_pages_low(), fseg_n_reserved_pages(): Return uint32_t. The file format limits page numbers to 32 bits. dict_table_t::stat: An Atomic_relaxed<uint32_t> that combines a number of metadata fields. innodb_copy_stat_flags(): Copy the statistics flags from TABLE_SHARE or HA_CREATE_INFO. dict_table_t::stats_initialized(), dict_table_t::stats_is_persistent(): Accessors to dict_table_t::stat. Reviewed by: Thirunarayanan Balathandayuthapani	2025-02-28 08:55:16 +02:00
Sergei Golubchik	c92add291e	12.0 branch	2025-02-12 12:37:38 +01:00
Sergei Golubchik	9ee09a33bb	Merge branch '11.7' into 11.8	2025-02-11 20:29:43 +01:00
Sergei Petrunia	43c5d1303f	MDEV-35958 Cost estimates for materialized derived tables are poor Backport of commit `74f70c3944` to 10.11. The new logic is disabled by default, to enable, use optimizer_adjust_secondary_key_costs=fix_derived_table_read_cost. == Original commit comment == Fixed costs in JOIN_TAB::estimate_scan_time() and HEAP Estimate_scan_time() calculates the cost of scanning a derivied table. The old code did not take into account that the temporary table heap table may be converted to Aria. Things fixed: - Added checking if the temporary tables data will fit in the heap. If not, then calculate the cost based on the designated internal temporary table engine (Aria). - Removed MY_MAX(records, 1000) and instead trust the optimizer's estimate of records. This reduces the cost of temporary tables a bit for small tables, which caused a few changes in mtr results. - Fixed cost calculation for HEAP. - HEAP costs->row_next_find_cost was not set. This does not affect old costs calculation as this cost slot was not used anywhere. Now HEAP cost->row_next_find_cost is set, which allowed me to remove some duplicated computation in ha_heap::scan_time()	2025-02-10 21:14:01 +02:00
Sergei Golubchik	ba01c2aaf0	Merge branch '11.4' into 11.7 * rpl.rpl_system_versioning_partitions updated for MDEV-32188 * innodb.row_size_error_log_warnings_3 changed error for MDEV-33658 (checks are done in a different order)	2025-02-06 16:46:36 +01:00
Julius Goryavsky	72f21560d5	Merge branch '10.6' into '10.11'	2025-02-02 23:17:20 +01:00

1 2 3 4 5 ...

2202 Commits