Statements that intend to modify data have to acquire protection
against ongoing backup. Prior to backup locks, protection against
FTWRL was acquired in form of 2 shared metadata locks of GLOBAL
(global read lock) and COMMIT namespaces. These two namespaces
were separate entities, they didn't share data structures and
locking primitives. And thus they were separate contention
points.
With backup locks, introduced by 7a9dfdd, these namespaces were
combined into a single BACKUP namespace. It became a single
contention point, which doubled load on BACKUP namespace data
structures and locking primitives compared to GLOBAL and COMMIT
namespaces. In other words system throughput has halved.
MDL fast lanes solve this problem by allowing multiple contention
points for single MDL_lock. Fast lane is scalable multi-instance
registry for leightweight locks. Internally it is just a list of
granted tickets, close counter and a mutex.
Number of fast lanes (or contention points) is defined by the
metadata_locks_instances system variable. Value of 1 disables fast
lanes and lock requests are served by conventional MDL_lock data
structures.
Since fast lanes allow arbitrary number of contention points, they
outperform pre-backup locks GLOBAL and COMMIT.
Fast lanes are enabled only for BACKUP namespace. Support for other
namespaces is to be implemented separately.
Lock types are divided in 2 categories: lightweight and heavyweight.
Lightweight lock types represent DML: MDL_BACKUP_DML,
MDL_BACKUP_TRANS_DML, MDL_BACKUP_SYS_DML, MDL_BACKUP_DDL,
MDL_BACKUP_ALTER_COPY, MDL_BACKUP_COMMIT. They are fully compatible
with each other. Normally served by corresponding fast lane, which is
determined by thread_id % metadata_locks_instances.
Heavyweight lock types represent ongoing backup: MDL_BACKUP_START,
MDL_BACKUP_FLUSH, MDL_BACKUP_WAIT_FLUSH, MDL_BACKUP_WAIT_DDL,
MDL_BACKUP_WAIT_COMMIT, MDL_BACKUP_FTWRL1, MDL_BACKUP_FTWRL2,
MDL_BACKUP_BLOCK_DDL. These locks are always served by conventional
MDL_lock data structures. Whenever such lock is requested, fast
lanes are closed and all tickets registered in fast lanes are
moved to conventional MDL_lock data structures. Until such locks
are released or aborted, lightweight lock requests are served by
conventional MDL_lock data structures.
Strictly speaking moving tickets from fast lanes to conventional
MDL_lock data structures is not required. But it allows to reduce
complexity and keep intact methods like: MDL_lock::visit_subgraph(),
MDL_lock::notify_conflicting_locks(), MDL_lock::reschedule_waiters(),
MDL_lock::can_grant_lock().
It is not even required to register tickets in fast lanes. They
can be implemented basing on an atomic variable that holds two
counters: granted lightweight locks and granted/waiting heavyweight
locks. Similarly to MySQL solution, which roughly speaking has
"single atomic fast lane". However it appears to be it won't bring
any better performance, while code complexity is going to be much
higher.
Despite being included in the HAVE_valgrind define.
As such it's best differenciated from valgrind in the
server identifier as they have for the purposes a distinct
and different set of behaviours.
MSAN has its own set of test inclusions that that are different
from valgrind and such including "valgrind" in a server string that
gets tested for valgrind will incorrectly exclude some tests
that are suitable for MSAN but not valgrind.
There's a have_sanitizer system variable for exposing
the sanitizer being used so there's no need for
version verboseness.
Correct have_sanitizer system variable description to
include MSAN has been possible for a while.
Appliers need to verify foreign key constraints during normal
operation, in multi-active topologies, and for this reason appliers
are configured to enable FK checking.
However, during node joining, in IST and latter catch up period,
the node is still idle (from local connections), and only source
for incoming transactions is the cluster sending certified write
sets for applying. IST happens with parallel applying, and there
is a possibility that foreign key check cause lock conflicts between
appliers accessing FK child and parent tables. Also, the excessive
FK checking will slow down IST process somewhat.
For this reasons, we could relax FK checks for appliers during IST
and catch up periods. The relaxed FK check mode should, however, be
configurable e.g. by wsrep_mode flag: SKIP_APPLIER_FK_CHECKS_IN_IST.
When this operation mode is set, and the node is processing IST or
catch up, appliers should skip FK checking.
Signed-off-by: Julius Goryavsky <julius.goryavsky@mariadb.com>
when a definer for SP/view is wrong - it shold be ER_MALFORMED_DEFINER,
not ER_NO_SUCH_USER
when one uses current_role as a definer or grantee but there's no
current role - it should be ER_INVALID_ROLE not ER_MALFORMED_DEFINER
when a non-existent user is specified - it should be ER_NO_SUCH_USER,
which should say "The user does not exist", not "Definer does not exist"
clarify ER_CANT_CHANGE_TX_CHARACTERISTICS to say what cannot be changed
Promote the last few SQL-inaccessible replication options (command line
or `mariadb.cnf`) as these GLOBAL read-only system variables:
```
@@master_info_file
@@replicate_same_server_id
@@show_slave_auth_info
```
Side effect: The latter two options changed from no argument
to optional argument. Quote `include/my_getopt.h`:
> It should be noted that for historical reasons variables with the
> combination arg_type=NO_ARG, my_option::var_type=GET_BOOL still
> accepts arguments. This is someone counter intuitive and care should
> be taken if the code is refactored.
Reviewed-by: Brandon Nesterenko <brandon.nesterenko@mariadb.com>
The main purpose of this allow one to use the --read-only
option to ensure that no one can issue a query that can
block replication.
The --read-only option can now take 4 different values:
0 No read only (as before).
1 Blocks changes for users without the 'READ ONLY ADMIN'
privilege (as before).
2 Blocks in addition LOCK TABLES and SELECT IN SHARE MODE
for not 'READ ONLY ADMIN' users.
3 Blocks in addition 'READ_ONLY_ADMIN' users for all the
previous statements.
read_only is changed to an enum and one can use the following
names for the lock levels:
OFF, ON, NO_LOCK, NO_LOCK_NO_ADMIN
Too keep things compatible with older versions config files, one can
still use values FALSE and TRUE, which are mapped to OFF and ON.
The main visible changes are:
- 'show variables like "read_only"' now returns a string
instead of a number.
- Error messages related to read_only violations now contains
the current value off readonly.
Other things:
- is_read_only_ctx() renamed to check_read_only_with_error()
- Moved TL_READ_SKIP_LOCKED to it's logical place
Reviewed by: Sergei Golubchik <serg@mariadb.org>
The lock order of the mutex must be LOCK_log followed by
LOCK_global_system_variables as InnoDB can lock
LOCK_global_system_variables during a transaction commit when LOCK_log
is hold.
Fix is to temporarly unlock LOCK_global_system_variables when setting
global binlog variables that needs to use LOCK_log.
MDEV-36563 Assertion `!mysql_bin_log.is_open()' failed in
THD::mark_tmp_table_as_free_for_reuse
The purpose of this commit is to ensure that creation and changes of
temporary tables are properly and predicable logged to the binary
log. It also fixes some bugs where ROW logging was used in MIXED mode,
when STATEMENT would be a better (and expected) choice.
In this comment STATEMENT stands for logging to binary log in
STATEMENT format, MIXED stands for MIXED binlog format and ROW for ROW
binlog format.
New rules for logging of temporary tables
- CREATE of temporary tables are now by default binlogged only if
STATEMENT binlog format is used. If it is binlogged, 1 is stored in
TABLE_SHARE->table_creation_was_logged. The user can change this
behavior by setting create_temporary_table_binlog_formats to
MIXED,STATEMENT in which case the create is logged in statement
format also in MIXED mode (as before).
- Changes to temporary tables are only binlogged if and only if
the CREATE was logged. The logging happens under STATEMENT or MIXED.
If binlog_format=ROW, temporary table changes are not binlogged. A
temporary table that are changed under ROW are marked as 'not up to
date in binlog' and no future row changes are logged. Any usage of
this temporary table will force row logging of other tables in any
future statements using the temporary table to be row logged.
- DROP TEMPORARY is binlogged only of the CREATE was binlogged.
Changes done:
- Row logging is forced for any statement using temporary tables that
are not up to date in the binary log.
(Before the row logging was forced if the user has a temporary table)
- If there is any changes to the temporary table that is not binlogged,
the table is marked as not up to date.
- TABLE_SHARE->table_creation_was_logged has a new definition for
temporary tables:
0 Table creating was not logged to binary log
1 Table creating was logged to binary log and table is up to date.
2 Table creating was logged to binary log but some changes where
not logged to binary log.
Table is not up to date in binary log is defined as value 0 or 2.
- If a multi-table-update or multi-table-delete fails then
all updated temporary tables are marked as not up to date.
- Enforce row logging if the query is using temporary tables
that are not up to date.
Before row logging was enforced if the user had any
temporary tables.
- When dropping temporary tables use IF EXISTS. This ensures
that slave will not stop if it had crashed and lost the
temporary tables.
- Remove comment and version from DROP /*!4000 TEMPORARY.. generated when
a connection closes that has open temporary tables. Added 'generated by
server' at the end of the DROP.
Bugs fixed:
- When using temporary tables with commands that forced row based,
like INSERT INTO temporary_table VALUES (UUID()), this was never
logged which causes the temporary table to be inconsistent on
master and slave.
- Used binlog format is now clearly defined. It is now only depending
on the current binlog_format and the tables used.
Before it was depending on the user had ANY temporary tables and
the state of 'current_stmt_binlog_format' set by previous queries.
This also caused temporary tables to be logged to binary log in
some cases.
- CREATE TABLE t1 LIKE not_logged_temporary_table caused replication
to stop.
- Rename of not binlogged temporary tables where binlogged to binary log
which caused replication to stop.
Changes in behavior:
- By default create_temporary_table_binlog_formats=STATEMENT, which
means that CREATE TEMPORARY is not logged to binary log under MIXED
binary logging. This can be changed by setting
create_temporary_table_binlog_formats to MIXED,STATEMENT.
- Using temporary tables that was not logged to the binary log will
cause any query using them for updating other tables to be logged in
ROW format. Before all queries was logged in ROW format if the user had
any temporary tables, even if they were not used by the query.
- Generated DROP TEMPORARY TABLE is now always using IF EXISTS and
has a "generated by server" comment in the binary log.
The consequences of the above is that manipulations of a lot of rows
through temporary tables will by default be be slower in mixed mode.
For example:
BEGIN;
CREATE TEMPORARY TABLE tmp AS SELECT a, b, c FROM
large_table1 JOIN large_table2 ON ...;
INSERT INTO other_table SELECT b, c FROM tmp WHERE a <100;
DROP TEMPORARY TABLE tmp;
COMMIT;
By default this will create a huge entry in the binary log, compared
to just a few hundred bytes in statement mode. However the change in
this commit will make usage of temporary tables more reliable and
predicable and is thus worth it. Using statement mode or
create_temporary_table_binlog_formats can be used to avoid this issue.
This is needed to make it easy for users to automatically ignore long
char and varchars when using ANALYZE TABLE PERSISTENT.
These fields can cause problems as they will consume
'CHARACTERS * MAX_CHARACTER_LENGTH * 2 * number_of_rows' space on disk
during analyze, which can easily be much bigger than the analyzed table.
This commit adds a new user variable, analyze_max_length, default value 4G.
Any field that is bigger than this in bytes, will be ignored by
ANALYZE TABLE PERSISTENT unless it is specified in FOR COLUMNS().
While doing this patch, I noticed that we do not skip GEOMETRY columns from
ANALYZE TABLE, like we do with BLOB. This should be fixed when merging
to the 'main' branch. At the same time we should add a resonable default
value for analyze_max_length, probably 1024, like we have for
max_sort_length.
Add ssl_passphrase server parameter, which works similarly
to --passout/--passin openssl command line parameters.
Pass phrase value can be formatted as follows.
- pass:password
Provide actual password after the pass: prefix.
- env:var
Obtain the password from the environment variable 'var'a
- file:pathname
Reads the password from the specified file pathname.
Only the first line, up to the newline character, is read from the stream.
If ssl_passphrase was set, SHOW VARIABLE will show "file:", "env:" or
"pass:" (but won't reveal sensitive data)
This patch adds support for SYS_REFCURSOR (a weakly typed cursor)
for both sql_mode=ORACLE and sql_mode=DEFAULT.
Works as a regular stored routine variable, parameter and return value:
- can be passed as an IN parameter to stored functions and procedures
- can be passed as an INOUT and OUT parameter to stored procedures
- can be returned from a stored function
Note, strongly typed REF CURSOR will be added separately.
Note, to maintain dependencies easier, some parts of sql_class.h
and item.h were moved to new header files:
- select_results.h:
class select_result_sink
class select_result
class select_result_interceptor
- sp_cursor.h:
class sp_cursor_statistics
class sp_cursor
- sp_rcontext_handler.h
class Sp_rcontext_handler and its descendants
The implementation consists of the following parts:
- A new class sp_cursor_array deriving from Dynamic_array
- A new class Statement_rcontext which contains data shared
between sub-statements of a compound statement.
It has a member m_statement_cursors of the sp_cursor_array data type,
as well as open cursor counter. THD inherits from Statement_rcontext.
- A new data type handler Type_handler_sys_refcursor in plugins/type_cursor/
It is designed to store uint16 references -
positions of the cursor in THD::m_statement_cursors.
- Type_handler_sys_refcursor suppresses some derived numeric features.
When a SYS_REFCURSOR variable is used as an integer an error is raised.
- A new abstract class sp_instr_fetch_cursor. It's needed to share
the common code between "OPEN cur" (for static cursors) and
"OPER cur FOR stmt" (for SYS_REFCURSORs).
- New sp_instr classes:
* sp_instr_copen_by_ref - OPEN sys_ref_curor FOR stmt;
* sp_instr_cfetch_by_ref - FETCH sys_ref_cursor INTO targets;
* sp_instr_cclose_by_ref - CLOSE sys_ref_cursor;
* sp_instr_destruct_variable - to destruct SYS_REFCURSOR variables when
the execution goes out of the BEGIN..END block
where SYS_REFCURSOR variables are declared.
- New methods in LEX:
* sp_open_cursor_for_stmt - handles "OPEN sys_ref_cursor FOR stmt".
* sp_add_instr_fetch_cursor - "FETCH cur INTO targets" for both
static cursors and SYS_REFCURSORs.
* sp_close - handles "CLOSE cur" both for static cursors and SYS_REFCURSORs.
- Changes in cursor functions to handle both static cursors and SYS_REFCURSORs:
* Item_func_cursor_isopen
* Item_func_cursor_found
* Item_func_cursor_notfound
* Item_func_cursor_rowcount
- A new system variable @@max_open_cursors - to limit the number
of cursors (static and SYS_REFCURSORs) opened at the same time.
Its allowed range is [0-65536], with 50 by default.
- A new virtual method Type_handler::can_return_bool() telling
if calling item->val_bool() is allowed for Items of this data type,
or if otherwise the "Illegal parameter for operation" error should be raised
at fix_fields() time.
- New methods in Sp_rcontext_handler:
* get_cursor()
* get_cursor_by_ref()
- A new class Sp_rcontext_handler_statement to handle top level statement
wide cursors which are shared by all substatements.
- A new virtual method expr_event_handler() in classes Item and Field.
It's needed to close (and make available for a new OPEN)
unused THD::m_statement_cursors elements which do not have any references
any more. It can happen in various moments in time, e.g.
* after evaluation parameters of an SQL routine
* after assigning a cursor expression into a SYS_REFCURSOR variable
* when leaving a BEGIN..END block with SYS_REFCURSOR variables
* after setting OUT/INOUT routine actual parameters from formal
parameters.
Backport of commit 74f70c3944 to 10.11.
The new logic is disabled by default, to enable, use
optimizer_adjust_secondary_key_costs=fix_derived_table_read_cost.
== Original commit comment ==
Fixed costs in JOIN_TAB::estimate_scan_time() and HEAP
Estimate_scan_time() calculates the cost of scanning a derivied table.
The old code did not take into account that the temporary table heap table
may be converted to Aria.
Things fixed:
- Added checking if the temporary tables data will fit in the heap.
If not, then calculate the cost based on the designated internal
temporary table engine (Aria).
- Removed MY_MAX(records, 1000) and instead trust the optimizer's
estimate of records. This reduces the cost of temporary tables a bit
for small tables, which caused a few changes in mtr results.
- Fixed cost calculation for HEAP.
- HEAP costs->row_next_find_cost was not set. This does not affect old
costs calculation as this cost slot was not used anywhere.
Now HEAP cost->row_next_find_cost is set, which allowed me to remove
some duplicated computation in ha_heap::scan_time()
* rpl.rpl_system_versioning_partitions updated for MDEV-32188
* innodb.row_size_error_log_warnings_3 changed error for MDEV-33658
(checks are done in a different order)
Although the `my_thread_id` type is 64 bits, binlog format specs
limits it to 32 bits in practice. (See also: MDEV-35706)
The writable SQL variable `pseudo_thread_id` didn’t realize this though
and had a range of `ULONGLONG_MAX` (at least `UINT64_MAX` in C/C++).
It consequentially accepted larger values silently, but only the lower
32 bits of whom gets binlogged; this could lead to inconsistency.
Reviewed-by: Brandon Nesterenko <brandon.nesterenko@mariadb.com>
The LOCK_global_system_variables must not be held when taking mutexes
such as LOCK_commit_ordered and LOCK_log, as this causes inconsistent
mutex locking order that can theoretically cause the server to
deadlock.
To avoid this, temporarily release LOCK_global_system_variables in two
system variable update functions, like it is done in many other
places.
Enforce the correct locking order at server startup, to more easily
catch (in debug builds) any remaining wrong orders that may be hidden
elsewhere in the code.
Note that when this is merged to 11.4, similar unlock/lock of
LOCK_global_system_variables must be added in update_binlog_space_limit()
as is done in binlog_checksum_update() and fix_max_binlog_size(), as this
is a new function added in 11.4 that also needs the same fix. Tests will
fail with wrong mutex order until this is done.
Reviewed-by: Sergei Golubchik <serg@mariadb.org>
Signed-off-by: Kristian Nielsen <knielsen@knielsen-hq.org>
Heap tables are allocated blocks to store rows according to
my_default_record_cache (mapped to the server global variable
read_buffer_size).
This causes performance issues when the record length is big
(> 1000 bytes) and the my_default_record_cache is small.
Changed to instead split the default heap allocation to 1/16 of the
allowed space and not use my_default_record_cache anymore when creating
the heap. The allocation is also aligned to be just under a power of 2.
For some test that I have been running, which was using record length=633,
the speed of the query doubled thanks to this change.
Other things:
- Fixed calculation of max_records passed to hp_create() to take
into account padding between records.
- Updated calculation of memory needed by heap tables. Before we
did not take into account internal structures needed to access rows.
- Changed block sized for memory_table from 1 to 16384 to get less
fragmentation. This also avoids a problem where we need 1K
to manage index and row storage which was not counted for before.
- Moved heap memory usage to a separate test for 32 bit.
- Allocate all data blocks in heap in powers of 2. Change reported
memory usage for heap to reflect this.
Reviewed-by: Sergei Golubchik <serg@mariadb.org>
Disallow changing @@gtid_domain_id while a temporary table is open in
STATEMENT or MIXED binlog mode. Otherwise, a slave may try to replicate
events refering to the same temporary table in parallel, using domain-based
out-of-order parallel replication. This is not valid, temporary tables are
only available for use within a single thread at a time.
One concrete consequence seen from this bug was a ROLLBACK on an
InnoDB temporary table running in one domain in parallel with DROP
TEMPORARY TABLE in another domain, causing an assertion inside InnoDB:
InnoDB: Failing assertion: table->get_ref_count() == 0 in
dict_sys_t::remove.
Use an existing error code that's somewhat close to the real issue
(ER_INSIDE_TRANSACTION_PREVENTS_SWITCH_GTID_DOMAIN_ID_SEQ_NO), to not add a
new error code in a GA release. When this is merged to the next GA release,
we could optionally introduce a new and more precise error code for an
attempt to change the domain_id while temporary tables are open.
Reviewed-by: Brandon Nesterenko <brandon.nesterenko@mariadb.com>
Signed-off-by: Kristian Nielsen <knielsen@knielsen-hq.org>
When these members switched from plain `char*` to `LEX_CSTRING`,
not all usages were converted. Specifically, in this commit are args of
`my_snprintf` derivatives. Because until MDEV-21978,
automated type checks were unavailable for those functions due to
their incompatibility, so these tools didn’t catch them.
* preserve the graph in memory between statements
* keep it in a TABLE_SHARE, available for concurrent searches
* nodes are generally read-only, walking the graph doesn't change them
* distance to target is cached, calculated only once
* SIMD-optimized bloom filter detects visited nodes
* nodes are stored in an array, not List, to better utilize bloom filter
* auto-adjusting heuristic to estimate the number of visited nodes
(to configure the bloom filter)
* many threads can concurrently walk the graph. MEM_ROOT and Hash_set
are protected with a mutex, but walking doesn't need them
* up to 8 threads can concurrently load nodes into the cache,
nodes are partitioned into 8 mutexes (8 is chosen arbitrarily, might
need tuning)
* concurrent editing is not supported though
* this is fine for MyISAM, TL_WRITE protects the TABLE_SHARE and the
graph (note that TL_WRITE_CONCURRENT_INSERT is not allowed, because an
INSERT into the main table means multiple UPDATEs in the graph)
* InnoDB uses secondary transaction-level caches linked in a list in
in thd->ha_data via a fake handlerton
* on rollback the secondary cache is discarded, on commit nodes
from the secondary cache are invalidated in the shared cache
while it is exclusively locked
* on savepoint rollback both caches are flushed. this can be improved
in the future with a row visibility callback
* graph size is controlled by @@mhnsw_cache_size, the cache is flushed
when it reaches the threshold
1. introduce alpha. the value of 1.1 is optimal, so hard-code it.
2. hard-code ef_construction=10, best by test
3. rename hnsw_max_connection_per_layer to mhnsw_max_edges_per_node
(max_connection is rather ambiguous in MariaDB) and add a help text
4. rename hnsw_ef_search to mhnsw_min_limit and add a help text