Without this increase the mtr test case pre/post conditions will
fail as the stack usage has increased under MSAN with clang-20.1.
ASAN takes a 11M stack, however there was no obvious gain in MSAN
test success after 2M.
The resulting behaviour observed on smaller stack size was a
SEGV normally.
Hide the default stack size from the sysvar tests that expose
thread-stack as a variable with its default value.
This is needed to make it easy for users to automatically ignore long
char and varchars when using ANALYZE TABLE PERSISTENT.
These fields can cause problems as they will consume
'CHARACTERS * MAX_CHARACTER_LENGTH * 2 * number_of_rows' space on disk
during analyze, which can easily be much bigger than the analyzed table.
This commit adds a new user variable, analyze_max_length, default value 4G.
Any field that is bigger than this in bytes, will be ignored by
ANALYZE TABLE PERSISTENT unless it is specified in FOR COLUMNS().
While doing this patch, I noticed that we do not skip GEOMETRY columns from
ANALYZE TABLE, like we do with BLOB. This should be fixed when merging
to the 'main' branch. At the same time we should add a resonable default
value for analyze_max_length, probably 1024, like we have for
max_sort_length.
Backport of commit 74f70c3944 to 10.11.
The new logic is disabled by default, to enable, use
optimizer_adjust_secondary_key_costs=fix_derived_table_read_cost.
== Original commit comment ==
Fixed costs in JOIN_TAB::estimate_scan_time() and HEAP
Estimate_scan_time() calculates the cost of scanning a derivied table.
The old code did not take into account that the temporary table heap table
may be converted to Aria.
Things fixed:
- Added checking if the temporary tables data will fit in the heap.
If not, then calculate the cost based on the designated internal
temporary table engine (Aria).
- Removed MY_MAX(records, 1000) and instead trust the optimizer's
estimate of records. This reduces the cost of temporary tables a bit
for small tables, which caused a few changes in mtr results.
- Fixed cost calculation for HEAP.
- HEAP costs->row_next_find_cost was not set. This does not affect old
costs calculation as this cost slot was not used anywhere.
Now HEAP cost->row_next_find_cost is set, which allowed me to remove
some duplicated computation in ha_heap::scan_time()
This commit updates default memory allocations size used with MEM_ROOT
objects to minimize the number of calls to malloc().
Changes:
- Updated MEM_ROOT block sizes in sql_const.h
- Updated MALLOC_OVERHEAD to also take into account the extra memory
allocated by my_malloc()
- Updated init_alloc_root() to only take MALLOC_OVERHEAD into account as
buffer size, not MALLOC_OVERHEAD + sizeof(USED_MEM).
- Reset mem_root->first_block_usage if and only if first block was used.
- Increase MEM_ROOT buffers sized used by my_load_defaults, plugin_init,
Create_tmp_table, allocate_table_share, TABLE and TABLE_SHARE.
This decreases number of malloc calls during queries.
- Use a small buffer for THD->main_mem_root in THD::THD. This avoids
multiple malloc() call for new connections.
I tried the above changes on a complex select query with 12 tables.
The following shows the number of extra allocations that where used
to increase the size of the MEM_ROOT buffers.
Original code:
- Connection to MariaDB: 9 allocations
- First query run: 146 allocations
- Second query run: 24 allocations
Max memory allocated for thd when using with heap table: 61,262,408
Max memory allocated for thd when using Aria tmp table: 419,464
After changes:
Connection to MariaDB: 0 allocations
- First run: 25 allocations
- Second run: 7 allocations
Max memory allocated for thd when using with heap table: 61,347,424
Max memory allocated for thd when using Aria table: 529,168
The new code uses slightly more memory, but avoids memory fragmentation
and is slightly faster thanks to much fewer calls to malloc().
Reviewed-by: Sergei Golubchik <serg@mariadb.org>
(Variant 4, with @@optimizer_adjust_secondary_key_costs, reuse in two
places, and conditions are replaced with equivalent simpler forms in two more)
In best_access_path(), ReuseRangeEstimateForRef-3, the check
for whether
"all used key_part_i used key_part_i=const"
was incorrect: it may produced a "NO" answer for cases when we
had:
key_part1= const // some key parts are usable
key_part2= value_not_in_join_prefix //present but unusable
key_part3= non_const_value // unusable due to gap in key parts.
This caused the optimizer to fail to apply ReuseRangeEstimateForRef
heuristics. The consequence is poor query plan choice when the index
in question has very skewed data distribution.
The fix is enabled if its @@optimizer_adjust_secondary_key_costs flag
is set.
(Variant 2b: call greedy_search() twice, correct handling for limited
search_depth)
Modify the join optimizer to specifically try to produce join orders that
can short-cut their execution for ORDER BY..LIMIT clause.
The optimization is controlled by @@optimizer_join_limit_pref_ratio.
Default value 0 means don't construct short-cutting join orders.
Other value means construct short-cutting join order, and prefer it only
if it promises speedup of more than #value times.
In Optimizer Trace, look for these names:
* join_limit_shortcut_is_applicable
* join_limit_shortcut_plan_search
* join_limit_shortcut_choice
(With trivial fixes by sergey@mariadb.com)
Added option fix_innodb_cardinality to optimizer_adjust_secondary_key_costs
Using fix_innodb_cardinality disables the 'divide by 2' of rec_per_key_int
in InnoDB that in effect doubles the Cardinality for secondary keys.
This has the biggest effect for indexes where a few rows has the same key
value. Using this may also cause table scans for very small tables (which
in some cases may be better than an index scan).
The user visible effect is that 'SHOW INDEX FROM table_name' will for
InnoDB show the true Cardinality (and not 2x the real value). It will
also allow the optimizer to chose a better index in some cases as the
division by 2 could have a bad effect for tables with 2-5 identical values
per key.
A few notes about using fix_innodb_cardinality:
- It has direct affect for SHOW INDEX FROM table_name. SHOW INDEX
will also update the statistics in table share.
- The effect of fix_innodb_cardinality for query plans or EXPLAIN
is only visible after first open of the table. This is why one must
do a flush tables or use SHOW INDEX for the option to take effect.
- Using fix_innodb_cardinality can thus affect all user in their query
plans if they are using the same tables.
Because of this, it is strongly recommended that one uses
optimizer_adjust_secondary_key_costs=fix_innodb_cardinality mainly
in configuration files to not cause issues for other users.
PURGE BINARY LOGS did not always purge binary logs. This commit fixes
some of the issues and adds notifications if a binary log cannot be
purged.
User visible changes:
- 'PURGE BINARY LOG TO log_name' and 'PURGE BINARY LOGS BEFORE date'
worked differently. 'TO' ignored 'slave_connections_needed_for_purge'
while 'BEFORE' did not. Now both versions ignores the
'slave_connections_needed_for_purge variable'.
- 'PURGE BINARY LOG..' commands now returns 'note' if a binary log cannot
be deleted like
Note 1375 Binary log 'master-bin.000004' is not purged because it is
the current active binlog
- Automatic binary log purges, based on date or size, will write a
note to the error log if a binary log matching the size or date
cannot yet be deleted.
- If 'slave_connections_needed_for_purge' is set from a config or
command line, it is set to 0 if Galera is enabled and 1 otherwise
(old default). This ensures that automatic binary log purge works
with Galera as before the addition of
'slave_connections_needed_for_purge'.
If the variable is changed to 0, a warning will be printed to the error
log.
Code changes:
- Added THD argument to several purge_logs related functions that needed
THD.
- Added 'interactive' options to purge_logs functions. This allowed
me to remove testing of sql_command == SQLCOM_PURGE.
- Changed purge_logs_before_date() to first check if log is applicable
before calling can_purge_logs(). This ensures we do not get a
notification for logs that does not match the remove criteria.
- MYSQL_BIN_LOG::can_purge_log() will write notifications to the user
or error log if a log cannot yet be removed.
- log_in_use() will return reason why a binary log cannot be removed.
Changes to keep code consistent:
- Moved checking of binlog_format for Galera to be after Galera is
initialized (The old check never worked). If Galera is enabled
we now change the binlog_format to ROW, with a warning, instead of
aborting the server. If this change happens a warning will be printed to
the error log.
- Print a warning if Galera or FLASHBACK changes the binlog_format
to ROW. Before it was done silently.
Reviewed by: Sergei Golubchik <serg@mariadb.com>,
Kristian Nielsen <knielsen@knielsen-hq.org>
We have an issue if a user have the following in a configuration file:
log_slow_filter="" # Log everything to slow query log
log_queries_not_using_indexes=ON
This set log_slow_filter to 'not_using_index' which disables
slow_query_logging of most queries.
In effect, on should never use log_slow_filter="" in config files but
instead use log_slow_filter=ALL.
Fixed by changing log_slow_filter="" that comes either from a
configuration file or from the command line, when starting to the server,
to log_slow_filter=ALL.
A warning will be printed when this happens.
Other things:
- One can now use =ALL for any 'set' variable to set all options at once.
(backported from 10.6)
Some fixes related to commit f838b2d799 and
Rows_log_event::do_apply_event() and Update_rows_log_event::do_exec_row()
for system-versioned tables were provided by Nikita Malyavin.
This was required by test versioning.rpl,trx_id,row.
https://jepsen.io/analyses/mysql-8.0.34 highlights that the
transaction isolation levels in the InnoDB storage engine do not
correspond to any widely accepted definitions, such as
"Generalized Isolation Level Definitions"
https://pmg.csail.mit.edu/papers/icde00.pdf
(PL-1 = READ UNCOMMITTED, PL-2 = READ COMMITTED, PL-2.99 = REPEATABLE READ,
PL-3 = SERIALIZABLE).
Only READ UNCOMMITTED in InnoDB seems to match the above definition.
The issue is that InnoDB does not detect write/write conflicts
(Section 4.4.3, Definition 6) in the above.
It appears that as soon as we implement write/write conflict detection
(SET SESSION innodb_snapshot_isolation=ON), the default isolation level
(SET TRANSACTION ISOLATION LEVEL REPEATABLE READ) will become
Snapshot Isolation (similar to Postgres), as defined in Section 4.2 of
"A Critique of ANSI SQL Isolation Levels", MSR-TR-95-51, June 1995
https://www.microsoft.com/en-us/research/wp-content/uploads/2016/02/tr-95-51.pdf
Locking reads inside InnoDB used to read the latest committed version,
ignoring what should actually be visible to the transaction.
The added test innodb.lock_isolation illustrates this. The statement
UPDATE t SET a=3 WHERE b=2;
is executed in a transaction that was started before a read view or
a snapshot of the current transaction was created, and committed before
the current transaction attempts to execute
UPDATE t SET b=3;
If SET innodb_snapshot_isolation=ON is in effect when the second
transaction was started, the second transaction will be aborted with
the error ER_CHECKREAD. By default (innodb_snapshot_isolation=OFF),
the second transaction would execute inconsistently, displaying an
incorrect SELECT COUNT(*) FROM t in its read view.
If innodb_snapshot_isolation=ON, if an attempt to acquire a lock on a
record that does not exist in the current read view is made, an error
DB_RECORD_CHANGED (HA_ERR_RECORD_CHANGED, ER_CHECKREAD) will
be raised. This error will be treated in the same way as a deadlock:
the transaction will be rolled back.
lock_clust_rec_read_check_and_lock(): If the current transaction has
a read view where the record is not visible and
innodb_snapshot_isolation=ON, fail before trying to acquire the lock.
row_sel_build_committed_vers_for_mysql(): If innodb_snapshot_isolation=ON,
disable the "semi-consistent read" logic that had been implemented by
myself on the directions of Heikki Tuuri in order to address
https://bugs.mysql.com/bug.php?id=3300 that was motivated by a customer
wanting UPDATE to skip locked rows that do not match the WHERE condition.
It looks like my changes were included in the MySQL 5.1.5
commit ad126d90e019f223470e73e1b2b528f9007c4532; at that time, employees
of Innobase Oy (a recent acquisition of Oracle) had lost write access to
the repository.
The only reason why we set innodb_snapshot_isolation=OFF by default is
backward compatibility with applications, such as the one that motivated
the implementation of "semi-consistent read" back in 2005. In a later
major release, we can default to innodb_snapshot_isolation=ON.
Thanks to Peter Alvaro, Kyle Kingsbury and Alexey Gotsman for their work
on https://github.com/jepsen-io/ and to Kyle and Alexey for explanations
and some testing of this fix.
Thanks to Vladislav Lesin for the initial test for MDEV-26643,
as well as reviewing these changes.
binlog_space_limit is a variable in Percona server used to limit the total
size of all binary logs.
This implementation is based on code from Percona server 5.7.
In MariaDB we decided to call the variable max-binlog-total-size to be
similar to max-binlog-size. This makes it easier to find in the output
from 'mariadbd --help --verbose'). MariaDB will also support
binlog_space_limit for compatibility with Percona.
Some internal notes to explain implementation notes:
- When running MariaDB does not delete binary logs that are either
used by slaves or have active xid that are not yet committed.
Some implementation notes:
- max-binlog-total-size is by default 0 (no limit).
- max-binlog-total-size can be changed without server restart.
- Binlog file sizes are checked on startup, or if
max-binlog-total-size is set to a value > 0, not for every log write.
The total size of all binary logs is cached and dynamically updated
when updating the binary log on binary log rotation.
- max-binlog-total-size is checked against existing log files during
serverstart, binlog rotation, FLUSH LOGS, when writing to binary log
or when max-binlog-total-size changes value.
- Option --slave-connections-needed-for-purge with 1 as default added.
This allows one to ensure that we do not delete binary logs if there
is less than 'slave-connections-needed-for-purge' connected.
Without this option max-binlog-total-size would potentially delete
binlogs needed by slaves on server startup or when a slave disconnects
as there are then no connected slaves to protect active binlogs.
- PURGE BINARY LOGS TO ... will be executed as if
slave-connectitons-needed-for-purge would be zero. In other words
it will do the purge even if there is no slaves connected. If there
are connected slaves working on the logs, these will be protected.
- If binary log is on and max-binlog-total_size <> 0 then the status
variable 'Binlog_disk_use' shows the current size of all old binary
logs + the state of the current one.
- Removed test of strcmp(log_file_name, log_info.log_file_name) in
purge_logs_before_date() as this is tested in can_purge_logs()
- To avoid expensive calls of log_in_use() we cache the result for the
last log that is in use by a slave. Future calls to can_purge_logs()
for this binary log will be quickly detected and false will be returned
until a slave starts working on a new log.
- Note that after a binary log rotation caused by max_binlog_size,
the last log will not be purged directly as it is still in use
internally. The next binary log write will purge binlogs if needed.
Reviewer:Kristian Nielsen <knielsen@knielsen-hq.org>
In MariaDB up to 10.11, the test_if_cheaper_ordering() code (that tries
to optimizer how GROUP BY is executed) assumes that if a table scan is used
then if there is any index usable by GROUP BY it will be used.
The reason MySQL 10.4 provides a better plan is because of two differences:
- Plans using 'ref' has a cost of 1/10 of what it should be (as a
protection against table scans). This is why 'ref' is used in 10.4
and not in 10.5.
- When 'ref' is used, then GROUP BY will not use an index for GROUP BY.
In MariaDB 10.5 the chosen plan is a table scan (as it calculated to be
faster) but as 'ref' is not used, the test_if_cheaper_ordering()
optimizer phase decides (as ref is not usd) to use an index for GROUP BY,
which has bad performance.
Description of fix:
- All new code is protected by the "optimizer_adjust_secondary_key_costs"
variable, which is now a bit map, and is only executed if the option
"disable_forced_index_in_group_by" set.
- Corrects GROUP BY handling in test_if_cheaper_ordering() by making
the choise of using and index with GROUP BY cost based instead of rule
based.
- Adds TIME_FOR_COMPARE to all costs, when using group by, to make
read_time, index_scan_time and range_cost comparable.
Other things:
- Made optimizer_adjust_secondary_key_costs a bit map (compatible with old
code).
Notes:
Current code ignores costs for the algorithm used when doing GROUP
BY on the first table:
- Create an in-memory temporary table for handling group by and doing a
filesort of the result file
We can probably in 10.6 continue to ignore this cost.
This patch should NOT be merged to 11.0 series (not needed in 11.0).
Improve the performance of slave connect using B+-Tree indexes on each binlog
file. The index allows fast lookup of a GTID position to the corresponding
offset in the binlog file, as well as lookup of a position to find the
corresponding GTID position.
This eliminates a costly sequential scan of the starting binlog file
to find the GTID starting position when a slave connects. This is
especially costly if the binlog file is not cached in memory (IO
cost), or if it is encrypted or a lot of slaves connect simultaneously
(CPU cost).
The size of the index files is generally less than 1% of the binlog data, so
not expected to be an issue.
Most of the work writing the index is done as a background task, in
the binlog background thread. This minimises the performance impact on
transaction commit. A simple global mutex is used to protect index
reads and (background) index writes; this is fine as slave connect is
a relatively infrequent operation.
Here are the user-visible options and status variables. The feature is on by
default and is expected to need no tuning or configuration for most users.
binlog_gtid_index
On by default. Can be used to disable the indexes for testing purposes.
binlog_gtid_index_page_size (default 4096)
Page size to use for the binlog GTID index. This is the size of the nodes
in the B+-tree used internally in the index. A very small page-size (64 is
the minimum) will be less efficient, but can be used to stress the
BTree-code during testing.
binlog_gtid_index_span_min (default 65536)
Control sparseness of the binlog GTID index. If set to N, at most one
index record will be added for every N bytes of binlog file written.
This can be used to reduce the number of records in the index, at
the cost only of having to scan a few more events in the binlog file
before finding the target position
Two status variables are available to monitor the use of the GTID indexes:
Binlog_gtid_index_hit
Binlog_gtid_index_miss
The "hit" status increments for each successful lookup in a GTID index.
The "miss" increments when a lookup is not possible. This indicates that the
index file is missing (eg. binlog written by old server version
without GTID index support), or corrupt.
Signed-off-by: Kristian Nielsen <knielsen@knielsen-hq.org>
optimizer-adjust_secondary_key_costs is added to provide 2 small
adjustments to the 10.x optimizer cost model. This can be used in the
case where the optimizer wrongly uses a secondary key instead of a
clustered primary key.
The reason behind this change is that MariaDB 10.x does not take into
account that for engines like InnoDB, that scanning a primary key can be
up to 7x faster than scanning a secondary key + read the row data trough
the primary key.
The different values for optimizer_adjust_secondary_key_costs are:
optimizer_adjust_secondary_key_costs=0
- No changes to current model
optimizer_adjust_secondary_key_costs=1
- Ensure that the cost of of secondary indexes has a cost of at
least 5x times the cost of a clustered primary key (if one exists).
This disables part of the worst_seek optimization described below.
optimizer_adjust_secondary_key_costs=2
- Disable "worst_seek optimization" and adjust filter cost slightly
(add cost of 1 if filter is used).
The idea behind 'worst_seek optimization' is that we limit the
cost for all non clustered ref access to the least of:
- best-rows-by-range (or all rows in no range found) / 10
- scan-time-table (roughly number of file blocks to scan table) * 3
In addition we also do not try to use rowid_filter if number of rows
estimated for 'ref' access is less than the worst_seek limitation.
The idea is that worst_seek is trying to take into account that if
we do a lot of accesses through a key, this is likely to be cached.
However it only does this for secondary keys, and not for clustered
keys or index only reads.
The effect of the worst_seek are:
- In some cases 'ref' will have a much lower cost than range or using
a clustered key.
- Some possible rowid filters for secondary keys will be ignored.
When implementing optimizer_adjust_secondary_key_costs=2, I noticed
that there is a slightly different costs for how ref+filter and
range+filter are calculated. This caused a lot of range and
range+filter to change to ref+filter, which is not good as
range+filter provides the optimizer a better estimate of how many
accepted rows there will be in the result set.
Adding a extra small cost (1 seek) when using filter mitigated the
above problems in almost all cases.
This patch should not be applied to MariaDB 11.0 as worst_seeks is
removed in 11.0 and the cost calculation for clustered keys, secondary
keys, index scan and filter is more exact.
Test case changes for --optimizer-adjust_secondary_key_costs=1
(Fix secondary key costs to be 5x of primary key):
- stat_tables_innodb:
- Complex change (probably ok as number of rows are really small)
- ref over 1 row changed to range over 10 rows with join buffer
- ref over 5 rows changed to eq_ref
- secondary ref over 1 row changed to ref of primary key over 4 rows
- Change of key to use longer key with index pushdown (a little
bit worse but not significant).
- Change to use secondary (1 row) -> primary (4 rows)
- rowid_filter_innodb:
- index_merge (2 rows) & ref (1) -> all (23 rows) -> primary eq_ref.
Test case changes for --optimizer-adjust_secondary_key_costs=2
(remove of worst_seeks & adjust filter cost):
- stat_tables_innodb:
- Join order change (probably ok as number of rows are really small)
- ref (5 rows) & ref(1 row) changed to range (10 rows & join buffer)
& eq_ref.
- selectivity_innodb:
- ref -> ref|filter (ok)
- rowid_filter_innodb:
- ref -> ref|filter (ok)
- range|filter (64 rows) changed to ref|filter (128 rows).
ok as ref|filter outputs wrong number of rows in explain.
- range, range_mrr_icp:
-ref (500 rows -> ALL (1000 rows) (ok)
- select_pkeycache, select, select_jcl6:
- ref|filter (2 rows) -> ref (2 rows) (ok)
- selectivity:
- ref -> ref_filter (ok)
- range:
- Change of 'filtered' but no stat or plan change (ok)
- selectivity:
- ref -> ref+filter (ok)
- Change of filtered but no plan change (ok)
- join_nested_jcl6:
- range -> ref|filter (ok as only 2 rows)
- subselect3, subselect3_jcl6:
- ref_or_null (4 rows) -> ALL (10 rows) (ok)
- Index_subquery (4 rows) -> ALL (10 rows) (ok)
- partition_mrr_myisam, partition_mrr_aria and partition_mrr_innodb:
- Uses ALL instead of REF for a key value that is the same for > 50%
of rows. (good)
order_by_innodb:
- range (200 rows) -> ref (20 rows)+filesort (ok)
- subselect_sj2_mat:
- One test changed. One ALL removed and replaced with eq_ref. Likely
to be better.
- join_cache:
- Changed ref over 60% of the rows to use hash join (ok)
- opt_tvc:
- Changed to use eq_ref instead of ref with plan change (probably ok)
- opt_trace:
- No worst/max seeks clipping (good).
- Almost double range_scan_time and index_scan_time (ok).
- rowid_filter:
- ref -> ref|filtered (ok)
- range|filter (77 rows) changed to ref|filter (151 rows). Proably
ok as ref|filter outputs wrong number of rows in explain.
Reviewer: Sergei Petrunia <sergey@mariadb.com>
rpl_semi_sync_slave_enabled_consistent.test and the first part of
the commit message comes from Brandon Nesterenko.
A test to show how to induce the "Read semi-sync reply magic number
error" message on a primary. In short, if semi-sync is turned on
during the hand-shake process between a primary and replica, but
later a user negates the rpl_semi_sync_slave_enabled variable while
the replica's IO thread is running; if the io thread exits, the
replica can skip a necessary call to kill_connection() in
repl_semisync_slave.slave_stop() due to its reliance on a global
variable. Then, the replica will send a COM_QUIT packet to the
primary on an active semi-sync connection, causing the magic number
error.
The test in this patch exits the IO thread by forcing an error;
though note a call to STOP SLAVE could also do this, but it ends up
needing more synchronization. That is, the STOP SLAVE command also
tries to kill the VIO of the replica, which makes a race with the IO
thread to try and send the COM_QUIT before this happens (which would
need more debug_sync to get around). See THD::awake_no_mutex for
details as to the killing of the replica’s vio.
Notes:
- The MariaDB documentation does not make it clear that when one
enables semi-sync replication it does not matter if one enables
it first in the master or slave. Any order works.
Changes done:
- The rpl_semi_sync_slave_enabled variable is now a default value for
when semisync is started. The variable does not anymore affect
semisync if it is already running. This fixes the original reported
bug. Internally we now use repl_semisync_slave.get_slave_enabled()
instead of rpl_semi_sync_slave_enabled. To check if semisync is
active on should check the @@rpl_semi_sync_slave_status variable (as
before).
- The semisync protocol conflicts in the way that the original
MySQL/MariaDB client-server protocol was designed (client-server
send and reply packets are strictly ordered and includes a packet
number to allow one to check if a packet is lost). When using
semi-sync the master and slave can send packets at 'any time', so
packet numbering does not work. The 'solution' has been that each
communication starts with packet number 1, but in some cases there
is still a chance that the packet number check can fail. Fixed by
adding a flag (pkt_nr_can_be_reset) in the NET struct that one can
use to signal that packet number checking should not be done. This
is flag is set when semi-sync is used.
- Added Master_info::semi_sync_reply_enabled to allow one to configure
some slaves with semisync and other other slaves without semisync.
Removed global variable semi_sync_need_reply that would not work
with multi-master.
- Repl_semi_sync_master::report_reply_packet() can now recognize
the COM_QUIT packet from semisync slave and not give a
"Read semi-sync reply magic number error" error for this case.
The slave will be removed from the Ack listener.
- On Windows, don't stop semisync Ack listener just because one
slave connection is using socket_id > FD_SETSIZE.
- Removed busy loop in Ack_receiver::run() by using
"Self-pipe trick" to signal new slave and stop Ack_receiver.
- Changed some Repl_semi_sync_slave functions that always returns 0
from int to void.
- Added Repl_semi_sync_slave::slave_reconnect().
- Removed dummy_function Repl_semi_sync_slave::reset_slave().
- Removed some duplicate semisync notes from the error log.
- Add test of "if (get_slave_enabled() && semi_sync_need_reply)"
before calling Repl_semi_sync_slave::slave_reply().
(Speeds up the code as we can skip all initializations).
- If epl_semisync_slave.slave_reply() fails, we disable semisync
for that connection.
- We do not call semisync.switch_off() if there are no active slaves.
Instead we check in Repl_semi_sync_master::commit_trx() if there are
no active threads. This simplices the code.
- Changed assert() to DBUG_ASSERT() to ensure that the DBUG log is
flushed in case of asserts.
- Removed the internal rpl_semi_sync_slave_status as it is not needed
anymore. The @@rpl_semi_sync_slave_status status variable is now
mapped to rpl_semi_sync_enabled.
- Removed rpl_semi_sync_slave_enabled as it is not needed anymore.
Repl_semi_sync_slave::get_slave_enabled() contains the active status.
- Added checking that we do not add a slave twice with
Ack_receiver::add_slave(). This could happen with old code.
- Removed Repl_semi_sync_master::check_and_switch() as it is not
needed anymore.
- Ensure that when we call Ack_receiver::remove_slave() that the slave
is removed from the listener before function returns.
- Call listener.listen_on_sockets() outside of mutex for better
performance and less contested mutex.
- Ensure that listening is ignoring newly added slaves when checking for
responses.
- Fixed the master ack_receiver listener is not killed if there are no
connected slaves (and thus stop semisync handling of future
connections). This could happen if all slaves sockets where would be
marked as unreliable.
- Added unlink() to base_ilist_iterator and remove() to
I_List_iterator. This enables us to remove 'dead' slaves in
Ack_recever::run().
- kill_zombie_dump_threads() now does killing of dump threads properly.
- It can now kill several threads (should be impossible but could
happen if IO slaves reconnects very fast).
- We now wait until the dump thread is done before starting the
dump.
- Added an error if kill_zombie_dump_threads() fails.
- Set thd->variables.server_id before calling
kill_zombie_dump_threads(). This simplies the code.
- Added a lot of comments both in code and tests.
- Removed DBUG_EVALUATE_IF "failed_slave_start" as it is not used.
Test changes:
- rpl.rpl_session_var2 added which runs rpl.rpl_session_var test with
semisync enabled.
- Some timings changed slight with startup of slave which caused
rpl_binlog_dump_slave_gtid_state_info.text to fail as it checked the
error log file before the slave had started properly. Fixed by
adding wait_for_pattern_in_file.inc that allows waiting for the
pattern to appear in the log file.
- Tests have been updated so that we first set
rpl_semi_sync_master_enabled on the master and then set
rpl_semi_sync_slave_enabled on the slaves (this is according to how
the MariaDB documentation document how to setup semi-sync).
- Error text "Master server does not have semi-sync enabled" has been
replaced with "Master server does not support semi-sync" for the
case when the master supports semi-sync but semi-sync is not
enabled.
Other things:
- Some trivial cleanups in Repl_semi_sync_master::update_sync_header().
- We should in 11.3 changed the default value for
rpl-semi-sync-master-wait-no-slave from TRUE to FALSE as the TRUE
does not make much sense as default. The main difference with using
FALSE is that we do not wait for semisync Ack if there are no slave
threads. In the case of TRUE we wait once, which did not bring any
notable benefits except slower startup of master configured for
using semisync.
Co-author: Brandon Nesterenko <brandon.nesterenko@mariadb.com>
This solves the problem reported in MDEV-32960 where a new
slave may not be registered in time and the master disables
semi sync because of that.
Fix old_mode flags conflict between OLD_MODE_NO_NULL_COLLATION_IDS
and OLD_MODE_LOCK_ALTER_TABLE_COPY.
Both flags used to be 1 << 6, now OLD_MODE_LOCK_ALTER_TABLE_COPY changed
to be 1 << 7