The largest_started_sub_id needs to be set under LOCK_parallel_entry
together with testing stop_sub_id. However, in-between was the logic for
do_ftwrl_wait(), which temporarily releases the mutex. This could lead to
inconsistent stopping amongst worker threads and lost data.
Fix by moving all the stop-related logic out from unrelated do_gco_wait()
and do_ftwrl_wait() and into its own function do_stop_handling().
Reviewed-by: Andrei Elkin <andrei.elkin@mariadb.com>
Signed-off-by: Kristian Nielsen <knielsen@knielsen-hq.org>
The problem is that when a worker thread is (user) killed in
wait_for_prior_commit, the event group may complete out-of-order since the
wait for prior commit was aborted by the kill.
This fix ensures that event groups will always complete in-order, even
in the error case. This is done in finish_event_group() by doing an
extra wait_for_prior_commit(), if necessary, that ignores kills.
This fix supersedes the fix for MDEV-30780, so the earlier fix for
that is reverted in this patch.
Also fix that an error from wait_for_prior_commit() inside
finish_event_group() would not signal the error to
wakeup_subsequent_commits().
Based on earlier work by Brandon Nesterenko and Andrei Elkin, with
some changes to simplify the semantics of wait_for_prior_commit() and
make the code more robust to future changes.
Reviewed-by: Andrei Elkin <andrei.elkin@mariadb.com>
Signed-off-by: Kristian Nielsen <knielsen@knielsen-hq.org>
The problem was an incorrect unmark_start_commit() in
signal_error_to_sql_driver_thread(). If an event group gets an error, this
unmark could run after the following GCO started, and the subsequent
re-marking could access de-allocated GCO.
The offending unmark_start_commit() looks obviously incorrect, and the fix
is to just remove it. It was introduced in the MDEV-8302 patch, the commit
message of which suggests it was added there solely to satisfy an assertion
in ha_rollback_trans(). So update this assertion instead to not trigger for
event groups that experienced an error (rgi->worker_error). When an error
occurs in an event group, all following event groups are skipped anyway, so
the unmark should never be needed in this case.
Reviewed-by: Andrei Elkin <andrei.elkin@mariadb.com>
Signed-off-by: Kristian Nielsen <knielsen@knielsen-hq.org>
At STOP SLAVE, worker threads will continue applying event groups until the
end of the current GCO before stopping. This is a left-over from when only
conservative mode was available. In optimistic and aggressive mode, often
_all_ queued event will be in the same GCO, and slave stop will be
needlessly delayed.
This patch instead records at STOP SLAVE time the latest (highest sub_id)
event group that has started. Then worker threads will continue to apply
event groups up to that event group, but skip any following. The result is
that each worker thread will complete its currently running event group, and
then the slave will stop.
If the slave is caught up, and STOP SLAVE is run in the middle of an event
group that is already executing in a worker thread, then that event group
will be rolled back and the slave stop immediately, as normal.
Reviewed-by: Andrei Elkin <andrei.elkin@mariadb.com>
Signed-off-by: Kristian Nielsen <knielsen@knielsen-hq.org>
The problem is that a parallel replica would not immediately stop
running/queued transactions when issued STOP SLAVE. That is, it
allowed the current group of transactions to run, and sometimes the
transactions which belong to the next group could be started and run
through commit after STOP SLAVE was issued too, if the last group
had started committing. This would lead to long periods to wait for
all waiting transactions to finish.
This patch updates a parallel replica to try and abort immediately
and roll-back any ongoing transactions. The exception to this is any
transactions which are non-transactional (e.g. those modifying
sequences or non-transactional tables), and any prior transactions,
will be run to completion.
The specifics are as follows:
1. A new stage was added to SHOW PROCESSLIST output for the SQL
Thread when it is waiting for a replica thread to either rollback or
finish its transaction before stopping. This stage presents as
“Waiting for worker thread to stop”
2. Worker threads which error or are killed no longer perform GCO
cleanup if there is a concurrently running prior transaction. This
is because a worker thread scheduled to run in a future GCO could be
killed and incorrectly perform cleanup of the active GCO.
3. Refined cases when the FL_TRANSACTIONAL flag is added to GTID
binlog events to disallow adding it to transactions which modify
both transactional and non-transactional engines when the binlogging
configuration allow the modifications to exist in the same event,
i.e. when using binlog_direct_non_trans_update == 0 and
binlog_format == statement.
4. A few existing MTR tests relied on the completion of certain
transactions after issuing STOP SLAVE, and were re-recorded
(potentially with added synchronizations) under the new rollback
behavior.
Reviewed By
===========
Andrei Elkin <andrei.elkin@mariadb.com>
The problem was that mutex_init() was called after the worker was
put into the domain_hash, which allowed other threads to access it
before mutex was initialized.
When using binlog_row_image=FULL with sequence table inserts, a
replica can deadlock because it treats full inserts in a sequence as DDL
statements by getting an exclusive lock on the sequence table. It
has been observed that with parallel replication, this exclusive
lock on the sequence table can lead to a deadlock where one
transaction has the exclusive lock and is waiting on a prior
transaction to commit, whereas this prior transaction is waiting on
the MDL lock.
This fix for this is on the master side, to raise FL_DDL
flag on the GTID of a full binlog_row_image write of a sequence table.
This forces the slave to execute the statement serially so a deadlock
cannot happen.
A test verifies the deadlock also to prove it happen on the OLD (pre-fixes)
slave.
OLD (buggy master) -replication-> NEW (fixed slave) is provided.
As the pre-fixes master's full row-image may represent both
SELECT NEXT VALUE and INSERT, the parallel slave pessimistically
waits for the prior transaction to have committed before to take on the
critical part of the second (like INSERT in the test) event execution.
The waiting exploits a parallel slave's retry mechanism which is
controlled by `@@global.slave_transaction_retries`.
Note that in order to avoid any persistent 'Deadlock found' 2013 error
in OLD -> NEW, `slave_transaction_retries` may need to be set to a
higher than the default value.
START-SLAVE is an effective work-around if this still happens.
The error was seen by a number of mtr tests being caused
by overdue initialization of rpl_parallel::LOCK_parallel_entry.
Specifically, SHOW-SLAVE-STATUS might find in
rpl_parallel::workers_idle() a gtid domain hash entry
already inserted whose mutex had not done
mysql_mutex_init().
Fixed with swapping the mutex init and the its entry's stack insertion.
Tested with a generous number of `mtr --repeat` of a few of the reported
to fail tests, incl rpl.parallel_backup.
…with: Test assertion failed
Problem:
=======
Assertion text: 'Value returned by SSS and PS table for Last_Error_Number
should be same.'
Assertion condition: '"1146" = "0"'
Assertion condition, interpolated: '"1146" = "0"'
Assertion result: '0'
Analysis:
========
In parallel replication when slave is started the worker pool gets
activated and it gets cleared when slave stops. Each time the worker pool
gets activated a backup worker pool also gets created to store worker
specific perforance schema information in case of errors. On error, all
relevant information is copied from rpl_parallel_thread to rli and it gets
cleared from thread. Then server waits for all workers to complete their
work, during this stage performance schema table specific worker info is
stored into the backup pool and finally the actual pool gets cleared. If
users query the performance schema table to know the status of workers the
information from backup pool will be used. The test simulates
ER_NO_SUCH_TABLE error and verifies the worker information in pfs table.
Test works fine if execution occurs in following order.
Step 1. Error occurred 'worker information is copied to backup pool'.
Step 2. handle_slave_sql invokes 'rpl_parallel_resize_pool_if_no_slaves' to
deactivate worker pool, it marks the pool->count=0
Step 3. PFS table is queried, since actual pool is deactivated backup pool
information is read.
If the Step 3 happens prior to Step2 the pool is yet to be deactivated and
the actual pool is read, which doesn't have any error details as they were
cleared. Hence test ocasionally fails.
Fix:
===
Upon error mark the back pool as being active so that if PFS table is
quried since the backup pool is flagged as valid its information will be
read, in case it is not flagged regular pool will be read.
This work is one of the last pieces created by the late Sujatha Sivakumar.
The hang could be seen as show slave status displaying an error like
Last_Error: Could not execute Write_rows_v1
along with
Slave_SQL_Running: Yes
accompanied with one of the replication threads in show-processlist
characteristically having status like
2394 | system user | | NULL | Slave_worker | 50852| closing tables
It turns out that closing tables worker got entrapped in endless looping
in mark_start_commit_inner() across already garbage-collected gco items.
The reclaimed gco links are explained with actually possible
out-of-order groups of events termination due to the Last_Error.
This patch reinforces the correct ordering to perform
finish_event_group's cleanup actions, incl unlinking gco:s
from the active list.
Problem
========
On a parallel, delayed replica, Seconds_Behind_Master will not be
calculated until after MASTER_DELAY seconds have passed and the
event has finished executing, resulting in potentially very large
values of Seconds_Behind_Master (which could be much larger than the
MASTER_DELAY parameter) for the entire duration the event is
delayed. This contradicts the documented MASTER_DELAY behavior,
which specifies how many seconds to withhold replicated events from
execution.
Solution
========
After a parallel replica idles, the first event after idling should
immediately update last_master_timestamp with the time that it began
execution on the primary.
Reviewed By
===========
Andrei Elkin <andrei.elkin@mariadb.com>
There are separate flags DBUG_OFF for disabling the DBUG facility
and ENABLED_DEBUG_SYNC for enabling the DEBUG_SYNC facility.
Let us allow debug builds without DEBUG_SYNC.
Note: For CMAKE_BUILD_TYPE=Debug, CMakeLists.txt will continue to
define ENABLED_DEBUG_SYNC.
This commit implements two phase binloggable ALTER.
When a new
@@session.binlog_alter_two_phase = YES
ALTER query gets logged in two parts, the START ALTER and the COMMIT
or ROLLBACK ALTER. START Alter is written in binlog as soon as
necessary locks have been acquired for the table. The timing is
such that any concurrent DML:s that update the same table are either
committed, thus logged into binary log having done work on the old
version of the table, or will be queued for execution on its new
version.
The "COMPLETE" COMMIT or ROLLBACK ALTER are written at the very point
of a normal "single-piece" ALTER that is after the most of
the query work is done. When its result is positive COMMIT ALTER is
written, otherwise ROLLBACK ALTER is written with specific error
happened after START ALTER phase.
Replication of two-phase binloggable ALTER is
cross-version safe. Specifically the OLD slave merely does not
recognized the start alter part, still being able to process and
memorize its gtid.
Two phase logged ALTER is read from binlog by mysqlbinlog to produce
BINLOG 'string', where 'string' contains base64 encoded
Query_log_event containing either the start part of ALTER, or a
completion part. The Query details can be displayed with `-v` flag,
similarly to ROW format events. Notice, mysqlbinlog output containing
parts of two-phase binloggable ALTER is processable correctly only by
binlog_alter_two_phase server.
@@log_warnings > 2 can reveal details of binlogging and slave side
processing of the ALTER parts.
The current commit also carries fixes to the following list of
reported bugs:
MDEV-27511, MDEV-27471, MDEV-27349, MDEV-27628, MDEV-27528.
Thanks to all people involved into early discussion of the feature
including Kristian Nielsen, those who helped to design, implement and
test: Sergei Golubchik, Andrei Elkin who took the burden of the
implemenation completion, Sujatha Sivakumar, Brandon
Nesterenko, Alice Sherepa, Ramesh Sivaraman, Jan Lindstrom.
Dead code cleanup:
part_info->num_parts usage was wrong and working incorrectly in
mysql_drop_partitions() because num_parts is already updated in
prep_alter_part_table(). We don't have to update part_info->partitions
because part_info is destroyed at alter_partition_lock_handling().
Cleanups:
- DBUG_EVALUATE_IF() macro replaced by shorter form DBUG_IF();
- Typo in ER_KEY_COLUMN_DOES_NOT_EXITS.
Refactorings:
- Splitted write_log_replace_delete_frm() into write_log_delete_frm()
and write_log_replace_frm();
- partition_info via DDL_LOG_STATE;
- set_part_info_exec_log_entry() removed.
DBUG_EVALUATE removed
DBUG_EVALUTATE was only added for consistency together with
DBUG_EVALUATE_IF. It is not used anywhere in the code.
DBUG_SUICIDE() fix on release build
On release DBUG_SUICIDE() was statement. It was wrong as
DBUG_SUICIDE() is used in expression context.
Problem:- When we issue FTWRL with shutdown in parallel, there is race between
FTWRL and shutdown. Shutdown might destroy the mutex (pool->LOCK_rpl_thread_pool)
before FTWRL can lock it. So we can get crash on FTWRL thread
Solution:- mysql_mutex_destroy(pool->LOCK_rpl_thread_pool) should wait for
FTWRL thread to complete its work , and then destroy.
So slave_prepare_for_shutdown will just deactivate the pool, and mutex is destroyed
later in end_slave()
Step 3:
======
Preserve worker pool information on either STOP SLAVE/Error. In case STOP
SLAVE is executed worker threads will be gone, hence worker threads will be
unavailable. Querying the table at this stage will give empty rows. To
address this case when worker threads are about to stop, due to an error or
forced stop, create a backup pool and preserve the data which is relevant to
populate performance schema table. Clear the backup pool upon slave start.
Step2:
=====
Add two extra columns mentioned below.
---------------------------------------------------------------------------
|Column Name: | Description: |
|-------------------------------------------------------------------------|
| | |
|WORKER_IDLE_TIME | Total idle time in seconds that the worker |
| | thread has spent waiting for work from |
| | co-ordinator thread |
| | |
|LAST_TRANS_RETRY_COUNT | Total number of retries attempted by last |
| | transaction |
---------------------------------------------------------------------------
Step1:
=====
Backport 'replication_applier_status_by_worker' from upstream.
Iterate through rpl_parallel_thread_pool and display slave worker thread
specific information as part of 'replication_applier_status_by_worker'
table.
---------------------------------------------------------------------------
|Column Name: | Description: |
|-------------------------------------------------------------------------|
| | |
|CHANNEL_NAME | Name of replication channel through which the |
| | transaction is received. |
| | |
|THREAD_ID | Thread_Id as displayed in 'performance_schema. |
| | threads' table for thread with name |
| | 'thread/sql/rpl_parallel_thread' |
| | |
| | THREAD_ID will be NULL when worker threads are |
| | stopped due to an error/force stop |
| | |
|SERVICE_STATE | Thread is running or not |
| | |
|LAST_SEEN_TRANSACTION | Last GTID executed by worker |
| | |
|LAST_ERROR_NUMBER | Last Error that occured on a particular worker |
| | |
|LAST_ERROR_MESSAGE | Last error specific message |
| | |
|LAST_ERROR_TIMESTAMP | Time stamp of last error |
| | |
---------------------------------------------------------------------------
CHANNEL_NAME will be empty when the worker has not processed any
transaction. Channel_name points to valid source channel_name when it is
processing a transaction/event group.
Problem:
========
Auto purge of relaylogs stops when relay-log-file is
'slave-relay-log.999999' and slave_parallel_threads is enabled.
Analysis:
=========
The problem is that in Relay_log_info::inc_group_relay_log_pos() function,
when two log names are compared via strcmp() function, it gives correct
result, when log name sequence numbers are of same digits(6 digits), But
when the number goes to 7 digits, a 999999 compares greater than
1000000, which is wrong, hence the bug.
Fix:
====
Extract the numeric extension part of the file name, convert it into
unsigned long and compare.
Thanks to David Zhao for the contribution.