1
0
mirror of https://github.com/MariaDB/server.git synced 2025-11-30 05:23:50 +03:00
Commit Graph

3287 Commits

Author SHA1 Message Date
Michael Widenius
9c79227c96 Fixed two bugs with CREATE OR REPLACE and LOCK TABLES:
MDEV-6560 Assertion `! is_set() ' failed in Diagnostics_area::set_ok_status on killing CREATE OR REPLACE
MDEV-6525 Assertion `table->pos_in_locked _tables == __null || table->pos_in_locked_tables->table = table' failed in mark_used_tables_as_free_for_reuse, locking problems and binlogging problems on CREATE OR REPLACE under lock.
 

mysql-test/r/create_or_replace.result:
  Added test for MDEV-6560
mysql-test/t/create_or_replace.test:
  Added test for MDEV-6560
mysql-test/valgrind.supp:
  Added suppression for OpenSuse 12.3
sql/sql_base.cc:
  More DBUG
sql/sql_class.cc:
  Changed that thd_sqlcom_can_generate_row_events() does not report that CREATE OR REPLACE is generating row events.
  This is safe as this function is only used by InnoDB/XtraDB to check if a query is generating row events as part of another transaction. As CREATE is always run as it's own transaction, this isn't a problem.
  This fixed MDEV-6525.
sql/sql_table.cc:
  Remember if reopen_tables() generates an error (which can only happen in case of KILL).
  This fixed MDEV-6560
2014-09-08 20:56:56 +03:00
Kristian Nielsen
36f50be970 MDEV-6462: Slave replicating using GTID doesn't recover correctly when master crashes in the middle of transaction
If the slave gets a reconnect in the middle of a GTID event group, normally
it will re-fetch that event group, skipping the first part that was already
queued for the SQL thread.

However, if the master crashed while writing the event group, the group is
incomplete. This patch detects this case and makes sure that the
transaction is rolled back and nothing is skipped from any following
event groups.

Similarly, a network proxy might cause the reconnect to end up on a
different master server. Detect this by noticing a different server_id,
and similarly in this case roll back the partially received group.
2014-09-02 14:07:01 +02:00
Kristian Nielsen
f2cbca793c Change a couple of permissions that cause lintian warnings in .deb packaging and don't really hurt to fix. 2014-08-13 15:46:39 +02:00
Sergei Golubchik
b5ebc21169 sanity
mysql-test/mysql-test-run.pl:
  fix the message
2014-08-10 14:36:17 +02:00
Kristian Nielsen
935309bedc Fix test case that requires dbug to not fail in release build. 2014-08-20 15:02:10 +02:00
Kristian Nielsen
453c29c3f7 MDEV-6321: close_temporary_tables() in format description event not serialised correctly
Follow-up patch, fixing a possible deadlock issue.

If the master crashes in the middle of an event group, there can be an active
transaction in a worker thread when we encounter the following master restart
format description event. In this case, we need to notify that worker thread
to abort and roll back the partial event group. Otherwise a deadlock occurs:
the worker thread waits for the commit that never arrives, and the SQL driver
thread waits for the worker thread to complete its event group, which it never
does.
2014-08-19 14:26:42 +02:00
Sergei Golubchik
50e192a04f Bug#17638477 UNINSTALL AND INSTALL SEMI-SYNC PLUGIN CAUSES SLAVES TO BREAK
Fix the bug properly (plugin cannot be unloaded as long as it's locked).
Enable and fix the test case.
Significantly reduce number of LOCK_plugin locks for semisync
(practically all locks were removed)
2014-08-03 12:45:14 +02:00
Sergei Golubchik
1c6ad62a26 mysql-5.5.39 merge
~40% bugfixed(*) applied
~40$ bugfixed reverted (incorrect or we're not buggy)
~20% bugfixed applied, despite us being not buggy
(*) only changes in the server code, e.g. not cmakefiles
2014-08-02 21:26:16 +02:00
Sergei Golubchik
4b4de01fae 5.3 merge 2014-08-01 16:51:12 +02:00
Sergei Golubchik
4e6e720160 MDEV-6290 Crash in KILL HARD QUERY USER x@y when slave threads are running
KILL USER should ignore system threads where sctx->user=sctx->host=NULL
2014-07-23 19:36:15 +02:00
Sergei Golubchik
1907bf042a MDEV-6409 CREATE VIEW replication problem if error occurs in mysql_register_view
fix by Sriram Patil
2014-07-23 12:01:05 +02:00
Otto Kekäläinen
e3ac16d721 Add executable bit to scripts that are supposed to have it. More info at: https://mariadb.atlassian.net/browse/MDEV-6153?focusedCommentId=55397&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-55397 2014-07-20 20:55:44 +03:00
Kristian Nielsen
4cb1e0eea0 MDEV-6321: close_temporary_tables() in format description event not serialised correctly
When a master server starts up, it logs a special format_description event at
the start of a new binlog to mark that is has restarted. This is used by a
slave to drop all temporary tables - this is needed in case the master crashed
and did not have a chance to send explicit DROP TEMPORARY TABLE statements to
the slave.

In parallel replication, we need to be careful when dropping the temporary
tables - we need to be sure that no prior events are still executing that
might be using the temporary tables to be dropped, _and_ that no following
events have started executing that might have created new temporary tables
that should not be dropped.

This was not handled correctly, which could cause errors about access to not
existing temporary tables or even crashes. This patch implements that such
format_description events cause serialisation of event execution; all prior
events are executed to completion first, then the format_description event is
executed, dropping temporary tables, then following events are queued for
execution.

Master restarts should be sufficiently infrequent that the resulting loss of
parallelism should be of minimal impact.
2014-07-02 12:51:45 +02:00
Michael Widenius
bd2117d154 Automatic merge from 5.5
Fixed 2 failing tests by replacing result files
2014-08-19 21:35:14 +03:00
Kristian Nielsen
cfa1ce81bb MDEV-6551: Some replication errors are ignored if slave_parallel_threads > 0
The problem occured when using parallel replication, and an error occured that
caused the SQL thread to stop when the IO thread had already reached a
following binlog file from the master (or otherwise performed a relay log
rotation).

In this case, the Rotate Event at the end of the relay log file could still be
executed, even though an earlier event in that relay log file had gotten an
error. This would cause the position to be incorrectly updated, so that upon
restart of the SQL thread, the event that had failed would be silently skipped
and ignored, causing replication corruption.

Fixed by checking before executing Rotate Event, whether an earlier event
has failed. If so, the Rotate Event is not executed, just dequeued, same as
for other normal events following a failing event.
2014-08-15 11:31:13 +02:00
Kristian Nielsen
ec05fea0a0 MDEV-6549, failing to update gtid_slave_pos for a transaction that was retried.
The bug was that in some cases, if a replicated transaction was rolled back
due to deadlock, during the subsequent retry of that transaction, the
gtid_slave_pos would _not_ be updated with the new GTID, leaving the GTID
position of the slave incorrect.

Fix this by ensuring during the retry that we clear the flag that marks that
the GTID has already been recorded in gtid_slave_pos, so that the update of
gtid_slave_pos will be done again during the retry.

In the original bug, the symptom was an assertion due to OPTION_GTID_BEGIN not
being cleared during the retry of the transaction. The reason was some code in
handling of a COMMIT query event, which would not clear the flag when not
recording a GTID in gtid_slave_pos. This commit also fixes that code to always
clear the OPTION_GTID_BEGIN flag for clarity, though it is actually not
possible for OPTION_GTID_BEGIN to become set unless a GTID is pending for
update (after fixing the bug described above).
2014-08-13 13:34:28 +02:00
Sergei Golubchik
7a7e65b9fc Fix rpl.rpl_semi_sync_uninstall_plugin to work reliably 2014-08-07 18:08:50 +02:00
Sergei Golubchik
6fb17a0601 5.5.39 merge 2014-08-07 18:06:56 +02:00
Sergei Golubchik
071a14c93d cleanup: remove have_mysql_upgrade.inc 2014-08-06 13:31:55 +02:00
Sergey Vojtovich
9f252fceab MDEV-6469 - rpl.rpl_gtid_basic, rpl.rpl_gtid_stop_start,
rpl.rpl_gtid_crash fail on PPC64

This is an addition to the original patch.

Restored show binlog events output and adjusted filters to
replace [\d-\d-\d,\d-\d-\d,\d-\d-\d] with [#-#-#].
2014-08-06 13:07:16 +04:00
Sergei Golubchik
fece177fe2 mysqltest: support pairs of delimiters in replace_regex 2014-08-04 21:19:24 +02:00
Sergey Vojtovich
2b61466733 MDEV-6469 - rpl.rpl_gtid_basic, rpl.rpl_gtid_stop_start,
rpl.rpl_gtid_crash fail on PPC64

GTID order in @@gtid_binlog_pos depends on internal hash order,
so requires to be hidden for stable test output.
2014-07-22 14:54:38 +04:00
Sergey Vojtovich
7e02ba555e MDEV-6465 - rpl.rpl_gtid_master_promote fails on PPC64
GTID order in @@gtid_binlog_pos depends on internal hash order,
so requires --replace_result for stable test output.
2014-07-21 13:16:08 +04:00
Kristian Nielsen
501c56ef1e MDEV-5262, MDEV-5914, MDEV-5941, MDEV-6020: Deadlocks during parallel replication causing replication to fail.
Merge the patches into MariaDB 10.0 main.

With this patch, parallel replication will now automatically retry a
transaction that fails due to deadlock or other temporary error, same as
single-threaded replication.

We catch deadlocks with InnoDB transactions due to enforced commit order. If
T1 must commit before T2 in parallel replication and T1 ends up waiting for T2
inside InnoDB, we kill T2 and retry it later to resolve the deadlock
automatically.
2014-07-11 12:06:47 +02:00
Kristian Nielsen
fd0abecaf4 Fix test failure seen in buildbot on power8.
GTID order in @@gtid_binlog_pos depends on internal hash order,
so requires --replace_result for stable test output.
2014-07-11 11:17:50 +02:00
Kristian Nielsen
8f21a31669 MDEV-6435: Assertion `m_status == DA_ERROR' failed in Diagnostics_area::sql_errno() with parallel replication
When a MyISAM query is killed midway, the query is logged to the binlog marked
with the error.

The slave does not attempt to run the query, but aborts with a suitable error
message in the error log for the DBA to act on.

In this case, the parallel replication code would check the sql_errno() code,
even no my_error() had been set. In debug builds, this causes an assertion.

Fixed the code to check that we actually have an error set before querying for
an error code.
2014-07-10 13:55:53 +02:00
Luis Soares
5111df0814 BUG#13874553: rpl.rpl_stop_slave fails sporadically on pb2
The test case makes use of the fine DEBUG_SYNC facility. Furthermore,
since it needs synchronization on internal threads (dump and SQL
threads) the server code has DEBUG_SYNC commands internally deployed
and activated through the DBUG_EXECUTE_IF macro. The internal
DBUG_SYNC commands are then controlled from the test case through the
DEBUG variable.

There were three problems around the DEBUG + DEBUG_SYNC facility
usage:

1. When signaling the SQL thread to continue, the test would reset
   immediately the DEBUG_SYNC variable. This could mean that the SQL
   thread might loose the signal and continue to wait forever;

2. A similar scenario was happening with the dump thread on the
   master. This thread was instructed to wait, and later it would be
   signaled to continue, but immediately after the DEBUG_SYNC would be
   reset. This could lead to the dump thread missing the signal and
   wait forever;

3. The test was not cleaning itself up with respect to the
   instrumentation of the dump thread. This would leave the
   conditional execution of an internal DEBUG_SYNC command active
   (through the usage of DBUG_EXECUTE_IF). 

We fix #1 and #2 by waiting for the threads to receive the signal and
only then issue the reset. We fix #3 by reseting the DEBUG variable,
thus deactivating the dump thread internal DEBUG_SYNC command.
2014-06-26 12:54:27 +01:00
Luis Soares
18fa87d41a BUG#13874553: rpl.rpl_stop_slave fails sporadically on pb2
The test case makes use of the fine DEBUG_SYNC facility. Furthermore,
since it needs synchronization on internal threads (dump and SQL
threads) the server code has DEBUG_SYNC commands internally deployed
and activated through the DBUG_EXECUTE_IF macro. The internal
DBUG_SYNC commands are then controlled from the test case through the
DEBUG variable.

There were three problems around the DEBUG + DEBUG_SYNC facility
usage:

1. When signaling the SQL thread to continue, the test would reset
   immediately the DEBUG_SYNC variable. This could mean that the SQL
   thread might loose the signal and continue to wait forever;

2. A similar scenario was happening with the dump thread on the
   master. This thread was instructed to wait, and later it would be
   signaled to continue, but immediately after the DEBUG_SYNC would be
   reset. This could lead to the dump thread missing the signal and
   wait forever;

3. The test was not cleaning itself up with respect to the
   instrumentation of the dump thread. This would leave the
   conditional execution of an internal DEBUG_SYNC command active
   (through the usage of DBUG_EXECUTE_IF). 

We fix #1 and #2 by waiting for the threads to receive the signal and
only then issue the reset. We fix #3 by reseting the DEBUG variable,
thus deactivating the dump thread internal DEBUG_SYNC command.
2014-06-26 12:54:27 +01:00
Kristian Nielsen
9150a0c7cb MDEV-4937: sql_slave_skip_counter does not work with GTID
The sql_slave_skip_counter is important to be able to recover replication from
certain errors. Often, an appropriate solution is to set
sql_slave_skip_counter to skip over a problem event. But setting
sql_slave_skip_counter produced an error in GTID mode, with a suggestion to
instead set @@gtid_slave_pos to point past the problem event. This however is
not always possible; for example, in case of an INCIDENT event, that event
does not have any GTID to assign to @@gtid_slave_pos.

With this patch, sql_slave_skip_counter now works in GTID mode the same was as
in non-GTID mode. When set, that many initial events are skipped when the SQL
thread starts, plus as many extra events are needed to completely skip any
partially skipped event group. The GTID position is updated to point past the
skipped event(s).
2014-06-25 15:24:11 +02:00
Sergei Golubchik
086a81986b MDEV-5867 ALTER TABLE t1 ENGINE=InnoDB keeps bad options when t1 ENGINE is CONNECT
Comment out unknown options in SHOW CREATE TABLE unless IGNORE_BAD_TABLE_OPTIONS is used
2014-07-08 19:39:27 +02:00
Kristian Nielsen
439f75f849 Fix test failures in rpl.rpl_checksum and rpl.rpl_gtid_errorlog.
These tests use search_pattern_in_file.inc to search the error log for
expected output. However, search_pattern_in_file.inc by default searched only
the first 50000 bytes, so if the error log grew too big the tests would fail.

This patch extends search_pattern_in_file.inc with an option to specify how
much of the file to search, and whether to search from the start of the file
or from the end. Then the rpl.rpl_checksum and rpl.rpl_gtid_errorlog test
cases are fixed to search the last 50000 bytes of the error log, which will
work no matter how large prior tests have made it.
2014-06-30 13:59:21 +02:00
Kristian Nielsen
370318f894 MDEV-6386: Assertion `thd->transaction.stmt.is_empty() || thd->in_sub_stmt || (thd->state_flags & Open_tables_state::BACKUPS_AVAIL)' fails with parallel replication
The direct cause of the assertion was missing error handling in
record_gtid(). If ha_commit_trans() fails for the statement commit, there was
missing code to catch the error and do ha_rollback_trans() in this case; this
caused close_thread_tables() to assert.

Normally, this error case is not hit, but in this case it was triggered due to
another bug: When a transaction T1 fails during parallel replication, the code
would signal following transactions that they could start to run without
properly marking the error condition. This caused subsequent transactions to
incorrectly start replicating, only to get an error later during their own
commit step. This was particularly serious if the subsequent transactions were
DDL or MyISAM updates, which cannot be rolled back and would leave replication
in an inconsistent state.

Fixed by 1) in case of error, only signal following transactions to continue
once the error has been properly marked and those transactions will know not
to start; and 2) implement proper error handling in record_gtid() in the case
that statement commit fails.
2014-06-27 13:34:29 +02:00
Kristian Nielsen
86362129a2 MDEV-6120: When slave stops with error, error message should indicate the failing GTID
If replication breaks in GTID mode, it is not trivial to determine the GTID of
the failing event group. This is a problem, as such GTID is needed eg. to
explicitly set @@gtid_slave_pos to skip to after that event group, or to
compare errors on different servers, etc.

Fix by ensuring that relevant slave errors logged to the error log include the
GTID of the event group containing the problem event.
2014-06-25 15:17:03 +02:00
Kristian Nielsen
00467e136e MDEV-5799: Error messages written upon LOST EVENTS incident are corrupted
This is MySQL Bug#59123. The message string stored in an INCIDENT event was
not zero-terminated. This caused any following checksum bytes (if enabled on
the master) to be output to the error log as trailing garbage when the message
was printed to the error log.

Backport the patch from MySQL 5.6:

  revno: 2876.228.200
  revision-id: zhenxing.he@sun.com-20110111051323-w2xnzvcjn46x6h6u
  committer: He Zhenxing <zhenxing.he@sun.com>
  timestamp: Tue 2011-01-11 13:13:23 +0800
  message:
    BUG#59123 rpl_stm_binlog_max_cache_size fails sporadically with found warnings

Also add a test case.
2014-06-25 13:08:30 +02:00
Kristian Nielsen
312219cc63 MDEV-6364: Migrate a slave from MySQL 5.6 to MariaDB 10 break replication
MySQL 5.6 implemented WL#344, which is about a MASTER_DELAY option to CHANGE
MASTER. But as part of this worklog, the format of the realy-log.info file was
changed. The new format is not understood by earlier versions, and nor by
MariaDB 10.0, so changing server to those versions would cause the slave to
abort with an error due to reading incorrect data out of relay-log.info.

Fix this by backporting from the WL#344 patch just the code that understands
the new relay-log.info format. We still write out the old format, and none of
the MASTER_DELAY feature is backported with this commit.
2014-06-24 14:43:08 +02:00
Sergei Golubchik
dc64ba2187 MDEV-6137 better help for SET/ENUM sysvars
Auto-generate the allowed list of values for enum/set/flagset options
in --help output. But don't do that when the help text already has them.

Also, remove lists of values from help strings of various options, where
they were simply listed without any additional information.
2014-06-19 12:02:23 +02:00
unknown
643738eec8 MDEV-6180: Error 1590 is not autoskippable
The INCIDENT_EVENT always caused slave error and abort, without checking
--slave-skip-errors.

Now, if error 1590, ER_SLAVE_INCIDENT is included in the --slave-skip-errors
list, incident events will be ignored.

This is a merge of this MySQL 5.6 patch:

revision-id: frazer@mysql.com-20110314170916-ypgin17otj3ucx95
committer: Frazer Clement <frazer@mysql.com>
timestamp: Mon 2011-03-14 17:09:16 +0000
message:
  Bug#11799671 NOT POSSIBLE TO SKIP INCIDENT ERRORS
2014-06-18 11:03:08 +02:00
unknown
081926f3d8 MDEV-6188: master_retry_count (ignored if disconnect happens on SET master_heartbeat_period)
That particular part of slave connect to master was missing code to handle
retry in case of network errors. The same problem is present in MySQL 5.5, but
fixed in MySQL 5.6.

Fixed with this patch, by adding the code (mostly identical to MySQL 5.6), and
also adding a test case.

I checked other queries done towards master during slave connect, and they now
all seem to handle reconnect in case of network failures.
2014-06-17 14:10:13 +02:00
unknown
bd4153a8c2 MDEV-5262, MDEV-5914, MDEV-5941, MDEV-6020: Deadlocks during parallel
replication causing replication to fail.

Remove the temporary fix for MDEV-5914, which used READ COMMITTED for parallel
replication worker threads. Replace it with a better, more selective solution.

The issue is with certain edge cases of InnoDB gap locks, for example between
INSERT and ranged DELETE. It is possible for the gap lock set by the DELETE to
block the INSERT, if the DELETE runs first, while the record lock set by
INSERT does not block the DELETE, if the INSERT runs first. This can cause a
conflict between the two in parallel replication on the slave even though they
ran without conflicts on the master.

With this patch, InnoDB will ask the server layer about the two involved
transactions before blocking on a gap lock. If the server layer tells InnoDB
that the transactions are already fixed wrt. commit order, as they are in
parallel replication, InnoDB will ignore the gap lock and allow the two
transactions to proceed in parallel, avoiding the conflict.

Improve the fix for MDEV-6020. When InnoDB itself detects a deadlock, it now
asks the server layer for any preferences about which transaction to roll
back. In case of parallel replication with two transactions T1 and T2 fixed to
commit T1 before T2, the server layer will ask InnoDB to roll back T2 as the
deadlock victim, not T1. This helps in some cases to avoid excessive deadlock
rollback, as T2 will in any case need to wait for T1 to complete before it can
itself commit.

Also some misc. fixes found during development and testing:

 - Remove thd_rpl_is_parallel(), it is not used or needed.

 - Use KILL_CONNECTION instead of KILL_QUERY when a parallel replication
   worker thread is killed to resolve a deadlock with fixed commit
   ordering. There are some cases, eg. in sql/sql_parse.cc, where a KILL_QUERY
   can be ignored if the query otherwise completed successfully, and this
   could cause the deadlock kill to be lost, so that the deadlock was not
   correctly resolved.

 - Fix random test failure due to missing wait_for_binlog_checkpoint.inc.

 - Make sure that deadlock or other temporary errors during parallel
   replication are not printed to the the error log; there were some places
   around the replication code with extra error logging. These conditions can
   occur occasionally and are handled automatically without breaking
   replication, so they should not pollute the error log.

 - Fix handling of rgi->gtid_sub_id. We need to be able to access this also at
   the end of a transaction, to be able to detect and resolve deadlocks due to
   commit ordering. But this value was also used as a flag to mark whether
   record_gtid() had been called, by being set to zero, losing the value. Now,
   introduce a separate flag rgi->gtid_pending, so rgi->gtid_sub_id remains
   valid for the entire duration of the transaction.

 - Fix one place where the code to handle ignored errors called reset_killed()
   unconditionally, even if no error was caught that should be ignored. This
   could cause loss of a deadlock kill signal, breaking deadlock detection and
   resolution.

 - Fix a couple of missing mysql_reset_thd_for_next_command(). This could
   cause a prior error condition to remain for the next event executed,
   causing assertions about errors already being set and possibly giving
   incorrect error handling for following event executions.

 - Fix code that cleared thd->rgi_slave in the parallel replication worker
   threads after each event execution; this caused the deadlock detection and
   handling code to not be able to correctly process the associated
   transactions as belonging to replication worker threads.

 - Remove useless error code in slave_background_kill_request().

 - Fix bug where wfc->wakeup_error was not cleared at
   wait_for_commit::unregister_wait_for_prior_commit(). This could cause the
   error condition to wrongly propagate to a later wait_for_prior_commit(),
   causing spurious ER_PRIOR_COMMIT_FAILED errors.

 - Do not put the binlog background thread into the processlist. It causes
   too many result differences in mtr, but also it probably is not useful
   for users to pollute the process list with a system thread that does not
   really perform any user-visible tasks...
2014-06-10 10:13:15 +02:00
Sergei Golubchik
e27c338634 5.5.38 merge 2014-06-06 00:07:27 +02:00
unknown
629b822913 MDEV-5262, MDEV-5914, MDEV-5941, MDEV-6020: Deadlocks during parallel
replication causing replication to fail.

In parallel replication, we run transactions from the master in parallel, but
force them to commit in the same order they did on the master. If we force T1
to commit before T2, but T2 holds eg. a row lock that is needed by T1, we get
a deadlock when T2 waits until T1 has committed.

Usually, we do not run T1 and T2 in parallel if there is a chance that they
can have conflicting locks like this, but there are certain edge cases where
it can occasionally happen (eg. MDEV-5914, MDEV-5941, MDEV-6020). The bug was
that this would cause replication to hang, eventually getting a lock timeout
and causing the slave to stop with error.

With this patch, InnoDB will report back to the upper layer whenever a
transactions T1 is about to do a lock wait on T2. If T1 and T2 are parallel
replication transactions, and T2 needs to commit later than T1, we can thus
detect the deadlock; we then kill T2, setting a flag that causes it to catch
the kill and convert it to a deadlock error; this error will then cause T2 to
roll back and release its locks (so that T1 can commit), and later T2 will be
re-tried and eventually also committed.

The kill happens asynchroneously in a slave background thread; this is
necessary, as the reporting from InnoDB about lock waits happen deep inside
the locking code, at a point where it is not possible to directly call
THD::awake() due to mutexes held.

Deadlock is assumed to be (very) rarely occuring, so this patch tries to
minimise the performance impact on the normal case where no deadlocks occur,
rather than optimise the handling of the occasional deadlock.

Also fix transaction retry due to deadlock when it happens after a transaction
already signalled to later transactions that it started to commit. In this
case we need to undo this signalling (and later redo it when we commit again
during retry), so following transactions will not start too early.

Also add a missing thd->send_kill_message() that got triggered during testing
(this corrects an incorrect fix for MySQL Bug#58933).
2014-06-03 10:31:11 +02:00
Sergei Golubchik
5d16592d44 mysql-5.5.38 merge 2014-06-03 09:55:08 +02:00
unknown
787c470cef MDEV-5262: Missing retry after temp error in parallel replication
Handle retry of event groups that span multiple relay log files.

 - If retry reaches the end of one relay log file, move on to the next.

 - Handle refcounting of relay log files, and avoid purging relay log
   files until all event groups have completed that might have needed
   them for transaction retry.
2014-05-15 15:52:08 +02:00
unknown
d60915692c MDEV-5262: Missing retry after temp error in parallel replication
Implement that if first retry fails, we can do another attempt.

Add testcases to test multi-retry that succeeds in second attempt, and
multi-retry that eventually fails due to exceeding slave_trans_retries.
2014-05-13 13:42:06 +02:00
Sergei Golubchik
edf1fbd25b MDEV-6153 Trivial Lintian errors in MariaDB sources: spelling errors and wrong executable bits 2014-05-13 11:53:30 +02:00
Sergei Golubchik
d3e2e1243b 5.5 merge 2014-05-09 12:35:11 +02:00
Venkatesh Duggirala
2870bd7423 Bug#17283409 4-WAY DEADLOCK: ZOMBIES, PURGING BINLOGS,
SHOW PROCESSLIST, SHOW BINLOGS

Problem:  A deadlock was occurring when 4 threads were
involved in acquiring locks in the following way
Thread 1: Dump thread ( Slave is reconnecting, so on
              Master, a new dump thread is trying kill
              zombie dump threads. It acquired thread's
              LOCK_thd_data and it is about to acquire
              mysys_var->current_mutex ( which LOCK_log)
Thread 2: Application thread is executing show binlogs and
               acquired LOCK_log and it is about to acquire
               LOCK_index.
Thread 3: Application thread is executing Purge binary logs
               and acquired LOCK_index and it is about to
               acquire LOCK_thread_count.
Thread 4: Application thread is executing show processlist
               and acquired LOCK_thread_count and it is
               about to acquire zombie dump thread's
               LOCK_thd_data.
Deadlock Cycle:
     Thread 1 -> Thread 2 -> Thread 3-> Thread 4 ->Thread 1

The same above deadlock was observed even when thread 4 is
executing 'SELECT * FROM information_schema.processlist' command and
acquired LOCK_thread_count and it is about to acquire zombie
dump thread's LOCK_thd_data.

Analysis:
There are four locks involved in the deadlock.  LOCK_log,
LOCK_thread_count, LOCK_index and LOCK_thd_data.
LOCK_log, LOCK_thread_count, LOCK_index are global mutexes
where as LOCK_thd_data is local to a thread.
We can divide these four locks in two groups.
Group 1 consists of LOCK_log and LOCK_index and the order
should be LOCK_log followed by LOCK_index.
Group 2 consists of other two mutexes
LOCK_thread_count, LOCK_thd_data and the order should
be LOCK_thread_count followed by LOCK_thd_data.
Unfortunately, there is no specific predefined lock order defined
to follow in the MySQL system when it comes to locks across these
two groups. In the above problematic example,
there is no problem in the way we are acquiring the locks
if you see each thread individually.
But If you combine all 4 threads, they end up in a deadlock.

Fix: 
Since everything seems to be fine in the way threads are taking locks,
In this patch We are changing the duration of the locks in Thread 4
to break the deadlock. i.e., before the patch, Thread 4
('show processlist' command) mysqld_list_processes()
function acquires LOCK_thread_count for the complete duration
of the function and it also acquires/releases
each thread's LOCK_thd_data.

LOCK_thread_count is used to protect addition and
deletion of threads in global threads list. While show
process list is looping through all the existing threads,
it will be a problem if a thread is exited but there is no problem
if a new thread is added to the system. Hence a new mutex is
introduced "LOCK_thd_remove" which will protect deletion
of a thread from global threads list. All threads which are
getting exited should acquire LOCK_thd_remove
followed by LOCK_thread_count. (It should take LOCK_thread_count
also because other places of the code still thinks that exit thread
is protected with LOCK_thread_count. In this fix, we are changing
only 'show process list' query logic )
(Eg: unlink_thd logic will be protected with
LOCK_thd_remove).

Logic of mysqld_list_processes(or file_schema_processlist)
will now be protected with 'LOCK_thd_remove' instead of
'LOCK_thread_count'.

Now the new locking order after this patch is:
LOCK_thd_remove -> LOCK_thd_data -> LOCK_log ->
LOCK_index -> LOCK_thread_count
2014-05-08 18:13:01 +05:30
Venkatesh Duggirala
33f15dc7ac Bug#17283409 4-WAY DEADLOCK: ZOMBIES, PURGING BINLOGS,
SHOW PROCESSLIST, SHOW BINLOGS

Problem:  A deadlock was occurring when 4 threads were
involved in acquiring locks in the following way
Thread 1: Dump thread ( Slave is reconnecting, so on
              Master, a new dump thread is trying kill
              zombie dump threads. It acquired thread's
              LOCK_thd_data and it is about to acquire
              mysys_var->current_mutex ( which LOCK_log)
Thread 2: Application thread is executing show binlogs and
               acquired LOCK_log and it is about to acquire
               LOCK_index.
Thread 3: Application thread is executing Purge binary logs
               and acquired LOCK_index and it is about to
               acquire LOCK_thread_count.
Thread 4: Application thread is executing show processlist
               and acquired LOCK_thread_count and it is
               about to acquire zombie dump thread's
               LOCK_thd_data.
Deadlock Cycle:
     Thread 1 -> Thread 2 -> Thread 3-> Thread 4 ->Thread 1

The same above deadlock was observed even when thread 4 is
executing 'SELECT * FROM information_schema.processlist' command and
acquired LOCK_thread_count and it is about to acquire zombie
dump thread's LOCK_thd_data.

Analysis:
There are four locks involved in the deadlock.  LOCK_log,
LOCK_thread_count, LOCK_index and LOCK_thd_data.
LOCK_log, LOCK_thread_count, LOCK_index are global mutexes
where as LOCK_thd_data is local to a thread.
We can divide these four locks in two groups.
Group 1 consists of LOCK_log and LOCK_index and the order
should be LOCK_log followed by LOCK_index.
Group 2 consists of other two mutexes
LOCK_thread_count, LOCK_thd_data and the order should
be LOCK_thread_count followed by LOCK_thd_data.
Unfortunately, there is no specific predefined lock order defined
to follow in the MySQL system when it comes to locks across these
two groups. In the above problematic example,
there is no problem in the way we are acquiring the locks
if you see each thread individually.
But If you combine all 4 threads, they end up in a deadlock.

Fix: 
Since everything seems to be fine in the way threads are taking locks,
In this patch We are changing the duration of the locks in Thread 4
to break the deadlock. i.e., before the patch, Thread 4
('show processlist' command) mysqld_list_processes()
function acquires LOCK_thread_count for the complete duration
of the function and it also acquires/releases
each thread's LOCK_thd_data.

LOCK_thread_count is used to protect addition and
deletion of threads in global threads list. While show
process list is looping through all the existing threads,
it will be a problem if a thread is exited but there is no problem
if a new thread is added to the system. Hence a new mutex is
introduced "LOCK_thd_remove" which will protect deletion
of a thread from global threads list. All threads which are
getting exited should acquire LOCK_thd_remove
followed by LOCK_thread_count. (It should take LOCK_thread_count
also because other places of the code still thinks that exit thread
is protected with LOCK_thread_count. In this fix, we are changing
only 'show process list' query logic )
(Eg: unlink_thd logic will be protected with
LOCK_thd_remove).

Logic of mysqld_list_processes(or file_schema_processlist)
will now be protected with 'LOCK_thd_remove' instead of
'LOCK_thread_count'.

Now the new locking order after this patch is:
LOCK_thd_remove -> LOCK_thd_data -> LOCK_log ->
LOCK_index -> LOCK_thread_count
2014-05-08 18:13:01 +05:30
unknown
b0b60f2498 MDEV-5262: Missing retry after temp error in parallel replication
Start implementing that an event group can be re-tried in parallel replication
if it fails with a temporary error (like deadlock).

Patch is very incomplete, just some very basic retry works.

Stuff still missing (not complete list):

 - Handle moving to the next relay log file, if event group to be retried
   spans multiple relay log files.

 - Handle refcounting of relay log files, to ensure that we do not purge a
   relay log file and then later attempt to re-execute events out of it.

 - Handle description_event_for_exec - we need to save this somehow for the
   possible retry - and use the correct one in case it differs between relay
   logs.

 - Do another retry attempt in case the first retry also fails.

 - Limit the max number of retries.

 - Lots of testing will be needed for the various edge cases.
2014-05-08 14:20:18 +02:00
Venkatesh Duggirala
3b43142623 Bug#17638477 UNINSTALL AND INSTALL SEMI-SYNC PLUGIN CAUSES SLAVES TO BREAK
Fixing post push failure
2014-05-07 14:33:58 +05:30