In SBR, if a statement does not fail, it is always written to the binary
log, regardless if rows are changed or not. If there is a failure, a
statement is only written to the binary log if a non-transactional (.e.g.
MyIsam) engine is updated.
INSERT ON DUPLICATE KEY UPDATE and INSERT IGNORE were not following the
rule above and were not written to the binary log, if then engine was
Innodb.
There are two calls to read_log_event() on master in mysql_binlog_send().
Each call reads 19 bytes in this test case and the error of the second
read_log_event() is reported to the slave.
The second read_log_event() starts from position 94 (75 + 19) to 113
(75 + 19 + 19). Usually, there are two events in the binary log:
. 0 - 3 - Header
. 4 - 105 - Format Descriptor Event
. 106 - 304 - Query Event
and both reads fail because operations are reading from invalid positions
as expected.
However, mysql_binlog_send() does not use the same IO_CACHE that is used to
write into binary log (i.e. mysql_bin_log.log_file) for the hot binary log.
It opens the binary log file directly by calling open_binlog() and creates a
separated IO_CACHE. So there is a possibly that after a master has flushed
the binary log file, the content has been cached by the filesystem, and has
not updated the disk file. If this happens, then a slave will only see part
of the file, and thus the second read_log_event() will report event truncated
error.
To fix the problem, if the first read_log_event() has failed, we ensure that
the second one will try to read from the same position.
rpl_packet got a timeout failure sporadically on PB when stopping
slave. The real reason of this bug is that STOP SLAVE stopped
IO thread first and then stopped SQL thread. It was
possible that IO thread stopped after replicating part of a
transaction which SQL thread was executing. SQL thread would
be hung if the transaction could not be rolled back safely.
After this patch, STOP SLAVE will stop SQL thread first and then stop IO
thread, which guarantees that IO thread will fetch the reset of the
events of the transaction that SQL thread is executing, so that SQL
thread can finish the transaction if it cannot be rolled back safely.
Added below auxiliary files to make the test code neater.
restart_slave_sql.inc
rpl_connection_master.inc
rpl_connection_slave.inc
rpl_connection_slave1.inc
On windows, an #endif in a wrong place was causing an early
return from mysql_load and thus the LOAD DATA LOCAL was not
executed. This problem was fixed by moving the #endif to the
right place.
The following code was missing
if ((stat_info.st_mode & S_IFIFO) == S_IFIFO)
is_fifo = 1;
which is required to properly configure and read from the
IO_CACHE when a named pipe is used. So it was re-introduced
before the #endif.
The test started failing on the same day patch for BUG 49978 was
pushed. BUG 49978 changed part of the replication testing
infrastructure in mysql-test-run. This caused the test to fail
sporadically with result differences on relay log file
names. When the test fails the relay-log filenames are shifted by
one, eg:
-show relaylog events in 'slave-relay-bin.000002' from <binlog_start>;
+show relaylog events in 'slave-relay-bin.000003' from <binlog_start>;
The problem was caused by a bad cleanup when using the include
files:
- include/setup_fake_relay_log.inc
- include/cleanup_fake_relay_log.inc
Which would leave a spurious relay-log file around (not listed in
slave-relay-bin.index), causing the server to shift the name of
the relay logs by one, even if cleaning up with RESET SLAVE.
We fix this by removing the relay-log file when it is not needed
anymore, ie at setup time and after recreating the fake relay-log
index.
Additionally, to make the affected test more resilient, we
deployed a call to rpl_reset.inc (which resets both master and
slave, including log files) before actually running the test
case.
Finally, appart from the reported bug, we also fix: (a) an
unrelated issue with the failing test itself - in some cases, the
test was not setting the log file name to use when it should;
(b) one typo.
Problem: master executed a statement that would fail on slave
(namely, DROP USER 'create_rout_db'@'localhost').
Then the test did:
--let $rpl_only_running_threads= 1
--source include/rpl_reset.inc
rpl_reset.inc calls rpl_sync.inc, which first checks which of
the threads are running and then syncs those threads that are
running. If the SQL thread fails after the check, the sync will
fail. So there was a race in the test and it failed on some
slow hosts.
Fix: Don't replicate the failing statement.
Normally, auto_increment value is generated for the column by
inserting either NULL or 0 into it. NO_AUTO_VALUE_ON_ZERO
suppresses this behavior for 0 so that only NULL generates
the auto_increment value. This behavior is also followed by
a slave, specifically by the SQL Thread, when applying events
in the statement format from a master. However, when applying
events in the row format, the flag was ignored thus causing
an assertion failure:
"Assertion failed: next_insert_id == 0, file .\handler.cc"
In fact, we never need to generate a auto_increment value for
the column when applying events in row format on slave. So we
don't allow it to happen by using 'MODE_NO_AUTO_VALUE_ON_ZERO'.
Refactoring: Get rid of all the sql_mode checks to rows_log_event
when applying it for avoiding problems caused by the inconsistency
of the sql_mode on slave and master as the sql_mode is not set for
Rows_log_event.
Normally, auto_increment value is generated for the column by
inserting either NULL or 0 into it. NO_AUTO_VALUE_ON_ZERO
suppresses this behavior for 0 so that only NULL generates
the auto_increment value. This behavior is also followed by
a slave, specifically by the SQL Thread, when applying events
in the statement format from a master. However, when applying
events in the row format, the flag was ignored thus causing
an assertion failure:
"Assertion failed: next_insert_id == 0, file .\handler.cc"
In fact, we never need to generate a auto_increment value for
the column when applying events in row format on slave. So we
don't allow it to happen by using 'MODE_NO_AUTO_VALUE_ON_ZERO'.
Refactoring: Get rid of all the sql_mode checks to rows_log_event
when applying it for avoiding problems caused by the inconsistency
of the sql_mode on slave and master as the sql_mode is not set for
Rows_log_event.
Major replication test framework cleanup. This does the following:
- Ensure that all tests clean up the replication state when they
finish, by making check-testcase check the output of SHOW SLAVE STATUS.
This implies:
- Slave must not be running after test finished. This is good
because it removes the risk for sporadic errors in subsequent
tests when a test forgets to sync correctly.
- Slave SQL and IO errors must be cleared when test ends. This is
good because we will notice if a test gets an unexpected error in
the slave threads near the end.
- We no longer have to clean up before a test starts.
- Ensure that all tests that wait for an error in one of the slave
threads waits for a specific error. It is no longer possible to
source wait_for_slave_[sql|io]_to_stop.inc when there is an error
in one of the slave threads. This is good because:
- If a test expects an error but there is a bug that causes
another error to happen, or if it stops the slave thread without
an error, then we will notice.
- When developing tests, wait_for_*_to_[start|stop].inc will fail
immediately if there is an error in the relevant slave thread.
Before this patch, we had to wait for the timeout.
- Remove duplicated and repeated code for setting up unusual replication
topologies. Now, there is a single file that is capable of setting
up arbitrary topologies (include/rpl_init.inc, but
include/master-slave.inc is still available for the most common
topology). Tests can now end with include/rpl_end.inc, which will clean
up correctly no matter what topology is used. The topology can be
changed with include/rpl_change_topology.inc.
- Improved debug information when tests fail. This includes:
- debug info is printed on all servers configured by include/rpl_init.inc
- User can set $rpl_debug=1, which makes auxiliary replication files
print relevant debug info.
- Improved documentation for all auxiliary replication files. Now they
describe purpose, usage, parameters, and side effects.
- Many small code cleanups:
- Made have_innodb.inc output a sensible error message.
- Moved contents of rpl000017-slave.sh into rpl000017.test
- Added mysqltest variables that expose the current state of
disable_warnings/enable_warnings and friends.
- Too many to list here: see per-file comments for details.
Post-push fixes:
- fixed platform dependent result files
- appeasing valgrind warnings:
Fault injection was also uncovering a previously existing
potential mem leaks. For BUG#46166 testing purposes, fixed
by forcing handling the leak when injecting faults.
Manual merge from mysql-5.1-bugteam into mysql-5.5-bugteam.
Conflicts
=========
Text conflict in sql/log.cc
Text conflict in sql/log.h
Text conflict in sql/slave.cc
Text conflict in sql/sql_parse.cc
Text conflict in sql/sql_priv.h
When a query fails with a different error on the slave,
the sql thread outputs a message (M) containing:
1. the error message format for the master error code
2. the master error code
3. the error message for the slave's error code
4. the slave error code
Given that the slave has no information on the error message
itself that the master outputs, it can only print its own
version of the message format (but stripped from the
additional data if the message format requires). This may
confuse users.
To fix this we augment the slave's message (M) to explicitly
state that the master's message is actually an error message
format, the one associated with the given master error code
and that the slave server knows about.
when generating new name.
If find_uniq_filename returns an error, then this error is not
being propagated upwards, and execution does not report error to
the user (although a entry in the error log is generated).
Additionally, some more errors were ignored in new_file_impl:
- when writing the rotate event
- when reopening the index and binary log file
This patch addresses this by propagating the error up in the
execution stack. Furthermore, when rotation of the binary log
fails, an incident event is written, because there may be a
chance that some changes for a given statement, were not properly
logged. For example, in SBR, LOAD DATA INFILE statement requires
more than one event to be logged, should rotation fail while
logging part of the LOAD DATA events, then the logged data would
become inconsistent with the data in the storage engine.
It is not necessary to support INSERT DELAYED for a single value insert,
while we do not support that for multi-values insert when binlog is
enabled in SBR.
The lock_type is upgrade to TL_WRITE from TL_WRITE_DELAYED for
INSERT DELAYED for single value insert as multi-values insert
did when binlog is enabled. Then it's safe. And binlog it as
INSERT without DELAYED.
Text conflict in mysql-test/suite/perfschema/r/dml_setup_instruments.result
Text conflict in mysql-test/suite/perfschema/r/global_read_lock.result
Text conflict in mysql-test/suite/perfschema/r/server_init.result
Text conflict in mysql-test/suite/perfschema/t/global_read_lock.test
Text conflict in mysql-test/suite/perfschema/t/server_init.test
bug #57006 "Deadlock between HANDLER and FLUSH TABLES WITH READ
LOCK" and bug #54673 "It takes too long to get readlock for
'FLUSH TABLES WITH READ LOCK'".
The first bug manifested itself as a deadlock which occurred
when a connection, which had some table open through HANDLER
statement, tried to update some data through DML statement
while another connection tried to execute FLUSH TABLES WITH
READ LOCK concurrently.
What happened was that FTWRL in the second connection managed
to perform first step of GRL acquisition and thus blocked all
upcoming DML. After that it started to wait for table open
through HANDLER statement to be flushed. When the first connection
tried to execute DML it has started to wait for GRL/the second
connection creating deadlock.
The second bug manifested itself as starvation of FLUSH TABLES
WITH READ LOCK statements in cases when there was a constant
stream of concurrent DML statements (in two or more
connections).
This has happened because requests for protection against GRL
which were acquired by DML statements were ignoring presence of
pending GRL and thus the latter was starved.
This patch solves both these problems by re-implementing GRL
using metadata locks.
Similar to the old implementation acquisition of GRL in new
implementation is two-step. During the first step we block
all concurrent DML and DDL statements by acquiring global S
metadata lock (each DML and DDL statement acquires global IX
lock for its duration). During the second step we block commits
by acquiring global S lock in COMMIT namespace (commit code
acquires global IX lock in this namespace).
Note that unlike in old implementation acquisition of
protection against GRL in DML and DDL is semi-automatic.
We assume that any statement which should be blocked by GRL
will either open and acquires write-lock on tables or acquires
metadata locks on objects it is going to modify. For any such
statement global IX metadata lock is automatically acquired
for its duration.
The first problem is solved because waits for GRL become
visible to deadlock detector in metadata locking subsystem
and thus deadlocks like one in the first bug become impossible.
The second problem is solved because global S locks which
are used for GRL implementation are given preference over
IX locks which are acquired by concurrent DML (and we can
switch to fair scheduling in future if needed).
Important change:
FTWRL/GRL no longer blocks DML and DDL on temporary tables.
Before this patch behavior was not consistent in this respect:
in some cases DML/DDL statements on temporary tables were
blocked while in others they were not. Since the main use cases
for FTWRL are various forms of backups and temporary tables are
not preserved during backups we have opted for consistently
allowing DML/DDL on temporary tables during FTWRL/GRL.
Important change:
This patch changes thread state names which are used when
DML/DDL of FTWRL is waiting for global read lock. It is now
either "Waiting for global read lock" or "Waiting for commit
lock" depending on the stage on which FTWRL is.
Incompatible change:
To solve deadlock in events code which was exposed by this
patch we have to replace LOCK_event_metadata mutex with
metadata locks on events. As result we have to prohibit
DDL on events under LOCK TABLES.
This patch also adds extensive test coverage for interaction
of DML/DDL and FTWRL.
Performance of new and old global read lock implementations
in sysbench tests were compared. There were no significant
difference between new and old implementations.