Problem:
========
1) Drop table queries are re-generated by server
before writing the events(queries) into binlog
for various reasons. If table name/db name contains
a non regular characters (like latin characters),
the generated query is wrong. Hence it breaks the
replication.
2) In the edge case, when table name/db name contains
64 characters, server is throwing an assert
assert(M_TBLLEN < 128)
3) In the edge case, when db name contains 64 latin
characters, binlog content is interpreted badly
which is leading replication failure.
Analysis & Fix :
================
1) Parser reads the table name from the query and converts
it to standard charset(utf8) and stores it in table_name variable.
When drop table query is regenerated with the same table_name
variable, it should be converted back to the original charset
from standard charset(utf8).
2) Latin character takes two bytes for each character. Limit
of the identifier is 64. SYSTEM_CHARSET_MBMAXLEN is set to '3'.
So there is a possiblity that tablename/dbname contains 3 * 64.
Hence assert is changed to
(M_TBLLEN <= NAME_CHAR_LEN*SYSTEM_CHARSET_MBMAXLEN)
3) db_len in the binlog event header is taking 1 byte.
db_len is ranged from 0 to 192 bytes (3 * 64).
While reading the db_len from the event, server
is casting to uint instead of uchar which is leading
to bad db_len. This problem is fixed by changing the
cast type to uchar.
Problem & Analysis: If DML invokes a trigger or a
stored function that inserts into an AUTO_INCREMENT column,
that DML has to be marked as 'unsafe' statement. If the
tables are locked in the transaction prior to DML statement
(using LOCK TABLES), then the same statement is not marked as
'unsafe' statement. The logic of checking whether unsafeness
is protected with if (!thd->locked_tables_mode). Hence if
we lock the tables prior to DML statement, it is *not* entering
into this if condition. Hence the statement is not marked
as unsafe statement.
Fix: Irrespective of locked_tables_mode value, the unsafeness
check should be done. Now with this patch, the code is moved
out to 'decide_logging_format()' function where all these checks
are happening and also with out 'if(!thd->locked_tables_mode)'.
Along with the specified test case in the bug scenario
(BINLOG_STMT_UNSAFE_AUTOINC_COLUMNS), we also identified that
other cases BINLOG_STMT_UNSAFE_AUTOINC_NOT_FIRST,
BINLOG_STMT_UNSAFE_WRITE_AUTOINC_SELECT, BINLOG_STMT_UNSAFE_INSERT_TWO_KEYS
are also protected with thd->locked_tables_mode which is not right. All
of those checks also moved to 'decide_logging_format()' function.
BINLOGGED INCORRECTLY - BREAKS A SLAVE
Submitted a incomplete patch with my previous push,
re submitting the extra changes the required to make
the patch complete.
Analysis:
In row based replication, Master does not send temp table information
to Slave. If there are any DDLs that involves in regular table that needs
to be sent to Slave and a temp tables (which will not be available at Slave),
the Master rewrites the query replacing temp table with it's defintion.
Eg: create table regular_table like temptable.
In rewrite logic, server is ignoring the database of regular table
which can cause problems mentioned in this bug.
Fix: dont ignore database information (if available) while
rewriting the query
special character sets like utf16, utf32, ucs2.
Analysis: MySQL server does not support few special character sets like
utf16,utf32 and ucs2 as "client's character set"(eg: utf16,utf32, ucs2).
It is known limitation listed in the documentation
http://dev.mysql.com/doc/refman/5.5/en/charset-connection.html.
The default value for default-character-set parameter is 'auto'
which means that if the server's character set is not supported,
then server automatically changes client's character set to
predefined character-set which is 'latin1' in the current code.
Eg:
$ ./mysql -uroot -S$SOCKET_FILE --default-character-set=utf16
ERROR 1231 (42000): Variable 'character_set_client' can't be set to the value of 'utf16'
$ ./mysql -uroot -S$SOCKET_FILE will be successfully connected to
server with 'latin1' as default client side character set.
When IO thread is trying to connect to Master, it sets server's character
set as client's character set. When Slave server is started with these
special character sets, IO thread (which is like a connection to Master)
fails because of the above said limitation.
Fix: Now even IO thread also behaves the same as a regular client behaves.
i.e., If server's character set is not supported as client's character set,
then set default's client character set(latin1) as client's character set.
Fix:
===
Backport Bug#11756194 to mysql-5.5. slave breaks if
'drop database' fails on master and mismatched tables on
slave.
'DROP TABLE <deleted tables>' was binlogged when
'DROP DATABASE' failed and at least one table was deleted
from the database. The log event would lead slave SQL thread
stop if some of the tables did not exist on slave.
After this patch, It is always binlogged with 'IF EXISTS'
option.
The test case makes use of the fine DEBUG_SYNC facility. Furthermore,
since it needs synchronization on internal threads (dump and SQL
threads) the server code has DEBUG_SYNC commands internally deployed
and activated through the DBUG_EXECUTE_IF macro. The internal
DBUG_SYNC commands are then controlled from the test case through the
DEBUG variable.
There were three problems around the DEBUG + DEBUG_SYNC facility
usage:
1. When signaling the SQL thread to continue, the test would reset
immediately the DEBUG_SYNC variable. This could mean that the SQL
thread might loose the signal and continue to wait forever;
2. A similar scenario was happening with the dump thread on the
master. This thread was instructed to wait, and later it would be
signaled to continue, but immediately after the DEBUG_SYNC would be
reset. This could lead to the dump thread missing the signal and
wait forever;
3. The test was not cleaning itself up with respect to the
instrumentation of the dump thread. This would leave the
conditional execution of an internal DEBUG_SYNC command active
(through the usage of DBUG_EXECUTE_IF).
We fix#1 and #2 by waiting for the threads to receive the signal and
only then issue the reset. We fix#3 by reseting the DEBUG variable,
thus deactivating the dump thread internal DEBUG_SYNC command.
SHOW PROCESSLIST, SHOW BINLOGS
Problem: A deadlock was occurring when 4 threads were
involved in acquiring locks in the following way
Thread 1: Dump thread ( Slave is reconnecting, so on
Master, a new dump thread is trying kill
zombie dump threads. It acquired thread's
LOCK_thd_data and it is about to acquire
mysys_var->current_mutex ( which LOCK_log)
Thread 2: Application thread is executing show binlogs and
acquired LOCK_log and it is about to acquire
LOCK_index.
Thread 3: Application thread is executing Purge binary logs
and acquired LOCK_index and it is about to
acquire LOCK_thread_count.
Thread 4: Application thread is executing show processlist
and acquired LOCK_thread_count and it is
about to acquire zombie dump thread's
LOCK_thd_data.
Deadlock Cycle:
Thread 1 -> Thread 2 -> Thread 3-> Thread 4 ->Thread 1
The same above deadlock was observed even when thread 4 is
executing 'SELECT * FROM information_schema.processlist' command and
acquired LOCK_thread_count and it is about to acquire zombie
dump thread's LOCK_thd_data.
Analysis:
There are four locks involved in the deadlock. LOCK_log,
LOCK_thread_count, LOCK_index and LOCK_thd_data.
LOCK_log, LOCK_thread_count, LOCK_index are global mutexes
where as LOCK_thd_data is local to a thread.
We can divide these four locks in two groups.
Group 1 consists of LOCK_log and LOCK_index and the order
should be LOCK_log followed by LOCK_index.
Group 2 consists of other two mutexes
LOCK_thread_count, LOCK_thd_data and the order should
be LOCK_thread_count followed by LOCK_thd_data.
Unfortunately, there is no specific predefined lock order defined
to follow in the MySQL system when it comes to locks across these
two groups. In the above problematic example,
there is no problem in the way we are acquiring the locks
if you see each thread individually.
But If you combine all 4 threads, they end up in a deadlock.
Fix:
Since everything seems to be fine in the way threads are taking locks,
In this patch We are changing the duration of the locks in Thread 4
to break the deadlock. i.e., before the patch, Thread 4
('show processlist' command) mysqld_list_processes()
function acquires LOCK_thread_count for the complete duration
of the function and it also acquires/releases
each thread's LOCK_thd_data.
LOCK_thread_count is used to protect addition and
deletion of threads in global threads list. While show
process list is looping through all the existing threads,
it will be a problem if a thread is exited but there is no problem
if a new thread is added to the system. Hence a new mutex is
introduced "LOCK_thd_remove" which will protect deletion
of a thread from global threads list. All threads which are
getting exited should acquire LOCK_thd_remove
followed by LOCK_thread_count. (It should take LOCK_thread_count
also because other places of the code still thinks that exit thread
is protected with LOCK_thread_count. In this fix, we are changing
only 'show process list' query logic )
(Eg: unlink_thd logic will be protected with
LOCK_thd_remove).
Logic of mysqld_list_processes(or file_schema_processlist)
will now be protected with 'LOCK_thd_remove' instead of
'LOCK_thread_count'.
Now the new locking order after this patch is:
LOCK_thd_remove -> LOCK_thd_data -> LOCK_log ->
LOCK_index -> LOCK_thread_count
Problem: Uninstallation of semi sync plugin causes replication to
break.
Analysis: A semisync enabled replication is mutual agreement between
Master and Slave when the connection (I/O thread) is established.
Once I/O thread is started and if semisync is enabled on both
master and slave, master appends special magic header to events
using semisync plugin functions and sends it to slave. And slave
expects that each event will have that special magic header format
and reads those bytes using semisync plugin functions.
When semi sync replication is in use if users execute
uninstallation of the plugin on master, slave gets confused while
interpreting that event's content because it expects special
magic header at the beginning of the event. Slave SQL thread will
be stopped with "Missing magic number in the header" error.
Similar problem will happen if uninstallation of the plugin happens
on slave when semi sync replication is in in use. Master sends
the events with magic header and slave does not know about the
added magic header and thinks that it received a corrupted event.
Hence slave SQL thread stops with "Found corrupted event" error.
Fix: Uninstallation of semisync plugin will be blocked when semisync
replication is in use and will throw 'ER_UNKNOWN_ERROR' error.
To detect that semisync replication is in use, this patch uses
semisync status variable values.
> On Master, it checks for 'Rpl_semi_sync_master_status' to be OFF
before allowing the uninstallation of rpl_semi_sync_master plugin.
>> Rpl_semi_sync_master_status is OFF when
>>> there is no dump thread running
>>> there are no semisync slaves
> On Slave, it checks for 'Rpl_semi_sync_slave_status' to be OFF
before allowing the uninstallation of rpl_semi_sync_slave plugin.
>> Rpl_semi_sync_slave_status is OFF when
>>> there is no I/O thread running
>>> replication is asynchronous replication.
BREAKS RBR
Analysis:
--------
A table created using a query of the format:
CREATE TABLE t1 AS SELECT REPEAT('A',1000) DIV 1 AS a;
breaks the Row Based Replication.
The query above creates a table having a field of datatype
'bigint' with a display width of 3000 which is beyond the
maximum acceptable value of 255.
In the RBR mode, CREATE TABLE SELECT statement is
replicated as a combination of CREATE TABLE statement
equivalent to one the returned by SHOW CREATE TABLE and
row events for rows inserted. When this CREATE TABLE event
is executed on the slave, an error is reported:
Display width out of range for column 'a' (max = 255)
The following is the output of 'SHOW CREATE TABLE t1':
CREATE TABLE t1(`a` bigint(3000) DEFAULT NULL)
ENGINE=InnoDB DEFAULT CHARSET=latin1;
The problem is due to the combination of two facts:
1) The above CREATE TABLE SELECT statement uses the display
width of the result of DIV operation as the display width
of the column created without validating the width for out
of bound condition.
2) The DIV operation incorrectly returns the length of its first
argument as the display width of its result; thus allowing
creation of a table with an incorrect display width of 3000
for the field.
Fix:
----
This fix changes the DIV operation implementation to correctly
evaluate the display width of its result. We check if DIV's
results estimated width crosses maximum width for integer
value (21) and if yes set it to this maximum value.
This patch also fixes fixes maximum display width evaluation
for DIV function when its first argument is in UCS2.
This worklog aims at testing the two following scenarios:
1) Whenever the mysql_binlog_send method (dump thread)
reaches the end of file when reading events from the binlog, before
checking if it should wait for more events, there was a test to
check if the file being read was still active, i.e, it was the last
known binlog. However, it was possible that something was written to
the binary log and then a rotation would happen, after EOF was
detected and before the check for active was performed. In this
case, the end of the binary log would not be read by the dump
thread, and this would cause the slave to lose updates.
This test verifies that the problem has been fixed. It waits during
this window while forcing a rotation in the binlog.
2) Verify dump thread can send events in active file, correctly after
encountering an IO error.
DURING INNODB RECOVERY
Problem:
=======
The connection 'master' is dropped by mysqltest after
rpl_end.inc. At this point, dropping temporary tables
at the connection 'master' are not synced at slave.
So, the temporary tables replicated from master remain
on slave leading to an inconsistent close of the test.
The following test thus complains about the presence of
temporary table(s) left over from the previous test.
Fix:
===
- Put explicit drop commands in replication tests so
that the temporary tables are dropped at slave as well.
- Added the check for Slave_open_temp_tables in
mtr_check.sql to warn about the remaining temporary
table, if any, at the close of a test.
IN TIME RECOVERY FAILURE ON SLAVES
Problem:
DROP TEMP TABLE IF EXISTS commands can cause point
in time recovery (re-applying binlog) failures.
Analyses:
In RBR, 'DROP TEMPORARY TABLE' commands are
always binlogged by adding 'IF EXISTS' clauses.
Also, the slave SQL thread will not check replicate.* filter
rules for "DROP TEMPORARY TABLE IF EXISTS" queries.
If log-slave-updates is enabled on slave, these queries
will be binlogged in the format of "USE `db`;
DROP TEMPORARY TABLE IF EXISTS `t1`;" irrespective
of filtering rules and irrespective of the `db` existence.
When users try to recover slave from it's own binlog,
use `db` command might fail if `db` is not present on slave.
Fix:
At the time of writing the 'DROP TEMPORARY TABLE
IF EXISTS' query into the binlog, 'use `db`' will not be
present and the table name in the query will be a fully
qualified table name.
Eg:
'USE `db`; DROP TEMPORARY TABLE IF EXISTS `t1`;'
will be logged as
'DROP TEMPORARY TABLE IF EXISTS `db`.`t1`;'.
--BINLOG-IGNORE-DB AND FULLY QUALIFIED TABLE
Problem:
=======
An ALTER TABLE statement is not written to binlog if server
started with "--binlog-ignore-db some database" and 'fully
qualified' table names are used in the ALTER TABLE statement
altering table different from current database context.
Analysis:
========
The above mentioned problem not only affects "ALTER TABLE"
statements but also to all kind of statements. Once the
current default database becomes "NULL" none of the
statements will be binlogged.
The current behaviour is such that if the user has specified
restrictions on which database needs to be replicated and the
default db is not specified, then do not replicate.
This means that "NULL" is considered to be equivalent to
everything (default db = null implied ignore don't log the
statement).
Fix:
===
"NULL" should not be considered as equivalent to everything.
Since the filtering criteria is not equal to "NULL" the
statement should be logged into binlog.
PLATFORM= MACOSX10.6 X86_64 MAX
Problem: The test was failing on pb2's mac machine because
it was not cleaned up properly. The test checks if
the command 'start slave until' throws a proper
error when issued with a wrong number/type of
parameters. After this,the replication stream was
stopped using the include file 'rpl_end.inc'.
The errors thrown earlier left the slave in an
inconsistent state to be closed by the include
file which was caught by the mac machine.
Fix: Started slave by invoking start_slave.inc to have a
working slave before calling rpl_reset.inc
Problem: The test file was not in a good shape. It tested
start slave until relay log file/pos combination
wrongly. A couple of commands were executed at
master and replicated at slave. Next, the
coordinates in terms of relay log file and pos
were noted down followed by reset slave and start
slave until saved relay log file/pos. Reset slave
deletes all relay log files and makes the slave
forget its replication position. So, using the
saved coordiantes after reset slave is wrong.
Fix: Split the test in two parts:
a) Test for start slave until master log file/pos and
checking for correct errors in the failure
scenarios.
b) Test for start slave until relay log file/pos.
Problem: The variables auto_increment_increment and
auto_increment_offset were set in the the include
file rpl_init.inc. This was only configured for
some connections that are rarely used by test
cases, so likely that it will cause confusion.
If replication tests want to setup these variables
they should do so explicitly.
Fix:
a) Removed code to set the variables
auto_increment_increment and auto_increment_offset
in the include file.
b) Updated tests files using the same.
post push fix:
rpl_stm_until.test was disabled because of
this bug. Enabled and fixed it.
Removed a part of the test that was obsolete.
It tested replication from 4.0 master to 5.0
slave.
post push fix:
rpl_stm_until.test was disabled because of
this bug. Enabled and fixed it.
Removed a part of the test that was obsolete.
It tested replication from 4.0 master to 5.0
slave.
DOWNGRADED FROM 5.6.11 TO 5.6.10
Problem was new syntax not accepted by previous version.
Fixed by adding version comment of /*!50531 around the
new syntax.
Like this in the .frm file:
'PARTITION BY KEY /*!50611 ALGORITHM = 2 */ () PARTITIONS 3'
and also changing the output from SHOW CREATE TABLE to:
CREATE TABLE t1 (a INT)
/*!50100 PARTITION BY KEY */ /*!50611 ALGORITHM = 1 */ /*!50100 ()
PARTITIONS 3 */
It will always add the ALGORITHM into the .frm for KEY [sub]partitioned
tables, but for SHOW CREATE TABLE it will only add it in case it is the non
default ALGORITHM = 1.
Also notice that for 5.5, it will say /*!50531 instead of /*!50611, which
will make upgrade from 5.5 > 5.5.31 to 5.6 < 5.6.11 fail!
If one downgrades an fixed version to the same major version (5.5 or 5.6) the
bug 14521864 will be visible again, but unless the .frm is updated, it will
work again when upgrading again.
Also fixed so that the .frm does not get updated version
if a single partition check passes.
Due to an internal change in the server code in between 5.1 and 5.5
(wl#2649) the hash function used in KEY partitioning changed
for numeric and date/time columns (from binary hash calculation
to character based hash calculation).
Also enum/set changed from latin1 ci based hash calculation to
binary hash between 5.1 and 5.5. (bug#11759782).
These changes makes KEY [sub]partitioned tables on any of
the affected column types incompatible with 5.5 and above,
since the calculation of partition id differs.
Also since InnoDB asserts that a deleted row was previously
read (positioned), the server asserts on delete of a row that
is in the wrong partition.
The solution for this situation is:
1) The partitioning engine will check that delete/update will go to the
partition the row was read from and give an error otherwise, consisting
of the rows partitioning fields. This will avoid asserts in InnoDB and
also alert the user that there is a misplaced row. A detailed error
message will be given, including an entry to the error log consisting
of both table name, partition and row content (PK if exists, otherwise
all partitioning columns).
2) A new optional syntax for KEY () partitioning in 5.5 is allowed:
[SUB]PARTITION BY KEY [ALGORITHM = N] (list_of_cols)
Where N = 1 uses the same hashing as 5.1 (Numeric/date/time fields uses
binary hashing, ENUM/SET uses charset hashing) N = 2 uses the same
hashing as 5.5 (Numeric/date/time fields uses charset hashing,
ENUM/SET uses binary hashing). If not set on CREATE/ALTER it will
default to 2.
This new syntax should probably be ignored by NDB.
3) Since there is a demand for avoiding scanning through the full
table, during upgrade the ALTER TABLE t PARTITION BY ... command is
considered a no-op (only .frm change) if everything except ALGORITHM
is the same and ALGORITHM was not set before, which allows manually
upgrading such table by something like:
ALTER TABLE t PARTITION BY KEY ALGORITHM = 1 () or
ALTER TABLE t PARTITION BY KEY ALGORITHM = 2 ()
4) Enhanced partitioning with CHECK/REPAIR to also check for/repair
misplaced rows. (Also works for ALTER TABLE t CHECK/REPAIR PARTITION)
CHECK FOR UPGRADE:
If the .frm version is < 5.5.3
and uses KEY [sub]partitioning
and an affected column type
then it will fail with an message:
KEY () partitioning changed, please run:
ALTER TABLE `test`.`t1` PARTITION BY KEY ALGORITHM = 1 (a)
PARTITIONS 12
(i.e. current partitioning clause, with the addition of
ALGORITHM = 1)
CHECK without FOR UPGRADE:
if MEDIUM (default) or EXTENDED options are given:
Scan all rows and verify that it is in the correct partition.
Fail for the first misplaced row.
REPAIR:
if default or EXTENDED (i.e. not QUICK/USE_FRM):
Scan all rows and every misplaced row is moved into its correct
partitions.
5) Updated mysqlcheck (called by mysql_upgrade) to handle the
new output from CHECK FOR UPGRADE, to run the ALTER statement
instead of running REPAIR.
This will allow mysql_upgrade (or CHECK TABLE t FOR UPGRADE) to upgrade
a KEY [sub]partitioned table that has any affected field type
and a .frm version < 5.5.3 to ALGORITHM = 1 without rebuild.
Also notice that if the .frm has a version of >= 5.5.3 and ALGORITHM
is not set, it is not possible to know if it consists of rows from
5.1 or 5.5! In these cases I suggest that the user does:
(optional)
LOCK TABLE t WRITE;
SHOW CREATE TABLE t;
(verify that it has no ALGORITHM = N, and to be safe, I would suggest
backing up the .frm file, to be used if one need to change to another
ALGORITHM = N, without needing to rebuild/repair)
ALTER TABLE t <old partitioning clause, but with ALGORITHM = N>;
which should set the ALGORITHM to N (if the table has rows from
5.1 I would suggest N = 1, otherwise N = 2)
CHECK TABLE t;
(here one could use the backed up .frm instead and change to a new N
and run CHECK again and see if it passes)
and if there are misplaced rows:
REPAIR TABLE t;
(optional)
UNLOCK TABLES;
RPL_ROTATE_LOGS has been failing sporadically in what seems a
problem related to routines that update the coordinates. However,
the test lacks proper assert statments and because of this the
debug information upon failure simply points to the content
mismatch between the test and the result file.
Not as a solution, but as a improvement to the test to better
debug this failure, new assert statments were added to the test.
@rpl_rotate_logs.test
Added new assert statments reducing the
dependency on the result file.
@rpl_rotate_logs.result
Added new content to the result file to
match the test changes
Problem: The problem with the test is that the slave returns
from start_slave.inc call too early before the list
is actually actualised. This caused the slave stale
data to be reported.
Fix: Added a wait in the test till the slave's IO status is
changed to "Waiting for master to send event" which
which ensures the list is correctly updated.
=== Problem ===
The test is dependent on binlog positions and checks
to see if the command 'START SLAVE' functions correctly
with the 'UNTIL' clause added to it. The 'UNTIL' clause
is added to specify that the slave should start and run
until the SQL thread reaches a given point in the master
binary log or in the slave relay log.
The test uses hard coded values for MASTER_LOG_POS and
RELAY_LOG_POS, instead of extracting it using
query_get_value() function. There is a test
'rpl.rpl_row_until' which does the similar thing but uses
query_get_value() function to set the values of
MASTER_LOG_POS/ RELAY_LOG_POS. To be precise,
rpl.rpl_row_until is a modified version of
engines/func.rpl_row_until.test.
The use of hard coded values may lead the slave to stop at a position
which may differ from the expected position in the binlog file,
an example being the failure of engines/funcs.rpl_row_until in
mysql-5.1 given as:
"query 'select * from t2' failed. Table 'test.t2' doesn't exist".
In this case, the slave actually ran a couple of extra commands
as a result of which the slave first deleted the table and then
ran a select query on table, leading to the above mentioned failure.
=== Fix ===
1) Fixed the code for failure seen in rpl.rpl_row_until.
This test was also failing although the symptoms of
failure were different.
2) Copied the contents from rpl.rpl_row_until into
into engines/funcs.rpl.rpl_row_until.
3) Updated engines/funcs.rpl_row_until.result accordingly.
When a binlog is replayed into a server, e.g.:
$ mysqlbinlog binlog.000001 | mysql
it sets a pseudo slave mode on the client connection in order to server
be able to read binlog events, there is, a format description event is
needed to correctly read following events.
Also this pseudo slave mode applies to the current connection
replication rules that are needed to correctly apply binlog events.
If a binlog dump is sourced on a connection, this pseudo slave mode will
remains after it, what will apply unexpected rules from customer
perspective to following commands.
Added a new SET statement to binlog dump that will unset pseudo slave
mode at the end of dump file.
QUOTING IN REPLICATION
Problem: Misquoting or unquoted identifiers may lead to
incorrect statements to be logged to the binary log.
Fix: we use specialized functions to append quoted identifiers in
the statements generated by the server.
rpl_cant_read_event_incident:
Slave applies updates from bug11747416_32228_binlog.000001 file which
contains a CREATE TABLE t statement and an incident, when SQL thread is
running slowly IO thread may reach the incident before SQL thread
executes the create table statement.
Execute "drop table if exists t" and also perform a RESET MASTER to
clean slave binary logs.
rpl_bug41902:
Error "MYSQL_BIN_LOG::purge_logs was called with file
./master-bin.000001 not listed in the index." suppression is not
considering windows path, there is ".\master-bin.000001".
Changed suppression to: "MYSQL_BIN_LOG::purge_logs was called with file
..master-bin.000001 not listed in the index", to match ".\" and "./".
updating the result file. Because a multi-row insert now reserves the
auto increment values before hand, if any explicitly specified auto
increment values are there, then some of the reserved values are lost.
Problem
========
SQL statements close to the size of max_allowed_packet produce binary
log events larger than max_allowed_packet.
The reason why this failure is occuring is because the event length is
more than the total size of the max_allowed_packet + max_event_header
length. Now since the event length exceeds this size master Dump
thread is unable to send the packet on to the slave.
That can happen e.g with row-based replication in Update_rows event.
Fix
====
The problem was fixed by increasing the max_allowed_packet for the
slave's threads (IO/SQL) by increasing it to 1GB.
This is done using the new server option included which is used to
regulate the max_allowed_packet of the slave thread (IO/SQL).
This causes the large packets to be received by the slave and apply
it successfully.
Problem - The failure on PB2 is possbily due to the port number being still in
use even after the server restarts which is not reflected in the
server restart.
Fix - The problem is fixed by starting the servers forcefully using the option
file and also the parameters for the server restart is passed correctly.