1
0
mirror of https://github.com/MariaDB/server.git synced 2025-08-12 20:49:12 +03:00
Commit Graph

1286 Commits

Author SHA1 Message Date
Kristian Nielsen
72e1cc8f52 MDEV-35806: Error in read_log_event() corrupts relay log writer, crashes server
In Log_event::read_log_event(), don't use IO_CACHE::error of the relay log's
IO_CACHE to signal an error back to the caller. When reading the active
relay log, this flag is also being used by the IO thread, and setting it can
randomly cause the IO thread to wrongly detect IO error on writing and
permanently disable the relay log.

This was seen sporadically in test case rpl.rpl_from_mysql80. The read
error set by the SQL thread in the IO_CACHE would be interpreted as a
write error by the IO thread, which would cause it to throw a fatal
error and close the relay log. And this would later cause CHANGE
MASTER to try to purge a closed relay log, resulting in nullptr crash.

SQL thread is not able to parse an event read from the relay log. This
can happen like here when replicating unknown events from a MySQL master,
potentially also for other reasons.

Also fix a mistake in my_b_flush_io_cache() introduced back in 2001
(fa09f2cd7e) where my_b_flush_io_cache() could wrongly return an error set
in IO_CACHE::error, even if the flush operation itself succeeded.

Also fix another sporadic failure in rpl.rpl_from_mysql80 where the outout
of MASTER_POS_WAIT() depended on timing of SQL and IO thread.

Reviewed-by: Monty <monty@mariadb.org>
Reviewed-by: Andrei Elkin <andrei.elkin@mariadb.com>
Signed-off-by: Kristian Nielsen <knielsen@knielsen-hq.org>
2025-01-24 09:15:20 +00:00
Sergei Golubchik
addc828363 Merge branch '10.5' into 10.6 2025-01-09 10:15:53 +01:00
Sergei Golubchik
82fd202fa4 fix "enforce no trailing \n in Diagnostic_area messages"
cannot have an assert in Warning_info::push_warning()
because SQL command SIGNAL can set an absolutely arbitrary
message, even an empty one or ending with '\n'

move the assert into push_warning() and my_message_sql().

followup for 9508a44c37
2025-01-09 09:28:28 +01:00
Marko Mäkelä
b251cb6a4f Merge 10.5 into 10.6 2025-01-08 08:48:21 +02:00
Sergei Golubchik
9508a44c37 enforce no trailing \n in Diagnostic_area messages
that is in my_error(), push_warning(), etc
2025-01-07 16:31:39 +01:00
Monty
87ee1e75bc MDEV-35643 Add support for MySQL 8.0 binlog events
MDEV-29533 Crash when MariaDB is replica of MySQL 8.0

MySQL 8.0 has added the following new events in the MySQL binary log

PARTIAL_UPDATE_ROWS_EVENT
TRANSACTION_PAYLOAD_EVENT
HEARTBEAT_LOG_EVENT_V2

- PARTIAL_UPDATE_ROWS_EVENT is used by MySQL to generate update
  statements using JSON_SET, JSON_REPLACE and JSON_REMOVE to make
  update of JSON columns more efficient.  These events can be
  disabled by setting 'binlog-row-value-options=""'
- TRANSACTION_PAYLOAD_EVENT is used by MySQL to signal that a
  row event is compressed. It an be disably by setting
  'binlog_transaction_compression=0'.
- HEARTBEAT_LOG_EVENT_V2 is written to the binary log many times
  per seconds. It can be ignored by the server.

What this patch does:

- If PARTIAL_UPDATE_ROWS_EVENT or TRANSACTION_PAYLOAD_EVENT is found,
  the server will stop with an error message of how to disable the
  MySQL server to generate such events.
- HEARTBEAT_LOG_EVENT_V2 events are ignored.
- mariadb-binlog will write the name of the new events.
- mariadb-binlog will stop if PARTIAL_UPDATE_ROWS_EVENT or
  TRANSACTION_PAYLOAD_EVENT is found, unless --force is given.
- Fixes a crash in mariadb-binlog if a character set unknown to
  MariaDB is found. (MDEV-29533)

From Kristian Nielsen:
- Add test case for MySQL 8.0 to MariaDB replication and fixed a
  a small typo in post_header_len initialization.

Reviewer: knielsen@mariadb.org
2025-01-05 16:40:11 +02:00
Marko Mäkelä
5ba542e9ee Merge 10.5 into 10.6 2024-05-30 14:27:07 +03:00
Robin Newhouse
dc38d8ea80 Minimize unsafe C functions with safe_strcpy()
Similar to #2480.
567b681 introduced safe_strcpy() to minimize the use of C with
potentially unsafe memory overflow with strcpy() whose use is
discouraged.
Replace instances of strcpy() with safe_strcpy() where possible, limited
here to files in the `sql/` directory.

All new code of the whole pull request, including one or several files
that are either new files or modified ones, are contributed under the
BSD-new license. I am contributing on behalf of my employer
Amazon Web Services, Inc.
2024-05-17 13:33:16 +01:00
Sergei Golubchik
e83d92ee5e sporadic failures of rpl.rpl_semi_sync_fail_over
in the $case=2 - it's wrong to kill after the first binlog EOF,
because that might happen between INSERT(4) and INSERT(5).

So, wait for the slave to acknowledge INSERT(5) before killing
the master, that is, both connection threads must pass
repl_semisync_master.wait_after_sync()
2024-04-21 22:54:52 +02:00
Marko Mäkelä
829cb1a49c Merge 10.5 into 10.6 2024-04-17 14:14:58 +03:00
Kristian Nielsen
16aa4b5f59 Merge from 10.4 to 10.5
Signed-off-by: Kristian Nielsen <knielsen@knielsen-hq.org>
2024-04-15 17:46:49 +02:00
Marko Mäkelä
59e7289b6c Fix g++-14 -Wmaybe-uninitialized 2024-03-19 08:10:42 +02:00
Michael Widenius
7af50e4df4 MDEV-32551: "Read semi-sync reply magic number error" warnings on master
rpl_semi_sync_slave_enabled_consistent.test and the first part of
the commit message comes from Brandon Nesterenko.

A test to show how to induce the "Read semi-sync reply magic number
error" message on a primary. In short, if semi-sync is turned on
during the hand-shake process between a primary and replica, but
later a user negates the rpl_semi_sync_slave_enabled variable while
the replica's IO thread is running; if the io thread exits, the
replica can skip a necessary call to kill_connection() in
repl_semisync_slave.slave_stop() due to its reliance on a global
variable. Then, the replica will send a COM_QUIT packet to the
primary on an active semi-sync connection, causing the magic number
error.

The test in this patch exits the IO thread by forcing an error;
though note a call to STOP SLAVE could also do this, but it ends up
needing more synchronization. That is, the STOP SLAVE command also
tries to kill the VIO of the replica, which makes a race with the IO
thread to try and send the COM_QUIT before this happens (which would
need more debug_sync to get around). See THD::awake_no_mutex for
details as to the killing of the replica’s vio.

Notes:
- The MariaDB documentation does not make it clear that when one
  enables semi-sync replication it does not matter if one enables
  it first in the master or slave. Any order works.

Changes done:
- The rpl_semi_sync_slave_enabled variable is now a default value for
  when semisync is started. The variable does not anymore affect
  semisync if it is already running. This fixes the original reported
  bug.  Internally we now use repl_semisync_slave.get_slave_enabled()
  instead of rpl_semi_sync_slave_enabled. To check if semisync is
  active on should check the @@rpl_semi_sync_slave_status variable (as
  before).
- The semisync protocol conflicts in the way that the original
  MySQL/MariaDB client-server protocol was designed (client-server
  send and reply packets are strictly ordered and includes a packet
  number to allow one to check if a packet is lost). When using
  semi-sync the master and slave can send packets at 'any time', so
  packet numbering does not work. The 'solution' has been that each
  communication starts with packet number 1, but in some cases there
  is still a chance that the packet number check can fail.  Fixed by
  adding a flag (pkt_nr_can_be_reset) in the NET struct that one can
  use to signal that packet number checking should not be done. This
  is flag is set when semi-sync is used.
- Added Master_info::semi_sync_reply_enabled to allow one to configure
  some slaves with semisync and other other slaves without semisync.
  Removed global variable semi_sync_need_reply that would not work
  with multi-master.
- Repl_semi_sync_master::report_reply_packet() can now recognize
  the COM_QUIT packet from semisync slave and not give a
  "Read semi-sync reply magic number error" error for this case.
  The slave will be removed from the Ack listener.
- On Windows, don't stop semisync Ack listener just because one
  slave connection is using socket_id > FD_SETSIZE.
- Removed busy loop in Ack_receiver::run() by using
 "Self-pipe trick" to signal new slave and stop Ack_receiver.
- Changed some Repl_semi_sync_slave functions that always returns 0
  from int to void.
- Added Repl_semi_sync_slave::slave_reconnect().
- Removed dummy_function Repl_semi_sync_slave::reset_slave().
- Removed some duplicate semisync notes from the error log.
- Add test of "if (get_slave_enabled() && semi_sync_need_reply)"
  before calling Repl_semi_sync_slave::slave_reply().
  (Speeds up the code as we can skip all initializations).
- If epl_semisync_slave.slave_reply() fails, we disable semisync
  for that connection.
- We do not call semisync.switch_off() if there are no active slaves.
  Instead we check in Repl_semi_sync_master::commit_trx() if there are
  no active threads. This simplices the code.
- Changed assert() to DBUG_ASSERT() to ensure that the DBUG log is
  flushed in case of asserts.
- Removed the internal rpl_semi_sync_slave_status as it is not needed
  anymore. The @@rpl_semi_sync_slave_status status variable is now
  mapped to rpl_semi_sync_enabled.
- Removed rpl_semi_sync_slave_enabled  as it is not needed anymore.
  Repl_semi_sync_slave::get_slave_enabled() contains the active status.
- Added checking that we do not add a slave twice with
  Ack_receiver::add_slave(). This could happen with old code.
- Removed Repl_semi_sync_master::check_and_switch() as it is not
  needed anymore.
- Ensure that when we call Ack_receiver::remove_slave() that the slave
  is removed from the listener before function returns.
- Call listener.listen_on_sockets() outside of mutex for better
  performance and less contested mutex.
- Ensure that listening is ignoring newly added slaves when checking for
  responses.
- Fixed the master ack_receiver listener is not killed if there are no
  connected slaves (and thus stop semisync handling of future
  connections). This could happen if all slaves sockets where would be
  marked as unreliable.
- Added unlink() to base_ilist_iterator and remove() to
  I_List_iterator. This enables us to remove 'dead' slaves in
  Ack_recever::run().
- kill_zombie_dump_threads() now does killing of dump threads properly.
  - It can now kill several threads (should be impossible but could
    happen if IO slaves reconnects very fast).
  - We now wait until the dump thread is done before starting the
    dump.
- Added an error if kill_zombie_dump_threads() fails.
- Set thd->variables.server_id before calling
  kill_zombie_dump_threads(). This simplies the code.
- Added a lot of comments both in code and tests.
- Removed DBUG_EVALUATE_IF "failed_slave_start" as it is not used.

Test changes:
- rpl.rpl_session_var2 added which runs rpl.rpl_session_var test with
  semisync enabled.
- Some timings changed slight with startup of slave which caused
  rpl_binlog_dump_slave_gtid_state_info.text to fail as it checked the
  error log file before the slave had started properly. Fixed by
  adding wait_for_pattern_in_file.inc that allows waiting for the
  pattern to appear in the log file.
- Tests have been updated so that we first set
  rpl_semi_sync_master_enabled on the master and then set
  rpl_semi_sync_slave_enabled on the slaves (this is according to how
  the MariaDB documentation document how to setup semi-sync).
- Error text "Master server does not have semi-sync enabled" has been
  replaced with "Master server does not support semi-sync" for the
  case when the master supports semi-sync but semi-sync is not
  enabled.

Other things:
- Some trivial cleanups in Repl_semi_sync_master::update_sync_header().
- We should in 11.3 changed the default value for
  rpl-semi-sync-master-wait-no-slave from TRUE to FALSE as the TRUE
  does not make much sense as default. The main difference with using
  FALSE is that we do not wait for semisync Ack if there are no slave
  threads.  In the case of TRUE we wait once, which did not bring any
  notable benefits except slower startup of master configured for
  using semisync.

Co-author: Brandon Nesterenko <brandon.nesterenko@mariadb.com>

This solves the problem reported in MDEV-32960 where a new
slave may not be registered in time and the master disables
semi sync because of that.
2024-01-23 13:03:11 +02:00
Marko Mäkelä
3a96eba25f Merge 10.5 into 10.6 2024-01-17 13:35:05 +02:00
Alexander Barkov
fa3171df08 MDEV-27666 User variable not parsed as geometry variable in geometry function
Adding GEOMETRY type user variables.
2024-01-16 18:53:23 +04:00
Sergei Golubchik
e95bba9c58 Merge branch '10.5' into 10.6 2023-12-17 11:20:43 +01:00
Sergei Golubchik
98a39b0c91 Merge branch '10.4' into 10.5 2023-12-02 01:02:50 +01:00
Brandon Nesterenko
c42aadc388 MDEV-32628: Cryptic ERROR message & inconsistent behavior on incorrect SHOW BINLOG EVENTS FROM ...
Calling SHOW BINLOG EVENTS FROM <offset> with an invalid offset
writes error messages into the server log about invalid reads. The
read errors that occur from this command should only be relayed back
to the user though, and not written into the server log. This is
because they are read-only and have no impact on server operation,
and the client only need be informed to correct the parameter.

This patch fixes this by omitting binary log read errors from the
server when the invocation happens from SHOW BINLOG EVENTS.
Additionally, redundant error messages are omitted when calling the
string based read_log_event from the IO_Cache based read_log_event,
as the later already will report the error of the former.

Reviewed By:
============
Kristian Nielsen <knielsen@knielsen-hq.org>
Andrei Elkin <andrei.elkin@mariadb.com>
2023-11-17 09:43:56 -08:00
Kristian Nielsen
6fa69ad747 MDEV-27436: binlog corruption (/tmp no space left on device at the same moment)
This commit fixes several bugs in error handling around disk full when
writing the statement/transaction binlog caches:

1. If the error occurs during a non-transactional statement, the code
attempts to binlog the partially executed statement (as it cannot roll
back). The stmt_cache->error was still set from the disk full error. This
caused MYSQL_BIN_LOG::write_cache() to get an error while trying to read the
cache to copy it to the binlog. This was then wrongly interpreted as a disk
full error writing to the binlog file. As a result, a partial event group
containing just a GTID event (no query or commit) was binlogged. Fixed by
checking if an error is set in the statement cache, and if so binlog an
INCIDENT event instead of a corrupt event group, as for other errors.

2. For LOAD DATA LOCAL INFILE, if a disk full error occured while writing to
the statement cache, the code would attempt to abort and read-and-discard
any remaining data sent by the client. The discard code would however
continue trying to write data to the statement cache, and wrongly interpret
another disk full error as end-of-file from the client. This left the client
connection with extra data which corrupts the communication for the next
command, as well as again causing an corrupt/incomplete event to be
binlogged. Fixed by restoring the default read function before reading any
remaining data from the client connection.

Reviewed-by: Andrei Elkin <andrei.elkin@mariadb.com>
Signed-off-by: Kristian Nielsen <knielsen@knielsen-hq.org>
2023-10-31 11:48:00 +01:00
Marko Mäkelä
5bada1246d Merge 10.5 into 10.6 2023-04-11 16:15:19 +03:00
Oleksandr Byelkin
ac5a534a4c Merge remote-tracking branch '10.4' into 10.5 2023-03-31 21:32:41 +02:00
Anel Husakovic
c596ad734d MDEV-30269: Remove rpl_semi_sync_[slave,master] usage in code
- Description:
  - Before 10.3.8 semisync was a plugin that is built into the server with
    MDEV-13073,starting with commit cbc71485e2.
    There are still some usage of `rpl_semi_sync_master` in mtr.
Note:
  - To recognize the replica in the `dump_thread`, replica is creating
    local variable `rpl_semi_sync_slave` (the keyword of plugin) in
    function `request_transmit`, that is catched by primary in
    `is_semi_sync_slave()`. This is the user variable and as such not
    related to the obsolete plugin.

 - Found in `sys_vars.all_vars` and `rpl_semi_sync_wait_point` tests,
   usage of plugins `rpl_semi_sync_master`, `rpl_semi_sync_slave`.
   The former test is disabled by default (`sys_vars/disabled.def`)
   and marked as `obsolete`, however this patch will remove the queries.

- Add cosmetic fixes to semisync codebase

Reviewer: <brandon.nesterenko@mariadb.com>
Closes PR #2528, PR #2380
2023-03-23 13:39:46 +01:00
Sergei Golubchik
900d7bf360 Merge branch '10.5' into 10.6 2022-10-02 22:14:21 +02:00
Sergei Golubchik
3a2116241b Merge branch '10.4' into 10.5 2022-10-02 14:38:13 +02:00
Sergei Golubchik
d4f6d2f08f Merge branch '10.3' into 10.4 2022-10-01 23:07:26 +02:00
Mikhail Chalov
9de9f105b5 Use memory safe snprintf() in Connect Engine and elsewhere (#2210)
Continue with similar changes as done in 19af1890 to replace sprintf(buf, ...)
with snprintf(buf, sizeof(buf), ...), specifically in the "easy" cases where buf
is allocated with a size known at compile time.

All new code of the whole pull request, including one or several files that are
either new files or modified ones, are contributed under the BSD-new license.  I
am contributing on behalf of my employer Amazon Web Services, Inc.
2022-09-28 15:45:25 +01:00
Marko Mäkelä
829e8111c7 Merge 10.5 into 10.6 2022-09-26 14:34:43 +03:00
Marko Mäkelä
6286a05d80 Merge 10.4 into 10.5 2022-09-26 13:34:38 +03:00
Marko Mäkelä
3c92050d1c Fix build without either ENABLED_DEBUG_SYNC or DBUG_OFF
There are separate flags DBUG_OFF for disabling the DBUG facility
and ENABLED_DEBUG_SYNC for enabling the DEBUG_SYNC facility.
Let us allow debug builds without DEBUG_SYNC.

Note: For CMAKE_BUILD_TYPE=Debug, CMakeLists.txt will continue to
define ENABLED_DEBUG_SYNC.
2022-09-23 17:37:52 +03:00
sjaakola
ef2dbb8dbc MDEV-23328 Server hang due to Galera lock conflict resolution
Mutex order violation when wsrep bf thread kills a conflicting trx,
the stack is

          wsrep_thd_LOCK()
          wsrep_kill_victim()
          lock_rec_other_has_conflicting()
          lock_clust_rec_read_check_and_lock()
          row_search_mvcc()
          ha_innobase::index_read()
          ha_innobase::rnd_pos()
          handler::ha_rnd_pos()
          handler::rnd_pos_by_record()
          handler::ha_rnd_pos_by_record()
          Rows_log_event::find_row()
          Update_rows_log_event::do_exec_row()
          Rows_log_event::do_apply_event()
          Log_event::apply_event()
          wsrep_apply_events()

and mutexes are taken in the order

          lock_sys->mutex -> victim_trx->mutex -> victim_thread->LOCK_thd_data

When a normal KILL statement is executed, the stack is

          innobase_kill_query()
          kill_handlerton()
          plugin_foreach_with_mask()
          ha_kill_query()
          THD::awake()
          kill_one_thread()

        and mutexes are

          victim_thread->LOCK_thd_data -> lock_sys->mutex -> victim_trx->mutex

This patch is the plan D variant for fixing potetial mutex locking
order exercised by BF aborting and KILL command execution.

In this approach, KILL command is replicated as TOI operation.
This guarantees total isolation for the KILL command execution
in the first node: there is no concurrent replication applying
and no concurrent DDL executing. Therefore there is no risk of
BF aborting to happen in parallel with KILL command execution
either. Potential mutex deadlocks between the different mutex
access paths with KILL command execution and BF aborting cannot
therefore happen.

TOI replication is used, in this approach,  purely as means
to provide isolated KILL command execution in the first node.
KILL command should not (and must not) be applied in secondary
nodes. In this patch, we make this sure by skipping KILL
execution in secondary nodes, in applying phase, where we
bail out if applier thread is trying to execute KILL command.
This is effective, but skipping the applying of KILL command
could happen much earlier as well.

This also fixed unprotected calls to wsrep_thd_abort
that will use wsrep_abort_transaction. This is fixed
by holding THD::LOCK_thd_data while we abort transaction.

Reviewed-by: Jan Lindström <jan.lindstrom@mariadb.com>
2021-10-29 20:40:35 +02:00
Jan Lindström
d5bc05798f MDEV-25114: Crash: WSREP: invalid state ROLLED_BACK (FATAL)
Revert "MDEV-23328 Server hang due to Galera lock conflict resolution"

This reverts commit eac8341df4.
2021-10-29 20:38:11 +02:00
sjaakola
5c230b21bf MDEV-23328 Server hang due to Galera lock conflict resolution
Mutex order violation when wsrep bf thread kills a conflicting trx,
the stack is

          wsrep_thd_LOCK()
          wsrep_kill_victim()
          lock_rec_other_has_conflicting()
          lock_clust_rec_read_check_and_lock()
          row_search_mvcc()
          ha_innobase::index_read()
          ha_innobase::rnd_pos()
          handler::ha_rnd_pos()
          handler::rnd_pos_by_record()
          handler::ha_rnd_pos_by_record()
          Rows_log_event::find_row()
          Update_rows_log_event::do_exec_row()
          Rows_log_event::do_apply_event()
          Log_event::apply_event()
          wsrep_apply_events()

and mutexes are taken in the order

          lock_sys->mutex -> victim_trx->mutex -> victim_thread->LOCK_thd_data

When a normal KILL statement is executed, the stack is

          innobase_kill_query()
          kill_handlerton()
          plugin_foreach_with_mask()
          ha_kill_query()
          THD::awake()
          kill_one_thread()

        and mutexes are

          victim_thread->LOCK_thd_data -> lock_sys->mutex -> victim_trx->mutex

This patch is the plan D variant for fixing potetial mutex locking
order exercised by BF aborting and KILL command execution.

In this approach, KILL command is replicated as TOI operation.
This guarantees total isolation for the KILL command execution
in the first node: there is no concurrent replication applying
and no concurrent DDL executing. Therefore there is no risk of
BF aborting to happen in parallel with KILL command execution
either. Potential mutex deadlocks between the different mutex
access paths with KILL command execution and BF aborting cannot
therefore happen.

TOI replication is used, in this approach,  purely as means
to provide isolated KILL command execution in the first node.
KILL command should not (and must not) be applied in secondary
nodes. In this patch, we make this sure by skipping KILL
execution in secondary nodes, in applying phase, where we
bail out if applier thread is trying to execute KILL command.
This is effective, but skipping the applying of KILL command
could happen much earlier as well.

This also fixed unprotected calls to wsrep_thd_abort
that will use wsrep_abort_transaction. This is fixed
by holding THD::LOCK_thd_data while we abort transaction.

Reviewed-by: Jan Lindström <jan.lindstrom@mariadb.com>
2021-10-29 09:52:52 +03:00
Jan Lindström
aa7ca987db MDEV-25114: Crash: WSREP: invalid state ROLLED_BACK (FATAL)
Revert "MDEV-23328 Server hang due to Galera lock conflict resolution"

This reverts commit eac8341df4.
2021-10-29 09:52:40 +03:00
Marko Mäkelä
73f5cbd0b6 Merge 10.5 into 10.6 2021-10-21 16:06:34 +03:00
Marko Mäkelä
5f8561a6bc Merge 10.4 into 10.5 2021-10-21 15:26:25 +03:00
Marko Mäkelä
489ef007be Merge 10.3 into 10.4 2021-10-21 14:57:00 +03:00
Marko Mäkelä
e4a7c15dd6 Merge 10.2 into 10.3 2021-10-21 13:41:04 +03:00
Brandon Nesterenko
2291f8ef73 MDEV-25284: Assertion `info->type == READ_CACHE || info->type == WRITE_CACHE' failed
Problem:
========
This patch addresses two issues.

First, if a CHANGE MASTER command is issued and an error happens
while locating the replica’s relay logs, the logs can be put into an
invalid state where future updates fail and future CHANGE MASTER
calls crash the server. More specifically, right before a replica
purges the relay logs (part of the `CHANGE MASTER TO` logic), the
relay log is temporarily closed with state LOG_TO_BE_OPENED. If the
server errors in-between the temporary log closure and purge, i.e.
during the function find_log_pos, the log should be closed.
MDEV-25284 reveals the log is not properly closed.

Second, upon issuing a RESET SLAVE ALL command, a slave’s GTID
filters are not cleared (DO_DOMAIN_IDS, IGNORE_DOMIAN_IDS,
IGNORE_SERVER_IDS). MySQL had a similar bug report, Bug #18816897,
which fixed this issue to clear IGNORE_SERVER_IDS after issuing
RESET SLAVE ALL in version 5.7.

Solution:
=========

To fix the first problem, the CHANGE MASTER error handling logic was
extended to transition the relay log state to LOG_CLOSED from
LOG_TO_BE_OPENED.

To fix the second problem, the RESET SLAVE ALL logic is extended to
clear the domain_id filter and ignore_server_ids.

Reviewed By:
============
Andrei Elkin <andrei.elkin@mariadb.com>
2021-10-18 10:43:51 -06:00
Marko Mäkelä
f3fcf5f45c Merge 10.5 to 10.6 2021-08-19 12:25:00 +03:00
Sujatha
475f69b985 MDEV-25958: rpl_semi_sync_fail_over.test fails in buildbot
Analysis:
========
In case multi binlog truncation scenario debug sync points are in the
following order.

Two inserts are done on master as shown below.

INSERT INTO t1 VALUES (4, REPEAT("x", 4100)
commit_after_release_LOCK_after_binlog_sync

INSERT INTO t1 VALUES (5, REPEAT("x", 4100)
commit_after_release_LOCK_log

First insert debug sync ensures that transaction is synced to binlog and
not committed but it reached slave through semi sync.

Second insert debug sync ensures that transaction is synced to binlog and
not committed. It doesn't ensure that 'INSERT 5' reached slave.

Most of the times INSERT 5 reaches slave, hence when it is promoted as
master it sends 4,5 to slave. But occasionally 5 may not reach slave in
those cases post recovery master will have only 4. When row 6 is inserted
Master has 4-6 and Slave has 4,5,6.

This results in test failure.

Fix:
===
For the first insert use 'commit_before_get_LOCK_commit_ordered' debug sync
point, it will ensure that binlog was sent to slave and slave has
acknowledged the receipt. Now enable debug code such that when the next
transaction is written to binary log, dump thread will read and send it
across the network and notify the server to be get killed. Insert row 5
and wait for notification from dump thread. Kill the server. This ensures
that both 4 and 5 have reached the semi-sync slave.

Added a new test case:
Insert two rows on master such that first is present in master's binlog and
reached semi sync slave. Second insert should be flushed to binlog but not
sent to slave. Now crash and fail over to slave. The promoted master will send
the extra transaction to slave.
2021-08-19 11:59:39 +05:30
Marko Mäkelä
4a25957274 Merge 10.4 into 10.5 2021-08-18 18:22:35 +03:00
Brandon Nesterenko
46c3e7e353 MDEV-20215: binlog.show_concurrent_rotate failed in buildbot with wrong result
Problem:
=======
There are two issues that are addressed in this patch:
 1) SHOW BINARY LOGS uses caching to store the binary logs that exist
in the log directory; however, if new events are written to the logs,
the caching strategy is unaware. This is okay for users, as it is
okay for SHOW to return slightly old data. The test, however, can
result in inconsistent data. It runs two connections concurrently,
where one shows the logs, and the other adds a new file. The output
of SHOW BINARY LOGS then depends on when the cache is built, with
respect to the time that the second connection rotates the logs.
 2) There is a race condition between RESET MASTER and SHOW BINARY
LOGS. More specifically, where they both need the binary log lock to
begin, SHOW BINARY LOGS only needs the lock to build its cache. If
RESET MASTER is issued after SHOW BINARY LOGS has built its cache and
before it has returned the results, the presented data may be
incorrect.

Solution:
========
 1) As it is okay for users to see stale data, to make the test
consistent, use DEBUG_SYNC to force the race condition (problem 2) to
make SHOW BINARY LOGS build a cache before RESET MASTER is called.
Then, use additional logic from the next part of the solution to
rebuild the cache.
 2) Use an Atomic_counter to keep track of the number of times RESET
MASTER has been called. If the value of the counter changes after
building the cache, the cache should be rebuilt and the analysis
should be restarted.

Reviewed By:
============
Andrei Elkin: <andrei.elkin@mariadb.com>
2021-08-13 10:53:19 -06:00
Monty
85d6278fed Change replication to use uchar for all buffers instead of char
This change is to get rid of randomly failing tests, especially those
that reads random position of the binary log. From looking at the logs
it's clear that some failures is because of a read char (with value >= 128)
is converted to a big long value. Using uchar everywhere makes this much
less likely to happen.
Another benefit is that a lot of cast of char to uchar could be removed.

Other things:
- Removed some extra space before '=' and '+=' in assignments
- Fixed indentations and lines > 80 characters
- Replace '16' with 'element_size' (from class definition) in
  Gtid_list_log_event()
2021-05-19 22:54:12 +02:00
Monty
b6ff139aa3 Reduce usage of strlen()
Changes:
- To detect automatic strlen() I removed the methods in String that
  uses 'const char *' without a length:
  - String::append(const char*)
  - Binary_string(const char *str)
  - String(const char *str, CHARSET_INFO *cs)
  - append_for_single_quote(const char *)
  All usage of append(const char*) is changed to either use
  String::append(char), String::append(const char*, size_t length) or
  String::append(LEX_CSTRING)
- Added STRING_WITH_LEN() around constant string arguments to
  String::append()
- Added overflow argument to escape_string_for_mysql() and
  escape_quotes_for_mysql() instead of returning (size_t) -1 on overflow.
  This was needed as most usage of the above functions never tested the
  result for -1 and would have given wrong results or crashes in case
  of overflows.
- Added Item_func_or_sum::func_name_cstring(), which returns LEX_CSTRING.
  Changed all Item_func::func_name()'s to func_name_cstring()'s.
  The old Item_func_or_sum::func_name() is now an inline function that
  returns func_name_cstring().str.
- Changed Item::mode_name() and Item::func_name_ext() to return
  LEX_CSTRING.
- Changed for some functions the name argument from const char * to
  to const LEX_CSTRING &:
  - Item::Item_func_fix_attributes()
  - Item::check_type_...()
  - Type_std_attributes::agg_item_collations()
  - Type_std_attributes::agg_item_set_converter()
  - Type_std_attributes::agg_arg_charsets...()
  - Type_handler_hybrid_field_type::aggregate_for_result()
  - Type_handler_geometry::check_type_geom_or_binary()
  - Type_handler::Item_func_or_sum_illegal_param()
  - Predicant_to_list_comparator::add_value_skip_null()
  - Predicant_to_list_comparator::add_value()
  - cmp_item_row::prepare_comparators()
  - cmp_item_row::aggregate_row_elements_for_comparison()
  - Cursor_ref::print_func()
- Removes String_space() as it was only used in one cases and that
  could be simplified to not use String_space(), thanks to the fixed
  my_vsnprintf().
- Added some const LEX_CSTRING's for common strings:
  - NULL_clex_str, DATA_clex_str, INDEX_clex_str.
- Changed primary_key_name to a LEX_CSTRING
- Renamed String::set_quick() to String::set_buffer_if_not_allocated() to
  clarify what the function really does.
- Rename of protocol function:
  bool store(const char *from, CHARSET_INFO *cs) to
  bool store_string_or_null(const char *from, CHARSET_INFO *cs).
  This was done to both clarify the difference between this 'store' function
  and also to make it easier to find unoptimal usage of store() calls.
- Added Protocol::store(const LEX_CSTRING*, CHARSET_INFO*)
- Changed some 'const char*' arrays to instead be of type LEX_CSTRING.
- class Item_func_units now used LEX_CSTRING for name.

Other things:
- Fixed a bug in mysql.cc:construct_prompt() where a wrong escape character
  in the prompt would cause some part of the prompt to be duplicated.
- Fixed a lot of instances where the length of the argument to
  append is known or easily obtain but was not used.
- Removed some not needed 'virtual' definition for functions that was
  inherited from the parent. I added override to these.
- Fixed Ordered_key::print() to preallocate needed buffer. Old code could
  case memory overruns.
- Simplified some loops when adding char * to a String with delimiters.
2021-05-19 22:27:48 +02:00
Monty
36cdd5c3cd Optimize usage of c_ptr(), c_ptr_quick() and String::alloc()
The problem was that when one used String::alloc() to allocate a string,
the String ensures that there is space for an extra NULL byte in the
buffer and if not, reallocates the string. This is a problem with the
String::set_int() that calls alloc(21), which forces extra
malloc/free calls to happen.

- We do not anymore re-allocate String if alloc() is called with the
  Allocated_length. This reduces number of malloc() allocations,
  especially one big re-allocation in Protocol::send_result_Set_metadata()
  for almost every query that produced a result to the connnected client.
- Avoid extra mallocs when using LONGLONG_BUFFER_SIZE
  This can now be done as alloc() doesn't increase buffers if new length is
  not bigger than old one.
- c_ptr() is redesigned to be safer (but a bit longer) than before.
- Remove wrong usage of c_ptr_quick()
  c_ptr_quick() was used in many cases to get the pointer to the used
  buffer, even when it didn't need to be \0 terminated. In this case
  ptr() is a better substitute.
  Another problem with c_ptr_quick() is that it did not guarantee that
  the string would be \0 terminated.
- item_val_str(), an API function not used currently by the server,
  now always returns a null terminated string (before it didn't always
  do that).
- Ensure that all String allocations uses STRING_PSI_MEMORY_KEY. The old
  mixed usage of performance keys caused assert's when String buffers
  where shrunk.
- Binary_string::shrink() is simplifed
- Fixed bug in String(const char *str, size_t len, CHARSET_INFO *cs) that
  used Binary_string((char *) str, len) instead of Binary_string(str,len).
- Changed argument to String() creations and String.set() functions to use
  'const char*' instead of 'char*'. This ensures that Alloced_length is
  not set, which gives safety against someone trying to change the
  original string. This also would allow us to use !Alloced_length in
  c_ptr() if needed.
- Changed string_ptr_cmp() to use memcmp() instead of c_ptr() to avoid
  a possible malloc during string comparision.
2021-05-19 22:27:27 +02:00
Nikita Malyavin
3f55c56951 Merge branch bb-10.4-release into bb-10.5-release 2021-05-05 23:57:11 +03:00
Nikita Malyavin
509e4990af Merge branch bb-10.3-release into bb-10.4-release 2021-05-05 23:03:01 +03:00
Nikita Malyavin
a8a925dd22 Merge branch bb-10.2-release into bb-10.3-release 2021-05-04 14:49:31 +03:00
Sujatha
abe6eb10a6 MDEV-16146: MariaDB slave stops with following errors.
Problem:
========
180511 11:07:58 [ERROR] Slave I/O: Unexpected master's heartbeat data:
heartbeat is not compatible with local info;the event's data: log_file_name
mysql-bin.000009 log_pos 1054262041, Error_code: 1623

Analysis:
=========
In replication setup when master server doesn't have any events to send to
slave server it sends an 'Heartbeat_log_event'. This event carries the
current binary log filename and offset details. The offset values is stored
within 4 bytes of event header. When the size of binary log is higher than
UINT32_MAX the log_pos values will not fit in 4 bytes memory.  It overflows
and hence slave stops with an error.

Fix:
===
Since we cannot extend the common_header of Log_event class, a greater than
4GB value of Log_event::log_pos is made to be transported with a HeartBeat
event's sub-header.  Log_event::log_pos in such case is set to zero to
indicate that the 8 byte sub-header is allocated in the event.

In case of cross version replication following behaviour is expected

OLD - Server without fix
NEW - Server with fix

OLD<->NEW : works bidirectionally as long as the binlog offset is
            (normally) within 4GB.

When log_pos > UINT32_MAX
OLD->NEW  : The 'log_pos' is bound to overflow and NEW slave may report
            an invalid event/incompatible heart beat event error.
NEW->OLD  : Since patched server sets log_pos=0 on overflow, OLD slave will
            report invalid event error.
2021-04-30 20:34:31 +05:30
Sergei Petrunia
bf310b4cfb MDEV-25305: MyRocks: Killing server during RESET MASTER can lose last transactions
rocksdb_checkpoint_request() should call FlushWAL(sync=true) (which does
write-out and sync), not just SyncWAL()  (which just syncs without writing
out)
Followup: the test requires debug sync facility

(This is a backport to 10.5)
2021-03-31 10:26:00 +03:00