When updating non-transactional tables inside a multi-statement transaction,
and binlog_direct_non_transactional_updates=1, then the non-transactional
updates are binlogged directly through the statement cache while the
transaction cache is still being added to in the main transaction.
Thus, move the engine_binlog_info out from binlog_cache_mngr and into the
individual stmt/trx binlog_cache_data, so that we can have separate
engine_binlog_info active for the statement and the transaction cache.
Signed-off-by: Kristian Nielsen <knielsen@knielsen-hq.org>
Support for SAVEPOINT, ROLLBACK TO SAVEPOINT, rolling back a failed
statement (keeping active transaction), and rolling back transaction.
For savepoints (and start-of-statement), if the binlog data to be rolled
back is still in the in-memory part of trx cache we can just truncate the
cache to the point.
But if we need to spill cache contents as out-of-band data containing one or
more savepoints/start-of-statement point, then split the spill at each point
and inform the engine of the savepoints.
In InnoDB, at savepoint set, save the state of the forest of perfect binary
trees being built. Then at rollback, restore the appropriate state.
Signed-off-by: Kristian Nielsen <knielsen@knielsen-hq.org>
This is actually an existing problem in the old binlog implementation, and
this patch is applicable to old binlog also. The problem is that RESET
MASTER can run concurrently with binlog dump threads / connected slaves.
This will remove the binlog from under the feet of the reader, which can
cause all sorts of strange behaviour.
This patch fixes the problem by disallowing to run RESET MASTER when dump
threads (or other RESET MASTER or SHOW BINARY LOGS) are running. An error is
thrown in this case, user must stop slaves and/or kill dump threads to make
the RESET MASTER go through. A slave that connects in the middle of RESET
MASTER will wait for it to complete.
Fix a lot of test cases to kill any lingering dump threads before doing
RESET MASTER, mostly just by sourcing include/kill_binlog_dump_threads.inc.
Signed-off-by: Kristian Nielsen <knielsen@knielsen-hq.org>
This patch makes replication crash-safe with the new binlog implementation,
even when --innodb-flush-log-at-trx-commit=0|2. The point is to not send any
binlog events to the slave until they have become durable on master, thus
avoiding that a slave may replicate a transaction that is lost during master
recovery, diverging the slave from the master.
Keep track of which point in the binlog has been durably synced to disk
(meaning the corresponding LSN has been durably synced to disk in the InnoDB
redo log). Each write to the binlog inserts an entry with offset and
corresponding LSN in a FIFO. Dump threads will first read only up to the
durable point in the binlog. A dump thread will then check the LSN fifo, and
do an InnoDB redo log sync if anything is pending. Then the FIFO is emptied
of any LSNs that have now become durable, and the durable point in the
binlog is updated and reading the binlog can continue.
Signed-off-by: Kristian Nielsen <knielsen@knielsen-hq.org>
If the event group fitted in the binlog cache without the GTID event but not
with, the code would attempt to spill part of the GTID event as out-of-band
data, which is not correct. In release builds this would hang the server as
the spilling would try to lock an already owned mutex.
Fix by checking if the GTID event fits, and spilling any non-GTID data as
oob if it does not.
Signed-off-by: Kristian Nielsen <knielsen@knielsen-hq.org>
Fix that spilling of out-of-band data to the binlog could happen
concurrently with binlog group commit, by holding LOCK_commit_ordered
over all binlog writes now.
Fix silly use-after-free bug where data was accessed in the old buffer after
realloc().
Improve the wording of the error when specifying an argument for --log-bin.
Signed-off-by: Kristian Nielsen <knielsen@knielsen-hq.org>
Keep track of, for each binlog file, how many open transactions have
out-of-band data starting in that file. Then at the start of each new binlog
file, in the header page, record the file_no of the earliest file that this
file might contain commit records with references back to OOB records in
that earlier file.
Use this in PURGE BINARY LOGS, so that when a dump thread (slave connection)
is active in file number N, and that file (or a later one) may require
looking back in an earlier file number M for out-of-band records, purge will
stop already at file number M. This way, we avoid that purge accidentally
deletes some binlog file that a dump thread would later get an error on
because it needs to read out-of-band data.
This patch also includes placeholder data for a similar facility for XA
references. The actual implementation of support for XA is for later though.
Signed-off-by: Kristian Nielsen <knielsen@knielsen-hq.org>
Mostly various fixes to avoid initializing or creating any data or files for
the legacy binlog.
A possible later refinement could be to sub-class the binlog class
differently for legacy and in-engine binlogs, writing separate virtual
functions for behaviour that differ, extracting common functionality into
sub-methods. This could remove some if (opt_binlog_engine_hton)
conditionals.
Signed-off-by: Kristian Nielsen <knielsen@knielsen-hq.org>
Still ToDo: is to restrict auto-purge so that it does not purge any binlog
file with out-of-band data that might still be needed by a connected slave.
Signed-off-by: Kristian Nielsen <knielsen@knielsen-hq.org>
Fix missing WORDS_BIGENDIAN define in ut0compr_int.cc.
Fix misaligned read buffer for O_DIRECT.
Fix wrong/missing update_binlog_end_pos() in binlog group commit.
Fix race where active_binlog_file_no incremented too early.
Fix wrong assertion when reader reaches the very start of (active+1).
Signed-off-by: Kristian Nielsen <knielsen@knielsen-hq.org>
With this commit, the out-of-band binlogging of large event groups in
multiple smaller records interleaved with other event groups is now working.
Instead of flushing the binlog cache to disk when they reach
@@binlog_cache_size, instead the cache is binlogged as an out-of-band
record. Then at transaction commit, a commit record is written containing
just the GTID and a link to the out-of-band data.
To facilitate append-only operation, the binlogged records do not have a
"next" pointer. Instead, they are written out as a forest of perfect binary
trees, the leftmost leaf of one tree pointing to the root of the previous
tree. This structure is used in the binlog reader to efficiently read out
the event group data consecutively for the binlog dump thread, needing to
maintain only O(log(N)) amount of memory during the reading.
As part of this commit, the existing binlog reader code is refactored to be
greatly improved, with a much cleaner explicit state machine and handling of
chunk/page/file boundaries etc.
Also fixes some bugs in the gtid_search::find_gtid_pos().
Signed-off-by: Kristian Nielsen <knielsen@knielsen-hq.org>
To restore the binlog state, after finding the position in the old binlog to
continue from, read the full gtid state saved at the start of the binlog
file as well as the most recent differentioal gtid state written shortly
before the starting position. Then construct a binlog reader to read the
remaining few events (if any), and update with any GTIDs read to obtain the
final restored GTID binlog state.
Signed-off-by: Kristian Nielsen <knielsen@knielsen-hq.org>
Every N bytes (hardcoded at 64k for now, to become a configurable setting),
write the binlog GTID state into the binlog tablespace. This allows to
quickly find a given GTID position by binary search to the prior GTID state
in the tablespace and then a small linear scan from that point.
The full binlog state is dumped at the start of the binlog file; remaining
states dumped are differential states containing only the changed
(domain_id, server_id) pairs, to save space if binlog space is large.
This commit only implements the writing of the binlog state to the
tablespace at regular intervals. The binary search to be implemented in a
subsequent commit.
Signed-off-by: Kristian Nielsen <knielsen@knielsen-hq.org>
The option --innodb-in-engine now causes InnoDB DML commits to include
binlogging in the same mtr. Binlog group commit now skips binlogging to
old file-based binlog and passes events to InnoDB instead.
Many things unfinished still, like allocating new tablespaces when the first
one is filled, writing large event groups out-of-band to not bloat the
InnoDB commit record in the redo log and exceed max mtr size, writing DDL
and all other events to the InnoDB binlog, skipping the creation of the
old-style binlog, reading the new style binlog from InnoDB, etc. etc.
Signed-off-by: Kristian Nielsen <knielsen@knielsen-hq.org>
Problem was that transacton was BF-aborted after certification
succeeded and transaction tried to rollback and during
rollback binlog stmt cache containing sequence value reservations
was written into binlog.
Transaction must replay because certification succeeded but
transaction must not be written into binlog yet, it will
be done during commit after the replay.
Fix is to skip binlog write if transaction must replay and
in replay we need to reset binlog stmt cache.
Signed-off-by: Julius Goryavsky <julius.goryavsky@mariadb.com>
The test issues a simple INSERT statement, while sql_log_bin = 0.
This option disables writes to binlog. However, since MDEV-7205,
the option does not affect Galera, so changes are still replicated.
So sql_log_bin=off, "partially" disabled the binlog and the INSERT
will involve both binlog and innodb, thus requiring internal 2 phase
commit (2PC). In 2PC INSERT is first prepared, which will make it
transition to PREPARED state in innodb, and later committed which
causes the new assertion from MDEV-24035 to fail.
Running the same test with sql_log_bin enabled also results in 2PC,
but the execution has one more step for ordered commit, between prepare
and commit. Ordered commit causes the transaction state to transition
back to TRX_STATE_NOT_STARTED. Thus avoiding the assertion.
This patch makes sure that when sql_log_bin=off, the ordered commit
step is not skipped, thus going through the expected state transitions
in the storage engine.
Signed-off-by: Julius Goryavsky <julius.goryavsky@mariadb.com>
In Log_event::read_log_event(), don't use IO_CACHE::error of the relay log's
IO_CACHE to signal an error back to the caller. When reading the active
relay log, this flag is also being used by the IO thread, and setting it can
randomly cause the IO thread to wrongly detect IO error on writing and
permanently disable the relay log.
This was seen sporadically in test case rpl.rpl_from_mysql80. The read
error set by the SQL thread in the IO_CACHE would be interpreted as a
write error by the IO thread, which would cause it to throw a fatal
error and close the relay log. And this would later cause CHANGE
MASTER to try to purge a closed relay log, resulting in nullptr crash.
SQL thread is not able to parse an event read from the relay log. This
can happen like here when replicating unknown events from a MySQL master,
potentially also for other reasons.
Also fix a mistake in my_b_flush_io_cache() introduced back in 2001
(fa09f2cd7e) where my_b_flush_io_cache() could wrongly return an error set
in IO_CACHE::error, even if the flush operation itself succeeded.
Also fix another sporadic failure in rpl.rpl_from_mysql80 where the outout
of MASTER_POS_WAIT() depended on timing of SQL and IO thread.
Reviewed-by: Monty <monty@mariadb.org>
Reviewed-by: Andrei Elkin <andrei.elkin@mariadb.com>
Signed-off-by: Kristian Nielsen <knielsen@knielsen-hq.org>
The LOCK_global_system_variables must not be held when taking mutexes
such as LOCK_commit_ordered and LOCK_log, as this causes inconsistent
mutex locking order that can theoretically cause the server to
deadlock.
To avoid this, temporarily release LOCK_global_system_variables in two
system variable update functions, like it is done in many other
places.
Enforce the correct locking order at server startup, to more easily
catch (in debug builds) any remaining wrong orders that may be hidden
elsewhere in the code.
Note that when this is merged to 11.4, similar unlock/lock of
LOCK_global_system_variables must be added in update_binlog_space_limit()
as is done in binlog_checksum_update() and fix_max_binlog_size(), as this
is a new function added in 11.4 that also needs the same fix. Tests will
fail with wrong mutex order until this is done.
Reviewed-by: Sergei Golubchik <serg@mariadb.org>
Signed-off-by: Kristian Nielsen <knielsen@knielsen-hq.org>
Fixing a wrong DBUG_ASSERT.
thd->start_time and thd->start_time_sec_part cannot be 0 at the same time.
But thd->start_time can be 0 when thd->start_time_sec_part is not 0,
e.g. after:
SET timestamp=0.99;
For a primary configured with wait_point=AFTER_SYNC, if two threads
T1 (binlogging through MYSQL_BIN_LOG::write()) and T2 were
binlogging at the same time, T1 could accidentally wait for its
semi-sync ACK using the binlog coordinates of T2. Prior to
MDEV-33551, this only resulted in delayed transactions, because all
transactions shared the same condition variable for ACK signaling.
However, with the MDEV-33551 changes, each thread has its own
condition variable to signal. So T1 could wait indefinitely when
either:
1) T1's ACK is received but not T2's when T1 goes into
wait_after_sync(), because the ACK receiver thread has already
notified about the T1 ACK, but T1 was _actually_ waiting on T2's
ACK, and therefore tries to wait (in vain).
2) T1 goes to wait_after_sync() before any ACKs have arrived. When
T1's ACK comes in, T1 is woken up; however, sees it needs to wait
more (because it was actually waiting on T2's ACK), and goes to wait
again (this time, in vain).
Note that the actual cause of T1 waiting on T2's binlog coordinates
is when MYSQL_BIN_LOG::write() would call
Repl_semisync_master::wait_after_sync(), the binlog offset parameter
was read as the end of MYSQL_BIN_LOG::log_file, which is shared
among transactions. So if T2 had updated the binary log _after_ T1
had released LOCK_log, but not yet invoked wait_after_sync(), it
would use the end of the binary log file as the binlog offset, which
was that of T2 (or any future transaction).
The fix in this patch ensures consistency between the binary log
coordinates a transaction uses between report_binlog_update() and
wait_after_sync().
Reviewed By
============
Kristian Nielsen <knielsen@knielsen-hq.org>
Andrei Elkin <andrei.elkin@mariadb.com>