Most things where wrong in the test suite.
The one thing that was a bug was that table_map_id was in some places
defined as ulong and in other places as ulonglong. On Linux 64 bit this
is not a problem as ulong == ulonglong, but on windows this caused failures.
Fixed by ensuring that all instances of table_map_id are ulonglong.
This patch augments Gtid_log_event with the user thread-id.
In particular that compensates for the loss of this info in
Rows_log_events.
Gtid_log_event::thread_id gets visible in mysqlbinlog output like
#231025 16:21:45 server id 1 end_log_pos 537 CRC32 0x1cf1d963 GTID 0-1-2 ddl thread_id=10
as 64 bit unsigned integer.
While the size of Gtid event has grown by 8-9 bytes
replication from OLD <-> NEW is not affected by it.
This work was started by the late Sujatha Sivakumar.
Brandon Nesterenko took it over, reviewed initial patches and extended
the work.
Reviewed-by: <andrei.elkin@mariadb.com>
Summary
=======
With FULL_NODUP mode, before image inclues all columns and after
image inclues only the changed columns. flashback will swap the
value of changed columns from after image to before image.
For example:
BI: c1, c2, c3_old, c4_old
AI: c3_new, c4_new
flashback will reconstruct the before and after images to
BI: c1, c2, c3_new, c4_new
AI: c3_old, c4_old
Implementation
==============
When parsing the before and after image, position and length of
the fields are collected into ai_fields and bi_fields, if it is an
Update_rows_event and the after image doesn't includes all columns.
The changed fields are swapped between bi_fields and ai_fields.
Then it recreates the before image and after image by using
bi_fields and ai_fields. nullbit will be set to 1 if the
field is NULL, otherwise nullbit will be 0.
It also optimized flashback a little bit.
- calc_row_event_length is used instead of print_verbose_one_row
- swap_buff1 and swap_buff2 are removed.
Calling SHOW BINLOG EVENTS FROM <offset> with an invalid offset
writes error messages into the server log about invalid reads. The
read errors that occur from this command should only be relayed back
to the user though, and not written into the server log. This is
because they are read-only and have no impact on server operation,
and the client only need be informed to correct the parameter.
This patch fixes this by omitting binary log read errors from the
server when the invocation happens from SHOW BINLOG EVENTS.
Additionally, redundant error messages are omitted when calling the
string based read_log_event from the IO_Cache based read_log_event,
as the later already will report the error of the former.
Reviewed By:
============
Kristian Nielsen <knielsen@knielsen-hq.org>
Andrei Elkin <andrei.elkin@mariadb.com>
Compute binlog checksums (when enabled) already when writing events
into the statement or transaction caches, where before it was done
when the caches are copied to the real binlog file. This moves the
checksum computation outside of holding LOCK_log, improving
scalabitily.
At stmt/trx cache write time, the final end_log_pos values are not
known, so with this patch these will be set to 0. Events that are
written directly to the binlog file (not through stmt/trx cache) keep
the correct end_log_pos value. The GTID and COMMIT/XID events at the
start and end of event groups are written directly, so the zero
end_log_pos is only for events in the middle of event groups, which
do not negatively affect replication.
An option --binlog-legacy-event-pos, off by default, is provided to
disable this behavior to provide backwards compatibility with any
external applications that might rely on end_log_pos in events in the
middle of event groups.
Checksums cannot be pre-computed when binlog encryption is enabled, as
encryption relies on correct end_log_pos to provide part of the
nonce/IV.
Checksum pre-computation is also disabled for WSREP/Galera, as it uses
events differently in its write-sets and so on. Extending pre-computation of
checksums to Galera where it makes sense could be added in a future patch.
The current --binlog-checksum configuration is saved in
binlog_cache_data at transaction start and used to pre-compute
checksums in cache, if applicable. When the cache is later copied to
the binlog, a check is made if the saved value still matches the
configured global value; if so, the events are block-copied directly
into the binlog file. If --binlog-checksum was changed during the
transaction, events are re-written to the binlog file one-by-one and
the checksums recomputed/discarded as appropriate.
Reviewed-by: Monty <monty@mariadb.org>
Signed-off-by: Kristian Nielsen <knielsen@knielsen-hq.org>
This is a preparatory commit for pre-computing checksums outside of
holding LOCK_log, no functional changes.
Which checksum algorithm is used (if any) when writing an event does not
belong in the event, it is a property of the log being written to.
Instead decide the checksum algorithm when constructing the
Log_event_writer object, and store it there.
Introduce a client-only Log_event::read_checksum_alg to be able to
print the checksum read, and a
Format_description_log_event::source_checksum_alg which is the
checksum algorithm (if any) to use when reading events from a log.
Also eliminate some redundant `enum` keywords on the enum_binlog_checksum_alg
type.
Reviewed-by: Monty <monty@mariadb.org>
Signed-off-by: Kristian Nielsen <knielsen@knielsen-hq.org>
This is a preparatory patch for precomputing binlog checksums outside
of holding LOCK_log, no functional changes.
Replace Log_event::writer with just passing the writer object as a
function parameter to Log_event::write().
This is mainly for code clarity. Having to set ev->writer before every
call to ev->write() is error-prone (what if it's forgotten in some
code place?), while passing it as parameter as usual makes it explicit
how the dataflow is.
As a minor point, it also improves the code, as the compiler now can
save the function parameter in a register across nested calls (when it
is a class member, compiler needs to reload across nested calls in
case the object would be modified during the call).
Reviewed-by: Monty <monty@mariadb.org>
Signed-off-by: Kristian Nielsen <knielsen@knielsen-hq.org>
Pack these fields together:
event_owns_temp_buf
cache_type
slave_exec_mode
checksum_alg
Make them bitfields to fit a single 2-byte hole.
This saves 24 bytes per event.
SLAVE_EXEC_MODE_LAST_BIT is rewritten as
> SLAVE_EXEC_MODE_LAST= SLAVE_EXEC_MODE_IDEMPOTENT
to avoid a false-positive -Wbitfield-enum-conversion warning:
Bit-field 'slave_exec_mode' is not wide enough to store all enumerators of
'enum_slave_exec_mode'.
Add a new virtual function that will increase the inserted rows count
for the insert log event and decrease it for the delete event.
Reuses Rows_log_event::m_row_count on the replication side, which was only
set on the logging side.
We can't rely on keys formed with columns that were added during this
ALTER. These columns can be set with non-deterministic values, which can
end up with broken or incorrect search.
The same applies to the keys that contain reliable columns, but also have
bogus ones. Using them can narrow the search, but they're also ignored.
Also, added columns shouldn't be considered during the record match. To
determine them, table->has_value_set bitmap is used.
To fill has_value_set bitmap in the find_key call, extra unpack_row call
has been added.
For replication case, extra replica columns can be considered for this
case. We try to ignore them, too.
* Log rows in online_alter_binlog.
* Table online data is replicated within dedicated binlog file
* Cached data is written on commit.
* Versioning is fully supported.
* Works both wit and without binlog enabled.
* For now savepoints setup is forbidden while ONLINE ALTER goes on.
Extra support is required. We can simply log the SAVEPOINT query events
and replicate them together with row events. But it's not implemented
for now.
* Cache flipping:
We want to care for the possible bottleneck in the online alter binlog
reading/writing in advance.
IO_CACHE does not provide anything better that sequential access,
besides, only a single write is mutex-protected, which is not suitable,
since we should write a transaction atomically.
To solve this, a special layer on top Event_log is implemented.
There are two IO_CACHE files underneath: one for reading, and one for
writing.
Once the read cache is empty, an exclusive lock is acquired (we can wait
for a currently active transaction finish writing), and flip() is emitted,
i.e. the write cache is reopened for read, and the read cache is emptied,
and reopened for writing.
This reminds a buffer flip that happens in accelerated graphics
(DirectX/OpenGL/etc).
Cache_flip_event_log is considered non-blocking for a single reader and a
single writer in this sense, with the only lock held by reader during flip.
An alternative approach by implementing a fair concurrent circular buffer
is described in MDEV-24676.
* Cache managers:
We have two cache sinks: statement and transactional.
It is important that the changes are first cached per-statement and
per-transaction.
If a statement fails, then only statement data is rolled back. The
transaction moves along, however.
Turns out, there's no guarantee that TABLE well persist in
thd->open_tables to the transaction commit moment.
If an error occurs, tables from statement are purged.
Therefore, we can't store te caches in TABLE. Ideally, it should be
handlerton, but we cut the corner and store it in THD in a list.
Event_log is supposed to be a basic logging class that can write events in
a single file.
MYSQL_BIN_LOG in comparison will have:
* rotation support
* index files
* purging
* gtid and transactional information handling.
* is dedicated for a general-purpose binlog
* Eliminate most usages of THD::use_trans_table. Only 3 left, and they are
at quite high levels, and really essential.
* Eliminate is_transactional argument when possible. Lots of places are
left though, because of some WSREP error handling in
MYSQL_BIN_LOG::set_write_error.
* Remove junk binlog functions from THD
* binlog_prepare_pending_rows_event is moved to log.cc inside MYSQL_BIN_LOG
and is not anymore template. Instead it accepls event factory with a type
code, and a callback to a constructing function in it.
This patch adds a way to override default collations
(or "character set collations") for desired character sets.
The SQL standard says:
> Each collation known in an SQL-environment is applicable to one
> or more character sets, and for each character set, one or more
> collations are applicable to it, one of which is associated with
> it as its character set collation.
In MariaDB, character set collations has been hard-coded so far,
e.g. utf8mb4_general_ci has been a hard-coded character set collation
for utf8mb4.
This patch allows to override (globally per server, or per session)
character set collations, so for example, uca1400_ai_ci can be set as a
character set collation for Unicode character sets
(instead of compiled xxx_general_ci).
The array of overridden character set collations is stored in a new
(session and global) system variable @@character_set_collations and
can be set as a comma separated list of charset=collation pairs, e.g.:
SET @@character_set_collations='utf8mb3=uca1400_ai_ci,utf8mb4=uca1400_ai_ci';
The variable is empty by default, which mean use the hard-coded
character set collations (e.g. utf8mb4_general_ci for utf8mb4).
The variable can also be set globally by passing to the server startup command
line, and/or in my.cnf.
When opening and locking tables, if triggers will be invoked in a
separate database, thd->set_db() is invoked, thus freeeing the memory
and headers which thd->db had previously pointed to. In row based
replication, the event execution logic initializes thd->db to point
to the database which the event targets, which is owned by the
corresponding table share (introduced in d9898c9 for MDEV-7409).
The problem then, is that during the table opening and locking
process for a row event, memory which belongs to the table share
would be freed, which is not valid.
This patch replaces the thd->reset_db() calls to thd->set_db(),
which copies-by-value, rather than by reference. Then when the
memory is freed, our copy of memory is freed, rather than memory
which belongs to a table share.
Notes:
1. The call to change thd->db now happens on a higher-level, in
Rows_log_event::do_apply_event() rather than ::do_exec_row(), in the
call stack. This is because do_exec_row() is called within a loop,
and each invocation would redundantly set and unset the db to the
same value.
2. thd->set_db() is only used if triggers are to be invoked, as
there is no vulnerability in the non-trigger case, and copying
memory would be an unnecessary inefficiency.
Reviewed By:
============
Andrei Elkin <andrei.elkin@mariadb.com>
If a replica failed to update the GTID slave state when committing
an XA PREPARE, the replica would retry the transaction and get an
out-of-order GTID error. This is because the commit phase of an XA
PREPARE is bifurcated. That is, first, the prepare is handled by the
relevant storage engines. Then second, the GTID slave state is
updated as a separate autocommit transaction. If the second phase
fails, and the transaction is retried, then the same transaction is
attempted to be committed again, resulting in a GTID out-of-order
error.
This patch fixes this error by immediately stopping the slave and
reporting the appropriate error. That is, there was logic to bypass
the error when updating the GTID slave state table if the underlying
error is allowed for retry on a parallel slave. This patch adds a
parameter to disallow the error bypass, thereby forcing the error
state to still happen.
Reviewed By
============
Andrei Elkin <andrei.elkin@mariadb.com>