Added NULL columns in testing, parameterizd tests by files so testing
combinations of options (Add/drop/type conversion/NULL) together is
easier.
Some TODOs remain to be written.
One can have data loss in multi-master setups when 1) both masters
update the same table, 2) ATLER TABLE is run on one master which
re-arranges the column ordering, and 3) transactions are binlogged
in ROW binlog_format. This is because a slave identifies columns to
update using column index number. That is, if a transaction updates
a table with <n> columns on the master, the binary log ROW event
will store its data column-by-column, from the first column, to the
<n>th column, in-order. When the slave applies this row to its
table, it simply updates each column of its new row using these same
values, in the same order. If the slave’s table has its columns in a
different order (from some ALTER TABLE, which can have added,
removed, or re-arranged) these columns, the data will still be
stored in the order that it was done on the master table. This leads
to data loss.
This patch adds the ability for a slave to lookup column name when
applying ROW events, so if the ordering of columns on the master and
slave differs, the slave can still apply the data changes to the
correct column. This is limited to when a master binlogs events
using option binlog_row_metadata=FULL, as this extends
Table_map_log_event metadata to contain column names. When this
column name metadata is missing, the lookup behavior is unchanged
(i.e. it will use column index, as before).
General code changes that extend beyond just this patch:
1. In unpack_row(), column "cleanup" for fields that don't exist on
the slave is now done as unpacking is done. Prior to this patch,
this cleanup would happen only after unpacking the event data, as
the only place that the event data could have extra rows would be
after the slave fields had been processed. Now, the slave can be
missing _any_ field that was logged on the master; and we have to
account for that as we see it. That is, for fields in the event
that are missing on the slave, we must move forward the both
the pointer that tracks the next spot to unpack data, as well as
account for our null_mask for tracking null columns. To try and
help ensure this new logic is correct, a new assertion is added,
assert_master_idx_matches_null_tracking()
which verifies that our tracking of master indices aligns with
our null field tracking (albeit this can only be done when all
fields of a row event are binlogged, e.g when logged with
binlog_row_image=FULL). Note that we can't meaningfully validate
our master index is consistent with null bit tracking when we have
missing columns, as they can be missing in arbitrary places, and if
we were to track when a bit is written, we would have to refer to
master_idx, which would invalidate the assertion in the first
place.
2. table->{read_set, write_set, rpl_write_set} also needed to be
updated so its bits would adjust for offset columns on slave. Note
that rpl_write_set is set in
TABLE::mark_columns_per_binlog_row_image(), which is pulled from
10.11. This fixes an existing bug in 10.6 where all slave columns
wouldn't be written if the slave had additional columns
(though the bug doesn't exits in 10.11, because it already calls
mark_columns_per_binlog_row_image()).
3. There was an issue with the conversion table, where its columns
weren't in sync with the table_def generated by the
Table_map_log_event. I.e., when the slave didn't have the column,
the conv_table generation logic would simply skip the rest of the
loop and go on to the next column (without adding a field for the
missing column). The compatible_with and unpack_row logic, however,
assumed the conv_table was in alignment with the columns from the
Rows_log_event. I thought of two ways to fix this:
a. Add a dummy field to the conv_table (and later reset it to
NULL).
b. Have a separate index variable which tracks the conv_table
fields.
This patch implements the second (b), as it is more efficient.
Patch is currently still being drafted. TODOs left:
1. Finish MTR testing current changes
2. Add error-checking logic & MTR tests
3. Optimize master_to_slave_index_map and master_unmatched_cols
to use an array and bitmap (per Kristian's review suggestion)
4. Re-test rpl and multi_source MTR suites with
binlog_row_metadata=FULL (temporarily change default value) to
make sure nothing fails from that option alone
5. Cleanup patch (remove printfs for debugging)
6. Extensive QA testing
Reviewed By:
============
<TODO>
Test to show that the slave doesn't re-binlog correctly if it has extra
columns. Note this is only broken in 10.6, 10.11 fixes it (as well as my
MDEV-36290 patch). The problem is that the slave doesn't set
table->rpl_write_set correctly. Also table->read_set and table->write_set
are not correct, but I haven't # yet looked into the implications of that..
To show this, this test uses chain replication with slave-specific columns.
The default value for the column is different on each slave. The middle-slave
doesn't write its column to the binlog (despite being configured with
binlog_row_image=FULL), and so when the final slave in the chain sees the
event, it only sees one column being updated, and then uses its default for
its extra column, which is different than the middle slaves default. Then,
we use diff_tables.inc to show the difference between the tables.
Problem:
=======
- While loading the foreign key constraints for the parent table,
if child table wasn't open then InnoDB uses the parent table heap
to store the child table name in fk_tables list. If the consecutive
foreign key relation for the parent table fails with error,
InnoDB evicts the parent table from memory. But InnoDB accesses the
evicted table memory again in dict_sys.load_table()
Solution:
========
dict_load_table_one(): In case of error, remove the child table
names which was added during dict_load_foreigns()
Problem:
========
- InnoDB does consecutive instant alter operation, first instant DDL
fails, it fails to reset the old instant information in table during
rollback. This lead to consecutive instant alter to have wrong
assumption about the exisitng instant column information.
Fix:
====
dict_table_t::instant_column(): Duplicate the instant information
field of the table. By doing this, InnoDB alter retains the old
instant information and reset it during rollback operation
The test in commit 1756b0f37d
is occasionally failing if there are unexpectedly many page cleaner
batches that are updating the log checkpoint by small amounts.
This occurs in particular when running the server under Valgrind.
Let us insert the same number of records with a larger number of
statements in a hope that the test would then be more likely to pass.
It prevents a crash in wsrep_report_error() which happened when appliers would run
with FK and UK checks disabled and erroneously execute plain inserts as bulk inserts.
Moreover, in release builds such a behavior could lead to deadlocks between two applier
threads if a thread waiting for a table-level lock was ordered before the lock holder.
In that case the lock holder would proceed to commit order and wait forever for the
now-blocked other applier thread to commit before.
Signed-off-by: Julius Goryavsky <julius.goryavsky@mariadb.com>
Problem was that transacton was BF-aborted after certification
succeeded and transaction tried to rollback and during
rollback binlog stmt cache containing sequence value reservations
was written into binlog.
Transaction must replay because certification succeeded but
transaction must not be written into binlog yet, it will
be done during commit after the replay.
Fix is to skip binlog write if transaction must replay and
in replay we need to reset binlog stmt cache.
Signed-off-by: Julius Goryavsky <julius.goryavsky@mariadb.com>
The test issues a simple INSERT statement, while sql_log_bin = 0.
This option disables writes to binlog. However, since MDEV-7205,
the option does not affect Galera, so changes are still replicated.
So sql_log_bin=off, "partially" disabled the binlog and the INSERT
will involve both binlog and innodb, thus requiring internal 2 phase
commit (2PC). In 2PC INSERT is first prepared, which will make it
transition to PREPARED state in innodb, and later committed which
causes the new assertion from MDEV-24035 to fail.
Running the same test with sql_log_bin enabled also results in 2PC,
but the execution has one more step for ordered commit, between prepare
and commit. Ordered commit causes the transaction state to transition
back to TRX_STATE_NOT_STARTED. Thus avoiding the assertion.
This patch makes sure that when sql_log_bin=off, the ordered commit
step is not skipped, thus going through the expected state transitions
in the storage engine.
Signed-off-by: Julius Goryavsky <julius.goryavsky@mariadb.com>
Fix a regression on test galera_sr.GCF-572 introduced by commit
c9a6adba. This commit partially reverted a trivial test fix for
the galera_sr.GCF-572 test (commit 11ef59fb), which was targeted
at 10.6.
This is backporting the fix from commit 11ef59fb fix to 10.5, so
that the test stays the same all versions >= 10.5.
Signed-off-by: Julius Goryavsky <julius.goryavsky@mariadb.com>
Seen on Windows using newest pcre2-10.45. Tests that use regular expression
e.g., mariabackup.partial, fail very often.
To fix, serialize regexec calls, which are documented as non-thread-safe
in https://www.pcre.org/current/doc/html/pcre2posix.html.
in row_update_for_mysql
932ec586 (MDEV-23644) in TABLE::delete_row() added ha_delete_row() for
the case of HA_ERR_FOREIGN_DUPLICATE_KEY. The problem is
ha_update_row() called beforewards may change m_last_part which is
required for ha_delete_row() to delete from correct partition.
The fix reverts m_last_part in case ha_partition::update_row() fails.
While ALTER thread tries to notify SELECT thread about lock conflict
it accesses its TABLE object (THD::notify_shared_lock()) and lock data
(mysql_lock_abort_for_thread()). As part of accessing lock data it
calls ha_partition::store_lock() which iterates over all partitions
and does their store_lock().
The problem is SELECT opened 2 read partitions, but
ha_partition::store_lock() tries to access all partitions as indicated
in m_tot_parts which is 4. So the last 2 partitions m_file[2] and
m_file[3] are uninitialized and store_lock() accesses uninitialized
data.
The code in ha_partition::store_lock() does this wrong handling to use
all partitions specifically for the case of
mysql_lock_abort_for_thread(), this is conducted with comment:
/*
This can be called from get_lock_data() in mysql_lock_abort_for_thread(),
even when thd != table->in_use. In that case don't use partition pruning,
but use all partitions instead to avoid using another threads structures.
*/
if (thd != table->in_use)
{
for (i= 0; i < m_tot_parts; i++)
to= m_file[i]->store_lock(thd, to, lock_type);
}
The explanation is "to avoid using another threads structures" does
not really explain why this change was needed.
The change was originally introduced by:
commit 9b7cccaf319
Author: Mattias Jonsson <mattias.jonsson@oracle.com>
Date: Wed May 30 00:14:39 2012 +0200
WL#4443:
final code change for dlenevs review.
- Don't use pruning in lock_count().
- Don't use pruning in store_lock() if not owning thd.
- Renamed is_fields_used_in_trigger to
is_fields_updated_in_trigger() and check if they
may be updated.
- moved out mark_fields_used(TRG_EVENT_UPDATE)
from mark_columns_needed_for_update().
And reverted the changed call order. And call
mark_fields_used(TRG_EVENT_UPDATE) instead.
which also fails to explain the rationale of the change. The original
idea of WL#4443 is to reduce locks and this change does not happen to
serve this goal.
So reverting this change restores original behaviour of using only
partitions marked for use and fixes invalid access to uninitialized
data.
With --view-protocol, mtr transforms a SELECT query to two queries:
1. CREATE OR REPLACE VIEW mysqltest_tmp_v AS ...
2. SELECT * FROM mysqltest_tmp_v
where ... is the original query. Further mtr may run the first query
in a separate connection.
On the other hand if the data node is the same as the spider node,
spider_same_server_link is required for connection to the data node.
Therefore, for mtr --view-protocol tests often spider_same_server_link
needs to be set on both session and global levels. In this patch we
add the missing "SET GLOBAL spider_same_server_link=1" queries to
tests that fail with wrong results due to this issue.
It does not fix --view-protocol for all the affected tests, because
there are other issues fixed in subsequent patches.
During regular iteration the page cleaner does flush from flush list
with some flush target and then goes for generating free pages from LRU
tail. When asynchronous flush is triggered i.e. when 7/8 th of the LSN
margin is filled in the redo log, the flush target for flush list is
set to innodb_io_capacity_max. If it could flush all, the flush
bandwidth for LRU flush is currently set to zero. If the LRU tail has
dirty pages, page cleaner ends up freeing no pages in one iteration.
The scenario could repeat across multiple iterations till async flush
target is reached. During this time the DB system is starved of free
pages resulting in apparent stall and in some cases dict_sys latch
fatal error.
Fix: In page cleaner iteration, before LRU flush, ensure we provide
enough flush limit so that freeing pages is no blocked by dirty pages
in LRU tail. Log IO and flush state if double write flush wait is long.
Reviewed by: Marko Mäkelä
innodb_ft_aux_table_validate(): If the table is found in InnoDB
but not valid for the parameter, only invoke dict_sys.unlock() once.
This fixes a regression due to MDEV-36122.
log_checkpoint(): In cmake -DWITH_VALGRIND=ON builds, let us
wait for all outstanding writes to complete, in order to
avoid an unexpectedly large number of innodb_log_writes
in the test innodb.page_cleaner.
PCRE2 10.45 sets cmake_minimum_required to version 3.15.
With that, on MSVC, compile flags for choosing C runtime (/MT, /MD, etc.)
are ignored. Instead, CMAKE_MSVC_RUNTIME_LIBRARY must be passed when
building an external project for consistent linkage, if it creates a
static library.
make it strict, don't just ignore all the pesky numbers it shows,
replace the current mariadb version with X.Y.Z,
replace first two components of the current version with X.Y
replace the first component (or first-1, if the second is 0) with X
fix perl to write mysql_upgrade_info files in binary mode,
otherwise mariadb-upgrade reads them incorrectly on Windows.
fix failing main.mysql-interactive
* increase socat EOF timeout from 0.5s to 10s
* add an explicit exit to not wait 10s (or even 0.5s - the test now
finishes in 0.15s)
* enable it for libedit
prepare_inplace_alter_table_dict(): If an unexpected reference to the
table exists, wait for the purge subsystem to release its table handle,
similar to how we would do in case FULLTEXT INDEX existed. This function
is supposed to be protected by MDL_EXCLUSIVE on the table name.
If purge is going to access the table again later during is ALTER TABLE
operation, it will have access to an MDL compatible name for it and
therefore should conflict with any MDL_EXCLUSIVE that would cover
ha_innobase::commit_inplace_alter_table(commit=true).
This change should prevent race conditions where purge had initially
looked up a table for which row-level undo log records had been
written by ALTER IGNORE TABLE, and purge did not finish before a
subsequent ALTER TABLE is trying to rebuild the table.
trx_purge_attach_undo_recs(), purge_sys_t::batch_cleanup():
Clear purge_sys.m_active only after all table handles have been released.
ha_innobase::truncate(): Before locking the data dictionary,
ensure that the purge subsystem is not holding a reference to the
table due to insufficient metadata locking related to an earlier
ALTER IGNORE TABLE operation.
Reviewed by: Debarun Banerjee
dict_table_open_on_id(): Simplify the logic.
dict_stats: A helper for acquiring MDL and opening the tables
mysql.innodb_table_stats and mysql.innodb_index_stats.
innodb_ft_aux_table_validate(): Contiguously hold dict_sys.latch
while accessing the table that we open with dict_table_open_on_name().
lock_table_children(): Do not hold a table reference while invoking
dict_acquire_mdl_shared<false>(), which may temporarily release and
reacquire the shared dict_sys.latch that we are holding.
With these changes, no caller of dict_acquire_mdl_shared<false>
should be holding a table reference.
All remaining calls to dict_table_open_on_name(dict_locked=false)
except the one in fts_lock_table() and possibly in the DDL recovery
predicate innodb_check_version() should be protected by MDL, but there
currently is no assertion that would enforce this.
Reviewed by: Debarun Banerjee
trx_purge_close_tables(): Before releasing any metadata locks (MDL),
release all table references, in case an ALTER TABLE…ALGORITHM=COPY
operation has confused our logic.
trx_purge_table_acquire(), trx_purge_table_open(): Do not acquire any
table reference before successfully acquiring a necessary metadata lock.
In this way, if purge is waiting for MDL, a concurrent
ha_innobase::commit_inplace_alter_table(commit=true) that is holding
a conflicting MDL_EXCLUSIVE will only observe its own reference on
the table that it may need to replace.
dict_acquire_mdl_shared<false>(): Unless we hold an MDL or one is
not needed, do not hold a table reference when releasing dict_sys.latch.
After loading a table into the dictionary cache, we will look up the
table again, because the table could be evicted or dropped while we
were not holding any dict_sys.latch.
Reviewed by: Debarun Banerjee
Reason:
=======
- InnoDB DML commit aborts the server when InnoDB does online
virtual index. During online DDL, concurrent DML commit
operation does read the undo log record and their related
current version of the clustered index record. Based on the
operation, InnoDB do build the old tuple and new tuple for
the table. If the concurrent online index can be affected
by the operation, InnoDB does build the entry
for the index and log the operation.
Problematic case is update operation, InnoDB does build the
update vector. But while building the old row, InnoDB fails
to fill the non-affected virtual column. This lead to
server abort while build the entry for index.
Fix:
===
- First, fill the virtual column entries for the new row.
Duplicate the old row based on new row and change only the
affected fields in old row based on the update vector.
- InnoDB fails to recover the full crc32 page_compressed page
from doublewrite buffer. The reason is that buf_dblwr_t::recover()
fails to identify the space id from the page because the page
has compressed from FIL_PAGE_FILE_FLUSH_LSN_OR_KEY_VERSION bytes.
Fix:
===
recv_dblwr_t::find_deferred_page(): Find the page which
has the same page number and try to decompress/decrypt the page
based on the tablespace metadata. After the decompression/decryption,
compare the space id and write the recovered page back to the file.
buf_page_t::read_complete(): Page read from disk is corrupted then
try to read the page from deferred pages in doublewrite buffer.
In resolving MDEV-33301 (76a27155b4) we
moved all the capabilities from CapabilityBoundingSet to AmbientCapabilities
where only add/moving CAP_IPC_LOCK was intended.
The effect of this is the defaulting running MariaDB HAS the capabiltiy
CAP_DAC_OVERRIDE CAP_AUDIT_WRITE allowing it to access any file,
even while running as a non-root user.
Resolve this by making CAP_IPC_LOCK apply to AmbientCapabilities and
leave the remaining CAP_DAC_OVERRIDE CAP_AUDIT_WRITE to CapabilityBoundingSet
for the use by auth_pam_tool.
- During prepare of incremental backup, mariabackup does create
new file in target directory with default file size of
4 * innodb_page_size. While applying .delta file to corresponding
data file, it encounters the FSP_SIZE modification on page0 and
tries to extend the file to the size which is 4 in our case.
Since the table is in compressed row format, page_size for the
table is 8k. This lead to shrinking of tablespace file from 65536
to 32768. This issue happens only in windows because
os_file_set_size() doesn't check for the current size
and shrinks the file.
Solution:
========
xtrabackup_apply_delta(): Check for the current size before
doing setting size for the file.
In commit bda40ccb85 (MDEV-34803)
there was a spelling mistake that somehow causes the deprecated
parameter innodb_purge_rseg_truncate_frequency to be rejected
at server startup.
Fixed a compilation error caused by MDEV-27861 that
occurs when building with cmake -DPLUGIN_PARTITION=NO,
because the default_part_plugin variable is only
visible when WITH_PARTITION_STORAGE_ENGINE is defined.
Signed-off-by: Julius Goryavsky <julius.goryavsky@mariadb.com>
- This issue caused by commit a032f14b342c782b82dfcd9235805bee446e6fe8(MDEV-33559).
In MDEV-33559, matched_rec::block was changed to pointer
and assinged with the help of buf_block_alloc(). But patch
fails to check for the block can be nullptr in
rtr_check_discard_page().
rtr_cur_search_with_match(): Acquire rtr_match_mutex before
creating shadow block for the matched records
rtr_pcur_move_to_next(): Copy the shadow block to page cursor
block under rtr_match_mutex
Attempt to create a procedure with the DEFINER clause resulted in
abnormal server termination in case the server run with the option
--skip-grant-tables=1.
The reason of abnormal termination is that on handling of the DEFINER
clause, not initialized data members of acl_cache is accessed, that led
to server crash.
Behaviour of the server for considered use case must be the same
as for embedded server. Than means, if a security subsytem wasn't
initialized (server is started with the option --skip-grant-tables=1)
return success from get_current_user() without further access to the
acl_cache that obviously not initialized.
Additionlly, AUTHID::is_role was modified to handle the case when
a host part of the user name isn't provided. Treat this case as if
the empty host name is provided.