The memory leak occurs on error when backup_reset_alter_copy_lock fails
with timeout. This leads to the alter rollback, but flush_unused is not
called.
Move table flushing on error handling to a single place and mind more
possible failures this time.
A second DML in a transaction for a table of non-rollbackable engine
leads to a cache corruption, because the cache wasn't reset after a
statement end, but also wasn't destroyed.
This patch resets the cache for a reuse by subsequent statements in
current transaction.
In case of a non-recovery XA rollback/commit in the same connection,
thon->rollback is called instead of rollback_by_xid, Though previously,
thd_ha_data was moved to thd->transaction->xid_state.xid in hton->prepare.
Like it wasn't enough, XA PREPARE can be skipped upon user and thus we
can end up in hton->commit/rollback with and unprepared XA, so checking
xid_state.is_explicit_XA is not enough -- we should check
xid_state.get_state_code() == XA_PREPARED, which will also guarantee
is_explicit_XA() == true.
XA support for online alter was totally missing.
Tying on binlog_hton made this hardly visible: simply having binlog_commit
called from xa_commit made an impression that it will automagically work
for online alter, which turns out wrong: all binlog does is writes
"XA END" into trx cache and flushes it to a real binlog.
In comparison, online alter can't do the same, since online replication
happens in a single transaction.
Solution: make a dedicated XA support.
* Extend struct xid_t with a pointer to Online_alter_cache_list
* On prepare: move online alter cache from THD::ha_data to XID passed
* On XA commit/rollback: use the online alter cache stored in this XID.
This makes us pass xid_cache_element->xid to xa_commit/xa_rollback
instead of lex->xid
* Use manual memory management for online alter cache list, instead of
mem_root allocation, since we don't have mem_root connected to the XA
transaction.
Assertion `!writer.checksum_len || writer.remains == 0' fails upon
concurrent online ALTER and transactions with failing statements and binary
log enabled.
Also another assertion, `pos != (~(my_off_t) 0)', fails in my_seek, upon
reinit_io_cache, on a simplified test. This means that IO_CACHE wasn't
properly initialized, or had an error before.
The overall problem is a deep interference with the effect of an installed
binlog_hton: the assumption about that thd->binlog_get_cache_mngr() is,
sufficiently, NULL, when we shouldn't run the binlog part of
binlog_commit/binlog_rollback, is wrong: as turns out, sometimes the binlog
handlerton can be not installed in current thd, but binlog_commit can be
called on behalf of binlog, as in the bug reported.
One separate condition found is XA recovery of the orphaned transaction,
when binlog_commit is also called, but it has nothing to do with
online alter.
Solution:
Extract online alter operations into a separate handlerton.
1032 (Can't find record) could be emitted when ALTER TABLE is execued vs
concurrent DELETE/UPDATE/other DML that would require search on the online
ALTER's side.
Innodb's INPLACE, in comparison, creates a new trx_t and uses it in scope
of the alter table context.
ALTER TABLE class of statements (i.g. CREATE INDEX, OPTIMIZE, etc.) is
expected to be unaffected by the value of current session's transaction
isolation.
This patch save-and-restores thd->tx_isolation and sets in to
ISO_REPEATABLE_READ for almost a whole mysql_alter_table duration, to avoid
any possible side-effect of it. This should be primarily done before the
lock_tables call, to initialize the storage engine's local value correctly
during the store_lock() call.
sql_table.cc: set thd->tx_isolation to ISO_REPEATABLE_READ in
mysql_alter_table and then restore it to the original value in the end of
the call.
Adding an auto_increment column online leads to an undefined behavior.
Basically any DEFAULTs that depend on a row order in the table, or on
the non-deterministic (in scope of the ALTER TABLE statement) function
is UB.
For example, NOW() is considered generally non-deterministic
(Item_func_now_utc is marked with VCOL_NON_DETERMINISTIC), but it's fixed
in scope of a single statement.
Same for any other function that depends only on the session/status vars
apart from its arguments.
Only two UB cases are known:
* adding new AUTO_INCREMENT column. Modifying the existing column may be
fine under certain circumstances, see MDEV-31058.
* adding new column with DEFAULT(nextval(...)). Modifying the existing
column is possible, since its value will be always present in the online
event, except for the NULL -> NOT NULL modification
Add a new virtual function that will increase the inserted rows count
for the insert log event and decrease it for the delete event.
Reuses Rows_log_event::m_row_count on the replication side, which was only
set on the logging side.
1. Make online disk writes unlimited, same as filesort does.
2. Make proper error handling -- in 32-bit build IO_CACHE capacity limit is
4GB, so it is quite possible to overfill there.
3. Event_log::write_cache complicated with event reparsing, and as it was
proven by QA, contains some mistakes. Rewrite introbuce a simpler and much
faster version, not featuring reparsing and therefore copying a whole
buffer at once. This also disables checksums and crypto.
4. Handle read_log_event errors correctly: error returned is -1 (eof
signal for alter table), and my_error is not called. Call my_error and
always return 1. There's no test for this, since it shouldn't happen,
see the next bullet.
5. An event could be written partially in case of error, if it's bigger
than the IO_CACHE buffer. Restore the position where it was before the
error was emitted.
As a result, online alter is untied of several binlog variables, which was
a second aim of this patch.
Group all the checks in online_alter_check_supported().
There is now two groups of checks:
1. A technical availability of online, that is checked before open_tables,
and affects table_list->lock_type. It's supposed to be safe to make it
TL_READ even if COPY algorithm will fall back to not-online, since MDL is
SHARED_UPGRADEABLE anyway.
2. An 'online' availability for a COPY algorithm. It can be done as late as
just before the copy_data_between_tables call. The lock_type influence is
disclosed above, so the only other place it affects is
Alter_info::supports_lock, where `online` flag is only used to decide
whether to report the error at the inplace preparation stage. We'd want to
make that at the last resort, which is COPY preparation, if no algorithm is
chosen by the user. So it's event better now.
Some changes are required to the autoinc support detection, as the check
now happens after mysql_prepare_alter_table:
* alter_info->drop_list is empty
* instead, dropped columns are in tmp_set
* alter_info->create_list now has every field that's in the new table.
* the column definition's change.str will be nonnull whether the column
remains in the new table (vs whether it was changed, as before).
But it also has `field` field set.
* IF EXISTS doesn't have to be dealt anymore
This infers that the changes are now checked in more detail: a field's
definition shouldn't be changed, vs a field shouldn't be mentioned in
the CHANGE list, as it was before. This is reflected by the line 193 test.
when committing a big transaction, online_alter_cache_log creates a cache
file. It wasn't properly closed, which was spotted by a memory leak from
my_register_filename. A temporary file also remained open.
Binlog wasn't affected by this, since it features its own file management.
A proper closing is calling close_cached_file. It deinits io_cache and
closes the underlying file. After closing, the file is expected to be
deleted automagically.
Non-transactional engines changes are visible immediately the row operation
succeeds, so we should apply the changes to the online buffer immediately
after the statement ends.
The row events were applied "twice": once for the ha_partition, and one
more time for the underlying storage engine.
There's no such problem in binlog/rpl, because ha_partiton::row_logging
is normally set to false.
The fix makes the events replicate only when the handler is a root handler.
We will try to *guess* this by comparing it to table->file. The same
approach is used in the MDEV-21540 fix, 231feabd. The assumption is made,
that the row methods are only called for table->file (and never for a
cloned handler), hence the assertions are added in ha_innobase and
ha_myisam to make sure that this is true at least for those engines
Also closes MDEV-31040, however the test is not included, since we have no
convenient way to construct a deterministic version.
ASAN showed use-after-free in binlog_online_alter_end_trans, during
running through thd->online_alter_cache_list.
In online_alter_binlog_get_cache_data, new_cache_data was allocated on
thd->mem_root, in case of autocommit=1, but this mem_root could be freed
in sp_head::execute, upon using stored functions.
It appears that thd->transaction->mem_root exists even in single-stmt
transaction mode (i.e autocommit=1), so it can be used in all cases.
This mem_root will remain valid till the end of transaction, including
commit phase.
This is the corner case of ONLINE ALTER vs ha_maria vs App-time Periods.
When a Delete_rows_event (or update) is executed, a lookup handler may be
created, normally to serve long unique index needs, by a call of
handler::prepare_for_insert. This function also creates a lookup handler
if an application-time period exists in a table.
A difference with a usual call of prepare_for_insert is that transactions
are disabled for this table during ALTER TABLE. See
mysql_trans_prepare_alter_copy_data call in copy_data_between_tables.
Then, ha_maria calls _ma_tmp_disable_logging_for_table during
ha_maria::external_lock. It never happened so before, that two handlers
would be created for write to a single ha_maria table under transactions
disabled.
Hence, the fix handles this scenario.
It could be done otherwise, by not creating this lookup handler (since it's
not used anyway during ONLINE ALTER), but architecturally, two handlers
should be supported.
Avoiding the creation of lookup handler could be done here additionally,
but with a cost of slowing down other more generic cases, with an
additional check of online alter table active.
ONLINE ALTER TABLE uses binlog events like the replication does.
Before it was never used outside of replication, so significant
change was required. For example, a single event had a statement-like
befavior: it locked the tables, opened it, and closed them in the end. But
for ONLINE ALTER we use preopened table.
A crash scenario is following: lex->query_tables was set to NULL in
restore_empty_query_table_list when alter event is applied.
Then lex->query_tables->prev_global was write-accessed in
LEX::first_lists_tables_same, leading to a segfault.
In replication restore_empty_query_table_list would mean resetting lex
before next query or event.
In ONLINE ALTER TABLE we reuse a locked table between the events, so
we should avoid it. Here the need to reset lex state (or close the tables)
can be determined by nonzero rgi->tables_to_lock_count.
If no table is locked, then event doesn't own the tables.
The same was already done before for rgi->slave_close_thread_tables call.
previously, fields with DEFAULTs were allowed just when expression is
deterministic. In case of online alter, we should recursively check that
underlying fields of expression also either have explicit values, or
have DEFAULT following this validity rule.
We can't rely on keys formed with columns that were added during this
ALTER. These columns can be set with non-deterministic values, which can
end up with broken or incorrect search.
The same applies to the keys that contain reliable columns, but also have
bogus ones. Using them can narrow the search, but they're also ignored.
Also, added columns shouldn't be considered during the record match. To
determine them, table->has_value_set bitmap is used.
To fill has_value_set bitmap in the find_key call, extra unpack_row call
has been added.
For replication case, extra replica columns can be considered for this
case. We try to ignore them, too.
ONLINE ALTER TABLE adds binlog handlerton into ha_list, so any
rollback command can end up calling binlog_rollback having no cache_mngr,
if binlog is not enabled.
The assertion should be fixed in the same manner as DBUG_ASSERT(WSREP(thd))
1. ER_KEY_NOT_FOUND
general replcation problem, already fixed earlier.
test added.
2. ER_LOCK_WAIT_TIMEOUT
This is a long unique specific problem.
Sometimes, lookup_handler is created for to->file. To properly free it,
ha_reset should be called. It is usually done by calling
close_thread_table, but ALTER TABLE makes it differently. Hence, a single
ha_reset call is added to mysql_alter_table.
Also, event_mem_root is removed. Normally, no per-event data should be
allocated on thd->mem_root, that would mean a leak. And otherwise,
lookup_handler is lazily allocated, but its lifetime matches statement,
not event.
because online means we'll apply events from the binlog, and
ignore means that bad rows will be skipped. So a bad Write_row_log_event
will be skipped and a following Update_row_log_event will fail to
apply.
Division by zero is a good example. sql_mode is basically ignored by
replication, see Bug#56662.
The behavior for ONLINE should remain the same as for non-ONLINE ALTER.
We shouldn't rely on `fill_extra_persistent_columns`, as it only updates
fields which have an index > cols->n_bits (replication bitmap width).
Actually, it should never be used, as its approach is error-prone.
Normal update_virtual_fields+update_default_fields should be done.
ALTER ONLINE TABLE acquires table with TL_READ. Myisam normally acquires
TL_WRITE for DML, which makes it hang until table is freed.
We deadlock once ALTER upgrades its MDL lock.
Solution:
Unlock table earlier. We don't need to hold TL_READ once we finished
copying. Relay log replication requires no data locks on `from` table.
in the catch-up phase of the online alter we apply row events,
they're unpacked into `from->record[0]` and then converted
to `to->record[0]`.
This needs all fields of `from` to be in the `write_set`.
Although practically `Field::unpack()` does not assert the `write_set`,
and `Field::reset()` - used when a field value is not present in the
after-image - also doesn't assert the `write_set` for many types,
`Field_new_decimal::reset()` does.