... upon replicating online ALTER
When an online event is applied and slave_exec_mode is idempotent,
Write_rows_log_event::do_before_row_operations had reset
thd->lex->sql_command to SQLCOM_REPLACE.
This led to that a statement was detected as a row-type during binlogging,
and was logged as not standalone.
So the corresponding Gtid_log_event, when applied on replica, did not exit
early and created a new PSI transaction. Hence the difference with
non-online ALTER.
Adding an auto_increment column online leads to an undefined behavior.
Basically any DEFAULTs that depend on a row order in the table, or on
the non-deterministic (in scope of the ALTER TABLE statement) function
is UB.
For example, NOW() is considered generally non-deterministic
(Item_func_now_utc is marked with VCOL_NON_DETERMINISTIC), but it's fixed
in scope of a single statement.
Same for any other function that depends only on the session/status vars
apart from its arguments.
Only two UB cases are known:
* adding new AUTO_INCREMENT column. Modifying the existing column may be
fine under certain circumstances, see MDEV-31058.
* adding new column with DEFAULT(nextval(...)). Modifying the existing
column is possible, since its value will be always present in the online
event, except for the NULL -> NOT NULL modification
Add a new virtual function that will increase the inserted rows count
for the insert log event and decrease it for the delete event.
Reuses Rows_log_event::m_row_count on the replication side, which was only
set on the logging side.
The deadlock was caused by too strong MDL acquired by the start ALTER.
Replica's ALTER TABLE replication consists of two phases:
1. Start ALTER (SA) -- the event is emittd in the very beginning,
allowing replication start ALTER in parallel
2. Commit ALTER (CA) -- ensures that master finishes successfully
CA is normally received by wait_for_master call.
If parallel DML was run, the following sequence will take place:
|- SA
|- DML
|- CA
If CA is handled after MDL upgrade, it'll will deadlock with DML.
While MDL is shared by the start ALTER wait for its 2nd part
to allow concurrent DMLs to grab the lock.
The fix uses wait_for_master reentrancy -- no need to avoid a second call
in the end of mysql_alter_table.
Since SA and CA are marked with FL_DDL, the DML issued in-between cannot be
rescheduled before or after them. However, SA "commits" (by he call of
write_bin_log_start_alter and, subsequently,
thd->wakeup_subsequent_commits) before the copy stage begins, unlocking
the DMLs to run on this table. That is, these DMLs will be executed
concurrently with the copy stage, making Online alter effective on replicas
as well
Co-authored-by: Nikita Malyavin (nikitamalyavin@gmail.com)
1. Make online disk writes unlimited, same as filesort does.
2. Make proper error handling -- in 32-bit build IO_CACHE capacity limit is
4GB, so it is quite possible to overfill there.
3. Event_log::write_cache complicated with event reparsing, and as it was
proven by QA, contains some mistakes. Rewrite introbuce a simpler and much
faster version, not featuring reparsing and therefore copying a whole
buffer at once. This also disables checksums and crypto.
4. Handle read_log_event errors correctly: error returned is -1 (eof
signal for alter table), and my_error is not called. Call my_error and
always return 1. There's no test for this, since it shouldn't happen,
see the next bullet.
5. An event could be written partially in case of error, if it's bigger
than the IO_CACHE buffer. Restore the position where it was before the
error was emitted.
As a result, online alter is untied of several binlog variables, which was
a second aim of this patch.
Group all the checks in online_alter_check_supported().
There is now two groups of checks:
1. A technical availability of online, that is checked before open_tables,
and affects table_list->lock_type. It's supposed to be safe to make it
TL_READ even if COPY algorithm will fall back to not-online, since MDL is
SHARED_UPGRADEABLE anyway.
2. An 'online' availability for a COPY algorithm. It can be done as late as
just before the copy_data_between_tables call. The lock_type influence is
disclosed above, so the only other place it affects is
Alter_info::supports_lock, where `online` flag is only used to decide
whether to report the error at the inplace preparation stage. We'd want to
make that at the last resort, which is COPY preparation, if no algorithm is
chosen by the user. So it's event better now.
Some changes are required to the autoinc support detection, as the check
now happens after mysql_prepare_alter_table:
* alter_info->drop_list is empty
* instead, dropped columns are in tmp_set
* alter_info->create_list now has every field that's in the new table.
* the column definition's change.str will be nonnull whether the column
remains in the new table (vs whether it was changed, as before).
But it also has `field` field set.
* IF EXISTS doesn't have to be dealt anymore
This infers that the changes are now checked in more detail: a field's
definition shouldn't be changed, vs a field shouldn't be mentioned in
the CHANGE list, as it was before. This is reflected by the line 193 test.
When column is changed to autoinc, ALTER TABLE may update zero/NULL values,
if NO_AUTO_VALUE_ON_ZERO mode is not enabled.
Forbid this for LOCK=NONE for the unreliable cases.
The cases are described in online_alter_check_autoinc.
Assertion `!table->versioned(VERS_TRX_ID)' failed in
Write_rows_log_event::binlog_row_logging_function during ONLINE ALTER.
trxid-versioned tables can't be replicated.
ONLINE ALTER will also be forbidden for these tables.
1. ER_KEY_NOT_FOUND
general replcation problem, already fixed earlier.
test added.
2. ER_LOCK_WAIT_TIMEOUT
This is a long unique specific problem.
Sometimes, lookup_handler is created for to->file. To properly free it,
ha_reset should be called. It is usually done by calling
close_thread_table, but ALTER TABLE makes it differently. Hence, a single
ha_reset call is added to mysql_alter_table.
Also, event_mem_root is removed. Normally, no per-event data should be
allocated on thd->mem_root, that would mean a leak. And otherwise,
lookup_handler is lazily allocated, but its lifetime matches statement,
not event.
because online means we'll apply events from the binlog, and
ignore means that bad rows will be skipped. So a bad Write_row_log_event
will be skipped and a following Update_row_log_event will fail to
apply.
if ALTER TABLE ... LOCK=xxx is executed under LOCK TABLES,
ignore the LOCK clause, because ALTER should not downgrade
already taken EXCLUSIVE table lock to SHARED or NONE.
This commit preserves the existing behavior (LOCK was de facto ignored),
but makes it explicit.
ALTER ONLINE TABLE acquires table with TL_READ. Myisam normally acquires
TL_WRITE for DML, which makes it hang until table is freed.
We deadlock once ALTER upgrades its MDL lock.
Solution:
Unlock table earlier. We don't need to hold TL_READ once we finished
copying. Relay log replication requires no data locks on `from` table.
in the catch-up phase of the online alter we apply row events,
they're unpacked into `from->record[0]` and then converted
to `to->record[0]`.
This needs all fields of `from` to be in the `write_set`.
Although practically `Field::unpack()` does not assert the `write_set`,
and `Field::reset()` - used when a field value is not present in the
after-image - also doesn't assert the `write_set` for many types,
`Field_new_decimal::reset()` does.
If online alter fails, TABLE_SHARE can be freed while concurrent
transactions still have row events in their online_alter_cache_data.
On commit they try'll to flush them, writing to TABLE_SHARE's
Cache_flip_event_log, which is already freed.
This causes a crash in main.alter_table_online_debug test
don't simply set tdc->flushed, use flush_unused(1) that removes opened
but unused TABLE instances (that would otherwise prevent TABLE_SHARE from
being closed by keeping the ref_count>0).
ht->start_consistent_snapshot() is also not a way,
because some engines (e.g. rocksdb) only do it readonly.
instead, downgrade the lock after reading the first row
(which implicitly opens a read view).
* Log rows in online_alter_binlog.
* Table online data is replicated within dedicated binlog file
* Cached data is written on commit.
* Versioning is fully supported.
* Works both wit and without binlog enabled.
* For now savepoints setup is forbidden while ONLINE ALTER goes on.
Extra support is required. We can simply log the SAVEPOINT query events
and replicate them together with row events. But it's not implemented
for now.
* Cache flipping:
We want to care for the possible bottleneck in the online alter binlog
reading/writing in advance.
IO_CACHE does not provide anything better that sequential access,
besides, only a single write is mutex-protected, which is not suitable,
since we should write a transaction atomically.
To solve this, a special layer on top Event_log is implemented.
There are two IO_CACHE files underneath: one for reading, and one for
writing.
Once the read cache is empty, an exclusive lock is acquired (we can wait
for a currently active transaction finish writing), and flip() is emitted,
i.e. the write cache is reopened for read, and the read cache is emptied,
and reopened for writing.
This reminds a buffer flip that happens in accelerated graphics
(DirectX/OpenGL/etc).
Cache_flip_event_log is considered non-blocking for a single reader and a
single writer in this sense, with the only lock held by reader during flip.
An alternative approach by implementing a fair concurrent circular buffer
is described in MDEV-24676.
* Cache managers:
We have two cache sinks: statement and transactional.
It is important that the changes are first cached per-statement and
per-transaction.
If a statement fails, then only statement data is rolled back. The
transaction moves along, however.
Turns out, there's no guarantee that TABLE well persist in
thd->open_tables to the transaction commit moment.
If an error occurs, tables from statement are purged.
Therefore, we can't store te caches in TABLE. Ideally, it should be
handlerton, but we cut the corner and store it in THD in a list.
Event_log is supposed to be a basic logging class that can write events in
a single file.
MYSQL_BIN_LOG in comparison will have:
* rotation support
* index files
* purging
* gtid and transactional information handling.
* is dedicated for a general-purpose binlog
it was redundant, duplicating vcol_type == VCOL_GENERATED_STORED.
Note that VCOL_DEFAULT is not "stored", "stored vcol" means that after
rnd_next or index_read/etc the field value is already in the record[0]
and does not need to be calculated separately
make TRANSACTIONAL table option behave similar to other engine-defined
table options. If the engine doesn't suport it:
* if specified expicitly in CREATE or ALTER - it's ER_UNKNOWN_OPTION
* an error or a warning depending on sql_mode IGNORE_BAD_TABLE_OPTIONS
* in ALTER TABLE from the engine that suppors it to the engine that
doesn't - silently preserved (no warning)
* it is commented out in SHOW CREATE unless IGNORE_BAD_TABLE_OPTIONS
* invoke check_expression() for all vcol_info's in
mysql_prepare_create_table() to check for FK CASCADE
* also check for SET NULL and SET DEFAULT
* to check against existing FKs when a vcol is added in ALTER TABLE,
old FKs must be added to the new_key_list just like other indexes are
* check columns recursively, if vcol1 references vcol2,
flags of vcol2 must be taken into account
* remove check_table_name_processor(), put that logic under
check_vcol_func_processor() to avoid walking the tree twice
mark old keys in the ALTER TABLE with the `old` flag, not with
the `key_create_info.check_for_duplicate_indexes`.
This allows to mark old foreign keys too.