1. system_versioning_insert_history session variable allows
pseudocolumns ROW_START and ROW_END be specified in INSERT,
INSERT..SELECT and LOAD DATA.
2. Cleaned up select_insert::send_data() from setting vers_write as
this parameter is now set on TABLE initialization.
4. Replication of system_versioning_insert_history via option_bits in
OPTIONS_WRITTEN_TO_BIN_LOG.
When a range rowid filter was used with an index ref access the cost of
accessing the index entries for the records rejected by the filter was not
taken into account. For a ref access by an index with big average number
of records per key this led to poor execution plans if selectivity of the
used filter was high.
The patch resolves this problem. It also introduces a minor optimization
that skips look-ups into a filter that turns out to be empty.
With this patch the output of ANALYZE stmt reports the number of look-ups
into used rowid filters.
The patch also back-ports from 10.5 the code that properly sets the field
TABLE::file::table for opened temporary tables.
The test cases that were supposed to use rowid filters have been adjusted
in order to use similar execution plans after this fix.
Approved by Oleksandr Byelkin <sanja@mariadb.com>
The ALTER related code cannot do at the same time both:
- modify partitions
- change column data types
Explicit changing of a column data type together with a partition change is
prohibited by the parter, so this is not allowed and returns a syntax error:
ALTER TABLE t MODIFY ts BIGINT, DROP PARTITION p1;
This fix additionally disables implicit data type upgrade
(e.g. from "MariaDB 5.3 TIME" to "MySQL 5.6 TIME", or the other way
around according to the current mysql56_temporal_format) in case of
an ALTER modifying partitions, e.g.:
ALTER TABLE t DROP PARTITION p1;
In such commands now only the partition change happens, while
the data types stay unchanged.
One can additionally run:
ALTER TABLE t FORCE;
either before or after the ALTER modifying partitions to
upgrade data types according to mysql56_temporal_format.
To prevent ASAN heap-use-after-poison in the MDEV-16549 part of
./mtr --repeat=6 main.derived
the initialization of Name_resolution_context was cleaned up.
The population of default values in INSERT SELECT was being
performed twice. With sequences, this resulted in every
second sequence value being used.
With SELECT INSERT we remove the second invokation of
table->update_default_fields(). This was already performed
in store_values() invoking fill_record_n_invoke_before_triggers()
which invoked update_default_fields() previously.
We do need to return an error on duplicate values, so the
::store_values is extended to take the ignore option.
When ha_end_bulk_insert() fails F_UNLCK was done twice: in
select_insert::prepare_eof() and in select_create::abort_result_set().
Now we avoid making F_UNLCK in prepare_eof() if error is non-zero.
add_back_last_deleted_lock() was called when the lock was never
removed. Lock is removed in finalize_atomic_replace() in
close_all_tables_for_name(). finalize_atomic_replace() is done only
for successful operation.
In non-atomic codepath it drops the table first, if anything fails
later we don't need to return back the lock since there is no table
now. So the fix is required as well.
Atomic CREATE OR REPLACE allows to keep an old table intact if the
command fails or during the crash. That is done through creating
a table with a temporary name and filling it with the data
(for CREATE OR REPLACE .. SELECT), then renaming the original table
to another temporary (backup) name and renaming the replacement table
to original table. The backup table is kept until the last chance of
failure and if that happens, the replacement table is thrown off and
backup recovered. When the command is complete and logged the backup
table is deleted.
Atomic replace algorithm
Two DDL chains are used for CREATE OR REPLACE:
ddl_log_state_create (C) and ddl_log_state_rm (D).
1. (C) Log CREATE_TABLE_ACTION of TMP table (drops TMP table);
2. Create new table as TMP;
3. Do everything with TMP (like insert data);
finalize_atomic_replace():
4. Link chains: (D) is executed only if (C) is closed;
5. (D) Log DROP_ACTION of BACKUP;
6. (C) Log RENAME_TABLE_ACTION from ORIG to BACKUP (replays BACKUP -> ORIG);
7. Rename ORIG to BACKUP;
8. (C) Log CREATE_TABLE_ACTION of ORIG (drops ORIG);
9. Rename TMP to ORIG;
finalize_ddl() in case of success:
10. Close (C);
11. Replay (D): BACKUP is dropped.
finalize_ddl() in case of error:
10. Close (D);
11. Replay (C):
1) ORIG is dropped (only after finalize_atomic_replace());
2) BACKUP renamed to ORIG (only after finalize_atomic_replace());
3) drop TMP.
If crash happens (C) or (D) is replayed in reverse order. (C) is
replayed if crash happens before it is closed, otherwise (D) is
replayed.
Temporary table for CREATE OR REPLACE
Before dropping "old" table, CREATE OR REPLACE creates "tmp" table.
ddl_log_state_create holds the drop of the "tmp" table. When
everything is OK (data is inserted, "tmp" is ready) ddl_log_state_rm
is written to replace "old" with "tmp". Until ddl_log_state_create
is closed ddl_log_state_rm is not executed.
After the binlogging is done ddl_log_state_create is closed. At that
point ddl_log_state_rm is executed and "tmp" is replaced with
"old". That is: final rename is done by the DDL log.
With that important role of DDL log for CREATE OR REPLACE operation
replay of ddl_log_state_rm must fail at the first hit error and
print the error message if possible. F.ex. foreign key error is
discovered at this phase: InnoDB rejects to drop the "old" table and
returns corresponding foreign key error code.
Additional notes
- CREATE TABLE without REPLACE is not affected by this commit.
- Engines having HTON_EXPENSIVE_RENAME flag set are not affected by
this commit.
- CREATE TABLE .. SELECT XID usage is fixed and now there is no need
to log DROP TABLE via DDL_CREATE_TABLE_PHASE_LOG (see comments in
do_postlock()). XID is now correctly updated so it disables
DDL_LOG_DROP_TABLE_ACTION. Note that binary log is flushed at the
final stage when the table is ready. So if we have XID in the
binary log we don't need to drop the table.
- Three variations of CREATE OR REPLACE handled:
1. CREATE OR REPLACE TABLE t1 (..);
2. CREATE OR REPLACE TABLE t1 LIKE t2;
3. CREATE OR REPLACE TABLE t1 SELECT ..;
- Test case uses 6 combinations for engines (aria, aria_notrans,
myisam, ib, lock_tables, expensive_rename) and 2 combinations for
binlog types (row, stmt). Combinations help to check differences
between the results. Error failures are tested for the above three
variations.
- expensive_rename tests CREATE OR REPLACE without atomic
replace. The effect should be the same as with the old behaviour
before this commit.
- Triggers mechanism is unaffected by this change. This is tested in
create_replace.test.
- LOCK TABLES is affected. Lock restoration must be done after "rm"
chain is replayed.
- Moved ddl_log_complete() from send_eof() to finalize_ddl(). This
checkpoint was not executed before for normal CREATE TABLE but is
executed now.
- CREATE TABLE will now rollback also if writing to the binary
logging failed. See rpl_gtid_strict.test
Rename and drop via DDL log
We replay ddl_log_state_rm to drop the old table and rename the
temporary table. In that case we must throw the correct error
message if ddl_log_revert() fails (f.ex. on FK error).
If table is deleted earlier and not via DDL log and the crash
happened, the create chain is not closed. Linked drop chain is not
executed and the new table is not installed. But the old table is
already deleted.
ddl_log.cc changes
Now we can place action before DDL_LOG_DROP_INIT_ACTION and it will
be replayed after DDL_LOG_DROP_TABLE_ACTION.
report_error parameter for ddl_log_revert() allows to fail at first
error and print the error message if possible.
ddl_log_execute_action() now can print error message.
Since we now can handle errors from ddl_log_execute_action() (in
case of non-recovery execution) unconditional setting "error= TRUE"
is wrong (it was wrong anyway because it was overwritten at the end
of the function).
On XID usage
Like with all other atomic DDL operations XID is used to avoid
inconsistency between master and slave in the case of a crash after
binary log is written and before ddl_log_state_create is closed. On
recovery XIDs are taken from binary log and corresponding DDL log
events get disabled. That is done by
ddl_log_close_binlogged_events().
On linking two chains together
Chains are executed in the ascending order of entry_pos of execute
entries. But entry_pos assignment order is undefined: it may assign
bigger number for the first chain and then smaller number for the
second chain. So the execution order in that case will be reverse:
second chain will be executed first.
To avoid that we link one chain to another. While the base chain
(ddl_log_state_create) is active the secondary chain
(ddl_log_state_rm) is not executed. That is: only one chain can be
executed in two linked chains.
The interface ddl_log_link_chains() was done in "MDEV-22166
ddl_log_write_execute_entry() extension".
More on CREATE OR REPLACE .. SELECT
We use create_and_open_tmp_table() like in ALTER TABLE to create
temporary TABLE object (tmp_table is (NON_)TRANSACTIONAL_TMP_TABLE).
After we created such TABLE object we use create_info->tmp_table()
instead of table->s->tmp_table when we need to check for
parser-requested tmp-table.
External locking is required for temporary table created by
create_and_open_tmp_table(). F.ex. that disables logging for Aria
transactional tables and without that (when no mysql_lock_tables()
is done) it cannot work correctly.
For making external lock the patch requires Aria table to work in
non-transactional mode. That is usually done by
ha_enable_transaction(false). But we cannot disable transaction
completely because: 1. binlog rollback removes pending row events
(binlog_remove_pending_rows_event()). The row events are added
during CREATE .. SELECT data insertion phase. 2. replication slave
highly depends on transaction and cannot work without it.
So we put temporary Aria table into non-transactional mode with
"thd->transaction->on hack". See comment for on_save variable.
Note that Aria table has internal_table mode. But we cannot use it
because:
if (!internal_table)
{
mysql_mutex_lock(&THR_LOCK_myisam);
old_info= test_if_reopen(name_buff);
}
For internal_table test_if_reopen() is not called and we get a new
MARIA_SHARE for each file handler. In that case duplicate errors are
missed because insert and lookup in CREATE .. SELECT is done via two
different handlers (see create_lookup_handler()).
For temporary table before dropping TABLE_SHARE by
drop_temporary_table() we must do ha_reset(). ha_reset() releases
storage share. Without that the share is kept and the second CREATE
OR REPLACE .. SELECT fails with:
HA_ERR_TABLE_EXIST (156): MyISAM table '#sql-create-b5377-4-t2' is
in use (most likely by a MERGE table). Try FLUSH TABLES.
HA_EXTRA_PREPARE_FOR_DROP also removes MYISAM_SHARE, but that is
not needed as ha_reset() does the job.
ha_reset() is usually done by
mark_tmp_table_as_free_for_reuse(). But we don't need that mechanism
for our temporary table.
Atomic_info in HA_CREATE_INFO
Many functions in CREATE TABLE pass the same parameters. These
parameters are part of table creation info and should be in
HA_CREATE_INFO (or whatever). Passing parameters via single
structure is much easier for adding new data and
refactoring.
InnoDB changes (revised by Marko Mäkelä)
row_rename_table_for_mysql(): Specify the treatment of FOREIGN KEY
constraints in a 4-valued enum parameter. In cases where FOREIGN KEY
constraints cannot exist (partitioned tables, or internal tables of
FULLTEXT INDEX), we can use the mode RENAME_IGNORE_FK.
The mod RENAME_REBUILD is for any DDL operation that rebuilds the
table inside InnoDB, such as TRUNCATE and native ALTER TABLE
(or OPTIMIZE TABLE). The mode RENAME_ALTER_COPY is used solely
during non-native ALTER TABLE in ha_innobase::rename_table().
Normal ha_innobase::rename_table() will use the mode RENAME_FK.
CREATE OR REPLACE will rename the old table (if one exists) along
with its FOREIGN KEY constraints into a temporary name. The replacement
table will be initially created with another temporary name.
Unlike in ALTER TABLE, all FOREIGN KEY constraints must be renamed
and not inherited as part of these operations, using the mode RENAME_FK.
dict_get_referenced_table(): Let the callers convert names when needed.
create_table_info_t::create_foreign_keys(): CREATE OR REPLACE creates
the replacement table with a temporary name table, so for
self-references foreign->referenced_table will be a table with
temporary name and charset conversion must be skipped for it.
Reviewed by:
Michael Widenius <monty@mariadb.org>
create_table duplicates select_insert::table_list. Since select_create
inherits select_insert and the functional role of the members is the
same we should remove one to eliminate the need of keeping them in
sync.
TABLEOP_HOOKS is a strange interface: proxy interface calls virtual
interface. Since it is used only for select_create::prepare() such
complexity is overwhelming.
There is a need in MDEV-25292 to have both C_ALTER_TABLE and
select_field_count in one call. Semantically creation mode and field
count are two different things. Making creation mode negative
constants and field count positive variable into one parameter seems
to be a lazy hack for not making the second parameter.
select_count does not make sense without alter_info->create_list, so
the natural way is to hold it in Alter_info too. select_count is now
stored in member select_field_count.
1. Store assignment failures on incompatible data types now raise errors if:
- STRICT_ALL_TABLES or STRICT_TRANS_TABLES sql_mode is used, and
- IGNORE is not used
Otherwise, only a warning is raised and the statement continues.
2. Changing the error/warning test as follows:
-ERROR HY000: Illegal parameter data types inet6 and int for operation 'SET'
+ERROR HY000: Cannot cast 'int' as 'inet6' in assignment of `db`.`t`.`col`
so in case of a big table it's easier to see which column has the problem.
The new error text is also applied to SP variables.
Not the SPIDER issue - happens to INSERT DELAYED.
the field::make_new_field does't copy the LONG_UNIQUE_HASH_FIELD
flag to the new field. Though the Delayed_insert::get_local_table
copies the field->vcol_info for this field. Ad a result
the parse_vcol_defs doesn't create the expression for that column
so the field->vcol_info->expr is NULL. Which leads to crash.
Backported fix for this from 10.5 - the flagg added in the
Delayed_insert::get_local_table.
Another problem with the USING HASH key is thst the
parse_vcol_defs modifies the table->keys content. Then the same
parse_vcol_defs is called on the table copy that has keys already
modified. Backported fix for that from 10.5 - key copying added
tot the Delayed_insert::get_local_table.
Finally - the created copy has to clear the expr_arena as
this table is not in the thd->open_tables list so won't be
cleared automatically.
1. For INSERT..SELECT statements: don't include table/view the data
is inserted into in the list of leaf tables
2. Remove duplicated and dead code related to table_count
Now INSERT, UPDATE, ALTER statements involving incompatible data type pairs, e.g.:
UPDATE TABLE t1 SET col_inet6=col_int;
INSERT INTO t1 (col_inet6) SELECT col_in FROM t2;
ALTER TABLE t1 MODIFY col_inet6 INT;
consistently return an error at the statement preparation time:
ERROR HY000: Illegal parameter data types inet6 and int for operation 'SET'
and abort the statement before starting interating rows.
This error is the same with what is raised for queries like:
SELECT col_inet6 FROM t1 UNION SELECT col_int FROM t2;
SELECT COALESCE(col_inet6, col_int) FROM t1;
Before this change the error was caught only during the execution time,
when a Field_xxx::store_xxx() was called for the very firts row.
The behavior was not consistent between various statements and could do different things:
- abort the statement
- set a column to the data type default value (e.g. '::' for INET6)
- set a column to NULL
A typical old error was:
ERROR 22007: Incorrect inet6 value: '1' for column `test`.`t1`.`a` at row 1
EXCEPTION:
Note, there is an exception: a multi-row INSERT..VALUES, e.g.:
INSERT INTO t1 (col_a,col_b) VALUES (a1,b1),(a2,b2);
checks assignment compability at the preparation time for the very first row only:
(col_a,col_b) vs (a1,b1)
Other rows are still checked at the execution time and return the old warnings
or errors in case of a failure. This is done because catching all rows at the
preparation time would change behavior significantly. So it still works
according to the STRICT_XXX_TABLES sql_mode flags and the table transaction ability.
This is too late to change this behavior in 10.7.
There is no a firm decision yet if a multi-row INSERT..VALUES
behavior will change in later versions.
MDEV-21810 MBR: Unexpected "Unsafe statement" warning for unsafe IODKU
MDEV-17614 fixes to replication unsafety for INSERT ON DUP KEY UPDATE
on two or more unique key table left a flaw. The fixes checked the
safety condition per each inserted record with the idea to catch a user-created
value to an autoincrement column and when that succeeds the autoincrement column
would become the source of unsafety too.
It was not expected that after a duplicate error the next record's
write_set may become different and the unsafe decision for that
specific record will be computed to screw the Query's binlogging
state and when @@binlog_format is MIXED nothing gets bin-logged.
This case has been already fixed in 10.5.2 by 91ab42a823 that
relocated/optimized THD::decide_logging_format_low() out of the record insert
loop. The safety decision is computed once and at the right time.
Pertinent parts of the commit are cherry-picked.
Also a spurious warning about unsafety is removed when MIXED
@@binlog_format; original MDEV-17614 test result corrected.
The original test of MDEV-17614 is extended and made more readable.
If UPDATE/DELETE does not change data it is skipped from
replication. We now force replication of such events when they trigger
partition auto-creation.
For ROLLBACK it is as simple as set OPTION_KEEP_LOG
flag. trans_cannot_safely_rollback() does the rest.
For UPDATE/DELETE .. LIMIT 0 we make additional binlog_query() calls
at the early points of return.
As a safety measure we also convert row format into statement if it is
needed. The condition is decided by
binlog_need_stmt_format(). Basically if there are some row events in
cache we don't need that: table open of row event will trigger
auto-creation anyway.
Multi-update/delete works via mysql_select(). There is no early points
of return, so binlogging is always checked by
send_eof()/abort_resultset(). But we must comply with the above
measure of converting into statement.
rename OPTION_KEEP_LOG -> OPTION_BINLOG_THIS_TRX.
Meaning: transaction cache will be written to binlog even on rollback.
convert log_current_statement to OPTION_BINLOG_THIS_STMT.
Meaning: the statement will be written to binlog (or trx binlog cache)
even if it normally wouldn't be.
setting OPTION_BINLOG_THIS_STMT must always set OPTION_BINLOG_THIS_TRX,
otherwise the statement won't be logged if the transaction is rolled back.
Use OPTION_BINLOG_THIS to set both.