Updated tests: cases with bugs or which cannot be run
with the cursor-protocol were excluded with
"--disable_cursor_protocol"/"--enable_cursor_protocol"
Fix for v.10.5
This patch extends the timestamp from
2038-01-19 03:14:07.999999 to 2106-02-07 06:28:15.999999
for 64 bit hardware and OS where 'long' is 64 bits.
This is true for 64 bit Linux but not for Windows.
This is done by treating the 32 bit stored int as unsigned instead of
signed. This is safe as MariaDB has never accepted dates before the epoch
(1970).
The benefit of this approach that for normal timestamp the storage is
compatible with earlier version.
However for tables using system versioning we before stored a
timestamp with the year 2038 as the 'max timestamp', which is used to
detect current values. This patch stores the new 2106 year max value
as the max timestamp. This means that old tables using system
versioning needs to be updated with mariadb-upgrade when moving them
to 11.4. That will be done in a separate commit.
We implement an idea that was suggested by Michael 'Monty' Widenius
in October 2017: When InnoDB is inserting into an empty table or partition,
we can write a single undo log record TRX_UNDO_EMPTY, which will cause
ROLLBACK to clear the table.
For this to work, the insert into an empty table or partition must be
covered by an exclusive table lock that will be held until the transaction
has been committed or rolled back, or the INSERT operation has been
rolled back (and the table is empty again), in lock_table_x_unlock().
Clustered index records that are covered by the TRX_UNDO_EMPTY record
will carry DB_TRX_ID=0 and DB_ROLL_PTR=1<<55, and thus they cannot
be distinguished from what MDEV-12288 leaves behind after purging the
history of row-logged operations.
Concurrent non-locking reads must be adjusted: If the read view was
created before the INSERT into an empty table, then we must continue
to imagine that the table is empty, and not try to read any records.
If the read view was created after the INSERT was committed, then
all records must be visible normally. To implement this, we introduce
the field dict_table_t::bulk_trx_id.
This special handling only applies to the very first INSERT statement
of a transaction for the empty table or partition. If a subsequent
statement in the transaction is modifying the initially empty table again,
we must enable row-level undo logging, so that we will be able to
roll back to the start of the statement in case of an error (such as
duplicate key).
INSERT IGNORE will continue to use row-level logging and locking, because
implementing it would require the ability to roll back the latest row.
Since the undo log that we write only allows us to roll back the entire
statement, we cannot support INSERT IGNORE. We will introduce a
handler::extra() parameter HA_EXTRA_IGNORE_INSERT to indicate to storage
engines that INSERT IGNORE is being executed.
In many test cases, we add an extra record to the table, so that during
the 'interesting' part of the test, row-level locking and logging will
be used.
Replicas will continue to use row-level logging and locking until
MDEV-24622 has been addressed. Likewise, this optimization will be
disabled in Galera cluster until MDEV-24623 enables it.
dict_table_t::bulk_trx_id: The latest active or committed transaction
that initiated an insert into an empty table or partition.
Protected by exclusive table lock and a clustered index leaf page latch.
ins_node_t::bulk_insert: Whether bulk insert was initiated.
trx_t::mod_tables: Use C++11 style accessors (emplace instead of insert).
Unlike earlier, this collection will cover also temporary tables.
trx_mod_table_time_t: Add start_bulk_insert(), end_bulk_insert(),
is_bulk_insert(), was_bulk_insert().
trx_undo_report_row_operation(): Before accessing any undo log pages,
invoke trx->mod_tables.emplace() in order to determine whether undo
logging was disabled, or whether this is the first INSERT and we are
supposed to write a TRX_UNDO_EMPTY record.
row_ins_clust_index_entry_low(): If we are inserting into an empty
clustered index leaf page, set the ins_node_t::bulk_insert flag for
the subsequent trx_undo_report_row_operation() call.
lock_rec_insert_check_and_lock(), lock_prdt_insert_check_and_lock():
Remove the redundant parameter 'flags' that can be checked in the caller.
btr_cur_ins_lock_and_undo(): Simplify the logic. Correctly write
DB_TRX_ID,DB_ROLL_PTR after invoking trx_undo_report_row_operation().
trx_mark_sql_stat_end(), ha_innobase::extra(HA_EXTRA_IGNORE_INSERT),
ha_innobase::external_lock(): Invoke trx_t::end_bulk_insert() so that
the next statement will not be covered by table-level undo logging.
ReadView::changes_visible(trx_id_t) const: New accessor for the case
where the trx_id_t is not read from a potentially corrupted index page
but directly from the memory. In this case, we can skip a sanity check.
row_sel(), row_sel_try_search_shortcut(), row_search_mvcc():
row_sel_try_search_shortcut_for_mysql(),
row_merge_read_clustered_index(): Check dict_table_t::bulk_trx_id.
row_sel_clust_sees(): Replaces lock_clust_rec_cons_read_sees().
lock_sec_rec_cons_read_sees(): Replaced with lower-level code.
btr_root_page_init(): Refactored from btr_create().
dict_index_t::clear(), dict_table_t::clear(): Empty an index or table,
for the ROLLBACK of an INSERT operation.
ROW_T_EMPTY, ROW_OP_EMPTY: Note a concurrent ROLLBACK of an INSERT
into an empty table.
This is joint work with Thirunarayanan Balathandayuthapani,
who created a working prototype.
Thanks to Matthias Leich for extensive testing.
First part of the fix (row0mysql.cc) addresses external columns when adding history
row on referential action. The full data must be retrieved before the
row is inserted.
Second part of the fix (the rest) avoids duplicate primary key error between
the history row generated on referential action and the history row
generated by SQL command. Both command and referential action can
happen on same table since foreign key can be self-reference (parent
and child tables are same). Moreover, the self-reference can refer
multiple rows when the key is non-unique. In such case history is
generated by referential action occured on first row but processed all
rows by a matched key. The second round is when the next row is
processed by a command but history already exists. In such case we
check TRX_ID of existing history row and if it is the same we assume
the above situation and skip adding one more history row or failing
the command.
Cause:
* row_start != 0 treated as it exists. Probably, possible row permutations had not been taken in mind.
Solution:
* Checking both row_start and row_end is correct, so versioned() function is used
Cause:
* when autocommit=0 (or transaction is issued by user),
`ha_commit_trans` is called twice on ALTER TABLE, causing a duplicated
insert into `transaction_registry` (ER_DUP_ENTRY).
Solution:
* ALTER TABLE makes an implicit commit by a second call. We actually
need to make an insert only when it is a real commit. So is_real
variable is additionally checked.
Preparation for MDEV-16210:
replace.test:
key_type combinations: PK and UNIQUE.
foreign.test:
Preparation for key_type combinations.
Other fixes:
* Merged versioning.update2 into versioning.update;
* Removed test2 database and done individual drop instead.
1. Removed TIMESTAMP/TRANSACTION unit auto-detection in favor of default TIMESTAMP.
Reasons:
1.1. rare practical use and doubtful advantage of such auto-detection;
1.2. it conflicts with MDEV-16226 (TRX_ID-based versioned tables performance improvement).
Needless check_unit membership removed.
2. SQL: versioning type handling refactoring
Vers_type_handler hierarchy stores versioning properties of type.
virtual Type_handler::vers() accesses specialization of
Vers_type_handler for specific type.
virtual Vers_type_handler::kind() returns versioning kind
(timestamp/trx_id).
Removed Type_handler::Vers_history_point_check_unit() in favor of
Type_handler::vers().
Renames:
require_timestamp() -> require_timestamp_error()
require_trx_id() -> require_trx_id_error()
EDIT by Alexander Barkov (@abarkov):
check_sys_fields() moved to Vers_type_handler::check_sys_fields()
Patch fixes two bugs:
1) BEGIN_TIMESTAMP of MYSQL.TRANSACTION_REGISTY is '0-0-0' on replication record insertion
2) BEGIN_TIMESTAMP equals COMMMIT_TIMESTAMP in MYSQL.TRANSACTION_REGISTRY
Fixed by calling THD::set_time() at appropriate places
Changing columns WITH/WITHOUT SYSTEM VERSIONING doens't require to read data at
all. Thus it should be an instant operation.
Patch also fixes a bug when ALTER_COLUMN_UNVERSIONED wasn't passed to InnoDB
to change its internal structures.
change_field_versioning_try(): apply WITH/WITHOUT SYSTEM VERSIONING
change in SYS_COLUMNS for one field.
change_fields_versioning_try(): apply WITH/WITHOUT SYSTEM VERSIONING
change in SYS_COLUMNS for every changed field in a table.
change_fields_versioning_cache(): update cache for versioning property
of columns.
MDEV-16100 FOR SYSTEM_TIME erroneously resolves string user variables as transaction IDs
Problem:
Vers_history_point::resolve_unit() tested item->result_type() before
item->fix_fields() was called.
- Item_func_get_user_var::result_type() returned REAL_RESULT by default.
This caused MDEV-16100.
- Item_func_sp::result_type() crashed on assert.
This caused MDEV-16094
Changes:
1. Adding item->fix_fields() into Vers_history_point::resolve_unit()
before using data type specific properties of the history point
expression.
2. Adding a new virtual method Type_handler::Vers_history_point_resolve_unit()
3. Implementing type-specific
Type_handler_xxx::Type_handler::Vers_history_point_resolve_unit()
in the way to:
a. resolve temporal and general purpose string types to TIMESTAMP
b. resolve BIT and general purpose INT types to TRANSACTION
c. disallow use of non-relevant data type expressions in FOR SYSTEM_TIME
Note, DOUBLE and DECIMAL data types are disallowed intentionally.
- DOUBLE does not have enough precision to hold huge BIGINT UNSIGNED values
- DECIMAL rounds on conversion to INT
Both lack of precision and rounding might potentionally lead to
very unpredictable results when a wrong transaction ID would be chosen.
If one really wants dangerous use of DOUBLE and DECIMAL, explicit CAST
can be used:
FOR SYSTEM_TIME AS OF CAST(double_or_decimal AS UNSIGNED)
QQ: perhaps DECIMAL(N,0) could still be allowed.
4. Adding a new virtual method Item::type_handler_for_system_time(),
to make HEX hybrids and bit literals work as TRANSACTION rather
than TIMESTAMP.
5. sql_yacc.yy: replacing the rule temporal_literal to "TIMESTAMP TEXT_STRING".
Other temporal literals now resolve to TIMESTAMP through the new
Type_handler methods. No special grammar needed. This removed
a few shift/resolve conflicts.
(TIMESTAMP related conflicts in "history_point:" will be removed separately)
6. Removing the "timestamp_only" parameter from
vers_select_conds_t::resolve_units() and Vers_history_point::resolve_unit().
It was a hint telling that a table did not have any TRANSACTION-aware
system time columns, so it's OK to resolve to TIMESTAMP in case of uncertainty.
In the new reduction it works as follows:
- the decision between TIMESTAMP and TRANSACTION is first made
based only on the expression data type only
- then, in case if the expression resolved to TRANSACTION, the table
is checked if TRANSACTION-aware columns really exist.
This way is safer against possible ALTER TABLE statements changing
ROW START and ROW END columns from "BIGINT UNSIGNED" to "TIMESTAMP(x)"
or the other way around.
Store transaction start time in thd->transaction.start_time.
THD::transaction_time() wraps over transaction.start_time taking into
account current status of BEGIN.
Vers SQL: TRT fix getting TRX_ID by COMMIT_TS
Fixed wrong assumption that records are ordered by COMMIT_TS.
This is anyway a quick hack until tempesta-tech#314 is done.
See also FIXME and TODO in TR_table::query(MYSQL_TIME, bool).
Test: SEES case for trx_id.test [closes#456]
MDEV-11415 Remove excessive undo logging during ALTER TABLE…ALGORITHM=COPY
Move a test from innodb.rename_table_debug to innodb.alter_copy.
ha_innobase::extra(HA_EXTRA_BEGIN_ALTER_COPY): Register id-versioned
tables so that mysql.transaction_registry will be updated, even for
empty tables that are subjected to ALTER TABLE…ALGORITHM=COPY.
and the system_versioning_transaction_registry variable.
The user enables transaction registry by specifying BIGINT for
row_start/row_end columns.
check mysql.transaction_registry structure on the first open,
not on startup. Avoid warnings unless transaction_registry
is actually used.