Also expand vcol field index coverings to include indexes covering all
the fields in the expression. The reasoning goes as follows: let f(c1,
c2, ..., cn) be a function on applied to columns c1, c2, ..., cn, if
f(...) is covered by an index, so should vc whose expression is
f(...).
For example, if t.vf = t.c1 + t.c2, and t has three indexes (vf), (c1,
c2), (c1).
Before this change, vf's index covering is a singleton {(vf)}. Let's call
that the "conventional" index covering.
After this change vf's index covering is now {(vf), (c1, c2)}, since
(c1, c2) covers both c1 and c2. Let's call (c1, c2) in this case the
"extra" covering.
With the coverings updated, when an index in the "extra" covering is
chosen for keyread, the vcol also needs to be calculated. In this case
we mark vcol in the table read_set, and ensure it is computed.
With these changes, we see various improvements, including from using
full table scan + filesort to full index scan + filesort when ORDER BY
an indexed vcol (here vc = c + 1 is a vcol and both c and vc are
indexes):
explain select c + 1 from t order by vc;
id select_type table type possible_keys key key_len ref rows Extra
-1 SIMPLE t ALL NULL NULL NULL NULL 10000 Using filesort
+1 SIMPLE t index NULL c 5 NULL 10000 Using index; Using filesort
The substitutions are followed updates to all_fields which include a
copy of the ORDER BY/GROUP BY item pointers, as well as corresponding
updates to ref_pointer_array so that the all_fields and
ref_pointer_array remain in sync.
Another, related change is the recomputation of table index covering
on substitutions. It not only reflects the correct table index
covering after the substitutions, but also improve executions where
the vcol index can be chosen, such as this example (here vc = c + 1
and vc is the only index in the table), from full table scan +
filesort to full index scan:
select vc from t order by c + 1;
We do it in SELECT as well as in single table DELETE/UPDATE.
Remove one of the major sources of race condiitons in mariadb-test.
Normally, mariadb_close() sends COM_QUIT to the server and immediately
disconnects. In mariadb-test it means the test can switch to another
connection and sends queries to the server before the server even
started parsing the COM_QUIT packet and these queries can see the
connection as fully active, as it didn't reach dispatch_command yet.
This is a major source of instability in tests and many - but not all,
still less than a half - tests employ workarounds. The correct one
is a pair count_sessions.inc/wait_until_count_sessions.inc.
Also very popular was wait_until_disconnected.inc, which was completely
useless, because it verifies that the connection is closed, and after
disconnect it always is, it didn't verify whether the server processed
COM_QUIT. Sadly the placebo was as widely used as the real thing.
Let's fix this by making mariadb-test `disconnect` command _to wait_ for
the server to confirm. This makes almost all workarounds redundant.
In some cases count_sessions.inc/wait_until_count_sessions.inc is still
needed, though, as only `disconnect` command is changed:
* after external tools, like `exec $MYSQL`
* after failed `connect` command
* replication, after `STOP SLAVE`
* Federated/CONNECT/SPIDER/etc after `DROP TABLE`
and also in some XA tests, because an XA transaction is dissociated from
the THD very late, after the server has closed the client connection.
Collateral cleanups: fix comments, remove some redundant statements:
* DROP IF EXISTS if nothing is known to exist
* DROP table/view before DROP DATABASE
* REVOKE privileges before DROP USER
etc
Before MySQL 4.0.18, user-specified constraint names were ignored.
Starting with MySQL 4.0.18, the specified constraint name was
prepended with the schema name and '/'. Now we are transforming
into a format where the constraint name is prepended with the
dict_table_t::name and the impossible UTF-8 sequence 0xff.
Generated constraint names will be ASCII decimal numbers.
On upgrade, old FOREIGN KEY constraint names will be displayed
without any schema name prefix. They will be updated to the new
format on DDL operations.
dict_foreign_t::sql_id(): Return the SQL constraint name
without any schemaname/tablename\377 or schemaname/ prefix.
row_rename_table_for_mysql(), dict_table_rename_in_cache():
Simplify the logic: Just rename constraints to the new format.
dict_table_get_foreign_id(): Replaces dict_table_get_highest_foreign_id().
innobase_get_foreign_key_info(): Let my_error() refer to erroneous
anonymous constraints as "(null)".
row_delete_constraint(): Try to drop all 3 constraint name variants.
Reviewed by: Thirunarayanan Balathandayuthapani
Tested by: Matthias Leich
Starting with mysql/mysql-server@02f8eaa998
and commit 2e814d4702 the index ID of
indexes on virtual columns was being encoded insufficiently in
InnoDB undo log records. Only the least significant 32 bits were
being written. This could lead to some corruption of the affected
indexes on ROLLBACK, as well as to missed chances to remove some
history from such indexes when purging the history of committed
transactions that included DELETE or an UPDATE in the indexes.
dict_hdr_create(): In debug instrumented builds, initialize the
DICT_HDR_INDEX_ID close to the 32-bit barrier, instead of initializing
it to DICT_HDR_FIRST_ID (10). This will allow the changed code to
be exercised while running ./mtr --suite=gcol,vcol.
trx_undo_log_v_idx(): Encode large index->id in a similar way as
mysql/mysql-server@e00328b4d0
but using a different implementation.
trx_undo_read_v_idx_low(): Decode large index->id in a similar way
as mach_u64_read_much_compressed().
Reviewed by: Debarun Banerjee
stats_deinit(): Replaces dict_stats_deinit().
Deinitialize the statistics for persistent tables,
so that they will be reloaded or recalculated
on a subsequent ha_innobase::open().
ha_innobase::rename_table(): Invoke stats_deinit() so that the
subsequent ha_innobase::open() will reload the InnoDB persistent
statistics. That is, it will remain possible to have the InnoDB
persistent statistics reloaded by executing the following:
RENAME TABLE t TO tmp, tmp TO t;
dict_table_close(table): Replaced with table->release().
There will no longer be any logic that would attempt to ensure
that the InnoDB persistent statistics will be reloaded after
FLUSH TABLES has been executed. This also fixes the problem that
dict_table_t::stat_modified_counter would be frequently reset to 0,
whenever ha_innobase::open() is invoked after the table reference
count had dropped to 0.
dict_table_close(table, thd, mdl): Remove the parameter "dict_locked".
Do not try to invalidate the statistics.
ha_innobase::statistics_init(): Replaces dict_stats_init(table).
Reviewed by: Thirunarayanan Balathandayuthapani
* rpl.rpl_system_versioning_partitions updated for MDEV-32188
* innodb.row_size_error_log_warnings_3 changed error for MDEV-33658
(checks are done in a different order)
(Review input addressed)
After this patch, the optimizer can handle virtual column expressions
in WHERE/ON clauses. If the table has an indexed virtual column:
ALTER TABLE t1
ADD COLUMN vcol INT AS (col1+1),
ADD INDEX idx1(vcol);
and the query uses the exact virtual column expression:
SELECT * FROM t1 WHERE col1+1 <= 100
then the optimizer will be able use index idx1 for it.
This is achieved by walking the WHERE/ON clauses and replacing instances
of virtual column expression (like "col1+1" above) with virtual column's
Item_field (like "vcol"). The latter can be processed by the optimizer.
Replacement is considered (and done) only in items that are potentially
usable to the range optimizer.
the information about index algorithm was stored in two
places inconsistently split between both.
BTREE index could have key->algorithm == HA_KEY_ALG_BTREE, if the user
explicitly specified USING BTREE or HA_KEY_ALG_UNDEF, if not.
RTREE index had key->algorithm == HA_KEY_ALG_RTREE
and always had key->flags & HA_SPATIAL
FULLTEXT index had key->algorithm == HA_KEY_ALG_FULLTEXT
and always had key->flags & HA_FULLTEXT
HASH index had key->algorithm == HA_KEY_ALG_HASH or HA_KEY_ALG_UNDEF
long unique index always had key->algorithm == HA_KEY_ALG_LONG_HASH
In this commit:
All indexes except BTREE and HASH always have key->algorithm
set, HA_SPATIAL and HA_FULLTEXT flags are not used anymore (except
for storage to keep frms backward compatible).
As a side effect ALTER TABLE now detects FULLTEXT index renames correctly
Some fixes related to commit f838b2d799 and
Rows_log_event::do_apply_event() and Update_rows_log_event::do_exec_row()
for system-versioned tables were provided by Nikita Malyavin.
This was required by test versioning.rpl,trx_id,row.
https://jepsen.io/analyses/mysql-8.0.34 highlights that the
transaction isolation levels in the InnoDB storage engine do not
correspond to any widely accepted definitions, such as
"Generalized Isolation Level Definitions"
https://pmg.csail.mit.edu/papers/icde00.pdf
(PL-1 = READ UNCOMMITTED, PL-2 = READ COMMITTED, PL-2.99 = REPEATABLE READ,
PL-3 = SERIALIZABLE).
Only READ UNCOMMITTED in InnoDB seems to match the above definition.
The issue is that InnoDB does not detect write/write conflicts
(Section 4.4.3, Definition 6) in the above.
It appears that as soon as we implement write/write conflict detection
(SET SESSION innodb_snapshot_isolation=ON), the default isolation level
(SET TRANSACTION ISOLATION LEVEL REPEATABLE READ) will become
Snapshot Isolation (similar to Postgres), as defined in Section 4.2 of
"A Critique of ANSI SQL Isolation Levels", MSR-TR-95-51, June 1995
https://www.microsoft.com/en-us/research/wp-content/uploads/2016/02/tr-95-51.pdf
Locking reads inside InnoDB used to read the latest committed version,
ignoring what should actually be visible to the transaction.
The added test innodb.lock_isolation illustrates this. The statement
UPDATE t SET a=3 WHERE b=2;
is executed in a transaction that was started before a read view or
a snapshot of the current transaction was created, and committed before
the current transaction attempts to execute
UPDATE t SET b=3;
If SET innodb_snapshot_isolation=ON is in effect when the second
transaction was started, the second transaction will be aborted with
the error ER_CHECKREAD. By default (innodb_snapshot_isolation=OFF),
the second transaction would execute inconsistently, displaying an
incorrect SELECT COUNT(*) FROM t in its read view.
If innodb_snapshot_isolation=ON, if an attempt to acquire a lock on a
record that does not exist in the current read view is made, an error
DB_RECORD_CHANGED (HA_ERR_RECORD_CHANGED, ER_CHECKREAD) will
be raised. This error will be treated in the same way as a deadlock:
the transaction will be rolled back.
lock_clust_rec_read_check_and_lock(): If the current transaction has
a read view where the record is not visible and
innodb_snapshot_isolation=ON, fail before trying to acquire the lock.
row_sel_build_committed_vers_for_mysql(): If innodb_snapshot_isolation=ON,
disable the "semi-consistent read" logic that had been implemented by
myself on the directions of Heikki Tuuri in order to address
https://bugs.mysql.com/bug.php?id=3300 that was motivated by a customer
wanting UPDATE to skip locked rows that do not match the WHERE condition.
It looks like my changes were included in the MySQL 5.1.5
commit ad126d90e019f223470e73e1b2b528f9007c4532; at that time, employees
of Innobase Oy (a recent acquisition of Oracle) had lost write access to
the repository.
The only reason why we set innodb_snapshot_isolation=OFF by default is
backward compatibility with applications, such as the one that motivated
the implementation of "semi-consistent read" back in 2005. In a later
major release, we can default to innodb_snapshot_isolation=ON.
Thanks to Peter Alvaro, Kyle Kingsbury and Alexey Gotsman for their work
on https://github.com/jepsen-io/ and to Kyle and Alexey for explanations
and some testing of this fix.
Thanks to Vladislav Lesin for the initial test for MDEV-26643,
as well as reviewing these changes.
In any test that uses wait_all_purged.inc, ensure that InnoDB tables
will be created without persistent statistics.
This is a follow-up to commit cd04673a17
after a similar failure was observed in the innodb_zip.blob test.
This imports and adapts a number of MySQL 5.7 test cases that are
applicable to MariaDB.
Some tests for old bug fixes are not that relevant because the code
has been refactored since then (especially starting with
MariaDB Server 10.6), and the tests would not reproduce the
original bug if the fix was reverted.
In the test innodb_fts.opt, there are many duplicate MATCH ranks, which
would make the results nondeterministic. The test was stabilized by
changing some LIMIT clauses or by adding sorted_result in those cases
where the purpose of a test was to show that no sorting took place
in the server.
In the test innodb_fts.phrase, MySQL 5.7 would generate FTS_DOC_ID that
are 1 larger than in MariaDB. In innodb_fts.index_table the difference is 2.
This is because in MariaDB, fts_get_next_doc_id() post-increments
cache->next_doc_id, while MySQL 5.7 pre-increments it.
Reviewed by: Thirunarayanan Balathandayuthapani
The motivation of introducing the parameter
innodb_purge_rseg_truncate_frequency in
mysql/mysql-server@28bbd66ea5 and
mysql/mysql-server@8fc2120fed
seems to have been to avoid stalls due to freeing undo log pages
or truncating undo log tablespaces. In MariaDB Server,
innodb_undo_log_truncate=ON should be a much lighter operation
than in MySQL, because it will not involve any log checkpoint.
Another source of performance stalls should be
trx_purge_truncate_rseg_history(), which is shrinking the history list
by freeing the undo log pages whose undo records have been purged.
To alleviate that, we will introduce a purge_truncation_task that will
offload this from the purge_coordinator_task. In that way, the next
innodb_purge_batch_size pages may be parsed and purged while the pages
from the previous batch are being freed and the history list being shrunk.
The processing of innodb_undo_log_truncate=ON will still remain the
responsibility of the purge_coordinator_task.
purge_coordinator_state::count: Remove. We will ignore
innodb_purge_rseg_truncate_frequency, and act as if it had been
set to 1 (the maximum shrinking frequency).
purge_coordinator_state::do_purge(): Invoke an asynchronous task
purge_truncation_callback() to free the undo log pages.
purge_sys_t::iterator::free_history(): Free those undo log pages
that have been processed. This used to be a part of
trx_purge_truncate_history().
purge_sys_t::clone_end_view(): Take a new value of purge_sys.head
as a parameter, so that it will be updated while holding exclusive
purge_sys.latch. This is needed for race-free access to the field
in purge_truncation_callback().
Reviewed by: Vladislav Lesin
The MDEV-29693 conflict resolution is from Monty, as well as is
a bug fix where ANALYZE TABLE wrongly built histograms for
single-column PRIMARY KEY.
Also includes a fix for safe_malloc error reporting.
Other things:
- Copied main.log_slow from 10.4 to avoid mtr issue
Disabled test:
- spider/bugfix.mdev_27239 because we started to get
+Error 1429 Unable to connect to foreign data source: localhost
-Error 1158 Got an error reading communication packets
- main.delayed
- Bug#54332 Deadlock with two connections doing LOCK TABLE+INSERT DELAYED
This part is disabled for now as it fails randomly with different
warnings/errors (no corruption).
row_vers_vc_matches_cluster(): Invoke dtype_get_at_most_n_mbchars()
to extract the correct number of bytes corresponding to the number
of characters in a virtual column prefix index, just like we do in
row_sel_sec_rec_is_for_clust_rec().
The test case would occasionally reproduce the failure when this
fix is not present.
* Log rows in online_alter_binlog.
* Table online data is replicated within dedicated binlog file
* Cached data is written on commit.
* Versioning is fully supported.
* Works both wit and without binlog enabled.
* For now savepoints setup is forbidden while ONLINE ALTER goes on.
Extra support is required. We can simply log the SAVEPOINT query events
and replicate them together with row events. But it's not implemented
for now.
* Cache flipping:
We want to care for the possible bottleneck in the online alter binlog
reading/writing in advance.
IO_CACHE does not provide anything better that sequential access,
besides, only a single write is mutex-protected, which is not suitable,
since we should write a transaction atomically.
To solve this, a special layer on top Event_log is implemented.
There are two IO_CACHE files underneath: one for reading, and one for
writing.
Once the read cache is empty, an exclusive lock is acquired (we can wait
for a currently active transaction finish writing), and flip() is emitted,
i.e. the write cache is reopened for read, and the read cache is emptied,
and reopened for writing.
This reminds a buffer flip that happens in accelerated graphics
(DirectX/OpenGL/etc).
Cache_flip_event_log is considered non-blocking for a single reader and a
single writer in this sense, with the only lock held by reader during flip.
An alternative approach by implementing a fair concurrent circular buffer
is described in MDEV-24676.
* Cache managers:
We have two cache sinks: statement and transactional.
It is important that the changes are first cached per-statement and
per-transaction.
If a statement fails, then only statement data is rolled back. The
transaction moves along, however.
Turns out, there's no guarantee that TABLE well persist in
thd->open_tables to the transaction commit moment.
If an error occurs, tables from statement are purged.
Therefore, we can't store te caches in TABLE. Ideally, it should be
handlerton, but we cut the corner and store it in THD in a list.