- Replacing the old style inplace check_db_name() in make_table_name_list()
to the new style non-modifying code
- Adding "const" qualifier to the "db" parameter to ha_discover_table_names()
and its dependency functions.
The row events were applied "twice": once for the ha_partition, and one
more time for the underlying storage engine.
There's no such problem in binlog/rpl, because ha_partiton::row_logging
is normally set to false.
The fix makes the events replicate only when the handler is a root handler.
We will try to *guess* this by comparing it to table->file. The same
approach is used in the MDEV-21540 fix, 231feabd. The assumption is made,
that the row methods are only called for table->file (and never for a
cloned handler), hence the assertions are added in ha_innobase and
ha_myisam to make sure that this is true at least for those engines
Also closes MDEV-31040, however the test is not included, since we have no
convenient way to construct a deterministic version.
instead use only one (trx) IO_CACHE and truncate it if the
statement is rolled back.
don't use binlog_cache_mngr to accumulate the data,
use binlog_cache_data instead.
(binlog_cache_data owns one IO_CACHE, binlog_cache_mngr owns
two binlog_cache_data's, trx and stmt).
* Log rows in online_alter_binlog.
* Table online data is replicated within dedicated binlog file
* Cached data is written on commit.
* Versioning is fully supported.
* Works both wit and without binlog enabled.
* For now savepoints setup is forbidden while ONLINE ALTER goes on.
Extra support is required. We can simply log the SAVEPOINT query events
and replicate them together with row events. But it's not implemented
for now.
* Cache flipping:
We want to care for the possible bottleneck in the online alter binlog
reading/writing in advance.
IO_CACHE does not provide anything better that sequential access,
besides, only a single write is mutex-protected, which is not suitable,
since we should write a transaction atomically.
To solve this, a special layer on top Event_log is implemented.
There are two IO_CACHE files underneath: one for reading, and one for
writing.
Once the read cache is empty, an exclusive lock is acquired (we can wait
for a currently active transaction finish writing), and flip() is emitted,
i.e. the write cache is reopened for read, and the read cache is emptied,
and reopened for writing.
This reminds a buffer flip that happens in accelerated graphics
(DirectX/OpenGL/etc).
Cache_flip_event_log is considered non-blocking for a single reader and a
single writer in this sense, with the only lock held by reader during flip.
An alternative approach by implementing a fair concurrent circular buffer
is described in MDEV-24676.
* Cache managers:
We have two cache sinks: statement and transactional.
It is important that the changes are first cached per-statement and
per-transaction.
If a statement fails, then only statement data is rolled back. The
transaction moves along, however.
Turns out, there's no guarantee that TABLE well persist in
thd->open_tables to the transaction commit moment.
If an error occurs, tables from statement are purged.
Therefore, we can't store te caches in TABLE. Ideally, it should be
handlerton, but we cut the corner and store it in THD in a list.
Event_log is supposed to be a basic logging class that can write events in
a single file.
MYSQL_BIN_LOG in comparison will have:
* rotation support
* index files
* purging
* gtid and transactional information handling.
* is dedicated for a general-purpose binlog
* Eliminate most usages of THD::use_trans_table. Only 3 left, and they are
at quite high levels, and really essential.
* Eliminate is_transactional argument when possible. Lots of places are
left though, because of some WSREP error handling in
MYSQL_BIN_LOG::set_write_error.
* Remove junk binlog functions from THD
* binlog_prepare_pending_rows_event is moved to log.cc inside MYSQL_BIN_LOG
and is not anymore template. Instead it accepls event factory with a type
code, and a callback to a constructing function in it.
The new statistics is enabled by adding the "engine", "innodb" or "full"
option to --log-slow-verbosity
Example output:
# Pages_accessed: 184 Pages_read: 95 Pages_updated: 0 Old_rows_read: 1
# Pages_read_time: 17.0204 Engine_time: 248.1297
Page_read_time is time doing physical reads inside a storage engine.
(Writes cannot be tracked as these are usually done in the background).
Engine_time is the time spent inside the storage engine for the full
duration of the read/write/update calls. It uses the same code as
'analyze statement' for calculating the time spent.
The engine statistics is done with a generic interface that should be
easy for any engine to use. It can also easily be extended to provide
even more statistics.
Currently only InnoDB has counters for Pages_% and Undo_% status.
Engine_time works for all engines.
Implementation details:
class ha_handler_stats holds all engine stats. This class is included
in handler and THD classes.
While a query is running, all statistics is updated in the handler. In
close_thread_tables() the statistics is added to the THD.
handler::handler_stats is a pointer to where statistics should be
collected. This is set to point to handler::active_handler_stats if
stats are requested. If not, it is set to 0.
handler_stats has also an element, 'active' that is 1 if stats are
requested. This is to allow engines to avoid doing any 'if's while
updating the statistics.
Cloned or partition tables have the pointer set to the base table if
status are requested.
There is a small performance impact when using --log-slow-verbosity=engine:
- All engine calls in 'select' will be timed.
- IO calls for InnoDB reads will be timed.
- Incrementation of counters are done on local variables and accesses
are inline, so these should have very little impact.
- Statistics has to be reset for each statement for the THD and each
used handler. This is only 40 bytes, which should be neglectable.
- For partition tables we have to loop over all partitions to update
the handler_status as part of table_init(). Can be optimized in the
future to only do this is log-slow-verbosity changes. For this to work
we have to update handler_status for all opened partitions and
also for all partitions opened in the future.
Other things:
- Added options 'engine' and 'full' to log-slow-verbosity.
- Some of the new files in the test suite comes from Percona server, which
has similar status information.
- buf_page_optimistic_get(): Do not increment any counter, since we are
only validating a pointer, not performing any buf_pool.page_hash lookup.
- Added THD argument to save_explain_data_intern().
- Switched arguments for save_explain_.*_data() to have
always THD first (generates better code as other functions also have THD
first).
MDEV-29253 Detect incompatible MySQL partition scheme and either convert
them or report to user and in error log.
This task is about converting in place MySQL 5.6 and 5.7 partition tables
to MariaDB as part of mariadb-upgrade.
- Update TABLE_SHARE::init_from_binary_frm_image() to be able to read
MySQL frm files with partitions.
- Create .par file, if it do not exists, on open of partitioned table.
Executing mariadb-upgrade will create all the missing .par files.
The MySQL .frm file will be changed to MariaDB format after next
ALTER TABLE.
Other changes:
- If we are using stored mysql_version to distingush between MySQL and
MariaDB .frm file information, do not upgrade mysql_version in the
.frm file as part of CHECK TABLE .. FOR UPGRADE as this would cause
problems next time we parse the .frm file.
The problem seems to be a deadlock between KILL command execution
and BF abort issued by an applier, where:
* KILL has locked victim's LOCK_thd_kill and LOCK_thd_data.
* Applier has innodb side global lock mutex and victim trx mutex.
* KILL is calling innobase_kill_query, and is blocked by innodb
global lock mutex.
* Applier is in wsrep_innobase_kill_one_trx and is blocked by
victim's LOCK_thd_kill.
The fix in this commit removes the TOI replication of KILL command
and makes KILL execution less intrusive operation. Aborting the
victim happens now by using awake_no_mutex() and ha_abort_transaction().
If the KILL happens when the transaction is committing, the
KILL operation is postponed to happen after the statement
has completed in order to avoid KILL to interrupt commit
processing.
Notable changes in this commit:
* wsrep client connections's error state may remain sticky after
client connection is closed. This error message will then pop
up for the next client session issuing first SQL statement.
This problem raised with test galera.galera_bf_kill.
The fix is to reset wsrep client error state, before a THD is
reused for next connetion.
* Release THD locks in wsrep_abort_transaction when locking
innodb mutexes. This guarantees same locking order as with applier
BF aborting.
* BF abort from MDL was changed to do BF abort on server/wsrep-lib
side first, and only then do the BF abort on InnoDB side. This
removes the need to call back from InnoDB for BF aborts which originate
from MDL and simplifies the locking.
* Removed wsrep_thd_set_wsrep_aborter() from service_wsrep.h.
The manipulation of the wsrep_aborter can be done solely on
server side. Moreover, it is now debug only variable and
could be excluded from optimized builds.
* Remove LOCK_thd_kill from wsrep_thd_LOCK/UNLOCK to allow more
fine grained locking for SR BF abort which may require locking
of victim LOCK_thd_kill. Added explicit call for
wsrep_thd_kill_LOCK/UNLOCK where appropriate.
* Wsrep-lib was updated to version which allows external
locking for BF abort calls.
Changes to MTR tests:
* Disable galera_bf_abort_group_commit. This test is going to
be removed (MDEV-30855).
* Record galera_gcache_recover_manytrx as result file was incomplete.
Trivial change.
* Make galera_create_table_as_select more deterministic:
Wait until CTAS execution has reached MDL wait for multi-master
conflict case. Expected error from multi-master conflict is
ER_QUERY_INTERRUPTED. This is because CTAS does not yet have open
wsrep transaction when it is waiting for MDL, query gets interrupted
instead of BF aborted. This should be addressed in separate task.
* A new test galera_kill_group_commit to verify correct behavior
when KILL is executed while the transaction is committing.
Co-authored-by: Seppo Jaakola <seppo.jaakola@iki.fi>
Co-authored-by: Jan Lindström <jan.lindstrom@galeracluster.com>
Signed-off-by: Julius Goryavsky <julius.goryavsky@mariadb.com>
If a replica failed to update the GTID slave state when committing
an XA PREPARE, the replica would retry the transaction and get an
out-of-order GTID error. This is because the commit phase of an XA
PREPARE is bifurcated. That is, first, the prepare is handled by the
relevant storage engines. Then second, the GTID slave state is
updated as a separate autocommit transaction. If the second phase
fails, and the transaction is retried, then the same transaction is
attempted to be committed again, resulting in a GTID out-of-order
error.
This patch fixes this error by immediately stopping the slave and
reporting the appropriate error. That is, there was logic to bypass
the error when updating the GTID slave state table if the underlying
error is allowed for retry on a parallel slave. This patch adds a
parameter to disallow the error bypass, thereby forcing the error
state to still happen.
Reviewed By
============
Andrei Elkin <andrei.elkin@mariadb.com>
If we are inside stored function or trigger we should not commit
or rollback current statement transaction.
Signed-off-by: Julius Goryavsky <julius.goryavsky@mariadb.com>
This error was discovered while working on
MDEV-30540 Wrong result with IN list length reaching
IN_PREDICATE_CONVERSION_THRESHOLD
If there is read error from handler::ha_rnd_next() during a recursive
query, st_select_lex_unit::exec_recursive() will crash as it will try to
get the error code from a structure that was deleted by the callee.
The code was using the construct:
sl->join->exec();
saved_error=sl->join->error;
This does not work as sl->join was freed by the exec() and sl->join would
be set to 0.
Fixed by having JOIN::exec() return the error code.
The included test case simulates the error in ha_rnd_next(), which causes
a crash without the patch.
scovered whle working on
MDEV-30540 Wrong result with IN list length reaching
IN_PREDICATE_CONVERSION_THRESHOLD
If there is read error from handler::ha_rnd_next() during a recursive
query, st_select_lex_unit::exec_recursive() will crash as it will try to
get the error code from a structure that was deleted by the callee.
The code was using the construct:
sl->join->exec();
saved_error=sl->join->error;
This does not work as sl->join was freed by the exec() and sl->join was
set to 0.
Fixed by having JOIN::exec() return the error code.
The included test case simulates the error in ha_rnd_next(), which causes
a crash without the patch.
1. Adding a separate MY_COLLATION_HANDLER
my_collation_ucs2_general_mysql500_ci_handler
implementing a proper order for ucs2_general_mysql500_ci
The problem happened because ucs2_general_mysql500_ci
erroneously used my_collation_ucs2_general_ci_handler.
2. Cosmetic changes: Renaming:
- plane00_mysql500 to my_unicase_mysql500_page00
- my_unicase_pages_mysql500 to my_unicase_mysql500_pages
to use the same naming style with:
- my_unicase_default_page00
- my_unicase_defaul_pages
3. Moving code fragments from
- handler::check_collation_compatibility() in handler.cc
- upgrade_collation() in table.cc
into new methods in class Charset, to reuse the code easier.
The reason for this is that we call file->index_flags(index, 0, 1)
multiple times in best_access_patch()when optimizing a table.
For example, in InnoDB, the calls is not trivial (4 if's and 2 assignments)
Now the function is inlined and is just a memory reference.
Other things:
- handler::is_clustering_key() and pk_is_clustering_key() are now inline.
- Added TABLE::can_use_rowid_filter() to simplify some code.
- Test if we should use a rowid_filter only if can_use_rowid_filter() is
true.
- Added TABLE::is_clustering_key() to avoid a memory reference.
- Simplify some code using the fact that HA_KEYREAD_ONLY is true implies
that HA_CLUSTERED_INDEX is false.
- Added DBUG_ASSERT to TABLE::best_range_rowid_filter() to ensure we
do not call it with a clustering key.
- Reorginized elements in struct st_key to get better memory alignment.
- Updated ha_innobase::index_flags() to not have
HA_DO_RANGE_FILTER_PUSHDOWN for clustered index
"select * from information_schema.tables limit 1" was giving the following
warning in the log:
[ERROR] Invalid (old?) table or database name '#rocksdb'
This includes:
- cleanup and optimization of filtering and pushdown engine code.
- Adjusted costs for rowid filters (based on extensive testing
and profiling).
This made a small two changes to the handler_rowid_filter_is_active()
API:
- One should not call it with a zero pointer!
- One does not need to call handler_rowid_filter_is_active() for every
row anymore. It is enough to check if filter is active by calling it
call it during index_init() or when handler::rowid_filter_changed()
is called
The changes was to avoid unnecessary function calls and checks if
pushdown conditions and rowid_filter is not used.
Updated costs for rowid_filter_lookup() to be closer to reality.
The old cost was based only on rowid_compare_cost. This is now
changed to take into account the overhead in checking the rowid.
Changed the Range_rowid_filter class to use DYNAMIC_ARRAY directly
instead of Dynamic_array<>. This was done to be able to use the new
append_dynamic() functions which gives a notable speed improvment
compared to the old code. Removing the abstraction also makes
the code easier to understand.
The cost of filtering is now slightly lower than before, which
is reflected in some test cases that is now using rowid filters.
This includes all test changes from
"Changing all cost calculation to be given in milliseconds"
and forwards.
Some of the things that caused changes in the result files:
- As part of fixing tests, I added 'echo' to some comments to be able to
easier find out where things where wrong.
- MATERIALIZED has now a higher cost compared to X than before. Because
of this some MATERIALIZED types have changed to DEPENDEND SUBQUERY.
- Some test cases that required MATERIALIZED to repeat a bug was
changed by adding more rows to force MATERIALIZED to happen.
- 'Filtered' in SHOW EXPLAIN has in many case changed from 100.00 to
something smaller. This is because now filtered also takes into
account the smallest possible ref access and filters, even if they
where not used. Another reason for 'Filtered' being smaller is that
we now also take into account implicit filtering done for subqueries
using FIRSTMATCH.
(main.subselect_no_exists_to_in)
This is caluculated in best_access_path() and stored in records_out.
- Table orders has changed because more accurate costs.
- 'index' and 'ALL' for small tables has changed to use 'range' or
'ref' because of optimizer_scan_setup_cost.
- index can be changed to 'range' as 'range' optimizer assumes we don't
have to read the blocks from disk that range optimizer has already read.
This can be confusing in the case where there is no obvious where clause
but instead there is a hidden 'key_column > NULL' added by the optimizer.
(main.subselect_no_exists_to_in)
- Scan on primary clustered key does not report 'Using Index' anymore
(It's a table scan, not an index scan).
- For derived tables, the number of rows is now 100 instead of 2,
which can be seen in EXPLAIN.
- More tests have "Using index for group by" as the cost of this
optimization is now more correct (lower).
- A primary key could be preferred for a normal key, even if it would
access more rows, as it's faster to do 1 lokoup and 3 'index_next' on a
clustered primary key than one lookup trough a secondary.
(main.stat_tables_innodb)
Notes:
- There was a 4.7% more calls to best_extension_by_limited_search() in
the main.greedy_optimizer test. However examining the test results
it looked that the plans where slightly better (eq_ref where more
chained together) so I assume this is ok.
- I have verified a few test cases where there was notable/unexpected
changes in the plan and in all cases the new optimizer plans where
faster. (main.greedy_optimizer and some others)