mariadb

mirror of https://github.com/MariaDB/server.git synced 2025-08-07 00:04:31 +03:00

Author	SHA1	Message	Date
Nikita Malyavin	70491fb07b	MDEV-31677 Assertion failed upon online ALTER with binlog_row_image=NOBLOB Make binlog_prepare_row_images accept image type as an argument.	2023-08-15 14:00:28 +02:00
Nikita Malyavin	b3f988d260	Add const to get_foreign_key_list/get_parent_foreign_key_list	2023-08-15 10:16:13 +02:00
Nikita Malyavin	c76072db93	MDEV-31033 ER_KEY_NOT_FOUND upon online COPY ALTER on a partitioned table The row events were applied "twice": once for the ha_partition, and one more time for the underlying storage engine. There's no such problem in binlog/rpl, because ha_partiton::row_logging is normally set to false. The fix makes the events replicate only when the handler is a root handler. We will try to guess this by comparing it to table->file. The same approach is used in the MDEV-21540 fix, `231feabd`. The assumption is made, that the row methods are only called for table->file (and never for a cloned handler), hence the assertions are added in ha_innobase and ha_myisam to make sure that this is true at least for those engines Also closes MDEV-31040, however the test is not included, since we have no convenient way to construct a deterministic version.	2023-08-15 10:16:13 +02:00
Sergei Golubchik	ea46fdcea4	cleanup, remove dead code	2023-08-15 10:16:12 +02:00
Sergei Golubchik	64b55151f4	separate online_alter_cache_data from binlog_cache_data	2023-08-15 10:16:12 +02:00
Sergei Golubchik	d767ed5c89	remove handler::open_read_view() ht->start_consistent_snapshot() is also not a way, because some engines (e.g. rocksdb) only do it readonly. instead, downgrade the lock after reading the first row (which implicitly opens a read view).	2023-08-15 10:16:11 +02:00
Nikita Malyavin	ab4bfad206	MDEV-16329 [5/5] ALTER ONLINE TABLE * Log rows in online_alter_binlog. * Table online data is replicated within dedicated binlog file * Cached data is written on commit. * Versioning is fully supported. * Works both wit and without binlog enabled. * For now savepoints setup is forbidden while ONLINE ALTER goes on. Extra support is required. We can simply log the SAVEPOINT query events and replicate them together with row events. But it's not implemented for now. * Cache flipping: We want to care for the possible bottleneck in the online alter binlog reading/writing in advance. IO_CACHE does not provide anything better that sequential access, besides, only a single write is mutex-protected, which is not suitable, since we should write a transaction atomically. To solve this, a special layer on top Event_log is implemented. There are two IO_CACHE files underneath: one for reading, and one for writing. Once the read cache is empty, an exclusive lock is acquired (we can wait for a currently active transaction finish writing), and flip() is emitted, i.e. the write cache is reopened for read, and the read cache is emptied, and reopened for writing. This reminds a buffer flip that happens in accelerated graphics (DirectX/OpenGL/etc). Cache_flip_event_log is considered non-blocking for a single reader and a single writer in this sense, with the only lock held by reader during flip. An alternative approach by implementing a fair concurrent circular buffer is described in MDEV-24676. * Cache managers: We have two cache sinks: statement and transactional. It is important that the changes are first cached per-statement and per-transaction. If a statement fails, then only statement data is rolled back. The transaction moves along, however. Turns out, there's no guarantee that TABLE well persist in thd->open_tables to the transaction commit moment. If an error occurs, tables from statement are purged. Therefore, we can't store te caches in TABLE. Ideally, it should be handlerton, but we cut the corner and store it in THD in a list.	2023-08-15 10:16:11 +02:00
Nikita Malyavin	d2d0995cf2	MDEV-16329 [4/5] Refactor MYSQL_BIN_LOG: extract Event_log ancestor Event_log is supposed to be a basic logging class that can write events in a single file. MYSQL_BIN_LOG in comparison will have: * rotation support * index files * purging * gtid and transactional information handling. * is dedicated for a general-purpose binlog	2023-08-15 10:16:11 +02:00
Nikita Malyavin	6427e343cf	MDEV-16329 [3/5] use binlog_cache_data directly in most places * Eliminate most usages of THD::use_trans_table. Only 3 left, and they are at quite high levels, and really essential. * Eliminate is_transactional argument when possible. Lots of places are left though, because of some WSREP error handling in MYSQL_BIN_LOG::set_write_error. * Remove junk binlog functions from THD * binlog_prepare_pending_rows_event is moved to log.cc inside MYSQL_BIN_LOG and is not anymore template. Instead it accepls event factory with a type code, and a callback to a constructing function in it.	2023-08-15 10:16:11 +02:00
Nikita Malyavin	429f635f30	MDEV-16329 [2/5] refactor binlog and cache_mngr pump up binlog and cache manager to level of binlog_log_row_internal	2023-08-15 10:16:11 +02:00
Oleksandr Byelkin	f5fae75652	Merge branch '11.0' into 11.1	2023-08-09 08:25:14 +02:00
Oleksandr Byelkin	51f9d62005	Merge branch '10.11' into 11.0	2023-08-09 07:53:48 +02:00
Oleksandr Byelkin	036df5f970	Merge branch '10.10' into 10.11	2023-08-08 14:57:31 +02:00
Oleksandr Byelkin	34a8e78581	Merge branch '10.6' into 10.9	2023-08-04 08:01:06 +02:00
Oleksandr Byelkin	5ea5291d97	Merge branch '10.5' into 10.6	2023-08-04 07:52:54 +02:00
Sergei Golubchik	61acb43689	MDEV-31822 ALTER TABLE ENGINE=x started failing instead of producing warning on unsupported TRANSACTIONAL=1 make TRANSACTIONAL table option behave similar to other engine-defined table options. If the engine doesn't suport it: * if specified expicitly in CREATE or ALTER - it's ER_UNKNOWN_OPTION * an error or a warning depending on sql_mode IGNORE_BAD_TABLE_OPTIONS * in ALTER TABLE from the engine that suppors it to the engine that doesn't - silently preserved (no warning) * it is commented out in SHOW CREATE unless IGNORE_BAD_TABLE_OPTIONS	2023-08-02 14:45:31 +02:00
Sergei Golubchik	ab1191c039	cleanup: key->key_create_info.check_for_duplicate_indexes -> key->old mark old keys in the ALTER TABLE with the `old` flag, not with the `key_create_info.check_for_duplicate_indexes`. This allows to mark old foreign keys too.	2023-08-01 22:43:16 +02:00
Marko Mäkelä	e81fa34502	Merge 11.1 into 11.2	2023-07-26 15:49:24 +03:00
Marko Mäkelä	c6ac1e39b6	Merge 11.0 into 11.1	2023-07-26 15:13:43 +03:00
Marko Mäkelä	f2b4972bd4	Merge 10.11 into 11.0	2023-07-26 15:13:06 +03:00
Marko Mäkelä	bce3ee704f	Merge 10.10 into 10.11	2023-07-26 14:44:43 +03:00
Alexander Barkov	75f25e4ca7	MDEV-30164 System variable for default collations This patch adds a way to override default collations (or "character set collations") for desired character sets. The SQL standard says: > Each collation known in an SQL-environment is applicable to one > or more character sets, and for each character set, one or more > collations are applicable to it, one of which is associated with > it as its character set collation. In MariaDB, character set collations has been hard-coded so far, e.g. utf8mb4_general_ci has been a hard-coded character set collation for utf8mb4. This patch allows to override (globally per server, or per session) character set collations, so for example, uca1400_ai_ci can be set as a character set collation for Unicode character sets (instead of compiled xxx_general_ci). The array of overridden character set collations is stored in a new (session and global) system variable @@character_set_collations and can be set as a comma separated list of charset=collation pairs, e.g.: SET @@character_set_collations='utf8mb3=uca1400_ai_ci,utf8mb4=uca1400_ai_ci'; The variable is empty by default, which mean use the hard-coded character set collations (e.g. utf8mb4_general_ci for utf8mb4). The variable can also be set globally by passing to the server startup command line, and/or in my.cnf.	2023-07-17 14:56:17 +04:00
Marko Mäkelä	7cde5c539b	Merge 10.6 into 10.9	2023-07-10 11:22:21 +03:00
Monty	99bd226059	MDEV-31558 Add InnoDB engine information to the slow query log The new statistics is enabled by adding the "engine", "innodb" or "full" option to --log-slow-verbosity Example output: # Pages_accessed: 184 Pages_read: 95 Pages_updated: 0 Old_rows_read: 1 # Pages_read_time: 17.0204 Engine_time: 248.1297 Page_read_time is time doing physical reads inside a storage engine. (Writes cannot be tracked as these are usually done in the background). Engine_time is the time spent inside the storage engine for the full duration of the read/write/update calls. It uses the same code as 'analyze statement' for calculating the time spent. The engine statistics is done with a generic interface that should be easy for any engine to use. It can also easily be extended to provide even more statistics. Currently only InnoDB has counters for Pages_% and Undo_% status. Engine_time works for all engines. Implementation details: class ha_handler_stats holds all engine stats. This class is included in handler and THD classes. While a query is running, all statistics is updated in the handler. In close_thread_tables() the statistics is added to the THD. handler::handler_stats is a pointer to where statistics should be collected. This is set to point to handler::active_handler_stats if stats are requested. If not, it is set to 0. handler_stats has also an element, 'active' that is 1 if stats are requested. This is to allow engines to avoid doing any 'if's while updating the statistics. Cloned or partition tables have the pointer set to the base table if status are requested. There is a small performance impact when using --log-slow-verbosity=engine: - All engine calls in 'select' will be timed. - IO calls for InnoDB reads will be timed. - Incrementation of counters are done on local variables and accesses are inline, so these should have very little impact. - Statistics has to be reset for each statement for the THD and each used handler. This is only 40 bytes, which should be neglectable. - For partition tables we have to loop over all partitions to update the handler_status as part of table_init(). Can be optimized in the future to only do this is log-slow-verbosity changes. For this to work we have to update handler_status for all opened partitions and also for all partitions opened in the future. Other things: - Added options 'engine' and 'full' to log-slow-verbosity. - Some of the new files in the test suite comes from Percona server, which has similar status information. - buf_page_optimistic_get(): Do not increment any counter, since we are only validating a pointer, not performing any buf_pool.page_hash lookup. - Added THD argument to save_explain_data_intern(). - Switched arguments for save_explain_.*_data() to have always THD first (generates better code as other functions also have THD first).	2023-07-07 12:53:18 +03:00
Marko Mäkelä	3883eb63dc	Merge 11.0 into 11.1	2023-06-08 14:09:21 +03:00
Marko Mäkelä	5fb2c031f7	Merge 10.11 into 11.0	2023-06-08 13:49:48 +03:00
Monty	ded4ed3220	MDEV-30944 Range_rowid_filter::fill() leaves file->keyread at MAX_KEY This test case exposed 2 different bugs: - When replacing a range with an index scan on a covering key in test_if_skip_sort_order() we didn't disable filtering. Filtering does not make much sense in this case. - Fixed by disabling filtering in this case. - Range_rowid_filter::fill() did not take into account that keyread could already active, which caused an assert when it tried to activate another keyread. - Fixed by remembering old keyread state at start and restoring it at end. Other things: - ha_start_keyread() allowed multiple calls. This is wrong, especially as we do no check if the index changed! I added an assert() to ensure that we don't call it there is already an active keyread. - ha_end_keyread() always called ha_extra(), even if keyread was not active. Added a check to avoid the extra call.	2023-06-07 18:44:12 +03:00
Monty	07b02ab40e	MDEV-31356: Range cost calculations does not take into account join_buffer This patch also fixes MDEV-31391 Assertion `((best.records_out) == 0.0 ... failed Cost changes caused by this change: - range queries with join buffer now have a notable smaller cost. - range ranges are bit more expensive as the MULTI_RANGE_COST is now properly applied to it in all cases (this extra cost is equal to a key lookup). - table scan cost is slight smaller as we now assume data is cached in the engine after the first scan pass. (We did this before for range scans and other access methods). - partition tables had wrong values for max_row_blocks and max_index_blocks. Correcting this, causes range access on partitioned tables to have slightly higher cost because of the increased estimated IO. - Using first match + join buffer caused 'filtered' to be calcualted wrong. (Only affected EXPLAIN, not query costs). - Added cost_without_join_buffer to optimizer_trace. - check_quick_select() adjusted the number of rows according to persistent statistics, but did not adjust cost. Now fixed. The big change in the patch are: - In best_access_path(), where we now are using storing the cost in 'ALL_READ_COST cost' and only converting it to a double at the end. This allows us to more exactly calculate the effect of the join_cache. - In JOIN_TAB::estimate_scan_time(), store the cost also in a ALL_READ_COST object. One of effect if this change is that when joining very small tables: t1 some_access_method t2 range t3 ALL Use join buffer This is swiched to t1 some_access_method t3 ALL t2 range use join buffer Both plans has the same cost, but as table scan in this case has less cost than rang, the table scan will be considered first and thus have precidence. Test case changes: - optimizer_trace - Addition of cost_without_join_buffer - subselect_mat_cost_bugs - Small tables and scan versus range - range & range_mrr_icp - Range + join_cache is faster than ref - optimizer_trace - cost_without_join_buffer, smaller scan cost, range setup cost. - mrr - range+join_buffer used as smaller cost	2023-06-07 18:42:58 +03:00
Marko Mäkelä	c04284e747	Merge 10.10 into 10.11	2023-06-07 15:01:43 +03:00
Oleg Smirnov	3118132228	MDEV-25080 Allow pushdown of UNIONs to foreign engines Allow queries of multiple SELECTs combined together with UNIONs/EXCEPTs/INTERSECTs to be pushed down to foreign engines. If the foreign engine provides an interface method "create_unit" and the UNIT is a top-level unit of the SQL query then the server tries to push the whole SELECT_LEX_UNIT down to the engine for execution. The engine should perform necessary checks and if they succeed, execute the query. If the engine is unable to execute the whole unit, then another attempt is made to push down SELECTs composing the unit separately using the "create_select" interface method. In this case the results of separate SELECTs are combined at the server side thus composing the final result	2023-06-05 20:15:57 +02:00
Marko Mäkelä	0796b7ad5e	Merge 10.6 into 10.9	2023-05-22 09:13:51 +03:00
Teemu Ollakka	f307160218	MDEV-29293 MariaDB stuck on starting commit state This commit contains a merge from 10.5-MDEV-29293-squash into 10.6. Although the bug MDEV-29293 was not reproducible with 10.6, the fix contains several improvements for wsrep KILL query and BF abort handling, and addresses the following issues: * MDEV-30307 KILL command issued inside a transaction is problematic for galera replication: This commit will remove KILL TOI replication, so Galera side transaction context is not lost during KILL. * MDEV-21075 KILL QUERY maintains nodes data consistency but breaks GTID sequence: This is fixed as well as KILL does not use TOI, and thus does not change GTID state. * MDEV-30372 Assertion in wsrep-lib state: This was caused by BF abort or KILL when local transaction was in the middle of group commit. This commit disables THD::killed handling during commit, so the problem is avoided. * MDEV-30963 Assertion failure !lock.was_chosen_as_deadlock_victim in trx0trx.h:1065: The assertion happened when the victim was BF aborted via MDL while it was committing. This commit changes MDL BF aborts so that transactions which are committing cannot be BF aborted via MDL. The RQG grammar attached in the issue could not reproduce the crash anymore. Original commit message from 10.5 fix: MDEV-29293 MariaDB stuck on starting commit state The problem seems to be a deadlock between KILL command execution and BF abort issued by an applier, where: * KILL has locked victim's LOCK_thd_kill and LOCK_thd_data. * Applier has innodb side global lock mutex and victim trx mutex. * KILL is calling innobase_kill_query, and is blocked by innodb global lock mutex. * Applier is in wsrep_innobase_kill_one_trx and is blocked by victim's LOCK_thd_kill. The fix in this commit removes the TOI replication of KILL command and makes KILL execution less intrusive operation. Aborting the victim happens now by using awake_no_mutex() and ha_abort_transaction(). If the KILL happens when the transaction is committing, the KILL operation is postponed to happen after the statement has completed in order to avoid KILL to interrupt commit processing. Notable changes in this commit: * wsrep client connections's error state may remain sticky after client connection is closed. This error message will then pop up for the next client session issuing first SQL statement. This problem raised with test galera.galera_bf_kill. The fix is to reset wsrep client error state, before a THD is reused for next connetion. * Release THD locks in wsrep_abort_transaction when locking innodb mutexes. This guarantees same locking order as with applier BF aborting. * BF abort from MDL was changed to do BF abort on server/wsrep-lib side first, and only then do the BF abort on InnoDB side. This removes the need to call back from InnoDB for BF aborts which originate from MDL and simplifies the locking. * Removed wsrep_thd_set_wsrep_aborter() from service_wsrep.h. The manipulation of the wsrep_aborter can be done solely on server side. Moreover, it is now debug only variable and could be excluded from optimized builds. * Remove LOCK_thd_kill from wsrep_thd_LOCK/UNLOCK to allow more fine grained locking for SR BF abort which may require locking of victim LOCK_thd_kill. Added explicit call for wsrep_thd_kill_LOCK/UNLOCK where appropriate. * Wsrep-lib was updated to version which allows external locking for BF abort calls. Changes to MTR tests: * Disable galera_bf_abort_group_commit. This test is going to be removed (MDEV-30855). * Make galera_var_retry_autocommit result more readable by echoing cases and expectations into result. Only one expected result for reap to verify that server returns expected status for query. * Record galera_gcache_recover_manytrx as result file was incomplete. Trivial change. * Make galera_create_table_as_select more deterministic: Wait until CTAS execution has reached MDL wait for multi-master conflict case. Expected error from multi-master conflict is ER_QUERY_INTERRUPTED. This is because CTAS does not yet have open wsrep transaction when it is waiting for MDL, query gets interrupted instead of BF aborted. This should be addressed in separate task. * A new test galera_bf_abort_registering to check that registering trx gets BF aborted through MDL. * A new test galera_kill_group_commit to verify correct behavior when KILL is executed while the transaction is committing. Co-authored-by: Seppo Jaakola <seppo.jaakola@iki.fi> Co-authored-by: Jan Lindström <jan.lindstrom@galeracluster.com> Signed-off-by: Julius Goryavsky <julius.goryavsky@mariadb.com>	2023-05-22 00:42:05 +02:00
Igor Babaev	3a9358a410	MDEV-28883 Re-design the upper level of handling UPDATE and DELETE statements This patch introduces a new way of handling UPDATE and DELETE commands at the top level after the parsing phase. This new way of processing update and delete statements can be seen in the implementation of the prepare() and execute() methods from the new Sql_cmd_dml class. This class derived from the Sql_cmd class can be considered as an interface class for processing such commands as SELECT, INSERT, UPDATE, DELETE and other comands manipulating data in tables. With this patch processing of update and delete statements after parsing proceeds by the following schema: - precheck of the access rights is performed for the used tables - the used tables are opened - context analysis phase is performed for the statement - the used tables are locked - the statement is optimized and executed - clean-up is performed for the statement The implementation of the method Sql_cmd_dml::execute() adheres this schema. The virtual functions of the class Sql_cmd_dml used for precheck of the access rights, context analysis, optimization and execution allow to adjust this schema for processing data manipulation statements of any types. This schema of processing data manipulation statements is taken from the current MySQL code. Moreover the definition the class Sql_cmd_dml introduced in this patch is almost a full replica of such class in the existing MySQL. However the implementation of the derived classes for update and delete statements is quite different. This implementation employs the JOIN class for all kinds of update and delete statements. It allows to perform main bulk of context analysis actions by the function JOIN::prepare(). This guarantees that characteristics and properties of the statement tree discovered for optimization phase when doing context analysis are the same for single-table and multi-table updates and deletes. With this patch the following functions are gone: mysql_prepare_update(), mysql_multi_update_prepare(), mysql_update(), mysql_multi_update(), mysql_prepare_delete(), mysql_multi_delete_prepare(), mysql_delete(). The code within these functions have been used as much as possible though. The functions mysql_test_update() and mysql_test_delete() are also not needed anymore. The method Sql_cmd_dml::prepare() serves processing - update/delete statement - PREPARE stmt FROM "<update/delete statement>" - EXECUTE stmt when stmt is prepared from update/delete statement. Approved by Oleksandr Byelkin <sanja@mariadb.com>	2023-03-15 17:35:22 -07:00
Marko Mäkelä	2e431ff7e6	Merge 10.11 into 11.0	2023-02-16 13:34:45 +02:00
Marko Mäkelä	1fd0099839	Merge 10.10 into 10.11	2023-02-16 11:41:18 +02:00
Marko Mäkelä	0d55914d96	Merge 10.8 into 10.9	2023-02-16 10:25:34 +02:00
Marko Mäkelä	dbab3e8d90	Merge 10.6 into 10.8	2023-02-10 13:43:53 +02:00
Marko Mäkelä	6aec87544c	Merge 10.5 into 10.6	2023-02-10 13:03:01 +02:00
Monty	0a7d291756	MDEV-30328 Assertion `avg_io_cost != 0.0 \|\| index_cost.io + row_cost.io == 0' failed in Cost_estimate::total_cost() The assert was there to check that engines reports sensible numbers for IO. However this does not work in case of optimizer_disk_read_ratio=0. Fixed by removing the assert.	2023-02-10 12:58:50 +02:00
Monty	e3f56254a8	MDEV-30098 Server crashes in ha_myisam::index_read_map with index_merge_sort_intersection=on Fixes also MDEV-30104 Server crashes in handler_rowid_filter_check upon ANALYZE TABLE cancel_pushed_rowid_filter() didn't inform the handler that rowid_filter was canceled.	2023-02-10 12:58:50 +02:00
Monty	3fa99f0c0e	Change cost for REF to take into account cost for 1 extra key read_next The main difference in code path between EQ_REF and REF is that for REF we have to do an extra read_next on the index to check that there is no more matching rows. Before this patch we added a preference of EQ_REF by ensuring that REF would always estimate to find at least 2 rows. This patch adds the cost of the extra key read_next to REF access and removes the code that limited REF to at least 2 rows. For some queries this can have a big effect as the total estimated rows will be halved for each REF table with 1 rows. multi_range cost calculations are also changed to take into account the difference between EQ_REF and REF. The effect of the patch to the test suite: - About 80 test case changed - Almost all changes where for EXPLAIN where estimated rows for REF where changed from 2 to 1. - A few test cases using explain extended had a change of 'filtered'. This is because of the estimated rows are now closer to the calculated selectivity. - A very few test had a change of table order. This is because the change of estimated rows from 2 to 1 or the small cost change for REF (main.subselect_sj_jcl6, main.group_by, main.dervied_cond_pushdown, main.distinct, main.join_nested, main.order_by, main.join_cache) - No key statistics and the estimated rows are now smaller which cased estimated filtering to be lower. (main.subselect_sj_mat) - The number of total rows are halved. (main.derived_cond_pushdown) - Plans with 1 row changed to use RANGE instead of REF. (main.group_min_max) - ALL changed to REF (main.key_diff) - Key changed from ref + index_only to PRIMARY key for InnoDB, as OPTIMIZER_ROW_LOOKUP_COST + OPTIMIZER_ROW_NEXT_FIND_COST is smaller than OPTIMIZER_KEY_LOOKUP_COST + OPTIMIZER_KEY_NEXT_FIND_COST. (main.join_outer_innodb) - Cost changes printouts (main.opt_trace*) - Result order change (innodb_gis.rtree)	2023-02-10 12:58:50 +02:00
Marko Mäkelä	c41c79650a	Merge 10.4 into 10.5	2023-02-10 12:02:11 +02:00
Vicențiu Ciorbaru	08c852026d	Apply clang-tidy to remove empty constructors / destructors This patch is the result of running run-clang-tidy -fix -header-filter=.* -checks='-,modernize-use-equals-default' . Code style changes have been done on top. The result of this change leads to the following improvements: 1. Binary size reduction. For a -DBUILD_CONFIG=mysql_release build, the binary size is reduced by ~400kb. * A raw -DCMAKE_BUILD_TYPE=Release reduces the binary size by ~1.4kb. 2. Compiler can better understand the intent of the code, thus it leads to more optimization possibilities. Additionally it enabled detecting unused variables that had an empty default constructor but not marked so explicitly. Particular change required following this patch in sql/opt_range.cc result_keys, an unused template class Bitmap now correctly issues unused variable warnings. Setting Bitmap template class constructor to default allows the compiler to identify that there are no side-effects when instantiating the class. Previously the compiler could not issue the warning as it assumed Bitmap class (being a template) would not be performing a NO-OP for its default constructor. This prevented the "unused variable warning".	2023-02-09 16:09:08 +02:00
Monty	ed0a723566	Cache file->index_flags(index, 0, 1) in table->key_info[index].index_flags The reason for this is that we call file->index_flags(index, 0, 1) multiple times in best_access_patch()when optimizing a table. For example, in InnoDB, the calls is not trivial (4 if's and 2 assignments) Now the function is inlined and is just a memory reference. Other things: - handler::is_clustering_key() and pk_is_clustering_key() are now inline. - Added TABLE::can_use_rowid_filter() to simplify some code. - Test if we should use a rowid_filter only if can_use_rowid_filter() is true. - Added TABLE::is_clustering_key() to avoid a memory reference. - Simplify some code using the fact that HA_KEYREAD_ONLY is true implies that HA_CLUSTERED_INDEX is false. - Added DBUG_ASSERT to TABLE::best_range_rowid_filter() to ensure we do not call it with a clustering key. - Reorginized elements in struct st_key to get better memory alignment. - Updated ha_innobase::index_flags() to not have HA_DO_RANGE_FILTER_PUSHDOWN for clustered index	2023-02-03 14:38:26 +03:00
Monty	66dde8a54e	Added rowid_filter support to Aria This includes: - cleanup and optimization of filtering and pushdown engine code. - Adjusted costs for rowid filters (based on extensive testing and profiling). This made a small two changes to the handler_rowid_filter_is_active() API: - One should not call it with a zero pointer! - One does not need to call handler_rowid_filter_is_active() for every row anymore. It is enough to check if filter is active by calling it call it during index_init() or when handler::rowid_filter_changed() is called The changes was to avoid unnecessary function calls and checks if pushdown conditions and rowid_filter is not used. Updated costs for rowid_filter_lookup() to be closer to reality. The old cost was based only on rowid_compare_cost. This is now changed to take into account the overhead in checking the rowid. Changed the Range_rowid_filter class to use DYNAMIC_ARRAY directly instead of Dynamic_array<>. This was done to be able to use the new append_dynamic() functions which gives a notable speed improvment compared to the old code. Removing the abstraction also makes the code easier to understand. The cost of filtering is now slightly lower than before, which is reflected in some test cases that is now using rowid filters.	2023-02-03 10:42:28 +03:00
Sergei Petrunia	0ba47126f1	Added MARIADB_NEW_COST_MODEL for ColumnStore to detect new cost model	2023-02-03 10:34:44 +03:00
Monty	d9d0e78039	Add limits for how many IO operations a table access will do This solves the current problem in the optimizer - SELECT FROM big_table - SELECT from small_table where small_table.eq_ref_key=big_table.id The old code assumed that each eq_ref access will cause an IO. As the cost of IO is high, this dominated the cost for the later table which caused the optimizer to prefer table scans + join cache over index reads. This patch fixes this issue by limit the number of expected IO calls, for rows and index separately, to the size of the table or index or the number of accesses that we except in a range for the index. The major changes are: - Adding a new structure ALL_READ_COST that is mainly used in best_access_path() to hold the costs parts of the cost we are calculating. This allows us to limit the number of IO when multiplying the cost with the previous row combinations. - All storage engine cost functions are changed to return IO_AND_CPU_COST. The virtual cost functions should now return in IO_AND_CPU_COST.io the number of disk blocks that will be accessed instead of the cost of the access. - We are not limiting the io_blocks for table or index scans as we assume that engines may not store these in the 'hot' part of the cache. Table and index scan also uses much less IO blocks than key accesses, so the original issue is not as critical with scans. Other things: OPT_RANGE now holds a 'Cost_estimate cost' instead a lot of different costs. All the old costs, like index_only_read, can be extracted from 'cost'. - Added to the start of some functions 'handler *file= table->file' to shorten the code that is using the handler. - handler->cost() is used to change a ALL_READ_COST or IO_AND_CPU_COST to 'cost in milliseconds' - New functions: handler::index_blocks() and handler::row_blocks() which are used to limit the IO. - Added index_cost and row_cost to Cost_estimate and removed all not needed members. - Removed cost coefficients from Cost_estimate as these don't make sense when costs (except IO_BLOCKS) are in milliseconds. - Removed handler::avg_io_cost() and replaced it with DISK_READ_COST. - Renamed best_range_rowid_filter_for_partial_join() to best_range_rowid_filter() as using the old name made rows too long. - Changed all SJ_MATERIALIZATION_INFO 'Cost_estimate' variables to 'double' as Cost_estimate power was not used for these and thus just caused storage and performance overhead. - Changed cost_for_index_read() to use 'worst_seeks' to only limit IO, not number of table accesses. With this patch worst_seeks is probably not needed anymore, but I kept it around just in case. - Applying cost for filter got to be much shorter and easier thanks to the API changes. - Adjusted cost for fulltext keys in collaboration with Sergei Golubchik. - Most test changes caused by this patch is that table scans are changed to use indexes. - Added ha_seq::keyread_time() and ha_seq::key_scan_time() to get make checking number of potential IO blocks easier during debugging.	2023-02-02 23:57:30 +03:00
Monty	009db2288b	Fixed limit optimization in range optimizer The issue was that when limit is used, SQL_SELECT::test_quick_select would set the cost of table scan to be unreasonable high to force a range to be used. The problem with this approach was that range was used even when the cost of range, when it would only read 'limit rows' would be higher than the cost of a table scan. This patch fixes it by not accepting ranges when the range can never have a lower cost than a table scan, even if every row would match the WHERE clause.	2023-02-02 23:54:57 +03:00
Monty	b66cdbd1ea	Changing all cost calculation to be given in milliseconds This makes it easier to compare different costs and also allows the optimizer to optimizer different storage engines more reliably. - Added tests/check_costs.pl, a tool to verify optimizer cost calculations. - Most engine costs has been found with this program. All steps to calculate the new costs are documented in Docs/optimizer_costs.txt - User optimizer_cost variables are given in microseconds (as individual costs can be very small). Internally they are stored in ms. - Changed DISK_READ_COST (was DISK_SEEK_BASE_COST) from a hard disk cost (9 ms) to common SSD cost (400MB/sec). - Removed cost calculations for hard disks (rotation etc). - Changed the following handler functions to return IO_AND_CPU_COST. This makes it easy to apply different cost modifiers in ha_..time() functions for io and cpu costs. - scan_time() - rnd_pos_time() & rnd_pos_call_time() - keyread_time() - Enhanched keyread_time() to calculate the full cost of reading of a set of keys with a given number of ranges and optional number of blocks that need to be accessed. - Removed read_time() as keyread_time() + rnd_pos_time() can do the same thing and more. - Tuned cost for: heap, myisam, Aria, InnoDB, archive and MyRocks. Used heap table costs for json_table. The rest are using default engine costs. - Added the following new optimizer variables: - optimizer_disk_read_ratio - optimizer_disk_read_cost - optimizer_key_lookup_cost - optimizer_row_lookup_cost - optimizer_row_next_find_cost - optimizer_scan_cost - Moved all engine specific cost to OPTIMIZER_COSTS structure. - Changed costs to use 'records_out' instead of 'records_read' when recalculating costs. - Split optimizer_costs.h to optimizer_costs.h and optimizer_defaults.h. This allows one to change costs without having to compile a lot of files. - Updated costs for filter lookup. - Use a better cost estimate in best_extension_by_limited_search() for the sorting cost. - Fixed previous issues with 'filtered' explain column as we are now using 'records_out' (min rows seen for table) to calculate filtering. This greatly simplifies the filtering code in JOIN_TAB::save_explain_data(). This change caused a lot of queries to be optimized differently than before, which exposed different issues in the optimizer that needs to be fixed. These fixes are in the following commits. To not have to change the same test case over and over again, the changes in the test cases are done in a single commit after all the critical change sets are done. InnoDB changes: - Updated InnoDB to not divide big range cost with 2. - Added cost for InnoDB (innobase_update_optimizer_costs()). - Don't mark clustered primary key with HA_KEYREAD_ONLY. This will prevent that the optimizer is trying to use index-only scans on the clustered key. - Disabled ha_innobase::scan_time() and ha_innobase::read_time() and ha_innobase::rnd_pos_time() as the default engine cost functions now works good for InnoDB. Other things: - Added --show-query-costs (\Q) option to mysql.cc to show the query cost after each query (good when working with query costs). - Extended my_getopt with GET_ADJUSTED_VALUE which allows one to adjust the value that user is given. This is used to change cost from microseconds (user input) to milliseconds (what the server is internally using). - Added include/my_tracker.h ; Useful include file to quickly test costs of a function. - Use handler::set_table() in all places instead of 'table= arg'. - Added SHOW_OPTIMIZER_COSTS to sys variables. These are input and shown in microseconds for the user but stored as milliseconds. This is to make the numbers easier to read for the user (less pre-zeros). Implemented in 'Sys_var_optimizer_cost' class. - In test_quick_select() do not use index scans if 'no_keyread' is set for the table. This is what we do in other places of the server. - Added THD parameter to Unique::get_use_cost() and check_index_intersect_extension() and similar functions to be able to provide costs to called functions. - Changed 'records' to 'rows' in optimizer_trace. - Write more information to optimizer_trace. - Added INDEX_BLOCK_FILL_FACTOR_MUL (4) and INDEX_BLOCK_FILL_FACTOR_DIV (3) to calculate usage space of keys in b-trees. (Before we used numeric constants). - Removed code that assumed that b-trees has similar costs as binary trees. Replaced with engine calls that returns the cost. - Added Bitmap::find_first_bit() - Added timings to join_cache for ANALYZE table (patch by Sergei Petrunia). - Added records_init and records_after_filter to POSITION to remember more of what best_access_patch() calculates. - table_after_join_selectivity() changed to recalculate 'records_out' based on the new fields from best_access_patch() Bug fixes: - Some queries did not update last_query_cost (was 0). Fixed by moving setting thd->...last_query_cost in JOIN::optimize(). - Write '0' as number of rows for const tables with a matching row. Some internals: - Engine cost are stored in OPTIMIZER_COSTS structure. When a handlerton is created, we also created a new cost variable for the handlerton. We also create a new variable if the user changes a optimizer cost for a not yet loaded handlerton either with command line arguments or with SET @@global.engine.optimizer_cost_variable=xx. - There are 3 global OPTIMIZER_COSTS variables: default_optimizer_costs The default costs + changes from the command line without an engine specifier. heap_optimizer_costs Heap table costs, used for temporary tables tmp_table_optimizer_costs The cost for the default on disk internal temporary table (MyISAM or Aria) - The engine cost for a table is stored in table_share. To speed up accesses the handler has a pointer to this. The cost is copied to the table on first access. If one wants to change the cost one must first update the global engine cost and then do a FLUSH TABLES. This was done to be able to access the costs for an open table without any locks. - When a handlerton is created, the cost are updated the following way: See sql/keycaches.cc for details: - Use 'default_optimizer_costs' as a base - Call hton->update_optimizer_costs() to override with the engines default costs. - Override the costs that the user has specified for the engine. - One handler open, copy the engine cost from handlerton to TABLE_SHARE. - Call handler::update_optimizer_costs() to allow the engine to update cost for this particular table. - There are two costs stored in THD. These are copied to the handler when the table is used in a query: - optimizer_where_cost - optimizer_scan_setup_cost - Simply code in best_access_path() by storing all cost result in a structure. (Idea/Suggestion by Igor)	2023-02-02 23:54:45 +03:00
Monty	5e651c9aea	Make the most important optimizer constants user variables Variables added: - optimizer_index_block_copy_cost - optimizer_key_copy_cost - optimizer_key_next_find_cost - optimizer_key_compare_cost - optimizer_row_copy_cost - optimizer_where_compare_cost Some rename of defines was done to make the internal defines similar to the visible ones: TIME_FOR_COMPARE -> WHERE_COST; WHERE_COST was also "inverted" to be a number between 0 and 1 that is multiply with accepted records (similar to other optimizer variables). TIME_FOR_COMPARE_IDX -> KEY_COMPARE_COST. This is also inverted, similar to TIME_FOR_COMPARE. TIME_FOR_COMPARE_ROWID -> ROWID_COMPARE_COST. This is also inverted, similar to TIME_FOR_COMPARE. All default costs are identical to what they where before this patch. Other things: - Compare factor in get_merge_buffers_cost() was inverted. - Changed namespace to static in filesort_utils.cc	2023-02-02 21:44:00 +03:00

1 2 3 4 5 ...

2241 Commits