Some fixes related to commit f838b2d799 and
Rows_log_event::do_apply_event() and Update_rows_log_event::do_exec_row()
for system-versioned tables were provided by Nikita Malyavin.
This was required by test versioning.rpl,trx_id,row.
In most cases, ha_partition forwards calls to extra() to all
locked_partitions. It doesn't make sense to forward some calls for
partitions that were pruned away.
This patch introduces ha_partition::loop_read_partitions and makes
these calls use it:
- ha_partition::extra_opt(HA_EXTRA_KEYREAD)
- ha_partition::extra(HA_EXTRA_KEYREAD)
- ha_partition::extra(HA_EXTRA_NO_KEYREAD)
Reviewed-by: Monty
MDEV-33502 Slowdown when running nested statement with many partitions
This change was triggered to help some MariaDB users with close to
10000 bits in their bitmaps.
- Change underlaying storage to be 64 bit instead of 32bit.
- This reduses number of loops to scan bitmaps.
- This can cause some bitmaps to be 4 byte large.
- Ensure that all not used top-bits are always 0 (simplifes code as
the last 64 bit storage is not a special case anymore).
- Use my_find_first_bit() to find the first set bit which is much faster
than scanning trough things byte by byte and then bit by bit.
Other things:
- Added a bool to remember if my_bitmap_init() did allocate the bitmap
array. my_bitmap_free() will only free arrays it did allocate.
This allowed me to remove setting 'bitmap=0' before calling
my_bitmap_free() for cases where the bitmap's where allocated externally.
- my_bitmap_init() sets bitmap to 0 in case of failure.
- Added 'universal' asserts to most bitmap functions.
- Change all remaining calls to bitmap_init() to my_bitmap_init().
- To finish the change from 2014.
- Changed all usage of uint32 in my_bitmap.h to my_bitmap_map.
- Updated bitmap_copy() to handle bitmaps of different size.
- Removed const from bitmap_exists_intersection() as this caused casts
on all usage.
- Removed not used function bitmap_set_above().
- Renamed create_last_word_mask() to create_last_bit_mask() (to match
name changes in my_bitmap.cc)
- Extended bitmap-t with test for more bitmap functions.
- Moving get_canonical_filename() from a public function to a method in handler.
- Adding a helper method is_canonical_filename() to handler.
- Adding helper methods left(), substr(), starts_with() to Lex_cstring.
- Adding helper methods is_sane(), buffer_overlaps(),
max_data_size() to CharBuffer.
- Adding append_casedn() to CharBuffer. It implements the main functionality
that replaces the being removed my_casedn_str() call.
- Adding a class Table_path_buffer,
a descendant of CharBuffer with size FN_REFLEN.
- Changing get_canonical_filename() to get a pointer to Table_path_buffer
instead just a pointer to char.
- Changing the data type of the "path" parameter and the return type of
get_canonical_filename() from char* to Lex_cstring.
- This commit is different from 10.6 commit c438284863.
Due to Commit 045757af4c (MDEV-24621),
InnoDB does buffer and pre-sort the records for each index, and build
the indexes one page at a time.
Multiple large insert ignore statment aborts the server during bulk
insert operation. Problem is that InnoDB merge record exceeds
the page size. To avoid this scenario, InnoDB should catch
too big record while buffering the insert operation itself.
row_merge_buf_encode(): returns length of the encoded index record
row_merge_buf_write(): Catches the DB_TOO_BIG_RECORD earlier and
returns error
- HA_EXTRA_IGNORE_INSERT call is being called for every inserted row,
and on partitioned tables on every row * every partition.
This leads to slowness during load..data operation
- Under bulk operation, multiple insert statement error handling
will end up emptying the table. This behaviour introduced by the
commit 8ea923f55b (MDEV-24818).
This makes the HA_EXTRA_IGNORE_INSERT call redundant. We can
use the same behavior for insert..ignore statement as well.
- Removed the extra call HA_EXTRA_IGNORE_INSERT as the solution
to improve the performance of load command.
The new statistics is enabled by adding the "engine", "innodb" or "full"
option to --log-slow-verbosity
Example output:
# Pages_accessed: 184 Pages_read: 95 Pages_updated: 0 Old_rows_read: 1
# Pages_read_time: 17.0204 Engine_time: 248.1297
Page_read_time is time doing physical reads inside a storage engine.
(Writes cannot be tracked as these are usually done in the background).
Engine_time is the time spent inside the storage engine for the full
duration of the read/write/update calls. It uses the same code as
'analyze statement' for calculating the time spent.
The engine statistics is done with a generic interface that should be
easy for any engine to use. It can also easily be extended to provide
even more statistics.
Currently only InnoDB has counters for Pages_% and Undo_% status.
Engine_time works for all engines.
Implementation details:
class ha_handler_stats holds all engine stats. This class is included
in handler and THD classes.
While a query is running, all statistics is updated in the handler. In
close_thread_tables() the statistics is added to the THD.
handler::handler_stats is a pointer to where statistics should be
collected. This is set to point to handler::active_handler_stats if
stats are requested. If not, it is set to 0.
handler_stats has also an element, 'active' that is 1 if stats are
requested. This is to allow engines to avoid doing any 'if's while
updating the statistics.
Cloned or partition tables have the pointer set to the base table if
status are requested.
There is a small performance impact when using --log-slow-verbosity=engine:
- All engine calls in 'select' will be timed.
- IO calls for InnoDB reads will be timed.
- Incrementation of counters are done on local variables and accesses
are inline, so these should have very little impact.
- Statistics has to be reset for each statement for the THD and each
used handler. This is only 40 bytes, which should be neglectable.
- For partition tables we have to loop over all partitions to update
the handler_status as part of table_init(). Can be optimized in the
future to only do this is log-slow-verbosity changes. For this to work
we have to update handler_status for all opened partitions and
also for all partitions opened in the future.
Other things:
- Added options 'engine' and 'full' to log-slow-verbosity.
- Some of the new files in the test suite comes from Percona server, which
has similar status information.
- buf_page_optimistic_get(): Do not increment any counter, since we are
only validating a pointer, not performing any buf_pool.page_hash lookup.
- Added THD argument to save_explain_data_intern().
- Switched arguments for save_explain_.*_data() to have
always THD first (generates better code as other functions also have THD
first).
MDEV-29253 Detect incompatible MySQL partition scheme and either convert
them or report to user and in error log.
This task is about converting in place MySQL 5.6 and 5.7 partition tables
to MariaDB as part of mariadb-upgrade.
- Update TABLE_SHARE::init_from_binary_frm_image() to be able to read
MySQL frm files with partitions.
- Create .par file, if it do not exists, on open of partitioned table.
Executing mariadb-upgrade will create all the missing .par files.
The MySQL .frm file will be changed to MariaDB format after next
ALTER TABLE.
Other changes:
- If we are using stored mysql_version to distingush between MySQL and
MariaDB .frm file information, do not upgrade mysql_version in the
.frm file as part of CHECK TABLE .. FOR UPGRADE as this would cause
problems next time we parse the .frm file.
MDEV-31445 Server crashes in ha_partition::index_blocks / cost_for_index_read
The crash happened in the case where partition pruning finds 0
partitions.
This patch also fixes
MDEV-31391 Assertion `((best.records_out) == 0.0 ... failed
Cost changes caused by this change:
- range queries with join buffer now have a notable smaller cost.
- range ranges are bit more expensive as the MULTI_RANGE_COST is now
properly applied to it in all cases (this extra cost is equal to a
key lookup).
- table scan cost is slight smaller as we now assume data is cached in
the engine after the first scan pass. (We did this before for range
scans and other access methods).
- partition tables had wrong values for max_row_blocks and
max_index_blocks. Correcting this, causes range access on
partitioned tables to have slightly higher cost because of the
increased estimated IO.
- Using first match + join buffer caused 'filtered' to be calcualted
wrong. (Only affected EXPLAIN, not query costs).
- Added cost_without_join_buffer to optimizer_trace.
- check_quick_select() adjusted the number of rows according to persistent
statistics, but did not adjust cost. Now fixed.
The big change in the patch are:
- In best_access_path(), where we now are using storing the cost in
'ALL_READ_COST cost' and only converting it to a double at the end.
This allows us to more exactly calculate the effect of the join_cache.
- In JOIN_TAB::estimate_scan_time(), store the cost also in a
ALL_READ_COST object.
One of effect if this change is that when joining very small tables:
t1 some_access_method
t2 range
t3 ALL Use join buffer
This is swiched to
t1 some_access_method
t3 ALL
t2 range use join buffer
Both plans has the same cost, but as table scan in this case has less
cost than rang, the table scan will be considered first and thus have
precidence.
Test case changes:
- optimizer_trace - Addition of cost_without_join_buffer
- subselect_mat_cost_bugs - Small tables and scan versus range
- range & range_mrr_icp - Range + join_cache is faster than ref
- optimizer_trace - cost_without_join_buffer, smaller scan cost,
range setup cost.
- mrr - range+join_buffer used as smaller cost