mariadb

mirror of https://github.com/MariaDB/server.git synced 2025-07-27 18:02:13 +03:00

Files

Marko Mäkelä 3cef4f8f0f MDEV-515 Reduce InnoDB undo logging for insert into empty table

We implement an idea that was suggested by Michael 'Monty' Widenius
in October 2017: When InnoDB is inserting into an empty table or partition,
we can write a single undo log record TRX_UNDO_EMPTY, which will cause
ROLLBACK to clear the table.

For this to work, the insert into an empty table or partition must be
covered by an exclusive table lock that will be held until the transaction
has been committed or rolled back, or the INSERT operation has been
rolled back (and the table is empty again), in lock_table_x_unlock().

Clustered index records that are covered by the TRX_UNDO_EMPTY record
will carry DB_TRX_ID=0 and DB_ROLL_PTR=1<<55, and thus they cannot
be distinguished from what MDEV-12288 leaves behind after purging the
history of row-logged operations.

Concurrent non-locking reads must be adjusted: If the read view was
created before the INSERT into an empty table, then we must continue
to imagine that the table is empty, and not try to read any records.
If the read view was created after the INSERT was committed, then
all records must be visible normally. To implement this, we introduce
the field dict_table_t::bulk_trx_id.

This special handling only applies to the very first INSERT statement
of a transaction for the empty table or partition. If a subsequent
statement in the transaction is modifying the initially empty table again,
we must enable row-level undo logging, so that we will be able to
roll back to the start of the statement in case of an error (such as
duplicate key).

INSERT IGNORE will continue to use row-level logging and locking, because
implementing it would require the ability to roll back the latest row.
Since the undo log that we write only allows us to roll back the entire
statement, we cannot support INSERT IGNORE. We will introduce a
handler::extra() parameter HA_EXTRA_IGNORE_INSERT to indicate to storage
engines that INSERT IGNORE is being executed.

In many test cases, we add an extra record to the table, so that during
the 'interesting' part of the test, row-level locking and logging will
be used.

Replicas will continue to use row-level logging and locking until
MDEV-24622 has been addressed. Likewise, this optimization will be
disabled in Galera cluster until MDEV-24623 enables it.

dict_table_t::bulk_trx_id: The latest active or committed transaction
that initiated an insert into an empty table or partition.
Protected by exclusive table lock and a clustered index leaf page latch.

ins_node_t::bulk_insert: Whether bulk insert was initiated.

trx_t::mod_tables: Use C++11 style accessors (emplace instead of insert).
Unlike earlier, this collection will cover also temporary tables.

trx_mod_table_time_t: Add start_bulk_insert(), end_bulk_insert(),
is_bulk_insert(), was_bulk_insert().

trx_undo_report_row_operation(): Before accessing any undo log pages,
invoke trx->mod_tables.emplace() in order to determine whether undo
logging was disabled, or whether this is the first INSERT and we are
supposed to write a TRX_UNDO_EMPTY record.

row_ins_clust_index_entry_low(): If we are inserting into an empty
clustered index leaf page, set the ins_node_t::bulk_insert flag for
the subsequent trx_undo_report_row_operation() call.

lock_rec_insert_check_and_lock(), lock_prdt_insert_check_and_lock():
Remove the redundant parameter 'flags' that can be checked in the caller.

btr_cur_ins_lock_and_undo(): Simplify the logic. Correctly write
DB_TRX_ID,DB_ROLL_PTR after invoking trx_undo_report_row_operation().

trx_mark_sql_stat_end(), ha_innobase::extra(HA_EXTRA_IGNORE_INSERT),
ha_innobase::external_lock(): Invoke trx_t::end_bulk_insert() so that
the next statement will not be covered by table-level undo logging.

ReadView::changes_visible(trx_id_t) const: New accessor for the case
where the trx_id_t is not read from a potentially corrupted index page
but directly from the memory. In this case, we can skip a sanity check.

row_sel(), row_sel_try_search_shortcut(), row_search_mvcc():
row_sel_try_search_shortcut_for_mysql(),
row_merge_read_clustered_index(): Check dict_table_t::bulk_trx_id.

row_sel_clust_sees(): Replaces lock_clust_rec_cons_read_sees().

lock_sec_rec_cons_read_sees(): Replaced with lower-level code.

btr_root_page_init(): Refactored from btr_create().

dict_index_t::clear(), dict_table_t::clear(): Empty an index or table,
for the ROLLBACK of an INSERT operation.

ROW_T_EMPTY, ROW_OP_EMPTY: Note a concurrent ROLLBACK of an INSERT
into an empty table.

This is joint work with Thirunarayanan Balathandayuthapani,
who created a working prototype.
Thanks to Matthias Leich for extensive testing.

2021-01-25 18:41:27 +02:00

collections

…

include

Merge 10.5 into 10.6

2021-01-07 09:08:09 +02:00

lib

Merge 10.4 into 10.5

2020-12-02 18:29:49 +02:00

main

MDEV-515 Reduce InnoDB undo logging for insert into empty table

2021-01-25 18:41:27 +02:00

std_data

Merge 10.5 into 10.6

2020-11-02 12:49:19 +02:00

suite

MDEV-515 Reduce InnoDB undo logging for insert into empty table

2021-01-25 18:41:27 +02:00

asan.supp

…

CMakeLists.txt

…

dgcov.pl

…

lsan.supp

…

mtr.out-of-source

…

mysql-stress-test.pl

…

mysql-test-run.pl

Merge commit '10.4' into 10.5

2021-01-06 10:53:00 +01:00

purify.supp

…

README

…

README-gcov

…

README.stress

…

suite.pm

…

unstable-tests

MDEV-21452: Remove os_event_t, MUTEX_EVENT, TTASEventMutex, sync_array

2020-12-15 17:56:17 +02:00

valgrind.supp

…

README

This directory contains test suites for the MariaDB server. To run
currently existing test cases, execute ./mysql-test-run in this directory.

Some tests are known to fail on some platforms or be otherwise unreliable.
The file "unstable-tests" contains the list of such tests along with
a comment for every test.
To exclude them from the test run, execute
# ./mysql-test-run --skip-test-list=unstable-tests

In general you do not have to have to do "make install", and you can have
a co-existing MariaDB installation, the tests will not conflict with it.
To run the tests in a source directory, you must do "make" first.

In Red Hat distributions, you should run the script as user "mysql".
The user is created with nologin shell, so the best bet is something like
# su -
# cd /usr/share/mysql-test
# su -s /bin/bash mysql -c "./mysql-test-run --skip-test-list=unstable-tests"

This will use the installed MariaDB executables, but will run a private
copy of the server process (using data files within /usr/share/mysql-test),
so you need not start the mysqld service beforehand.

You can omit --skip-test-list option if you want to check whether
the listed failures occur for you.

To clean up afterwards, remove the created "var" subdirectory, e.g.
# su -s /bin/bash - mysql -c "rm -rf /usr/share/mysql-test/var"

If one or more tests fail on your system on reasons other than listed
in lists of unstable tests, please read the following manual section
for instructions on how to report the problem:

https://mariadb.com/kb/en/reporting-bugs

If you want to use an already running MySQL server for specific tests,
use the --extern option to mysql-test-run. Please note that in this mode,
you are expected to provide names of the tests to run.

For example, here is the command to run the "alias" and "analyze" tests
with an external server:

# mysql-test-run --extern socket=/tmp/mysql.sock alias analyze

To match your setup, you might need to provide other relevant options.

With no test names on the command line, mysql-test-run will attempt
to execute the default set of tests, which will certainly fail, because
many tests cannot run with an external server (they need to control the
options with which the server is started, restart the server during
execution, etc.)

You can create your own test cases. To create a test case, create a new
file in the main subdirectory using a text editor. The file should have a .test
extension. For example:

# xemacs t/test_case_name.test

In the file, put a set of SQL statements that create some tables,
load test data, and run some queries to manipulate it.

Your test should begin by dropping the tables you are going to create and
end by dropping them again. This ensures that you can run the test over
and over again.

If you are using mysqltest commands in your test case, you should create
the result file as follows:

# mysql-test-run --record test_case_name

# mysqltest --record < t/test_case_name.test

If you only have a simple test case consisting of SQL statements and
comments, you can create the result file in one of the following ways:

# mysql-test-run --record test_case_name

# mysql test < t/test_case_name.test > r/test_case_name.result

# mysqltest --record --database test --result-file=r/test_case_name.result < t/test_case_name.test

When this is done, take a look at r/test_case_name.result.
If the result is incorrect, you have found a bug. In this case, you should
edit the test result to the correct results so that we can verify that
the bug is corrected in future releases.

If you want to submit your test case you can send it
to maria-developers@lists.launchpad.net or attach it to a bug report on
https://mariadb.org/jira/.

If the test case is really big or if it contains 'not public' data,
then put your .test file and .result file(s) into a tar.gz archive,
add a README that explains the problem, ftp the archive to
ftp://ftp.askmonty.org/private and submit a report to
https://mariadb.org/jira about it.

The latest information about mysql-test-run can be found at:
https://mariadb.com/kb/en/mariadb/mysqltest/

If you want to create .rdiff files, check
https://mariadb.com/kb/en/mariadb/mysql-test-auxiliary-files/