Set solution is to check if transaction, which modified a record, is
still active in lock_clust_rec_read_check_and_lock(). if yes, then just
request a lock. If no, then, depending on if the current transaction read
view can see the changes, return eighter DB_RECORD_CHANGED or request a
lock.
We can do the check in lock_clust_rec_read_check_and_lock() because
transaction tries to set a lock on the record which cursor points to after
transaction resuming and cursor position restoring. If the lock already
exists, then we don't request the lock again. But for the current commit
it's important that lock_clust_rec_read_check_and_lock() will be invoked
again for the same record, so we can do the check again after
transaction, which modified a record, was committed or rolled back.
MDEV-33802(4aa9291) is partially reverted. If some transaction holds
implicit lock on some record and transaction with snapshot isolation level
requests conflicting lock on the same record, it should be blocked instead
of returning DB_RECORD_CHANGED to have ability to continue execution when
implicit lock owner is rolled back.
The construction
--------------------------------------------------------------------------
let $wait_condition=
select count(*) = 1 from information_schema.processlist
where state = 'Updating' and info = 'UPDATE t SET b = 2 WHERE a';
--source include/wait_condition.inc
--------------------------------------------------------------------------
is not reliable enought to make sure transaction is blocked in test
case, the test failed sporadically with
--------------------------------------------------------------------------
./mtr --max-test-fail=1 --parallel=96 lock_isolation{,,,,,,,}{,,,}{,,} \
--repeat=500
--------------------------------------------------------------------------
command. That's why it was replaced with debug sync-points.
Reviewed by: Marko Mäkelä
Under unknown circumstances, the SQL layer may wrongly disregard an
invocation of thd_mark_transaction_to_rollback() when an InnoDB
transaction had been aborted (rolled back) due to one of the following errors:
* HA_ERR_LOCK_DEADLOCK
* HA_ERR_RECORD_CHANGED (if innodb_snapshot_isolation=ON)
* HA_ERR_LOCK_WAIT_TIMEOUT (if innodb_rollback_on_timeout=ON)
Such an error used to cause a crash of InnoDB during transaction commit.
These changes aim to catch and report the error earlier, so that not only
this crash can be avoided but also the original root cause be found and
fixed more easily later.
The idea of this fix is from Michael 'Monty' Widenius.
HA_ERR_ROLLBACK: A new error code that will be translated into
ER_ROLLBACK_ONLY, signalling that the current transaction
has been aborted and the only allowed action is ROLLBACK.
trx_t::state: Add TRX_STATE_ABORTED that is like
TRX_STATE_NOT_STARTED, but noting that the transaction had been
rolled back and aborted.
trx_t::is_started(): Replaces trx_is_started().
ha_innobase: Check the transaction state in various places.
Simplify the logic around SAVEPOINT.
ha_innobase::is_valid_trx(): Replaces ha_innobase::is_read_only().
The InnoDB logic around transaction savepoints, commit, and rollback
was unnecessarily complex and might have contributed to this
inconsistency. So, we are simplifying that logic as well.
trx_savept_t: Replace with const undo_no_t*. When we rollback to
a savepoint, all we need to know is the number of undo log records
that must survive.
trx_named_savept_t, DB_NO_SAVEPOINT: Remove. We can store undo_no_t
directly in the space allocated at innobase_hton->savepoint_offset.
fts_trx_create(): Do not copy previous savepoints.
fts_savepoint_rollback(): If a savepoint was not found, roll back
everything after the default savepoint of fts_trx_create().
The test innodb_fts.savepoint is extended to cover this code.
Reviewed by: Vladislav Lesin
Tested by: Matthias Leich
convert_error_code_to_mysql(): Treat DB_DEADLOCK and DB_RECORD_CHANGED
in the same way, that is, signal to the SQL layer that the transaction
had been rolled back.
The fixes in b8a6719889 have not disabled
semi-consistent read for innodb_snapshot_isolation=ON mode, they just allowed
read uncommitted version of a record, that's why the test for MDEV-26643 worked
well.
The semi-consistent read should be disabled on upper level in
row_search_mvcc() for READ COMMITTED isolation level.
Reviewed by Marko Mäkelä.
Even after commit b8a6719889 there
is an anomaly where a locking read could return inconsistent results.
If a locking read would have to wait for a record lock, then by the
definition of a read view, the modifications made by the current lock
holder cannot be visible in the read view. This is because the read
view must exclude any transactions that had not been committed at the
time when the read view was created.
lock_rec_convert_impl_to_expl_for_trx(), lock_rec_convert_impl_to_expl():
Return an unsafe-to-dereference pointer to a transaction that holds or
held the lock, or nullptr if the lock was available.
lock_clust_rec_modify_check_and_lock(),
lock_sec_rec_read_check_and_lock(),
lock_clust_rec_read_check_and_lock():
Return DB_RECORD_CHANGED if innodb_strict_isolation=ON and the
lock was being held by another transaction.
The test case, which is based on a bug report by Zhuang Liu,
covers the function lock_sec_rec_read_check_and_lock().
Reviewed by: Vladislav Lesin
https://jepsen.io/analyses/mysql-8.0.34 highlights that the
transaction isolation levels in the InnoDB storage engine do not
correspond to any widely accepted definitions, such as
"Generalized Isolation Level Definitions"
https://pmg.csail.mit.edu/papers/icde00.pdf
(PL-1 = READ UNCOMMITTED, PL-2 = READ COMMITTED, PL-2.99 = REPEATABLE READ,
PL-3 = SERIALIZABLE).
Only READ UNCOMMITTED in InnoDB seems to match the above definition.
The issue is that InnoDB does not detect write/write conflicts
(Section 4.4.3, Definition 6) in the above.
It appears that as soon as we implement write/write conflict detection
(SET SESSION innodb_snapshot_isolation=ON), the default isolation level
(SET TRANSACTION ISOLATION LEVEL REPEATABLE READ) will become
Snapshot Isolation (similar to Postgres), as defined in Section 4.2 of
"A Critique of ANSI SQL Isolation Levels", MSR-TR-95-51, June 1995
https://www.microsoft.com/en-us/research/wp-content/uploads/2016/02/tr-95-51.pdf
Locking reads inside InnoDB used to read the latest committed version,
ignoring what should actually be visible to the transaction.
The added test innodb.lock_isolation illustrates this. The statement
UPDATE t SET a=3 WHERE b=2;
is executed in a transaction that was started before a read view or
a snapshot of the current transaction was created, and committed before
the current transaction attempts to execute
UPDATE t SET b=3;
If SET innodb_snapshot_isolation=ON is in effect when the second
transaction was started, the second transaction will be aborted with
the error ER_CHECKREAD. By default (innodb_snapshot_isolation=OFF),
the second transaction would execute inconsistently, displaying an
incorrect SELECT COUNT(*) FROM t in its read view.
If innodb_snapshot_isolation=ON, if an attempt to acquire a lock on a
record that does not exist in the current read view is made, an error
DB_RECORD_CHANGED (HA_ERR_RECORD_CHANGED, ER_CHECKREAD) will
be raised. This error will be treated in the same way as a deadlock:
the transaction will be rolled back.
lock_clust_rec_read_check_and_lock(): If the current transaction has
a read view where the record is not visible and
innodb_snapshot_isolation=ON, fail before trying to acquire the lock.
row_sel_build_committed_vers_for_mysql(): If innodb_snapshot_isolation=ON,
disable the "semi-consistent read" logic that had been implemented by
myself on the directions of Heikki Tuuri in order to address
https://bugs.mysql.com/bug.php?id=3300 that was motivated by a customer
wanting UPDATE to skip locked rows that do not match the WHERE condition.
It looks like my changes were included in the MySQL 5.1.5
commit ad126d90e019f223470e73e1b2b528f9007c4532; at that time, employees
of Innobase Oy (a recent acquisition of Oracle) had lost write access to
the repository.
The only reason why we set innodb_snapshot_isolation=OFF by default is
backward compatibility with applications, such as the one that motivated
the implementation of "semi-consistent read" back in 2005. In a later
major release, we can default to innodb_snapshot_isolation=ON.
Thanks to Peter Alvaro, Kyle Kingsbury and Alexey Gotsman for their work
on https://github.com/jepsen-io/ and to Kyle and Alexey for explanations
and some testing of this fix.
Thanks to Vladislav Lesin for the initial test for MDEV-26643,
as well as reviewing these changes.