This bug fix requires that Bug #58912 be fixed as well (bzr revision id
marko.makela@oracle.com-20101221093919-mcmmgd4zpse9567d). Otherwise,
another double BLOB free could occur when InnoDB would try to perform
an update-in-place as delete-and-insert-by-update-in-place.
row_upd_clust_rec_by_insert(): Do not disown the externally stored
columns from the old record (btr_cur_mark_extern_inherited_fields())
until after checking the foreign key constraints and successfully
inserting the updated record. If a lock wait timeout occurs between
the delete-marking of the old record and the insertion of the updated
record, mark the columns inherited before retrying the insert.
Distinguish the state UPD_NODE_INSERT_BLOB from
UPD_NODE_INSERT_CLUSTERED.
btr_cur_del_mark_set_clust_rec(): Replace the cursor with
block,rec,index,offsets so that the offsets need not be recalculated.
Assert that rec is on a clustered index leaf page.
btr_cur_disown_inherited_fields(): Renamed from
btr_cur_mark_extern_inherited_fields(). Use
upd_get_field_by_field_no(). Assert that there are externally stored
columns. Assert that a mini-transaction is passed. Remove the return
status. (The only caller, row_upd_clust_rec_by_insert(), will have
determined that some fields have changed ownership.)
btr_cur_mark_dtuple_inherited_extern(): Rename to
row_upd_clust_rec_by_insert_inherit_func() and declare as static. Add
the debug parameters rec, offsets. When rec is given, assert that the
off-page columns match those in the inesrt tuple and that the off-page
columns are owned by the record. Assert that the non-updated off-page
columns in the insert tuple are owned, and mark them inherited.
row_upd_clust_rec_by_insert_inherit(): A wrapper macro for
row_upd_clust_rec_by_insert_inherit_func().
row_undo_mod_upd_exist_sec(): Adjust a comment about
row_upd_clust_rec_by_insert().
rb:508 approved by Jimmy Yang
row_upd_changes_ord_field_binary(): Do not return TRUE if the update
vector changes a column that is covered by a prefix index, but does
not change the column prefix. Add the row_ext_t parameter for
determining whether the prefixes of externally stored columns match.
dfield_datas_are_binary_equal(): Add the parameter len, for comparing
column prefixes when len > 0.
innodb.test: Add a test case where the patch of Bug #55284 failed
without this fix.
rb:537 approved by Jimmy Yang
Replace the array of mutexes that used to protect
dict_index_t::stat_n_diff_key_vals[] with an array of rw locks that protects
all the stats related members in dict_table_t and all of its indexes.
Approved by: Jimmy (rb://503)
columns again
This is follow-up to Bug #54358. Not all occurrences of the bug were fixed.
We need to check all calls to btr_copy_externally_stored_field_prefix_low()
and do the right thing when the pointer to the off-page column is null
(full of zero bytes).
It turns out that only the call to btr_copy_externally_stored_field_prefix()
in row_sel_sec_rec_is_for_blob() needs to be changed.
For fetching complete off-page columns rather than prefixes, the function
btr_rec_copy_externally_stored_field() already checks if the pointer
is null (all-zero). Two of its callers (row_merge_copy_blobs() and
row_sel_fetch_columns()) are never executed as READ COMMITTED and can
rightfully assert that the fetch succeeded. The third caller,
row_sel_store_mysql_rec(), already does the right thing.
The calls in row_upd_ext_fetch() and trx_undo_page_fetch_ext() must
expect that the off-page column exists. Update and rollback are
locking operations, never READ UNCOMMITTED.
row_search_for_mysql(): When a secondary index record might not be
visible in the current transaction's read view and we consult the
clustered index and optionally some undo log records, return the
relevant columns of the clustered index record to MySQL instead of the
secondary index record.
ibuf_insert_to_index_page_low(): New function, refactored from
ibuf_insert_to_index_page().
ibuf_insert_to_index_page(): When we are inserting a record in place
of a delete-marked record and some fields of the record differ, update
that record just like row_ins_sec_index_entry_by_modify() would do.
btr_cur_update_alloc_zip(): Make the function public.
mysql_row_templ_t: Add clust_rec_field_no.
row_sel_store_mysql_rec(), row_sel_push_cache_row_for_mysql(): Add the
flag rec_clust, for returning data at clust_rec_field_no instead of
rec_field_no. Resurrect the debug assertion that the record not be
marked for deletion. (Bug #55626)
[UNIV_DEBUG || UNIV_IBUF_DEBUG] ibuf_debug, buf_page_get_gen(),
buf_flush_page_try():
Implement innodb_change_buffering_debug=1 for evicting pages from the
buffer pool, so that change buffering will be attempted more
frequently.
This is a regression from the fix for bug no 38999. A storage engine capable
of reading only a subset of a table's columns updates corresponding bits in
the read buffer to signal that it has read NULL values for the corresponding
columns. It cannot, and should not, update any other bits. Bug no 38999
occurred because the implementation of UPDATE statements compare the NULL bits
using memcmp, inadvertently comparing bits that were never requested from the
storage engine. The regression was caused by the storage engine trying to
alleviate the situation by writing to all NULL bits, even those that it had no
knowledge of. This has devastating effects for the index merge algorithm,
which relies on all NULL bits, except those explicitly requested, being left
unchanged.
The fix reverts the fix for bug no 38999 in both InnoDB and InnoDB plugin and
changes the server's method of comparing records. For engines that always read
entire rows, we proceed as usual. For engines capable of reading only select
columns, the record buffers are now compared on a column by column basis. An
assertion was also added so that non comparable buffers are never read. Some
relevant copy-pasted code was also consolidated in a new function.
This is a simple optimization issue. All stats are related to only indexed
columns, index size or number of rows in the whole table. UPDATEs that touch
only non-indexed columns cannot affect stats and we can avoid calling the
function row_update_statistics_if_needed() which may result in unnecessary I/O.
Approved by: Marko (rb://466)
Fix compiler warning:
row/row0vers.c: In function 'row_vers_impl_x_locked_off_kernel':
row/row0vers.c:74:9: error: variable 'err' set but not used [-Werror=unused-but-set-variable]
Fix compiler warning:
row/row0umod.c: In function 'row_undo_mod_clust_low':
row/row0umod.c:117:9: error: variable 'success' set but not used [-Werror=unused-but-set-variable]
Fix compiler warning:
row/row0purge.c: In function 'row_purge_step':
row/row0purge.c:687:9: error: variable 'err' set but not used [-Werror=unused-but-set-variable]
Remove a bogus debug assertion that triggered the bug.
Add assertions precisely where records must not be delete-marked.
And a comment to clarify when the record is allowed to be delete-marked.
The bug is due to a double delete of a BLOB, once via:
rollback -> btr_cur_pessimistic_delete()
and the second time via purge.
The bug is in row_upd_clust_rec_by_insert(). There we relinquish ownership
of the non-updated BLOB columns in btr_cur_mark_extern_inherited_fields()
before building the row entry that will be inserted and whose contents will
be logged in the UNDO log. However, we don't set the BLOB column later to
INHERITED so that a possible rollback will not free the original row's
non-updated BLOB entries. This is because the condition that checks for
that is in :
if (node->upd_ext) {}.
node->upd_ext is non-NULL only if a BLOB column was updated and that column
is part of some key ordering (see row_upd_replace()). This results in the
non-update BLOB columns being deleted during a rollback and subsequently by
purge again.
rb://413
(READ UNCOMMITTED access failure of off-page DYNAMIC or COMPRESSED columns).
Records that lack incompletely written externally stored columns may
be accessed by READ UNCOMMITTED transaction even without involving a
crash during an INSERT or UPDATE operation. I verified this as follows.
(1) added a delay after the mini-transaction for writing the clustered
index 'stub' record was committed (patch attached)
(2) started mysqld in gdb, setting breakpoints to the where the
assertions about READ UNCOMMITTED were added in the bug fix
(3) invoked ibtest3 --create-options=key_block_size=2
to create BLOBs in a COMPRESSED table
(4) invoked the following:
yes 'set transaction isolation level read uncommitted;
checksum table blobt3;select sleep(1);'|mysql -uroot test
(5) noted that one of the breakpoints was triggered
(return(NULL) in btr_rec_copy_externally_stored_field())
=== modified file 'storage/innodb_plugin/row/row0ins.c'
--- storage/innodb_plugin/row/row0ins.c 2010-06-30 08:17:25 +0000
+++ storage/innodb_plugin/row/row0ins.c 2010-06-30 08:17:25 +0000
@@ -2120,6 +2120,7 @@ function_exit:
rec_t* rec;
ulint* offsets;
mtr_start(&mtr);
+ os_thread_sleep(5000000);
btr_cur_search_to_nth_level(index, 0, entry, PAGE_CUR_LE,
BTR_MODIFY_TREE, &cursor, 0,
=== modified file 'storage/innodb_plugin/row/row0upd.c'
--- storage/innodb_plugin/row/row0upd.c 2010-06-30 08:11:55 +0000
+++ storage/innodb_plugin/row/row0upd.c 2010-06-30 08:11:55 +0000
@@ -1763,6 +1763,7 @@ row_upd_clust_rec(
rec_offs_init(offsets_);
mtr_start(mtr);
+ os_thread_sleep(5000000);
ut_a(btr_pcur_restore_position(BTR_MODIFY_TREE, pcur, mtr));
rec = btr_cur_get_rec(btr_cur);
dict_table_get_format(index->table)
The REDUNDANT and COMPACT formats store a local 768-byte prefix of
each externally stored column. No row_ext cache is needed, but we
initialized one nevertheless. When the BLOB pointer was zero, we would
ignore the locally stored prefix as well. This triggered an assertion
failure in row_undo_mod_upd_exist_sec().
row_build(): Allow ext==NULL when a REDUNDANT or COMPACT table
contains externally stored columns.
row_undo_search_clust_to_pcur(), row_upd_store_row(): Invoke
row_build() with ext==NULL on REDUNDANT and COMPACT tables.
rb://382 approved by Jimmy Yang
columns
When the server crashes after a record stub has been inserted and
before all its off-page columns have been written, the record will
contain incomplete off-page columns after crash recovery. Such records
may only be accessed at the READ UNCOMMITTED isolation level or when
rolling back a recovered transaction in recv_recovery_rollback_active().
Skip these records at the READ UNCOMMITTED isolation level.
TODO: Add assertions for checking the above assumptions hold when an
incomplete BLOB is encountered.
btr_rec_copy_externally_stored_field(): Return NULL if the field is
incomplete.
row_prebuilt_t::templ_contains_blob: Clarify what "BLOB" means in this
context. Hint: MySQL BLOBs are not the same as InnoDB BLOBs.
row_sel_store_mysql_rec(): Return FALSE if not all columns could be
retrieved. Previously this function always returned TRUE. Assert that
the record is not delete-marked.
row_sel_push_cache_row_for_mysql(): Return FALSE if not all columns
could be retrieved.
row_search_for_mysql(): Skip records containing incomplete off-page
columns. Assert that the transaction isolation level is READ
UNCOMMITTED.
rb://380 approved by Jimmy Yang
and clarifies the invariant in dict_table_get_on_id().
In Mar 2007 Marko observed a crash during recovery, the crash resulted from
an UNDO operation on a system table. His solution was to acquire an X lock on
the data dictionary, this in hindsight was an overkill. It is unclear what
caused the crash, current hypothesis is that it was a memory corruption.
The X lock results in performance issues by when undoing changes due to
rollback during normal operation on regular tables.
Why the change is safe:
======================
The InnoDB code has changed since the original X lock change was made. In the
new code we always lock the data dictionary in X mode during startup when
UNDOing operations on the system tables (this is a given). This ensures that
the crash Marko observed cannot happen as long as all transactions that update
the system tables follow the standard rules by setting the appropriate DICT_OP
flag when writing the log records when they make the changes.
If transactions violate the above mentioned rule then during recovery (at
startup) the rollback code (see trx0roll.c) will not acquire the X lock
and we will see the crash again. This will however be a different bug.
when renaming tables
Allocate the table name using ut_malloc() instead of table->heap because
the latter cannot be freed.
Adjust dict_sys->size calculations all over the code.
Change dict_table_t::name from const char* to char* because we need to
ut_malloc()/ut_free() it.
Reviewed by: Inaam, Marko, Heikki (rb://384)
Approved by: Heikki (rb://384)
(InnoDB plugin branch)
mysql-test/suite/innodb_plugin/r/innodb_mysql.result:
test case
mysql-test/suite/innodb_plugin/t/innodb_mysql.test:
test case
storage/innodb_plugin/row/row0sel.c:
init null bytes with default values as they might be
left uninitialized in some cases and these uninited bytes
might be copied into mysql record buffer that leads to
valgrind warnings on next use of the buffer.
In semi-consistent read, only unlock freshly locked non-matching records.
lock_rec_lock_fast(): Return LOCK_REC_SUCCESS,
LOCK_REC_SUCCESS_CREATED, or LOCK_REC_FAIL instead of TRUE/FALSE.
enum db_err: Add DB_SUCCESS_LOCKED_REC for indicating a successful
operation where a record lock was created.
lock_sec_rec_read_check_and_lock(),
lock_clust_rec_read_check_and_lock(), lock_rec_enqueue_waiting(),
lock_rec_lock_slow(), lock_rec_lock(), row_ins_set_shared_rec_lock(),
row_ins_set_exclusive_rec_lock(), sel_set_rec_lock(),
row_sel_get_clust_rec_for_mysql(): Return DB_SUCCESS_LOCKED_REC if a
new record lock was created. Adjust callers.
row_unlock_for_mysql(): Correct the function documentation.
row_prebuilt_t::new_rec_locks: Correct the documentation.
lock_rec_unlock(): Cache first_lock and rewrite while() loops as for().
btr_cur_optimistic_update(): Use common error handling return.
row_create_prebuilt(): Add Valgrind instrumentation.