This bug was originally filed and fixed as Bug#12612184. The original
fix was buggy, and it was patched by Bug#12704861. Also that patch was
buggy (potentially breaking crash recovery), and both fixes were
reverted.
This fix was not ported to the built-in InnoDB of MySQL 5.1, because
the function signatures of many core functions are different from
InnoDB Plugin and later versions. The block allocation routines and
their callers would have to changed so that they handle block
descriptors instead of page frames.
When a record is updated so that its size grows, non-updated columns
can be selected for external (off-page) storage. The bug is that the
initially inserted updated record contains an all-zero BLOB pointer to
the field that was not updated. Only after the BLOB pages have been
allocated and written, the valid pointer can be written to the record.
Between the release of the page latch in mtr_commit(mtr) after
btr_cur_pessimistic_update() and the re-latching of the page in
btr_pcur_restore_position(), other threads can see the invalid BLOB
pointer consisting of 20 zero bytes. Moreover, if the system crashes
at this point, the situation could persist after crash recovery, and
the contents of the non-updated column would be permanently lost.
The problem is amplified by the ROW_FORMAT=DYNAMIC and
ROW_FORMAT=COMPRESSED that were introduced in
innodb_file_format=barracuda in InnoDB Plugin, but the bug does exist
in all InnoDB versions.
The fix is as follows. After a pessimistic B-tree operation that needs
to write out off-page columns, allocate the pages for these columns in
the mini-transaction that performed the B-tree operation (btr_mtr),
but write the pages in a separate mini-transaction (blob_mtr). Do
mtr_commit(blob_mtr) before mtr_commit(btr_mtr). A quirk: Do not reuse
pages that were previously freed in btr_mtr. Only write the off-page
columns to 'fresh' pages.
In this way, crash recovery will see redo log entries for blob_mtr
before any redo log entry for btr_mtr. It will apply the BLOB page
writes to pages that were marked free at that point. If crash recovery
fails to see all of the btr_mtr redo log, there will be some
unreachable BLOB data in free pages, but the B-tree will be in a
consistent state.
btr_page_alloc_low(): Renamed from btr_page_alloc(). Add the parameter
init_mtr. Return an allocated block, or NULL. If init_mtr!=mtr but
the page was already X-latched in mtr, do not initialize the page.
btr_page_alloc(): Wrapper for btr_page_alloc_for_ibuf() and
btr_page_alloc_low().
btr_page_free(): Add a debug assertion that the page was a B-tree page.
btr_lift_page_up(): Return the father block.
btr_compress(), btr_cur_compress_if_useful(): Add the parameter ibool
adjust, for adjusting the cursor position.
btr_cur_pessimistic_update(): Preserve the cursor position when
big_rec will be written and the new flag BTR_KEEP_POS_FLAG is defined.
Remove a duplicate rec_get_offsets() call. Keep the X-latch on
index->lock when big_rec is needed.
btr_store_big_rec_extern_fields(): Replace update_inplace with
an operation code, and local_mtr with btr_mtr. When not doing a
fresh insert and btr_mtr has freed pages, put aside any pages that
were previously X-latched in btr_mtr, and free the pages after
writing out all data. The data must be written to 'fresh' pages,
because btr_mtr will be committed and written to the redo log after
the BLOB writes have been written to the redo log.
btr_blob_op_is_update(): Check if an operation passed to
btr_store_big_rec_extern_fields() is an update or insert-by-update.
fseg_alloc_free_page_low(), fsp_alloc_free_page(),
fseg_alloc_free_extent(), fseg_alloc_free_page_general(): Add the
parameter init_mtr. Return an allocated block, or NULL. If
init_mtr!=mtr but the page was already X-latched in mtr, do not
initialize the page.
xdes_get_descriptor_with_space_hdr(): Assert that the file space
header is being X-latched.
fsp_alloc_from_free_frag(): Refactored from fsp_alloc_free_page().
fsp_page_create(): New function, for allocating, X-latching and
potentially initializing a page. If init_mtr!=mtr but the page was
already X-latched in mtr, do not initialize the page.
fsp_free_page(): Add ut_ad(0) to the error outcomes.
fsp_free_page(), fseg_free_page_low(): Increment mtr->n_freed_pages.
fsp_alloc_seg_inode_page(), fseg_create_general(): Assert that the
page was not previously X-latched in the mini-transaction. A file
segment or inode page should never be allocated in the middle of an
mini-transaction that frees pages, such as btr_cur_pessimistic_delete().
fseg_alloc_free_page_low(): If the hinted page was allocated, skip the
check if the tablespace should be extended. Return NULL instead of
FIL_NULL on failure. Remove the flag frag_page_allocated. Instead,
return directly, because the page would already have been initialized.
fseg_find_free_frag_page_slot() would return ULINT_UNDEFINED on error,
not FIL_NULL. Correct a bogus assertion.
fseg_alloc_free_page(): Redefine as a wrapper macro around
fseg_alloc_free_page_general().
buf_block_buf_fix_inc(): Move the definition from the buf0buf.ic to
buf0buf.h, so that it can be called from other modules.
mtr_t: Add n_freed_pages (number of pages that have been freed).
page_rec_get_nth_const(), page_rec_get_nth(): The inverse function of
page_rec_get_n_recs_before(), get the nth record of the record
list. This is faster than iterating the linked list. Refactored from
page_get_middle_rec().
trx_undo_rec_copy(): Add a debug assertion for the length.
trx_undo_add_page(): Return a block descriptor or NULL instead of a
page number or FIL_NULL.
trx_undo_report_row_operation(): Add debug assertions.
trx_sys_create_doublewrite_buf(): Assert that each page was not
previously X-latched.
page_cur_insert_rec_zip_reorg(): Make use of page_rec_get_nth().
row_ins_clust_index_entry_by_modify(): Pass BTR_KEEP_POS_FLAG, so that
the repositioning of the cursor can be avoided.
row_ins_index_entry_low(): Add DEBUG_SYNC points before and after
writing off-page columns. If inserting by updating a delete-marked
record, do not reposition the cursor or commit the mini-transaction
before writing the off-page columns.
row_build(): Tighten a debug assertion about null BLOB pointers.
row_upd_clust_rec(): Add DEBUG_SYNC points before and after writing
off-page columns. Do not reposition the cursor or commit the
mini-transaction before writing the off-page columns.
rb:939 approved by Jimmy Yang
Bug#12612184 RACE CONDITION AFTER BTR_CUR_PESSIMISTIC_UPDATE()
The fix introduced potentially more severe crash recovery problems
than the bug causes. Revert the fix for now.
This was an attempt to address problems with the Bug#12612184 fix.
Even with this follow-up fix, crash recovery can be broken.
Let us fix the bug later.
The fix of Bug#12612184 broke crash recovery. When a record that
contains off-page columns (BLOBs) is updated, we must first write redo
log about the BLOB page writes, and only after that write the redo log
about the B-tree changes. The buggy fix would log the B-tree changes
first, meaning that after recovery, we could end up having a record
that contains a null BLOB pointer.
Because we will be redo logging the writes off the off-page columns
before the B-tree changes, we must make sure that the pages chosen for
the off-page columns are free both before and after the B-tree
changes. In this way, the worst thing that can happen in crash
recovery is that the BLOBs are written to free pages, but the B-tree
changes are not applied. The BLOB pages would correctly remain free in
this case. To achieve this, we must allocate the BLOB pages in the
mini-transaction of the B-tree operation. A further quirk is that BLOB
pages are allocated from the same file segment as leaf pages. Because
of this, we must temporarily "hide" any leaf pages that were freed
during the B-tree operation by "fake allocating" them prior to writing
the BLOBs, and freeing them again before the mtr_commit() of the
B-tree operation, in btr_mark_freed_leaves().
btr_cur_mtr_commit_and_start(): Remove this faulty function that was
introduced in the Bug#12612184 fix. The problem that this function was
trying to address was that when we did mtr_commit() the BLOB writes
before the mtr_commit() of the update, the new BLOB pages could have
overwritten clustered index B-tree leaf pages that were freed during
the update. If recovery applied the redo log of the BLOB writes but
did not see the log of the record update, the index tree would be
corrupted. The correct solution is to make the freed clustered index
pages unavailable to the BLOB allocation. This function is also a
likely culprit of InnoDB hangs that were observed when testing the
Bug#12612184 fix.
btr_mark_freed_leaves(): Mark all freed clustered index leaf pages of
a mini-transaction allocated (nonfree=TRUE) before storing the BLOBs,
or freed (nonfree=FALSE) before committing the mini-transaction.
btr_freed_leaves_validate(): A debug function for checking that all
clustered index leaf pages that have been marked free in the
mini-transaction are consistent (have not been zeroed out).
btr_page_alloc_low(): Refactored from btr_page_alloc(). Return the
number of the allocated page, or FIL_NULL if out of space. Add the
parameter "mtr_t* init_mtr" for specifying the mini-transaction where
the page should be initialized, or if this is a "fake allocation"
(init_mtr=NULL) by btr_mark_freed_leaves(nonfree=TRUE).
btr_page_alloc(): Add the parameter init_mtr, allowing the page to be
initialized and X-latched in a different mini-transaction than the one
that is used for the allocation. Invoke btr_page_alloc_low(). If a
clustered index leaf page was previously freed in mtr, remove it from
the memo of previously freed pages.
btr_page_free(): Assert that the page is a B-tree page and it has been
X-latched by the mini-transaction. If the freed page was a leaf page
of a clustered index, link it by a MTR_MEMO_FREE_CLUST_LEAF marker to
the mini-transaction.
btr_store_big_rec_extern_fields_func(): Add the parameter alloc_mtr,
which is NULL (old behaviour in inserts) and the same as local_mtr in
updates. If alloc_mtr!=NULL, the BLOB pages will be allocated from it
instead of the mini-transaction that is used for writing the BLOBs.
fsp_alloc_from_free_frag(): Refactored from
fsp_alloc_free_page(). Allocate the specified page from a partially
free extent.
fseg_alloc_free_page_low(), fseg_alloc_free_page_general(): Add the
parameter "mtr_t* init_mtr" for specifying the mini-transaction where
the page should be initialized, or NULL if this is a "fake allocation"
that prevents the reuse of a previously freed B-tree page for BLOB
storage. If init_mtr==NULL, try harder to reallocate the specified page
and assert that it succeeded.
fsp_alloc_free_page(): Add the parameter "mtr_t* init_mtr" for
specifying the mini-transaction where the page should be initialized.
Do not allow init_mtr == NULL, because this function is never to be
used for "fake allocations".
mtr_t: Add the operation MTR_MEMO_FREE_CLUST_LEAF and the flag
mtr->freed_clust_leaf for quickly determining if any
MTR_MEMO_FREE_CLUST_LEAF operations have been posted.
row_ins_index_entry_low(): When columns are being made off-page in
insert-by-update, invoke btr_mark_freed_leaves(nonfree=TRUE) and pass
the mini-transaction as the alloc_mtr to
btr_store_big_rec_extern_fields(). Finally, invoke
btr_mark_freed_leaves(nonfree=FALSE) to avoid leaking pages.
row_build(): Correct a comment, and add a debug assertion that a
record that contains NULL BLOB pointers must be a fresh insert.
row_upd_clust_rec(): When columns are being moved off-page, invoke
btr_mark_freed_leaves(nonfree=TRUE) and pass the mini-transaction as
the alloc_mtr to btr_store_big_rec_extern_fields(). Finally, invoke
btr_mark_freed_leaves(nonfree=FALSE) to avoid leaking pages.
buf_reset_check_index_page_at_flush(): Remove. The function
fsp_init_file_page_low() already sets
bpage->check_index_page_at_flush=FALSE.
There is a known issue in tablespace extension. If the request to
allocate a BLOB page leads to the tablespace being extended, crash
recovery could see BLOB writes to pages that are off the tablespace
file bounds. This should trigger an assertion failure in fil_io() at
crash recovery. The safe thing would be to write redo log about the
tablespace extension to the mini-transaction of the BLOB write, not to
the mini-transaction of the record update. However, there is no redo
log record for file extension in the current redo log format.
rb:693 approved by Sunny Bains
discarded in buf_page_create()
This bug turned out to be a false alarm, a bug in the UNIV_SYNC_DEBUG
diagnostic code. Because of this, the patch was not backported to the
built-in InnoDB in MySQL 5.1. Furthermore, there is no test case for
InnoDB Plugin in MySQL 5.1, because the delete buffering in MySQL 5.5
makes triggering the failure much easier.
When a freed page for which there exist orphaned buffered changes is
allocated and reused for something else, buf_page_create() will discard
the buffered changes by invoking ibuf_merge_or_delete_for_page().
This would violate the InnoDB latching order.
Tweak the latching order as follows. Move SYNC_IBUF_MUTEX below
SYNC_FSP_PAGE, where it logically belongs, and assign new latching
levels for the ibuf->index->lock and the insert buffer B-tree pages:
#define SYNC_IBUF_MUTEX 370 /* ibuf_mutex */
#define SYNC_IBUF_INDEX_TREE 360
#define SYNC_IBUF_TREE_NODE_NEW 359
#define SYNC_IBUF_TREE_NODE 358
btr_block_get(), btr_page_get(): In UNIV_SYNC_DEBUG, add the parameter
"index" for determining the appropriate latching order
(SYNC_IBUF_TREE_NODE or SYNC_TREE_NODE).
btr_page_alloc_for_ibuf(), btr_create(): Use SYNC_IBUF_TREE_NODE_NEW
instead of SYNC_TREE_NODE_NEW for insert buffer pages.
btr_cur_search_to_nth_level(), btr_pcur_restore_position_func(): Use
SYNC_IBUF_TREE_NODE instead of SYNC_TREE_NODE for insert buffer pages.
btr_search_guess_on_hash(): Assert that the index is not an insert buffer tree.
dict_index_add_to_cache(): Use SYNC_IBUF_INDEX_TREE for the insert
buffer tree (ibuf->index->lock).
ibuf0ibuf.c: Use SYNC_IBUF_TREE_NODE or SYNC_IBUF_TREE_NODE_NEW for
all B-tree pages.
ibuf_merge_or_delete_for_page(): Assert that the user page is
BUF_IO_READ fixed. Only in this way it is OK to latch it as
SYNC_IBUF_TREE_NODE instead of the proper SYNC_TREE_NODE (which would
violate the changed latching order).
sync_thread_add_level(): Remove the special tweak for
SYNC_IBUF_MUTEX. Add rules for the added latching levels.
rb:591 approved by Jimmy Yang
btr_cur_compress_if_useful(), btr_compress(): Add the parameter ibool
adjust. If adjust=TRUE, adjust the cursor position after compressing
the page.
btr_lift_page_up(): Return a pointer to the father page.
BTR_KEEP_POS_FLAG: A new flag for btr_cur_pessimistic_update().
btr_cur_pessimistic_update(): If *big_rec != NULL and flags &
BTR_KEEP_POS_FLAG, keep the cursor positioned on the updated record.
Also, do not release the index tree x-lock if *big_rec != NULL.
btr_cur_mtr_commit_and_start(): Commits and restarts a
mini-transaction so that it will retain an x-lock on index->lock and
the page of the cursor. This is invoked when
btr_cur_pessimistic_update() returns *big_rec != NULL.
In all callers of btr_cur_pessimistic_update() that do not pass
BTR_KEEP_POS_FLAG, assert that *big_rec == NULL.
btr_cur_compress(): Unused function [in the built-in MySQL 5.1], remove.
page_rec_get_nth(): Return the nth record on the page (an inverse
function of page_rec_get_n_recs_before()). Refactored from
page_get_middle_rec().
page_get_middle_rec(): Invoke page_rec_get_nth().
page_cur_insert_rec_zip_reorg(): Make use of the page directory
shortcuts in page_rec_get_nth() instead of scanning the whole list of
records.
row_ins_clust_index_entry_by_modify(): Pass BTR_KEEP_POS_FLAG to
btr_cur_pessimistic_update().
row_ins_index_entry_low(): If row_ins_clust_index_entry_by_modify()
returns a big_rec, invoke btr_cur_mtr_commit_and_start() in order to
commit and start the mini-transaction without releasing the x-locks on
index->lock and the cursor page, and write the big_rec. Releasing the
page latch in mtr_commit() caused a race condition.
row_upd_clust_rec(): Pass BTR_KEEP_POS_FLAG to
btr_cur_pessimistic_update(). If it returns a big_rec, invoke
btr_cur_mtr_commit_and_start() in order to commit and start the
mini-transaction without releasing the x-locks on index->lock and the
cursor page, and write the big_rec. Releasing the page latch in
mtr_commit() caused a race condition.
sync_thread_add_level(): Add the parameter ibool relock. When TRUE,
bypass the latching order rules.
rw_lock_add_debug_info(): For nested X-lock requests, pass relock=TRUE
to sync_thread_add_level().
rb:678 approved by Jimmy Yang
This option is known to be broken when tablespaces contain off-page
columns after crash recovery. It has only been tested when creating
the data files from the scratch.
btr_blob_dbg_t: A map from page_no:heap_no:field_no to first_blob_page_no.
This map is instantiated for every clustered index in index->blobs.
It is protected by index->blobs_mutex.
btr_blob_dbg_msg_issue(): Issue a diagnostic message.
Invoked when btr_blob_dbg_msg is set.
btr_blob_dbg_rbt_insert(): Insert a btr_blob_dbg_t into index->blobs.
btr_blob_dbg_rbt_delete(): Remove a btr_blob_dbg_t from index->blobs.
btr_blob_dbg_cmp(): Comparator for btr_blob_dbg_t.
btr_blob_dbg_add_blob(): Add a BLOB reference to the map.
btr_blob_dbg_add_rec(): Add all BLOB references from a record to the map.
btr_blob_dbg_print(): Display the map of BLOB references in an index.
btr_blob_dbg_remove_rec(): Remove all BLOB references of a record from
the map.
btr_blob_dbg_is_empty(): Check that no BLOB references exist to or
from a page. Disowned references from delete-marked records are
tolerated.
btr_blob_dbg_op(): Perform an operation on all BLOB references on a
B-tree page.
btr_blob_dbg_add(): Add all BLOB references from a B-tree page to the
map.
btr_blob_dbg_remove(): Remove all BLOB references from a B-tree page
from the map.
btr_blob_dbg_restore(): Restore the BLOB references after a failed
page reorganize.
btr_blob_dbg_set_deleted_flag(): Modify the 'deleted' flag in the BLOB
references of a record.
btr_blob_dbg_owner(): Own or disown a BLOB reference.
btr_page_create(), btr_page_free_low(): Assert that no BLOB references exist.
btr_create(): Create index->blobs for clustered indexes.
btr_page_reorganize_low(): Invoke btr_blob_dbg_remove() before copying
the records. Invoke btr_blob_dbg_restore() if the operation fails.
btr_page_empty(), btr_lift_page_up(), btr_compress(), btr_discard_page():
Invoke btr_blob_dbg_remove().
btr_cur_del_mark_set_clust_rec(): Invoke btr_blob_dbg_set_deleted_flag().
Other cases of modifying the delete mark are either in the secondary
index or during crash recovery, which we do not promise to support.
btr_cur_set_ownership_of_extern_field(): Invoke btr_blob_dbg_owner().
btr_store_big_rec_extern_fields(): Invoke btr_blob_dbg_add_blob().
btr_free_externally_stored_field(): Invoke btr_blob_dbg_assert_empty()
on the first BLOB page.
page_cur_insert_rec_low(), page_cur_insert_rec_zip(),
page_copy_rec_list_end_to_created_page(): Invoke btr_blob_dbg_add_rec().
page_cur_insert_rec_zip_reorg(), page_copy_rec_list_end(),
page_copy_rec_list_start(): After failure, invoke
btr_blob_dbg_remove() and btr_blob_dbg_add().
page_cur_delete_rec(): Invoke btr_blob_dbg_remove_rec().
page_delete_rec_list_end(): Invoke btr_blob_dbg_op(btr_blob_dbg_remove_rec).
page_zip_reorganize(): Invoke btr_blob_dbg_remove() before copying the records.
page_zip_copy_recs(): Invoke btr_blob_dbg_add().
row_upd_rec_in_place(): Invoke btr_blob_dbg_rbt_delete() and
btr_blob_dbg_rbt_insert().
innobase_start_or_create_for_mysql(): Warn when UNIV_BLOB_DEBUG is enabled.
rb://550 approved by Jimmy Yang
Improve the diagnostics of buffer pool accesses for B-trees,
so that the file names and line numbers of the real calls are shown
instead of the line of the buf_page_get() call in btr_block_get().
btr_page_get(): Replaced with a macro.
btr_block_get_func(): Renamed from btr_block_get(). Add file, line.
btr_block_get(): A macro that passes the __FILE__, __LINE__ to
btr_block_get_func().
dict_truncate_index_tree(): Replace a btr_page_get() call
with btr_block_get(), since we are only latching the page, not accessing it.
Detailed revision comments:
r6768 | vasil | 2010-03-02 18:20:48 +0200 (Tue, 02 Mar 2010) | 5 lines
branches/zip:
Add a NOTE to the comment of btr_node_ptr_get_child_page_no()
to prevent mysterious bugs.
Detailed revision comments:
r6749 | vasil | 2010-02-20 18:45:41 +0200 (Sat, 20 Feb 2010) | 5 lines
Non-functional change: update copyright year to 2010 of the files
that have been modified after 2010-01-01 according to svn.
for f in $(svn log -v -r{2010-01-01}:HEAD |grep "^ M " |cut -b 16- |sort -u) ; do sed -i "" -E 's/(Copyright \(c\) [0-9]{4},) [0-9]{4}, (.*Innobase Oy.+All Rights Reserved)/\1 2010, \2/' $f ; done
Detailed revision comments:
r6559 | marko | 2010-02-04 13:21:18 +0200 (Thu, 04 Feb 2010) | 14 lines
branches/zip: Pass the file name and line number of the caller of the
b-tree cursor functions to the buffer pool requests, in order to make
the latch diagnostics more accurate.
buf_page_optimistic_get_func(): Renamed to buf_page_optimistic_get().
btr_page_get_father_node_ptr(), btr_insert_on_non_leaf_level(),
btr_pcur_open(), btr_pcur_open_with_no_init(), btr_pcur_open_on_user_rec(),
btr_pcur_open_at_rnd_pos(), btr_pcur_restore_position(),
btr_cur_open_at_index_side(), btr_cur_open_at_rnd_pos():
Rename the function to _func and add the parameters file, line.
Define wrapper macros with __FILE__, __LINE__.
btr_cur_search_to_nth_level(): Add the parameters file, line.