This bug was originally filed and fixed as Bug#12612184. The original
fix was buggy, and it was patched by Bug#12704861. Also that patch was
buggy (potentially breaking crash recovery), and both fixes were
reverted.
This fix was not ported to the built-in InnoDB of MySQL 5.1, because
the function signatures of many core functions are different from
InnoDB Plugin and later versions. The block allocation routines and
their callers would have to changed so that they handle block
descriptors instead of page frames.
When a record is updated so that its size grows, non-updated columns
can be selected for external (off-page) storage. The bug is that the
initially inserted updated record contains an all-zero BLOB pointer to
the field that was not updated. Only after the BLOB pages have been
allocated and written, the valid pointer can be written to the record.
Between the release of the page latch in mtr_commit(mtr) after
btr_cur_pessimistic_update() and the re-latching of the page in
btr_pcur_restore_position(), other threads can see the invalid BLOB
pointer consisting of 20 zero bytes. Moreover, if the system crashes
at this point, the situation could persist after crash recovery, and
the contents of the non-updated column would be permanently lost.
The problem is amplified by the ROW_FORMAT=DYNAMIC and
ROW_FORMAT=COMPRESSED that were introduced in
innodb_file_format=barracuda in InnoDB Plugin, but the bug does exist
in all InnoDB versions.
The fix is as follows. After a pessimistic B-tree operation that needs
to write out off-page columns, allocate the pages for these columns in
the mini-transaction that performed the B-tree operation (btr_mtr),
but write the pages in a separate mini-transaction (blob_mtr). Do
mtr_commit(blob_mtr) before mtr_commit(btr_mtr). A quirk: Do not reuse
pages that were previously freed in btr_mtr. Only write the off-page
columns to 'fresh' pages.
In this way, crash recovery will see redo log entries for blob_mtr
before any redo log entry for btr_mtr. It will apply the BLOB page
writes to pages that were marked free at that point. If crash recovery
fails to see all of the btr_mtr redo log, there will be some
unreachable BLOB data in free pages, but the B-tree will be in a
consistent state.
btr_page_alloc_low(): Renamed from btr_page_alloc(). Add the parameter
init_mtr. Return an allocated block, or NULL. If init_mtr!=mtr but
the page was already X-latched in mtr, do not initialize the page.
btr_page_alloc(): Wrapper for btr_page_alloc_for_ibuf() and
btr_page_alloc_low().
btr_page_free(): Add a debug assertion that the page was a B-tree page.
btr_lift_page_up(): Return the father block.
btr_compress(), btr_cur_compress_if_useful(): Add the parameter ibool
adjust, for adjusting the cursor position.
btr_cur_pessimistic_update(): Preserve the cursor position when
big_rec will be written and the new flag BTR_KEEP_POS_FLAG is defined.
Remove a duplicate rec_get_offsets() call. Keep the X-latch on
index->lock when big_rec is needed.
btr_store_big_rec_extern_fields(): Replace update_inplace with
an operation code, and local_mtr with btr_mtr. When not doing a
fresh insert and btr_mtr has freed pages, put aside any pages that
were previously X-latched in btr_mtr, and free the pages after
writing out all data. The data must be written to 'fresh' pages,
because btr_mtr will be committed and written to the redo log after
the BLOB writes have been written to the redo log.
btr_blob_op_is_update(): Check if an operation passed to
btr_store_big_rec_extern_fields() is an update or insert-by-update.
fseg_alloc_free_page_low(), fsp_alloc_free_page(),
fseg_alloc_free_extent(), fseg_alloc_free_page_general(): Add the
parameter init_mtr. Return an allocated block, or NULL. If
init_mtr!=mtr but the page was already X-latched in mtr, do not
initialize the page.
xdes_get_descriptor_with_space_hdr(): Assert that the file space
header is being X-latched.
fsp_alloc_from_free_frag(): Refactored from fsp_alloc_free_page().
fsp_page_create(): New function, for allocating, X-latching and
potentially initializing a page. If init_mtr!=mtr but the page was
already X-latched in mtr, do not initialize the page.
fsp_free_page(): Add ut_ad(0) to the error outcomes.
fsp_free_page(), fseg_free_page_low(): Increment mtr->n_freed_pages.
fsp_alloc_seg_inode_page(), fseg_create_general(): Assert that the
page was not previously X-latched in the mini-transaction. A file
segment or inode page should never be allocated in the middle of an
mini-transaction that frees pages, such as btr_cur_pessimistic_delete().
fseg_alloc_free_page_low(): If the hinted page was allocated, skip the
check if the tablespace should be extended. Return NULL instead of
FIL_NULL on failure. Remove the flag frag_page_allocated. Instead,
return directly, because the page would already have been initialized.
fseg_find_free_frag_page_slot() would return ULINT_UNDEFINED on error,
not FIL_NULL. Correct a bogus assertion.
fseg_alloc_free_page(): Redefine as a wrapper macro around
fseg_alloc_free_page_general().
buf_block_buf_fix_inc(): Move the definition from the buf0buf.ic to
buf0buf.h, so that it can be called from other modules.
mtr_t: Add n_freed_pages (number of pages that have been freed).
page_rec_get_nth_const(), page_rec_get_nth(): The inverse function of
page_rec_get_n_recs_before(), get the nth record of the record
list. This is faster than iterating the linked list. Refactored from
page_get_middle_rec().
trx_undo_rec_copy(): Add a debug assertion for the length.
trx_undo_add_page(): Return a block descriptor or NULL instead of a
page number or FIL_NULL.
trx_undo_report_row_operation(): Add debug assertions.
trx_sys_create_doublewrite_buf(): Assert that each page was not
previously X-latched.
page_cur_insert_rec_zip_reorg(): Make use of page_rec_get_nth().
row_ins_clust_index_entry_by_modify(): Pass BTR_KEEP_POS_FLAG, so that
the repositioning of the cursor can be avoided.
row_ins_index_entry_low(): Add DEBUG_SYNC points before and after
writing off-page columns. If inserting by updating a delete-marked
record, do not reposition the cursor or commit the mini-transaction
before writing the off-page columns.
row_build(): Tighten a debug assertion about null BLOB pointers.
row_upd_clust_rec(): Add DEBUG_SYNC points before and after writing
off-page columns. Do not reposition the cursor or commit the
mini-transaction before writing the off-page columns.
rb:939 approved by Jimmy Yang
hash index at shutdown
btr_search_disable(): Just drop the entire adaptive hash index,
without dropping every record separately.
buf_pool_clear_hash_index(): Renamed and simplified from
buf_pool_drop_hash_index(). Set block->index = NULL for every block in
the buffer pool. Do not release the btr_search_latch. The caller will
have to adjust other data structures.
Remove block->is_hashed. It is redundant, should be always equal to
block->index != NULL.
Remove btr_search_fully_disabled, btr_search_enabled_mutex, and
SYNC_SEARCH_SYS_CONF. We drop the AHI in one pass, without releasing
the btr_search_latch in between.
Replace void* with const rec_t* and add assertions on btr_search_latch
and btr_search_enabled to ha0ha.h, ha0ha.ic, ha0ha.c.
page_set_max_trx_id(): Ignore the adaptive hash index. I forgot to
push this in rb:750.
btr0sea.c: Always after acquiring btr_search_latch, check for
block->index==NULL or !btr_search_enabled. We can now set
block->index=NULL while only holding btr_search_latch in exclusive
mode. Always acquire btr_search_latch before reading block->index,
except in shortcuts when testing for block->index == NULL.
ha_clear(), ha_search(): Unused function, remove.
buf_page_peek_if_search_hashed(): Remove. This function may avoid
latching a page at the cost of doing a duplicate buf_pool->page_hash
lookup.
rb:775 approved by Inaam Rana
when buffered changes are to be discarded
ibuf_add_free_page(): Lower the latching order of the newly allocated page
to SYNC_IBUF_TREE_NODE_NEW after latching the insert buffer tree root.
This bug always was bogus UNIV_SYNC_DEBUG alarm. The function
buf_block_dbg_add_level() is a no-op unless UNIV_SYNC_DEBUG is defined.
Tweak the faulty UNIV_SYNC_DEBUG diagnostics a little bit more.
ibuf_add_free_page(): Lower the latching order of the newly allocated page
only after acquiring the ibuf_mutex.
discarded in buf_page_create()
This bug turned out to be a false alarm, a bug in the UNIV_SYNC_DEBUG
diagnostic code. Because of this, the patch was not backported to the
built-in InnoDB in MySQL 5.1. Furthermore, there is no test case for
InnoDB Plugin in MySQL 5.1, because the delete buffering in MySQL 5.5
makes triggering the failure much easier.
When a freed page for which there exist orphaned buffered changes is
allocated and reused for something else, buf_page_create() will discard
the buffered changes by invoking ibuf_merge_or_delete_for_page().
This would violate the InnoDB latching order.
Tweak the latching order as follows. Move SYNC_IBUF_MUTEX below
SYNC_FSP_PAGE, where it logically belongs, and assign new latching
levels for the ibuf->index->lock and the insert buffer B-tree pages:
#define SYNC_IBUF_MUTEX 370 /* ibuf_mutex */
#define SYNC_IBUF_INDEX_TREE 360
#define SYNC_IBUF_TREE_NODE_NEW 359
#define SYNC_IBUF_TREE_NODE 358
btr_block_get(), btr_page_get(): In UNIV_SYNC_DEBUG, add the parameter
"index" for determining the appropriate latching order
(SYNC_IBUF_TREE_NODE or SYNC_TREE_NODE).
btr_page_alloc_for_ibuf(), btr_create(): Use SYNC_IBUF_TREE_NODE_NEW
instead of SYNC_TREE_NODE_NEW for insert buffer pages.
btr_cur_search_to_nth_level(), btr_pcur_restore_position_func(): Use
SYNC_IBUF_TREE_NODE instead of SYNC_TREE_NODE for insert buffer pages.
btr_search_guess_on_hash(): Assert that the index is not an insert buffer tree.
dict_index_add_to_cache(): Use SYNC_IBUF_INDEX_TREE for the insert
buffer tree (ibuf->index->lock).
ibuf0ibuf.c: Use SYNC_IBUF_TREE_NODE or SYNC_IBUF_TREE_NODE_NEW for
all B-tree pages.
ibuf_merge_or_delete_for_page(): Assert that the user page is
BUF_IO_READ fixed. Only in this way it is OK to latch it as
SYNC_IBUF_TREE_NODE instead of the proper SYNC_TREE_NODE (which would
violate the changed latching order).
sync_thread_add_level(): Remove the special tweak for
SYNC_IBUF_MUTEX. Add rules for the added latching levels.
rb:591 approved by Jimmy Yang
It was the enabling of UNIV_DEBUG_FILE_ACCESSES that caught Bug #55284
in the first place. This is a very light piece of of debug code, and
there really is no reason why it is not enabled in all debug builds.
rb://551 approved by Jimmy Yang
row_search_for_mysql(): When a secondary index record might not be
visible in the current transaction's read view and we consult the
clustered index and optionally some undo log records, return the
relevant columns of the clustered index record to MySQL instead of the
secondary index record.
ibuf_insert_to_index_page_low(): New function, refactored from
ibuf_insert_to_index_page().
ibuf_insert_to_index_page(): When we are inserting a record in place
of a delete-marked record and some fields of the record differ, update
that record just like row_ins_sec_index_entry_by_modify() would do.
btr_cur_update_alloc_zip(): Make the function public.
mysql_row_templ_t: Add clust_rec_field_no.
row_sel_store_mysql_rec(), row_sel_push_cache_row_for_mysql(): Add the
flag rec_clust, for returning data at clust_rec_field_no instead of
rec_field_no. Resurrect the debug assertion that the record not be
marked for deletion. (Bug #55626)
[UNIV_DEBUG || UNIV_IBUF_DEBUG] ibuf_debug, buf_page_get_gen(),
buf_flush_page_try():
Implement innodb_change_buffering_debug=1 for evicting pages from the
buffer pool, so that change buffering will be attempted more
frequently.
Detailed revision comments:
r6348 | marko | 2009-12-22 11:04:34 +0200 (Tue, 22 Dec 2009) | 37 lines
branches/zip: Merge a change from MySQL:
r6351 | marko | 2009-12-22 11:11:18 +0200 (Tue, 22 Dec 2009) | 1 line
branches/zip: Remove an obsolete declaration of LOCK_thread_count.
r6352 | marko | 2009-12-22 12:33:01 +0200 (Tue, 22 Dec 2009) | 104 lines
branches/zip: Merge revisions 6206:6350 from branches/5.1,
except r6347, r6349, r6350 which were committed separately
to both branches, and r6310, which was backported from zip to 5.1.
------------------------------------------------------------------------
r6206 | jyang | 2009-11-20 09:38:43 +0200 (Fri, 20 Nov 2009) | 3 lines
Changed paths:
M /branches/5.1/handler/ha_innodb.cc
branches/5.1: Non-functional change, fix formatting.
------------------------------------------------------------------------
r6230 | sunny | 2009-11-24 23:52:43 +0200 (Tue, 24 Nov 2009) | 3 lines
Changed paths:
M /branches/5.1/mysql-test/innodb-autoinc.result
branches/5.1: Fix autoinc failing test results.
(this should be skipped when merging 5.1 into zip)
------------------------------------------------------------------------
r6231 | sunny | 2009-11-25 10:26:27 +0200 (Wed, 25 Nov 2009) | 7 lines
Changed paths:
M /branches/5.1/mysql-test/innodb-autoinc.result
M /branches/5.1/mysql-test/innodb-autoinc.test
M /branches/5.1/row/row0sel.c
branches/5.1: Fix BUG#49032 - auto_increment field does not initialize to last value in InnoDB Storage Engine.
We use the appropriate function to read the column value for non-integer
autoinc column types, namely float and double.
rb://208. Approved by Marko.
------------------------------------------------------------------------
r6232 | sunny | 2009-11-25 10:27:39 +0200 (Wed, 25 Nov 2009) | 2 lines
Changed paths:
M /branches/5.1/row/row0sel.c
branches/5.1: This is an interim fix, fix white space errors.
------------------------------------------------------------------------
r6233 | sunny | 2009-11-25 10:28:35 +0200 (Wed, 25 Nov 2009) | 2 lines
Changed paths:
M /branches/5.1/include/mach0data.h
M /branches/5.1/include/mach0data.ic
M /branches/5.1/mysql-test/innodb-autoinc.result
M /branches/5.1/mysql-test/innodb-autoinc.test
M /branches/5.1/row/row0sel.c
branches/5.1: This is an interim fix, fix tests and make read float/double arg const.
------------------------------------------------------------------------
r6234 | sunny | 2009-11-25 10:29:03 +0200 (Wed, 25 Nov 2009) | 2 lines
Changed paths:
M /branches/5.1/row/row0sel.c
branches/5.1: This is an interim fix, fix whitepsace issues.
------------------------------------------------------------------------
r6235 | sunny | 2009-11-26 01:14:42 +0200 (Thu, 26 Nov 2009) | 9 lines
Changed paths:
M /branches/5.1/handler/ha_innodb.cc
M /branches/5.1/mysql-test/innodb-autoinc.result
M /branches/5.1/mysql-test/innodb-autoinc.test
branches/5.1: Fix Bug#47720 - REPLACE INTO Autoincrement column with negative values.
This bug is similiar to the negative autoinc filter patch from earlier,
with the additional handling of filtering out the negative column values
set explicitly by the user.
rb://184
Approved by Heikki.
------------------------------------------------------------------------
r6242 | vasil | 2009-11-27 22:07:12 +0200 (Fri, 27 Nov 2009) | 4 lines
Changed paths:
M /branches/5.1/export.sh
branches/5.1:
Minor changes to support plugin snapshots.
------------------------------------------------------------------------
r6306 | calvin | 2009-12-14 15:12:46 +0200 (Mon, 14 Dec 2009) | 5 lines
Changed paths:
M /branches/5.1/mysql-test/innodb-autoinc.result
M /branches/5.1/mysql-test/innodb-autoinc.test
branches/5.1: fix bug#49267: innodb-autoinc.test fails on windows
because of different case mode
There is no change to the InnoDB code, only to fix test case by
changing "T1" to "t1".
------------------------------------------------------------------------
r6324 | jyang | 2009-12-17 06:54:24 +0200 (Thu, 17 Dec 2009) | 8 lines
Changed paths:
M /branches/5.1/handler/ha_innodb.cc
M /branches/5.1/include/lock0lock.h
M /branches/5.1/include/srv0srv.h
M /branches/5.1/lock/lock0lock.c
M /branches/5.1/log/log0log.c
M /branches/5.1/srv/srv0srv.c
M /branches/5.1/srv/srv0start.c
branches/5.1: Fix bug #47814 - Diagnostics are frequently not
printed after a long lock wait in InnoDB. Separate out the
lock wait timeout check thread from monitor information
printing thread.
rb://200 Approved by Marko.
------------------------------------------------------------------------
r6364 | marko | 2009-12-26 21:06:31 +0200 (Sat, 26 Dec 2009) | 4 lines
branches/zip: ibuf_bitmap_get_map_page():
Define a wrapper macro that passes __FILE__, __LINE__ of the caller
to buf_page_get_gen().
This will ease the diagnosis of the likes of Issue #135.
Detailed revision comments:
r6130 | marko | 2009-11-02 11:42:56 +0200 (Mon, 02 Nov 2009) | 9 lines
branches/zip: Free all resources at shutdown. Set pointers to NULL, so
that Valgrind will not complain about freed data structures that are
reachable via pointers. This addresses Bug #45992 and Bug #46656.
This patch is mostly based on changes copied from branches/embedded-1.0,
mainly c5432, c3439, c3134, c2994, c2978, but also some other code was
copied. Some added cleanup code is specific to MySQL/InnoDB.
rb://199 approved by Sunny Bains