mariadb

mirror of https://github.com/MariaDB/server.git synced 2025-06-13 13:01:51 +03:00

Author	SHA1	Message	Date
Marko Mäkelä	ebefef658e	Merge 10.11 into 11.2	2024-10-18 11:32:22 +03:00
Marko Mäkelä	eca552a1a4	MDEV-34830: LSN in the future is not being treated as serious corruption The invariant of write-ahead logging is that before any change to a page is written to the data file, the corresponding log record must must first have been durably written. In crash recovery, there were some sloppy checks for this. Let us implement accurate checks and flag an inconsistency as a hard error, so that we can avoid further corruption of a corrupted database. For data extraction from the corrupted database, innodb_force_recovery can be used. Before recovery is reading any data pages or invoking buf_dblwr_t::recover() to recover torn pages from the doublewrite buffer, InnoDB will have parsed the log until the final LSN and updated log_sys.lsn to that. So, we can rely on log_sys.lsn at all times. The doublewrite buffer recovery has been refactored in such a way that the recv_sys.dblwr.pages may be consulted while discovering files and their page sizes, but nothing will be written back to data files before buf_dblwr_t::recover() is invoked. recv_max_page_lsn, recv_lsn_checks_on: Remove. recv_sys_t::validate_checkpoint(): Validate the write-ahead-logging condition at the end of the recovery. recv_dblwr_t::validate_page(): Keep track of the maximum LSN (if we are checking a non-doublewrite copy of a page) but do not complain LSN being in the future. The doublewrite buffer is a special case, because it will be read early during recovery. Besides, starting with commit `762bcb81b5` the dblwr=true copies of pages may legitimately be "too new". recv_dblwr_t::find_page(): Find a valid page with the smallest FIL_PAGE_LSN that is in the valid range for recovery. recv_dblwr_t::restore_first_page(): Replaced by find_page(). Only buf_dblwr_t::recover() will write to data files. buf_dblwr_t::recover(): Simplify the message output. Do attempt doublewrite recovery on user page read error. Ignore doublewrite pages whose FIL_PAGE_LSN is outside the usable bounds. Previously, we could wrongly recover a too new page from the doublewrite buffer. It is unlikely that this could have lead to an actual error. Write back all recovered pages from the doublewrite buffer here, including for the first page of any tablespace. buf_page_is_corrupted(): Distinguish the return values CORRUPTED_FUTURE_LSN and CORRUPTED_OTHER. buf_page_check_corrupt(): Return the error code DB_CORRUPTION in case the LSN is in the future. Datafile::read_first_page_flags(): Split from read_first_page(). Take a copy of the first page as a parameter. recv_sys_t::free_corrupted_page(): Take the file as a parameter and return whether a message was displayed. This avoids some duplicated and incomplete error messages. buf_page_t::read_complete(): Remove some redundant output and always display the name of the corrupted file. Never return DB_FAIL; use it only in internal error handling. IORequest::read_complete(): Assume that buf_page_t::read_complete() will have reported any error. fil_space_t::set_corrupted(): Return whether this is the first time the tablespace had been flagged as corrupted. Datafile::validate_first_page(), fil_node_open_file_low(), fil_node_open_file(), fil_space_t::read_page0(), fil_node_t::read_page0(): Add a parameter for a copy of the first page, and a parameter to indicate whether the FIL_PAGE_LSN check should be suppressed. Before buf_dblwr_t::recover() is invoked, we cannot validate the FIL_PAGE_LSN, but we can trust the FSP_SPACE_FLAGS and the tablespace ID that may be present in a potentially too new copy of a page. Reviewed by: Debarun Banerjee	2024-10-18 10:12:47 +03:00
Marko Mäkelä	e782e416ac	Merge 10.11 into 11.2	2024-09-18 07:38:49 +03:00
Marko Mäkelä	b187414764	Merge 10.6 into 10.11	2024-09-16 10:58:40 +03:00
Marko Mäkelä	a74bea7ba9	MDEV-34879 InnoDB fails to merge the change buffer to ROW_FORMAT=COMPRESSED tables buf_page_t::read_complete(): Fix an incorrect condition that had been added in commit `aaef2e1d8c` (MDEV-27058). Also for compressed-only pages we must remember that buffered changes may exist. buf_read_page(): Correct the function comment; this is for a synchronous and not asynchronous read. Pass the parameter unzip=true to buf_read_page_low(), because each of our callers will be interested in the uncompressed page frame. This will cause the test encryption.innodb-compressed-blob to emit more errors when the correct keys for decrypting the clustered index root page are unavailable. Reviewed by: Debarun Banerjee	2024-09-12 10:52:55 +03:00
Marko Mäkelä	d73baa402a	Merge 10.11 into 11.0	2024-02-20 12:02:01 +02:00
Marko Mäkelä	77b4399545	MDEV-33421 innodb.corrupted_during_recovery fails due to error that the table is corrupted This fixes up the merge commit `7e39470e33` dict_table_open_on_name(): Report ER_TABLE_CORRUPT in a consistent fashion, with a pretty-printed table name.	2024-02-08 14:20:42 +02:00
Marko Mäkelä	e581396b7a	MDEV-29983 Deprecate innodb_file_per_table Before commit `6112853cda` in MySQL 4.1.1 introduced the parameter innodb_file_per_table, all InnoDB data was written to the InnoDB system tablespace (often named ibdata1). A serious design problem is that once the system tablespace has grown to some size, it cannot shrink even if the data inside it has been deleted. There are also other design problems, such as the server hang MDEV-29930 that should only be possible when using innodb_file_per_table=0 and innodb_undo_tablespaces=0 (storing both tables and undo logs in the InnoDB system tablespace). The parameter innodb_change_buffering was deprecated in commit `b5852ffbee`. Starting with commit `baf276e6d4` (MDEV-19229) the number of innodb_undo_tablespaces can be increased, so that the undo logs can be moved out of the system tablespace of an existing installation. If all these things (tables, undo logs, and the change buffer) are removed from the InnoDB system tablespace, the only variable-size data structure inside it is the InnoDB data dictionary. DDL operations on .ibd files was optimized in commit `86dc7b4d4c` (MDEV-24626). That should have removed any thinkable performance advantage of using innodb_file_per_table=0. Since there should be no benefit of setting innodb_file_per_table=0, the parameter should be deprecated. Starting with MySQL 5.6 and MariaDB Server 10.0, the default value is innodb_file_per_table=1.	2023-01-11 17:55:56 +02:00
Marko Mäkelä	c980350438	MDEV-13542 fixup: Improve a recovery error message A message used to say "failed to read or decrypt" but the "or decrypt" part was removed in commit `0b47c126e3` without adjusting rarely needed error message suppressions in some encryption tests. Let us improve the error message so that it mentions the file name, and adjust all error message suppressions in tests. Thanks to Oleksandr Byelkin for noticing one test failure.	2022-08-05 11:02:18 +03:00
Marko Mäkelä	0b47c126e3	MDEV-13542: Crashing on corrupted page is unhelpful The approach to handling corruption that was chosen by Oracle in commit `177d8b0c12` is not really useful. Not only did it actually fail to prevent InnoDB from crashing, but it is making things worse by blocking attempts to rescue data from or rebuild a partially readable table. We will try to prevent crashes in a different way: by propagating errors up the call stack. We will never mark the clustered index persistently corrupted, so that data recovery may be attempted by reading from the table, or by rebuilding the table. This should also fix MDEV-13680 (crash on btr_page_alloc() failure); it was extensively tested with innodb_file_per_table=0 and a non-autoextend system tablespace. We should now avoid crashes in many cases, such as when a page cannot be read or allocated, or an inconsistency is detected when attempting to update multiple pages. We will not crash on double-free, such as on the recovery of DDL in system tablespace in case something was corrupted. Crashes on corrupted data are still possible. The fault injection mechanism that is introduced in the subsequent commit may help catch more of them. buf_page_import_corrupt_failure: Remove the fault injection, and instead corrupt some pages using Perl code in the tests. btr_cur_pessimistic_insert(): Always reserve extents (except for the change buffer), in order to prevent a subsequent allocation failure. btr_pcur_open_at_rnd_pos(): Merged to the only caller ibuf_merge_pages(). btr_assert_not_corrupted(), btr_corruption_report(): Remove. Similar checks are already part of btr_block_get(). FSEG_MAGIC_N_BYTES: Replaces FSEG_MAGIC_N_VALUE. dict_hdr_get(), trx_rsegf_get_new(), trx_undo_page_get(), trx_undo_page_get_s_latched(): Replaced with error-checking calls. trx_rseg_t::get(mtr_t): Replaces trx_rsegf_get(). trx_rseg_header_create(): Let the caller update the TRX_SYS page if needed. trx_sys_create_sys_pages(): Merged with trx_sysf_create(). dict_check_tablespaces_and_store_max_id(): Do not access DICT_HDR_MAX_SPACE_ID, because it was already recovered in dict_boot(). Merge dict_check_sys_tables() with this function. dir_pathname(): Replaces os_file_make_new_pathname(). row_undo_ins_remove_sec(): Do not modify the undo page by adding a terminating NUL byte to the record. btr_decryption_failed(): Report decryption failures dict_set_corrupted_by_space(), dict_set_encrypted_by_space(), dict_set_corrupted_index_cache_only(): Remove. dict_set_corrupted(): Remove the constant parameter dict_locked=false. Never flag the clustered index corrupted in SYS_INDEXES, because that would deny further access to the table. It might be possible to repair the table by executing ALTER TABLE or OPTIMIZE TABLE, in case no B-tree leaf page is corrupted. dict_table_skip_corrupt_index(), dict_table_next_uncorrupted_index(), row_purge_skip_uncommitted_virtual_index(): Remove, and refactor the callers to read dict_index_t::type only once. dict_table_is_corrupted(): Remove. dict_index_t::is_btree(): Determine if the index is a valid B-tree. BUF_GET_NO_LATCH, BUF_EVICT_IF_IN_POOL: Remove. UNIV_BTR_DEBUG: Remove. Any inconsistency will no longer trigger assertion failures, but error codes being returned. buf_corrupt_page_release(): Replaced with a direct call to buf_pool.corrupted_evict(). fil_invalid_page_access_msg(): Never crash on an invalid read; let the caller of buf_page_get_gen() decide. btr_pcur_t::restore_position(): Propagate failure status to the caller by returning CORRUPTED. opt_search_plan_for_table(): Simplify the code. row_purge_del_mark(), row_purge_upd_exist_or_extern_func(), row_undo_ins_remove_sec_rec(), row_undo_mod_upd_del_sec(), row_undo_mod_del_mark_sec(): Avoid mem_heap_create()/mem_heap_free() when no secondary indexes exist. row_undo_mod_upd_exist_sec(): Simplify the code. row_upd_clust_step(), dict_load_table_one(): Return DB_TABLE_CORRUPT if the clustered index (and therefore the table) is corrupted, similar to what we do in row_insert_for_mysql(). fut_get_ptr(): Replace with buf_page_get_gen() calls. buf_page_get_gen(): Return nullptr and err=DB_CORRUPTION if the page is marked as freed. For other modes than BUF_GET_POSSIBLY_FREED or BUF_PEEK_IF_IN_POOL this will trigger a debug assertion failure. For BUF_GET_POSSIBLY_FREED, we will return nullptr for freed pages, so that the callers can be simplified. The purge of transaction history will be a new user of BUF_GET_POSSIBLY_FREED, to avoid crashes on corrupted data. buf_page_get_low(): Never crash on a corrupted page, but simply return nullptr. fseg_page_is_allocated(): Replaces fseg_page_is_free(). fts_drop_common_tables(): Return an error if the transaction was rolled back. fil_space_t::set_corrupted(): Report a tablespace as corrupted if it was not reported already. fil_space_t::io(): Invoke fil_space_t::set_corrupted() to report out-of-bounds page access or other errors. Clean up mtr_t::page_lock() buf_page_get_low(): Validate the page identifier (to check for recently read corrupted pages) after acquiring the page latch. buf_page_t::read_complete(): Flag uninitialized (all-zero) pages with DB_FAIL. Return DB_PAGE_CORRUPTED on page number mismatch. mtr_t::defer_drop_ahi(): Renamed from mtr_defer_drop_ahi(). recv_sys_t::free_corrupted_page(): Only set_corrupt_fs() if any log records exist for the page. We do not mind if read-ahead produces corrupted (or all-zero) pages that were not actually needed during recovery. recv_recover_page(): Return whether the operation succeeded. recv_sys_t::recover_low(): Simplify the logic. Check for recovery error. Thanks to Matthias Leich for testing this extensively and to the authors of https://rr-project.org for making it easy to diagnose and fix any failures that were found during the testing.	2022-06-06 14:03:22 +03:00
Marko Mäkelä	02d9b048a2	Merge 10.3 into 10.4	2019-04-05 11:41:03 +03:00
Marko Mäkelä	cad56fbaba	MDEV-18733 MariaDB slow start after crash recovery If InnoDB crash recovery was needed, the InnoDB function srv_start() would invoke extra validation, reading something from every InnoDB data file. This should be unnecessary now that MDEV-14717 made RENAME operations crash-safe inside InnoDB (which can be disabled in MariaDB 10.2 by setting innodb_safe_truncate=OFF). dict_check_sys_tables(): Skip tables that would be dropped by row_mysql_drop_garbage_tables(). Perform extra validation only if innodb_safe_truncate=OFF, innodb_force_recovery=0 and crash recovery was needed. dict_load_table_one(): Validate the root page of the table. In this way, we can deny access to corrupted or mismatching tables not only after crash recovery, but also after a clean shutdown.	2019-04-03 19:56:03 +03:00
Michael Widenius	b5615eff0d	Write information about restart in .result Idea comes from MySQL which does something similar	2019-04-01 19:47:24 +03:00
Marko Mäkelä	9835f7b80f	Merge 10.1 into 10.2	2019-03-04 16:46:58 +02:00
Marko Mäkelä	e39d6e0c53	MDEV-18601 Can't create table with ENCRYPTED=DEFAULT when innodb_default_encryption_key_id!=1 The problem with the InnoDB table attribute encryption_key_id is that it is not being persisted anywhere in InnoDB except if the table attribute encryption is specified and is something else than encryption=default. MDEV-17320 made it a hard error if encryption_key_id is specified to be anything else than 1 in that case. Ideally, we would always persist encryption_key_id in InnoDB. But, then we would have to be prepared for the case that when encryption is being enabled for a table whose encryption_key_id attribute refers to a non-existing key. In MariaDB Server 10.1, our best option remains to not store anything inside InnoDB. But, instead of returning the error that MDEV-17320 introduced, we should merely issue a warning that the specified encryption_key_id is going to be ignored if encryption=default. To improve the situation a little more, we will issue a warning if SET [GLOBAL\|SESSION] innodb_default_encryption_key_id is being set to something that does not refer to an available encryption key. Starting with MariaDB Server 10.2, thanks to MDEV-5800, we could open the table definition from InnoDB side when the encryption is being enabled, and actually fix the root cause of what was reported in MDEV-17320.	2019-02-28 23:20:31 +02:00
Marko Mäkelä	72005b7a1c	MDEV-13103: Improve 'cannot be decrypted' error message buf_page_check_corrupt(): Display the file name.	2018-06-13 16:02:40 +03:00
Marko Mäkelä	112cb56182	Add suppressions for background page read errors	2018-02-19 08:59:36 +02:00
Marko Mäkelä	2d8fdfbde5	Merge 10.1 into 10.2 Replace have_innodb_zip.inc with innodb_page_size_small.inc.	2017-06-08 12:45:08 +03:00
Jan Lindström	6b6987154a	MDEV-12114: install_db shows corruption for rest encryption and innodb_checksum_algorithm=strict_none Problem was that checksum check resulted false positives that page is both not encrypted and encryted when checksum_algorithm was strict_none. Encrypton checksum will use only crc32 regardless of setting. buf_zip_decompress: If compression fails report a error message containing the space name if available (not available during import). And note if space could be encrypted. buf_page_get_gen: Do not assert if decompression fails, instead unfix the page and return NULL to upper layer. fil_crypt_calculate_checksum: Use only crc32 method. fil_space_verify_crypt_checksum: Here we need to check crc32, innodb and none method for old datafiles. fil_space_release_for_io: Allow null space. encryption.innodb-compressed-blob is now run with crc32 and none combinations. Note that with none and strict_none method there is not really a way to detect page corruptions and page corruptions after decrypting the page with incorrect key. New test innodb-checksum-algorithm to test different checksum algorithms with encrypted, row compressed and page compressed tables.	2017-06-01 14:07:48 +03:00
Marko Mäkelä	f9cc391863	Merge 10.1 into 10.2 This only merges MDEV-12253, adapting it to MDEV-12602 which is already present in 10.2 but not yet in the 10.1 revision that is being merged. TODO: Error handling in crash recovery needs to be improved. If a page cannot be decrypted (or read), we should cleanly abort the startup. If innodb_force_recovery is specified, we should ignore the problematic page and apply redo log to other pages. Currently, the test encryption.innodb-redo-badkey randomly fails like this (the last messages are from cmake -DWITH_ASAN): 2017-05-05 10:19:40 140037071685504 [Note] InnoDB: Starting crash recovery from checkpoint LSN=1635994 2017-05-05 10:19:40 140037071685504 [ERROR] InnoDB: Missing MLOG_FILE_NAME or MLOG_FILE_DELETE before MLOG_CHECKPOINT for tablespace 1 2017-05-05 10:19:40 140037071685504 [ERROR] InnoDB: Plugin initialization aborted at srv0start.cc[2201] with error Data structure corruption 2017-05-05 10:19:41 140037071685504 [Note] InnoDB: Starting shutdown... i================================================================= ==5226==ERROR: AddressSanitizer: attempting free on address which was not malloc()-ed: 0x612000018588 in thread T0 #0 0x736750 in operator delete(void) (/mariadb/server/build/sql/mysqld+0x736750) #1 0x1e4833f in LatchCounter::~LatchCounter() /mariadb/server/storage/innobase/include/sync0types.h:599:4 #2 0x1e480b8 in LatchMeta<LatchCounter>::~LatchMeta() /mariadb/server/storage/innobase/include/sync0types.h:786:17 #3 0x1e35509 in sync_latch_meta_destroy() /mariadb/server/storage/innobase/sync/sync0debug.cc:1622:3 #4 0x1e35314 in sync_check_close() /mariadb/server/storage/innobase/sync/sync0debug.cc:1839:2 #5 0x1dfdc18 in innodb_shutdown() /mariadb/server/storage/innobase/srv/srv0start.cc:2888:2 #6 0x197e5e6 in innobase_init(void) /mariadb/server/storage/innobase/handler/ha_innodb.cc:4475:3	2017-05-05 10:38:53 +03:00
Jan Lindström	765a43605a	MDEV-12253: Buffer pool blocks are accessed after they have been freed Problem was that bpage was referenced after it was already freed from LRU. Fixed by adding a new variable encrypted that is passed down to buf_page_check_corrupt() and used in buf_page_get_gen() to stop processing page read. This patch should also address following test failures and bugs: MDEV-12419: IMPORT should not look up tablespace in PageConverter::validate(). This is now removed. MDEV-10099: encryption.innodb_onlinealter_encryption fails sporadically in buildbot MDEV-11420: encryption.innodb_encryption-page-compression failed in buildbot MDEV-11222: encryption.encrypt_and_grep failed in buildbot on P8 Removed dict_table_t::is_encrypted and dict_table_t::ibd_file_missing and replaced these with dict_table_t::file_unreadable. Table ibd file is missing if fil_get_space(space_id) returns NULL and encrypted if not. Removed dict_table_t::is_corrupted field. Ported FilSpace class from 10.2 and using that on buf_page_check_corrupt(), buf_page_decrypt_after_read(), buf_page_encrypt_before_write(), buf_dblwr_process(), buf_read_page(), dict_stats_save_defrag_stats(). Added test cases when enrypted page could be read while doing redo log crash recovery. Also added test case for row compressed blobs. btr_cur_open_at_index_side_func(), btr_cur_open_at_rnd_pos_func(): Avoid referencing block that is NULL. buf_page_get_zip(): Issue error if page read fails. buf_page_get_gen(): Use dberr_t for error detection and do not reference bpage after we hare freed it. buf_mark_space_corrupt(): remove bpage from LRU also when it is encrypted. buf_page_check_corrupt(): @return DB_SUCCESS if page has been read and is not corrupted, DB_PAGE_CORRUPTED if page based on checksum check is corrupted, DB_DECRYPTION_FAILED if page post encryption checksum matches but after decryption normal page checksum does not match. In read case only DB_SUCCESS is possible. buf_page_io_complete(): use dberr_t for error handling. buf_flush_write_block_low(), buf_read_ahead_random(), buf_read_page_async(), buf_read_ahead_linear(), buf_read_ibuf_merge_pages(), buf_read_recv_pages(), fil_aio_wait(): Issue error if page read fails. btr_pcur_move_to_next_page(): Do not reference page if it is NULL. Introduced dict_table_t::is_readable() and dict_index_t::is_readable() that will return true if tablespace exists and pages read from tablespace are not corrupted or page decryption failed. Removed buf_page_t::key_version. After page decryption the key version is not removed from page frame. For unencrypted pages, old key_version is removed at buf_page_encrypt_before_write() dict_stats_update_transient_for_index(), dict_stats_update_transient() Do not continue if table decryption failed or table is corrupted. dict0stats.cc: Introduced a dict_stats_report_error function to avoid code duplication. fil_parse_write_crypt_data(): Check that key read from redo log entry is found from encryption plugin and if it is not, refuse to start. PageConverter::validate(): Removed access to fil_space_t as tablespace is not available during import. Fixed error code on innodb.innodb test. Merged test cased innodb-bad-key-change5 and innodb-bad-key-shutdown to innodb-bad-key-change2. Removed innodb-bad-key-change5 test. Decreased unnecessary complexity on some long lasting tests. Removed fil_inc_pending_ops(), fil_decr_pending_ops(), fil_get_first_space(), fil_get_next_space(), fil_get_first_space_safe(), fil_get_next_space_safe() functions. fil_space_verify_crypt_checksum(): Fixed bug found using ASAN where FIL_PAGE_END_LSN_OLD_CHECKSUM field was incorrectly accessed from row compressed tables. Fixed out of page frame bug for row compressed tables in fil_space_verify_crypt_checksum() found using ASAN. Incorrect function was called for compressed table. Added new tests for discard, rename table and drop (we should allow them even when page decryption fails). Alter table rename is not allowed. Added test for restart with innodb-force-recovery=1 when page read on redo-recovery cant be decrypted. Added test for corrupted table where both page data and FIL_PAGE_FILE_FLUSH_LSN_OR_KEY_VERSION is corrupted. Adjusted the test case innodb_bug14147491 so that it does not anymore expect crash. Instead table is just mostly not usable. fil0fil.h: fil_space_acquire_low is not visible function and fil_space_acquire and fil_space_acquire_silent are inline functions. FilSpace class uses fil_space_acquire_low directly. recv_apply_hashed_log_recs() does not return anything.	2017-04-26 15:19:16 +03:00

21 Commits